Issue 
4open
Volume 5, 2022
Logical Entropy



Article Number  1  
Number of page(s)  33  
Section  Physics  Applied Physics  
DOI  https://doi.org/10.1051/fopen/2021004  
Published online  13 January 2022 
Research Article
Introduction to logical entropy and its relationship to Shannon entropy
Faculty of Social Sciences, University of Ljubljana, Ljubljana 1000, Slovenia
^{*} Corresponding author: david@ellerman.org
Received:
23
August
2021
Accepted:
5
October
2021
We live in the information age. Claude Shannon, as the father of the information age, gave us a theory of communications that quantified an “amount of information,” but, as he pointed out, “no concept of information itself was defined.” Logical entropy provides that definition. Logical entropy is the natural measure of the notion of information based on distinctions, differences, distinguishability, and diversity. It is the (normalized) quantitative measure of the distinctions of a partition on a setjust as the Boole–Laplace logical probability is the normalized quantitative measure of the elements of a subset of a set. And partitions and subsets are mathematically dual concepts – so the logic of partitions is dual in that sense to the usual Boolean logic of subsets, and hence the name “logical entropy.” The logical entropy of a partition has a simple interpretation as the probability that a distinction or dit (elements in different blocks) is obtained in two independent draws from the underlying set. The Shannon entropy is shown to also be based on this notion of informationasdistinctions; it is the average minimum number of binary partitions (bits) that need to be joined to make all the same distinctions of the given partition. Hence all the concepts of simple, joint, conditional, and mutual logical entropy can be transformed into the corresponding concepts of Shannon entropy by a uniform nonlinear ditbit transform. And finally logical entropy linearizes naturally to the corresponding quantum concept. The quantum logical entropy of an observable applied to a state is the probability that two different eigenvalues are obtained in two independent projective measurements of that observable on that state.
Key words: Logical entropy / Shannon entropy / Partitions / MaxEntropy / Quantum logical entropy / Von Neumann entropy
© D. Ellerman et al., Published by EDP Sciences, 2022
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Introduction
This paper is an introduction to the concept of logical entropy as the direct measure of the definition of information in terms of distinctions, differences, distinguishability, and diversity. The formula for logical entropy goes back to the early twentieth century, but the current development comes out of seeing the formula as the quantification of information in a partition as the normalized number of distinctions or dits (ordered pairs of elements in different blocks) of the partition. Just as the Laplace–Boole notion of probability, as the normalized number of elements in a subset, quantifies the logic of subsets, so logical entropy, as the normalized number of distinctions in a partition, quantifies the logic of partitions – and hence the adjective “logical.” The logical entropy of a partition is, in fact, a probability measure – the probability of obtaining a distinction of the partition in two independent draws from the universe set, just as the logical Laplace–Boole probability of a subset (or event) is the onedraw probability of obtaining an element of the subset.
Far from displacing the usual notion of Shannon entropy; the point is to show that the Shannon entropy of a partition is a different quantification of the same notion of informationasdistinctions, i.e., the average minimum number of binary partitions (bits) that need to be joined together to make the same distinctions of a partition. In fact, there is a nonlinear dittobit transformation that transforms all the concepts of simple, joint, conditional and mutual logical entropy into the corresponding formulas for Shannon entropy, where the latter are especially suited for the theory of coding and communications.
Edwin Jaynes’ MaxEntropy method is intended to generalize the Laplace indifference principle by determining the “best” probability distribution consistent with given constraints (e.g., that rule out the uniform distribution of the indifference principle) by maximizing Shannon entropy subject to those constraints. We show that maximizing logical entropy subject to the same constraints gives a different probability distribution. The logical entropy solution is the closest to the uniform distribution in terms of the usual notion of (Euclidean) distance while the Jaynes solution is the closest in terms of the Kullback–Leibler (KL) divergence from the uniform distribution. The notion of informationasdifferences also connects to ordinary statistical theory since the metrical version of logical entropy is just twice the usual notion of variance (or equals the variance if one counts unordered pairs), and similarly for the notion of covariance.
There is a quasialgorithmic method, linearization, that transforms setbased concepts into vectorspace concepts. Applied to the setbased concepts of “classical” logical entropy, the linearization to Hilbert spaces generates the quantum versions of logical entropy. The quantum logical entropy of an observable applied to a quantum state is the probability of getting different eigenvalues in two independent (projective) measurements of the observable on that state.
Logical entropy
Partitions on a set
A partition π = {B_{1}, …, B_{m}} on a finite set U = {u_{1}, …,u_{n}} is a set of nonempty subsets B_{i} ⊆ U called blocks that are disjoint and whose union is all of U. A distinction or dit of π is an ordered pair (u_{j}, u_{k}) ∈ U × U where u_{j} and u_{k} are in different blocks of π. The set of all distinctions of π is the ditset dit(π) ⊆ U × U. An ordered pair (u_{j}, u_{k}) ∈ U × U is an indistinction or indit of π if u_{j} and u_{k} are in the same block of π, and the set of all indits of π is the inditset $\mathrm{indit}\left(\pi \right)={\cup}_{j=1}^{m}\left({B}_{j}\times {B}_{j}\right)$. A binary relation $E\subseteq U\times U$ is an equivalence relation on U if it is reflexive (i.e., for all u ∈ U, (u, u) ∈ E), symmetric (i.e., for all (u, u′) ∈ E, (u′, u) ∈ E), and transitive (i.e., if (u, u′) ∈ E and (u′, u″) ∈ E, then (u′, u″) ∈ E). The inditset indit(π) of a partition on U is an equivalence relation on U. Given an equivalence relation E on U, two elements are said to be equivalent, u ~ u′, if (u,u′) ∈ E. Let [u]_{E} ⊆ U be the set of elements of U equivalent to u ∈ U, i.e., an equivalence class of E. The set of equivalence classes of E is a partition on U and the inditset of that partition is E. Hence the notion of an equivalence relation and an inditset of a partition are equivalent notions.
Since each ordered pair (u_{j}, u_{k}) ∈ U × U is either an dit of π or an indit of π but not both, the ditset dit(π) = U × U − indit(π) is the complement of the inditset in U × U = U ^{2}. As a binary relation dit(π) ⊆ U × U, the ditsets of a partition are called a partition relation or an apartness relation. Partition relations P ⊆ U × U can be characterized as being irreflexive (i.e., for any u ∈ U, (u, u) ∉ P), symmetric, and antitransitive (i.e., for any (u_{j}, u_{k}) ∈ P and for any sequence ${u}_{j}={u}_{{j}_{0}},{u}_{{j}_{1}},...,{u}_{{j}_{k}},{u}_{{j}_{k+1}}={u}_{k}$ of elements of U, there is a pair $\left({u}_{{j}_{i}},{u}_{{j}_{i+1}}\right)\in P$). Every ditset of a partition is a partition relation and viceversa.
Given another partition σ = {C_{1}, …, C_{k}} on U, the partition π refines σ, written σ ≾ π, if for every block B ∈ π, there is a block C ∈ σ such that B ⊆ C. Intuitively, π is obtained from σ by splitting up some of the blocks of σ which creates more distinctions. Indeed, σ ≾ π if and only if (iff) dit(σ) ⊆ dit(π). The refinement relation on the partitions on U is a partial order in the sense that it is reflexive, antisymmetric (i.e., if σ ≾ π and π ≾ σ then σ = π), and transitive. The partial order has a maximal or top partition and a minimal or bottom partition. The top is the discrete partition 1_{U} = {{u}}_{(u∈U)} where all the blocks are singletons, and the bottom is the indiscrete partition or “blob” 0_{U} = {U} with only one block U. Both the join (least upper bound) and meet (greatest lower bound) of two partitions always exist so the refinement partial order is a lattice Π(U).^{1} Only the join operation is used here, but all the Boolean operations on subsets can be extended to partitions to form the logic of partitions [3, 4] that is the dual counterpart to the Boolean logic of subsets (which is usually presented in the special case of propositional logic). Given π and σ, the join π ∨ σ is the partition on U whose blocks are all the nonempty intersections B ∩ C for B ∈ π and C ∈ σ.
One of the easiest ways to see the dual pairing of the concepts of a subset and a partition is to consider a function f: X → Y from a set X to a set Y. The image is the subset f(X) = {y ∈ Y: ∃_{x} ∈ X, f(x) = y} of the codomain Y, and the inverseimage or coimage is the partition {f ^{−1}(y)}_{y∈f(X)} on the domain X (Fig. 1).^{2}
Figure 1 Image subset and inverseimage partition of a function f: X→Y. 
Logical entropy: the quantification of distinctions
The set of all subsets of a set U, the powerset ℘(U), also forms a lattice under the inclusion partial order with the top U, the bottom ∅, and the join and meet being set union and intersection respectively. Given the duality between subsets and partitions, it is natural to see what concepts carry over from subsets to partitions. In particular, the quantitative measure of a subset S ⊆ U is its cardinality S, and the normalized cardinality of a subset S is the logical notion of probability $\mathrm{Pr}\left(S\right)=\frac{\leftS\right}{\leftU\right}$ developed by Boole and Laplace (where each point u ∈ U is considered equiprobable). GianCarlo Rota in his Fubini Lectures [6] and in his lectures at MIT on probability theory [7] argued that information or entropy should be to partitions what probability was to subsets, i.e.,$$\frac{\mathrm{Probability}}{\mathrm{Subsets}}\approx \frac{\mathrm{Information}}{\mathrm{Partitions}}.$$(1)
The quantitative notion attached to a subset is its number of elements S, so the question is; What is the quantitative notion associated with a partition? The duality between subsets and partitions can be analyzed back to its conceptual building blocks which are the dual notions of elements (its) of a subset and the distinctions (dits) of a partition [8]. Hence, the natural notion of information in a partition would, by this reasoning, be the normalized number of distinctions, and that is our definition of the logical entropy of a partition π;$$\begin{array}{cc}h\left(\pi \right)& =\frac{\left\mathrm{dit}\left(\pi \right)\right}{\leftU\times U\right}=\frac{\leftU\times U\right\left\mathrm{indit}\left(\pi \right)\right}{\leftU\times U\right}\\ & =1\frac{\left{\cup}_{i}\left({B}_{i}\times {B}_{i}\right)\right}{\leftU\times U\right}=1\sum _{i=1}^{m}\mathrm{}{\left(\frac{\left{B}_{i}\right}{\leftU\right}\right)}^{2}=1\sum _{i}\mathrm{}\mathrm{Pr}{\left({B}_{i}\right)}^{2},\end{array}$$(2)where $\mathrm{Pr}\left({B}_{i}\right)=\frac{\left{B}_{i}\right}{\leftU\right}$ is the probability of a random draw from U will give an element of B_{i} (with equiprobable points).
When there are point probabilities p = (p_{1}, …, p_{n}) for p_{j} as the probability of the outcome u_{j} ∈ U with ${\sum}_{j=1}^{n}\mathrm{}{p}_{j}=1$, then $\mathrm{Pr}\left({B}_{i}\right)=\sum \mathrm{}\left\{{p}_{j}:{u}_{j}\in {B}_{i}\right\}$ in the formula for logical entropy. This also gives the definition of logical entropy for any probability distribution p = (p_{1}, …, p_{n}),$$h\left(p\right)=1\sum _{j=1}^{n}\mathrm{}{p}_{j}^{2}.$$(3)
Logical entropy always has an ultrasimple and logical interpretation. Logical information theory is built on the idea that information is about distinctions, differences, distinguishability, and diversity. The notion of difference requires two things in order to have a difference. Hence, given a partition π = {B_{1}, …, B_{m}} or a probability distribution p = (p_{1}, …, p_{n}), the obvious measure for idea of information as distinctions or difference is the probability that in two independent samples or draws from U or from the distribution p, one will obtain elements in different blocks of π, i.e., a distinction of π, or different outcomes p_{j}, p_{k} for j ≠ k. And that is precisely the interpretation of logical entropy, the “probability of difference.” The probability of obtaining elements from the same block of π is ${\sum}_{i}\mathrm{}\mathrm{Pr}{\left({B}_{i}\right)}^{2}$ so the probability of getting elements from different blocks is h(π) = 1 − ∑_{i}Pr(B_{i})^{2}. And similarly for the logical entropy of a probability distribution $h\left(p\right)=1{\sum}_{j}\mathrm{}{p}_{j}^{2}$. Another way to express this result is the formula:$$1={1}^{2}=\left({p}_{1}+...+{p}_{n}\right)\left({p}_{1}+...+{p}_{n}\right)=\sum _{j=1}^{n}\mathrm{}{p}_{i}^{2}+\sum _{j\ne k}\mathrm{}{p}_{j}{p}_{k}$$(4)so that:$$h\left(p\right)=1\sum _{j=1}^{n}\mathrm{}{p}_{i}^{2}=\sum _{j=1}^{n}\mathrm{}{p}_{j}\left(1{p}_{j}\right)=\sum _{j\ne k}\mathrm{}{p}_{j}{p}_{k}=2\sum _{j<k}\mathrm{}{p}_{j}{p}_{k}$$(5)for j, k = 1, …, n. Thus, to be more specific, logical entropy is the probability of getting an ordered pair of distinct indices p_{j} and p_{k} for j ≠ k – which is twice the probability of getting an unordered pair of different indices such as p_{j} and p_{k} for j < k.
There is a simple way to picture the logical entropy. Given partition π = {{u_{1}, u_{2}},{u_{3}},{u_{4}}} with the corresponding point probabilities p = (p_{1}, p_{2}, p_{3}, p_{4}). Since the sum of the probabilities is 1, the logical entropy h(π) can be pictured in a 1 × 1 box, Figure 2, as the shaded area outside the boxed diagonal.
Figure 2 Logical entropy box diagram. 
Logical entropy is also a measure, indeed, a probability measure, in the usual sense of measure theory ([9], p. 30) (although terminology differs) which includes being nonnegative. A finitely additive set function (the values on disjoint sets add together) that can take negative values is usually called a “signed measure” ([9], p. 118) (or a “charge” [10]), and, as we will see, Shannon mutual information can be negative.
Partitions often arise as the inverseimages of random variables $X:U\to \mathbb{R}$. To use an example that we will have use of later, consider the throw of one fair die followed by the throw of a second fair die. All that is recorded is whether the face up on each die was even or odd, i.e., its parity (or mod(2) value). With even represented by 0 and odd by 1, then the space of possible outcomes for the throws of the dice is U = {(0, 0), (0, 1), (1, 0), (1, 1)}. Let X: U → 2 = {0,1} represent the outcome of the first die, the $X$die, and Y: U → 2 the outcome of the second die, the $Y$die. For instance, the point (1, 0) ∈ U represents that the first die came up with odd parity 1 and the second die with even parity 0.
A measure on a finite set is determined by just an assignment of a nonnegative number to each point in the set. The set on which logical entropy is a (probability) measure is U × U so it can again be represented in a box diagram with the equiprobable outcome pairs in U along each edge. Each square in Figure 3, representing a pair ((x, y), (x′, y′)) of pairs, has the probability weight of $\frac{1}{4}\times \frac{1}{4}=\frac{1}{16}$ assigned to it. The inverseimage partition of the random variable $X$ is$${X}^{1}=\left\{{X}^{1}\left(0\right),{X}^{1}\left(1\right)\right\}=\left\{\left\{\left(\mathrm{0,0}\right),\left(\mathrm{0,1}\right)\right\},\left\{\left(\mathrm{1,0}\right),\left(\mathrm{1,1}\right)\right\}\right\}.$$(6)
Figure 3 Box diagram for $h\left(X\right)=\sum \mathrm{}\left\{p\left(x,y\right)p\left({x}^{\mathrm{\prime}},{y}^{\mathrm{\prime}}\right):x\ne {x}^{\mathrm{\prime}}\right\}=\frac{8}{16}=\frac{1}{2}$ which can also be seen as a Venn diagram. 
The ditset dit(X^{−1}) is the set of pairs of pairs, i.e., points in U × U, that differ in the first coordinate:$$\mathrm{dit}\left({X}^{1}\right)=\left\{\left(\left(\mathrm{0,0}\right),\left(\mathrm{1,0}\right)\right),\left(\left(\mathrm{0,0}\right),\left(\mathrm{1,1}\right)\right),\left(\left(\mathrm{0,1}\right),\left(\mathrm{1,0}\right)\right),\left(\left(\mathrm{0,1}\right),\left(\mathrm{1,1}\right)\right),\dots \right\},$$(7)where the ellipsis … represents the pairs of pairs with the reversed order. The shaded squares in Figure 3 box diagram are the ones included in the logical entropy h(X) since they are the ones which differ in the first coordinate of the ordered pairs of outcomes, i.e.,, the pairs where the first die’s outcomes had different parities. Each outcome (x, y) has probability $p\left(x,y\right)=\frac{1}{4}$ and the only squares that count for the logical entropy of X are the ones for ((x, y), (x′,y′)) where x ≠ x′.
The logical entropy of the random variable Y: U → 2 is computed and represented in Figure 4 in the same manner except that the relevant pairs of pairs are those that differ in the second coordinate representing the parity of the second die.
Figure 4 Box/Venn diagram for $h\left(Y\right)=\sum \mathrm{}\left\{p\left(x,y\right)p\left({x}^{\mathrm{\prime}},{y}^{\mathrm{\prime}}\right):y\ne {y}^{\mathrm{\prime}}\right\}=\frac{1}{2}$. 
The logical entropy h(X) for X (the parity of the outcome for the first die) and h(Y) for Y (the parity of outcome for the second die) is the probability that on two independent throws of the relevant die, one will obtain outcomes of different parity.
History of the logical entropy formula
The concept of information as a measure of differences goes back to 1641, the year before Isaac Newton was born, when the polymath John Wilkins (1614–1672) anonymously published one of the earliest books on cryptography, Mercury or the Secret and Swift Messenger. This book not only pointed out the fundamental role of differences but noted that any (finite) set of different things could be encoded by words in a binary code.
For in the general we must note, That whatever is capable of a competent Difference, perceptible to any Sense, may be a sufficient Means whereby to express the Cogitations. It is more convenient, indeed, that these Differences should be of as great Variety as the Letters of the Alphabet; but it is sufficient if they be but twofold, because Two alone may, with somewhat more Labour and Time, be well enough contrived to express all the rest. ([11], Chap. XVII, p. 69)
Wilkins explains that a five letter binary code would be sufficient to code the letters of the alphabet since 2^{5} = 32:
Thus any two Letters or Numbers, suppose A.B. being transposed through five Places, will yield Thirty Two Differences, and so consequently will superabundantly serve for the Four and twenty Letters... .([11], Chap. XVII, p. 69)
In James Gleick’s 2011 book, The Information: A History, A Theory, A Flood, he noted that:
Any difference meant a binary choice. Any binary choice began the expressing of cogitations. Here, in this arcane and anonymous treatise of $1641$, the essential idea of information theory poked to the surface of human thought, saw its shadow, and disappeared again for [three] hundred years. ([12], p. 161)^{3}
The idea that information is about differences was also expressed by the polymath, Gregory Bateson, who noted that (the transmission of) “[i]nformation consists of differences that make a difference.” ([13], p. 99)
The formula that is a measure of differences, $h\left(p\right)=1{\sum}_{j}\mathrm{}{p}_{j}^{2}$ (or its complementary form $1h\left(p\right)={\sum}_{j}\mathrm{}{p}_{j}^{2}$), goes back at least to Corrado Gini (1884–1965) who published it as an index of mutability [14] in 1912 (not to be confused with Gini’s betterknown index of inequality). Some of the immediate following history of the formula was connected to cryptology as foreshadowed by Wilkins. William F. Friedman, an American cryptologist, devoted a 1922 book [15] to the “index of coincidence” (i.e., $\sum \mathrm{}{p}_{i}^{2}$). Solomon Kullback worked as an assistant to Friedman and wrote a book on cryptology which used the index [16].
During World War II, Alan M. Turing worked for a time in the Government Code and Cypher School at the Bletchley Park facility in England. Probably unaware of the earlier work, Turing used $\rho =\sum \mathrm{}{p}_{i}^{2}$ in his cryptoanalysis work and called it the repeat rate since it is the probability of a repeat in a pair of independent draws from a population with those probabilities. Polish cryptographers had independently used the repeat rate in their work on the Enigma [17]. After WWII, Edward H. Simpson, a British statistician, proposed ${\sum}_{B}\mathrm{}{p}_{B}^{2}$ as a measure of species concentration (the opposite of diversity) where π is the partition of animals or plants according to species and where each animal or plant is considered as equiprobable. And Simpson gave the interpretation of this homogeneity measure as “the probability that two individuals chosen at random and independently from the population will be found to belong to the same group.” ([18], p. 688) Hence $1{\sum}_{B}\mathrm{}{p}_{B}^{2}$ is the probability that a random ordered pair will belong to different species, i.e., will be distinguished by the species partition. The biodiversity literature [19] refers to the formula as “ Simpson’s index of diversity” or sometimes, the “GiniSimpson diversity index.” In the bioinformatics literature, Masatoshi Nei [20] introduced the logical entropy formula as a measure of gene diversity.
But the Simpson story has a twist. Simpson along with I.J. Good worked at Bletchley Park during WWII, and, according to Good, “E.H. Simpson and I both obtained the notion [the repeat rate] from Turing.” ([21], p. 395) When Simpson published the index in 1948, he (again, according to Good) did not acknowledge Turing “ fearing that to acknowledge him would be regarded as a breach of security.” ([22], p. 562) Perhaps logical entropy should be called “Turing entropy” to compete with the “big names” attached to Shannon entropy and von Neumann entropy. But given its frequent discovery and rediscovery, Good also negated that idea.
If p_{1}, p_{2}, …, p_{n} are the probabilities of mutually exclusive and exhaustive events, any statistician of this century who wanted a measure of homogeneity would have taken about two seconds to suggest $\sum \mathrm{}{p}_{i}^{2}$, which I shall call ρ. … Thus it is unjust to associate ρ with any one person. It would be better to use such names as “repeat rate” or “quadratic index of homogeneity” for ρ and perhaps “quadratic index of heterogeneity or diversity” for 1 − ρ. ([22], pp. 561–2]
Thus the name “logical entropy” seems appropriate, particularly in view of Stigler’s Law of Eponymy, i.e., “No scientific discovery is named after its original discoverer” [23], and since it is the quantitative measure associated with partitions in the logic of partitions just as finite probability is the quantitative measure associated with subsets in the usual Boolean logic of subsets.
In economics, Albert O. Hirschman ([24], p. 159) suggested in 1945 using $\sqrt{\sum \mathrm{}{p}_{i}^{2}}$ as an index of trade concentration (where p_{i} is the relative share of trade in a certain commodity or with a certain partner). A few years afterwards, Orris Herfindahl [25] independently suggested using $\sum \mathrm{}{p}_{i}^{2}$ as an index of industrial concentration (where p_{i} is the relative share of the ith firm in an industry). In the literature on industrial economics, the index $H=\sum \mathrm{}{p}_{i}^{2}$ is variously called the Hirschman–Herfindahl index, the HH index, or just the H index of concentration.
Another way to look at logical entropy is that two elements from U = {u_{1}, …, u_{n}} are either identical or distinct. Gini [14] introduced d_{ij} = 1 − δ_{ij} (the complement of the Kronecker delta function) as the “distance” between the ith and jth elements where d_{ij} = 1 for i ≠ j and d_{ii} = 0. Then Gini’s index of mutability, $h\left(p\right)={\sum}_{i}\mathrm{}{d}_{\mathrm{ij}}{p}_{i}{p}_{j}$, is the average (logical) distance between a pair of independently drawn elements. But one might generalize by allowing other nonnegative distances d_{ij} = d_{ji} for i ≠ j (but always d_{ii} = 0) so that $Q={\sum}_{i}\mathrm{}{d}_{\mathrm{ij}}{p}_{i}{p}_{j}$ would be the average distance between a pair of independently drawn elements from $U$. In 1982, C.R. (Calyampudi Radhakrishna) Rao introduced precisely this concept as quadratic entropy [26]. The logical entropy is also the quadratic special case of the Tsallis–Havrda–Charvat entropy [27, 28].
Časlav Brukner and Anton Zeilinger have also developed the logical entropy formula $1{\sum}_{i=1}^{n}\mathrm{}{p}_{i}^{2}$ in their treatment of quantum information [29, 30] and have also used the normalized form of the (Euclidean) distance squared of a probability distribution from the uniform distribution, which is closely related to the logical entropy since: ${\sum}_{i=1}^{n}\mathrm{}{\left({p}_{i}\frac{1}{n}\right)}^{2}=\left(1\frac{1}{n}\right)h\left(p\right)$.
Compound notions of logical entropy
We now consider a joint probability distribution {p(x, y)} on the finite sample space X × Y (where to avoid trivialities, assume X, Y ≥ 2), with the marginal distributions {p(x)} and {p(y)} where $p\left(x\right)={\sum}_{y}\mathrm{}p\left(x,y\right)$ and $p\left(y\right)={\sum}_{x}\mathrm{}p\left(x,y\right)$. The setting is a pair of random variables X and Y where we also consider X as the set of possible values x of the r.v. X and similarly for the r.v. Y. Then the joint probability distribution is p(x, y) = Pr(X = x, Y = y), and the marginals are p(x) = Pr(X = x), and p(y) = Pr(Y = y). For notational simplicity, the entropies can be considered as functions of the random variables or of their probability distributions, e.g., h({p(x)}) = h(X) and h({p(y)}) = h(Y). Logical entropy is characterized in terms of the probability that in two independent draws (x, y) and (x′, y′) from the sample space, one will get different outcomes. Hence in this case,$$h\left(X\right)=\sum _{x,y}\mathrm{}\left\{p\left(x,y\right)p\left(x\mathrm{\prime},y\mathrm{\prime}\right):x\ne x\mathrm{\prime}\right\},$$(8) $$h\left(Y\right)=\sum _{x,y}\mathrm{}\left\{p\left(x,y\right)p\left(x\mathrm{\prime},y\mathrm{\prime}\right):y\ne y\mathrm{\prime}\right\}.$$(9)
Then the joint entropy h(X, Y) is just the logical entropy h({p(x, y)}_{((x, y) ∈ X × Y)}) of the joint probability distribution which can also be characterized as:$$h\left(X,Y\right)=\sum _{x,y}\mathrm{}\left\{p\left(x,y\right)p\left(x\mathrm{\prime},y\mathrm{\prime}\right):x\ne x\mathrm{\prime}\mathrm{or}y\ne y\mathrm{\prime}\right\}.$$(10)
In the previous evenodd dice example of throwing an Xdie and a Ydie, each die had an outcome set of {0, 1} so X × Y = {(0, 0), (0, 1), (1, 0), (1, 1)} = U. The space on which the probabilities are assigned is U × U = (X × Y) × (X × Y) so the probability assigned to each point ((x, y),(x′, y′)) is p(x, y)p(x′, y′). The points in the space (X × Y)^{2} whose probabilities add up to give h(X, Y) are just the union of the points for h(X), i.e., where x ≠ x′, and for h(Y), i.e., where y ≠ y′. Since each point in (X × Y)^{2} is represented by a square with probability $\frac{1}{16}$, the shaded squares for h(X, Y) are just the union of the squares for h(X) and h(Y) as shown in Figure 5.
Figure 5 Union of Box/Venn diagrams for h(X) and h(Y) gives the box diagram for joint entropy $h\left(X,Y\right)=\frac{12}{16}=\frac{3}{4}$. 
The usual interpretation carries over to the compound notions such as the joint entropy; in two independent throws of the pair of dice, the probability that one will get a different parity in the Xdie or in the Ydie (or both) is $h\left(X,Y\right)=\frac{3}{4}$.
In a Venn diagram that is merely illustrative, the logical entropies would be represented as circles and the union of the circles would represent the joint entropy as in Figure 6.
Figure 6 Illustrative Venn diagram for the compound logical entropies. 
Figure 6 also illustrates the “formulas” for the other compound logical entropies. The conditional logical entropy $$h\left(XY\right)=\sum _{x,y}\mathrm{}\left\{p\left(x,y\right)p\left(x\mathrm{\prime},y\mathrm{\prime}\right):x\ne x\mathrm{\prime}\mathrm{and}y=y\mathrm{\prime}\right\}$$(11)represents the distinctions made by X (i.e., the cases where the two throws of Xdie had different parities) after the distinctions made by Y are taken away (so y = y′), and viceversa for h(YX). And the mutual logical information $$m\left(X,Y\right)=\sum _{x,y}\mathrm{}\left\{p\left(x,y\right)p\left({x}^{\mathrm{\prime}},{y}^{\mathrm{\prime}}\right):x\ne {x}^{\mathrm{\prime}}\mathrm{and}y\ne {y}^{\mathrm{\prime}}\right\}$$(12)is the probability that in the two throws of the pair of dice, the pair of pairs of outcomes will have different parity in the Xdie and in the Ydie – as one can easily see from the shaded squares for m(X, Y) in Figure 7.
Figure 7 Box diagrams representing the two conditional logical entropies and the mutual logical information all with the value $\frac{1}{4}$. 
These specific box/Venn diagrams illustrate general relationships such as the two conditional entropies and mutual information all being disjoint and adding to the joint entropy. In general (not just for this example), the compound logical entropies stand in the relationships shown by the areas in the illustrative Figure 6:$$h\left(X,Y\right)=h\left(XY\right)+h\left(YX\right)+m\left(X,Y\right),$$(13) $$h\left(X\right)=h\left(XY\right)+m\left(X,Y\right),$$(14) $$h\left(Y\right)=h\left(YX\right)+m\left(X,Y\right)$$(15) $$h\left(X,Y\right)=h\left(X\right)+h\left(Y\right)m\left(X,Y\right).$$(16)
Shannon entropy
The basic definitions
Both the logical and the Shannon entropies are defined for probability distributions regardless of whether the distribution is derived from the blocks of a partition Pr(B_{i}) or the values of a random variable Pr(X = x). Hence we can start the treatment of Shannon entropy [31, 32] defined on a probability distribution p = (p_{1}, …, p_{n}):$$H\left(p\right)=\sum _{i=1}^{n}\mathrm{}{p}_{i}{\mathrm{log}}_{2}\left({p}_{i}\right)=\sum _{i=1}^{n}\mathrm{}{p}_{i}{\mathrm{log}}_{2}\left(\frac{1}{{p}_{i}}\right)$$(17)where for ${p}_{i}=0$, ${p}_{i}{\mathrm{log}}_{2}\left(\frac{1}{{p}_{i}}\right)$ is defined to be $0$. Henceforth, the logs are to base 2 unless otherwise specified. For a random variable X with p(x) = Pr(X = x), then:$$H\left(X\right)=\sum _{x\in X}\mathrm{}p\left(x\right)\mathrm{log}\left(\frac{1}{p\left(x\right)}\right).$$(18)
Given a joint probability distribution p(x, y) on X × Y, the joint Shannon entropy is:$$H\left(X,Y\right)=\sum _{\left(x,y\right)\in X\times Y}\mathrm{}p\left(x,y\right)\mathrm{log}\left(\frac{1}{p\left(x,y\right)}\right).$$(19)
The conditional Shannon entropy H(XY) is defined as the average of the Shannon entropies for conditional probability distributions. Given the joint distribution {p(x, y)} on X × Y, then for a specific y_{0} ∈ Y, then the conditional probability distribution is $p\left(x{y}_{0}\right)=\frac{p\left(x,{y}_{0}\right)}{p\left({y}_{0}\right)}$ which has the Shannon entropy: $H\left(X{y}_{0}\right)={\sum}_{x}\mathrm{}p\left(x{y}_{0}\right)\mathrm{log}\left(\frac{1}{p\left(x{y}_{0}\right)}\right)$. Then the Shannon conditional entropy is defined as the average of these entropies:$$H\left(XY\right)=\sum _{y\in Y}\mathrm{}p\left(y\right)\sum _{x}\mathrm{}\frac{p\left(x,y\right)}{p\left(y\right)}\mathrm{log}\left(\frac{p\left(y\right)}{p\left(x,y\right)}\right)=\sum _{x,y}\mathrm{}p\left(x,y\right)\mathrm{log}\left(\frac{p\left(y\right)}{p\left(x,y\right)}\right).$$(20)
Since the Venn diagram for any measure like logical entropy satisfies a relationship like h(X) + h(Y) − h(X,Y) = m(X,Y), Shannon defined the mutual Shannon information as:$$I\left(X;Y\right)=\sum _{x\in X,y\in Y}\mathrm{}p\left(x,y\right)\left[\mathrm{log}\left(\frac{1}{p\left(x\right)}\right)+\mathrm{log}\left(\frac{1}{p\left(y\right)}\right)\mathrm{log}\left(\frac{1}{p\left(x,y\right)}\right)\right].$$(21)
Then it is perhaps no surprise that these compound Shannon entropies satisfy the Venn diagram relationship as if the Shannon entropy was defined as a measure on a set. Hence one finds in the textbooks on Shannon’s theory of communications, a Venn diagram like Figure 8 to serve at least as a mnemonic about the interrelationships.
Figure 8 Venn diagram mnemonic for the compound Shannon entropies. 
Shannon’s communications theory and “information theory”
This paper presents a different version of “information theory” than the received version. There is no difference in the part of information theory where Shannon entropy actually does its work, namely the theory of coding and communication. Shannon himself did not name his original paper or book as “information theory” but rather as the “mathematical theory of communication” ([31, 32]). Thus the notion that the theory of communications (including coding theory) was an “information theory” was a creation of the science press, science popularizers, and textbook writers. Shannon even reacted against the “bandwagon” that inflated “information theory” far beyond the actual technical results of communications theory.
Information theory has, in the last few years, become something of a scientific bandwagon. Starting as a technical tool for the communication engineer, it has received an extraordinary amount of publicity in the popular as well as the scientific press. In part, this has been due to connections with such fashionable fields as computing machines, cybernetics, and automation; and in part, to the novelty of its subject matter. As a consequence, it has perhaps been ballooned to an importance beyond its actual accomplishments. Our fellow scientists in many different fields, attracted by the fanfare and by the new avenues opened to scientific analysis, are using these ideas in their own problems. Applications are being made to biology, psychology, linguistics, fundamental physics, economics, the theory of organization, and many others. In short, information theory is currently partaking of a somewhat heady draught of general popularity. ([33], p. 462)
Shannon repeated the points in a 1961 interview with Myron Tribus.
In 1961 Professor Shannon, in a private conversation, made it quite clear to me that he considered applications of his work to problems outside of communication theory to be suspect and he did not attach fundamental significance to them. ([34], p. 1)
Moreover, while Shannon noted that while his entropy formula indicates the “amount of information” (i.e., the average numbers of binary distinctions needed to distinguish all the “messages”), “no concept of information itself was defined” ([35], p. 458) in communications theory. Perhaps the most common idea about Shannon entropy is that it a measure of “amount of uncertainty.” But there are many other interpretations.
Other terms used to convey an intuitive feeling for entropy include randomness, disorganization, “ mixedupness” (Gibbs), missing information, incomplete knowledge, complexity, chaos, ignorance, and uncertainty. ([36], p. 9)
There is also the view that entropy and information were in fact opposites or complements; “Gain in entropy always means loss of information, and nothing more.” ([37], p. 573) That view was later popularized by Leon Brillouin who claimed that:
information must be considered as a negative term in the entropy of a system; in short, information is negentropy. ... Entropy measures the lack of information. ([38], p. xii)
However, there is no need for this conceptual chaos; the (simple) Shannon entropy is another way to quantify the notion of informationasdistinctions. That is, Shannon entropy is the minimum average number of binary partitions (bits) that need to be joined in order to make the distinctions that distinguish all the “messages.” And simple logical entropy is the direct measure of distinctions.
Is Shannon entropy a “measure”?
Shannon entropy and a host of other entropy “formulas” (sans interpretation) are routinely called “measures” of information [39]. A prominent information theorist, Lorne Campbell, has noted in 1965 the analogies between Shannon entropy and measures (in the usual nonnegative sense).
Certain analogies between entropy and measure have been noted by various authors. These analogies provide a convenient mnemonic for the various relations between entropy, conditional entropy, joint entropy, and mutual information. It is interesting to speculate whether these analogies have a deeper foundation. It would seem to be quite significant if entropy did admit an interpretation as the measure of some set. ([40], p. 112)
We only need be concerned with the simplest case of a measure [9] on a finite set where for any finite set U, a measure μ is a function from the powerset of U (the subsets of U) to the reals μ: $\wp \left(\mathrm{U}\right)\to \mathbb{R}$ such that:

μ(∅) = 0,

for any E ⊆ U, μ(E) ≥ 0, and

for any disjoint subsets E_{1} and E_{2}, μ(E_{1} ∪ E_{2}) = μ(E_{1}) + μ(E_{2}).
The whole measure is determined by the values on singletons and simply summed over larger finite subsets.
It would be desirable for Shannon entropy to be a measure in this technical sense so:
that H(α) and H(β) are measures of sets, that H(α, β) is the measure of their union, that I(α, β) is the measure of their intersection, and that H(αβ) is the measure of their difference. The possibility that I(α, β) is the entropy of the “ intersection” of two partitions is particularly interesting. This “ intersection,” if it existed, would presumably contain the information common to the partitions α and β. ([40], p. 113)
Logical entropy satisfies all those desiderata.
There are some differences in the use of the word “measure.” It would seem that the usual notion of a measure is always nonnegative [9,41] and then there is an extended notion of a “signed measure” that can take on negative values. Other authors define a “measure” to allow negative values and then define a “positive measure” to have only nonnegative values. The most general usage, adopted here, is that a measure is nonnegative and the generalized notion to allow negative value is a “signed measure.” This is important since logical entropy is defined as a measure, indeed a probability measure, while Shannon entropy is not defined as a measure on a set. Given any Venn diagram of Shannon entropies, then, as with any Venn diagram, an ex post measue or signed measure can always be trivially constructed. Both measures and signed measures can be represented as additive set functions ([42], Part 8, Chap. 1, Prob. 26) [43] ([44], Chap. 2) [45] that satisfy the inclusionexclusion principle (or overcountundercount relationships) that can be associated with Venn diagrams (if we allow negative areas).
For logical entropy, consider a set U = {u_{1}, …, u_{n}} with point probabilities ${\left\{{p}_{i}\right\}}_{i=1}^{n}$ and a random variable $X:U\to \mathbb{R}$ which induces a partition X ^{−1} on U and similarly for $Y:U\to \mathbb{R}$. The set on which the logical entropy measure is defined is U × U and the value assigned to a point (u_{j}, u_{k}) ∈ U × U is μ({(u_{j}, u_{k})}) = p_{j} p_{k}. The logical entropy associated with the random variable is:$$h\left(X\right)=\mu \left(\mathrm{dit}\left({X}^{1}\right)\right)=\sum _{{u}_{j},{u}_{k}\in U}\mathrm{}\left\{{p}_{j}{p}_{k}:\left({u}_{j},{u}_{k}\right)\in \mathrm{dit}\left({X}^{1}\right)\right\},$$(22)namely the sum of all the products p_{j} p_{k} for which X(u_{j}) ≠ X(u_{k}). Thus it is interpreted as the probability that on two independent trials, the random variable X will give different values. That illustrates how logical entropy measures differences. If the values of X have no differences, i.e., if it is constant, then X ^{−1} = 0_{U} and h(0_{U}) = 0. The more refined the inverseimage partition ${X}^{1}$, the higher the logical entropy. Then all the usual Venn diagram relationships hold such as$$\begin{array}{cc}h\left(X\right)& =\sum \left\{{p}_{j}{p}_{k}:\left({u}_{j},{u}_{k}\right)\in \mathrm{dit}\left({X}^{1}\right)\right\}\\ & =\sum _{{u}_{j},{u}_{k}\in U}\mathrm{}\left\{{p}_{j}{p}_{k}:\left({u}_{j},{u}_{k}\right)\in \mathrm{dit}\left({X}^{1}\right)\mathrm{and}\left({u}_{j},{u}_{k}\right)\notin \mathrm{dit}\left({Y}^{1}\right)\right\}\\ \begin{array}{c}\\ \end{array}& \begin{array}{c}+\sum _{{u}_{j},{u}_{k}\in U}\mathrm{}\left\{{p}_{j}{p}_{k}:\left({u}_{j},{u}_{k}\right)\in \mathrm{dit}\left({X}^{1}\right)\mathrm{and}\left({u}_{j},{u}_{k}\right)\in \mathrm{dit}\left({Y}^{1}\right)\right\}\\ =h\left(XY\right)+m\left(X,Y\right)\end{array}\end{array}$$(23)and all of Campbell’s desiderata are satisfied. For instance, the logical conditional entropy is the measure on the difference of the sets for $h\left(X\right)$ and $h\left(Y\right)$:$$h\left(XY\right)=\sum _{{u}_{j},{u}_{k}\in U}\mathrm{}\left\{{p}_{j}{p}_{k}:\left({u}_{j},{u}_{k}\right)\in \mathrm{dit}\left({X}^{1}\right)\mathrm{dit}\left({Y}^{1}\right)\right\}.$$(24)
To see why Shannon entropy is not in general a (nonnegative) measure, consider the previous evenodd dice example of two random variables $X,Y:U\to 2$ for U = {(0, 0), (0, 1), (1, 0), (1, 1)} where X was the parity of the first die thrown and Y the parity of a second die thrown. Each point (x, y) ∈ U = X × Y has probability $p\left(x,y\right)=\frac{1}{4}$ and marginal distributions have $p\left(x\right)=\frac{1}{2}=p\left(y\right)$. A twovariable joint distribution p(x, y) has the independence property if p(x, y) = p(x)p(y) for all (x, y) ∈ U. Hence the two r.v.s X and Y are independent. One of the original “selling points” of Shannon entropy was that for independent r.v.s, H(X, Y) = H(X) + H(Y), i.e., independent r.v.s have “no information in common” so that I(X, Y) = 0. It might be noted that having an overlap of H(X) and H(Y) of 0 is not the same as the Venn diagrams for H(X) and H(Y) not overlapping.
Consider a third random variable Z: U → 2 whose value is the parity of the sum X + Y so Z((0, 0)) = Z((1, 1)) = 0 and Z((0, 1)) = Z((1, 0)) = 1. Then $\mathrm{Pr}\left(Z=0\right)=\mathrm{Pr}\left(Z=1\right)=\frac{1}{2}=p\left(z\right)$ and p(x, z) = p(x)p(z) for all x, z ∈ {0, 1} = 2 so X and Z are also independent and similarly for Y and Z. Thus the three variables are pairwise independent but they are not mutually independent for the simple reason that if you know the values of any two of them, you know the value of the third variable. Hence in the Venn diagram for the Shannon entropies of X, Y, and Z, each pair of areas must have zero overlap but the three areas must intersect in nonzero overlap. The only way this can happen is for the threeway overlap to be negative and the twoway overlaps be the sum of that negative triple overlap and the equal positive remaining twoway overlap so all the twoway overlaps are zero as shown in Figure 9.
Figure 9 Venn diagram for threeway negative Shannon mutual information I(X; Y; Z). 
Thus the intuitively satisfactory idea of the twoway overlaps for independent variables being zero (“no information in common”) leads to the interpretive “problem” of threeway mutual information being possibly negative. Shannon dealt with this problem in the simplest possible way; he never defined mutual information for more than two variables. Or, as perhaps the most definitive monograph on information theory casually put it; “There isn’t really a notion of mutual information common to three random variables.” ([46], p. 49) But the threeway definition is automatically given by the usual inclusionexclusion formulas that hold for measures and signed measures. For two variables, H(X, Y) = H(X) + H(Y) − I(X; Y), and for three variables, it is:$$H\left(X,Y,Z\right)=H\left(X\right)+H\left(Y\right)+H\left(Z\right)I\left(X;Y\right)I\left(X;Z\right)I\left(Y;Z\right)+I\left(X;Y;Z\right)$$(25)where Shannon defined all the terms in the equation except the last one I(X; Y; Z) which is thus determined. The Shannon entropy for each variable X, Y, and Z, is:$$H\left(X\right)=H\left(Y\right)=H\left(Z\right)=p\left(0\right)\mathrm{log}\left(1/p\left(0\right)\right)+p\left(1\right)\mathrm{log}\left(1/p\left(1\right)\right)=\frac{1}{2}\mathrm{log}\left(2\right)+\frac{1}{2}\mathrm{log}\left(2\right)=1.$$(26)And all the twoway overlaps have the values:$$I\left(X;Y\right)=I\left(X;Z\right)=I\left(Y;Z\right)=0.$$(27)
The threeway joint entropy is the Shannon entropy of the probability distribution p(x, y, z) which is easily computed in Table 1. Since the values of any two variables determine the third, the probabilities are either $\frac{1}{4}$ if the third value agrees with the values of the other two or $0$ otherwise.
Probability distribution p(x, y, z) and computation of H(X, Y, Z).
The sum of the last column gives the threeway joint Shannon entropy of H(X, Y, Z) = 2. Hence the inclusionexclusion formula gives:$$\begin{array}{cc}I\left(X;Y;Z\right)& =H\left(X,Y,Z\right)H\left(X\right)H\left(Y\right)H\left(Z\right)+I\left(X;Y\right)+I\left(X;Z\right)+I\left(Y;Z\right)\\ & =2111+0+0+0=1.\end{array}$$(28)
Thinking in term of underlying points, the threeway overlap has points that are common to H(X), H(Y), and H(Z), so some of the points must have negative values. Thus all three, H(X), H(Y), and H(Z), cannot be the value of a (nonnegative) measure on some set. Moreover, the intuitive appeal of I(X; Y) = 0 as meaning “no information in common” for independent variables is lessened when it turns out to mean not disjoint or nonoverlapping areas but that the positive information in each of the three twoway overlap of these independent random variables must be balanced by “negative information” in the threeway overlap, which, as Csiszar and Kröner remark, has “no natural intuitive meaning.” ([47], p. 53)
Finally, we might consider how this example is treated by logical entropy. The r.v. Z has a logical entropy h(Z) as the sum of the shaded $\frac{1}{16}$ squares in Figure 10.
Figure 10 Venn diagram for logical entropy $h\left(Z\right)=\frac{8}{16}=\frac{1}{2}$. 
The twoway mutual logical information, say for m(X, Y), is given by the shaded squares that are in common, i.e., the twoway overlap of h(X) and h(Y), and the threeway mutual logical entropy m(X, Y, Z) is given by the shaded squares in common to all three. But since logical entropy is a measure (in the usual nonnegative sense), we can compute the threeway mutual information by using the undercountovercount formula:$$m\left(X,Y,Z\right)=h\left(X,Y,Z\right)h\left(X\right)h\left(Y\right)h\left(Z\right)+m\left(X,Y\right)+m\left(X,Z\right)+m\left(Y,Z\right).$$(29)
The threeway joint logical entropy includes all squares except the diagonal so its value is $\frac{12}{16}=\frac{3}{4}$. The single logical entropies are all $\frac{1}{2}$ and the twoway mutual informations are all $\frac{4}{16}=\frac{1}{4}$ so the formula yields:$$m\left(X,Y,Z\right)=\frac{3}{4}\frac{1}{2}\frac{1}{2}\frac{1}{2}+\frac{1}{4}+\frac{1}{4}+\frac{1}{4}=\frac{6}{4}\frac{3}{2}=0.$$(30)
The twoway logical mutual informations for independent variables are not zero since logical entropy is a probability distribution – so for independent variables, it is multiplicative, e.g., m(X, Y) = h(X)h(Y). The calculation of the threeway mutual logical information can be intuitively checked by considering the three areas for h(X), h(Y), and h(Z) in Figure 11.
Figure 11 The box diagrams for h(X), h(Y), and h(Z). 
It can then be checked by inspection that there is no shaded square common to all three diagrams so the threeway overlap is zero.
It is interesting to note that all twoway mutual logical informations such as m(X, Y), m(X, Z), and m(Y, Z), are in general never zero when all point probabilities are positive. This is the result of the CommonDits Theorem that any two nonempty ditsets have a nonempty intersection.^{4}
Theorem 3.1.
Common Dits. Any two nonempty ditsets intersect, i.e., have some dits in common.
Proof. A ditset dit(π) = ∅ iff π = 0_{U}, the indiscrete partition or blob. Consider any two nonempty ditsets dit(π) and dit(σ). Since π is not the blob 0_{U}, consider two elements u and u′ distinguished by π but identified by σ; otherwise (u, u′) ∈ dit(π)∩dit(σ) and we are finished. Since $\sigma $ is also not the blob, there must be a third element u″ not in the same block of σ as u and u′, as shown in Figure 12.
Figure 12 Solid circles = blocks of π, dashed circles = blocks of σ, and (u′, u′′) as a common dit to π and σ. 
But since $u$ and u′ are in different blocks of π, the third element u″ must be distinguished from one or the other or both in π, e.g., distinguished from u′ in both partitions as in Figure 12. Hence (u, u″) or (u′, u″) must be distinguished by both partitions and thus must be in dit(π) ∩ dit(σ). □
The three nontrivial partitions on a threeelement set show that there are no common dits to all three of them (as in the dice example), only to each pair of partitions.
The connection between the logical and Shannon entropies
One question lingers. If, as we have seen, Shannon entropy is not defined as a measure in the usual nonnegative sense, then what accounts for the compound Shannon entropies satisfying the Venn diagram relationships? As one author surmised: “Shannon carefully contrived for this ‘accident’ to occur” ([49], p. 153), and Campbell asked “whether these analogies have a deeper foundation.” ([40], p. 112) Since Shannon arranged or “contrived” for the compound entropies to satisfy the Venn diagram relationships for two random variables, they can be extended to any number of variables using the inclusionexclusion formulas [50, 51]. As we have seen, mutual information can be negative for three or more variables.
But there is an interesting twist to the story. Information theorists do not define Shannon entropy as a signed measure on a given set. But such a set can be trivially constructed ex post in the manner shown by Hu [43] and Yeung [52] but the underlying mathematical fact about additive set functions goes back at least to the 1925 first edition of PolyaSzego’s book [42]. In the Venn diagram showing all possible overlaps of three “circles” for three random variables, there are 2^{3} = 8 atomic areas (2^{ n } in the general finite case of n random variables), each of which can trivially be taken as a single element in a set. Then numbers (positive or negative) can be assigned arbitrarily to those points and then summed to get the values attached to the circles. Then all the Venn diagram relationships are automatically satisfied. Since these sets are constructed in terms of the independently defined Shannon entropies, the set and value assignments may change when more variables come into play. In the even–odd dice example, as long as only X and Y are considered, then all the compound Shannon entropies are nonnegative and a set with a (nonnegative) measure on it can be constructed to yield the values of H(X) and H(Y). But when the variable Z is brought into consideration, then the underlying set must be reconstructed to have a negativevalued point representing I(X; Y; Z) so that the signed measure on that set will give the values of all the compound Shannon entropies.^{5} This serves to underline the fact that Shannon entropy is not defined as a measure on a set in the first place.
In contrast, the logical entropy defines the set beforehand, namely U × U, and the values assigned to the points is determined beforehand, namely p_{i} p_{j} is assigned to (u_{i}, u_{j}), and then the simple and compound logical entropies are defined by collections of those points and their values. Nothing changes when new random variables are considered; it just means considering a different set of points. Thus it is not a simple matter of saying logical entropy is a (nonnegative) measure and Shannon entropy is a signed measure. Logical entropy is defined as a probability measure on a set given beforehand, and the Shannon entropies are only a signed measure on a set ex post constructed for the purpose after all the numerical values are independently given in the Venn diagram formulas for a given set of random variables.
There is, however, a deeper connection between the two entropies since there is a transform from all the compound logical entropy formulas to the corresponding compound Shannon entropy formulas that preserves the Venn diagram relationships. To understand this transform, consider the canonical examples of 2^{ m } equiprobable points in U. For any m such as m = 3, the binary partitions required to distinguish the 2^{3} = 8 “messages” (or leaves) can be pictured in the (upside–down) binary tree of Figure 13.
Figure 13 Three equiprobable binary partitions distinguish the 2^{3} = 8 leaves on the tree. 
It was previously asserted that logical entropy and Shannon entropy are two different ways to quantify the definition of informationasdistinctions. Logical entropy is the direct (normalized) count of the distinctions or dits in a partition and Shannon entropy is the minimum average number of binary partitions that need to be joined together to make the same distinctions.
This connection can be easily demonstrated using Figure 13 example. Let π = {{u_{1}}, …, {u_{8}}} be the discrete partition on the equiprobable outcomes (messages or leaves in the tree) in U = {u_{1}, …, u_{8}}. The number of distinctions dit(π) is U × U − Δ = 64 – 8 = 56 (where Δ is the diagonal of selfpairs {(u_{1}, u_{1}), …, (u_{8}, u_{8})}), and the logical entropy h(π) of π is $\frac{\left\mathrm{dit}\left(\pi \right)\right}{\leftU\times U\right}=\frac{56}{64}=\frac{7}{8}=1\frac{1}{8}$, the probability that in two independent draws from U, different elements of U are obtained, i.e., the probability $1\frac{1}{8}$ that the second draw isn’t the same as the first draw. Since the outcomes u_{i} are the leaves of the tree in Figure 13, one could image a marble rolling down from the root (like on a Galton board) and then going one way or the other with equal probability at each branching. The logical entropy is the probability that two such marbles will end up in different leaves.
We need to show that the Shannon entropy of π is the minimum number of binary partitions (corresponding to yesorno questions in the game of 20questions) necessary to make all the same distinctions of π. Recall that given π and σ partitions on U, the join π ∨ σ is the partition on U whose blocks are all the nonempty intersections B ∩ C for B ∈ π and C ∈ σ. Since dit(π ∨ σ) = dit(π) ∪ dit(σ), the join of partitions accumulates the distinctions made by each of the partitions. GianCarlo Rota formulated the problem as the Devil selecting a particular ${u}_{i}$ or message and not revealing it to the questioner but having to answer any yesorno question truthfully. Or less colorfully, “To determine an object, we need to ensure that the responses to the sequence of questions uniquely identifies the object from the set of possible objects” ([46], p. 120). But the problems of 1) making all the distinctions, and 2) uniquely determining any given outcome or message, are equivalent. If the binary partitions or binary questioning does not distinguish u_{i} from u_{j}, then it would not determine the hidden message if it happened to be u_{i} or u_{j} – and if the questioning cannot determine the message if it was u_{i} or u_{j}, then the corresponding binary partitions do not distinguish u_{i} and u_{j}. This means that the usual Shannon interpretation about the minimum average number of binary questions necessary to uniquely determine the message is also the minimum average number of binary partitions necessary to make all the distinctions between messages.
This can be illustrated with the example of Figure 13. The first binary partition π ^{1} corresponds to the first branching point in Figure 13 and the first binary digit in the codes (reading from left to right in the code words):$${\pi}^{1}=\left\{\left\{{u}_{1},...,{u}_{4}\right\},\left\{{u}_{5},...,{u}_{8}\right\}\right\},$$(31)where u_{1}, …, u_{4} have 0 as the first digit and u_{5}, …, u_{8} have 1 as the first digit. The binary partition π ^{1} corresponds to the yesorno question, “Is the first letter in the code for u_{i} a 0?”. The partition has 16 distinctions from {u_{1}, …, u_{4}} × {u_{5}, …, u_{8}} and another 16 from the reverse ordering for a total of 32 distinctions.
The second binary partition π ^{2} in effect asks about the second digit in the codes for the u_{i}, and it is:$${\pi}^{2}=\left\{\left\{{u}_{1},{u}_{2},{u}_{5},{u}_{6}\right\},\left\{{u}_{3},{u}_{4},{u}_{7},{u}_{8}\right\}\right\}$$(32)so the join is:$${\pi}^{1}\vee {\pi}^{2}=\left\{\left\{{u}_{1},{u}_{2}\right\},\left\{{u}_{3},{u}_{4}\right\},\left\{{u}_{5},{u}_{6}\right\},\left\{{u}_{7},{u}_{8}\right\}\right\}.$$(33)
Comparing π ^{1} ∨ π ^{2} to π ^{1}, we see the splitting {u_{1}, …, u_{4}} into {u_{1}, u_{2}} and {u_{3}, u_{4}} so that creates the new distinctions {u_{1}, u_{2}} × {u_{3}, u_{4}} × 2 = 8 and similarly for {u_{5}, u_{6}} and {u_{7}, u_{8}}, so π ^{2} makes 8 + 8 = 16 new distinctions for 32 + 16 = 48 distinctions. Equivalently, one could compute the distinctions of π ^{1} ∨ π ^{2} from scratch to get the same total.
The third binary partition π ^{3} in effect asks about the third digit in the codes for the u_{i}, and it is:$${\pi}^{3}=\left\{\left\{{u}_{1},{u}_{3},{u}_{5},{u}_{7}\right\},\left\{{u}_{2},{u}_{4},{u}_{6},{u}_{8}\right\}\right\}$$(34)and the final join is:$${\pi}^{1}\vee {\pi}^{2}\vee {\pi}^{3}=\left\{\left\{{u}_{1}\right\},...,\left\{{u}_{8}\right\}\right\}=\pi .$$(35)
Since π ^{3} distinguishes each of the four pairs in π ^{1} ∨ π ^{2}, it introduces {u_{1}} × {u_{2}} × 4 × 2 = 8 new distinctions for a total of 48 + 8 = 56 distinctions. Thus the three partitions together make the same 56 distinctions, but Shannon entropy counts the number of those binary partitions (bits) necessary to make the distinctions instead of counting the distinctions or dits themselves. This illustrates that Shannon entropy $H\left(p\right)={\sum}_{i=1}^{8}\mathrm{}\frac{1}{8}{\mathrm{log}}_{2}\left(\frac{1}{1/8}\right)={\mathrm{log}}_{2}\left(\frac{1}{1/8}\right)=3$ is also quantifying distinctions in the sense of counting the minimum number of binary partitions, namely three in this case, needed to make the same distinctions. Hence Shannon entropy is a different quantification of the same notion of information, informationasdistinctions (of a partition). It is not just a quantification of the “amount of uncertainty” – whatever that may be.
Moreover, the example shows how to transform the ditquantification of informationasdistinctions (logical entropy) into the bitquantification of informationasdistinctions (Shannon entropy). In this canonical case (all ${p}_{i}=\frac{1}{{2}^{n}}$), $h\left(p\right)=1{p}_{i}$ and $H\left(p\right)={\mathrm{log}}_{2}\left(\frac{1}{{p}_{i}}\right)$ so the ditcount and bitcount are precisely related: $h\left(p\right)=1\frac{1}{{2}^{H\left(p\right)}}$ and $H\left(p\right)={\mathrm{log}}_{2}\left(\frac{1}{1h\left(p\right)}\right)$. In general, the two entropies are the probability averages ${\sum}_{i}\mathrm{}{p}_{i}\left(\cdots \right)$ of those canonical values $1{p}_{i}$ and ${\mathrm{log}}_{2}\left(\frac{1}{{p}_{i}}\right)$. Hence the transform$$1{p}_{i}\u21dd{\mathrm{log}}_{2}\left(\frac{1}{{p}_{i}}\right)$$(36)transforms logical entropy into Shannon entropy in general:$$\begin{array}{c}h\left(p\right)=\sum _{i}\mathrm{}{p}_{i}\left(1{p}_{i}\right)\u21ddH\left(p\right)=\sum _{i}\mathrm{}{p}_{i}{\mathrm{log}}_{2}\left(\frac{1}{{p}_{i}}\right).\\ \mathrm{The}\mathrm{Dit}\mathrm{Bit}\mathrm{Transform}.\end{array}$$(37)
Since the ditbit transform works for the simple entropies, let us consider the conditional entropies where Shannon constructed H(XY) as the average of the Shannon entropies for the conditional probability distributions for y ∈ Y,$$H\left(XY\right)=\sum _{y}\mathrm{}p\left(y\right)\sum _{x}\mathrm{}\frac{p\left(x,y\right)}{p\left(y\right)}\mathrm{log}\left(\frac{p\left(y\right)}{p\left(x,y\right)}\right)=\sum _{x,y}\mathrm{}p\left(x,y\right)\mathrm{log}\left(\frac{p\left(y\right)}{p\left(x,y\right)}\right).$$(38)
First, we express the logical conditional entropy as a probability average:$$\begin{array}{c}h\left(XY\right)=h\left(X,Y\right)h\left(Y\right)=\sum _{x,y}\mathrm{}p\left(x,y\right)\left(1p\left(x,y\right)\right)\sum _{y}\mathrm{}p\left(y\right)\left(1p\left(y\right)\right)\\ =\sum _{x,y}\mathrm{}p\left(x,y\right)\left[\left(1p\left(x,y\right)\right)\left(1p\left(y\right)\right)\right],\end{array}$$(39)and then we make the substitutions of the ditbit transform: $1p\left(x,y\right)\u21dd\mathrm{log}\left(1/p\left(x,y\right)\right)$ and $1p\left(y\right)\u21dd\mathrm{log}\left(1/p\left(y\right)\right)$ to get:$$\sum _{x,y}\mathrm{}p\left(x,y\right)\left[\mathrm{log}\left(1/p\left(x,y\right)\right)\mathrm{log}\left(1/p\left(y\right)\right)\right]=\sum _{x,y}\mathrm{}p\left(x,y\right)\mathrm{log}\left(\frac{p\left(y\right)}{p\left(x,y\right)}\right)=H\left(XY\right).$$(40)
The other ditbit transforms go in the same manner at indicated in Table 2.
The ditbit transform from the compound logical entropies to the corresponding Shannon entropies.
As one can see, the preservation of the Venn diagram relationships is built into the ditbit transformation. For instance,$$m\left(X,Y\right)=\sum _{x,y}\mathrm{}p\left(x,y\right)\left[\left[1p\left(x\right)\right]+\left[1p\left(y\right)\right]\left[1p\left(x,y\right)\right]\right]=h\left(X\right)+h\left(Y\right)h\left(X,Y\right)$$(41)transforms to:$$\begin{array}{c}I\left(X;Y\right)=\sum _{x,y}\mathrm{}p\left(x,y\right)\left[\mathrm{log}\left(\frac{1}{p\left(x\right)}\right)+\mathrm{log}\left(\frac{1}{p\left(y\right)}\right)\mathrm{log}\left(\frac{1}{p\left(x,y\right)}\right)\right]\\ =H\left(X\right)+H\left(Y\right)H\left(X,Y\right)\end{array}$$(42)so that Venn diagram relationships are preserved. The ditbit transform thus provides the “deeper foundation” ([40], p. 112) sought more than a halfcentury ago by Lorne Campbell for the Shannon entropies satisfying the Venn diagram relationships in spite of not being defined as a measure on a set.
A basic inequality in Shannon’s communications theory is that for positive x, 1 – x ≤ ln(1/x). Substituting logs to base 2 for natural logs, it is still true that 1 − p_{i} ≤ log_{2} (1/p_{i}) for 0 < p_{i} ≤ 1 as shown in Figure 14.
Figure 14 Ditbit transform and inequality: 1 – p ≤ log_{2} (1/p) for 0 < p ≤ 1. 
The ditbit transform is just replacing the lefthand side with the righthand side of the inequality so the transform is highly nonlinear – unlike converting units of measurement like feet and meters. Hence Shannon is correct when he terms his entropy as the “amount of information” ([35], p. 458) denominated in bits. The logical entropy h(π) of a partition is a direct measure of the distinctions made by a partition and the Shannon entropy H(π) of a partition is statistically the minimum average number of binary partitions that must be joined to make all the distinctions of the partition.
Boltzmann and Shannon entropies: A conceptual connection?
When Shannon showed his formula to John von Neumann, then von Neumann suggested calling it “entropy” for two reasons: there is a similar formula in Boltzmann’s statistical mechanics and you can win more arguments using the name “entropy” since no one knows what it really is. (34], pp. 2–3) How does the Shannon formula ${\sum}_{i=1}^{m}\mathrm{}{p}_{i}\mathrm{ln}\left(1/{p}_{i}\right)$ (using natural logs) arise in Boltzmann’s statistical mechanics?
The context is n particles that can be in m different states (e.g., energy levels) with a configuration (or macrostate) being defined by having n_{i} particles in the ith state so ${\sum}_{i=1}^{m}\mathrm{}{n}_{i}=n$. If all the m ^{ n } possible assignments (“microstates”) of the n particles to the m states are equiprobable, then Boltzmann’s idea was that the system would evolve to the macrostate that had the highest probability. Since the number of microstates for any configuration is the multinomial coefficient $\left(\begin{array}{l}n\\ {n}_{1},...,{n}_{m}\end{array}\right)=\frac{n!}{{n}_{1}!...{n}_{m}!}$, the larger the multinomial coefficient, the larger the probability of that configuration. Hence to find the equilibrium configuration, the problem is to maximize the multinomial coefficient subject to the relevant constraints. In addition to ${\sum}_{i=1}^{m}\mathrm{}{n}_{i}=n$, each of the m states would have an associated energy level ε_{i} and the total energy ${\sum}_{i=1}^{m}\mathrm{}{n}_{i}{\epsilon}_{i}$ should equal a constant value E. Where did the Shannon formula come from in Boltzmann’s nineteenthcentury statistical mechanics?
Since the natural log is a monotonic transformation, it is equivalent to maximize $\mathrm{ln}\left(\frac{n!}{{n}_{1}!...{n}_{m}!}\right)$. Moreover, the log gives an additive quantity to be associated with the extensive quantity of entropy in thermodynamics. However, maximizing $\mathrm{ln}\left(\frac{n!}{{n}_{1}!...{n}_{m}!}\right)$ subject to the constraints is not very analytically tractable due to the presence of the factorials n! and n_{i}!. But there is the Stirling infinite series expression for ln(n!) and for large n, just the first few terms in the series will give a “for all practical purposes” good approximation. In particular, the first two terms give the approximation: nln(n) – n ≈ ln(n!). Using that numerical approximation, normalizing by dividing by n, and ignoring the physical Boltzmann’s constant, yields a familiar expression as the approximation:$$\begin{array}{cc}S& =\frac{1}{n}\mathrm{ln}\left(\frac{n!}{{n}_{1}!...{n}_{m}!}\right)=\frac{1}{n}\left[\mathrm{ln}(n!)\sum _{i=1}^{m}\mathrm{}\mathrm{ln}({n}_{i}!)\right]\\ & \approx \frac{1}{n}\left[n\left[\mathrm{ln}\left(n\right)1\right]\sum _{i=1}^{m}\mathrm{}{n}_{i}\left[\mathrm{ln}\left({n}_{i}\right)1\right]\right]=\frac{1}{n}\left[n\mathrm{ln}\left(n\right)\sum _{i}\mathrm{}{n}_{i}\mathrm{ln}\left({n}_{i}\right)\right]\\ & \begin{array}{c}=\frac{1}{n}\left[\sum \mathrm{}{n}_{i}\mathrm{ln}\left(n\right)\sum \mathrm{}{n}_{i}\mathrm{ln}\left({n}_{i}\right)\right]=\frac{1}{n}\sum _{i}\mathrm{}{n}_{i}\mathrm{ln}\left(\frac{{n}_{i}}{n}\right)\\ =\sum _{i=1}^{m}\mathrm{}{p}_{i}\mathrm{ln}\left({p}_{i}\right)=\sum _{i=1}^{m}\mathrm{}{p}_{i}\mathrm{ln}\left(1/{p}_{i}\right)={H}_{e}\left(p\right)\hspace{1em}\mathrm{where}{p}_{i}={n}_{i}/n.\end{array}\end{array}$$(43)
That is how the twoterm Stirling approximation brings the Shannon formula into the statistical mechanics of Boltzmann (and Gibbs). It should be noted that the two probability distributions are quite different. For the exact maximal configuration (n_{1}, …, n_{m}), there are $\frac{n!}{{n}_{1}!...{n}_{m}!}$ terms in the equiprobable distribution over the microstates. For the twoterm Stirling approximate probability distribution, there are only m terms (p_{1}, …, p_{m}). It is the total quantity $\frac{1}{n}\mathrm{ln}\left(\frac{n!}{{n}_{1}!...{n}_{m}!}\right)$ that is approximated by the Shannon formula H_{e}(p) for large n.
And by taking more terms in the Stirling approximation, one information theorist notes that one would have an even better approximation ([53], p. 2), and a prominent physical chemist notes that $\mathrm{ln}\left(n!\right)\approx \sqrt{\mathrm{ln}\left(2\pi \right)}+\left(n+\frac{1}{2}\right)\mathrm{ln}\left(n\right)n$ is a much better approximation ([54], p. 533). But neither uses those formulas since the purpose at hand is analytical tractability, not better approximations, and the Shannon formula leads to a very nice development in statistical mechanics – in particular to the beautiful partition function Z that connects statistical mechanics to thermodynamics. Unfortunately, the role of what became later known as the Shannon formula as a very convenient numerical approximation to Boltzmann entropy is often “forgotten” in the literature where one even sees expressions like “ShannonBoltzmann entropy” ([46], p. 11) or “Boltzmann–Gibbs–Shannon entropy” (e.g., [36]). Perhaps nowhere else in mathematical physics has a numerical approximation been attributed such conceptual significance.
Another way to emphasize the conceptual difference is to consider a small n example where we can compute both entropies since the original Boltzmann problem is a tractable integer programming problem not using any approximation. Consider an example of n = 10 particles with three possible energy levels of ε = (ε_{1}, ε_{2}, ε_{3}) = (1, 2, 3) and a total energy of E = 22. For n_{i} as the number of particles at energy level i, the energy constraint is ${\sum}_{i=1}^{3}\mathrm{}{\epsilon}_{i}{n}_{i}=E$ and of course ${\sum}_{i=1}^{3}\mathrm{}{n}_{i}=n$.^{6} There are only four nonnegative integer solutions satisfying the two constraints (see Tab. 3).
Feasible integer solutions.
The exact Boltzmann solution giving the maximum multinomial coefficient is (n_{1}, n_{2}, n_{3}) = (2, 4, 4) and the (normalized) Boltzmann entropy is:$$S=\frac{1}{10}\mathrm{ln}\left(\frac{10!}{2!4!4!}\right)=\frac{1}{10}\mathrm{ln}\left(3150\right)=\frac{1}{10}8.055=0.8055,$$(44)while maximizing the usual Shannon approximation gives the noninteger result (n_{1}, n_{2}, n_{3}) = (2.3837, 3.2326, 4.3837) (to four decimal places) with the Shannon entropy of H_{e}(p) = 1.0684 (where p_{i} = n_{i}/n). The probability distribution in the Boltzmann case has $3150$ equal terms with the value $\frac{1}{3150}$ while the probability distribution in the “Shannon case” has 3 terms, $\frac{1}{10}\left(\mathrm{2.3837,3.2326,4.3837}\right)$. The maximization of the multinomial coefficient (or its normalized logarithm) and the Shannon expression are obviously different for low n. But for the enormous number of particles in a system of statistical mechanics, that numerical difference fades into insignificance – unless one forgets about it altogether and attaches conceptual significance to the Shannon formula in Boltzmannian statistical mechanics.
MaxEntropy with which entropy for discrete distributions?
Edwin T. Jaynes [56] started a whole “MaxEntropy” subdiscipline in information theory by arguing that the classical indifference principle used to give an equiprobable probability distribution (in the lack of other knowledge) should be generalized to other more constrained contexts by choosing the probabilities that maximize the Shannon entropy subject to those constraints. His motivation was based, in significant part, on attaching conceptual significance to the maximizing of Shannon entropy in Boltzmannian statistical mechanics:
the “method of the most probable distribution” dating back to Boltzmann ... which turns out in the end to be mathematically equivalent to maximum [Shannon] entropy. ([56], p. 441)
The question naturally arises: “What about maximizing logical entropy subject to the same constraints?” If there are no constraints, then maximizing both entropies yields the classical result of the equiprobable distribution, i.e., the indifference principle. But when there are constraints, then the two maximums yield different probability distributions.
Consider a function $X:U\to \mathbb{R}$ with values $X\left({u}_{i}\right)={x}_{i}$ for $i=1,...,n$ with unknown probabilities p = (p_{1}, …, p_{n}). A standard discrete MaxEntropy problem is to find the “best” probabilities so that the average value ${\sum}_{i=1}^{n}\mathrm{}{p}_{i}{x}_{i}=m$ for some given value of m (which must be between the maximum and minimum values of the ${x}_{i}$).
Where “best” is defined by maximizing the Shannon entropy, the procedure is to maximize the Lagrangian:$$\mathcal{L}=\sum \mathrm{}{p}_{i}\mathrm{ln}\left({p}_{i}\right)+\lambda \left(1\sum \mathrm{}{p}_{i}\right)+\tau \left(m\sum \mathrm{}{p}_{i}{x}_{i}\right)$$(45)so the firstorder conditions are:$$\mathrm{\partial}\mathcal{L}/\mathrm{\partial}{p}_{i}=\mathrm{ln}\left({p}_{i}\right){p}_{i}\frac{1}{{p}_{i}}\lambda {x}_{i}\tau =0,$$(46)where it should be noted at the outset that the use of the log function ln(p_{i}) (and the term 1/p_{i} in the firstorder conditions) assumes p_{i} ≠ 0 for all i. Then exponentiating gives: ${p}_{i}={\mathrm{e}}^{\left(1+\lambda +\tau {x}_{i}\right)}$. Substituting into the constraints to determine the Lagrange multipliers:$$1=\sum \mathrm{}{p}_{i}=\sum \mathrm{}{\mathrm{e}}^{\left(1+\lambda +\tau {x}_{i}\right)}={\mathrm{e}}^{(1+\lambda )}\sum \mathrm{}{\mathrm{e}}^{\tau {x}_{i}}{\hspace{1em}\mathrm{so}\mathrm{e}}^{1+\lambda}=\sum \mathrm{}{\mathrm{e}}^{\tau {x}_{i}}$$(47)which yields p_{MaxH} in terms of the Lagrange multiplier τ as:$${p}_{i}={\mathrm{e}}^{\tau {x}_{i}}/\sum _{j=1}^{n}\mathrm{}{\mathrm{e}}^{\tau {x}_{j}}.$$(48)
And $m=\sum \mathrm{}{x}_{i}{\mathrm{e}}^{\left(1+\lambda +\tau {x}_{i}\right)}=\sum \mathrm{}{x}_{i}{\mathrm{e}}^{\tau {x}_{i}}/\sum \mathrm{}{\mathrm{e}}^{\tau {x}_{i}}$. Rather than trying to solve directly for τ, it is best to let w = e^{−τ} and then numerically solve for a real root w (aside from w = 0) of the equation:$$\sum _{i}\mathrm{}{x}_{i}{w}^{{x}_{i}}m\left(\sum _{i}\mathrm{}{w}^{{x}_{i}}\right)=0.$$(49)
Given such a real w, τ = −ln(w) and then the p_{MaxH} = (p_{1}, …, p_{n}) for maximizing Shannon entropy H(p) are determined by the above formula: ${p}_{i}={\mathrm{e}}^{\tau {x}_{i}}/{\sum}_{j=1}^{n}\mathrm{}{\mathrm{e}}^{\tau {x}_{j}}$. Moreover, it is clear from the formula that all the p_{i} are positive (and sum to $1$).
Where “best” is defined by maximizing logical entropy, the procedure is to solve the quadratic programming problem of maximizing $h\left(p\right)=1{\sum}_{i}\mathrm{}{p}_{i}^{2}$ subject to the same constraints ∑_{i} p_{i} x_{i} = m and ∑_{i} p_{i} = 1 plus the additional nonnegativity constraints 0 ≤ p_{i} for i = 1, …, n. For a certain range of values of m, the nonnegativity constraints will be automatically satisfied so one can approach that part of the problem using the Lagrangian approach.$$\mathcal{L}\left({p}_{1},...,{p}_{n}\right)=1\sum _{i}\mathrm{}{p}_{i}^{2}\lambda \left(1\sum \mathrm{}{p}_{i}\right)+\tau \left(m\sum \mathrm{}{p}_{i}{x}_{i}\right)$$(50)so the firstorder conditions are:$$\mathrm{\partial}\mathcal{L}/\mathrm{\partial}{p}_{i}=2{p}_{i}+\lambda \tau {\mathrm{x}}_{i}=0,$$(51)so$${p}_{i}=\frac{1}{2}\left(\lambda \tau {x}_{i}\right).$$(52)
Using the first constraint:$$1=\sum \mathrm{}{p}_{i}=\frac{1}{2}\left(\mathrm{n\lambda}\tau \sum \mathrm{}{x}_{i}\right)=\frac{n}{2}\lambda \frac{1}{2}\tau \sum \mathrm{}{x}_{i},$$(53)and using the second constraint:$$m=\sum \mathrm{}{p}_{i}{x}_{i}=\sum \mathrm{}{x}_{i}\frac{1}{2}\left(\lambda \tau {x}_{i}\right)=\lambda \frac{1}{2}\sum \mathrm{}{x}_{i}\tau \frac{1}{2}\sum \mathrm{}{x}_{i}^{2}$$(54)so we have two linear equations that can be used to solve for the Lagrange multipliers λ and τ. Before going forward, it is useful to consider the mean and variance of the x_{i}’s if they were equiprobable. Then $\mu =\frac{\sum \mathrm{}{x}_{i}}{n}$ and $\mathrm{Var}\left(X\right)=E\left({X}^{2}\right){\mu}^{2}=\frac{\sum \mathrm{}{x}_{i}^{2}}{n}{\mu}^{2}={\sigma}^{2}$ so $n\mathrm{Var}\left(X\right)=\sum \mathrm{}{x}_{i}^{2}n{\mu}^{2}$. Then the two equations are:$$1=\frac{n}{2}\lambda \frac{\mathrm{n\mu}}{2}\tau \hspace{1em}\mathrm{and}\hspace{1em}m=\frac{\mathrm{n\mu}}{2}\lambda \frac{n}{2}\left[\mathrm{Var}\left(X\right)+{\mu}^{2}\right]\tau .$$(55)
After a bit of algebra, one arrives at the informative formula for the p_{i} in p_{Maxh} = (p_{1}, …, p_{n}) that results from maximizing logical entropy subject to the same constraints:$${p}_{i}=\frac{1}{n}+\frac{\left(\mu m\right)\left(\mu {x}_{i}\right)}{n\mathrm{Var}\left(X\right)}=\frac{1}{n}+\frac{1}{n}\left(\frac{m\mu}{\sigma}\right)\left(\frac{{x}_{i}\mu}{\sigma}\right).$$(56)
Since all the operations in the formula are rational (e.g., no square roots, not to mention transcendental functions), the probabilities are all rational if all the x_{i} are rational. One test of intuitiveness is: if x_{i} is equal to the equiprobable mean μ, then shouldn’t that p_{i} equal the equiprobable value $\frac{1}{n}$ regardless of the other values? That is true as we see from the formula for p_{i}. If any x_{i} = μ then that ${p}_{i}=\frac{1}{n}$ and if m = μ, then all the ${p}_{i}=\frac{1}{n}$, the equiprobable solution. The condition for all the p_{i} ≥ 0 is that $\frac{\left(\mu m\right)\left(\mu {x}_{i}\right)}{n\mathrm{Var}\left(X\right)}\ge \frac{1}{n}$ or $\left(\mu m\right)\left(\mu {x}_{i}\right)\ge \mathrm{Var}\left(X\right)$ for all $i$. If that condition is not satisfied for some p_{i}, then the nonnegativity constraints must be enforced by using quadratic programming techniques instead of the Lagrangian technique used above.^{7}
One of the bestknown examples is Jaynes’s Brandeis dice problem ([59], p. 47 or [60], p. 427). If a die was fair, then the average of the equiprobable outcomes is μ = 3.5. But suppose that it is a given constraint that the average outcome is 4.5, then what is the “best” estimate of the probabilities for the six sides?
To maximize Shannon entropy, the x_{i}’s are 1, 2, 3, 4, 5, 6 so the equation to be numerically solved is:$$\sum _{i=1}^{6}\mathrm{}i{w}^{i}m\left(\sum _{i=1}^{6}\mathrm{}{w}^{i}\right)=0.$$(57)for m = 4.5. In addition to w = 0, the relevant real root to four decimal places is w = 1.4493. Then τ = −ln(1.449 3) = −0.371 08 and the Jaynes solution for the probabilities to four decimal places is:$${p}_{\mathrm{Max}H}=\left(\mathrm{0.0543,0.0788,0.1142,0.1654,0.2398,0.3475}\right).$$(58)
To maximize logical entropy, $\mu =\frac{7}{2}$, $m=\frac{9}{2}=4.5$, and $\mathrm{Var}\left(X\right)=\frac{35}{12}$, so the formula ${p}_{i}=\frac{1}{n}+\frac{\left(\mu m\right)\left(\mu {x}_{i}\right)}{\mathrm{nVar}\left(X\right)}$ can be used to solve for the rational maximum logical entropy solution:$$\begin{array}{cc}{p}_{\mathrm{Max}h}& =\frac{1}{210}\left(\mathrm{5,17,29,41,53,65}\right)\\ & =\left(\mathrm{0.0238,0.0810,0.1381,0.1952,0.2524,0.3095}\right).\end{array}$$(59)
In this case, $\left(\mu m\right)\left(\mu {x}_{i}\right)=\left(\frac{7}{2}\frac{9}{2}\right)\left(\frac{7}{2}{x}_{i}\right)=\left(\frac{7}{2}{x}_{i}\right)\ge \mathrm{Var}\left(X\right)=\frac{35}{12}$. The RHS is the smallest for x_{1} = 1 where $\left(\frac{7}{2}\frac{2}{2}\right)=\frac{30}{12}\ge \frac{35}{12}$ so all the probabilities are positive. Equality holds when $\mu m=\frac{35}{12}\frac{2}{5}=\frac{7}{6}$ or $m=\frac{7}{2}+\frac{7}{6}=\frac{28}{6}=4\frac{2}{3}$. Hence for any $m>4\frac{2}{3}$, p_{1} and possibly other p_{i} will be 0 so quadratic programming must be used. A little calculation shows that $\frac{7}{3}$ is the lower bound so that for $m\frac{7}{3}$, there will be some zero probabilities. For instance, for m = 5, the probabilities for logical entropy are: ${p}_{\mathrm{Maxh}}=\frac{1}{10}\left(\mathrm{0,0},\mathrm{1,2},\mathrm{3,4}\right)$, while the Jaynes solution is p_{MaxH} = (0.0205, 0.0385, 0.0723, 0.1357, 0.2548, 0.4781) to four decimal places.
It is interesting that the only alternative to maximizing Shannon entropy that Jaynes considers ([56], pp. 345–6) is minimizing ${\sum}_{i}\mathrm{}{p}_{i}^{2}$ which is the same as maximizing logical entropy $h\left(p\right)=1{\sum}_{i}\mathrm{}{p}_{i}^{2}$. But then he criticizes it because some of the ${p}_{i}$ may be negative if one uses only the Lagrangian method.
The formal solution for minimum ${\sum}_{i}\mathrm{}{p}_{i}^{2}$ lacks the property of nonnegativity. We might try to patch this up in an ad hoc way by replacing the negative values by zero and adjusting the other probabilities to keep the constraint satisfied. [56], p. 346]
It is unclear if Jaynes was aware of the field of quadratic programming which was welldeveloped in the 1960s ([61], p. 490) and which hardly proceeds in “an ad hoc way by replacing the negative values by zero and adjusting the other probabilities to keep the constraint satisfied.”
Clearly the two solutions are different in general^{8} and each one maximizes the corresponding type of entropy. How can one determine which probability distribution is “best”? One criterion that immediately suggests itself is the distribution (p_{1}, …, p_{n}) of numbers that is the most uniform in the sense of having the least variance Var(p) where each of the numbers p_{i} is considered equally probable. The minimum is Var(p) = 0 for the uniform probability distribution which maximizes both entropies in the absence of constraints. In the two cases where m = 4.5 and m = 5, the logical entropy maximizing distribution p_{Maxh} has the lower variance Var(p). But is that true in general?
At first, it seems rather intractable to prove in general which of the two distributions has least variance in the discrete case since the Jaynes solution involves finding the roots of a highdegree polynomial. But there is an easy and general proof that the logical entropy solution has a variance less than (or equal to) the Jaynes solution when both are maximized subject to the same constraints – and similarly for being closest to the uniform distribution in terms of the usual notion of Euclidean distance in ${\mathbb{R}}^{n}$.
Proposition 3.2.
Var(p_{Maxh}) ≤ Var(p_{MaxH}).
Proof. For any constraint set on the probability distributions p = (p_{1}, …, p_{n}), minimize the variance itself over all the feasible distributions (rather than maximize either of the two entropies – or any other entropy for that matter), and then show that the minimum variance distribution p_{MinVar} is the same as the maximum logical entropy distribution p_{Maxh}. The equality p_{MinVar} = p_{Maxh} is shown by computing the relationship between Var(p) and h(p). Looking at (p_{1}, …, p_{n}) as just a set of equiprobable numbers with ∑_{i} p_{i} = 1, it has the variance:$$\begin{array}{cc}\mathrm{Var}\left(p\right)& =\sum _{i=1}^{n}\mathrm{}\frac{1}{n}{\left({p}_{i}\frac{1}{n}\right)}^{2}=E\left({p}^{2}\right)E{\left(p\right)}^{2}\\ & =\frac{1}{n}\sum _{i}\mathrm{}{p}_{i}^{2}{\left(\frac{1}{n}\sum _{i}\mathrm{}{p}_{i}\right)}^{2}=\frac{1}{n}\sum _{i}\mathrm{}{p}_{i}^{2}{\left(\frac{1}{n}\right)}^{2}=\frac{1}{n}\left[\left(1\frac{1}{n}\right)h\left(p\right)\right]\end{array}$$(60)since ${\sum}_{i=1}^{n}\mathrm{}{p}_{i}^{2}=1h\left(p\right)$. Since h(p) appears with a negative sign in the expression for Var(p), minimizing Var(p) is the same as maximizing h(p) over the set of feasible probability distributions, so p_{MinVar} = p_{Maxh}. □
Corollary 3.3.
${p}_{\mathrm{Maxh}}$ minimizes the (Euclidean) distance to the uniform distribution $\left(\frac{1}{n},...,\frac{1}{n}\right)$ .
Proof. Minimizing Euclidean distance is the same as minimizing the distance squared and ${\sum}_{i=1}^{n}\mathrm{}{\left({p}_{i}\frac{1}{n}\right)}^{2}=\left(1\frac{1}{n}\right)h\left(p\right)$.□
There is another specialized notion of “distance,” namely the Kullback–Leibler divergence [62] $D\left(p\left\rightq\right)={\sum}_{i=1}^{n}\mathrm{}{p}_{i}\mathrm{log}\left(\frac{{p}_{i}}{{q}_{i}}\right)$ (p and q are probability distributions on the same index set with all q_{i} > 0) which is neither symmetrical nor satisfies the triangle inequality. But $D\left(p\left\right\left(\frac{1}{n},...,\frac{1}{n}\right)\right)=\mathrm{log}\left(n\right)H\left(p\right)$ so that maximizing H(p) subject to the constraints is equivalent to minimizing the Kullback–Leibler divergence of p from the uniform distribution.
The corresponding asymmetrical divergence formula for logical entropy, also for probability distributions p and q where q_{i} > 0 for all i, is the directed logical divergence:$${d}^{\mathrm{*}}\left(p\left\rightq\right):=\sum _{i=1}^{n}\mathrm{}\frac{1}{{q}_{i}}{\left({q}_{i}{p}_{i}\right)}^{2}=\sum _{i}\mathrm{}{p}_{i}\left(\frac{{p}_{i}}{{q}_{i}}1\right)=\sum _{i}\mathrm{}\frac{{p}_{i}^{2}}{{q}_{i}}1\ge 0\hspace{1em}\mathrm{with}\mathrm{equality}\mathrm{iff}p=q.$$(61)
Another way to prove nonnegativity is to note that $\left({q}_{i}{p}_{i}\right)\left(1\frac{{p}_{i}}{{q}_{i}}\right)\ge 0$ since both terms are negative or both are nonnegative, and ${\sum}_{i}\mathrm{}\left({q}_{i}{p}_{i}\right)\left(1\frac{{p}_{i}}{{q}_{i}}\right)={\sum}_{i}\mathrm{}\frac{{p}_{i}^{2}}{{q}_{i}}1$. Since the KL divergence uses probability ratios inside the log term, we do the same for ditbit transform so that: $1\frac{{p}_{i}}{{q}_{i}}\u21dd\mathrm{log}\left(\frac{1}{{p}_{i}/{q}_{i}}\right)$ and thus:$${d}^{\mathrm{*}}\left(p\left\rightq\right)=\sum _{i}\mathrm{}{p}_{i}\left(1\frac{{p}_{i}}{{q}_{i}}\right)\u21dd\sum _{i}\mathrm{}{p}_{i}\mathrm{log}\left(\frac{1}{{p}_{i}/{q}_{i}}\right)=\sum _{i}\mathrm{}{p}_{i}\mathrm{log}\left(\frac{{q}_{i}}{{p}_{i}}\right)=D\left(p\left\rightq\right)$$(62)so ${d}^{\mathrm{*}}\left(p\left\rightq\right)\u21ddD\left(p\left\rightq\right)$, i.e., the KL divergence is the ditbit transform of the directed logical divergence. What is the probability distribution closest to the uniform distribution using the directed logical divergence?$$\begin{array}{c}{d}^{\mathrm{*}}\left(p\left\right\left(\frac{1}{n},...,\frac{1}{n}\right)\right)=\sum _{i}\mathrm{}{p}_{i}\left(n{p}_{i}1\right)=n\sum _{i}\mathrm{}{p}_{i}^{2}1\\ =n\left[\left(1\frac{1}{n}\right)h\left(p\right)\right]=n\sum _{i=1}^{n}\mathrm{}{\left({p}_{i}\frac{1}{n}\right)}^{2}\end{array}$$(63)so it is the logical entropy solution that is the closest to the uniform distribution by the logical notion of directed divergence.
Metrical logical entropy = (twice) variance
The above results suggest a broader connection between the usual notion of the variance of a random variable and the logical entropy of “differences” when the differences have metrical significance. The logical entropy h(X) of a random variable $X:U\to \mathbb{R}$ with n distinct values (x_{1}, …, x_{n}) with the probabilities p = (p_{1}, …, p_{n}) is computed as h(X) = ∑_{i ≠ j} p_{i} p_{j} which only takes notice of when values are the same or different. Logical entropy in that sense is a special case of C.R. Rao’s notion of quadratic entropy ∑_{i,j} d_{ij} p_{i} p_{j}, where d_{ij} is a nonnegative “distance function” such that d_{ii} = 0 and d_{ij} = d_{ji} [26, 63], for the logical distance function d_{ij} = 1 − δ_{ij}, the complement of the Kronecker delta. A natural metrical distance function is the Euclidean distance squared d_{ij} = (x_{i} − x_{j})^{2}.
Proposition 3.4.
${\sum}_{j}{p}_{i}{p}_{j}{\left({x}_{i}{x}_{j}\right)}^{2}=2\mathrm{Var}\left(X\right)$ .
Proof. Firstly, since for $i=j$, ${\left({x}_{i}{x}_{j}\right)}^{2}=0$, we can sum over all $i,j$. $$\begin{array}{cc}\sum _{j\ne i}\mathrm{}{p}_{i}{p}_{j}{\left({x}_{i}{x}_{j}\right)}^{2}& =\sum _{i,j}\mathrm{}{p}_{i}{p}_{j}{\left({x}_{i}{x}_{j}\right)}^{2}\\ & =\sum _{i,j}\mathrm{}{p}_{i}{p}_{j}\left({x}_{i}^{2}2{x}_{i}{x}_{j}+{x}_{j}^{2}\right)=E\left({X}^{2}\right)2E{\left(X\right)}^{2}+E\left({X}^{2}\right)=2\mathrm{Var}\left(X\right)\end{array}$$(64)□
It was previously noted that when counting distinctions (u_{i}, u_{j}) ∈ dit(π), both (u_{i}, u_{j}) and (u_{j}, u_{i}) are included. If only the distinctions (u_{i}, u_{j}) for i < j are counted, then one get half the number as is evident in the logical entropy box diagrams such as Figure 2.
Corollary 3.5.
${\sum}_{i}{p}_{i}{p}_{j}{\left({x}_{i}{x}_{j}\right)}^{2}=\mathrm{Var}\left(X\right)$ .
Thus the variance of a metrical random variable X is the average distance squared between the values in an unordered pair of independent trials.
The result extends to covariances as well. Consider two realvalued random variables X with distinct values x_{i} for i = 1, …, n and Y with distinct values y_{j} for j = 1, …, m with the joint probability distribution $p\left({x}_{i},{y}_{j}\right):X\times Y\to \mathbb{R}$. Two ordered draws from X × Y gives two ordered pairs: (x_{i}, y_{j}) and $\left({x}_{{i}^{\mathrm{\prime}}},{y}_{{j}^{\mathrm{\prime}}}\right)$. For this bivariate distribution, the generalization of ∑_{j ≠ i} p_{i} p_{j} (x_{i} −x_{j})^{2} is:$$\sum _{\left(i,j\right)\ne \left({i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}\right)}\mathrm{}\mathrm{p}\left({x}_{i},{y}_{j}\right)p\left({x}_{{i}^{\mathrm{\prime}}},{y}_{{j}^{\mathrm{\prime}}}\right)\left({x}_{i}{x}_{{i}^{\mathrm{\prime}}}\right)\left({y}_{j}{y}_{{j}^{\mathrm{\prime}}}\right).$$(65)
Metrical logical entropy for bivariate distributions of metrical random variables which is no longer a special case of quadratic entropy since $\left({x}_{i}{x}_{{i}^{\mathrm{\prime}}}\right)\left({y}_{i}{y}_{{i}^{\mathrm{\prime}}}\right)$ can be negative. The (unordered) twodraw notion of metrical variation for a bivariate distribution reproduces the usual notion of covariance Cov(X, Y) = E(XY) − E(X)E(Y).^{9}
Proposition 3.6.
${\sum}_{\left(i,j\right)}p\left({x}_{i},{y}_{j}\right)p\left({x}_{{i}^{\text{'}}},{y}_{{j}^{\text{'}}}\right)\left({x}_{i}{x}_{{i}^{\text{'}}}\right)\left({y}_{j}{y}_{{j}^{\text{'}}}\right)=2\mathrm{Cov}\left(X,Y\right)$ .
Proof. Since $\left({x}_{i}{x}_{{i}^{\mathrm{\prime}}}\right)\left({y}_{j}{y}_{{j}^{\mathrm{\prime}}}\right)=0$ if i = i′ or j = j′, we can sum over all i, j. Abbreviating p(x_{i}, y_{j}) = p_{ij}, we have:$$\begin{array}{c}\sum _{i,j,{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}\left({x}_{i}{x}_{{i}^{\mathrm{\prime}}}\right)\left({y}_{j}{y}_{{j}^{\mathrm{\prime}}}\right)\\ =\sum _{i,j,\mathrm{i\text{'}},\mathrm{j\text{'}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}\left[{x}_{i}{y}_{j}{x}_{i}{y}_{{j}^{\mathrm{\prime}}}{x}_{{i}^{\mathrm{\prime}}}{y}_{j}+{x}_{{i}^{\mathrm{\prime}}}{y}_{{j}^{\mathrm{\prime}}}\right]\\ =\sum _{i,j,{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}{x}_{i}{y}_{j}\sum _{i,j,{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}{x}_{i}{y}_{{j}^{\mathrm{\prime}}}\sum _{i,j,{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}{x}_{{i}^{\mathrm{\prime}}}{y}_{j}+\sum _{i,j,{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}{x}_{{i}^{\mathrm{\prime}}}{y}_{{j}^{\mathrm{\prime}}}.\end{array}$$(66)
Then using:$$\sum _{i,j,{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}{x}_{i}{y}_{j}=\sum _{i,j}\mathrm{}{p}_{\mathrm{ij}}{x}_{i}{y}_{j}\sum _{{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}=\sum _{\mathrm{ij}}\mathrm{}{p}_{\mathrm{ij}}{x}_{i}{y}_{j}=E\left(\mathrm{XY}\right),$$(67)and$$\begin{array}{cc}\sum _{i,j,{i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}{x}_{i}{y}_{{j}^{\mathrm{\prime}}}& =\sum _{i,{j}^{\mathrm{\prime}}}\mathrm{}{x}_{i}{y}_{{j}^{\mathrm{\prime}}}\sum _{{i}^{\mathrm{\prime}}}\mathrm{}\sum _{j}\mathrm{}{p}_{\mathrm{ij}}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}=\sum _{i,{j}^{\mathrm{\prime}}}\mathrm{}{x}_{i}{y}_{{j}^{\mathrm{\prime}}}\sum _{{i}^{\mathrm{\prime}}}\mathrm{}{p}_{i}{p}_{{i}^{\mathrm{\prime}}{j}^{\mathrm{\prime}}}\\ & =\sum _{i,j/}\mathrm{}{p}_{i}{x}_{i}{y}_{{j}^{\mathrm{\prime}}}{p}_{{j}^{\mathrm{\prime}}}=\left(\sum _{i}\mathrm{}{p}_{i}{x}_{i}\right)\left(\sum _{{j}^{\mathrm{\prime}}}\mathrm{}{p}_{{j}^{\mathrm{\prime}}}{y}_{{j}^{\mathrm{\prime}}}\right)=E\left(X\right)E\left(Y\right)\end{array}$$(68)and similarly for the other cases, so we have:$$\sum _{\left(i,j\right)\ne \left({i}^{\mathrm{\prime}},{j}^{\mathrm{\prime}}\right)}\mathrm{}p\left({x}_{i},{y}_{j}\right)p\left({x}_{{i}^{\mathrm{\prime}}},{y}_{{j}^{\mathrm{\prime}}}\right)\left({x}_{i}{x}_{{i}^{\mathrm{\prime}}}\right)\left({y}_{j}{y}_{{j}^{\mathrm{\prime}}}\right)=E\left(\mathrm{XY}\right)E\left(X\right)E\left(Y\right)E\left(Y\right)E\left(X\right)+E\left(\mathrm{XY}\right)=2\mathrm{Cov}(X,Y).$$(69)□
The linear ordering on indices i and j can be extended to the linear lexicographic (or dictionary) ordering on ordered pairs of indices where (i, j) < (i′, j′) if i < i′ or if i = i′, then j < j′. Then for each pair of distinct ordered pairs (i, j) ≠ (i′, j′), either (i, j) < (i′, j′) or (i′, j′) < (i, j) but not both, so (i, j) < (i′, j′) picks out half the cases of (i, j) ≠ (i′, j′).
Corollary 3.7.
${\sum}_{\left(i,j\right)}p\left({x}_{i},{y}_{i}\right)p\left({x}_{{i}^{\text{'}}},{y}_{{j}^{\text{'}}}\right)\left({x}_{i}{x}_{{i}^{\text{'}}}\right)\left({y}_{j}{y}_{{j}^{\text{'}}}\right)=\mathrm{Cov}\left(X,Y\right)$ .
In the switch from logical entropy to metrical logical entropy, the interpretation switches from being a twodraw probability (and thus always nonnegative) to being a twodraw average metrical quantity which, like the covariance, might be positive or negative.
Thus logical entropy connects naturally with the notions of variance and covariance in statistics. Although beyond the scope of this paper, the metrical logical entropy for a discrete random variable $g\left(X\right)$, ${\sum}_{j}\mathrm{}{p}_{i}{p}_{j}{\left(g\left({x}_{i}\right)g\left({x}_{j}\right)\right)}^{2}=2\mathrm{Var}\left(g\left(X\right)\right)$, shows how to generalize to the logical entropy of a continuous random variable g(X) where X has the probability density $f\left(x\right)$:$$h\left(g\left(X\right)\right)=\int \mathrm{}f\left(x\right)\int \mathrm{}f\left({x}^{\mathrm{\prime}}\right){\left(g\left(x\right)g\left({x}^{\mathrm{\prime}}\right)\right)}^{2}\mathrm{d}{x}^{\mathrm{\prime}}\mathrm{d}x=2\mathrm{Var}\left(g\left(X\right)\right).$$(70)
The interpretation of $h\left(g\left(X\right)\right)$ is the average (Euclidean) distance squared between the values of g(X) on two independent trials – which is twice the variance.
Quantum logical entropy
Logical entropy via density matrices
The transition from “classical” (i.e., nonquantum) logical entropy to quantum logical entropy is facilitated by reformulating logical entropy using density matrices over the real numbers. A stepping stone in that reformulation is the notion of an incidence matrix of a binary relation. For a finite U = {u_{1}, ..., u_{n}}, a binary relation R on U is a subset R ⊆ U × U. The n × n incidence matrix In(R) is defined by:$$\mathrm{In}{\left(R\right)}_{\mathrm{ij}}=\{\begin{array}{l}1\hspace{1em}\mathrm{if}\left({u}_{i},{u}_{j}\right)\in R\\ 0\hspace{1em}\mathrm{if}\left({u}_{i},{u}_{j}\right)\notin R.\end{array}$$(71)
Then the incidence matrix associated with a partition π = {B_{1}, …, B_{m}} is In(indit(π)), the incidence matrix of the partition’s inditset, i.e., the associated equivalence relation. And then for equiprobable points in U, the density matrix ρ(π) associated with π is the incidence matrix In(indit(π)) rescaled to be of trace 1 (trace = sum of diagonal elements):$$\rho \left(\pi \right)=\frac{1}{n}\mathrm{In}\left(\mathrm{indit}\left(\pi \right)\right).$$(72)
Each offdiagonal element has two associated diagonal elements in its row and column. If an offdiagonal element in In(indit(π)) or ρ(π) is nonzero, then the corresponding diagonal elements are for elements u_{i}, u_{j} ∈ B_{k} for some block B_{k} ∈ π.
For U with point probabilities p = (p_{1},...,p_{n}), the density matrix ρ(π) can be constructed block by block. For a block B_{i} ∈ π, let ${B}_{i}\rangle $ be the column vector with the jth entry being $\sqrt{\frac{{p}_{j}}{\mathrm{Pr}\left({B}_{i}\right)}}$ if u_{j} ∈ B_{i} and otherwise 0. Then the ρ(B_{i}) is the n × n matrix formed by the product of column vector B_{i}〉 times its row vector transpose B_{i}〉^{t}, and the density matrix ρ(π) is the probabilityweighted sum:$$\rho \left(\pi \right)=\sum _{{B}_{i}\in \pi}\mathrm{}\mathrm{Pr}\left({B}_{i}\right)\rho \left({B}_{i}\right).$$(73)
Then each jk entry in ρ(π) is:$$\rho {\left(\pi \right)}_{\mathrm{jk}}=\{\begin{array}{l}\sqrt{{p}_{j}{p}_{k}}\hspace{1em}\mathrm{if}\left({u}_{j},{u}_{k}\right)\in \mathrm{indit}\left(\pi \right)\\ 0\hspace{1em}\mathrm{otherwise}.\end{array}$$(74)
These values are the square roots of the unshaded squares in the logical entropy box diagrams, e.g., Figures 2–5.
For instance, if π = {{u_{1}, u_{3}},{u_{2}, u_{4}}}, then:$$\rho \left(\pi \right)=\left[\begin{array}{llll}{p}_{1}& 0& \sqrt{{p}_{1}{p}_{3}}& 0\\ 0& {p}_{2}& 0& \sqrt{{p}_{2}{p}_{4}}\\ \sqrt{{p}_{3}{p}_{1}}& 0& {p}_{3}& 0\\ 0& \sqrt{{p}_{4}{p}_{2}}& 0& {p}_{4}\end{array}\right],$$(75)where the nonzero offdiagonal elements indicate which elements are in the same block of the partition. With a suitable interchange of rows and columns, the matrix would become blockdiagonal – where the entries squared correspond to the values of the unshaded squares in the box diagram for logical entropy. The density matrix is symmetric, has trace 1, and all nonnegative elements.
The most important calculation for our purposes is the trace of the square ρ(π)^{2} of a density matrix. Consider a diagonal element (ρ(π)^{2})_{jj} where u_{j} ∈ B_{i} which is the product of the jth row times the jth column of ρ(π):$${\left(\rho {\left(\pi \right)}^{2}\right)}_{\mathrm{jj}}=\sum _{{u}_{k}\in {B}_{i}}\mathrm{}\sqrt{{p}_{j}{p}_{k}}\sqrt{{p}_{k}{p}_{j}}={p}_{j}\sum _{{u}_{k}\in {B}_{i}}\mathrm{}{p}_{k}={p}_{j}\mathrm{Pr}\left({B}_{i}\right).$$(76)
Then summing over all those diagonal elements for u_{j} ∈ B_{i} gives ${\sum}_{{u}_{j}}\mathrm{}{p}_{j}\mathrm{Pr}\left({B}_{i}\right)=\mathrm{Pr}{\left({B}_{i}\right)}^{2}$. These block probabilities squared were the values assigned to the unshaded blocks in the box diagrams for logical entropy. Finally summing over all the diagonal elements yields the basic result about the trace of ρ(π)^{2}:$$\mathrm{tr}\left[\rho {\left(\pi \right)}^{2}\right]=\sum _{{B}_{i}\in \pi}\mathrm{}\mathrm{Pr}{\left({B}_{i}\right)}^{2}.$$(77)
This result immediately yields the translation of the logical entropy h(π) into the density matrix formalism:$$h\left(\pi \right)=1\mathrm{tr}\left[\rho {\left(\pi \right)}^{2}\right],$$(78)i.e., the sum of the shaded squares in the box diagrams for logical entropy.
We will define the tensor product of matrices by considering the example of a 2 × 2 matrix A times a 3 × 3 matrix B:$$\begin{array}{cc}\mathbf{A}\otimes \mathbf{B}& =\left[\begin{array}{ll}{a}_{11}& {a}_{12}\\ {a}_{21}& {a}_{22}\end{array}\right]\otimes \left[\begin{array}{lll}{b}_{11}& {b}_{12}& {b}_{13}\\ {b}_{21}& {b}_{22}& {b}_{23}\\ {b}_{31}& {b}_{32}& {b}_{33}\end{array}\right]=\left[\begin{array}{ll}{a}_{11}B& {a}_{12}B\\ {a}_{21}B& {a}_{22}B\end{array}\right]\\ & =\begin{array}{l}\left(\mathrm{1,1}\right)\\ \left(\mathrm{1,2}\right)\\ \left(\mathrm{1,3}\right)\\ \left(\mathrm{2,1}\right)\\ \left(\mathrm{2,2}\right)\\ \left(\mathrm{2,3}\right)\end{array}\left[\begin{array}{llllll}{a}_{11}{b}_{11}& {a}_{11}{b}_{12}& {a}_{11}{b}_{13}& {a}_{12}{b}_{11}& {a}_{12}{b}_{12}& {a}_{12}{b}_{13}\\ {a}_{11}{b}_{21}& {a}_{11}{b}_{22}& {a}_{11}{b}_{23}& {a}_{12}{b}_{21}& {a}_{12}{b}_{22}& {a}_{12}{b}_{23}\\ {a}_{11}{b}_{31}& {a}_{11}{b}_{32}& {a}_{11}{b}_{33}& {a}_{12}{b}_{31}& {a}_{12}{b}_{32}& {a}_{12}{b}_{33}\\ {a}_{21}{b}_{11}& {a}_{21}{b}_{12}& {a}_{21}{b}_{13}& {a}_{22}{b}_{11}& {a}_{22}{b}_{12}& {a}_{22}{b}_{13}\\ {a}_{21}{b}_{21}& {a}_{21}{b}_{22}& {a}_{21}{b}_{23}& {a}_{22}{b}_{21}& {a}_{22}{b}_{22}& {a}_{22}{b}_{23}\\ {a}_{21}{b}_{31}& {a}_{21}{b}_{32}& {a}_{21}{b}_{33}& {a}_{22}{b}_{31}& {a}_{22}{b}_{32}& {a}_{22}{b}_{33}\end{array}\right]\end{array}.$$(79)
In particular, it might be noted that all the diagonal elements have the form a_{ii} b_{jj} but their (row, column) designators are (i, j)(i, j). Thus a_{11} b_{33} is the diagonal element in the (1, 3) row and the (1, 3) column, i.e., a diagonal element of the tensor product A ⊗ B.
One can take the tensor product of an n × n density matrix ρ(π) (where π = f ^{1} for some $f:U\to \mathbb{R}$) with itself to obtain a n ^{2} × n ^{2} matrix whose diagonal elements are ${\left(\rho \left(\pi \right)\otimes \rho \left(\pi \right)\right)}_{\left(i,j\right)\left(i,j\right)}={p}_{i}{p}_{j}$. Let P_{dit}(π) be the n ^{2} × n ^{2} diagonal (projection) matrix with diagonal elements ${\left({P}_{\mathrm{dit}\left(\mathrm{\pi}\right)}\right)}_{\left(i,j\right)\left(i,j\right)}={\chi}_{\mathrm{dit}\left(\pi \right)}\left({u}_{i},{u}_{j}\right)$. Then the matrix product ${P}_{\mathrm{dit}\left(\pi \right)}\rho \left(\pi \right)\otimes \rho \left(\pi \right)$ will have the nonzero diagonal elements p_{i} p_{j} for $\left({u}_{i},{u}_{j}\right)\in \mathrm{dit}\left(\pi \right)$, and thus:$$h\left({f}^{1}\right)=h\left(\pi \right)=\sum _{\left({u}_{i},{u}_{j}\right)\in \mathrm{dit}\left(\pi \right)}\mathrm{}{p}_{i}{p}_{j}=\mathrm{tr}\left[{P}_{\mathrm{dit}\left(\pi \right)}\rho \left(\pi \right)\otimes \rho \left(\pi \right)\right].$$(80)
That formula will carry over to the quantum case.
In general, a density matrix ρ is said to represent a pure state if tr[ρ ^{2}] = 1, and otherwise a mixed state. For partitions, the only pure state density matrix is ρ(0_{U}), the density matrix of the indiscrete partition ${\mathbf{0}}_{U}=\left\{U\right\}$ on U which has zero logical entropy.
Given another partition $\sigma =\left\{{C}_{1},...,{C}_{{m}^{\mathrm{\prime}}}\right\}$ on U, the join partition $\pi \vee \sigma $ is the partition whose blocks are all the nonempty intersections ${B}_{i}\cap {C}_{j}$ for ${B}_{i}\in \pi $ and ${C}_{j}\in \sigma $. Then $\mathrm{dit}\left(\pi \vee \sigma \right)=\mathrm{dit}\left(\pi \right)\cup \mathrm{dit}\left(\sigma \right)$ so that $\mathrm{indit}\left(\pi \vee \sigma \right)=\mathrm{indit}\left(\pi \right)\cap \mathrm{indit}\left(\sigma \right)$. The logical entropy $h\left(\pi \vee \sigma \right)$, also the joint logical entropy h(π, σ), is: $h\left(\pi \vee \sigma \right)=1{\sum}_{{B}_{i}}\mathrm{}\mathrm{Pr}{\left({B}_{i}\cap {C}_{j}\right)}^{2}$. This has an elegant formulation in the density matrix formalism which implies the earlier result since $\pi \vee \pi =\pi $.
Lemma 4.1.
$h\left(\pi \vee \sigma \right)=1\mathrm{tr}\left[\rho \left(\pi \right)\rho \left(\sigma \right)\right]$ .
Proof. The kth diagonal entry in ρ(π)ρ(σ) is the scalar product ${\sum}_{j}\mathrm{}\rho {\left(\pi \right)}_{\mathrm{kj}}\rho {\left(\sigma \right)}_{\mathrm{jk}}$ with $\rho {\left(\pi \right)}_{\mathrm{kj}}=\sqrt{{p}_{k}{p}_{j}}$ if, $\left({u}_{j},{u}_{k}\right)\in \mathrm{indit}\left(\pi \right)$ and otherwise 0, and similarly for $\rho {\left(\sigma \right)}_{\mathrm{jk}}$. Hence the only nonzero terms in that sum are for $\left({u}_{k},{u}_{j}\right)\in \mathrm{indit}\left(\pi \right)\cap \mathrm{indit}\left(\sigma \right)=\mathrm{indit}\left(\pi \vee \sigma \right)$. Hence$$\mathrm{tr}\left[\rho \left(\pi \right)\rho \left(\sigma \right)\right]=\sum _{\left({u}_{j},{u}_{k}\right)\in \mathrm{indit}\left(\pi \vee \sigma \right)}\mathrm{}{p}_{j}{p}_{k}=1\sum _{\left({u}_{j},{u}_{k}\right)\in \mathrm{dit}\left(\pi \vee \sigma \right)}\mathrm{}{p}_{j}{p}_{k}=1h\left(\pi \vee \sigma \right)$$(81)so$$h\left(\pi \vee \sigma \right)=1\mathrm{tr}\left[\rho \left(\pi \right)\rho \left(\sigma \right)\right].$$(82)□
In coding theory, the differencebased notion of distance between two 0,1 nvectors is the Hamming distance ([65], p. 66) which is just the number of places where the corresponding entries in the two vectors are different. If we think of the 0,1 nvectors as characteristic functions of subsets S and T of an nelement set, then the Hamming distance is the cardinality of the symmetric difference: $\leftST\right+\leftTS\right=\leftS\cup T\right\leftS\cap T\right$. This motivates the definition of the logical distance (or Hamming distance) between two partitions as: $h\left(\pi \sigma \right)+h\left(\sigma \pi \right)=h\left(\pi \vee \sigma \right)m\left(\pi ,\sigma \right)$, the product probability measure on the dits that in one partition but not the other. But there is the HilbertSchmidt distance measure, $\mathrm{tr}\left[{\left(\rho \tau \right)}^{2}\right]$ between density matrices ρ and τ which does not mention logical entropy at all (see [66]). Taking the two density matrices as ρ(π) and ρ(σ), we have the following result that the logical (Hamming) distance between partitions is the HilbertSchmidt distance between the partitions.
Proposition 4.2.
$\mathrm{tr}\left[{\left(\rho \left(\pi \right)\rho \left(\sigma \right)\right)}^{2}\right]=h\left(\pi \sigma \right)+h\left(\sigma \pi \right)$ .
Proof. $\mathrm{tr}\left[{\left(\rho \left(\pi \right)\rho \left(\sigma \right)\right)}^{2}\right]=\mathrm{tr}\left[\rho {\left(\pi \right)}^{2}\right]\mathrm{tr}\left[\rho \left(\pi \right)\rho \left(\sigma \right)\right]\mathrm{tr}\left[\rho \left(\sigma \right)\rho \left(\pi \right)\right]+\mathrm{tr}\left[\rho {\left(\sigma \right)}^{2}\right]$ so:$$\begin{array}{c}\mathrm{tr}\left[{\left(\rho \left(\pi \right)\rho \left(\sigma \right)\right)}^{2}\right]=2\left[1\mathrm{tr}\left[\rho \left(\pi \right)\rho \left(\sigma \right)\right]\right]\left(1\mathrm{tr}\left[\rho {\left(\pi \right)}^{2}\right]\left(1\mathrm{tr}\left[\rho {\left(\sigma \right)}^{2}\right]\right)\right)\\ =2h\left(\pi \vee \sigma \right)h\left(\pi \right)h\left(\sigma \right)=h\left(\sigma \pi \right)+h\left(\pi \sigma \right).\end{array}$$(83)□
One point of developing these results in this classical case, is that the same theorems hold, mutatis mutandis, for quantum logical entropy [67].
Another set of classical results about logical entropy that extend to the quantum case are concerned with the quantum notion of (projective) measurement which is described by the Lüders mixture operation ([68], p. 279)]. For the partition $\sigma =\left\{{C}_{1},...,{C}_{{m}^{\mathrm{\prime}}}\right\}$ on U, let P_{C} be the diagonal n × n projection matrix whose diagonal entries are just the characteristic function χ_{C} (u_{i}) for C ∈ σ. Then the Lüders mixture operation of performing a “σmeasurement” on ρ(π) is defined as: ${\sum}_{C}\mathrm{}{P}_{C}\rho \left(\pi \right){P}_{C}$.
Theorem 4.3. Luders mixture operation = partition join operation.
${\sum}_{C}{P}_{C}\rho \left(\pi \right){P}_{C}=\rho \left(\pi \vee \sigma \right)$ .
Proof. A nonzero entry in ρ(π) has the form $\rho {\left(\pi \right)}_{\mathrm{jk}}=\sqrt{{p}_{j}{p}_{k}}$ iff there is some block B ∈ π such that $\left({u}_{j},{u}_{k}\right)\in B\times B$, i.e., if ${u}_{j},{u}_{k}\in B$. The matrix operation P_{C} ρ(π) will preserve the entry $\sqrt{{p}_{j}{p}_{k}}$ if ${u}_{j}\in C$, otherwise the entry is zeroed. And if the entry was preserved, then the further matrix operation $\left({P}_{C}\rho \left(\pi \right)\right){P}_{C}$ will preserve the entry $\sqrt{{p}_{j}{p}_{k}}$ if ${u}_{k}\in C$, otherwise it is zeroed. Hence the entries $\sqrt{{p}_{j}{p}_{k}}$ in ρ(π) that are preserved in ${P}_{C}\rho \left(\pi \right){P}_{C}$ are the entries where both ${u}_{j},{u}_{k}\in B$ for some B ∈ π and ${u}_{j},{u}_{k}\in C$. These are the entries in $\rho \left(\pi \vee \sigma \right)$ corresponding to the nonempty blocks $B\cap C$ of $\pi \vee \sigma $ for some $B\in \pi $, so summing over $C\in \sigma $ gives the result: ${\sum}_{C}\mathrm{}{P}_{C}\rho \left(\pi \right){P}_{C}=\rho \left(\pi \vee \sigma \right)$. □
Thus projective quantum measurement is modeled classically by the distinctioncreating partition join. Hence the logical information created by the σmeasurement of ρ(π) is $h\left(\sigma \vee \pi \right)h\left(\pi \right)=h\left(\sigma \pi \right)$. Moreover, this increase in logical entropy can be computed from the changes in the entries in the density matrices. A nonzero offdiagonal entry in a density matrix ρ(π) indicates that the u_{j} and u_{k} for the corresponding diagonal elements must “cohere” together in the same block of π. If such a nonzero offdiagonal element of ρ(π) was zeroed in the transition to the density matrix $\rho \left(\pi \vee \sigma \right)$ of the σmeasurement result, then it means that u_{j} and u_{k} were “decohered” by σ, i.e., were in different blocks of σ.
Corollary 4.4.
The sum of all the squares p_{j} p_{k} of all the nonzero offdiagonal entries $\sqrt{{p}_{j}{p}_{j}}$ of ρ(π) that were zeroed in the Lüders mixture operation that transforms ρ(π) into ${\sum}_{C}{P}_{C}\rho \left(\pi \right){P}_{C}=\rho \left(\pi \vee \sigma \right)$ is $h\left(\pi \vee \sigma \right)h\left(\pi \right)=h\left(\sigma \pi \right)$ .
Proof. All the entries $\sqrt{{p}_{j}{p}_{j}}$ that got zeroed were for ordered pair $\left({u}_{j},{u}_{k}\right)$ that were indits of π but not indits of $\pi \vee \sigma $, i.e., $\left({u}_{j},{u}_{k}\right)\in \mathrm{indit}\left(\pi \right)\cap \mathrm{indit}{\left(\pi \vee \sigma \right)}^{c}=\mathrm{dit}{\left(\pi \right)}^{c}\cap \mathrm{dit}\left(\pi \vee \sigma \right)=\mathrm{dit}\left(\pi \vee \sigma \right)\mathrm{dit}\left(\pi \right)$. The sum of products p_{j} p_{k} for those pairs (u_{j} , u_{k}) is just the product probability measure on that set $\mathrm{dit}\left(\pi \vee \sigma \right)\mathrm{dit}\left(\pi \right)$ which is $h\left(\pi \vee \sigma \pi \right)$. And since $\mathrm{dit}\left(\pi \right)\subseteq \mathrm{dit}\left(\pi \vee \sigma \right)$, the measure on $\mathrm{dit}\left(\pi \vee \sigma \right)\mathrm{dit}\left(\pi \right)$ is $h\left(\pi \vee \sigma \pi \right)=h\left(\pi \vee \sigma \right)h\left(\pi \right)=h\left(\sigma \pi \right)$. □
It might be noted that nothing about logical entropy was used in the definition of the Lüders mixture operation that describes the “σmeasurement of ρ(π).” Yet the logical information created by the σmeasurement of ρ(π) is $h\left(\sigma \pi \right)$, the logical information that is in σ over and above the information in π. And that logical entropy $h\left(\sigma \pi \right)$ can be computed directly from the terms in the density matrix ρ(π) that were zeroed in the Lüders operation.
Now we are ready to make the transition to quantum logical information theory where all the corresponding results will hold.
Linearizing “classical” to quantum logical entropy
One of the developers of quantum information theory, Charles Bennett, said that information was fundamentally about distinguishability.
[Information] is the notion of distinguishability abstracted away from what we are distinguishing, or from the carrier of information. ... And we ought to develop a theory of information which generalizes the theory of distinguishability to include these quantum properties... ([69], pp. 155–157)
Given a normalized vector $\psi \rangle $ in an ndimensional Hilbert space V, a pure state density matrix is formed as $\rho \left(\psi \right)=\psi \rangle \langle \psi =\psi \rangle {\left(\psi \rangle \right)}^{\u2020}$ (where ${\left(\right)}^{\u2020}$ is the conjugatetranspose) and mixed state density matrix is a probability mixture $\rho ={\sum}_{i}\mathrm{}{p}_{i}\rho \left({\psi}_{i}\right)$ of pure state density matrices. Any such density matrix always has a spectral decomposition into the form $\rho =\sum \mathrm{}{p}_{i}\rho \left({\psi}_{i}\right)$ where the different vectors ψ_{i} and ${\psi}_{{i}^{\mathrm{\prime}}}$ are orthogonal. The general definition of the quantum logical entropy of a density matrix is: h(ρ) = 1 − tr[ρ ^{2}], where if ρ is a pure state if and only if tr[ρ ^{2}] = 1 so h(ρ) = 0, and tr[ρ ^{2}] < 1 for mixed states so 1 > h(p) > 0 for mixed states. The formula h(ρ) = 1 − tr[ρ ^{2}] is hardly new. Indeed, tr[ρ ^{2}] is usually called the purity of the density matrix so the complement 1 − tr[ρ ^{2}] has been called the “ mixedness” ([70], p. 5) or “ impurity” of the state ρ. The seminal paper of Manfredi and Feix [71] approaches the same formula 1 − tr[ρ ^{2}] (which they denote as S_{2}) from the advanced viewpoint of Wigner functions, and they present strong arguments for this notion of quantum entropy.
While that definition is an easy generalization of the classical one formulated using density matrices, our goal is to develop quantum logical entropy in a manner that brings out the analogy with classical logical entropy and relates it closely to quantum measurement.
There is a method (or what GianCarlo Rota would call a “yoga”) to linearize set concepts to vectorspace concepts:
The Yoga of Linearization:
Apply the set concept to a basisset of a vector space (i.e., treat the basis set as a set universe U) and whatever is generated is the corresponding vectorspace concept.
For instance, there is the classical Boolean logic of subsets, and a subset of a basis set generates a subspace so the Boolean logic of subsets linearizes to vector spaces as the logic of subspaces, and specializing to Hilbert spaces yields the usual quantum logic of subspaces [72].
In view of the categorytheoretic duality of subsets of a set and partitions on a set (e.g., the image subset of the codomain and the inverseimage partition on the domain of a set function), there is a dual “classical” logic of partitions on a set ([3, 4]). To linearize the set concept of a partition to vector spaces, one considers a set partition on a basis set of a vector space and then sees what it generates. Each block generates a subspace and the set of subspaces corresponding to the blocks form a directsum decomposition (DSD) of the vector space. A directsum decomposition of a vector space V is a set ${\left\{{V}_{i}\right\}}_{i\in I}$ of subspaces such that ${V}_{i}\cap {\sum}_{{i}^{\mathrm{\prime}}}\mathrm{}{V}_{i}=\left\{0\right\}$ (where ${\sum}_{{i}^{\mathrm{\prime}}}\mathrm{}{V}_{i}$ is the subspace generated by those ${V}_{{i}^{\mathrm{\prime}}}$, and {0} is the zero subspace), and which span the space V and is written $V={\oplus}_{i\in I}{V}_{i}$. Then every nonzero vector $v\in V$ is a unique sum of vectors from the subspaces {V_{i}}. That is the vectorspace version of characterizing a partition π on a set U as a collection of subsets B_{i} (blocks) of U such that every nonempty subset $S\subseteq U$ is uniquely expressed as a union of subsets of the blocks, i.e., $S\cap {B}_{i}$ for ${B}_{i}\in \pi $. Hence the logic of partitions linearizes to the dual logic of DSDs of a vector space which specialized to Hilbert spaces yields the quantum logic of DSDs [74] dual to the usual Birkhoff–vonNeumann quantum logic of subspaces.
Another basic set concept is the notion of a numerical attribute $f:U\to \mathbb{K}$ that evaluates the points of U in a field $\mathbb{K}$. Taking U to be a basis set of a vector space V over the field $\mathbb{K}$, the corresponding vectorspace notion that can be seen as generated is the notion of a diagonalizable linear operator $F:V\to V$ defined by Fu = f(u)u linearly extended to V. The values of f linearize to the eigenvalues of F, the constant sets of f linearize to the eigenvectors of F, and the set of constant sets for a specific value linearizes to the eigenspace of eigenvectors for that eigenvalue. For instance, if we let “rS” stand for assigning the value r to each element of the subset $S\subseteq U$, then the set version of the eigenvector equation Fv = λv is f(S) = rS.
The Cartesian product of two basis sets of two vector spaces (same base field) generates the tensor product of the two vector spaces – so the linearization of the direct or Cartesian product of sets is not the direct product (as might be suggested by category theory) but the tensor product of vector spaces. And the cardinality of sets linearizes to the dimension of vector spaces and so forth as illustrated in Table 4. Those examples show how the setbased classical logical information theory will linearize to vector spaces and particularly to Hilbert spaces for the logical version of quantum information theory.^{10}
Linearization of set concepts to corresponding vectorspace concepts.
Let $F:V\to V$ be a selfadjoint (or Hermitian) operator (observable) on a ndimensional Hilbert space $V$ with the real eigenvalues ϕ_{1},…,ϕ_{I}, and let U={u_{1},…,u_{n}} be an orthonormal (ON) basis of eigenvectors of F. The quantum version of a “dit” is a “qudit.” A qudit is defined by the DSD of eigenspaces of an observable, just as classically, a distinction or dit is defined by the partition ${\left\{{f}^{1}\left(r\right)\right\}}_{r\in f\left(U\right)}$ determined a numerical attribute $f:U\to \mathbb{R}$. Then, there is a set partition $\pi ={\left\{{B}_{i}\right\}}_{i=1,\dots ,I}$ on the ON basis U so that B_{i} is a basis for the eigenspace of the eigenvalue ϕ_{i} and $\left{B}_{i}\right$ is the “ multiplicity” (dimension of the eigenspace) of the eigenvalue ϕ_{i} for i = 1,…,I. The eigenspaces V_{i} generated by the blocks B_{i} for the eigenvalues ϕ_{i} form a directsum decomposition of V. Note that the realvalued numerical attribute or eigenvalue function $f:U\to \mathbb{R}$ that takes each eigenvector in ${u}_{j}\in {B}_{i}\subseteq U$ to its eigenvalue ϕ_{i} so that ${f}^{1}\left({\varphi}_{i}\right)={B}_{i}$ contains all the information in the selfadjoint operator $F:V\to V$ since F can be reconstructed by defining it on the basis U as $F{u}_{j}=f\left({u}_{j}\right){u}_{j}$. The important informationtheoretic aspect of the eigenvalues is not their numerical value but when they are the same or different, and that information is there in the eigenspaces ${\left\{{V}_{i}\right\}}_{i\in I}$ of the directsum decomposition.^{11}
Classically, a dit of the partition ${\left\{{f}^{1}\left({\varphi}_{i}\right)\right\}}_{i\in I}$ on U, defined by $f:U\to \mathbb{R}$, is a pair $\left({u}_{k},{u}_{{k}^{\mathrm{\prime}}}\right)\in U\times U$ of points in distinct blocks of the partition, i.e., $f\left({u}_{k}\right)\ne f\left({u}_{{k}^{\mathrm{\prime}}}\right)$. Hence, a qudit of F is a pair $\left({u}_{k},{u}_{{k}^{\mathrm{\prime}}}\right)$ (interpreted as ${u}_{k}\otimes {u}_{{k}^{\mathrm{\prime}}}\in V\otimes V$) of vectors in the eigenbasis distinguished by F, i.e., $f\left({u}_{k}\right)\ne f\left({u}_{{k}^{\mathrm{\prime}}}\right)$ for the eigenvalue function $f:U\to \mathbb{R}$. Let $G:V\to V$ be another selfadjoint operator on V, which commutes with F so that we may then assume that U is an orthonormal basis of simultaneous eigenvectors of F and G ([75], p. 177). The assumption that F and G commute plays the role of considering partitions π = f ^{−1} for $f:U\to \mathbb{R}$ and σ = g ^{−1} for $g:U\to \mathbb{R}$ being defined on the same universe U. Let ${\left\{{\gamma}_{j}\right\}}_{j\in J}$ be the set of eigenvalues of G, and let $g:U\to \mathbb{R}$ be the eigenvalue function so a pair $\left({u}_{k},{u}_{{k}^{\mathrm{\prime}}}\right)$ is a qudit of G if $g\left({u}_{k}\right)\ne g\left({u}_{{k}^{\mathrm{\prime}}}\right)$, i.e., if the two eigenvectors have distinct eigenvalues of G.
As Kolmogorov suggested:
Information theory must precede probability theory, and not be based on it. By the very essence of this discipline, the foundations of information theory have a finite combinatorial character. ([76], p. 39)
In classical logical information theory, information is defined prior to probabilities by certain subsets (e.g., ditsets and differences between and intersections of ditsets) or, in the quantum case, quantum information is defined by certain subspaces prior to the introduction of any probabilities (unlike the case with Shannon or von Neumann entropies). Since the transition from classical to quantum logical information theory is straightforward, it will be first presented in a table (which does not involve any probabilities), where the qudits $\left({u}_{k},{u}_{{k}^{\mathrm{\prime}}}\right)$ are interpreted as ${u}_{k}\otimes {u}_{{k}^{\mathrm{\prime}}}$. The qudit space, the vectorspace analogue of the ditset, associated with F (the vectorspace analogue of $f:U\to \mathbb{R}$) is the subspace $\left[\mathrm{qudit}\left(F\right)\right]\subseteq V\otimes V$ generated by the qudits ${u}_{k}\otimes {u}_{{k}^{\mathrm{\prime}}}$ of F.
If F = λI is a scalar multiple of the identity I (the vectorspace analogue of a constant function $f:U\to \mathbb{R}$), then it has no qudits, so its qudit subspace [qudit(λI)] is the zero subspace (the analogue of the empty ditset of the indiscrete partition). The Common Dits Theorem says that any two nonempty ditsets have a nonzero intersection. In the quantum case, this means any two nonzero qudit spaces [qudit(F)] and [qudit(G)] for commuting F and G have a nonzero intersection, i.e., have a nonzero mutual information space. That is, for commuting F and G, there are always two simultaneous eigenvectors u_{k} and ${u}_{{k}^{\mathrm{\prime}}}$ that have different eigenvalues both by F and by G.
The observables do not provide the point probabilities in a measurement; the probabilities come from the pure (normalized) state ψ being measured. Let $\psi \rangle ={\sum}_{j=1}^{n}\mathrm{}\langle {u}_{j}\psi \rangle {u}_{j}\rangle ={\sum}_{j=1}^{n}\mathrm{}{\alpha}_{j}{u}_{j}\rangle $ be the resolution of $\psi \rangle $ in terms of the orthonormal basis U = {u_{1},…, u_{n}} of simultaneous eigenvectors for F and G. Then, ${p}_{j}={\alpha}_{j}{\alpha}_{j}^{\mathrm{*}}$ (${\alpha}_{j}^{\mathrm{*}}$ is the complex conjugate of α_{j}) for j = 1,…, n are the point probabilities on U, and the pure state density matrix $\rho \left(\psi \right)=\psi \rangle \langle \psi $ (where $\langle \psi ={\psi \rangle}^{\u2020}$ is the conjugatetranspose) has the entries: ${\rho}_{\mathrm{jk}}\left(\psi \right)={\alpha}_{j}{\alpha}_{k}^{\mathrm{*}}$, so the diagonal entries ${\rho}_{\mathrm{jj}}\left(\psi \right)={\alpha}_{j}{\alpha}_{j}^{\mathrm{*}}={p}_{j}$ are the point probabilities. Then we have Table 5 giving the remaining parallel development with the probabilities provided by the pure state ψ where we write $\rho {\left(\psi \right)}^{\u2020}\rho \left(\psi \right)$ as $\rho {\left(\psi \right)}^{2}$.
Ditsets and qudit subspaces without probabilities.
The definition of quantum logical entropy$$h\left(F:\psi \right)=\mathrm{tr}\left[{P}_{\left[\mathrm{qudit}\left(F\right)\right]}\rho \left(\psi \right)\otimes \rho \left(\psi \right)\right]$$(84)is just the quantum version of the formulation of the classical logical entropy$$h\left({f}^{1}\right)=h\left(\pi \right)=\sum _{\left({u}_{i},{u}_{j}\right)\in \mathrm{dit}\left(\pi \right)}\mathrm{}{p}_{i}{p}_{j}=\mathrm{tr}\left[{P}_{\mathrm{dit}\left(\pi \right)}\rho \left(\pi \right)\otimes \rho \left(\pi \right)\right]$$(85)for $f:U\to \mathbb{R}$ with the point probabilities (p_{1},..., p_{n}) on U and thus p×p on U×U. The tensor product $\rho \left(\psi \right)\otimes \rho \left(\psi \right)$ is an n ^{2} × n ^{2} matrix with the diagonal entries ${\left(\rho \left(\psi \right)\otimes \rho \left(\psi \right)\right)}_{\left(j,k\right),\left(j,k\right)}=\rho {\left(\psi \right)}_{\mathrm{jj}}\rho {\left(\psi \right)}_{\mathrm{kk}}={p}_{j}{p}_{k}$ where ${p}_{j}={\alpha}_{j}{\alpha}_{j}^{\mathrm{*}}$ for $\psi \rangle ={\sum}_{j=1}^{n}\mathrm{}\langle {u}_{j}\psi \rangle {u}_{j}\rangle ={\sum}_{j=1}^{n}\mathrm{}{\alpha}_{j}{u}_{j}\rangle $ where U = {u_{1},..., u_{n}} is an ON basis of eigenvectors of the observable F. The n ^{2} × n ^{2} diagonal projection matrix P_{[qudit(F)]} has a diagonal element ${\left({P}_{\left[\mathrm{qudit}\left(F\right)\right]}\right)}_{\left(j,k\right)\left(j,k\right)}=1$ if ${u}_{j}\otimes {u}_{k}\in \left[\mathrm{qudit}\left(F\right)\right]$. i.e., if the eigenvectors u_{j} and u_{k} have different eigenvalues, and 0 otherwise. Hence the product ${P}_{\left[\mathrm{qudit}\left(F\right)\right]}\rho \left(\psi \right)\otimes \rho \left(\psi \right)$ will pick out the products p_{j} p_{k} for ${u}_{j}\otimes {u}_{k}\in \left[\mathrm{qudit}\left(F\right)\right]$ and the trace will sum them. Hence we have the result that:$$\begin{array}{c}h\left(F:\psi \right)=\mathrm{tr}\left[{\mathrm{P}}_{\left[\mathrm{qudit}\left(F\right)\right]}\rho \left(\psi \right)\otimes \rho \left(\psi \right)\right]\\ =\sum _{j,k}\mathrm{}\left\{{p}_{j}{p}_{k}:{u}_{j}\otimes {u}_{k}\in \left[\mathrm{qudit}\left(F\right)\right]\right\}=\sum _{j,k}\mathrm{}\left\{{p}_{j}{p}_{k}:f\left({u}_{j}\right)\ne f\left({u}_{k}\right)\right\}\end{array}$$(86)where $f:U\to \mathbb{R}$ is the eigenvalue function taking each eigenvector to its eigenvalue.
With those preliminaries, the definitions in Table 6 might be better motivated and the statements clearer.
Probabilities applied to ditsets and qudit spaces.
Some basic results about quantum logical entropy
A selfadjoint operator F on V, i.e., an observable, alone defines the eigenvalue partition ${f}^{1}={\left\{{f}^{1}\left({\varphi}_{i}\right)\right\}}_{i\in I}$ on a basis U of ON eigenvectors for F. But the points have no probabilities associated with them. The probabilities are supplied by a normalized vector $\psi \rangle \in V$ as ${p}_{i}={\alpha}_{i}{\alpha}_{i}^{\mathrm{*}}$ for ${\alpha}_{i}=\langle {u}_{i}\psi \rangle $. Then we have a completely classical situation, a set partition f ^{1} on a set U with point probabilities provided by $\psi \rangle $ which will be denoted π(F:ψ). Hence that partition will have a (classical) logical entropy h(π(F:ψ)). Since the blocks in that π(F:ψ) partition on U are the sets of basis vectors each for a certain eigenvalue, the probabilities for those blocks are ${\sum}_{j}\mathrm{}\left\{{p}_{j}:f\left({u}_{j}\right)={\varphi}_{i}\right\}=\mathrm{Pr}\left({f}^{1}\left({\varphi}_{i}\right)\right)=\mathrm{Pr}\left({B}_{i}\right)$ for i = 1,..., I. Hence we have:$$\begin{array}{c}h\left(\pi \left(F:\psi \right)\right)=1\sum _{i\in I}\mathrm{}\mathrm{Pr}{\left({f}^{1}\left({\varphi}_{i}\right)\right)}^{2}=1\sum _{i\in I}\mathrm{}{\left(\sum _{j}\mathrm{}\left\{{p}_{j}:f\left({u}_{j}\right)={\varphi}_{i}\right\}\right)}^{2}\\ =1\sum _{i}\mathrm{}\left(\sum _{f\left({u}_{j}\right)={\varphi}_{i}}\mathrm{}{p}_{j}^{2}+\sum _{j\ne k}\mathrm{}\left\{{p}_{j}{p}_{k}:f\left({u}_{j}\right)={\varphi}_{i}=f\left({u}_{k}\right)\right\}\right)\\ =\sum _{j,k}\mathrm{}\left\{{p}_{j}{p}_{k}:f\left({u}_{j}\right)\ne f\left({u}_{k}\right)\right\}=h\left(F:\psi \right)=\mathrm{tr}\left[{P}_{\left[\mathrm{qudit}\left(F\right)\right]}\rho \left(\psi \right)\otimes \rho \left(\psi \right)\right]\end{array}$$(87)
And there is another way to arrive at this logical entropy, namely perform the Fmeasurement on the pure state density matrix ρ(ψ). The results of the Fmeasurement is given by the Lüders mixture operation ([68], p. 279) on the density matrix ρ(ψ). The block ${B}_{i}\in \pi \left(F:\psi \right)$ generates the eigenspace V_{i} corresponding to the eigenvalue ϕ_{i} so ${P}_{{V}_{i}}$ is the projection matrix to that eigenspace for i = 1, …, I. Then the Lüders mixture operation, representing the Fmeasurement of ψ, gives the mixed state density matrix:$$\widehat{\rho}\left(\psi \right)=\sum _{i\in I}\mathrm{}{P}_{{V}_{i}}\rho \left(\psi \right){P}_{{V}_{i}}.$$(88)
To show that $h\left(\widehat{\rho}\left(\psi \right)\right)=1\mathrm{tr}\left[\widehat{\rho}{\left(\psi \right)}^{2}\right]=h\left(\pi \left(F:\psi \right)\right)$ for $\widehat{\rho}\left(\psi \right)={\sum}_{i=1}^{I}\mathrm{}{P}_{{V}_{i}}\rho \left(\psi \right){P}_{{V}_{i}}$, we need to compute $\mathrm{tr}\left[\widehat{\rho}{\left(\psi \right)}^{2}\right]$. An offdiagonal element in ${\rho}_{\mathrm{jk}}\left(\psi \right)={\alpha}_{j}{\alpha}_{k}^{\mathrm{*}}$ of ρ(ψ) survives (i.e., is not zeroed and has the same value) the Lüders operation if and only if f(u_{j}) = f(u_{k}). Hence, the jth diagonal element of $\widehat{\rho}{\left(\psi \right)}^{2}$ is:$$\sum _{k=1}^{n}\mathrm{}\left\{{\alpha}_{j}^{\mathrm{*}}{\alpha}_{k}{\alpha}_{j}{\alpha}_{k}^{\mathrm{*}}:\varphi \left({u}_{j}\right)=\varphi \left({u}_{k}\right)\right\}=\sum _{k=1}^{n}\mathrm{}\left\{{p}_{j}{p}_{k}:f\left({u}_{j}\right)=f\left({u}_{k}\right)\right\}={p}_{j}\mathrm{Pr}\left({B}_{i}\right)$$(89)where u_{j}∈B_{i}. Then, grouping the jth diagonal elements for u_{j}∈B_{i} gives ${\sum}_{{u}_{j}}\mathrm{}{p}_{j}\mathrm{Pr}\left({B}_{i}\right)=\mathrm{Pr}{\left({B}_{i}\right)}^{2}$. Hence, the whole trace is: $\mathrm{tr}\left[\widehat{\rho}{\left(\psi \right)}^{2}\right]={\sum}_{i=1}^{I}\mathrm{}\mathrm{Pr}{\left({B}_{i}\right)}^{2}$, and thus:$$h\left(\widehat{\rho}\left(\psi \right)\right)=1\mathrm{tr}\left[\widehat{\rho}{\left(\psi \right)}^{2}\right]=1\sum _{i=1}^{I}\mathrm{}\mathrm{Pr}{\left({B}_{i}\right)}^{2}=h\left(F:\psi \right).$$(90)
This completes the proof of the following theorem which shows the different ways to characterize h(F: ψ).
Theorem 4.5.
$h\left(F:\psi \right)=h\left(\pi \left(F:\psi \right)\right)=h\left(\widehat{\rho}\left(\psi \right)\right).$
Like the classical join operation on partitions, quantum measurement creates distinctions, i.e., turns coherences into “decoherences,”^{12} which, classically, is the operation of distinguishing elements by classifying them according to some attribute like classifying the faces of a die by their parity. The fundamental theorem about quantum logical entropy and projective measurement, in the density matrix version, shows how the quantum logical entropy created (starting with h(ρ(ψ)) = 0 for the pure state ψ) by the measurement can be computed directly from the coherences of ρ(ψ) that are decohered in $\widehat{\rho}\left(\psi \right)$.
Theorem 4.6. Fundamental theorem about quantum measurement and logical entropy.
The increase in quantum logical entropy, $h\left(\widehat{\rho}\left(\psi \right)\right)=h\left(F:\psi \right)$ due to the Fmeasurement of the pure state ψ is the sum of the absolute squares of the nonzero offdiagonal terms (coherences) in ρ(ψ) (represented in an ON basis of Feigenvectors) that are zeroed (“decohered”) in the postmeasurement Lüders mixture density matrix $\widehat{\rho}\left(\psi \right)={\sum}_{i=1}^{I}{P}_{{V}_{i}}\rho \left(\psi \right){P}_{{V}_{i}}$ .
Proof. $h\left(\widehat{\rho}\left(\psi \right)\right)h\left(\rho \left(\psi \right)\right)=\left(1\mathrm{tr}\left[\widehat{\rho}{\left(\psi \right)}^{2}\right]\right)\left(1\mathrm{tr}\left[\rho {\left(\psi \right)}^{2}\right]\right)={\sum}_{j}\mathrm{}\left({\left{\rho}_{\mathrm{jk}}\left(\psi \right)\right}^{2}{\left\widehat{\rho}\left(\psi \right)\right}^{2}\right)$ since $\mathrm{tr}\left[{\rho}^{2}\right]={\sum}_{i}\mathrm{}{\left{\rho}_{\mathrm{ij}}\right}^{2}$ is the sum of the absolute squares of all the elements of ρ ([78], p. 77). If u_{j} and u_{k} are a qudit of F, then and only then are the corresponding offdiagonal terms zeroed by the Lüders mixture operation ${\sum}_{i=1}^{I}\mathrm{}{P}_{{V}_{i}}\rho \left(\psi \right){P}_{{V}_{i}}$ to obtain $\widehat{\rho}\left(\psi \right)$ from ρ(ψ). □
Example: For a simple quantum example, consider a system with two spinobservable σ eigenstates$\uparrow \rangle $ and $\downarrow \rangle $ (like electron spin up or down along the zaxis) where the given normalized superposition state is $\psi \rangle ={\alpha}_{\uparrow}\uparrow \rangle +{\alpha}_{\downarrow}\downarrow \rangle =\left[\begin{array}{l}{\alpha}_{\uparrow}\\ {\alpha}_{\downarrow}\end{array}\right]$ so the density matrix is $\rho \left(\psi \right)=\left[\begin{array}{ll}{p}_{\uparrow}& {\alpha}_{\uparrow}{\alpha}_{\downarrow}^{\mathrm{*}}\\ {\alpha}_{\downarrow}{\alpha}_{\uparrow}^{\mathrm{*}}& {p}_{\downarrow}\end{array}\right]$ where ${p}_{\uparrow}={\alpha}_{\uparrow}{\alpha}_{\uparrow}^{\mathrm{*}}$ and ${p}_{\downarrow}={\alpha}_{\downarrow}{\alpha}_{\downarrow}^{\mathrm{*}}$. Using the Lüders mixture operation, the measurement of that spinobservable σ goes from the pure state ρ(ψ) to$$\begin{array}{c}{P}_{\uparrow}\rho \left(\psi \right){P}_{\uparrow}+{P}_{\downarrow}\rho \left(\psi \right){P}_{\downarrow}=\left[\begin{array}{ll}1& 0\\ 0& 0\end{array}\right]\left[\begin{array}{ll}{p}_{\uparrow}& {\alpha}_{\uparrow}{\alpha}_{\downarrow}^{\mathrm{*}}\\ {\alpha}_{\downarrow}{\alpha}_{\uparrow}^{\mathrm{*}}& {p}_{\downarrow}\end{array}\right]\left[\begin{array}{ll}1& 0\\ 0& 0\end{array}\right]+\left[\begin{array}{ll}0& 0\\ 0& 1\end{array}\right]\left[\begin{array}{ll}{p}_{\uparrow}& {\alpha}_{\uparrow}{\alpha}_{\downarrow}^{\mathrm{*}}\\ {\alpha}_{\downarrow}{\alpha}_{\uparrow}^{\mathrm{*}}& {p}_{\downarrow}\end{array}\right]\left[\begin{array}{ll}0& 0\\ 0& 1\end{array}\right]\\ =\left[\begin{array}{ll}{p}_{\uparrow}& 0\\ 0& {p}_{\downarrow}\end{array}\right]=\widehat{\rho}\left(\psi \right)\end{array}.$$(91)
The logical entropy of any pure state such as ρ(ψ) is 0. The logical entropy of $\widehat{\rho}\left(\psi \right)$ is $h\left(\widehat{\rho}\left(\psi \right)\right)=1\mathrm{tr}\left[\widehat{\rho}{\left(\psi \right)}^{2}\right]=1{p}_{\uparrow}^{2}{p}_{\downarrow}^{2}$. The entries that were zeroed in the Lüders mixture operation were the two offdiagonal elements ${\alpha}_{\uparrow}{\alpha}_{\downarrow}^{\mathrm{*}}$ and ${\alpha}_{\downarrow}{\alpha}_{\uparrow}^{\mathrm{*}}$ so the sum of their absolute squares is $2{\alpha}_{\uparrow}{\alpha}_{\downarrow}^{\mathrm{*}}{\alpha}_{\downarrow}{\alpha}_{\uparrow}^{\mathrm{*}}=2{p}_{\uparrow}{p}_{\downarrow}$ which equals $1{p}_{\uparrow}^{2}{p}_{\downarrow}^{2}$ since $1={\left({p}_{\uparrow}+{p}_{\downarrow}\right)}^{2}={p}_{\uparrow}^{2}+{p}_{\downarrow}^{2}+2{p}_{\uparrow}{p}_{\downarrow}$.
Conclusions
The underlying thesis is that information is defined in terms of distinctions, differences, distinguishability, and diversity – or, with similar uses of the diprefix (which means “two”), discriminations, divisions, or differentiations. Yet those are all vague concepts so this notion of informationasdistinctions is made precise using the basic mathematical concept that represents differences and nondifferences (or equivalences), namely partitions (including the inverseimage partitions of random variables). The elements in the same block of a partition are similar or equivalent (block = equivalence class), and the ordered pairs of elements in different blocks are the distinctions or dits. Hence logical entropy measures informationasdistinctions by the probability measure on distinctions, so the logical entropy of a partition is the probability that a distinction of the partition is obtained in two independent draws from the underling universe set of elements. This notion of informationasdistinctions then encompasses the Shannon notion of entropy as the average minimum number of binary partitions (bits) that have to be joined to make the same distinctions of the partition. Moreover, there is the ditbit transform that derives all of Shannon’s definitions of entropy, joint entropy, conditional entropy, and mutual information from the corresponding definitions of logical entropy that are based on logical entropy being defined as a (probability) measure in the sense of measure theory. A few applications were discussed; distinguishing the Boltzmann and Shannon entropies, developing the MaxEntropy method with logical entropy, and showing how the metrical notion of logical entropy gives the notion of variance in statistical theory.
There is a method, linearization, to lift setbased concepts to the corresponding vectorspace concepts, and that provides the method to develop the corresponding quantum notions from the “classical” or nonquantum notions of logical entropy. There are two equivalent formulations of quantum mechanics; one using wave functions and the other using density matrices ([79], p. 102). But only one of those formulations maps naturally to the mathematics of partitions, namely the density matrix formulation.
At the beginning of our presentation, density matrices were foreshadowed by the box diagrams representing logical entropy. The box diagrams led to the incidence matrices for $\mathrm{indit}\left(\pi \right)$, or the complementary ones for $\mathrm{dit}\left(\pi \right)$, and then point probabilities are introduced into the matrices so that when normalized by their trace, the matrices are density matrices. In that manner, a reformulation of the classical logical entropy framework is first presented using density matrices over the real numbers to foreshadow the later quantum results over the complex numbers. Every density matrix over the complex numbers has a spectral decomposition into a probability mixture of orthogonal pure states which correspond classically to the disjoint blocks and block probabilities of a partition.
The fundamental theorem for logical entropy and measurement shows there is a simple, direct and quantitative connection between density matrices and logical entropy. The theorem directly connects the changes in the density matrix due to a projective measurement (sum of absolute squares of zeroed offdiagonal terms) with the increase in logical entropy due to the $F$measurement $h\left(F:\psi \right)=h\left(\widehat{\rho}\left(\psi \right)\right)$ (where $h\left(\rho \left(\psi \right)\right)=0$ for the pure state $\psi $). Moreover, the quantum logical entropy has a simple “ twodraw probability” interpretation, i.e., $h\left(F:\psi \right)=h\left(\widehat{\rho}\left(\psi \right)\right)$ is the probability that two independent Fmeasurements of ψ will yield distinct Feigenvalues, i.e., will yield a qudit of F. In contrast, the von Neumann entropy has no such simple interpretation, and there seems to be no such intuitive connection between pre and postmeasurement density matrices and von Neumann entropy, although von Neumann entropy also increases in a projective measurement ([79], Thm. 11.9, p. 515).
This direct quantitative connection between state discrimination and quantum logical entropy reinforces the judgment of Boaz Tamir and Eliahu Cohen [66, 80] that quantum logical entropy is a natural and informative entropy concept for quantum mechanics.
We find this framework of partitions and distinction most suitable (at least conceptually) for describing the problems of quantum state discrimination, quantum cryptography and in general, for discussing quantum channel capacity. In these problems, we are basically interested in a distance measure between such sets of states, and this is exactly the kind of knowledge provided by logical entropy (reference to [81]). ([80], p. 1])
In summary, the basic idea of information as distinctions, differences, distinguishability, and diversity is naturally quantified at the “classical” level in terms of logical entropy and then naturally linearized to the quantum notion of logical entropy.^{13}
Conflicts of interest
The author declares no conflict of interest.
Funding
This research did not receive any specific funding.
In category theory, the notion of a subset generalizes to the notion of a subobject or “part” and the “dual notion (obtained by reversing the arrows) of ‘part’ is the notion of partition” ([5], p. 85).
This is a restatement of the graphtheoretic result that the complement of any disconnected graph is connected ([48], p. 30). In terms of inditsets or equivalence relations E and E′, if E ∪ E′ = U × U, then E = U × U or E′ = U × U.
The construction is easy; take the seven atomic areas inside the three circles in Figure 9 as containing one point having the value assigned to that atomic area. The eighth area outside the three circles can have an arbitrary value – at least until another random variable appears.
The example was inspired by Eric Johnson’s excellent treatment of Boltzmann’s entropy [55].
These formulas, for the equiprobable case, were derived using the “difference method” by Zhang et al. [64] as new formulas for variance and covariance although the formulas may be much older “folk theorems.”
Since setconcepts can be formulated in vector spaces over ${\mathbb{Z}}_{2}$, that means there is a pedagogical or “toy” model of quantum mechanics over ${\mathbb{Z}}_{2}$, i.e., over sets [73].
That is why the quantum logic of DSDs [74] is essentially the quantum logic of observables – in much the same sense that the logic of partitions on U is essentially the logic of numerical attributes $f:U\to \mathbb{R}$ on U.
This notion of “decoherence” is used in an older sense, not the recent sense given by the work of Zurek [77] and others.
References
 Birkhoff G (1948), Lattice theory, American Mathematical Society, New York. [Google Scholar]
 Grätzer G (2003), General Lattice Theory, 2nd edn., Birkhäuser Verlag, Boston. [Google Scholar]
 Ellerman D (2010), The logic of partitions: introduction to the dual of the logic of subsets. Rev Symb Log 3, 287–350. https://doi.org/10.1017/S1755020310000018. [CrossRef] [Google Scholar]
 Ellerman D (2014), An introduction to partition logic. Log J IGPL 22, 94–125. https://doi.org/10.1093/jigpal/jzt036. [CrossRef] [Google Scholar]
 Lawvere FW, Rosebrugh R (2003), Sets for mathematics, Cambridge University Press, Cambridge, MA. [Google Scholar]
 Rota GC (2001), Twelve problems in probability no one likes to bring up, in: H. Crapo, D. Senato (Eds.), Algebraic combinatorics and computer science: a tribute to GianCarlo Rota, Springer, Milano, pp. 57–93. [Google Scholar]
 Rota GC (1998) Probability Vol. I & II: The guidi notes, MIT Copy Services, Cambridge, MA. [Google Scholar]
 Ellerman D (2021), The logical theory of canonical maps: the elements and distinctions analysis of the morphisms, duality, canonicity and universal constructions in Set. https://ArXiv.org, https://arxiv.org/abs/2104.08583. [Google Scholar]
 Halmos PR (1974), Measure theory, SpringerVerlag, New York. [Google Scholar]
 Rao KPSB, Rao MB (1983), Theory of charges: a study of finitely additive measures, Academic Press, London. [Google Scholar]
 Wilkins J (1707), Mercury or the secret and swift messenger, London. Original in 1641. [Google Scholar]
 Gleick J (2011), The information: a history, a theory, a flood, Pantheon, New York. [Google Scholar]
 Bateson G (1979), Mind and nature: a necessary unity, Dutton, New York. [Google Scholar]
 Gini C (1912), Variabilità e mutabilità, Tipografia di Paolo Cuppini, Bologna. [Google Scholar]
 Friedman WF (1922), The index of coincidence and its applications in cryptography, Riverbank Laboratories, Geneva IL. [Google Scholar]
 Kullback S (1976), Statistical methods in cryptanalysis, Aegean Park Press, Walnut Creek, CA. [Google Scholar]
 Rejewski M (1981), How Polish mathematicians deciphered the enigma. IEEE Ann Hist Comput 3, 213–234. [CrossRef] [Google Scholar]
 Simpson EH (1949), Measurement of diversity. Nature 163, 688. [CrossRef] [Google Scholar]
 Ricotta C, Szeidl L (2006), Towards a unifying approach to diversity measures: bridging the gap between the Shannon entropy and Rao’s quadratic index. Theor Popul Biol 70, 237–243. https://doi.org/10.1016/j.tpb.2006.06.003. [CrossRef] [PubMed] [Google Scholar]
 Nei M (1973), Analysis of Gene Diversity in subdivided populations. Proc Nat Acad Sci USA 70, 3321–3323. [CrossRef] [PubMed] [Google Scholar]
 Good IJ (1979), A.M. Turing’s statistical work in World War II. Biometrika 66, 393–396. [CrossRef] [Google Scholar]
 Good IJ (1982), Comment (on Patil and Taillie: diversity as a concept and its measurement). J Am Stat Assoc 77, 561–563. [Google Scholar]
 Stigler SM (1999), Statistics on the table, Harvard University Press, Cambridge. [Google Scholar]
 Hirschman AO (1945), National power and the structure of foreign trade, University of California Press, Berkeley. [Google Scholar]
 Herfindahl OC (1950), Concentration in the US Steel Industry, Unpublished Doctoral Dissertation, Columbia University. [Google Scholar]
 Rao CR (1982), Diversity and dissimilarity coefficients: a unified approach. Theor Popul Biol 21, 24–43. [CrossRef] [Google Scholar]
 Havrda J, Charvat F (1967), Quantification methods of classification processes: concept of structural αentropy. Kybernetika (Prague) 3, 30–35. [Google Scholar]
 Tsallis C (1988), Possible generalization for BoltzmannGibbs statistics. J Stat Phys 52, 479–487. [CrossRef] [Google Scholar]
 Brukner Č, Zeilinger A (2000), Operationally invariant information in quantum measurements. https://ArXiv.org. https://arxiv.org/abs/quantph/0005084, 19 May 2000. [Google Scholar]
 Brukner Č, Zeilinger A (2003), Information and fundamental elements of the structure of quantum theory, in: L. Castell, O. Ischebeck (Eds.), Time, quantum and information, SpringerVerlag, Berlin, pp. 323–354. [Google Scholar]
 Shannon CE (1948), A mathematical theory of communication. Bell Syst Tech J 27, 379–423, 623–656. [CrossRef] [Google Scholar]
 Shannon CE, Weaver W (1964), The mathematical theory of communication, University of Illinois Press, Urbana. [Google Scholar]
 Shannon CE (1993), The Bandwagon, in: N.J.A. Sloane, A.D. Wyner (Eds.), Claude E. Shannon: Collected Papers, IEEE Press, Piscataway, NJ, p. 462. [Google Scholar]
 Tribus M (1978), Thirty years of information theory, in: R.D. Levine, M. Tribus (Eds.), The maximum entropy formalism, MIT, Cambridge, MA, pp. 1–14. [Google Scholar]
 Shannon CE (1993), Some topics in information theory, in: N.J.A. Sloane, A.D. Wyner (Eds.), Claude E. Shannon: Collected Papers, IEEE Press, Piscataway, NJ, pp. 458–459. [Google Scholar]
 Ramshaw JD (2018), The Statistical Foundations of Entropy, World Scientific Publishing, Singapore. [Google Scholar]
 Lewis GN (1930), The Symmetry of Time in Physics. Science 71, 569–577. [CrossRef] [PubMed] [Google Scholar]
 Brillouin L (1962), Science and Information Theory, Academic Press, New York. [Google Scholar]
 Aczel J, Daroczy Z (1975), On Measures of Information and Their Characterization, Academic Press, New York. [Google Scholar]
 Campbell LL (1965), Entropy as a Measure. IEEE Trans Inform Theory IT11, 112–114. [Google Scholar]
 Doob JL (1994), Measure Theory, Springer Science+Business Media, New York. [Google Scholar]
 Polya G, Szego G (1998), Problems and Theorems in Analysis, Vol. II, SpringerVerlag, Berlin. [Google Scholar]
 Hu KT (1962), On the amount of information. Probability Theory and Its Applications 7, 439–447. https://doi.org/10.1137/1107041. [CrossRef] [Google Scholar]
 Ryser HJ (1963), Combinatorial Mathematics, Mathematical Association of America, Washington DC. [Google Scholar]
 Takacs L (1967), On the method of inclusion and exclusion. J Am Stat Assoc 62, 102–113. https://doi.org/10.1080/01621459.1967.10482891. [Google Scholar]
 Cover T, Thomas J (2006), Elements of information theory, 2nd edn., John Wiley and Sons, Hoboken, NJ. [Google Scholar]
 Csiszar I, Körner J (1981), Information theory: coding theorems for discrete memoryless systems, Academic Press, New York. [Google Scholar]
 Wilson RJ (1972), Introduction to graph theory, Longman, London. [Google Scholar]
 Rozeboom WW (1968), The theory of abstract partials: an introduction. Psychometrika 33, 133–167. [CrossRef] [PubMed] [Google Scholar]
 McGill WJ (1954), Multivariate information transmission. Trans IRE Prof Group Inform Theory 4, 93–111. https://doi.org/10.1109/TIT.1954.1057469. [CrossRef] [Google Scholar]
 Fano RM (1961), Transmission of Information, MIT Press, Cambridge, MA. [Google Scholar]
 Yeung RW (1991), A new outlook on Shannon’s information measures. IEEE Trans on Information Theory 37, 466–474. https://doi.org/10.1109/18.79902. [CrossRef] [Google Scholar]
 MacKay DJC (2003), Information theory, inference, and learning algorithms, Cambridge University Press, Cambridge, UK. [Google Scholar]
 Atkins P, de Paula J, Keeler J (2018), Atkins’ physical chemistry, 11th edn., Oxford University Press, Oxford UK. [Google Scholar]
 Johnson E (2018), Anxiety and the equation: Understanding Boltzmann’s entropy, MIT Press, Cambridge, MA. [Google Scholar]
 Jaynes ET (2003), Probability theory: The logic of science, Cambridge University Press, Cambridge, UK. [Google Scholar]
 Kaplan W (1999), Maxima and minima with applications: practical optimization and duality, John Wiley & Sons, New York. [Google Scholar]
 Best MJ (2017), Quadratic programming with computer programs, CRC Press, Boca Raton FL. [Google Scholar]
 Jaynes ET (1978), Where do we stand on maximum entropy? in: R.D. Levine, M. Tribus (Eds.), The Maximum Entropy Formalism, MIT, Cambridge, MA, pp. 15–118. [Google Scholar]
 Papoulis A (1990), Probability and statistics, PrenticeHall, Englewood Cliffs, NJ. [Google Scholar]
 Dantzig GB (1963), Linear programming and extensions, Princeton University Press, Princeton. [Google Scholar]
 Kullback S, Leibler RA (1951), On information and sufficiency. Ann Math Stat 22, 79–86. https://doi.org/10.1214/aoms/1177729694. [CrossRef] [Google Scholar]
 Rao CR (2010), Quadratic entropy and analysis of diversity. Sankhyā Indian J Stat 72A, 70–80. [Google Scholar]
 Zhang Y, Wu H, Cheng L (2012), Some new deformation formulas about variance and covariance, in: Proceedings of 2012 International Conference on Modelling, Identification and Control (ICMIC2012), pp. 987–992. [Google Scholar]
 McEliece RJ (1977), The theory of information and coding: a mathematical framework for communication (Encyclopedia of Mathematics and its Applications, Vol. 3). AddisonWesley, Reading, MA. [Google Scholar]
 Tamir B, Cohen E (2015), A Holevotype bound for a Hilbert Schmidt distance measure. J Quantum Inf Sci 5, 127–133. https://doi.org/10.4236/jqis.2015.54015. [CrossRef] [Google Scholar]
 Ellerman D (2018), Logical entropy: introduction to classical and quantum logical information theory. Entropy 20, 679. https://doi.org/10.3390/e20090679. [CrossRef] [Google Scholar]
 Auletta G, Fortunato M, Parisi G (2009), Quantum mechanics, Cambridge University Press, Cambridge, UK. [Google Scholar]
 Bennett CH (2003), Quantum information: qubits and quantum error correction. Int J Theor Phys 42, 153–176. https://doi.org/10.1023/A:1024439131297. [CrossRef] [Google Scholar]
 Jaeger G (2007), Quantum information: an overview, Springer Science+Business Media, New York. [Google Scholar]
 Manfredi G, Feix MR (2000), Entropy and Wigner Functions. Phys Rev E 62, 4665–4674. https://doi.org/10.1103/PhysRevE.62.4665. [CrossRef] [PubMed] [Google Scholar]
 Birkhoff G, Von Neumann J (1936), The logic of quantum mechanics. Ann Math 37, 823–843. [CrossRef] [Google Scholar]
 Ellerman D (2017), Quantum mechanics over sets: a pedagogical model with noncommutative finite probability theory as its quantum probability calculus. Synthese 194, 4863–4896. https://doi.org/10.1007/s1122901611750. [CrossRef] [Google Scholar]
 Ellerman D (2018), The quantum logic of directsum decompositions: the dual to the quantum logic of subspaces. Logic J IGPL 26, 1–13. https://doi.org/10.1093/jigpal/jzx026. [CrossRef] [Google Scholar]
 Hoffman K, Kunze R (1961), Linear algebra, PrenticeHall, Englewood Cliffs, NJ. [Google Scholar]
 Kolmogorov AN (1983), Combinatorial foundations of information theory and the calculus of probabilities. Russian Math Surv 38, 29–40. [CrossRef] [Google Scholar]
 Zurek WH (2003), Decoherence, einselection, and the quantum origins of the classical. Rev Modern Phys 75, 715–775. [CrossRef] [Google Scholar]
 Fano U (1957), Description of states in quantum mechanics by density matrix and operator techniques. Rev Mod Phys 29, 74–93. [CrossRef] [Google Scholar]
 Nielsen M, Chuang I (2000), Quantum computation and quantum information, Cambridge University Press, Cambridge. [Google Scholar]
 Tamir B, Cohen E (2014), Logical entropy for quantum states. https://ArXiv.org. http://arxiv.org/abs/1412.0616v2 [Google Scholar]
 Ellerman D (2009), Counting distinctions: on the conceptual foundations of Shannon’s Information Theory. Synthese 168, 119–149. https://doi.org/10.1007/s1122900893337. [CrossRef] [Google Scholar]
 Ellerman D (2021), New foundations for information theory: logical entropy and Shannon entropy, Springer Nature, Cham, Switzerland. [Google Scholar]
 Tamir B, Piava IL, SchwartzmanNowik Z, Cohen E (2021), Quantum logical entropy: fundamentals and general properties. https://ArXiv.org. https://arxiv.org/abs/2108.02726. [Google Scholar]
Cite this article as: Ellerman D 2022. Introduction to logical entropy and its relationship to Shannon entropy. 4open, 5, 1.
All Tables
The ditbit transform from the compound logical entropies to the corresponding Shannon entropies.
All Figures
Figure 1 Image subset and inverseimage partition of a function f: X→Y. 

In the text 
Figure 2 Logical entropy box diagram. 

In the text 
Figure 3 Box diagram for $h\left(X\right)=\sum \mathrm{}\left\{p\left(x,y\right)p\left({x}^{\mathrm{\prime}},{y}^{\mathrm{\prime}}\right):x\ne {x}^{\mathrm{\prime}}\right\}=\frac{8}{16}=\frac{1}{2}$ which can also be seen as a Venn diagram. 

In the text 
Figure 4 Box/Venn diagram for $h\left(Y\right)=\sum \mathrm{}\left\{p\left(x,y\right)p\left({x}^{\mathrm{\prime}},{y}^{\mathrm{\prime}}\right):y\ne {y}^{\mathrm{\prime}}\right\}=\frac{1}{2}$. 

In the text 
Figure 5 Union of Box/Venn diagrams for h(X) and h(Y) gives the box diagram for joint entropy $h\left(X,Y\right)=\frac{12}{16}=\frac{3}{4}$. 

In the text 
Figure 6 Illustrative Venn diagram for the compound logical entropies. 

In the text 
Figure 7 Box diagrams representing the two conditional logical entropies and the mutual logical information all with the value $\frac{1}{4}$. 

In the text 
Figure 8 Venn diagram mnemonic for the compound Shannon entropies. 

In the text 
Figure 9 Venn diagram for threeway negative Shannon mutual information I(X; Y; Z). 

In the text 
Figure 10 Venn diagram for logical entropy $h\left(Z\right)=\frac{8}{16}=\frac{1}{2}$. 

In the text 
Figure 11 The box diagrams for h(X), h(Y), and h(Z). 

In the text 
Figure 12 Solid circles = blocks of π, dashed circles = blocks of σ, and (u′, u′′) as a common dit to π and σ. 

In the text 
Figure 13 Three equiprobable binary partitions distinguish the 2^{3} = 8 leaves on the tree. 

In the text 
Figure 14 Ditbit transform and inequality: 1 – p ≤ log_{2} (1/p) for 0 < p ≤ 1. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.