Hubbry Logo
Binary entropy functionBinary entropy functionMain
Open search
Binary entropy function
Community hub
Binary entropy function
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Binary entropy function
Binary entropy function
from Wikipedia
Entropy of a Bernoulli trial (in shannons) as a function of binary outcome probability, called the binary entropy function.

In information theory, the binary entropy function, denoted or , is defined as the entropy of a Bernoulli process (i.i.d. binary variable) with probability of one of two values, and is given by the formula:

The base of the logarithm corresponds to the choice of units of information; base e corresponds to nats and is mathematically convenient, while base 2 (binary logarithm) corresponds to shannons and is conventional (as shown in the graph); explicitly:

Note that the values at 0 and 1 are given by the limit (by L'Hôpital's rule); and that "binary" refers to two possible values for the variable, not the units of information.

When , the binary entropy function attains its maximum value, 1 shannon (1 binary unit of information); this is the case of an unbiased coin flip. When or , the binary entropy is 0 (in any units), corresponding to no information, since there is no uncertainty in the variable.

Notation

[edit]

Binary entropy is a special case of , the entropy function. is distinguished from the entropy function in that the former takes a single real number as a parameter whereas the latter takes a distribution or random variable as a parameter. Thus the binary entropy (of p) is the entropy of the distribution , so .

Writing the probability of each of the two values being p and q, so and , this corresponds to

Sometimes the binary entropy function is also written as . However, it is different from and should not be confused with the Rényi entropy, which is denoted as .

Explanation

[edit]

In terms of information theory, entropy is considered to be a measure of the uncertainty in a message. To put it intuitively, suppose . At this probability, the event is certain never to occur, and so there is no uncertainty at all, leading to an entropy of 0. If , the result is again certain, so the entropy is 0 here as well. When , the uncertainty is at a maximum; if one were to place a fair bet on the outcome in this case, there is no advantage to be gained with prior knowledge of the probabilities. In this case, the entropy is maximum at a value of 1 bit. Intermediate values fall between these cases; for instance, if , there is still a measure of uncertainty on the outcome, but one can still predict the outcome correctly more often than not, so the uncertainty measure, or entropy, is less than 1 full bit.

Properties

[edit]

Derivative

[edit]

The derivative of the binary entropy function may be expressed as the negative of the logit function:

.

Convex conjugate

[edit]

The convex conjugate (specifically, the Legendre transform) of the binary entropy (with base e) is the negative softplus function. This is because (following the definition of the Legendre transform: the derivatives are inverse functions) the derivative of negative binary entropy is the logit, whose inverse function is the logistic function, which is the derivative of softplus.

Softplus can be interpreted as logistic loss, so by duality, minimizing logistic loss corresponds to maximizing entropy. This justifies the principle of maximum entropy as loss minimization.

Taylor series

[edit]

The Taylor series of the binary entropy function at 1/2 is

which converges to the binary entropy function for all values .

Bounds

[edit]

The following bounds hold for :[1]

and

where denotes natural logarithm.

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The binary entropy function, often denoted as H(p)H(p), quantifies the average uncertainty or in a binary random variable that takes the value 1 with probability pp (and 0 with probability 1p1-p), where 0p10 \leq p \leq 1. It is mathematically defined by the formula
H(p)=plog2p(1p)log2(1p),H(p) = -p \log_2 p - (1-p) \log_2 (1-p),
with the convention that 0log20=00 \log_2 0 = 0. This function, a special case of for two outcomes, serves as a foundational measure in for assessing the inefficiency of a binary source or the information required to encode it.
The binary entropy function exhibits several key properties that highlight its role in probabilistic modeling. It is symmetric around p=0.5p = 0.5, concave, and achieves its maximum value of 1 bit at p=0.5p = 0.5, corresponding to maximum in a flip. At the endpoints, H(0)=H(1)=0H(0) = H(1) = 0, reflecting zero for deterministic outcomes. These characteristics make H(p)H(p) a that bounds the of any binary distribution and facilitates analysis in optimization problems. Introduced by Claude Shannon in his seminal 1948 paper "A Mathematical Theory of Communication," the binary entropy function emerged as part of the broader framework for quantifying information transmission over noisy channels. Shannon demonstrated that it represents the fundamental limit on the efficiency of encoding binary sources, influencing the development of source coding theorems. Its definition aligns with the entropy of an ideal binary source, providing a baseline for measuring information loss or gain in communication systems. In applications, the binary entropy function is central to calculating channel capacities, such as the binary symmetric channel, where capacity is given by C=1H(p)C = 1 - H(p) and pp is the crossover probability, determining the maximum reliable transmission rate. It also appears in error-correcting codes, data compression algorithms like for binary alphabets, and analyses of in and , where it evaluates model uncertainty or feature importance. Beyond communications, extensions of H(p)H(p) inform and by modeling binary decision entropy.

Definition and Interpretation

Mathematical Definition

The binary entropy function is the Shannon entropy of a Bernoulli taking values in {0, 1} with success probability pp. It is conventionally defined on the domain [0,1][0, 1] using the base-2 logarithm to measure in bits: h(p)={plog2p(1p)log2(1p)0<p<1,0p=0 or p=1.h(p) = \begin{cases} - p \log_2 p - (1 - p) \log_2 (1 - p) & 0 < p < 1, \\ 0 & p = 0 \text{ or } p = 1. \end{cases} The values at the endpoints follow from continuity, as the direct substitution yields the indeterminate form 0()0 \cdot (-\infty), but the relevant limits satisfy limp0+plog2p=0\lim_{p \to 0^+} p \log_2 p = 0 and limp1(1p)log2(1p)=0\lim_{p \to 1^-} (1 - p) \log_2 (1 - p) = 0. In mathematical analyses, a variant using the natural logarithm is often employed, yielding entropy in nats: H(p)={plnp(1p)ln(1p)0<p<1,0p=0 or p=1,H(p) = \begin{cases} - p \ln p - (1 - p) \ln (1 - p) & 0 < p < 1, \\ 0 & p = 0 \text{ or } p = 1, \end{cases} with endpoint values again obtained by limits using limp0+plnp=0\lim_{p \to 0^+} p \ln p = 0 and limp1(1p)ln(1p)=0\lim_{p \to 1^-} (1 - p) \ln (1 - p) = 0. The two forms are related by the change-of-base formula h(p)=H(p)/ln2h(p) = H(p) / \ln 2, since log2y=lny/ln2\log_2 y = \ln y / \ln 2 for y>0y > 0.

Probabilistic Interpretation

The binary entropy function represents the Shannon entropy H(X)H(X) of a Bernoulli XX with success probability pp, where XX takes the value 1 with probability pp and 0 with probability 1p1-p. This measures the average number of bits required to encode the possible outcomes of XX using an optimal , providing a fundamental limit on the efficiency of data compression for such binary sources. Probabilistically, the binary entropy quantifies the average uncertainty or associated with the outcomes of XX. It is expressed as the E[log2P(X)]\mathbb{E}[-\log_2 P(X)], where log2P(X=x)-\log_2 P(X=x) denotes the self-information of outcome xx, which captures the surprise or informational value of observing that specific binary event—the rarer the event, the higher its self-information. Thus, h(p)h(p) averages these self-informations over the distribution of XX, serving as a measure of the inherent unpredictability in binary probabilistic events. For illustration, when p=0.5p=0.5, as in a flip, h(0.5)=1h(0.5)=1 bit, signifying the maximum achievable for a binary random variable, where each outcome requires a full bit to specify on average. In contrast, for p=0p=0 or p=1p=1, h(p)=0h(p)=0 bits, indicating complete certainty with no information needed to encode the deterministic outcome.

Mathematical Properties

Symmetry and Basic Bounds

The binary entropy function h(p)h(p), defined for p[0,1]p \in [0,1], exhibits a fundamental symmetry property: h(p)=h(1p)h(p) = h(1-p). This arises directly from the functional form h(p)=plog2p(1p)log2(1p)h(p) = -p \log_2 p - (1-p) \log_2 (1-p), where substituting 1p1-p for pp yields the identical expression, reflecting the equivalence in between a binary event with success probability pp and failure probability 1p1-p. This symmetry implies that the function is invariant under reflection across p=0.5p = 0.5, making it a on either side of the . As a result, the behavior of h(p)h(p) for p<0.5p < 0.5 precisely mirrors that for p>0.5p > 0.5, which simplifies analyses in information-theoretic contexts where binary probabilities are interchangeable. The function is bounded between 0 and 1: 0h(p)10 \leq h(p) \leq 1 for all p[0,1]p \in [0,1]. The lower bound of 0 is achieved at the endpoints p=0p = 0 and p=1p = 1, corresponding to complete certainty (no ), while the upper bound of 1 is attained uniquely at p=0.5p = 0.5, representing maximum in a . Graphically, h(p)h(p) forms a bell-shaped symmetric about p=0.5p = 0.5, starting at 0 when p=0p = 0, rising smoothly to its peak of 1 at p=0.5p = 0.5, and then descending symmetrically back to 0 at p=1p = 1. This concave profile underscores the function's role in quantifying balanced versus imbalanced binary uncertainties.

Derivatives and Critical Points

The first derivative of the binary entropy function h(p)=plog2p(1p)log2(1p)h(p) = -p \log_2 p - (1-p) \log_2 (1-p) for p(0,1)p \in (0,1) is derived using the product and chain rules. Differentiating the first term gives log2pp1pln2=log2p1ln2-\log_2 p - p \cdot \frac{1}{p \ln 2} = -\log_2 p - \frac{1}{\ln 2}, and the second term yields log2(1p)+(1p)1(1p)ln2=log2(1p)+1ln2\log_2 (1-p) + (1-p) \cdot \frac{1}{(1-p) \ln 2} = \log_2 (1-p) + \frac{1}{\ln 2}. The constants cancel, resulting in h(p)=log2(1pp)h'(p) = \log_2 \left( \frac{1-p}{p} \right). This expression shows that h(p)>0h'(p) > 0 for p<0.5p < 0.5 (indicating an increasing ) and h(p)<0h'(p) < 0 for p>0.5p > 0.5 (indicating a decreasing ), consistent with the function rising from h(0)=0h(0) = 0 to a peak and then falling to h(1)=0h(1) = 0. The critical point occurs where h(p)=0h'(p) = 0, so log2(1pp)=0\log_2 \left( \frac{1-p}{p} \right) = 0, implying 1pp=1\frac{1-p}{p} = 1 and thus p=0.5p = 0.5. To confirm this is a maximum, the second is computed by differentiating h(p)h'(p): h(p)=ddp[ln(1pp)ln2]=1ln21/(1p)(1/p)(1p)/p=1p(1p)ln2h''(p) = \frac{d}{dp} \left[ \frac{\ln \left( \frac{1-p}{p} \right)}{\ln 2} \right] = \frac{1}{\ln 2} \cdot \frac{ -1/(1-p) - (-1/p) }{ (1-p)/p } = -\frac{1}{p(1-p) \ln 2}. Since h(p)<0h''(p) < 0 for all p(0,1)p \in (0,1), the function is strictly concave, verifying that p=0.5p = 0.5 is a global maximum where h(0.5)=1h(0.5) = 1. Near the endpoints, the asymptotic behavior of the first derivative reveals the steepness: as p0+p \to 0^+, 1pp+\frac{1-p}{p} \to +\infty, so h(p)+h'(p) \to +\infty; as p1p \to 1^-, 1pp0+\frac{1-p}{p} \to 0^+, so h(p)h'(p) \to -\infty. This indicates rapid initial growth from zero and sharp decline toward one, underscoring the function's peaked shape around the midpoint.

Series Expansions

The of the around p=0.5p = 0.5 provides a useful polynomial approximation for values near the maximum point, where the function is symmetric and reaches its peak of 1 . This expansion is particularly valuable in for analyzing small deviations from balanced probabilities. The series is given by h(p)=112ln2n=1(12p)2nn(2n1),h(p) = 1 - \frac{1}{2 \ln 2} \sum_{n=1}^{\infty} \frac{(1-2p)^{2n}}{n (2n-1)}, which converges for p(0,1)p \in (0,1). For the leading quadratic term (n=1n=1), it simplifies to 2ln2(p0.5)2-\frac{2}{\ln 2} (p - 0.5)^2, confirming the second derivative h(0.5)=4ln2h''(0.5) = -\frac{4}{\ln 2}. Near the endpoint p=0p = 0, the binary entropy function exhibits rapid decay, and a simple asymptotic approximation captures the dominant behavior for small p>0p > 0. Substituting the first-order Taylor expansion log2(1p)p/ln2\log_2(1-p) \approx -p / \ln 2 into the definition yields h(p)plog21p+pln2.h(p) \approx p \log_2 \frac{1}{p} + \frac{p}{\ln 2}. This leading-order form, equivalent to plog2p+plog2e-p \log_2 p + p \log_2 e, arises directly from the series expansion of log2(1p)p/ln2-\log_2(1-p) \approx p / \ln 2, neglecting higher-order terms like O(p2)O(p^2). Higher-order expansions can include additional terms such as p2/(2ln2)-p^2 / (2 \ln 2), but the given approximation suffices for many asymptotic analyses. These series representations do not typically involve Bernoulli numbers directly, though related entropy functions or integrals (e.g., in logistic distributions) may connect to them via polylogarithms or generating functions; no standard expression for h(p)h(p) uses Bernoulli numbers explicitly. The expansions are instrumental for numerical computation, enabling efficient evaluation via truncated polynomials without evaluating logarithms near singularities or the peak, and for asymptotic analysis in contexts like channel capacity bounds or large-deviation principles in coding theory, where closed-form approximations simplify proofs and rate computations.

Convexity and Conjugate

The binary entropy function h(p)=plog2p(1p)log2(1p)h(p) = -p \log_2 p - (1-p) \log_2 (1-p) for p[0,1]p \in [0,1] is concave. This concavity follows from the : the first derivative is h(p)=log21pph'(p) = \log_2 \frac{1-p}{p}, and the second derivative is h(p)=1p(1p)ln2<0h''(p) = -\frac{1}{p(1-p) \ln 2} < 0 for p(0,1)p \in (0,1), confirming that the function is strictly concave on its domain. This property is fundamental in , as it underpins applications for measures. Since h(p)h(p) is concave, the negative binary entropy h(p)-h(p) is a convex function on [0,1], which is particularly relevant in problems within , such as rate-distortion theory where minimizing subject to distortion constraints leverages the convexity of rate functions derived from h(p)-h(p). In rate-distortion settings, the convexity of h(p)-h(p) ensures that the achievable rate-distortion curve is convex, facilitating duality-based solutions and bounding achievable rates. The convex conjugate (Fenchel-Legendre transform) of the convex function f(p)=h(p)f(p) = -h(p) (using natural logarithm for the entropy definition to align with standard convex analysis, scaled appropriately for base 2) is given by f(λ)=supp[0,1]λp+h(p)f^*(\lambda) = \sup_{p \in [0,1]} \lambda p + h(p), which evaluates to the softplus function log(1+eλ)\log(1 + e^{\lambda}) (for natural log version; for base 2, it scales by 1/ln21/\ln 2 to log2(1+2λ)\log_2(1 + 2^{\lambda})). The maximizer p(λ)=11+eλp^*(\lambda) = \frac{1}{1 + e^{-\lambda}} (sigmoid function) corresponds to the tilted distribution in the exponential family. This conjugate arises in the dual formulation of maximum entropy problems, where maximizing h(p)h(p) subject to moment constraints E[g(X)]=μ\mathbb{E}[g(X)] = \mu yields the dual objective involving f(λ)f^*(\lambda), enabling efficient computation via Lagrange multipliers. In practice, this duality transforms entropy maximization into unconstrained optimization over the dual variables λ\lambda, with the conjugate providing the closed-form expression for the partition function in binary exponential families.

Applications and Extensions

In Information Theory

The binary entropy function serves as a in , particularly in quantifying the limits of communication over noisy channels and the efficiency of source coding. For the binary symmetric channel (BSC), a where bits are transmitted with crossover probability pp (the probability that a 0 is received as 1 or vice versa), the CC—the supremum of reliable transmission rates in bits per channel use—is given by C=1h(p),C = 1 - h(p), where h(p)=plog2p(1p)log2(1p)h(p) = -p \log_2 p - (1-p) \log_2 (1-p). This expression, first derived by , reflects how noise reduces the effective information throughput from the noiseless binary channel's capacity of 1 bit, with h(p)h(p) measuring the uncertainty introduced by the symmetric errors; the capacity is maximized at p=0p=0 (where C=1C=1) and drops to 0 as pp approaches 0.5. In source coding, the binary entropy establishes the theoretical minimum for of binary memoryless sources. For a Bernoulli source emitting symbols 0 and 1 with probability pp of 1, the h(p)h(p) bits per symbol represents the average , serving as the lower bound on the expected codeword length for any uniquely decodable code; guarantees that rates approaching h(p)h(p) are achievable in the asymptotic limit of long sequences. , an algorithmic method for constructing optimal prefix codes, attains an average length LL satisfying h(p)L<h(p)+1h(p) \leq L < h(p) + 1 for binary alphabets, thereby approaching the limit closely for typical probability distributions and enabling efficient encoding of binary data streams in practice. The binary entropy also relates to in binary testing, where it bounds the information extractable about a binary from observations. In this framework, if the prior distribution over the two hypotheses has entropy h(π)h(\pi) for prior probability π\pi, the I(H;Y)I(H; Y) between the HH and observation YY satisfies I(H;Y)h(π)I(H; Y) \leq h(\pi), quantifying the reduction in uncertainty about the ; this bound underscores the binary entropy's role in assessing detection , as developed in information-theoretic analyses of testing that link error exponents to divergence measures involving . Furthermore, the binary entropy features prominently in the asymptotic equipartition property (AEP) for binary sequences, which describes the concentration of probability mass on typical sequences from i.i.d. Bernoulli sources. For nn independent draws from a Bernoulli(pp) distribution, the AEP asserts that with probability approaching 1 as nn \to \infty, the empirical frequency of 1's is near pp, and the sequences in the —those with type probabilities close to the source distribution—have roughly equal probability 2nh(p)2^{-n h(p)}, while the set's size is approximately 2nh(p)2^{n h(p)}; this equipartition enables the partitioning of the sequence space into entropy-sized bins, directly supporting the achievability of for binary sources by identifying compressible typical sequences.

In Statistics and Machine Learning

In machine learning, the binary cross-entropy loss serves as a fundamental objective for training models on binary classification problems, given by L(y,y^)=[ylog2y^+(1y)log2(1y^)],L(y, \hat{y}) = - \left[ y \log_2 \hat{y} + (1 - y) \log_2 (1 - \hat{y}) \right], where y{0,1}y \in \{0, 1\} is the true binary label and y^(0,1)\hat{y} \in (0, 1) is the model's predicted probability of the positive class. This loss measures the dissimilarity between the predicted probabilities and the true labels, and its expected value over the true distribution approximates the binary entropy function h(p)h(p) when the predictions align closely with the true probabilities pp, as deviations increase the loss beyond the inherent uncertainty captured by h(p)h(p). In , which models binary outcomes via the , parameter estimation relies on maximum likelihood, equivalent to minimizing the average binary cross-entropy loss across the . The log-likelihood for a of independent binary observations is the negative sum of individual cross-entropy terms, ensuring that the fitted model maximizes the probability of the observed labels under the assumption. This approach yields interpretable probabilistic predictions and underpins many generalized linear models for binary data. The connection to the Kullback-Leibler (KL) divergence for binary distributions further highlights the binary entropy's role, as the H(P,Q)=pilog2qiH(P, Q) = -\sum p_i \log_2 q_i decomposes into H(P,Q)=h(P)+DKL(PQ)H(P, Q) = h(P) + D_{\text{KL}}(P \| Q), where h(P)h(P) is the binary entropy of the true distribution PP and DKL(PQ)D_{\text{KL}}(P \| Q) quantifies the additional information loss from using QQ to approximate PP. Minimizing thus reduces both the intrinsic entropy and the , promoting predictions that efficiently encode the data's uncertainty. In variational inference for probabilistic models with binary latent variables, the binary arises in the (ELBO), where it contributes to the of the variational posterior over Bernoulli-distributed latents. This term encourages diversity in the approximate posterior, balancing reconstruction fidelity with regularization via KL divergence to the prior, as seen in mean-field approximations for binary variable models like topic models or latent .

Historical Development

Origins in Information Theory

The binary entropy function was first introduced by Claude Shannon in his seminal 1948 paper "A Mathematical Theory of Communication," where it served as a measure of uncertainty or information content for binary sources, specifically in the context of binary digits or bits. Shannon developed this concept as part of a broader framework for quantifying the information transmitted through communication systems, laying the groundwork for information theory. In the paper, he presented entropy as a fundamental quantity that captures the average information produced by a stochastic source, with the binary case emerging naturally from considerations of discrete events with two possible outcomes. The motivation for introducing the binary entropy stemmed from practical challenges in early communication technologies, such as , where messages were encoded in binary signals and susceptible to that could alter or corrupt the transmission. Shannon sought to address the fundamental problem of reproducing a message at the receiver despite such distortions in binary channels, drawing on the statistical structure of and signals to define reliable communication limits. This focus on binary channels was influenced by the era's telegraph systems, which operated on on-off keying akin to binary digits, and the need to model as probabilistic errors in these simple yet prevalent setups. Shannon defined the entropy using the logarithm, with the base-2 logarithm yielding units in bits for the binary case, expressed as H(p) = -p log_2 p - (1-p) log_2 (1-p), where p is the probability of one outcome. He employed the notation H(p) specifically for this binary case, emphasizing its role in calculating the average information per symbol in a binary source. This choice reflected the paper's emphasis on practical applicability to binary communication engineering.

Subsequent Developments

In the 1960s, generalized the binary entropy function through the introduction of the of order α, defined for a binary probability distribution with probabilities p and 1-p as H_α(p) = \frac{1}{1-\alpha} \log_2 \left( p^\alpha + (1-p)^\alpha \right) for α ≠ 1, which reduces to the Shannon binary entropy in the limit as α approaches 1. This generalization preserved additivity for independent events while allowing tunable sensitivity to probability , finding applications in robust and non-extensive systems. During the 1970s, the binary entropy function gained prominence in for analyzing the efficiency of data compression algorithms, serving as a theoretical lower bound on the average number of bits required to encode binary sources. This adoption extended to algorithm design in areas like universal coding, where the function quantified and informed adaptive encoding strategies. By the , numerical implementations of the binary entropy function emerged in scientific computing software, enabling efficient computation and visualization for applications. These tools supported simulations in communications and , where iterative evaluations informed system design without manual derivation. Post-2000, the binary —equivalent to the binary Shannon entropy for a with eigenvalues p and 1-p—has become central to theory, quantifying mixedness and correlations in quantum states. Seminal works, such as Nielsen and Chuang's 2000 textbook, applied it to derive capacities of quantum channels and protocols, where it establishes bounds analogous to classical source coding theorems. Subsequent research has leveraged it in and , highlighting its role in measuring quantum uncertainty beyond classical limits.
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.