Entropy coding Encyclopedia: Wikipedia & Grokipedia

Entropy coding

current hub

Read side by side

Entropy coding

View on Wikipedia

from Wikipedia

This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (December 2013) (Learn how and when to remove this message)

In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method must have an expected code length greater than or equal to the entropy of the source.^[1]

More precisely, the source coding theorem states that for any source distribution, the expected code length satisfies $\operatorname {E} _{x\sim P}[\ell (d(x))]\geq \operatorname {E} _{x\sim P}[-\log _{b}(P(x))]$ , where $\ell$ is the function specifying the number of symbols in a code word, $d$ is the coding function, $b$ is the number of symbols used to make output codes and $P$ is the probability of the source symbol. An entropy coding attempts to approach this lower bound.

Two of the most common entropy coding techniques are Huffman coding and arithmetic coding.^[2] If the approximate entropy characteristics of a data stream are known in advance (especially for signal compression), a simpler static code may be useful. These static codes include universal codes (such as Elias gamma coding or Fibonacci coding) and Golomb codes (such as unary coding or Rice coding).

Since 2014, data compressors have started using the asymmetric numeral systems family of entropy coding techniques, which allows combination of the compression ratio of arithmetic coding with a processing cost similar to Huffman coding.

Entropy as a measure of similarity

[edit]

Besides using entropy coding as a way to compress digital data, an entropy encoder can also be used to measure the amount of similarity between streams of data and already existing classes of data. This is done by generating an entropy coder/compressor for each class of data; unknown data is then classified by feeding the uncompressed data to each compressor and seeing which compressor yields the highest compression. The coder with the best compression is probably the coder trained on the data that was most similar to the unknown data.

References

[edit]

^ Duda, Jarek; Tahboub, Khalid; Gadgil, Neeraj J.; Delp, Edward J. (May 2015). "The use of asymmetric numeral systems as an accurate replacement for Huffman coding". 2015 Picture Coding Symposium (PCS). pp. 65–69. doi:10.1109/PCS.2015.7170048. ISBN 978-1-4799-7783-3. S2CID 20260346.
^ Huffman, David (1952). "A Method for the Construction of Minimum-Redundancy Codes". Proceedings of the IRE. 40 (9). Institute of Electrical and Electronics Engineers (IEEE): 1098–1101. doi:10.1109/jrproc.1952.273898. ISSN 0096-8390.

External links

[edit]

Information Theory, Inference, and Learning Algorithms, by David MacKay (2003), gives an introduction to Shannon theory and data compression, including the Huffman coding and arithmetic coding.
Source Coding, by T. Wiegand and H. Schwarz (2011).

Data compression methods

Lossless
type

Entropy	Adaptive coding Arithmetic Asymmetric numeral systems Golomb Huffman Adaptive Canonical Modified Range Shannon Shannon–Fano Shannon–Fano–Elias Tunstall Unary Universal Exp-Golomb Fibonacci Gamma Levenshtein
Dictionary	Byte-pair encoding Lempel–Ziv 842 LZ4 LZJB LZO LZRW LZSS LZW LZWL Snappy
Other	BWT CTW CM Delta Incremental DMC DPCM Grammar Re-Pair Sequitur LDCT MTF PAQ PPM RLE
Hybrid	LZ77 + Huffman Deflate LZX LZS LZ77 + ANS LZFSE LZ77 + Huffman + ANS Zstandard LZ77 + Huffman + context Brotli LZSS + Huffman LHA/LZH LZ77 + Range LZMA LZHAM RLE + BWT + MTF + Huffman bzip2

Lossy
type

Transform	Discrete cosine transform DCT MDCT DST FFT Wavelet Daubechies DWT SPIHT
Predictive	DPCM ADPCM LPC ACELP CELP LAR LSP WLPC Motion Compensation Estimation Vector Psychoacoustic

Audio

Concepts	Bit rate ABR CBR VBR Companding Convolution Dynamic range Latency Nyquist–Shannon theorem Sampling Silence compression Sound quality Speech coding Sub-band coding
Codec parts	A-law μ-law DPCM ADPCM DM FT FFT LPC ACELP CELP LAR LSP WLPC MDCT Psychoacoustic model

Image

Concepts	Chroma subsampling Coding tree unit Color space Compression artifact Image resolution Macroblock Pixel PSNR Quantization Standard test image Texture compression
Methods	Chain code DCT Deflate Fractal KLT LP RLE Wavelet Daubechies DWT EZW SPIHT

Video

Concepts	Bit rate ABR CBR VBR Display resolution Frame Frame rate Frame types Interlace Video characteristics Video quality
Codec parts	DCT DPCM Deblocking filter Lapped transform Motion Compensation Estimation Vector Wavelet Daubechies DWT

Theory

Community

Hutter Prize

People

Revisions and contributors Edit on Wikipedia Read on Wikipedia

Entropy coding

View on Grokipedia

from Grokipedia

Entropy coding is a class of lossless data compression algorithms that assign shorter binary codes to more probable symbols in a data source, thereby minimizing the average code length and approaching the theoretical lower bound given by the source's Shannon entropy.^[1] This technique exploits the statistical properties of the data to eliminate redundancy without losing information, making it fundamental to efficient encoding in fields such as digital communications, image and video compression, and storage systems.^[2] The concept of entropy coding originates from Claude Shannon's foundational work in information theory, particularly his 1948 paper "A Mathematical Theory of Communication," where he defined entropy as a measure of uncertainty or information content in a random source and proved the source coding theorem, establishing entropy as the ultimate limit for lossless compression efficiency. Building on this, practical methods emerged in the mid-20th century; David A. Huffman developed the first widely adopted algorithm in 1952, introducing variable-length prefix codes that optimally assign code lengths proportional to the negative logarithm of symbol probabilities for discrete sources with known distributions. Huffman's method, known as Huffman coding, constructs a binary tree based on symbol frequencies to generate uniquely decodable codes, achieving near-entropy performance for stationary sources but requiring integer code lengths, which can introduce minor inefficiencies.^[2] Subsequent advancements addressed these limitations, with arithmetic coding emerging as a more flexible alternative. Initially conceptualized by Peter Elias in the early 1960s and first practically implemented by Jorma Rissanen in 1976, arithmetic coding encodes an entire sequence of symbols into a single fractional number within a unit interval, subdivided according to cumulative probabilities, allowing code lengths to more precisely match the entropy without integer constraints. This approach gained prominence through the 1987 paper by Ian H. Witten, Radford M. Neal, and John G. Cleary, which popularized adaptive and context-based variants for real-world applications. Modern entropy coders, such as Context-Adaptive Binary Arithmetic Coding (CABAC) used in video standards like H.264/AVC and HEVC, combine arithmetic principles with adaptive probability estimation to handle non-stationary data effectively.^[3] Key advantages of entropy coding include its optimality for sources with skewed symbol probabilities and its integration with other compression stages, such as transform or prediction coding in multimedia formats.^[2] However, it requires accurate probability models, and computational complexity can be higher for arithmetic variants compared to simpler Huffman implementations, though hardware optimizations have mitigated this in contemporary systems.^[1] Overall, entropy coding remains a cornerstone of data compression, enabling efficient representation of information across diverse applications.

Introduction

Definition and Principles

Entropy coding is a lossless data compression technique that assigns variable-length binary codes to symbols from a finite alphabet based on their probabilities of occurrence, with more probable symbols receiving shorter codes to minimize the average number of bits required per symbol.^[1] This approach exploits the statistical redundancy in the source data, ensuring that the encoded bitstream can be uniquely decoded back to the original sequence without any loss of information.^[4] The fundamental principles of entropy coding revolve around achieving an average code length that approaches the theoretical lower bound given by the source's entropy, while maintaining unique decodability of the codewords.^[1] Codes must be designed to be instantaneous, meaning they are prefix-free—no codeword is a prefix of another—allowing the decoder to identify symbol boundaries immediately upon reading the bitstream without lookahead.^[4] More broadly, the codes ensure unique decodability, where any concatenation of codewords corresponds to exactly one possible sequence of symbols, preventing ambiguity in reconstruction.^[5] This Shannon entropy serves as the irreducible limit for the average code length in lossless compression of a memoryless source.^[6] To illustrate the benefits, consider a simple alphabet with three symbols A, B, and C having probabilities 0.5, 0.25, and 0.25, respectively. A fixed-length coding scheme would require 2 bits per symbol (e.g., A: 00, B: 01, C: 10), yielding an average length of 2 bits. In contrast, a variable-length prefix-free code can assign A: 0 (1 bit), B: 10 (2 bits), and C: 11 (2 bits), reducing the average length to 1.5 bits while remaining uniquely decodable.^[4] Unlike predictive compression methods such as LZ77, which model dependencies between symbols to remove redundancy through dictionary-based substitution, entropy coding operates solely on the independent statistical frequencies of symbols without modeling higher-order correlations.^[7]

Historical Context

The foundations of entropy coding trace back to the development of information theory in the mid-20th century. In 1948, Claude Shannon published "A Mathematical Theory of Communication," where he introduced the concept of entropy as the fundamental lower bound on the average number of bits required to represent information from a source, establishing the theoretical limits for lossless data compression.^[6] This work was later expanded in a 1949 book co-authored with Warren Weaver, who provided an accessible interpretation of Shannon's ideas for broader scientific audiences, emphasizing their implications for communication systems.^[8] Key milestones in the practical realization of entropy coding followed soon after. In the 1950s, Peter Elias contributed to early advancements in source coding at MIT, including predictive coding techniques that built on Shannon's principles to address efficient encoding for discrete sources.^[9] This laid groundwork for David Huffman's 1952 algorithm, which provided an optimal method for constructing prefix-free codes that minimize redundancy based on symbol probabilities, marking a seminal advance in block-based entropy coding.^[10] Later, in 1976, Jorma Rissanen developed arithmetic coding, patented the following year, which represented a breakthrough by enabling stream-based encoding that could approach Shannon's entropy limit more closely than Huffman codes for certain sources.^[11] The evolution of entropy coding shifted from block-oriented methods like Huffman to more flexible stream-based approaches during the 1980s and 1990s, driven by the need for higher efficiency in real-time applications. Arithmetic coding gained practical traction in this period, overcoming earlier implementation challenges related to precision and complexity.^[12] This transition facilitated integration into international standards, such as the JPEG image compression standard released in 1992, which employed Huffman coding as its baseline entropy method while optionally supporting arithmetic coding for improved performance.^[13] Entropy coding saw practical deployments in fax machines and modems starting in the 1980s, where Huffman-based schemes like Modified Huffman in CCITT Group 3 standards (adopted in 1980) enabled efficient transmission of binary images over telephone lines.^[14] Open-source implementations emerged in the mid-1990s, exemplified by the zlib library released in 1995, which incorporated Huffman coding within the DEFLATE algorithm for widespread use in software compression. Post-2010, entropy coding has been revitalized in deep learning contexts, particularly for neural network-based compression, where techniques like hyperprior models estimate probabilities for autoregressive entropy coders to achieve near-optimal rates in image and video tasks. Seminal works, such as Ballé et al.'s 2018 variational framework, demonstrated how learned entropy models could surpass traditional methods in end-to-end learned compression systems.^[15]

Theoretical Foundations

Shannon's Entropy

Shannon entropy, denoted as $ H(X) $, quantifies the average uncertainty or information content associated with a random variable $ X $ taking values from a discrete alphabet with probabilities $ p_i $ for each symbol $ i $. Formally, it is defined as

H(X) = -\sum_i p_i \log_2 p_i,

where the logarithm is base-2 to measure information in bits; this represents the expected number of bits required to encode the outcomes of $ X $ in an optimal, lossless manner.^[16] The concept originates from the self-information of an individual outcome $ x $, given by $ I(x) = -\log_2 p(x) $, which measures the surprise or improbability of that specific event—the rarer the event, the higher the information content. Entropy then emerges as the ensemble average of this self-information: $ H(X) = \mathbb{E}[I(X)] = \sum_i p_i I(x_i) $, providing a foundational measure of the source's inherent unpredictability that underpins efficient coding strategies.^[16] The derivation of Shannon entropy follows from axiomatic principles of information measurement, emphasizing continuity, monotonicity in probability, and additivity for independent events. Starting with self-information as the negative logarithm of probability, which satisfies these axioms for a single outcome, the average over the probability distribution yields entropy as the unique function meeting the criteria for a broad class of information measures. This formulation interprets entropy not only as average surprise but also as the minimal redundancy in representing the source, with higher values indicating greater variability in symbol probabilities.^[16] Key properties of Shannon entropy include its additivity for independent random variables, where $ H(X,Y) = H(X) + H(Y) $ if $ X $ and $ Y $ are independent, reflecting the combined uncertainty of separate sources. It is concave, meaning $ H(\lambda X + (1-\lambda) Y) \geq \lambda H(X) + (1-\lambda) H(Y) $ for $ 0 \leq \lambda \leq 1 $, which implies that entropy is maximized when probabilities are uniform and decreases as the distribution becomes more peaked. The joint entropy $ H(X,Y) = -\sum_{i,j} p(i,j) \log_2 p(i,j) $ generalizes to multiple variables and satisfies $ H(X,Y) \leq H(X) + H(Y) $, with equality under independence, bounding the total information.^[16] For a simple binary source like a fair coin flip, where $ p(0) = p(1) = 1/2 $, the entropy calculates as $ H(X) = -(1/2 \log_2 1/2 + 1/2 \log_2 1/2) = 1 $ bit, illustrating the need for one bit on average to specify the outcome.^[16] As the irreducible limit for lossless source coding, Shannon entropy establishes that no coding scheme can achieve an average codeword length below $ H(X) $ bits per symbol in the long run, serving as the theoretical foundation for evaluating compression efficiency in subsequent theorems like the source coding theorem.^[16]

Source Coding Theorem and Kraft Inequality

The source coding theorem, also known as Shannon's noiseless coding theorem, establishes the fundamental limits of lossless compression for discrete sources. For a discrete memoryless source

X

with entropy

H(X)

, the theorem states that the average codeword length

\bar{l}

of any uniquely decodable code satisfies

\bar{l} \geq H(X)

, where lengths are measured in bits assuming a binary code alphabet.^[6] This lower bound is approachable asymptotically: for block codes of length

n

, there exist codes with

\bar{l}_n / n \to H(X)

n \to \infty

, provided the source is memoryless.^[6] The theorem has weak and strong versions. The weak version applies to independent and identically distributed (i.i.d.) sources, proving that the entropy rate is both a lower bound on the compression rate and achievable in the limit using block coding.^[17] The strong version extends this to stationary ergodic processes, where the entropy rate

H

(defined as the limit of the per-symbol entropy of

n

-blocks) serves as the optimal compression rate, achievable with vanishing error probability for rates above

H

.^[17] The Kraft inequality provides a necessary and sufficient condition for the existence of a binary prefix code with specified codeword lengths

l_1, l_2, \dots, l_m

\sum_{i=1}^m 2^{-l_i} \leq 1

.^[18] This inequality ensures that the codewords do not overlap in the prefix sense, allowing instantaneous decoding. For a

D

-ary alphabet, the generalized form is

\sum_{i=1}^m D^{-l_i} \leq 1

.^[19] A proof sketch uses a binary tree representation: each prefix code corresponds to leaves in a binary tree at depths

l_i

, where no leaf is an ancestor of another. To show necessity, extend the tree to a full complete binary tree by adding dummy leaves; the sum

\sum 2^{-l_i}

then equals the proportion of leaves used by the original code, which cannot exceed 1. Sufficiency follows by constructing the code via a tree where branches are allocated proportionally to the lengths satisfying the inequality.^[20] The Kraft inequality implies that Huffman codes are optimal among prefix codes with integer lengths, as their lengths

l_i \approx -\log_2 p_i

satisfy the inequality (often with equality for complete codes), achieving average length within 1 bit of the entropy bound. For example, consider a 3-symbol source with probabilities

p_1 = 0.5

p_2 = 0.25

p_3 = 0.25

; code lengths 1, 2, 2 yield

\sum 2^{-l_i} = 2^{-1} + 2^{-2} + 2^{-2} = 0.5 + 0.25 + 0.25 = 1

, verifying feasibility and optimality.^[19] McMillan's theorem generalizes the Kraft inequality to all uniquely decodable codes (not just prefix codes), stating that the same condition

\sum 2^{-l_i} \leq 1

is necessary and sufficient for the existence of such a code with the given lengths.^[21] This extension ensures the inequality applies broadly to lossless coding schemes. The Kraft inequality originated in the late 1940s, with Kraft's 1949 work building upon related ideas in Shannon's 1948 paper, stemming from early discussions on pulse coding at MIT.^[18]^[6]

Core Techniques

Huffman Coding

Huffman coding is a foundational entropy coding technique that constructs variable-length prefix codes optimized for a known probability distribution of symbols, assigning shorter codes to more frequent symbols to minimize the average code length. Developed by David A. Huffman in 1952 as part of his doctoral work at MIT, the method builds a binary tree where the path from the root to a leaf represents the code for a symbol, ensuring no code is a prefix of another to enable instantaneous decoding.^[22] This approach achieves an average code length that is at most the entropy plus one bit per symbol, though it is constrained to integer code lengths.^[22] The algorithm employs a bottom-up greedy strategy to construct the code tree using a priority queue (typically a min-heap) ordered by symbol probabilities. It begins by creating a leaf node for each symbol with its probability. Repeatedly, the two nodes with the smallest probabilities are extracted, merged into a new internal node with probability equal to their sum, and reinserted into the queue; this process continues until only the root remains. Code assignment occurs via a traversal of the final tree, appending '0' for left branches and '1' for right branches to determine each symbol's binary code.^[22] The resulting tree satisfies the Kraft equality with equality for optimal distributions, ensuring prefix-freeness.^[22] To illustrate, consider four symbols A, B, C, and D with probabilities 0.4, 0.3, 0.2, and 0.1, respectively. Initialize the priority queue with these leaf nodes. Merge the lowest-probability nodes D (0.1) and C (0.2) into an internal node X (0.3). The queue now holds A (0.4), B (0.3), and X (0.3); next, merge B (0.3) and X (0.3) into node Y (0.6). Finally, merge Y (0.6) and A (0.4) into the root (1.0). Traversing the tree (left as 0, right as 1) yields codes A: 0 (length 1), B: 10 (length 2), C: 110 (length 3), and D: 111 (length 3). The average code length is

0.4 \times 1 + 0.3 \times 2 + 0.2 \times 3 + 0.1 \times 3 = 1.9

bits per symbol.^[22] Huffman coding exists in static and adaptive variants. In static Huffman coding, the probability distribution is assumed known beforehand, allowing the full tree to be built and transmitted once before encoding the data.^[22] Adaptive Huffman coding, introduced by Robert G. Gallager in 1978, dynamically updates the tree during encoding based on observed symbol frequencies, enabling real-time adaptation to changing statistics without prior knowledge of the distribution. Another variant, canonical Huffman coding, generates codes deterministically from the sorted list of code lengths rather than storing the full tree, reducing overhead for transmission or storage in applications like image compression standards.^[23]

Arithmetic Coding

Arithmetic coding is an entropy encoding technique that represents an entire message as a single fractional number within the unit interval [0, 1), achieving compression by allocating code space proportionally to symbol probabilities rather than assigning fixed-length codes to individual symbols.^[24] Introduced by Jorma Rissanen in his 1976 paper on generalized Kraft inequality and arithmetic coding, this method allows for fractional bit allocation per symbol, overcoming the inherent one-bit overhead per symbol in Huffman coding and approaching the theoretical entropy limit more closely for skewed probability distributions.^[25] However, its early adoption was limited by patents held by IBM, which expired in the mid-1990s, allowing broader implementation thereafter.^[11] The approach was further developed by Rissanen and Glenn G. Langdon in their 1979 work, which generalized the technique for practical implementation. The mechanism of arithmetic coding begins by initializing the coding interval to [0, 1). For each symbol in the message, the interval is subdivided according to the symbol's probability distribution, and the subinterval corresponding to the actual symbol is selected, cumulatively narrowing the range. This subdivision uses the cumulative distribution function of the probabilities: if the current interval is [low, high) with width = high - low, the subinterval for a symbol with probability p_i starts at low + (cumulative probability up to previous symbols) * width and has length p_i * width.^[24] The process continues until the entire message is encoded, resulting in a final subinterval whose any representative value (e.g., the midpoint) serves as the code. Decoding reverses this by starting from the code value and iteratively identifying the symbol that contains it within the subdivided intervals, using the same probability model.^[24] To illustrate, consider encoding the message "BAC" over an alphabet {A, B, C} with probabilities P(A)=0.5, P(B)=0.25, P(C)=0.25. Start with [0, 1). For 'B' (cumulative up to A: 0.5), the subinterval is [0.5, 0.75). For 'A' within this (width 0.25, P(A)=0.5 so length 0.125, starting at 0.5), it becomes [0.5, 0.625). For 'C' within [0.5, 0.625) (width 0.125), the subintervals relative to this interval are A: [0.5, 0.5 + 0.5 × 0.125) = [0.5, 0.5625), B: [0.5625, 0.5625 + 0.25 × 0.125) = [0.5625, 0.59375), C: [0.59375, 0.625). Thus, select [0.59375, 0.625) for 'C'. A binary representation of a value in [0.59375, 0.625), such as approximately 0.60 (e.g., binary 0.10011), can be output as the code.^[24] Arithmetic coding requires arbitrary-precision arithmetic to represent the shrinking intervals accurately, as finite fixed-point representations can lead to precision loss and decoding errors over long messages.^[24] To address this in practice, renormalization is employed: when the interval width falls below a threshold (e.g., 1/2^k for k-bit precision), the interval is rescaled by shifting out leading bits, outputting those bits to the code stream and maintaining sufficient precision for streaming encoding and decoding.^[24] This renormalization ensures the method remains efficient without excessive overhead, typically adding negligible bits beyond the entropy bound.^[24]

Advanced Variants (Range and ANS Coding)

Range coding represents a practical implementation of arithmetic coding principles, adapting the interval-based encoding to fixed-precision integer arithmetic for improved computational efficiency. Introduced by G. Nigel N. Martin in 1979, it maintains a current interval defined by a lower bound and a range width, typically using a fixed integer scale such as

2^{32}

2^{64}

to avoid floating-point operations. When encoding a symbol, the interval is subdivided proportionally to symbol probabilities, and the lower bound and range are updated accordingly. To prevent underflow—where the range becomes too small for precision—renormalization occurs by shifting bits from the lower bound to the output bitstream and rescaling the range, effectively handling overflow in the lower bound by outputting whole bytes at once. This approach reduces multiplication overhead compared to full arithmetic coding while preserving near-optimal compression ratios.^[26] A key application of range coding appears in the JPEG 2000 standard, where the MQ-coder serves as an adaptive binary arithmetic coder built on range coding techniques to entropy-encode quantized wavelet coefficients. The MQ-coder employs 46 probability states per context for adaptation and performs renormalization through bit shifting, enabling efficient processing of binary symbols without floating-point arithmetic, which contributes to the standard's superior compression performance over JPEG for high-quality images.^[27] Asymmetric numeral systems (ANS) offer another advancement, proposed by Jarek Duda in 2007 as a family of entropy coders that model encoding as probabilistic state transitions within a discrete state space. Unlike arithmetic coding's continuous interval updates, ANS represents the encoded data as a state value

x

in a predefined range, where each symbol emission or absorption corresponds to a deterministic mapping derived from symbol probabilities, often implemented via precomputed lookup tables for rapid table-based encoding and decoding.^[28] This table-driven mechanism avoids multiplications entirely, relying instead on bit shifts and modular arithmetic, which enhances speed in software implementations—typically achieving rates comparable to Huffman coding while matching arithmetic coding's compression efficiency.^[29] ANS has seen adoption in modern compression tools, such as Facebook's Zstandard algorithm introduced in 2015, where variants like finite-state entropy (FSE) and streamed ANS enable fast, adaptive entropy coding atop dictionary-based stages. In neural network compression, post-2014 methods leverage ANS for entropy coding quantized weights and activations; for instance, end-to-end learned image compression frameworks use ANS to model hyperpriors and achieve state-of-the-art rate-distortion performance with low overhead. Key advantages include better parallelization potential due to independent state updates and reduced computational overhead, making ANS particularly suitable for resource-constrained environments like mobile inference.

Performance Evaluation

Compression Efficiency Metrics

The primary metric for evaluating the compression efficiency of an entropy code is the average code length

L

, defined as the expected number of bits required to encode a single symbol from the source, given by

L = \sum_i p_i l_i,

where

p_i

is the probability of the

i

-th symbol and

l_i

is the length of its codeword in bits.^[6]^[30] This measure quantifies the total bit rate of the encoded output per symbol, with lower values indicating better compression for a fixed source distribution. A closely related metric is the redundancy

\rho

, which captures the excess bits beyond the theoretical minimum and is calculated as

\rho = L - H(X),

where

H(X)

is the entropy of the source,

H(X) = -\sum_i p_i \log_2 p_i

.^[31] Redundancy is always nonnegative for uniquely decodable codes due to the source coding theorem, and

\rho \to 0

as the code approaches optimality.^[6] Another common metric is the compression ratio, defined as the ratio of the original data size (in bits) to the encoded size, which directly reflects storage or transmission savings and is inversely related to

L

.^[32] Bits per symbol efficiency is often assessed by comparing

L

H(X)

, where the ideal code achieves

L \approx H(X)

, minimizing bits per symbol to the source's inherent uncertainty.^[33] The variance of code lengths,

\text{Var}(l_i) = \sum_i p_i (l_i - L)^2

, influences practical performance by affecting decoding complexity and buffer requirements, though it does not directly alter the average rate.^[34] For example, consider a Huffman-coded source with symbols A (probability 0.5, code length 1), B (0.25, length 2), C (0.125, length 3), and D (0.125, length 3). The average code length is

L = 0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = 1.75

bits/symbol, while the entropy

H(X) = 1.75

bits/symbol yields redundancy

\rho = 1.75 - 1.75 = 0

bits/symbol.^[35] Efficiency is further affected by factors such as block size and overhead. Larger block sizes in block-based entropy coding allow better approximation of the entropy rate, as the average code length per symbol converges to

H(X)

for sequences of length

N \to \infty

, though practical limits arise from computational complexity.^[6] Additionally, overhead from transmitting code tables (e.g., Huffman codebooks) reduces net efficiency for short messages, as the fixed cost of the table amortized over fewer symbols increases effective

L

.^[4]

Comparative Analysis of Techniques

Entropy coding techniques exhibit distinct trade-offs in compression efficiency, computational complexity, and practical implementation. Huffman coding, while straightforward and optimal for fixed probabilities, incurs redundancy overhead, particularly for rare symbols where code lengths may exceed the ideal by up to 1 bit per symbol due to integer codeword constraints.^[36] In contrast, arithmetic coding and asymmetric numeral systems (ANS) achieve compression rates closer to the Shannon entropy limit with negligible redundancy, often approaching 0 bits per symbol, but at the cost of higher computational demands involving multiplication and state updates.^[37] ANS mitigates some of arithmetic coding's slowdown by using table-based operations, offering speeds comparable to Huffman while retaining near-optimal efficiency.^[37] These differences are quantified in the following comparison:

Method	Average Redundancy (bits/symbol)	Time Complexity (Encoding)
Huffman	<1 (higher for skewed distributions)	O(n log n) build + O(n)
Arithmetic	~0 (near entropy limit)	O(n)
ANS	~0 (near entropy limit)	O(n)

Huffman coding suits scenarios with small alphabets, such as binary data or simple symbol sets, where its low overhead and fast table lookups enable rapid processing without significant redundancy loss.^[38] Arithmetic coding excels in handling text and image data with larger, non-uniform alphabets, providing superior compression ratios despite slower execution.^[39] ANS is particularly advantageous for high-speed applications like video streaming, where its linear-time operations and table-driven design yield faster decoding than arithmetic coding in standards such as Zstandard.^[40] A common limitation across these methods is the assumption of known symbol probabilities; static versions perform optimally only with precomputed statistics, while adaptive variants introduce additional overhead for probability estimation, increasing both time and memory usage in dynamic data streams.^[41] Post-2020 advancements have addressed complexity through hardware acceleration, such as FPGA implementations of ANS that achieve high throughputs for lossless compression tasks, with examples reaching up to 124 MS/s per lane as of 2021.^[42] No single entropy coding technique dominates universally; instead, hybrid approaches prevail in modern standards, exemplified by H.265/HEVC's CABAC, which integrates arithmetic coding with context adaptation for balanced efficiency in video compression.^[43]

Applications and Extensions

In Data Compression Standards

Entropy coding has been integral to numerous data compression standards since the late 1980s, with Huffman coding serving as the foundational technique in formats like ZIP and PNG. The ZIP format, introduced in 1989, employs Huffman coding in conjunction with LZ77 dictionary compression via the DEFLATE algorithm to achieve lossless compression of files. Similarly, the PNG standard, finalized in 1996, uses DEFLATE with [Huffman coding](/page/Huffman coding) for its IDAT chunks, enabling efficient lossless compression of raster images while supporting transparency and interlacing. These early adoptions in the 1980s and 1990s established Huffman as a patent-free, computationally efficient method for general-purpose file and image compression. Arithmetic coding emerged as a more efficient alternative in multimedia standards, offering finer granularity in probability modeling compared to Huffman. The JPEG standard (ISO/IEC 10918-1), developed in 1992 and embodied in the JFIF file format, includes an optional arithmetic coding mode for entropy encoding quantized DCT coefficients, providing marginal gains over Huffman at the cost of higher complexity. In video compression, the H.264/AVC standard (ITU-T Recommendation H.264, 2003) introduced Context-based Adaptive Binary Arithmetic Coding (CABAC), a binary arithmetic coder that models syntax elements with context-specific probabilities, achieving approximately 10-15% bit-rate reduction over the baseline CAVLC method. CABAC binarizes input symbols and adapts probability estimates dynamically based on previously coded data, enhancing efficiency for diverse video content. More recent standards have incorporated advanced variants like range and asymmetric numeral systems (ANS) coding to balance compression ratios and speed. Google's WebP format, announced in 2010, utilizes a range coder—derived from the VP8 video codec—for entropy coding in its lossy mode, encoding boolean decisions and coefficients to support both lossy and lossless compression of web images. The Zstandard (zstd) algorithm, released by Facebook in 2016, employs finite-state entropy (FSE) based on ANS for its entropy stage following LZ77 preprocessing, delivering compression ratios superior to DEFLATE while maintaining high throughput for real-time applications. Brotli, specified in IETF draft in 2015 and later as RFC 7932, relies on context-modeled Huffman coding with pre-defined dictionaries for web content compression, outperforming gzip by 20-26% on text and HTML payloads. The DEFLATE specification (RFC 1951, 1996) exemplifies Huffman integration by applying dynamic and static Huffman trees to encode LZ77 symbols, literals, and distances, forming the basis for tools like gzip and PNG filters. Brotli extends this with second-order context modeling in its Huffman codes, adapting to byte histories for better web-specific gains. Evolutionarily, standards have shifted toward context-adaptive entropy coders to exploit spatial and temporal dependencies; for instance, CABAC in H.264 selects from 400+ context models per syntax element and updates probabilities multiplicatively after each bin, adapting effectively within slices or blocks to capture local statistics. This adaptability contributes significantly to overall compression, with CABAC providing 10-20% bit-rate reduction compared to CAVLC in H.264 video.

Beyond Compression (e.g., in Machine Learning and Cryptography)

In machine learning, entropy coding techniques like asymmetric numeral systems (ANS) have been integrated into variational autoencoders (VAEs) to handle discrete latent spaces, enabling more efficient probabilistic modeling and generation. Specifically, the Bits-Back with ANS (BB-ANS) method allows for lossless compression of latent variables in VAEs by leveraging ANS to encode and decode samples from discrete distributions, approaching the optimal rate given by the model's evidence lower bound. This approach, introduced in 2019, facilitates practical applications in generative modeling where continuous latent spaces are discretized, improving sampling efficiency without significant loss in reconstruction quality. Entropy regularization, drawing from information-theoretic principles akin to those in entropy coding, is widely used in reinforcement learning (RL) to promote policy exploration by adding a term that maximizes the entropy of the policy distribution. In algorithms like Soft Actor-Critic (SAC), this regularization encourages the agent to maintain stochasticity in action selection, balancing exploitation of known rewards with exploration of uncertain states, which accelerates learning in high-dimensional environments. The seminal SAC framework, published in 2018, demonstrates that entropy-augmented objectives lead to robust policies in continuous control tasks, with empirical gains in sample efficiency over standard RL methods. In cryptography, arithmetic coding has been adapted for secure entropy coding schemes to enhance data protection during compression, particularly by incorporating nonlinear dynamic filters to obfuscate symbol probabilities. One such method, proposed in 2009, uses changeable coefficients in the filter to generate pseudorandom probability intervals, ensuring that the encoded bitstream resists statistical attacks while maintaining compression efficiency close to standard arithmetic coding. This approach is suitable for applications requiring confidentiality in compressed streams, such as secure data transmission. For homomorphic encryption, arithmetic coding principles support compact representations of encrypted data by enabling efficient encoding of probabilistic structures within ciphertexts, though direct integrations are emerging in specialized contexts like entropy estimation on encrypted strings. A 2024 technique computes Shannon entropy directly on homomorphically encrypted data using packed encodings, allowing detection of anomalous patterns (e.g., packed code in malware) without decryption, which reduces computational overhead in secure cloud analytics.^[44] Beyond these domains, entropy coding plays a role in genomic data processing through formats like CRAM, which employs rANS and adaptive arithmetic encoders to compress aligned sequencing reads. In CRAM 3.1, released in 2022, enhancements include SIMD-accelerated rANS with data transformations like run-length encoding (RLE) and bit-packing, achieving 7-15% better compression on Illumina data compared to prior versions by modeling symbol frequencies in quality scores and read names. This enables efficient storage and transmission of large-scale genomic datasets, with the format's entropy coders adapting to biological data redundancies. In audio processing, Huffman coding is utilized in watermarking schemes for MP3-encoded signals to embed semi-fragile identifiers within the compressed bitstream. A 2020 method modifies Huffman codewords in the scale factor bands to insert authentication data, preserving audio quality (signal-to-noise ratio >30 dB) while detecting tampering through codeword discrepancies upon decoding. This technique leverages the variable-length nature of Huffman codes for high-capacity embedding without altering the perceptual audio characteristics. A key application in artificial intelligence involves entropy coding for efficient sampling from complex distributions, as in Bits-Back ANS for VAEs, which recovers "bits back" during decoding to enable bidirectional inference in generative models. In emerging 2020s developments, such as federated learning, relative entropy coding (REC) compresses model updates to reduce communication bandwidth, achieving significant communication reductions, such as an 18-fold compression on the FEMNIST dataset, while maintaining accuracy in privacy-preserving federated learning settings. Techniques like predictive coding with entropy encoding further optimize transmission in resource-constrained networks, as demonstrated in 2021 frameworks.^[45]

Knowledge Base

Talk Channels

Special Pages

Entropy coding

Entropy coding

Entropy coding

Entropy as a measure of similarity

See also

References

External links

Entropy coding

Introduction

Definition and Principles

Historical Context

Theoretical Foundations

Shannon's Entropy

Source Coding Theorem and Kraft Inequality

Core Techniques

Huffman Coding

Arithmetic Coding

Advanced Variants (Range and ANS Coding)

Performance Evaluation

Compression Efficiency Metrics

Comparative Analysis of Techniques

Applications and Extensions

In Data Compression Standards

Beyond Compression (e.g., in Machine Learning and Cryptography)

References

History

Entropy coding

Entropy coding

Entropy coding

Entropy as a measure of similarity

See also

References

External links

Entropy coding

Introduction

Definition and Principles

Historical Context

Theoretical Foundations

Shannon's Entropy

Source Coding Theorem and Kraft Inequality

Core Techniques

Huffman Coding

Arithmetic Coding

Advanced Variants (Range and ANS Coding)

Performance Evaluation

Compression Efficiency Metrics

Comparative Analysis of Techniques

Applications and Extensions

In Data Compression Standards

Beyond Compression (e.g., in Machine Learning and Cryptography)

References