Entropy coding
View on WikipediaThis article includes a list of general references, but it lacks sufficient corresponding inline citations. (December 2013) |
In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method must have an expected code length greater than or equal to the entropy of the source.[1]
More precisely, the source coding theorem states that for any source distribution, the expected code length satisfies , where is the function specifying the number of symbols in a code word, is the coding function, is the number of symbols used to make output codes and is the probability of the source symbol. An entropy coding attempts to approach this lower bound.
Two of the most common entropy coding techniques are Huffman coding and arithmetic coding.[2] If the approximate entropy characteristics of a data stream are known in advance (especially for signal compression), a simpler static code may be useful. These static codes include universal codes (such as Elias gamma coding or Fibonacci coding) and Golomb codes (such as unary coding or Rice coding).
Since 2014, data compressors have started using the asymmetric numeral systems family of entropy coding techniques, which allows combination of the compression ratio of arithmetic coding with a processing cost similar to Huffman coding.
Entropy as a measure of similarity
[edit]Besides using entropy coding as a way to compress digital data, an entropy encoder can also be used to measure the amount of similarity between streams of data and already existing classes of data. This is done by generating an entropy coder/compressor for each class of data; unknown data is then classified by feeding the uncompressed data to each compressor and seeing which compressor yields the highest compression. The coder with the best compression is probably the coder trained on the data that was most similar to the unknown data.
See also
[edit]References
[edit]- ^ Duda, Jarek; Tahboub, Khalid; Gadgil, Neeraj J.; Delp, Edward J. (May 2015). "The use of asymmetric numeral systems as an accurate replacement for Huffman coding". 2015 Picture Coding Symposium (PCS). pp. 65–69. doi:10.1109/PCS.2015.7170048. ISBN 978-1-4799-7783-3. S2CID 20260346.
- ^ Huffman, David (1952). "A Method for the Construction of Minimum-Redundancy Codes". Proceedings of the IRE. 40 (9). Institute of Electrical and Electronics Engineers (IEEE): 1098–1101. doi:10.1109/jrproc.1952.273898. ISSN 0096-8390.
External links
[edit]- Information Theory, Inference, and Learning Algorithms, by David MacKay (2003), gives an introduction to Shannon theory and data compression, including the Huffman coding and arithmetic coding.
- Source Coding, by T. Wiegand and H. Schwarz (2011).
Entropy coding
View on GrokipediaIntroduction
Definition and Principles
Entropy coding is a lossless data compression technique that assigns variable-length binary codes to symbols from a finite alphabet based on their probabilities of occurrence, with more probable symbols receiving shorter codes to minimize the average number of bits required per symbol.[1] This approach exploits the statistical redundancy in the source data, ensuring that the encoded bitstream can be uniquely decoded back to the original sequence without any loss of information.[4] The fundamental principles of entropy coding revolve around achieving an average code length that approaches the theoretical lower bound given by the source's entropy, while maintaining unique decodability of the codewords.[1] Codes must be designed to be instantaneous, meaning they are prefix-free—no codeword is a prefix of another—allowing the decoder to identify symbol boundaries immediately upon reading the bitstream without lookahead.[4] More broadly, the codes ensure unique decodability, where any concatenation of codewords corresponds to exactly one possible sequence of symbols, preventing ambiguity in reconstruction.[5] This Shannon entropy serves as the irreducible limit for the average code length in lossless compression of a memoryless source.[6] To illustrate the benefits, consider a simple alphabet with three symbols A, B, and C having probabilities 0.5, 0.25, and 0.25, respectively. A fixed-length coding scheme would require 2 bits per symbol (e.g., A: 00, B: 01, C: 10), yielding an average length of 2 bits. In contrast, a variable-length prefix-free code can assign A: 0 (1 bit), B: 10 (2 bits), and C: 11 (2 bits), reducing the average length to 1.5 bits while remaining uniquely decodable.[4] Unlike predictive compression methods such as LZ77, which model dependencies between symbols to remove redundancy through dictionary-based substitution, entropy coding operates solely on the independent statistical frequencies of symbols without modeling higher-order correlations.[7]Historical Context
The foundations of entropy coding trace back to the development of information theory in the mid-20th century. In 1948, Claude Shannon published "A Mathematical Theory of Communication," where he introduced the concept of entropy as the fundamental lower bound on the average number of bits required to represent information from a source, establishing the theoretical limits for lossless data compression.[6] This work was later expanded in a 1949 book co-authored with Warren Weaver, who provided an accessible interpretation of Shannon's ideas for broader scientific audiences, emphasizing their implications for communication systems.[8] Key milestones in the practical realization of entropy coding followed soon after. In the 1950s, Peter Elias contributed to early advancements in source coding at MIT, including predictive coding techniques that built on Shannon's principles to address efficient encoding for discrete sources.[9] This laid groundwork for David Huffman's 1952 algorithm, which provided an optimal method for constructing prefix-free codes that minimize redundancy based on symbol probabilities, marking a seminal advance in block-based entropy coding.[10] Later, in 1976, Jorma Rissanen developed arithmetic coding, patented the following year, which represented a breakthrough by enabling stream-based encoding that could approach Shannon's entropy limit more closely than Huffman codes for certain sources.[11] The evolution of entropy coding shifted from block-oriented methods like Huffman to more flexible stream-based approaches during the 1980s and 1990s, driven by the need for higher efficiency in real-time applications. Arithmetic coding gained practical traction in this period, overcoming earlier implementation challenges related to precision and complexity.[12] This transition facilitated integration into international standards, such as the JPEG image compression standard released in 1992, which employed Huffman coding as its baseline entropy method while optionally supporting arithmetic coding for improved performance.[13] Entropy coding saw practical deployments in fax machines and modems starting in the 1980s, where Huffman-based schemes like Modified Huffman in CCITT Group 3 standards (adopted in 1980) enabled efficient transmission of binary images over telephone lines.[14] Open-source implementations emerged in the mid-1990s, exemplified by the zlib library released in 1995, which incorporated Huffman coding within the DEFLATE algorithm for widespread use in software compression. Post-2010, entropy coding has been revitalized in deep learning contexts, particularly for neural network-based compression, where techniques like hyperprior models estimate probabilities for autoregressive entropy coders to achieve near-optimal rates in image and video tasks. Seminal works, such as Ballé et al.'s 2018 variational framework, demonstrated how learned entropy models could surpass traditional methods in end-to-end learned compression systems.[15]Theoretical Foundations
Shannon's Entropy
Shannon entropy, denoted as $ H(X) $, quantifies the average uncertainty or information content associated with a random variable $ X $ taking values from a discrete alphabet with probabilities $ p_i $ for each symbol $ i $. Formally, it is defined asSource Coding Theorem and Kraft Inequality
The source coding theorem, also known as Shannon's noiseless coding theorem, establishes the fundamental limits of lossless compression for discrete sources. For a discrete memoryless source with entropy , the theorem states that the average codeword length of any uniquely decodable code satisfies , where lengths are measured in bits assuming a binary code alphabet.[6] This lower bound is approachable asymptotically: for block codes of length , there exist codes with as , provided the source is memoryless.[6] The theorem has weak and strong versions. The weak version applies to independent and identically distributed (i.i.d.) sources, proving that the entropy rate is both a lower bound on the compression rate and achievable in the limit using block coding.[17] The strong version extends this to stationary ergodic processes, where the entropy rate (defined as the limit of the per-symbol entropy of -blocks) serves as the optimal compression rate, achievable with vanishing error probability for rates above .[17] The Kraft inequality provides a necessary and sufficient condition for the existence of a binary prefix code with specified codeword lengths : .[18] This inequality ensures that the codewords do not overlap in the prefix sense, allowing instantaneous decoding. For a -ary alphabet, the generalized form is .[19] A proof sketch uses a binary tree representation: each prefix code corresponds to leaves in a binary tree at depths , where no leaf is an ancestor of another. To show necessity, extend the tree to a full complete binary tree by adding dummy leaves; the sum then equals the proportion of leaves used by the original code, which cannot exceed 1. Sufficiency follows by constructing the code via a tree where branches are allocated proportionally to the lengths satisfying the inequality.[20] The Kraft inequality implies that Huffman codes are optimal among prefix codes with integer lengths, as their lengths satisfy the inequality (often with equality for complete codes), achieving average length within 1 bit of the entropy bound. For example, consider a 3-symbol source with probabilities , , ; code lengths 1, 2, 2 yield , verifying feasibility and optimality.[19] McMillan's theorem generalizes the Kraft inequality to all uniquely decodable codes (not just prefix codes), stating that the same condition is necessary and sufficient for the existence of such a code with the given lengths.[21] This extension ensures the inequality applies broadly to lossless coding schemes. The Kraft inequality originated in the late 1940s, with Kraft's 1949 work building upon related ideas in Shannon's 1948 paper, stemming from early discussions on pulse coding at MIT.[18][6]Core Techniques
Huffman Coding
Huffman coding is a foundational entropy coding technique that constructs variable-length prefix codes optimized for a known probability distribution of symbols, assigning shorter codes to more frequent symbols to minimize the average code length. Developed by David A. Huffman in 1952 as part of his doctoral work at MIT, the method builds a binary tree where the path from the root to a leaf represents the code for a symbol, ensuring no code is a prefix of another to enable instantaneous decoding.[22] This approach achieves an average code length that is at most the entropy plus one bit per symbol, though it is constrained to integer code lengths.[22] The algorithm employs a bottom-up greedy strategy to construct the code tree using a priority queue (typically a min-heap) ordered by symbol probabilities. It begins by creating a leaf node for each symbol with its probability. Repeatedly, the two nodes with the smallest probabilities are extracted, merged into a new internal node with probability equal to their sum, and reinserted into the queue; this process continues until only the root remains. Code assignment occurs via a traversal of the final tree, appending '0' for left branches and '1' for right branches to determine each symbol's binary code.[22] The resulting tree satisfies the Kraft equality with equality for optimal distributions, ensuring prefix-freeness.[22] To illustrate, consider four symbols A, B, C, and D with probabilities 0.4, 0.3, 0.2, and 0.1, respectively. Initialize the priority queue with these leaf nodes. Merge the lowest-probability nodes D (0.1) and C (0.2) into an internal node X (0.3). The queue now holds A (0.4), B (0.3), and X (0.3); next, merge B (0.3) and X (0.3) into node Y (0.6). Finally, merge Y (0.6) and A (0.4) into the root (1.0). Traversing the tree (left as 0, right as 1) yields codes A: 0 (length 1), B: 10 (length 2), C: 110 (length 3), and D: 111 (length 3). The average code length is bits per symbol.[22] Huffman coding exists in static and adaptive variants. In static Huffman coding, the probability distribution is assumed known beforehand, allowing the full tree to be built and transmitted once before encoding the data.[22] Adaptive Huffman coding, introduced by Robert G. Gallager in 1978, dynamically updates the tree during encoding based on observed symbol frequencies, enabling real-time adaptation to changing statistics without prior knowledge of the distribution. Another variant, canonical Huffman coding, generates codes deterministically from the sorted list of code lengths rather than storing the full tree, reducing overhead for transmission or storage in applications like image compression standards.[23]Arithmetic Coding
Arithmetic coding is an entropy encoding technique that represents an entire message as a single fractional number within the unit interval [0, 1), achieving compression by allocating code space proportionally to symbol probabilities rather than assigning fixed-length codes to individual symbols.[24] Introduced by Jorma Rissanen in his 1976 paper on generalized Kraft inequality and arithmetic coding, this method allows for fractional bit allocation per symbol, overcoming the inherent one-bit overhead per symbol in Huffman coding and approaching the theoretical entropy limit more closely for skewed probability distributions.[25] However, its early adoption was limited by patents held by IBM, which expired in the mid-1990s, allowing broader implementation thereafter.[11] The approach was further developed by Rissanen and Glenn G. Langdon in their 1979 work, which generalized the technique for practical implementation. The mechanism of arithmetic coding begins by initializing the coding interval to [0, 1). For each symbol in the message, the interval is subdivided according to the symbol's probability distribution, and the subinterval corresponding to the actual symbol is selected, cumulatively narrowing the range. This subdivision uses the cumulative distribution function of the probabilities: if the current interval is [low, high) with width = high - low, the subinterval for a symbol with probability p_i starts at low + (cumulative probability up to previous symbols) * width and has length p_i * width.[24] The process continues until the entire message is encoded, resulting in a final subinterval whose any representative value (e.g., the midpoint) serves as the code. Decoding reverses this by starting from the code value and iteratively identifying the symbol that contains it within the subdivided intervals, using the same probability model.[24] To illustrate, consider encoding the message "BAC" over an alphabet {A, B, C} with probabilities P(A)=0.5, P(B)=0.25, P(C)=0.25. Start with [0, 1). For 'B' (cumulative up to A: 0.5), the subinterval is [0.5, 0.75). For 'A' within this (width 0.25, P(A)=0.5 so length 0.125, starting at 0.5), it becomes [0.5, 0.625). For 'C' within [0.5, 0.625) (width 0.125), the subintervals relative to this interval are A: [0.5, 0.5 + 0.5 × 0.125) = [0.5, 0.5625), B: [0.5625, 0.5625 + 0.25 × 0.125) = [0.5625, 0.59375), C: [0.59375, 0.625). Thus, select [0.59375, 0.625) for 'C'. A binary representation of a value in [0.59375, 0.625), such as approximately 0.60 (e.g., binary 0.10011), can be output as the code.[24] Arithmetic coding requires arbitrary-precision arithmetic to represent the shrinking intervals accurately, as finite fixed-point representations can lead to precision loss and decoding errors over long messages.[24] To address this in practice, renormalization is employed: when the interval width falls below a threshold (e.g., 1/2^k for k-bit precision), the interval is rescaled by shifting out leading bits, outputting those bits to the code stream and maintaining sufficient precision for streaming encoding and decoding.[24] This renormalization ensures the method remains efficient without excessive overhead, typically adding negligible bits beyond the entropy bound.[24]Advanced Variants (Range and ANS Coding)
Range coding represents a practical implementation of arithmetic coding principles, adapting the interval-based encoding to fixed-precision integer arithmetic for improved computational efficiency. Introduced by G. Nigel N. Martin in 1979, it maintains a current interval defined by a lower bound and a range width, typically using a fixed integer scale such as or to avoid floating-point operations. When encoding a symbol, the interval is subdivided proportionally to symbol probabilities, and the lower bound and range are updated accordingly. To prevent underflow—where the range becomes too small for precision—renormalization occurs by shifting bits from the lower bound to the output bitstream and rescaling the range, effectively handling overflow in the lower bound by outputting whole bytes at once. This approach reduces multiplication overhead compared to full arithmetic coding while preserving near-optimal compression ratios.[26] A key application of range coding appears in the JPEG 2000 standard, where the MQ-coder serves as an adaptive binary arithmetic coder built on range coding techniques to entropy-encode quantized wavelet coefficients. The MQ-coder employs 46 probability states per context for adaptation and performs renormalization through bit shifting, enabling efficient processing of binary symbols without floating-point arithmetic, which contributes to the standard's superior compression performance over JPEG for high-quality images.[27] Asymmetric numeral systems (ANS) offer another advancement, proposed by Jarek Duda in 2007 as a family of entropy coders that model encoding as probabilistic state transitions within a discrete state space. Unlike arithmetic coding's continuous interval updates, ANS represents the encoded data as a state value in a predefined range, where each symbol emission or absorption corresponds to a deterministic mapping derived from symbol probabilities, often implemented via precomputed lookup tables for rapid table-based encoding and decoding.[28] This table-driven mechanism avoids multiplications entirely, relying instead on bit shifts and modular arithmetic, which enhances speed in software implementations—typically achieving rates comparable to Huffman coding while matching arithmetic coding's compression efficiency.[29] ANS has seen adoption in modern compression tools, such as Facebook's Zstandard algorithm introduced in 2015, where variants like finite-state entropy (FSE) and streamed ANS enable fast, adaptive entropy coding atop dictionary-based stages. In neural network compression, post-2014 methods leverage ANS for entropy coding quantized weights and activations; for instance, end-to-end learned image compression frameworks use ANS to model hyperpriors and achieve state-of-the-art rate-distortion performance with low overhead. Key advantages include better parallelization potential due to independent state updates and reduced computational overhead, making ANS particularly suitable for resource-constrained environments like mobile inference.Performance Evaluation
Compression Efficiency Metrics
The primary metric for evaluating the compression efficiency of an entropy code is the average code length , defined as the expected number of bits required to encode a single symbol from the source, given byComparative Analysis of Techniques
Entropy coding techniques exhibit distinct trade-offs in compression efficiency, computational complexity, and practical implementation. Huffman coding, while straightforward and optimal for fixed probabilities, incurs redundancy overhead, particularly for rare symbols where code lengths may exceed the ideal by up to 1 bit per symbol due to integer codeword constraints.[36] In contrast, arithmetic coding and asymmetric numeral systems (ANS) achieve compression rates closer to the Shannon entropy limit with negligible redundancy, often approaching 0 bits per symbol, but at the cost of higher computational demands involving multiplication and state updates.[37] ANS mitigates some of arithmetic coding's slowdown by using table-based operations, offering speeds comparable to Huffman while retaining near-optimal efficiency.[37] These differences are quantified in the following comparison:| Method | Average Redundancy (bits/symbol) | Time Complexity (Encoding) |
|---|---|---|
| Huffman | <1 (higher for skewed distributions) | O(n log n) build + O(n) |
| Arithmetic | ~0 (near entropy limit) | O(n) |
| ANS | ~0 (near entropy limit) | O(n) |