Recent from talks
Contribute something
Nothing was collected or created yet.
Prefix code
View on WikipediaA prefix code is a type of code system distinguished by its possession of the prefix property, which requires that there is no whole code word in the system that is a prefix (initial segment) of any other code word in the system. It is trivially true for fixed-length codes, so only a point of consideration for variable-length codes.
For example, a code with code has the prefix property; a code consisting of does not, because 5 is a prefix of 59 and also of 55. A prefix code is a uniquely decodable code: given a complete and accurate sequence, a receiver can identify each word without requiring a special marker between words. However, there are uniquely decodable codes that are not prefix codes; for instance, the reverse of a prefix code is still uniquely decodable (it is a suffix code), but it is not necessarily a prefix code.
Prefix codes are also known as prefix-free codes, prefix condition codes and instantaneous codes. Although Huffman coding is just one of many algorithms for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes", even when the code was not produced by a Huffman algorithm. The term comma-free code is sometimes also applied as a synonym for prefix-free codes[1][2] but in most mathematical books and articles (e.g.[3][4]) a comma-free code is used to mean a self-synchronizing code, a subclass of prefix codes.
Using prefix codes, a message can be transmitted as a sequence of concatenated code words, without any out-of-band markers or (alternatively) special markers between words to frame the words in the message. The recipient can decode the message unambiguously, by repeatedly finding and removing sequences that form valid code words. This is not generally possible with codes that lack the prefix property, for example : a receiver reading a 1 at the start of a code word would not know whether that was the complete code word 1, or merely the prefix of the code word 10 or 11; so the string 10 could be interpreted either as a single codeword or as the concatenation of the words 1 then 0.
The variable-length Huffman codes, country calling codes, the country and publisher parts of ISBNs, the Secondary Synchronization Codes used in the UMTS W-CDMA 3G Wireless Standard, and the instruction sets (machine language) of most computer microarchitectures are prefix codes.
Prefix codes are not error-correcting codes. In practice, a message might first be compressed with a prefix code, and then encoded again with channel coding (including error correction) before transmission.
For every uniquely decodable code there is a prefix code that has the same code word lengths.[5] Kraft's inequality characterizes the sets of code word lengths that are possible in a uniquely decodable code.[6]
Techniques
[edit]If every word in the code has the same length, the code is called a fixed-length code, or a block code (though the term block code is also used for fixed-size error-correcting codes in channel coding). For example, ISO 8859-15 letters are always 8 bits long. UTF-32/UCS-4 letters are always 32 bits long. ATM cells are always 424 bits (53 bytes) long. A fixed-length code of fixed length bits can encode up to source symbols.
A fixed-length code is necessarily a prefix code. It is possible to turn any code into a fixed-length code by padding fixed symbols to the shorter prefixes in order to meet the length of the longest prefixes. Alternately, such padding codes may be employed to introduce redundancy that allows autocorrection and/or synchronisation. However, fixed length encodings are inefficient in situations where some words are much more likely to be transmitted than others.
Truncated binary encoding is a straightforward generalization of fixed-length codes to deal with cases where the number of symbols n is not a power of two. Source symbols are assigned codewords of length and , where is chosen so that .
Huffman coding is a more sophisticated technique for constructing variable-length prefix codes. The Huffman coding algorithm takes as input the frequencies that the code words should have, and constructs a prefix code that minimizes the weighted average of the code word lengths. (This is closely related to minimizing the entropy.) This is a form of lossless data compression based on entropy encoding.
Some codes mark the end of a code word with a special "comma" symbol (also called a Sentinel value), different from normal data.[7] This is somewhat analogous to the spaces between words in a sentence; they mark where one word ends and another begins. If every code word ends in a comma, and the comma does not appear elsewhere in a code word, the code is automatically prefix-free. However, reserving an entire symbol only for use as a comma can be inefficient, especially for languages with a small number of symbols. Morse code is an everyday example of a variable-length code with a comma. The long pauses between letters, and the even longer pauses between words, help people recognize where one letter (or word) ends, and the next begins. Similarly, Fibonacci coding uses a 11 to mark the end of every code word.
Self-synchronizing codes are prefix codes that allow frame synchronization.
Related concepts
[edit]A suffix code is a set of words none of which is a suffix of any other; equivalently, a set of words which are the reverse of a prefix code. As with a prefix code, the representation of a string as a concatenation of such words is unique. A bifix code is a set of words which is both a prefix and a suffix code.[8] An optimal prefix code is a prefix code with minimal average length. That is, assume an alphabet of n symbols with probabilities for a prefix code C. If C' is another prefix code and are the lengths of the codewords of C', then .[9]
Prefix codes in use today
[edit]Examples of prefix codes include:
- variable-length Huffman codes
- country calling codes
- Chen–Ho encoding
- the country and publisher parts of ISBNs
- the Secondary Synchronization Codes used in the UMTS W-CDMA 3G Wireless Standard
- VCR Plus+ codes
- Unicode Transformation Format, in particular the UTF-8 system for encoding Unicode characters, which is both a prefix-free code and a self-synchronizing code[10]
- variable-length quantity
Techniques
[edit]Commonly used techniques for constructing prefix codes include Huffman codes and the earlier Shannon–Fano codes, and universal codes such as:
- Elias delta coding
- Elias gamma coding
- Elias omega coding
- Fibonacci coding
- Levenshtein coding
- Unary coding
- Golomb Rice code
- Straddling checkerboard (simple cryptography technique which produces prefix codes)
- binary coding[11]
Notes
[edit]- ^ US Federal Standard 1037C
- ^ ATIS Telecom Glossary 2007, archived from the original on July 8, 2010, retrieved December 4, 2010
- ^ Berstel, Jean; Perrin, Dominique (1985), Theory of Codes, Academic Press
- ^ Golomb, S. W.; Gordon, Basil; Welch, L. R. (1958), "Comma-Free Codes", Canadian Journal of Mathematics, 10 (2): 202–209, doi:10.4153/CJM-1958-023-9, S2CID 124092269
- ^ Le Boudec, Jean-Yves, Patrick Thiran, and Rüdiger Urbanke. Introduction aux sciences de l'information: entropie, compression, chiffrement et correction d'erreurs. PPUR Presses polytechniques, 2015.
- ^ Berstel et al (2010) p.75
- ^ A. Jones, J. "Development of Trigger and Control Systems for CMS" (PDF). High Energy Physics, Blackett Laboratory, Imperial College, London. p. 70. Archived from the original (PDF) on Jun 13, 2011.
- ^ Berstel et al (2010) p.58
- ^ McGill COMP 423 Lecture notes
- ^ Pike, Rob (2003-04-03). "UTF-8 history".
- ^ Shevchuk, Y. V. (2018), "Vbinary: variable length integer coding revisited" (PDF), Program Systems: Theory and Applications, 9 (4): 239–252, doi:10.25209/2079-3316-2018-9-4-239-252
References
[edit]- Berstel, Jean; Perrin, Dominique; Reutenauer, Christophe (2010). Codes and automata. Encyclopedia of Mathematics and its Applications. Vol. 129. Cambridge: Cambridge University Press. ISBN 978-0-521-88831-8. Zbl 1187.94001.
- Elias, Peter (1975). "Universal codeword sets and representations of the integers". IEEE Trans. Inf. Theory. 21 (2): 194–203. doi:10.1109/tit.1975.1055349. ISSN 0018-9448. Zbl 0298.94011.
- D.A. Huffman, "A method for the construction of minimum-redundancy codes", Proceedings of the I.R.E., Sept. 1952, pp. 1098–1102 (Huffman's original article)
- Profile: David A. Huffman, Scientific American, Sept. 1991, pp. 54–58 (Background story)
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 16.3, pp. 385–392.
This article incorporates public domain material from Federal Standard 1037C. General Services Administration. Archived from the original on 2022-01-22.
External links
[edit]- Codes, trees and the prefix property by Kona Macphee
Prefix code
View on GrokipediaFundamentals
Definition
A prefix code is a variable-length code over a finite alphabet where no codeword is a prefix of any other codeword in the set.[5][8] This prefix-free property ensures instantaneous decodability, meaning that codewords can be decoded sequentially from a concatenated stream without lookahead or ambiguity, as the receiver can uniquely identify each codeword as soon as it is complete.[1][9] For example, consider a binary prefix code for symbols {A, B, C} with codewords {0, 10, 11}: here, no codeword prefixes another, allowing unambiguous decoding of sequences like 01011 (A followed by C). In contrast, the set {0, 01, 11} is not prefix-free because 0 prefixes 01, leading to ambiguity in decoding 01 (is it A or the start of B?).[9][8] The concept of prefix codes was introduced by Claude Shannon in 1948 as part of his foundational work on the source coding theorem in information theory.[10] A necessary condition for the existence of a prefix code with given codeword lengths is the Kraft inequality.[11]Properties
The prefix property of a code ensures that any concatenation of codewords can be unambiguously decoded without requiring explicit delimiters or lookahead beyond the current codeword, as no codeword serves as a prefix for another. This structural advantage allows for instantaneous decoding, where the receiver identifies complete codewords sequentially in a single pass through the encoded stream.[12] Prefix codes offer informational optimality by achieving the minimal possible expected codeword length for a given source probability distribution among all uniquely decodable codes, bounded by the source entropy and the constraints of the Kraft inequality. This efficiency arises because prefix codes can be designed to closely approximate the ideal lengths dictated by symbol probabilities, minimizing redundancy in variable-length encoding while preserving decodability.[13] A complete prefix code maximizes efficiency by saturating the available code space, meaning its codeword lengths satisfy equality in the Kraft inequality with no unused branches, thereby eliminating gaps that could otherwise allow for shorter encodings. This completeness ensures that the code is as compact as possible without violating the prefix condition. The Kraft inequality enables this tree-like structure, where the sum of 2^{-l_i} over codeword lengths l_i equals 1 for complete codes.[5] Prefix codes extend naturally to a binary tree representation, in which each codeword corresponds to a unique path from the root to a leaf, and no codeword is assigned to an internal node along any path. In this model, the tree's leaves represent the symbols, and the branching reflects the binary decisions in decoding, providing a visual and algorithmic foundation for code design and analysis.[14] All prefix codes are uniquely decodable, guaranteeing that every encoded sequence maps back to a single original message sequence. However, the converse does not hold: there exist uniquely decodable codes that violate the prefix condition and thus require more complex decoding procedures.[15]Mathematical Foundations
Kraft Inequality
The Kraft inequality provides a fundamental necessary and sufficient condition for the existence of a prefix code with specified codeword lengths over a finite alphabet. For a prefix code consisting of codewords of lengths drawn from an alphabet of size , the inequality states that This condition ensures that the codewords can be arranged in a tree structure without prefix conflicts, bounding the total "capacity" allocated to the codewords. A proof of the inequality can be sketched using the code tree model, where each codeword corresponds to a path from the root to a leaf in a complete -ary tree. Each internal node branches into children, and the leaves represent the codewords; the measure for a codeword of length equals the proportion of the total tree volume occupied by the subtree rooted at that leaf. Since the subtrees are disjoint and the entire tree has volume 1, their measures sum to at most 1, establishing the necessity. The sufficiency follows by constructing the code via a greedy assignment of codewords to available branches, ensuring no overlap. When the sum equals 1, the prefix code is complete, meaning it fully utilizes the available tree space and is optimal in the sense that no additional codeword can be added while preserving the prefix property and the given lengths. This completeness implies the code achieves the tightest possible packing for those lengths. McMillan's theorem (1956) generalizes the Kraft inequality to infinite or non-prefix uniquely decodable codes, proving that the same inequality is necessary and sufficient for the existence of a uniquely decodable code with those lengths, even if instantaneous decodability is not required. This extension broadens the applicability to more general coding scenarios beyond strict prefix constraints. As an illustrative example for binary codes () with lengths 1, 2, and 2, the sum is , satisfying the inequality with equality and confirming the existence of a complete prefix code, such as {0, 10, 11}.Instantaneous Decodability
Instantaneous decodability refers to the property of prefix codes that allows symbols to be decoded in real time as the encoded bits arrive, without requiring any lookahead into future bits or buffering to resolve ambiguities. In the decoding process, the receiver sequentially scans the incoming bitstream and identifies complete codewords by matching prefixes of the received sequence against the code dictionary. As soon as a prefix matches a full codeword—guaranteed not to be a prefix of any longer codeword due to the prefix-free condition—the corresponding symbol is output, and decoding continues immediately with the next bit. This self-synchronizing nature eliminates the need for delimiters or additional markers between codewords.[16][7] The efficiency of this decoding stems from the structured nature of prefix codes, often implemented via a binary tree where each codeword corresponds to a path from the root to a leaf. Tree traversal for decoding takes time proportional to the length of the codeword, resulting in an average time complexity of O(1) per symbol when using optimized structures like lookup tables for finite alphabets or balanced trees, as the average codeword length remains bounded for a given source distribution.[17] In contrast, non-instantaneous uniquely decodable codes require lookahead or delays to disambiguate potential codeword boundaries, as a received sequence may correspond to multiple possible parses until additional bits arrive. For example, consider the code with symbols mapped to 0, 01, and 11: the sequence "01" could be the single codeword for the second symbol or the first symbol followed by the start of another, necessitating buffering of subsequent bits to resolve. Such delays arise because some codewords may be prefixes of concatenations of others, violating the instantaneous property.[15][14] This instantaneous property makes prefix codes particularly valuable in applications demanding low-latency decoding, such as real-time data streaming, where any delay in symbol recovery could disrupt continuous playback or transmission. In streaming media, for instance, the ability to decode symbols on-the-fly ensures smooth processing without accumulating buffers that might cause lag.[18] Prefix codes, also known as instantaneously decodable codes, form a subclass of uniquely decodable codes. The broader class of uniquely decodable codes includes non-instantaneous variants; the prefix condition ensures instantaneous decodability, while unique decodability in general can be verified using the Sardinas–Patterson algorithm, which checks for overlaps in code extensions without detailing the procedure here. This distinction aligns with the Kraft inequality, which provides a necessary and sufficient condition for the existence of such decodable codes.[16][19]Construction Methods
Huffman Algorithm
The Huffman algorithm is a greedy technique for constructing an optimal prefix code from a set of symbols with known probabilities, minimizing the average length of the encoded symbols. Developed by David A. Huffman in 1952 during his master's thesis work at MIT, it builds a binary tree structure where the code for each symbol corresponds to the path from the root to the symbol's leaf node, with branches assigned 0 for left and 1 for right.[20][21] The algorithm begins by creating a leaf node for each symbol, weighted by its probability, and placing these nodes into a priority queue ordered by increasing weight. It then repeatedly extracts the two nodes with the smallest weights, creates a new internal node with a weight equal to their sum, and makes the extracted nodes its left and right children before reinserting the new node into the queue. This merging process continues until only a single root node remains, forming the complete Huffman tree from which the prefix codes are derived by traversing paths from the root.[22][23] A pseudocode outline of the algorithm is as follows:function HUFFMAN(symbols, probabilities):
Create a [priority queue](/page/Priority_queue) Q
for i from 1 to length(symbols):
create leaf node l with symbol[i] and weight probabilities[i]
insert l into Q
while size(Q) > 1:
extract node x with smallest weight from Q
extract node y with smallest weight from Q
create internal node z with weight = weight(x) + weight(y)
set left child of z to x
set right child of z to y
insert z into Q
return the single node in Q as the root of the Huffman tree
function HUFFMAN(symbols, probabilities):
Create a [priority queue](/page/Priority_queue) Q
for i from 1 to length(symbols):
create leaf node l with symbol[i] and weight probabilities[i]
insert l into Q
while size(Q) > 1:
extract node x with smallest weight from Q
extract node y with smallest weight from Q
create internal node z with weight = weight(x) + weight(y)
set left child of z to x
set right child of z to y
insert z into Q
return the single node in Q as the root of the Huffman tree
