Recent from talks
Contribute something
Nothing was collected or created yet.
Range coding
View on WikipediaRange coding (or range encoding) is an entropy coding method defined by G. Nigel N. Martin in a 1979 paper,[1] which effectively rediscovered the FIFO arithmetic code first introduced by Richard Clark Pasco in 1976.[2] Given a stream of symbols and their probabilities, a range coder produces a space-efficient stream of bits to represent these symbols and, given the stream and the probabilities, a range decoder reverses the process.
Range coding is very similar to arithmetic coding, except that coding is done with digits in any base, instead of with bits, and so it is faster when using larger bases (e.g. a byte) at small cost in compression efficiency.[3] After the expiration of the first (1978) arithmetic coding patent,[4] range coding appeared to clearly be free of patent encumbrances. This particularly drove interest in the technique in the open source community. Since that time, patents on various well-known arithmetic coding techniques have also expired.
How range coding works
[edit]
Range coding conceptually encodes all the symbols of the message into one number, unlike Huffman coding which assigns each symbol a bit-pattern and concatenates all the bit-patterns together. Thus range coding can achieve greater compression ratios than the one-bit-per-symbol lower bound on Huffman coding and it does not suffer the inefficiencies that Huffman does when dealing with probabilities that are not an exact power of two.
The central concept behind range coding is this: given a large-enough range of integers, and a probability estimation for the symbols, the initial range can easily be divided into sub-ranges whose sizes are proportional to the probability of the symbol they represent. Each symbol of the message can then be encoded in turn, by reducing the current range down to just that sub-range which corresponds to the next symbol to be encoded. The decoder must have the same probability estimation the encoder used, which can either be sent in advance, derived from already transferred data or be part of the compressor and decompressor.
When all symbols have been encoded, merely identifying the sub-range is enough to communicate the entire message (presuming of course that the decoder is somehow notified when it has extracted the entire message). A single integer is actually sufficient to identify the sub-range, and it may not even be necessary to transmit the entire integer; if there is a sequence of digits such that every integer beginning with that prefix falls within the sub-range, then the prefix alone is all that's needed to identify the sub-range and thus transmit the message.
Example
[edit]Suppose we want to encode the message "AABA<EOM>", where <EOM> is the end-of-message symbol. For this example it is assumed that the decoder knows that we intend to encode exactly five symbols in the base 10 number system (allowing for 105 different combinations of symbols with the range [0, 100000)) using the probability distribution {A: .60; B: .20; <EOM>: .20}. The encoder breaks down the range [0, 100000) into three subranges:
A: [ 0, 60000) B: [ 60000, 80000) <EOM>: [ 80000, 100000)
Since our first symbol is an A, it reduces our initial range down to [0, 60000). The second symbol choice leaves us with three sub-ranges of this range. We show them following the already-encoded 'A':
AA: [ 0, 36000) AB: [ 36000, 48000) A<EOM>: [ 48000, 60000)
With two symbols encoded, our range is now [0, 36000) and our third symbol leads to the following choices:
AAA: [ 0, 21600) AAB: [ 21600, 28800) AA<EOM>: [ 28800, 36000)
This time it is the second of our three choices that represent the message we want to encode, and our range becomes [21600, 28800). It may look harder to determine our sub-ranges in this case, but it is actually not: we can merely subtract the lower bound from the upper bound to determine that there are 7200 numbers in our range; that the first 4320 of them represent 0.60 of the total, the next 1440 represent the next 0.20, and the remaining 1440 represent the remaining 0.20 of the total. Adding back the lower bound gives us our ranges:
AABA: [21600, 25920) AABB: [25920, 27360) AAB<EOM>: [27360, 28800)
Finally, with our range narrowed down to [21600, 25920), we have just one more symbol to encode. Using the same technique as before for dividing up the range between the lower and upper bound, we find the three sub-ranges are:
AABAA: [21600, 24192) AABAB: [24192, 25056) AABA<EOM>: [25056, 25920)
And since <EOM> is our final symbol, our final range is [25056, 25920). Because all five-digit integers starting with "251" fall within our final range, it is one of the three-digit prefixes we could transmit that would unambiguously convey our original message. (The fact that there are actually eight such prefixes in all implies we still have inefficiencies. They have been introduced by our use of base 10 rather than base 2.)
The central problem may appear to be selecting an initial range large enough that no matter how many symbols we have to encode, we will always have a current range large enough to divide into non-zero sub-ranges. In practice, however, this is not a problem, because instead of starting with a very large range and gradually narrowing it down, the encoder works with a smaller range of numbers at any given time. After some number of digits have been encoded, the leftmost digits will not change. In the example after coding just three symbols, we already knew that our final result would start with "2". More digits are shifted in on the right as digits on the left are sent off. This is illustrated in the following code:
int low = 0;
int range = 100000;
void Run()
{
Encode(0, 6, 10); // A
Encode(0, 6, 10); // A
Encode(6, 2, 10); // B
Encode(0, 6, 10); // A
Encode(8, 2, 10); // <EOM>
// emit final digits - see below
while (range < 10000)
EmitDigit();
low += 10000;
EmitDigit();
}
void EmitDigit()
{
Console.Write(low / 10000);
low = (low % 10000) * 10;
range *= 10;
}
void Encode(int start, int size, int total)
{
// adjust the range based on the symbol interval
range /= total;
low += start * range;
range *= size;
// check if left-most digit is same throughout range
while (low / 10000 == (low + range) / 10000)
EmitDigit();
// readjust range - see reason for this below
if (range < 1000)
{
EmitDigit();
EmitDigit();
range = 100000 - low;
}
}
To finish off we may need to emit a few extra digits. The top digit of low is probably too small so we need to increment it, but we have to make sure we don't increment it past low+range. So first we need to make sure range is large enough.
// emit final digits
while (range < 10000)
EmitDigit();
low += 10000;
EmitDigit();
One problem that can occur with the Encode function above is that range might become very small but low and low+range still have differing first digits. This could result in the interval having insufficient precision to distinguish between all of the symbols in the alphabet. When this happens we need to fudge a little, output the first couple of digits even though we might be off by one, and re-adjust the range to give us as much room as possible.
For example, imagine the input stream has led the encoder to the right-open interval [59888, 60188), that is, low = 59888 and range = 300. The trick is to narrow down the interval to [59888, 60000) = [59888, 59999], which allows the encoder to emit two of the left-most digits of low, and readjust the interval to [88800, 99999] = [88800, 100000), that is, low = 88800 and range = 100000 - low.
The decoder will be following the same steps so it will know when it needs to do this to keep in sync.
// this goes just before the end of Encode() above
if (range < 1000)
{
EmitDigit();
EmitDigit();
range = 100000 - low;
}
Base 10 was used in this example, but a real implementation would just use binary, with the full range of the native integer data type. Instead of 10000 and 1000 you would likely use hexadecimal constants such as 0x1000000 and 0x10000. Instead of emitting a digit at a time you would emit a byte at a time and use a byte-shift operation instead of multiplying by 10.
Decoding uses exactly the same algorithm with the addition of keeping track of the current code value consisting of the digits read from the compressor. Instead of emitting the top digit of low you just throw it away, but you also shift out the top digit of code and shift in a new digit read from the compressor. Use AppendDigit below instead of EmitDigit.
int code = 0;
int low = 0;
int range = 1;
void InitializeDecoder()
{
AppendDigit(); // with this example code, only 1 of these is actually needed
AppendDigit();
AppendDigit();
AppendDigit();
AppendDigit();
}
void AppendDigit()
{
code = (code % 10000) * 10 + ReadNextDigit();
low = (low % 10000) * 10;
range *= 10;
}
void Decode(int start, int size, int total) // Decode is same as Encode with EmitDigit replaced by AppendDigit
{
// adjust the range based on the symbol interval
range /= total;
low += start * range;
range *= size;
// check if left-most digit is same throughout range
while (low / 10000 == (low + range) / 10000)
AppendDigit();
// readjust range - see reason for this below
if (range < 1000)
{
AppendDigit();
AppendDigit();
range = 100000 - low;
}
}
In order to determine which probability intervals to apply, the decoder needs to look at the current value of code within the interval [low, low+range) and decide which symbol this represents.
void Run()
{
int start = 0;
int size;
int total = 10;
InitializeDecoder(); // need to get range/total >0
while (start < 8) // stop when receive EOM
{
int v = GetValue(total); // code is in what symbol range?
switch (v) // convert value to symbol
{
case 0:
case 1:
case 2:
case 3:
case 4:
case 5: start=0; size=6; Console.Write("A"); break;
case 6:
case 7: start=6; size=2; Console.Write("B"); break;
default: start=8; size=2; Console.WriteLine("");
}
Decode(start, size, total);
}
}
int GetValue(int total)
{
return (code - low) / (range / total);
}
For the AABA<EOM> example above, this would return a value in the range 0 to 9. Values 0 through 5 would represent A, 6 and 7 would represent B, and 8 and 9 would represent <EOM>.
Relationship with arithmetic coding
[edit]Arithmetic coding is the same as range coding, but with the integers taken as being the numerators of fractions. These fractions have an implicit, common denominator, such that all the fractions fall in the range [0,1). Accordingly, the resulting arithmetic code is interpreted as beginning with an implicit "0". As these are just different interpretations of the same coding methods, and as the resulting arithmetic and range codes are identical, each arithmetic coder is its corresponding range encoder, and vice versa. In other words, arithmetic coding and range coding are just two, slightly different ways of understanding the same thing.
In practice, though, so-called range encoders tend to be implemented pretty much as described in Martin's paper,[1] while arithmetic coders more generally tend not to be called range encoders. An often noted feature of such range encoders is the tendency to perform renormalization a byte at a time, rather than one bit at a time (as is usually the case). In other words, range encoders tend to use bytes as coding digits, rather than bits. While this does reduce the amount of compression that can be achieved by a very small amount, it is faster than when performing renormalization for each bit.
See also
[edit]References
[edit]- ^ a b G. Nigel N. Martin, Range encoding: An algorithm for removing redundancy from a digitized message, Video & Data Recording Conference, Southampton, UK, July 24–27, 1979.
- ^ "Source coding algorithms for fast data compression" Richard Clark Pasco, Stanford, CA 1976
- ^ "On the Overhead of Range Coders", Timothy B. Terriberry, Technical Note 2008
- ^ U.S. patent 4,122,440 — (IBM) Filed March 4, 1977, Granted 24 October 1978 (Now expired)
External links
[edit]Range coding
View on GrokipediaIntroduction
Definition and Purpose
Range coding is an entropy encoding technique used in data compression that represents an entire sequence of symbols as a single real number within a predefined interval, typically [0, 1), by iteratively subdividing the interval into sub-ranges proportional to the probabilities of the symbols.[5] This method achieves compression by exploiting the statistical redundancies in the data, assigning shorter effective code lengths to more probable symbols through the proportional allocation of interval space.[6] The primary purpose of range coding is to enable lossless compression of variable-length symbol sequences in a manner that closely approaches the theoretical entropy limit, outperforming prefix codes such as Huffman coding by allowing fractional bit allocations per symbol rather than discrete integer bits.[5][7] Unlike Huffman coding, which requires precomputing fixed codeword lengths based on symbol frequencies, range coding dynamically adapts to probabilities during encoding, resulting in more efficient representations for sources with non-uniform distributions. This makes it particularly suitable for applications requiring high compression ratios, such as multimedia and archival data storage, where minimizing output bits is critical. At its core, range coding maps the cumulative probability of a message into an interval of real numbers (or their integer approximations for practical implementation), enabling the encoding of multiple symbols within the precision of a single number rather than separate codes for each.[5] This fractional bit usage per symbol provides a key advantage over traditional methods, as it eliminates the overhead of aligning codewords to bit boundaries. Range coding is closely related to arithmetic coding, serving as a practical variant that often operates on larger digit bases (e.g., bytes) for improved speed with minimal efficiency loss.[5]Historical Development
Range coding emerged as an entropy encoding technique in the late 1970s, building on earlier concepts in arithmetic coding. It was formally defined by G. N. N. Martin in his 1979 paper "Range Encoding: An Algorithm for Removing Redundancy from a Digitised Message," presented at the Video & Data Recording Conference organized by the IBM UK Scientific Center in Southampton. This work described a method for encoding messages by maintaining and subdividing a range of possible values, effectively rediscovering the FIFO (first-in, first-out) arithmetic coding approach originally proposed by Richard Clark Pasco in his 1976 PhD thesis "Source Coding Algorithms for Fast Data Compression" at Stanford University.[8] Pasco's thesis explored arithmetic coding variants to address precision issues in fixed-precision implementations, laying foundational ideas that Martin's range encoding simplified and adapted for practical use.[9] The adoption of range coding was initially limited by intellectual property constraints on arithmetic coding. A key U.S. Patent 4,122,440, titled "Method and Means for Arithmetic String Coding," issued to IBM in 1978, covered core aspects of arithmetic coding and deterred widespread implementation until its expiration in 1995 (17 years from issuance).[10] Following this expiration, range coding saw increased popularity in open-source and academic projects during the 1990s, as developers sought efficient, patent-free alternatives to Huffman coding for lossless compression.[5] This period marked the technique's evolution from theoretical proposal to viable component in algorithms like PPM (Prediction by Partial Matching) variants. In the early 2000s, binary-adapted forms of range coding influenced high-impact standards, notably the Context-Adaptive Binary Arithmetic Coding (CABAC) method introduced in the H.264/AVC video compression standard in 2003, where it provided efficient entropy coding for binary decisions in video streams.Core Principles
State Representation
In range coding, the encoder maintains an internal state that represents the current interval of possible code values for the encoded message, typically denoted as a half-open interval [low, high) where low and high are non-negative integers satisfying low < high. This state serves as the foundation for mapping input symbols to subintervals based on their probabilities.[5] To ensure computational efficiency and avoid the precision limitations of floating-point arithmetic, the state is implemented using fixed-precision integer arithmetic, commonly with 32-bit or larger unsigned integers for the low and range values (where range = high - low). This approach scales the interval multiplicatively while keeping all operations within integer bounds, preventing overflow or underflow through periodic renormalization steps. Renormalization occurs when the range falls below a predefined threshold (e.g., half the maximum representable value), involving left-shifts of low and range by a fixed number of bits (often 8 for byte-aligned output) and emitting the most significant bits of low as output bits.[5] The state update for a given symbol incorporates the symbol's probability prob and its cumulative probability cum_prob (the probability mass up to but excluding the symbol, derived from the probability model). The new interval boundaries are computed as: \text{new_low} = \text{low} + (\text{high} - \text{low}) \times \text{cum_prob} \text{new_high} = \text{low} + (\text{high} - \text{low}) \times (\text{cum_prob} + \text{prob}) These updates narrow the interval proportionally to the symbol's probability, with all multiplications and additions performed in integer arithmetic to maintain exactness within the finite precision.[5]Range Subdivision
In range coding, the current coding interval, bounded by state variables low and high representing the lower and upper endpoints respectively, is subdivided into contiguous subintervals, one for each possible symbol in the alphabet. The width of each subinterval is proportional to the estimated probability of the corresponding symbol, ensuring that the subdivision reflects the symbol's likelihood under the employed probability model. This proportional allocation allows the encoder to map the entire message to a uniquely narrow subinterval within the initial range, typically starting from [0, 1) or an equivalent integer representation for precision.[9] The probability model determines the subinterval widths and can be either static, where probabilities are predefined and fixed throughout encoding, or adaptive, where estimates are updated dynamically based on the encoded data to better capture non-stationary statistics. For instance, adaptive models often employ finite-state machines or frequency counters to refine probability estimates after each symbol, enabling more efficient coding for evolving sources. Upon selecting a symbol for encoding, the current interval is narrowed to the corresponding subinterval by adjusting low and high to its endpoints, effectively accumulating information by reducing uncertainty in proportion to the symbol's probability. This iterative narrowing process integrates the sequence of symbols into a single, progressively smaller interval whose position and size uniquely identify the original message.[11][9] Output of encoded bits occurs through renormalization, triggered when the interval width becomes sufficiently small to guarantee that the entire span fits within a single digit of the target output base (e.g., 8 bits for binary output), preventing precision loss in finite arithmetic implementations. At this point, the most significant bits common to low and high are emitted as output, and the interval is rescaled by shifting low and high to restore a workable precision range, typically by multiplying by the base (e.g., 256 for bytes). This mechanism ensures variable-length encoding without immediate symbol-by-symbol output, achieving compression close to the source entropy while maintaining decodability from the final interval.[9][11]Encoding and Decoding
Encoding Procedure
The encoding procedure in range coding represents a sequence of symbols as a single number within a shrinking interval of the unit space, using integer arithmetic to avoid floating-point precision issues. This process maps the symbols' probabilities to subintervals, progressively narrowing the current range until the final interval can be encoded into a compact bitstream. The algorithm maintains two key variables: the lower bound low (initially 0) and the range size range (initially ), representing the current interval . For each symbol in the input sequence, the interval is subdivided based on the symbol's probability, as determined by a frequency model with total frequency , symbol frequency , and cumulative frequency up to the symbol (where is the cumulative up to but not including ). Compute scaled_increment = range / T. Then update range ← scaled_increment × and low ← low + scaled_increment × . These integer operations truncate toward zero, ensuring the subinterval lies within the current interval and selects the sub-range corresponding to the symbol's probability, narrowing the interval proportionally. Renormalization follows immediately if necessary to maintain sufficient precision. Renormalization occurs when the range becomes too small (when ), to prevent underflow and ensure the interval spans enough integer values for accurate subdivision. In this step, the range and low are left-shifted by 8 bits (multiplied by 256) to restore the range size, while emitting the high byte from low to the output stream if necessary. Specifically, output the byte (low >> 24), then low ← (low & 0x00FFFFFF) << 8, and range ← range << 8. This process repeats until . Advanced implementations handle cases where the interval straddles powers of 2 (underflow) by adjusting output bits to avoid ambiguity. The shifting ensures the interval remains representable within the fixed integer precision, typically 32 bits. Upon processing all symbols, the final interval is flushed to the output by emitting the shortest bit prefix that uniquely identifies a value within it, ensuring the decoder can unambiguously reconstruct the sequence. This typically involves outputting up to 4 additional bytes from low, with trailing zeros if needed for byte alignment. The total output length approximates the negative log of the message probability, achieving near-optimal compression. The following pseudocode outlines a simplified core encoding algorithm in a 32-bit integer implementation, assuming a static probability model with precomputed cumulative frequencies. Note that production systems like LZMA include additional logic for underflow handling:Initialize:
low ← 0
range ← 2^32 - 1
output_buffer ← empty
For each symbol s in the message:
scaled_increment ← range / T // T is total frequency
low ← low + scaled_increment * C_s // C_s is cumulative freq up to s
range ← scaled_increment * f_s // f_s is frequency of s
While range < 2^16:
output_buffer ← output_buffer + (low >> 24) // output high byte
low ← (low & 0x00FFFFFF) << 8
range ← range << 8
// Final flush: repeat renormalization up to 4 times or until range is full
While range < 2^32 - 1:
output_buffer ← output_buffer + (low >> 24)
low ← (low & 0x00FFFFFF) << 8
range ← range << 8
Output the output_buffer as the compressed bitstream, adding trailing zeros if needed for byte alignment
Initialize:
low ← 0
range ← 2^32 - 1
output_buffer ← empty
For each symbol s in the message:
scaled_increment ← range / T // T is total frequency
low ← low + scaled_increment * C_s // C_s is cumulative freq up to s
range ← scaled_increment * f_s // f_s is frequency of s
While range < 2^16:
output_buffer ← output_buffer + (low >> 24) // output high byte
low ← (low & 0x00FFFFFF) << 8
range ← range << 8
// Final flush: repeat renormalization up to 4 times or until range is full
While range < 2^32 - 1:
output_buffer ← output_buffer + (low >> 24)
low ← (low & 0x00FFFFFF) << 8
range ← range << 8
Output the output_buffer as the compressed bitstream, adding trailing zeros if needed for byte alignment
Decoding Procedure
The decoding procedure in range coding reverses the encoding process by extracting symbols from a compressed bitstream, leveraging the same probabilistic model to subdivide the current interval and locate the code value within it. The decoder tracks three primary state variables: the lower bound of the interval (low), the width of the current interval (range), and a code value representing the fractional position derived from consumed input bits. These variables are maintained as integers to enable efficient computation without floating-point operations.
Initialization begins by setting low = 0 and range to for 32-bit arithmetic to maximize precision while avoiding overflow. The code is then populated by reading the initial 32 bits from the bitstream (e.g., four bytes shifted into position: code = ((b1 << 24) | (b2 << 16) | (b3 << 8) | b4)). This establishes the starting point, with the input bitstream serving as the encoded representation of the final narrow interval from encoding.
For each symbol, the decoder computes subinterval boundaries using the probability estimates from the shared model, where the total frequency sums to a fixed value . Compute scaled = range / T. The scaled position (code - low) / scaled determines which subinterval contains the value, identifying the output symbol s such that scaled position , where is the cumulative frequency up to s. The state is then updated as low += scaled * C_s and range = scaled * f_s, narrowing the interval to the selected subrange while preserving the relative position of code. All operations use integer arithmetic with the scaled approach to minimize precision loss.
Renormalization occurs reactively when range falls below to prevent precision loss, consuming additional input bits as needed. In this step, the decoder shifts range left by 8 bits (range <<= 8), applies the same shift to low (low <<= 8), and updates code = (code << 8) | read_byte(). This repeats until range exceeds the threshold. The code value is masked to 32 bits after each update to maintain precision. This input consumption is driven by range narrowing, contrasting with encoding's output emission. Advanced implementations adjust for potential carries from low shifts to ensure accuracy.
Synchronization relies on the encoder and decoder employing identical probability models—updated adaptively or statically in tandem—and identical renormalization conditions, allowing stateless alignment without side-channel communication. Any mismatch in models or triggers would desynchronize the process, leading to incorrect symbol extraction.
Illustrative Example
Consider encoding the sequence "ab" where the alphabet consists of symbols 'a' and 'b' with probabilities P(a) = 3/4 and P(b) = 1/4. Range coding uses an integer range, starting with [low, high) = [0, 256) for simplicity (8-bit precision).- Initial range: [0, 256).
- Encode 'a': The range for 'a' is the first 3/4 of the current range, so [0, 192). Update: low = 0, high = 192.
- Encode 'b': Within [0, 192), the range for 'b' is the last 1/4, which is [144, 192) (since 192 * 1/4 = 48, starting from 192 - 48 = 144). Update: low = 144, high = 192.
