Recent from talks
Nothing was collected or created yet.
Variable-length quantity
View on WikipediaA variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets (eight-bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to LEB128 except in endianness. See the example below.
Applications and history
[edit]Base-128 compression is known by many names – VB (Variable Byte), VByte, Varint, VInt, EncInt etc.[1]
A variable-length quantity (VLQ) was defined for use in the standard MIDI file format[2] to save additional space for a resource-constrained system, and is also used in the later Extensible Music Format (XMF).
Base-128 is also used in ASN.1 BER encoding to encode tag numbers and object identifiers.[3] It is also used in the WAP environment, where it is called variable length unsigned integer or uintvar. The DWARF debugging format[4] defines a variant called LEB128 (or ULEB128 for unsigned numbers), where the least significant group of 7 bits is encoded in the first byte, and the most significant bits are in the last byte (so effectively it is the little-endian analog of a VLQ). Google Protocol Buffers use a similar format to have compact representation of integer values,[5] as does Oracle Portable Object Format (POF)[6] and the Microsoft .NET Framework "7-bit encoded int" in the BinaryReader and BinaryWriter classes.[7]
It is also used extensively in web browsers for source mapping – which contain a lot of integer line and column number mappings – to keep the size of the map to a minimum.[8]
Variable-width integers in LLVM use a similar principle. The encoding chunks are little-endian and need not be 8 bits in size. The LLVM documentation describes a field that uses 4-bit chunk, with each chunk consisting of 1 bit continuation and 3 bits payload.[9]
Benefits
[edit]- Compactness: One of the primary benefits of VLQ encoding is its compactness. Since it uses a variable number of bytes to encode an integer, smaller integers can be represented using fewer bytes, resulting in a smaller overall file size. This is particularly useful in scenarios where storage space is at a premium, such as in embedded systems or mobile devices.
- Efficiency: VLQ encoding is an efficient way to store and transmit data. Since smaller integers are represented using fewer bytes, this reduces the amount of data that needs to be transmitted, which in turn reduces the time and bandwidth required to transmit the data.
- Flexibility: Another advantage of VLQ encoding is its flexibility. Since the number of bytes used to represent an integer is based on its magnitude, VLQ encoding can handle integers of different sizes. This means that VLQ encoding can be used to represent integers of any size, from small 8-bit integers to large 64-bit integers.
General structure
[edit]The encoding assumes an octet (an 8-bit byte) where the most significant bit (MSB), also commonly known as the sign bit, is reserved to indicate whether another VLQ octet follows.
| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|---|---|---|---|---|---|---|---|
| 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 |
| A | Bn | ||||||
If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.
B is a 7-bit number [0x00, 0x7F], and n is the position of the VLQ octet where B0 is the least significant. The VLQ octets are arranged most significant first in a stream.
Variants
[edit]The general VLQ encoding is simple, but in basic form is only defined for unsigned integers (nonnegative, positive or zero), and is somewhat redundant, since prepending 0x80 octets corresponds to zero padding. There are various signed number representations to handle negative numbers, and techniques to remove the redundancy.
Group Varint Encoding
[edit]Google developed Group Varint Encoding (GVE) after observing that traditional VLQ encoding incurs many CPU branches during decompression. GVE uses a single byte as a header for 4 variable-length uint32 values. The header byte has 4 2-bit numbers representing the storage length of each of the following 4 uint32s. Such a layout eliminates the need to check and remove VLQ continuation bits. Data bytes can be copied directly to their destination. This layout reduces CPU branches, making GVE faster than VLQ on modern pipelined CPUs.[10]
PrefixVarint is a similar design but with a uint64 maximum. It is said to have "been invented multiple times independently".[11] It is possible to be changed into a chained version with infinitely many continuations.
Signed numbers
[edit]Sign bit
[edit]Negative numbers can be handled using a sign bit, which only needs to be present in the first octet.
In the data format for Unreal Packages used by the Unreal Engine, a variable-length quantity scheme called Compact Indices[12] is used. The only difference in this encoding is that the first VLQ octet has the sixth bit reserved to indicate whether the encoded integer is positive or negative. Any consecutive VLQ octet follows the general structure.
| First VLQ octet | Other VLQ octets | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 |
| A | B | C0 | A | Cn (n > 0) | |||||||||||
If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.
If B is 0, then the VLQ represents a positive integer. If B is 1, then the VLQ represents a negative number.
C is the number chunk being encoded, and n is the position of the VLQ octet where C0 is the least significant. The VLQ octets are arranged least significant first in a stream.
Zigzag encoding
[edit]An alternative way to encode negative numbers is to use the least significant bit for sign. This is notably done for Google Protocol Buffers, and is known as a zigzag encoding for signed integers.[13] One can encode the numbers so that encoded 0 corresponds to 0, 1 to −1, 10 to 1, 11 to −2, 100 to 2, etc.: counting up alternates between nonnegative (starting at 0) and negative (since each step changes the least-significant bit, hence the sign), whence the name "zigzag encoding". Concretely, transform the integer as (n << 1) ^ (n >> k - 1) for fixed k-bit integers.
Two's complement
[edit]LEB128 uses two's complement to represent signed numbers. In this scheme of representation, n bits encode a range from −2n to 2n − 1, and all negative numbers start with a 1 in the most significant bit. In Signed LEB128, the input is sign-extended so that its length is a multiple of 7 bits. From there the encoding proceeds as usual.[14]
In LEB128, the stream is arranged least significant first.[14]
Removing redundancy
[edit]With the VLQ encoding described above, any number that can be encoded with N octets can also be encoded with more than N octets simply by prepending additional 0x80 octets as zero-padding. For example, the decimal number 358 can be encoded as the 2-octet VLQ 0x8266, or the number 0358 can be encoded as 3-octet VLQ 0x808266, or 00358 as the 4-octet VLQ 0x80808266 and so forth.
However, the VLQ format used in Git[15] removes this prepending redundancy and extends the representable range of shorter VLQs by adding an offset to VLQs of 2 or more octets in such a way that the lowest possible value for such an (N + 1)-octet VLQ becomes exactly one more than the maximum possible value for an N-octet VLQ. In particular, since a 1-octet VLQ can store a maximum value of 127, the minimum 2-octet VLQ (0x8000) is assigned the value 128 instead of 0. Conversely, the maximum value of such a 2-octet VLQ (0xFF7F) is 16511 instead of just 16383. Similarly, the minimum 3-octet VLQ (0x808000) has a value of 16512 instead of zero, which means that the maximum 3-octet VLQ (0xFFFF7F) is 2113663 instead of just 2097151.
In this way, there is one and only one encoding of each integer, making this a base-128 bijective numeration.
Examples
[edit]
Here is a worked-out example for the decimal number 137:
- Represent the value in binary notation (e.g. 137 as 10001001)
- Break it up in groups of 7 bits starting from the lowest significant bit (e.g. 137 as 0000001 0001001). This is equivalent to representing the number in base 128.
- Take the lowest 7 bits, and that gives you the least significant byte (0000 1001). This byte comes last.
- For all the other groups of 7 bits (in the example, this is 000 0001), set the MSb to 1 (which gives 1000 0001 in our example). Thus 137 becomes 1000 0001 0000 1001, where the bits in boldface are something we added. These added bits denote whether there is another byte to follow or not. Thus, by definition, the very last byte of a variable-length integer will have 0 as its MSb.
Another way to look at this is to represent the value in base-128 and then set the MSB of all but the last base-128 digit to 1.
The Standard MIDI File format specification gives more examples:[2][16]
| Integer (decimal) |
Integer (binary) | Variable-length quantity (binary) | Integer (hexadecimal) |
Variable-length quantity (hexadecimal) |
|---|---|---|---|---|
| 0 | 00000000 00000000 00000000 00000000 |
00000000 |
00000000 | 00 |
| 127 | 00000000 00000000 00000000 01111111 |
01111111 |
0000007F | 7F |
| 128 | 00000000 00000000 00000000 10000000 |
10000001 00000000 |
00000080 | 81 00 |
| 8192 | 00000000 00000000 00100000 00000000 |
11000000 00000000 |
00002000 | C0 00 |
| 16383 | 00000000 00000000 00111111 11111111 |
11111111 01111111 |
00003FFF | FF 7F |
| 16384 | 00000000 00000000 01000000 00000000 |
10000001 10000000 00000000 |
00004000 | 81 80 00 |
| 2097151 | 00000000 00011111 11111111 11111111 |
11111111 11111111 01111111 |
001FFFFF | FF FF 7F |
| 2097152 | 00000000 00100000 00000000 00000000 |
10000001 10000000 10000000 00000000 |
00200000 | 81 80 80 00 |
| 134217728 | 00001000 00000000 00000000 00000000 |
11000000 10000000 10000000 00000000 |
08000000 | C0 80 80 00 |
| 268435455 | 00001111 11111111 11111111 11111111 |
11111111 11111111 11111111 01111111 |
0FFFFFFF | FF FF FF 7F |
References
[edit]- ^ Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson. "An Experimental Study of Bitmap Compression vs. Inverted List Compression" Archived 2019-12-07 at the Wayback Machine. 2017. doi:10.1145/3035918.3064007.
- ^ a b MIDI File Format: Variable Quantities.
- ^ "ITU-T Recommendation X.690" (PDF). International Telecommunication Union. 2002.
- ^ DWARF Standard.
- ^ Google Protocol Buffers.
- ^ Oracle Portable Object Format (POF) Archived 2013-12-27 at the Wayback Machine.
- ^ System.IO.BinaryWriter.Write7BitEncodedInt(int) method and System.IO.BinaryReader.Read7BitEncodedInt() method.
- ^ Introduction to javascript source maps.
- ^ "LLVM Bitcode File Format", section "Variable Width Integers". Accessed 2019-10-01.
- ^ Jeff Dean. "Challenges in Building Large-Scale Information Retrieval Systems" (PDF). p. 58. Retrieved 2020-05-30.
- ^ Olesen, Jakob Stoklund (31 May 2020). "stoklund/varint". Archived from the original on 19 November 2020. Retrieved 9 July 2020.
- ^ "Unreal Packages". 1999-07-21. Archived from the original on 2010-08-20. Retrieved 2021-08-29.
- ^ Protocol Buffers: Encoding: Signed Integers.
- ^ a b Free Standards Group (December 2005). "DWARF Debugging Information Format Specification Version 3.0" (PDF). p. 70. Retrieved 2009-07-19.
- ^ "Git – fast, scalable, distributed revision control system". 28 October 2021.
- ^ Standard MIDI-File Format Spec. 1.1.
External links
[edit]- MIDI Manufacturers Association (MMA) - Source for English-language MIDI specs
- Association of Musical Electronics Industry Archived 2010-01-17 at the Wayback Machine (AMEI) -Source for Japanese-language MIDI specs
Variable-length quantity
View on GrokipediaOverview
Definition and Purpose
A variable-length quantity (VLQ) is a method for encoding non-negative integers using a variable number of bytes, where the number of bytes adapts to the magnitude of the value to ensure smaller integers occupy minimal space. This encoding scheme represents arbitrarily large integers through a sequence of octets, primarily targeting non-negative values up to 28 bits in its original MIDI form or more depending on the variant. The core mechanism relies on continuation bits: each byte except the last has its most significant bit (MSB) set to 1 to signal additional bytes follow, while the last byte has its MSB set to 0; the remaining 7 bits in each byte contribute to the integer value, concatenated in order from most to least significant.[1] The primary purpose of VLQ is to optimize storage and transmission efficiency in binary data formats, particularly where small values are predominant, thereby reducing average byte usage compared to fixed-width integer representations. By allocating only 1 byte for values up to 127 and additional bytes only as needed for larger numbers, VLQ minimizes overhead in scenarios with skewed distributions favoring low-magnitude integers, such as timing deltas or identifiers. This approach enables lossless compression without requiring prior knowledge of value ranges, making it suitable for streaming or serialized data. VLQ originated from the requirements of compact binary formats seeking to eliminate the inefficiencies of fixed-width fields, allowing flexible representation of integers in resource-constrained environments. The term was coined in the Standard MIDI Files format around 1988. It has been adopted in protocols like Protocol Buffers for varint encoding to enhance data interchange efficiency, though varints use a little-endian variant.[4]Historical Development
The concept of variable-length encoding for integers traces its roots to the 1970s and 1980s, when computing systems emphasized compact data representation amid limited storage and transmission resources. Early influences included variable-length encodings in emerging standards for network protocols and databases, addressing needs for scalable data formats without fixed-size overhead. A pivotal milestone occurred in the late 1980s with the standardization of similar encodings in key protocols. The Abstract Syntax Notation One (ASN.1) framework, specified in ITU-T Recommendation X.208 (1988), incorporated base-128 variable-length encoding (similar to VLQ) for Object Identifier arcs under Basic Encoding Rules (BER) in X.690 (1988), enabling variable-length representation of hierarchical identifiers in telecommunications and distributed systems. Concurrently, the Musical Instrument Digital Interface (MIDI) protocol, released in 1983 by the MIDI Manufacturers Association, adopted VLQ in its Standard MIDI Files format (formalized in 1988) to encode delta times for musical events, conserving space in resource-constrained audio hardware and files.[1] The 1990s reinforced the role of such encodings through broader adoption in internet and software standards, building on ASN.1 BER for protocols like X.500 directory services. By the 2000s, open-source tools amplified its use: Git, launched in 2005 by Linus Torvalds, integrated a VLQ variant in packfile headers to encode object lengths and deltas, enhancing repository efficiency.[5] Google's Protocol Buffers, prototyped in 2001 and publicly released in 2008, employed varints—little-endian base-128 encodings similar to VLQ—for integer serialization in distributed applications.[6] In the 2010s, similar encodings evolved within big data and web ecosystems. Apache Avro, introduced in 2009 as part of the Hadoop project, utilized zig-zag varints (a signed extension of VLQ-like encoding) for compact, schema-aware data storage and transmission. The Concise Binary Object Representation (CBOR), defined in RFC 7049 (2013), provided integer encoding using major types with additional information for length, differing from VLQ by using fixed octet counts for larger values rather than continuation bits. Post-2020, WebAssembly's binary format, with its minimum viable product in 2017 and core specification in 2019, standardized LEB128 (little-endian base-128, a VLQ relative) for all integers, optimizing module parsing and execution in browsers and beyond.[7] This progression underscores the adaptation of VLQ and related encodings from niche protocols to foundational elements in modern computing.Applications
In Data Serialization Protocols
Variable-length quantities (VLQs), particularly in the form of base-128 varints, play a central role in Protocol Buffers (protobuf), a widely used data serialization format developed by Google. In protobuf's wire format, varints encode field tags, lengths of length-delimited fields, and integer values, allowing small integers to use as few as one byte while supporting up to 64-bit values across up to ten bytes. This compact encoding facilitates schema evolution by enabling backward- and forward-compatible changes to message structures without breaking serialization. Additionally, by treating integers as unsigned during varint encoding, protobuf avoids sign extension issues that could lead to integer overflow or inefficiency in cross-platform serialization of signed values, particularly when using zigzag encoding for signed integers (sint32/sint64) to map negative numbers efficiently.[4][8] The Concise Binary Object Representation (CBOR), specified in RFC 7049, employs a form of variable-length encoding for integers and lengths to achieve compactness in binary data interchange. CBOR's major types use an initial byte with 5-bit additional information for short values (0-23), extended by 1, 2, 4, or 8 bytes for larger values, effectively providing VLQ-like efficiency for integers ranging from -2^64 to 2^64-1 and indefinite-length items. This design supports major types like integers, byte strings, and text strings, where lengths are encoded variably to minimize overhead for small payloads. MessagePack, another binary serialization format, utilizes similar variable-length integer encodings to produce small, fast outputs comparable to JSON but in binary form. Integers are encoded as fixints (single-byte for -32 to 127), or extended to uint8/16/32/64 and int8/16/32/64 based on magnitude, with strings and arrays prefixed by variable-length sizes using the same scheme. This approach ensures typical short strings and small integers require minimal bytes, enhancing efficiency in network transmission and storage. In the Abstract Syntax Notation One (ASN.1) standards, particularly under Basic Encoding Rules (BER) and Distinguished Encoding Rules (DER), INTEGER types employ variable-length representations prefixed by a definite length octet. The contents octets represent the integer in two's complement form using the minimal number of bytes (1 to 8 for typical values), with the length field itself supporting short (1 octet for 0-127) or long definite forms for larger encodings, ensuring flexibility for arbitrary-sized integers in protocols like X.509 certificates. DER adds determinism by mandating minimal octet lengths and primitive encoding for integers. More recent adoptions include gRPC, introduced in 2015, which leverages Protocol Buffers' varint-based encoding over HTTP/2 for efficient remote procedure calls, where varints optimize frame lengths and metadata in the multiplexed transport. FlatBuffers, released in 2014, supports zero-copy deserialization through prefixed lengths for variable-sized fields like strings, though it primarily uses fixed-size offsets rather than pure VLQs for core integer encoding.[9]In Version Control Systems
In version control systems, variable-length quantities (VLQs) are integral to compactly representing numeric data such as object sizes, offsets, and lengths in storage formats, particularly for delta-compressed revisions. This approach is prominent in systems handling large histories of incremental changes, where short values predominate, allowing minimal byte usage for the majority of entries while supporting larger values as needed. Git extensively employs VLQ in its packfile format to mitigate repository bloat from frequent small modifications. Each object entry begins with a variable-length header encoding both the object type (using 3 bits) and size (up to 4 bits in the first byte, extended by 7 bits per continuation byte, where the most significant bit signals more bytes). For deltified objects, the base and result sizes follow the same encoding, and offsets in OBJ_OFS_DELTA entries use a dedicated variable-length scheme: the first byte encodes 7 bits of offset, with subsequent bytes adding powers of 128 shifted by 7 bits each, prefixed by a special type byte. These encodings enable efficient delta chains in packfiles, ideal for repositories with numerous minor updates, as most deltas remain small and consume few bytes. Pack index files (version 2), however, use fixed-width 4- or 8-byte offsets for direct access, complementing the VLQ usage in data sections.[5] Apache Subversion applies VLQ in its svndiff delta format, used for compressing revision differences in both internal storage and dump files. All integers—such as source/target offsets, copy lengths, instruction lengths, and new data lengths—are encoded in a big-endian variable-length scheme: each byte holds 7 data bits, with the high bit set (1) for continuation bytes and cleared (0) for the final byte. For instance, a length of 130 is represented as two bytes (0x81 0x02). This format supports concise representation of node revisions and paths in dump files, facilitating efficient serialization of change histories without fixed-size overhead for small values.[10] Modern extensions like Git Large File Storage (LFS), introduced in 2015, leverage Git's packfile VLQ mechanisms indirectly for storing pointer metadata, including file sizes, in compressed objects, extending space efficiency to large-file handling in versioned repositories.Advantages and Limitations
Space Efficiency Benefits
Variable-length quantity (VLQ) encoding provides substantial space savings over fixed-length representations by allocating fewer bytes to smaller integers, which are prevalent in many datasets. For instance, a fixed 32-bit unsigned integer always requires 4 bytes, regardless of value, whereas VLQ uses 1 byte for values from 0 to 127, 2 bytes for 128 to 16,383, 3 bytes for 16,384 to 2,097,151, 4 bytes for 2,097,152 to 268,435,455, and 5 bytes for larger values up to 2^{32}-1. In typical web and protocol data, where most integers are small (e.g., IDs, lengths, or indices below 1,000), the average VLQ length is approximately 1.5 to 2 bytes, yielding 50-62.5% space reduction compared to 4-byte fixed encoding.[4] The mathematical foundation for these savings lies in the variable allocation based on the value's magnitude. The length in bytes for a positive integer is given by , since each byte contributes 7 payload bits (with the 8th bit as a continuation flag). This can be approximated as for , where the base-128 logarithm reflects the 7-bit grouping (). To derive this, note that the minimal number of 7-bit groups needed to represent is , plus one byte per group; the approximation holds because , and the ceiling adds the initial byte offset. For a uniform distribution over 0 to , the expected length exceeds 4 bytes (approximately 4.94 bytes, calculated as the weighted sum: , where is the proportion in the -byte range, dominated by ~93.75% in 5 bytes), offering minimal savings against 4-byte fixed but highlighting VLQ's overhead for large uniform ranges. However, for skewed distributions with small values (e.g., exponential or Zipf-like for IDs), the expected length drops to 1-2 bytes, achieving up to 75-80% savings versus fixed 32-bit, as 80-90% of values fit in 1 byte.[11] In real-world applications like data serialization, VLQ-based varints in Protocol Buffers reduce overall message sizes significantly compared to text-based JSON. Benchmarks show Protobuf messages averaging 1,157 bytes versus 12,187 bytes for JSON equivalents, a reduction exceeding 90% due to binary compactness and VLQ for integers, though typical cases yield 20-30% savings for mixed payloads.[12] Recent studies on database compression further demonstrate VLQ's benefits in columnar stores. In evaluations of lightweight integer compression, varint-based techniques like bitpacking achieve up to 53% memory reduction over dictionary encoding in TPC-H workloads and 27% over fixed-width vectors, particularly effective for skewed integer columns in analytical queries. For example, in ClickHouse and similar systems, VLQ variants in binary encodings and codecs compress variable-length integers in columns, reducing storage by 40-60% for datasets with many small values, as seen in 2021 analyses of in-memory OLAP systems.[13]Performance Considerations
The encoding of variable-length quantities (VLQs) typically employs a straightforward bit-shifting loop that processes the integer in 7-bit chunks, appending continuation bits, which proves faster than fixed-width encoding for small values due to reduced byte writes but introduces conditional branches for larger values requiring multiple bytes. Microbenchmarks indicate that VLQ encoding can achieve throughputs of up to 237 million integers per second on Apple M1 hardware using optimized unrolled loops, yet it remains approximately 10-20% slower than direct uint32 writes in scenarios with mixed value sizes owing to the overhead of length determination and shifting.[14] Decoding VLQs involves an iterative loop that accumulates bits until encountering a byte with the most significant bit (MSB) set to 0, introducing potential overhead in performance-critical paths due to variable iteration counts and branch checks per byte. This process is optimized in libraries such as Protocol Buffers' C++ implementation through branchless techniques and unrolled loops, reducing decoding latency to around 0.287 nanoseconds per operation in targeted benchmarks.[15][16] The time complexity of VLQ operations is O(log n) in the worst case for an n-bit integer, reflecting the proportional number of bytes needed, but averages O(1) for typical small values in real-world distributions like those in serialization protocols. Modern implementations leverage hardware support, such as SIMD instructions in AVX2-enabled CPUs, to unroll decoding loops and process multiple VLQs in parallel, achieving throughputs exceeding 500 million integers per second for uniform small values.[17][18] Key limitations include branch prediction misses when handling mixed-size data streams, which disrupt CPU pipelines and inflate cycle counts by up to 50% compared to predictable fixed-width accesses. For datasets with uniformly large values, fixed-width encodings are preferred to avoid these penalties, as VLQ's adaptive nature yields minimal benefits without size variability.[16] Post-2020 optimizations in Rust's prost crate have addressed decoding bottlenecks in Protocol Buffers by adopting zero-copy Bytes types and unsafe slice manipulations, reducing deserialization time by over 75% for high-volume requests like those in time-series databases. Similarly, Java's Chronicle Wire employs stop-bit encoding akin to VLQ for integers in low-latency trading systems, minimizing garbage collection and enabling microsecond-level serialization without excessive CPU overhead.[19][20]Core Encoding Mechanism
General Structure for Unsigned Integers
The general structure of variable-length quantity (VLQ) encoding for unsigned integers involves representing the value as a sequence of bytes where each byte carries 7 bits of the integer's binary representation, starting from the most significant bits. The value is divided into 7-bit groups, with the most significant group placed in the first byte and subsequent groups in following bytes. For all bytes except the last, the most significant bit (MSB, bit 7) is set to 1 (equivalent to OR'ing the 7-bit value with 0x80 in hexadecimal) to indicate continuation. The final byte has its MSB set to 0, signaling the end of the encoding. This scheme allows for compact representation of small values while accommodating larger ones through additional bytes.[1] Mathematically, for an unsigned integer value , the encoding produces bytes such that where is the number of bytes, denotes bitwise AND, and the final byte (MSB = 0). Each extracts the 7-bit payload from the -th group, and the powers of 128 (base ) account for the positional weighting in big-endian order of the groups. This formula ensures unambiguous decoding by accumulating shifted payloads from the byte stream.[1] The number of bytes required depends on the magnitude of , with each additional byte extending the range by a factor of 128. The following table summarizes the ranges for common byte counts, assuming unsigned integers up to 64 bits:| Bytes | Range | Maximum Value (Decimal) |
|---|---|---|
| 1 | 0 to 127 | |
| 2 | 128 to 16,383 | |
| 3 | 16,384 to 2,097,151 | |
| 4 | 2,097,152 to 268,435,455 | |
| 5–9 | Successive extensions | Up to |
| 10 | Full 64-bit range |
function encodeVLQ(unsigned v):
if v == 0:
return [0x00]
payloads = empty list
temp = v
while temp > 0:
payloads.append(temp & 0x7F)
temp >>= 7
// Now payloads has LSB to MSB groups
reverse payloads // Now MSB to LSB
bytes = empty list
for i = 0 to length(payloads) - 1:
byte = payloads[i]
if i < length(payloads) - 1:
byte |= 0x80 // Set continuation for all but last
bytes.append(byte)
return bytes
function encodeVLQ(unsigned v):
if v == 0:
return [0x00]
payloads = empty list
temp = v
while temp > 0:
payloads.append(temp & 0x7F)
temp >>= 7
// Now payloads has LSB to MSB groups
reverse payloads // Now MSB to LSB
bytes = empty list
for i = 0 to length(payloads) - 1:
byte = payloads[i]
if i < length(payloads) - 1:
byte |= 0x80 // Set continuation for all but last
bytes.append(byte)
return bytes
function decodeVLQ(bytes stream):
result = 0
while True:
if no more bytes:
error
byte = next byte from stream
payload = byte & 0x7F
result = (result << 7) | payload
if (byte & 0x80) == 0:
return result
if result > 2^64 - 1: // For 64-bit limit
error overflow
function decodeVLQ(bytes stream):
result = 0
while True:
if no more bytes:
error
byte = next byte from stream
payload = byte & 0x7F
result = (result << 7) | payload
if (byte & 0x80) == 0:
return result
if result > 2^64 - 1: // For 64-bit limit
error overflow
Byte-Level Representation
In variable-length quantity (VLQ) encoding, each byte consists of an 8-bit structure where the most significant bit (MSB) serves as a continuation flag and the remaining 7 bits carry payload data. A byte with MSB set to 1 (binary pattern 1xxxxxxx) indicates continuation, meaning additional bytes follow to complete the value, while an MSB of 0 (binary pattern 0xxxxxxx) marks the final byte. This design allows efficient representation by dedicating only 7 bits per byte to the actual value until the stop condition is reached.[1] Common hexadecimal examples illustrate this byte-level structure. The value 0 encodes as a single byte00, 127 as 7F, 128 as the two-byte sequence 81 00, and 16384 as the three-byte sequence 81 80 00. These encodings reflect the progressive addition of continuation bytes as the numerical value exceeds powers of 128 (2^7).[1]
To visualize the breakdown, the following table shows the binary composition for selected values, highlighting the MSB flag and 7-bit payload in each byte:
| Value (decimal) | Bytes (hex) | Binary Breakdown (MSB | Payload) |
|---|---|---|---|
| 0 | 00 | 0 | 0000000 | |
| 127 | 7F | 0 | 1111111 | |
| 128 | 81 00 | 1 | 0000001 0 | 0000000 | |
| 255 | 81 7F | 1 | 0000001 0 | 1111111 |
Encoding Variants
Group Varint Encoding
Group varint encoding is a variant of variable-length quantity (VLQ) designed to compress arrays of small, typically unsigned integers by processing them in fixed-size groups, thereby reducing overhead compared to individual VLQ encodings.[21] It groups up to four integers together, using a shared header to specify the byte length for each value in the group, which allows for more efficient decoding by minimizing branching instructions during processing.[22] This approach is particularly effective for contiguous sequences of integers, such as document IDs or term frequencies in inverted indexes. The structure begins with a single 8-bit header byte that encodes the lengths of the four values using 2 bits per value. Each 2-bit field (nibble) determines the number of bytes for the corresponding integer: 00 indicates 1 byte, 01 indicates 2 bytes, 10 indicates 3 bytes, and 11 indicates 4 bytes. The formula for the byte count per value is , where is the 2-bit value extracted from the header. Following the header, the values are concatenated directly as fixed-length fields without continuation bits, filling up to 17 bytes total per group (1 header + up to 4 × 4 bytes). For encoding, integers are partitioned into groups of four (padding with zeros if necessary for incomplete groups), the minimal byte length is selected for each to fit the value (capped at 4 bytes for 32-bit integers), the header is constructed by packing the 2-bit codes, and the byte streams are appended. Decoding involves reading the header, extracting the 2-bit codes, and sequentially consuming the specified number of bytes for each value, often accelerated via lookup tables for offsets and masks to enable SIMD optimizations.[23][21] This encoding is optimized for sorted or small-range integer lists, such as those in search engine indexes, where values often fit within fewer bytes and benefit from batch processing.[22] By eliminating the 1-bit continuation overhead per byte in traditional VLQs, it reduces per-value metadata to approximately 0.5 bits on average for small groups, improving space efficiency for dense, low-entropy data. Additionally, it achieves higher decoding throughput—up to 400 million integers per second—compared to standard 7-bit VLQs (around 180 million) or fixed 30-bit encodings (around 240 million), due to the predictable group structure that reduces CPU branch predictions.[21] For illustration, consider encoding the integers [42, 255, 1000, 2000000]:- 42 fits in 1 byte (00).
- 255 fits in 1 byte (00).
- 1000 fits in 2 bytes (01).
- 2000000 fits in 3 bytes (10).
Signed Integer Encodings
Variable-length quantity (VLQ) encodings are inherently designed for non-negative integers, efficiently representing them with a variable number of bytes based on magnitude. Extending VLQ to signed integers, which span negative and positive values, requires transforming the signed value into a non-negative equivalent suitable for standard unsigned VLQ application. This transformation maintains the core benefits of VLQ, such as compactness for small values, while covering the full signed range without fixed-length overhead.[4] Common methods for signed VLQ include ZigZag encoding and signed LEB128. ZigZag encoding, as implemented in Protocol Buffers, maps signed integers to unsigned ones by interleaving sign and magnitude bits, ensuring that values with small absolute magnitudes—regardless of sign—produce compact encodings, and preserving numerical order for applications like sorted data storage. For instance, it transforms negative values to odd unsigned numbers and positive to even, allowing seamless use of unsigned VLQ. This approach is particularly effective in serialization protocols where order preservation aids compression and querying efficiency.[4][25] Signed LEB128, used in WebAssembly binaries and DWARF debugging formats, directly encodes signed integers in a two's complement form with variable-length bytes, incorporating sign extension through continuation bits set to 1 for negative values. Unlike pure unsigned VLQ, it handles negatives natively by filling higher bits with the sign bit, which can result in longer encodings for large negative numbers but simplifies decoding without additional mapping. This method is optimized for low-level binary formats where direct signed representation aligns with hardware integer operations.[7] Other adaptations, such as sign-magnitude with a dedicated sign bit prepended to an unsigned VLQ-encoded magnitude or two's complement offsets, treat the signed value as an unsigned after adjustment to ensure compatibility with base VLQ structures. These variants appear in specialized systems like FlatBuffers, where signed offsets enable relative addressing in buffer layouts for more flexible data structures. For 32-bit signed integers (-2^{31} to 2^{31}-1), these encodings typically span 1 to 5 bytes, while 64-bit versions extend to 1 to 10 bytes, depending on value size and method. The choice of transformation often prioritizes monotonicity in encoded values to support efficient indexing in sorted datasets.[7]Advanced Techniques
Redundancy Reduction Methods
In standard variable-length quantity (VLQ) encodings, a potential redundancy arises from the possibility of non-minimal representations, where additional continuation bytes with zero lower bits (e.g., 0x80) could theoretically pad the encoding without altering the decoded value if the shifting rules are not strictly enforced. This ambiguity can lead to multiple byte sequences mapping to the same integer, complicating decoding and storage efficiency. To address this, implementations impose a strict canonical rule: encodings must use the minimal number of bytes, prohibiting additional continuation bytes with zero value bits, and ensuring the most significant bit of the first byte is 0 for values fitting in one byte, while subsequent bytes have their most significant bit set to 1 until the final byte.[26] This guarantees a unique representation for each unsigned integer, eliminating redundancy while preserving the variable-length benefits. VLQ schemes inherently employ prefix-free codes, where no valid encoding is a prefix of another, allowing instantaneous decoding in streams without explicit delimiters between values.[27] This property avoids the need for length prefixes or separators, reducing overhead in concatenated sequences. For applications with known bounded ranges, hybrid approaches combine VLQ with fixed-width encodings; small values use compact fixed bytes, while larger ones fall back to VLQ, optimizing space for predictable data distributions.[28] Advanced redundancy reduction often layers delta encoding atop VLQ for sequential data, computing differences (deltas) between consecutive values—typically by subtracting the prior value—and encoding these smaller deltas with VLQ to exploit locality and small changes.[28] This is particularly effective for sorted or time-series integers, as deltas are often positive and compact. In Git's pack files, a variant of big-endian VLQ avoids redundancy by using the shortest possible byte sequence, prohibiting additional 0x80 bytes that would pad the value without change, ensuring minimal and unique forms even in delta-compressed object storage.[29] For integer data in columnar formats, frame-of-reference (FOR) encoding can integrate with delta schemes by subtracting a base value (e.g., the minimum in a block) from each value and encoding the residuals, often using bit-packing for small ranges to reduce redundancy in clustered data.[28] This method, as implemented in systems like DuckDB for dates and timestamps, achieves high compression ratios for numerical columns by aligning values to a common reference frame before encoding.[30]Handling Negative Numbers
Variable-length quantity (VLQ) encodings, originally designed for unsigned integers, require adaptations to represent signed integers efficiently while maintaining compactness for small values. Common schemes transform signed values into a form compatible with unsigned VLQ mechanisms, addressing challenges like sign representation and preservation of numerical order. One straightforward method is the sign-magnitude approach, where a single sign bit is appended (0 for positive or zero, 1 for negative), followed by the VLQ encoding of the absolute value of the integer. This adds minimal overhead for the sign but effectively doubles the encoding length for negative values compared to their positive counterparts of the same magnitude, as the absolute value must still be encoded fully. The simplicity of this method makes it suitable for basic implementations, though it does not preserve sorted order between positive and negative numbers. A more efficient technique is zigzag encoding, which maps signed integers to unsigned integers in a way that interleaves positives and negatives while ensuring small-magnitude values (both positive and negative) use fewer bytes. In this scheme, a signed integer is transformed to an unsigned value using the formula: The reverse decoding from to is: This mapping ensures that consecutive signed integers, such as -1, 0, 1, map to 1, 0, 2, preserving relative order for applications like sorted keys in databases.[4] Zigzag is particularly effective when combined with VLQ, as it avoids the space inefficiency of negative values in other representations. Direct use of two's complement for signed integers in VLQ is possible but challenging due to sign extension, where negative numbers have leading 1 bits that force the varint to consume the maximum length (e.g., 10 bytes for a 32-bit negative value like -1).[4] This occurs because the continuation bits in VLQ interpret the high 1 bits as requiring additional bytes, making pure two's complement rarely used without modification in variable-length contexts. Zigzag encoding is preferred in systems like Protocol Buffers for its efficiency in sorted data scenarios, while the sign-magnitude method remains simpler for non-order-preserving uses despite its length penalty for negatives.[4] In Apache Avro (introduced in 2009 with stable specifications by 2011), a hybrid approach employs zigzag for signed integers within union types, prefixing the encoded value with a schema index to handle polymorphic data efficiently.[31]Practical Examples
Encoding Simple Values
The standard unsigned variable-length quantity (VLQ) encoding represents non-negative integers by packing 7 bits of data into each byte, using the most significant bit (MSB) as a continuation flag: 1 indicates more bytes follow, while 0 marks the final byte.[1] This approach ensures efficient storage for small values, which fit into a single byte, while larger values use additional bytes as needed. The bytes are ordered from most to least significant 7-bit group (big-endian). To encode, the value is divided into 7-bit groups starting from the least significant bit; these groups are then arranged in reverse order (highest first), with the continuation bit (0x80) ORed onto all bytes except the last one.[1] To encode a value, begin with the integer and extract 7-bit groups from the least significant bits, collect them, reverse the order, and apply continuation flags. For instance, consider encoding 0. The value has no bits set, so it requires a single byte with all zeros and no continuation: [0x00]. This single stop byte fully represents the value without any further bytes.[1] For 150 (binary: 10010110), the 7-bit groups are 22 (0010110, least significant) and 1 (0000001). Arrange as most significant first: 1, then 22. Set continuation on the first: 0x01 | 0x80 = 0x81. The second has no continuation: 0x16. Thus, the encoding is [0x81, 0x16].[1] Encoding 256 (binary: 100000000) gives groups 0 (least significant) and 2 (0000010). Arrange as 2, then 0. Continuation on first: 0x02 | 0x80 = 0x82. Second: 0x00. The result is [0x82, 0x00]. This demonstrates how powers of 128 often lead to trailing zero payloads in final bytes, emphasizing the variable nature of the format.[1]Decoding Process Illustration
The decoding process for a variable-length quantity (VLQ) in MIDI reverses the encoding by iteratively reading bytes from the stream and accumulating the value starting from the most significant 7-bit group. The most significant bit (MSB) of each byte serves as a continuation flag: if set (1), additional bytes follow; if clear (0), the current byte is the last. The decoder initializes a result to 0, then for each byte, shifts the current result left by 7 bits and ORs in the lower 7 bits of the byte (byte & 0x7F). If the MSB is clear, decoding stops. This big-endian grouping ensures efficient reconstruction of the original unsigned integer.[1] Consider the byte sequence [0x81, 0x16], which represents 150. The first byte 0x81 (binary 10000001) has MSB 1, so result = 0 << 7 | 1 = 1. The second byte 0x16 (binary 00010110) has MSB 0, so result = 1 << 7 | 22 = 128 + 22 = 150; stop.[1] For a larger value like 1000, the sequence is [0x87, 0x68]. The first byte 0x87 (binary 10000111) has MSB 1, so result = 0 << 7 | 7 = 7. The second byte 0x68 (binary 01101000) has MSB 0, so result = 7 << 7 | 104 = 896 + 104 = 1000; stop. This demonstrates handling multi-byte sequences where values exceed 127 (one byte).[1] Error conditions arise if the stream ends prematurely with a byte's MSB set (indicating expected continuation bytes are missing), rendering the value invalid. Additionally, for 32-bit unsigned integers (as limited in MIDI), sequences longer than 4 bytes are invalid, as they would exceed 28 bits (7 bits × 4 bytes). Implementations typically enforce these checks to ensure data integrity.[1] The following pseudocode illustrates the decoding loop using bitwise operations:function decodeVLQ(stream):
result = 0
while true:
if no more bytes in stream:
raise Error("Unexpected end of stream")
byte = readByte(stream)
result = (result << 7) | (byte & 0x7F)
if (byte & 0x80) == 0: // MSB clear, last byte
break
if result > 0x0FFFFFFF: // For [MIDI](/page/MIDI) 28-bit limit
raise Error("Value too large")
return result
function decodeVLQ(stream):
result = 0
while true:
if no more bytes in stream:
raise Error("Unexpected end of stream")
byte = readByte(stream)
result = (result << 7) | (byte & 0x7F)
if (byte & 0x80) == 0: // MSB clear, last byte
break
if result > 0x0FFFFFFF: // For [MIDI](/page/MIDI) 28-bit limit
raise Error("Value too large")
return result
