Hubbry Logo
Lempel–Ziv–WelchLempel–Ziv–WelchMain
Open search
Lempel–Ziv–Welch
Community hub
Lempel–Ziv–Welch
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Lempel–Ziv–Welch
Lempel–Ziv–Welch
from Wikipedia

Lempel–Ziv–Welch (LZW) is a universal lossless compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improvement to the LZ78 algorithm published by Lempel and Ziv in 1978. Claimed advantages include: simple to implement and the potential for high throughput in a hardware implementation.[1]

A large English text file can typically be compressed via LZW to about half its original size.

The algorithm became the first widely used universal data compression method used on computers. The algorithm was used in the compress program commonly included in Unix systems starting around 1986. It has since disappeared from many distributions, because it both infringed the LZW patent and because gzip produced better compression ratios using the LZ77-based DEFLATE algorithm. The algorithm found wide use when it became part of the GIF image format in 1987. It may optionally be used in TIFF and PDF files. Although LZW is available in Adobe Acrobat software, Acrobat by default uses DEFLATE for most text and color-table-based image data in PDF files.

Algorithm

[edit]

The scenario described by Welch's 1984 paper[1] encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence with no code yet in the dictionary. The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary.

In the practical application of image compression and for an image based on a color table, a natural character alphabet is the set of color table indexes. In the 1980s, many images had small color tables (on the order of 16 colors). For such a reduced alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the idea of a variable-width code was introduced: codes typically start one bit wider than the symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up to some prescribed maximum (typically 12 bits). When the maximum code value is reached, encoding proceeds using the existing table, but new codes are not generated for addition to the table.

Further refinements include reserving a code to indicate that the code table should be cleared and restored to its initial state (a "clear code", typically the first value immediately after the values for the individual alphabet characters), and a code to indicate the end of data (a "stop code", typically one greater than the clear code). The clear code lets the table be reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input data. Smart encoders can monitor the compression efficiency and clear the table whenever the existing table no longer matches the input well.

Since codes are added in a manner determined by the data, the decoder mimics building the table as it sees the resulting codes. It is critical that the encoder and decoder agree on the variety of LZW used: the size of the alphabet, the maximum table size (and code width), whether variable-width encoding is used, initial code size, and whether to use the clear and stop codes (and what values they have). Most formats that employ LZW build this information into the format specification or provide explicit fields for them in a compression header for the data.

Encoding

[edit]

The encoding process can be described as:

  1. Initialize the dictionary to contain all strings of length one.
  2. Find the longest string W in the dictionary that matches the current input.
  3. Emit the dictionary index for W to output and remove W from the input.
  4. Add W followed by the next symbol in the input to the dictionary.
  5. Go to Step 2.

A dictionary is initialized to contain the single-character strings corresponding to all the possible input characters (and nothing else except the clear and stop codes if they're being used). The algorithm works by scanning through the input string for successively longer substrings until it finds one that is not in the dictionary. When such a string is found, the index for the string without the last character (i.e., the longest substring that is in the dictionary) is retrieved from the dictionary and sent to output, and the new string (including the last character) is added to the dictionary with the next available code. The last input character is then used as the next starting point to scan for substrings.

In this way, successively longer strings are registered in the dictionary and available for subsequent encoding as single output values. The algorithm works best on data with repeated patterns, so the initial parts of a message see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum (i.e., the compression factor or ratio improves on an increasing curve, and not linearly, approaching a theoretical maximum inside a limited time period rather than over infinite time).[2]

Decoding

[edit]

The decoding process can be described as:

  1. Initialize the dictionary to contain all strings of length one.
  2. Read the next encoded symbol.
  3. If the symbol is not encoded in the dictionary, goto step 7.
  4. Emit the corresponding string W to output.
  5. Concatenate the previous string emitted to output with the first symbol of W; add this to the dictionary.
  6. Go to step 9.
  7. Concatenate the previous string emitted to output with its first symbol; call this string V.
  8. Add V to the dictionary and emit V to output.
  9. Repeat step 2 until end of input.

The decoding algorithm works by reading a value from the encoded input and outputting the corresponding string from the dictionary. However, the full dictionary is not needed, only the initial dictionary that contains single-character strings (and that is usually hard coded in the program, instead of sent with the encoded data). Instead, the full dictionary is rebuilt during the decoding process the following way: after decoding a value and outputting a string, the decoder concatenates it with the first character of the next decoded string (or the first character of current string, if the next one can't be decoded; since if the next value is unknown, then it must be the value added to the dictionary in this iteration, and so its first character is the same as the first character of the current string), and updates the dictionary with the new string. The decoder then proceeds to the next input (which was already read in the previous iteration) and processes it as before, and so on until it has exhausted the input stream.

Variable-width codes

[edit]

If variable-width codes are being used, the encoder and decoder must be careful to change the width at the same points in the encoded data so they don't disagree on boundaries between individual codes in the stream. In the standard version, the encoder increases the width from p to p + 1 when a sequence ω + s is encountered that is not in the table (so that a code must be added for it) but the next available code in the table is 2p (the first code requiring p + 1 bits). The encoder emits the code for ω at width p (since that code does not require p + 1 bits), and then increases the code width so that the next code emitted is p + 1 bits wide.

The decoder is always one code behind the encoder in building the table, so when it sees the code for ω, it generates an entry for code 2p − 1. Since this is the point where the encoder increases the code width, the decoder must increase the width here as well—at the point where it generates the largest code that fits in p bits.

Unfortunately, some early implementations of the encoding algorithm increase the code width and then emit ω at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early. This is called "early change"; it caused so much confusion that Adobe now allows both versions in PDF files, but includes an explicit flag in the header of each LZW-compressed stream to indicate whether early change is being used. Of the graphics file formats that support LZW compression, TIFF uses early change, while GIF and most others don't.

When the table is cleared in response to a clear code, both encoder and decoder change the code width after the clear code back to the initial code width, starting with the code immediately following the clear code.

Packing order

[edit]

Since the codes emitted typically do not fall on byte boundaries, the encoder and decoder must agree on how codes are packed into bytes. The two common methods are LSB-first ("least significant bit first") and MSB-first ("most significant bit first"). In LSB-first packing, the first code is aligned so that the least significant bit of the code falls in the least significant bit of the first stream byte, and if the code has more than 8 bits, the high-order bits left over are aligned with the least significant bits of the next byte; further codes are packed with LSB going into the least significant bit not yet used in the current stream byte, proceeding into further bytes as necessary. MSB-first packing aligns the first code so that its most significant bit falls in the MSB of the first stream byte, with overflow aligned with the MSB of the next byte; further codes are written with MSB going into the most significant bit not yet used in the current stream byte.

GIF files use LSB-first packing order. TIFF files and PDF files use MSB-first packing order.

Further coding

[edit]

Many applications extend the algorithm by applying further encoding to the sequence of output symbols. Some package the coded stream as printable characters using some form of binary-to-text encoding. This increases the encoded length and decreases the compression rate. Conversely, increased compression can often be achieved with an adaptive entropy encoder. Such a coder estimates the probability distribution for the value of the next symbol, based on the observed frequencies of values so far. A standard entropy encoding such as Huffman coding or arithmetic coding then uses shorter codes for values with higher probabilities.

Example

[edit]

The following example illustrates the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the data. This example has been constructed to give reasonable compression on a very short message. In real text data, repetition is generally less pronounced, so longer input streams are typically necessary before the compression builds up efficiency.

The plaintext to be encoded (from an alphabet using only the capital letters) is:

TOBEORNOTTOBEORTOBEORNOT#

There are 26 symbols in the plaintext alphabet (the capital letters A through Z). # is used to represent a stop code: a code outside the plaintext alphabet that triggers special handling. We arbitrarily assign these the values 1 through 26 for the letters, and 0 for the stop code '#'. (Most flavors of LZW would put the stop code after the data alphabet, but nothing in the basic algorithm requires that. The encoder and decoder only have to agree what value it has.)

A computer renders these as strings of bits. Five-bit codes are needed to give sufficient combinations to encompass this set of 27 values. The dictionary is initialized with these 27 values. As the dictionary grows, the codes must grow in width to accommodate the additional entries. A 5-bit code gives 25 = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm must switch at that point from 5-bit strings to 6-bit strings (for all code values, including those previously output with only five bits). Note that since the all-zero code 00000 is used, and is labeled "0", the 33rd dictionary entry is labeled 32. (Previously generated output is not affected by the code-width change, but once a 6-bit value is generated in the dictionary, it could conceivably be the next code emitted, so the width for subsequent output shifts to 6 bits to accommodate that.)

The initial dictionary, then, consists of the following entries:

Symbol Binary Decimal
# 00000 0
A 00001 1
B 00010 2
C 00011 3
D 00100 4
E 00101 5
F 00110 6
G 00111 7
H 01000 8
I 01001 9
J 01010 10
K 01011 11
L 01100 12
M 01101 13
N 01110 14
O 01111 15
P 10000 16
Q 10001 17
R 10010 18
S 10011 19
T 10100 20
U 10101 21
V 10110 22
W 10111 23
X 11000 24
Y 11001 25
Z 11010 26

Encoding

[edit]

Buffer input characters in a sequence ω until ω + next character is not in the dictionary. Emit the code for ω, and add ω + next character to the dictionary. Start buffering again with the next character. (The string to be encoded is "TOBEORNOTTOBEORTOBEORNOT#".)

Current Sequence Next Char Output Extended Dictionary Comments
Code Bits
NULL T
T O 20 10100 27: TO 27 = first available code after 0 through 26
O B 15 01111 28: OB
B E 2 00010 29: BE
E O 5 00101 30: EO
O R 15 01111 31: OR
R N 18 10010 32: RN 32 requires 6 bits, so for next output use 6 bits
N O 14 001110 33: NO
O T 15 001111 34: OT
T T 20 010100 35: TT
TO B 27 011011 36: TOB
BE O 29 011101 37: BEO
OR T 31 011111 38: ORT
TOB E 36 100100 39: TOBE
EO R 30 011110 40: EOR
RN O 32 100000 41: RNO
OT # 34 100010 # stops the algorithm; send the cur seq
0 000000 and the stop code
Unencoded length = 25 symbols × 5 bits/symbol = 125 bits
Encoded length = (6 codes × 5 bits/code) + (11 codes × 6 bits/code) = 96 bits.

Using LZW has saved 29 bits out of 125, reducing the message by more than 23%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, sending repeated words very compactly.

Decoding

[edit]

To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply concatenations of previous entries.

Input Output Sequence New Dictionary Entry Comments
Bits Code Full Conjecture
10100 20 T 27: T?
01111 15 O 27: TO 28: O?
00010 2 B 28: OB 29: B?
00101 5 E 29: BE 30: E?
01111 15 O 30: EO 31: O?
10010 18 R 31: OR 32: R? created code 31 (last to fit in 5 bits)
001110 14 N 32: RN 33: N? so start reading input at 6 bits
001111 15 O 33: NO 34: O?
010100 20 T 34: OT 35: T?
011011 27 TO 35: TT 36: TO?
011101 29 BE 36: TOB 37: BE? 36 = TO + 1st symbol (B) of
011111 31 OR 37: BEO 38: OR? next coded sequence received (BE)
100100 36 TOB 38: ORT 39: TOB?
011110 30 EO 39: TOBE 40: EO?
100000 32 RN 40: EOR 41: RN?
100010 34 OT 41: RNO 42: OT?
000000 0 #

At each stage, the decoder receives a code X; it looks X up in the table and outputs the sequence χ it codes, and it conjectures χ + ? as the entry the encoder just added – because the encoder emitted X for χ precisely because χ + ? was not in the table, and the encoder goes ahead and adds it. But what is the missing letter? It is the first letter in the sequence coded by the next code Z that the decoder receives. So the decoder looks up Z, decodes it into the sequence ω and takes the first letter z and tacks it onto the end of χ as the next dictionary entry.

This works as long as the codes received are in the decoder's dictionary, so that they can be decoded into sequences. What happens if the decoder receives a code Z that is not yet in its dictionary? Since the decoder is always just one code behind the encoder, Z can be in the encoder's dictionary only if the encoder just generated it, when emitting the previous code X for χ. Thus Z codes some ω that is χ + ?, and the decoder can determine the unknown character as follows:

  1. The decoder sees X and then Z, where X codes the sequence χ and Z codes some unknown sequence ω.
  2. The decoder knows that the encoder just added Z as a code for χ + some unknown character c, so ω = χ + c.
  3. Since c is the first character in the input stream after χ, and since ω is the string appearing immediately after χ, c must be the first character of the sequence ω.
  4. Since χ is an initial substring of ω, c must also be the first character of χ.
  5. So even though the Z code is not in the table, the decoder is able to infer the unknown sequence and adds χ + (the first character of χ) to the table as the value of Z.

This situation occurs whenever the encoder encounters input of the form cScSc, where c is a single character, S is a string and cS is already in the dictionary, but cSc is not. The encoder emits the code for cS, putting a new code for cSc into the dictionary. Next it sees cSc in the input (starting at the second c of cScSc) and emits the new code it just inserted. The argument above shows that whenever the decoder receives a code not in its dictionary, the situation must look like this.

Although input of form cScSc might seem unlikely, this pattern is fairly common when the input stream is characterized by significant repetition. In particular, long strings of a single character (which are common in the kinds of images LZW is often used to encode) repeatedly generate patterns of this sort.

Patents

[edit]

Various patents have been issued in the United States and other countries for LZW and similar algorithms. LZ78 was covered by U.S. patent 4,464,650 by Lempel, Ziv, Cohn, and Eastman, assigned to Sperry Corporation, later Unisys Corporation, filed on August 10, 1981. Two US patents were issued for the LZW algorithm: U.S. patent 4,814,746 by Victor S. Miller and Mark N. Wegman and assigned to IBM, originally filed on June 1, 1983, and U.S. patent 4,558,302 by Welch, assigned to Sperry Corporation, later Unisys Corporation, filed on June 20, 1983.

In addition to the above patents, Welch's 1983 patent also includes citations to several other patents that influenced it, including two 1980 Japanese patents (JP9343880A and JP17790880A) from NEC's Jun Kanatsu, U.S. patent 4,021,782 (1974) from John S. Hoerning, U.S. patent 4,366,551 (1977) from Klaus E. Holtz, and a 1981 German patent (DE19813118676) from Karl Eckhart Heinz.[3]

In 1993–94, and again in 1999, Unisys Corporation received widespread condemnation when it attempted to enforce licensing fees for LZW in GIF images. The 1993–1994 Unisys-CompuServe controversy (CompuServe being the creator of the GIF format) prompted a Usenet comp.graphics discussion Thoughts on a GIF-replacement file format, which in turn fostered an email exchange that eventually culminated in the creation of the patent-unencumbered Portable Network Graphics (PNG) file format in 1995.

Unisys's US patent on the LZW algorithm expired on June 20, 2003,[4] 20 years after it had been filed. Patents that had been filed in the United Kingdom, France, Germany, Italy, Japan and Canada all expired in 2004,[4] likewise 20 years after they had been filed.

Variants

[edit]

LZMW

[edit]

LZMW (by V. Miller, M. Wegman, 1985[5]) searches input for the longest string already in the dictionary (the "current" match); adds the concatenation of the previous match with the current match to the dictionary. Dictionary entries thus grow more rapidly; but this scheme is much more complicated to implement. Miller and Wegman suggest deleting low-frequency entries from the dictionary when the dictionary fills up.

LZAP

[edit]

LZAP (by James Storer, 1988[6]) is a modification of LZMW. Instead of adding just the concatenation of the previous match with the current match to the dictionary, add the concatenations of the previous match with each initial substring of the current match ("AP" stands for "all prefixes"). For example, if the previous match is "wiki" and current match is "pedia", then the LZAP encoder adds 5 new sequences to the dictionary: "wikip", "wikipe", "wikiped", "wikipedi", and "wikipedia", where the LZMW encoder adds only the one sequence "wikipedia". This eliminates some of the complexity of LZMW, at the price of adding more dictionary entries.

LZWL

[edit]

LZWL is a syllable-based variant of LZW.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Lempel–Ziv–Welch (LZW) is a lossless compression method that encodes sequences of symbols by building and referencing a dynamic of substrings, replacing repeated patterns with shorter variable-length codes to reduce file size without loss. Developed by Terry A. Welch in 1984 as an adaptation of the earlier LZ78 by Abraham Lempel and , LZW operates adaptively, initializing a dictionary with single characters and incrementally adding longer strings encountered in the input stream. This -based approach enables efficient compression of text, images, and other exhibiting , with typical compression ratios varying based on input but often achieving 2:1 or better for structured files. LZW's core mechanism involves scanning the input sequentially: it outputs the code for the longest recognized string from the dictionary, appends the next input symbol to that string to form a new entry, and continues until no match is found, at which point the current symbol is output and used to extend the previous string. Decompression mirrors this process symmetrically, reconstructing the dictionary on-the-fly without needing a separate pass, ensuring both encoder and decoder remain synchronized. The algorithm uses fixed initial code lengths (e.g., 9 bits for 256 single-byte symbols) that can extend up to 12 or 13 bits as the dictionary grows to 4,096 or 8,192 entries, after which it may reset or clear to manage memory. Renowned for its simplicity and speed, LZW found widespread adoption in early applications, notably powering the compression in the GIF image format since 1987, as well as optional use in TIFF and PDF files for lossless storage of graphics and documents. It also underpinned Unix's compress utility, facilitating efficient archiving of text files on systems with limited storage. However, LZW's implementation was complicated by a held by Corporation (expired in 2003 in the and 2004 internationally), which required licensing fees for commercial use and sparked alternatives like for web images. Variants such as LZW with early reset or modified dictionary management have since enhanced its robustness for modern streaming and error-prone environments.

Overview

Description

The Lempel–Ziv–Welch (LZW) algorithm is a universal lossless data compression technique developed by Terry Welch as an improvement to the LZ78 algorithm by Abraham Lempel and . It enables efficient reduction of size without any loss of information, making it suitable for various applications including file storage and transmission. The algorithm was first described in detail by Welch in his 1984 paper, "A Technique for High-Performance Compression," published in IEEE Computer. At its core, LZW operates by constructing a dynamic of strings during the compression process, which allows repeated sequences in the input data to be substituted with shorter numeric codes. This dictionary-based approach incrementally learns patterns from the , assigning unique codes to increasingly longer phrases as compression proceeds. The input consists of , such as text or files, while the output comprises a sequence of variable-length codes that represent these phrases, enabling decoding back to the original data. LZW typically achieves compression ratios of about 50% for large English text files, demonstrating its effectiveness on data with inherent redundancies. It represents an improvement over the earlier LZ78 algorithm by Lempel and Ziv, incorporating refinements for better practical performance.

Key Principles

The Lempel–Ziv–Welch (LZW) algorithm employs a dynamic , also known as a code table, that serves as the core mechanism for compression. This begins with 256 predefined entries, each representing a single 8-bit symbol from the input , such as the ASCII character set ranging from 0 to 255. These initial entries allow the algorithm to handle basic symbols without prior learning, establishing a foundation for recognizing and encoding longer patterns efficiently. Central to LZW's operation is the process of phrase building, where identifies the longest prefix of the current input string that matches an existing entry. Upon finding such a match, it outputs the corresponding for that prefix and then forms a new potential entry by appending the next input symbol to this prefix. If this extended is not already in the , it is added as a new entry, enabling the recognition of increasingly longer redundant sequences in the . assignment occurs sequentially in a growing code space, with unique codes allocated starting from 258 (following reserved codes at 256 and 257), ensuring that each distinct receives a distinct, compact representation that expands as the grows. The dictionary includes two special reserved codes to manage its operation: code 256 serves as the clear code, which resets the dictionary to its initial state of 256 single-symbol entries when transmitted, preventing excessive growth and adapting to changes in data patterns. Code 257 acts as the end-of-information code, optionally signaling the termination of the data stream. LZW's lossless nature stems from the symmetric rules used in both encoding and decoding; the decoder independently reconstructs the dictionary by applying the same phrase-building logic to the received codes, guaranteeing exact recovery of the original input without any loss of information.

History

Development

The Lempel–Ziv–Welch (LZW) algorithm traces its origins to the LZ78 compression method, developed by Abraham Lempel and and published in 1978. LZ78 employed a dictionary-based approach that parsed the input into distinct phrases, where each new phrase was formed by appending a novel symbol to an existing dictionary entry, using a dynamic dictionary that explicitly stored these phrases for reference during encoding. This work built upon Lempel and Ziv's earlier LZ77 algorithm from 1977, which utilized a sliding window mechanism to identify and encode repeated substrings from recent input history. In 1984, Terry A. Welch, a researcher at the Sperry Research Center (later part of Corporation), refined LZ78 into the LZW algorithm. Welch's primary contribution was adapting the dictionary construction for real-time operation: instead of storing full strings explicitly, LZW maintained only indices in the dictionary and generated new entries implicitly by combining prior codes with input symbols, enabling immediate output of variable-length codes without buffering the entire input. The modifications addressed limitations in LZ78 for practical deployment, particularly by enhancing speed and reducing memory demands to support efficient hardware implementations, such as in disk controllers, and processing where low latency is essential. An early software implementation of LZW appeared in the Unix compress utility, developed by Spencer W. Thomas at the and initially released in 1984 shortly after Welch's publication. Welch filed a for LZW in 1983, which was granted in 1985.

Adoption and Impact

Lempel–Ziv–Welch (LZW) compression saw initial adoption in the Unix operating system through the compress utility, initially released in 1984 and integrated into distributions like 4.3BSD around 1986, which produced files with the .Z extension and became a standard tool for lossless data compression in Unix environments. This early integration demonstrated LZW's efficiency in handling general-purpose file compression, leveraging its adaptive to achieve typical ratios of about 50-60% on text files without requiring prior knowledge of the data. At its peak in the late , LZW gained widespread influence in image formats, notably as the core compression method in the Graphics Interchange Format (), developed by in 1987, where it efficiently compressed color palettes and enabled compact transmission over early networks. Similarly, LZW was incorporated into the Tagged Image File Format (TIFF) with Revision 5.0 in October 1988, supporting for raster images across various bit depths and proving particularly effective for repetitive patterns in palette-color data. These adoptions solidified LZW's role in multimedia and document standards, with optional use extending to formats like PDF for embedded . LZW's broader impact lay in popularizing dictionary-based coding techniques, establishing adaptive string tables as a foundational approach in and inspiring subsequent algorithms, including , which combined LZ77 sliding-window methods with to build on dictionary principles for improved performance. The algorithm's prominence declined in the 1990s due to aggressive patent enforcement by Corporation starting in late 1994, which demanded royalties for LZW implementations and disrupted its use in web and software ecosystems. This accelerated the development of patent-free alternatives, such as the Portable Network Graphics (PNG) format in 1995, which employed for superior compression without licensing barriers, and in 1992, an open-source tool offering better ratios than LZW-based compress via LZ77 and Huffman integration. Despite its decline, LZW's legacy persists in resource-constrained environments like embedded systems, where its low —typically around 2KB for implementation—makes it suitable for and instruction memory compression in devices with limited RAM. The patent disputes also fueled broader debates on open-source compression, highlighting the need for royalty-free algorithms and contributing to the shift toward accessible standards like in tools such as and bzip2.

Algorithm

Dictionary Management

In the Lempel–Ziv–Welch (LZW) , the , also known as the string table or , is initialized with 256 entries indexed from 0 to 255, each representing a single-byte corresponding to the possible values in an 8-bit . This setup ensures that the initial covers all individual characters without requiring any prior transmission of dictionary information between encoder and decoder. In the original formulation, the next available code value begins at 256 for the first dynamically added entry. The dictionary is updated dynamically during processing by appending new phrases formed from the current matching prefix (the longest recognized string) concatenated with the next input symbol, assigning this new entry the next sequential code value if not already present. This promotes efficient recognition of repeating patterns without redundant storage. For instance, if the current prefix is the string "AB" (code 260) and the next symbol is "C", the entry for "ABC" is added as code 261 if absent. In the original LZW, there is no special clear code; when the dictionary reaches its maximum size (commonly 4096 entries for 12-bit codes), new additions cease, and encoding continues using existing entries. Some implementations, such as in the GIF format, reserve code 256 for a special clear code to reinitialize the to its original 256 entries, clearing the current prefix, with dynamic codes starting at 258 (257 being the end-of-information code). Upon receiving the clear code, the decoder performs the same reinitialization, ensuring synchronization. The does not store full strings explicitly; instead, it maintains indices or pointers to facilitate quick lookups and reconstructions, minimizing storage overhead while allowing phrases up to the maximum length supported by the code size. This implicit representation contributes to the algorithm's efficiency in both hardware and software implementations. The between encoder and decoder is a core feature of LZW, as both independently construct and update identical dictionaries using the same rules based on the sequence of codes processed—no dictionary contents are transmitted over the channel. This self-synchronizing property ensures that decoding mirrors encoding precisely, provided the input stream begins after initialization (or clear code in variants), making LZW suitable for streaming applications.

Encoding Process

The Lempel–Ziv–Welch (LZW) encoding process compresses an input stream of symbols by dynamically building a dictionary of phrases and outputting variable-length codes representing the longest matching phrases found in the input. The algorithm reads the input one symbol at a time, maintaining a current phrase (initially empty) and extending it greedily while matches exist in the dictionary. The process begins with the dictionary initialized to contain single-symbol entries (e.g., codes 0 to 255 for an 8-bit alphabet), as detailed in the dictionary management phase. The encoder sets the current phrase ww to empty. It then reads the next input symbol kk, forming a candidate phrase w+kw + k. If w+kw + k exists in the dictionary, the encoder appends kk to ww and continues reading the next symbol, seeking the longest possible match. If w+kw + k is not in the dictionary, the encoder outputs the code corresponding to the current ww, adds the new phrase w+kw + k to the dictionary with the next available code, and resets ww to kk. This loop repeats until all input symbols are processed. At the end of the input stream, the encoder outputs the code for the final ww. In the original formulation, no explicit stop code is used; however, some implementations, such as the GIF format, append an end-of-information (EOI) code (typically 257 for an 8-bit alphabet) to signal termination. In the original LZW, when the dictionary reaches its maximum size (e.g., 4096 entries for 12-bit codes), the encoder ceases adding new phrases and continues matching existing ones until the input ends, without resetting. Variants like the GIF implementation introduce a clear code (typically 256) that, when output periodically or upon fullness, resets the dictionary to its initial state and reinitializes phrase building, allowing continued compression on subsequent data blocks. The following pseudocode outlines the core encoding loop, assuming a function to check dictionary membership and assign codes (for original LZW without special codes):

initialize dictionary with single symbols (codes 0 to alphabet_size - 1) w ← "" // current phrase while there is input: k ← read next input symbol if w + k is in [dictionary](/page/Dictionary): w ← w + k else: output code for w add w + k to [dictionary](/page/Dictionary) with next available code w ← k output code for w // final phrase

initialize dictionary with single symbols (codes 0 to alphabet_size - 1) w ← "" // current phrase while there is input: k ← read next input symbol if w + k is in [dictionary](/page/Dictionary): w ← w + k else: output code for w add w + k to [dictionary](/page/Dictionary) with next available code w ← k output code for w // final phrase

This procedure ensures by exploiting repeated patterns through adaptive growth.

Decoding Process

The decoding process in Lempel–Ziv–Welch (LZW) reconstructs the original input data from a sequence of variable-length codes by dynamically building a that mirrors the one constructed during encoding, ensuring lossless decompression without transmitting the itself. The algorithm processes the codes sequentially, outputting strings from the and extending it based on the decoded content, which allows the decoder to adaptively learn longer phrases as decompression proceeds. Initialization begins with the dictionary populated identically to the encoder's starting state, typically containing 256 entries for single-byte symbols (codes 0–255 representing the standard character set). In variants like , codes 256 and 257 are reserved for clear and EOI, respectively. For the original LZW without special codes: Read the first code cc, output the corresponding single-character from the , and set the current prefix ww to this value. In the main loop, while additional codes are available, read the next code cc. Retrieve the associated as entry and output it; set kk to the first character of entry. Form the new as the previous ww plus kk, add this to the with the next code value, and update ww to entry. For the special case where cc is not yet in the (equals next available code), construct entry as ww appended with w{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}, output it, and set k = w{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}. In implementations with special codes like GIF, special handling occurs before dictionary lookup: if cc is the clear code (256), reset the dictionary to initial state without output; if cc is EOI (257), halt decoding. Otherwise, proceed with lookup or special case as above. This mechanism addresses dictionary overflow by allowing periodic resets, maintaining efficiency. The process is nearly symmetric to encoding but operates inversely. The following pseudocode outlines the core decoding procedure for variants with special codes (e.g., GIF):

Initialize dictionary with codes 0–255 mapping to single characters // Reserve 256: clear, 257: EOI (no string output for them) Read first code c if c == 256: // clear: already initialized, but handle if needed pass elif c == 257: // EOI: end immediately stop else: entry = string(c) output entry w = entry while there is input: Read next code c if c == 256: // Clear code reset dictionary to initial state // read next code after reset continue if c == 257: // End code break if c is in dictionary: entry = string(c) else: entry = w + w[0] output entry add w + entry[0] to dictionary with next code w = entry

Initialize dictionary with codes 0–255 mapping to single characters // Reserve 256: clear, 257: EOI (no string output for them) Read first code c if c == 256: // clear: already initialized, but handle if needed pass elif c == 257: // EOI: end immediately stop else: entry = string(c) output entry w = entry while there is input: Read next code c if c == 256: // Clear code reset dictionary to initial state // read next code after reset continue if c == 257: // End code break if c is in dictionary: entry = string(c) else: entry = w + w[0] output entry add w + entry[0] to dictionary with next code w = entry

Implementation Details

Variable-Length Codes

In the Lempel–Ziv–Welch (LZW) , variable-length codes enable efficient compression by dynamically adjusting the bit of codes as the dictionary grows, starting with a typical of 9 bits for 8-bit input data to accommodate the initial 256 entries (codes 0–255) plus room for 256 additional entries (up to code 511). This initial 9-bit ensures that single-byte symbols can be represented while allowing early growth without excessive overhead. The length increases at specific thresholds tied to dictionary size: it grows to 10 bits when code 512 is assigned, to 11 bits at , and to 12 bits at code 2048, with a common maximum of 12 bits supporting up to 4096 total codes (indices 0–4095). When the dictionary reaches its maximum size, such as code 4095 in a 12-bit scheme, the encoder outputs a special clear code (typically 256) to reset the dictionary to its initial state and revert the code length to 9 bits, preventing overflow and restarting the growth process. The maximum number of codes at a given bit length bb is 2b12^b - 1 after accounting for reserved codes like clear and end-of-information. Both the encoder and decoder maintain by tracking the current code length based on the number of dictionary entries added, ensuring that output codes are always padded or read to the prevailing length during transmission. This adaptive mechanism enhances efficiency by using shorter codes when the dictionary is small and fewer repetitions are encoded, avoiding the of fixed longer codes from the outset and achieving better compression ratios on repetitive without early bloat.

Bit Packing and Output Order

In the Lempel–Ziv–Welch (LZW) algorithm, variable-length codes are converted to binary representations and packed into a continuous for output, with the of each code determined by the current dictionary size as established in prior stages of the encoding process. These binary codes are typically written most significant bit (MSB) first or least significant bit (LSB) first, depending on the implementation; for instance, the packs bits LSB-first, filling bytes from the least significant bit to the most significant bit, while the Tagged Image File Format (TIFF) and Portable Document Format (PDF) use MSB-first packing. This packing ensures efficient use of storage by allowing codes to straddle byte boundaries without alignment restrictions, where the bits of one code may continue into the next byte after the previous code's bits are placed. When the total bits from a sequence of codes do not exactly fill an output byte, the remaining bits in that byte are handled by either with zeros or carrying over to the next code's placement, maintaining the 's continuity without introducing gaps. In , for example, the is divided into blocks of up to 255 bytes, each preceded by a byte indicating the block length, and the final block is terminated by a zero-length indicator, ensuring decoders can process the stream sequentially. Similarly, TIFF LZW streams for each image strip begin with a clear code (value 256) to initialize the and end with an end-of-information code (value 257), with no additional required as the data is byte-aligned at strip boundaries. The output order of LZW codes follows the sequence in which they are generated during encoding, preserving the temporal order of the compressed symbols for faithful decoding. Byte-level (little-endian or big-endian) generally does not affect the itself, as LZW outputs are processed as a of bytes independent of host , though specific formats like employ little-endian conventions for multi-byte elements surrounding the LZW data. Implementations often include a header prefix specifying the initial size; in , this is a single byte denoting the minimum code size (typically 2–8 bits, with encoding starting at one more bit), which signals the decoder to begin with that bit width. This approach to bit packing contributes to efficiency by minimizing overhead compared to fixed-length coding, where a 12-bit , for example, occupies exactly 1.5 bytes on average across the .

Additional Optimizations

One common optimization to the basic LZW algorithm involves early clearing of the , where the clear code is output before the reaches its maximum size if the begins to degrade. This technique monitors the recent output bits versus the number of codes emitted; if the ratio drops below a threshold—indicating from growth—the encoder sends the clear code to reset the and restart with shorter s, potentially improving overall compression for with varying redundancy patterns. For instance, in implementations processing files, clearing after a size increase but before the fills can yield better ratios by avoiding inefficient long s on less compressible segments. Another enhancement is post-coding, where the sequence of variable-length LZW codes is further compressed using a secondary entropy coder such as Huffman or to exploit statistical biases in the code distribution. This hybrid approach cascades LZW's dictionary-based compression with a final pass that assigns shorter bit lengths to more frequent codes, achieving higher overall ratios without altering the core LZW process. Experiments on text and have shown improvements of 10-20% in depending on the 's after LZW, with often outperforming Huffman due to its adaptability. In some variants, such as certain extensions in image formats, this post-processing is integrated to handle the LZW output stream more efficiently. Dictionary pruning represents a less common but targeted optimization, involving the selective removal of infrequently used or redundant entries from the instead of a full reset via the clear code, thereby extending the effective code space and reducing memory usage. Typically employing a least-recently-used (LRU) , this method deletes entries based on access during encoding, preventing dictionary bloat on with transient patterns while preserving key phrases for ongoing compression. Studies on text files demonstrate modest gains in (up to 5%) and speed for large inputs, though it increases computational overhead for entry management. This approach is rarely used in standard implementations due to added complexity but proves useful in resource-constrained environments. Block coding optimizes LZW by dividing the input stream into fixed-size blocks, compressing each independently with its own dictionary reset at block boundaries, which limits memory per block and adapts to localized data characteristics. This prevents the dictionary from growing indefinitely across the entire input and allows parallel processing, though it may introduce minor overhead from repeated initializations. On repetitive or structured data like images, block sizes of 4-8 KB have been shown to balance ratio and efficiency, with reductions post-LZW enabling further gains when combined with statistical block preprocessing. These methods collectively trade algorithmic simplicity for better on specific data types, increasing implementation complexity but enhancing adaptability.

Example

Encoding Illustration

To illustrate the LZW encoding process, consider the sample input string "TOBEORNOTTOBEORTOBEORNOT#", where uppercase letters use their standard ASCII codes (e.g., T=84, O=79, B=66, E=69, R=82, N=78) and # serves as an end-of-input marker with ASCII code 35. The initial consists of 256 entries for single-byte values (codes 0–255 corresponding to all possible 8-bit characters). The process starts with an empty prefix string w and reads the input character by character, extending w with the current character c if w + c exists in the ; otherwise, it outputs the code for w, adds the new entry w + c to the with the next available code (starting from 256), and sets w = c. At the end of the input, the code for the final w is output. The encoding proceeds as follows, building longer phrases over time due to repetitions in the string:
  • Read T (84): w = "T" (exists).
  • Read O (79): "TO" not in dictionary; output 84 ("T"), add "TO" as code 256, set w = "O".
  • Read B (66): "OB" not in dictionary; output 79 ("O"), add "OB" as code 257, set w = "B".
  • Read E (69): "BE" not in dictionary; output 66 ("B"), add "BE" as code 258, set w = "E".
  • Read O (79): "EO" not in dictionary; output 69 ("E"), add "EO" as code 259, set w = "O".
  • Read R (82): "OR" not in dictionary; output 79 ("O"), add "OR" as code 260, set w = "R".
  • Read N (78): "RN" not in dictionary; output 82 ("R"), add "RN" as code 261, set w = "N".
  • Read O (79): "NO" not in dictionary; output 78 ("N"), add "NO" as code 262, set w = "O".
  • Read T (84): "OT" not in dictionary; output 79 ("O"), add "OT" as code 263, set w = "T".
  • Read T (84): "TT" not in dictionary; output 84 ("T"), add "TT" as code 264, set w = "T".
  • Read O (79): "TO" in dictionary (256); set w = "TO".
  • Read B (66): "TOB" not in dictionary; output 256 ("TO"), add "TOB" as code 265, set w = "B".
  • Read E (69): "BE" in dictionary (258); set w = "BE".
  • Read O (79): "BEO" not in dictionary; output 258 ("BE"), add "BEO" as code 266, set w = "O".
  • Read R (82): "OR" in dictionary (260); set w = "OR".
  • Read T (84): "ORT" not in dictionary; output 260 ("OR"), add "ORT" as code 267, set w = "T".
  • Read O (79): "TO" in dictionary (256); set w = "TO".
  • Read B (66): "TOB" in dictionary (265); set w = "TOB".
  • Read E (69): "TOBE" not in dictionary; output 265 ("TOB"), add "TOBE" as code 268, set w = "E".
  • Read O (79): "EO" in dictionary (259); set w = "EO".
  • Read R (82): "EOR" not in dictionary; output 259 ("EO"), add "EOR" as code 269, set w = "R".
  • Read N (78): "RN" in dictionary (261); set w = "RN".
  • Read O (79): "RNO" not in dictionary; output 261 ("RN"), add "RNO" as code 270, set w = "O".
  • Read T (84): "OT" in dictionary (263); set w = "OT".
  • Read # (35): "OT#" not in dictionary; output 263 ("OT"), add "OT#" as code 271, set w = "#".
At the end of the input, output the code for the final w ("#"), which is 35. The dictionary grows incrementally with each new phrase encountered, starting with single characters and adding two- and three-character sequences like "TO" (256), "OB" (257), "BE" (258), "EO" (259), "OR" (260), "RN" (261), "NO" (262), "OT" (263), "TT" (264), "TOB" (265), "BEO" (266), "ORT" (267), "TOBE" (268), "EOR" (269), "RNO" (270), and "OT#" (271). This adaptive growth captures repetitions, such as the recurring "TOB" and "OR" patterns. The complete sequence of output codes before any bit-level packing is: 84, 79, 66, 69, 79, 82, 78, 79, 84, 256, 258, 260, 265, 259, 261, 263, 35. This represents the compressed form, where initial single-character codes are followed by multi-character phrase codes as the expands.

Decoding Illustration

To illustrate the LZW decoding , consider the sequence of codes produced during encoding of the "TOBEORNOTTOBEORTOBEORNOT#": 84, 79, 66, 69, 79, 82, 78, 79, 84, 256, 258, 260, 265, 259, 261, 263, 35. The decoder initializes the with codes 0 to 255 mapping to single ASCII characters (e.g., 84 to 'T', 79 to 'O', 66 to 'B', 69 to 'E', 82 to 'R', 78 to 'N', 35 to '#'). It reads codes one by one, outputs the corresponding strings, and dynamically adds new entries to the to maintain with the encoder. The process starts by reading the first code and setting the previous string w to the corresponding entry. For subsequent codes, the decoder retrieves the string S for the code; if the code exists in the dictionary, it outputs S and adds the entry w + S (the previous string concatenated with the first character of S) as the next code. It then updates w to S. The dictionary grows identically to the encoding phase, ensuring consistent mappings. A special case arises when the code equals the next available dictionary index (not yet defined). In this situation, the decoder constructs S as w + w (the previous string concatenated with its own first character), outputs S, adds w + S to the dictionary, and sets w to S. This handles scenarios where the encoder outputs a code for a newly formed phrase just after adding it, allowing the decoder to infer the missing entry without prior knowledge. The following table traces the initial steps of decoding for clarity, showing code input, output string, dictionary addition, and cumulative output (continuing similarly until the end):
StepCode ReadRetrieved String (S)OutputDictionary Addition (Next Code)w after stepCumulative Output
184'T''T'(none for first)'T''T'
279'O''O''T' + 'O' = 'TO' (256)'O''TO'
366'B''B''O' + 'B' = 'OB' (257)'B''TOB'
469'E''E''B' + 'E' = 'BE' (258)'E''TOBE'
579'O''O''E' + 'O' = 'EO' (259)'O''TOBEO'
682'R''R''O' + 'R' = 'OR' (260)'R''TOBEOR'
778'N''N''R' + 'N' = 'RN' (261)'N''TOBEORN'
879'O''O''N' + 'O' = 'NO' (262)'O''TOBEORNO'
984'T''T''O' + 'T' = 'OT' (263)'T''TOBEORNOT'
10256'TO''TO''T' + 'T' = 'TT' (264)'TO''TOBEORNOTTO'
11258'BE''BE''TO' + 'B' = 'TOB' (265)'BE''TOBEORNOTTOBE'
.....................
1735'#''#''OT' + '#' = 'OT#' (next if continued)'#''TOBEORNOTTOBEORTOBEORNOT#'
The trace continues through code 263 ('OT'), yielding the full reconstructed string "TOBEORNOTTOBEORTOBEORNOT#". The final code 35 outputs the terminator '#', completing the decompression without loss. This parallel dictionary growth ensures the decoder reconstructs the exact original data.

Applications

File Formats

The Lempel–Ziv–Welch (LZW) compression algorithm has been integrated into several file formats for lossless data compression, particularly in and document storage. One of the earliest and most prominent adoptions is in the Graphics Interchange Format (), introduced in 1987 with GIF87a and extended in GIF89a. In , LZW compresses sequences of color-indexed s from the 's active color table, rasterized in left-to-right, top-to-bottom order. The compression data is packaged into sub-blocks of up to 255 bytes each, preceded by a block size byte and terminated by a zero-length block (0x00), allowing flexible streaming. The initial code size is specified in the header via the LZW Minimum Code Size byte, typically starting at 8 bits for the pixel values plus 1 bit for the code size (e.g., 9 bits), and increasing up to a maximum of 12 bits as the grows. A clear code (2 raised to the power of the code size) is inserted at the start of each data stream to reset the compression and decompression tables, ensuring independent processing per block. This implementation faced patent-related controversies in the , leading to temporary restrictions on usage. In the Tagged Image File Format (TIFF), LZW serves as an optional method, designated by compression code 5 in the Image File Directory (IFD). Introduced in TIFF Revision 5.0 (1988) and formalized in the TIFF 6.0 specification (1992), it is applied horizontally to individual scanlines of image data, treating each as a separate compression . The algorithm supports variable-length codes from 4 to 12 bits, beginning with 9-bit codes for the initial dictionary entries (256 for literals plus 1 for clear and up to 4096 total), incrementing to 10 bits after 512 entries, 11 bits after , and 12 bits after 2048, with a hard limit of 12 bits to avoid exceeding the dictionary size. This makes LZW suitable for palette-color or images in TIFF, though it was later supplemented by other methods like due to licensing concerns. The Portable Document Format (PDF) incorporated LZW via the LZWDecode filter starting with PDF 1.2 (1993), applying it to general streams including , ASCII text, inline s, embedded fonts, and thumbnails. The filter uses variable lengths from 9 to 12 bits, with an optional EarlyChange parameter (default 1) to allow earlier code length increases for better efficiency on certain data patterns, and supports predictor functions (Predictor value >1) for differencing samples. However, LZWDecode was deprecated and removed in PDF 2.0 (ISO 32000-2:2017) in favor of the FlateDecode filter, which provides superior compression ratios using the zlib/deflate algorithm and avoids LZW's patent encumbrances. LZW has also appeared in other formats, such as , where it can be implemented programmatically for compressing image or text data despite lacking native filter support in the base language. Additionally, some arcade games from the late and , including titles like , employed LZW for compressing graphics or data assets to fit hardware constraints.

Software Implementations

The Unix compress utility, first released around 1986 as part of Unix systems, implements the LZW algorithm to create compressed .Z files, employing variable code lengths from 9 to 12 bits and emitting clear codes to reset the when necessary. This tool prioritized speed over maximum , making it suitable for general-purpose file compression in early computing environments. The open-source GIFLIB library, originally developed as libgif in the late and actively maintained since, includes robust LZW encoder and decoder routines specifically tailored for handling GIF data streams. It supports code sizes from 2 to 12 bits as specified in the GIF standard, enabling efficient reading and writing of LZW-compressed GIF files across various platforms. In Python, while the built-in zlib module focuses on compression and does not natively support LZW, third-party packages such as python-lzw offer pure-Python implementations for general LZW compression and decompression. For processing, the Pillow library (a of PIL) integrates LZW encoding when files in GIF format, ensuring compatibility with legacy graphics workflows. Example usage involves loading an and it with image.save('output.gif', format='GIF'), which applies LZW automatically. Java's Image I/O API (ImageIO), part of the standard library since Java 1.4, provides native support for reading LZW-compressed TIFF and GIF images through its ImageReader and ImageWriter classes. Writing GIF files with LZW is handled directly, while TIFF output with LZW compression often requires the Java Advanced Imaging (JAI) Image I/O Tools plugin for full encoder capabilities. Following the expiration of key LZW patents—such as Unisys's U.S. patent in June 2003 and European/Japanese counterparts in June 2004—open-source implementations have proliferated without licensing restrictions. Today, LZW is seldom deployed as a standalone compression tool due to superior alternatives like DEFLATE, but it remains embedded in format-specific libraries like Pillow and GIFLIB for maintaining compatibility with legacy files such as GIF and TIFF.

Patent History

The Lempel–Ziv–Welch (LZW) compression algorithm is primarily covered by two key U.S. patents: U.S. Patent No. 4,558,302, issued to Terry A. Welch on December 10, 1985, for a "High speed data compression and decompression device," and U.S. Patent No. 4,814,746, issued to Victor S. Miller and Mark N. Wegman on March 21, 1989, for a "Data compression method." The Welch patent, filed on June 20, 1983, and assigned to Sperry Corporation (later Unisys Corporation following a 1986 merger), describes a method for compressing data streams by building a dynamic string table (dictionary) that stores encountered character sequences and outputs variable-length codes referencing those strings, enabling on-the-fly dictionary construction without explicit transmission of the dictionary contents. The Miller and Wegman patent, filed on August 11, 1986 (with priority to a June 1, 1983, application) and assigned to IBM, covers an apparatus and method variant focused on adaptive data compression for communication systems, incorporating similar dictionary-based encoding with modifications for terminal-host interactions. These patents stem from Welch's 1984 refinement of the earlier LZ78 algorithm by Abraham Lempel and , adapting it for practical high-performance use in hardware and software. Their claims broadly encompass dictionary-based techniques that parse input into phrases, assign codes to novel phrases formed by concatenating existing dictionary entries, and update the dictionary dynamically during encoding and decoding—distinguishing LZW from the original LZ78, which the patents explicitly do not cover due to its separate protection under U.S. Patent No. 4,464,650. International equivalents exist for the Welch/ patent, including European Patent No. 0 127 886 (granted in multiple countries) and Japanese patents, extending coverage to similar dictionary-building mechanisms abroad. In the United States, the Welch expired on June 20, 2003—twenty years from its filing date, as determined under pre-1995 term rules opting for the longer period over seventeen years from issuance. The and Wegman expired on August 11, 2006. European counterparts to the Welch lapsed on June 18, 2004, while Japanese equivalents expired on June 20, 2004. Prior to expiration, enforced licensing for LZW implementations, including in formats like ; for instance, in 1999, the company sought a $5,000 flat fee plus royalties (such as 0.45% of product sales price) from software vendors using LZW in GIF encoders. similarly licensed its patent, though less aggressively in practice. These fees contributed to industry efforts to develop alternatives like during the patent era.

Licensing Controversies

In 1994, Unisys Corporation began enforcing its patents on the Lempel–Ziv–Welch (LZW) compression algorithm, sending letters to software developers and companies such as — the creator of the Graphics Interchange Format ()—demanding royalties for its use in GIF files. This action sparked widespread backlash in the developer community, including online campaigns like "Burn All GIFs," which urged users to the format and avoid LZW-based tools to protest the perceived overreach of software patents. The controversy prompted the rapid development of the Portable Network Graphics (PNG) format in 1995 as a royalty-free alternative to GIF, replacing LZW compression with the patent-unencumbered DEFLATE algorithm (a variant of LZ77) while supporting similar features like transparency and lossless compression. Unisys pursued additional claims against other entities, including a settlement with CompuServe over GIF implementation and licensing agreements with Adobe Systems for LZW use in formats like Tagged Image File Format (TIFF) and aspects of Portable Document Format (PDF); these disputes were resolved out of court without public litigation details. The enforcement significantly impacted , where projects like those under umbrella deliberately avoided LZW to evade potential infringement, removing GIF support from tools such as and until the patents expired, thereby delaying adoption in ecosystems. This reluctance stemmed from Unisys's refusal to grant royalty-free licenses to non-commercial or open-source developers, forcing alternatives like uncompressed formats or patent-free compressors. The core U.S. patent (No. 4,558,302) expired on June 20, 2003, with international counterparts following in 2004, enabling unrestricted global use of LZW and restoring GIF's viability in open-source contexts without licensing fears. The episode underscored broader concerns about software s encumbering de facto standards, influencing subsequent debates on patent reform and the promotion of open alternatives in graphics and data compression.

Variants

LZMW

The Lempel–Ziv–Miller–Wegman (LZMW) algorithm is a dictionary-based lossless data compression method developed by Victor S. Miller and Mark N. Wegman in 1985 as a variant of the LZW algorithm. It builds phrases in the dictionary by concatenating entire previous and current matches, enabling faster growth of longer entries compared to LZW's incremental symbol appending. In LZMW, the encoder initializes the dictionary with all individual symbols from the input alphabet, each assigned a unique code. It then scans the input stream, identifying the longest prefix starting at the current position that matches an existing entry (the current match). The code for this match is output, and the input position advances by the length of the match. The is updated by checking if the of the previous match's and the current match's exists; if not, this new is added with the next available code. The previous match is then updated to the current one for the next . Decoding mirrors this process symmetrically, requiring the decoder to maintain its own and track the previous to reconstruct and extend entries, which introduces additional state management not needed in LZW. This multi-symbol concatenation accelerates dictionary expansion, particularly benefiting compression of repetitive or structured where longer phrases emerge quickly, often yielding better ratios than LZW on such inputs. However, the approach complicates decoding due to dependency on prior phrases, increasing implementation overhead, and LZMW has seen less widespread adoption than LZW owing to these intricacies and LZW's established use in formats like .

LZAP

LZAP, or Lempel–Ziv All Prefixes, is a dictionary-based variant introduced by James A. Storer in as an extension of the LZMW algorithm. Unlike LZMW, which adds only the full of the previous dictionary phrase and the current input symbol to the , LZAP adds all proper prefixes of that as new entries. For instance, if the previous phrase is "AB" and the current symbol is "C", LZAP would add "ABC", "AB", and potentially shorter prefixes depending on the implementation details, ensuring that partial matches are immediately available for subsequent . This modification accelerates the recognition of recurring substrings by pre-populating the with intermediate forms. The core modification employs a tree-structured , commonly realized as a , to facilitate rapid lookups and insertions, which is particularly advantageous for inputs with large alphabets such as text or genomic data. proceeds adaptively, with the dictionary evolving based on the frequency and locality of symbols encountered in the input stream, allowing the algorithm to prioritize recently used phrases without explicit frequency tracking. A distinguishing feature is the implicit context sensitivity through the trie structure, where branches correspond to symbol extensions, thereby reducing average search time from linear to logarithmic in dictionary size. In terms of , LZAP achieves compression ratios comparable to LZW on , but at the cost of higher computational intensity due to multiple updates per step. This leads to faster exhaustion and greater demands, making it suitable for applications with ample resources but less ideal for constrained environments. It has been applied in specialized text compression scenarios where rapid reuse predominates, such as early archiving systems. LZW serves as a conceptual simplification of LZAP by limiting updates to a single entry, trading some parsing efficiency for reduced overhead.

LZWL

LZWL is a variant of the Lempel–Ziv–Welch (LZW) algorithm developed in the mid-2000s, specifically adapted for compressing texts by operating on s as the primary units rather than individual characters. This extension builds on LZW's dictionary-based approach but incorporates linguistic structure to better capture redundancies in morphologically rich languages, such as those with frequent syllable repetitions. The algorithm was introduced to address limitations in standard LZW for handling sub-word units that align with phonetic or semantic patterns in text. In LZWL, the dictionary is implemented as a data structure that initially stores a set of frequent extracted from language-specific corpora, serving as a predefined to accelerate compression of common patterns. During encoding, the input text is first decomposed into using hyphenation algorithms that respect word boundaries, ensuring that phrases are built around meaningful linguistic segments. New dictionary entries are dynamically added by concatenating the longest matching previous phrase with the next , which implicitly incorporates semantic linkages through the hierarchical organization. Optional optimizations, such as hashing for faster lookups, can be employed to improve efficiency in larger , though the core process emphasizes syllabic concatenation over arbitrary byte sequences. Genetic algorithms may be used to optimize the initial dictionary for better performance on specific languages. This syllable-oriented modification provides advantages in compressing texts with repetitive sub-word elements, particularly for small to medium-sized files. For instance, on English texts of 100 bytes to 1 KB, LZWL achieves 5.43–5.61 bits per character using or C65 dictionaries, slightly worse than bzip2's 5.28 bits per character in the same range. Similar results hold for Czech texts, with LZWL reaching about 6.15 bits per character, marginally worse than bzip2's 6.05 bits per character. These outcomes highlight minor improvements from genetic optimization for small files, from the reduced size and higher predictability of syllables in . Despite its conceptual strengths for linguistic data, LZWL remains largely experimental and has seen limited adoption outside research settings, primarily due to the need for language-specific preprocessing and the dominance of more versatile LZW implementations in file formats. It is particularly suited for scenarios involving short documents or texts in languages like Czech, German, or English, where syllable decomposition enhances redundancy detection without requiring extensive training.

Comparisons

With LZ77 and LZ78

The Lempel–Ziv–Welch (LZW) algorithm belongs to the family of Lempel-Ziv compression methods, sharing core similarities with : all are lossless, universal algorithms that achieve compression by dynamically building a of substrings from the input and replacing repeated patterns with to dictionary entries, without requiring statistical models of the source. Compared to LZ77, which operates using a fixed-size sliding window over the recent input history to identify the longest matching and encodes matches via back-references specifying the distance from the current position and the match length (optionally followed by a literal for unmatched ), LZW employs an explicit, expandable dictionary that is not constrained by a window, allowing it to detect and encode non-local or widely separated repeats more effectively without rescanning limited history. LZW builds upon LZ78's dictionary construction, where phrases are formed by concatenating the longest recognized prefix from prior input with a new input to create entries; however, LZ78 encodes each new phrase as a pair consisting of the index of the prefix and the appended literal , whereas LZW outputs solely the index of the recognized prefix (after an initial phase of literal outputs), relying on the decoder to infer the new that extends the entry, thereby eliminating redundant transmission in the compressed stream. These modifications render LZW particularly suited for real-time and streaming scenarios, as its output requires fewer bits per than LZ78's symbol-inclusive pairs and avoids LZ77's need for management, though LZ78 may demand more bits overall due to the explicit symbols. LZW emerged in as a hardware-friendly refinement of LZ78, prioritizing encoding simplicity and efficiency for practical high-speed applications. LZW achieves compression ratios similar to those of across diverse data types, often with faster decoding.

Performance Characteristics

The Lempel–Ziv–Welch (LZW) typically achieves compression ratios of 2:1 to 3:1 for English text files, reducing a large file to approximately half its original size through the identification of repetitive sequences. For random or incompressible data lacking patterns, LZW performs poorly, often yielding ratios worse than 1:1 with no size reduction or even slight expansion due to overhead. In contrast, it excels on repetitive data such as certain images or structured files, where ratios can reach 4:1 or higher by efficiently encoding common substrings. On 1990s-era hardware, LZW encoding speeds ranged from approximately 1 to 5 MB/s, depending on implementation and data characteristics, while decoding was faster—often 1.5 to 2 times quicker—owing to simpler lookups without matching. Memory usage for the is modest, typically 4 to 64 KB for a 12-bit code size supporting 4096 entries, as each entry stores pointers or short in a ; larger code sizes scale memory linearly but risk overflow. Compared to modern alternatives like (LZ77 with ), LZW generally yields worse compression ratios on text and mixed , though it remains simpler and adequate for specific uses like image formats. It is slower than ultra-fast options like LZ4 but requires less for real-time applications. A key limitation is dictionary overflow, where the code table fills after processing repetitive , necessitating a reset that discards learned patterns and temporarily reduces efficiency until new entries rebuild redundancy capture. LZW is also vulnerable to adversarial inputs designed to exhaust the with unique sequences, leading to near-zero compression on crafted . For illustration, the repetitive string "TOBEORNOTTOBEORTOBEORNOT" (24 bytes) compresses under LZW to approximately 60% of its original size, as the algorithm assigns short codes to recurring phrases like "TOBEORNOT".

References

Add your contribution
Related Hubs
User Avatar
No comments yet.