Hubbry Logo
logo
Lempel–Ziv–Welch
Community hub

Lempel–Ziv–Welch

logo
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something to knowledge base
Hub AI

Lempel–Ziv–Welch AI simulator

(@Lempel–Ziv–Welch_simulator)

Lempel–Ziv–Welch

Lempel–Ziv–Welch (LZW) is a universal lossless compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improvement to the LZ78 algorithm published by Lempel and Ziv in 1978. Claimed advantages include: simple to implement and the potential for high throughput in a hardware implementation.

A large English text file can typically be compressed via LZW to about half its original size.

The algorithm became the first widely used universal data compression method used on computers. The algorithm was used in the compress program commonly included in Unix systems starting around 1986. It has since disappeared from many distributions, because it both infringed the LZW patent and because gzip produced better compression ratios using the LZ77-based DEFLATE algorithm. The algorithm found wide use when it became part of the GIF image format in 1987. It may optionally be used in TIFF and PDF files. Although LZW is available in Adobe Acrobat software, Acrobat by default uses DEFLATE for most text and color-table-based image data in PDF files.

The scenario described by Welch's 1984 paper encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence with no code yet in the dictionary. The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary.

In the practical application of image compression and for an image based on a color table, a natural character alphabet is the set of color table indexes. In the 1980s, many images had small color tables (on the order of 16 colors). For such a reduced alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the idea of a variable-width code was introduced: codes typically start one bit wider than the symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up to some prescribed maximum (typically 12 bits). When the maximum code value is reached, encoding proceeds using the existing table, but new codes are not generated for addition to the table.

Further refinements include reserving a code to indicate that the code table should be cleared and restored to its initial state (a "clear code", typically the first value immediately after the values for the individual alphabet characters), and a code to indicate the end of data (a "stop code", typically one greater than the clear code). The clear code lets the table be reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input data. Smart encoders can monitor the compression efficiency and clear the table whenever the existing table no longer matches the input well.

Since codes are added in a manner determined by the data, the decoder mimics building the table as it sees the resulting codes. It is critical that the encoder and decoder agree on the variety of LZW used: the size of the alphabet, the maximum table size (and code width), whether variable-width encoding is used, initial code size, and whether to use the clear and stop codes (and what values they have). Most formats that employ LZW build this information into the format specification or provide explicit fields for them in a compression header for the data.

The encoding process can be described as:

See all
User Avatar
No comments yet.