Hubbry Logo
Data compressionData compressionMain
Open search
Data compression
Community hub
Data compression
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Data compression
Data compression
from Wikipedia

In information theory, data compression, source coding,[1] or bit-rate reduction is the process of encoding information using fewer bits than the original representation.[2] Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information.[3] Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

The process of reducing the size of a data file is often referred to as data compression. In the context of data transmission, it is called source coding: encoding is done at the source of the data before it is stored or transmitted.[4] Source coding should not be confused with channel coding, for error detection and correction or line coding, the means for mapping data onto a signal.

Data compression algorithms present a space–time complexity trade-off between the bytes needed to store or transmit information, and the computational resources needed to perform the encoding and decoding. The design of data compression schemes involves balancing the degree of compression, the amount of distortion introduced (when using lossy data compression), and the computational resources or time required to compress and decompress the data.[5]

Lossless

[edit]

Lossless data compression algorithms usually exploit statistical redundancy to represent data without losing any information, so that the process is reversible. Lossless compression is possible because most real-world data exhibits statistical redundancy. For example, an image may have areas of color that do not change over several pixels; instead of coding "red pixel, red pixel, ..." the data may be encoded as "279 red pixels". This is a basic example of run-length encoding; there are many schemes to reduce file size by eliminating redundancy.

The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage.[6] DEFLATE is a variation on LZ optimized for decompression speed and compression ratio,[7] but compression can be slow. In the mid-1980s, following work by Terry Welch, the Lempel–Ziv–Welch (LZW) algorithm rapidly became the method of choice for most general-purpose compression systems. LZW is used in GIF images, programs such as PKZIP, and hardware devices such as modems.[8] LZ methods use a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded. Grammar-based codes like this can compress highly repetitive input extremely effectively, for instance, a biological data collection of the same or closely related species, a huge versioned document collection, internet archival, etc. The basic task of grammar-based codes is constructing a context-free grammar deriving a single string. Other practical grammar compression algorithms include Sequitur and Re-Pair.

The strongest modern lossless compressors use probabilistic models, such as prediction by partial matching. The Burrows–Wheeler transform can also be viewed as an indirect form of statistical modelling.[citation needed] In a further refinement of the direct use of probabilistic modelling, statistical estimates can be coupled to an algorithm called arithmetic coding. Arithmetic coding is a more modern coding technique that uses the mathematical calculations of a finite-state machine to produce a string of encoded bits from a series of input data symbols. It can achieve superior compression compared to other techniques such as the better-known Huffman algorithm. It uses an internal memory state to avoid the need to perform a one-to-one mapping of individual input symbols to distinct representations that use an integer number of bits, and it clears out the internal memory only after encoding the entire string of data symbols. Arithmetic coding applies especially well to adaptive data compression tasks where the statistics vary and are context-dependent, as it can be easily coupled with an adaptive model of the probability distribution of the input data. An early example of the use of arithmetic coding was in an optional (but not widely used) feature of the JPEG image coding standard.[9] It has since been applied in various other designs including H.263, H.264/MPEG-4 AVC and HEVC for video coding.[10]

Archive software typically has the ability to adjust the "dictionary size", where a larger size demands more random-access memory during compression and decompression, but compresses stronger, especially on repeating patterns in files' content.[11][12]

Lossy

[edit]
Composite image showing JPG and PNG image compression. Left side of the image is from a JPEG image, showing lossy artifacts; the right side is from a PNG image.

In the late 1980s, digital images became more common, and standards for lossless image compression emerged. In the early 1990s, lossy compression methods began to be widely used.[13] In these schemes, some loss of information is accepted as dropping nonessential detail can save storage space. There is a corresponding trade-off between preserving information and reducing size. Lossy data compression schemes are designed by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to the variations in color. JPEG image compression works in part by rounding off nonessential bits of information.[14] A number of popular compression formats exploit these perceptual differences, including psychoacoustics for sound, and psychovisuals for images and video.

Most forms of lossy compression are based on transform coding, especially the discrete cosine transform (DCT). It was first proposed in 1972 by Nasir Ahmed, who then developed a working algorithm with T. Natarajan and K. R. Rao in 1973, before introducing it in January 1974.[15][16] DCT is the most widely used lossy compression method, and is used in multimedia formats for images (such as JPEG and HEIF),[17] video (such as MPEG, AVC and HEVC) and audio (such as MP3, AAC and Vorbis).

Lossy image compression is used in digital cameras, to increase storage capacities. Similarly, DVDs, Blu-ray and streaming video use lossy video coding formats. Lossy compression is extensively used in video.

In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the audio signal. Compression of human speech is often performed with even more specialized techniques; speech coding is distinguished as a separate discipline from general-purpose audio compression. Speech coding is used in internet telephony, for example, audio compression is used for CD ripping and is decoded by the audio players.[citation needed]

Lossy compression can cause generation loss.

Theory

[edit]

The theoretical basis for compression is provided by information theory and, more specifically, Shannon's source coding theorem; domain-specific theories include algorithmic information theory for lossless compression and rate–distortion theory for lossy compression. These areas of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Other topics associated with compression include coding theory and statistical inference.[18]

Machine learning

[edit]

There is a close connection between machine learning and compression. A system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution). Conversely, an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as a justification for using data compression as a benchmark for "general intelligence".[19][20][21]

An alternative view can show compression algorithms implicitly map strings into implicit feature space vectors, and compression-based similarity measures compute similarity within these feature spaces. For each compressor C(.) we define an associated vector space ℵ, such that C(.) maps an input string x, corresponding to the vector norm ||~x||. An exhaustive examination of the feature spaces underlying all compression algorithms is precluded by space; instead, feature vectors chooses to examine three representative lossless compression methods, LZW, LZ77, and PPM.[22]

According to AIXI theory, a connection more directly explained in Hutter Prize, the best possible compression of x is the smallest possible software that generates x. For example, in that model, a zip file's compressed size includes both the zip file and the unzipping software, since you can not unzip it without both, but there may be an even smaller combined form.

Examples of AI-powered audio/video compression software include NVIDIA Maxine, AIVC.[23] Examples of software that can perform AI-powered image compression include OpenCV, TensorFlow, MATLAB's Image Processing Toolbox (IPT) and High-Fidelity Generative Image Compression.[24]

In unsupervised machine learning, k-means clustering can be utilized to compress data by grouping similar data points into clusters. This technique simplifies handling extensive datasets that lack predefined labels and finds widespread use in fields such as image compression.[25]

Data compression aims to reduce the size of data files, enhancing storage efficiency and speeding up data transmission. K-means clustering, an unsupervised machine learning algorithm, is employed to partition a dataset into a specified number of clusters, k, each represented by the centroid of its points. This process condenses extensive datasets into a more compact set of representative points. Particularly beneficial in image and signal processing, k-means clustering aids in data reduction by replacing groups of data points with their centroids, thereby preserving the core information of the original data while significantly decreasing the required storage space.[26]

Large language models (LLMs) are also efficient lossless data compressors on some data sets, as demonstrated by DeepMind's research with the Chinchilla 70B model. Developed by DeepMind, Chinchilla 70B effectively compressed data, outperforming conventional methods such as Portable Network Graphics (PNG) for images and Free Lossless Audio Codec (FLAC) for audio. It achieved compression of image and audio data to 43.4% and 16.4% of their original sizes, respectively. There is, however, some reason to be concerned that the data set used for testing overlaps the LLM training data set, making it possible that the Chinchilla 70B model is only an efficient compression tool on data it has already been trained on.[27][28]

Data differencing

[edit]
Comparison of two revisions of a file

Data compression can be viewed as a special case of data differencing.[29][30] Data differencing consists of producing a difference given a source and a target, with patching reproducing the target given a source and a difference. Since there is no separate source and target in data compression, one can consider data compression as data differencing with empty source data, the compressed file corresponding to a difference from nothing. This is the same as considering absolute entropy (corresponding to data compression) as a special case of relative entropy (corresponding to data differencing) with no initial data.

The term differential compression is used to emphasize the data differencing connection.

Uses

[edit]

Image

[edit]

Entropy coding originated in the 1940s with the introduction of Shannon–Fano coding,[31] the basis for Huffman coding which was developed in 1950.[32] Transform coding dates back to the late 1960s, with the introduction of fast Fourier transform (FFT) coding in 1968 and the Hadamard transform in 1969.[33]

An important image compression technique is the discrete cosine transform (DCT), a technique developed in the early 1970s.[15] DCT is the basis for JPEG, a lossy compression format which was introduced by the Joint Photographic Experts Group (JPEG) in 1992.[34] JPEG greatly reduces the amount of data required to represent an image at the cost of a relatively small reduction in image quality and has become the most widely used image file format.[35][36] Its highly efficient DCT-based compression algorithm was largely responsible for the wide proliferation of digital images and digital photos.[37]

Lempel–Ziv–Welch (LZW) is a lossless compression algorithm developed in 1984. It is used in the GIF format, introduced in 1987.[38] DEFLATE, a lossless compression algorithm specified in 1996, is used in the Portable Network Graphics (PNG) format.[39]

Wavelet compression, the use of wavelets in image compression, began after the development of DCT coding.[40] The JPEG 2000 standard was introduced in 2000.[41] In contrast to the DCT algorithm used by the original JPEG format, JPEG 2000 instead uses discrete wavelet transform (DWT) algorithms.[42][43][44] JPEG 2000 technology, which includes the Motion JPEG 2000 extension, was selected as the video coding standard for digital cinema in 2004.[45]

Audio

[edit]

Audio data compression, not to be confused with dynamic range compression, has the potential to reduce the transmission bandwidth and storage requirements of audio data. Audio compression formats compression algorithms are implemented in software as audio codecs. In both lossy and lossless compression, information redundancy is reduced, using methods such as coding, quantization, DCT and linear prediction to reduce the amount of information used to represent the uncompressed data.

Lossy audio compression algorithms provide higher compression and are used in numerous audio applications including Vorbis and MP3. These algorithms almost all rely on psychoacoustics to eliminate or reduce fidelity of less audible sounds, thereby reducing the space required to store or transmit them.[2][46]

The acceptable trade-off between loss of audio quality and transmission or storage size depends upon the application. For example, one 640 MB compact disc (CD) holds approximately one hour of uncompressed high fidelity music, less than 2 hours of music compressed losslessly, or 7 hours of music compressed in the MP3 format at a medium bit rate. A digital sound recorder can typically store around 200 hours of clearly intelligible speech in 640 MB.[47]

Lossless audio compression produces a representation of digital data that can be decoded to an exact digital duplicate of the original. Compression ratios are around 50–60% of the original size,[48] which is similar to those for generic lossless data compression. Lossless codecs use curve fitting or linear prediction as a basis for estimating the signal. Parameters describing the estimation and the difference between the estimation and the actual signal are coded separately.[49]

A number of lossless audio compression formats exist. See list of lossless codecs for a listing. Some formats are associated with a distinct system, such as Direct Stream Transfer, used in Super Audio CD and Meridian Lossless Packing, used in DVD-Audio, Dolby TrueHD, Blu-ray and HD DVD.

Some audio file formats feature a combination of a lossy format and a lossless correction; this allows stripping the correction to easily obtain a lossy file. Such formats include MPEG-4 SLS (Scalable to Lossless), WavPack, and OptimFROG DualStream.

When audio files are to be processed, either by further compression or for editing, it is desirable to work from an unchanged original (uncompressed or losslessly compressed). Processing of a lossily compressed file for some purpose usually produces a final result inferior to the creation of the same compressed file from an uncompressed original. In addition to sound editing or mixing, lossless audio compression is often used for archival storage, or as master copies.

Lossy audio compression

[edit]
Comparison of spectrograms of audio in an uncompressed format and several lossy formats. The lossy spectrograms show bandlimiting of higher frequencies, a common technique associated with lossy audio compression.

Lossy audio compression is used in a wide range of applications. In addition to standalone audio-only applications of file playback in MP3 players or computers, digitally compressed audio streams are used in most video DVDs, digital television, streaming media on the Internet, satellite and cable radio, and increasingly in terrestrial radio broadcasts. Lossy compression typically achieves far greater compression than lossless compression, by discarding less-critical data based on psychoacoustic optimizations.[50]

Psychoacoustics recognizes that not all data in an audio stream can be perceived by the human auditory system. Most lossy compression reduces redundancy by first identifying perceptually irrelevant sounds, that is, sounds that are very hard to hear. Typical examples include high frequencies or sounds that occur at the same time as louder sounds. Those irrelevant sounds are coded with decreased accuracy or not at all.

Due to the nature of lossy algorithms, audio quality suffers a digital generation loss when a file is decompressed and recompressed. This makes lossy compression unsuitable for storing the intermediate results in professional audio engineering applications, such as sound editing and multitrack recording. However, lossy formats such as MP3 are very popular with end-users as the file size is reduced to 5-20% of the original size and a megabyte can store about a minute's worth of music at adequate quality.

Several proprietary lossy compression algorithms have been developed that provide higher quality audio performance by using a combination of lossless and lossy algorithms with adaptive bit rates and lower compression ratios. Examples include aptX, LDAC, LHDC, MQA and SCL6.

Coding methods
[edit]

To determine what information in an audio signal is perceptually irrelevant, most lossy compression algorithms use transforms such as the modified discrete cosine transform (MDCT) to convert time domain sampled waveforms into a transform domain, typically the frequency domain. Once transformed, component frequencies can be prioritized according to how audible they are. Audibility of spectral components is assessed using the absolute threshold of hearing and the principles of simultaneous masking—the phenomenon wherein a signal is masked by another signal separated by frequency—and, in some cases, temporal masking—where a signal is masked by another signal separated by time. Equal-loudness contours may also be used to weigh the perceptual importance of components. Models of the human ear-brain combination incorporating such effects are often called psychoacoustic models.[51]

Other types of lossy compressors, such as the linear predictive coding (LPC) used with speech, are source-based coders. LPC uses a model of the human vocal tract to analyze speech sounds and infer the parameters used by the model to produce them moment to moment. These changing parameters are transmitted or stored and used to drive another model in the decoder which reproduces the sound.

Lossy formats are often used for the distribution of streaming audio or interactive communication (such as in cell phone networks). In such applications, the data must be decompressed as the data flows, rather than after the entire data stream has been transmitted. Not all audio codecs can be used for streaming applications.[50]

Latency is introduced by the methods used to encode and decode the data. Some codecs will analyze a longer segment, called a frame, of the data to optimize efficiency, and then code it in a manner that requires a larger segment of data at one time to decode. The inherent latency of the coding algorithm can be critical; for example, when there is a two-way transmission of data, such as with a telephone conversation, significant delays may seriously degrade the perceived quality.

In contrast to the speed of compression, which is proportional to the number of operations required by the algorithm, here latency refers to the number of samples that must be analyzed before a block of audio is processed. In the minimum case, latency is zero samples (e.g., if the coder/decoder simply reduces the number of bits used to quantize the signal). Time domain algorithms such as LPC also often have low latencies, hence their popularity in speech coding for telephony. In algorithms such as MP3, however, a large number of samples have to be analyzed to implement a psychoacoustic model in the frequency domain, and latency is on the order of 23 ms.

Speech encoding
[edit]

Speech encoding is an important category of audio data compression. The perceptual models used to estimate what aspects of speech a human ear can hear are generally somewhat different from those used for music. The range of frequencies needed to convey the sounds of a human voice is normally far narrower than that needed for music, and the sound is normally less complex. As a result, speech can be encoded at high quality using a relatively low bit rate.

This is accomplished, in general, by some combination of two approaches:

  • Only encoding sounds that could be made by a single human voice.
  • Throwing away more of the data in the signal—keeping just enough to reconstruct an "intelligible" voice rather than the full frequency range of human hearing.

The earliest algorithms used in speech encoding (and audio data compression in general) were the A-law algorithm and the μ-law algorithm.

History

[edit]
Solidyne 922: The world's first commercial audio bit compression sound card for PC, 1990

Early audio research was conducted at Bell Labs. There, in 1950, C. Chapin Cutler filed the patent on differential pulse-code modulation (DPCM).[52] In 1973, Adaptive DPCM (ADPCM) was introduced by P. Cummiskey, Nikil S. Jayant and James L. Flanagan.[53][54]

Perceptual coding was first used for speech coding compression, with linear predictive coding (LPC).[55] Initial concepts for LPC date back to the work of Fumitada Itakura (Nagoya University) and Shuzo Saito (Nippon Telegraph and Telephone) in 1966.[56] During the 1970s, Bishnu S. Atal and Manfred R. Schroeder at Bell Labs developed a form of LPC called adaptive predictive coding (APC), a perceptual coding algorithm that exploited the masking properties of the human ear, followed in the early 1980s with the code-excited linear prediction (CELP) algorithm which achieved a significant compression ratio for its time.[55] Perceptual coding is used by modern audio compression formats such as MP3[55] and AAC.

Discrete cosine transform (DCT), developed by Nasir Ahmed, T. Natarajan and K. R. Rao in 1974,[16] provided the basis for the modified discrete cosine transform (MDCT) used by modern audio compression formats such as MP3,[57] Dolby Digital,[58][59] and AAC.[60] MDCT was proposed by J. P. Princen, A. W. Johnson and A. B. Bradley in 1987,[61] following earlier work by Princen and Bradley in 1986.[62]

The world's first commercial broadcast automation audio compression system was developed by Oscar Bonello, an engineering professor at the University of Buenos Aires. [63] This broadcast automation system was launched in 1987 under the name Audicom. [64] [65]

A literature compendium for a large variety of audio coding systems was published in the IEEE's Journal on Selected Areas in Communications (JSAC), in February 1988. While there were some papers from before that time, this collection documented an entire variety of finished, working audio coders, nearly all of them using perceptual techniques and some kind of frequency analysis and back-end noiseless coding.[66]

Video

[edit]

Uncompressed video requires a very high data rate. Although lossless video compression codecs perform at a compression factor of 5 to 12, a typical H.264 lossy compression video has a compression factor between 20 and 200.[67]

The two key video compression techniques used in video coding standards are the DCT and motion compensation (MC). Most video coding standards, such as the H.26x and MPEG formats, typically use motion-compensated DCT video coding (block motion compensation).[68][69]

Most video codecs are used alongside audio compression techniques to store the separate but complementary data streams as one combined package using so-called container formats.[70]

Encoding theory

[edit]

Video data may be represented as a series of still image frames. Such data usually contains abundant amounts of spatial and temporal redundancy. Video compression algorithms attempt to reduce redundancy and store information more compactly.

Most video compression formats and codecs exploit both spatial and temporal redundancy (e.g. through difference coding with motion compensation). Similarities can be encoded by only storing differences between e.g. temporally adjacent frames (inter-frame coding) or spatially adjacent pixels (intra-frame coding). Inter-frame compression (a temporal delta encoding) (re)uses data from one or more earlier or later frames in a sequence to describe the current frame. Intra-frame coding, on the other hand, uses only data from within the current frame, effectively being still-image compression.[51]

The intra-frame video coding formats used in camcorders and video editing employ simpler compression that uses only intra-frame prediction. This simplifies video editing software, as it prevents a situation in which a compressed frame refers to data that the editor has deleted.

Usually, video compression additionally employs lossy compression techniques like quantization that reduce aspects of the source data that are (more or less) irrelevant to the human visual perception by exploiting perceptual features of human vision. For example, small differences in color are more difficult to perceive than are changes in brightness. Compression algorithms can average a color across these similar areas in a manner similar to those used in JPEG image compression.[9] As in all lossy compression, there is a trade-off between video quality and bit rate, cost of processing the compression and decompression, and system requirements. Highly compressed video may present visible or distracting artifacts.

Other methods other than the prevalent DCT-based transform formats, such as fractal compression, matching pursuit and the use of a discrete wavelet transform (DWT), have been the subject of some research, but are typically not used in practical products. Wavelet compression is used in still-image coders and video coders without motion compensation. Interest in fractal compression seems to be waning, due to recent theoretical analysis showing a comparative lack of effectiveness of such methods.[51]

Inter-frame coding
[edit]

In inter-frame coding, individual frames of a video sequence are compared from one frame to the next, and the video compression codec records the differences to the reference frame. If the frame contains areas where nothing has moved, the system can simply issue a short command that copies that part of the previous frame into the next one. If sections of the frame move in a simple manner, the compressor can emit a (slightly longer) command that tells the decompressor to shift, rotate, lighten, or darken the copy. This longer command still remains much shorter than data generated by intra-frame compression. Usually, the encoder will also transmit a residue signal which describes the remaining more subtle differences to the reference imagery. Using entropy coding, these residue signals have a more compact representation than the full signal. In areas of video with more motion, the compression must encode more data to keep up with the larger number of pixels that are changing. Commonly, the high-frequency detail of explosions, flames, flocks of animals, and some panning shots, will lead to decreases in quality or increases in the variable bitrate.

Hybrid block-based transform formats

[edit]
Processing stages of a typical video encoder

Many commonly used video compression methods (e.g., those in standards approved by the ITU-T or ISO) share the same basic architecture that dates back to H.261 which was standardized in 1988 by the ITU-T. They mostly rely on the DCT, applied to rectangular blocks of neighboring pixels, and temporal prediction using motion vectors, as well as nowadays also an in-loop filtering step.

In the prediction stage, various deduplication and difference-coding techniques are applied that help decorrelate data and describe new data based on already transmitted data.

Then rectangular blocks of remaining pixel data are transformed to the frequency domain. In the main lossy processing stage, frequency domain data gets quantized in order to reduce information that is irrelevant to human visual perception.

In the last stage statistical redundancy gets largely eliminated by an entropy coder which often applies some form of arithmetic coding.

In an additional in-loop filtering stage various filters can be applied to the reconstructed image signal. By computing these filters also inside the encoding loop they can help compression because they can be applied to reference material before it gets used in the prediction process and they can be guided using the original signal. The most popular example are deblocking filters that blur out blocking artifacts from quantization discontinuities at transform block boundaries.

History

[edit]

In 1967, A.H. Robinson and C. Cherry proposed a run-length encoding bandwidth compression scheme for the transmission of analog television signals.[71] The DCT, which is fundamental to modern video compression,[72] was introduced by Nasir Ahmed, T. Natarajan and K. R. Rao in 1974.[16][73]

H.261, which debuted in 1988, commercially introduced the prevalent basic architecture of video compression technology.[74] It was the first video coding format based on DCT compression.[72] H.261 was developed by a number of companies, including Hitachi, PictureTel, NTT, BT and Toshiba.[75]

The most popular video coding standards used for codecs have been the MPEG standards. MPEG-1 was developed by the Motion Picture Experts Group (MPEG) in 1991, and it was designed to compress VHS-quality video. It was succeeded in 1994 by MPEG-2/H.262,[74] which was developed by a number of companies, primarily Sony, Thomson and Mitsubishi Electric.[76] MPEG-2 became the standard video format for DVD and SD digital television.[74] In 1999, it was followed by MPEG-4/H.263.[74] It was also developed by a number of companies, primarily Mitsubishi Electric, Hitachi and Panasonic.[77]

H.264/MPEG-4 AVC was developed in 2003 by a number of organizations, primarily Panasonic, Godo Kaisha IP Bridge and LG Electronics.[78] AVC commercially introduced the modern context-adaptive binary arithmetic coding (CABAC) and context-adaptive variable-length coding (CAVLC) algorithms. AVC is the main video encoding standard for Blu-ray Discs, and is widely used by video sharing websites and streaming internet services such as YouTube, Netflix, Vimeo, and iTunes Store, web software such as Adobe Flash Player and Microsoft Silverlight, and various HDTV broadcasts over terrestrial and satellite television.[citation needed]

Genetics

[edit]

Genetics compression algorithms are the latest generation of lossless algorithms that compress data (typically sequences of nucleotides) using both conventional compression algorithms and genetic algorithms adapted to the specific datatype. In 2012, a team of scientists from Johns Hopkins University published a genetic compression algorithm that does not use a reference genome for compression. HAPZIPPER was tailored for HapMap data and achieves over 20-fold compression (95% reduction in file size), providing 2- to 4-fold better compression and is less computationally intensive than the leading general-purpose compression utilities. For this, Chanda, Elhaik, and Bader introduced MAF-based encoding (MAFE), which reduces the heterogeneity of the dataset by sorting SNPs by their minor allele frequency, thus homogenizing the dataset.[79] Other algorithms developed in 2009 and 2013 (DNAZip and GenomeZip) have compression ratios of up to 1200-fold—allowing 6 billion basepair diploid human genomes to be stored in 2.5 megabytes (relative to a reference genome or averaged over many genomes).[80][81] For a benchmark in genetics/genomics data compressors, see [82]

Outlook and currently unused potential

[edit]

It is estimated that the total amount of data that is stored on the world's storage devices could be further compressed with existing compression algorithms by a remaining average factor of 4.5:1.[83] It is estimated that the combined technological capacity of the world to store information provides 1,300 exabytes of hardware digits in 2007, but when the corresponding content is optimally compressed, this only represents 295 exabytes of Shannon information.[84]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Data compression is the process of encoding using fewer bits than the original representation, thereby reducing its size for more efficient storage and transmission while preserving the essential . This technique eliminates redundancies in , transforming it into a more compact form that can be decoded back to approximate or exact originals depending on the method used. By increasing effective , compression plays a critical role in applications ranging from file archiving and to processing. There are two primary categories of data compression: and lossy. Lossless compression allows the original data to be perfectly reconstructed without any loss of , making it suitable for text files, executables, and scenarios where is essential; typical compression ratios for lossless methods range from 2:1 to 4:1. In contrast, discards less perceptible details to achieve higher compression rates—often significantly better than lossless— and is commonly applied to images, audio, and video, as seen in formats like and MP3. The foundations of modern data compression trace back to information theory developed by Claude Shannon in the mid-20th century, which established fundamental limits like entropy as the theoretical minimum for lossless encoding. Practical advancements accelerated in the 1960s with applications in space missions, where both lossless and lossy methods were employed to manage telemetry data. Key milestones include David Huffman's 1952 algorithm for optimal prefix codes and the 1977–1978 work by Jacob Ziv and Abraham Lempel, leading to the LZW algorithm that powers tools like ZIP and GIF. These innovations have made data compression indispensable in computing, enabling everything from efficient web browsing to high-definition streaming. Common techniques in lossless compression include for repetitive data, for variable-length symbol assignment based on frequency, and dictionary-based methods like LZW for substituting repeated sequences. Lossy approaches often rely on perceptual models, such as discrete cosine transforms in for images or modified discrete cosine transforms in for audio, prioritizing human perception over exact replication. Compression ratios are typically measured as the ratio of the original size to the compressed size, with higher ratios indicating better efficiency (e.g., 2:1 means the compressed file is half the original size). Ongoing research continues to push boundaries, particularly in hardware-accelerated and AI-enhanced methods for emerging data-intensive fields like IoT and .

Fundamentals

Definition and Principles

Data compression is the process of encoding information using fewer bits than the original representation to reduce size while preserving the essential content. This technique aims to minimize storage requirements and optimize data transmission by eliminating unnecessary bits. The primary purposes include enabling efficient on limited-capacity devices, accelerating file transfers over networks, and conserving bandwidth in communication systems. At its core, data compression exploits statistical redundancy inherent in data, such as repeated patterns or predictable sequences, to represent more compactly without altering its meaning. A fundamental principle is , introduced by as a measure of the average or in a source, which establishes the theoretical limit for the shortest possible encoding length in . The HH for a discrete source with symbols having probabilities pip_i is calculated as H=ipilog2pi,H = -\sum_i p_i \log_2 p_i, where the summation is over all possible symbols, and log2pi\log_2 p_i quantifies the bits needed to encode each symbol based on its probability of occurrence. Key trade-offs in compression include the compression ratio, defined as the original data size divided by the compressed size (higher values indicate better efficiency), balanced against the computational cost of encoding and decoding, which can impact processing time and resource usage. For instance, a simple text file of 10 KB containing repetitive phrases might compress to 4-5 KB, yielding a ratio of about 2:1 to 2.5:1, while a binary file filled with uniform patterns, like all zeros, can achieve near-total reduction to a few bytes representing the pattern and length. Compression methods fall into lossless and lossy categories, with the former ensuring exact data recovery and the latter allowing minor losses for greater size reduction.

Types of Compression

Data compression is broadly categorized into two primary types: lossless and lossy, each designed to reduce data size while addressing different requirements for fidelity and efficiency. ensures the original data can be exactly reconstructed without any loss of , making it essential for applications where is paramount. In contrast, permits some irreversible data loss to achieve significantly higher compression ratios, prioritizing perceptual quality over exact reproduction. Lossless compression algorithms exploit statistical redundancies in the data to encode it more compactly, guaranteeing bit-for-bit recovery of the source upon decompression. This type is particularly suitable for text files, executable programs, and other structured data where even minor alterations could render the content unusable or introduce errors. For instance, compressing source code or database records requires lossless methods to preserve functionality and accuracy. Typical compression ratios for lossless techniques range from 2:1 to 4:1, depending on the data's entropy and redundancy patterns. Lossy compression, on the other hand, discards less perceptually significant information, such as high-frequency details in images or inaudible frequencies in audio, to achieve greater size reduction while maintaining acceptable for human observers. It is ideal for content like photographs, videos, and music streams, where exact replication is unnecessary and bandwidth constraints are critical. Compression ratios in lossy methods often exceed 10:1 for images and can reach 100:1 or more for video, enabling efficient storage and transmission in resource-limited environments. The choice between lossless and lossy compression involves trade-offs in fidelity, efficiency, and applicability, as summarized in the following comparison:
AspectLossless CompressionLossy Compression
FidelityExact reconstruction; no data lossApproximate reconstruction; some data discarded
Compression RatioTypically 2:1-4:1 for general dataOften 10:1-100:1 or more for media content
ProsPreserves all information; suitable for critical dataHigher efficiency; better for perceptual media
ConsLower ratios; less effective on random dataIrreversible loss; potential quality degradation
Use CasesArchival storage, software distributionStreaming services, mobile devices
Hybrid approaches combine elements of both, applying to critical data portions and lossy to others, to optimize overall performance in scenarios like scientific simulations or . Selecting the appropriate type depends on data sensitivity—opting for lossless when exactness is required, such as in legal documents or financial records—and domain-specific needs, like archival preservation versus real-time streaming where bandwidth savings outweigh minor quality trade-offs. Underlying is rate- theory, which provides a framework for balancing the (amount of used) against (deviation from the original), often visualized as the rate- curve that identifies optimal operating points for a given quality threshold.

Theory

Information Theory Foundations

The foundations of compression are rooted in , particularly Claude Shannon's seminal work establishing fundamental limits on . Shannon's , also known as the noiseless coding theorem, states that for a discrete memoryless source, the minimum average number of bits required per symbol for reliable is equal to the source's ; it is impossible to compress below this rate without introducing errors on average. This theorem provides the theoretical boundary for compression efficiency, demonstrating that redundancy in can be exploited up to but not beyond the limit. Central to this theorem is the concept of , which quantifies the or in a . For a discrete XX with possible outcomes xix_i and probabilities p(xi)p(x_i), the H(X)H(X) is derived as the of the self-information, where self-information of an outcome is log2p(xi)-\log_2 p(x_i), measuring the surprise or bits needed to specify it. The formula arises from the requirement that the code length for each symbol should be approximately log2p(xi)-\log_2 p(x_i) to achieve optimality, leading to the code length being at least H(X)=ip(xi)log2p(xi)H(X) = -\sum_i p(x_i) \log_2 p(x_i). To illustrate, consider a flip where XX takes values heads or tails each with probability 0.5; here, H(X)=(0.5log20.5+0.5log20.5)=1H(X) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 bit, indicating full compressibility to 1 bit per flip without loss. In contrast, English text exhibits lower entropy per character (around 1-1.5 bits due to predictability) compared to a uniform random string (5 bits for 32 possible characters), allowing significant compression by encoding probable sequences more efficiently. Another theoretical measure of compressibility is , defined as the length of the shortest that outputs a given on a . This provides an absolute, uncomputable limit on description length, where a is incompressible if its equals its length, highlighting intrinsic without probabilistic assumptions. Information theory distinguishes source coding, which removes redundancy from the data source to minimize bits for representation, from channel coding, which adds controlled redundancy to protect against transmission errors over noisy channels. Shannon's framework separates these to optimize end-to-end communication, with source coding achieving limits in noiseless settings and channel coding approaching capacity in noisy ones. Universal coding methods, such as , approximate these limits without prior knowledge of exact source statistics by adaptively partitioning the probability interval [0,1] for the message, assigning shorter codes to more probable sequences. Developed from early ideas by and refined in practical forms, arithmetic coding achieves compression close to the entropy bound, often within 1-2 bits of the theoretical minimum for finite blocks.

Algorithmic Techniques

Dictionary methods form a cornerstone of lossless data compression by exploiting redundancies through the construction and substitution of repeated substrings in the input data. These techniques build a of previously seen phrases or substrings, replacing future occurrences with references to the dictionary entries, thereby reducing the overall data size without loss of information. The Lempel-Ziv (LZ) family of algorithms exemplifies this approach, with being seminal variants. LZ77 employs a sliding window mechanism over the input stream, searching backward within a fixed-size window to find the longest matching substring preceding the current position; it then encodes the output as a literal character or a pair consisting of the distance back to the match and its length. In contrast, LZ78 builds an explicit dictionary incrementally by parsing the input into non-overlapping phrases, where each new entry is formed by appending the next input symbol to a previous dictionary phrase, and the output references the dictionary index along with the symbol. These methods achieve efficiency by adapting to local redundancies, though LZ77 favors patterns while LZ78 supports more readily. Entropy coding techniques assign variable-length codes to symbols based on their probabilities, ensuring that more frequent symbols receive shorter codes to minimize the average code length, approaching the theoretical bound established in . constructs an optimal using a built from symbol frequencies: symbols are repeatedly combined into nodes with probabilities equal to the sum of their children's, forming a tree where code lengths reflect the inverse of symbol probabilities, and the resulting codes are instantaneous and uniquely decodable. , on the other hand, encodes the entire message as a single fractional number within the unit interval [0,1), subdividing the interval according to cumulative symbol probabilities and narrowing it progressively with each symbol; this fractional representation is then quantized to a binary string, often achieving compression closer to the entropy limit than fixed-code methods by avoiding codeword boundaries. Transform coding alters the data representation to concentrate redundancies, making subsequent compression more effective; a simple example is (RLE), which replaces sequences of identical symbols with a single instance of the symbol paired with the count of repetitions, ideal for data with long runs like binary images or sparse arrays. This transformation groups consecutive identical values, reducing the number of symbols to encode while preserving exact recovery during decoding. Prediction and residual coding leverage correlations by estimating future values from past ones and encoding only the prediction errors, which are typically smaller and more compressible. In delta encoding, commonly applied to time-series data, each value is predicted as the previous one, and the difference (residual) is stored instead; for monotonically increasing sequences, these deltas are often small non-negative integers that entropy coders can represent efficiently. The effectiveness of these algorithmic techniques is evaluated using metrics such as (ratio of original to compressed size), encoding and decoding speed (in bytes per second), and memory usage during compression. Standard benchmarks like the Calgary corpus—a collection of 14 files totaling about 3.2 MB representing diverse text types such as programs, binaries, and —provide a consistent framework for comparing algorithms, with higher ratios and faster speeds indicating superior performance under resource constraints.

Machine Learning Approaches

Machine learning approaches to data compression leverage neural networks to learn data representations and probability distributions directly from training data, often surpassing traditional methods in rate-distortion performance for complex data like images and videos. These methods typically frame compression as an , balancing bitrate (rate) and reconstruction fidelity () through end-to-end trainable models. Unlike classical techniques, ML-based compression adapts to data statistics without hand-crafted features, enabling flexible, variable-rate encoding. A foundational in neural compression involves autoencoders, particularly variational autoencoders (VAEs), optimized with a rate-distortion . In this setup, an encoder compresses input data into a latent representation, which is quantized and entropy-coded, while a decoder reconstructs the output; the loss combines a term (e.g., ) with a rate term estimating the bitrate via the latent's . The seminal work by Ballé, Laparra, and Simoncelli introduced an end-to-end optimized nonlinear transform coder using this framework, achieving superior compression to across bitrates on standard benchmarks. Subsequent extensions, such as variational models with hyperpriors, further refine latent distributions for better , enabling state-of-the-art results on datasets like . Learned entropy models enhance compression by using neural networks to predict symbol probabilities more accurately than fixed parametric assumptions, often integrated with for near-optimal entropy rates. Recurrent neural networks (RNNs) or transformers model spatial or sequential dependencies in latents, estimating conditional probabilities to guide the coder; for instance, hyperprior networks learn a side-information latent to parameterize the primary latent's distribution, reducing redundancy. This approach, building on VAE architectures, has been shown to outperform in multi-scale structural similarity (MS-SSIM) metrics at low bitrates. Transformers, with their mechanisms, have recently improved long-range dependency capture in entropy modeling, as seen in lossless schemes where they replace RNNs for faster . Entropy coding integration allows these models to approach theoretical limits while maintaining compatibility with standard arithmetic decoders. Generative models extend by prioritizing perceptual quality over pixel-wise fidelity, using adversarial or processes for realistic reconstructions. Generative adversarial networks (GANs) train a generator to produce outputs indistinguishable from originals, conditioned on compressed latents, often yielding visually pleasing results at ultra-low bitrates despite higher metrics. models, which iteratively denoise from , have been adapted for compression by encoding conditionals that guide the reverse process, achieving better perceptual scores than baselines on image datasets. For example, conditional frameworks optimize end-to-end for rate--perception trade-offs, demonstrating gains over BPG in human evaluations. Post-2020 advances have pushed neural compression toward practical deployment and theoretical efficiency. Bits-back coding enables with latent variable models by "returning" bits from sampling auxiliary variables, improving rates over standard ANS-based schemes; variants address approximation gaps, yielding up to 10-20% better compression on text and images compared to . Neural replaces traditional coders with differentiable approximations, allowing gradient-based optimization of the entire pipeline, as in recurrent models for edge devices that achieve 15% bitrate savings over classical . Google's early neural efforts, evolving into hyperprior-based systems, have influenced standards, with models rivaling in efficiency. In 2025, methods like LMCompress leverage large language models to achieve rates that halve those of state-of-the-art codecs such as JPEG-XL for images, for audio, and H.264 for video, and improve text compression by nearly one-third over zpaq, by exploiting semantic understanding of data. Despite these gains, challenges persist in ML compression. Training requires vast datasets to generalize across domains, often leading to on specific distributions like natural images, unlike classical methods that are dataset-agnostic. Computational costs are high: encoding/decoding with deep networks demands GPU acceleration, contrasting the lightweight nature of Huffman or , with inference times 10-100x slower in early models. Ongoing research addresses these via and quantization, but deployment in resource-constrained settings remains limited.

Lossless Compression

Core Mechanisms

Lossless compression algorithms reduce data size by eliminating statistical redundancy while ensuring the original data can be perfectly reconstructed. This process relies on identifying patterns and correlations in the data, such as repeated sequences or uneven symbol frequencies, without discarding any . Applying the same lossless compression algorithm multiple times does not reduce file size further after the initial compression, as redundancies are already eliminated; subsequent attempts may slightly increase size due to metadata overhead. Size reduction may occur only when using a superior algorithm. The theoretical foundation is Shannon's entropy, which defines the minimum average number of bits needed to represent the data based on its . Key mechanisms include , which assigns variable-length codes to symbols proportional to their occurrence probabilities—shorter codes for frequent symbols and longer for rare ones—to approach the limit. Unlike fixed-length coding, this exploits the non-uniform distribution of data symbols. Dictionary-based methods build a of common phrases or substrings during compression, replacing subsequent occurrences with references to dictionary entries, thus avoiding repetition. Preprocessing techniques, such as (RLE), further enhance efficiency by encoding consecutive identical symbols as a single value and count, particularly effective for sparse or repetitive data like text or . These mechanisms are often combined, with prediction models estimating future symbols from context to residual-encode differences, followed by of the residuals. The Burrows-Wheeler transform (BWT) represents another core approach, rearranging the data to group similar characters together, creating long runs of identical symbols that can be efficiently compressed via RLE and . Overall, lossless methods achieve typical compression ratios of 2:1 to 4:1 for general data, depending on redundancy, with no distortion introduced.

Specific Algorithms

, developed in 1952, constructs an optimal based on symbol frequencies, ensuring no code is a prefix of another for unambiguous decoding. Symbols are traversed from the root to leaves, assigning binary codes along the path; for example, in English text, 'e' might receive a 1-bit code while rare letters get longer ones, reducing average code length close to . It is widely used as a building block in formats like ZIP and (for lossless modes). The (LZW) algorithm, introduced in 1984 as a variant of LZ78, uses a dynamic to encode sequences. It parses input into previously unseen phrases, adds them to the dictionary, and outputs the index of the longest matching prior phrase; decoding rebuilds the dictionary sequentially. LZW powers and TIFF formats, achieving good ratios for textual and graphical data without lookahead. , an advanced entropy coder from the 1970s, treats the entire message as a fractional number within [0,1), narrowing an interval based on cumulative symbol probabilities rather than discrete codes. This allows fractional bits per symbol, outperforming Huffman for small alphabets or skewed distributions, and is employed in modern tools like JBIG2 for fax compression and H.264 video entropy stages. DEFLATE, standardized in for PNG and ZIP, combines LZ77 sliding-window dictionary matching with of literals and distances. It scans for matches within a 32 KB window, encoding unmatched literals or match (length, distance) pairs, followed by Huffman compression of the output stream, balancing speed and ratio for general-purpose use.

Lossy Compression

Core Mechanisms

Lossy compression fundamentally relies on the irreversible discard of data to achieve substantial reduction in storage or transmission requirements, prioritizing perceptual fidelity over exact reconstruction. This irreversibility is achieved primarily through quantization, a process that approximates continuous or high-precision signal values with a discrete set of representation levels, effectively eliminating nuances below a certain precision threshold that contribute minimally to the overall . By mapping input values to the nearest quantization level, subtle details are lost, but the resulting approximation maintains acceptable quality for human observers when guided by perceptual models. Central to effective lossy compression are perceptual models that exploit limitations in human sensory systems to determine which data can be discarded without noticeable degradation. In audio compression, psychoacoustic models leverage effects, where a dominant sound masks weaker simultaneous or nearby signals, allowing quantization noise to be introduced below computed masking thresholds in the time-frequency domain. For visual data, psychovisual models incorporate concepts like the (JND), derived from psychophysical principles such as Weber's law, which quantifies the minimal perceptible change in luminance or color, enabling the safe removal of variations below these thresholds. These models ensure that discarded information remains imperceptible, optimizing compression while preserving subjective quality. Transform coding forms a cornerstone mechanism, converting the signal into a domain where redundancies are more compactly represented, followed by selective quantization. Techniques such as the (DCT) or (FFT) shift the data into the , concentrating signal energy into fewer low-frequency coefficients while high-frequency components, often less perceptually salient, receive coarser quantization. Quantization can be scalar, applying uniform or non-uniform steps to individual coefficients, or vector-based, grouping multiple coefficients into vectors and mapping them to entries for improved rate-distortion performance by exploiting statistical dependencies. This combination decorrelates the data and targets discard of less critical frequency content. The trade-off between compression rate and resulting distortion is formalized by rate-distortion theory, providing a theoretical lower bound on the bitrate needed for a given level. In practice, Lagrangian optimization minimizes the objective function J=D+λRJ = D + \lambda R where DD measures reconstruction error (e.g., ), RR is the bitrate, and λ>0\lambda > 0 tunes the emphasis on distortion versus rate. For instance, in designing a quantizer for a simple uniform source, one iterates over possible step sizes: finer steps reduce DD but increase RR due to more bits per sample; the optimal λ\lambda yields the step size minimizing JJ, balancing, say, a 10% distortion increase against a 20% rate saving for efficient encoding. This approach ensures globally optimal parameter selection across the compression pipeline. Despite these optimizations, often introduces visible or audible artifacts from aggressive data discard and block-based processing. Blocking artifacts manifest as discontinuities at processing block edges due to independent quantization, while ringing appears as oscillatory halos around sharp transitions from truncated high frequencies. techniques, such as post-filtering, apply adaptive to blend block boundaries and suppress oscillations without reintroducing excessive blur, preserving edge details through edge-directed filters.

Specific Algorithms

Prominent lossy compression algorithms leverage techniques such as , , , and to achieve efficient data reduction while introducing controlled distortion through quantization. These methods transform or predict the input data to concentrate energy or redundancy, followed by quantization that discards less perceptually important information, as referenced in core lossy mechanisms. Transform-based algorithms, exemplified by the (DCT) in the JPEG baseline standard, operate on small blocks of data to decorrelate spatial information. In , the image is divided into 8x8 pixel blocks, each undergoing a two-dimensional DCT to convert spatial domain values into frequency coefficients, where lower frequencies capture most energy. These coefficients are then quantized using standard or custom quantization tables that scale values inversely with perceptual importance, discarding high-frequency details to achieve compression ratios often exceeding 10:1 with acceptable visual quality for still images. The process concludes with of the quantized coefficients, enabling scalable quality adjustment via table modifications. Predictive coding methods, such as Differential Pulse Code Modulation (DPCM), exploit temporal or spatial correlations by estimating the current sample from previous ones and encoding only the prediction error. In DPCM, a linear predictor—typically a simple average of adjacent samples for images or autoregressive filter for audio—generates the estimate, and the difference (error) is quantized with fewer bits due to its smaller compared to the original signal. Quantization introduces loss by rounding the error to discrete levels, often using uniform or non-uniform steps tailored to signal statistics, yielding compression gains of 2-4 bits per sample in early audio systems while maintaining intelligibility in speech. This approach forms the basis for more advanced codecs, balancing prediction accuracy with quantization granularity. Subband coding divides the signal into frequency bands using filter banks, followed by independent quantization and coding of each band, with transforms providing a multi-resolution alternative to fixed subbands. The Embedded Zerotree (EZW) applies a to decompose the signal into hierarchical trees, exploiting the statistical similarity of coefficients across scales where insignificance (below a threshold) forms zerotrees—sparse structures that allow efficient progressive transmission. Starting from the coarsest scale, EZW scans coefficients in dominant and subordinate passes, encoding significant ones and refining approximations iteratively, which supports embedded bitstreams for quality scalability and achieves superior rate-distortion performance over DCT at low , such as below 0.5 bits per pixel for images. Vector quantization (VQ) maps multidimensional input vectors—such as blocks of pixels or speech frames—to the nearest entry in a predefined of representative patterns, encoding only the index of the match. Codebooks are designed using algorithms like the Linde-Buzo-Gray (LBG) method, which iteratively partitions training data via to minimize , typically with codebook sizes of 256-1024 vectors for 8-10 bit indices in speech applications. In speech compression, VQ groups spectral parameters or coefficients into vectors, enabling rates as low as 1-2 kbps by capturing joint statistics, though it requires large codebooks to avoid block artifacts and is often combined with other techniques for robustness. Key parameters in these algorithms control the trade-off between and , including quantization step sizes that determine levels—analogous to early constant rate factor (CRF) mechanisms in video coding precursors, where a fixed quantization adjusts bitrate dynamically to target perceptual . Bit allocation strategies further optimize by distributing available bits non-uniformly across coefficients, bands, or vectors based on rate- criteria, prioritizing perceptually sensitive components to minimize overall error for a given rate.

Applications

Image Compression

Image compression techniques are designed to reduce the storage and transmission requirements of visual data while preserving perceptual quality, with methods differing significantly between raster and vector formats. Raster images, composed of a grid of pixels, rely on pixel-level encoding and thus employ both lossless and lossy compression to manage large file sizes inherent to their resolution-dependent nature. In contrast, vector images represent graphics through mathematical paths, points, and curves, enabling inherently scalable and compact representations that typically use lossless compression without quality degradation upon scaling. For example, the Scalable Vector Graphics (SVG) format, defined by the W3C, stores vector data in XML, achieving compression through text-based optimization rather than pixel manipulation, resulting in files that remain lightweight even for complex illustrations. The JPEG family of standards dominates lossy compression for raster photographs, with the baseline JPEG (ISO/IEC 10918-1) utilizing the discrete cosine transform (DCT) to convert spatial data into frequency coefficients, followed by quantization and entropy coding to achieve high compression ratios. This approach excels in reducing file sizes for continuous-tone images but introduces artifacts such as blocking—visible 8x8 pixel grid discontinuities—and ringing around edges, particularly at high compression levels, which can degrade perceived quality in areas of fine detail or smooth gradients. JPEG 2000 (ISO/IEC 15444-1), an advancement using discrete wavelet transforms, offers superior compression efficiency and supports both lossy and lossless modes, mitigating JPEG's blocking artifacts through better energy compaction and providing region-of-interest coding for selective quality preservation; however, its computational complexity limits widespread adoption compared to baseline JPEG. For lossless raster compression, particularly suited to graphics and web imagery, the Portable Network Graphics (PNG) format employs the algorithm—a combination of LZ77 dictionary coding and Huffman entropy coding—to achieve portable, bit-depth-agnostic compression without data loss, making it ideal for images requiring exact reproduction like diagrams or screenshots. PNG's filtering step before optimizes for spatial redundancies in scanlines, yielding compression ratios superior to uncompressed formats but generally larger than lossy alternatives for photographs. Modern formats like WebP and AVIF address the limitations of legacy standards by integrating video codec technologies for enhanced efficiency in both lossy and lossless scenarios. WebP, developed by Google, adapts intra-frame coding from the VP8 video codec to deliver 25-34% smaller files than JPEG at equivalent quality, supporting transparency and animation while reducing artifacts through predictive coding and advanced entropy methods. Similarly, AVIF leverages the AV1 video codec within the HEIF container to achieve up to 50% better compression than JPEG or WebP, with native support for high dynamic range (HDR), wide color gamuts, and lossless modes, though its encoding speed remains a drawback for real-time applications. Emerging formats like JPEG XL provide even higher efficiency, with up to 55% smaller files than JPEG and 25% than AVIF, supporting lossless conversion from legacy JPEG and progressive decoding, though adoption is growing as of 2025. Quality in image compression is evaluated using metrics that quantify fidelity between original and compressed versions, with (PSNR) measuring pixel-wise in decibels—higher values indicate less , though it correlates poorly with —and structural similarity index (SSIM), which assesses , contrast, and structural on a scale of -1 to 1, better aligning with visual judgments by emphasizing edge preservation. Use cases dictate format selection: lossy methods like and suit web archiving where minor artifacts are tolerable for faster loading, while lossless or excels in professional graphics and archival storage to ensure pixel-perfect .

Audio Compression

Audio compression involves reducing the size of digital sound data while preserving perceptual quality, leveraging the temporal nature of audio signals and human auditory perception. Uncompressed audio typically uses (PCM), where analog waveforms are sampled at regular intervals and quantized into digital values. For (CD) quality, PCM employs a sampling rate of 44.1 kHz and 16-bit depth per channel, capturing frequencies up to 22.05 kHz per the Nyquist theorem to cover the human hearing range of approximately 20 Hz to 20 kHz. This results in a bitrate of about 1.411 Mbps for stereo audio, making compression essential for storage and transmission. Central to audio compression are psychoacoustic principles that exploit auditory limitations to discard inaudible information. Frequency masking occurs when a louder sound at one frequency reduces the perception of nearby frequencies, allowing encoders to allocate fewer bits to masked regions. Equal-loudness contours describe how human sensitivity varies across frequencies, with lower sensitivity at extremes (e.g., below 100 Hz or above 10 kHz), enabling reduced quantization in those bands. Filter banks, such as polyphase or quadrature mirror filters, decompose the signal into subbands aligned with critical bands of hearing (about 25 units), facilitating precise perceptual modeling. Lossy audio compression formats apply these principles to achieve high efficiency by removing imperceptible details. (MPEG-1 Audio Layer III) uses perceptual coding with a hybrid combining 32 subband filters and (MDCT) for finer resolution, followed by quantization guided by a psychoacoustic model that computes masking thresholds. A bit reservoir mechanism allows borrowing bits across frames to handle variable bitrate needs, enabling compression to 128 kbps with near-transparent quality for many listeners. (AAC), an successor standard, improves efficiency through better filter banks, parametric stereo coding, and tools like Spectral Band Replication (SBR), which reconstructs high frequencies from low-band data at bitrates as low as 48 kbps, offering 30-50% better compression than at equivalent quality. Lossless formats compress audio without data loss, typically achieving 40-60% size reduction. FLAC (Free Lossless Audio Codec) employs linear predictive coding (LPC) to estimate samples from prior ones, encoding prediction residuals using Rice codes—a variable-length prefix code optimal for exponentially distributed errors—followed by Huffman coding for further efficiency. Apple's ALAC (Apple Lossless Audio Codec) uses a similar LPC-based approach with adaptive prediction orders up to 31, integrating arithmetic coding to maintain bit-perfect reconstruction, and supports metadata embedding for iOS ecosystems. In applications like streaming and storage, compressed audio balances quality and bandwidth. Services such as employ Ogg , a lossy format using MDCT and perceptual noise shaping, at up to 320 kbps for "Very High" quality, enabling efficient delivery over variable networks. Subjective quality is often assessed via (MOS), a 1-5 scale from ITU recommendations where scores above 4 indicate excellent perceived fidelity, guiding optimizations for real-world listening.

Video Compression

Video compression techniques reduce the data required to represent moving images by exploiting both spatial redundancies within individual and temporal redundancies across consecutive . This process typically involves dividing video into and applying hybrid coding methods that combine with transform-based compression. Intra-frame coding treats each frame independently, similar to still , using techniques like (DCT) to encode keyframes (I-frames) that serve as reference points without relying on other . Inter-frame coding, in contrast, predicts subsequent from previous ones using , where predicted (P-frames) and bi-directionally predicted (B-frames) incorporate motion vectors to describe changes between , significantly lowering bit rates for sequences with consistent motion. Motion compensation extends by estimating and compensating for object movement across , typically through block-matching algorithms. Major standards define the frameworks for these methods, with H.264/AVC (Advanced Video Coding) being a foundational block-based approach that uses 16x16 macroblocks for and compensation, along with context-adaptive binary (CABAC) for efficient encoding. H.265/HEVC () improves upon H.264 by employing larger coding tree units (CTUs) up to 64x64 pixels, enabling more flexible partitioning and achieving approximately 50% better compression efficiency for the same quality, particularly beneficial for high-resolution content. , developed by the , offers a royalty-free alternative with comparable or superior efficiency to HEVC, supporting advanced features like synthesis while remaining open for broad adoption in web and streaming applications. Hybrid formats in video compression integrate transform coding, such as integer DCT approximations, to concentrate energy in low-frequency coefficients after motion-compensated prediction, followed by quantization and entropy coding. Motion estimation employs search algorithms, like full or diamond search patterns, to compute vectors minimizing differences between blocks in reference and current frames, often with sub-pixel accuracy for smoother predictions. Deblocking filters are applied post-reconstruction to mitigate artifacts at block boundaries, enhancing perceptual quality without increasing bit rates significantly. Various profiles tailor these standards to specific uses, such as streaming 4K HDR content, where and wide color gamut require extended bit depths and color sampling. Metrics like (VMAF) evaluate perceived quality by fusing multiple models, correlating strongly with human judgments for optimized encoding in bandwidth-constrained environments. Challenges in video compression include achieving real-time encoding for live streams, where the high computational demands of and transform processes can introduce latencies exceeding acceptable thresholds for interactive applications, necessitating or simplified algorithms.

Scientific and Other Uses

In , compression plays a vital role in managing the vast volumes of generated by high-throughput technologies. Reference-based compression methods, such as DNAzip, leverage a known to encode differences, exploiting the repetitive nature of genomic sequences where identical or similar segments recur frequently across individuals or species. This approach achieves significant space savings by storing only variations from the , making it particularly effective for resequencing projects. For instance, DNAzip can compress to ratios exceeding 100:1 in some cases, facilitating efficient storage and analysis in bioinformatics pipelines. More recent advancements include , which optimizes read ordering for improved compression of FASTQ files (2025), and novel methods that compress hundreds of terabytes of genomic into gigabytes (as of 2024). Scientific datasets, including time-series observations from telescopes, often employ to capture small incremental changes between consecutive data points, which is ideal for signals with low variability over time. In astronomical applications, tools like DeltaComp apply delta coding combined with adaptive techniques to compress timelines from instruments monitoring variable stars or exoplanets, reducing storage needs while preserving precision for downstream . Similarly, for large-scale simulations in fields like climate modeling or physics, the Hierarchical Data Format 5 (HDF5) integrates lossy compressors such as SZ, which bound errors to user-specified tolerances and achieve compression ratios up to 10:1 or higher on multidimensional arrays from . The SZ compressor, designed for floating-point scientific data, uses predictive modeling to exploit spatial correlations, enabling faster I/O in environments. In general storage systems, techniques like deduplication in ZFS file systems identify and eliminate redundant blocks across datasets, storing only unique instances to optimize disk usage in environments with repetitive data patterns, such as virtual machine images. For backups, NTFS compression on Windows volumes transparently reduces file sizes using algorithms like LZNT1, which is particularly useful for archiving logs or documents without impacting application compatibility. Archival formats combine these with tools like tar for bundling files and gzip for deflate-based compression, as in TAR.GZ archives, which balance portability and efficiency for long-term preservation of heterogeneous data collections. In big data ecosystems, Hadoop integrates Snappy compression for intermediate map-reduce outputs, prioritizing speed over maximum ratio to accelerate processing of petabyte-scale datasets in distributed clusters. Domain-specific redundancies further enhance compression in scientific contexts, such as sparsity in matrices from graph analytics or finite element methods, where formats like compressed sparse row (CSR) store only non-zero elements, reducing by orders of magnitude compared to dense representations. Additionally, error-resilient codes, including modifications to Tunstall encoding, incorporate to detect and correct bit s in compressed streams transmitted over noisy channels, ensuring in or space-based experiments without excessive overhead. These techniques underscore compression's adaptation to structured scientific , emphasizing exactness, error control, and efficiency over perceptual quality.

History

Early Developments

The concept of data compression emerged in the early with efforts to optimize communication efficiency in . developed in , assigning shorter sequences of dots and dashes to more frequently used letters in English, such as 'E' (a single dot) and 'T' (a single dash), while rarer letters received longer codes; this variable-length encoding reduced the average transmission time for messages over limited-bandwidth telegraph lines. In 1948, published "," which laid the theoretical foundation for data compression by introducing the concept of as a measure of the minimum average information bits required to represent a source's output, enabling the quantification of redundancy in signals. The 1950s saw practical algorithmic advances, with introducing in his 1952 paper "A Method for the Construction of Minimum-Redundancy Codes," which builds optimal prefix codes by assigning shorter binary strings to more probable symbols based on their frequencies, achieving compression close to the limit. This method was implemented in hardware for early digital systems, including telecommunications equipment and nascent computer applications on platforms like mainframes in the 1960s, where it optimized and transmission for punch-card and tape-based processing. Practical advancements accelerated in the with applications in space exploration, where data compression was used on deep space missions to manage data. Significant flight history exists for both lossless and lossy methods, including implementations on Voyager and other missions to reduce bandwidth requirements for interplanetary communication. By the 1970s, dictionary-based approaches gained prominence with the Lempel-Ziv algorithms developed by Abraham Lempel and . Their 1977 LZ77 algorithm used a sliding window to replace repeated substrings with references to prior occurrences, followed by LZ78 in 1978, which built an explicit dictionary of phrases during compression. Concurrently, (RLE), which represents consecutive identical data elements by a single value and its count, was widely adopted in () machines starting in the early 1970s to efficiently transmit scanned lines of black-and-white pixels, reducing bandwidth needs for document imaging over telephone lines. These early developments, driven by key figures like Huffman, Lempel, and Ziv, primarily found initial applications in for minimizing signal and improving transmission efficiency in bandwidth-constrained environments.

Modern Evolution

The modern evolution of data compression from the onward marked a shift toward standardized algorithms for , driven by the explosion of personal computing, growth, and applications. In the , key developments included the emergence of and audio compression techniques that laid the groundwork for widespread digital formats. The standard, formally adopted in 1992 as ISO/IEC 10918-1, introduced (DCT)-based for continuous-tone , achieving compression ratios of 10:1 to 20:1 with minimal perceptual loss, revolutionizing still storage and transmission. Similarly, the format, released in 1987 by , utilized (LZW) dictionary-based compression for lossless encoding of indexed-color , supporting animations and becoming ubiquitous in early web graphics despite its 256-color limitation. Precursors to audio compression also arose in the late , with projects like the Adaptive Spectral Perceptual Entropy Coding (ASPEC) algorithm developed under the Eureka 147 project, which employed psychoacoustic modeling and to reduce audio file sizes by factors of 10-12 while preserving quality. The 1990s and 2000s saw compression integrate deeply into file archiving, video, and broadband ecosystems, with standards promoting interoperability. The ZIP format, standardized in 1989 via PKWARE's implementation of (combining LZ77 sliding-window matching with ), achieved ubiquity in software like and became the for general-purpose , offering ratios around 2:1 to 3:1 for text and executables. Video compression advanced through the MPEG family of standards; (1993, ISO/IEC 11172-2) enabled CD-ROM video playback with 30:1 ratios, while (1995, ISO/IEC 13818-2) supported DVD and broadcast TV at similar efficiencies, and MPEG-4 (1999 onward) introduced object-based coding for interactive media, facilitating streaming with adaptive bitrates. For lossless text and binary data, (1996) employed the followed by move-to-front encoding and , delivering superior ratios (up to 20-30% better than ) for large files at the cost of higher computational demands. Milestones included patent disputes, such as Unisys's enforcement of LZW patents in the 1990s, which led to royalty fees for implementations and spurred alternatives like , highlighting tensions between innovation and . Open-source initiatives further democratized compression in this era; the zlib library (1995), implementing without patent restrictions, became integral to , HTTP, and countless applications, fostering widespread adoption through its permissive BSD-like license. Entering the 2010s, high-efficiency codecs addressed 4K and beyond: (High Efficiency Video Coding, 2013, ISO/IEC 23008-2) halved bitrates compared to H.264 for equivalent quality, enabling efficient UHD streaming with 50:1 ratios in some scenarios. Image formats evolved with (2010, ), supporting both lossy (VP8-based) and lossless modes, achieving 25-34% better compression than or for web use. Neural network-based compression research began gaining traction around 2016-2018, with early works like Ballé et al.'s models demonstrating learned that outperformed traditional methods on images by 10-20% in rate-distortion performance. In the 2020s, compression has emphasized royalty-free, efficient, and resilient designs amid rising data volumes and environmental concerns. (AOMedia Video 1, 2018, finalized 2020s adoption) has seen broad uptake in platforms like and , offering 30% bitrate savings over HEVC without licensing fees, supporting up to 50% efficiency gains for 8K video. Sustainable compression efforts focus on energy-efficient algorithms, such as hardware-optimized neural codecs that reduce encoding power compared to traditional methods, addressing the of data centers. Emerging quantum-resistant methods incorporate into compression pipelines to safeguard against quantum attacks, ensuring long-term for archived data without significant overhead. As of 2025, advancements in AI-driven compression include large model-based lossless methods like LMCompress, which leverage neural networks to achieve record-breaking compression ratios on diverse datasets. These innovations reflect a trajectory toward accessible, performant, and future-proof compression integral to digital infrastructure.

Future Directions

Emerging Technologies

AI-driven neural codecs represent a significant advancement in data compression, leveraging to achieve higher efficiency and perceptual quality compared to traditional methods. These codecs employ end-to-end neural networks for encoding and decoding, often incorporating generative models to reconstruct compressed data with minimal loss in visual fidelity. For instance, -based perceptual neural video compression frameworks integrate foundational models to enhance reconstruction, while maintaining superior subjective quality on benchmark datasets. has integrated enhancements into AV1 encoding pipelines, using neural networks for tasks such as dynamic optimization and downscaling, which improve encoding efficiency without compromising viewer experience. Recent developments, including real-time neural video codecs, further enable practical deployment by balancing compression ratios with low latency, outperforming prior models by 10.7% in BD-rate reduction on standard video sequences. Quantum compression emerges as a theoretical frontier, potentially surpassing classical limits through the use of entangled states. Schumacher's quantum data compression theorem establishes the as the fundamental limit for compressing , allowing faithful reconstruction with high probability using quantum typical subspaces. Entanglement-assisted protocols extend this by exploiting shared entanglement to reduce the required quantum communication rate below the Schumacher limit, achieving savings equal to half the classical for certain sources. While still largely theoretical, these approaches hold promise for quantum networks, where entangled states could enable more efficient storage and transmission of quantum data, though practical implementations remain constrained by current quantum hardware limitations. In for IoT, lightweight algorithms prioritize speed and low resource usage to handle constrained devices. Algorithms like Zstandard offer high compression ratios with significantly faster decompression than , making them suitable for real-time IoT data processing on edge nodes. Recent innovations, such as SZ4IoT, adapt techniques for sensor data, achieving high size reduction with adjustable error bounds while consuming minimal CPU cycles on resource-limited hardware. These methods reduce latency and bandwidth demands in distributed IoT networks, enabling efficient at the edge before transmission to central servers. Adaptive streaming technologies, particularly integrated with , dynamically adjust bitrates based on predicted network conditions and user quality preferences. ML models in frameworks predict optimal streaming quality by analyzing logs and historical throughput, selecting bitrates that minimize rebuffering while maximizing perceptual quality, with prediction accuracies exceeding 85% in varied network scenarios. This approach enhances in heterogeneous environments, without quality degradation. Sustainability-focused low-power codecs address the energy demands of data centers by minimizing computational overhead and data volume. Energy-efficient lossless algorithms reduce storage and transfer energy through faster processing and smaller payloads, directly lowering the carbon footprint of large-scale data handling. For multimedia in green data centers, compact visual representations using neural compression cut transmission energy while preserving fidelity, supporting sustainable practices amid rising data volumes. These codecs prioritize hardware-friendly designs, enabling deployment on energy-constrained servers to curb the sector's substantial share of global electricity consumption.

Unused Potential

Despite significant advances in data compression algorithms, substantial untapped potential remains in leveraging , particularly large language models and techniques, to achieve more efficient encoding across diverse data types. Recent surveys highlight that while classical compression methods rely on statistical redundancy, AI-driven approaches can exploit semantic understanding to push beyond traditional limits, enabling task-oriented or goal-directed compression that preserves relevant to specific applications rather than raw fidelity. For instance, integrating large models for has demonstrated unprecedented ratios, such as halving the rates of established standards like JPEG-XL for images and for audio, by linking deeper data comprehension to encoding efficiency. Future enhancements in model capabilities could further unlock this potential, potentially revolutionizing compression for complex, multimodal data streams. In domain-specific areas like , current compressors often fail to fully capture subtle structural patterns in biological sequences, leaving room for algorithms that incorporate evolutionary or contextual redundancies to achieve dramatically smaller file sizes. The explosive growth of genomic datasets—projected to reach exabytes by the end of the —underscores this gap, as existing methods like or specialized tools such as CRAM provide only modest reductions, typically 2-5x for raw sequences, while advanced pattern-aware techniques could exceed 10x in targeted scenarios. Research emphasizes the need for hybrid approaches combining with biological priors to exploit these untapped redundancies, potentially reducing storage costs for large-scale sequencing projects and enabling broader reuse of compressed archives. Similarly, in for scientific simulations, techniques show promise but remain underutilized due to challenges in error bounding and reconstruction fidelity, with ongoing work pointing to adaptive, application-specific reducers as a key frontier. Another area of unused potential lies in cross-modal and real-time compression for emerging applications, such as IoT sensor networks and , where energy constraints limit adoption of sophisticated algorithms. Probabilistic modeling advancements, including neural autoregressive flows, offer pathways to more practical neural compression that balances rate-distortion trade-offs without excessive computational overhead. Overall, these opportunities hinge on interdisciplinary efforts to address open challenges like for massive datasets and standardization of AI-compressed formats, ensuring that compression evolves in tandem with data generation rates.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.