Video coding format
View on Wikipedia
A video coding format[a] (or sometimes video compression format) is an encoded format of digital video content, such as in a data file or bitstream. It typically uses a standardized video compression algorithm, most commonly based on discrete cosine transform (DCT) coding and motion compensation. A computer software or hardware component that compresses or decompresses a specific video coding format is a video codec.
Some video coding formats are documented by a detailed technical specification document known as a video coding specification. Some such specifications are written and approved by standardization organizations as technical standards, and are thus known as a video coding standard. There are de facto standards and formal standards.
Video content encoded using a particular video coding format is normally bundled with an audio stream (encoded using an audio coding format) inside a multimedia container format such as AVI, MP4, FLV, RealMedia, or Matroska. As such, the user normally does not have a H.264 file, but instead has a video file, which is an MP4 container of H.264-encoded video, normally alongside AAC-encoded audio. Multimedia container formats can contain one of several different video coding formats; for example, the MP4 container format can contain video coding formats such as MPEG-2 Part 2 or H.264. Another example is the initial specification for the file type WebM, which specifies the container format (Matroska), but also exactly which video (VP8) and audio (Vorbis) compression format is inside the Matroska container, even though Matroska is capable of containing VP9 video, and Opus audio support was later added to the WebM specification.
Distinction between format and codec
[edit]A format is the layout plan for data produced or consumed by a codec.
Although video coding formats such as H.264 are sometimes referred to as codecs, there is a clear conceptual difference between a specification and its implementations. Video coding formats are described in specifications, and software, firmware, or hardware to encode/decode data in a given video coding format from/to uncompressed video are implementations of those specifications. As an analogy, the video coding format H.264 (specification) is to the codec OpenH264 (specific implementation) what the C Programming Language (specification) is to the compiler GCC (specific implementation). Note that for each specification (e.g., H.264), there can be many codecs implementing that specification (e.g., x264, OpenH264, H.264/MPEG-4 AVC products and implementations).
This distinction is not consistently reflected terminologically in the literature. The H.264 specification calls H.261, H.262, H.263, and H.264 video coding standards and does not contain the word codec.[2] The Alliance for Open Media clearly distinguishes between the AV1 video coding format and the accompanying codec they are developing, but calls the video coding format itself a video codec specification.[3] The VP9 specification calls the video coding format VP9 itself a codec.[4]
As an example of conflation, Chromium's[5] and Mozilla's[6] pages listing their video formats support both call video coding formats, such as H.264 codecs. As another example, in Cisco's announcement of a free-as-in-beer video codec, the press release refers to the H.264 video coding format as a codec ("choice of a common video codec"), but calls Cisco's implementation of a H.264 encoder/decoder a codec shortly thereafter ("open-source our H.264 codec").[7]
A video coding format does not dictate all algorithms used by a codec implementing the format. For example, a large part of how video compression typically works is by finding similarities between video frames (block-matching) and then achieving compression by copying previously-coded similar subimages (such as macroblocks) and adding small differences when necessary. Finding optimal combinations of such predictors and differences is an NP-hard problem,[8] meaning that it is practically impossible to find an optimal solution. Though the video coding format must support such compression across frames in the bitstream format, by not needlessly mandating specific algorithms for finding such block-matches and other encoding steps, the codecs implementing the video coding specification have some freedom to optimize and innovate in their choice of algorithms. For example, section 0.5 of the H.264 specification says that encoding algorithms are not part of the specification.[2] Free choice of algorithm also allows different space–time complexity trade-offs for the same video coding format, so a live feed can use a fast but space-inefficient algorithm, and a one-time DVD encoding for later mass production can trade long encoding-time for space-efficient encoding.
History
[edit]The concept of analog video compression dates back to 1929, when R.D. Kell in Britain proposed the concept of transmitting only the portions of the scene that changed from frame-to-frame. The concept of digital video compression dates back to 1952, when Bell Labs researchers B.M. Oliver and C.W. Harrison proposed the use of differential pulse-code modulation (DPCM) in video coding. In 1959, the concept of inter-frame motion compensation was proposed by NHK researchers Y. Taki, M. Hatori and S. Tanaka, who proposed predictive inter-frame video coding in the temporal dimension.[9] In 1967, University of London researchers A.H. Robinson and C. Cherry proposed run-length encoding (RLE), a lossless compression scheme, to reduce the transmission bandwidth of analog television signals.[10]
The earliest digital video coding algorithms were either for uncompressed video or used lossless compression, both methods inefficient and impractical for digital video coding.[11][12] Digital video was introduced in the 1970s,[11] initially using uncompressed pulse-code modulation (PCM), requiring high bitrates around 45–200 Mbit/s for standard-definition (SD) video,[11][12] which was up to 2,000 times greater than the telecommunication bandwidth (up to 100 kbit/s) available until the 1990s.[12] Similarly, uncompressed high-definition (HD) 1080p video requires bitrates exceeding 1 Gbit/s, significantly greater than the bandwidth available in the 2000s.[13]
Motion-compensated DCT
[edit]Practical video compression emerged with the development of motion-compensated DCT (MC DCT) coding,[12][11] also called block motion compensation (BMC)[9] or DCT motion compensation. This is a hybrid coding algorithm,[9] which combines two key data compression techniques: discrete cosine transform (DCT) coding[12][11] in the spatial dimension, and predictive motion compensation in the temporal dimension.[9]
DCT coding is a lossy block compression transform coding technique that was first proposed by Nasir Ahmed, who initially intended it for image compression, while he was working at Kansas State University in 1972. It was then developed into a practical image compression algorithm by Ahmed with T. Natarajan and K. R. Rao at the University of Texas in 1973, and was published in 1974.[14][15][16]
The other key development was motion-compensated hybrid coding.[9] In 1974, Ali Habibi at the University of Southern California introduced hybrid coding,[17][18][19] which combines predictive coding with transform coding.[9][20] He examined several transform coding techniques, including the DCT, Hadamard transform, Fourier transform, slant transform, and Karhunen-Loeve transform.[17] However, his algorithm was initially limited to intra-frame coding in the spatial dimension. In 1975, John A. Roese and Guner S. Robinson extended Habibi's hybrid coding algorithm to the temporal dimension, using transform coding in the spatial dimension and predictive coding in the temporal dimension, developing inter-frame motion-compensated hybrid coding.[9][21] For the spatial transform coding, they experimented with different transforms, including the DCT and the fast Fourier transform (FFT), developing inter-frame hybrid coders for them, and found that the DCT is the most efficient due to its reduced complexity, capable of compressing image data down to 0.25-bit per pixel for a videotelephone scene with image quality comparable to a typical intra-frame coder requiring 2-bit per pixel.[22][21]
The DCT was applied to video encoding by Wen-Hsiung Chen,[23] who developed a fast DCT algorithm with C.H. Smith and S.C. Fralick in 1977,[24][25] and founded Compression Labs to commercialize DCT technology.[23] In 1979, Anil K. Jain and Jaswant R. Jain further developed motion-compensated DCT video compression.[26][9] This led to Chen developing a practical video compression algorithm, called motion-compensated DCT or adaptive scene coding, in 1981.[9] Motion-compensated DCT later became the standard coding technique for video compression from the late 1980s onwards.[11][27]
Video coding standards
[edit]The first digital video coding standard was H.120, developed by the CCITT (now ITU-T) in 1984.[28] H.120 was not usable in practice, as its performance was too poor.[28] H.120 used motion-compensated DPCM coding,[9] a lossless compression algorithm that was inefficient for video coding.[11] During the late 1980s, a number of companies began experimenting with discrete cosine transform (DCT) coding, a much more efficient form of compression for video coding. The CCITT received 14 proposals for DCT-based video compression formats, in contrast to a single proposal based on vector quantization (VQ) compression. The H.261 standard was developed based on motion-compensated DCT compression.[11][27] H.261 was the first practical video coding standard,[28] and uses patents licensed from a number of companies, including Hitachi, PictureTel, NTT, BT, and Toshiba, among others.[29] Since H.261, motion-compensated DCT compression has been adopted by all the major video coding standards (including the H.26x and MPEG formats) that followed.[11][27]
MPEG-1, developed by the Moving Picture Experts Group (MPEG), followed in 1991, and it was designed to compress VHS-quality video.[28] It was succeeded in 1994 by MPEG-2/H.262,[28] which was developed with patents licensed from a number of companies, primarily Sony, Thomson and Mitsubishi Electric.[30] MPEG-2 became the standard video format for DVD and SD digital television.[28] Its motion-compensated DCT algorithm was able to achieve a compression ratio of up to 100:1, enabling the development of digital media technologies such as video on demand (VOD)[12] and high-definition television (HDTV).[31] In 1999, it was followed by MPEG-4/H.263, which was a major leap forward for video compression technology.[28] It uses patents licensed from a number of companies, primarily Mitsubishi, Hitachi and Panasonic.[32]
The most widely used video coding format as of 2019[update] is H.264/MPEG-4 AVC.[33] It was developed in 2003, and uses patents licensed from a number of organizations, primarily Panasonic, Godo Kaisha IP Bridge and LG Electronics.[34] In contrast to the standard DCT used by its predecessors, AVC uses the integer DCT.[23][35] H.264 is one of the video encoding standards for Blu-ray Discs; all Blu-ray Disc players must be able to decode H.264. It is also widely used by streaming internet sources, such as videos from YouTube, Netflix, Vimeo, and the iTunes Store, web software such as the Adobe Flash Player and Microsoft Silverlight, and also various HDTV broadcasts over terrestrial (ATSC standards, ISDB-T, DVB-T or DVB-T2), cable (DVB-C), and satellite (DVB-S2).[36]
A main problem for many video coding formats has been patents, making it expensive to use or potentially risking a patent lawsuit due to submarine patents. The motivation behind many recently designed video coding formats such as Theora, VP8, and VP9 have been to create a (libre) video coding standard covered only by royalty-free patents.[37] Patent status has also been a major point of contention for the choice of which video formats the mainstream web browsers will support inside the HTML video tag.
The current-generation video coding format is HEVC (H.265), introduced in 2013. AVC uses the integer DCT with 4x4 and 8x8 block sizes, and HEVC uses integer DCT and DST transforms with varied block sizes between 4x4 and 32x32.[38] HEVC is heavily patented, mostly by Samsung Electronics, GE, NTT, and JVCKenwood.[39] It is challenged by the AV1 format, intended for free license. As of 2019[update], AVC is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video developers, followed by HEVC, which is used by 43% of developers.[33]
List of video coding standards
[edit]| Basic algorithm | Video coding standard | Year | Publishers | Committees | Licensors | Market presence (2019)[33] | Popular implementations |
|---|---|---|---|---|---|---|---|
| DPCM | H.120 | 1984 | CCITT | VCEG | — | — | Unknown |
| DCT | H.261 | 1988 | CCITT | VCEG | Hitachi, PictureTel, NTT, BT, Toshiba, etc.[29] | — | Videoconferencing, videotelephony |
| Motion JPEG (MJPEG) | 1992 | JPEG | JPEG | ISO / Open Source does NOT mean free! [40] | — | QuickTime | |
| MPEG-1 Part 2 | 1993 | ISO, IEC | MPEG | Fujitsu, IBM, Matsushita, etc.[41] | — | Video CD, Internet video | |
| H.262 / MPEG-2 Part 2 (MPEG-2 Video) | 1995 | ISO, IEC, ITU-T | MPEG, VCEG | Sony, Thomson, Mitsubishi, etc.[30] | 29% | DVD Video, Blu-ray, DVB, ATSC, SVCD, SDTV | |
| DV | 1995 | IEC | IEC | Sony, Panasonic | Unknown | Camcorders, digital cassettes | |
| H.263 | 1996 | ITU-T | VCEG | Mitsubishi, Hitachi, Panasonic, etc.[32] | Unknown | Videoconferencing, videotelephony, H.320, ISDN,[42][43] mobile video (3GP), MPEG-4 Visual | |
| MPEG-4 Part 2 (MPEG-4 Visual) | 1999 | ISO, IEC | MPEG | Mitsubishi, Hitachi, Panasonic, etc.[32] | Unknown | Internet video, DivX, Xvid | |
| DWT | Motion JPEG 2000 (MJ2) | 2001 | JPEG[44] | JPEG[45] | — | Unknown | Digital cinema[46] |
| DCT | Advanced Video Coding (H.264 / MPEG-4 AVC) | 2003 | ISO, IEC, ITU-T | MPEG, VCEG | Panasonic, Godo Kaisha IP Bridge, LG, etc.[34] | 91% | Blu-ray, HD DVD, HDTV (DVB, ATSC), video streaming (YouTube, Netflix, Vimeo), iTunes Store, iPod Video, Apple TV, videoconferencing, Flash Player, Silverlight, VOD |
| Theora | 2004 | Xiph | Xiph | — | Unknown | Internet video, web browsers | |
| VC-1 | 2006 | SMPTE | SMPTE | Microsoft, Panasonic, LG, Samsung, etc.[47] | Unknown | Blu-ray, Internet video | |
| Apple ProRes | 2007 | Apple | Apple | Apple | Unknown | Video production, post-production | |
| High Efficiency Video Coding (H.265 / MPEG-H HEVC) | 2013 | ISO, IEC, ITU-T | MPEG, VCEG | Samsung, GE, NTT, JVCKenwood, etc.[39][48] | 43% | UHD Blu-ray, DVB, ATSC 3.0, UHD streaming, HEIF, macOS High Sierra, iOS 11 | |
| AV1 | 2018 | AOMedia | AOMedia | — | 7% | HTML video | |
| Versatile Video Coding (VVC / H.266) | 2020 | JVET | JVET | Unknown | — | — |
Lossless, lossy, and uncompressed
[edit]Consumer video is generally compressed using lossy video codecs, since that results in significantly smaller files than lossless compression. Some video coding formats are designed explicitly for either lossy or lossless compression, and some video coding formats such as Dirac and H.264 support both.[49]
Uncompressed video formats, such as Clean HDMI, is a form of lossless video used in some circumstances, such as when sending video to a display over an HDMI connection. Some high-end cameras can also capture video directly in this format.[examples needed]
Intra-frame
[edit]Interframe compression complicates editing of an encoded video sequence.[50] One subclass of relatively simple video coding formats are the intra-frame video formats, such as DV, in which each frame of the video stream is compressed independently without referring to other frames in the stream, and no attempt is made to take advantage of correlations between successive pictures over time for better compression. One example is Motion JPEG, which is simply a sequence of individually JPEG-compressed images. This approach is quick and simple, at the expense of the encoded video being much larger than a video coding format supporting Inter frame coding.
Because interframe compression copies data from one frame to another, if the original frame is simply cut out (or lost in transmission), the following frames cannot be reconstructed properly. Making cuts in intraframe-compressed video while video editing is almost as easy as editing uncompressed video: one finds the beginning and ending of each frame, and simply copies bit-for-bit each frame that one wants to keep, and discards the frames one does not want. Another difference between intraframe and interframe compression is that, with intraframe systems, each frame uses a similar amount of data. In most interframe systems, certain frames (such as I-frames in MPEG-2) are not allowed to copy data from other frames, so they require much more data than other frames nearby.[51]
It is possible to build a computer-based video editor that spots problems caused when I frames are edited out while other frames need them. This has allowed newer formats like HDV to be used for editing. However, this process demands a lot more computing power than editing intraframe compressed video with the same picture quality. But, this compression is not very effective to use for any audio format.[52]
Profiles and levels
[edit]A video coding format can define optional restrictions to encoded video, called profiles and levels. It is possible to have a decoder that only supports decoding a subset of profiles and levels of a given video format, for example, to make the decoder program/hardware smaller, simpler, or faster.[citation needed]
A profile restricts which encoding techniques are allowed. For example, the H.264 format includes the profiles baseline, main and high (and others). While P-slices (which can be predicted based on preceding slices) are supported in all profiles, B-slices (which can be predicted based on both preceding and following slices) are supported in the main and high profiles but not in baseline.[53]
A level is a restriction on parameters such as maximum resolution and data rates.[53]
See also
[edit]Notes
[edit]- ^ The term video coding includes Advanced Video Coding, High Efficiency Video Coding, and Video Coding Experts Group.[1]
References
[edit]- ^ Thomas Wiegand; Gary J. Sullivan; Gisle Bjontegaard & Ajay Luthra (July 2003). "Overview of the H.264 / AVC Video Coding Standard" (PDF). IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY.
- ^ a b "SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS : Infrastructure of audiovisual services – Coding of moving video : Advanced video coding for generic audiovisual services". Itu.int. Retrieved January 6, 2015.
- ^ "Front Page". Alliance for Open Media. Retrieved May 23, 2016.
- ^ Adrian Grange; Peter de Rivaz & Jonathan Hunt. "VP9 Bitstream & Decoding Process Specification" (PDF).
- ^ "Audio/Video". The Chromium Projects. Retrieved May 23, 2016.
- ^ "Media formats supported by the HTML audio and video elements". Mozilla. Retrieved May 23, 2016.
- ^ Rowan Trollope (October 30, 2013). "Open-Sourced H.264 Removes Barriers to WebRTC". Cisco. Archived from the original on May 14, 2019. Retrieved May 23, 2016.
- ^ "Chapter 3 : Modified A* Prune Algorithm for finding K-MCSP in video compression" (PDF). Shodhganga.inflibnet.ac.in. Retrieved January 6, 2015.
- ^ a b c d e f g h i j "History of Video Compression". ITU-T. Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6). July 2002. pp. 11, 24–9, 33, 40–1, 53–6. Retrieved November 3, 2019.
- ^ Robinson, A. H.; Cherry, C. (1967). "Results of a prototype television bandwidth compression scheme". Proceedings of the IEEE. 55 (3). IEEE: 356–364. doi:10.1109/PROC.1967.5493.
- ^ a b c d e f g h i Ghanbari, Mohammed (2003). Standard Codecs: Image Compression to Advanced Video Coding. Institution of Engineering and Technology. pp. 1–2. ISBN 9780852967102.
- ^ a b c d e f Lea, William (1994). Video on demand: Research Paper 94/68. House of Commons Library. Retrieved September 20, 2019.
- ^ Lee, Jack (2005). Scalable Continuous Media Streaming Systems: Architecture, Design, Analysis and Implementation. John Wiley & Sons. p. 25. ISBN 9780470857649.
- ^ Ahmed, Nasir (January 1991). "How I Came Up With the Discrete Cosine Transform". Digital Signal Processing. 1 (1): 4–5. Bibcode:1991DSP.....1....4A. doi:10.1016/1051-2004(91)90086-Z.
- ^ Ahmed, Nasir; Natarajan, T.; Rao, K. R. (January 1974), "Discrete Cosine Transform", IEEE Transactions on Computers, C-23 (1): 90–93, doi:10.1109/T-C.1974.223784
- ^ Rao, K. R.; Yip, P. (1990), Discrete Cosine Transform: Algorithms, Advantages, Applications, Boston: Academic Press, ISBN 978-0-12-580203-1
- ^ a b Habibi, Ali (1974). "Hybrid Coding of Pictorial Data". IEEE Transactions on Communications. 22 (5): 614–624. doi:10.1109/TCOM.1974.1092258.
- ^ Chen, Z.; He, T.; Jin, X.; Wu, F. (2019). "Learning for Video Compression". IEEE Transactions on Circuits and Systems for Video Technology. 30 (2): 566–576. arXiv:1804.09869. Bibcode:2020ITCSV..30..566C. doi:10.1109/TCSVT.2019.2892608.
- ^ Pratt, William K. (1984). Advances in Electronics and Electron Physics: Supplement. Academic Press. p. 158. ISBN 9780120145720.
A significant advance in image coding methodology occurred with the introduction of the concept of hybrid transform/DPCM coding (Habibi, 1974).
- ^ Ohm, Jens-Rainer (2015). Multimedia Signal Coding and Transmission. Springer. p. 364. ISBN 9783662466919.
- ^ a b Roese, John A.; Robinson, Guner S. (October 30, 1975). Tescher, Andrew G. (ed.). "Combined Spatial And Temporal Coding Of Digital Image Sequences". Efficient Transmission of Pictorial Information. 0066. International Society for Optics and Photonics: 172–181. Bibcode:1975SPIE...66..172R. doi:10.1117/12.965361.
- ^ Huang, T. S. (1981). Image Sequence Analysis. Springer Science & Business Media. p. 29. ISBN 9783642870378.
- ^ a b c Stanković, Radomir S.; Astola, Jaakko T. (2012). "Reminiscences of the Early Work in DCT: Interview with K.R. Rao" (PDF). Reprints from the Early Days of Information Sciences. 60. Retrieved October 13, 2019.
- ^ Chen, Wen-Hsiung; Smith, C. H.; Fralick, S. C. (September 1977). "A Fast Computational Algorithm for the Discrete Cosine Transform". IEEE Transactions on Communications. 25 (9): 1004–1009. doi:10.1109/TCOM.1977.1093941.
- ^ "T.81 – Digital compression and coding of continuous-tone still images – Requirements and guidelines" (PDF). CCITT. September 1992. Retrieved July 12, 2019.
- ^ Cianci, Philip J. (2014). High Definition Television: The Creation, Development and Implementation of HDTV Technology. McFarland. p. 63. ISBN 9780786487974.
- ^ a b c Li, Jian Ping (2006). Proceedings of the International Computer Conference 2006 on Wavelet Active Media Technology and Information Processing: Chongqing, China, 29-31 August 2006. World Scientific. p. 847. ISBN 9789812709998.
- ^ a b c d e f g "The History of Video File Formats Infographic". RealNetworks. April 22, 2012. Retrieved August 5, 2019.
- ^ a b "ITU-T Recommendation declared patent(s)". ITU. Retrieved July 12, 2019.
- ^ a b "MPEG-2 Patent List" (PDF). MPEG LA. Archived from the original (PDF) on May 29, 2019. Retrieved July 7, 2019.
- ^ Shishikui, Yoshiaki; Nakanishi, Hiroshi; Imaizumi, Hiroyuki (1994). "An HDTV Coding Scheme using Adaptive-Dimension DCT". Signal Processing of HDTV. pp. 611–618. doi:10.1016/B978-0-444-81844-7.50072-3. ISBN 978-0-444-81844-7.
- ^ a b c "MPEG-4 Visual - Patent List" (PDF). MPEG LA. Archived from the original (PDF) on July 6, 2019. Retrieved July 6, 2019.
- ^ a b c "Video Developer Report 2019" (PDF). Bitmovin. 2019. Retrieved November 5, 2019.
- ^ a b "AVC/H.264 – Patent List" (PDF). MPEG LA. Archived from the original (PDF) on January 25, 2023. Retrieved July 6, 2019.
- ^ Wang, Hanli; Kwong, S.; Kok, C. (2006). "Efficient prediction algorithm of integer DCT coefficients for H.264/AVC optimization". IEEE Transactions on Circuits and Systems for Video Technology. 16 (4): 547–552. Bibcode:2006ITCSV..16..547W. doi:10.1109/TCSVT.2006.871390.
- ^ "Digital Video Broadcasting (DVB); Specification for the use of video and audio coding in DVB services delivered directly over IP" (PDF).
- ^ "World, Meet Thor – a Project to Hammer Out a Royalty Free Video Codec". August 11, 2015.
- ^ Thomson, Gavin; Shah, Athar (2017). "Introducing HEIF and HEVC" (PDF). Apple Inc. Retrieved August 5, 2019.
- ^ a b "HEVC Patent List" (PDF). MPEG LA. Archived from the original (PDF) on April 10, 2021. Retrieved July 6, 2019.
- ^ ISO. "Home". International Standards Organization. ISO. Retrieved August 3, 2022.
- ^ "ISO Standards and Patents". ISO. Retrieved July 10, 2019.
- ^ Davis, Andrew (June 13, 1997). "The H.320 Recommendation Overview". EE Times. Retrieved November 7, 2019.
- ^ Li Ding; Takaya, K. (1997). "H.263 based facial image compression for low bitrate communications". IEEE WESCANEX 97 Communications, Power and Computing. Conference Proceedings. pp. 30–34. doi:10.1109/WESCAN.1997.627108. ISBN 0-7803-4147-3. p. 30:
H.263 is similar to, but more complex than H.261. It is currently the most widely used international video compression standard for video telephony on ISDN (Integrated Services Digital Network) telephone lines.
- ^ "Motion JPEG 2000 Part 3". Joint Photographic Experts Group, JPEG, and Joint Bi-level Image experts Group, JBIG. Archived from the original on October 5, 2012. Retrieved June 21, 2014.
- ^ Taubman, David; Marcellin, Michael (2012). JPEG2000 Image Compression Fundamentals, Standards and Practice: Image Compression Fundamentals, Standards and Practice. Springer Science & Business Media. ISBN 9781461507994.
- ^ Swartz, Charles S. (2005). Understanding Digital Cinema: A Professional Handbook. Taylor & Francis. p. 147. ISBN 9780240806174.
- ^ "VC-1 Patent List" (PDF). MPEG LA. Archived from the original (PDF) on July 6, 2019. Retrieved July 11, 2019.
- ^ "HEVC Advance Patent List". HEVC Advance. Archived from the original on August 24, 2020. Retrieved July 6, 2019.
- ^ Filippov, Alexey; Norkin, Aney; Alvarez, José Roberto (April 2020). "RFC 8761 - Video Codec Requirements and Evaluation Methodology". datatracker.ietf.org. Retrieved February 10, 2022.
- ^ Bhojani, D.R. "4.1 Video Compression" (PDF). Hypothesis. Archived from the original (PDF) on May 10, 2013. Retrieved March 6, 2013.
- ^ Jaiswal, R.C. (2009). Audio-Video Engineering. Pune, Maharashtra: Nirali Prakashan. p. 3.55. ISBN 9788190639675.
- ^ "WebCodecs". www.w3.org. Retrieved February 10, 2022.
- ^ a b Jan Ozer. "Encoding options for H.264 video". Adobe.com. Retrieved January 6, 2015.
Video coding format
View on GrokipediaDefinitions and Fundamentals
Distinction between format, codec, and container
A video coding format defines the syntax, semantics, and decoding processes for representing compressed video data in a bitstream, enabling interoperability across encoding and playback systems. It outlines the structure of the encoded video stream, including how frames, parameters, and compression artifacts are organized to minimize data size while preserving visual quality. For instance, standards like ITU-T H.264 specify the bitstream format and a conformant decoding process that reconstructs video from the compressed data, ensuring that any compliant decoder can render the content accurately. In contrast, a codec refers to the specific software or hardware implementation that encodes raw video into a given coding format's bitstream or decodes it back to playable frames. Codecs handle the algorithmic compression and decompression tasks, such as applying transforms and quantization, but adhere strictly to the rules of their associated format. Software codecs, like the FFmpeg library, provide versatile encoding tools across multiple formats in open-source environments, while hardware codecs, such as application-specific integrated circuit (ASIC) chips in media processors, optimize real-time processing for formats like H.265 in devices like smartphones and set-top boxes. A notable example is x264, an open-source software codec that implements the H.264 format for efficient encoding in applications ranging from streaming to archiving.[6] A container format serves as a wrapper that multiplexes the coded video bitstream with audio tracks, subtitles, chapters, and metadata into a cohesive file, without modifying the underlying compression. It defines synchronization, timing, and packaging rules to facilitate storage, transmission, and playback of multimedia content. Unlike coding formats or codecs, containers are agnostic to the video compression method and can hold streams from various codecs; for example, the MP4 container (based on the ISO Base Media File Format) commonly encapsulates H.264 video and AAC audio, while the Matroska (MKV) container supports flexible combinations like H.265 video with multiple subtitle tracks for advanced home theater use. The distinction ensures that the raw video bitstream remains separate from delivery logistics, allowing remuxing into different containers without re-encoding.[7][8] Historically, the terminology has evolved with standardization efforts, where "format" and "standard" were frequently used interchangeably in early literature to describe the specifications from bodies like ITU-T and ISO/IEC MPEG, reflecting the interchangeable roles in defining bitstream rules during the development of initial codecs like H.261 in 1990. This overlap persists in some contexts, but modern usage clarifies "format" as the abstract specification, "codec" as its practical realization, and "container" as the multimedia packaging layer introduced prominently with formats like QuickTime and AVI in the 1990s.[9]Compression types: lossless, lossy, and uncompressed
Uncompressed video represents raw pixel data without any form of data reduction, preserving every detail of the original footage at the highest possible quality but requiring substantial storage and bandwidth resources. Common formats include RGB, which stores full color information for each pixel, and YUV variants such as YUV 4:2:0, where chroma (color) information is subsampled to reduce data while maintaining luma (brightness) at full resolution.[10][11] For instance, professional cinema workflows often employ uncompressed 4K formats to ensure fidelity during production and post-processing.[12] A typical uncompressed bitrate for 1080p at 60 frames per second in YUV 4:2:0 8-bit format is approximately 1.5 Gbps, highlighting the immense data demands.[13] Lossless compression achieves reversible data reduction by exploiting statistical redundancies in the video signal, ensuring the original data can be perfectly reconstructed without any loss of information. This is typically accomplished through entropy coding techniques, such as Huffman coding, which assigns shorter codes to more frequent symbols, or arithmetic coding, which encodes the entire message as a single fractional number for higher efficiency. Examples include the lossless mode in H.264 (also known as AVC), which supports bit-exact reconstruction within its High 4:4:4 Predictive profile, and the FFV1 codec, designed specifically for archival purposes with intra-frame coding.[14][15] Compression ratios for lossless video are generally modest, often around 2:1, though they can reach up to 5:1 depending on content complexity.[16] Lossy compression, in contrast, involves irreversible removal of data deemed perceptually insignificant, enabling dramatically smaller file sizes at the expense of some quality degradation. It targets perceptual redundancies using psycho-visual models that discard details below human visual perception thresholds, such as subtle color variations or high-frequency spatial details, often informed by properties like contrast sensitivity and visual masking.[17] This approach allows for compression ratios up to 1000:1 in severe cases, though typical ratios for acceptable quality range from 50:1 to 100:1.[18] Quality in lossy compression is commonly assessed using metrics like Peak Signal-to-Noise Ratio (PSNR), which quantifies the difference between original and compressed signals in decibels, with higher values indicating better fidelity.[19] The choice among these compression types involves key trade-offs in quality, efficiency, and application. Uncompressed video is ideal for editing and mastering in professional environments, where absolute fidelity is paramount despite the high bandwidth needs.[20] Lossless compression suits archiving and preservation, as seen with FFV1 in institutional workflows, balancing perfect reconstruction with moderate size reduction.[21] Lossy methods dominate streaming and distribution, prioritizing bandwidth savings for consumer delivery while relying on PSNR or similar metrics to ensure perceptual quality.[22]Core Coding Techniques
Intra-frame coding
Intra-frame coding, also known as intra-picture coding, compresses individual video frames independently by exploiting spatial redundancies within the frame, treating it as a standalone still image without reference to other frames. This approach forms the foundation for random access points in video streams, allowing decoding to begin at any intra-coded frame (I-frame), and supports error recovery by isolating corruption to a single frame.[23] Key techniques in intra-frame coding include spatial prediction, transform coding, quantization, and entropy coding. Spatial prediction estimates pixel values in a block based on neighboring pixels already decoded within the same frame, using directional modes to capture edges and textures; for example, H.264/AVC defines nine intra prediction modes for 4×4 luma blocks, such as vertical, horizontal, DC, and various diagonal and other directional predictions to capture edges and textures. Transform coding applies a frequency-domain transform, typically a discrete cosine transform (DCT) or its integer approximation, to the prediction residual to concentrate energy in low-frequency coefficients. Quantization then discards less perceptible high-frequency details by scaling coefficients with a quantization parameter, while entropy coding, such as context-adaptive variable-length coding (CAVLC) or arithmetic coding, efficiently represents the quantized data using variable-length codes tailored to probability distributions.[23][3] The detailed process begins with block-based partitioning of the frame into fixed or adaptive sizes, such as 8×8 blocks in MPEG-1 or 4×4 and 16×16 macroblocks in H.264/AVC, to handle varying content complexity. For each block, a predictor is generated from adjacent pixels, and the residual (difference between original and predicted block) is computed. This residual undergoes a DCT transform—for instance, an 8×8 DCT in MPEG-1 (typically implemented using integer approximations) to decorrelate spatial data—followed by quantization to reduce precision, and finally entropy coding to compress the coefficient stream. In H.264/AVC, the process evolves with adaptive block sizes and additional DC coefficient transforms for larger blocks to further minimize residuals in smooth regions.[24][23] Intra-frame coding enables frame-level editing and splicing in post-production, as each I-frame is self-contained, and provides robustness to packet loss in transmission by limiting error propagation to one frame. However, it requires higher bitrates compared to inter-frame methods due to the absence of temporal redundancy exploitation, often comprising 20-50% more bits per frame in hybrid codecs. In lossless compression modes, intra-frame techniques can be adapted by disabling quantization to preserve all data, though this increases file sizes significantly. Specific implementations include the 8×8 DCT for intra-coding in MPEG-1, which processes luminance and chrominance blocks separately, and the shift to adaptive partitioning in later standards like H.264/AVC, allowing 4×4 blocks for detailed textures and 16×16 for uniform areas to optimize compression efficiency.[23][24][3]Inter-frame coding and motion compensation
Inter-frame coding exploits temporal redundancy in video sequences by predicting the content of a current frame based on one or more previously encoded reference frames, encoding only the residual differences to achieve significant bitrate reduction, often 50-90% compared to intra-frame coding alone.[25] This approach forms the basis of predictive coding in standards like H.261 and later codecs, where frames are classified as P-frames (predictive, using forward prediction from a prior reference) or B-frames (bi-predictive, using both forward and backward references for enhanced efficiency).[26] The residuals from this prediction are then compressed, typically using intra-frame techniques on the difference signal.[27] Motion compensation is the core mechanism enabling this prediction, involving the estimation of motion vectors that describe 2D displacements between blocks in the current frame and corresponding regions in reference frames, with vector precision ranging from full-pixel in early standards to sub-pixel levels for better accuracy.[26] In block-based motion estimation, the frame is divided into macroblocks (e.g., 16×16 pixels for luminance in H.261), and for each block, a matching block in the reference frame is found within a defined search window, often ±15 pixels, using metrics like mean absolute difference (MAD) to minimize prediction error.[27] Seminal block-matching techniques, introduced in 1981, perform this by exhaustively comparing candidate positions or using faster approximations to reduce computational demands.[28] Common motion estimation algorithms include full search (exhaustive block matching over the entire search range), which provides optimal results but high complexity—often exceeding 80% of total encoding computation and up to 10^9 operations per frame for typical resolutions due to evaluating hundreds of candidates per block.[29] Faster alternatives like the diamond search algorithm, proposed in 2000, use predefined large and small diamond-shaped patterns to iteratively refine the motion vector, significantly lowering complexity while maintaining near-optimal performance in many scenarios. Sub-pixel accuracy, such as 1/4-pixel interpolation in H.264/AVC, further refines these vectors by applying filters to reference frames, improving prediction quality at the cost of added processing.[30] The overall process unfolds in key steps: first, motion estimation identifies the best-matching block and vector; second, motion compensation generates the predicted block by shifting the reference block according to the vector; third, the residual (difference between actual and predicted blocks) is computed and encoded; and finally, loop filtering is applied to the reconstructed reference frames to minimize error propagation or drift across the sequence.[31] These elements are organized within a group of pictures (GOP), a structural unit in standards like MPEG-1 and H.264 that sequences I-frames (intra-coded anchors), P-frames, and B-frames (e.g., IBBPBBP pattern) to balance compression efficiency and random access capabilities.[26]Transform-based compression
Transform-based compression is a fundamental technique in video coding that operates in the frequency domain to achieve data compaction. It begins by applying an orthogonal transform to spatial-domain residuals—obtained from intra-frame or inter-frame prediction—to convert them into frequency coefficients. This transformation concentrates the signal's energy into a small number of low-frequency coefficients, enabling subsequent quantization to discard or coarsely represent high-frequency components with minimal perceptual impact.[32] The most widely adopted transform is the Discrete Cosine Transform (DCT), particularly the Type-II DCT, which is applied separably to 8x8 blocks in early standards like JPEG for images and MPEG-1 for video. The 1D DCT formula is given by:Profiles, Levels, and Extensions
Profiles
In video coding standards, a profile defines a specific subset of the syntax and tools within the overall format, enabling tailored implementations for varying application needs while ensuring interoperability among compliant devices and software.[23] These subsets constrain the use of certain coding features to balance computational complexity, compression efficiency, and compatibility, allowing encoders and decoders to signal adherence to a particular profile for seamless playback across ecosystems.[35] Key aspects of profiles include the selective inclusion or exclusion of advanced tools, such as bi-predictive B-frames for improved temporal prediction, context-adaptive binary arithmetic coding (CABAC) for enhanced entropy efficiency, or 8x8 integer transforms for better handling of high-frequency details.[35] Profiles are signaled in the bitstream via parameters like profile_idc in the sequence parameter set, which indicates the active profile and any compatibility flags for hybrid support.[35] This signaling ensures decoders can verify and apply only the necessary decoding processes, reducing overhead in resource-limited environments. Representative examples illustrate these trade-offs. In H.264/AVC, the Baseline Profile omits B-frames and CABAC to support low-latency applications like real-time video conferencing on mobile devices, prioritizing simplicity over maximum efficiency.[35] The Main Profile extends Baseline by incorporating B-frames and CABAC for broadcast and storage use cases, while the High Profile further adds 8x8 transforms and weighted prediction to achieve higher quality for high-definition content, such as Blu-ray discs.[35] Similarly, in HEVC (H.265), the Main Profile targets 8-bit 4:2:0 video for standard dynamic range applications up to 4K resolution, whereas the Main 10 Profile supports 10-bit depths and 4:2:0 chroma for high dynamic range (HDR) content with wider color gamuts.[36] The primary purpose of profiles is to facilitate interoperability across diverse devices, from low-power mobiles requiring minimal features to high-end broadcast systems handling complex tools, thus enabling widespread adoption without universal decoder over-specification.[23] This approach evolved from earlier standards like MPEG-2, which introduced multiple profiles (e.g., Simple and Main) but emphasized a dominant Main Profile for general use, to the more granular, backward-compatible structure in H.264/AVC where higher profiles encompass lower ones' tools. Profiles directly influence decoder requirements, with advanced ones like H.264 High Profile demanding greater computational resources—such as increased processing cycles for CABAC and larger reference frame buffers—compared to Baseline, potentially requiring up to 50% more operations per macroblock in some implementations.[23] Backward compatibility rules mandate that decoders supporting a higher profile must handle lower-profile bitstreams without errors, ensuring gradual deployment in mixed environments.[35]Levels
In video coding standards such as H.264/AVC and HEVC (H.265), levels define a set of constraints on key operational parameters, including maximum macroblock processing rate, bitrate, sample size, and frame buffer requirements, to ensure compatibility with specific classes of decoder hardware and software implementations. These constraints cap computational demands and data throughput, allowing encoders to produce bitstreams tailored to target devices ranging from low-power mobiles to high-end broadcast systems.[37] The primary purpose of levels is to prevent decoder overload by limiting factors like hypothetical reference decoder (HRD) buffer sizes and decoding processing rates, thereby guaranteeing real-time performance without excessive memory or processing power. This facilitates device certification and interoperability; for instance, the Blu-ray Disc specification mandates H.264 High Profile at Level 4.1 or higher to support 1080p playback with maximum bitrates up to 40 Mbps. Levels also enable standardized testing of decoder conformance, ensuring that compliant devices can handle bitstreams up to the specified limits without failure. Key parameters vary by standard and level but typically include maximum luma picture size in samples, maximum bitrate for video coding layer (VCL), and maximum macroblocks or coding tree units (CTUs) per second. In H.264/AVC, Level 3.1 supports up to 108,000 macroblocks per second and a maximum VCL bitrate of 14 Mbps (for Baseline, Main, and Extended profiles), enabling resolutions such as 1920×1080 at 30 fps or 1280×720 at 60 fps. Similarly, in HEVC, Level 4 allows a maximum luma sample rate of 66,846,720 samples per second and a maximum bitrate of 12 Mbps (Main tier), accommodating resolutions such as 1920×1080 at 30 fps for high-definition content. These limits are derived from empirical decoder performance models in the standards' Annex A, balancing compression efficiency with practical implementation constraints. Levels are signaled in the bitstream via the level_idc syntax element, an 8-bit code in the sequence parameter set (SPS) for H.264 or the general profile, tier, and level structure for HEVC, which decoders use to verify compliance and allocate resources accordingly.[37] For example, level_idc=31 indicates Level 3.1 in H.264, while HEVC uses a more granular general_level_idc value scaled by 30 (e.g., 120 for Level 4). Illustrative examples span device capabilities: H.264 Level 1 targets mobile applications with QCIF (176×144) resolution at 15 fps and 64 kbps bitrate, suitable for low-bandwidth wireless networks. At the high end, HEVC Level 6.2 supports 8K (7680×4320) at 120 fps with bitrates up to 800 Mbps (High tier), addressing cinema and professional broadcast needs. Levels are inherently hierarchical, with each higher level incorporating all constraints of lower levels plus additional relaxed limits, allowing decoders certified for a given level to process any lower-level bitstream without modification. Interactions across profiles are standardized such that the same level numbering applies universally, though bitrate and sample limits may differ slightly by profile (e.g., higher bitrates permitted in H.264 High Profile compared to Baseline at the same level). This design promotes broad interoperability while permitting profile-specific optimizations.[37]Extensions and scalability features
Extensions in video coding formats refer to optional add-ons that enhance the base standard's capabilities, such as support for multiview coding (MVC) in H.264 for 3D video applications, scalable video coding extensions (SVCE or SVC) in H.264 for layered adaptability, and range extensions (RExt) in HEVC for higher bit depths (up to 16 bits) and advanced color formats like 4:4:4 chroma subsampling.[38] These extensions build upon core profiles and levels to enable specialized use cases without altering the fundamental decoding process for compatible base layers. For instance, MVC, defined in Annex H of H.264, allows efficient coding of multiple views by adding disparity-compensated prediction to the base layer. Scalability features in video coding involve layered bitstream structures that permit extraction of subsets for adaptation to varying network conditions or device capabilities, supporting spatial, temporal, and quality (signal-to-noise ratio or SNR) scalability. Spatial scalability encodes layers at different resolutions, using inter-layer prediction to reference lower-resolution base layers for efficient enhancement, enabling bitstream extraction to match display sizes. Temporal scalability adjusts frame rates by structuring layers hierarchically, often with hierarchical B-frames where lower layers provide reference frames at reduced rates (e.g., every other frame), allowing decoders to drop higher layers for lower temporal resolution without re-encoding. Quality scalability refines SNR through progressive refinement layers, where base layers offer basic fidelity and enhancement layers add detail, facilitating graceful degradation in bandwidth-limited scenarios.[39] Key techniques for scalability include hierarchical B-frames for temporal layering, which organize bi-predictive frames in a pyramid structure to minimize drift between layers, and inter-layer prediction mechanisms that reuse motion data or textures from base to enhancement layers to reduce redundancy. In H.264 SVC, medium-grained scalability (MGS) provides finer SNR control than coarse-grained scalability (CGS) by allowing partial layer extraction at the slice or macroblock level, while fine-grained scalability (FGS), inherited from earlier MPEG-4 standards, enables bit-by-bit refinement for even more precise rate adaptation, though at a slight efficiency cost. For HEVC's scalable extension (SHVC), these techniques extend to support up to 8K resolution in layered configurations, with inter-layer syntax elements ensuring compatibility across layers. Scalability introduces a moderate complexity overhead, typically requiring 20-50% more encoding time due to additional prediction modes, but enables bitstream extraction without full re-decoding.[40][41][42][39] Applications of these extensions and scalability features are prominent in adaptive streaming protocols like Dynamic Adaptive Streaming over HTTP (DASH), where SVC or SHVC layers allow servers to deliver a single encoded stream that clients can truncate based on available bandwidth, reducing storage needs and latency. In error-prone networks, such as mobile or wireless environments, temporal and quality scalability supports unequal error protection, prioritizing base layers for robustness while enhancement layers tolerate packet loss. For example, H.264 SVC has been deployed for mobile video adaptation, enabling seamless switching between low and high frame rates over fluctuating connections.[43][44][45]Historical Development
Early innovations: analog to digital transition
Analog video systems, such as the NTSC and PAL standards, relied on continuous electrical signals to represent luminance and chrominance information without any form of data compression. These standards transmitted composite video signals over bandwidth-limited channels, typically 6 MHz for terrestrial broadcast, which constrained horizontal resolution to approximately 330-400 TV lines and introduced artifacts like cross-color due to inseparable luma and chroma components.[46] The absence of compression meant that analog video was highly susceptible to noise accumulation during transmission and storage, degrading quality over distance or repeated copying, as seen in formats like VHS tapes.[46] The shift from analog to digital video was driven by the need for improved storage reliability and transmission efficiency, particularly as consumer and professional demands grew for higher-quality media like the transition from VHS analog tapes to DVD optical discs, which enabled error-corrected digital playback. Early digital formats, such as the D1 standard introduced in 1986, provided uncompressed standard-definition component video at bitrates around 173 Mbps, facilitating studio-grade recording without generational loss but requiring substantial bandwidth.[47] This transition was further motivated by digital signals' ability to regenerate without noise buildup, supporting efficient multiplexing over communication lines and paving the way for compressed formats to reduce storage and bandwidth needs.[48] Initial digital techniques began with Pulse Code Modulation (PCM) to sample and quantize analog signals into binary data, typically requiring high bitrates like 70 Mbps for 5 MHz video bandwidth due to its direct representation without prediction. To address PCM's inefficiency, Differential PCM (DPCM) emerged in the late 1970s and 1980s, encoding differences between adjacent samples to reduce redundancy and lower bitrates by up to 18 Mbps while improving signal-to-noise ratios by 14 dB in video applications.[49] ITU studies in the 1980s, through the CCIR (now ITU-R), explored these methods for component video, culminating in the 1982 adoption of Recommendation BT.601, which standardized sampling at 13.5 MHz for luminance and 6.75 MHz for color-difference signals to accommodate both 525/60 and 625/50 systems.[48] A pivotal milestone was the 1984 ITU-T Recommendation H.120, the first international digital video coding standard for videoconferencing at primary digital rates of 1.544 Mbps (NTSC) and 2.048 Mbps (PAL), employing DPCM with conditional replenishment and scalar quantization for basic compression. Still-image compression efforts, such as the DCT-based approach later formalized in the JPEG standard (work beginning in the late 1980s), served as a precursor by demonstrating effective intra-frame transform techniques that influenced subsequent video coding.[3][50] This analog-to-digital transition faced challenges including aliasing, where frequencies above half the sampling rate folded into lower frequencies, potentially distorting images if anti-aliasing filters were inadequate, and quantization noise from rounding continuous amplitudes to discrete levels, introducing granular errors that reduced perceived quality in early low-bit-depth systems.[51] These issues were mitigated through higher sampling rates like those in BT.601 and dithering techniques, ensuring digital video maintained fidelity despite the conversion process.[48]Motion-compensated DCT and initial standards
The motion-compensated discrete cosine transform (MC-DCT) emerged during the 1980s as a breakthrough hybrid video coding technique, integrating temporal prediction through motion compensation with spatial compression via the discrete cosine transform (DCT). This method addressed limitations in earlier intra-frame and simple inter-frame approaches by reducing temporal redundancy more effectively while concentrating energy in fewer DCT coefficients for efficient quantization and entropy coding. Seminal research, such as the 1987 work at AT&T Bell Laboratories by H.-M. Hang and J. W. Woods, explored block-based motion tracking to estimate displacements, followed by orthogonal transforms on frame differences, demonstrating substantial bitrate savings for moving images compared to non-compensated coding. At its core, MC-DCT employs block-based motion estimation, where video frames are partitioned into small blocks—typically 16×16 macroblocks in early designs—to compute motion vectors that predict the current block from a reference frame, with the residual difference then transformed using an 8×8 DCT. This hybrid structure became the foundational paradigm for digital video standards by 1990, balancing computational feasibility with high compression ratios suitable for emerging digital networks. The 8×8 DCT block size, proposed by Didier Le Gall for its optimal trade-off between frequency resolution and boundary effects in block transforms, was central to this efficiency, as it concentrated most signal energy in low-frequency coefficients for subsequent quantization. Early MC-DCT systems incorporated conditional replenishment principles for motion handling, selectively coding and transmitting only the compensated residuals in active regions to minimize overhead, though they omitted loop filters to avoid added decoder complexity and delay. An approximate bitrate target for such systems can be estimated as frame rate × resolution (in pixels) × bits per pixel / compression ratio, providing a practical guideline for deployment at constrained rates like 1 Mbps. The initial standardization of MC-DCT appeared in ITU-T Recommendation H.261, ratified in December 1990, which specified a codec for audiovisual services at p×64 kbps over ISDN lines, primarily for video telephony and conferencing. H.261 utilized 16×16 macroblocks for luminance motion compensation (with 8×8 blocks for chrominance), applying the 8×8 DCT to residuals and supporting resolutions of QCIF (176×144 pixels) and CIF (352×288 pixels) at up to 30 frames per second. Building on this, the MPEG-1 standard (ISO/IEC 11172), finalized in 1993, adapted the MC-DCT framework for consumer storage applications like Video CD, achieving approximately 1.5 Mbps for SIF (352×240 or 352×288) resolution interlaced video at 25 or 30 Hz. MPEG-1 retained H.261's core tools, including block motion compensation and DCT on residuals, but introduced bidirectional prediction in P-frames to enhance efficiency for non-real-time playback.Modern advancements post-2010
Following the standardization of Advanced Video Coding (AVC/H.264) in 2003, High Efficiency Video Coding (HEVC/H.265) emerged in 2013 as a major advancement, achieving approximately 50% bitrate reduction compared to AVC for equivalent video quality through enhanced block partitioning and larger coding units (CUs) supporting sizes up to 64×64 pixels. HEVC introduced tools like Sample Adaptive Offset (SAO) filtering to reduce banding artifacts and improve reconstruction quality by adaptively offsetting pixel values based on local statistics.[52] These innovations built on the hybrid coding framework but optimized it for higher resolutions, such as 4K, by employing more flexible prediction structures and improved transform coding.[53] Subsequent standards further pushed efficiency boundaries. Versatile Video Coding (VVC/H.266), finalized in 2020 by the Joint Video Experts Team (JVET), delivers 30–50% better compression than HEVC, particularly for high-definition and ultra-high-definition content, through larger coding tree units (CTUs) up to 128×128 pixels and advanced motion compensation techniques like affine motion models that handle complex deformations such as rotation and scaling. VVC also incorporates sophisticated in-loop filters, including the Adaptive Loop Filter (ALF) for shape-adaptive Wiener filtering and Decoder-side Motion Vector Refinement (DMVR) to enhance motion accuracy without additional signaling overhead. Meanwhile, AOMedia Video 1 (AV1), released in 2018 as a royalty-free alternative, provides 20–30% bitrate savings over HEVC while supporting similar tools like extended partition trees and compound prediction modes, making it suitable for internet streaming.[54] Post-2010 advancements have increasingly integrated machine learning to address rate-distortion optimization (RDO), where neural networks predict optimal coding parameters, such as mode decisions and quantization levels, reducing computational redundancy while maintaining quality—demonstrated in hybrid frameworks that outperform traditional RDO by up to 5–10% in bitrate efficiency for specific sequences.[55] Additional tools like template matching in VVC refine motion estimation by comparing reconstructed templates, further improving inter-frame prediction for diverse content. These developments tackle emerging challenges, including support for 8K resolution, high dynamic range (HDR) with wider color gamuts, and 360° immersive video through enhanced geometry handling and projection mapping, though VVC incurs about 30% higher computational complexity than HEVC, primarily in encoding, to achieve these gains.[56][57] By 2025, AV1 has seen widespread adoption, comprising over 50% of streaming content on platforms like YouTube and more than 95% at Netflix, driven by its open-source nature and hardware acceleration in modern devices.[58] Looking ahead, the JVET is exploring H.267 with a focus on AI-based paradigms, including end-to-end neural codecs that replace traditional block-based processing with learned representations, potentially yielding at least 40% additional efficiency over VVC for machine-analyzed video while adapting to generative content creation.[59]Major Standards and Comparisons
ITU-T H.26x and MPEG family
The ITU-T H.26x series, developed by the Video Coding Experts Group (VCEG) within ITU-T Study Group 16, forms the foundational lineage of proprietary video coding standards emphasizing block-based hybrid coding with motion compensation and transform techniques. H.261, standardized in 1990, targeted video telephony and conferencing over integrated services digital network (ISDN) lines at bitrates of p×64 kbit/s (where p ranges from 1 to 30), employing 8×8 discrete cosine transform (DCT) blocks, intra/inter-frame prediction, and quantization for efficient compression of CIF (352×288) and QCIF (176×144) resolutions. This standard laid the groundwork for subsequent advancements by integrating motion estimation to exploit temporal redundancies in video sequences. Building on H.261, H.263 was published in 1996 to optimize low-bitrate coding for videophone and video conferencing, introducing enhancements such as unrestricted motion vectors (allowing estimation beyond picture boundaries), advanced prediction modes (PB-frames for bidirectional coding), and a median filter-based deblocking loop to reduce artifacts. These features delivered approximately a 30% bitrate reduction compared to H.261 at equivalent quality levels, as demonstrated by PSNR gains of about 2 dB at 64 kbit/s for typical sequences.[60] H.264, also known as Advanced Video Coding (AVC), emerged in 2003 through joint efforts and incorporated context-adaptive binary arithmetic coding (CABAC) for entropy efficiency, in-loop deblocking filters to improve visual quality, and multiple reference frames for motion compensation, enabling up to 50% better compression than prior standards for high-definition content. H.265, or High Efficiency Video Coding (HEVC), followed in 2013 with larger coding tree units (up to 64×64), flexible partitioning, and advanced intra-prediction modes to handle 4K and beyond, achieving roughly 50% bitrate savings over H.264. Most recently, H.266, termed Versatile Video Coding (VVC), was finalized in 2020 to support emerging applications like 8K video and 360-degree formats, featuring adaptive color space transforms and enhanced affine motion models for greater flexibility and efficiency. Parallel to the H.26x series, the MPEG family from ISO/IEC JTC 1/SC 29/WG 11 has produced complementary standards, often harmonized with ITU-T efforts, sharing the block-based hybrid architecture while bearing royalties managed through patent pools. MPEG-1, completed in 1993 (ISO/IEC 11172), focused on progressive-scan video storage for CD-ROMs at up to 1.5 Mbit/s, supporting VCD applications with similar DCT-based tools to H.261 but optimized for single-pass decoding. MPEG-2, standardized in 1995 (ISO/IEC 13818), extended capabilities for interlaced video in DVD and broadcasting, incorporating scalability profiles (e.g., spatial and SNR) to enable layered transmission, and saw widespread global adoption by the 2000s in Digital Video Broadcasting (DVB) and Advanced Television Systems Committee (ATSC) standards for terrestrial, cable, and satellite delivery.[61] MPEG-4 Part 2 (ISO/IEC 14496-2, 1999) introduced object-based coding for interactive multimedia, allowing independent manipulation of video objects via sprite and global motion compensation. MPEG-4 Part 10, identical to H.264/AVC (ISO/IEC 14496-10), resulted from direct collaboration, while HEVC aligns with H.265 as MPEG-H Part 2 (ISO/IEC 23008-2). These standards exhibit common traits as royalty-bearing technologies, licensed via collective pools such as the MPEG LA (now Via Licensing Alliance) consortium, which aggregates essential patents to facilitate broad implementation while ensuring fair access. Key collaborations include the Joint Video Team (JVT), formed in 2001 between VCEG and MPEG to develop H.264/AVC, and the Joint Collaborative Team on Video Coding (JCT-VC), established in 2010 for H.265/HEVC, streamlining efforts across organizations for unified specifications.[62][63] This synergy has driven the evolution from basic telephony to immersive, high-resolution video delivery.Royalty-free and open-source standards
Royalty-free and open-source video coding standards emerged as alternatives to proprietary formats, prioritizing accessibility for web and streaming applications without licensing fees. These standards, developed through collaborative efforts by organizations like the Xiph.Org Foundation and the Alliance for Open Media (AOMedia), facilitate widespread adoption by providing freely implementable codecs under open-source licenses.[64] Theora, introduced in the 2000s by the Xiph.Org Foundation, serves as an early example of such a standard, derived from On2 Technologies' VP3 codec and integrated into the Ogg container for multimedia streaming.[65] Released in version 1.0 in 2008, Theora supports resolutions up to 4096x2304 and uses a discrete cosine transform (DCT)-based approach for compression, making it suitable for general-purpose video distribution without royalties.[66] Building on this foundation, Google released VP8 in 2010 as part of the WebM project, an open-source initiative to promote royalty-free video on the web.[67] VP8, originally developed by On2 Technologies before its acquisition by Google, employs motion-compensated prediction and entropy coding to achieve efficient compression comparable to H.264, while being licensed under BSD terms with no patent fees.[68] The libvpx library provides the reference open-source implementation for VP8 encoding and decoding. VP9, announced by Google in 2013, extends VP8 with enhancements that deliver approximately 50% better compression efficiency at equivalent quality levels, positioning it as a royalty-free bridge from older formats like H.264 toward higher performance. Key improvements include larger block sizes up to 64x64, advanced intra-prediction modes, and support for 10-bit color depth, all implemented in the updated libvpx codebase. Like its predecessor, VP9 incurs no royalties and is optimized for web delivery, with widespread use in platforms like YouTube.[67] The most advanced royalty-free standard, AV1 (AOMedia Video 1), was finalized in 2018 by AOMedia, a consortium founded in 2015 that includes major players such as Apple, Netflix, Google, and Intel.[64] Drawing from open-source projects like Google's VP9, Xiph.Org's Daala—which emphasized perceptual coding techniques for better visual quality at low bitrates—and Microsoft's Thor, AV1 achieves compression efficiency on par with HEVC without any licensing costs.[69] The reference implementation, libaom, is maintained under a BSD-like license, ensuring free availability for developers.[69] AV1 incorporates advanced features for modern video needs, including support for 10-bit and 12-bit color depths to reduce banding in gradients, as well as HDR formats like HDR10 and HLG.[70] Its film grain synthesis tool denoises source material during encoding and regenerates grain post-decoding, preserving artistic intent while improving compressibility for grainy content.[71] Efficiency gains stem from innovations like compound prediction modes, which blend multiple reference frames, and loop restoration filters that mitigate artifacts after in-loop processing.[69] By 2025, AV1 hardware acceleration has become standard in consumer chips, with Intel's Arc GPUs and Core processors supporting both encoding and decoding, alongside AMD's Ryzen and Radeon series integrating AV1 capabilities for 4K and beyond.[72] Adoption milestones include native support in Google Chrome starting in 2018 and Microsoft Edge shortly thereafter, enabling efficient web playback.[73] By 2023, AV1 had emerged as a de facto standard for 4K streaming on platforms like Netflix and YouTube, reducing bandwidth demands for high-resolution delivery. In September 2025, AOMedia announced the impending year-end launch of AV2, aiming to further enhance compression efficiency beyond AV1.[74]Performance comparisons and adoption trends
Performance comparisons among video coding standards reveal significant advancements in compression efficiency, with each successive generation achieving substantial bitrate reductions at equivalent perceptual quality levels. For instance, High Efficiency Video Coding (HEVC, or H.265) delivers approximately 50% bitrate savings compared to Advanced Video Coding (AVC, or H.264) for 1080p and 4K content, as measured by Bjøntegaard Delta Rate (BD-Rate) in objective tests using PSNR and SSIM metrics.[75] AV1, developed by the Alliance for Open Media, further improves on HEVC with 30-38% bitrate savings for similar resolutions, enabling higher quality streams at lower bandwidths, such as 4K video at bitrates under 10 Mbps.[76] Versatile Video Coding (VVC, or H.266) extends this trend, offering 30-50% savings over HEVC, particularly pronounced at 8K resolutions where gains reach up to 70% in some sequences.[77]| Codec | Bitrate Savings vs. Predecessor (at same quality, 1080p/4K) | Representative Benchmark (BD-Rate, VMAF/SSIM) |
|---|---|---|
| HEVC vs. AVC | ~50% | 40-50% reduction; VMAF gains of 6-12 points[75] |
| AV1 vs. HEVC | 30-38% | 24-38% BD-Rate; SSIM improvements in MSU tests for 4K[76][78] |
| VVC vs. HEVC | 30-50% | 5-50% BD-Rate; up to 70% at 8K per high-resolution evaluations[77] |