Recent from talks
Nothing was collected or created yet.
H.262/MPEG-2 Part 2
View on Wikipedia
| H.262 / MPEG-2 Part 2 | |
|---|---|
| Information technology – Generic coding of moving pictures and associated audio information: Video | |
| Status | In force |
| Year started | 1995 |
| First published | May 1996 |
| Latest version | ISO/IEC 13818-2:2013 October 2013 |
| Organization | ITU-T, ISO/IEC JTC 1 |
| Committee | ITU-T Study Group 16 VCEG, MPEG |
| Base standards | H.261, MPEG-2 |
| Related standards | H.222.0, H.263, H.264, H.265, H.266, ISO/IEC 14496-2 |
| Predecessor | H.261 |
| Successor | H.263 |
| Domain | Video compression |
| License | Expired patents[1] |
| Website | https://www.itu.int/rec/T-REC-H.262 |
H.262[2] or MPEG-2 Part 2 (formally known as ITU-T Recommendation H.262 and ISO/IEC 13818-2,[3] also known as MPEG-2 Video) is a video coding format standardised and jointly maintained by ITU-T Study Group 16 Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), and developed with the involvement of many companies. It is the second part of the ISO/IEC MPEG-2 standard. The ITU-T Recommendation H.262 and ISO/IEC 13818-2 documents are identical.
The standard is available for a fee from the ITU-T[2] and ISO. MPEG-2 Video is very similar to MPEG-1, but also provides support for interlaced video (an encoding technique used in analog NTSC, PAL and SECAM television systems). MPEG-2 video is not optimized for low bit-rates (e.g., less than 1 Mbit/s), but somewhat outperforms MPEG-1 at higher bit rates (e.g., 3 Mbit/s and above), although not by a large margin unless the video is interlaced. All standards-conforming MPEG-2 Video decoders are also fully capable of playing back MPEG-1 Video streams.[4]
History
[edit]The ISO/IEC approval process was completed in November 1994.[5] The first edition was approved in July 1995[6] and published by ITU-T[2] and ISO/IEC in 1996.[7] Didier LeGall of Bellcore chaired the development of the standard[8] and Sakae Okubo of NTT was the ITU-T coordinator and chaired the agreements on its requirements.[9]
The technology was developed with contributions from a number of companies. Hyundai Electronics (now SK Hynix) developed the first MPEG-2 SAVI (System/Audio/Video) decoder in 1995.[10]
The majority of patents that were later asserted in a patent pool to be essential for implementing the standard came from three companies: Sony (311 patents), Thomson (198 patents) and Mitsubishi Electric (119 patents).[11]
In 1996, it was extended by two amendments to include the registration of copyright identifiers and the 4:2:2 Profile.[2][12] ITU-T published these amendments in 1996 and ISO in 1997.[7]
There are also other amendments published later by ITU-T and ISO/IEC.[2][13] The most recent edition of the standard was published in 2013 and incorporates all prior amendments.[3]
Editions
[edit]| Edition | Release date | Latest amendment | ISO/IEC standard | ITU-T Recommendation |
|---|---|---|---|---|
| First edition | 1995 | 2000 | ISO/IEC 13818-2:1996[7] | H.262 (07/95) |
| Second edition | 2000 | 2010[2][14] | ISO/IEC 13818-2:2000[15] | H.262 (02/00) |
| Third edition | 2013 | ISO/IEC 13818-2:2013[3] | H.262 (02/12), incorporating Amendment 1 (03/13) |
Video coding
[edit]This section may contain an excessive amount of intricate detail that may only interest a particular audience. Specifically, this is not the place to explain the general concept of video compression in such detail; focus should be kept on the H.262 video codec.. (May 2020) |
Picture sampling
[edit]An HDTV camera with 8-bit sampling generates a raw video stream of 25 × 1920 × 1080 × 3 = 155,520,000 bytes per second for 25 frame-per-second video (using the 4:4:4 sampling format). This stream of data must be compressed if digital TV is to fit in the bandwidth of available TV channels and if movies are to fit on DVDs. Video compression is practical because the data in pictures is often redundant in space and time. For example, the sky can be blue across the top of a picture and that blue sky can persist for frame after frame. Also, because of the way the eye works, it is possible to delete or approximate some data from video pictures with little or no noticeable degradation in image quality.
A common (and old) trick to reduce the amount of data is to separate each complete "frame" of video into two "fields" upon broadcast/encoding: the "top field", which is the odd numbered horizontal lines, and the "bottom field", which is the even numbered lines. Upon reception/decoding, the two fields are displayed alternately with the lines of one field interleaving between the lines of the previous field; this format is called interlaced video. The typical field rate is 50 (Europe/PAL) or 59.94 (US/NTSC) fields per second, corresponding to 25 (Europe/PAL) or 29.97 (North America/NTSC) whole frames per second. If the video is not interlaced, then it is called progressive scan video and each picture is a complete frame. MPEG-2 supports both options.
Digital television requires that these pictures be digitized so that they can be processed by computer hardware. Each picture element (a pixel) is then represented by one luma number and two chroma numbers. These describe the brightness and the color of the pixel (see YCbCr). Thus, each digitized picture is initially represented by three rectangular arrays of numbers.
Another common practice to reduce the amount of data to be processed is to subsample the two chroma planes (after low-pass filtering to avoid aliasing). This works because the human visual system better resolves details of brightness than details in the hue and saturation of colors. The term 4:2:2 is used for video with the chroma subsampled by a ratio of 2:1 horizontally, and 4:2:0 is used for video with the chroma subsampled by 2:1 both vertically and horizontally. Video that has luma and chroma at the same resolution is called 4:4:4. The MPEG-2 Video document considers all three sampling types, although 4:2:0 is by far the most common for consumer video, and there are no defined "profiles" of MPEG-2 for 4:4:4 video (see below for further discussion of profiles).
While the discussion below in this section generally describes MPEG-2 video compression, there are many details that are not discussed, including details involving fields, chrominance formats, responses to scene changes, special codes that label the parts of the bitstream, and other pieces of information. Aside from features for handling fields for interlaced coding, MPEG-2 Video is very similar to MPEG-1 Video (and even quite similar to the earlier H.261 standard), so the entire description below applies equally well to MPEG-1.
I-frames, P-frames, and B-frames
[edit]MPEG-2 includes three basic types of coded frames: intra-coded frames (I-frames), predictive-coded frames (P-frames), and bidirectionally-predictive-coded frames (B-frames).
An I-frame is a separately-compressed version of a single uncompressed (raw) frame. The coding of an I-frame takes advantage of spatial redundancy and of the inability of the eye to detect certain changes in the image. Unlike P-frames and B-frames, I-frames do not depend on data in the preceding or the following frames, and so their coding is very similar to how a still photograph would be coded (roughly similar to JPEG picture coding). Briefly, the raw frame is divided into 8 pixel by 8 pixel blocks. The data in each block is transformed by the discrete cosine transform (DCT). The result is an 8×8 matrix of coefficients that have real number values. The transform converts spatial variations into frequency variations, but it does not change the information in the block; if the transform is computed with perfect precision, the original block can be recreated exactly by applying the inverse cosine transform (also with perfect precision). The conversion from 8-bit integers to real-valued transform coefficients actually expands the amount of data used at this stage of the processing, but the advantage of the transformation is that the image data can then be approximated by quantizing the coefficients. Many of the transform coefficients, usually the higher frequency components, will be zero after the quantization, which is basically a rounding operation. The penalty of this step is the loss of some subtle distinctions in brightness and color. The quantization may either be coarse or fine, as selected by the encoder. If the quantization is not too coarse and one applies the inverse transform to the matrix after it is quantized, one gets an image that looks very similar to the original image but is not quite the same. Next, the quantized coefficient matrix is itself compressed. Typically, one corner of the 8×8 array of coefficients contains only zeros after quantization is applied. By starting in the opposite corner of the matrix, then zigzagging through the matrix to combine the coefficients into a string, then substituting run-length codes for consecutive zeros in that string, and then applying Huffman coding to that result, one reduces the matrix to a smaller quantity of data. It is this entropy coded data that is broadcast or that is put on DVDs. In the receiver or the player, the whole process is reversed, enabling the receiver to reconstruct, to a close approximation, the original frame.
The processing of B-frames is similar to that of P-frames except that B-frames use the picture in a subsequent reference frame as well as the picture in a preceding reference frame. As a result, B-frames usually provide more compression than P-frames. B-frames are never reference frames in MPEG-2 Video.
Typically, every 15th frame or so is made into an I-frame. P-frames and B-frames might follow an I-frame like this, IBBPBBPBBPBB(I), to form a Group of Pictures (GOP); however, the standard is flexible about this. The encoder selects which pictures are coded as I-, P-, and B-frames.
Macroblocks
[edit]P-frames provide more compression than I-frames because they take advantage of the data in a previous I-frame or P-frame – a reference frame. To generate a P-frame, the previous reference frame is reconstructed, just as it would be in a TV receiver or DVD player. The frame being compressed is divided into 16 pixel by 16 pixel macroblocks. Then, for each of those macroblocks, the reconstructed reference frame is searched to find a 16 by 16 area that closely matches the content of the macroblock being compressed. The offset is encoded as a "motion vector". Frequently, the offset is zero, but if something in the picture is moving, the offset might be something like 23 pixels to the right and 4-and-a-half pixels up. In MPEG-1 and MPEG-2, motion vector values can either represent integer offsets or half-integer offsets. The match between the two regions will often not be perfect. To correct for this, the encoder takes the difference of all corresponding pixels of the two regions, and on that macroblock difference then computes the DCT and strings of coefficient values for the four 8×8 areas in the 16×16 macroblock as described above. This "residual" is appended to the motion vector and the result sent to the receiver or stored on the DVD for each macroblock being compressed. Sometimes no suitable match is found. Then, the macroblock is treated like an I-frame macroblock.
Video profiles and levels
[edit]MPEG-2 video supports a wide range of applications from mobile to high quality HD editing. For many applications, it is unrealistic and too expensive to support the entire standard. To allow such applications to support only subsets of it, the standard defines profiles and levels.
A profile defines sets of features such as B-pictures, 3D video, chroma format, etc. The level limits the memory and processing power needed, defining maximum bit rates, frame sizes, and frame rates.
A MPEG application then specifies the capabilities in terms of profile and level. For example, a DVD player may say it supports up to main profile and main level (often written as MP@ML). It means the player can play back any MPEG stream encoded as MP@ML or less.
The tables below summarizes the limitations of each profile and level, though there are constraints not listed here.[2]: Annex E Note that not all profile and level combinations are permissible, and scalable modes modify the level restrictions.
| Abbr. | Name | Picture Coding Types | Chroma Format | Scalable modes | Intra DC Precision |
|---|---|---|---|---|---|
| SP | Simple profile | I, P | 4:2:0 | none | 8, 9, 10 |
| MP | Main profile | I, P, B | 4:2:0 | none | 8, 9, 10 |
| SNR | SNR Scalable profile | I, P, B | 4:2:0 | SNR[a] | 8, 9, 10 |
| Spatial | Spatially Scalable profile | I, P, B | 4:2:0 | SNR,[a] spatial[b] | 8, 9, 10 |
| HP | High-profile | I, P, B | 4:2:2 or 4:2:0 | SNR,[a] spatial[b] | 8, 9, 10, 11 |
| 422 | 4:2:2 profile | I, P, B | 4:2:2 or 4:2:0 | none | 8, 9, 10, 11 |
| MVP | Multi-view profile | I, P, B | 4:2:0 | Temporal[c] | 8, 9, 10 |
- ^ a b c SNR-scalability sends the transform-domain differences to a lower quantization level of each block, raising the quality and bitrate when both streams are combined. A main stream can be recreated losslessly.
- ^ a b Spatial-scalability encodes the difference between the HD and the upscaled SD streams, which is combined with the SD to recreate the HD stream. A Main stream cannot be recreated losslessly.
- ^ Temporal-scalability inserts extra frames between every base frame, to raise the frame rate or add a 3D viewpoint. This is the only MPEG-2 profile allowing adaptive frame references, a prominent feature of H.264/AVC. A Main stream may be recreated losslessly only if extended references are not used.
| Abbr. | Name | Frame rates (Hz) |
Max resolution | Max luminance samples per second (approximately height x width x framerate) |
Max bit rate MP@ (Mbit/s) | |
|---|---|---|---|---|---|---|
| horizontal | vertical | |||||
| LL | Low Level | 23.976, 24, 25, 29.97, 30 | 352 | 288 | 3,041,280 | 4 |
| ML | Main Level | 23.976, 24, 25, 29.97, 30 | 720 | 576 | 10,368,000, except in High-profile: constraint is 14,475,600 for 4:2:0 and 11,059,200 for 4:2:2 | 15 |
| H-14 | High 1440 | 23.976, 24, 25, 29.97, 30, 50, 59.94, 60 | 1440 | 1152 | 47,001,600, except in High-profile: constraint is 62,668,800 for 4:2:0 | 60 |
| HL | High Level | 23.976, 24, 25, 29.97, 30, 50, 59.94, 60 | 1920 | 1152 | 62,668,800, except in High-profile: constraint is 83,558,400 for 4:2:0 | 80 |
A few common MPEG-2 Profile/Level combinations are presented below, with particular maximum limits noted:
| Profile @ Level | Resolution (px) | Framerate max. (Hz) | Sampling | Bitrate (Mbit/s) | Example Application |
|---|---|---|---|---|---|
| SP@LL | 176 × 144 | 15 | 4:2:0 | 0.096 | Wireless handsets |
| SP@ML | 352 × 288 | 15 | 4:2:0 | 0.384 | PDAs |
| 320 × 240 | 24 | ||||
| MP@LL | 352 × 288 | 30 | 4:2:0 | 4 | Set-top boxes (STB) |
| MP@ML | 720 × 480 | 30 | 4:2:0 | 15 | DVD (9.8 Mbit/s), SD DVB (15 Mbit/s) |
| 720 × 576 | 25 | ||||
| MP@H-14 | 1440 × 1080 | 30 | 4:2:0 | 60 | HDV (25 Mbit/s) |
| 1280 × 720 | 30 | ||||
| MP@HL | 1920 × 1080 | 30 | 4:2:0 | 80 | ATSC (18.3 Mbit/s), SD DVB (31 Mbit/s), HD DVB (50.3 Mbit/s) |
| 1280 × 720 | 60 | ||||
| 422P@ML | 720 × 480 | 30 | 4:2:2 | 50 | Sony IMX (I only), Broadcast Contribution (I&P only) |
| 720 × 576 | 25 | ||||
| 422P@H-14 | 1440 × 1080 | 30 | 4:2:2 | 80 | |
| 422P@HL | 1920 × 1080 | 30 | 4:2:2 | 300 | Sony MPEG HD422 (50 Mbit/s), Canon XF Codec (50 Mbit/s), Convergent Design Nanoflash recorder (up to 160 Mbit/s) |
| 1280 × 720 | 60 |
Applications
[edit]Some applications are listed below.
- DVD-Video – a standard definition consumer video format. Uses 4:2:0 color subsampling and variable video data rate up to 9.8 Mbit/s.
- MPEG IMX – a standard definition professional video recording format. Uses intraframe compression, 4:2:2 color subsampling and user-selectable constant video data rate of 30, 40 or 50 Mbit/s.
- HDV – a tape-based high definition video recording format. Uses 4:2:0 color subsampling and 19.4 or 25 Mbit/s total data rate.
- XDCAM – a family of tapeless video recording formats, which, in particular, includes formats based on MPEG-2 Part 2. These are: standard definition MPEG IMX (see above), high definition MPEG HD, high definition MPEG HD422. MPEG IMX and MPEG HD422 employ 4:2:2 color subsampling, MPEG HD employs 4:2:0 color subsampling. Most subformats use selectable constant video data rate from 25 to 50 Mbit/s, although there is also a variable bitrate mode with maximum 18 Mbit/s data rate.
- XF Codec – a professional tapeless video recording format, similar to MPEG HD and MPEG HD422 but stored in a different container file.
- HD DVD – defunct high definition consumer video format.
- Blu-ray Disc – high definition consumer video format.
- Broadcast TV – in some countries MPEG-2 Part 2 is used for digital broadcast in high definition. For example, ATSC specifies both several scanning formats (480i, 480p, 720p, 1080i, 1080p) and frame/field rates at 4:2:0 color subsampling, with up to 19.4 Mbit/s data rate per channel.
- Digital cable TV
- Satellite TV
Patent holders
[edit]The following organizations have held patents for MPEG-2 video technology, as listed at MPEG LA. All of these patents are now expired in the US and most other territories.[1]
| Organization | Patents[16] |
|---|---|
| Sony Corporation | 311 |
| Thomson Licensing | 198 |
| Mitsubishi Electric | 119 |
| Philips | 99 |
| GE Technology Development, Inc. | 75 |
| Panasonic Corporation | 55 |
| CIF Licensing, LLC | 44 |
| JVC Kenwood | 39 |
| Samsung Electronics | 38 |
| Alcatel Lucent (including Multimedia Patent Trust) | 33 |
| Cisco Technology, Inc. | 13 |
| Toshiba Corporation | 9 |
| Columbia University | 9 |
| LG Electronics | 8 |
| Hitachi | 7 |
| Orange S.A. | 7 |
| Fujitsu | 6 |
| Robert Bosch GmbH | 5 |
| General Instrument | 4 |
| British Telecommunications | 3 |
| Canon Inc. | 2 |
| KDDI Corporation | 2 |
| Nippon Telegraph and Telephone (NTT) | 2 |
| ARRIS Technology, Inc. | 2 |
| Sanyo Electric | 1 |
| Sharp Corporation | 1 |
| Hewlett-Packard Enterprise Company | 1 |
References
[edit]- ^ a b "MPEG-2 patent expiration opens door for royalty-free use". TechRepublic. 15 February 2018. Retrieved 13 December 2021.
- ^ a b c d e f g "H.262 : Information technology – Generic coding of moving pictures and associated audio information: Video". ITU-T Website. International Telecommunication Union – Telecommunication Standardization Sector (ITU-T). February 2000. Retrieved 13 August 2009.
- ^ a b c ISO. "ISO/IEC 13818-2:2013 – Information technology – Generic coding of moving pictures and associated audio information: Video". ISO. Retrieved 24 July 2014.
- ^ The Moving Picture Experts Group. "MPEG-2 Video". Archived from the original on 25 March 2019. Retrieved 15 June 2019 – via mpeg.chiariglione.org.
- ^ P.N. Tudor (December 2005). "MPEG-2 Video compression". Retrieved 1 November 2009.
- ^ H.262 (07/95) Information Technology – Generic Coding of Moving Picture and Associated Audio Information: Video, ITU, retrieved 3 November 2009
- ^ a b c ISO. "ISO/IEC 13818-2:1996 – Information technology – Generic coding of moving pictures and associated audio information: Video". ISO. Retrieved 24 July 2014.
- ^ "Didier LeGall, Executive Vice President". Ambarella Inc. Retrieved 2 June 2017.
- ^ "Sakae Okubo". ITU. Retrieved 27 January 2017.
- ^ "History: 1990s". SK Hynix. Archived from the original on 5 February 2021. Retrieved 6 July 2019.
- ^ "MPEG-2 Patent List" (PDF). MPEG LA. Archived from the original (PDF) on 8 March 2021. Retrieved 7 July 2019.
- ^ Leonardo Chiariglione – Convenor (October 2000). "Short MPEG-2 description". Retrieved 1 November 2009.
- ^ a b MPEG. "MPEG standards". chiariglione.org. Retrieved 24 July 2014.
- ^ ISO. "ISO/IEC 13818-2:2000/Amd 3 – New level for 1080@50p/60p". Retrieved 24 July 2014.
- ^ ISO. "ISO/IEC 13818-2:2000 – Information technology – Generic coding of moving pictures and associated audio information: Video". ISO. Retrieved 24 July 2014.
- ^ "MPEG-2 Patent List" (PDF). MPEG LA. Archived from the original (PDF) on 8 March 2021. Retrieved 7 July 2019.
External links
[edit]- Official MPEG web site
- MPEG-2 Video Encoding (H.262) – The Library of Congress
H.262/MPEG-2 Part 2
View on GrokipediaHistory and development
Standardization process
The development of H.262/MPEG-2 Part 2 represented a collaborative effort between the ISO/IEC Moving Picture Experts Group (MPEG) under JTC1/SC29/WG11 and the ITU-T Video Coding Experts Group (VCEG) within Study Group 16. This joint initiative for MPEG-2 began in 1990, initially focusing on extending video coding capabilities beyond the constraints of early digital storage media to address emerging needs in broadcast television and higher-quality applications.[6][7] Building on the foundational MPEG-1 standard, which was completed and published in 1992, the MPEG-2 project shifted emphasis to advanced requirements outlined in early 1993, such as handling interlaced video signals and supporting scalable bit rates for diverse transmission environments. The Test Model editing process commenced later that year, involving iterative refinements through expert contributions and verification tests to ensure compatibility and performance across profiles and levels. Key decisions emerged from MPEG meetings, including the establishment of a committee draft for the video component in November 1993, which served as the basis for further harmonization with ITU-T efforts.[8][9] The standardization culminated in the completion of the approval process for the first edition as ISO/IEC 13818-2 by ISO/IEC in November 1994, with publication in May 1996. The ITU-T subsequently endorsed the identical technical content as Recommendation H.262 on July 10, 1995, marking formal international recognition.[10][11] Central objectives of the process included enabling efficient compression for interlaced formats prevalent in analog broadcasting, accommodating resolutions from standard definition (SD) up to early high definition (HD) variants, and promoting interoperability across storage media and transmission systems like satellite and cable networks.[12][13]Editions and amendments
The H.262/MPEG-2 Part 2 standard was first published as ISO/IEC 13818-2:1996 (May 1996), in alignment with ITU-T Recommendation H.262 (1995), establishing the core specifications for video coding of moving pictures and associated data.[14] This first edition received multiple amendments to extend its applicability; notable among these was Amendment 1 (1997), adding registration of copyright identifiers, and Amendment 2 (1997), which added support for the 4:2:2 Profile to enable higher-quality sampling for professional video applications.[15][16] Additional amendments to this edition included Amendment 3 (1998) for video signal-to-noise ratio estimation[17], Amendment 4 (1999) for further refinements[18], along with Corrigendum 1 (1997) addressing minor technical corrections.[19] The second edition, ISO/IEC 13818-2:2000 (aligned with ITU-T H.262, 02/2000)[20], consolidated the 1996 edition along with its prior amendments and corrigenda, while introducing enhancements such as improved error resilience mechanisms to better handle transmission errors in bitstreams. This edition was subsequently updated through Amendment 1 (2001) for additional syntax elements[21], Amendment 2 (2007) enabling extended-gamut color transfer characteristics[22], Amendment 3 (2010) defining a new level for 1080-line 50p/60p formats[23], and Amendment 4 (2012) for further bitstream extensions; it also included Corrigendum 1 (2002)[24] and Corrigendum 2 (2007)[25], the latter providing fixes related to bitstream conformance in the inverse discrete cosine transform (IDCT) process. The third edition, ISO/IEC 13818-2:2013 (aligned with ITU-T H.262, 02/2013), replaced the 2000 edition and integrated all preceding amendments and corrigenda, applying minor technical revisions primarily for improved clarity in normative text and corrections to identified bugs in decoding processes without altering the fundamental coding structure.[1] This edition was last reviewed and confirmed in 2019, maintaining the standard's stability for ongoing implementations.[1]Technical overview
Compression principles
H.262/MPEG-2 Part 2 utilizes a block-based hybrid coding framework to achieve efficient video compression by addressing both spatial and temporal redundancies in the source material. This approach integrates motion-compensated prediction for exploiting inter-frame correlations, the discrete cosine transform (DCT) for intra-block spatial decorrelation, and scalar quantization to manage data volume while introducing controlled loss. The video sequence is partitioned into 8×8 pixel blocks, typically after subtracting a predicted version from a reference frame to form a residual; this residual undergoes DCT to concentrate energy into fewer coefficients, facilitating subsequent compression stages.[3] The DCT process transforms each 8×8 block of pixel differences (or original pixels for intra-coded blocks) into an 8×8 matrix of frequency-domain coefficients, where the DC coefficient represents the average intensity and AC coefficients capture higher spatial frequencies. These coefficients are then quantized by dividing each by a corresponding value from a scaling matrix and rounding to the nearest integer, effectively discarding less perceptually significant high-frequency details. MPEG-2 defines default quantization matrices for luminance and chrominance, with the intra luminance matrix emphasizing finer quantization for low frequencies (e.g., smaller scaling values near the DC component) to preserve visual quality, while coarser steps for high frequencies reduce bitrate. Custom matrices can be transmitted for adaptation, but defaults ensure baseline compliance.[26] Following quantization, the coefficients are entropy-coded to further eliminate statistical redundancy using variable-length coding (VLC) based on Huffman principles. The process employs run-level encoding, where sequences of zero-valued AC coefficients (runs) followed by a non-zero level are paired into symbols; these pairs, along with the DC coefficient, are mapped to short codes from predefined tables optimized for typical coefficient distributions—one table for intra blocks and another for inter blocks. This method assigns shorter codes to more frequent symbols, achieving additional compression without loss.[27][28] Rate control in H.262/MPEG-2 is implicitly enforced via the Video Buffering Verifier (VBV), a theoretical model of the decoder's input buffer that constrains encoder output to prevent overflow or underflow. Specified in the standard's annex, the VBV assumes a fixed buffer size and bitrate, requiring the cumulative bits up to any point in the stream to stay within defined bounds; this guides quantization adjustments during encoding to maintain consistent quality and decoder compatibility across varying scene complexities.[29]Picture formats and sampling
H.262/MPEG-2 Part 2 supports both progressive scanning, where each picture is a complete frame displayed sequentially, and interlaced scanning, where pictures consist of alternating odd and even fields suitable for traditional television systems like 525/60 or 625/50.[30] In interlaced mode, the standard specifies support for top-field-first or bottom-field-first ordering, indicated in the picture header to ensure proper field interleaving during decoding.[30] The standard accommodates various chroma subsampling formats to balance color fidelity and bandwidth efficiency. The 4:2:0 format, the most common for broadcast applications, subsamples chroma components to one-quarter the luma resolution (half horizontally and vertically), reducing color detail while preserving luminance sharpness.[30] In contrast, 4:2:2 subsamples chroma horizontally to half the luma resolution, maintaining full vertical color detail for professional video production, while 4:4:4 provides full-resolution chroma matching luma for high-quality applications like studio editing, though at higher data rates.[30] These formats are signaled via the chroma_format parameter in the sequence header, with implications for color resolution that affect applications ranging from consumer SDTV to professional workflows.[31] Resolution support in H.262/MPEG-2 Part 2 spans from low-bitrate formats like QCIF (176×144 pixels) for early mobile or video telephony to high-definition television (HDTV) up to 1920×1080 pixels, enabling compatibility across consumer electronics and broadcast systems.[30] Common standard-definition resolutions include 704×480 for NTSC (29.97 Hz) and 704×576 for PAL (25 Hz), while HDTV examples feature 1440×1152 or 1920×1080 at 25 or 30 Hz.[30] Aspect ratios are flexibly supported, including 4:3 for traditional square-pixel displays, 16:9 for widescreen, and anamorphic modes that encode non-square pixels to fit standard containers without letterboxing.[30] To ensure alignment with the 16×16-pixel macroblock structure, picture dimensions must adhere to specific constraints: horizontal sizes as multiples of 16 pixels, and vertical sizes as multiples of 32 pixels for frame pictures, facilitating efficient partitioning and processing.[30] These sampling and format choices influence overall compression efficiency by determining the spatial data volume prior to encoding.[30]Coding structure
Frame types: I, P, B
In H.262/MPEG-2 Part 2, video sequences are composed of three primary picture types, distinguished by their coding methods and prediction dependencies, which enable efficient compression by exploiting both spatial and temporal redundancies.[14] These types are defined by the picture_coding_type parameter in the picture header, allowing decoders to process each picture accordingly.[14] Intra-coded pictures, denoted as I-pictures, are encoded independently without reference to other pictures in the sequence. They rely solely on spatial compression techniques, such as the discrete cosine transform (DCT) applied to 8x8 blocks within macroblocks, to reduce intra-frame redundancy. I-pictures serve as essential random access points, enabling decoding to commence from any such picture, and they also act as reference frames for subsequent predictive coding, making them crucial for error recovery and scene transitions.[14][3] Predictive-coded pictures, or P-pictures, achieve greater compression efficiency by using motion-compensated prediction from a previous I- or P-picture in display order. The prediction residual is then DCT-coded, with macroblocks classified as either intra-coded (similar to I-pictures) or inter-coded using forward motion vectors to reference the past frame. P-pictures can themselves serve as references for future predictions, forming a chain of dependencies that links back to the most recent I-picture, thus balancing compression gains with decoding complexity.[14][3] Bidirectionally predictive-coded pictures, known as B-pictures, provide the highest compression ratios by employing motion compensation from both a preceding and a subsequent I- or P-picture, allowing for interpolated predictions that capture more accurate motion across frames. Like P-pictures, B-pictures encode the prediction error via DCT after classifying macroblocks as intra-coded, forward-predicted, backward-predicted, or bidirectionally predicted, but they do not serve as references for other pictures in non-scalable profiles. This bidirectional approach enhances efficiency for scenes with smooth motion but introduces latency, as pictures must be reordered from display order to bitstream order during encoding and decoding, with the delay proportional to the number of consecutive B-pictures.[14][3] These picture types are organized into groups of pictures (GOPs), which begin with an I-picture followed by a configurable sequence of P- and B-pictures, providing a framework for random access and closed GOP decoding. Common GOP patterns include IBBPBBP, where two B-pictures precede each P-picture, though the standard allows flexible arrangements with up to three B-pictures between reference (I- or P-) pictures in typical implementations. The GOP header, optional but recommended, specifies parameters like the temporal distance to the next I-picture, aiding in editing and fast-forward operations.[14][3]Macroblocks and partitioning
In H.262/MPEG-2 Part 2, the macroblock serves as the basic unit of video coding, consisting of a 16×16 array of luma samples divided into four 8×8 blocks, along with corresponding chroma samples that depend on the color subsampling format.[30] In the common 4:2:0 format, the chroma components are subsampled by a factor of 2 in both horizontal and vertical directions, resulting in one 8×8 block each for Cb and Cr, for a total of six 8×8 blocks per macroblock.[27] Assuming 8-bit precision, this structure represents 384 bytes of raw pixel data per macroblock: 256 bytes for luma and 128 bytes for chroma combined.[30] Macroblocks are encoded in one of several modes to balance compression efficiency and quality, with the choice depending on the picture type (I, P, or B frames). In intra mode, the macroblock is coded independently using the discrete cosine transform (DCT) directly on the original pixel values, without reference to other pictures, which is essential for I-frames and periodic refresh in P- or B-frames.[27] Inter mode applies motion compensation to predict the macroblock from a reference picture, followed by DCT on the residual difference, supporting forward prediction in P-frames and bidirectional prediction in B-frames.[30] Skipped mode transmits no residual data or motion vector for the macroblock, implying a zero motion vector where the decoder copies the prediction directly from the reference, which enhances efficiency for regions with minimal change.[27] All modes involve quantization of DCT coefficients to reduce bitrate, with coarser quantization applied in inter and skipped modes for greater compression.[30] For interlaced video, macroblocks support optional partitioning to handle field structure more effectively, unlike progressive video where only the full 16×16 luma block is used. In frame-coding mode, the entire 16×16 macroblock is treated as a unit, but field-coding mode splits it into two 16×8 partitions (top and bottom fields), each with independent motion vectors to better capture vertical motion disparities between fields.[27] These options are signaled in the macroblock header and apply only to interlaced content, with no support for smaller sub-8×8 partitions as seen in later standards.[30] After DCT transformation, the 8×8 coefficient matrix for each block is scanned to serialize the data for entropy coding, prioritizing low-frequency coefficients for better run-length encoding. The standard zigzag scan order starts at the DC coefficient in the top-left and proceeds diagonally through the matrix, grouping higher-energy coefficients first before trailing zeros.[27] For intra-coded macroblocks, the DC coefficient is treated separately and differentially coded relative to neighboring blocks to exploit spatial redundancy.[30] In interlaced field-coded blocks, an alternate vertical scan may be used instead of zigzag to account for doubled vertical resolution, but zigzag remains the default for frame-coded and progressive content.[27]Motion compensation and estimation
Motion compensation in H.262/MPEG-2 Part 2 reduces temporal redundancy by predicting the current picture from one or more reference pictures using motion vectors that describe the displacement of macroblocks between pictures.[32] The process involves motion estimation at the encoder to determine these vectors and motion compensation at both encoder and decoder to generate prediction blocks by shifting and interpolating pixels from the reference.[33] This mechanism is applied in P-frames (predicted from previous reference) and B-frames (bidirectionally predicted from previous and future references). The precision and range of motion vectors are controlled by the f_code parameters (forward_f_code and backward_f_code) in the picture header, with values from 1 to 15 determining the allowable displacement.[2] Motion estimation employs block-matching algorithms to find the best-matching block in the reference picture for each macroblock in the current picture, minimizing the prediction error, typically measured by mean absolute difference or sum of squared differences.[33] Although the standard does not specify the estimation method, allowing encoder flexibility, the full or exhaustive search—evaluating all candidate positions within a search window—is commonly used for optimal accuracy, though computationally intensive.[34] Compensation then constructs the prediction by applying the selected motion vector to fetch and interpolate pixels from the reference picture at half-sample positions using a bilinear filter, enabling sub-pixel precision without increasing vector resolution.[9] Motion vectors in H.262 are defined with half-pixel accuracy for both horizontal and vertical components, coded in units of half-pels to support this interpolation.[32] The range is variable and determined by the f_code parameter (ranging from 1 to 15), allowing displacements up to approximately ±1024 pixels horizontally and vertically in frame pictures when f_code=15, with adjustments for field pictures to account for interlacing.[35] For B-frames, dual-prime prediction enhances efficiency by deriving a second motion vector from the first using differential values, allowing bidirectional compensation from a single forward or backward vector when no opposite reference is available, particularly useful in low-delay scenarios without intervening B-pictures.[36] To handle both progressive and interlaced content, H.262 supports frame-based and field-based prediction modes. Frame-based prediction treats the entire frame as a single entity, suitable for progressive sequences, where motion vectors apply uniformly across top and bottom fields.[11] Field-based prediction, used for interlaced video, estimates motion separately for each field to better capture motion across field intervals, avoiding artifacts from assuming uniform motion within a frame; field-based vectors reference only the same-parity field in the reference, with a reduced effective range due to half the vertical resolution.[11] Motion vectors are efficiently coded using variable-length codes (VLC) applied to their differential values relative to predictors from neighboring macroblocks, exploiting spatial correlation in motion to minimize bitrate.[37] Separate VLC tables encode the horizontal and vertical components of the motion vector difference (MVD), with longer codes for larger displacements and shorter ones for small or zero values; for example, MVDs near zero use 1-2 bits, while extremes approach 14 bits.[27] This differential encoding predicts the current vector from the median or average of adjacent vectors (left, top, top-right), reducing redundancy and enabling robust transmission.[37]Profiles, levels, and constraints
Profiles
In H.262/MPEG-2 Part 2, profiles define subsets of the syntax and tools, allowing encoders and decoders to target specific applications by balancing compression efficiency, quality, and computational requirements. Each profile specifies supported frame types, chroma formats, and optional scalability features, enabling trade-offs between simplicity and advanced capabilities. The standard outlines five primary profiles: Simple, Main, SNR Scalable, Spatially Scalable, and High, with the 4:2:2 Profile as a specialized variant for professional use. These profiles build hierarchically, where higher profiles incorporate all tools from lower ones plus additional features. The Simple Profile (SP) is designed for low-complexity decoding in resource-constrained environments, supporting only intra-coded (I) and predictive (P) frames while excluding bidirectional (B) frames to minimize memory and processing demands. It uses 4:2:0 chroma subsampling and lacks scalability tools, making it suitable for basic video applications with moderate compression needs and no support for interlaced field prediction beyond simple modes. This profile trades off some compression efficiency for decoder simplicity, as the absence of B-frames reduces buffering requirements but limits temporal prediction accuracy.[14][30] The Main Profile (MP) extends the Simple Profile by incorporating B-frames for enhanced compression through bidirectional prediction, supporting up to three B-frames between reference frames and full interlaced coding tools. It maintains 4:2:0 chroma and does not include scalability, focusing on general-purpose video for consumer applications like broadcasting and storage. MP achieves better efficiency than SP by exploiting temporal redundancy more fully, though it increases decoding complexity due to B-frame handling; this profile is the most widely adopted for its balance of performance and compatibility.[14][38][30] The SNR Scalable Profile builds on the Main Profile by adding signal-to-noise ratio (SNR) scalability, allowing a base layer (conforming to MP) plus one or more enhancement layers to improve quality without altering resolution. It supports I, P, and B frames with 4:2:0 chroma, enabling progressive refinement for variable bitrate scenarios. This profile trades increased encoding/decoding overhead for flexibility in layered transmission, though adoption has been limited due to the added complexity over non-scalable options.[14][30] The Spatially Scalable Profile extends the SNR Scalable Profile to support multi-layer spatial resolution, with a lower-resolution base layer (MP-compliant) enhanced by higher-resolution layers using up to three layers total. It includes I, P, and B frames with 4:2:0 chroma, facilitating adaptive decoding for differing display sizes. The trade-off involves higher computational costs for motion estimation across layers, prioritizing versatility for hierarchical video services over single-layer efficiency.[14][30] The High Profile (HP) provides the most advanced feature set, supporting I, P, and B frames with 4:2:0, 4:2:2, or 4:4:4 chroma formats for superior color fidelity, and optional SNR or spatial scalability similar to the scalable profiles. It is tailored for professional video production, enabling high-quality encoding with enhanced chroma tools and error resilience. HP offers the greatest flexibility but at the highest complexity, suitable for studio and broadcast environments where quality outweighs decoding simplicity. The 4:2:2 Profile, a derivative focused on professional multi-generation workflows, mirrors HP's frame types but without scalability, mandating 4:2:2 chroma for better color handling in editing, with reduced efficiency at lower bitrates compared to 4:2:0 profiles.[14][39][30][40]| Profile | Frame Types | Chroma Formats | Scalability | Key Trade-offs |
|---|---|---|---|---|
| Simple | I, P | 4:2:0 | None | Low complexity vs. reduced efficiency |
| Main | I, P, B | 4:2:0 | None | Balanced efficiency vs. moderate complexity |
| SNR Scalable | I, P, B | 4:2:0 | SNR (quality layers) | Layered quality vs. overhead |
| Spatially Scalable | I, P, B | 4:2:0 | Spatial (resolution layers) | Resolution flexibility vs. computation |
| High | I, P, B | 4:2:0, 4:2:2, 4:4:4 | Optional SNR/Spatial | High quality vs. high complexity |
| 4:2:2 | I, P, B | 4:2:2 | None | Professional chroma vs. bitrate inefficiency |
Levels
In H.262/MPEG-2 Part 2, levels define quantitative constraints on decoder performance and encoded bitstream parameters, including maximum picture resolution, frame or field rates, bitrates, and video buffering verifier (VBV) sizes, ensuring compatibility across different applications while interacting with profiles to specify supported coding tools.[41] These levels establish tiers of increasing capability, with higher levels encompassing all features and limits of lower ones for backward compatibility, allowing decoders designed for advanced levels to handle simpler content without modification.[42] The Low Level (LL) targets basic video applications, supporting a maximum resolution of 352 × 288 pixels at up to 30 frames per second (or equivalent field rate for interlaced video) with a peak bitrate of 4 Mbps and a VBV buffer size of approximately 0.04 MB.[30] This level is suited for low-complexity decoding, such as in early digital video systems or constrained environments.[41] The Main Level (ML) serves as the standard for standard-definition television (SDTV), accommodating up to 720 × 576 pixels at 30 frames per second with a bitrate limit of 15 Mbps and a VBV buffer of 0.23 MB.[42] It includes luminance sample rates of 13.5 MHz for 4:2:0 chroma subsampling, enabling efficient handling of broadcast-quality interlaced video.[30] Higher tiers include the High 1440 Level, which extends to 1440 × 1152 pixels at up to 60 fields per second with 60 Mbps bitrate, and the High Level (HL), designed for high-definition television (HDTV) with support for 1920 × 1080 (or 1920 × 1152 interlaced) at 30 frames per second and 80 Mbps in the Main Profile.[41] Profile-specific variations adjust these limits; for instance, the High 4:2:2 Profile at High Level permits up to 300 Mbps bitrate and higher sample rates (e.g., 74.25 MHz luminance) for professional applications requiring enhanced chroma resolution.[30] VBV sizes scale accordingly, reaching about 1.23 MB for High Level Main Profile to manage larger bitstreams.[42] The following table summarizes key parameters for selected level-profile combinations:| Level | Profile | Max Resolution (pixels) | Max Picture Rate (Hz) | Max Bitrate (Mbps) | VBV Buffer Size (MB) |
|---|---|---|---|---|---|
| Low | Main | 352 × 288 | 30 | 4 | 0.04 |
| Main | Main | 720 × 576 | 30 | 15 | 0.23 |
| High 1440 | Main | 1440 × 1152 | 60 (fields/s) | 60 | 0.46 |
| High | Main | 1920 × 1080 | 30 | 80 | 1.23 |
| High | 4:2:2 | 1920 × 1080 | 30 | 300 | 1.23 |
