Recent from talks
Nothing was collected or created yet.
RTP payload formats
View on WikipediaThe Real-time Transport Protocol (RTP) specifies a general-purpose data format and network protocol for transmitting digital media streams on Internet Protocol (IP) networks. The details of media encoding, such as signal sampling rate, frame size and timing, are specified in an RTP payload format. The format parameters of the RTP payload are typically communicated between transmission endpoints with the Session Description Protocol (SDP), but other protocols, such as the Extensible Messaging and Presence Protocol (XMPP) may be used.
Payload types and formats
[edit]The technical parameters of payload formats for audio and video streams are standardised. The standard also describes the process of registering new payload types with IANA.
- RFC 3550 – "RTP: A Transport Protocol for Real-Time Applications,"[1] Internet Standard 64.
- RFC 3551 – "RTP Profile for Audio and Video Conferences with Minimal Control,"[2] Internet Standard 65.
- RFC 3611 – "RTP Control Protocol Extended Reports (RTCP XR),"[3] Proposed Standard.
- RFC 4856 – "Media Type Registration of Payload Formats in the RTP Profile for Audio and Video Conferences,"[4] Proposed Standard.
Text messaging payload types
[edit]Payload formats and types for text messaging are defined in the following specifications:
MIDI payload types
[edit]Payload formats and types for MIDI are defined in the following specifications:
Audio and video payload types
[edit]Payload formats and types for audio and video are defined in the following specifications:
- RFC 2029 – "RTP Payload Format of Sun's CellB Video Encoding,"[9] Proposed Standard.
- RFC 2190 – "RTP Payload Format for H.263 Video Streams,"[10] Historic.
- RFC 2198 – "RTP Payload for Redundant Audio Data,"[11] Proposed Standard.
- RFC 2250 – "RTP Payload Format for MPEG1/MPEG2 Video,"[12] Proposed Standard.
- RFC 2343 – "RTP Payload Format for Bundled MPEG,"[13] Experimental.
- RFC 2435 – "RTP Payload Format for JPEG-compressed Video,"[14] Proposed Standard.
- RFC 2586 – "The Audio/L16 MIME content type,"[15] Informational.
- RFC 2658 – "RTP Payload Format for PureVoice(tm) Audio,"[16] Proposed Standard.
- RFC 3190 – "RTP Payload Format for 12-bit DAT Audio and 20- and 24-bit Linear Sampled Audio,"[17] Proposed Standard.
- RFC 3389 – "Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN),"[18] Proposed Standard.
- RFC 3497 – "RTP Payload Format for Society of Motion Picture and Television Engineers (SMPTE) 292M Video,"[19] Informational.
- RFC 3640 – "RTP Payload Format for Transport of MPEG-4 Elementary Streams,"[20] Proposed Standard.
- RFC 3952 – "Real-time Transport Protocol (RTP) Payload Format for internet Low Bit Rate Codec (iLBC) Speech,"[21] Experimental.
- RFC 4175 – "RTP Payload Format for Uncompressed Video,"[22] Proposed Standard.
- RFC 4184 – "RTP Payload Format for AC-3 Audio,"[23] Proposed Standard.
- RFC 4352 – "RTP Payload Format for the Extended Adaptive Multi-Rate Wideband (AMR-WB+) Audio Codec,"[24] Proposed Standard.
- RFC 4587 – "RTP Payload Format for H.261 Video Streams,"[25] Proposed Standard.
- RFC 4598 – "Real-time Transport Protocol (RTP) Payload Format for Enhanced AC-3 (E-AC-3) Audio,"[26] Proposed Standard.
- RFC 4629 – "RTP Payload Format for ITU-T Rec. H.263 Video,"[27] Proposed Standard.
- RFC 4733 – "RTP Payload for DTMF Digits, Telephony Tones, and Telephony Signals,"[28] Proposed Standard.
- RFC 4749 – "RTP Payload Format for the G.729.1 Audio Codec,"[29] Proposed Standard.
- RFC 4788 – "Enhancements to RTP Payload Formats for EVRC Family Codecs,"[30] Proposed Standard.
- RFC 4867 – "RTP Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs,"[31] Proposed Standard.
- RFC 5188 – "RTP Payload Format for the Enhanced Variable Rate Wideband Codec (EVRC-WB) and the Media Subtype Updates for EVRC-B Codec,"[32] Proposed Standard.
- RFC 5215 – "RTP Payload Format for Vorbis Encoded Audio,"[33] Proposed Standard.
- RFC 5371 – "RTP Payload Format for JPEG 2000 Video Streams,"[34] Proposed Standard.
- RFC 5391 – "RTP Payload Format for ITU-T Recommendation G.711.1,"[35] Proposed Standard.
- RFC 5404 – "RTP Payload Format for G.719,"[36] Proposed Standard.
- RFC 5574 – "RTP Payload Format for the Speex Codec,"[37] Proposed Standard.
- RFC 5577 – "RTP Payload Format for ITU-T Recommendation G.722.1,"[38] Proposed Standard.
- RFC 5584 – "RTP Payload Format for the Adaptive TRansform Acoustic Coding (ATRAC) Family,"[39] Proposed Standard.
- RFC 5686 – "RTP Payload Format for mU-law EMbedded Codec for Low-delay IP Communication (UEMCLIP) Speech Codec,"[40] Proposed Standard.
- RFC 5993 – "RTP Payload Format for Global System for Mobile Communications Half Rate (GSM-HR),"[41] Proposed Standard.
- RFC 6184 – "RTP Payload Format for H.264 Video,"[42] Proposed Standard.
- RFC 6190 – "RTP Payload Format for Scalable Video Coding,"[43] Proposed Standard.
- RFC 6416 – "RTP Payload Format for MPEG-4 Audio/Visual Streams,"[44] Proposed Standard.
- RFC 6469 – "RTP Payload Format for DV (IEC 61834) Video,"[45] Proposed Standard.
- RFC 7310 – "RTP Payload Format for Standard apt-X and Enhanced apt-X Codecs,"[46] Proposed Standard.
- RFC 7587 – "RTP Payload Format for the Opus Speech and Audio Codec,"[47] Proposed Standard.
- RFC 7741 – "RTP Payload Format for VP8 Video,"[48] Proposed Standard.
- RFC 7798 – "RTP Payload Format for High Efficiency Video Coding (HEVC),"[49] Proposed Standard.
- RFC 9134 – "RTP Payload Format for ISO/IEC 21122 (JPEG XS),"[50] Proposed Standard.
- RFC 9607 – "RTP Payload Format for the Secure Communication Interoperability Protocol (SCIP) Codec,"[51] Proposed Standard.
- RFC 9628 – "RTP Payload Format for VP9 Video,"[52] Proposed Standard.
Payload identifiers 96–127 are used for payloads defined dynamically during a session. It is recommended to dynamically assign port numbers, although port numbers 5004 and 5005 have been registered for use of the profile when a dynamically assigned port is not required.
Applications should always support PCMU (payload type 0). Previously, DVI4 (payload type 5) was also recommended, but this was removed in 2013.[53]
| Payload type (PT) | Name | Type | No. of channels | Clock rate (Hz)[note 1] | Frame size (byte) | Default packet interval (ms) | Description | References |
|---|---|---|---|---|---|---|---|---|
| 0 | PCMU | audio | 1 | 8000 | any | 20 | ITU-T G.711 PCM μ-Law audio 64 kbit/s | RFC 3551 |
| 1 | reserved (previously FS-1016 CELP) | audio | 1 | 8000 | reserved, previously FS-1016 CELP audio 4.8 kbit/s | RFC 3551 | ||
| 2 | reserved (previously G721 or G726-32) | audio | 1 | 8000 | reserved, previously ITU-T G.721 ADPCM audio 32 kbit/s or ITU-T G.726 audio 32 kbit/s | RFC 3551 | ||
| 3 | GSM | audio | 1 | 8000 | 20 | 20 | European GSM Full Rate audio 13 kbit/s (GSM 06.10) | RFC 3551 |
| 4 | G723 | audio | 1 | 8000 | 30 | 30 | ITU-T G.723.1 audio | RFC 3551 |
| 5 | DVI4 | audio | 1 | 8000 | any | 20 | IMA ADPCM audio 32 kbit/s | RFC 3551 |
| 6 | DVI4 | audio | 1 | 16000 | any | 20 | IMA ADPCM audio 64 kbit/s | RFC 3551 |
| 7 | LPC | audio | 1 | 8000 | any | 20 | Experimental Linear Predictive Coding audio 5.6 kbit/s | RFC 3551 |
| 8 | PCMA | audio | 1 | 8000 | any | 20 | ITU-T G.711 PCM A-Law audio 64 kbit/s | RFC 3551 |
| 9 | G722 | audio | 1 | 8000[note 2] | any | 20 | ITU-T G.722 audio 64 kbit/s | RFC 3551 |
| 10 | L16 | audio | 2 | 44100 | any | 20 | Linear PCM 16-bit Stereo audio 1411.2 kbit/s,[15][54]: 62 [4]: 18 uncompressed | RFC 3551: 27 |
| 11 | L16 | audio | 1 | 44100 | any | 20 | Linear PCM 16-bit audio 705.6 kbit/s, uncompressed | RFC 3551: 27 |
| 12 | QCELP | audio | 1 | 8000 | 20 | 20 | Qualcomm Code Excited Linear Prediction | RFC 2658, RFC 3551: 28 |
| 13 | CN | audio | 1 | 8000 | Comfort noise. Payload type used with audio codecs that do not support comfort noise as part of the codec itself such as G.711, G.722.1, G.722, G.726, G.727, G.728, GSM 06.10, Siren, and RTAudio. | RFC 3389 | ||
| 14 | MPA | audio | 1, 2 | 90000 | 8–72 | MPEG-1 or MPEG-2 audio only | RFC 2250, RFC 3551 | |
| 15 | G728 | audio | 1 | 8000 | 2.5 | 20 | ITU-T G.728 audio 16 kbit/s | RFC 3551 |
| 16 | DVI4 | audio | 1 | 11025 | any | 20 | IMA ADPCM audio 44.1 kbit/s | RFC 3551 |
| 17 | DVI4 | audio | 1 | 22050 | any | 20 | IMA ADPCM audio 88.2 kbit/s | RFC 3551 |
| 18 | G729 | audio | 1 | 8000 | 10 | 20 | ITU-T G.729 and G.729a audio 8 kbit/s; Annex B is implied unless the annexb=no parameter is used
|
RFC 3551,: 20 RFC 4856: 12 |
| 19 | reserved (previously CN) | audio | reserved, previously comfort noise | RFC 3551 | ||||
| 25 | CELLB | video | 90000 | Sun CellB video[55] | RFC 2029 | |||
| 26 | JPEG | video | 90000 | JPEG video | RFC 2435 | |||
| 28 | nv | video | 90000 | Xerox PARC's Network Video (nv)[56][57] | RFC 3551: 32 | |||
| 31 | H261 | video | 90000 | ITU-T H.261 video | RFC 4587 | |||
| 32 | MPV | video | 90000 | MPEG-1 and MPEG-2 video | RFC 2250 | |||
| 33 | MP2T | audio/video | 90000 | MPEG-2 transport stream | RFC 2250 | |||
| 34 | H263 | video | 90000 | H.263 video, first version (1996) | RFC 2190, RFC 3551 | |||
| 72–76 | reserved | reserved because RTCP packet types 200–204 would otherwise be indistinguishable from RTP payload types 72–76 with the marker bit set | RFC 3550, RFC 3551 | |||||
| 77–95 | unassigned | note that RTCP packet type 207 (XR, Extended Reports) would be indistinguishable from RTP payload types 79 with the marker bit set | RFC 3551, RFC 3611 | |||||
| dynamic | H263-1998 | video | 90000 | H.263 video, second version (1998) | RFC 2190, RFC 3551, RFC 4629 | |||
| dynamic | H263-2000 | video | 90000 | H.263 video, third version (2000) | RFC 4629 | |||
| dynamic (or profile) | H264 AVC | video | 90000 | H.264 video (MPEG-4 Part 10) | RFC 6184 | |||
| dynamic (or profile) | H264 SVC | video | 90000 | H.264 video | RFC 6190 | |||
| dynamic (or profile) | H265 | video | 90000 | H.265 video (HEVC) | RFC 7798 | |||
| dynamic (or profile) | theora | video | 90000 | Theora video | draft-barbato-avt-rtp-theora | |||
| dynamic | iLBC | audio | 1 | 8000 | 20, 30 | 20, 30 | Internet low Bitrate Codec 13.33 or 15.2 kbit/s | RFC 3952 |
| dynamic | PCMA-WB | audio | 1 | 16000 | 5 | ITU-T G.711.1 A-law | RFC 5391 | |
| dynamic | PCMU-WB | audio | 1 | 16000 | 5 | ITU-T G.711.1 μ-law | RFC 5391 | |
| dynamic | G718 | audio | 32000 (placeholder) | 20 | ITU-T G.718 | draft-ietf-payload-rtp-g718 | ||
| dynamic | G719 | audio | (various) | 48000 | 20 | ITU-T G.719 | RFC 5404 | |
| dynamic | G7221 | audio | 16000, 32000 | 20 | ITU-T G.722.1 and G.722.1 Annex C | RFC 5577 | ||
| dynamic | G726-16 | audio | 1 | 8000 | any | 20 | ITU-T G.726 audio 16 kbit/s | RFC 3551 |
| dynamic | G726-24 | audio | 1 | 8000 | any | 20 | ITU-T G.726 audio 24 kbit/s | RFC 3551 |
| dynamic | G726-32 | audio | 1 | 8000 | any | 20 | ITU-T G.726 audio 32 kbit/s | RFC 3551 |
| dynamic | G726-40 | audio | 1 | 8000 | any | 20 | ITU-T G.726 audio 40 kbit/s | RFC 3551 |
| dynamic | G729D | audio | 1 | 8000 | 10 | 20 | ITU-T G.729 Annex D | RFC 3551 |
| dynamic | G729E | audio | 1 | 8000 | 10 | 20 | ITU-T G.729 Annex E | RFC 3551 |
| dynamic | G7291 | audio | 16000 | 20 | ITU-T G.729.1 | RFC 4749 | ||
| dynamic | GSM-EFR | audio | 1 | 8000 | 20 | 20 | ITU-T GSM-EFR (GSM 06.60) | RFC 3551 |
| dynamic | GSM-HR-08 | audio | 1 | 8000 | 20 | ITU-T GSM-HR (GSM 06.20) | RFC 5993 | |
| dynamic (or profile) | AMR | audio | (various) | 8000 | 20 | Adaptive Multi-Rate audio | RFC 4867 | |
| dynamic (or profile) | AMR-WB | audio | (various) | 16000 | 20 | Adaptive Multi-Rate Wideband audio (ITU-T G.722.2) | RFC 4867 | |
| dynamic (or profile) | AMR-WB+ | audio | 1, 2 or omit | 72000 | 13.3–40 | Extended Adaptive Multi Rate – WideBand audio | RFC 4352 | |
| dynamic (or profile) | vorbis | audio | (various) | (various) | Vorbis audio | RFC 5215 | ||
| dynamic (or profile) | opus | audio | 1, 2 | 48000[note 3] | 2.5–60 | 20 | Opus audio | RFC 7587 |
| dynamic (or profile) | speex | audio | 1 | 8000, 16000, 32000 | 20 | Speex audio | RFC 5574 | |
| dynamic | mpa-robust | audio | 1, 2 | 90000 | 24–72 | Loss-Tolerant MP3 audio | RFC 5219 | |
| dynamic (or profile) | MP4A-LATM | audio | 90000 or others | MPEG-4 Audio (includes AAC) | RFC 6416 | |||
| dynamic (or profile) | MP4V-ES | video | 90000 or others | MPEG-4 Visual | RFC 6416 | |||
| dynamic (or profile) | mpeg4-generic | audio/video | 90000 or other | MPEG-4 Elementary Streams | RFC 3640 | |||
| dynamic | VP8 | video | 90000 | VP8 video | RFC 7741 | |||
| dynamic | VP9 | video | 90000 | VP9 video | RFC 9628 | |||
| dynamic | AV1 | video | 90000 | AV1 video | av1-rtp-spec | |||
| dynamic | L8 | audio | (various) | (various) | any | 20 | Linear PCM 8-bit audio with 128 offset | RFC 3551: § 4.5.10 : Table 5 |
| dynamic | DAT12 | audio | (various) | (various) | any | 20 (by analogy with L16) | IEC 61119 12-bit nonlinear audio | RFC 3190: §3 |
| dynamic | L16 | audio | (various) | (various) | any | 20 | Linear PCM 16-bit audio | RFC 3551,: § 4.5.11 RFC 2586 |
| dynamic | L20 | audio | (various) | (various) | any | 20 (by analogy with L16) | Linear PCM 20-bit audio | RFC 3190: § 4 |
| dynamic | L24 | audio | (various) | (various) | any | 20 (by analogy with L16) | Linear PCM 24-bit audio | RFC 3190: § 4 |
| dynamic | raw | video | 90000 | Uncompressed Video | RFC 4175 | |||
| dynamic | ac3 | audio | (various) | 32000, 44100, 48000 | Dolby AC-3 audio | RFC 4184 | ||
| dynamic | eac3 | audio | (various) | 32000, 44100, 48000 | Enhanced AC-3 audio | RFC 4598 | ||
| dynamic | t140 | text | 1000 | Text over IP | RFC 4103 | |||
| dynamic | EVRC EVRC0 EVRC1 |
audio | 8000 | EVRC audio | RFC 4788 | |||
| dynamic | EVRCB EVRCB0 EVRCB1 |
audio | 8000 | EVRC-B audio | RFC 4788 | |||
| dynamic | EVRCWB EVRCWB0 EVRCWB1 |
audio | 16000 | EVRC-WB audio | RFC 5188 | |||
| dynamic | jpeg2000 | video | 90000 | JPEG 2000 video | RFC 5371 | |||
| dynamic | UEMCLIP | audio | 8000, 16000 | UEMCLIP audio | RFC 5686 | |||
| dynamic | ATRAC3 | audio | 44100 | ATRAC3 audio | RFC 5584 | |||
| dynamic | ATRAC-X | audio | 44100, 48000 | ATRAC3+ audio | RFC 5584 | |||
| dynamic | ATRAC-ADVANCED-LOSSLESS | audio | (various) | ATRAC Advanced Lossless audio | RFC 5584 | |||
| dynamic | DV | video | 90000 | DV video | RFC 6469 | |||
| dynamic | BT656 | video | ITU-R BT.656 video | RFC 3555 | ||||
| dynamic | BMPEG | video | Bundled MPEG-2 video | RFC 2343 | ||||
| dynamic | SMPTE292M | video | SMPTE 292M video | RFC 3497 | ||||
| dynamic | RED | audio | Redundant Audio Data | RFC 2198 | ||||
| dynamic | VDVI | audio | Variable-rate DVI4 audio | RFC 3551 | ||||
| dynamic | MP1S | video | MPEG-1 Systems Streams video | RFC 2250 | ||||
| dynamic | MP2P | video | MPEG-2 Program Streams video | RFC 2250 | ||||
| dynamic | tone | audio | 8000 (default) | tone | RFC 4733 | |||
| dynamic | telephone-event | audio | 8000 (default) | DTMF tone | RFC 4733 | |||
| dynamic | aptx | audio | 2 – 6 | (equal to sampling rate) | 4000 ÷ sample rate | 4[note 4] | aptX audio | RFC 7310 |
| dynamic | jxsv | video | 90000 | JPEG XS video | RFC 9134 | |||
| dynamic | scip | audio/video | 8000 or 90000 | SCIP | RFC 9607 |
- ^ The "clock rate" is the rate at which the timestamp in the RTP header is incremented, which need not be the same as the codec's sampling rate. For instance, video codecs typically use a clock rate of 90000 so their frames can be more precisely aligned with the RTCP NTP timestamp, even though video sampling rates are typically in the range of 1–60 samples per second.
- ^ Although the sampling rate for G.722 is 16000, its clock rate is 8000 to remain backwards compatible with RFC 1890, which incorrectly used this value.[2]: 14
- ^ Because Opus can change sampling rates dynamically, its clock rate is fixed at 48000, even when the codec will be operated at a lower sampling rate. The
maxplaybackrateandsprop-maxcapturerateparameters in SDP can be used to indicate hints/preferences about the maximum sampling rate to encode/decode. - ^ For aptX, the packetization interval must be rounded down to the nearest packet interval that can contain an integer number of samples. So at sampling rates of 11025, 22050, or 44100, a packetization rate of "4" is rounded down to 3.99.
See also
[edit]References
[edit]- ^ H. Schulzrinne; S. Casner; R. Frederick; V. Jacobson (July 2003). RTP: A Transport Protocol for Real-Time Applications. Network Working Group. doi:10.17487/RFC3550. STD 64. RFC 3550. Internet Standard 64. Updated by RFC 8860, 7160, 5761, 5506, 6051, 6222, 7022, 7164 and 8083. Obsoletes RFC 1889.
- ^ a b H. Schulzrinne; S. Casner (July 2003). RTP Profile for Audio and Video Conferences with Minimal Control. Network Working Group. doi:10.17487/RFC3551. STD 65. RFC 3551. Internet Standard 65. Updated by RFC 8860, 5761 and 7007. Obsoletes RFC 1890.
- ^ T. Friedman; R. Caceres; A. Clark, eds. (November 2003). RTP Control Protocol Extended Reports (RTCP XR). Network Working Group. doi:10.17487/RFC3611. RFC 3611. Proposed Standard.
- ^ a b S. Casner (March 2007). Media Type Registration of Payload Formats in the RTP Profile for Audio and Video Conferences. Network Working Group. doi:10.17487/RFC4856. RFC 4856. Proposed Standard. Obsoletes RFC 3555.
- ^ G. Hellstrom; P. Jones (June 2005). RTP Payload for Text Conversation. Network Working Group. doi:10.17487/RFC4103. RFC 4103. Proposed Standard. Obsoletes RFC 2793. Updated by RFC 9071.
- ^ G. Hellström (July 2021). RTP-Mixer Formatting of Multiparty Real-Time Text. Internet Engineering Task Force. doi:10.17487/RFC9071. ISSN 2070-1721. RFC 9071. Proposed Standard. Updates RFC 4103.
- ^ J. Lazzaro; J. Wawrzynek (June 2011). RTP Payload Format for MIDI. Internet Engineering Task Force. doi:10.17487/RFC6295. ISSN 2070-1721. RFC 6295. Proposed Standard. Obsoletes RFC 4695.
- ^ J. Lazzaro; J. Wawrzynek (November 2006). An Implementation Guide for RTP MIDI. Network Working Group. doi:10.17487/RFC4696. RFC 4696. Informational.
- ^ M. Speer; D. Hoffman (October 1996). RTP Payload Format of Sun's CellB Video Encoding. Network Working Group. doi:10.17487/RFC2029. RFC 2029. Proposed Standard.
- ^ C. Zhu (September 1997). RTP Payload Format for H.263 Video Streams. IETF Network Working Group. doi:10.17487/RFC2190. RFC 2190. Historic.
- ^ C. Perkins; I. Kouvelas; O. Hodson; V. Hardman; M. Handley; J.C. Bolot; A. Vega-Garcia; S. Fosse-Parisis (September 1997). RTP Payload for Redundant Audio Data. IETF Network Working Group. doi:10.17487/RFC2198. RFC 2198. Proposed Standard. Updated by RFC 6354.
- ^ D. Hoffman; G. Fernando; V. Goyal; M. Civanlar (January 1998). RTP Payload Format for MPEG1/MPEG2 Video. Network Working Group. doi:10.17487/RFC2250. RFC 2250. Proposed Standard. Obsoletes RFC 2038.
- ^ M. Civanlar; G. Cash; B. Haskell (May 1998). RTP Payload Format for Bundled MPEG. Network Working Group. doi:10.17487/RFC2343. RFC 2343. Experimental.
- ^ L. Berc; W. Fenner; R. Frederick; S. McCanne; P. Stewart (October 1998). RTP Payload Format for JPEG-compressed Video. Network Working Group. doi:10.17487/RFC2435. RFC 2435. Proposed Standard. Obsoletes RFC 2035.
- ^ a b J. Salsman; H. Alvestrand (May 1999). The Audio/L16 MIME content type. Network Working Group. doi:10.17487/RFC2586. RFC 2586. Informational.
- ^ K. McKay (August 1999). RTP Payload Format for PureVoice(tm) Audio. Network Working Group. doi:10.17487/RFC2658. RFC 2658. Proposed Standard.
- ^ K. Kobayashi; A. Ogawa; A. Ogawa; C. Bormann (January 2002). RTP Payload Format for 12-bit DAT Audio and 20- and 24-bit Linear Sampled Audio. Network Working Group. doi:10.17487/RFC3190. RFC 3190. Proposed Standard.
- ^ R. Zopf (September 2002). Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN). Network Working Group. doi:10.17487/RFC3389. RFC 3389. Proposed Standard.
- ^ L. Gharai; C. Perkins; G. Goncher; A. Mankin (March 2003). RTP Payload Format for Society of Motion Picture and Television Engineers (SMPTE) 292M Video. Network Working Group. doi:10.17487/RFC3497. RFC 3497. Informational.
- ^ J. van der Meer; D. Mackie; V. Swaminathan; D. Singer; P. Gentric (November 2003). RTP Payload Format for Transport of MPEG-4 Elementary Streams. Network Working Group. doi:10.17487/RFC3640. RFC 3640. Proposed Standard. Updated by RFC 5691.
- ^ A. Duric; S. Andersen (December 2004). Real-time Transport Protocol (RTP) Payload Format for internet Low Bit Rate Codec (iLBC) Speech. Network Working Group. doi:10.17487/RFC3952. RFC 3952. Experimental.
- ^ L. Gharai; C. Perkins (September 2005). RTP Payload Format for Uncompressed Video. Network Working Group. doi:10.17487/RFC4175. RFC 4175. Proposed Standard. Updated by RFC 4421.
- ^ B. Link; T. Hager; J. Flaks (October 2005). RTP Payload Format for AC-3 Audio. Network Working Group. doi:10.17487/RFC4184. RFC 4184. Proposed Standard.
- ^ J. Sjoberg; M. Westerlund; A. Lakaniemi; S. Wenger (January 2006). RTP Payload Format for the Extended Adaptive Multi-Rate Wideband (AMR-WB+) Audio Codec. Network Working Group. doi:10.17487/RFC4352. RFC 4352. Proposed Standard.
- ^ R. Even (August 2006). RTP Payload Format for H.261 Video Streams. Network Working Group. doi:10.17487/RFC4587. RFC 4587. Proposed Standard. Obsoletes RFC 2032.
- ^ B. Link (August 2006). Real-time Transport Protocol (RTP) Payload Format for Enhanced AC-3 (E-AC-3) Audio. Network Working Group. doi:10.17487/RFC4598. RFC 4598. Proposed Standard.
- ^ J. Ott; C. Bormann; G. Sullivan; S. Wenger (January 2007). R. Even (ed.). RTP Payload Format for ITU-T Rec. H.263 Video. Network Working Group. doi:10.17487/RFC4629. RFC 4629. Proposed Standard. Obsoletes RFC 2429. Updates RFC 3555.
- ^ H. Schulzrinne; T. Taylor (October 2006). RTP Payload for DTMF Digits, Telephony Tones, and Telephony Signals. IETF Network Working Group. doi:10.17487/RFC4733. RFC 4733. Proposed Standard. Updated by RFC 4734, 5244. Obsoletes RFC 2833.
- ^ A. Sollaud (October 2006). RTP Payload Format for the G.729.1 Audio Codec. IETF Network Working Group. doi:10.17487/RFC4749. RFC 4749. Proposed Standard. Updated by RFC 5459.
- ^ Q. Xie; R. Kapoor (October 2006). Enhancements to RTP Payload Formats for EVRC Family Codecs. IETF Network Working Group. doi:10.17487/RFC4788. RFC 4788. Proposed Standard. Updated by RFC 5188. Updates RFC 3558.
- ^ J. Sjoberg; M. Westerlund; A. Lakaniemi; Q. Xie (April 2007). RTP Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs. Network Working Group. doi:10.17487/RFC4867. RFC 4867. Proposed Standard. Obsoletes RFC 3267.
- ^ H. Desineni; Q. Xie (February 2008). RTP Payload Format for the Enhanced Variable Rate Wideband Codec (EVRC-WB) and the Media Subtype Updates for EVRC-B Codec. Network Working Group. doi:10.17487/RFC5188. RFC 5188. Proposed Standard. Updates RFC 4788.
- ^ L. Barbato (August 2008). RTP Payload Format for Vorbis Encoded Audio. Network Working Group. doi:10.17487/RFC5215. RFC 5215. Proposed Standard.
- ^ S. Futemma; E. Itakura; A. Leung (October 2008). RTP Payload Format for JPEG 2000 Video Streams. IETF Network Working Group. doi:10.17487/RFC5371. RFC 5371. Proposed Standard.
- ^ A. Sollaud (November 2008). RTP Payload Format for ITU-T Recommendation G.711.1. IETF Network Working Group. doi:10.17487/RFC5391. RFC 5391. Proposed Standard.
- ^ M. Westerlund; I. Johansson (January 2009). RTP Payload Format for G.719. Network Working Group. doi:10.17487/RFC5404. RFC 5404. Proposed Standard.
- ^ G. Herlein; J. Valin; A. Heggestad; A. Moizard (June 2009). RTP Payload Format for the Speex Codec. Internet Engineering Task Force Network Working Group. doi:10.17487/RFC5574. ISSN 2070-1721. RFC 5574. Proposed Standard. Obsoletes RFC 3047.
- ^ P. Luthi; R. Even (July 2009). RTP Payload Format for ITU-T Recommendation G.722.1. Internet Engineering Task Force Network Working Group. doi:10.17487/RFC5577. ISSN 2070-1721. RFC 5577. Proposed Standard. Obsoletes RFC 3047.
- ^ M. Hatanaka; J. Matsumoto (July 2009). RTP Payload Format for the Adaptive TRansform Acoustic Coding (ATRAC) Family. Internet Engineering Task Force Network Working Group. doi:10.17487/RFC5584. ISSN 2070-1721. RFC 5584. Proposed Standard.
- ^ Y. Hiwasaki; H. Ohmuro (October 2009). RTP Payload Format for mU-law EMbedded Codec for Low-delay IP Communication (UEMCLIP) Speech Codec. IETF Network Working Group. doi:10.17487/RFC5686. RFC 5686. Proposed Standard.
- ^ X. Duan; S. Wang; M. Westerlund; K. Hellwig; I. Johansson (October 2010). RTP Payload Format for Global System for Mobile Communications Half Rate (GSM-HR). Internet Engineering Task Force. doi:10.17487/RFC5993. ISSN 2070-1721. RFC 5993. Proposed Standard.
- ^ Y.-K. Wang; R. Even; T. Kristensen; R. Jesup (May 2011). RTP Payload Format for H.264 Video. Internet Engineering Task Force (IETF). doi:10.17487/RFC6184. RFC 6184. Proposed Standard. Obsoletes RFC 3984.
- ^ S. Wenger; Y.-K. Wang; T. Schierl; A. Eleftheriadis (May 2011). RTP Payload Format for Scalable Video Coding. Internet Engineering Task Force. doi:10.17487/RFC6190. ISSN 2070-1721. RFC 6190. Proposed Standard.
- ^ M. Schmidt; F. de Bont; S. Doehla; J. Kim (October 2011). RTP Payload Format for MPEG-4 Audio/Visual Streams. Internet Engineering Task Force. doi:10.17487/RFC6416. ISSN 2070-1721. RFC 6416. Proposed Standard. Obsoletes RFC 3016.
- ^ K. Kobayashi; K. Mishima; S. Casner; C. Bormann (December 2011). RTP Payload Format for DV (IEC 61834) Video. Internet Engineering Task Force. doi:10.17487/RFC6469. ISSN 2070-1721. RFC 6469. Proposed Standard. Obsoletes RFC 3189.
- ^ J. Lindsay; H. Foerster (July 2014). RTP Payload Format for Standard apt-X and Enhanced apt-X Codecs. Internet Engineering Task Force. doi:10.17487/RFC7310. ISSN 2070-1721. RFC 7310. Proposed Standard.
- ^ J. Spittka; K. Vos; JM. Valin (June 2015). RTP Payload Format for the Opus Speech and Audio Codec. Internet Engineering Task Force. doi:10.17487/RFC7587. ISSN 2070-1721. RFC 7587. Proposed Standard.
- ^ P. Westin; H. Lundin; M. Glover; J. Uberti; F. Galligan (March 2016). RTP Payload Format for VP8 Video. Internet Engineering Task Force. doi:10.17487/RFC7741. ISSN 2070-1721. RFC 7741. Proposed Standard.
- ^ Y.-K. Wang; Y. Sanchez; T. Schierl; S. Wenger; M. M. Hannuksela (March 2016). RTP Payload Format for High Efficiency Video Coding (HEVC). Internet Engineering Task Force. doi:10.17487/RFC7798. ISSN 2070-1721. RFC 7798. Proposed Standard.
- ^ T. Bruylants; A. Descampe; C. Damman; T. Richter (June 2022). RTP Payload Format for ISO/IEC 21122 (JPEG XS). Internet Engineering Task Force. doi:10.17487/RFC9134. ISSN 2070-1721. RFC 9134. Proposed Standard.
- ^ D. Hanson; M. Faller; K. Maver (July 2024). RTP Payload Format for the Secure Communication Interoperability Protocol (SCIP) Codec. Internet Engineering Task Force. doi:10.17487/RFC9607. ISSN 2070-1721. RFC 9607. Proposed Standard.
- ^ J. Uberti; S. Holmer; M. Flodman; D. Hong (March 2025). RTP Payload Format for VP9 Video. Internet Engineering Task Force. doi:10.17487/RFC9628. ISSN 2070-1721. RFC 9628. Proposed Standard.
- ^ T. Terriberry (August 2013). Update to Remove DVI4 from the Recommended Codecs for the RTP Profile for Audio and Video Conferences with Minimal Control (RTP/AVP). Internet Engineering Task Force. doi:10.17487/RFC7007. ISSN 2070-1721. RFC 7007. Proposed Standard. Updates RFC 3551.
- ^ R. Kumar; M. Mostafa (May 2001). Conventions for the use of the Session Description Protocol (SDP) for ATM Bearer Connections. Network Working Group. doi:10.17487/RFC3108. RFC 3108. Proposed Standard.
- ^ XIL Programmer's Guide, Chapter 22 "CellB Codec". August 1997. Retrieved on 2014-07-19.
- ^ nv - network video on Henning Schulzrinne's website, Network Video on The University of Toronto's website, Retrieved on 2009-07-09.
- ^ Ron Frederick Github with source code
External links
[edit]RTP payload formats
View on GrokipediaOverview and Fundamentals
Definition and Role in RTP
RTP payload formats are standardized specifications, primarily defined in Request for Comments (RFC) documents by the Internet Engineering Task Force (IETF), that outline the rules for structuring application-layer media data—such as audio or video—within the payload field of Real-time Transport Protocol (RTP) packets. These formats ensure that diverse media encodings can be reliably transported over IP networks in real-time applications, like voice over IP or video conferencing, by defining how the media data is packetized, including the placement of headers, markers, and extensions specific to the media type.[1][2] In RTP, payload formats play a critical role in facilitating the multiplexing of multiple media streams within a single session, where each stream is distinguished by synchronization source identifiers (SSRCs) and payload type fields, allowing endpoints to demultiplex and process incoming data correctly. They also support synchronization by leveraging RTP timestamps to indicate the timing of media samples, enabling receivers to reconstruct the original timing and sequence despite network jitter or packet loss. Additionally, these formats address the challenges of variable-length payloads by specifying mechanisms for handling media units of differing sizes, ensuring efficient use of bandwidth without exceeding packet size limits.[1][2] Payload formats serve as the essential bridge between upper-layer codecs—which generate encoded media data, such as those producing H.264 video streams—and the RTP transport layer, by defining how codec output is fragmented into smaller units for transmission or aggregated from multiple units into a single packet when appropriate. This includes precise rules for fragmentation, where large access data units (ADUs) from the codec are split across multiple RTP packets, and aggregation, where small ADUs are combined to optimize transmission efficiency and reduce overhead. Such mechanisms are vital for maintaining real-time performance, as they adapt the media data to RTP's fixed-header structure and underlying UDP transport.[1] Registration of RTP payload formats occurs through the Internet Assigned Numbers Authority (IANA), which maintains a registry of media types associated with these formats. Static payload types, numbered 0 through 95, are pre-assigned by IETF profiles for common encodings and require formal standardization for use. In contrast, dynamic payload types, ranging from 96 to 127, are negotiated out-of-band during session setup—typically via protocols like Session Description Protocol (SDP)—allowing flexibility for proprietary or less common formats without permanent IANA assignment.[2]Historical Development
The development of RTP payload formats originated within the Internet Engineering Task Force (IETF) Audio/Video Transport (AVT) working group during the early 1990s, driven by the need for standardized transport of real-time multimedia over IP networks. The foundational RTP specification, RFC 1889, published in January 1996, introduced the core RTP framework along with initial payload formats for common audio and video codecs, such as PCMU and H.261, enabling end-to-end delivery of time-sensitive data with features like payload type identification.[5] This marked the formal establishment of the IANA registry for RTP payload types in 1996, which began assigning static identifiers to ensure interoperability across diverse network applications.[6] Early advancements focused on enhancing reliability and flexibility for multimedia transmission. In 1997, RFC 2198 defined a payload format for redundant audio data, allowing senders to include backup encodings within RTP packets to mitigate packet loss in unreliable networks without requiring separate retransmission mechanisms. As RTP adoption grew in the 2000s, the AVT working group evolved into the AVTCORE subgroup around 2007 to address scalability and maintenance of core protocols, emphasizing guidelines for new payload formats to support emerging codecs and network conditions. This shift facilitated broader standardization, with RFC 4855 in 2007 updating the registration procedures for RTP payload formats as media subtypes, promoting dynamic payload type assignment and expert review to manage the expanding ecosystem. By the 2010s, RTP payload formats increasingly integrated with web-based real-time communication, particularly through WebRTC, which adapted RTP for browser-native applications with enhanced congestion control and security. As of 2025, over 100 RFCs have defined RTP payload formats, reflecting sustained growth in support for diverse media types, including a rising emphasis on secure formats using SRTP encryption (RFC 3711) with payload formats registered as media types per RFC 4855.[6][7][8] Recent efforts, including the closure of the dedicated RTP payload format media types registry in RFC 9751 (March 2025), underscore the maturity of the framework by consolidating registrations under broader IANA media types.Technical Foundations
RTP Packet Structure
The Real-time Transport Protocol (RTP) packet is structured to enable efficient delivery of real-time media data over IP networks, consisting of a fixed 12-byte header, optional contributing source (CSRC) identifiers, an optional extension header, and a variable-length payload field that carries the formatted media content.[9] This design supports synchronization, ordering, and identification of media streams while minimizing overhead for time-sensitive applications.[10] The fixed RTP header occupies the first 12 bytes (octets 0 through 11) and includes essential fields for packet processing. It begins with a 1-byte field containing the version (V: 2 bits, set to 2 for the current RTP version), padding indicator (P: 1 bit, set to 1 if padding octets are present at the packet's end to align the total length to a multiple of 4 bytes), extension flag (X: 1 bit, set to 1 if an extension header follows), and CSRC count (CC: 4 bits, indicating the number of CSRC identifiers, ranging from 0 to 15).[10] The second byte comprises the marker bit (M: 1 bit, interpreted by the profile to signal events such as frame boundaries) and payload type (PT: 7 bits, which identifies the format of the data in the payload field).[10] Following these are the sequence number (16 bits, bytes 2-3, which increments by 1 for each RTP packet sent from the source to detect losses or reordering), timestamp (32 bits, bytes 4-7, representing the sampling instant of the first octet in the payload, in the clock rate of the media), and synchronization source identifier (SSRC: 32 bits, bytes 8-11, a random value that uniquely identifies the source of the stream within the RTP session).[10] The payload type field in the header is used during session setup to negotiate and select the appropriate format for encapsulating media data.[10] Immediately after the fixed header comes the optional CSRC list, which consists of 0 to 15 SSRC identifiers (each 32 bits or 4 bytes), totaling up to 60 bytes if all 15 are present; this list is included when CC > 0, typically by RTP mixers to identify the contributing sources in a mixed stream.[10] If the extension flag (X) is set, an extension header follows the CSRC list (or the fixed header if no CSRC list), adding 4 bytes of fixed fields—a 16-bit profile-specific identifier and a 16-bit length field (indicating the number of 32-bit words in the extension)—plus a variable amount of profile-defined data for custom metadata, such as additional timing or control information.[11] The payload field, positioned after the fixed header, CSRC list, and any extension header, serves as a variable-length container for the actual media data in its specified format, with the total packet length determined by the underlying transport protocol (e.g., UDP).[10] In diagram form, the RTP packet layout can be visualized as: fixed header (bytes 0-11), followed by CSRC list (0 to 15 × 4 bytes), optional extension header (4 bytes + variable), and the remaining bytes as payload; padding, if indicated, appears at the very end to ensure alignment without affecting the payload integrity.[9] This structure ensures that the payload integrates seamlessly with the header fields for real-time transport, allowing receivers to reconstruct timing and order the media stream effectively.[9]Payload Type Identification
In RTP, payload types serve as identifiers that specify the format and encoding of the media data carried in the payload, enabling receivers to decode it correctly. These types are 7-bit values embedded in the RTP packet header, allowing for up to 128 distinct formats per session.[12] The assignment of payload types is divided into static and dynamic categories to balance standardization with flexibility for emerging media types. Static payload types, ranging from 0 to 95, are pre-assigned by the Internet Assigned Numbers Authority (IANA) for commonly used audio and video encodings, as defined in the RTP/AVP profile. For instance, payload type 0 is assigned to PCMU (G.711 mu-law) audio at 8000 Hz, providing a fixed mapping that requires no negotiation for interoperability.[13] These assignments originated from earlier RTP specifications and were standardized in RFC 3551, which obsoletes prior profiles like RFC 1890, establishing a closed registry to prevent conflicts.[3] Updates to static assignments occur rarely through IETF processes, with new media types typically allocated dynamic values instead. Dynamic payload types, numbered 96 to 127, are reserved for session-specific assignments and must be negotiated between endpoints to bind them to particular encodings, clock rates, or parameters. This negotiation commonly occurs via the Session Description Protocol (SDP) during session setup in protocols like SIP or WebRTC. In the SDP offer, the sender lists supported payload types in the media description line (e.g., m=audio 5004 RTP/AVP 0 96) and provides mappings for dynamic types using attributes like a=rtpmap:96 opus/48000/2. The receiver responds in its SDP answer by selecting compatible types from the offer, confirming the bindings.[13] If a mismatch occurs during negotiation—such as unsupported dynamic types—the receiver may reject the offer with an error code, like 488 Not Acceptable Here in SIP, prompting re-negotiation or session failure. This process ensures robust identification across diverse networks, with RFC 3551 providing the foundational guidelines for both static and dynamic usage.[4]General Payload Format Mechanisms
Encapsulation Principles
Encapsulation in RTP payload formats involves structuring media data, such as audio or video samples, within the variable-length payload field of an RTP packet, ensuring compatibility with the protocol's end-to-end transport functions for real-time applications.[2] The core principle is to preserve the original media timing information, achieved through the RTP timestamp field, which indicates the sampling instant of the first octet in the payload and enables synchronization at the receiver.[2] Additionally, RTP payloads support out-of-order delivery by leveraging the sequence number field, which increments monotonically for each packet, allowing receivers to reorder data and detect losses without relying on underlying network ordering guarantees.[1] This design facilitates robust transmission over potentially unreliable networks like IP/UDP.[2] A key aspect of encapsulation is handling variable-sized media units, often termed Application Data Units (ADUs), through fragmentation and aggregation to optimize packet efficiency and avoid excessive overhead. For large ADUs, such as video frames exceeding the typical Maximum Transmission Unit (MTU) size of 1500 bytes, fragmentation splits the unit across multiple RTP packets, with the marker (M) bit in the RTP header set to indicate the final fragment of the unit, enabling proper reassembly.[1] Conversely, aggregation combines multiple small ADUs, for instance, comfort noise frames in audio streams, into a single RTP payload to reduce header overhead and improve bandwidth utilization, often using a table-of-contents structure to delineate boundaries within the payload.[1] These mechanisms ensure that payloads remain decodable independently where possible, aligning with RTP's application-level framing principle.[1] Optional padding supports alignment requirements, particularly for fixed-size data blocks or encryption needs, by appending zeroed octets to the payload end, signaled via the padding (P) bit in the RTP header and a length indicator in the last padding octet.[2] This allows payloads to meet transport or security constraints without altering the media data itself. For enhanced reliability, encapsulation principles also accommodate forward error correction (FEC), as defined in RFC 2733, which specifies a generic payload format for protecting media streams against packet loss through redundant parity data integrated into RTP packets.[14] Payload type negotiation, typically via protocols like SDP, ensures the selected format adheres to these principles during session setup.[1]Handling of Timestamps and Sequence Numbers
In RTP, the sequence number serves as a 16-bit field in the packet header that is incremented by one for each successive RTP data packet sent by the source, enabling receivers to detect packet loss, duplication, and reordering while reconstructing the original transmission order.[2] This field wraps around after reaching 65535, at which point receivers employ extended sequence number tracking—combining the 16-bit value with an inferred cycle count—to maintain accurate loss detection over long sessions.[2] Initial sequence numbers are chosen randomly to enhance security against certain attacks.[2] The timestamp field, a 32-bit unsigned integer in the RTP header, indicates the sampling instant of the first octet in the RTP data packet's payload, providing a reference for playout buffering, jitter compensation, and synchronization across media streams.[2] It derives from a clock with a frequency specified by the payload format, such as 8000 Hz for common audio encodings, where the timestamp increments by the number of samples per packet (e.g., 160 for a 20 ms frame at that rate).[2] Unlike wall-clock time, the RTP timestamp uses a monotonic counter that wraps around after 2^{32} (approximately 4.29 billion) units, equivalent to about 13.3 hours at 90 kHz for video, and starts from a random offset to support encryption.[2] Payload formats define precise mappings from media-specific timestamps to RTP timestamps, which may be absolute (starting from zero or a fixed reference at session initiation) or relative (incremental from prior packets), ensuring consistent synchronization regardless of encoding variations.[1] Clock rates for these timestamps are registered in the IETF media types registry as outlined in RFC 3550, allowing interoperability by associating each payload type with a defined sampling frequency.[2] For layered encodings, such as scalable video codecs, RFC 6051 extends RTP with header options carrying 56- or 64-bit NTP timestamps in packets, facilitating rapid synchronization of multiple interdependent flows by aligning relative timing across layers without relying solely on infrequent RTCP reports.[15]Audio Payload Formats
Core Audio Encoding Formats
Core audio encoding formats form the backbone of RTP-based audio transmission, particularly for real-time applications like telephony and streaming where low latency is paramount. These formats prioritize simplicity, robustness, and minimal processing overhead to ensure reliable delivery over packet-switched networks. Foundational examples include pulse-code modulation (PCM) variants, which provide uncompressed or lightly compressed audio suitable for narrowband voice communications. The G.711 codec, standardized by the ITU-T, represents a core PCM variant widely used in RTP payloads. It employs either μ-law or A-law companding to encode 8-bit samples at a 64 kbps bit rate, supporting a sampling rate of 8 kHz for narrowband audio (300–3400 Hz bandwidth). In RTP, G.711 uses payload types 0 (μ-law) and 8 (A-law), with packets typically encapsulating a fixed 160 samples—equivalent to 20 ms of audio—to align with common timing intervals and facilitate synchronization. This fixed-packet approach minimizes jitter and supports low-latency telephony, as the straightforward octet-aligned packing requires no complex decompaction at the receiver.[16] The G.729 codec, another ITU-T standard for low-bitrate speech, operates at 8 kbps using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP). It supports an 8 kHz sampling rate for narrowband audio and uses RTP payload type 18, with packets carrying 160 samples (20 ms) in a fixed octet-aligned format. This format enables efficient transmission for bandwidth-constrained VoIP applications while maintaining toll-quality speech.[17] For wideband audio, the G.722 codec extends capabilities to a 7 kHz bandwidth (50–7000 Hz), enabling higher-fidelity speech suitable for modern VoIP and conferencing. Defined in ITU-T Recommendation G.722, it operates at 48, 56, or 64 kbps using sub-band adaptive differential pulse-code modulation (ADPCM). The RTP payload format, specified in RFC 3551, uses payload type 9 and packs 320 samples per 20 ms packet at a nominal 16 kHz sampling rate, though the RTP timestamp clock rate is 8 kHz for backward compatibility. This format maintains low latency by avoiding inter-frame dependencies, making it ideal for interactive streaming applications.[18] More advanced core formats like Opus provide versatile, variable-bitrate encoding for both speech and music, supporting bandwidths from narrowband (4 kHz) to fullband (20 kHz). Standardized in RFC 7587, Opus achieves bit rates from 6 kbps up to 510 kbps, adapting dynamically to network conditions while preserving low latency (frame sizes as small as 2.5 ms). The RTP payload format allows flexible packetization, including in-band forward error correction and multistream support for stereo or multichannel audio, making it a high-impact choice for low-delay streaming in WebRTC and similar protocols. The Opus codec, which incorporates SILK for narrowband and wideband modes, includes in-band Forward Error Correction (FEC) to mitigate packet losses by embedding redundant data within the audio stream and received updates to its specification in RFC 8251 from October 2017, addressing issues like stereo state resets, resampler buffer management, and hybrid mode folding to improve error resilience and audio fidelity under network variability.[19][20] To handle silence periods efficiently without transmitting unnecessary data, comfort noise generation is integrated via RFC 3389. This payload format transports parameters for synthetic noise mimicking background ambiance, using a VAD (voice activity detection) signal to switch between active audio and comfort noise frames. Primarily designed for codecs like G.711 and G.729, it reduces bandwidth during quiet intervals while preventing unnatural silence, thus supporting natural-sounding low-latency telephony. The format uses a dedicated payload type (e.g., 13 for G.711) and includes spectral envelope and noise energy descriptors for reconstruction.[21] Adaptive payload mechanisms further enhance flexibility by enabling mixed encodings within a single RTP stream, as exemplified by the format for Adaptive Multi-Rate (AMR) codecs in RFC 3267. AMR supports multiple modes (bit rates from 4.75 to 12.2 kbps for narrowband), allowing the encoder to select optimal configurations per frame based on channel conditions. The RTP payload uses a Table of Contents (TOC) field to indicate mode switches and frame interleaving within packets, facilitating seamless transitions without stream interruptions—crucial for robust, low-latency mobile telephony over variable networks.Specific Audio Codecs and RFCs
The Speex codec, designed as a variable bitrate speech codec optimized for low-bitrate applications, incorporates Voice Activity Detection (VAD) to enable silence suppression and efficient bandwidth usage. Its RTP payload format, defined in RFC 5574 published in June 2009, supports dynamic bit-rate switching ranging from 2.15 kbit/s to 44 kbit/s, with VBR modes activated via Session Description Protocol (SDP) parameters such as "vbr=on" or "vbr=vad" for VAD-integrated operation.[22] The payload structure encapsulates one or more Speex frames following the standard RTP header, ensuring octet alignment and in-band signaling for parameters like sampling rate (8, 16, or 32 kHz) and bit-rate, while prohibiting frame fragmentation across packets to maintain robustness.[22] Advanced Audio Coding (AAC), a perceptual audio codec supporting high-quality stereo and multichannel audio, utilizes an RTP payload format initially specified in RFC 3016 from November 2000 for MPEG-4 audio streams.[23] This format employs the Low-overhead MPEG-4 Audio Transport Multiplex (LATM) to map audioMuxElements—each containing one or more audio frames—directly into the RTP payload, with the marker bit indicating complete or final fragments.[23] For High Efficiency AAC (HE-AAC), an extension enhancing low-bitrate performance, the payload was refined in RFC 3640 from November 2003, incorporating an AU (Access Unit) Header Section to delineate individual AAC frames and the Auxiliary Section (kept empty), alongside support for configurable timestamp rates like 48 kHz and scalable layering across separate RTP packets.[24] To further bolster audio payload resilience against bursty packet losses common in IP networks, RFC 6015 from January 2010 defines a 1-D interleaved parity Forward Error Correction (FEC) scheme for RTP, where protection parameters like interleaving depth (L) are tuned to distribute losses across packets, outperforming non-interleaved methods in burst conditions while maintaining low overhead for typical audio streams.[25]Video Payload Formats
Video Encoding Basics
Video encoding relies on compression techniques that exploit spatial and temporal redundancies in video data to reduce bitrate while maintaining quality. Intra-frame coding compresses individual frames independently, treating each as a standalone image using spatial prediction within the frame, similar to still image codecs. Inter-frame coding, in contrast, predicts frames based on differences from previous or subsequent frames, employing motion compensation to model movement between them. These methods form the basis of modern video codecs, where a Group of Pictures (GOP) structures the sequence as a series of intra-coded I-frames (key frames for random access), predictive P-frames (forward-referenced), and optionally bi-directional B-frames (referencing both directions for higher efficiency).[26][27] For real-time delivery over RTP, low-latency modes are essential, often achieved by avoiding B-frames, which introduce decoding delays due to their dependence on future frames. Short GOP structures with frequent I- or P-frames minimize buffering and enable quicker error recovery, prioritizing responsiveness in interactive applications like videoconferencing. This contrasts with storage-oriented encoding, where longer GOPs with B-frames optimize compression at the expense of latency.[28] RTP payload formats must address challenges arising from video's variable frame sizes and structured data units. Encoded frames consist of slices—self-contained segments of macroblocks—that can vary significantly in size, with I-frames typically larger than P- or B-frames due to less redundancy. The RTP marker bit in the header signals the end of a frame across multiple packets, allowing receivers to delineate boundaries and synchronize decoding without full frame reassembly.[29][30] Scalability in video encoding supports adaptive transmission by layering content for varying network conditions. Scalable Video Coding (SVC), an extension to H.264, organizes the bitstream into a base layer—providing a baseline video stream decodable by legacy H.264 devices—and one or more enhancement layers that incrementally improve spatial resolution, temporal frame rate, or signal-to-noise ratio (quality). In RTP, these layers can be transmitted in single or multiple sessions, enabling selective decoding based on bandwidth availability.[31] A key constraint in RTP video transport is the network's Maximum Transmission Unit (MTU), typically 1500 bytes for Ethernet, leaving approximately 1400 bytes for payload after headers. High-definition (HD) video frames often exceed this, necessitating fragmentation into multiple RTP packets to prevent IP-layer fragmentation, which complicates loss recovery and increases overhead.[32][33]Key Video Codecs and Extensions
The RTP payload format for H.264/AVC, defined in RFC 3984, encapsulates Network Abstraction Layer (NAL) units into RTP packets to enable efficient transmission of compressed video over networks.[34] This format supports single NAL unit packets for straightforward transmission, while aggregation packets such as Single-Time Aggregation Packets (STAP-A) combine multiple NAL units with the same timestamp into one RTP packet, reducing header overhead for smaller units like parameter sets or slice headers.[34] Fragmentation Units (FU-A) address larger NAL units by splitting them across multiple RTP packets, using start (S) and end (E) indicators in the FU header to facilitate reassembly at the receiver, ensuring compatibility with varying MTU sizes and enhancing reliability in lossy environments.[34] For the VP8 video codec, RFC 7741 specifies an RTP payload format that emphasizes error resilience through partition-based structures inherent to VP8's encoding.[35] Key frames, which are intra-coded and do not depend on prior frames, are signaled by setting the P-bit to 0 in the VP8 payload header, allowing decoders to identify synchronization points for recovery from packet loss.[35] The format partitions each frame into up to eight independent segments, with each partition typically carried in a separate RTP packet; the Partition Index (PID) field and Start-of-Partition (S) bit in the header guide the assembly, enabling graceful degradation if a partition is lost, as subsequent partitions can still be decoded with reduced quality.[35] This approach supports low-latency applications like video conferencing by minimizing re-encoding needs during transmission errors.[35] The High Efficiency Video Coding (HEVC/H.265) RTP payload format, outlined in RFC 7798, builds on prior standards with enhanced support for scalability and metadata.[36] Temporal scalability extensions leverage the TemporalId in NAL unit headers to layer video streams, allowing middleboxes to drop higher temporal layers during congestion without affecting lower-layer decodability; parameters like sprop-sub-layer-id in SDP signal the maximum supported layers for session negotiation.[36] Supplemental Enhancement Information (SEI) messages, such as those for picture timing or user data, are embedded directly in the RTP payloads as NAL units, providing out-of-band metadata for rendering and error handling without impacting core video decoding.[36] Additionally, parameter sets—including Video Parameter Set (VPS), Sequence Parameter Set (SPS), and Picture Parameter Set (PPS)—can be transmitted out-of-band via SDP attributes like sprop-vps, enabling session setup without requiring full keyframes and ensuring decoders initialize correctly before receiving coded slices.[36] The Versatile Video Coding (VVC/H.266) RTP payload format, defined in RFC 9328, extends the NAL unit-based approach of previous codecs to support VVC's advanced compression and scalability features.[37] It allows packetization of one or more NAL units per RTP packet, with aggregation packets (type 28) combining multiple units from the same access unit to reduce overhead, and fragmentation units (type 29) splitting larger NAL units using start (S), end (E), and reserved (R) bits for reassembly.[37] Scalability is handled through LayerId and TemporalId fields in NAL headers, enabling temporal, spatial, and quality layers; SDP parameters such as sprop-sublayer-id and sprop-ols-id facilitate negotiation of supported layers and output layer sets. Parameter sets (VPS, SPS, PPS) and SEI messages can be delivered in-band or out-of-band via SDP, supporting efficient initialization and metadata handling in real-time applications.[37] The RTP payload format for the AV1 video codec, specified by the Alliance for Open Media in version 1.0.0 (December 2024), uses Open Bitstream Units (OBUs) for encapsulation, suitable for low-bitrate peer-to-peer to high-quality streaming scenarios.[38] It supports aggregation of multiple OBUs into a single RTP packet via an aggregation header indicating OBU count and sizes, and fragmentation of individual OBUs across packets with start (Z) and end (Y) flags for reassembly. Error resilience is enhanced by AV1's inherent structure, including key frame indicators in OBU headers and partition independence; the format leverages the Dependency Descriptor RTP header extension to describe frame dependencies in scalable streams, allowing selective layer transmission and decoder recovery from losses. As of 2025, this format is implemented in tools like FFmpeg and used in WebRTC for efficient, royalty-free video transport.[38]Text and Messaging Payload Formats
Text Transmission Protocols
Text transmission protocols in RTP payload formats enable the real-time conveyance of textual data, such as conversational messages or timed captions, within multimedia sessions. These formats are designed to handle low-latency requirements, packet loss resilience, and synchronization with other media streams like audio or video. Unlike audio or video payloads, text formats prioritize simplicity and efficiency due to the typically low data rates involved, often leveraging UTF-8 encoding for international character support. Key protocols focus on character-by-character streaming or structured markup for timed presentation, ensuring compatibility with applications like IP telephony, video conferencing, and broadcast subtitles.[39][40][41] The primary format for real-time text conversation is defined in RFC 4103, which specifies an RTP payload for ITU-T Recommendation T.140 text, encoded in UTF-8. This format transmits text in small blocks or individual characters within dedicated RTP packets, using the RTP header's sequence numbers for ordering and timestamps at a 1000 Hz clock rate for synchronization. To mitigate packet loss in unreliable networks, it supports optional redundancy payloads as per RFC 2198, where previous text blocks are interleaved with new ones. The MIME type is text/t140, with an optional "cps" parameter to cap the character transmission rate (default 30 characters per second), and recommends a 300 ms buffering delay to balance latency and reliability. This approach suits interactive scenarios, such as text telephony in multimedia calls, and obsoletes the earlier RFC 2793.[39] For timed text applications, particularly subtitles in mobile and broadcast contexts, RFC 4396 outlines an RTP payload format for 3GPP Timed Text. This standard encapsulates text samples from the 3GPP file format, including strings and modifier boxes for styling (e.g., font, color, position), into RTP packets. Each sample covers a duration, typically up to 8 seconds to fit MTU constraints, with timestamps indicating display start times relative to the session. The format uses a 3GPP-specific header for sample duration and fragmentation support, enabling aggregation across packets, and is registered under the MIME type text/3gpp-tt. It is optimized for low-bandwidth mobile streaming, integrating with 3GPP multimedia services like packet-switched streaming.[40] Another structured approach is provided by RFC 8759, which defines an RTP payload for Timed Text Markup Language (TTML), a W3C XML-based standard for synchronized text rendering. TTML documents, including timing attributes and styling elements, are fragmented into RTP payloads if exceeding path MTU, with the marker bit signaling the final fragment. Timestamps reference the document's epoch, using a default 1000 Hz clock, and the MIME type application/ttml+xml requires parameters like media timeBase for alignment with other streams. This format targets broadcast and streaming workflows, supporting profiles such as EBU-TT for television subtitles, and emphasizes semantic richness over conversational simplicity.[41] An extension for multiparty scenarios is defined in RFC 9071, which provides RTP-mixer formatting enhancements to the real-time text format in RFC 4103. This enables centralized mixing of text from multiple participants in conferences, using a mixer-specific payload structure to combine contributions while preserving timing and redundancy mechanisms.[42] These protocols collectively address diverse text needs in RTP, from interactive chat to captioned media, while adhering to RTP's core mechanisms like sequence numbering and optional forward error correction (e.g., RFC 2733). Implementations must consider security via SRTP for protecting sensitive text content, and SDP offers for session negotiation, including rate limits and redundancy levels. Adoption has been prominent in standards like SIP for real-time communication and 3GPP for mobile multimedia.[39][40][41]Specialized Payload Formats
MIDI Data Encapsulation
MIDI Data Encapsulation in RTP involves packaging Musical Instrument Digital Interface (MIDI) 1.0 commands into RTP payloads to enable real-time transmission of symbolic music performance data over IP networks, distinct from sampled audio waveforms. This format supports all legal MIDI commands, including voice messages like note-on and note-off, which are typically 3 bytes each (status octet plus note number and velocity), as well as variable-length system exclusive (SysEx) messages that can span multiple packets. The encapsulation preserves MIDI's event-based structure, allowing for efficient networked applications such as remote instrument control and collaborative music performances.[43] The RTP payload consists of a MIDI command section with a variable-length header followed by a list of timestamped MIDI events, optionally including a recovery journal for error resilience. Events are timestamped relative to the RTP header timestamp using delta times encoded in 1-4 octets, enabling per-event timing with cumulative deltas modulo 2^32 that must not exceed the next packet's timestamp. This delta-timed approach aligns with Standard MIDI File (SMF) semantics in "comex" mode, ensuring precise synchronization. Running status optimization is maintained, where the status octet is omitted for consecutive identical commands after the first, reducing payload size—for instance, a sequence of note-off events on the same channel can drop from 3 bytes to 2 bytes per event. The RTP timestamp clock rate is configurable via SDP, typically set to audio sampling rates such as 44.1 kHz. In "buffer" mode, implementations may poll the MIDI buffer 1000 times per second, with the sampling interval (mperiod) computed as the clock rate divided by 1000.[43][44] Extensions in the updated specification address challenges in live performances, particularly jitter buffer adaptation to handle network variability while minimizing latency. Receivers can employ adaptive buffering strategies, such as processing recovery journals with delays informed by session parameters likeguardtime, to maintain synchronization in ensemble settings. Multicast transmission is supported via UDP or TCP, facilitating group communication akin to MIDI DIN party-line topology, which is ideal for distributed music ensembles where multiple participants share timing without individual acknowledgments. Sequence numbers in the RTP header aid in packet reordering and loss detection, complementing the payload's timing mechanisms.[43][44]
