Recent from talks
Nothing was collected or created yet.
Transmission Control Protocol
View on Wikipedia
| Protocol stack | |
| Abbreviation | TCP |
|---|---|
| Developer(s) | Vint Cerf and Bob Kahn |
| Introduction | 1974 |
| Based on | Transmission Control Program |
| OSI layer | Transport layer (4) |
| RFC(s) | 9293 |
The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. It originated in the initial network implementation in which it complemented the Internet Protocol (IP). Therefore, the entire suite is commonly referred to as TCP/IP. TCP provides reliable, ordered, and error-checked delivery of a stream of octets (bytes) between applications running on hosts communicating via an IP network. Major internet applications such as the World Wide Web, email, remote administration, file transfer and streaming media rely on TCP, which is part of the transport layer of the TCP/IP suite. SSL/TLS often runs on top of TCP. Today, TCP remains a core protocol for most Internet communication, ensuring reliable data transfer across diverse networks.[1]
TCP is connection-oriented, meaning that sender and receiver firstly need to establish a connection based on agreed parameters; they do this through a three-way handshake procedure.[2] The server must be listening (passive open) for connection requests from clients before a connection is established. Three-way handshake (active open), retransmission, and error detection adds to reliability but lengthens latency. Applications that do not require reliable data stream service may use the User Datagram Protocol (UDP) instead, which provides a connectionless datagram service that prioritizes time over reliability. TCP employs network congestion avoidance. However, there are vulnerabilities in TCP, including denial of service, connection hijacking, TCP veto, and reset attack.
| Internet protocol suite |
|---|
| Application layer |
| Transport layer |
| Internet layer |
| Link layer |
Historical origin
[edit]In May 1974, Vint Cerf and Bob Kahn described an internetworking protocol for sharing resources using packet switching among network nodes.[3] The authors had been working with Gérard Le Lann to incorporate concepts from the French CYCLADES project into the new network.[4] The specification of the resulting protocol, RFC 675 (Specification of Internet Transmission Control Program), was written by Vint Cerf, Yogen Dalal, and Carl Sunshine, and published in December 1974.[5] It contains the first attested use of the term internet, as a shorthand for internetwork.[citation needed]
The Transmission Control Program incorporated both connection-oriented links and datagram services between hosts. In version 4, the monolithic Transmission Control Program was divided into a modular architecture consisting of the Transmission Control Protocol and the Internet Protocol.[6][7] This resulted in a networking model that became known informally as TCP/IP, although formally it was variously referred to as the DoD internet architecture model (DoD model for short) or DARPA model.[8][9][10] Later, it became the part of, and synonymous with, the Internet Protocol Suite. TCP continues to evolve, with incremental updates and best practices formalized in RFCs such as RFC 9293 (2022).[11]
The following Internet Experiment Note (IEN) documents describe the evolution of TCP into the modern version:[12]
- IEN #5 Specification of Internet Transmission Control Program TCP Version 2 (March 1977)
- IEN #21 Specification of Internetwork Transmission Control Program TCP Version 3 (January 1978)
- IEN #27 A Proposal for TCP Version 3.1 Header Format (February 1978)
- IEN #40 Transmission Control Protocol Draft Version 4 (June 1978)
- IEN #44 Latest Header Formats (June 1978)
- IEN #55 Specification of Internetwork Transmission Control Protocol Version 4 (September 1978)
- IEN #81 Transmission Control Protocol Version 4 (February 1979)
- IEN #112 Transmission Control Protocol (August 1979)
- IEN #124 DOD STANDARD TRANSMISSION CONTROL PROTOCOL (December 1979)
TCP was standardized in January 1980 as RFC 761.
In 2004, Vint Cerf and Bob Kahn received the Turing Award for their foundational work on TCP/IP.[13][14]
Network function
[edit]The Transmission Control Protocol provides a communication service at an intermediate level between an application program and the Internet Protocol. It provides host-to-host connectivity at the transport layer of the Internet model. An application does not need to know the particular mechanisms for sending data via a link to another host, such as the required IP fragmentation to accommodate the maximum transmission unit of the transmission medium. At the transport layer, TCP handles all handshaking and transmission details and presents an abstraction of the network connection to the application typically through a network socket interface.
At the lower levels of the protocol stack, due to network congestion, traffic load balancing, or unpredictable network behavior, IP packets may be lost, duplicated, or delivered out of order. TCP detects these problems, requests re-transmission of lost data, rearranges out-of-order data and even helps minimize network congestion to reduce the occurrence of the other problems. If the data still remains undelivered, the source is notified of this failure. Once the TCP receiver has reassembled the sequence of octets originally transmitted, it passes them to the receiving application. Thus, TCP abstracts the application's communication from the underlying networking details.
TCP is optimized for accurate delivery rather than timely delivery and can incur relatively long delays (on the order of seconds) while waiting for out-of-order messages or re-transmissions of lost messages. Therefore, it is not particularly suitable for real-time applications such as voice over IP. For such applications, protocols like the Real-time Transport Protocol (RTP) operating over the User Datagram Protocol (UDP) are usually recommended instead.[15]
TCP is a reliable byte stream delivery service that guarantees that all bytes received will be identical and in the same order as those sent. Since packet transfer by many networks is not reliable, TCP achieves this using a technique known as positive acknowledgment with re-transmission. This requires the receiver to respond with an acknowledgment message as it receives the data. The sender keeps a record of each packet it sends and maintains a timer from when the packet was sent. The sender re-transmits a packet if the timer expires before receiving the acknowledgment. The timer is needed in case a packet gets lost or corrupted.[15]
While IP handles actual delivery of the data, TCP keeps track of segments – the individual units of data transmission that a message is divided into for efficient routing through the network. For example, when an HTML file is sent from a web server, the TCP software layer of that server divides the file into segments and forwards them individually to the internet layer in the network stack. The internet layer software encapsulates each TCP segment into an IP packet by adding a header that includes (among other data) the destination IP address. When the client program on the destination computer receives them, the TCP software in the transport layer re-assembles the segments and ensures they are correctly ordered and error-free as it streams the file contents to the receiving application.
TCP segment structure
[edit]Transmission Control Protocol accepts data from a data stream, divides it into chunks, and adds a TCP header creating a TCP segment. The TCP segment is then encapsulated into an Internet Protocol (IP) datagram, and exchanged with peers.[16]
The term TCP packet appears in both informal and formal usage, whereas in more precise terminology segment refers to the TCP protocol data unit (PDU), datagram[17] to the IP PDU, and frame to the data link layer PDU:
Processes transmit data by calling on the TCP and passing buffers of data as arguments. The TCP packages the data from these buffers into segments and calls on the internet module [e.g. IP] to transmit each segment to the destination TCP.[18]
A TCP segment consists of a segment header and a data section. The segment header contains 10 mandatory fields, and an optional extension field (Options, sand color background in table). The data section follows the header and is the payload data carried for the application.[19] The length of the data section is not specified in the segment header; it can be calculated by subtracting the combined length of the segment header and IP header from the total IP datagram length specified in the IP header.[citation needed]
| Offset | Octet | 0 | 1 | 2 | 3 | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Octet | Bit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 0 | 0 | Source Port | Destination Port | ||||||||||||||||||||||||||||||
| 4 | 32 | Sequence Number | |||||||||||||||||||||||||||||||
| 8 | 64 | Acknowledgement Number (meaningful when ACK bit set) | |||||||||||||||||||||||||||||||
| 12 | 96 | Data Offset | Reserved | CWR | ECE | URG | ACK | PSH | RST | SYN | FIN | Window | |||||||||||||||||||||
| 16 | 128 | Checksum | Urgent Pointer (meaningful when URG bit set)[20] | ||||||||||||||||||||||||||||||
| 20 | 160 | (Options) If present, Data Offset will be greater than 5. Padded with zeroes to a multiple of 32 bits, since Data Offset counts words of 4 octets. | |||||||||||||||||||||||||||||||
| ⋮ | ⋮ | ||||||||||||||||||||||||||||||||
| 56 | 448 | ||||||||||||||||||||||||||||||||
| 60 | 480 | Data | |||||||||||||||||||||||||||||||
| 64 | 512 | ||||||||||||||||||||||||||||||||
| ⋮ | ⋮ | ||||||||||||||||||||||||||||||||
- Source Port: 16 bits
- Identifies the sending port.
- Destination Port: 16 bits
- Identifies the receiving port.
- Sequence Number: 32 bits
- Has a dual role:
- If the SYN flag is set (1), then this is the initial sequence number. The sequence number of the actual first data byte and the acknowledged number in the corresponding ACK are then this sequence number plus 1.
- If the SYN flag is unset (0), then this is the accumulated sequence number of the first data byte of this segment for the current session.
- Acknowledgment Number: 32 bits
- If the ACK flag is set then the value of this field is the next sequence number that the sender of the ACK is expecting. This acknowledges receipt of all prior bytes (if any).[21] The first ACK sent by each end acknowledges the other end's initial sequence number itself, but no data.[22]
- Data Offset (DOffset): 4 bits
- Specifies the size of the TCP header in 32-bit words. The minimum size header is 5 words and the maximum is 15 words thus giving the minimum size of 20 bytes and maximum of 60 bytes, allowing for up to 40 bytes of options in the header. This field gets its name from the fact that it is also the offset from the start of the TCP segment to the actual data.[citation needed]
- Reserved (Rsrvd): 4 bits
- For future use and should be set to zero; senders should not set these and receivers should ignore them if set, in the absence of further specification and implementation.
- From 2003 to 2017, the last bit (bit 103 of the header) was defined as the NS (Nonce Sum) flag by the experimental RFC 3540, ECN-nonce. ECN-nonce never gained widespread use and the RFC was moved to Historic status.[23]
- A RFC draft[24] proposes a new use for this bit. The bit is now used for negotiating the use of Accurate ECN.
- Flags: 8 bits
- Contains 8 1-bit flags (control bits) as follows. When using tcpdump, a set flag is indicated with the character in parentheses.
- CWR (W): 1 bit
- Congestion window reduced (CWR) flag is set by the sending host to indicate that it received a TCP segment with the ECE flag set and had responded in congestion control mechanism.[25][a]
- ECE (E): 1 bit
- ECN-Echo has a dual role, depending on the value of the SYN flag. It indicates:
- If the SYN flag is set (1), the TCP peer is ECN capable.[26]
- If the SYN flag is unset (0), a packet with the Congestion Experienced flag set (ECN=11) in its IP header was received during normal transmission.[a] This serves as an indication of network congestion (or impending congestion) to the TCP sender.[27]
- URG (U): 1 bit
- Indicates that the Urgent pointer field is significant.
- ACK (.): 1 bit
- Indicates that the Acknowledgment field is significant. All packets after the initial SYN packet sent by the client should have this flag set.[28]
- PSH (P): 1 bit
- Push function. Asks to push the buffered data to the receiving application.
- RST (R): 1 bit
- Reset the connection
- SYN (S): 1 bit
- Synchronize sequence numbers. Only the first packet sent from each end should have this flag set. Some other flags and fields change meaning based on this flag, and some are only valid when it is set, and others when it is clear.
- FIN (F): 1 bit
- Last packet from sender
- Window: 16 bits
- The size of the receive window, which specifies the number of window size units[b] that the sender of this segment is currently willing to receive.[c] (See § Flow control and § Window scaling.)
- Checksum: 16 bits
- The 16-bit checksum field is used for error-checking of the TCP header, the payload and an IP pseudo-header. The pseudo-header consists of the source IP address, the destination IP address, the protocol number for the TCP protocol (6) and the length of the TCP headers and payload (in bytes).
- Urgent Pointer: 16 bits
- If the URG flag is set, then this 16-bit field is an offset from the sequence number indicating the last urgent data byte.
- Options (TCP Option): Variable 0–320 bits, in units of 32 bits;
size(Options) == (DOffset - 5) * 32 - The length of this field is determined by the Data Offset field. The TCP header padding is used to ensure that the TCP header ends, and data begins, on a 32-bit boundary. The padding is composed of zeros.[18]
- Options have up to three fields: Option-Kind (1 byte), Option-Length (1 byte), Option-Data (variable). The Option-Kind field indicates the type of option and is the only field that is not optional. Depending on Option-Kind value, the next two fields may be set. Option-Length indicates the total length of the option, and Option-Data contains data associated with the option, if applicable. For example, an Option-Kind byte of 1 indicates that this is a no operation option used only for padding, and does not have an Option-Length or Option-Data fields following it. An Option-Kind byte of 0 marks the end of options, and is also only one byte. An Option-Kind byte of 2 is used to indicate Maximum Segment Size option, and will be followed by an Option-Length byte specifying the length of the MSS field. Option-Length is the total length of the given options field, including Option-Kind and Option-Length fields. So while the MSS value is typically expressed in two bytes, Option-Length will be 4. As an example, an MSS option field with a value of 0x05B4 is coded as (0x02 0x04 0x05B4) in the TCP options section.
- Some options may only be sent when SYN is set; they are indicated below as
[SYN]. Option-Kind and standard lengths given as (Option-Kind, Option-Length).
Option-Kind Option-Length Option-Data Purpose Notes 0 — — End of options list 1 — — No operation This may be used to align option fields on 32-bit boundaries for better performance. 2 4 SS Maximum segment size See § Maximum segment size for details. [SYN]3 3 S Window scale See § Window scaling for details.[29] [SYN]4 2 — Selective Acknowledgement permitted See § Selective acknowledgments for details.[30] [SYN]5 N (10, 18, 26, or 34) BBBB, EEEE, ... Selective ACKnowledgement (SACK)[31] These first two bytes are followed by a list of 1–4 blocks being selectively acknowledged, specified as 32-bit begin/end pointers. 8 10 TTTT, EEEE Timestamp and echo of previous timestamp See § TCP timestamps for details.[29] 28 4 — User Timeout Option See RFC 5482. 29 N — TCP Authentication Option (TCP-AO) For message authentication, replacing MD5 authentication (option 19) originally designed to protect BGP sessions.[32] See RFC 5925. 30 N — Multipath TCP (MPTCP) See Multipath TCP for details.
- The remaining Option-Kind values are historical, obsolete, experimental, not yet standardized, or unassigned. Option number assignments are maintained by the Internet Assigned Numbers Authority (IANA).[33]
- Data: Variable
- The payload of the TCP packet
Protocol operation
[edit]
TCP protocol operations may be divided into three phases. Connection establishment is a multi-step handshake process that establishes a connection before entering the data transfer phase. After data transfer is completed, the connection termination closes the connection and releases all allocated resources.
A TCP connection is managed by an operating system through a resource that represents the local end-point for communications, the Internet socket. During the lifetime of a TCP connection, the local end-point undergoes a series of state changes:[34]
| State | Endpoint | Description |
|---|---|---|
| LISTEN | Server | Waiting for a connection request from any remote TCP end-point. |
| SYN-SENT | Client | Waiting for a matching connection request after having sent a connection request. |
| SYN-RECEIVED | Server | Waiting for a confirming connection request acknowledgment after having both received and sent a connection request. |
| ESTABLISHED | Server and client | An open connection, data received can be delivered to the user. The normal state for the data transfer phase of the connection. |
| FIN-WAIT-1 | Server and client | Waiting for a connection termination request from the remote TCP, or an acknowledgment of the connection termination request previously sent. |
| FIN-WAIT-2 | Server and client | Waiting for a connection termination request from the remote TCP. |
| CLOSE-WAIT | Server and client | Waiting for a connection termination request from the local user. |
| CLOSING | Server and client | Waiting for a connection termination request acknowledgment from the remote TCP. |
| LAST-ACK | Server and client | Waiting for an acknowledgment of the connection termination request previously sent to the remote TCP (which includes an acknowledgment of its connection termination request). |
| TIME-WAIT | Server or client | Waiting for enough time to pass to be sure that all remaining packets on the connection have expired. |
| CLOSED | Server and client | No connection state at all. |
Connection establishment
[edit]Before a client attempts to connect with a server, the server must first bind to and listen at a port to open it up for connections: this is called a passive open. Once the passive open is established, a client may establish a connection by initiating an active open using the three-way (or 3-step) handshake:
- SYN: The active open is performed by the client sending a SYN to the server. The client sets the segment's sequence number to a random value A.
- SYN-ACK: In response, the server replies with a SYN-ACK. The acknowledgment number is set to one more than the received sequence number i.e. A+1, and the sequence number that the server chooses for the packet is another random number, B.
- ACK: Finally, the client sends an ACK back to the server. The sequence number is set to the received acknowledgment value i.e. A+1, and the acknowledgment number is set to one more than the received sequence number i.e. B+1.
Steps 1 and 2 establish and acknowledge the sequence number for one direction (client to server). Steps 2 and 3 establish and acknowledge the sequence number for the other direction (server to client). Following the completion of these steps, both the client and server have received acknowledgments and a full-duplex communication is established.
Connection termination
[edit]

The connection termination phase uses a four-way handshake, with each side of the connection terminating independently. When an endpoint wishes to stop its half of the connection, it transmits a FIN packet, which the other end acknowledges with an ACK. Therefore, a typical tear-down requires a pair of FIN and ACK segments from each TCP endpoint. After the side that sent the first FIN has responded with the final ACK, it waits for a timeout before finally closing the connection, during which time the local port is unavailable for new connections; this state lets the TCP client resend the final acknowledgment to the server in case the ACK is lost in transit. The time duration is implementation-dependent, but some common values are 30 seconds, 1 minute, and 2 minutes. After the timeout, the client enters the CLOSED state and the local port becomes available for new connections.[35]
It is also possible to terminate the connection by a 3-way handshake, when host A sends a FIN and host B replies with a FIN & ACK (combining two steps into one) and host A replies with an ACK.[36]
Some operating systems, such as Linux[37] implement a half-duplex close sequence. If the host actively closes a connection, while still having unread incoming data available, the host sends the signal RST (losing any received data) instead of FIN. This assures that a TCP application is aware there was a data loss.[38]
A connection can be in a half-open state, in which case one side has terminated the connection, but the other has not. The side that has terminated can no longer send any data into the connection, but the other side can. The terminating side should continue reading the data until the other side terminates as well.[39][40]
Resource usage
[edit]Most implementations allocate an entry in a table that maps a session to a running operating system process. Because TCP packets do not include a session identifier, both endpoints identify the session using the client's address and port. Whenever a packet is received, the TCP implementation must perform a lookup on this table to find the destination process. Each entry in the table is known as a Transmission Control Block or TCB. It contains information about the endpoints (IP and port), status of the connection, running data about the packets that are being exchanged and buffers for sending and receiving data.
The number of sessions in the server side is limited only by memory and can grow as new connections arrive, but the client must allocate an ephemeral port before sending the first SYN to the server. This port remains allocated during the whole conversation and effectively limits the number of outgoing connections from each of the client's IP addresses. If an application fails to properly close unrequired connections, a client can run out of resources and become unable to establish new TCP connections, even from other applications.
Both endpoints must also allocate space for unacknowledged packets and received (but unread) data.
Data transfer
[edit]The Transmission Control Protocol differs in several key features compared to the User Datagram Protocol:
- Ordered data transfer: the destination host rearranges segments according to a sequence number[15]
- Retransmission of lost packets: any cumulative stream not acknowledged is retransmitted[15]
- Error-free data transfer: corrupted packets are treated as lost and are retransmitted[16]
- Flow control: limits the rate a sender transfers data to guarantee reliable delivery. The receiver continually hints the sender on how much data can be received. When the receiving host's buffer fills, the next acknowledgment suspends the transfer and allows the data in the buffer to be processed.[15]
- Congestion control: lost packets (presumed due to congestion) trigger a reduction in data delivery rate[15]
Reliable transmission
[edit]TCP uses a sequence number to identify each byte of data. The sequence number identifies the order of the bytes sent from each computer so that the data can be reconstructed in order, regardless of any out-of-order delivery that may occur. The sequence number of the first byte is chosen by the transmitter for the first packet, which is flagged SYN. This number can be arbitrary, and should, in fact, be unpredictable to defend against TCP sequence prediction attacks.
Acknowledgments (ACKs) are sent with a sequence number by the receiver of data to tell the sender that data has been received to the specified byte. ACKs do not imply that the data has been delivered to the application, they merely signify that it is now the receiver's responsibility to deliver the data.
Reliability is achieved by the sender detecting lost data and retransmitting it. TCP uses two primary techniques to identify loss. Retransmission timeout (RTO) and duplicate cumulative acknowledgments (DupAcks).
When a TCP segment is retransmitted, it retains the same sequence number as the original delivery attempt. This conflation of delivery and logical data ordering means that, when acknowledgment is received after a retransmission, the sender cannot tell whether the original transmission or the retransmission is being acknowledged, the so-called retransmission ambiguity.[41] TCP incurs complexity due to retransmission ambiguity.[42]
Duplicate-ACK-based retransmission
[edit]If a single segment (say segment number 100) in a stream is lost, then the receiver cannot acknowledge packets above that segment number (100) because it uses cumulative ACKs. Hence the receiver acknowledges packet 99 again on the receipt of another data packet. This duplicate acknowledgement is used as a signal for packet loss. That is, if the sender receives three duplicate acknowledgments, it retransmits the last unacknowledged packet. A threshold of three is used because the network may reorder segments causing duplicate acknowledgements. This threshold has been demonstrated to avoid spurious retransmissions due to reordering.[43] Some TCP implementations use selective acknowledgements (SACKs) to provide explicit feedback about the segments that have been received. This greatly improves TCP's ability to retransmit the right segments.
Retransmission ambiguity can cause spurious fast retransmissions and congestion avoidance if there is reordering beyond the duplicate acknowledgment threshold.[44] In the last two decades more packet reordering has been observed over the Internet[45] which led TCP implementations, such as the one in the Linux Kernel to adopt heuristic methods to scale the duplicate acknowledgment threshold.[46] Recently, there have been efforts to completely phase out duplicate-ACK-based fast-retransmissions and replace them with timer based ones.[47] (Not to be confused with the classic RTO discussed below). The time based loss detection algorithm called Recent Acknowledgment (RACK)[48] has been adopted as the default algorithm in Linux and Windows.[49]
Timeout-based retransmission
[edit]When a sender transmits a segment, it initializes a timer with a conservative estimate of the arrival time of the acknowledgment. The segment is retransmitted if the timer expires, with a new timeout threshold of twice the previous value, resulting in exponential backoff behavior. Typically, the initial timer value is smoothed RTT + max(G, 4 × RTT variation), where G is the clock granularity.[50] This guards against excessive transmission traffic due to faulty or malicious actors, such as man-in-the-middle denial of service attackers.
Accurate RTT estimates are important for loss recovery, as it allows a sender to assume an unacknowledged packet to be lost after sufficient time elapses (i.e., determining the RTO time).[51] Retransmission ambiguity can lead a sender's estimate of RTT to be imprecise.[51] In an environment with variable RTTs, spurious timeouts can occur:[52] if the RTT is under-estimated, then the RTO fires and triggers a needless retransmit and slow-start. After a spurious retransmission, when the acknowledgments for the original transmissions arrive, the sender may believe them to be acknowledging the retransmission and conclude, incorrectly, that segments sent between the original transmission and retransmission have been lost, causing further needless retransmissions to the extent that the link truly becomes congested;[53][54] selective acknowledgement can reduce this effect.[55] RFC 6298 specifies that implementations must not use retransmitted segments when estimating RTT.[56] Karn's algorithm ensures that a good RTT estimate will be produced—eventually—by waiting until there is an unambiguous acknowledgment before adjusting the RTO.[57] After spurious retransmissions, however, it may take significant time before such an unambiguous acknowledgment arrives, degrading performance in the interim.[58] TCP timestamps also resolve the retransmission ambiguity problem in setting the RTO,[56] though they do not necessarily improve the RTT estimate.[59]
Error detection
[edit]Sequence numbers allow receivers to discard duplicate packets and properly sequence out-of-order packets. Acknowledgments allow senders to determine when to retransmit lost packets.
To assure correctness a checksum field is included; see § Checksum computation for details. The TCP checksum is a weak check by modern standards and is normally paired with a CRC integrity check at layer 2, below both TCP and IP, such as is used in PPP or the Ethernet frame. However, introduction of errors in packets between CRC-protected hops is common and the 16-bit TCP checksum catches most of these.[60]
Flow control
[edit]TCP uses an end-to-end flow control protocol to avoid having the sender send data too fast for the TCP receiver to receive and process it reliably. Having a mechanism for flow control is essential in an environment where machines of diverse network speeds communicate. For example, if a PC sends data to a smartphone that is slowly processing received data, the smartphone must be able to regulate the data flow so as not to be overwhelmed.[15]
TCP uses a sliding window flow control protocol. In each TCP segment, the receiver specifies in the receive window field the amount of additionally received data (in bytes) that it is willing to buffer for the connection. The sending host can send only up to that amount of data before it must wait for an acknowledgment and receive window update from the receiving host.

When a receiver advertises a window size of 0, the sender stops sending data and starts its persist timer. The persist timer is used to protect TCP from a deadlock situation that could arise if a subsequent window size update from the receiver is lost, and the sender cannot send more data until receiving a new window size update from the receiver. When the persist timer expires, the TCP sender attempts recovery by sending a small packet so that the receiver responds by sending another acknowledgment containing the new window size.
If a receiver is processing incoming data in small increments, it may repeatedly advertise a small receive window. This is referred to as the silly window syndrome, since it is inefficient to send only a few bytes of data in a TCP segment, given the relatively large overhead of the TCP header.
Congestion control
[edit]The final main aspect of TCP is congestion control. TCP uses a number of mechanisms to achieve high performance and avoid congestive collapse, a gridlock situation where network performance is severely degraded. These mechanisms control the rate of data entering the network, keeping the data flow below a rate that would trigger collapse. They also yield an approximately max-min fair allocation between flows.
Acknowledgments for data sent, or the lack of acknowledgments, are used by senders to infer network conditions between the TCP sender and receiver. Coupled with timers, TCP senders and receivers can alter the behavior of the flow of data. This is more generally referred to as congestion control or congestion avoidance.
Modern implementations of TCP contain four intertwined algorithms: slow start, congestion avoidance, fast retransmit, and fast recovery.[61]
In addition, senders employ a retransmission timeout (RTO) that is based on the estimated round-trip time (RTT) between the sender and receiver, as well as the variance in this round-trip time.[62] There are subtleties in the estimation of RTT. For example, senders must be careful when calculating RTT samples for retransmitted packets; typically they use Karn's Algorithm or TCP timestamps.[29] These individual RTT samples are then averaged over time to create a smoothed round trip time (SRTT) using Jacobson's algorithm. This SRTT value is what is used as the round-trip time estimate.
Enhancing TCP to reliably handle loss, minimize errors, manage congestion and go fast in very high-speed environments are ongoing areas of research and standards development. As a result, there are a number of TCP congestion avoidance algorithm variations.
Maximum segment size
[edit]The maximum segment size (MSS) is the largest amount of data, specified in bytes, that TCP is willing to receive in a single segment. For best performance, the MSS should be set small enough to avoid IP fragmentation, which can lead to packet loss and excessive retransmissions. To accomplish this, typically the MSS is announced by each side using the MSS option when the TCP connection is established. The option value is derived from the maximum transmission unit (MTU) size of the data link layer of the networks to which the sender and receiver are directly attached. TCP senders can use path MTU discovery to infer the minimum MTU along the network path between the sender and receiver, and use this to dynamically adjust the MSS to avoid IP fragmentation within the network.
MSS announcement may also be called MSS negotiation but, strictly speaking, the MSS is not negotiated. Two completely independent values of MSS are permitted for the two directions of data flow in a TCP connection,[63][18] so there is no need to agree on a common MSS configuration for a bidirectional connection.
Selective acknowledgments
[edit]Relying purely on the cumulative acknowledgment scheme employed by the original TCP can lead to inefficiencies when packets are lost. For example, suppose bytes with sequence number 1,000 to 10,999 are sent in 10 different TCP segments of equal size, and the second segment (sequence numbers 2,000 to 2,999) is lost during transmission. In a pure cumulative acknowledgment protocol, the receiver can only send a cumulative ACK value of 2,000 (the sequence number immediately following the last sequence number of the received data) and cannot say that it received bytes 3,000 to 10,999 successfully. Thus the sender may then have to resend all data starting with sequence number 2,000.
To alleviate this issue TCP employs the selective acknowledgment (SACK) option, defined in 1996 in RFC 2018, which allows the receiver to acknowledge discontinuous blocks of packets that were received correctly, in addition to the sequence number immediately following the last sequence number of the last contiguous byte received successively, as in the basic TCP acknowledgment. The acknowledgment can include a number of SACK blocks, where each SACK block is conveyed by the Left Edge of Block (the first sequence number of the block) and the Right Edge of Block (the sequence number immediately following the last sequence number of the block), with a Block being a contiguous range that the receiver correctly received. In the example above, the receiver would send an ACK segment with a cumulative ACK value of 2,000 and a SACK option header with sequence numbers 3,000 and 11,000. The sender would accordingly retransmit only the second segment with sequence numbers 2,000 to 2,999.
A TCP sender may interpret an out-of-order segment delivery as a lost segment. If it does so, the TCP sender will retransmit the segment previous to the out-of-order packet and slow its data delivery rate for that connection. The duplicate-SACK option, an extension to the SACK option that was defined in May 2000 in RFC 2883, solves this problem. Once the TCP receiver detects a second duplicate packet, it sends a D-ACK to indicate that no segments were lost, allowing the TCP sender to reinstate the higher transmission rate.
The SACK option is not mandatory and comes into operation only if both parties support it. This is negotiated when a connection is established. SACK uses a TCP header option (see § TCP segment structure for details). The use of SACK has become widespread—all popular TCP stacks support it. Selective acknowledgment is also used in Stream Control Transmission Protocol (SCTP).
Selective acknowledgements can be 'reneged', where the receiver unilaterally discards the selectively acknowledged data. RFC 2018 discouraged such behavior, but did not prohibit it to allow receivers the option of reneging if they, for example, ran out of buffer space.[64] The possibility of reneging leads to implementation complexity for both senders and receivers, and also imposes memory costs on the sender.[65]
Window scaling
[edit]For more efficient use of high-bandwidth networks, a larger TCP window size may be used. A 16-bit TCP window size field controls the flow of data and its value is limited to 65,535 bytes. Since the size field cannot be expanded beyond this limit, a scaling factor is used. The TCP window scale option, as defined in RFC 1323, is an option used to increase the maximum window size to 1 gigabyte. Scaling up to these larger window sizes is necessary for TCP tuning.
The window scale option is used only during the TCP 3-way handshake. The window scale value represents the number of bits to left-shift the 16-bit window size field when interpreting it. The window scale value can be set from 0 (no shift) to 14 for each direction independently. Both sides must send the option in their SYN segments to enable window scaling in either direction.
Some routers and packet firewalls rewrite the window scaling factor during a transmission. This causes sending and receiving sides to assume different TCP window sizes. The result is non-stable traffic that may be very slow. The problem is visible on some sites behind a defective router.[66]
TCP timestamps
[edit]TCP timestamps, defined in RFC 1323 in 1992, can help TCP determine in which order packets were sent. TCP timestamps are not normally aligned to the system clock and start at some random value. Many operating systems will increment the timestamp for every elapsed millisecond; however, the RFC only states that the ticks should be proportional.
There are two timestamp fields:
- a 4-byte sender timestamp value (my timestamp)
- a 4-byte echo reply timestamp value (the most recent timestamp received from you).
TCP timestamps are used in an algorithm known as Protection Against Wrapped Sequence numbers, or PAWS. PAWS is used when the receive window crosses the sequence number wraparound boundary. In the case where a packet was potentially retransmitted, it answers the question: "Is this sequence number in the first 4 GB or the second?" And the timestamp is used to break the tie.
Also, the Eifel detection algorithm uses TCP timestamps to determine if retransmissions are occurring because packets are lost or simply out of order.[67]
TCP timestamps are enabled by default in Linux,[68] and disabled by default in Windows Server 2008, 2012 and 2016.[69]
Recent Statistics show that the level of TCP timestamp adoption has stagnated, at ~40%, owing to Windows Server dropping support since Windows Server 2008.[70]
Out-of-band data
[edit]It is possible to interrupt or abort the queued stream instead of waiting for the stream to finish. This is done by specifying the data as urgent. This marks the transmission as out-of-band data (OOB) and tells the receiving program to process it immediately. When finished, TCP informs the application and resumes the stream queue. An example is when TCP is used for a remote login session where the user can send a keyboard sequence that interrupts or aborts the remotely running program without waiting for the program to finish its current transfer.[15]
The urgent pointer only alters the processing on the remote host and doesn't expedite any processing on the network itself. The capability is implemented differently or poorly on different systems or may not be supported. Where it is available, it is prudent to assume only single bytes of OOB data will be reliably handled.[71][72] Since the feature is not frequently used, it is not well tested on some platforms and has been associated with vulnerabilities, WinNuke for instance.
Forcing data delivery
[edit]Normally, TCP waits for 200 ms for a full packet of data to send (Nagle's Algorithm tries to group small messages into a single packet). This wait creates small, but potentially serious delays if repeated constantly during a file transfer. For example, a typical send block would be 4 KB, a typical MSS is 1460, so 2 packets go out on a 10 Mbit/s Ethernet taking ~1.2 ms each followed by a third carrying the remaining 1176 after a 197 ms pause because TCP is waiting for a full buffer. In the case of telnet, each user keystroke is echoed back by the server before the user can see it on the screen. This delay would become very annoying.
Setting the socket option TCP_NODELAY overrides the default 200 ms send delay. Application programs use this socket option to force output to be sent after writing a character or line of characters.
The RFC 793 defines the PSH push bit as "a message to the receiving TCP stack to send this data immediately up to the receiving application".[15] There is no way to indicate or control it in user space using Berkeley sockets; it is controlled by the protocol stack only.[73]
Vulnerabilities
[edit]TCP may be attacked in a variety of ways. The results of a thorough security assessment of TCP, along with possible mitigations for the identified issues, were published in 2009,[74] and was pursued within the IETF through 2012.[75] Notable vulnerabilities include denial of service, connection hijacking, TCP veto and TCP reset attack.
Denial of service
[edit]By using a spoofed IP address and repeatedly sending purposely assembled SYN packets, followed by many ACK packets, attackers can cause the server to consume large amounts of resources keeping track of the bogus connections. This is known as a SYN flood attack. Proposed solutions to this problem include SYN cookies and cryptographic puzzles, though SYN cookies come with their own set of vulnerabilities.[76] Sockstress is a similar attack, that might be mitigated with system resource management.[77] An advanced DoS attack involving the exploitation of the TCP persist timer was analyzed in Phrack No. 66.[78] PUSH and ACK floods are other variants.[79]
Connection hijacking
[edit]An attacker who is able to eavesdrop on a TCP session and redirect packets can hijack a TCP connection. To do so, the attacker learns the sequence number from the ongoing communication and forges a false segment that looks like the next segment in the stream. A simple hijack can result in one packet being erroneously accepted at one end. When the receiving host acknowledges the false segment, synchronization is lost.[80] Hijacking may be combined with ARP spoofing or other routing attacks that allow an attacker to take permanent control of the TCP connection.
Impersonating a different IP address was not difficult prior to RFC 1948 when the initial sequence number was easily guessable. The earlier implementations allowed an attacker to blindly send a sequence of packets that the receiver would believe came from a different IP address, without the need to intercept communication through ARP or routing attacks: it is enough to ensure that the legitimate host of the impersonated IP address is down, or bring it to that condition using denial-of-service attacks. This is why the initial sequence number is now chosen at random.
TCP veto
[edit]An attacker who can eavesdrop and predict the size of the next packet to be sent can cause the receiver to accept a malicious payload without disrupting the existing connection. The attacker injects a malicious packet with the sequence number and a payload size of the next expected packet. When the legitimate packet is ultimately received, it is found to have the same sequence number and length as a packet already received and is silently dropped as a normal duplicate packet—the legitimate packet is vetoed by the malicious packet. Unlike in connection hijacking, the connection is never desynchronized and communication continues as normal after the malicious payload is accepted. TCP veto gives the attacker less control over the communication but makes the attack particularly resistant to detection. The only evidence to the receiver that something is amiss is a single duplicate packet, a normal occurrence in an IP network. The sender of the vetoed packet never sees any evidence of an attack.[81]
TCP ports
[edit]A TCP connection is identified by a four-tuple of the source address, source port, destination address, and destination port.[d][82][83] Port numbers are used to identify different services, and to allow multiple connections between hosts.[16] TCP uses 16-bit port numbers, providing 65,536 possible values for each of the source and destination ports.[19] The dependency of connection identity on addresses means that TCP connections are bound to a single network path; TCP cannot use other routes that multihomed hosts have available, and connections break if an endpoint's address changes.[84]
Port numbers are categorized into three basic categories: well-known, registered, and dynamic or private. The well-known ports are assigned by the Internet Assigned Numbers Authority (IANA) and are typically used by system-level processes. Well-known applications running as servers and passively listening for connections typically use these ports. Some examples include: FTP (20 and 21), SSH (22), TELNET (23), SMTP (25), HTTP over SSL/TLS (443), and HTTP (80).[e] Registered ports are typically used by end-user applications as ephemeral source ports when contacting servers, but they can also identify named services that have been registered by a third party. Dynamic or private ports can also be used by end-user applications, however, these ports typically do not contain any meaning outside a particular TCP connection.
Network Address Translation (NAT), typically uses dynamic port numbers, on the public-facing side, to disambiguate the flow of traffic that is passing between a public network and a private subnetwork, thereby allowing many IP addresses (and their ports) on the subnet to be serviced by a single public-facing address.
Development
[edit]TCP is a complex protocol. However, while significant enhancements have been made and proposed over the years, its most basic operation has not changed significantly since its first specification RFC 675 in 1974, and the v4 specification RFC 793, published in September 1981. RFC 1122, published in October 1989, clarified a number of TCP protocol implementation requirements. A list of the 8 required specifications and over 20 strongly encouraged enhancements is available in RFC 7414. Among this list is RFC 2581, TCP Congestion Control, one of the most important TCP-related RFCs in recent years, describes updated algorithms that avoid undue congestion. In 2001, RFC 3168 was written to describe Explicit Congestion Notification (ECN), a congestion avoidance signaling mechanism.
The original TCP congestion avoidance algorithm was known as TCP Tahoe, but many alternative algorithms have since been proposed (including TCP Reno, TCP Vegas, FAST TCP, TCP New Reno, and TCP Hybla).
Multipath TCP (MPTCP)[85][86] is an ongoing effort within the IETF that aims at allowing a TCP connection to use multiple paths to maximize resource usage and increase redundancy. The redundancy offered by Multipath TCP in the context of wireless networks enables the simultaneous use of different networks, which brings higher throughput and better handover capabilities. Multipath TCP also brings performance benefits in datacenter environments.[87] The reference implementation[88] of Multipath TCP was developed in the Linux kernel.[89] Multipath TCP is used to support the Siri voice recognition application on iPhones, iPads and Macs.[90]
tcpcrypt is an extension proposed in July 2010 to provide transport-level encryption directly in TCP itself. It is designed to work transparently and not require any configuration. Unlike TLS (SSL), tcpcrypt itself does not provide authentication, but provides simple primitives down to the application to do that. The tcpcrypt RFC was published by the IETF in May 2019.[91]
TCP Fast Open is an extension to speed up the opening of successive TCP connections between two endpoints. It works by skipping the three-way handshake using a cryptographic cookie. It is similar to an earlier proposal called T/TCP, which was not widely adopted due to security issues.[92] TCP Fast Open was published as RFC 7413 in 2014.[93]
Proposed in May 2013, Proportional Rate Reduction (PRR) is a TCP extension developed by Google engineers. PRR ensures that the TCP window size after recovery is as close to the slow start threshold as possible.[94] The algorithm is designed to improve the speed of recovery and is the default congestion control algorithm in Linux 3.2+ kernels.[95]
Deprecated proposals
[edit]TCP Cookie Transactions (TCPCT) is an extension proposed in December 2009[96] to secure servers against denial-of-service attacks. Unlike SYN cookies, TCPCT does not conflict with other TCP extensions such as window scaling. TCPCT was designed due to necessities of DNSSEC, where servers have to handle large numbers of short-lived TCP connections. In 2016, TCPCT was deprecated in favor of TCP Fast Open. The status of the original RFC was changed to historic.[97]
Hardware implementations
[edit]One way to overcome the processing power requirements of TCP is to build hardware implementations of it, widely known as TCP offload engines (TOE). The main problem of TOEs is that they are hard to integrate into computing systems, requiring extensive changes in the operating system of the computer or device.
TCP low power (TCPlp) has been demonstrated to work in resource constrained environments where otherwise UDP-based CoAP is preferred.[98][99]
Wire image and ossification
[edit]The wire data of TCP provides significant information-gathering and modification opportunities to on-path observers, as the protocol metadata is transmitted in cleartext.[100][101] While this transparency is useful to network operators[102] and researchers,[103] information gathered from protocol metadata may reduce the end-user's privacy.[104] This visibility and malleability of metadata has led to TCP being difficult to extend—a case of protocol ossification—as any intermediate node (a 'middlebox') can make decisions based on that metadata or even modify it,[105][106] breaking the end-to-end principle.[107] One measurement found that a third of paths across the Internet encounter at least one intermediary that modifies TCP metadata, and 6.5% of paths encounter harmful ossifying effects from intermediaries.[108] Avoiding extensibility hazards from intermediaries placed significant constraints on the design of MPTCP,[109][110] and difficulties caused by intermediaries have hindered the deployment of TCP Fast Open in web browsers.[111] Another source of ossification is the difficulty of modification of TCP functions at the endpoints, typically in the operating system kernel[112] or in hardware with a TCP offload engine.[113]
Performance
[edit]As TCP provides applications with the abstraction of a reliable byte stream, it can suffer from head-of-line blocking: if packets are reordered or lost and need to be retransmitted (and thus are reordered), data from sequentially later parts of the stream may be received before sequentially earlier parts of the stream; however, the later data cannot typically be used until the earlier data has been received, incurring network latency. If multiple independent higher-level messages are encapsulated and multiplexed onto a single TCP connection, then head-of-line blocking can cause processing of a fully-received message that was sent later to wait for delivery of a message that was sent earlier.[114] Web browsers attempt to mitigate head-of-line blocking by opening multiple parallel connections. This incurs the cost of connection establishment repeatedly, as well as multiplying the resources needed to track those connections at the endpoints.[115] Parallel connections also have congestion control operating independently of each other, rather than being able to pool information together and respond more promptly to observed network conditions;[116] TCP's aggressive initial sending patterns can cause congestion if multiple parallel connections are opened; and the per-connection fairness model leads to a monopolization of resources by applications that take this approach.[117]
Connection establishment is a major contributor to latency as experienced by web users.[118][119] TCP's three-way handshake introduces one RTT of latency during connection establishment before data can be sent.[119] For short flows, these delays are very significant.[120] Transport Layer Security (TLS) requires a handshake of its own for key exchange at connection establishment. Because of the layered design, the TCP handshake and the TLS handshake proceed serially; the TLS handshake cannot begin until the TCP handshake has concluded.[121] Two RTTs are required for connection establishment with TLS 1.2 over TCP.[122] TLS 1.3 allows for zero RTT connection resumption in some circumstances, but, when layered over TCP, one RTT is still required for the TCP handshake, and this cannot assist the initial connection; zero RTT handshakes also present cryptographic challenges, as efficient, replay-safe and forward secure non-interactive key exchange is an open research topic.[123] TCP Fast Open allows the transmission of data in the initial (i.e., SYN and SYN-ACK) packets, removing one RTT of latency during connection establishment.[124] However, TCP Fast Open has been difficult to deploy due to protocol ossification; as of 2020[update], no Web browsers used it by default.[111]
TCP throughput is affected by packet reordering. Reordered packets can cause duplicate acknowledgments to be sent, which, if they cross a threshold, will then trigger a spurious retransmission and congestion control. Transmission behavior can also become bursty, as large ranges are acknowledged all at once when a reordered packet at the range's start is received (in a manner similar to how head-of-line blocking affects applications).[125] Blanton & Allman (2002) found that throughput was inversely related to the amount of reordering, up to a threshold where all reordering triggers spurious retransmission.[126] Mitigating reordering depends on a sender's ability to determine that it has sent a spurious retransmission, and hence on resolving retransmission ambiguity.[127] Reducing reordering-induced spurious retransmissions may slow recovery from genuine loss.[128]
Selective acknowledgment can provide a significant benefit to throughput; Bruyeron, Hemon & Zhang (1998) measured gains of up to 45%.[129] An important factor in the improvement is that selective acknowledgment can more often avoid going into slow start after a loss and can hence better use available bandwidth.[130] However, TCP can only selectively acknowledge a maximum of three blocks of sequence numbers. This can limit the retransmission rate and hence loss recovery or cause needless retransmissions, especially in high-loss environments.[131][132]
TCP was originally designed for wired networks where packet loss is considered to be the result of network congestion and the congestion window size is reduced dramatically as a precaution. However, wireless links are known to experience sporadic and usually temporary losses due to fading, shadowing, hand off, interference, and other radio effects, that are not strictly congestion. After the (erroneous) back-off of the congestion window size, due to wireless packet loss, there may be a congestion avoidance phase with a conservative decrease in window size. This causes the radio link to be underused. Extensive research on combating these harmful effects has been conducted. Suggested solutions can be categorized as end-to-end solutions, which require modifications at the client or server,[133] link layer solutions, such as Radio Link Protocol in cellular networks, or proxy-based solutions which require some changes in the network without modifying end nodes.[133][134] A number of alternative congestion control algorithms, such as Vegas, Westwood, Veno, and Santa Cruz, have been proposed to help solve the wireless problem.[citation needed]
Acceleration
[edit]The idea of a TCP accelerator is to terminate TCP connections inside the network processor and then relay the data to a second connection toward the end system. The data packets that originate from the sender are buffered at the accelerator node, which is responsible for performing local retransmissions in the event of packet loss. Thus, in case of losses, the feedback loop between the sender and the receiver is shortened to the one between the acceleration node and the receiver which guarantees a faster delivery of data to the receiver.[135]
Since TCP is a rate-adaptive protocol, the rate at which the TCP sender injects packets into the network is directly proportional to the prevailing load condition within the network as well as the processing capacity of the receiver. The prevalent conditions within the network are judged by the sender on the basis of the acknowledgments received by it. The acceleration node splits the feedback loop between the sender and the receiver and thus guarantees a shorter round trip time (RTT) per packet. A shorter RTT is beneficial as it ensures a quicker response time to any changes in the network and a faster adaptation by the sender to combat these changes.
Disadvantages of the method include the fact that the TCP session has to be directed through the accelerator; this means that if routing changes so that the accelerator is no longer in the path, the connection will be broken. It also destroys the end-to-end property of the TCP ACK mechanism; when the ACK is received by the sender, the packet has been stored by the accelerator, not delivered to the receiver.
Debugging
[edit]A packet sniffer, which taps TCP traffic on a network link, can be useful in debugging networks, network stacks, and applications that use TCP by showing an engineer what packets are passing through a link. Some networking stacks support the SO_DEBUG socket option, which can be enabled on the socket using setsockopt. That option dumps all the packets, TCP states, and events on that socket, which is helpful in debugging. Netstat is another utility that can be used for debugging.
Alternatives
[edit]For many applications TCP is not appropriate. The application cannot normally access the packets coming after a lost packet until the retransmitted copy of the lost packet is received. This causes problems for real-time applications such as streaming media, real-time multiplayer games and voice over IP (VoIP) where it is generally more useful to get most of the data in a timely fashion than it is to get all of the data in order.
For historical and performance reasons, most storage area networks (SANs) use Fibre Channel Protocol (FCP) over Fibre Channel connections. For embedded systems, network booting, and servers that serve simple requests from huge numbers of clients (e.g. DNS servers) the complexity of TCP can be a problem. Tricks such as transmitting data between two hosts that are both behind NAT (using STUN or similar systems) are far simpler without a relatively complex protocol like TCP in the way.
Generally, where TCP is unsuitable, the User Datagram Protocol (UDP) is used. This provides the same application multiplexing and checksums that TCP does, but does not handle streams or retransmission, giving the application developer the ability to code them in a way suitable for the situation, or to replace them with other methods such as forward error correction or error concealment.
Stream Control Transmission Protocol (SCTP) is another protocol that provides reliable stream-oriented services similar to TCP. It is newer and considerably more complex than TCP, and has not yet seen widespread deployment. However, it is especially designed to be used in situations where reliability and near-real-time considerations are important.
Venturi Transport Protocol (VTP) is a patented proprietary protocol that is designed to replace TCP transparently to overcome perceived inefficiencies related to wireless data transport.
The TCP congestion avoidance algorithm works very well for ad-hoc environments where the data sender is not known in advance. If the environment is predictable, a timing-based protocol such as Asynchronous Transfer Mode (ATM) can avoid TCP's retransmission overhead.
UDP-based Data Transfer Protocol (UDT) has better efficiency and fairness than TCP in networks that have high bandwidth-delay product.[136]
Multipurpose Transaction Protocol (MTP/IP) is patented proprietary software that is designed to adaptively achieve high throughput and transaction performance in a wide variety of network conditions, particularly those where TCP is perceived to be inefficient.
Checksum computation
[edit]TCP checksum for IPv4
[edit]When TCP runs over IPv4, the method used to compute the checksum is defined as follows:[18]
The checksum field is the 16-bit ones' complement of the ones' complement sum of all 16-bit words in the header and text. The checksum computation needs to ensure the 16-bit alignment of the data being summed. If a segment contains an odd number of header and text octets, alignment can be achieved by padding the last octet with zeros on its right to form a 16-bit word for checksum purposes. The pad is not transmitted as part of the segment. While computing the checksum, the checksum field itself is replaced with zeros.
In other words, after appropriate padding, all 16-bit words are added using ones' complement arithmetic. The sum is then bitwise complemented and inserted as the checksum field. A pseudo-header that mimics the IPv4 packet header used in the checksum computation is as follows:
| Offset | Octet | 0 | 1 | 2 | 3 | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Octet | Bit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 0 | 0 | Source address | |||||||||||||||||||||||||||||||
| 4 | 32 | Destination address | |||||||||||||||||||||||||||||||
| 8 | 64 | Zeroes | Protocol (6) | TCP length | |||||||||||||||||||||||||||||
| 12 | 96 | Source port | Destination port | ||||||||||||||||||||||||||||||
| 16 | 128 | Sequence number | |||||||||||||||||||||||||||||||
| 20 | 160 | Acknowledgement number | |||||||||||||||||||||||||||||||
| 24 | 192 | Data offset | Reserved | Flags | Window | ||||||||||||||||||||||||||||
| 28 | 224 | Checksum | Urgent pointer | ||||||||||||||||||||||||||||||
| 32 | 256 | (Options) | |||||||||||||||||||||||||||||||
| 36 | 288 | Data | |||||||||||||||||||||||||||||||
| 40 | 320 | ||||||||||||||||||||||||||||||||
| ⋮ | ⋮ | ||||||||||||||||||||||||||||||||
The checksum is computed over the following fields:
- Source address: 32 bits
- The source address in the IPv4 header
- Destination address: 32 bits
- The destination address in the IPv4 header
- Zeroes: 8 bits
- All zeroes
- Protocol: 8 bits
- The protocol value for TCP: 6
- TCP length: 16 bits
- The length of the TCP header and data (measured in octets). For example, let's say we have IPv4 packet with Total Length of 200 bytes and IHL value of 5, which indicates a length of 5 bits × 32 bits = 160 bits = 20 bytes. We can compute the TCP length as (Total Length) − (IPv4 Header Length) i.e. 200 − 20, which results in 180 bytes.
TCP checksum for IPv6
[edit]When TCP runs over IPv6, the method used to compute the checksum is changed:[137]
Any transport or other upper-layer protocol that includes the addresses from the IP header in its checksum computation must be modified for use over IPv6, to include the 128-bit IPv6 addresses instead of 32-bit IPv4 addresses.
A pseudo-header that mimics the IPv6 header for computation of the checksum is shown below.
| Offset | Octet | 0 | 1 | 2 | 3 | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Octet | Bit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 0 | 0 | Source address | |||||||||||||||||||||||||||||||
| 4 | 32 | ||||||||||||||||||||||||||||||||
| 8 | 64 | ||||||||||||||||||||||||||||||||
| 12 | 96 | ||||||||||||||||||||||||||||||||
| 16 | 128 | Destination address | |||||||||||||||||||||||||||||||
| 20 | 160 | ||||||||||||||||||||||||||||||||
| 24 | 192 | ||||||||||||||||||||||||||||||||
| 28 | 224 | ||||||||||||||||||||||||||||||||
| 32 | 256 | TCP length | |||||||||||||||||||||||||||||||
| 36 | 288 | Zeroes | Next header (6) | ||||||||||||||||||||||||||||||
| 40 | 320 | Source port | Destination port | ||||||||||||||||||||||||||||||
| 44 | 352 | Sequence number | |||||||||||||||||||||||||||||||
| 48 | 384 | Acknowledgement number | |||||||||||||||||||||||||||||||
| 52 | 416 | Data offset | Reserved | Flags | Window | ||||||||||||||||||||||||||||
| 56 | 448 | Checksum | Urgent pointer | ||||||||||||||||||||||||||||||
| 60 | 480 | (Options) | |||||||||||||||||||||||||||||||
| 64 | 512 | Data | |||||||||||||||||||||||||||||||
| 68 | 544 | ||||||||||||||||||||||||||||||||
| ⋮ | ⋮ | ||||||||||||||||||||||||||||||||
The checksum is computed over the following fields:
- Source address: 128 bits
- The address in the IPv6 header.
- Destination address: 128 bits
- The final destination; if the IPv6 packet doesn't contain a Routing header, TCP uses the destination address in the IPv6 header, otherwise, at the originating node, it uses the address in the last element of the Routing header, and, at the receiving node, it uses the destination address in the IPv6 header.
- TCP length: 32 bits
- The length of the TCP header and data (measured in octets).
- Zeroes: 24 bits;
Zeroes == 0 - All zeroes.
- Next header: 8 bits
- The protocol value for TCP: 6.
Checksum offload
[edit]Many TCP/IP software stack implementations provide options to use hardware assistance to automatically compute the checksum in the network adapter prior to transmission onto the network or upon reception from the network for validation. This may reduce CPU load associated with calculating the checksum, potentially increasing overall network performance.
This feature may cause packet analyzers that are unaware or uncertain about the use of checksum offload to report invalid checksums in outbound packets that have not yet reached the network adapter.[138] This will only occur for packets that are intercepted before being transmitted by the network adapter; all packets transmitted by the network adaptor on the wire will have valid checksums.[139] This issue can also occur when monitoring packets being transmitted between virtual machines on the same host, where a virtual device driver may omit the checksum calculation (as an optimization), knowing that the checksum will be calculated later by the VM host kernel or its physical hardware.
See also
[edit]- Micro-bursting (networking)
- TCP global synchronization
- TCP fusion
- TCP pacing
- TCP Stealth
- Transport layer § Comparison of transport layer protocols
- WTCP a proxy-based modification of TCP for wireless networks
Notes
[edit]- ^ a b Added to header by RFC 3168
- ^ Windows size units are, by default, bytes.
- ^ Window size is relative to the segment identified by the sequence number in the acknowledgment field.
- ^ Equivalently, a pair of network sockets for the source and destination, each of which is made up of an address and a port
- ^ As of the latest standard, HTTP/3, QUIC is used as a transport instead of TCP.
References
[edit]- ^ Comer, D. E. (2021). Internetworking with TCP/IP (6th ed.). Pearson.
- ^ Labrador, Miguel A.; Perez, Alfredo J.; Wightman, Pedro M. (2010). Location-Based Information Systems Developing Real-Time Tracking Applications. CRC Press. ISBN 9781000556803.
- ^ Vinton G. Cerf; Robert E. Kahn (May 1974). "A Protocol for Packet Network Intercommunication" (PDF). IEEE Transactions on Communications. 22 (5): 637–648. doi:10.1109/tcom.1974.1092259. Archived from the original (PDF) on March 4, 2016.
- ^ Bennett, Richard (September 2009). "Designed for Change: End-to-End Arguments, Internet Innovation, and the Net Neutrality Debate" (PDF). Information Technology and Innovation Foundation. p. 11. Archived (PDF) from the original on 29 August 2019. Retrieved 11 September 2017.
- ^ RFC 675.
- ^ Russell, Andrew Lawrence (2008). 'Industrial Legislatures': Consensus Standardization in the Second and Third Industrial Revolutions (Thesis). "See Abbate, Inventing the Internet, 129–30; Vinton G. Cerf (October 1980). "Protocols for Interconnected Packet Networks". ACM SIGCOMM Computer Communication Review. 10 (4): 10–11.; and RFC 760. doi:10.17487/RFC0760."
- ^ Postel, Jon (15 August 1977), Comments on Internet Protocol and TCP, IEN 2, archived from the original on May 16, 2019, retrieved June 11, 2016,
We are screwing up in our design of internet protocols by violating the principle of layering. Specifically we are trying to use TCP to do two things: serve as a host level end to end protocol, and to serve as an internet packaging and routing protocol. These two things should be provided in a layered and modular way.
- ^ Cerf, Vinton G. (1 April 1980). "Final Report of the Stanford University TCP Project".
- ^ Cerf, Vinton G; Cain, Edward (October 1983). "The DoD internet architecture model". Computer Networks. 7 (5): 307–318. doi:10.1016/0376-5075(83)90042-9.
- ^ "The TCP/IP Guide – TCP/IP Architecture and the TCP/IP Model". www.tcpipguide.com. Retrieved 2020-02-11.
- ^ https://datatracker.ietf.org/doc/html/rfc9293
- ^ "Internet Experiment Note Index". www.rfc-editor.org. Retrieved 2024-01-21.
- ^ "Robert E Kahn – A.M. Turing Award Laureate". amturing.acm.org. Archived from the original on 2019-07-13. Retrieved 2019-07-13.
- ^ "Vinton Cerf – A.M. Turing Award Laureate". amturing.acm.org. Archived from the original on 2021-10-11. Retrieved 2019-07-13.
- ^ a b c d e f g h i Comer, Douglas E. (2006). Internetworking with TCP/IP: Principles, Protocols, and Architecture. Vol. 1 (5th ed.). Prentice Hall. ISBN 978-0-13-187671-2.
- ^ a b c RFC 9293, 2.2. Key TCP Concepts.
- ^ RFC 791, pp. 5–6.
- ^ a b c d RFC 9293.
- ^ a b c RFC 9293, 3.1. Header Format.
- ^ RFC 9293, 3.8.5 The Communication of Urgent Information.
- ^ RFC 9293, 3.4. Sequence Numbers.
- ^ RFC 9293, 3.4.1. Initial Sequence Number Selection.
- ^ "Change RFC 3540 "Robust Explicit Congestion Notification (ECN) Signaling with Nonces" to Historic". datatracker.ietf.org. Retrieved 2023-04-18.
- ^ "More Accurate Explicit Congestion Notification (AccECN) Feedback in TCP". datatracker.ietf.org. Retrieved 2025-10-24.
- ^ RFC 3168, p. 13-14.
- ^ RFC 3168, p. 15.
- ^ RFC 3168, p. 18-19.
- ^ RFC 793.
- ^ a b c RFC 7323.
- ^ RFC 2018, 2. Sack-Permitted Option.
- ^ RFC 2018, 3. Sack Option Format.
- ^ Heffernan, Andy (August 1998). "Protection of BGP Sessions via the TCP MD5 Signature Option". IETF. Retrieved 2023-12-30.
- ^ "Transmission Control Protocol (TCP) Parameters: TCP Option Kind Numbers". IANA. Archived from the original on 2017-10-02. Retrieved 2017-10-19.
- ^ RFC 9293, 3.3.2. State Machine Overview.
- ^ Kurose, James F. (2017). Computer networking : a top-down approach. Keith W. Ross (7th ed.). Harlow, England. p. 286. ISBN 978-0-13-359414-0. OCLC 936004518.
{{cite book}}: CS1 maint: location missing publisher (link) - ^ Tanenbaum, Andrew S. (2003-03-17). Computer Networks (Fourth ed.). Prentice Hall. ISBN 978-0-13-066102-9.
- ^ "linux/net/ipv4/tcp_minisocks.c at master · torvalds/linux". GitHub. Retrieved 2025-04-24.
- ^ RFC 1122, 4.2.2.13. Closing a Connection.
- ^ "TCP (Transmission Control Protocol) – The transmission protocol explained". IONOS Digital Guide. 2020-03-02. Retrieved 2025-04-24.
- ^ "The TCP/IP Guide - TCP Connection Termination". www.tcpipguide.com. Retrieved 2025-04-24.
- ^ Karn & Partridge 1991, p. 364.
- ^ RFC 9002, 4.2. Monotonically Increasing Packet Numbers.
- ^ Mathis; Mathew; Semke; Mahdavi; Ott (1997). "The macroscopic behavior of the TCP congestion avoidance algorithm". ACM SIGCOMM Computer Communication Review. 27 (3): 67–82. CiteSeerX 10.1.1.40.7002. doi:10.1145/263932.264023. S2CID 1894993.
- ^ RFC 3522, p. 4.
- ^ Leung, Ka-cheong; Li, Victor O.k.; Yang, Daiqin (2007). "An Overview of Packet Reordering in Transmission Control Protocol (TCP): Problems, Solutions, and Challenges". IEEE Transactions on Parallel and Distributed Systems. 18 (4): 522–535. doi:10.1109/TPDS.2007.1011.
- ^ Johannessen, Mads (2015). Investigate reordering in Linux TCP (MSc thesis). University of Oslo.
- ^ Cheng, Yuchung (2015). RACK: a time-based fast loss detection for TCP draft-cheng-tcpm-rack-00 (PDF). IETF94. Yokohama: IETF.
- ^ RFC 8985.
- ^ Cheng, Yuchung; Cardwell, Neal; Dukkipati, Nandita; Jha, Priyaranjan (2017). RACK: a time-based fast loss recovery draft-ietf-tcpm-rack-02 (PDF). IETF100. Yokohama: IETF.
- ^ RFC 6298, p. 2.
- ^ a b Zhang 1986, p. 399.
- ^ Karn & Partridge 1991, p. 365.
- ^ Ludwig & Katz 2000, p. 31-33.
- ^ Gurtov & Ludwig 2003, p. 2.
- ^ Gurtov & Floyd 2004, p. 1.
- ^ a b RFC 6298, p. 4.
- ^ Karn & Partridge 1991, p. 370-372.
- ^ Allman & Paxson 1999, p. 268.
- ^ RFC 7323, p. 7.
- ^ Stone; Partridge (2000). "When the CRC and TCP checksum disagree". Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. ACM SIGCOMM Computer Communication Review. pp. 309–319. CiteSeerX 10.1.1.27.7611. doi:10.1145/347059.347561. ISBN 978-1581132236. S2CID 9547018. Archived from the original on 2008-05-05. Retrieved 2008-04-28.
- ^ RFC 5681.
- ^ RFC 6298.
- ^ RFC 1122.
- ^ RFC 2018, p. 10.
- ^ RFC 9002, 4.4. No Reneging.
- ^ "TCP window scaling and broken routers". LWN.net. Archived from the original on 2020-03-31. Retrieved 2016-07-21.
- ^ RFC 3522.
- ^ "IP sysctl". Linux Kernel Documentation. Archived from the original on 5 March 2016. Retrieved 15 December 2018.
- ^ Wang, Eve. "TCP timestamp is disabled". Technet – Windows Server 2012 Essentials. Microsoft. Archived from the original on 2018-12-15. Retrieved 2018-12-15.
- ^ David Murray; Terry Koziniec; Sebastian Zander; Michael Dixon; Polychronis Koutsakis (2017). "An Analysis of Changing Enterprise Network Traffic Characteristics" (PDF). The 23rd Asia-Pacific Conference on Communications (APCC 2017). Archived (PDF) from the original on 3 October 2017. Retrieved 3 October 2017.
- ^ Gont, Fernando (November 2008). "On the implementation of TCP urgent data". 73rd IETF meeting. Archived from the original on 2019-05-16. Retrieved 2009-01-04.
- ^ Peterson, Larry (2003). Computer Networks. Morgan Kaufmann. p. 401. ISBN 978-1-55860-832-0.
- ^ Richard W. Stevens (November 2011). TCP/IP Illustrated. Vol. 1, The protocols. Addison-Wesley. pp. Chapter 20. ISBN 978-0-201-63346-7.
- ^ "Security Assessment of the Transmission Control Protocol (TCP)" (PDF). Archived from the original on March 6, 2009. Retrieved 2010-12-23.
{{cite web}}: CS1 maint: bot: original URL status unknown (link) - ^ Survey of Security Hardening Methods for Transmission Control Protocol (TCP) Implementations
- ^ Jakob Lell (13 August 2013). "Quick Blind TCP Connection Spoofing with SYN Cookies". Archived from the original on 2014-02-22. Retrieved 2014-02-05.
- ^ "Some insights about the recent TCP DoS (Denial of Service) vulnerabilities" (PDF). Archived from the original (PDF) on 2013-06-18. Retrieved 2010-12-23.
- ^ "Exploiting TCP and the Persist Timer Infiniteness". Archived from the original on 2010-01-22. Retrieved 2010-01-22.
- ^ "PUSH and ACK Flood". f5.com. Archived from the original on 2017-09-28. Retrieved 2017-09-27.
- ^ Laurent Joncheray (1995). "Simple Active Attack Against TCP" (PDF). Retrieved 2023-06-04.
- ^ John T. Hagen; Barry E. Mullins (2013). "TCP veto: A novel network attack and its Application to SCADA protocols". 2013 IEEE PES Innovative Smart Grid Technologies Conference (ISGT). pp. 1–6. doi:10.1109/ISGT.2013.6497785. ISBN 978-1-4673-4896-6. S2CID 25353177.
- ^ RFC 9293, 4. Glossary.
- ^ RFC 8095, p. 6.
- ^ RFC 6182.
- ^ RFC 6824.
- ^ Raiciu; Barre; Pluntke; Greenhalgh; Wischik; Handley (2011). "Improving datacenter performance and robustness with multipath TCP". ACM SIGCOMM Computer Communication Review. 41 (4): 266. CiteSeerX 10.1.1.306.3863. doi:10.1145/2043164.2018467. Archived from the original on 2020-04-04. Retrieved 2011-06-29.
- ^ "MultiPath TCP – Linux Kernel implementation". Archived from the original on 2013-03-27. Retrieved 2013-03-24.
- ^ Raiciu; Paasch; Barre; Ford; Honda; Duchene; Bonaventure; Handley (2012). "How Hard Can It Be? Designing and Implementing a Deployable Multipath TCP". Usenix NSDI: 399–412. Archived from the original on 2013-06-03. Retrieved 2013-03-24.
- ^ Bonaventure; Seo (2016). "Multipath TCP Deployments". IETF Journal. Archived from the original on 2020-02-23. Retrieved 2017-01-03.
- ^ Cryptographic Protection of TCP Streams (tcpcrypt). May 2019. doi:10.17487/RFC8548. RFC 8548.
- ^ Michael Kerrisk (2012-08-01). "TCP Fast Open: expediting web services". LWN.net. Archived from the original on 2014-08-03. Retrieved 2014-07-21.
- ^ RFC 7413.
- ^ RFC 6937.
- ^ Grigorik, Ilya (2013). High-performance browser networking (1. ed.). Beijing: O'Reilly. ISBN 978-1449344764.
- ^ RFC 6013.
- ^ RFC 7805.
- ^ Kumar, Sam; P Andersen, Michael; Kim, Hyung-Sin; E. Culler, David (2020). "Performant TCP for Low-Power Wireless Networks". University of California, Berkeley. USENIX.
- ^ "Rethinking System Design for Expressive Cryptography" (PDF). samkumar.org. University of California, Berkeley. 2023. Archived from the original (PDF) on 13 Oct 2025.
- ^ RFC 8546, p. 6.
- ^ RFC 8558, p. 3.
- ^ RFC 9065, 2. Current Uses of Transport Headers within the Network.
- ^ RFC 9065, 3. Research, Development, and Deployment.
- ^ RFC 8558, p. 8.
- ^ RFC 9170, 2.3. Multi-party Interactions and Middleboxes.
- ^ RFC 9170, A.5. TCP.
- ^ Papastergiou et al. 2017, p. 620.
- ^ Edeline & Donnet 2019, p. 175-176.
- ^ Raiciu et al. 2012, p. 1.
- ^ Hesmans et al. 2013, p. 1.
- ^ a b Rybczyńska 2020.
- ^ Papastergiou et al. 2017, p. 621.
- ^ Corbet 2015.
- ^ Briscoe et al. 2016, pp. 29–30.
- ^ Marx 2020, HOL blocking in HTTP/1.1.
- ^ Marx 2020, Bonus: Transport Congestion Control.
- ^ IETF HTTP Working Group, Why just one TCP connection?.
- ^ Corbet 2018.
- ^ a b RFC 7413, p. 3.
- ^ Sy et al. 2020, p. 271.
- ^ Chen et al. 2021, p. 8-9.
- ^ Ghedini 2018.
- ^ Chen et al. 2021, p. 3-4.
- ^ RFC 7413, p. 1.
- ^ Blanton & Allman 2002, p. 1-2.
- ^ Blanton & Allman 2002, p. 4-5.
- ^ Blanton & Allman 2002, p. 3-4.
- ^ Blanton & Allman 2002, p. 6-8.
- ^ Bruyeron, Hemon & Zhang 1998, p. 67.
- ^ Bruyeron, Hemon & Zhang 1998, p. 72.
- ^ Bhat, Rizk & Zink 2017, p. 14.
- ^ RFC 9002, 4.5. More ACK Ranges.
- ^ a b "TCP performance over CDMA2000 RLP". Archived from the original on 2011-05-03. Retrieved 2010-08-30.
- ^ Muhammad Adeel; Ahmad Ali Iqbal (2007). "TCP Congestion Window Optimization for CDMA2000 Packet Data Networks". Fourth International Conference on Information Technology (ITNG'07). pp. 31–35. doi:10.1109/ITNG.2007.190. ISBN 978-0-7695-2776-5. S2CID 8717768.
- ^ "TCP Acceleration". Archived from the original on 2024-04-22. Retrieved 2024-04-18.
- ^ Yunhong Gu, Xinwei Hong, and Robert L. Grossman. "An Analysis of AIMD Algorithm with Decreasing Increases" Archived 2016-03-05 at the Wayback Machine. 2004.
- ^ RFC 8200.
- ^ "Wireshark: Offloading". Archived from the original on 2017-01-31. Retrieved 2017-02-24.
Wireshark captures packets before they are sent to the network adapter. It won't see the correct checksum because it has not been calculated yet. Even worse, most OSes don't bother initialize this data so you're probably seeing little chunks of memory that you shouldn't. New installations of Wireshark 1.2 and above disable IP, TCP, and UDP checksum validation by default. You can disable checksum validation in each of those dissectors by hand if needed.
- ^ "Wireshark: Checksums". Archived from the original on 2016-10-22. Retrieved 2017-02-24.
Checksum offloading often causes confusion as the network packets to be transmitted are handed over to Wireshark before the checksums are actually calculated. Wireshark gets these "empty" checksums and displays them as invalid, even though the packets will contain valid checksums when they leave the network hardware later.
Bibliography
[edit]Requests for Comments
[edit]- Cerf, Vint; Dalal, Yogen; Sunshine, Carl (December 1974). Specification of Internet Transmission Control Program, December 1974 Version. doi:10.17487/RFC0675. RFC 675.
- Postel, Jon (September 1981). Internet Protocol. doi:10.17487/RFC0791. RFC 791.
- Postel, Jon (September 1981). Transmission Control Protocol. doi:10.17487/RFC0793. RFC 793.
- Braden, Robert, ed. (October 1989). Requirements for Internet Hosts – Communication Layers. doi:10.17487/RFC1122. RFC 1122.
- Jacobson, Van; Braden, Bob; Borman, Dave (May 1992). TCP Extensions for High Performance. doi:10.17487/RFC1323. RFC 1323.
- Bellovin, Steven M. (May 1996). Defending Against Sequence Number Attacks. doi:10.17487/RFC1948. RFC 1948.
- Mathis, Matt; Mahdavi, Jamshid; Floyd, Sally; Romanow, Allyn (October 1996). TCP Selective Acknowledgment Options. doi:10.17487/RFC2018. RFC 2018.
- Allman, Mark; Paxson, Vern; Stevens, W. Richard (April 1999). TCP Congestion Control. doi:10.17487/RFC2581. RFC 2581.
- Floyd, Sally; Mahdavi, Jamshid; Mathis, Matt; Podolsky, Matthew (July 2000). An Extension to the Selective Acknowledgement (SACK) Option for TCP. doi:10.17487/RFC2883. RFC 2883.
- Ramakrishnan, K. K.; Floyd, Sally; Black, David (September 2001). The Addition of Explicit Congestion Notification (ECN) to IP. doi:10.17487/RFC3168. RFC 3168.
- Ludwig, Reiner; Meyer, Michael (April 2003). The Eifel Detection Algorithm for TCP. doi:10.17487/RFC3522. RFC 3522.
- Spring, Neil; Weatherall, David; Ely, David (June 2003). Robust Explicit Congestion Notification (ECN) Signaling with Nonces. doi:10.17487/RFC3540. RFC 3540.
- Allman, Mark; Paxson, Vern; Blanton, Ethan (September 2009). TCP Congestion Control. doi:10.17487/RFC5681. RFC 5681.
- Simpson, William Allen (January 2011). TCP Cookie Transactions (TCPCT). doi:10.17487/RFC6013. RFC 6013.
- Ford, Alan; Raiciu, Costin; Handley, Mark; Barre, Sebastien; Iyengar, Janardhan (March 2011). Architectural Guidelines for Multipath TCP Development. doi:10.17487/RFC6182. RFC 6182.
- Paxson, Vern; Allman, Mark; Chu, H.K. Jerry; Sargent, Matt (June 2011). Computing TCP's Retransmission Timer. doi:10.17487/RFC6298. RFC 6298.
- Ford, Alan; Raiciu, Costin; Handley, Mark; Bonaventure, Olivier (January 2013). TCP Extensions for Multipath Operation with Multiple Addresses. doi:10.17487/RFC6824. RFC 6824.
- Mathis, Matt; Dukkipati, Nandita; Cheng, Yuchung (May 2013). Proportional Rate Reduction for TCP. doi:10.17487/RFC6937. RFC 6937.
- Borman, David; Braden, Bob; Jacobson, Van (September 2014). Scheffenegger, Richard (ed.). TCP Extensions for High Performance. doi:10.17487/RFC7323. RFC 7323.
- Duke, Martin; Braden, Robert; Eddy, Wesley M.; Blanton, Ethan; Zimmermann, Alexander (February 2015). A Roadmap for Transmission Control Protocol (TCP) Specification Documents. doi:10.17487/RFC7414. RFC 7414.
- Cheng, Yuchung; Chu, Jerry; Radhakrishnan, Sivasankar; Jain, Arvind (December 2014). TCP Fast Open. doi:10.17487/RFC7413. RFC 7413.
- Zimmermann, Alexander; Eddy, Wesley M.; Eggert, Lars (April 2016). Moving Outdated TCP Extensions and TCP-Related Documents to Historic or Informational Status. doi:10.17487/RFC7805. RFC 7805.
- Fairhurst, Gorry; Trammell, Brian; Kuehlewind, Mirja, eds. (March 2017). Services Provided by IETF Transport Protocols and Congestion Control Mechanisms. doi:10.17487/RFC8095. RFC 8095.
- Cheng, Yuchung; Cardwell, Neal; Dukkipati, Nandita; Jha, Priyaranjan, eds. (February 2021). The RACK-TLP Loss Detection Algorithm for TCP. doi:10.17487/RFC8985. RFC 8985.
- Deering, Stephen E.; Hinden, Robert M. (July 2017). Internet Protocol, Version 6 (IPv6) Specification. doi:10.17487/RFC8200. RFC 8200.
- Trammell, Brian; Kuehlewind, Mirja (April 2019). The Wire Image of a Network Protocol. doi:10.17487/RFC8546. RFC 8546.
- Hardie, Ted, ed. (April 2019). Transport Protocol Path Signals. doi:10.17487/RFC8558. RFC 8558.
- Iyengar, Jana; Swett, Ian, eds. (May 2021). QUIC Loss Detection and Congestion Control. doi:10.17487/RFC9002. RFC 9002.
- Fairhurst, Gorry; Perkins, Colin (July 2021). Considerations around Transport Header Confidentiality, Network Operations, and the Evolution of Internet Transport Protocols. doi:10.17487/RFC9065. RFC 9065.
- Thomson, Martin; Pauly, Tommy (December 2021). Long-Term Viability of Protocol Extension Mechanisms. doi:10.17487/RFC9170. RFC 9170.
- Eddy, Wesley M., ed. (August 2022). Transmission Control Protocol (TCP). doi:10.17487/RFC9293. RFC 9293.
Other documents
[edit]- Allman, Mark; Paxson, Vern (October 1999). "On estimating end-to-end network path properties". ACM SIGCOMM Computer Communication Review. 29 (4): 263–274. doi:10.1145/316194.316230. hdl:2060/20000004338.
- Bhat, Divyashri; Rizk, Amr; Zink, Michael (June 2017). "Not so QUIC: A Performance Study of DASH over QUIC". NOSSDAV'17: Proceedings of the 27th Workshop on Network and Operating Systems Support for Digital Audio and Video. pp. 13–18. doi:10.1145/3083165.3083175. S2CID 32671949.
- Blanton, Ethan; Allman, Mark (January 2002). "On making TCP more robust to packet reordering" (PDF). ACM SIGCOMM Computer Communication Review. 32: 20–30. doi:10.1145/510726.510728. S2CID 15305731.
- Briscoe, Bob; Brunstrom, Anna; Petlund, Andreas; Hayes, David; Ros, David; Tsang, Ing-Jyh; Gjessing, Stein; Fairhurst, Gorry; Griwodz, Carsten; Welzl, Michael (2016). "Reducing Internet Latency: A Survey of Techniques and Their Merits". IEEE Communications Surveys & Tutorials. 18 (3): 2149–2196. doi:10.1109/COMST.2014.2375213. hdl:2164/8018. S2CID 206576469.
- Bruyeron, Renaud; Hemon, Bruno; Zhang, Lixa (April 1998). "Experimentations with TCP selective acknowledgment". ACM SIGCOMM Computer Communication Review. 28 (2): 54–77. doi:10.1145/279345.279350. S2CID 15954837.
- Chen, Shan; Jero, Samuel; Jagielski, Matthew; Boldyreva, Alexandra; Nita-Rotaru, Cristina (2021). "Secure Communication Channel Establishment: TLS 1.3 (Over TCP Fast Open) versus QUIC". Journal of Cryptology. 34 (3) 26. doi:10.1007/s00145-021-09389-w. S2CID 235174220.
- Corbet, Jonathan (8 December 2015). "Checksum offloads and protocol ossification". LWN.net.
- Corbet, Jonathan (29 January 2018). "QUIC as a solution to protocol ossification". LWN.net.
- Edeline, Korian; Donnet, Benoit (2019). A Bottom-Up Investigation of the Transport-Layer Ossification. 2019 Network Traffic Measurement and Analysis Conference (TMA). doi:10.23919/TMA.2019.8784690.
- Ghedini, Alessandro (26 July 2018). "The Road to QUIC". The Cloudflare Blog. Cloudflare.
- Gurtov, Andrei; Floyd, Sally (February 2004). Resolving Acknowledgment Ambiguity in non-SACK TCP (PDF). Next Generation Teletraffic and Wired/Wireless Advanced Networking (NEW2AN'04).
- Gurtov, Andrei; Ludwig, Reiner (2003). Responding to Spurious Timeouts in TCP (PDF). IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies. doi:10.1109/INFCOM.2003.1209251.
- Hesmans, Benjamin; Duchene, Fabien; Paasch, Christoph; Detal, Gregory; Bonaventure, Olivier (2013). Are TCP extensions middlebox-proof?. HotMiddlebox '13. CiteSeerX 10.1.1.679.6364. doi:10.1145/2535828.2535830.
- IETF HTTP Working Group. "HTTP/2 Frequently Asked Questions".
- Karn, Phil; Partridge, Craig (November 1991). "Improving round-trip time estimates in reliable transport protocols". ACM Transactions on Computer Systems. 9 (4): 364–373. doi:10.1145/118544.118549.
- Ludwig, Reiner; Katz, Randy Howard (January 2000). "The Eifel algorithm: making TCP robust against spurious retransmissions". ACM SIGCOMM Computer Communication Review. doi:10.1145/505688.505692.
- Marx, Robin (3 December 2020). "Head-of-Line Blocking in QUIC and HTTP/3: The Details".
- Paasch, Christoph; Bonaventure, Olivier (1 April 2014). "Multipath TCP". Communications of the ACM. 57 (4): 51–57. doi:10.1145/2578901. hdl:2078.1/141195. S2CID 17581886.
- Papastergiou, Giorgos; Fairhurst, Gorry; Ros, David; Brunstrom, Anna; Grinnemo, Karl-Johan; Hurtig, Per; Khademi, Naeem; Tüxen, Michael; Welzl, Michael; Damjanovic, Dragana; Mangiante, Simone (2017). "De-Ossifying the Internet Transport Layer: A Survey and Future Perspectives". IEEE Communications Surveys & Tutorials. 19: 619–639. doi:10.1109/COMST.2016.2626780. hdl:2164/8317. S2CID 1846371.
- Rybczyńska, Marta (13 March 2020). "A QUIC look at HTTP/3". LWN.net.
- Sy, Erik; Mueller, Tobias; Burkert, Christian; Federrath, Hannes; Fischer, Mathias (2020). "Enhanced Performance and Privacy for TLS over TCP Fast Open". Proceedings on Privacy Enhancing Technologies. 2020 (2): 271–287. arXiv:1905.03518. doi:10.2478/popets-2020-0027.
- Zhang, Lixia (5 August 1986). "Why TCP timers don't work well". ACM SIGCOMM Computer Communication Review. 16 (3): 397–405. doi:10.1145/1013812.18216.
Further reading
[edit]- Stevens, W. Richard (1994-01-10). TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley Pub. Co. ISBN 978-0-201-63346-7.
- Stevens, W. Richard; Wright, Gary R (1994). TCP/IP Illustrated, Volume 2: The Implementation. Addison-Wesley. ISBN 978-0-201-63354-2.
- Stevens, W. Richard (1996). TCP/IP Illustrated, Volume 3: TCP for Transactions, HTTP, NNTP, and the UNIX Domain Protocols. Addison-Wesley. ISBN 978-0-201-63495-2.**
External links
[edit]Transmission Control Protocol
View on GrokipediaHistory and Development
Origins and Early Design
The Transmission Control Protocol (TCP) originated in the early 1970s as part of the United States Department of Defense Advanced Research Projects Agency (DARPA) efforts to interconnect diverse packet-switched networks under the ARPANET project. In 1973, Vint Cerf, then at Stanford University, and Bob Kahn, at DARPA, began collaborating on a protocol to enable reliable communication across heterogeneous networks that might include varying transmission media and topologies. Their work addressed the limitations of the existing Network Control Protocol (NCP), which was confined to ARPANET's homogeneous environment and struggled with emerging multi-network scenarios. By September 1973, at a meeting in Sussex, England, Cerf and Kahn presented initial concepts for what would become TCP, emphasizing end-to-end host responsibilities rather than network-level reliability. This foundational effort culminated in their seminal 1974 paper, "A Protocol for Packet Network Intercommunication," published in IEEE Transactions on Communications, which outlined the core architecture for internetworking.[3][4] The initial design of TCP was driven by the need for a robust transport layer protocol that could operate over unreliable underlying networks, providing reliable data delivery in the face of potential failures. Key goals included establishing connection-oriented communication between processes on different hosts, ensuring ordered delivery of data streams, and incorporating mechanisms for flow control, error detection, and recovery from transmission issues. Cerf and Kahn envisioned TCP as a gateway-agnostic solution that would handle variations in packet sizes, sequencing, and end-to-end acknowledgments, thereby abstracting the complexities of diverse networks from upper-layer applications. This approach was particularly motivated by the challenges of early internetworks, such as packet loss due to congestion or errors, out-of-order arrivals from differing route delays, and potential duplications from retransmissions or routing loops. By shifting reliability to the endpoints, TCP aimed to foster scalable interconnection without requiring modifications to individual networks.[3][5] Early documentation of TCP appeared in December 1974 with RFC 675, "Specification of Internet Transmission Control Program," authored by Cerf, Yogen Dalal, and Carl Sunshine, which detailed the protocol's interface and functions for internetwork transmission. Initially, TCP encompassed both transport and internetworking responsibilities in a single layer. However, by 1978, growing requirements for distinct handling of datagram routing led to its separation into the Transmission Control Protocol for host-to-host transport and the Internet Protocol (IP) for network-layer addressing and forwarding, forming the basis of the TCP/IP suite. This split was formalized through subsequent revisions, with the baseline specification for TCP established in RFC 793, "Transmission Control Protocol," published in September 1981 by Jon Postel, which defined the standard connection establishment, data transfer, and termination procedures still in use today.[6][7][8]Standardization and Evolution
The Transmission Control Protocol (TCP) was formally standardized as a Department of Defense (DoD) standard in RFC 793, published in September 1981 and edited by Jon Postel of the Information Sciences Institute at the University of Southern California.[9] This document established the baseline specification for TCP, defining it as a reliable, connection-oriented transport protocol for internetwork communication, including mechanisms for connection establishment via a three-way handshake, data transfer with sequence numbering and acknowledgments, flow control, and connection termination.[9] Subsequent updates refined these requirements to promote host interoperability and address implementation ambiguities. In October 1989, RFC 1122, edited by R. Braden, provided a comprehensive set of requirements for Internet hosts, updating RFC 793 by clarifying TCP behaviors such as segment processing (e.g., handling of the Push flag and window sizing), retransmission timeout calculations using algorithms like Jacobson's and Karn's, and support for options including Maximum Segment Size (MSS).[10] It also addressed urgent data handling, specifying that the Urgent Pointer points to the last byte of urgent data and requires asynchronous notification to applications, while making Push flag processing optional for delivery to the application layer.[10] Major evolutionary changes focused on improving network stability and performance amid growing Internet traffic. Congestion control was first introduced in RFC 896, published in January 1984 by John Nagle, which identified "congestion collapse" risks in IP/TCP networks due to gateway overloads and proposed mitigations like reducing small packet transmissions—later formalized as Nagle's algorithm to coalesce short segments and avoid inefficient "silly window" effects.[11] Refinements to Nagle's algorithm appeared in subsequent specifications, such as RFC 1122's integration with Silly Window Syndrome (SWS) avoidance, though its use became tunable (e.g., via TCP_NODELAY) to balance latency and throughput in diverse applications.[10] Further advancements in congestion management came with RFC 2001, published in January 1997 by W. Richard Stevens, which standardized the slow start algorithm as part of TCP's core congestion control mechanisms.[12] Slow start initializes the congestion window to one segment for new connections, exponentially increasing it based on acknowledgment rates to probe available bandwidth without overwhelming the network, transitioning to congestion avoidance once a threshold is reached.[12] This built on earlier proposals to prevent the aggressive window growth seen in pre-1988 TCP implementations that contributed to Internet congestion episodes.[12] The Internet Engineering Task Force (IETF) has overseen TCP's ongoing specification through its working groups, including the historical TCP Extensions Working Group and the modern TCP Maintenance and Minor Extensions (TCPM) Working Group, which handle clarifications, minor extensions, and updates to ensure protocol robustness.[13] For instance, out-of-band data via the Urgent mechanism received clarifications in RFC 1122 but was later discouraged for new applications in evolutions like RFC 9293 (2022, ed. W. Eddy), due to inconsistent implementations and middlebox interference, though legacy support remains mandatory.[10][14]Recent Extensions and Proposals
In the 2010s, efforts to reduce latency in TCP connection establishment led to the development of TCP Fast Open (TFO), an experimental extension that allows data to be included in the initial SYN packet, enabling the receiver to process it during the handshake without waiting for a full connection.[15] This mechanism can save up to one round-trip time (RTT) for short connections, such as those in web browsing, by carrying up to 60 bytes of application data in the SYN and optionally in the SYN-ACK.[15] TFO uses a cookie-based approach to mitigate SYN flood attacks, where the server provides a cryptographically generated cookie in the SYN-ACK for subsequent connections from the same client.[15] Deployed in kernels like Linux since version 3.7 and supported by major browsers, TFO has demonstrated latency reductions of 10-30% in real-world scenarios with repeated short flows.[15] Building on multipath capabilities, Multipath TCP (MPTCP) was standardized in 2020 to allow a single TCP connection to aggregate multiple network paths simultaneously, enhancing throughput and resilience in heterogeneous environments like mobile networks.[16] MPTCP introduces new TCP options for path management, subflow establishment, and data scheduling across paths, while maintaining compatibility with legacy single-path TCP endpoints through a fallback mechanism.[16] It employs coupled congestion control to fairly share capacity among paths, preventing over-utilization of any single link, and supports failover by seamlessly switching traffic during path disruptions.[16] Widely implemented in iOS, Android, and Linux kernels, MPTCP has shown throughput increases of up to 80% in Wi-Fi/cellular aggregation tests.[16] As an alternative to traditional loss-based congestion control algorithms like Reno or CUBIC, Google's BBR (Bottleneck Bandwidth and Round-trip propagation time) algorithm, introduced in 2016, adopts a model-based approach that estimates the network's bottleneck bandwidth and RTT to set the congestion window, aiming to maximize throughput while minimizing latency.[17] Unlike loss-based methods that react to packet drops, BBR proactively paces sends to match available bandwidth and controls queues by estimating the minimum RTT, reducing bufferbloat in bottleneck links.[17] Deployed across Google's B4 inter-data center network since 2016, BBR has achieved up to 26 times higher throughput and three times lower latency compared to CUBIC in long-lived flows over high-bandwidth-delay product paths.[17] Its integration into Linux kernel 4.9 and subsequent refinements, such as BBRv2, address fairness issues with competing flows.[17] Recent analyses in 2025 have highlighted persistent gaps between TCP RFC specifications and their implementations, particularly in security features, underscoring challenges in protocol evolution. A study using automated differential checks on intermediate representations of RFCs and codebases identified 15 inconsistencies across major TCP stacks, including improper Initial Sequence Number (ISN) generation vulnerable to prediction attacks and flawed TCP Challenge ACK responses that fail to validate spoofed segments. These mismatches, observed in implementations like Linux and FreeBSD, stem from incomplete updates to RFCs such as 793 and 5961, potentially exposing networks to off-path injection or denial-of-service exploits. The research emphasizes the need for LLM-assisted validation tools to bridge these gaps, as manual audits struggle with the protocol's complexity. For high-performance computing (HPC) and artificial intelligence (AI) workloads, ongoing proposals in 2024-2025 seek to extend TCP with support for collective communication primitives and direct device memory access, addressing limitations in distributed training and simulation. Discussions at Netdev 0x18 highlighted extensions enabling TCP to handle all-reduce and broadcast operations natively, integrating with Message Passing Interface (MPI) semantics to reduce overhead in GPU clusters.[18] These include Device Memory TCP (devmem TCP), merged into the Linux kernel in 2024 (version 6.12), which allows TCP payloads to be directly mapped to GPU or accelerator memory, bypassing host CPU copies.[19] Prototypes for collective operations have demonstrated up to 3x throughput gains over standard TCP in GPU communication benchmarks.[20]Overview and Network Role
Core Principles and Functions
The Transmission Control Protocol (TCP) operates as a connection-oriented transport protocol, establishing a virtual connection between two endpoints before data transfer begins, which allows for managed, stateful communication. This model supports full-duplex operation, enabling simultaneous data flow in both directions over the same connection, identified by a pair of sockets consisting of IP addresses and port numbers.[21] Such a design facilitates reliable process-to-process communication in packet-switched networks, where the protocol maintains connection state to coordinate data exchange.[22] At its core, TCP provides end-to-end reliability, ensuring that data is delivered accurately and in the exact order sent, without duplicates or losses, even across unreliable underlying networks. This is achieved through mechanisms such as sequence numbering for each octet of data, positive acknowledgments from the receiver, and retransmission of lost segments upon timeout detection. Error recovery is further supported by checksums that detect corrupted data, prompting discards and subsequent retransmissions to maintain integrity.[23] These guarantees make TCP suitable for applications requiring dependable transfer, contrasting with the best-effort delivery of lower-layer protocols.[24] TCP offers key services to applications, including a stream abstraction that presents data as a continuous, byte-oriented flow rather than discrete packets, hiding the complexities of network segmentation. It enables multiplexing through port numbers, allowing multiple concurrent connections on a single host by distinguishing between different application processes. Additionally, TCP handles segmentation and reassembly, dividing application data into manageable segments for transmission and reconstructing the original stream at the receiver, ensuring seamless interaction for higher-layer protocols.[24] In distinction from datagram protocols like the User Datagram Protocol (UDP), TCP imposes additional overhead—such as larger headers and connection management—to achieve its reliability features, whereas UDP provides a lightweight, connectionless service with no ordering or delivery assurances, prioritizing simplicity and low latency for real-time applications.[22][25] This trade-off positions TCP as the preferred choice for scenarios demanding accuracy over speed.[25] Common use cases for TCP include web browsing via HTTP and HTTPS, which rely on its reliability for transferring hypertext documents and secure content; email transmission through SMTP, ensuring complete message delivery; and file transfers with FTP, where ordered and error-free octet streams are essential for data integrity.[26]Integration in TCP/IP Stack
The Transmission Control Protocol (TCP) operates at the transport layer (layer 4) of the OSI model, positioned directly above the Internet Protocol (IP) at the network layer (layer 3), forming a key component of the TCP/IP protocol suite.[14] This layering enables TCP to provide end-to-end services while relying on IP for routing and delivery across interconnected networks.[27] In the TCP/IP architecture, TCP interfaces with higher-layer protocols such as application services and lower-layer mechanisms including IP and its associated protocols, ensuring modular operation in heterogeneous environments.[28] TCP segments are encapsulated within IP datagrams, where the TCP header immediately follows the IP header, allowing IP to handle addressing, fragmentation, and transmission of the combined packet.[29] Source and destination port numbers in the TCP header, combined with IP addresses, form socket pairs that enable demultiplexing of multiple concurrent connections at each host, directing incoming data to the appropriate processes.[30] This encapsulation supports TCP's role in facilitating reliable host-to-host communication over diverse, packet-switched networks, abstracting underlying variations in link-layer technologies.[28] TCP interacts with the Internet Control Message Protocol (ICMP), which operates within the IP layer, to receive error reports such as destination unreachable or time exceeded messages; TCP implementations must process these to adjust behavior, such as aborting connections or reducing segment sizes via Path MTU Discovery.[31] For address resolution, TCP relies indirectly on the Address Resolution Protocol (ARP), invoked by the IP layer to map IP addresses to link-layer (e.g., Ethernet) addresses before transmitting TCP-carrying datagrams on local networks.[32] In the context of IPv6, TCP's checksum computation has evolved to include a pseudo-header incorporating the IPv6 source and destination addresses, the upper-layer packet length, three zero octets, and the next header value (6 for TCP), enhancing protection against misdelivery compared to IPv4.[29] This adjustment, defined in the IPv6 specification, ensures checksum integrity aligns with IPv6's header structure while maintaining backward compatibility with TCP's reliability mechanisms.[33]TCP Segment Structure
Header Format and Fields
The TCP header consists of a minimum of 20 bytes in a fixed format, which precedes the data payload in each TCP segment. This structure is defined to provide essential control information for reliable data transfer over IP networks. The header fields are arranged in a specific byte order, with the first 12 bytes containing port numbers and sequence information, followed by control flags, window size, checksum, and urgent pointer.[34] Key fields in the TCP header include the source port and destination port, each 16 bits long, which identify the sending and receiving applications, respectively, enabling multiplexing of multiple connections over a single IP address. The sequence number field, 32 bits, represents the position of the first byte of data in the sender's byte stream, treating the data as a continuous sequence of octets for reliable ordering and retransmission. The acknowledgment number field, also 32 bits, specifies the next expected byte from the remote sender when the ACK flag is set, confirming receipt of prior data.[34] The data offset field uses 4 bits to indicate the length of the TCP header in 32-bit words, with a minimum value of 5 corresponding to the 20-byte fixed header. A 4-bit reserved field follows, which must be set to zero and is ignored by receivers. The flags field comprises 8 bits for control purposes: CWR (Congestion Window Reduced), ECE (ECN-Echo), URG (urgent data present), ACK (acknowledgment valid), PSH (request immediate data delivery to application), RST (reset the connection), SYN (synchronize sequence numbers for connection setup), and FIN (no more data from sender). These flags manage connection states, error handling, and data processing semantics. The 16-bit window field advertises the receiver's available buffer space in octets for flow control, while the 16-bit checksum covers the header, payload, and a pseudo-header for integrity verification. The urgent pointer, 16 bits, provides an offset from the sequence number to the end of urgent data when the URG flag is set.[34] TCP employs byte-stream oriented sequence numbering, where each segment's sequence number increments based on the amount of data sent, wrapping around after 2^32 - 1 to ensure continuous tracking without gaps. To enhance security against prediction attacks, the initial sequence number (ISN) is randomized using a cryptographically strong generator, incorporating factors like timestamps and connection identifiers to prevent off-path guessing. This randomization mitigates risks such as session hijacking while maintaining compatibility with the protocol's core reliability mechanisms.[35][36]| Field | Size (bits) | Description |
|---|---|---|
| Source Port | 16 | Identifies the sending port. |
| Destination Port | 16 | Identifies the receiving port. |
| Sequence Number | 32 | Byte position of first data octet in stream. |
| Acknowledgment Number | 32 | Next expected byte sequence number (if ACK set). |
| Data Offset | 4 | Header length in 32-bit words. |
| Reserved | 4 | Must be zero. |
| Flags (CWR, ECE, URG, ACK, PSH, RST, SYN, FIN) | 8 | Control bits for congestion notification, connection management, and data handling. |
| Window | 16 | Receiver's buffer capacity in octets. |
| Checksum | 16 | Integrity check over header and data. |
| Urgent Pointer | 16 | Offset to end of urgent data (if URG set). |
Header Options and Extensions
The TCP header includes a variable-length options field that allows for extensibility beyond the fixed 20-byte header, enabling negotiation of parameters to optimize connection performance.[14] Each option follows a general format consisting of a 1-byte Kind field identifying the option type, a 1-byte Length field specifying the total size of the option (including Kind and Length), and a variable-length Data field containing option-specific information; options with Kind values of 0 or 1 omit the Length field.[14] The options are padded with zeros to ensure 32-bit alignment, preventing misalignment in the header.[14] The fixed header's Data Offset field specifies the total header length in 32-bit words, accounting for the options' variable size.[14] Common TCP options include the Maximum Segment Size (MSS) option (Kind 2), which specifies the largest segment the sender can receive excluding the TCP and IP headers, typically exchanged to avoid fragmentation; the Window Scale option (Kind 3), which enables scaling of the receive window for high-bandwidth connections; the Selective Acknowledgment (SACK) Permitted option (Kind 4), which indicates support for selective acknowledgments; and the Timestamps option (Kind 8), which adds timestamp values for better round-trip time estimation and protection against wrapped sequence numbers.[37][14] These options are defined in RFC 9293 for the core format, with specifics for Window Scale and Timestamps in RFC 7323, and SACK Permitted in RFC 2018.[14] Options are primarily negotiated during the SYN phase of the three-way handshake, where the SYN sender proposes supported options, and the SYN-ACK receiver responds by echoing accepted ones or proposing modifications, establishing mutual agreement for the connection.[14] To manage variable lengths, the End of Option List (Kind 0) marks the conclusion of options with a single byte, while the No-Operation (NOP, Kind 1) serves as a single-byte padding element to align subsequent options without affecting semantics.[14][37] The total length of options is limited to 40 bytes, as the maximum TCP header size is 60 bytes (20 bytes fixed plus 40 for options), ensuring compatibility with IP packet constraints and avoiding excessive overhead or fragmentation risks.[14] This cap is enforced to maintain efficiency in the protocol's design.[14]Protocol Operation
Connection Establishment
The Transmission Control Protocol (TCP) establishes a reliable, connection-oriented communication channel through a process known as the three-way handshake, which synchronizes sequence numbers and confirms mutual agreement to proceed.[9] This mechanism ensures that both endpoints are ready for data transfer and prevents issues from delayed or duplicate packets. The process begins when a client initiates a connection by sending a SYN (synchronize) segment to the server. This segment sets the SYN flag in the TCP header and includes the client's initial sequence number (ISN), a 32-bit value that marks the starting point for byte-level numbering of data sent by the client.[9] The ISN is generated using a clock-based procedure to promote uniqueness, typically incrementing every 4 microseconds to cycle through the 32-bit space approximately every 4.55 hours.[9] Upon receiving the SYN, the server responds with a SYN-ACK (synchronize-acknowledge) segment, which sets both the SYN and ACK flags, acknowledges receipt of the client's SYN by setting the acknowledgment number to the client's ISN plus one, and includes the server's own ISN.[9] This dual-flag segment confirms the server's willingness to establish the connection while advancing its own sequence numbering. The client then completes the handshake by sending an ACK segment, which sets the ACK flag and acknowledges the server's SYN-ACK by setting the acknowledgment number to the server's ISN plus one.[9] At this point, both endpoints transition to the ESTABLISHED state, with their sequence numbers synchronized, allowing subsequent data segments to carry meaningful acknowledgments. The three-way exchange ensures bidirectional confirmation, as a two-way process could not reliably verify the initiator's receipt of the responder's commitment.[9] TCP also handles the rare case of simultaneous open, where both endpoints initiate a connection at the same time by sending SYN segments to each other.[9] In this scenario, each side receives the other's SYN, responds with a SYN-ACK acknowledging the incoming SYN and providing its own ISN, and then sends a final ACK upon receiving the SYN-ACK, resulting in a four-way exchange that still leads to the ESTABLISHED state without conflict.[9] This symmetric handling maintains protocol robustness even under concurrent initiation attempts. To enhance security, modern TCP implementations randomize the ISN rather than relying solely on predictable clock increments, as predictable ISNs can enable off-path attackers to forge packets and hijack connections, as demonstrated in early vulnerabilities like the 1988 Morris worm.[36] RFC 6528 recommends generating the ISN using a pseudorandom function (PRF) that incorporates a secret key along with connection parameters such as IP addresses and ports, for example, ISN = M + F(localip, localport, remoteip, remoteport, secret), where M is a monotonic timer and F is a cryptographic hash like MD5 with a periodically refreshed 128-bit key.[36] This randomization makes ISN prediction computationally infeasible, significantly reducing the risk of sequence number attacks. Additionally, TCP addresses half-open connections—where a SYN is received but no final ACK follows—through mechanisms like SYN cookies to mitigate denial-of-service (DoS) attacks such as SYN floods, which exhaust server resources by creating numerous incomplete connections.[38] In SYN cookie mode, the server does not allocate a full transmission control block (TCB) upon receiving a SYN; instead, it encodes the connection state into the SYN-ACK's sequence number using a 32-bit cookie derived from a hash of the connection tuple (IP addresses, ports) and a timestamp counter, typically structured as 24 bits of hash + 3 bits for maximum segment size (MSS) + 5 bits from a 64-second counter.[38] When the client responds with an ACK, the server reconstructs the state from the cookie only if it validates against the hash and recent timestamps, rejecting invalid ones without resource commitment. This stateless approach allows servers to handle high volumes of SYNs during attacks while maintaining responsiveness to legitimate traffic.[38]Reliable Data Transfer
TCP ensures reliable data transfer over unreliable networks by implementing mechanisms for error detection, sequence numbering, acknowledgments, and retransmissions, guaranteeing that data is delivered in order without loss or duplication.[9] The protocol treats data as a byte stream, assigning a 32-bit sequence number to each byte, which allows for ordered delivery and handling of wrap-around in long-lived connections.[9] Cumulative acknowledgments form the core of TCP's reliability, where the acknowledgment (ACK) number in a segment specifies the next expected sequence number from the sender, implicitly confirming receipt of all preceding bytes.[9] This cumulative approach simplifies the protocol by requiring only one ACK per segment to acknowledge multiple prior segments, reducing overhead while ensuring ordered delivery.[9] Receivers discard out-of-order segments and send duplicate ACKs for the last correctly received byte, signaling potential losses without explicit negative acknowledgments.[9] Retransmissions in TCP are triggered either by a timeout or by fast retransmit upon detecting loss. The sender maintains a retransmission timeout (RTO) based on measured round-trip time (RTT), retransmitting unacknowledged segments if the RTO expires.[9] For faster recovery, TCP performs fast retransmit when three duplicate ACKs arrive, indicating a likely lost segment ahead of correctly received data; the sender then retransmits that segment without waiting for the timeout.[39] These mechanisms complement error detection via the TCP checksum, which verifies segment integrity upon receipt.[9] TCP's basic retransmission strategy follows a Go-Back-N automatic repeat request (ARQ) model, where upon loss detection, the sender retransmits the missing segment and all subsequent unacknowledged segments, regardless of their receipt status at the receiver.[9] This approach is efficient for low error rates but can waste bandwidth on high-loss paths by resending already-delivered data. Extensions like Selective Acknowledgments (SACK) enable selective repeat retransmissions, allowing the sender to retransmit only lost segments while skipping acknowledged ones, though basic TCP relies on cumulative ACKs alone.[40] To prevent duplicates from confusing the receiver, especially with 32-bit sequence numbers that wrap around after 4 GB of data (modulo 2^32 arithmetic), TCP uses the sequence numbers and ACKs to uniquely identify and discard duplicate segments.[9] The sender's initial sequence number (ISN), chosen randomly during connection setup, further mitigates risks from old or duplicate connections.[9] When the receiver advertises a zero receive window—indicating no buffer space— the sender stops transmitting but periodically probes with zero-window probes: small segments sent at one-minute intervals to check if the window has reopened.[9] If the receiver responds with a non-zero window, transmission resumes; otherwise, probing continues until the connection times out.[9] This prevents indefinite stalls due to temporary receiver overload.Flow Control Mechanisms
TCP employs a sliding window protocol for flow control, allowing the sender to transmit multiple segments before requiring acknowledgments while respecting the receiver's buffer capacity. The receiver advertises its available buffer space, known as the receive window (rwnd), in the Window field of every TCP segment header, indicating the number of octets it can accept starting from the next expected sequence number.[9] This mechanism ensures that the sender does not overwhelm the receiver by limiting outstanding unacknowledged data to the advertised rwnd. Window updates are conveyed in acknowledgment (ACK) segments, where the receiver dynamically adjusts the advertised rwnd based on its current buffer availability; the window can grow as space becomes available or shrink if the buffer fills.[9] To optimize efficiency, receivers are encouraged to defer sending window updates until the available space increases significantly, such as by at least 20-40% of the maximum window size, thereby avoiding frequent small adjustments.[9] The silly window syndrome (SWS) arises when either the sender transmits or the receiver advertises very small windows, leading to inefficient use of network bandwidth with numerous tiny segments. To mitigate SWS, TCP implementations use delayed acknowledgments, where the receiver postpones sending an ACK for up to 200-500 milliseconds or until a full segment's worth of data arrives, unless a push bit is set or the window changes substantially.[41] Complementing this, Nagle's algorithm on the sender side buffers small amounts of outgoing data until either an ACK arrives for prior data or the buffer reaches the maximum segment size, preventing the transmission of undersized segments during interactive applications like Telnet.[11] Together, these strategies substantially reduce overhead and maintain high throughput by promoting larger, more efficient transfers.[41] When the receiver's advertised window reaches zero, indicating no buffer space, the sender halts transmission but must periodically probe the connection to check for updates. This zero-window probing involves sending a one-octet segment (or the smallest allowable unit) after a retransmission timeout, with subsequent probes doubling in interval exponentially until a non-zero window is advertised or the connection times out.[14] The receiver processes these probes by responding with an ACK containing the current window size, allowing the connection to resume without closing.[14] The effective amount of data a TCP sender can transmit is further constrained by the minimum of the receive window (rwnd) and the congestion window (cwnd), ensuring flow control coordinates with network congestion management to prevent both receiver overload and link saturation.[42]Congestion Control Algorithms
TCP congestion control algorithms aim to prevent network overload by dynamically adjusting the sender's transmission rate based on inferred network conditions, primarily through management of the congestion window (cwnd), which limits the amount of unacknowledged data in flight. These algorithms follow an additive increase/multiplicative decrease (AIMD) policy, where the cwnd grows gradually during periods of low congestion and halves upon detecting congestion, ensuring fair sharing of bandwidth among competing flows.[43] This approach, foundational to TCP's stability, was introduced to address early Internet congestion collapses observed in the late 1980s.[43] The core phases of TCP congestion control include slow start and congestion avoidance. In slow start, the cwnd begins at a small initial value (typically 2–10 segments) and doubles approximately every round-trip time (RTT) upon receiving acknowledgments, allowing exponential growth to quickly probe available bandwidth without immediate risk of overload.[44] This phase transitions to congestion avoidance when the cwnd reaches the slow start threshold (ssthresh), typically set to half the cwnd at the onset of congestion. During congestion avoidance, the cwnd increases linearly by incrementing it by 1/cwnd for each acknowledgment received, effectively adding one segment per RTT: This linear growth promotes fairness among flows while avoiding excessive aggression.[44] Congestion detection triggers multiplicative reduction: upon timeout or receipt of three duplicate acknowledgments (indicating packet loss), the ssthresh is set to half the current cwnd, and the cwnd is adjusted accordingly. Fast retransmit and fast recovery enhance efficiency by retransmitting lost segments upon three duplicate ACKs without waiting for a timeout, and then recovering by inflating the cwnd temporarily (to ssthresh plus three) before deflating it to ssthresh upon new ACKs, avoiding unnecessary slow start restarts.[44] To manage retransmission timeouts (RTO), TCP computes an estimated RTT using the smoothed RTT (SRTT) and RTT variance. The SRTT is updated as SRTT ← (1 - α) × SRTT + α × SampleRTT, where α = 0.125, and the variance (RTTVAR) as RTTVAR ← (1 - β) × RTTVAR + β × |SampleRTT - SRTT|, with β = 0.25; the RTO is then RTO ← SRTT + 4 × RTTVAR, clamped between 1 second and 60 seconds. This adaptive timer prevents premature or delayed retransmissions, balancing throughput and reliability. Variants of these algorithms address limitations in diverse network conditions. TCP Reno, specified in RFC 2581, integrates fast recovery with AIMD for improved performance over lossy links by reducing the penalty for isolated losses.[45] TCP Cubic, designed for high-bandwidth, long-delay networks, modifies the congestion avoidance phase with a cubic function for cwnd growth that is less aggressive at low rates but scales better at high rates, achieving greater throughput while remaining friendly to Reno flows. Bottleneck Bandwidth and Round-trip propagation time (BBR), a model-based approach from Google, estimates available bandwidth and delay to set cwnd more precisely, offering higher utilization in constrained paths (detailed further in recent extensions).[17] These algorithms interact with flow control via the effective window, the minimum of cwnd and the receiver's advertised window, to respect both network and endpoint limits.[44]Connection Termination
TCP connection termination ensures that both endpoints agree to end communication gracefully, preventing data loss and resource leaks while handling potential network anomalies. The process typically involves a four-way handshake using the Finish (FIN) flag in TCP segments to signal the end of data transmission from one side, allowing for orderly shutdown. This mechanism is defined in the original TCP specification, which outlines state transitions to manage the closure reliably.[9] In an active close, one endpoint (the active closer) initiates termination by sending a segment with the FIN flag set, transitioning to the FIN-WAIT-1 state. The remote endpoint (passive closer) acknowledges this with an ACK, prompting the active closer to enter FIN-WAIT-2. Upon deciding to close, the passive endpoint sends its own FIN, which the active closer acknowledges, leading to the TIME-WAIT state before fully closing. This sequence ensures all data is transmitted and acknowledged before release.[9] The passive close mirrors the active process but from the receiving side: upon receiving the initial FIN, the endpoint sends an ACK and enters the CLOSE-WAIT state, notifying its application to stop sending data. Once the application issues a close command, the endpoint sends a FIN and transitions to LAST-ACK, awaiting the final ACK from the active closer to reach the CLOSED state. A symmetric FIN exchange thus coordinates the bilateral shutdown, with the active side's final ACK completing the process after a brief delay.[9] TCP supports half-close, enabling one direction of the connection to terminate while the other continues transmitting data. For instance, after the active closer sends its FIN and receives the ACK, the passive endpoint can still send remaining data before issuing its FIN. This feature, useful in applications like file transfers where one side finishes sending but needs to receive more, maintains reliability in unidirectional scenarios without forcing full closure.[9] For abrupt termination on errors, such as invalid segments or application aborts, TCP uses the Reset (RST) flag in a segment to immediately close the connection. The RST causes both endpoints to discard the connection state and flush associated queues, bypassing graceful sequences; it is sent in response to out-of-window data or explicit abort requests, ensuring quick recovery from anomalies. The FIN and RST flags, part of the TCP header's control bits, facilitate these distinct closure modes.[9] The TIME-WAIT state, entered by the active closer after acknowledging the remote FIN, enforces a 2 × MSL (Maximum Segment Lifetime) delay—typically 2 minutes—before deleting the connection record and releasing the local port. This wait absorbs any lingering duplicate segments from the prior incarnation of the connection, preventing them from confusing a new instance using the same tuple and mitigating risks like port exhaustion in high-throughput environments. Without this safeguard, delayed packets could corrupt subsequent connections, compromising TCP's reliability.[9]Resource Allocation and Management
TCP employs a Transmission Control Block (TCB) for each active connection to maintain essential state variables, including local and remote socket identifiers, sequence numbers, window sizes, and timer information.[14] This per-connection structure ensures isolated management of resources, preventing interference between concurrent sessions.[14] Additionally, TCP allocates dynamic buffers for send and receive queues to handle data temporarily before transmission or delivery to the application, with buffer sizes typically adjustable based on system memory availability to optimize performance without excessive allocation.[14] Port allocation in TCP distinguishes between well-known ports, assigned by the Internet Assigned Numbers Authority (IANA) in the range 0–1023 for standard services like HTTP on port 80, and ephemeral ports used by clients for outgoing connections.[46] The IANA-recommended ephemeral port range spans 49152–65535, providing a pool of approximately 16,000 dynamic ports to support multiple simultaneous client connections from a single host, though operating systems may configure slightly different ranges for compatibility.[46] This separation facilitates orderly resource assignment, with servers binding to well-known ports and clients selecting unused ephemeral ports to form unique socket pairs. TCP relies on several timers to manage resources efficiently during operation. The retransmission timer, set for each unacknowledged segment based on the estimated round-trip time (RTT) plus variance, triggers resends if acknowledgments are not received within the computed timeout, adhering to a standardized algorithm that bounds the initial value at one second and doubles it on subsequent retries up to a maximum.[47] The persistence timer activates when the receiver advertises a zero receive window, periodically probing with small segments to elicit window updates and avoid deadlocks from lost advertisements.[14] The keepalive timer, optional but recommended for long-idle connections, defaults to an interval of no less than two hours, sending probe segments to detect if the peer has crashed or become unreachable, thereby enabling timely resource release.[10] To mitigate resource exhaustion, TCP servers maintain a SYN backlog queue during connection establishment, queuing incoming SYN segments in the SYN-RECEIVED state up to a configurable limit—often 128 or more in modern implementations—before rejecting further attempts with resets.[38] This queue consumes memory for partial TCBs and helps defend against floods that could deplete port or memory resources by limiting half-open connections, though excessive backlog pressure may lead to dropped SYNs and incomplete handshakes.[38] Resource cleanup occurs through state-specific timeouts to reclaim allocations after connection termination begins. In the FIN_WAIT_2 state, where the local endpoint awaits the peer's FIN after sending its own, many implementations impose a timeout—typically on the order of minutes—to forcibly close lingering connections and free the TCB if no response arrives, preventing indefinite resource holds.[14] Orphan connections, arising when the owning process terminates without closing the socket, are managed by the kernel via accelerated keepalive probes or 2MSL (twice maximum segment lifetime) timers, ensuring buffers and TCBs are released after detecting inactivity, with limits on the number of such orphans to avoid system-wide exhaustion.[10]Advanced Features
Segment Size Negotiation
The Transmission Control Protocol (TCP) employs the Maximum Segment Size (MSS) option during connection establishment to specify the largest amount of data, in octets, that a sender should transmit in a single segment, excluding TCP and IP headers.[48] This option is included in the SYN segment of the three-way handshake, where each endpoint announces its receive-side MSS independently, allowing the sender to limit segment sizes to the minimum of its own send MSS and the peer's announced receive MSS.[48] The MSS value is calculated as the interface's Maximum Transmission Unit (MTU) minus the fixed sizes of the IP and TCP headers—typically 20 octets each for IPv4, yielding an MSS of MTU - 40.[49] For example, on an Ethernet interface with a 1500-octet MTU, the MSS would be 1460 octets.[49] If no MSS option is received during connection setup, TCP implementations must assume a default MSS of 536 octets, corresponding to the minimum IPv4 MTU of 576 octets minus 40 octets for headers.[49] This conservative default ensures compatibility across diverse networks but may lead to fragmentation or inefficiency on paths with larger MTUs.[48] The MSS option format, as defined in the TCP header options field, consists of a 1-octet Kind (value 2), a 1-octet Length (4), and a 2-octet MSS value.[9] TCP integrates with Path MTU Discovery (PMTUD) to dynamically adjust the effective MSS based on the smallest MTU along the path, using ICMP "Datagram Too Big" messages (Type 3, Code 4 for IPv4) to signal reductions.[50] Upon receiving such feedback, the sender lowers its Path MTU estimate and recomputes the MSS accordingly, setting the Don't Fragment (DF) bit on outgoing IP datagrams to probe for the optimal size without fragmentation.[50] The minimum Path MTU is 68 octets for IPv4, below which MSS adjustments are not applied.[50] PMTUD failures, often termed "black holes," arise when ICMP feedback is blocked by firewalls or misconfigured routers, causing large segments to be silently dropped and connections to stall.[51] To mitigate this, TCP implementations incorporate black hole detection by monitoring for timeouts on probe packets and falling back to smaller segment sizes, such as the default 536-octet MSS, or disabling PMTUD temporarily.[51] MSS clamping is a common countermeasure, where intermediate devices or endpoints proactively adjust the MSS value in SYN segments to a safe limit based on known path constraints, preventing oversized packets from entering the network.[51] For IPv6, PMTUD relies on ICMPv6 "Packet Too Big" messages (Type 2, Code 0) and assumes a minimum link MTU of 1280 octets, leading to a default MSS of 1220 octets (1280 minus 40 for IPv6 header and 20 for TCP header).[52] Hosts must not reduce the Path MTU below this minimum, ensuring reliable transmission even on low-MTU links.[52]Selective and Cumulative Acknowledgments
In TCP, cumulative acknowledgments form the foundational mechanism for confirming receipt of data, where the acknowledgment number specifies the next expected sequence number, thereby verifying all preceding octets as successfully received. This approach, defined in the original TCP specification, ensures reliable ordered delivery by allowing the receiver to send a single acknowledgment that covers contiguous data up to the highest in-sequence sequence number, without needing to individually acknowledge each segment.[9] The sender advances its unacknowledged sequence pointer (SND.UNA) upon receiving such an acknowledgment, removing confirmed data from its retransmission queue.[9] To address limitations of cumulative acknowledgments in scenarios with out-of-order or lost segments, TCP supports selective acknowledgments (SACK) as an optional extension. Negotiated during connection establishment via the SACK-permitted option in SYN segments, SACK enables the receiver to report multiple non-contiguous blocks of successfully received data beyond the cumulative acknowledgment point.[40] Each SACK option, identified by kind value 5, can include up to four blocks, with each block defined by a left edge (starting sequence number) and right edge (one beyond the last received octet), allowing the sender to identify specific holes in the data stream for targeted retransmissions.[40] This selective repeat policy reduces the recovery time from multiple losses within a single window by avoiding unnecessary retransmission of already-received data.[40] An extension to SACK, known as duplicate SACK (D-SACK), further refines loss detection by using the SACK mechanism to report receipt of duplicate segments. In D-SACK, the first SACK block in an option specifies the range of the duplicate data, enabling the sender to distinguish between true losses and artifacts like packet reordering or premature retransmissions.[53] This helps mitigate false fast retransmits, where the sender might otherwise interpret delayed acknowledgments as losses, by confirming that the sender's scoreboard already marked the data as acknowledged.[53] TCP implementations supporting SACK maintain a scoreboard data structure at the sender to track the state of transmitted segments, including those cumulatively acknowledged, selectively acknowledged in SACK blocks, and outstanding holes indicating potential losses. This scoreboard, typically implemented as a list or gap-based representation, updates with each incoming SACK to precisely delineate received and missing data ranges, facilitating efficient gap-filling retransmissions without inflating the congestion window unnecessarily.[54] The use of SACK, including D-SACK, provides significant benefits in high-loss networks by minimizing spurious retransmissions and accelerating recovery, often improving throughput by up to 30-50% in environments with multiple segment drops per window compared to cumulative-only schemes.[55] For instance, in wireless links prone to non-congestion losses, SACK enables finer-grained recovery, reducing the time spent in slow-start after loss events and better utilizing available bandwidth.[55]Window Scaling for High Bandwidth
The TCP window scaling option enables the protocol to support much larger receive windows than the 16-bit window field in the base TCP header would otherwise allow, addressing performance limitations in high-speed networks with significant latency. This extension is particularly vital for "long fat networks" (LFNs), characterized by high bandwidth-delay products (BDP), where the product of available bandwidth and round-trip time exceeds the unscaled maximum window size of 65,535 bytes. Without scaling, TCP connections in such environments would underutilize the link, as the sender could only transmit data up to the receiver's advertised window before pausing for acknowledgments. The option was originally specified in RFC 1323 and refined in RFC 7323 to provide clearer definitions and behaviors for modern implementations.[56] Negotiation of window scaling occurs exclusively during the initial connection establishment phase, with the scale factor exchanged in the SYN segments. Each endpoint includes a three-byte Window Scale option in its SYN, containing a shift count value between 0 and 14, indicating the number of bits to shift the window field value leftward (equivalent to multiplying the 16-bit field by ). If both endpoints advertise the option, scaling is enabled for the connection; otherwise, it remains disabled, and the base 16-bit window applies. The scale factor is fixed once negotiated and cannot be altered during the connection's lifetime, ensuring consistent interpretation of window advertisements. This negotiation allows for asymmetric scaling, where the sender and receiver may use different shift counts tailored to their respective capabilities.[56] With a maximum scale factor of 14, the effective window size can reach up to 1 GiB (2^{30} bytes, or 65,535 × 2^{14}), vastly expanding TCP's capacity to handle high-BDP paths without frequent acknowledgments. For instance, on a 10 Gbps link with a 100 ms round-trip time, the BDP is approximately 125 MB, which scaling accommodates efficiently. RFC 7323 clarifies handling of window scaling during scenarios like window retraction or probing, emphasizing that scaled windows must be monotonically non-decreasing to avoid ambiguity in interpretation. This mechanism has become a standard feature in TCP implementations, enabling reliable high-throughput transfers over diverse network conditions.[56]Timestamps for RTT Measurement
The TCP Timestamps option, defined as a 10-byte TCP header extension with Kind value 8 and Length 10, includes two 32-bit fields: TSval (Timestamp Value), which carries the sender's current timestamp clock value, and TSecr (Timestamp Echo Reply), which echoes the most recent TSval received from the peer.[57] This option enables precise round-trip time (RTT) estimation by allowing the sender to compute the elapsed time between sending a segment and receiving its acknowledgment, using the difference between the current TSval and the echoed TSecr value in the ACK.[58] Specifically, RTT measurements are filtered to update the smoothed RTT estimate only for acknowledged segments that advance the left edge of the send window (SND.UNA), ensuring accuracy by excluding ambiguous retransmissions.[59] A primary benefit of the Timestamps option is its role in the Protection Against Wrapped Sequences (PAWS) mechanism, which mitigates issues arising from 32-bit sequence number wraparound in high-bandwidth or long-lived connections.[60] PAWS uses the monotonically non-decreasing TSval to detect and discard outdated duplicate segments; upon receiving a segment, the receiver checks if its TSval is at least as recent as the previously recorded TS.Recent value, rejecting the segment if it appears stale.[61] This timestamp-based validation occurs before standard TCP sequence number checks, providing robust protection without relying solely on sequence numbers that may have wrapped multiple times.[59] The timestamp clock must be monotonically increasing and typically operates with a granularity of 1 millisecond, though it can range from 1 ms to 1 second per tick to balance precision and overhead.[62] To support high-speed networks, the clock should tick at least once every 2^31 bytes of data sent, ensuring sufficient resolution for PAWS over paths with large bandwidth-delay products.[63] While the Timestamps option is optional to minimize header overhead in low-latency environments, it is strongly recommended for high-performance scenarios, as it enhances RTT accuracy for congestion control and enables PAWS for reliable sequence validation.[64]Out-of-Band and Urgent Data
The Transmission Control Protocol (TCP) provides a mechanism for signaling urgent data through the URG (Urgent) flag and the associated urgent pointer field in the TCP header. When the URG flag is set in a segment, it indicates that the segment contains urgent data, and the urgent pointer specifies the sequence number of the octet immediately following the last byte of urgent data, thereby defining the end of the urgent portion within the stream. This pointer is only interpreted when the URG flag is asserted, allowing the receiver to identify and prioritize the urgent bytes ahead of normal data in the receive buffer.[10] Out-of-band (OOB) data in TCP refers to the urgent data marked by the URG flag, which is intended to be processed separately from the regular byte stream to enable expedited handling by the application. However, TCP supports only up to one byte of true OOB data per urgent indication, as subsequent urgent data may overwrite it in some implementations, and the mechanism is designed to deliver this byte via a distinct path to the application layer.[65] Interpretations of urgent data delivery vary across implementations: some treat it as inline data within the normal stream (per the original specification), while others extract the final urgent byte for OOB delivery, leading to inconsistencies influenced by middlebox behaviors that may strip or alter the URG flag.[65] RFC 6093 clarifies these variations, recommending inline delivery of all urgent data to avoid compatibility issues and emphasizing that the urgent mechanism does not create a true separate channel but rather a priority signal within the stream.[65] A primary historical use case for urgent data is in the Telnet protocol, where it signals interrupts such as a break character to allow immediate user attention, such as aborting a lengthy command without waiting for the full stream buffer. Despite this, the urgent mechanism is largely deprecated in modern applications due to its inconsistent implementation, limited utility beyond legacy protocols, and the availability of higher-level alternatives for priority signaling in protocols like SSH.[65] Upon receipt of urgent data, operating systems notify the application differently. In Unix-like systems, the kernel delivers a SIGURG signal to the process or process group owning the socket, enabling asynchronous handling of the urgent byte via mechanisms like signal handlers or socket options such as SO_OOBINLINE.[66] On Windows, urgent data is accessed through the Winsock API using recv() with the MSG_OOB flag, allowing the application to retrieve the single OOB byte separately from the inline stream, though support is limited to one byte and requires explicit polling or event-based notification.[65]Security and Vulnerabilities
Denial-of-Service Attacks
The Transmission Control Protocol (TCP) is susceptible to denial-of-service (DoS) attacks that exploit its stateful nature and resource allocation during connection establishment and maintenance, leading to exhaustion of memory, processing capacity, or bandwidth on targeted hosts or intermediate devices. These vulnerabilities arise from TCP's reliance on sequence numbers, acknowledgments, and timeouts to ensure reliable delivery, allowing attackers to flood systems with malformed or spoofed packets without completing legitimate handshakes or data transfers. Such attacks disrupt availability by overwhelming the victim's backlog queues or forcing unnecessary computations, often using spoofed source addresses to amplify impact while remaining stealthy.[38] A prominent example is the SYN flood attack, which targets the TCP three-way handshake by sending a large volume of SYN segments with spoofed IP addresses to a listening server. The server responds with SYN-ACK segments and allocates resources for half-open connections in its backlog queue, typically limited to dozens of entries, each consuming 280–1,300 bytes for transmission control blocks (TCBs). Without the final ACK from the spoofed client, these entries persist until timeout, filling the queue and preventing legitimate connections; for instance, a barrage of SYNs can exhaust the backlog in seconds, rendering the server unresponsive. This method, well-documented since the 1990s, leverages TCP's state retention in the LISTEN mode and is mitigated in part by techniques like SYN cookies, which encode connection state into the SYN-ACK sequence number without allocating a TCB until validation.[38][38] ACK floods extend this disruption to post-handshake phases or stateful intermediaries by inundating the target with spoofed TCP ACK packets that lack corresponding connections. In TCP, ACKs confirm data receipt and advance the acknowledgment number, but illegitimate ones force servers or firewalls to search session tables—often millions of entries—for matches, consuming CPU cycles and memory; a flood of such packets can saturate bandwidth or crash devices by processing overhead alone, with attack volumes reaching gigabits per second via botnets. Similarly, RST floods abuse TCP's reset mechanism, where spoofed RST packets with guessed sequence numbers terminate purported connections, prompting the victim to scan state tables and discard resources for non-existent sessions, leading to widespread disruption of active flows. These attacks are effective because TCP endpoints trust incoming control flags without robust validation, amplifying resource strain on high-traffic systems.[67][68][69] Resource exhaustion can also occur through low-rate DoS variants that mimic legitimate slow traffic, such as shrew attacks, which periodically burst packets at rates tuned to TCP's minimum retransmission timeout (typically 1 second) to trigger repeated backoffs and reduce throughput to near zero without exceeding detection thresholds. By exploiting TCP's additive increase-multiplicative decrease congestion control, these attacks throttle flows intermittently—sending at line rates for milliseconds followed by silence—forcing the victim to retransmit and probe, thereby tying up buffers and CPU over extended periods; experimental evaluations show throughput drops of over 90% for TCP sessions under shrew bursts of just 100–500 ms. Analogous slow-rate tactics abuse small advertised receive windows or delayed ACKs to prolong connection states: an attacker opens multiple connections, advertises minimal windows (e.g., 1 byte), and dribbles data slowly, compelling the server to send tiny segments or zero-window probes while holding TCBs open, exhausting port pools or memory akin to application-layer slowloris attacks but at the transport level.[70][71] Amplification attacks leverage ICMP messages in TCP's Path MTU Discovery (PMTUD) process, where forged "Packet Too Big" ICMP errors desynchronize endpoints by falsely lowering the perceived path MTU, causing excessive fragmentation and retransmissions. Off-path attackers spoof these ICMP messages to redirect TCP traffic into black holes or induce repeated PMTUD probes, amplifying DoS impact; scans of the internet reveal over 43,000 websites vulnerable. This exploits the cross-layer trust between IP and TCP, where unverified ICMP alters TCP behavior without direct packet injection.[72] As of 2025, persistent gaps between RFC specifications and implementations exacerbate these DoS risks, with analyses identifying inconsistencies in 15 areas across major operating systems like Linux and BSD variants. For instance, incomplete adherence to RFC 5961 omits challenge ACKs for invalid RST or SYN segments in older kernels (e.g., Linux 2.6.39), enabling spoofed floods to inject resets and cause blind in-window disruptions; similarly, lapses in RFC 6528's secret key rotation for initial sequence numbers facilitate prediction-based floods, allowing low-effort DoS via targeted packet injection. These discrepancies, detected via LLM-assisted differential testing[73], highlight ongoing vulnerabilities in flood handling despite RFC updates, affecting real-world deployments and underscoring the need for rigorous compliance verification.[74][75]Connection Hijacking and Spoofing
Connection hijacking and spoofing in TCP involve unauthorized interference with established sessions by injecting forged packets, compromising the integrity of data exchange. These attacks exploit the protocol's reliance on sequence numbers to ensure ordered delivery and prevent duplication, allowing attackers to impersonate legitimate endpoints or disrupt communications. TCP sequence numbers, which are 32-bit values incremented per byte transmitted, must be predicted or observed to succeed in such exploits.[76] TCP sequence prediction attacks, first described by Robert T. Morris in 1985, target predictable initial sequence numbers (ISNs) generated by early implementations like Berkeley's 4.2BSD and 4.3BSD TCP/IP stacks. In these systems, ISNs were incremented by a fixed amount—such as 128 or 125,000 per second—making them guessable based on timing or observed patterns. An attacker could spoof a trusted host's IP address, predict the ISN, and send a forged SYN segment to initiate a connection, followed by an ACK and injected data segments with anticipated sequence numbers. This enabled the execution of malicious commands, such as via rsh, without receiving server responses, as the spoofed packets appeared valid to the victim. The vulnerability was detailed in Steve Bellovin's 1989 analysis, highlighting how it allowed off-path attackers to hijack trust relationships in UNIX networks.[76][76] A prominent example of the broader impact of such vulnerabilities occurred with the Morris worm in 1988, which, while primarily exploiting buffer overflows in fingerd and sendmail, underscored the dangers of weak TCP security in the early Internet. Although the worm did not directly employ sequence prediction, Morris's prior discovery amplified awareness of spoofing risks, infecting thousands of machines and disrupting about 10% of the Internet for several days. This event catalyzed improvements in network security practices and protocol robustness.[77][77] Blind spoofing represents an off-path variant where an attacker, without direct network access, forges packets by guessing sequence numbers within the receiver's window. In early TCP, large receive windows (up to 65,535 bytes) increased the probability of successful guesses, allowing injection of RST segments to terminate sessions or data to hijack them. For instance, at gigabit speeds with 100 ms latency, an attacker could probe the sequence space with 10-100 packets. Man-in-the-middle variants, such as those enabled by ARP cache poisoning, position the attacker on-path by sending spoofed ARP replies that falsify IP-to-MAC mappings, redirecting traffic through the attacker. Once interposed, the attacker can observe sequence numbers and inject forged segments to hijack sessions, such as resetting connections or altering data flows. This technique, common in local networks, can compromise even encrypted sessions by exploiting timing and size patterns in traffic.[78][78][79] To counter these threats, modern TCP implementations employ ISN randomization as specified in RFC 6528, generating ISNs using a formula that incorporates a monotonic timer and a pseudorandom function: ISN = M + F(localip, localport, remoteip, remoteport, secretkey), where M is a 4-microsecond-resolution timer and F is a hash like MD5 with a secret key refreshed periodically. This assigns unique, unpredictable sequence spaces per connection quadruple, rendering blind prediction computationally infeasible for off-path attackers. Additional mitigations include cryptographic protections, such as the deprecated TCP MD5 Authentication (using pre-shared keys to sign segments)[80] or the recommended TCP Authentication Option (TCP-AO), which provides stronger cryptographic protection with key rotation and support for multiple algorithms, or IPsec for network-layer integrity, which validate packet authenticity and prevent injection even in man-in-the-middle scenarios. These measures, rooted in responses to early spoofing exploits, have significantly reduced the prevalence of TCP hijacking in contemporary networks.[36][36][78][80]Mitigation Strategies and Best Practices
To mitigate SYN flooding attacks, where an attacker exhausts server resources by sending numerous SYN packets without completing the handshake, TCP implementations can employ SYN cookies. This technique encodes connection state information into the initial sequence number of the SYN-ACK response using a cryptographic hash, allowing the server to avoid allocating resources for half-open connections until a valid ACK is received.[38] SYN cookies are particularly effective in hash-based state avoidance, as they prevent backlog queue overflow without requiring changes to the core TCP protocol.[38] TCP stack tweaks further enhance resilience by adjusting parameters such as reducing the maximum backlog queue size, enabling SYN cookies via configuration (e.g., sysctl net.ipv4.tcp_syncookies=1 in Linux), and limiting the rate of incoming SYN packets per source IP. These adjustments minimize resource consumption during floods while maintaining legitimate connection acceptance, as recommended in standard mitigation guidelines.[38] For protecting against connection hijacking and spoofing, which rely on forged IP source addresses, network operators should implement BCP 38 ingress filtering. This practice involves routers discarding packets at network edges if the source IP does not match the expected prefix of the originating interface, effectively blocking spoofed traffic before it reaches TCP endpoints.[81] Widespread adoption of BCP 38 significantly reduces the feasibility of IP spoofing-based attacks across the Internet.[81] To ensure confidentiality and authentication in TCP communications, encapsulating TCP within IPsec or TLS is a standard best practice. IPsec provides end-to-end encryption and integrity protection at the network layer via protocols like Encapsulating Security Payload (ESP), preventing eavesdropping and tampering during transmission. Similarly, TLS operates at the transport layer over TCP, offering mutual authentication and secure key exchange to thwart man-in-the-middle attacks on TCP sessions.[82] These encapsulations address vulnerabilities in plain TCP by adding cryptographic safeguards without altering the underlying protocol.[82] Recent guidance emphasizes enabling limits on Selective Acknowledgment (SACK) recovery to counter denial-of-service exploits that manipulate SACK options to induce excessive retransmissions or resource exhaustion. The conservative SACK-based loss recovery algorithm restricts the pipe size during recovery to the minimum of flight size and bytes in flight, preventing attackers from inflating perceived lost segments beyond verifiable data. Implementing such limits, as updated in modern TCP stacks post-2019 vulnerabilities, reduces the attack surface from SACK-related panics. Ongoing monitoring for anomalies, such as unusually high rates of RST or FIN segments, enables early detection of disruptive attacks like reset floods. Tools applying packet header anomaly detection can profile normal TCP flag usage and alert on deviations, allowing proactive filtering or rate limiting. Best practices include integrating such detection into network intrusion systems to maintain TCP reliability under adversarial conditions.[83]Implementations and Deployment
Software Stacks and Variations
The Transmission Control Protocol (TCP) is implemented in various software stacks across operating systems and libraries, each tailored to platform-specific requirements while aiming for RFC compliance. These implementations differ in default congestion control algorithms, tunable parameters, and optimizations for loss recovery, reflecting trade-offs in performance, resource usage, and network conditions. In the Linux kernel, TCP is integrated into the networking subsystem with TCP Cubic as the default congestion control algorithm since kernel version 2.6.19, optimizing for high-bandwidth-delay product networks through a cubic congestion window growth function. Administrators can tune congestion window (cwnd) behavior via sysctl parameters, such asnet.ipv4.tcp_congestion_control to switch algorithms (e.g., to BBR or Reno) and net.ipv4.tcp_slow_start_after_idle to control cwnd reset after idle periods, enabling fine-grained adjustments for throughput and latency in diverse environments.[84][85]
Microsoft's WinTCP, introduced in Windows Vista and used in subsequent versions, incorporates Compound TCP (CTCP) as an optional but prominent feature for compound scaling, combining loss-based and delay-based congestion control to achieve higher throughput on high-speed, long-distance links without compromising fairness to standard TCP. CTCP dynamically adjusts the cwnd based on both packet loss and round-trip time variations, making it suitable for broadband scenarios, and can be enabled via registry settings like Tcp1323Opts and TcpAckFrequency.[86]
BSD variants, such as FreeBSD, traditionally default to NewReno for congestion control, which enhances Reno by allowing multiple segments to be recovered per window during fast recovery, but FreeBSD 14 and later shifted to Cubic as the default for better scalability. FreeBSD supports advanced loss recovery through the RACK (Recent ACKnowledgment) mechanism with Tail Loss Probe (TLP), implemented in the tcp_rack module, which uses time-based detection to initiate fast recovery more promptly than duplicate acknowledgment thresholds, reducing retransmission delays in modern networks.[87][88][89]
Cross-platform libraries like lwIP provide lightweight TCP implementations for embedded systems, emphasizing minimal RAM and ROM usage (tens of kilobytes) while supporting core RFC features such as connection management and retransmission. lwIP's timer granularity varies by host system but defaults to a coarse-grained interval of 250 ms via TCP_TMR_INTERVAL, with one-shot timers at least 200 ms resolution for tasks like delayed acknowledgments and retransmission timeouts, allowing adaptations for resource-constrained microcontrollers where finer granularity might increase overhead.[90]
TCP implementations are tested for compliance against RFC standards, such as RFC 9293 for core protocol behavior and RFC 1122 for optional features, using tools that simulate edge cases to verify sequence number handling, window scaling, and error recovery. Common divergences include variations in keepalive mechanisms, where RFC 1122 recommends a minimum 2-hour idle interval before probes but implementations differ: Linux uses 7200 seconds idle with 75-second probe intervals and up to 9 probes, while some older stacks employ shorter timeouts (e.g., 5 seconds), risking premature connection drops in congested networks as documented in known implementation surveys.[10][91]
Hardware and Offload Implementations
TCP offload engines (TOEs) are specialized hardware components integrated into network interface cards (NICs) that handle the processing of the TCP/IP protocol stack, including tasks such as segmentation, reassembly, and checksum computation, thereby relieving the host CPU from these operations.[92] This offloading is particularly valuable in high-speed networks, where traditional software-based TCP processing can become a bottleneck due to the increasing disparity between network throughput and CPU processing speeds.[92] TOEs can be implemented as full offload solutions, which manage the entire TCP/IP stack in hardware, or partial offload mechanisms that target specific functions. Full offload TOEs process connection management, acknowledgments, and data transfer entirely on the NIC, enabling sustained gigabit and 10-gigabit Ethernet performance with minimal host intervention.[92] In contrast, partial offloads, such as TCP Segmentation Offload (TSO) and Large Send Offload (LSO), allow the host to send large data buffers to the NIC, which then performs the segmentation into compliant packet sizes and adds necessary headers, as seen in modern NICs from NVIDIA (formerly Mellanox).[93] These partial approaches are more widely adopted due to their simplicity and compatibility with existing software stacks, though they do not eliminate all CPU involvement in TCP handling.[92] Field-programmable gate arrays (FPGAs) enable custom TCP offload implementations tailored for datacenter environments, where hyperscalers deploy them to accelerate network processing in cloud infrastructures. For instance, FPGA-based TOEs can achieve full 10 Gbps throughput with low latency by hardware-accelerating the TCP stack, as demonstrated in deployments by companies like Tencent for high-performance heterogeneous computing.[94] These programmable devices offer flexibility for specialized workloads, such as integrating TCP with storage protocols, but require careful design to balance resource usage and performance.[95] The primary benefit of hardware and offload implementations is significant CPU relief in high-throughput scenarios, allowing processors to focus on application logic rather than protocol overhead, which can improve overall system efficiency by up to several times in bandwidth-intensive applications.[92] However, drawbacks include reduced flexibility, as hardware-fixed behaviors may limit adaptability to evolving TCP extensions or custom configurations, along with higher initial costs for specialized silicon or FPGAs.[92] Standards like iWARP (Internet Wide Area RDMA Protocol) facilitate convergence by enabling remote direct memory access (RDMA) over TCP, offloading data transfers to network hardware while maintaining compatibility with standard IP networks. Defined in RFC 5040, iWARP uses mappings such as Marker PDU Aligned Framing (MPA) over TCP to ensure reliable, low-latency operations suitable for storage and clustering applications.[96]Debugging and Analysis Tools
Debugging and analyzing Transmission Control Protocol (TCP) issues requires specialized tools to inspect packet flows, monitor connection states, measure performance metrics, and identify bottlenecks in network stacks. These tools enable network engineers to diagnose problems such as packet loss, congestion, or misconfigurations without disrupting live traffic. By capturing and examining TCP headers, states, and behaviors, administrators can pinpoint root causes like retransmissions or stalled windows, ensuring reliable data transmission over IP networks. Packet capture tools are foundational for TCP analysis, allowing detailed inspection of headers and payloads. Wireshark, a widely used open-source network protocol analyzer, dissects TCP packets and provides expert analysis features, including detection of anomalies like retransmissions or zero-window probes through display filters such as "tcp.analysis.retransmission" or "tcp.analysis.zero_window". Similarly, tcpdump, a command-line packet analyzer, captures TCP traffic in real-time or from files, supporting filters like "tcp" to isolate protocol-specific data and options such as "-i any" to monitor all interfaces for comprehensive traces. These tools facilitate header examination for fields like sequence numbers, acknowledgments, and flags, aiding in the identification of protocol violations or errors. Kernel-level utilities offer insights into active TCP connections and socket states without requiring packet captures. The ss command in Linux displays detailed socket statistics, including TCP states (e.g., ESTABLISHED, TIME_WAIT) and associated processes, using options like "ss -tuln" to list listening TCP/UDP sockets or "ss -tan state established" to filter active connections. Netstat, though increasingly superseded by ss, provides similar functionality for viewing TCP socket information, such as active connections and port usage via "netstat -an | grep TCP", helping diagnose state mismatches or resource exhaustion. For performance evaluation, tools like iperf measure TCP throughput by simulating bidirectional traffic between endpoints, reporting bandwidth, jitter, and packet loss rates to assess link capacity under load. Tcptrace processes tcpdump captures to generate summaries and plots of TCP metrics, including round-trip time (RTT) variations and congestion window evolution, enabling visualization of throughput trends and loss events through graphical outputs like RTT histograms. Advanced diagnostics target deeper stack-level issues. Flame graphs, developed by Brendan Gregg, visualize profiled stack traces to highlight CPU bottlenecks in the TCP implementation, such as excessive time in kernel functions handling congestion control, by stacking sampled call paths proportionally to resource usage. On Windows, Event Tracing for Windows (ETW) captures kernel and user-mode events related to TCP performance, allowing analysis of latency in socket operations or driver interactions via tools like Windows Performance Analyzer. Common TCP diagnostics often focus on indicators like retransmit rates and window stalls, which signal underlying problems. High retransmit rates, observable in Wireshark via "tcp.analysis.retransmission" filters or tcptrace summaries, typically arise from packet loss due to congestion or errors, with rates exceeding 1-2% warranting investigation into network paths. Window stalls, where the receive window shrinks to zero causing sender pauses, can be detected in packet traces showing prolonged zero-window probes and are often linked to receiver-side buffer limitations, as analyzed in studies of TCP performance degradation.Performance and Optimization
Key Metrics and Bottlenecks
The throughput of TCP connections is fundamentally limited by the bandwidth-delay product (BDP), defined as the product of the available bandwidth and the round-trip time (RTT): . This metric represents the amount of data that can be in flight on the network path without acknowledgment, and TCP's congestion window must scale to at least the BDP to achieve maximum throughput on high-bandwidth or high-latency paths. For instance, on a 10 Gbps link with a 100 ms RTT, the BDP exceeds 100 MB, necessitating window scaling extensions as specified in earlier TCP standards to avoid underutilization.[97] Latency in TCP encompasses both connection establishment and data transfer phases. The three-way handshake requires at least 1.5 RTTs, accounting for the partial round trip in the initial SYN exchange followed by the full SYN-ACK and ACK. Data transfer latency accumulates as the number of segments multiplied by RTT, divided by the degree of parallelism enabled by the congestion window, highlighting how larger windows reduce effective delay through pipelining. In practice, this means short transfers (e.g., a few kilobytes) incur significant relative latency from the handshake alone, while bulk transfers benefit from sustained window growth.[98][14] Goodput, the effective rate of useful data delivery, differs from raw throughput by excluding protocol overheads such as TCP and IP headers. Header overhead typically ranges from 5% to 20% depending on the maximum segment size (MSS); for an MSS of 1460 bytes on a 1500-byte MTU, it is approximately 2.7% (40 bytes of headers), but rises sharply for smaller MSS values common in fragmented or tunneled traffic. This distinction is critical in bandwidth-constrained environments, where overhead can reduce effective utilization by up to a factor of five for very small payloads.[49][99] Common bottlenecks in TCP performance include high packet loss rates and bufferbloat. Loss rates exceeding 1% often trigger congestion control mechanisms, as TCP interprets such losses as network overload rather than rare bit errors, leading to multiplicative window reductions and throughput collapse. Bufferbloat exacerbates this by causing excessive queuing delays in oversized router buffers, inflating RTT and inducing further losses without providing proportional bandwidth gains. These factors can degrade performance in asymmetric or wireless links, where loss rates may naturally hover near or above the threshold.[42][100] TCP performance metrics are measured using active or passive techniques. Active methods, such as Pathload, inject probe traffic to estimate available bandwidth and detect bottlenecks by analyzing dispersion patterns in packet trains. Passive approaches, exemplified by tcptrace, analyze existing TCP traces to compute RTT, loss, and throughput without additional load, making them suitable for production monitoring. Active measurements provide direct path characterization but risk perturbing the network, while passive ones offer non-intrusive insights limited to observed flows.Acceleration Techniques
Several techniques have been developed to accelerate TCP performance by reducing overheads, optimizing resource usage, and leveraging additional network capabilities without altering the core protocol. These methods address limitations in latency, throughput, and congestion handling, enabling TCP to operate more efficiently in diverse environments such as data centers and wide-area networks. Recent IETF efforts include RFC 9743 (March 2025), which provides a framework for specifying and evaluating new congestion control algorithms to enhance performance without harming the Internet ecosystem.[101] TCP splicing and proxy mechanisms enable kernel bypass for user-space acceleration, allowing intermediate systems to forward traffic without unnecessary data copying between kernel and user spaces. In traditional proxies, data received in the kernel must be copied to user space for processing and then back for transmission, introducing latency and CPU overhead. Splicing merges the incoming and outgoing TCP connections at the proxy, effectively creating a direct pipe that avoids these copies while maintaining connection state. This approach, originally proposed for URL-aware redirection in web proxies, can improve throughput by up to 30% for large transfers by minimizing context switches and buffer management. Modern implementations, such as those in high-performance network function virtualization (NFV), use splicing in user-space stacks to achieve line-rate forwarding, reducing per-packet processing time from microseconds to nanoseconds in kernel-bypassed environments.[102] Multipath TCP (MPTCP) provides acceleration through link aggregation, allowing a single TCP connection to utilize multiple network paths simultaneously for increased bandwidth and resilience. Defined in RFC 8684, MPTCP extends standard TCP by adding subflow management, where multiple TCP subflows operate in parallel over different interfaces or routes, with the scheduler aggregating their capacities. This bonding can double or triple effective throughput in scenarios like cellular-Wi-Fi handover or data center multipathing, as demonstrated in field trials where MPTCP achieved up to 1.5x higher goodput than single-path TCP under varying link conditions. As an IETF-standardized extension, MPTCP maintains compatibility with legacy TCP while enabling proactive path selection to minimize latency spikes.[103] Explicit Congestion Notification (ECN), specified in RFC 3168, accelerates TCP by enabling proactive congestion avoidance through marking rather than packet drops. ECN uses two bits in the IP header to signal incipient congestion from routers to endpoints, allowing TCP senders to reduce rates early without losing packets. This leads to higher throughput and lower latency, particularly for short-lived flows, with studies showing up to 20% improvement in web page load times over drop-based mechanisms like RED. RFC 8087 further outlines benefits including reduced retransmissions and better fairness in mixed-traffic networks, making ECN a widely adopted feature in modern TCP stacks for avoiding the "bufferbloat" penalty of excessive queuing delays.[104][105] Buffer tuning techniques, such as autotuning in TCP stacks, enhance performance by dynamically adjusting receive and send buffers to match network conditions while mitigating bufferbloat. In Linux, for instance, the TCP autotuner increases the receive window (rmem) up to a maximum based on bandwidth-delay product estimates, preventing underutilization on high-speed links without fixed large buffers that cause queuing delays. This adaptive sizing avoids the latency inflation from overbuffering, where excessive queues amplify round-trip times; evaluations show autotuning reduces average latency by 10-50% in long-fat networks compared to static configurations. By coupling with congestion control algorithms, autotuning ensures efficient resource use without introducing bloat, as buffers scale proportionally to observed throughput.[106] Hybrid approaches combining TCP with Remote Direct Memory Access (RDMA), such as iWARP, deliver low-latency acceleration by offloading data transfer directly between application memories over TCP/IP networks. iWARP, defined in RFC 5040 and RFC 5041, encapsulates RDMA operations within TCP for reliable, lossless delivery on standard Ethernet, bypassing kernel involvement for zero-copy transfers. This reduces latency to sub-microsecond levels for small messages—up to 50% lower than pure TCP in storage applications—while maintaining TCP's congestion control for wide-area compatibility. Deployed in converged data centers, iWARP hybrids achieve throughputs exceeding 100 Gbps with minimal CPU overhead, making them suitable for latency-sensitive workloads like distributed databases.Comparative Analysis with Alternatives
The Transmission Control Protocol (TCP) provides reliable, ordered, and error-checked delivery of data streams between applications, making it suitable for bulk data transfer where completeness is essential.[14] In contrast, the User Datagram Protocol (UDP) offers a simple, connectionless datagram service without guarantees of delivery, ordering, or error correction, prioritizing low overhead and minimal latency.[25] This trade-off positions UDP as ideal for real-time applications like the [Real-time Transport Protocol (RTP)](/page/Real-time_Transport Protocol), which transmits audio and video streams tolerant of minor packet loss to avoid delays.[107] TCP, however, excels in scenarios requiring full reliability, such as file transfers or web content loading, where retransmissions ensure data integrity despite added complexity from acknowledgments and congestion control.[14] Compared to the Stream Control Transmission Protocol (SCTP), TCP operates as a single-stream protocol, delivering data as a continuous byte stream without inherent support for message boundaries or multi-streaming.[14] SCTP, designed for reliable transport over connectionless networks, supports multiple independent streams within a single association, preserving message boundaries and enabling partial reliability options, which reduces head-of-line blocking in multi-stream scenarios.[108] These features make SCTP particularly advantageous for telephony signaling, where it transports Public Switched Telephone Network (PSTN) messages over IP, offering multi-homing for failover and congestion avoidance tailored to signaling traffic.[108] TCP remains preferable for legacy applications lacking SCTP support, but SCTP's multi-streaming provides better efficiency for applications like voice over IP gateways handling concurrent signaling channels. QUIC, standardized in 2021, builds on UDP to deliver TCP-like reliability with integrated security, multiplexing, and faster connection establishment, addressing TCP's limitations in modern networks.[109] Unlike TCP, which requires separate handshakes for connection setup and encryption (often via TLS), QUIC embeds TLS 1.3 cryptography within its protocol, enabling 0-RTT resumption for resuming sessions without full negotiation, reducing latency in mobile and web scenarios.[109] QUIC's stream multiplexing avoids TCP's head-of-line blocking by allowing independent stream delivery, even if one is delayed, and it supports seamless connection migration across network paths.[109] This design mitigates TCP's vulnerability to wire ossification, where middleboxes like firewalls inspect and modify TCP headers, hindering protocol evolution; QUIC's encapsulation in UDP evades such interference, facilitating deployment. Selection of TCP versus alternatives depends on application needs and network constraints: TCP ensures broad compatibility with existing infrastructure for reliable bulk transfers, while UDP suits low-latency, loss-tolerant real-time flows; SCTP fits multi-stream telephony or failover-critical uses; and QUIC is optimal for latency-sensitive web and mobile applications requiring built-in security and migration.[14][25][108][109] In environments with middlebox restrictions or evolving requirements like low-latency streaming, alternatives like QUIC offer superior performance without TCP's ossification challenges.Error Handling and Checksum
Checksum Computation Process
The TCP checksum is a 16-bit error-detection field included in the TCP header to verify the integrity of the transmitted segment, covering the header, payload data, and a conceptual pseudo-header derived from the IP layer.[110] This mechanism detects corruption caused by transmission errors but does not guarantee delivery or ordering. The sender computes the checksum before transmission, and the receiver recomputes it upon receipt; a mismatch indicates an error, prompting discard of the segment.[110] The checksum computation encompasses all 16-bit words in the TCP header (with the checksum field temporarily set to zero), the TCP payload (padded with a trailing zero octet if its length is odd, though this pad is not transmitted), and the pseudo-header. The pseudo-header, not transmitted but constructed at both sender and receiver, includes the source and destination IP addresses, a reserved zero field, the IP protocol number (6 for TCP), and the total length of the TCP segment (header plus data). For IPv4, the pseudo-header is 12 octets long; for IPv6, it follows a different structure as defined in RFC 8200. This inclusion protects against misdelivery to the wrong host or protocol.[110] The core algorithm uses one's complement arithmetic to compute a 16-bit checksum, as specified for Internet protocols. The process begins by concatenating the pseudo-header, TCP header (checksum field zeroed), and padded data into a sequence of 16-bit words. These words are then summed using 16-bit arithmetic, folding back any carry bits from the most significant bit into the least significant bit (end-around carry addition) to handle overflow. If the data length results in an odd number of octets, the final 16-bit word is formed by appending a zero to the last octet. The final checksum value is the one's complement (bitwise inversion) of this sum, inserted into the TCP header. At the receiver, the same sum is computed including the received checksum; a correct transmission yields all ones (0xFFFF) in one's complement representation.[111] To illustrate, consider a simplified example with a short TCP segment. Suppose the concatenated 16-bit words (after zeroing the checksum field) are . The intermediate sum is computed as: with end-around carry: if the sum exceeds 16 bits, add the carry to the low-order 16 bits and repeat until no carry remains. The checksum is then: where denotes bitwise complement. This method ensures efficient incremental updates for protocols like TCP during retransmissions or option changes.[111] Verification at the receiver follows identical steps, confirming in one's complement. Implementations must adhere strictly to this process to avoid interoperability issues, such as those arising from ambiguous zero representations in older RFC clarifications.[110]IPv4 and IPv6 Specifics
The TCP checksum computation incorporates a pseudo-header derived from the underlying IP layer to provide protection against misdelivery and certain types of errors, with distinct formats for IPv4 and IPv6. In IPv4 environments, the pseudo-header consists of the 32-bit source address, the 32-bit destination address, an 8-bit zero field, an 8-bit protocol field set to 6 (indicating TCP), and a 16-bit TCP length field representing the length of the TCP header plus data in octets. This structure, totaling 12 octets, ensures that the checksum verifies the segment's association with the correct IPv4 endpoints and payload size. For IPv6, the pseudo-header is expanded to accommodate larger addresses and includes the 128-bit source address, the 128-bit destination address, a 32-bit upper-layer packet length (the size of the TCP segment, excluding the IPv6 header and any preceding extension headers), three zero octets for padding, and an 8-bit next header field set to 6 (for TCP). This 40-octet pseudo-header maintains compatibility while leveraging IPv6's addressing scheme. When IPv6 extension headers precede the TCP header, the TCP checksum calculation uses the pseudo-header's upper-layer packet length to encompass the full TCP segment (header and data), ensuring integrity over the transport-layer payload irrespective of preceding extensions. In transition mechanisms such as 6to4 tunnels, where IPv6 packets are encapsulated within IPv4, the inner TCP checksum employs the IPv6-formatted pseudo-header based on the constructed IPv6 addresses derived from the IPv4 tunnel endpoints, without altering the core computation process. The use of 128-bit addresses in IPv6 results in a larger pseudo-header compared to IPv4's 32-bit addresses, introducing a slight increase in header overhead and checksum computation cost, though this is mitigated by the overall protocol efficiencies.Offload and Hardware Support
Checksum offload refers to the delegation of TCP checksum computation to the network interface card (NIC) hardware, which verifies incoming packets and computes checksums for outgoing packets, thereby reducing host CPU involvement during transmission (TX) and reception (RX). This feature is supported for both IPv4/TCP and IPv6/TCP traffic, with the NIC typically indicated through driver configurations rather than explicit flags in the Ethernet frame header.[112] Offload types include partial and full checksum computation: partial offload handles only the TCP header checksum, leaving the pseudo-header and payload sums to software, while full offload enables the NIC to compute the entire checksum, including payload, for greater efficiency. In IPv6/TCP scenarios, the pseudo-header incorporates IPv6-specific fields such as source and destination addresses.[113][114] Early standards like TCP Offload Engines (TOEs), which aimed to offload the full TCP/IP stack to hardware, have largely been supplanted by stateless offloads such as TCP Segmentation Offload (TSO) and Generic Segmentation Offload (GSO), focusing on targeted accelerations like checksums without maintaining full connection state.[115][116] The primary benefit is a reduction in CPU cycles, with reported savings of approximately 15% in utilization for jumbo frames (MTU 9000) on certain systems, particularly beneficial on high-throughput systems favoring bandwidth over latency; however, drawbacks include potential error handling issues from hardware bugs or timing mismatches, leading to invalid checksums or communication failures.[117][118][119] In Linux implementations, checksum offload can be enabled or disabled usingethtool -K <interface> tx-checksum-ip-generic <on|off> for transmit and ethtool -K <interface> rx-checksumming <on|off> for receive, with current status verified via ethtool -k <interface>.[120][121]