Hubbry Logo
Border Gateway ProtocolBorder Gateway ProtocolMain
Open search
Border Gateway Protocol
Community hub
Border Gateway Protocol
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Border Gateway Protocol
Border Gateway Protocol
from Wikipedia
Border Gateway Protocol
Communication protocol
BGP state machine
AbbreviationBGP
Purposeexchange Internet Protocol routing information
IntroductionJune 1, 1989; 36 years ago (1989-06-01)[1]
Based onEGP
OSI layerApplication layer
Port(s)tcp/179
RFC(s)§ Standards documents
Internet history timeline

Early research and development:

Merging the networks and creating the Internet:

Commercialization, privatization, broader access leads to the modern Internet:

Examples of Internet services:

Border Gateway Protocol (BGP) is a standardized exterior gateway protocol designed to exchange routing and reachability information among autonomous systems (AS) on the Internet.[2] BGP is classified as a path-vector routing protocol,[3] and it makes routing decisions based on paths, network policies, or rule-sets configured by a network administrator.

BGP used for routing within an autonomous system is called Interior Border Gateway Protocol (iBGP). In contrast, the Internet application of the protocol is called Exterior Border Gateway Protocol (EBGP).

History

[edit]

The genesis of BGP was in 1989 when Kirk Lougheed, Len Bosack and Yakov Rekhter were sharing a meal at an IETF conference. They famously sketched the outline of their new routing protocol on the back of some napkins, hence often referenced to as the “Two Napkin Protocol”.[4][5][6]

It was first described in 1989 in RFC 1105, and has been in use on the Internet since 1994.[7] IPv6 BGP was first defined in RFC 1654 in 1994, and it was improved to RFC 2283 in 1998.

The current version of BGP is version 4 (BGP4), which was first published as RFC 1654 in 1994, subsequently updated by RFC 1771 in 1995 and RFC 4271 in 2006.[8] RFC 4271 corrected errors, clarified ambiguities and updated the specification with common industry practices. The major enhancement of BGP4 was the support for Classless Inter-Domain Routing (CIDR) and use of route aggregation to decrease the size of routing tables. RFC 4271 allows BGP4 to carry a wide range of IPv4 and IPv6 "address families". It is also called the Multiprotocol Extensions which is Multiprotocol BGP (MP-BGP).

Operation

[edit]

BGP neighbors, called peers, are established by manual configuration among routers to create a TCP session on port 179. A BGP speaker sends 19-byte keep-alive messages every 30 seconds (protocol default value, tunable) to maintain the connection.[9] Among routing protocols, BGP is unique in using TCP as its transport protocol.

When BGP runs between two peers in the same autonomous system (AS), it is referred to as Internal BGP (iBGP or Interior Border Gateway Protocol). When it runs between different autonomous systems, it is called External BGP (eBGP or Exterior Border Gateway Protocol). Routers on the boundary of one AS exchanging information with another AS are called border or edge routers or simply eBGP peers and are typically connected directly, while iBGP peers can be interconnected through other intermediate routers. Other deployment topologies are also possible, such as running eBGP peering inside a VPN tunnel, allowing two remote sites to exchange routing information in a secure and isolated manner.

The main difference between iBGP and eBGP peering is in the way routes that were received from one peer are typically propagated by default to other peers:

  • New routes learned from an eBGP peer are re-advertised to all iBGP and eBGP peers.
  • New routes learned from an iBGP peer are re-advertised to all eBGP peers only.

These route-propagation rules effectively require that all iBGP peers inside an AS are interconnected in a full mesh with iBGP sessions.

How routes are propagated can be controlled in detail via the route-maps mechanism. This mechanism consists of a set of rules. Each rule describes, for routes matching some given criteria, what action should be taken. The action could be to drop the route, or it could be to modify some attributes of the route before inserting it in the routing table.

Extensions negotiation

[edit]

During the peering handshake, when OPEN messages are exchanged, BGP speakers can negotiate optional capabilities of the session,[10] including multiprotocol extensions[11] and various recovery modes. If the multiprotocol extensions to BGP are negotiated at the time of creation, the BGP speaker can prefix the Network Layer Reachability Information (NLRI) it advertises with an address family prefix. These families include the IPv4 (default), IPv6, IPv4/IPv6 Virtual Private Networks and multicast BGP. Increasingly, BGP is used as a generalized signaling protocol to carry information about routes that may not be part of the global Internet, such as VPNs.[12]

In order to make decisions in its operations with peers, a BGP peer uses a simple finite-state machine (FSM) that consists of six states: Idle; Connect; Active; OpenSent; OpenConfirm; and Established. For each peer-to-peer session, a BGP implementation maintains a state variable that tracks which of these six states the session is in. The BGP defines the messages that each peer should exchange in order to change the session from one state to another.

The first state is the Idle state. In the Idle state, BGP initializes all resources, refuses all inbound BGP connection attempts and initiates a TCP connection to the peer. The second state is Connect. In the Connect state, the router waits for the TCP connection to complete and transitions to the OpenSent state if successful. If unsuccessful, it starts the ConnectRetry timer and transitions to the Active state upon expiration. In the Active state, the router resets the ConnectRetry timer to zero and returns to the Connect state. In the OpenSent state, the router sends an Open message and waits for one in return in order to transition to the OpenConfirm state. Keepalive messages are exchanged and, upon successful receipt, the router is placed into the Established state. In the Established state, the router can send and receive: Keepalive; Update; and Notification messages to and from its peer.

  • Idle State:
    • Refuse all incoming BGP connections.
    • Start the initialization of event triggers.
    • Initiates a TCP connection with its configured BGP peer.
    • Listens for a TCP connection from its peer.
    • Changes its state to Connect.
    • If an error occurs at any state of the FSM process, the BGP session is terminated immediately and returned to the Idle state. Some of the reasons why a router does not progress from the Idle state are:
      • TCP port 179 is not open.
      • A random TCP port over 1023 is not open.
      • Peer address configured incorrectly on either router.
      • AS number configured incorrectly on either router.
  • Connect State:
    • Waits for successful TCP negotiation with peer.
    • BGP does not spend much time in this state if the TCP session has been successfully established.
    • Sends Open message to peer and changes state to OpenSent.
    • If an error occurs, BGP moves to the Active state. Some reasons for the error are:
      • TCP port 179 is not open.
      • A random TCP port over 1023 is not open.
      • Peer address configured incorrectly on either router.
      • AS number configured incorrectly on either router.
  • Active State:
    • If the router was unable to establish a successful TCP session, then it ends up in the Active state.
    • BGP FSM tries to restart another TCP session with the peer and, if successful, then it sends an Open message to the peer.
    • If it is unsuccessful again, the FSM is reset to the Idle state.
    • Repeated failures may result in a router cycling between the Idle and Active states. Some of the reasons for this include:
      • TCP port 179 is not open.
      • A random TCP port over 1023 is not open.
      • BGP configuration error.
      • Network congestion.
      • Flapping network interface.
  • OpenSent State:
    • BGP FSM listens for an Open message from its peer.
    • Once the message has been received, the router checks the validity of the Open message.
    • If there is an error it is because one of the fields in the Open message does not match between the peers, e.g., BGP version mismatch, the peering router expects a different My AS, etc. The router then sends a Notification message to the peer indicating why the error occurred.
    • If there is no error, a Keepalive message is sent, various timers are set and the state is changed to OpenConfirm.
  • OpenConfirm State:
    • The peer is listening for a Keepalive message from its peer.
    • If a Keepalive message is received and no timer has expired before reception of the Keepalive, BGP transitions to the Established state.
    • If a timer expires before a Keepalive message is received, or if an error condition occurs, the router transitions back to the Idle state.
  • Established State:
    • In this state, the peers send Update messages to exchange information about each route being advertised to the BGP peer.
    • If there is any error in the Update message then a Notification message is sent to the peer, and BGP transitions back to the Idle state.

Router connectivity and learning routes

[edit]

In the simplest arrangement, all routers within a single AS and participating in BGP routing must be configured in a full mesh: each router must be configured as a peer to every other router. This causes scaling problems, since the number of required connections grows quadratically with the number of routers involved. To alleviate the problem, BGP implements two options: route reflectors (RFC 4456) and BGP confederations (RFC 5065). The following discussion of basic update processing assumes a full iBGP mesh.

A given BGP router may accept network-layer reachability information (NLRI) updates from multiple neighbors and advertise NLRI to the same, or a different set, of neighbors. The BGP process maintains several routing information bases:

  • RIB: routers main routing information base table.
  • Loc-RIB: local routing information base BGP maintains its own master routing table separate from the main routing table of the router.
  • Adj-RIB-In: For each neighbor, the BGP process maintains a conceptual adjacent routing information base, incoming, containing the NLRI received from the neighbor.
  • Adj-RIB-Out: For each neighbor, the BGP process maintains a conceptual adjacent routing information base, outgoing , containing the NLRI sent to the neighbor.

The physical storage and structure of these conceptual tables are decided by the implementer of the BGP code. Their structure is not visible to other BGP routers, although they usually can be interrogated with management commands on the local router. It is quite common, for example, to store the Adj-RIB-In, Adj-RIB-Out and the Loc-RIB together in the same data structure, with additional information attached to the RIB entries. The additional information tells the BGP process such things as whether individual entries belong in the Adj-RIBs for specific neighbors, whether the peer-neighbor route selection process made received policies eligible for the Loc-RIB, and whether Loc-RIB entries are eligible to be submitted to the local router's routing table management process.

BGP submits the routes that it considers best to the main routing table process. Depending on the implementation of that process, the BGP route is not necessarily selected. For example, a directly connected prefix, learned from the router's own hardware, is usually most preferred. As long as that directly connected route's interface is active, the BGP route to the destination will not be put into the routing table. Once the interface goes down, and there are no more preferred routes, the Loc-RIB route would be installed in the main routing table.

BGP carries the information with which rules inside BGP-speaking routers can make policy decisions. Some of the information carried that is explicitly intended to be used in policy decisions are:

Route selection process

[edit]

The BGP standard specifies a number of decision factors, more than the ones that are used by any other common routing process, for selecting NLRI to go into the Loc-RIB. The first decision point for evaluating NLRI is that its next-hop attribute must be reachable (or resolvable). Another way of saying the next-hop must be reachable is that there must be an active route, already in the main routing table of the router, to the prefix in which the next-hop address is reachable.

Next, for each neighbor, the BGP process applies various standard and implementation-dependent criteria to decide which routes conceptually should go into the Adj-RIB-In. The neighbor could send several possible routes to a destination, but the first level of preference is at the neighbor level. Only one route to each destination will be installed in the conceptual Adj-RIB-In. This process will also delete, from the Adj-RIB-In, any routes that are withdrawn by the neighbor.

Whenever a conceptual Adj-RIB-In changes, the main BGP process decides if any of the neighbor's new routes are preferred to routes already in the Loc-RIB. If so, it replaces them. If a given route is withdrawn by a neighbor, and there is no other route to that destination, the route is removed from the Loc-RIB and no longer sent by BGP to the main routing table manager. If the router does not have a route to that destination from any non-BGP source, the withdrawn route will be removed from the main routing table.

As long as there is a tie, the route selection process moves to the next step.

Steps to determine best path, in order of tiebreaker: [13] [14]
Step Scope Name Default Preferred BGP field Notes
1 Local to router local Weight "Off" Higher Cisco-specific parameter
2 Internal to AS Local preference "Off", all set to 100. Higher LOCAL_PREF If there are several iBGP routes from the neighbor, the one with the highest local preference is selected unless there are several routes with the same local preference.
3 Accumulated Interior Gateway Protocol (AIGP) "Off" Lowest  AIGP RFC 7311
4 External to AS Autonomous system (AS) jumps "On", skipped if ignored in configuration Lowest  AS-path AS jumps is the number of AS numbers that must be traversed to reach the advertised destination. AS1–AS2–AS3 is a shorter path with fewer jumps than AS4–AS5–AS6–AS7.
5 origin type "IGP" Lowest  ORIGIN 0 = IGP
1 = EGP
2 = Incomplete
6 multi-exit discriminator (MED) "on", imported from peer AS Lowest  MULTI_EXIT_DISC By default only routes with the same peer autonomous system (AS) are compared. Can be set to ignore this. By default, IGP metric is not added. Can be set to add IGP metric.

Before the most recent edition of the BGP standard, if an update had no MED value, several implementations created a MED with the highest possible value. The current standard specifies that missing MEDs are treated as the lowest possible value. Since the current rule may cause different behavior than the vendor interpretations, BGP implementations that used the nonstandard default value have a configuration feature that allows the old or standard rule to be selected.

7 Local to router (Loc-RIB) eBGP over iBGP paths "on" Directly connected, over indirectly
8 IGP metric to BGP next hop "on", imported from IGP Lowest  Prefer the route with the lowest interior cost to the next hop, according to the main routing table. If two neighbors advertised the same route, but one neighbor is reachable via a low-bitrate link and the other by a high-bitrate link, and the interior routing protocol calculates lowest cost based on highest bitrate, the route through the high-bitrate link would be preferred and other routes dropped.

If a BGP extension is used to support multipath routing, the best path selection may stop here and all paths selected up to this point will be added to the routing table.

9 Path that was received first "on" oldest Used to ignore changes on the next steps to minimize route flapping.
10 Neighbor Router ID "on" Lowest 
11 Cluster list length "on" Lowest 
12 Neighbor IP address "on" Lowest

The local preference, weight, and other criteria can be manipulated by local configuration and software capabilities. Such manipulation, although commonly used, is outside the scope of the standard. For example, the community attribute (see below) is not directly used by the BGP selection process. The BGP neighbor process can have a rule to set local preference or another factor based on a manually programmed rule to set the attribute if the community value matches some pattern-matching criterion. If the route was learned from an external peer the per-neighbor BGP process computes a local preference value from local policy rules and then compares the local preference of all routes from the neighbor.

Communities

[edit]

BGP communities are attribute tags that can be applied to incoming or outgoing prefixes to achieve some common goal.[15] While it is common to say that BGP allows an administrator to set policies on how prefixes are handled by ISPs, this is generally not possible, strictly speaking. For instance, BGP natively has no concept to allow one AS to tell another AS to restrict advertisement of a prefix to only North American peering customers. Instead, an ISP generally publishes a list of well-known or proprietary communities with a description for each one, which essentially becomes an agreement of how prefixes are to be treated.

Well-known BGP communities[16]
Attribute value Attribute Description Reference
0x00000000–0x0000FFFF Reserved RFC 1997
0x00010000–0xFFFEFFFF Reserved for private use RFC 1997
0xFFFF0000 GRACEFUL_SHUTDOWN At neighbor AS-peer, set LOCAL_PREF, lower to route away from source. RFC 8326
0xFFFF0001 ACCEPT_OWN Used to modify how a route originated within one VRF is imported into other VRFs RFC 7611
0xFFFF0002 ROUTE_FILTER_TRANSLATED_v4 RFC draft-l3vpn-legacy-rtc
0xFFFF0003 ROUTE_FILTER_v4 RFC draft-l3vpn-legacy-rtc
0xFFFF0004 ROUTE_FILTER_TRANSLATED_v6 RFC draft-l3vpn-legacy-rtc
0xFFFF0005 ROUTE_FILTER_v6 RFC draft-l3vpn-legacy-rtc
0xFFFF0006 LLGR_STALE Stale routes are retained for longer after a session failure RFC 9494
0xFFFF0007 NO_LLGR LLGR capability should not apply RFC 9494
0xFFFF0008 accept-own-nexthop RFC draft-agrewal-idr-accept-own-nexthop
0xFFFF0009 Standby PE Allow for faster recovery of connectivity on different types of failures, with multicast in BGP/MPLS VPNs. RFC 9026
0xFFFF029A BLACKHOLE To temporarily protect against denial-of-service attack by asking the neighbour AS to discard all traffic to the prefix (blackholing) RFC 7999
0xFFFFFF01 NO_EXPORT Limit to a BGP confederation boundary RFC 1997
0xFFFFFF02 NO_ADVERTISE Limit to a BGP peer RFC 1997
0xFFFFFF03 NO_EXPORT_SUBCONFED Limit to an AS RFC 1997
0xFFFFFF04 NOPEER "No need" to advertise over a peer link RFC 3765

Examples of common communities include:

  • local preference adjustments,
  • geographic
  • peer type restrictions
  • denial-of-service attack identification
  • AS prepending options.

An ISP might state that any routes received from customers with following examples:

  • To Customers North America (East Coast) 3491:100
  • To Customers North America (West Coast) 3491:200

The customer simply adjusts their configuration to include the correct community or communities for each route, and the ISP is responsible for controlling who the prefix is advertised to. The end user has no technical ability to enforce correct actions being taken by the ISP, though problems in this area are generally rare and accidental.[17][18]

It is a common tactic for end customers to use BGP communities (usually ASN:70,80,90,100) to control the local preference the ISP assigns to advertised routes instead of using MED (the effect is similar). The community attribute is transitive, but communities applied by the customer very rarely propagated outside the next-hop AS. Not all ISPs give out their communities to the public.[19]

BGP Extended Community Attribute

[edit]

The BGP Extended Community Attribute was added in 2006,[20] in order to extend the range of such attributes and to provide a community attribute structuring by means of a type field. The extended format consists of one or two octets for the type field followed by seven or six octets for the respective community attribute content. The definition of this Extended Community Attribute is documented in RFC 4360. The IANA administers the registry for BGP Extended Communities Types.[21] The Extended Communities Attribute itself is a transitive optional BGP attribute. A bit in the type field within the attribute decides whether the encoded extended community is of a transitive or non-transitive nature. The IANA registry therefore provides different number ranges for the attribute types. Due to the extended attribute range, its usage can be manifold. RFC 4360 exemplarily defines the "Two-Octet AS Specific Extended Community", the "IPv4 Address Specific Extended Community", the "Opaque Extended Community", the "Route Target Community", and the "Route Origin Community". A number of BGP QoS drafts also use this Extended Community Attribute structure for inter-domain QoS signalling.[22]

With the introduction of 32-bit AS numbers, some issues were immediately obvious with the community attribute that only defines a 16-bit ASN field, which prevents the matching between this field and the real ASN value. Since RFC 7153, extended communities are compatible with 32-bit ASNs. RFC 8092 and RFC 8195 introduce a Large Community attribute of 12 bytes, divided in three field of 4 bytes each (AS:function:parameter).[23]

Multi-exit discriminators

[edit]

MEDs, defined in the main BGP standard, were originally intended to show to another neighbor AS the advertising AS's preference as to which of several links are preferred for inbound traffic. Another application of MEDs is to advertise the value, typically based on delay, of multiple ASs that have a presence at an IXP, that they impose to send traffic to some destination.

Some routers (like Juniper) will use the Metric from OSPF to set MED.

Examples of MED used with BGP when exported to BGP on Juniper SRX

# run show ospf route   
Topology default Route Table:
Prefix             Path  Route      NH       Metric NextHop       Nexthop     
                   Type  Type       Type            Interface     Address/LSP
10.32.37.0/24      Inter Discard    IP       16777215
10.32.37.0/26      Intra Network    IP          101 ge-0/0/1.0    10.32.37.241
10.32.37.64/26     Intra Network    IP          102 ge-0/0/1.0    10.32.37.241
10.32.37.128/26    Intra Network    IP          101 ge-0/0/1.0    10.32.37.241

# show route advertising-protocol bgp 10.32.94.169
  Prefix               Nexthop       MED    Lclpref            AS path
* 10.32.37.0/24           Self                 16777215           I
* 10.32.37.0/26           Self                 101                I
* 10.32.37.64/26          Self                 102                I
* 10.32.37.128/26         Self                 101                I

Packet format

[edit]

Message header format

[edit]
BGP version 4 message header format[24]
bit offset 0–15 16–23 24–31
0 Marker (always: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff)
32
64
96
128 Length Type
  • Marker: Included for compatibility, must be set to all ones.
  • Length: Total length of the message in octets, including the header.
  • Type: Type of BGP message. The following values are defined:
    • Open (1)
    • Update (2)
    • Notification (3)
    • KeepAlive (4)
    • Route-Refresh (5)

note: "Marker" and "Length" is omitted from the examples.

Open Packet

[edit]
Version (8 bits)
Version of BGP used.
My AS (16 bits)
Senders autonomous system number.
Hold Time (16 bits)
Timeout timer, used to calculate KeepAlive messages. Default 90 seconds.
BGP Identifier (32 bits)
IP-address of sender.
Optional Parameters Length (8 bits): total length of the Optional parameters field.

Example of Open Message

Type: Open Message (1)
Version: 4
My AS: 64496
Hold Time: 90
BGP Identifier: 192.0.2.254
Optional Parameters Length: 16
Optional Parameters:
 Capability: Multiprotocol extensions capability (1)
 Capability: Route refresh capability (2)
 Capability: Route refresh capability (Cisco) (128)

Update Packet

[edit]

Only changes are sent, after initial exchange, only difference (add/change/removed) are sent.

Example of UPDATE Message

Type: UPDATE Message (2)
Withdrawn Routes Length: 0
Total Path Attribute Length: 25
Path attributes
 ORIGIN: IGP
 AS_PATH: 64500
 NEXT_HOP: 192.0.2.254
 MULTI_EXIT_DISC: 0
Network Layer Reachability Information (NLRI)
 192.0.2.0/27
 192.0.2.32/27
 192.0.2.64/27

Notification

[edit]

If there is an error it is because one of the fields in the OPEN or UPDATE message does not match between the peers, e.g., BGP version mismatch, the peering router expects a different My AS, etc. The router then sends a Notification message to the peer indicating why the error occurred.

Error Codes
Error Code Name subcodes
Code Name
1 Message Header Error 1 Connection Not Synchronized
2 Bad Message Length
3 Bad Message Type
2 OPEN Message Error 1 Unsupported Version Number.
2 Bad Peer AS.
3 Bad BGP Identifier.
4 Unsupported Authentication Code.
5 Authentication Failure.
6 Unacceptable Hold Time.
3 UPDATE Message Error 1 Malformed Attribute List.
2 Unrecognized Well-known Attribute.
3 Missing Well-known Attribute.
4 Attribute Flags Error.
5 Attribute Length Error.
6 Invalid ORIGIN Attribute
7 AS Routing Loop.
8 Invalid NEXT_HOP Attribute.
9 Optional Attribute Error.
10 Invalid Network Field.
11 Malformed AS_PATH.
4 Hold Timer Expired
5 Finite State Machine Error
6 Cease

Example of NOTIFICATION Message

Type: NOTIFICATION Message (3)
Major error Code: OPEN Message Error (2)
Minor error Code (Open Message): Bad Peer AS (2)
Bad Peer AS: 65200

KeepAlive

[edit]

KeepAlive messages are sent periodically, to verify that remote peer is still alive. keepalives should be sent at intervals of one third the holdtime.

Example of KEEPALIVE Message

Type: KEEPALIVE Message (4)

Route-Refresh

[edit]

Defined in RFC7313.

Allows for soft updating of Adj-RIB-in, without resetting connection.

Example of ROUTE-REFRESH Message

Type: ROUTE-REFRESH Message (5)
Address family identifier (AFI): IPv4 (1)
Subtype: Normal route refresh request [RFC2918] with/without ORF [RFC5291] (0)
Subsequent address family identifier (SAFI): Unicast (1)

Internal scalability

[edit]

BGP is "the most scalable of all routing protocols."[25]

An autonomous system with internal BGP (iBGP) must have all of its iBGP peers connect to each other in a full mesh (where everyone speaks to everyone directly). This full-mesh configuration requires that each router maintain a session with every other router. In large networks, this number of sessions may degrade the performance of routers, due to either a lack of memory, or high CPU process requirements.

Route reflectors

[edit]

Route reflectors (RRs) reduce the number of connections required in an AS. A single router (or two for redundancy) can be made an RR: other routers in the AS need only be configured as peers to them. An RR offers an alternative to the logical full-mesh requirement of iBGP. The purpose of the RR is concentration. Multiple BGP routers can peer with a central point, the RR – acting as an RR server – rather than peer with every other router in a full mesh. All the other iBGP routers become RR clients.[26]

This approach, similar to OSPF's DR/BDR feature, provides large networks with added iBGP scalability. In a fully meshed iBGP network of 10 routers, 90 individual CLI statements (spread throughout all routers in the topology) are needed just to define the remote-AS of each peer: this quickly becomes a headache to manage. An RR topology can cut these 90 statements down to 18, offering a viable solution for the larger networks administered by ISPs.

An RR is a single point of failure, therefore at least a second RR may be configured in order to provide redundancy. As it is an additional peer for the other 10 routers, it approximately doubles the number of CLI statements, requiring an additional 11 × 2 − 2 = 20 statements in this case. In a BGP multipath environment the additional RR also can benefit the network by adding local routing throughput if the RRs are acting as traditional routers instead of just a dedicated RR server role.

RRs and confederations both reduce the number of iBGP peers to each router and thus reduce processing overhead. RRs are a pure performance-enhancing technique, while confederations also can be used to implement more fine-grained policy.

Rules

[edit]
A typical configuration of BGP RR deployment, as proposed by Section 6, RFC 4456.

RR servers propagate routes inside the AS based on the following rules:

  • Routes are always reflected to eBGP peers.
  • Routes are never reflected to the originator of the route.
  • If a route is received from a non-client peer, reflect to client peers.
  • If a route is received from a client peer, reflect to client and non-client peers.

Cluster

[edit]

An RR and its clients form a cluster. The cluster ID is then attached to every route advertised by the RR to its client or nonclient peers. A cluster ID is a cumulative, non-transitive BGP attribute, and every RR must prepend the local cluster ID to the cluster list to avoid routing loops.

Confederation

[edit]

Confederations are sets of autonomous systems. In common practice,[27] only one of the confederation AS numbers is seen by the Internet as a whole. Confederations are used in very large networks where a large AS can be configured to encompass smaller more manageable internal ASs.

The confederated AS is composed of multiple ASs. Each confederated AS alone has iBGP fully meshed and has connections to other ASs inside the confederation. Even though these ASs have eBGP peers to ASs within the confederation, the ASs exchange routing as if they used iBGP. In this way, the confederation preserves next hop, metric, and local preference information. To the outside world, the confederation appears to be a single AS. With this solution, iBGP transit AS problems can be resolved as iBGP requires a full mesh between all BGP routers: large number of TCP sessions and unnecessary duplication of routing traffic.[clarification needed]

Confederations can be used in conjunction with route reflectors. Both confederations and route reflectors can be subject to persistent oscillation unless specific design rules, affecting both BGP and the interior routing protocol, are followed.[28]

These alternatives can introduce problems of their own, including the following:

  • route oscillation
  • sub-optimal routing
  • increase of BGP convergence time[29]

Additionally, route reflectors and BGP confederations were not designed to ease BGP router configuration. Nevertheless, these are common tools for experienced BGP network architects. These tools may be combined, for example, as a hierarchy of route reflectors.

Stability

[edit]

The routing tables managed by a BGP implementation are adjusted continually to reflect actual changes in the network, such as links or routers going down and coming back up. In the network as a whole, it is normal for these changes to happen almost continuously, but for any particular router or link, changes are expected to be relatively infrequent. If a router is misconfigured or mismanaged then it may get into a rapid cycle between down and up states. This pattern of repeated withdrawal and re-announcement known as route flapping can cause excessive activity in all the other routers that know about the cycling entity, as the same route is continually injected and withdrawn from the routing tables. The BGP design is such that delivery of traffic may not function while routes are being updated. On the Internet, a BGP routing change may cause outages for several minutes.

A feature known as route flap damping (RFC 2439) is built into many BGP implementations in an attempt to mitigate the effects of route flapping. Without damping, the excessive activity can cause a heavy processing load on routers, which may in turn delay updates on other routes, and so affect overall routing stability. With damping, a route's flapping is exponentially decayed. At the first instance when a route becomes unavailable and quickly reappears, damping does not take effect, so as to maintain the normal fail-over times of BGP. At the second occurrence, BGP shuns that prefix for a certain length of time; subsequent occurrences are ignored exponentially longer. After the abnormalities have ceased and a suitable length of time has passed for the offending route, prefixes can be reinstated with a clean slate. Damping can also mitigate denial-of-service attacks.

It is also suggested in RFC 2439: Section 4  that route flap damping is a feature more desirable if implemented to Exterior Border Gateway Protocol Sessions (eBGP sessions or simply called exterior peers) and not on Interior Border Gateway Protocol Sessions (iBGP sessions or simply called internal peers). With this approach when a route flaps inside an autonomous system, it is not propagated to the external ASs – flapping a route to an eBGP will cause a chain of flapping for the particular route throughout the backbone. This method also successfully avoids the overhead of route flap damping for iBGP sessions.

Subsequent research has shown that flap damping can actually lengthen convergence times in some cases, and can cause interruptions in connectivity even when links are not flapping.[30][31] Moreover, as backbone links and router processors have become faster, some network architects have suggested that flap damping may not be as important as it used to be, since changes to the routing table can be handled much faster by routers.[32] This has led the RIPE Routing Working Group to write, "With the current implementations of BGP flap damping, the application of flap damping in ISP networks is NOT recommended. ... If flap damping is implemented, the ISP operating that network will cause side-effects to their customers and the Internet users of their customers' content and services ... . These side-effects would quite likely be worse than the impact caused by simply not running flap damping at all."[33] Improving stability without the problems of flap damping is the subject of current research.[34][needs update]

Routing table growth

[edit]
BGP table growth on the Internet
Number of AS on the Internet vs number of registered AS

One of the largest problems faced by BGP, and indeed the Internet infrastructure as a whole, is the growth of the Internet routing table. If the global routing table grows to the point where some older, less capable routers cannot cope with the memory requirements or the CPU load of maintaining the table, these routers will cease to be effective gateways between the parts of the Internet they connect. In addition, and perhaps even more importantly, larger routing tables take longer to stabilize after a major connectivity change, leaving network service unreliable, or even unavailable, in the interim.

Until late 2001, the global routing table was growing exponentially, threatening an eventual widespread breakdown of connectivity. In an attempt to prevent this, ISPs cooperated in keeping the global routing table as small as possible, by using Classless Inter-Domain Routing (CIDR) and route aggregation. While this slowed the growth of the routing table to a linear process for several years, with the expanded demand for multihoming by end-user networks the growth was once again superlinear by the middle of 2004.

512k day

[edit]

A Y2K-like overflow triggered in 2014 for those models that were not appropriately updated.

While a full IPv4 BGP table as of August 2014 (512k day)[35][36] was in excess of 512,000 prefixes,[37] many older routers had a limit of 512k (512,000–524,288)[38][39] routing table entries. On August 12, 2014, outages resulting from full tables hit eBay, LastPass and Microsoft Azure among others.[40] A number of Cisco routers commonly in use had TCAM, a form of high-speed content-addressable memory, for storing BGP advertised routes. On impacted routers, the TCAM was by default allocated as 512k IPv4 routes and 256k IPv6 routes. While the reported number of IPv6 advertised routes was only about 20k, the number of advertised IPv4 routes reached the default limit, causing a spillover effect as routers attempted to compensate for the issue by using slow software routing (as opposed to fast hardware routing via TCAM). The main method for dealing with this issue involves operators changing the TCAM allocation to allow more IPv4 entries, by reallocating some of the TCAM reserved for IPv6 routes, which requires a reboot on most routers. The 512k problem was predicted by a number of IT professionals.[41][42][43]

The actual allocations which pushed the number of routes above 512k was the announcement of about 15,000 new routes in short order, starting at 07:48 UTC. Almost all of these routes were to Verizon Autonomous Systems 701 and 705, created as a result of deaggregation of larger blocks, introducing thousands of new /24 routes, and making the routing table reach 515,000 entries. The new routes appear to have been reaggregated within 5 minutes, but instability across the Internet apparently continued for a number of hours.[44] Even if Verizon had not caused the routing table to exceed 512k entries in the short spike, it would have soon happened through natural growth.

Route summarization is often used to improve aggregation of the BGP global routing table, thereby reducing the necessary table size in routers of an AS. Consider AS1 has been allocated the big address space of 172.16.0.0/16, this would be counted as one route in the table, but due to customer requirements or traffic engineering purposes, AS1 wants to announce smaller, more specific routes of 172.16.0.0/18, 172.16.64.0/18, and 172.16.128.0/18. The prefix 172.16.192.0/18 does not have any hosts so AS1 does not announce a specific route 172.16.192.0/18. This all counts as AS1 announcing four routes.

AS2 will see the four routes from AS1 (172.16.0.0/16, 172.16.0.0/18, 172.16.64.0/18, and 172.16.128.0/18) and it is up to the routing policy of AS2 to decide whether or not to take a copy of the four routes or, as 172.16.0.0/16 overlaps all the other specific routes, to just store the summary, 172.16.0.0/16.

If AS2 wants to send data to prefix 172.16.192.0/18, it will be sent to the routers of AS1 on route 172.16.0.0/16. At AS1, it will either be dropped or a destination unreachable ICMP message will be sent back, depending on the configuration of AS1's routers.

If AS1 later decides to drop the route 172.16.0.0/16, leaving 172.16.0.0/18, 172.16.64.0/18, and 172.16.128.0/18, the number of routes AS1 announces drops to three. Depending on the routing policy of AS2, it will store a copy of the three routes, or aggregate 172.16.0.0/18 and 172.16.64.0/18 to 172.16.0.0/17, thereby reducing the number of routes AS2 stores to two (172.16.0.0/17 and 172.16.128.0/18).

If AS2 now wants to send data to prefix 172.16.192.0/18, it will be dropped or a destination unreachable ICMP message will be sent back at the routers of AS2 (not AS1 as before), because 172.16.192.0/18 is not in the routing table.

AS number depletion and 32-bit ASNs

[edit]

The RFC 1771 BGP-4 specification coded AS numbers on 16 bits, for 64,510 possible public AS numbers.[a] In 2011, only 15,000 AS numbers were still available, and projections[45] were envisioning a complete depletion of available AS numbers in September 2013.

RFC 6793 extends AS coding from 16 to 32 bits,[b] which now allows up to 4 billion available AS. An additional private AS range is also defined in RFC 6996.[c] To allow the traversal of router groups not able to manage those new ASNs, the new attribute AS4_PATH (optional transitive) and the special 16-bit ASN AS_TRANS (AS23456) is used.[46] 32-bit ASN assignments started in 2007.

Load balancing

[edit]

Another factor contributing to the growth of the routing table is the need for load balancing of multi-homed networks. It is not a trivial task to balance the inbound traffic to a multi-homed network across its multiple inbound paths, due to limitation of the BGP route selection process. For a multi-homed network, if it announces the same network blocks across all of its BGP peers, the result may be that one or several of its inbound links become congested while the other links remain under-utilized, because external networks all picked that set of congested paths as optimal. Like most other routing protocols, BGP does not detect congestion.

To work around this problem, BGP administrators of that multihomed network may divide a large contiguous IP address block into smaller blocks and tweak the route announcement to make different blocks look optimal on different paths, so that external networks will choose a different path to reach different blocks of that multi-homed network. Such cases will increase the number of routes as seen on the global BGP table.

One method to address the routing table issue associated with load balancing is to deploy Locator/Identifier Separation Protocol (BGP/LISP) gateways within an Internet exchange point to allow ingress traffic engineering across multiple links. This technique does not increase the number of routes seen on the global BGP table.

Security

[edit]

By design, routers running BGP accept advertised routes from other BGP routers by default. This allows for automatic and decentralized routing of traffic across the Internet, but it also leaves the Internet potentially vulnerable to accidental or malicious disruption, known as BGP hijacking. Due to the extent to which BGP is embedded in the core systems of the Internet, and the number of different networks operated by many different organizations which collectively make up the Internet, correcting this vulnerability (such as by introducing the use of cryptographic keys to verify the identity of BGP routers) is a technically and economically challenging problem.[47]

Extensions

[edit]

Multiprotocol Extensions for BGP (MBGP), sometimes referred to as Multiprotocol BGP or Multicast BGP and defined in RFC 4760, is an extension to BGP that allows different types of addresses (known as address families) to be distributed in parallel. Whereas standard BGP supports only IPv4 unicast addresses, Multiprotocol BGP supports IPv4 and IPv6 addresses and it supports unicast and multicast variants of each. Multiprotocol BGP allows information about the topology of IP multicast-capable routers to be exchanged separately from the topology of normal IPv4 unicast routers. Thus, it allows a multicast routing topology different from the unicast routing topology. Although MBGP enables the exchange of inter-domain multicast routing information, other protocols such as the Protocol Independent Multicast family are needed to build trees and forward multicast traffic. Multiprotocol BGP is also widely deployed in case of MPLS L3 VPN, to exchange VPN labels learned for the routes from the customer sites over the MPLS network, in order to distinguish between different customer sites when the traffic from the other customer sites comes to the provider edge router for routing.

Another extension to BGP is multipath routing. This typically requires identical MED, weight, origin, and AS-path although some implementations provide the ability to relax the AS-path checking to only expect an equal path length rather than the actual AS numbers in the path being expected to match too. This can then be extended further with features like Cisco's dmzlink-bw which enables a ratio of traffic sharing based on bandwidth values configured on individual links.

By default, BGP only supports the advertisement of a single locally selected best path to its neighbors, through its Update messages. RFC 7911 defines the ADD-PATH extension, which allows a BGP speaker to advertise multiple paths for the same destination to peers. One application for this is when using route reflectors (RRs), because then the RR can advertise all known route paths to its clients, instead of sending only a single route it selected based on its local decision process, which is likely not the best path for all of its clients.

Uses

[edit]

BGP4 is standard for Internet routing and required of most Internet service providers (ISPs) to establish routing between one another. Very large private IP networks use BGP internally. An example use case is the joining of a number of large Open Shortest Path First (OSPF) networks when OSPF by itself does not scale to the size required. Another reason to use BGP is multihoming a network for better redundancy, either to multiple access points to a single ISP or to multiple ISPs.

Implementations

[edit]

Routers, especially small ones intended for small office/home office (SOHO) use, may not include BGP capability. Other commercial routers may need a specific software executable image that supports BGP, or a license that enables it. Devices marketed as layer-3 switches are less likely to support BGP than devices marketed as routers, but many high-end layer-3 switches can run BGP.

Products marketed as switches may have a size limitation on BGP tables that is far smaller than a full Internet table plus internal routes. These devices may be perfectly reasonable and useful when used for BGP routing of some smaller part of the network, such as a confederation-AS representing one of several smaller enterprises that are linked, by a BGP backbone of backbones, or a small enterprise that announces routes to an ISP but only accepts a default route and perhaps a small number of aggregated routes.

A BGP router used only for a network with a single point of entry to the Internet may have a much smaller routing table size (and hence RAM and CPU requirement) than a multihomed network. Even simple multihoming can have modest routing table size. The actual amount of memory required in a BGP router depends on the amount of BGP information exchanged with other BGP speakers and the way in which the particular router stores BGP information. The router may have to keep more than one copy of a route, so it can manage different policies for route advertising and acceptance to a specific neighboring AS. The term view is often used for these different policy relationships on a running router.

If one router implementation takes more memory per route than another implementation, this may be a legitimate design choice, trading processing speed against memory. A full IPv4 BGP table as of August 2015 is in excess of 590,000 prefixes.[37] Large ISPs may add another 50% for internal and customer routes. Again depending on implementation, separate tables may be kept for each view of a different peer AS.

Notable free and open-source implementations of BGP include:

Systems for testing BGP conformance, load or stress performance come from vendors such as:

See also

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Border Gateway Protocol (BGP), particularly its version 4 (BGP-4), is an interdomain that enables autonomous systems (ASes)—distinct networks under single administrative control—to exchange and information across the . Defined in RFC 4271, BGP-4 supports (CIDR) by advertising IP prefixes and aggregating routes, while using path attributes like AS_PATH to select routes based on policy preferences and prevent loops. It runs over TCP on port 179, establishing persistent sessions between BGP speakers to maintain a stable topology of global connectivity. BGP originated in the late 1980s as a successor to the aging (EGP), with its initial specification published as RFC 1105 in June 1989 by designers Yakov Rekhter of and Kirk Lougheed of Cisco Systems. The protocol evolved through versions BGP-2 (RFC 1163, 1990) and BGP-3 (RFC 1267, 1991), before BGP-4 introduced CIDR support in RFC 1771 (1995) and was refined in RFC 4271 (January 2006) to address scaling needs amid Internet growth. Over time, extensions such as route reflectors (RFC 4456) and AS confederations (RFC 5065) have enhanced scalability for internal BGP (iBGP) within large ASes, while the IETF's Secure Inter-Domain Routing (SIDR) working group has standardized security features like (RPKI) and BGPsec (RFC 8205, 2017). In operation, BGP employs a (Idle, Connect, Active, OpenSent, OpenConfirm, Established) to manage peer sessions, exchanging four message types: OPEN to negotiate parameters, UPDATE to advertise or withdraw routes with attributes (e.g., NEXT_HOP, LOCAL_PREF), to sustain connections, and NOTIFICATION for errors. This design allows ASes to enforce complex routing policies, such as preferring certain paths for traffic engineering or load balancing, while external BGP (eBGP) handles inter-AS exchanges and iBGP synchronizes routes within an AS. BGP's deployment since 1989 has made it the backbone of routing, supporting over 78,000 ASes visible in the IPv4 global table and more than 35,000 in as of November 2025, with millions of routes enabling worldwide connectivity for diverse networks from small enterprises to major ISPs. Its policy-driven flexibility has proven resilient across heterogeneous environments, from low-bandwidth links to high-speed 10 Gbps+ backbones, but vulnerabilities to prefix hijacking and route leaks persist, prompting recent IETF efforts like the deprecation of insecure AS_SET attributes (RFC 9774, 2025) and ongoing updates to BGP operations and security guidelines.

History

Origins and Early Development

The Border Gateway Protocol (BGP) originated in 1989 as a response to the limitations of the (EGP), which relied on a distance-vector approach and assumed a hierarchical, tree-like centered around a single backbone network such as . Developed by Yakov Rekhter of and Kirk Lougheed of , the protocol's initial concept emerged during a lunch meeting at the 12th (IETF) conference in January 1989, where the core ideas were sketched on two napkins. This informal design addressed the need for a more flexible inter-autonomous system (AS) routing mechanism capable of supporting arbitrary network topologies and allowing administrators to enforce routing policies based on business or operational preferences rather than mere distance metrics. BGP version 1 (BGP-1) was formalized shortly thereafter in RFC 1105, published in June 1989, without initially undergoing a full process through an RFC as a proposed standard. The protocol introduced path-vector routing, which propagates full AS paths to prevent loops and enable informed decisions, marking a shift from EGP's restrictive model that struggled with the Internet's evolving, decentralized structure. This innovation facilitated the first true inter-AS routing independent of a centralized backbone, allowing diverse networks to interconnect while preserving administrative autonomy. Initial operational deployment of BGP occurred in 1989 on the (NSFNET) T1 backbone, where it replaced EGP to exchange information between regional networks and the core infrastructure. This rollout addressed EGP's issues, such as its inability to handle non-hierarchical and policy enforcement, amid the Internet's rapid expansion; by late 1991, the number of ASes had grown to approximately 300, underscoring the urgency for a robust replacement protocol. The NSFNET implementation demonstrated BGP's viability in production environments, paving the way for its broader adoption in interdomain routing.

Standardization and Version Evolution

The Border Gateway Protocol (BGP) underwent formal standardization through a series of (RFC) documents published by the (IETF), evolving from its initial versions to address growing scale and policy needs. BGP version 2, specified in RFC 1163 (June 1990) alongside its application guidelines in RFC 1164, introduced path attributes as a core mechanism for control. These attributes, categorized as well-known mandatory (e.g., AS_PATH for loop prevention), well-known discretionary, optional transitive, and optional non-transitive, enabled routers to enforce interdomain policies by evaluating metrics like origin type and inter-AS costs during route selection. This marked a shift from BGP-1's simpler structure, adding support for incremental updates and hop-by-hop policy decisions to better manage autonomous system (AS) interactions. BGP version 3, detailed in RFC 1267 (October 1991) with application notes in RFC 1268, built on these foundations by enhancing efficiency in route information exchange. Key additions included the ability to advertise multiple networks in a single UPDATE message, reducing protocol overhead, and optimizations for route aggregation through unreachable network announcements with minimal attributes. It also relaxed restrictions on NEXT_HOP attributes, allowing flexible border router designations across AS boundaries, which laid groundwork for handling larger, more hierarchical topologies—precursors to later confederation mechanisms that subdivide AS internals without altering external views. These changes improved scalability for classful addressing environments while maintaining with BGP-2. BGP version 4, first published in RFC 1771 (March 1995) with companion application RFC 1772, became the foundational standard still in use today; the protocol specification was obsoleted and refined by RFC 4271 (January 2006) for clearer specifications, while RFC 4272 (January 2006) separately analyzes BGP security vulnerabilities. The primary innovation was support for (CIDR), allowing advertisement of IP prefixes of arbitrary length rather than fixed classful networks, which dramatically reduced sizes amid growth. Route aggregation was further advanced with AS_SET and AS_SEQUENCE constructs to summarize paths efficiently. Multiprotocol extensions were initially formalized in RFC 2283 (February 1998) to enable BGP to carry routing information for protocols beyond IPv4, such as and MPLS VPNs, using address family indicators, with subsequent updates including RFC 4760 (January 2007). A critical milestone in BGP's evolution addressed the depletion of 16-bit AS numbers (1–65,535), projected to exhaust around 2009–2011 based on allocation trends. The transition began in 2007 with initial extensions in RFC 4893, culminating in RFC 6793 (November 2012), which standardized 32-bit AS support (up to 4,294,967,295) through extended encoding in path attributes, ensuring seamless interoperability during the phased rollout from 2007 to 2012; by 2015, the transition was largely complete. Post-2012 updates have focused on operational refinements, such as RFC 8203 (July 2017), which enhances BGP session management by allowing administrative shutdown notifications with optional free-text reasons, improving transparency during maintenance without full route withdrawals. More recent developments include RFC 9774 (March 2025), which deprecates the insecure AS_SET path attribute to mitigate route aggregation risks. These iterative improvements, including graceful restart capabilities from RFC 4724 (2006) with subsequent enhancements, underscore BGP's adaptability to modern network demands.

Fundamentals

Role in Interdomain Routing

The Border Gateway Protocol (BGP) serves as the primary for the , facilitating the exchange of information between autonomous systems (ASes) to enable interdomain connectivity. An autonomous system is defined as a collection of IP networks and routers under the control of one or more network operators that presents a common policy to the . BGP operates as a , which allows network administrators to make policy-based decisions on route selection rather than relying solely on metrics like distance or link state, distinguishing it from interior gateway protocols (IGPs) that use link-state or distance-vector algorithms within a single domain. This policy flexibility is essential for interdomain , where diverse administrative entities negotiate traffic flows based on business agreements, security considerations, and performance goals. A key mechanism in BGP is the AS_PATH attribute, which records the sequence of ASes traversed by a route advertisement, enabling loop prevention by discarding routes that would create cycles—specifically, if a receiving router detects its own AS number in the AS_PATH, it excludes the route from further consideration. This attribute supports BGP's scalability, allowing the protocol to handle the global Internet's vast topology with approximately 78,000 ASes as of November 2025, far exceeding the capabilities of traditional distance-vector protocols that struggle with large-scale loop detection. BGP peering occurs in two main forms: external BGP (eBGP) for connections between routers in different ASes, which directly propagates routing updates across domain boundaries, and internal BGP (iBGP) for distributing routes within the same AS, ensuring consistent policy application internally without altering the AS_PATH. In practice, BGP maintains a that, as of November 2025, contains approximately 1.04 million IPv4 routes and 236,000 routes, reflecting the protocol's ability to scale with the 's growth while accommodating policy-driven filtering and aggregation to manage this volume efficiently. This interdomain role underscores BGP's robustness in supporting a decentralized, policy-oriented architecture that has underpinned global Internet routing since its standardization.

Comparison to Interior Protocols

The Border Gateway Protocol (BGP) operates as a , distinguishing it from interior gateway protocols (IGPs) by prioritizing policy-based decisions over shortest-path optimization. While IGPs such as (OSPF), a link-state protocol, and (RIP), a distance-vector protocol, focus on computing the lowest-cost routes within a single autonomous system (AS) using metrics like bandwidth or hop count, BGP evaluates paths based on attributes that reflect administrative policies, such as local preferences and AS path lengths. This design enables BGP to enforce interdomain routing policies that align with business agreements, rather than solely minimizing latency or distance. BGP's scalability for the global topology stems from its use of AS aggregation and avoidance of full topology ing, allowing routers to exchange summarized information without disseminating every link detail across domains. In contrast, IGPs like OSPF link-state advertisements (LSAs) throughout the AS to build a complete topology map, enabling rapid shortest-path calculations via algorithms like Dijkstra's, while RIP periodically broadcasts entire tables. This ing mechanism suits intradomain environments but becomes unstable and resource-intensive at scale, potentially leading to loops or excessive bandwidth consumption if applied interdomain. BGP's path-vector approach, by appending AS numbers to routes, prevents loops and supports aggregation to manage the vast number of prefixes—over 1 million IPv4 routes as of November 2025—without overwhelming the . A core operational difference lies in transport and update mechanisms: BGP relies on TCP port 179 for reliable, connection-oriented sessions, ensuring ordered delivery and retransmission of incremental updates triggered only by changes, which promotes stability in policy-driven environments. IGPs, however, typically use UDP or direct IP encapsulation with or broadcast for faster propagation within an AS, as seen in OSPF's LSA flooding or RIP's periodic updates, prioritizing speed over absolute reliability in controlled internal topologies. This TCP foundation in BGP supports multihop peering across distant ASes, whereas IGP limits them to local links. Convergence times further highlight their suited scopes: may take minutes to stabilize after failures due to deliberate timers and validations that prevent oscillations across the , whereas IGPs like OSPF converge in seconds through immediate recalculations. In hybrid deployments common in large ASes, internal BGP (iBGP) complements IGPs by carrying external routes learned from external BGP (eBGP) peers, offloading interdomain traffic decisions from the IGP to avoid prefix overload and maintain internal efficiency.

Core Operation

Session Establishment and Maintenance

Border Gateway Protocol (BGP) establishes sessions between peers over TCP connections using port 179 as the destination port for reliable transport. In external BGP (eBGP), sessions typically connect routers in different autonomous systems (ASes) and require IP adjacency by default, though multi-hop configurations allow connections across multiple IP while preserving the next-hop attribute. Internal BGP (iBGP) sessions, in contrast, occur between routers within the same AS and do not enforce adjacency, often spanning multiple within the internal network topology. Once the TCP three-way handshake completes, peers exchange OPEN messages to negotiate session parameters and establish the BGP session. The OPEN message includes the sender's AS number as a 2-octet unsigned and proposes a Hold Time, with a default value of 180 seconds if unspecified, representing the maximum interval before declaring the peer dead. The receiving peer selects the smaller of the two proposed Hold Times and responds with its own OPEN message; if the negotiated Hold Time is zero, no periodic keepalives are required, but implementations must support a minimum of 3 seconds. Session maintenance relies on periodic KEEPALIVE messages, transmitted at intervals no greater than one-third of the negotiated Hold Time—typically every 60 seconds for the default 180-second Hold Time—to prevent timeouts. BGP operates via a (FSM) with six states: Idle (initial state, awaiting manual or automatic start), Connect (TCP connection initiation), Active (TCP retry after failure), OpenSent (OPEN message sent, awaiting response), OpenConfirm (parameters accepted, awaiting or UPDATE), and Established (session active for route exchange). Transitions between states handle events like connection establishment, timer expirations, or message receipts, ensuring robust session lifecycle management. Extensions and optional features are negotiated during session establishment through the Capabilities Optional Parameter (Type 2) in the OPEN message, as defined in RFC 2842, allowing peers to advertise supported capabilities without disrupting compatibility. For instance, multiprotocol BGP extensions (RFC 4760) are advertised via this mechanism using Capability Code 1, enabling support for address families beyond IPv4 unicast. More recent extensions, such as those for advertising Segment Routing (SR) policies in RFC 9830, introduce a new Subsequent Address Family Identifier (SAFI 73) advertised in OPEN capabilities, allowing BGP to distribute SR Policy Candidate Paths with attributes like color and endpoint for advanced traffic engineering. Errors during establishment or maintenance, such as mismatched AS numbers, unsupported capabilities, or Hold Timer expirations, trigger a , which closes the TCP connection and resets the FSM to , terminating the session. This error-handling ensures session integrity while permitting rapid recovery attempts.

Route Exchange and Selection Process

BGP routers exchange updates through UPDATE messages, which serve to advertise feasible routes or withdraw unfeasible ones. An UPDATE message includes a variable-length list of withdrawn routes (IP prefixes to remove from the neighbor's ), followed by path attributes and Reachability Information (NLRI) for newly advertised prefixes. These attributes apply to all NLRIs in the message, allowing efficient grouping of multiple destinations under common path properties. This incremental update mechanism avoids retransmitting the full , reducing bandwidth consumption and processing overhead during changes. After receiving and validating UPDATE messages, a BGP speaker computes the best path for each IP prefix from the set of available paths in its Adj-RIBs-In (adjusted routing information bases). The decision process follows a deterministic sequence of criteria to ensure consistent selection across implementations, though the exact ordering of some steps may vary by vendor. The algorithm first discards any paths containing the speaker's own AS number in the AS_PATH to prevent loops. Among valid paths, it prefers the highest LOCAL_PREF value (a policy-driven preference for outbound traffic). If values tie, it selects the shortest AS_PATH length (fewest AS numbers). Next, it chooses the lowest ORIGIN code (IGP < EGP < INCOMPLETE). For paths from the same neighboring AS, it prefers the lowest MULTI_EXIT_DISC (MED) value to influence inbound traffic selection. It then favors eBGP-learned paths over iBGP-learned ones. Subsequent tie-breakers include the lowest IGP metric to the NEXT_HOP, the greatest route age (for eBGP paths), the lowest originating router ID, the shortest Cluster List (for iBGP with route reflectors), and finally the lowest neighbor IP address. The selected best path is installed in the Loc-RIB (local routing information base) and propagated via further UPDATE messages to peers, subject to outbound policy filters. BGP inherently prevents routing loops through the mandatory AS_PATH attribute, which prepends the sending AS's number to the path list upon advertisement to external peers (while internal peers leave it unmodified). A receiving speaker rejects any route where its own AS appears in the AS_PATH, ensuring no re-circulation within the same AS or back to the originator. AS_PATH prepending extends this by allowing an AS to insert multiple copies of its own number, artificially lengthening the path to deprioritize it in remote selections without altering connectivity. To suppress the propagation of unstable routes that flap (repeatedly withdraw and readvertise), BGP employs route flap damping, which tracks a penalty score for each prefix based on update frequency. Penalties accumulate on flaps and decay exponentially with a configurable half-life (typically 15 minutes); routes exceeding a reuse threshold (e.g., 2000) are suppressed until the penalty drops below a cut-off (e.g., 750). While intended to reduce CPU load from churn, empirical studies showed damping often delays convergence for stable routes and exacerbates outages, leading to its deprecation in practice—many operators disable it entirely. As a modern alternative for enhancing stability without broad suppression, Long-Lived Graceful Restart (LLGR) enables BGP speakers to retain and mark stale routes as long-lived for a negotiated Long-Lived Stale Time (LLST) during session restarts, preserving forwarding while new paths converge.

Path Attributes and Policies

Standard and Well-Known Attributes

In BGP, path attributes provide metadata associated with advertised routes, enabling routers to apply policies and select paths without modifying the underlying network topology. These attributes are categorized as well-known or optional, with well-known attributes being universally recognized by all BGP implementations. Well-known attributes further divide into mandatory (must be included in every UPDATE message containing reachable NLRI) and discretionary (may be omitted but must be recognized if present). They propagate either transitively (passed to all peers) or non-transitively (restricted to internal use), influencing route selection during the best-path algorithm. The well-known mandatory attributes form the core of BGP's path information and are always present in valid UPDATE messages. The ORIGIN attribute specifies the source of the routing information, with possible values of IGP (learned via an interior gateway protocol), EGP (learned via the Exterior Gateway Protocol), or INCOMPLETE (learned by other means, such as redistribution or static configuration). It is transitive and must not be altered by intermediate BGP speakers, serving to indicate the route's authenticity and integration point into the interdomain routing system. The AS_PATH attribute records the sequence of Autonomous Systems (ASes) that a route has traversed, prepending the local AS number when advertising to external peers. It is transitive and essential for loop prevention: a BGP speaker discards any route containing its own AS number in the path. Additionally, the length of the AS_PATH serves as the primary metric for inter-AS path selection, with shorter paths preferred to favor closer or more direct routes. To support 32-bit AS numbers (extending the AS space from 65,536 to over 4 billion), RFC 4893 introduces encoding mechanisms in AS_PATH, including the use of a special AS_TRANS value (23456) for non-mappable 32-bit ASNs when interoperating with legacy 16-bit implementations, alongside new optional attributes like AS4_PATH for full 32-bit propagation. The NEXT_HOP attribute identifies the IP address of the immediate next router to forward packets toward the advertised destinations, typically set to the advertising router's address for external routes or unchanged for internal ones. It is transitive but follows specific rules: for eBGP peers in different ASes, it is updated to the local router's address unless overridden by configuration, ensuring correct forwarding across AS boundaries. This attribute is crucial for packet encapsulation and recursion in the forwarding plane. Well-known discretionary attributes are recognized by all BGP speakers but are not required in every UPDATE message. The LOCAL_PREF attribute conveys a preference value (typically 0-4,294,967,295) for route selection within an AS, allowing network operators to influence outbound traffic paths by assigning higher values to preferred routes. It is non-transitive, advertised only to iBGP peers and not to external peers (except in confederations), thereby keeping internal policy preferences private. In the route selection process, LOCAL_PREF is compared first among internal paths to determine the best exit point from the AS. The ATOMIC_AGGREGATE attribute, which has a fixed length of zero, signals that a route represents an aggregated prefix where more specific routes have been suppressed or withdrawn. It is transitive and must be preserved across AS boundaries, preventing recipients from de-aggregating the route based on partial path information. This attribute ensures that aggregated advertisements are treated as indivisible units, maintaining routing table stability during summarization.

Optional Attributes: Communities and MED

The Border Gateway Protocol (BGP) employs optional attributes to enable fine-grained policy control, allowing autonomous systems (ASes) to implement sophisticated routing decisions without mandating universal adoption. Among these, the Communities attribute provides a mechanism for tagging routes with 32-bit identifiers, facilitating the grouping of destinations that share common properties as defined by AS administrators. This optional transitive attribute, with Type Code 8, consists of variable-length sequences of four-octet values, where the first two octets typically represent the originating AS number and the last two are administrator-defined, enabling policies such as no-transit rules or adjustments to local preference (LOCAL_PREF). For instance, well-known community values like NO_EXPORT (0xFFFFFF01) instruct BGP speakers not to advertise tagged routes outside a confederation boundary, while NO_ADVERTISE (0xFFFFFF02) prevents advertisement to any peers, and these can be matched using regular expressions in router configurations to enforce propagation controls. To address limitations in the 32-bit scope of basic Communities, the Extended Communities attribute introduces a more structured type-length-value (TLV) format, expanding applicability to scenarios like virtual private networks (VPNs). Defined as an optional transitive attribute with Type Code 16 and an 8-octet length, it features a 1- or 2-octet Type field (indicating transitivity and subtype) followed by a Value field that supports global administrator subfields like AS numbers or IPv4 addresses. This design enables larger-scale tagging and policy enforcement across AS boundaries, particularly in MPLS-based VPNs, where subtypes such as Route Target (e.g., Type 0x0002 or 0x0102) identify which routers should import or export specific routes, thereby segmenting traffic flows. Another key optional attribute is the Multi-Exit Discriminator (MED), which assists in optimizing traffic exit points at AS boundaries by conveying relative preferences for multiple inter-AS links. As a non-transitive optional attribute with Type Code 4, MED is a four-octet unsigned integer that neighboring ASes use to select the preferred entry point, with the lowest value indicating the most desirable path when other factors are equal. Unlike transitive attributes, MED is not propagated beyond the immediate neighboring AS, allowing the advertising AS to control inbound traffic without influencing further propagation. In route selection, MED influences decisions among paths from the same AS by prioritizing lower metrics, though implementations may alter or omit it based on local policy.

Message Formats

Common Header Structure

All BGP messages share a fixed-size header of 19 octets, which precedes any message-specific data and enables peers to identify, validate, and process incoming transmissions reliably. This header consists of three fields: a 16-octet Marker, a 2-octet Length, and a 1-octet Type. The structure ensures that BGP, operating as an application-layer protocol over TCP on port 179, can detect synchronization issues and basic integrity without relying on lower-layer mechanisms. The Marker field, occupying the first 16 octets, is typically set to all ones (0xFFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF) to provide a fixed pattern for message demarcation and rudimentary authentication. This value aids in detecting lost or corrupted messages during TCP transmission, though its authentication role has been largely superseded by more robust options like TCP MD5 signatures. (https://datatracker.ietf.org/doc/html/rfc2385) Following the Marker, the Length field specifies the total size of the entire BGP message in octets, ranging from a minimum of 19 (for header-only messages) to a maximum of 4096. This includes the Marker, Length, Type, and any data portion, allowing the receiver to determine when a complete message has arrived before processing. The Type field, the final octet of the header, identifies the message's purpose using values such as 1 for OPEN, 2 for UPDATE, 3 for NOTIFICATION, and 4 for KEEPALIVE. Upon receiving a message, a BGP speaker first inspects the header for validity: if the Marker is not all ones (absent authentication), the Length is outside the allowed range, or the Type is unrecognized, the peer generates a NOTIFICATION message with an appropriate error code (e.g., Connection Notifies: Bad Message Length or Bad Message Type) and terminates the TCP connection. (https://datatracker.ietf.org/doc/html/rfc4271#section-8.2.2) This error-handling mechanism promotes stability by isolating malformed traffic early in the session.
FieldSize (octets)Description
Marker16Fixed pattern (all ones) for synchronization and authentication.
Length2Total message length (19–4096 octets), including header and data.
Type1Message type code (1=OPEN, 2=UPDATE, 3=NOTIFICATION, 4=KEEPALIVE).
This uniform header format underpins BGP's reliability across diverse network environments, where messages are only processed after full receipt over the reliable TCP transport.

OPEN and KEEPALIVE Messages

The OPEN message initiates a BGP peering session between two BGP speakers, establishing parameters for communication and advertising capabilities. It follows the common 19-octet BGP message header and contains a fixed-length body of 10 octets plus variable-length optional parameters. The message structure is defined as follows:
FieldSize (octets)Description
Version1Specifies the BGP protocol version supported by the sender; the current value is 4 for BGP-4.
My Autonomous System2Contains the sender's Autonomous System number as a 2-octet unsigned integer; for support of 4-octet AS numbers, this field may use the transitional value 23456 (AS_TRANS) if no unique 2-octet AS is available, with the full 4-octet AS advertised separately via capabilities.
Hold Time2Proposes the maximum time interval (in seconds) between KEEPALIVE and/or UPDATE messages before the sender considers the peer dead; a value of 0 disables the Hold Timer, while non-zero values must be at least 3 seconds.
BGP Identifier4A 4-octet unsigned integer representing a unique identifier for the BGP speaker, typically set to one of its IPv4 addresses at startup and remaining constant across sessions.
Optional ParametersVariableA sequence of <Parameter Type, Parameter Length, Parameter Value> triplets advertising optional features, such as capabilities (e.g., multiprotocol extensions or 4-octet AS support via Capability Code 65).
Upon receiving an OPEN message, the recipient validates the fields and negotiates parameters, such as selecting the smaller of the locally configured Hold Time and the proposed Hold Time (ensuring it is at least 3 seconds if non-zero); invalid values, like a Hold Time between 1 and 2 seconds, trigger a NOTIFICATION message with an OPEN Message Error. The KEEPALIVE message maintains an established BGP session by periodically confirming the viability of the peer connection. It consists solely of the 19-octet common header with no additional data payload, serving as a lightweight heartbeat. BGP speakers transmit messages at intervals no greater than one-third of the negotiated Hold Time (or 60 seconds if Hold Time is 0), resetting the Hold Timer upon receipt to prevent session termination due to inactivity. If no KEEPALIVE or UPDATE messages are received within the Hold Time, the session is considered dead, prompting a connection closure. Optional parameters in the OPEN message often include capability advertisements, which inform the peer of supported extensions; unrecognized or unsupported capabilities result in a NOTIFICATION message with Error Code 2 (OPEN Message Error) and Subcode 7 (Unsupported Capability), followed by session termination. This mechanism ensures backward compatibility while enabling advanced features like 4-octet AS numbers, where the capability (Code 65) carries the full AS value overriding the 2-octet field if both peers support it.

UPDATE and NOTIFICATION Messages

The BGP UPDATE message serves as the primary mechanism for exchanging routing information between peers, enabling the advertisement of feasible routes and the withdrawal of unfeasible ones. It begins with a 2-octet Unfeasible Routes Length field, which specifies the total length in octets of the subsequent Withdrawn Routes field; this value is set to zero if no routes are being withdrawn. The Withdrawn Routes field itself is variable-length and contains a sequence of IP address prefixes, each encoded as a 1-octet length (indicating prefix length in bits) followed by the prefix value, representing routes that are no longer reachable. Following the withdrawn routes, the UPDATE message includes a 2-octet Total Path Attribute Length field, indicating the length of the Path Attributes field in octets, which is zero if no new attributes or reachable routes are advertised. The Path Attributes field is variable-length and consists of one or more path attributes, each structured as a Type Code (1 octet), Length (1 or 2 octets), and Value (variable); these attributes, such as ORIGIN, AS_PATH, and NEXT_HOP, provide policy information and path details for the advertised routes. The message concludes with the Network Layer Reachability Information (NLRI) field, a variable-length sequence of IP address prefixes (encoded similarly to withdrawn routes) that identify the destinations to which the preceding path attributes apply. UPDATE messages support route aggregation to reduce the volume of information exchanged; for instance, multiple prefixes can share the same path attributes within a single message, and techniques like AS_SET in AS_PATH or the ATOMIC_AGGREGATE attribute allow summarization of routes from multiple autonomous systems. The maximum size of an UPDATE message is 4096 octets, encompassing the entire message payload over the transport connection. If a single route's encoding exceeds this limit or the transport MTU, it is not advertised. The BGP NOTIFICATION message is used to report errors and terminate BGP sessions, ensuring peers can detect and respond to protocol violations or administrative actions. It has a fixed minimum length of 21 octets and includes a 1-octet field that categorizes the issue, such as 1 for Header Error, 3 for UPDATE Message Error, or 6 for Cease. This is followed by a 1-octet Error Subcode field providing more specific details within the error code; for example, under 6 (Cease), Subcode 2 denotes Administrative Reset, which signals an intentional closure of the session for policy reasons without indicating a protocol fault. The NOTIFICATION message ends with a variable-length Data field, which may contain diagnostic information relevant to the error, such as the portion of a malformed message or an erroneous attribute. Specific error types include invalid AS_PATH under UPDATE Message Error (Error Code 3, Subcode 11), where the AS_PATH attribute fails validation, such as containing invalid AS numbers or loops. Upon sending or receiving a NOTIFICATION, the BGP session is immediately terminated, and the transport connection is closed.

Route-Refresh and Other Optional Messages

The Route Refresh capability in BGP-4 enables BGP speakers to dynamically request the re-advertisement of routing information without tearing down the BGP session, facilitating efficient policy changes and route validation. Defined in RFC 2918, this optional capability is advertised during session establishment via the BGP Capabilities Advertisement mechanism in the OPEN message, using capability code 2 with a length of 0. Upon receiving the capability advertisement, a BGP speaker can send a Route Refresh message (message type 5) to its peer, specifying an Address Family Identifier (AFI) and Subsequent Address Family Identifier (SAFI) to request the re-sending of the peer's Adj-RIB-Out for that address family. The message format includes a 16-bit AFI, an 8-bit reserved field (set to 0), and an 8-bit SAFI, allowing targeted refreshes for specific address families without affecting the entire . This mechanism avoids the need for soft reconfiguration, which requires storing unmodified routes and consumes significant memory and CPU resources, by instead triggering the peer to apply its outbound policy and re-advertise only the current valid routes. For instance, if a BGP speaker changes its inbound or outbound policy, it can request a route refresh from its peers to receive updated advertisements, ensuring consistency without session resets. The capability supports extensions, enabling refreshes for diverse address families such as IPv4 unicast (AFI 1, SAFI 1) or VPNv4 (AFI 1, SAFI 128). RFC 7313 enhances the original Route Refresh capability by introducing subtypes to demarcate the start and end of a refresh cycle, improving support for non-disruptive validation and correction of inconsistencies like missing withdrawals. This enhanced capability uses code 70 in the OPEN message and redefines the reserved octet in the Route Refresh message for subtypes: 0 for normal refresh (as in RFC 2918), 1 for Begin of Route Refresh (), and 2 for End of Route Refresh (EoRR). Upon receiving a , the receiving speaker marks existing routes as stale, processes incoming UPDATE messages during the refresh to replace or withdraw them, and purges remaining stale routes after EoRR, thus enabling precise synchronization without . This extension is particularly useful for detecting and resolving discrepancies in large-scale deployments, such as validating the absence of withdrawn routes between peers. Beyond route refresh, BGP includes other optional messages for mid-session adjustments, such as the Dynamic Capability message introduced in draft-ietf-idr-dynamic-cap, which allows peers to enable, disable, or update capabilities without resetting the session. This message (type 6) carries capability codes similar to those in OPEN, enabling dynamic negotiation of features like route refresh itself or multiprotocol extensions during an active session. Recent extensions, such as those in RFC 9832 for BGP Classful Transport Planes, leverage the Route Refresh capability with new AFI/SAFI combinations (e.g., AFI 1/SAFI 76 for IPv4 classful transport) to request re-advertisements of transport routes annotated with transport classes, supporting intent-driven networking without disrupting established sessions. These optional messages enhance BGP's flexibility, allowing incremental upgrades and policy refinements in operational environments.

Scalability Techniques

Route Reflectors and Clusters

In internal BGP (iBGP), the requirement for a full mesh of sessions among all speakers—scaling as O(n²) where n is the number of speakers—poses significant operational challenges in large autonomous systems (ASes). Route reflection, defined in RFC 4456, introduces a designated router called a route reflector (RR) that relaxes this constraint by allowing the RR to reflect iBGP-learned routes to its peers, thereby eliminating the need for a complete mesh. Specifically, the RR advertises routes learned from its clients (a subset of iBGP peers configured to peer exclusively with the RR) to both other clients and non-client peers, while routes from non-clients are reflected only to clients; non-clients must still form a full mesh among themselves to ensure proper route propagation. This design breaks the traditional iBGP split-horizon rule, which prohibits advertising iBGP-learned routes to other iBGP peers, and reduces the total number of required sessions to O(n). The reflection process follows specific rules to maintain consistency with standard BGP path selection. An RR selects its best path using the standard BGP decision process and reflects it only under certain conditions: a route learned from a client is advertised to all other iBGP peers (clients and non-clients), while a route from a non-client is advertised only to clients. If multiple paths to the same destination exist, the RR advertises the best path but may also support advertising additional paths via extensions like BGP Additional Paths (RFC 7911) for enhanced redundancy. To prevent routing loops introduced by this reflection, two optional non-transitive attributes are used: the ORIGINATOR_ID, which carries the BGP identifier of the originating speaker and causes the route to be discarded if it matches the local router's identifier, and the CLUSTER_LIST, to which the RR prepends its CLUSTER_ID (a 4-octet value, often the RR's BGP identifier) before reflection; a route is discarded if the local CLUSTER_ID appears in the list. For redundancy and , route reflectors are organized into clusters, where a cluster is a group of clients served by one or more RRs sharing the same CLUSTER_ID. In a single-RR cluster, the CLUSTER_ID is simply the RR's BGP identifier, but multiple RRs can form a redundant cluster by configuring the same CLUSTER_ID on all of them, allowing clients to peer with any RR while ensuring loop prevention via the shared identifier in CLUSTER_LIST. This setup provides failover without introducing loops, as routes reflected within the same cluster are not re-reflected. Standard route reflection can lead to suboptimal path selection if the RR is not ideally placed in the network , as the RR's "hot-potato" (favoring the closest exit point based on its own IGP metrics) may not align with clients' perspectives. BGP Optimal Route Reflection (BGP-ORR), specified in RFC 9107, addresses this by extending RR behavior to compute paths using IGP costs from configured client locations or sets, enabling the advertisement of more optimal routes tailored to client positions and potentially reducing intra-AS latency. This requires support for BGP Additional Paths and increases computational overhead on the RR, but it allows flexible placement without compromising efficiency in hierarchical or non-hierarchical .

Confederations and Internal Hierarchies

BGP confederations provide a mechanism to scale internal BGP (iBGP) operations within a large autonomous system (AS) by logically partitioning it into multiple sub-autonomous systems, known as Member-ASes, while presenting a unified external identity to the broader . This approach, defined in RFC 5065, allows an organization to divide its network into smaller, more manageable segments without requiring a full iBGP mesh across all routers, thereby reducing the number of peering sessions from O(n²) to a more hierarchical structure. Each Member-AS within the confederation is assigned a unique identifier, typically drawn from the private AS number range (64512–65534) as reserved by RFC 6996, ensuring these numbers remain invisible to external peers. Peering between Member-ASes emulates external BGP (eBGP) procedures but occurs intra-AS, including the use of eBGP-like AS path prepending and loop prevention, while still applying iBGP split-horizon rules to avoid routing loops. This hybrid model enables finer-grained policy enforcement, such as traffic engineering or access controls, at the boundaries between sub-ASes, enhancing overall network manageability in complex environments. To maintain path transparency internally while concealing the hierarchical structure externally, BGP introduces two optional path attributes: AS_CONFED_SEQUENCE and AS_CONFED_SET. The AS_CONFED_SEQUENCE attribute records an ordered list of Member-AS numbers traversed by a route within the , functioning similarly to the standard AS_PATH for loop detection and path length calculations inside the AS. In contrast, AS_CONFED_SET captures an unordered collection of Member-AS numbers for routes that do not require sequencing, such as those involving aggregation. When advertisements exit the to external peers, these attributes are stripped, and the resulting AS_PATH reflects only the single external AS number, preserving privacy of the internal . Confederations are particularly valuable for large service providers seeking to isolate policies across geographic or administrative divisions without fragmenting their public AS identity, though they are often combined with other techniques like route reflectors for optimal scaling. This method supports hierarchical routing hierarchies, where inter-Member-AS connections form a sparser , significantly easing deployment and in expansive networks.

Stability and Growth Challenges

Mechanisms for Route Stability

BGP employs various mechanisms to enhance route stability, primarily by mitigating —rapid oscillations in route advertisements that can lead to prolonged convergence times and network disruptions. These techniques aim to suppress unstable updates, preserve forwarding during disruptions, and promote reliable convergence without introducing undue delays. Key methods include advertisement throttling, damping algorithms, restart capabilities, and multi-path advertising, which collectively reduce churn in large-scale deployments. A core stability feature is the Minimum Route Advertisement Interval (MRAI), which limits the frequency of UPDATE messages sent to a peer for the same set of destinations, thereby curbing excessive announcements and withdrawals. Under MRAI, a BGP speaker delays sending an update until the interval elapses since the last advertisement or withdrawal affecting those destinations, allowing aggregation of changes into fewer messages. For external BGP (eBGP) peers, the default MRAI is 30 seconds, while for internal BGP (iBGP) peers, it is 5 seconds, balancing convergence speed with stability. This mechanism, integral to BGP-4 since its standardization, prevents router overload from bursty updates during topology changes. Route flap damping, introduced to suppress persistently unstable routes, assigns a penalty to prefixes exhibiting frequent state changes, such as transitions between reachable and unreachable. Each flap incurs a penalty increment—typically 1000 for unreachability and 500 for changes—tracked via a that decays exponentially over time, with half-lives of 5 minutes when reachable and 15 minutes when suppressed. If the figure exceeds a suppression threshold (e.g., 3000), the route is withheld from the forwarding table until it decays below a reuse threshold (e.g., 2000) and proves stable. Defined in RFC 2439, this approach reduces propagation of flaps across the network but has been largely deprecated in modern deployments due to its potential to cause prolonged unreachability for otherwise stable routes, especially in diverse topologies; operators now favor disabling it or using refined parameters per RFC 7196 and RIPE recommendations. To maintain forwarding continuity during BGP session restarts, the Graceful Restart capability allows a restarting speaker to preserve its forwarding state (e.g., in the Loc-RIB) while re-establishing sessions with neighbors. Upon restart, the speaker advertises a Graceful Restart Capability in the OPEN message, specifying a Restart Time (up to 4095 seconds) estimating reconvergence duration and a Forwarding State bit indicating preserved routes per address family. Neighbors mark affected routes as stale but continue using them for forwarding until receiving fresh updates or an End-of-RIB marker signaling completion; stale routes are then purged. Specified in RFC 4724, this minimizes transient blackholing and loops, significantly improving stability during planned or unplanned outages in high-availability environments. Building on Graceful Restart, Long-Lived Graceful Restart (LLGR) extends stale route retention beyond short-term restarts, enabling holding times up to days for better resilience in scenarios like software upgrades or link failures. Peers negotiate LLGR via an extended capability, including an LLGR Stale Time parameter (up to 16 million seconds) per address family; supported routes are marked with the LLGR_STALE community (0xFFFF0006) and depreferenced to avoid loops. Stale routes are advertised only to LLGR-capable peers and purged after the Stale Time elapses, with the NO_LLGR (0xFFFF0007) allowing for specific prefixes. Defined in RFC 9494 (2023), LLGR reduces reconvergence overhead but requires careful deployment to prevent suboptimal paths. For added resilience against single-path failures, the ADD-PATH capability enables advertising multiple paths for the same prefix, rather than replacing prior ones, using a 4-octet Path Identifier to distinguish them in UPDATE messages. Peers negotiate ADD-PATH via BGP Capability Code 69, specifying send/receive support per address family; upon mutual agreement, up to 256 paths can be sent, with the sender selecting based on policy. Standardized in RFC 7911, this mitigates oscillations from path withdrawals and enhances load balancing, contributing to faster convergence and stability in diverse environments. Recent trends indicate these mechanisms have sustained BGP stability amid growing update volumes; in 2023, daily IPv4 updates averaged 180,000 and 60,000–100,000, with no unsustainable spikes, while 2024 saw a net increase of 53,000 entries yet stable churn levels concentrated in few autonomous systems. The escalating size underscores the ongoing importance of these techniques in handling expanded scale without proportional instability.

Routing Table Expansion and Limits

The expansion of the Border Gateway Protocol (BGP) routing table has been driven primarily by IPv4 address exhaustion, which prompts networks to announce more specific prefixes to conserve and optimize scarce address space; increased multihoming, where organizations connect to multiple upstream providers and advertise finer-grained routes for traffic engineering; and the rise of cloud providers, such as Amazon, which in 2024 alone added over 109 million IPv4 addresses through numerous prefix announcements. By the end of 2022, the global IPv4 BGP routing table had reached approximately 940,000 entries, reflecting a 4% annual growth rate that year. This growth has continued, with the IPv4 table reaching 996,000 prefixes by the end of 2024 and 1,038,438 as of November 2025 (FIB), surpassing 1 million entries as projected under linear growth models. In parallel, the routing table has grown more steadily, expanding from 172,400 entries at the end of 2022 to 221,500 by the end of 2024 and 236,461 as of November 2025 (FIB), stabilizing around 200,000 to 250,000 entries as anticipated depending on deployment trends. Post-2023, overall table growth has slowed to approximately 4% annually on average, with IPv4 showing near-zero increase in 2023 before resuming at 6% in 2024, and decelerating from 17% in 2023 to 10% in 2024 due to maturing adoption and reduced de-aggregation incentives. Key events have highlighted the challenges of this expansion. On August 12, 2014—known as "512k Day"—the IPv4 prefix count exceeded 512,000, triggering hardware limitations in many routers' ternary (TCAM), which often defaulted to 512,000-entry caps, resulting in dropped routes, performance degradation, and temporary outages for affected networks. Another critical milestone was the near-depletion of 16-bit Autonomous System Numbers (ASNs), resolved through the deployment of 32-bit ASNs as defined in RFC 6793, which extended the ASN space to over 4 billion unique identifiers and averted a crisis in network identifier allocation. To address these limits, operators implement mitigations such as route aggregation, which combines multiple contiguous prefixes into a single summary entry to reduce table while preserving , and the use of default routes on edge devices to avoid downloading the full global table. Additionally, load balancing via equal-cost multipath (ECMP) enables routers to distribute across multiple BGP paths to the same destination without expanding the table, improving utilization of available bandwidth in multihomed environments. These techniques help sustain BGP's amid ongoing pressures from address scarcity and network complexity.

Security Considerations

Common Vulnerabilities and Hijacking Risks

BGP's core protocol lacks inherent or validation mechanisms for route announcements, making it reliant on the security of the underlying TCP transport for session integrity. This design exposes the protocol to threats such as , where an attacker could spoof TCP packets to disrupt or impersonate peering sessions, and route injection, allowing unauthorized prefixes to propagate across the global . The optional TCP MD5 Signature Option, defined in RFC 2385, offers limited protection against such session-based attacks by appending a hashed signature to TCP segments, but it is inherently weak due to 's to collision attacks and preimage exploits, rendering it insufficient against determined adversaries. Despite these known flaws, many legacy implementations continue to use or forgo additional safeguards entirely, amplifying BGP's exposure in production environments. A primary vulnerability stems from BGP's trust model, which accepts route announcements without verifying the origin or path integrity, enabling prefix hijacking. In this attack, a malicious AS announces bogus routes for a victim's IP prefix as their own origin, often using more specific prefixes or shorter paths to divert traffic intended for the victim to the attacker's network. Hijackers can exploit this to intercept sensitive data, such as in man-in-the-middle scenarios, or perform blackholing by announcing more specific prefixes (e.g., a /24 within a legitimate /8) that cause routers to drop traffic destined for the victim, effectively denying service. Such hijacks have been documented in serial attacks, where persistent actors reuse AS numbers to target blocks for spam distribution or traffic monetization, with episodes affecting thousands of prefixes over months. Route leaks represent another prevalent risk, typically arising from misconfigurations where an AS inadvertently advertises internal or customer routes to external peers in violation of intended policies, leading to suboptimal or unstable global routing. A prominent example occurred on November 6, 2017, when (AS3356) leaked over 1,000 routes learned from Verizon, propagating them globally and causing widespread service degradation across for approximately 90 minutes, impacting major providers like . Pre-2020 analyses reported around 2,000 confirmed hijacking incidents annually, though the total including leaks reached over 14,000 events in 2017 alone, underscoring the scale of inadvertent disruptions. Hijacks and leaks also facilitate DDoS attacks, where attackers leverage BGP announcements to redirect traffic toward victims, exploiting the protocol's path vector nature to flood networks with unintended routes that exacerbate volumetric attacks. In recent years, accidental leaks have persisted, particularly among cloud provider ASes; for instance, quarterly reports from 2023 to 2024 indicate over 3,000 unique ASes involved in route leaks, with cloud environments like those operated by major hyperscalers contributing to incidents due to rapid scaling and complex configurations. Vulnerabilities remain pervasive without comprehensive adoption of validation tools.

Mitigation Strategies and Extensions

To mitigate BGP vulnerabilities such as route hijacking, the (RPKI) provides a framework for validating the origin of BGP routes through digitally signed Route Origin Authorizations (ROAs). Defined in RFC 6480 and subsequent documents in the RFC 6480 series, RPKI enables resource holders like Regional Internet Registries to issue ROAs that cryptographically attest to the authorized origin Autonomous System (AS) for specific IP prefixes. Route Origin Validation (ROV) then allows BGP speakers to check incoming routes against these ROAs, discarding those with invalid origins to prevent unauthorized advertisements. As of November 2025, RPKI covers approximately 58% of global IPv4 prefixes and 60% of prefixes, reflecting steady growth in adoption. The Mutually Agreed Norms for Routing Security (MANRS) initiative promotes RPKI deployment among network operators, with actions including ROA issuance and ROV implementation as core requirements for participation. By the end of 2023, 66% of MANRS members managed prefixes covered by valid ROAs, far exceeding the global average of around 34% for all ASes, demonstrating the initiative's role in accelerating secure routing practices. ROV deployment has also advanced, with about 27% of networks actively validating routes using RPKI data as of mid-2025, helping to filter out invalid announcements at scale. While RPKI focuses on origin authentication via ROAs (addressing Origin Authorization needs), it does not validate the full AS path, leaving gaps that BGPsec aims to fill through end-to-end cryptographic path signatures. Specified in RFC 8205, BGPsec extends BGP by requiring each AS along the path to sign updates with its private key, allowing receivers to verify the and authenticity of the entire propagation chain. However, BGPsec adoption remains limited as of , with nearly no widespread deployment due to challenges in key management, computational overhead, and the need for coordinated global rollout; pilots have highlighted these barriers without achieving production-scale use. Recent extensions include Autonomous System Provider Authorization (ASPA, RFC 9487), which validates AS provider-customer relationships to detect unauthorized path segments, with growing adoption in 2025 to complement RPKI. Additionally, RFC 9234 provides operational guidance for preventing route leaks through improved filtering and peering policies. Additional lightweight mitigations include the Generalized TTL Security Mechanism (GTSM), outlined in RFC 5082, which protects against spoofed BGP sessions from unauthorized sources by enforcing a high TTL value (typically 255) on directly connected eBGP peers, ensuring packets from off-link attackers are discarded due to TTL decrement. GTSM, also known as BGP TTL security, is widely implemented in routers and complements cryptographic approaches by reducing the from forged control-plane messages without requiring .

Modern Extensions

Multiprotocol and Segment Routing Support

(MP-BGP), defined in RFC 4760, extends the Border Gateway Protocol version 4 (BGP-4) to support the advertisement of routing information for multiple protocols beyond IPv4 unicast, using Address Family Identifiers (AFIs) and Subsequent Address Family Identifiers (SAFIs) to specify the protocol and type of routes being exchanged. This allows BGP to handle diverse address families, such as unicast (AFI 2, SAFI 1), routes, and labeled VPN routes like VPNv4 (AFI 1, SAFI 128) for IPv4-based Layer 3 VPNs (L3VPNs). By encapsulating protocol-specific next-hop and prefix information within Multiprotocol Reachable Network Layer Reachability Information (MP_REACH_NLRI) and Unreachable (MP_UNREACH_NLRI) attributes, MP-BGP maintains backward compatibility with classic BGP-4 while enabling scalable distribution of routes for services like and VPNs across autonomous systems. In the context of Segment Routing (SR), BGP extensions facilitate traffic engineering by distributing topology and policy information. BGP-Link State (BGP-LS), specified in RFC 7752, enables the northbound distribution of link-state and traffic engineering (TE) data from interior gateway protocols (IGPs) like OSPF and IS-IS to external controllers or applications via BGP, using a dedicated address family (AFI 16388, SAFI 71) to advertise link, node, and prefix attributes such as bandwidth and affinities. This supports SR egress peer engineering by allowing BGP to signal peer node SIDs and adjacency SIDs, enabling source-based path steering without per-flow state in the network core. Further integration of SR with BGP occurs through mechanisms for advertising SR policies and supporting SRv6. RFC 9830 defines a BGP Subsequent Address Family Identifier (SAFI 77) for distributing candidate paths of SR policies, which consist of ordered segment lists for source-routed traffic steering, including preference, binding SID, and endpoint sub-TLVs to specify policy details like color and protocol. For SR over IPv6 (SRv6), RFC 9252 outlines procedures for BGP overlay services, where SRv6 Segment Identifiers (SIDs) are carried in VPN routes (e.g., VPNv6 with SAFI 128) to enable L3VPN encapsulation and end-to-end IPv6-based path programming without MPLS labels. Additionally, RFC 9832 introduces BGP Classful Transport (BGP-CT) as a new address family (AFI 1, SAFI 78) for intent-driven service mapping, classifying underlay routes by transport classes (e.g., low-latency or high-bandwidth) to steer overlay services like SR policies based on explicit intents. These extensions collectively enhance BGP's role in SR environments by providing flexible, scalable control for traffic engineering across IPv4, , and hybrid networks.

EVPN and BGP-LS Applications

Ethernet VPN (EVPN) extends BGP to provide scalable Layer 2 and Layer 3 VPN services, particularly in environments using VXLAN overlays. Defined in RFC 7432, EVPN enables control-plane learning of and addresses through BGP advertisements, replacing traditional data-plane flooding and learning mechanisms. Provider Edge (PE) devices advertise MAC/IP Advertisement routes using the EVPN Address Family (AFI 25, SAFI 70), which include fields such as , Ethernet Segment Identifier (ESI), , and optional , allowing for efficient distribution of endpoint reachability information across the network. This approach supports with all-active or single-active via ESIs and enhances load balancing by providing multiple next-hop options in BGP updates. For integrated Layer 2 and Layer 3 services, EVPN incorporates symmetric Integrated Routing and Bridging (IRB), where PEs use a common as the for inter-subnet . This is achieved by advertising the MAC/IP pair with the Default Gateway Extended Community in MAC/IP Advertisement routes, ensuring consistent forwarding behavior without asymmetric issues. Symmetric IRB unifies the bridging and tables on PEs, facilitating seamless L2 extension and L3 gateway functions in VXLAN-based overlays. Since 2020, EVPN has seen widespread adoption in fabrics for its ability to scale multi-tenant overlays, support VM mobility, and integrate with over Layer 3 (NVO3) architectures, as outlined in subsequent applicability guidance. BGP Link-State (BGP-LS) extends BGP to distribute Interior Gateway Protocol (IGP) topology and traffic engineering information to external controllers, enabling centralized network management in software-defined networking (SDN) environments. Specified in RFC 7752, BGP-LS uses a dedicated Address Family (AFI 16388, SAFI 71 for non-VPN) to encode link-state data in BGP Network Layer Reachability Information (NLRI) with types for nodes, links, and prefixes, formatted as Type-Length-Value (TLV) structures. This allows controllers, such as Path Computation Elements (PCEs), to receive a complete topology view from BGP speakers within the network, supporting applications like path computation and application-layer traffic optimization without requiring direct IGP peering. Recent extensions in RFC 9815 introduce BGP-LS support for Shortest Path First (SPF) routing by defining a new BGP-LS-SPF Subsequent Address Family Identifier (SAFI 80), which enables Dijkstra-based path computation directly on distributed topology data. This facilitates fast convergence and Equal-Cost Multi-Path (ECMP) in large-scale environments through incremental updates and a Link State Database (LSDB) maintained by receivers. In Clos fabrics common to data centers, RFC 9816 describes the applicability of these BGP-LS SPF extensions, recommending sparse peering models with route reflectors or controllers to reduce session overhead while providing full topology visibility for underlay routing and traffic engineering. These mechanisms address the need for policy-controlled distribution in multi-stage topologies, improving operational simplicity over traditional IGP flooding.

Implementations and Uses

Software and Hardware Implementations

The Border Gateway Protocol (BGP) is implemented across a range of daemons and commercial operating systems, enabling its use in diverse networking environments from Linux-based servers to enterprise routers. Open-source implementations provide flexible, cost-effective options for , testing, and production deployments, often emphasizing and community-driven enhancements. Among open-source solutions, (FRR), a of the earlier initiated in 2017, stands out for its comprehensive support of BGP features, including (MP-BGP) for IPv4 and routing, and (RPKI) for route origin validation to mitigate hijacking risks. FRR's architecture separates protocol daemons like bgpd for BGP from the zebra daemon, which interfaces with the kernel's (FIB) to install routes, allowing seamless integration with host-based routing on platforms such as Cumulus Linux. This kernel integration enables FRR to manage dynamic routing tables efficiently on distributions, with widespread adoption in data centers and internet exchange points due to its stability and support for over 150 BGP-related RFCs as of late 2024. Quagga, the predecessor to FRR, introduced a modular zebra-based design that influenced modern implementations, though it has largely been superseded by FRR for active use owing to enhanced performance and bug fixes identified in behavioral testing across both. , another prominent open-source routing suite, excels in high-performance scenarios, demonstrating superior memory efficiency and convergence speed compared to FRR when handling full internet routing tables, making it suitable for resource-constrained environments like embedded systems or large-scale . Commercial implementations integrate BGP deeply into vendor-specific operating systems, offering hardware-accelerated features for high-scale deployments. Cisco's IOS and IOS XR platforms provide robust BGP capabilities, including advanced policy-based routing and support for extensions like Segment Routing over IPv6 (SRv6), with recent firmware updates from 2023 to 2025 enabling SRv6 locator advertisements and service SID allocations for simplified VPN and traffic engineering. Juniper's Junos OS emphasizes operational simplicity in BGP configuration, supporting dynamic capability negotiation and multipath routing, which enhances interoperability in multi-vendor environments. Arista's EOS extends BGP version 4+ with multiprotocol extensions per RFC 4760, facilitating efficient IPv6 route exchange and EVPN overlays on its Extensible Operating System. In 2025, launched AI-optimized routing systems, such as upgrades to its Silicon One-based platforms, incorporating BGP to handle intense inter-data-center traffic for AI workloads, achieving higher throughput and lower latency through automated path optimization. These hardware integrations, including SRv6 support in vendor firmware like 's IOS XR releases, bridge traditional BGP operations with emerging IPv6-based segment routing for scalable, programmable networks.

Deployment in Networks and Services

Border Gateway Protocol (BGP) plays a central role in facilitating interconnections between autonomous systems (ASes) at Internet Exchange Points (IXPs), where networks establish sessions to exchange traffic directly without traversing upstream providers. At IXPs, BGP enables efficient route advertisement and selection among multiple peers, often through route servers that simplify configuration by allowing a single BGP session to aggregate announcements from numerous participants, reducing the complexity of maintaining individual sessions. This deployment enhances global connectivity by minimizing latency and costs for high-volume traffic exchanges between ISPs and content providers. In content delivery networks (CDNs) and (DNS) services, BGP supports addressing, where the same IP prefix is advertised from multiple geographically dispersed locations, allowing routers to direct traffic to the nearest instance based on BGP path attributes like AS path length. For DNS, ensures resilient query resolution by requests to the optimal server via BGP's dynamic updates, improving availability during failures or attacks. Similarly, CDNs leverage BGP to optimize content distribution, reducing latency for end-users accessing media or applications from edge servers worldwide. BGP multihoming allows organizations to connect to multiple upstream providers for redundancy and load distribution, using techniques such as selective prefix announcements to control inbound traffic flows across links. Traffic engineering in these setups often relies on BGP communities—optional transitive attributes appended to routes—to influence path selection, such as by tagging prefixes for local preference adjustments or AS path prepending at the provider level, enabling fine-tuned control over traffic symmetry without altering core BGP metrics. For , BGP FlowSpec, as defined in RFC 8955, extends the protocol to propagate filtering rules as network layer information (NLRI), allowing rapid dissemination of traffic specifications (e.g., source/destination ports, protocols) to downstream routers for real-time blackholing or redirection of malicious flows. This capability is widely deployed in service provider networks to counter volumetric attacks by coordinating defenses across AS boundaries, often integrated with scrubbing centers for automated response. In and environments, BGP integrates with Segment Routing (SR-EVPN) to provide scalable Layer 2/3 services, where BGP advertises EVPN routes over MPLS or SRv6 segments to support low-latency interconnects between core networks and edge nodes. This deployment enables dynamic endpoint discovery and traffic steering in distributed architectures, facilitating services like network slicing and mobile edge computing by unifying operations under BGP. BGP underscores the protocol's foundational role in inter-domain connectivity as of 2025. Emerging trends include AI-driven load balancing, where models analyze BGP updates and traffic patterns to predictively adjust communities or path selections, optimizing in dynamic environments like data centers and SD-WANs.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.