Minimizing or reducing traffic loss when an external border gateway protocol (eBGP) peer goes down
11546246 · 2023-01-03
Assignee
Inventors
- Rafal Jan Szarecki (San Jose, CA, US)
- Kaliraj Vairavakkalai (Fremont, CA)
- Natrajan Venkataraman (Sunnyvale, CA, US)
Cpc classification
H04L12/66
ELECTRICITY
International classification
H04L12/28
ELECTRICITY
H04L45/50
ELECTRICITY
H04L12/66
ELECTRICITY
Abstract
A router configured as an autonomous system border router (ASBR) in a local autonomous system (AS), includes: (1) a control component for communicating and computing routing information, the control component running a Border Gateway Protocol (BGP) and peering with at least one BGP peer device in an outside autonomous system (AS) different from the local AS; and (2) a forwarding component for forwarding packets using forwarding information derived from the routing information computed by the control component, wherein the control component (i) receives reachability information for an external prefix corresponding to a device outside the local AS, and (ii) associates the external prefix, as a BGP next hop (B_NH), an abstract next hop (ANH) that identifies a set of BGP (eBGP) sessions that contains at least one eBGP session over which given external prefix has been learned, each of the at least one eBGP sessions being between the ASBR and a BGP peer device in an AS outside the AS, wherein the device located outside the local AS is reachable via the BGP peer device.
Claims
1. A router configured as an autonomous system border router (ASBR) in a local autonomous system (AS), the router comprising: a) a control component for communicating and computing routing information, the control component running a Border Gateway Protocol (BGP) and peering with at least one BGP peer device in an outside autonomous system (AS) different from the local AS; and b) a forwarding component for forwarding packets using forwarding information derived from the routing information computed by the control component, wherein the control component (1) receives reachability information for at least one external prefix, each of the at least one external prefix corresponding to a device located outside the local AS, and (2) associates the at least one external prefix, as a BGP next hop (B_NH), with an abstract next hop (ANH) that identifies either (A) a set of at least two BGP (eBGP) sessions, wherein each of the at least two eBGP sessions is between the ASBR and a BGP peer device through which the device corresponding to one of the at least one external prefix located outside the local AS is reachable, the BGP peer device being located in the outside AS, or (B) an eBGP session between the ASBR and a BGP peer device through which each of at least two devices corresponding to at least two external prefixes is reachable.
2. The router of claim 1 wherein the ANH is an IP address.
3. The router of claim 2 wherein the control component further advertises the ANH using an Interior Gateway Protocol (IGP) of the local AS.
4. The router of claim 2 wherein the control component further advertises the ANH via a Multiprotocol Label Switching (MPLS) label distribution control protocol of the local AS.
5. The router of claim 1 wherein each eBGP session in the set of BGP sessions identified by the ANH is between the router and at least two peer devices in the outside AS through which the device is reachable.
6. The router of claim 1 wherein the set of BGP sessions identified by the ANH includes (1) a BGP session between the router and at least one peer device in the outside AS through which the device is reachable, and (2) a BGP session between at least one other ASBR router in the local AS and at least one peer device in the outside AS through which the device is reachable.
7. The router of claim 1 wherein the set of BGP sessions identified by the ANH includes a BGP session between the router and at least two peer devices in at least two ASes outside the local AS through which the device is reachable.
8. The router of claim 1 wherein the set of BGP sessions identified by the ANH includes (1) a BGP session between the router and at least one peer device in an AS outside the local AS through which the device is reachable, and (2) a BGP session between at least one other ASBR router in the local AS and at least one peer device in an AS outside the local AS through which the device is reachable.
9. The router of claim 1 wherein the control component further advertises to a route reflector (RR) within the local AS, the external prefix with the abstract next hop as a single path, regardless of how many eBGP sessions are associated with the ANH and regardless of whether the external prefix was learned from more than one of the eBGP sessions.
10. The router of claim 1 wherein the control component further advertises the external prefix with the abstract next hop as a single path, regardless of how many eBGP sessions are associated with the ANH and regardless of whether the external prefix was learned from more than one of the eBGP sessions.
11. The router of claim 1 wherein abstract next hop (ANH) that identifies an eBGP session between the ASBR and a BGP peer device through which each of at least two devices corresponding to at least two external prefixes is reachable.
12. A non-transitory storage medium provided on an autonomous system border router (ASBR) in a local autonomous system (AS) storing a data structure comprising: a) an external prefix corresponding to a device located outside the local AS; and b) an abstract next hop Internet protocol (IP) address (ANH) that (1) is associated with the external prefix, and (2) identifies a set of BGP (eBGP) sessions that contains at least one eBGP session, each of the at least one eBGP session being between the ASBR and a BGP peer device in an AS outside the AS, wherein the device located outside the local AS is reachable via the BGP peer device.
13. The non-transitory storage medium of claim 12, wherein the ANH does not identify, and is not associated with, any object other than the at least BGP session with which it is associated.
14. The non-transitory computer-readable storage medium of claim 12 wherein the ANH is selected from a set of Address Families and Sub-Address Families comprising: (A) Internet Protocol version 4 (IPv4), (B) Internet Protocol version 6 (IPv6), (C) Virtual Private Network version 4 (VPNv4), (D) Virtual Private Network version 6 (VPNv6), (E) layer 2 Virtual Private Network/Virtual Private LAN Service (L2VPN/VPLS) and (F) Ethernet Virtual Private Network (EVPN).
15. A method for configuring an autonomous system border router (ASBR) in a local autonomous system (AS) having at least one BGP peer device in an outside autonomous system (AS) different from the local AS, the method comprising: a) receiving reachability information for at least one external prefix, each of the at least one external prefix corresponding to a device located outside the local AS; and b) associating with the at least one external prefix, as a BGP next hop (B_NH), an abstract next hop Internet protocol (IP) address (ANH) that (1) is associated with the external prefix, and (2) identifies either (A) a set of at least two BGP (eBGP) sessions, wherein each of the at least two eBGP sessions is between the ASBR and a BGP peer device through which the device corresponding to one of the at least one external prefix located outside the local AS is reachable, the BGP peer device being located in the outside AS, or (B) an eBGP session between the ASBR and a BGP peer device through which each of at least two devices corresponding to at least two external prefixes is reachable.
16. The method of claim 15 wherein the ANH does not identify, and is not associated with, any object associated with the prefix, other than the at least BGP sessions with which it is associated.
17. The method of claim 15 wherein the reachability information for the external prefix is received via a user interface of the ASBR, and wherein the ANH is associated with the external prefix via the user interface.
18. The method of claim 15 wherein the reachability information for the external prefix and the ANH associated with the external prefix are received as manually-entered configuration information stored on a non-transitory computer-readable storage medium.
19. The method of claim 15 wherein the abstract next hop (ANH) that identifies a set of at least two BGP (eBGP) sessions, wherein each of the at least two eBGP sessions is between the ASBR and a BGP peer device through which the device corresponding to one of the at least one external prefix located outside the local AS is reachable, the BGP peer device being located in the outside AS.
20. The method of claim 15 wherein the abstract next hop (ANH) that identifies an eBGP session between the ASBR and a BGP peer device through which each of at least two devices corresponding to at least two external prefixes is reachable.
Description
§ 4. BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15) Request for Comments 7938 (Internet Engineering Task Force, August 2016), referred to as “RFC 7938” and incorporated herein by reference.)
§ 5. DETAILED DESCRIPTION
(16) The present description may involve novel methods, apparatus, message formats, and/or data structures for improving convergence by removing dependency on per-BGP-prefix withdrawal operations in response to a lost connection with an eBGP peer (e.g., by minimizing or reducing traffic loss when an external border gateway protocol (eBGP) peer (or an eBGP session) goes down). The following description is presented to enable one skilled in the art to make and use the invention, and is provided in the context of particular applications and their requirements. Thus, the following description of embodiments consistent with the present invention provides illustration and description, but is not intended to be exhaustive or to limit the present invention to the precise form disclosed. Various modifications to the disclosed embodiments will be apparent to those skilled in the art, and the general principles set forth below may be applied to other embodiments and applications. For example, although a series of acts may be described with reference to a flow diagram, the order of acts may differ in other implementations when the performance of one act is not dependent on the completion of another act. Further, non-dependent acts may be performed in parallel. No element, act or instruction used in the description should be construed as critical or essential to the present invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. Thus, the present invention is not intended to be limited to the embodiments shown and the inventors regard their invention as any patentable subject matter described.
(17) Example embodiments consistent with the present description provide a so-called Abstract Next Hop (or ANH). Referring back to
(18) ANH may simply be an IP-address that identifies an eBGP peer or a set of eBGP peers. The set of eBGP peers may be defined by a human operator via a user interface of the ASBR, or remotely. Thus, the set of eBGP sessions may be defined by a human operator in local configuration, according to network design needs. As one example, the set of eBGP peers may be defined as those eBGP peers belonging to same peer AS and handled by given single ASBR. As another example, a set of eBGP peers may be defined as those eBGP peers belonging to same peer AS and handled by one or more ASBR(s) at given site. As yet another example, a set of eBGP peers may be defined as eBGP peers belonging to any of upstream provider AS. As yet still another example, a set of eBGP peers may be defined as BGP sessions with a given peer device and handled by one or more of ASBRs of the local AS. Naturally other sets or groupings of eBGP peers are possible.
(19) A host route to the ANH is installed in the relevant RIB and redistributed into the IGP. BGP maintains the ANH host route based on the state of the associated group of BGP sessions as follows. As soon as all BGP sessions in the set go “DOWN,” the ANH route is removed. When at least one BGP session of the set comes “UP,” the ANH route is created only after initial route convergence is complete for the peer (e.g., when an End-of-RIB (EoR) (See, e.g., “Graceful Restart Mechanism for BGP,” Request for Comments 4724 (Internet Engineering Task Force, January 2007) (referred to as “RFC 4724” and incorporated herein by reference) is received). Taken together, these procedures ensure that as soon as the final eBGP session in the set goes DOWN, ingress routers will see the associated ANH withdrawn from the IGP. Since the ANH is used to resolve the BGP next hops of BGP prefixes, the ingress routers are triggered to converge to send traffic to their alternate (new best) route. They also ensure that as soon as one session in the set comes UP and is synchronized (that is, the EoR is received), ingress routers will see the ANH advertised in the IGP and will be able to re-converge to use routes that are associated with that next hop.
(20) By way of background, RFC 4724 recognized that usually, when BGP on a router restarts, all the BGP peers detect that the session went “DOWN” and then came “UP.” This down-to-up transition results in a “routing flap” and causes BGP route re-computation, generation of BGP routing updates, and unnecessary churn to the forwarding tables (which could spread across multiple routing domains). Such routing flaps may create undesirable transient forwarding blackholes and/or transient forwarding loops. They also consume resources on the control plane of the routers affected by the flap. As such, they are detrimental to the overall network performance. RFC 4724 describes a mechanism to help minimize the negative effects caused by BGP restart. More specifically, per RFC 4724, an End-of-RIB marker is specified and can be used to convey routing convergence information. RFC 4724 defines a new BGP capability, termed “Graceful Restart Capability”, that would allow a BGP speaker to express its ability to preserve forwarding state during BGP restart. Finally, RFC 4724 outlines procedures for temporarily retaining routing information across a TCP session termination/re-establishment. A BGP UPDATE message with no reachable Network Layer Reachability Information (NLRI) and empty withdrawn NLRI is specified as the “End-of-RIB marker” that can be used by a BGP speaker to indicate to its peer the completion of the initial routing update after the session is established. For the IPv4 unicast address family, the End-of-RIB marker is an UPDATE message with the minimum length (See, e.g., RFC 4271). For any other address family, it is an UPDATE message that contains only the MP_UNREACH_NLRI attribute (See, e.g., RFC 4760.) with no withdrawn routes for that <AFI, SAFI>. Although the End-of-RIB marker is specified for the purpose of BGP graceful restart, it is noted that the generation of such a marker upon completion of the initial update would be useful for routing convergence in general. In addition, it would be beneficial for routing convergence if a BGP speaker can indicate up-front to its peer that it will generate the End-of-RIB marker (regardless of its ability to preserve its forwarding state during BGP restart).
(21) A host route to ANH (/32 for IPv4 or /128 for IPv6) is installed in an IP Route Information Base (IP RIB, such as inet.0 or inet6.0 in routers from Juniper Networks, Inc. of Sunnyvale, Calif.) or in a Labeled IP RIB (such as inet.3 and inet6.3 in routers from Juniper Networks) and redistributed into IGP/LDP (Transport-protocols). In the Junos OS from Juniper Networks, the routing table “inet.0” and “inet6.0” are used for IP version 4 (IPv4) and IP version 6 (IPv6) unicast routes, respectively, and used to construct forwarding stricture—FIB (Forwarding Information Base). This table stores interface local and direct routes, static routes, and dynamically learned routes. In the Junos OS from Juniper Networks, the Labeled IP RIB routing table are “inet.3” and “inet6.3,” used for IPv4 MPLS and IPv6 MPLS, respectively. This table stores the MPLS FEC, typically egress address of an MPLS label-switched path (LSP), the LSP name, and the outgoing interface name. This routing table is used only when the local device is the ingress node to an LSP for the purpose of Next Hop resolution. The IGPs and BGP store their routing information in the inet.0/inet6.0 routing table, the main IP routing table. To do so, for BGP routes, the BGP NH needs to be resolved. If the traffic-engineering BGP is enabled (Implicit default on Junos OS from Juniper Networks, Inc. of Sunnyvale, Calif.), thereby allowing only BGP to use MPLS paths for forwarding traffic, BGP can access the inet.3 routing table. BGP uses both inet.0 and inet.3 to resolve next-hop addresses. If the traffic-engineering BGP-IGP command is configured, thereby allowing the IGPs to use MPLS paths for forwarding traffic, MPLS path information is stored in the inet.0 routing table. The inet.3 routing table contains the MPLS FEC addresses, typically host address of each LSP's egress router. BGP uses the inet.3 routing table on the ingress router to help in resolving next-hop addresses. When BGP resolves a BGP next-hop attribute of given prefix, it examines both the inet.0 and inet.3 routing tables, seeking the next hop with the best match (longest prefix match) and of best preference. If it finds a next-hop entry with an equal preference in both routing tables, BGP prefers the entry in the inet.3 routing table.)
(22) An ANH IP address can be any value that user wants to assign based on IP-address management. As one example, ANHx can be PeerX's lo0-address, when ANH is to represent a single eBGP peer device connected to local AS.
(23) The IGP route to ANH/32 or ANH/128 route can be withdrawn or advertised with a less preferred metric to drain traffic away from the eBGP-peer(s).
§ 5.1 Example Methods
(24)
(25) Still referring to
§ 5.2 Illustrative Example of Operations of Example Embodiment
(26)
(27) Referring first to
(28) Further note that the BGP RIB of PE2 320′ associates an abstract next hop (ANH.sub.PE2) with the both the prefix Pfx1 and the prefix Pfx2, as shown. ANH.sub.PE2 is associated with the IP address of remote-end of the eBGP session with Peer 3 370 (10.0.26.2) in PE2's RIB. As shown, the iBGP update message advertising the prefixes includes the association of Pfx1 with ANH.sub.PE2 and the association of Pfx2 with ANH.sub.PE2.
(29) Referring next to
(30) Still referring to
(31) Finally, referring to
(32) Still referring to
§ 5.3 Example Apparatus
(33)
(34) As just discussed above, and referring to
(35) The control component 810 may include an operating system (OS) kernel 820, routing protocol process(es) 830, label-based forwarding protocol process(es) 840, interface process(es) 850, configuration API(s) 852, a user interface (e.g., command line interface) process(es) 854, programmatic API(s), 856, and chassis process(es) 870, and may store routing table(s) 839, label forwarding information 845, configuration information in a configuration database(s) 860 and forwarding (e.g., route-based and/or label-based) table(s) 880. As shown, the routing protocol process(es) 830 may support routing protocols such as the routing information protocol (“RIP”) 831, the intermediate system-to-intermediate system protocol (“ISIS”) 832, the open shortest path first protocol (“OSPF”) 833, the enhanced interior gateway routing protocol (“EIGRP”) 834 and the border gateway protocol (“BGP”) 835, and the label-based forwarding protocol process(es) 840 may support protocols such as BGP 835, the label distribution protocol (“LDP”) 836 and the resource reservation protocol (“RSVP”) 837. One or more components (not shown) may permit a user to interact, directly or indirectly (via an external device), with the router configuration database(s) 860 and control behavior of router protocol process(es) 830, the label-based forwarding protocol process(es) 840, the interface process(es) 850, and the chassis process(es) 870. For example, the configuration database(s) 860 may be accessed via SNMP 885, configuration API(s) (e.g. the Network Configuration Protocol (NetConf), the Yet Another Next Generation (e) protocol, etc.) 852, a user command line interface (CLI) 854, and/or programmatic API(s) 856. Control component processes may send information to an outside device via SNMP 885, syslog, streaming telemetry (e.g., Google's network management protocol (gNMI), the IP Flow Information Export (IPFIX) protocol, etc.)), etc. Similarly, one or more components (not shown) may permit an outside device to interact with one or more of the router protocol process(es) 830, the label-based forwarding protocol process(es) 840, the interface process(es) 850, configuration database(s) 860, and the chassis process(es) 870, via programmatic API(s) (e.g. gRPC) 856. Such processes may send information to an outside device via streaming telemetry. In this way, one or more ANHs consistent with the present description may be configured onto a router, such as an ASBR for example. That is, channels such as user CLI 854, SNMP 885, configuration API(s) (e.g. Netconf/XML/YANG, so an external computer system can be used to provide configuration information) 852, and/or programmatic API(s) to routing protocol process (e.g., Google's remote procedure call (gRPC) protocol, so an external software application can directly create and manipulate states of routing protocol process) 856 may be used to instantiate the ANH within the configuration database(s) 860.
(36) The packet forwarding component 890 may include a microkernel 892, interface process(es) 893, distributed ASICs 894, chassis process(es) 895 and forwarding (e.g., route-based and/or label-based) table(s) 896.
(37) In the example router 800 of
(38) Still referring to
(39) Referring to the routing protocol process(es) 830 of
(40) Still referring to
(41) The example control component 810 may provide several ways to manage the router. For example, it 810 may provide a user interface process(es) 860 which allows a system operator to interact with the system through configuration, modifications, and monitoring. The SNMP 885 allows SNMP-capable systems to communicate with the router platform. This also allows the platform to provide necessary SNMP information to external agents. For example, the SNMP 885 may permit management of the system from a network management station running software, such as Hewlett-Packard's Network Node Manager (“HP-NNM”), through a framework, such as Hewlett-Packard's OpenView. Further, as already noted above, the configuration database(s) 860 may be accessed via SNMP 885, configuration API(s) (e.g. NetConf, YANG, etc.) 852, a user CLI 854, and/or programmatic API(s) 856. Control component processes may send information to an outside device via SNMP 885, syslog, streaming telemetry (e.g., gNMI, IPFIX, etc.), etc. Similarly, one or more components (not shown) may permit an outside device to interact with one or more of the router protocol process(es) 830, the label-based forwarding protocol process(es) 840, the interface process(es) 850, and the chassis process(es) 870, via programmatic API(s) (e.g., gRPC) 856. Such processes may send information to an outside device via streaming telemetry. In this way, one or more ANHs consistent with the present description may be configured onto a router, such as an ASBR for example. That is, channels such as user CLI 854, SNMP 885, configuration API(s) (e.g. Netconf/XML/YANG, so an external computer system can be used to provide configuration information) 852, and/or programmatic API(s) to routing protocol process (e.g., gRPC, so an external software application can directly create and manipulate states of routing protocol process) 856 may be used to instantiate the ANH. In any of these ways, one or more ANHs may be configured onto the example router 800. Accounting of packets (generally referred to as traffic statistics) may be performed by the control component 810, thereby avoiding slowing traffic forwarding by the packet forwarding component 890.
(42) Although not shown, the example router 800 may provide for out-of-band management, RS-232 DB9 ports for serial console and remote management access, and tertiary storage using a removable PC card. Further, although not shown, a craft interface positioned on the front of the chassis provides an external view into the internal workings of the router. It can be used as a troubleshooting tool, a monitoring tool, or both. The craft interface may include LED indicators, alarm indicators, control component ports, and/or a display screen. Finally, the craft interface may provide interaction with a command line interface (“CLI”) 854 via a console port, an auxiliary port, and/or a management Ethernet port. In any of these ways, one or more ANHs may be configured onto the example router 800.
(43) The packet forwarding component 890 is responsible for properly outputting received packets as quickly as possible. If there is no entry in the forwarding table for a given destination or a given label and the packet forwarding component 890 cannot perform forwarding by itself, it 890 may send the packets bound for that unknown destination off to the control component 810 for processing. The example packet forwarding component 890 is designed to perform Layer 2 and Layer 3 switching, route lookups, and rapid packet forwarding.
(44) As shown in
(45) In the example router 800, the example method 500 consistent with the present disclosure may be implemented in BGP component 835, and perhaps partly in the user CLI processes 854, or remotely (e.g., on the cloud) via configuration API(s) 852 and/or programmatic API(s) 856.
(46) Referring back to distributed ASICs 894 of
(47) Still referring to
(48) An FPC 920 can contain from one or more PICs 910, and may carry the signals from the PICs 910 to the midplane/backplane 930 as shown in
(49) The midplane/backplane 930 holds the line cards. The line cards may connect into the midplane/backplane 930 when inserted into the example router's chassis from the front. The control component (e.g., routing engine) 810 may plug into the rear of the midplane/backplane 930 from the rear of the chassis. The midplane/backplane 930 may carry electrical (or optical) signals and power to each line card and to the control component 810.
(50) The system control board 940 may perform forwarding lookup. It 940 may also communicate errors to the routing engine. Further, it 940 may also monitor the condition of the router based on information it receives from sensors. If an abnormal condition is detected, the system control board 940 may immediately notify the control component 810.
(51) Referring to
(52) The I/O manager ASIC 922 on the egress FPC 920/920′ may perform some value-added services. In addition to incrementing time to live (“TTL”) values and re-encapsulating the packet for handling by the PIC 910, it can also apply class-of-service (CoS) rules. To do this, it may queue a pointer to the packet in one of the available queues, each having a share of link bandwidth, before applying the rules to the packet. Queuing can be based on various rules. Thus, the I/O manager ASIC 922 on the egress FPC 920/920′ may be responsible for receiving the blocks from the second DBM ASIC 935/935′, incrementing TTL values, queuing a pointer to the packet, if necessary, before applying CoS rules, re-encapsulating the blocks, and sending the encapsulated packets to the PIC I/O manager ASIC 915.
(53)
(54) Referring back to block 1170, the packet may be queued. Actually, as stated earlier with reference to
(55) Referring back to block 1180 of
(56) Although example embodiments consistent with the present disclosure may be implemented on the example routers of
(57)
(58) In some embodiments consistent with the present disclosure, the processors 1210 may be one or more microprocessors and/or ASICs. The bus 1240 may include a system bus. The storage devices 1220 may include system memory, such as read only memory (ROM) and/or random access memory (RAM). The storage devices 1220 may also include a hard disk drive for reading from and writing to a hard disk, a magnetic disk drive for reading from or writing to a (e.g., removable) magnetic disk, an optical disk drive for reading from or writing to a removable (magneto-) optical disk such as a compact disk or other (magneto-) optical media, or solid-state non-volatile storage.
(59) Some example embodiments consistent with the present disclosure may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may be non-transitory and may include, but is not limited to, flash memory, optical disks, CD-ROMs, DVD ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards or any other type of machine-readable media suitable for storing electronic instructions. For example, example embodiments consistent with the present disclosure may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of a communication link (e.g., a modem or network connection) and stored on a non-transitory storage medium. The machine-readable medium may also be referred to as a processor-readable medium.
(60) Example embodiments consistent with the present disclosure (or components or modules thereof) might be implemented in hardware, such as one or more field programmable gate arrays (“FPGA”s), one or more integrated circuits such as ASICs, one or more network processors, etc. Alternatively, or in addition, embodiments consistent with the present disclosure (or components or modules thereof) might be implemented as stored program instructions executed by a processor. Such hardware and/or software might be provided in an addressed data (e.g., packet, cell, etc.) forwarding device (e.g., a switch, a router, etc.), a laptop computer, desktop computer, a tablet computer, a mobile phone, or any device that has computing and networking capabilities.
§ 5.4 Refinements, Alternatives and Extensions
(61) As noted in the '929 provisional, many large-scale service provider networks use some form of scale-out architecture at peering sites. In such an architecture, each participating Autonomous System (AS) deploys multiple independent Autonomous System Border Routers (ASBRs) for peering, and Equal Cost Multi-Path (ECMP) load balancing is used between them. There are numerous benefits to this architecture, including, but not limited to, N+1 redundancy and the ability to flexibly increase capacity as needed. A cost of this architecture is an increase in the amount of state in both the control and data planes, which has negative consequences for network convergence time and scale. Configuration routing protocols (e.g., both BGP and IGP) to use ANH in a manner consistent with the present description may be used to mitigate these negative consequences. For example, using ANH allows the number of BGP paths in the control plane to be reduced and enables rapid path withdrawal (and hence, rapid network convergence and traffic restoration).
(62)
(63) AS2 1310b includes peer devices (PEER 2.1, . . . , PEER 2.t) 1370 that may have an eBGP session with one or more of the ASBRs of site 1 1320a, site 2 1320b, and/or site 3 1320c, though not all sessions are shown.
(64) In traditional configurations such as those described with reference to
(65) Note that reachability of the ANH address in the IGP depends on eBGP session state and not inter-AS interface state, although of course, interface state may impact session state. The manner in which the IP route to the ANH address is instantiated on an ASBR and inserted into the IGP on particular device is a matter of local implementation.
§ 5.4.1 (Egress ASBR-PEER AS) Abstract Next Hop (AP-ANH)
(66) The AP-ANH is unique to an ASBR and its peer AS. For example, in the network of
(67) Provided that all ASBRs in a given site (e.g., site 1320a in
§ 5.4.2 (SITE-PEER AS) Abstract Next Hop (SP-ANH)
(68) The AP-ANH works on an ASBR level. From a given local AS perspective, the number of ANH is proportional to the number of pairs of ASBRs and ASes each of them peers with. With hundreds of peer ASes, tens of sites and −10 ASBRs per site, the number of AP-ANH may scale into the thousands. At the same time, it might not be necessary or even desirable for every BGP speaker in the network to have visibility to every path down to individual egress ASBR granularity. With symmetrical multiplane backbone and/or leaf-spine designs (See, e.g.,
(69) At the same time, when multiple paths are available on BGP speakers, every change is propagated, with consequent transmission and processing costs on all BGP speakers across the network. This will be true even if the route change doesn't impact the forwarding plane. For example, in the network of
(70) To avoid the above drawbacks, the RR of a given site (e.g., site 1 1320a in
§ 5.4.3 Assignment of Abstract Next Hops
(71) More details of how abstract next hops can be injected in several different common network architectures are discussed in §§ 5.4.3.1-5.4.3.3 below.
§ 5.4.3.1 Native IP Networks
(72) In native IP networks every router, including core routers, has full BGP routing information and forwards each packet based on destination IP lookup. Provided that all routers at an egress site receive multiple paths with BGP-NH set to AP-ANH (and not SP-ANH), the human operator may decide which node (RR, ASBR or CR) will inject the SP-ANH route into the IGP. One operator may believe that injection of SP-ANH by ASBRs may be simpler, as it will be done by the same procedure and policy as injection of AP-ANH. Another operator may prefer injection at RR, as it limits the number of configuration touch-points.
§ 5.4.3.2 MPLS
(73) First, assume that identical BGP address space and paths are received on all ASBRs. In the MPLS network, since traffic is carried over LSP tunnels, the SP-ANH should be injected into the IGP by a node that has the ability to perform an IP lookup. This eliminates the RR, and possibly CRs (in “BGP-free core” architectures). Instead, all ASBRs may be used to insert SP-ANH addresses into the IGP. In the case of LDP-based networks, this is sufficient. The CR will create an ECMP forwarding structure for labels of SP-ANH FEC coming from other sites. In RSVP-TE based networks, ECMP needs to happen on the ingress LSR and therefore, every BGP speaker needs to establish an LSP to every ASBR, and the SP-ANH address needs to be part of the FEC for its respective LSP. If SP-ANH is used as an RSVP (signaling) destination, some other means (such as affinity groups) needs to be used to ensure the desired 1:1, LSP to egress ASBR, mapping. Note that if MPLS is used to advertise an ANH, it should do so with an implicit-null or explicit-null label (Penultimate-Hop-Popping or Ultimate-Hop-Popping, respectively). This is to facilitate IP-lookup for packets coming from the core network going towards the device reachable through the peer-ASBR nodes. Non-null label can also be used, but only if the ANH identifies a set of eBGP sessions such that the eBGP sessions are providing exactly equal/same set of prefixes (e.g., when eBGP over parallel links between two routers is used).
(74) Alternatively, assume that different address space sets or paths are received on different ASBRs. If the set of prefixes received from a given peer AS by one ASBR is different from the set received by another one, a combination of SP-ANH and MPLS-based load balancing on a CR may lead to a situation in which an IP packet will be directed to an ASBR that lacks external routing information, and consequently can't forward traffic directly out of the AS. Similarly, if path attributes for a given prefix received by one ASBR are different from those received by another, again, packets can be directed to the “wrong” ASBR. In this case the ASBR would use the iBGP route it learned from another ASBR of the same site (via RR, with AP-ANH) and forward traffic over an LSP to the “correct” ASBR. This extra hop constitutes a sub-optimal traffic path through the network.
(75) For example, in the network of FIG. 2 of the '929 provisional, assume that prefix P2 is advertised to BR1.2-BR1.N by AS2, but not to BR1.1. Border router BR3.1 has a BGP best route to P2 with its BGP-NH set to the SP-ANH of (site 1, AS2). It resolves this BGP-NH (SP-ANH) by ECMP over N MPLS LSPs, terminating on BR1.1-BR1.N. So, some packets are forwarded by BR3.1 over an LSP via CR1.x and terminated on BR1.1. Border router BR1.1 has no external route to P2, but it has (N−1) iBGP routes to P2 with BGP-NHs equal to the AP-ANHs of BR1.2-BR1.N. Therefore, BR1.1 performs an IP lookup and forwards this packet over LSPs via CR1.x and terminated on BR1.2-BR1.N. Traffic is U-turned on BR1.1 and traverses CRs at site 1 twice.
(76) Such asymmetry may be considered acceptable by the provider, as long as it's a transient condition. However, in the general case, such a situation could be persistent as the result of intentional configuration on the peer AS's ASBRs. Therefore, a better solution would be to insert the SP-ANH into the IGP on CRs. In this case, CRs need to perform forwarding based on destination IP lookup. Therefore, CRs would have to be able to learn and handle large IP routing and forwarding tables—at least all prefixes learned from peer ASes by the local ASBRs.
§ 5.4.3.3 Spring
(77) First, assume that identical BGP address space and paths are received on all ASBRs. For SPRING based networks, one can take advantage of the unique capability of Anycast-SID. (See, e.g., “Segment Routing Architecture,” Request for Comments 8402 (Internet Engineering Task Force, July 2018)(referred to as “RFC 8402” and incorporated herein by reference).) The ASBRs of a single site allocate an Anycast-SID for each SP-ANH address. This SID can be used as the only SID by an ingress BGP speaker or, if a TE routed path is desired, depending on TE constraints, the TE controller can provision a SPRING path with the Anycast-SID at the end, instructing the CR to perform load balancing among connected ASBRs.
(78) Alternatively, assume that different address space sets or paths are received on different ASBRs. Similar to a classic MPLS environment, such a situation may lead to suboptimal routing (redirecting from one ASBR to another), or may require the CR (instead of ASBR) to insert the SP-ANH into the IGP and generate a PREFIX-SID (or Anycast-SID if there is more than one CR) for it.
§ 5.4.4 Use of ANH in Clos-Network Data Center Fabrics
(79) Referring to
§ 5.5 Conclusions
(80) Abstract Next Hop (ANH), as described above, does not require any changes to the BGP protocol itself. Rather, ANH is an architectural solution to network configuration. It uses the capabilities of existing protocols while achieving higher scale and faster routing convergence (especially in a network configured with scale-out peering sites).
(81) When same ANH is used to represent a set of peers, it also reduces route-scale and routing-churn in the iBGP-network. This is because one path can be advertised (or withdrawn) instead of advertising (or withdrawing) multiple paths.
(82) ANH can also be used to drain traffic from iBGP-core, for example when an eBGP peer is being taken out for maintenance.