LOGICAL OVERLAY TUNNEL SELECTION
20230163997 · 2023-05-25
Assignee
Inventors
- Stephen SAUER (Bar le duc, FR)
- Benoit SARDA (Les Mureaux, FR)
- Dominic FOLEY (Brentwood, GB)
- Yann SIMONET (Alfortville, FR)
Cpc classification
H04L12/4604
ELECTRICITY
H04L2012/4629
ELECTRICITY
International classification
H04L45/00
ELECTRICITY
Abstract
Example methods and systems for logical overlay tunnel selection are described. One example may involve a first computer system generating and sending probe packets over multiple logical overlay tunnels and configuring routing information associated with a destination based on a comparison between tunnel state information measured using the probe packets and a desired state. In response to detecting an egress packet that is destined for the destination, the first computer system may select a first logical overlay tunnel that satisfies the desired state over a second logical overlay tunnel that does not satisfy the desired state. An encapsulated packet is then generated and sent over the first logical overlay tunnel to reach the destination. The encapsulated packet may include the egress packet and an outer header that is addressed from a first virtual tunnel endpoint (VTEP) on the first computer system and a second VTEP on a second computer system.
Claims
1. A method for a first computer system to perform logical overlay tunnel selection, wherein the method comprises: generating and sending probe packets over multiple logical overlay tunnels via which a destination is reachable from the first computer system; configuring routing information associated with the destination based on a comparison between tunnel state information measured using the probe packets and a desired state; and in response to detecting, from a virtualized computing instance on the first computer system, an egress packet that is destined for the destination, based on the routing information, selecting a first logical overlay tunnel that satisfies the desired state over a second logical overlay tunnel that does not satisfy the desired state, wherein the first logical overlay tunnel is established between a first virtual tunnel endpoint (VTEP) on the first computer system and a second VTEP on a second computer system; and generating and sending an encapsulated packet over the first logical overlay tunnel towards the second computer system to reach the destination, wherein the encapsulated packet includes the egress packet and an outer header that is addressed from the first VTEP and the second VTEP.
2. The method of claim 1, wherein configuring the routing information comprises: configuring the routing information based on control information received from a management entity that is capable of at least one of the following: (a) identifying the desired state based on one or more service level agreements and (b) performing the comparison between the tunnel state information measured and the desired state.
3. The method of claim 1, wherein the method further comprises: based on performance degradation detected using subsequent probe packets, reconfiguring the routing information to indicate that the first logical overlay tunnel no longer satisfies the desired state.
4. The method of claim 3, wherein the method further comprises: in response to detecting a subsequent egress packet from the virtualized computing instance to the destination, switching from the first logical overlay tunnel to the second logical overlay tunnel or a third logical overlay tunnel that satisfies the desired state.
5. The method of claim 1, wherein generating and sending the probe packets comprises: generating and sending the probe packets over the multiple logical overlay tunnels to cause multiple second computer systems to respond with reply packets; and based on the reply packets, generating the tunnel state information that measures at least one of the following: two-way latency, jitter, packet loss and connectivity status.
6. The method of claim 1, wherein generating and sending the probe packets comprises: generating and sending the probe packets the probe packets over the multiple logical overlay tunnels to cause multiple second computer systems to generate the tunnel state information that measures at least one of the following: one-way latency, jitter, packet loss and connectivity status.
7. The method of claim 1, wherein the method comprises: prior to generating and sending the probe packets, establishing multiple monitoring sessions multiple second computer systems in the form of a cluster of edge nodes operating in an active-active mode to provide one or more networking services for the first computer system.
8. A non-transitory computer-readable storage medium that includes a set of instructions which, in response to execution by a processor of a first computer system, cause the processor to perform a method of logical overlay tunnel selection, wherein the method comprises: generating and sending probe packets over multiple logical overlay tunnels via which a destination is reachable from the first computer system; configuring routing information associated with the destination based on a comparison between tunnel state information measured using the probe packets and a desired state; and in response to detecting, from a virtualized computing instance on the first computer system, an egress packet that is destined for the destination, based on the routing information, selecting a first logical overlay tunnel that satisfies the desired state over a second logical overlay tunnel that does not satisfy the desired state, wherein the first logical overlay tunnel is established between a first virtual tunnel endpoint (VTEP) on the first computer system and a second VTEP on a second computer system; and generating and sending an encapsulated packet over the first logical overlay tunnel towards the second computer system to reach the destination, wherein the encapsulated packet includes the egress packet and an outer header that is addressed from the first VTEP and the second VTEP.
9. The non-transitory computer-readable storage medium of claim 8, wherein configuring the routing information comprises: configuring the routing information based on control information received from a management entity that is capable of at least one of the following: (a) identifying the desired state based on one or more service level agreements and (b) performing the comparison between the tunnel state information measured and the desired state.
10. The non-transitory computer-readable storage medium of claim 8, wherein the method further comprises: based on performance degradation detected using subsequent probe packets, reconfiguring the routing information to indicate that the first logical overlay tunnel no longer satisfies the desired state.
11. The non-transitory computer-readable storage medium of claim 10, wherein the method further comprises: in response to detecting a subsequent egress packet from the virtualized computing instance to the destination, switching from the first logical overlay tunnel to the second logical overlay tunnel or a third logical overlay tunnel that satisfies the desired state.
12. The non-transitory computer-readable storage medium of claim 8, wherein generating and sending the probe packets comprises: generating and sending the probe packets over the multiple logical overlay tunnels to cause multiple second computer systems to respond with reply packets; and based on the reply packets, generating the tunnel state information that measures at least one of the following: two-way latency, jitter, packet loss and connectivity status.
13. The non-transitory computer-readable storage medium of claim 8, wherein generating and sending the probe packets comprises: generating and sending the probe packets the probe packets over the multiple logical overlay tunnels to cause multiple second computer systems to generate the tunnel state information that measures at least one of the following: one-way latency, jitter, packet loss and connectivity status.
14. The non-transitory computer-readable storage medium of claim 8, wherein the method comprises: prior to generating and sending the probe packets, establishing multiple monitoring sessions multiple second computer systems in the form of a cluster of edge nodes operating in an active-active mode to provide one or more networking services for the first computer system.
15. A computer system, being a first computer system, configured to perform logical overlay tunnel selection, wherein the computer system comprises: a processor; and a non-transitory computer-readable medium having stored thereon instructions that, when executed by the processor, cause the processor to: generate and send probe packets over multiple logical overlay tunnels via which a destination is reachable from the first computer system; configure routing information associated with the destination based on a comparison between tunnel state information measured using the probe packets and a desired state; and in response to detecting, from a virtualized computing instance on the first computer system, an egress packet that is destined for the destination, based on the routing information, select a first logical overlay tunnel that satisfies the desired state over a second logical overlay tunnel that does not satisfy the desired state, wherein the first logical overlay tunnel is established between a first virtual tunnel endpoint (VTEP) on the first computer system and a second VTEP on a second computer system; and generate and send an encapsulated packet over the first logical overlay tunnel towards the second computer system to reach the destination, wherein the encapsulated packet includes the egress packet and an outer header that is addressed from the first VTEP and the second VTEP.
16. The computer system of claim 15, wherein the instructions for configuring the routing information cause the processor to: configure the routing information based on control information received from a management entity that is capable of at least one of the following: (a) identifying the desired state based on one or more service level agreements and (b) performing the comparison between the tunnel state information measured and the desired state.
17. The computer system of claim 15, wherein the instructions further cause the processor to: based on performance degradation detected using subsequent probe packets, reconfiguring the routing information to indicate that the first logical overlay tunnel no longer satisfies the desired state.
18. The computer system of claim 17, wherein the instructions further cause the processor to: in response to detecting a subsequent egress packet from the virtualized computing instance to the destination, switching from the first logical overlay tunnel to the second logical overlay tunnel or a third logical overlay tunnel that satisfies the desired state.
19. The first computer system of claim 15, wherein the instructions for generating and sending the probe packets cause the processor to: generating and sending the probe packets over the multiple logical overlay tunnels to cause multiple second computer systems to respond with reply packets; and based on the reply packets, generating the tunnel state information that measures at least one of the following: two-way latency, jitter, packet loss and connectivity status.
20. The first computer system of claim 15, wherein the instructions for generating and sending the probe packets cause the processor to: generating and sending the probe packets the probe packets over the multiple logical overlay tunnels to cause multiple second computer systems to generate the tunnel state information that measures at least one of the following: one-way latency, jitter, packet loss and connectivity status.
21. The computer system of claim 15, wherein the instructions further cause the processor to: prior to generating and sending the probe packets, establish multiple monitoring sessions multiple second computer systems in the form of a cluster of edge nodes operating in an active-active mode to provide one or more networking services for the first computer system.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0002]
[0003]
[0004]
[0005]
[0006]
[0007]
[0008]
[0009]
DETAILED DESCRIPTION
[0010] According to examples of the present disclosure, logical overlay tunnel selection may be implemented more dynamically based on tunnel state information. One example may involve a first computer system (e.g., host-A 210A in
[0011] In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the drawings, can be arranged, substituted, combined, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein. Although the terms “first” and “second” are used to describe various elements, these elements should not be limited by these terms. These terms are used to distinguish one element from another. For example, a first element may be referred to as a second element, and vice versa.
[0012]
[0013] In the example in
[0014] Referring also to
[0015] Hypervisor 214A/214B maintains a mapping between underlying hardware 212A/212B and virtual resources allocated to respective VMs. Virtual resources are allocated to respective VMs 231-234 to support a guest operating system (OS; not shown for simplicity) and application(s); see 241-244, 251-254. For example, the virtual resources may include virtual CPU, guest physical memory, virtual disk, virtual network interface controller (VNIC), etc. Hardware resources may be emulated using virtual machine monitors (VMMs). For example in
[0016] Although examples of the present disclosure refer to VMs, it should be understood that a “virtual machine” running on a host is merely one example of a “virtualized computing instance” or “workload.” A virtualized computing instance may represent an addressable data compute node (DCN) or isolated user space instance. In practice, any suitable technology may be used to provide isolated user space instances, not just hardware virtualization. Other virtualized computing instances may include containers (e.g., running within a VM or on top of a host operating system without the need for a hypervisor or separate operating system or implemented as an operating system level virtualization), virtual private servers, client computers, etc. Such container technology is available from, among others, Docker, Inc. The VMs may also be complete computational environments, containing virtual equivalents of the hardware and software components of a physical computing system.
[0017] The term “hypervisor” may refer generally to a software layer or component that supports the execution of multiple virtualized computing instances, including system-level software in guest VMs that supports namespace containers such as Docker, etc. Hypervisors 214A-B may each implement any suitable virtualization technology, such as VMware ESX® or ESXi™ (available from VMware, Inc.), Kernel-based Virtual Machine (KVM), etc. The term “packet” may refer generally to a group of bits that can be transported together, and may be in another form, such as “frame,” “message,” “segment,” etc. The term “traffic” or “flow” may refer generally to multiple packets. The term “layer-2” may refer generally to a link layer or media access control (MAC) layer; “layer-3” a network or Internet Protocol (IP) layer; and “layer-4” a transport layer (e.g., using Transmission Control Protocol (TCP), User Datagram Protocol (UDP), etc.), in the Open System Interconnection (OSI) model, although the concepts described herein may be used with other networking models.
[0018] SDN controller 270 and SDN manager 272 are example network management entities in SDN environment 100. One example of an SDN controller is the NSX controller component of VMware NSX® (available from VMware, Inc.) that operates on a central control plane. SDN controller 270 may be a member of a controller cluster (not shown for simplicity) that is configurable using SDN manager 272. Network management entity 270/272 may be implemented using physical machine(s), VM(s), or both. To send or receive control information, a local control plane (LCP) agent (not shown) on host 210A/210B may interact with SDN controller 270 via control-plane channel 201/202.
[0019] Through virtualization of networking services in SDN environment 100, logical networks (also referred to as overlay networks or logical overlay networks) may be provisioned, changed, stored, deleted and restored programmatically without having to reconfigure the underlying physical hardware architecture. Hypervisor 214A/214B implements virtual switch 215A/215B and logical distributed router (DR) instance 217A/217B to handle egress packets from, and ingress packets to, VMs 231-234. In SDN environment 100, logical switches and logical DRs may be implemented in a distributed manner and can span multiple hosts.
[0020] For example, a logical switch (LS) may be deployed to provide logical layer-2 connectivity (i.e., an overlay network) to VMs 231-234. A logical switch may be implemented collectively by virtual switches 215A-B and represented internally using forwarding tables 216A-B at respective virtual switches 215A-B. Forwarding tables 216A-B may each include entries that collectively implement the respective logical switches. Further, logical DRs that provide logical layer-3 connectivity may be implemented collectively by DR instances 217A-B and represented internally using routing tables (not shown) at respective DR instances 217A-B. Each routing table may include entries that collectively implement the respective logical DRs.
[0021] Packets may be received from, or sent to, each VM via an associated logical port. For example, logical switch ports 265-268 (labelled “LSP1” to “LSP4”) are associated with respective VMs 231-234. Here, the term “logical port” or “logical switch port” may refer generally to a port on a logical switch to which a virtualized computing instance is connected. A “logical switch” may refer generally to a software-defined networking (SDN) construct that is collectively implemented by virtual switches 215A-B, whereas a “virtual switch” may refer generally to a software switch or software implementation of a physical switch. In practice, there is usually a one-to-one mapping between a logical port on a logical switch and a virtual port on virtual switch 215A/215B. However, the mapping may change in some scenarios, such as when the logical port is mapped to a different virtual port on a different virtual switch after migration of the corresponding virtualized computing instance (e.g., when the source host and destination host do not have a distributed virtual switch spanning them).
[0022] A logical overlay network may be formed using any suitable tunneling protocol, such as Virtual eXtensible Local Area Network (VXLAN), Stateless Transport Tunneling (STT), Generic Network Virtualization Encapsulation (GENEVE), Generic Routing Encapsulation (GRE), etc. For example, VXLAN is a layer-2 overlay scheme on a layer-3 network that uses tunnel encapsulation to extend layer-2 segments across multiple hosts which may reside on different layer 2 physical networks. Hypervisor 214A/214B may implement virtual tunnel endpoint (VTEP) 219A/219B to encapsulate and decapsulate packets with an outer header (also known as a tunnel header) identifying the relevant logical overlay network (e.g., VNI). Hosts 210A-B may maintain data-plane connectivity with each other via physical network 205 to facilitate east-west communication among VMs 231-234.
[0023] Hosts 210A-B may also maintain data-plane connectivity with cluster 110 of multiple (M) EDGE nodes 111-11M in
[0024] In the example in
[0025] In practice, however, logical overlay tunnels 101-103 may be susceptible to various performance issues. At one point in time, one tunnel may have better performance than another. For example, a tunnel that is selected using the hash-based approach may have high latency that affects the quality of packet flows. As a result, in some cases, a data center service provider may be unable to fulfil a service level agreement (SLA) signed with a data center customer, which is undesirable.
[0026] Logical Overlay Tunnel Selection
[0027] According to examples of the present disclosure, logical overlay tunnel selection may be implemented more dynamically based on tunnel state information that is measured in real time. Some examples will be described using
[0028] At 310 in
[0029] At 320 in
[0030] The term “desired state” may include a target performance level or threshold for a particular network characteristic or metric. For example, the desired state may specify one threshold (e.g., maximum latency in
[0031] At 330-340 in
[0032] At 350-360 in
[0033] Using examples of the present disclosure, logical overlay tunnel selection may better adapt to varying network characteristics. Unlike conventional hash-based approaches that are agnostic to network conditions, tunnel state information may be measured in real time to improve logical overlay tunnel selection to achieve better packet flow quality and VM performance. Since the desired state may be derived based on SLA(s) between a data center service provider and a service customer, examples of the present disclosure may be implemented to improve the likelihood of SLA fulfilment during overlay network traffic forwarding in SDN environment 100.
[0034] Further, as will be described using
[0035] Logical Overlay Tunnel Monitoring
[0036]
[0037] (a) Logical Overlay Tunnels
[0038] At 410-415 in
[0039] In the example in
[0040] In practice, EDGE cluster 110 with M=3 nodes in
[0041] Each EDGE node may be located at the same geographical site as host-A 210A, or a different site. In practice, multiple EDGE nodes may be deployed at different sites for failover and disaster recovery purposes. For example, one or more service providers may be selected for site A. When there is a failure affecting external connectivity at site A, EDGE node(s) at site B may be selected as an exit point for VMs located at site A. Using examples of the present disclosure, traffic may be spread across multiple EDGE nodes that are deployed at different sites and operate in an active-active mode. Depending on the desired implementation, various constraints may be considered when deploying a cluster of EDGE nodes across multiple sites, such as security, return traffic, hairpinning traffic, etc.
[0042] (b) Monitoring Sessions
[0043] At 420-425 in
[0044] Using BFD as an example in
[0045] (c) Tunnel State Information
[0046] At 430 in
[0047] In the example in
[0048] Any suitable tunnel state information (denoted as STATE-i for TUN-i) may be measured or generated in real time based on probe packets 521-523. For example, tunnel state information may include at least one of the following metrics: connectivity status (e.g., UP or DOWN), packet latency or delay, packet loss, jitter, etc. In practice, one-way latency is the time required to transmit a packet from a source to a destination. For two-way latency, the round-trip time (RTT) is the time required to transmit a packet from the source to the destination, then back to the source. Packet loss may refer generally to the number of packets lost per a fixed number (e.g., 100) sent. In this case, block 430 may involve host-A 210A tagging each probe packet with a monotonically increasing index or sequence number for packet loss detection. Jitter may refer generally to a variance in latency over time. As network characteristics vary, the tunnel state information measured for a particular tunnel also changes in real time.
[0049] At 435 in
[0050] Alternatively or additionally, at 445-450 in
[0051] In the example in
[0052] Based on reply packets 531-533, monitoring agent 218A on host-A 210A may generate tunnel state information (denoted as STATE-i) for each tunnel (TUN-i). For example in
[0053] (d) Routing Information Configuration
[0054] At 455-460 in
[0055] In practice, the desired state may be configured using any suitable approach, such as based on SLA(s), etc. In general, an SLA is a contract between a data center service provider and a service customer to identify service(s) supported by the service provider, performance metric(s) for each service, target performance threshold for each metric, etc. For example, SDN manager 272 on the management plane may provide a user interface to create a service profile based on network objects and apply an SLA profile to the service profile. Example network objects may include layer-3 objects (e.g., IP addresses, IP address groups, prefixes) and layer-4 objects (e.g., TCP/UDP ports). In a first example, an SLA profile may be configured to select all possible paths with latency under a maximum latency (t-max). If no path satisfies this requirement, the “best” path may be selected and a notification is sent to a network administrator. In another example, an SLA profile may be configured to select the path with the lowest latency, or the path with the lowest combination of jitter and packet loss for voice over IP (VoIP) packets.
[0056] In the example in
[0057] At 475 in
[0058] Dynamic Tunnel Selection
[0059] At 480-485 in
[0060] Some examples will now be discussed using
[0061] At 611 in
[0062] At 620 in
[0063] At 630 in
[0064] At 640 in
[0065] At 650 in
[0066] Performance Degradation
[0067] Using examples of the present disclosure, logical overlay tunnel selection may be updated dynamically according to real-time tunnel state information. An example will be discussed using
[0068] At 710 in
[0069] At 720 in
[0070] At 740-750 in
[0071] At 760 in
[0072] At 770 in
[0073] In practice, any suitable approach may be used to resolve issues relating to out-of-order delivery using multiple tunnels, such as stream control transmission protocol (SCTP) that provides sequenced delivery of user messages within multiple streams, with an option for order-of-arrival delivery of individual messages. SCTP is standardized by the IETF in RFC 4960 and incorporated herein by reference.
[0074] Multiple VTEP Configuration
[0075] Examples of the present disclosure may be implemented by host-A 210A with multiple VTEPs. Some examples will be described using
[0076] In the example in
[0077] According to the example in
[0078] At 840 in
[0079] At 870-880 in
[0080] At 890-891 in
[0081] At EDGE1 111, any suitable processing may be performed before the egress packet (P3) is forwarded towards destination 104. Note that routing information 880 may be reconfigured over time based on tunnel state information measured in real time. In the event of performance degradation, the selected tunnel 831 (see TUN-1) may be excluded from routing information 880 and a different tunnel may be selected for subsequent packets. Reconfiguration of routing information has been described using
[0082] Container Implementation
[0083] Although discussed using VMs 231-234, it should be understood that logical overlay tunnel selection may be performed for other virtualized computing instances, such as containers, etc. The term “container” (also known as “container instance”) is used generally to describe an application that is encapsulated with all its dependencies (e.g., binaries, libraries, etc.). For example, multiple containers may be executed as isolated processes inside VM1 231, where a different VNIC is configured for each container. Each container is “OS-less”, meaning that it does not include any OS that could weigh 11 s of Gigabytes (GB). This makes containers more lightweight, portable, efficient and suitable for delivery into an isolated OS environment. Running containers inside a VM (known as “containers-on-virtual-machine” approach) not only leverages the benefits of container technologies but also that of virtualization technologies.
[0084] Computer System
[0085] The above examples can be implemented by hardware (including hardware logic circuitry), software or firmware or a combination thereof. The above examples may be implemented by any suitable computing device, computer system, etc. The computer system may include processor(s), memory unit(s) and physical NIC(s) that may communicate with each other via a communication bus, etc. The computer system may include a non-transitory computer-readable medium having stored thereon instructions or program code that, when executed by the processor, cause the processor to perform processes described herein with reference to
[0086] The techniques introduced above can be implemented in special-purpose hardwired circuitry, in software and/or firmware in conjunction with programmable circuitry, or in a combination thereof. Special-purpose hardwired circuitry may be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), and others. The term ‘processor’ is to be interpreted broadly to include a processing unit, ASIC, logic unit, or programmable gate array etc.
[0087] The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or any combination thereof.
[0088] Those skilled in the art will recognize that some aspects of the embodiments disclosed herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computing systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of skill in the art in light of this disclosure.
[0089] Software and/or to implement the techniques introduced here may be stored on a non-transitory computer-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A “computer-readable storage medium”, as the term is used herein, includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant (PDA), mobile device, manufacturing tool, any device with a set of one or more processors, etc.). A computer-readable storage medium may include recordable/non recordable media (e.g., read-only memory (ROM), random access memory (RAM), magnetic disk or optical storage media, flash memory devices, etc.).
[0090] The drawings are only illustrations of an example, wherein the units or procedure shown in the drawings are not necessarily essential for implementing the present disclosure. Those skilled in the art will understand that the units in the device in the examples can be arranged in the device in the examples as described or can be alternatively located in one or more devices different from that in the examples. The units in the examples described can be combined into one module or further divided into a plurality of sub-units.