Listing congestion notification packet generation by switch
11005770 ยท 2021-05-11
Assignee
Inventors
- Barak Gafni (Campbell, CA)
- Eitan Zahavi (Zichron Yaakov, IL)
- Gil Levy (Hod Hasharon, IL)
- Aviv Kfir (Nili, IL)
- Liron Mula (Ramat Gan, IL)
Cpc classification
H04L47/35
ELECTRICITY
H04L47/263
ELECTRICITY
International classification
Abstract
Network communication is carried out by sending packets from a source network interface toward a destination network interface, receiving one of the packets in an intermediate switch of the network, determining that the intermediate switch is experiencing network congestion, generating in the intermediate switch a congestion notification packet for the received packet, and transmitting the congestion notification packet from the intermediate switch to the source network interface via the network. The received packet is forwarded from the intermediate switch toward the destination network interface. The source network interface may modify a rate of packet transmission responsively to the congestion notification packet.
Claims
1. A method of communication, comprising the steps of: sending packets over a network from a source network interface toward a destination network interface; receiving one of the packets in an intermediate switch of the network; determining that the intermediate switch is experiencing network congestion; generating in the intermediate switch a congestion notification packet for the received packet; transmitting the congestion notification packet from the intermediate switch to the source network interface via the network; responsively to the congestion notification packet modifying a rate of packet transmission to the destination network interface from the source network interface; forwarding the received packet from the intermediate switch toward the destination network interface; and prior to forwarding the received packet: determining whether all other intermediate switches are capable of transmitting congestion notification packets to the source network interface; responsive to determining that all other intermediate switches are capable of transmitting congestion notification packets to the source network interface, marking the received packet in a first manner indicating that the received packet is ineligible to cause other intermediate switches of the network to generate and transmit new congestion notification packets; and responsive to determining that less than all other intermediate switches are capable of transmitting congestion notification packets to the source network interface, marking the received packets in a second manner.
2. The method according to claim 1, wherein the received packet is RoCEV2-compliant.
3. The method according to claim 1, wherein the received packet is a tunnel packet.
4. The method according to claim 1, wherein sending and receiving the packets are performed using a source queue pair (source QP) and a destination queue pair (destination QP), respectively, wherein the step of generating comprises obtaining the source QP by maintaining in the intermediate switch a translation table between the destination QP and the source QP.
5. The method according to claim 1, wherein sending and receiving the packets are performed using a source queue pair (source QP) and a destination queue pair (destination QP), respectively, wherein the step of generating comprises calculating a hash function on a destination address and the destination QP of the received packet.
6. A communication apparatus, comprising: a source network interface; a destination network interface, wherein the source network interface is operative for sending packets over a network toward the destination network interface and the destination network interface is operative for accepting the packets from the source network interface; and an intermediate switch in the network that receives one of the packets, the intermediate switch being operative for: determining that the intermediate switch is experiencing network congestion; generating a congestion notification packet for the received packet; transmitting the congestion notification packet to the source network interface via the network; and forwarding the received packet toward the destination network interface via at least one other intermediate switch, wherein responsively to the congestion notification packet the source network interface is operative for modifying a rate of packet transmission to the destination network interface, and the intermediate switch is also operative for, prior to forwarding the received packet: determining whether all other intermediate switches are capable of transmitting congestion notification packets to the source network interface; responsive to determining that all other intermediate switches are capable of transmitting congestion notification packets to the source network interface, marking the received packet in a first manner indicating that the received packet is ineligible to cause other intermediate switches of the network to generate and transmit new congestion notification packets; and responsive to determining that less than all other intermediate switches are capable of transmitting congestion notification packets to the source network interface, marking the received packets in a second manner.
7. The apparatus according to claim 6, wherein the received packet is RoCEV2-compliant.
8. The apparatus according to claim 6, wherein the received packet is a tunnel packet.
9. The apparatus according to claim 6, wherein sending the packets from the source network interface and accepting the packets in the destination network interface are performed using a source queue pair (source QP) and a destination queue pair (destination QP), respectively, wherein the step of generating comprises obtaining the source QP by maintaining in the intermediate switch a translation table between the destination QP and the source QP.
10. The apparatus according to claim 6, wherein sending the packets from the source network interface and accepting the packets in the destination network interface are performed using a source queue pair (source QP) and a destination queue pair (destination QP), respectively, wherein the step of generating comprises calculating a hash function on a destination address and the destination QP of the received packet.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
(1) For a better understanding of the present invention, reference is made to the detailed description of the invention, by way of example, which is to be read in conjunction with the following drawings, wherein like elements are given like reference numerals, and wherein:
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF THE INVENTION
(8) In the following description, numerous specific details are set forth in order to provide a thorough understanding of the various principles of the present invention. It will be apparent to one skilled in the art, however, that not all these details are necessarily always needed for practicing the present invention. In this instance, well-known circuits, control logic, and the details of computer program instructions for conventional algorithms and processes have not been shown in detail in order not to obscure the general concepts unnecessarily.
(9) Documents incorporated by reference herein are to be considered an integral part of the Application except that, to the extent that any terms are defined in these incorporated documents in a manner that conflicts with definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
(10) System Description.
(11)
(12) Each HCA comprises a congestion control unit 44, which takes measures for mitigating congestion of packets in network 32. Congestion control unit 44 comprises a pool of rate limiters 48 (RL) that regulate the transmission rate of packets. The congestion control methods applied by congestion control unit 44 are described in detail further below.
(13) The example of
(14) Packets that are sent from HCA 24 to HCA 25 may traverse various network elements in network 32. In the present example, the packets traverse a certain path in the network that passes through a switch 52. Switch 52 comprises multiple queues 56 that queue the packets traversing the switch, shown representatively in the present example as four queues. In alternative embodiments, the packets may traverse various paths that may each pass through multiple network elements.
(15) The HCAs 24, 25, switch 52 and the system configurations shown in
(16) In some embodiments, certain HCA functions may be implemented using a general-purpose computer, which is programmed in software to carry out the functions described herein. In one example embodiment, such functions may be performed by a processor of host 28. The software may be downloaded to the computer in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
(17) In some practical cases, certain components of the network elements in network 32 are loaded with heavy traffic and may cause large delays or even packet loss. In the example of
(18) Congestion Control Scheme.
(19) A RoCEV2 packet has the format shown in
(20) :
(21) TABLE-US-00002 TABLE 2 ECT Field 00 Not ECT capable 01 ECT capable 10 ECT capable 11 Congestion Encountered
(22) In conventional RoCE networks the switches signal congestion by setting the ECN field 62 to the binary value 11. The packet then continues its transit through the network, and reaches a destination node. Once the destination node receives a packet whose ECN field has the binary value 11, it generates a CNP designated for the sender of the received packet. The CNP has the format shown in
(23) In embodiments of the invention, an intermediate fabric switch receiving a packet that experiences traffic congestion generates a congestion notification packet (CNP). Local congestion at a switch can be recognized by known methods, for example, because its queues are filled. However, instead of marking the ECN fields, a CNP is generated. The CNP is sent by the switch to the source node of the received packet. When the source node receives the CNP it reduces the traffic injection rate of the congesting flow as if the CNP had been generated from the packet destination node. The term switch includes network elements or nodes that have network interfaces such as NICs, in which the network elements perform the function of a switch.
(24) The switch may be configured to handle each of the following embodiments:
First Embodiment (Nonvirtualized Networks Handling Native Packets)
(25) RoCEV2-compliant switches are ECN-aware, and switches indicate network congestion by marking the ECN field 62 as binary 11.
(26) Regarding the CNP, in the format of
(27) L2-MAC segment 66 is built as defined for normal router flow in the switch. Fields of the L3-IP segment 68 are assigned according to the RoCEV2 standard.
(28) Reference is now made to
(29) At initial step 70 RoCEV2 packet, generally belonging to a flow, is received in a network switch or other network element.
(30) Next, at decision step 72, it is determined if a condition of network congestion at the switch exists. This determination is made by any suitable method of congestion determination known in the art. For example, the switch may monitor its internal queues. If the determination at decision step 72 is negative, then the received packet is non-congested. Control proceeds to final step 74, where the noncongested packet is forwarded or processed conventionally by the switch in accordance with its routing information.
(31) If the determination at decision step 72 is affirmative, then control proceeds to decision step 76, where it is determined if the received packet (referred to as the congested packet) is eligible for application of the principles of the invention. Referring again to
(32) 1) The congested packet is RoCEV2-compliant. RoCEV2 compliance requires that the ECN field in the L3-IP segment 60 is available for signaling congestion.
(33) 2) The congested packet is eligible for ECN marking. A packet having the binary value 00 in ECN field 62 is not eligible.
(34) 3) The congested packet is in fact facing congestion as discussed above.
(35) 4) The congested packet is not being dropped by the switch (excluding buffer congestion). It is an advantage that a CNP is sent to the source node even when the packet is dropped due to buffer congestion. In conventional networks in which the destination node generates the CNP, a CNP would not be generated in this case.
(36) If the determination at decision step 76 is negative, then the procedure ends at final step 74.
(37) If the determination at decision step 76 is affirmative, then in step 78 a CNP is generated by the switch. Step 78 comprises steps 80, 82, 84. In step 80 the DIP and SIP are read from the L3-IP segment 60 of the congested packet, then exchanged into the L3-IP segment 68 of the CNP, such that the SIP of the congested packet becomes the DIP of the CNP, and the DIP of the congested packet becomes the SIP of the CNP. In some embodiments, the IP address of the switch itself can be used as the SIP of the CNP. L2-MAC segments 58 do not undergo address swapping, but are treated as described above.
(38) In step 82 the DSCP field of the L3-IP segment of the CNP header is set to a high priority value in order to ensure that the CNP reaches the source node of the congested packet as fast as possible.
(39) A CNP is not eligible for ECN marking. The ECN field of the L3-IP segment of the CNP is set to binary 00 in step 84, so that the CNP is unaffected by congestion management procedures of other nodes during its transit through the network.
(40) Once the switch has committed to generating a CNP, it would be undesirable for downstream switches to repeat the procedure when they subsequently handle the congested packet, because the packet source node would then receive multiple CNPs concerning a single packet. The packet source node might respond by throttling the flow more than necessary, possibly violate QoS requirements, and even cause the flow to become unusable at its destination.
(41) Next, the ECN fields of the L3-IP segment of the congested packet are reset in block 86, in which one of a series of options is selected, depending on the capability of the switches in the network. The sequence begins at decision step 88, where it is determined if all switches in the network are interoperable, which in this context means that they are all capable of not applying the Weighted Random Early Detection (WRED) method to packets that are not ECN-capable, while applying ECN marking for packets that are ECN capable. WRED is a well-known queuing discipline in which packets are selectively dropped, based on IP precedence.
(42) If the determination at decision step 88 is affirmative, then the network is ideally suited for application of the principles of the invention. Control proceeds to step 90, where the ECN field of the L3-IP segment in the congested packet is set to binary 00. For purposes of congestion management, the congested packet will be deemed ineligible in subsequent iterations of decision step 76 in downstream network nodes. Performance of step 90 insures that exactly one CNP is generated for the congested packet. This behavior resembles CNP generation by an end where all switch congestions along the path coalesce to a single CNP generation.
(43) If the determination at decision step 88 is negative, then at decision step 92, it is determined if the network nodes are configured or configurable to recognize only a selected one of the binary values 01 and 10 as indicating packet eligibility in step 90. In the example, referred to in eligible packets are denoted by the binary value 01, while the binary value 10 (as well as the value 00) denotes ineligibility. The binary values 01 and 10 are used arbitrarily herein to distinguish packet eligibility from ineligibility. The binary values in this example have no significance with respect to the actual configuration of the method.
(44) If the determination at decision step 92 is affirmative, then in step 94 the ECN field is set to binary 10. Network switches configured to perform step 94 generate CNP packets only for packets with the selected ECN field value indicating eligibility for CNP generation, binary 01 in this example, and prevent multiple CNP generation by setting the ECN field to indicate ineligibility. This ensures single CNP generation per packet.
(45) If the determination at decision step 92 is negative, then the ECN field of the congested packet is left unchanged. In cases where no other switches in the path of the congested packet are experiencing congestion, there is no harm, as only one CNP will be generated. It is only where multiple network congestion points are present that the undesired effect of multiple CNPs may be experienced.
(46) After exiting block 86 at final step 96 the congested packet is forwarded according to its routing instructions, and the CNP is transmitted through the network toward the source node of the congested packet.
Second Embodiment (Virtualized Networks Handling Tunnel Packets)
(47) As is known in the art, a tunneling protocol is a communications protocol that allows for the movement of data from one network to another. The payload of a tunnel packet encapsulates another packet that is compliant with another protocol, and transports the payload of the other packet. This embodiment is described with reference to the example of the protocol Virtual Extensible LAN (VXLAN), described in IETF RFC 7348, which is herein incorporated by reference. VXLAN is a Layer 2 overlay scheme on a Layer 3 network.
(48) A RoCEV2-over-VXLAN tunnel packet (referred to as a non-congested tunnel packet or congested tunnel packet as the case may be) has the format shown in
(49) When forwarding a congested tunnel packet, the L3-IP segment 102 in the outer header 98 denotes any congestion in the underlay network. Once the packet is decapsulated, the ECN markings of the L3-IP segment 102 may be copied to the L3-IP segment 104 in the inner header 100.
(50) A CNP-over-VXLAN has the format shown in
(51) In the case of the inner header 116, step 78 (
(52) Regarding the outer header 110, the L2-MAC segment 112 and L3-IP segment 114 are treated in the same way as the L2-MAC segment 66 and L3-IP segment 68 of a CNP for regular RoCEV2 congested packets (step 78,
(53) Implementation Details.
(54) One of the challenges in CNP generation is to retrieve the destination QP for the CNP.
(55) For reliable connection transport services, the source QP is not available in the packet header of the congested packet (or congested tunnel packet). This issue can be solved by one of the following options:
(56) A translation table may be maintained between {destination IP, destination QP} and the source QP in the switch. Such a table typically exists in an NIC, and can be exploited when it generates the CNP. It is possible to have this table in the switch; however, it may introduce scale issues. In addition, a central network controller needs to be aware of the entire network to configure the switch.
(57) Instead of using the QP the NIC maps a CNP to a congestion control context by another field. A typical solution is to calculate a hash function on the following fields: Destination IP (of the original packet, which is the SIP of the CNP packet); and Destination QP of the original packet.
(58) This option requires that the CNP packet reports this destination QP. As the source QP of the original packet is not available (otherwise there is no issue) the switch will report the destination QP in the CNP.
(59) In a reliable connection transport service, the source node may use IP or other alternatives such as a hash table to correlate between a packet and its congestion control context.
(60) Conventionally, when the destination node generates the CNP, the destination QP of the CNP packet (which is the source QP of the congested packet) is retrieved from a lookup table on the destination QP of the congested packet. This is used for reliable transport services, where the source QP is not denoted in the packet. While this approach can be used by an intermediate switch, it is more difficult to scale up than the above options.
(61) It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.