Using consistent hashing for ECMP routing
09853900 · 2017-12-26
Assignee
Inventors
Cpc classification
International classification
H04L12/28
ELECTRICITY
Abstract
ECMP routing is carried out in fabric of network entities by representing valid destinations and invalid destinations in a group of the entities by a member vector. The order of the elements in the member vector is permuted and fanned out. A portion of the elements in the fanned out vector is pseudo-randomly masked. A flow of packets is transmitted to the first valid destination in the masked member vector.
Claims
1. A method, comprising the steps of: defining in a fabric of network entities a group of the entities; representing the group by a member vector having elements, including valid elements and invalid elements that respectively reference valid destinations and invalid destinations in the group; permuting an order of the elements in the member vector to define a permuted member vector; fanning out the permuted member vector to form a series of consecutive instances of the permuted member vector; pseudo-randomly masking a portion of the elements of the fanned out member vector; and transmitting a flow of packets to the valid destination represented by a first valid element of the masked member vector.
2. The method according to claim 1, wherein the destinations are next hops in an equal cost multi-path (ECMP) group.
3. The method according to claim 1, wherein the destinations are addresses in a link aggregation group (LAG).
4. The method according to claim 1, wherein permuting an order comprises applying a hash function to a packet header and varying the order according to the hash function.
5. The method according to claim 1, wherein masking comprises applying a masking hash function to the fanned out member vector.
6. The method according to claim 1, further comprising deleting one of the valid destinations from the group prior to representing the group by a member vector and redistributing the flow of packets to remaining valid destinations using the masked member vector.
7. The method according to claim 6, wherein redistributing the flow of packets comprises avoiding migrating flows of packets in the remaining valid destinations to other remaining valid destinations.
8. The method according to claim 1, further comprising adding a new valid destination to the group prior to representing the group by a member vector redistributing the flow of packets to include the new valid destination using the masked member vector.
9. The method according to claim 8, wherein redistributing the flow of packets comprises avoiding migrating flows of packets in preexisting valid destinations to other preexisting valid destinations.
10. A system, comprising the steps of: a fabric of network entities for communication of data packets, wherein the network entities have a plurality of physical ports and a memory storing a forwarding database, respectively; and programmable hardware logic in the network entities configured for performing the steps of: representing a group of the network entities by a member vector having elements, including valid elements and invalid elements in the forwarding database that respectively reference valid destinations and invalid destinations in the group; permuting an order of the elements in the member vector to define a permuted member vector; fanning out the permuted member vector to form a series of consecutive instances of the permuted member vector; pseudo-randomly masking a portion of the elements of the fanned out member vector; and transmitting a flow of packets to the valid destination represented by a first valid element of the masked member vector.
11. The system according to claim 10, wherein the destinations are next hops in an equal cost multi-path (ECMP) group.
12. The system according to claim 10, wherein the destinations are addresses in a link aggregation group (LAG).
13. The system according to claim 10, wherein permuting an order comprises applying a hash function to a packet header and varying the order according to the hash function.
14. The system according to claim 10, wherein masking comprises applying a masking hash function to the fanned out member vector.
15. The system according to claim 10, wherein the hardware logic is configured for deleting one of the valid destinations from the group prior to representing the group by a member vector and redistributing the flow of packets to remaining valid destinations using the masked member vector.
16. The system according to claim 15, wherein redistributing the flow of packets comprises avoiding migrating flows of packets in the remaining valid destinations to other remaining valid destinations.
17. The system according to claim 10, wherein the hardware logic is configured for adding a new valid destination to the group prior to representing the group by a member vector redistributing the flow of packets to include the new valid destination using the masked member vector.
18. The system according to claim 17, wherein redistributing the flow of packets comprises avoiding migrating flows of packets in preexisting valid destinations to other preexisting valid destinations.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
(1) For a better understanding of the present invention, reference is made to the detailed description of the invention, by way of example, which is to be read in conjunction with the following drawings, wherein like elements are given like reference numerals, and wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION OF THE INVENTION
(15) In the following description, numerous specific details are set forth in order to provide a thorough understanding of the various principles of the present invention. It will be apparent to one skilled in the art, however, that not all these details are necessarily always needed for practicing the present invention. In this instance, well-known circuits, control logic, and the details of computer pro-gram instructions for conventional algorithms and processes have not been shown in detail in order not to obscure the general concepts unnecessarily.
(16) Documents incorporated by reference herein are to be considered an integral part of the application except that, to the extent that any terms are defined in these incorporated documents in a manner that conflicts with definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
Definitions
(17) A packet that is subject to ECMP/LAG forwarding belongs to an ECMP/LAG group. A group is defined by its valid members. A member can be an egress port (in the LAG case) or a router-interface (ECMP case). A group can be represented by a vector of static pointers, which are either valid or invalid. A pointer points to all the properties of the member, e.g., port number, router interface number. Selecting a member is equivalent to selecting a pointer to that member. The principles of the invention are applicable to ECMP and LAG forwarding, although the discussion, which follows often specifies ECMP routes for convenience. Those skilled in the art can readily adapt these principles to LAG forwarding.
(18) Overview.
(19) When a failure occurs in one or more paths through a fabric, conventional recalculating the next hop for all paths typically would result in the reordering of all flows. Consistent load balancing according to principles of the invention overrides this behavior so that only flows for links that have become inactive are redirected. Thus, in a consistent hashing scheme, some flows are migrated from one ECMP destination, (or LAG port) and redistributed to the remainder of the destinations another, while other flows remain in their original destinations. In a case that the number of available destinations increases, some flows will use the new destinations. Assume that in a group of four destinations, each destination handles 25% of the flows, and a fifth destination becomes active. It is desirable that each of the five destinations hold 20% of the flows. This can be accomplished by moving 20% of the flows from each of the preexisting four destinations to the new destination. However, flows are not migrated from one preexisting destination to another preexisting destination. Similarly, when the number of destinations decreases, the flows of an inactivated destination may be distributed to the remaining destinations, while the existing flows in the remaining destinations remain in place. Preferably, the scheme that follows is implemented in hardware for reasons of efficiency. In particular, the methods employed in embodiments of the invention incur less memory overhead as compared to fine grain ECMP rearrangements.
(20) Turning now to the drawings, reference is now made to
(21) in the pictured embodiment, decision logic 14 receives packets 16, each containing a header 18 and payload data 20. A processing pipeline 22 in decision logic 14 extracts a classification key from each packet, typically (although not necessarily) including the contents of certain fields of header 18. For example, the key may comprise the source and destination addresses and ports and a protocol identifier. Pipeline 22 matches the key against a matching database 24 containing a set of rule entries, which is stored in an SRAM 26 in network element 10, as described in detail hereinbelow. SRAM 26 also contains a list of actions 28 to be performed when a key is found to match one of the rule entries and may include a forwarding database. For this purpose, each rule entry typically contains a pointer to the particular action that decision logic 14 is to apply to packets 16 in case of a match. Pipeline 22 typically comprises dedicated or programmable hardware logic, which is configured to carry out the functions described herein.
(22) In addition, network element 10 typically comprises a cache 30, which contains rules that have not been incorporated into the matching database 24 in SRAM 26. Cache 30 may contain, for example, rules that have recently been added to network element 10 and not yet incorporated into the data structure of matching database 24, and/or rules having rule patterns that occur with low frequency, so that their incorporation into the data structure of matching database 24 would be impractical. The entries in cache 30 likewise point to corresponding actions 28 in SRAM 26. Pipeline 22 may match the classification keys of all incoming packets 16 against both matching database 24 in SRAM 26 and cache 30. Alternatively, cache 30 may be addressed only if a given classification key does not match any of the rule entries in database 24 or if the matching rule entry indicates (based on the value of a designated flag, for example) that cache 30 should be checked, as well, for a possible match to a rule with higher priority.
Traffic Flow
(23) When a LAG stack is based on standard Ethernet links, each device performs standard layer 2 (bridge) forwarding. Reference is now made to
(24) Reference is now made to
(25) The process steps of the following algorithm are shown in a particular linear sequence in
(26) At initial step 60 a member vector is prepared, based on the fabric topology. Individual member vectors may be maintained in each element of the fabric. Alternatively, a fabric controller (not shown) may maintain respective versions of the member vector for the fabric, recognizing that each switch or forwarder sees only the next hop.
(27) Next, at step 62 the member vector prepared in initial step 60 is permuted, using a hash function. Selection of the hash function from a group of suitably chosen hash functions produces nearly ideal pseudorandom rearrangements of the member vector. Suitable hash functions include Bitwise-XOR, CRC32, CRC16, Toeplitz-Hash, and MD5. In the example of
(28) It should be noted that a packet that is to be forwarded in a manner not involving ECMP or LAG according to the forwarding database of the switch does not activate the permutation mechanism.
(29) The following discussion explains how the first valid member of a permuted member vector is made to randomly appear.
(30) Reference is now made to
(31) The set of valid members changes when a new route is introduced. Thus, if a new route 68 (indicated by a broken line) were introduced that directly connected switches 36, 56 then ECMP group 40 would increase from three to four valid members.
(32) Permutation of the member vector 64 is indicated by block 70 according to hash function 72. The algorithm represented by
(33) Reference is now made to
(34) Moreover, other permutation creation techniques known in the art may be substituted in step 62
(35) in
(36) A conventional method for choosing a member of a group, such as an element of the member vector 64, begins by choosing an entry point, and scanning the group from the entry point forward, wrapping around, and continuing back to the entry point until a valid member is found. In the case of member vector 64 scanning the elements from left to right would find member M1 prior to member M0 for almost every entry point in the vector, which leads to unfairness in load-balancing between members M0 and M1.
(37) After permutation of the member vector 64 into permuted member vector 82 a selection from a well-chosen set of permutation functions, for any pair of elements 84, either member has an equal probability of preceding the other in a scan. Indeed, for any set of elements 84, any member of the set has an almost equal likelihood of preceding another member. In a practical system there should be at least 2N permutations available for selection, where N is the ECMP vector size.
(38) The permutation operation in block 70 consumes O(log.sub.2 (number of permutations) time.
(39) Reverting to
(40) Reference is now made to
(41) Step 88 is accomplished in block 94 by applying a hash masking function 96 to the member vector 90. In an efficient hardware implementation, the hash masking function 96 is performed by computing random bits, using the hash function, and applying an AND (“&”) operation between the fan-out-vector and the random-bits-vector such that a bit vector has each bit pseudo-randomly set to 1 with a chosen probability, e.g., 0.5. Masking out the elements 84 is done simultaneously, resulting in a masked permuted member vector 98. The masking operation in block 94 is thus done in O(1) time. The intent is to randomly eliminate a predefined percentage of the valid members, which in this example is 50%.
(42) Reference is now made to
(43) An additional permutation operation 110 may be performed on the member 102 in block 112, resulting in a 7-bit output vector 114, which is interpreted as an integer that identifies the ECMP (or LAG)-member to which the packet will be forwarded.
(44) Reverting to
Example
(45) The following example illustrates nearly ideal randomization in selection of member elements in a simulation of the above-described procedure, which was conducted using the pseudocode simulation design shown in Listing 1.
Listing 1
(46) A member vector (mem_vec) is loaded. Vector is 128b for LAG and 64b for ECMP. 1) /* if a 64b member vector is loaded, it is duplicated twice for a 128 vector. a [0 . . . ch_max_perm−1] hash is generated, called ch_permute_hash mem_vec is permuted to create mem_vec_permuted, using: ch_permute_hash [128]×[ch_max_perm]-to-1 muxes: mem_vec_permuted is duplicated (fan out) ch_mem_vec_fen_out_size times [128]×[ch_mem_vec_fan_out_size] random bits are generated such that probabilities of bits are (counting from MSB): [ 1/64]×64 bits [ 1/32]×64 bits [¼]×128 bits [½]×128 bits Mem_vec_permuted_mask=Mem_vec_permuted & random bits are computed Mem_vec_permuted_mask_1.sup.st=Find_1.sup.st_set(Mem_vec_permuted_mask) 2) /* Find_1st_set over [ch_mem_vec_fan_out_size]×[128] bits: If (˜|Mem_vec_permuted_mask_1.sup.st).fwdarw. Mem_vec_permuted_mask_1.sup.st=Find_last_set(Mem_vec_permuted) Mem_vec_permuted_mask_fan_in=fold(Mem_vec_permuted_mask_1.sup.st, 128) /*This is 128×ch_mem_vec_fan_out_size “x/or” gates. Out_port=reassemble(Mem_vec_permuted_mask_fan_in. ch_permute_hash). /*this is 7(each output bit)×ch_max_perm-to-1 muxes 3)./*Design Parameters ch_max_perm=64 ch_mem_vec_fen_out_size=3
(47) Reference is now made to
(48) ECMP Group Size Change.
(49) ECMP group size change is HW handled. When a new member is introduced, some of the existing flows switch to the new member, while respecting current activity status—an active flow remains on its old ECMP member, and when the flow becomes inactive, it may then be reassigned to the new ECMP member. Non-active flows stay on their original ECMP members.
(50) Reference is now made to
(51) Reference is now made to
(52) Each slot in the activity shadow memory 130 represents an activity state of one or more flow(s), When a new member is introduced, all those slots are initialized as ‘active=3’, although their activity status is not yet known. Each time the packet-aging timer 132 ticks, the activity state of all slots is changed from ‘active=3’.fwdarw.‘maybe active=2’.fwdarw.‘not active=0’.
(53) Each time a packet arrives, and its slot activity is not ‘not active=0’, that packet is served using the work memory 128. Its slot activity state is changed from ‘maybe active=2’.fwdarw.‘active=3’.
(54) Reference is now made to
(55) Reference is now made to
(56) Reference is now made to
(57) It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.