High-density, fail-in-place switches for computer and data networks
10749817 ยท 2020-08-18
Assignee
Inventors
- Paul W. Coteus (Yorktown, NY)
- Fuad E. Doany (Katonah, NY, US)
- Shawn A. Hall (Pleasantville, NY)
- Mark D. Schultz (Ossining, NY)
- Todd E. Takken (Brewster, NY, US)
- Shurong Tian (Mount Kisco, NY, US)
Cpc classification
H05K7/20509
ELECTRICITY
H05K7/20572
ELECTRICITY
H05K7/20254
ELECTRICITY
International classification
Abstract
A structure for a network switch. The network switch may include a plurality of spine chips arranged on a plurality of spine cards, where one or more spine chips are located on each spine card; and a plurality of leaf chips arranged on a plurality of leaf cards, wherein one or more leaf chips are located on each leaf card, where each spine card is connected to every leaf chip and the plurality of spine chips are surrounded on at least two sides by leaf cards.
Claims
1. A network switch comprising: a plurality of spine cards each comprising a liquid cooling plate in direct contact with one or more spine chips, the plurality of spine cards are stacked vertically one on top of another and separated by a predetermined space; a plurality of leaf cards each comprising one or more leaf chips in direct contact with a heat sink, the plurality of leaf cards are perpendicular to and circumferentially arrayed around the stack of spine cards, the spine chips of each spine card are electrically connected to the leaf chips of each leaf card, wherein the plurality of leaf cards along any one side of the stack of spine cards are arranged at an angle relative to one another such that no two leaf cards along that side of the stack of spine cards are parallel; a heat exchanger comprising supply piping connecting it to the liquid cooling plate and return piping connecting it from the liquid cooling plate; lower card guide plate comprising perforations and guiding features corresponding with each of the plurality of leaf cards; and upper card guide plate comprising perforations and guiding features corresponding with each of the plurality of leaf cards.
2. The structure according to claim 1, wherein each leaf card is removably coupled to all of the spine cards.
3. The structure according to claim 1, wherein an orthogonal connector connects all the spine cards to each of the leaf cards.
4. The structure according to claim 1, wherein the leaf cards surround the stack of spine cards on all four sides, and all the leaf cards along each side of the stack of spine cards are separated from each other by a space at all four corners of the stack of spine cards.
5. The structure according to claim 1, wherein the supply piping is routed through a space between adjacent leaf cards at a first corner of the stack of spine cards, and the return piping is routed through a space between adjacent leaf cards at a second corner of the stack of spine cards.
6. The structure according to claim 1, wherein each of the one or more leaf chips are positioned nearest to an innermost edge of each of the plurality of leaf cards to minimize electrical path length in the leaf card.
7. The structure according to claim 1, wherein each of the plurality of leaf cards comprises a vapor chamber situated between the one or more leaf chips and the heat sink to transfer the heat with low resistance from an innermost edge of the leaf card to an outermost edge of the leaf card.
8. The structure according to claim 1, wherein the heat sink comprises cooling fins.
9. The structure according to claim 8, wherein a height of the cooling fins gradually increase from an innermost edge of the leaf card to an outermost edge of the leaf card to provide clearance for adjacent leaf cards.
10. The structure according to claim 1, wherein each of the plurality of leaf cards comprises one orthogonal receptacle corresponding to each spine card in the stack of spine cards.
11. The structure according to claim 1, wherein each of the plurality of leaf cards comprises multiple rows of RJ45 connectors arranged along an edge opposite from the stack of spine cards.
12. The structure according to claim 1, wherein the supply piping is routed through a space between adjacent leaf cards at a first corner of the stack of spine cards, and the return piping is routed through a space between adjacent leaf cards at a second corner of the stack of spine cards.
13. The structure according to claim 1, wherein each of the plurality of leaf cards comprises a vapor chamber situated between the one or more leaf chips and the heat sink to transfer the heat with low resistance from an innermost edge of the leaf card to an outermost edge of the leaf card.
14. A network switch comprising: a plurality of spine cards each comprising a liquid cooling plate in direct contact with one or more spine chips, the plurality of spine cards are stacked vertically one on top of another and separated by a predetermined space; a plurality of leaf cards each comprising one or more leaf chips in direct contact with a heat sink, the plurality of leaf cards are perpendicular to and circumferentially arrayed around the stack of spine cards, the spine chips of each spine card are electrically connected to the leaf chips of each leaf card, wherein each of the plurality of spine cards comprises one orthogonal receptacle corresponding to each of the plurality of leaf cards, and, wherein each orthogonal receptacle along any one side of each of the plurality of leaf cards are arranged at an angle relative to one another such that no two orthogonal receptacles along that side of each of the plurality of leaf cards are parallel; a heat exchanger comprising supply piping connecting it to the liquid cooling plate and return piping connecting it from the liquid cooling plate; lower card guide plate comprising perforations and guiding features corresponding with each of the plurality of leaf cards; and upper card guide plate comprising perforations and guiding features corresponding with each of the plurality of leaf cards.
15. The structure according to claim 14, wherein each leaf card is removably coupled to all of the spine cards.
16. The structure according to claim 14, wherein an orthogonal connector connects all the spine cards to each of the leaf cards.
17. The structure according to claim 14, wherein the leaf cards surround the stack of spine cards on all four sides, and all the leaf cards along each side of the stack of spine cards are separated from each other by a space at all four corners of the stack of spine cards.
18. The structure according to claim 14, wherein each of the one or more leaf chips are positioned nearest to an innermost edge of each of the plurality of leaf cards to minimize electrical path length in the leaf card.
19. The structure according to claim 14, wherein the heat sink comprises cooling fins.
20. The structure according to claim 19, wherein a height of the cooling fins gradually increase from an innermost edge of the leaf card to an outermost edge of the leaf card to provide clearance for adjacent leaf cards.
21. The structure according to claim 14, wherein each of the plurality of leaf cards comprises one orthogonal receptacle corresponding to each spine card in the stack of spine cards.
22. The structure according to claim 14, wherein each of the plurality of leaf cards comprises multiple rows of RJ45 connectors arranged along an edge opposite from the stack of spine cards.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
(1) The following detailed description, given by way of example and not intended to limit the invention solely thereto, will best be appreciated in conjunction with the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13) The drawings are not necessarily to scale. The drawings are merely schematic representations, not intended to portray specific parameters of the invention. The drawings are intended to depict only typical embodiments of the invention. In the drawings, like numbering represents like elements.
DETAILED DESCRIPTION
(14) Detailed embodiments of the claimed structures and methods are disclosed herein; however, it can be understood that the disclosed embodiments are merely illustrative of the claimed structures and methods that may be embodied in various forms. This invention may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete and will fully convey the scope of this invention to those skilled in the art. In the description, details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the presented embodiments.
(15) References in the specification to one embodiment, an embodiment, an example embodiment, etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
(16) For purposes of the description hereinafter, the terms upper, lower, right, left, vertical, horizontal, top, bottom, and derivatives thereof shall relate to the disclosed structures and methods, as oriented in the drawing figures. The terms overlying, atop, on top, positioned on or positioned atop mean that a first element, such as a first structure, is present on a second element, such as a second structure, wherein intervening elements, such as an interface structure may be present between the first element and the second element. The term direct contact means that a first element, such as a first structure, and a second element, such as a second structure, are connected without any intermediary conducting, insulating or semiconductor layers at the interface of the two elements.
(17) In the interest of not obscuring the presentation of embodiments of the present invention, in the following detailed description, some processing steps or operations that are known in the art may have been combined together for presentation and for illustration purposes and in some instances may have not been described in detail. In other instances, some processing steps or operations that are known in the art may not be described at all. It should be understood that the following description is rather focused on the distinctive features or elements of various embodiments of the present invention.
(18) Referring now to
(19) Each of a first set of switch chips is called a leaf chip and each leaf chip has p.sub.L ports. Four leaf chips are illustrated and each leaf chip has four ports (i.e., p.sub.L=4). Typically, in a core switch, the leaf chips are placed on a plurality of circuit cards, often called line cards or leaf cards. Generally, each leaf chip has externally facing ports equal to p.sub.L/2 connected on a first edge of the leaf card to cables running to external computer elements (i.e., nodes), such as, for example, a processor or a storage node. There also may be internally facing ports equal to p.sub.L/2 connected through leaf-to-spine connectors located on a second edge of the leaf card to a second set of switch chips called spine chips, such as, for example, spine chip S1 and spine chip S2. The spine chips are illustrated as having p.sub.S ports. Typically, spine chips and leaf chips are identical, where p.sub.S=p.sub.L=p. One or more spine chips are typically placed on a circuit card called a spine card. Every port of every spine chip is connected to an internally facing port of a leaf chip. Consequently, the maximum number of externally facing ports (P.sub.ext) in a two-layer core switch may be computed as follows.
(20) Let:
N.sub.LNumber of leaf chips in the core switch(1)
N.sub.SNumber of spine chips in the core switch(2)
p.sub.LNumber of ports per leaf chip(3)
p.sub.SNumber of ports per spine chip(4)
P.sub.extTotal number of externally facing ports for the core switch(5)
(21) Because one port of each spine chip must connect to a port on a leaf chip, it follows that:
N.sub.L=p.sub.S(8)
(22) For a non-blocking switch, only half of the leaf chip ports are externally facing, it follows that:
P.sub.ext=N.sub.L*p.sub.L/2(7)
(23) Substituting (6) into (7) yields:
P.sub.ext=p.sub.S*p.sub.L/2(8)
(24) When spine chips are identical to leaf chips, it follows that:
p.sub.S=p.sub.Lp(9)
P.sub.ext=p.sup.2/2(10)
(25) Because the total number of internally facing ports on the leaf chips, N.sub.L*p.sub.L/2, must equal the total number of ports N.sub.S*p.sub.S on the spine chips, it follows also that:
N.sub.S=N.sub.L*p.sub.L/(2*p.sub.S)(11)
(26) whence, for the special case (9), it follows that:
N.sub.S=N.sub.L/2(12)
(27) For the example above, where the special case (9) with p=48, a non-blocking core switch has P.sub.ext=1152 externally facing ports according to equation (10).
(28) Referring now to
(29) A limitation of the current approach is that the maximum value of P.sub.ext, that can be economically packaged, is often smaller than the number of devices desired for interconnection. For example, economic packaging of a 10GBASE-T Ethernet is accomplished using inexpensive RJ45 connectors and low-cost category-6 electrical cables, but these inexpensive components are bulky, so a small number of leaf cards can accommodate only a modest value of P.sub.ext. To achieve larger values of P.sub.ext, expensive alternative components are often used, such as a QSFP. A QSFP can accept four-channel-wide optical cables, which, for 10G Ethernet, must be connected to computing or storage nodes using expensive 1-to-4 optical breakout cables.
(30) The fact that P.sub.ext is limited by packaging constraints is further aggravated by the development of new, faster switch chips with larger p values. The higher speed of the port can be handled by appropriate IO cell design, proper high-speed electrical packaging techniques, and use of commensurately faster cables. However, because the number of externally facing ports P.sub.ext in a two-level fat-tree is quadratic in p, as stated by equation (10), the problem of packaging a core switch becomes dramatically more difficult as p increases.
(31) In some switches, a port is implemented by a single electrical channel. For example a 10 Gb/s Ethernet port might be made by two differential wiring pairs, one receiving data and one sending data, carried on so-called twin-axial electrical cable. In other switches, a port may be implemented optically: using optical transceivers, the same 10 Gb/s signals may be sent over a pair of optical fibers, one sending data and one receiving data. Such a pair of signals is called a lane. A port can also include multiple lanes. If the signaling rate is 25 Gb/s per lane, these four lanes can deliver, in parallel, 100 Gb/s per port. There is a new Ethernet standard that also uses 4 parallel lanes at 25 Gb/s. Assessment of the limits of packaging and chip technology suggests that 64 port chips, each with four lanes, are possible with today's technologies.
(32) Referring now to
(33) Equation (10) applies for non-blocking switches, in which only half of a leaf switch's ports are externally facing, such that all cable ports may send or receive data at full bandwidth without slowdown. A more popular structure includes oversubscribed switches, in which there are more leaf cards than there are spine cards to support them at full bandwidth. In other words, the fraction of a leaf switch's ports that are externally facing is greater than .
(34) Let:
f=Fraction of a leaf switch's ports that are externally facing.
(35) Then equation (7) generalizes to:
P.sub.ext=f*N.sub.L*p.sub.L(13)
(36) Substituting (6) into (13) yields:
P=f*p.sub.S*p.sub.L(14)
(37) Consequently, for the typical case (9) (i.e., p.sub.S=p.sub.Lp), it follows that:
P.sub.ext=f*p.sup.2(15)
(38) Because the total number of internally facing ports on the leaf chips, (1f)*N.sub.L*p.sub.L, must equal the total number of ports N.sub.S*p.sub.S on the spine chips, it follows also that:
S=(1f)*N.sub.L*p.sub.L/p.sub.S(17)
(39) whence, for the special case (9), it follows that:
N.sub.S=(1f)*N.sub.L(17)
(40) In an example, f=. Then, for special case (9), P.sub.ext=0.75*p.sup.2. f greater than may yield data collisions if more than of the nodes connected to the leaf cable ports want to send or receive data at once. However, such collisions seldom, if ever, occur, accounting for the popularity of the over-subscribed switches.
(41) As indicated by equation (15), the number of externally facing ports P.sub.ext for over-subscribed switches (f>) is greater than stated by the previous examples of non-blocking switches (f=). For example, for p=48 with f=, equation (15) gives P.sub.ext=1728.
(42) Today there are data centers with tens of thousands of nodes, many more nodes than can be connected by existing core switches. That situation is handled by cascading core switches in various ways to make larger switches. This comes at the cost of more cables and switches, more power for serialization-deserialization (SERDES) at each cable connection, and additional latency to move through the extra switches, cables, and SERDES stages.
(43) In some switches the spine and leaf cards can be concurrently maintained, and there can be an extra spine card. If a spine cards fails, data is routed to other spine cards, and the faulty spine card can be replaced without turning off the switch, which continues to route data. If a leaf card fails then the compute or storage nodes connected to them are lost from the network, but modern data centers are designed to tolerate the failure of a compute or storage node, so that is typically acceptable.
(44) A new packaging architecture for obtaining very-large-port-count network switches and obtaining a large value of P.sub.ext is described below. The architecture is based on the idea that leaf-to-spine connections in a fat-tree are inherently redundant. For example, in
(45) The proposed, large-P.sub.ext packaging architecture deliberately exploits the aforesaid redundancy: it allows spine ports or even entire spine chips to fail without replacement, a strategy called fail in place. Where, under normal conditions, if a port on a spine chip, or even the entire spine chip, fails, it is not repaired. Rather, traffic is redirected to the other, working spine ports. Despite the failure, there is still connectivity between all cable ports; the throughput of the switch is merely reduced. If N.sub.S is large, this throughput reduction is minimal. For example, for the previously mentioned switch chip having p=104 ports, there are N.sub.S=p/2=52 spine chips and N.sub.S*p=52*104=5408 spine ports. The loss of a few of these ports, the likely cost of the large-P.sub.ext packaging architecture proposed herein, is acceptable in exchange for the much-larger benefits afforded thereby. That is, compared to the large-P.sub.ext alternative of combining smaller core switches to make larger ones, the large-P.sub.ext architecture may provide three benefits: first, reduced power because there are fewer SERDES (serializer/deserializer) channels; second, lower latency because there are fewer hops through switch chips; and third, reduced cost because there are fewer cables.
(46) The large-P.sub.ext packaging architecture can achieve up to a four-fold increase in P.sub.ext compared to current switches. That is, because the spine cards are never removed for the lifetime of the switch, they may be surrounded by leaf cards on multiple sides rather than on only one side. Thus, compared to current architecture where leaf cards are connected to spine cards on the front only, the number of leaf chips N.sub.L, and hence the number of externally facing ports P.sub.ext, can be doubled if the leaf cards are connected to two edges of each spine card. Similarly, compared to current architecture, the number of externally facing ports P.sub.ext can be quadrupled if leaf cards are connected to all four edges of each spine card. Embodiments achieving large P.sub.ext by exploiting the fail-in-place strategy.
(47) Referring now to
(48) Integers n.sub.S and n.sub.L are related to the previously defined integers N.sub.S and N.sub.L, where:
N.sub.S=n.sub.S*k.sub.S and N.sub.L=n.sub.L*k.sub.L(18)
k.sub.S=Number of switch chips per spine card(19)
k.sub.L=Number of switch chips per leaf card(20)
(49) In embodiments illustrated in
n.sub.S=13;k.sub.S=4;N.sub.S=52(21)
n.sub.L=104;k.sub.L=1;N.sub.L=104(22)
(50) The switch chip on both spine and leaf cards has p=104 ports. The spine-and-leaf array 400 achieves:
P.sub.ext=p.sup.2/2=5408(23)
(51) In the illustrated embodiment, there are 112 cards shown, 104 of which may be leaf assemblies 404 and the remaining cards may be power cards, spine-control cards and/or leaf-control cards. Power, spine-control, and leaf-control are redundant, such that global failure never occurs from failure of one component, or even from failure of several components. In the illustrated embodiment, if fewer than eight peripheral power and control cards are needed, then peripheral cards nearest the corners of the array may be eliminated.
(52) The spine-and-leaf array 400 may be the heart of a core switch, but the switch may also include mechanical and cooling infrastructure (e.g., water cooling infrastructure). Spine assemblies 402 are linearly arrayed at a spine-to-spine pitch H along a z-direction of coordinate system 410, and the leaf assemblies 404 are circumferentially arrayed around the stack of spine assemblies 402, such that each leaf assembly 404 may be electrically or optically connected to each spine assembly 402.
(53) In an embodiment, the central stack may include one or more central leaf-control assemblies (not illustrated) in addition to the n.sub.S spine assemblies 402. Each central leaf-control assembly may be similar in size, shape and orientation to the spine assembly 402, but may include electronics (instead of spine chips) for initializing and controlling the peripheral array of leaf assemblies 404. An alternative embodiment, an initialization and control function for leaf cards may be provided by one or more peripheral leaf-control assemblies (not illustrated). Each peripheral leaf-control assembly may be similar in size, shape and orientation to a leaf assembly 404, but may include electronics (instead of leaf chips) for initializing and controlling the leaf assemblies 404. In the illustrated embodiments shown in
(54) Referring now to
(55) Referring now to
(56) Referring now to
(57) To achieve large n.sub.L with a spine card having limited perimeter, adjacent headers 412 should be close together. However, if the headers 412 were close together and leaf cards were parallel, the space between leaf cards would be too small to accommodate the appropriate number of external connectors (e.g., dual-stacked RJ-45 connectors 416 illustrated in
(58) Referring now to
(59) Referring now to
(60) Referring now to
(61) Referring to
(62) Referring now to
(63) Referring now to
(64) Referring now to
(65) Referring now to
(66) Referring now to
(67) Referring now to
(68) The leaf card 408, upon insertion, is guided into position by two cards guides 752, one at the bottom, which is affixed to the lower card-guide plate 702, and the other at the top (not illustrated) which is affixed to the upper card-guide plate 704. In
(69) Referring now to
(70) Referring now to
(71) Referring now to
(72) Referring now to
(73) Referring now to
(74) Referring now to
(75) Referring now to
(76) The fail-in-place strategy disclosed herein provides, for computer and data networks, a switch having a large number of externally facing ports. This is achieved by making an improved use of the perimeter of the spine cards for connection to leaf cards. Accessibility of spine cards is thereby sacrificed, precluding the easy repair thereof. This tradeoff may be advantageous because the two-level, spine-leaf topology typically used in switches, is inherently redundant, the fail-in-place strategy disclosed herein causes only minor performance penalties, yet allows the number of externally facing ports to increase significantly, by as much as a factor of four compared to prior art, thereby significantly increasing the number of computing and storage elements that may be interconnected without an undesirable cascading of switches.
(77) While the description above contains much specificity, these should not be construed as limitations on the scope, but rather as exemplifications of preferred embodiments thereof. Many other variations are possible, such as, for example, the number of switch chips on the spine and leaf cards may vary, the number of spine and leaf boards may vary, the connections between spine and leaf boards may be optical as well as electrical, and cooling of the chips may be accomplished in a variety of ways. The terminology used herein was chosen to best explain the principles of the embodiment, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.