Data center interconnect as a switch
10477288 ยท 2019-11-12
Assignee
Inventors
Cpc classification
International classification
Abstract
An interconnect module (ICAS module) includes n optical data ports each comprising n optical interfaces, and an interconnecting network implementing a full mesh topology for interconnecting the optical interfaces of each port each to a respective one of the optical interfaces of each of the other ports. In one embodiment, each optical interface exchanges data signals over a communication medium with optical transceiver. The interconnecting module may implement the full mesh topology using optical fibers. The interconnecting module may be used to replace fabric switches as well as a building block for a spine switch.
Claims
1. A data center network having a plurality of data interfaces for receiving and transmitting data signals from and to a plurality of servers, comprising: a first plurality of data switching units, wherein each of the first plurality of data switching units comprises (a) a plurality of first level interconnecting devices providing a subset of the data interfaces; and (b) one or more second level interconnecting devices routing data signals of the subset of data signals among the subset of data interfaces of the first level interconnecting device and a data interface with an external network, wherein the first and second interconnecting devices are interconnected to implement a full mesh network of a predetermined number of nodes; and a second plurality of data switching units, wherein each of the second plurality of the data switching units comprises a plurality of third level interconnecting devices, wherein each third level interconnecting device routes data signals received from, or to be transmitted to, a corresponding one of the first level interconnecting devices in each of the first plurality of data switching units.
2. The data center network of claim 1, wherein the data signals of each data interface are received from or to be transmitted to optical transceiver.
3. The data center network of claim 1, wherein the first level interconnecting devices comprise top-of-rack (TOR) switches.
4. The data center network of claim 1, wherein the second level interconnecting devices comprise fabric switches, wherein the full network comprises the predetermined number of first level interconnecting devices and the predetermined number less one fabric switches.
5. The data center network of claim 1, wherein the second level interconnecting devices of each of the first plurality of data switching units comprise an interconnect module, which comprise: n optical data ports each comprising n1 optical interfaces, wherein each optical interface receives and transmits data signals over a communication medium; and an interconnecting network implementing a full mesh topology for interconnecting the optical interfaces of each port each to a respective one of the optical interfaces of each of the other ports.
6. The data center network of claim 5, wherein one of the optical data ports is provided as an uplink interface for connecting to an external network.
7. The data center network of claim 5, wherein the interconnecting network comprises optical fibers.
8. The data center network of claim 5, wherein the optical interfaces each receive and transmit optical signals to and from optical transceiver.
9. The data center network of claim 5, wherein the second level connecting devices provide communication links for traffic among the first level interconnecting devices.
10. The data center network of claim 1, wherein the third level interconnecting devices each comprise a spine switch.
11. The data center network of claim 10, wherein each first level interconnecting device is connected by communication links to each of the spine switches in each of the second plurality of data switching units.
12. The data center network of claim 11, wherein the communication links in each of the second plurality of data switching units are implemented at least in part by a fanout cable transpose rack.
13. The data center network of claim 12, wherein each first level interconnecting device is connected by a fanout cable transpose rack to each of the spine switches in each of the second plurality of data switching units.
14. The data center network of claim 13, wherein the fanout cable transpose rack comprises a fraction of a full-rack, a standalone full rack, or multiple, full racks.
15. The data center network of claim 10, wherein the spine switch comprises one or more switching elements connected to one or more interconnect modules, each interconnect modules comprising: n optical data ports each comprising n-1 optical interfaces, wherein each optical interface receives and transmits data signals over a communication medium; and an interconnecting network implementing a full mesh topology for interconnecting the optical interfaces of each port each to a respective one of the optical interfaces of each of the other ports.
16. The data center network of claim 15, wherein one of the optical data ports is provided as an uplink interface for connecting to an external network.
17. The data center network of claim 15, wherein the interconnecting network comprises optical fibers.
18. The data center network of claim 15, wherein the optical interfaces each receive and transmit optical signals to and from optical transceiver.
19. The data center network of claim 15, wherein the switching elements and the interconnection modules are each packaged in a stackable, rack-mount chassis.
20. The data center network of claim 19, wherein the stackable switching device comprises multiple network devices, each being 1U high.
21. The data center network of claim 15, wherein the switching elements and the interconnection modules are packaged in a printed circuit board-based multi-unit rack mount chassis.
22. The data center network of claim 21, wherein the multi-unit switching device is multiple rack units high.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27) To facilitate cross-referencing among the figures and to simplify the detailed description, like elements are assigned like reference numerals.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
(28) The present invention simplifies the network architecture by eliminating the switches in the fabric layer based on a new fabric topology, referred herein as the interconnect-as-a-switch (ICAS) topology. The ICAS topology of the present invention is based on the full mesh topology. In a full mesh topology, each node is connected to all other nodes. The example of a 9-node full mesh topology is illustrated in
(29) As discussed in further detail below, the ICAS topology enables a data center network that is far superior than a network of the fat-tree topology used in prior art data center networks. Unlike other network topologies, the ICAS topology imposes a structure on the network which avoids congestion, while allowing aggregation of traffic using a multipath routing scheme (e.g., ECMP). According to one embodiment, the present invention provides an ICAS module as a component for interconnecting communicating devices.
(30)
(31)
(32) As switching in ICAS module 400 is achieved passively by its connectivity, no power is dissipated in performing the switching function. Typical port-to-port delay through an ICAS passive switch is around 10 ns 5 ns/meter, for an optical fiber), making it very desirable for a data center application, or for big data, AI and HPC environments.
(33) The indexing scheme of external-to-internal connections in ICAS module 400 of
(34) TABLE-US-00001 TABLE 1 ICAS ICAS External Connection Port 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 1 0 2 3 4 5 6 7 8 2 0 1 3 4 5 6 7 8 3 0 1 2 4 5 6 7 8 4 0 1 2 3 5 6 7 8 5 0 1 2 3 4 6 7 8 6 0 1 9 3 4 5 7 8 7 0 1 2 3 4 5 6 8 8 0 1 2 3 4 5 6 7
(35)
(36)
(37) In full mesh topology network 500, the interfaces of each TOR switch is divided into ports, such that each port contains 8 interfaces (connections). To illustrate this arrangement, port 2 from each TOR switch connects to ICAS module 400. As each TOR switch has a dedicated path through ICAS module 400 to each of the other TOR switches, no congestion can result from two or more flows from different source switches being routed to the same destination switch (the Single-Destination-Multiple-Source Traffic Aggregation case). In that case, for example, when TOR switches 51-0 to 51-8 each have a 10-G data flow that has TOR switch 51-0 as destination, all the flows would be routed on paths through respective connections. Table 2 summarizes the separate designated paths:
(38) TABLE-US-00002 TABLE 2 ICAS Source ICAS Destination Destina- Source Internal Internal tion T1.p2.c0 .Math. ICAS2.p1.c0 .Math. ICAS2.p0.c1 .Math. T0.p2.c0 T2.p2.c0 .Math. ICAS2.p2.c0 .Math. ICAS2.p0.c2 .Math. T0.p2.c1 T3.p2.c0 .Math. ICAS2.p3.c0 .Math. ICAS2.p0.c3 .Math. T0.p2.c2 T4.p2.c0 .Math. ICAS2.p4.c0 .Math. ICAS2.p0.c4 .Math. T0.p2.c3 T5.p2.c0 .Math. ICAS2.p5.c0 .Math. ICAS2.p0.c5 .Math. T0.p2.c4 T6.p2.c0 .Math. ICAS2.p6.c0 .Math. ICAS2.p0.c6 .Math. T0.p2.c5 T7.p2.c0 .Math. ICAS2.p7.c0 .Math. ICAS2.p0.c7 .Math. T0.p2.c6 T8.p2.c0 .Math. ICAS2.p8.c0 .Math. ICAS2.p0.c8 .Math. T0.p2.c7
(39) In Table 2 (as well as in all Tables herein), the switch source and the switch destination are each specified by 3 values: Ti.p.sub.k.c.sub.k, where T.sub.i is the TOR switch with index i, p.sub.j is the port with port number j and c.sub.k is the connection with connection number k. Likewise, the source and destination connections in module 500 are also each specified by 3 values: ICASj.p.sub.i.c.sub.k, where ICASj is the ICAS module with index j, p.sub.i is the port with port number i and c.sub.k is the internal or external connection with connection number k.
(40) An ICAS-based network is customarily allocated so that when its ports are connected to port i from all TOR switches the ICAS will be labeled as ICASi with index i.
(41) Congestion can also be avoided in full mesh topology network 500 with a suitable routing method, even when a source switch receives a large burst of aggregated data (e.g., 80 Gbits per second) from all its connected servers to be routed to the same destination switch (the Port-to-Port Traffic Aggregation case). In this case, it is helpful to imagine the TOR switches as consisting of two groups: the source switch i and the rest of the switches 0 to i1, i+1 to 8. The rest of the switches are herein collectively referred to as the spine group. Suppose TOR switch 51-1 receives 80 Gbits per second (e.g., 8 10G flows) from all its connected servers all designating to destination TOR switch 51-0. The routing method for the Port-to-Port Traffic Aggregation case allocates the aggregated traffic to its 8 10G-connections with port 51-1 of ICAS module 500, such that the data packets in each 10G-connection is routed to a separate TOR switch in the spine group (Table 3A):
(42) TABLE-US-00003 TABLE 3A ICAS Source ICAS Destination Destina- Source Internal Internal tion T1.p2.c0 .Math. ICAS2.p1.c0 .Math. ICAS2.p0.c1 .Math. T0.p2.c0 T1.p2.c1 .Math. ICAS2.p1.c2 .Math. ICAS2.p2.c1 .Math. T2.p2.c1 T1.p2.c2 .Math. ICAS2.p1.c3 .Math. ICAS2.p3.c1 .Math. T3.p2.c1 T1.p2.c3 .Math. ICAS2.p1.c4 .Math. ICAS2.p4.c1 .Math. T4.p2.c1 T1.p2.c4 .Math. IC AS2.p1.c5 .Math. ICAS2.p5.c1 .Math. T5.p2.c1 T1.p2.c5 .Math. ICAS2.p1.c6 .Math. ICAS2.p6.c1 .Math. T6.p2.c1 T1.p2.c6 .Math. ICAS2.p1.c7 .Math. ICAS2.p7.c1 .Math. T7.p2.c1 T1.p2.c7 .Math. ICAS2.p1.c8 .Math. ICAS2.p8.c1 .Math. T8.p2.c1
(43) Note that the data routed to TOR switch 51-0 has arrived its designation and therefore would not be routed further. Each TOR switch in the spine group, other than TOR switch 51-0, then allocates its 10G-connection to connection 0 for forwarding its received data to TOR switch 51-0 (Table 3B):
(44) TABLE-US-00004 TABLE 3B ICAS Source ICAS Destination Destina- Source Internal Internal tion .Math. .Math. .Math. T2.p2.c0 .Math. ICAS2.p2.c0 .Math. ICAS2.p0.c2 .Math. T0.p2.c1 T3.p2.c0 .Math. ICAS2.p3.c0 .Math. ICAS2.p0.c3 .Math. T0.p2.c2 T4.p2.c0 .Math. ICAS2.p4.c0 .Math. ICAS2.p0.c4 .Math. T0.p2.c3 T5.p2.c0 .Math. ICAS2.p5.c0 .Math. ICAS2.p0.c5 .Math. T0.p2.c4 T6.p2.c0 .Math. ICAS2.p6.c0 .Math. ICAS2.p0.c6 .Math. T0.p2.c5 T7.p2.c0 .Math. ICAS2.p7.c0 .Math. ICAS2.p0.c7 .Math. T0.p2.c6 T8.p2.c0 .Math. ICAS2.p8.c0 .Math. ICAS2.p0.c8 .Math. T0.p2.c7
(45) Thus, the full mesh topology network of the present invention provides performance that is in stark contrast to prior art network topologies (e.g., fat-tree), in which congestions in the source or destination switches cannot be avoided under Single-Destination-Multiple-Source Traffic Aggregation and Port-to-Port Traffic Aggregation cases.
(46) Also, as discussed above, when TOR switches 51-0 to 51-8 abide by the rule m2n2, where m is the number of network-side connections (e.g., the connections with a port in ICAS module 500) and n is the number of the TOR switch's input connections (e.g., connections to the servers within the data center), a strict blocking condition is avoided. In other words, a static path is available between any pair of input connections under any traffic condition. Avoiding such a blocking condition is essential in a circuit-switched network but is rarely significant in a flow-based switched network.
(47) In the full mesh topology network 500 of
(48) In full mesh topology network 500, uniform traffic may be spread out to the spine group and then forwarded to its destination. In network 620 of
(49) The inventor of the present invention investigated in detail the similarities and the differences between the full mesh topology of the present invention and other network topologies, such as the fat-tree topology in the data center network of
(50) The inventor discovered that an n-node full mesh graph is embedded in a fabric-leaf network represented by a bipartite graph with (n1, n) nodes (i.e., a network with n1 fabric nodes and n server leaves).
(51) This discovery leads to the following rather profound results: (a) An n-node full mesh graph is embedded in an (n1, n)-bipartite graph; and the (n1, n) bipartite graph and the data center Fabric/TOR topology have similar connectivity characteristics; (b) A network in the (n1, n) Fabric/TOR topology (i.e., with n1 fabric switches and n TOR switches) can operate in same connectivity characteristics as a network with full mesh topology (e.g., network 500 of
(52) In the following, a data center network that incorporates ICAS modules in place of fabric switches may be referred to as an ICAS-based data center network. An ICAS-based data center network has the following advantages:
(53) (a) less costly, as fabric switches are not used;
(54) (b) lower power consumption, as ICAS modules are passive;
(55) (c) less congestion;
(56) (d) lower latency;
(57) (e) less network layers; and.
(58) (f) greater scalability as a data center network.
(59) These results may be advantageously used to improve typical state-of-the-art data center networks.
(60)
(61) Details of a spine plane of
(62) Details of a server pod of
(63) The data traffic through the fabric switches is primarily limited to intra-pod. The TOR switches now route both the intra-pod traffic as well as inter-pod traffic and are more complex. The independent link types achieve massive scalability in data center network implementations. (Additional independent links provided from higher radix switching ASIC may be created to achieve larger scale of connectivity objectives). Additionally, data center network 800 incorporates the full mesh topology concept (without physically incorporating an ICAS module) to remove redundant network devices and allow the use of innovative switching methods, in order to achieve a lean and mean data center fabric with improved data traffic characteristics.
(64) As shown in
(65) As the network in the intra-pod region of each server pod can operate in the same connectivity characteristics as a full mesh topology network, all the 20 fabric switches of the server pod may be replaced by an ICAS module. ICAS-based data center network 900, resulting from substituting fabric switches 83-0 to 83-19 of data center network 800, is shown in
(66)
(67) Details of a spine plane of
(68) As pointed out earlier in this detailed description, the state-of-the-art data centers and switch silicon are designed with 4 interfaces (TX, RX) at 10 Gb/s or 25 Gb/s each per port in mind. Switching devices are interconnected at the connection level in ICAS-based data center. In such a configuration, a QSFP cable coming out from a QSFP transceiver is separated into 4 connections, and 4 connections from different QSFP transceivers are combined in a QSFP cable for connecting to another QSFP transceiver. Also, a spine plane may interconnect a large and varying number of ICAS pods (e.g., in the hundreds) because of the scalability of an ICAS-based data center network. Such a cabling scheme is more suitable to be organized in a fanout cable transpose rack (e.g., fanout cable transpose rack 921), which may be one or multiple racks and be integrated into the spine planes. Specifically, the spine switches and the TOR switches may each connect to the fanout cable transpose rack with QSFP straight cables. Such an arrangement simplifies the cabling in a data center.
(69) Details of an ICAS pod of
(70) The data traffic through the ICAS module is primarily limited to intra-pod. The TOR switches now perform routing for the intra-pod traffic as well as inter-pod traffic and are more complex. The independent link types achieve massive scalability in data center network implementations. (Additional independent links provided from higher radix switching ASIC may be created to achieve a larger scale of connectivity objectives).
(71) As shown in
(72) Together, the ICAS pods and the spine planes form a modular network topology capable of accommodating hundreds of thousands of 10G-connected servers, scaling to multi-petabit bisection bandwidth, and covering a data center with congestion improved and non-oversubscribed rack-to-rack performance.
(73) According to one embodiment of the present invention, a spine switch can be implemented using a high-radix (e.g., 24010G) single chip switching device, as shown in
(74) To work-around the CMOS semiconductor limitation, one may create a stackable switching device, in which multiple ICAS modules and switching devices are put in a rack or in multiple racks to form a larger high-radix (i.e., high port-count) spine switch, such as shown in
(75) Details of an ICAS-based stackable switching device 950 are shown in
(76) ICAS-based stackable switching device has the benefits of improved network congestion, saving the costs, power consumption and space savings than the switching devices implemented in the state of the art data center. As shown in the ICAS+Stackable Chassis column of Table 4, data center with ICAS and ICAS-based stackable switching device performs remarkably with total switching ASIC saving by 53.5%, total power consumption saving by 26.0%, total space saving by 25.6% and much improved network congestion performance. However total QSFP transceiver usage is increased by 2.3%.
(77) The above stackable switching device is for illustrative purpose only. A person of ordinary skill in the art may expand the scalability of the stackable switching device. The present invention is not limited by the illustration.
(78) One embodiment of the present invention also provides a multi-unit (rack unit) chassis-based switching device. The multi-unit chassis switching device groups many switching integrated circuits across multiple line cards. The multi-unit chassis-based switching device interconnects line cards, control cards, CPU cards through PCB-based fabric cards or backplane and saves the proportion cost of transceivers, cables and rack space required to interconnect them. It is instructive to learn that a rack unit (RU or simply U) is a measure of chassis height in data center and equals to 1.75 inches tall. A full rack is a 48U (48 rack unit) tall rack.
(79) Details of an ICAS-based multi-unit chassis switching device 970 are shown in
(80) Multi-unit chassis-based switching device with fabric cards that are ICAS-based full-mesh topology has the benefits of improved network congestion, saving the costs and power consumption than that of ASIC-based fabric cards implementation with fat-tree topology. As shown in the ICAS+Multi-unit Chassis column of Table 4, a data center with ICAS and ICAS-based multi-unit chassis-based switching device performs remarkably with a total QSFP transceiver saving by 12.6%, total switching ASIC saving by 53.5%, total power consumption saving by 32.7%, total space saving by 29.95% and much improved network congestion performance.
(81) The above multi-unit chassis switching device is for illustrative purpose only. A person of ordinary skill in the art may expand the scalability of the multi-unit chassis switching device. The present invention is not limited by the illustration.
(82) The multi-unit chassis-based switching device has the disadvantage of a much longer development time and a higher cost to manufacture due to its system complexity and is also limited overall by the form factor of the multi-unit chassis. The multi-unit chassis-based switching device, though provides a much larger port count than the single-chip switching device. Although the stackable switching device requires additional transceivers and cables than that of the multi-unit chassis-based approach, the stackable switching device approach has the advantage of greater manageability in the internal network interconnection, virtually unlimited scalability, and requires significantly less time for assembling a much larger switching device.
(83) The material required for (i) the data center networks of
(84) TABLE-US-00005 TABLE 4 Fat-tree + ICAS + ICAS + Multi-unit Multi-unit Stackable Chassis Chassis Chassis Intralink (within Pod) N/A 5 5 Interlink (Across Pod) 4 5 5 Downlink (to Server) 12 15 15 Total 16 25 25 D:U ratio 3 3 3 D:I ratio N/A 3 3 Number of 10G Interface 96 184.3 184.3 (for comparison) QSFP XCVR Module (Watt) 4 4 4 TOR Switch (Watt) 150 200 200 Multi-unit Chassis (Watt) 1660 0 0 Spine-side Interlink QSFP XCVR 18432 18800 38000 TOR-side Interlink QSFP XCVR 18432 18800 18800 Fabric/TOR-side Intralink 36864 18800 18800 QSFP XCVR Server-side QSFP XCVR 55296 56400 56400 Total QSFP XCVR 129024 112800 132000 (12.6%) (2.3%) ASIC in Spine Switch 2304 1600 1600 ASIC in Fabric Switch 4608 0 0 ASIC on Tor Switvh 4608 3760 3760 Total Switching ASIC 11520 5360 5360 (53.5%) (53.5%) Spine Switch (KW) 392.448 327.2 472.0 Fabric Switch (KW) 784.896 0 0 TOR Switch (KW) 986.112 1128.0 1128.0 Total Power Consumption (KW) 2163.456 1455.2 1600 (32.7%) (26.0%) 96 x QSFP Spine Switch (8U) 1536 0 0 96 x QSFP Fabric Switch (8U) 3072 0 0 48 x QSFP Spine Switch (4U) 0 1600 1600 TOR Switvh (1U) 4608 3760 3760 ICAS1X5TRIPLE (1U) 0 0 400 ICAS5X21 (2U) 0 376 376 Transpose Rack (36U) 0 720 720 ICAS2X9 (1U) 0 0 0 ICAS8X33 (4U) 0 0 0 ICAS10C41 (6U) 0 0 0 ICAS16X65 (16U) 0 0 0 Total Rack Unit (U) 9216 6456 6856 (29.95%) (25.6%) Pod Interlink Bandwidth (Tbps) 7.7 4.0 4.0 Pod Intralink Bandwidth (Tbps) 7.7 4.0 4.0 Total Data Link Bandwidth (Pbps) 2.2 2.2 2.2 Per Plane Uplink Bandwidth (Tbps) 7.7/plane 0 0 Total Spine Uplink Bandwidth (Tbps) 0 150.4 601.6 Total ICAS Uplink Bandwidth (Tbps) 0 37.6 37.6 Spine-side Interlink QSFP Cable 18432 18800 18800 QSFP Fanout Cable (Transpose Rack) 0 37600 37600 QSFP Fanout Cable (ICAS5X21) 0 19740 19740 TOR-side Interlink QSFP Cable 0 18800 18800 TOR-side Intralink QSFP Cable 18432 18800 18800 Spine Switch QSFP Cable 0 0 19200 QSFP Fanout Cable 0 0 19200 (ICAS1X5TRIPLE) Total QSFP Cable 36864 56400 75600 Total QSFP Fanout Cable 0 57340 76540
(85) As shown in Table 4, the ICAS-based systems require significantly less power dissipation, ASICs and space, resulting in reduced material costs and energy.
(86) The above detailed description is provided to illustrate specific embodiments of the present invention and is not intended to be limiting. Numerous modifications and variations within the scope of the present invention are possible. The present invention is set forth in the accompanying claims.