Data center interconnect as a switch

Abstract

An interconnect module (ICAS module) includes n optical data ports each comprising n optical interfaces, and an interconnecting network implementing a full mesh topology for interconnecting the optical interfaces of each port each to a respective one of the optical interfaces of each of the other ports. In one embodiment, each optical interface exchanges data signals over a communication medium with optical transceiver. The interconnecting module may implement the full mesh topology using optical fibers. The interconnecting module may be used to replace fabric switches as well as a building block for a spine switch.

Claims

1. A data center network having a plurality of data interfaces for receiving and transmitting data signals from and to a plurality of servers, comprising: a first plurality of data switching units, wherein each of the first plurality of data switching units comprises (a) a plurality of first level interconnecting devices providing a subset of the data interfaces; and (b) one or more second level interconnecting devices routing data signals of the subset of data signals among the subset of data interfaces of the first level interconnecting device and a data interface with an external network, wherein the first and second interconnecting devices are interconnected to implement a full mesh network of a predetermined number of nodes; and a second plurality of data switching units, wherein each of the second plurality of the data switching units comprises a plurality of third level interconnecting devices, wherein each third level interconnecting device routes data signals received from, or to be transmitted to, a corresponding one of the first level interconnecting devices in each of the first plurality of data switching units.

2. The data center network of claim 1, wherein the data signals of each data interface are received from or to be transmitted to optical transceiver.

3. The data center network of claim 1, wherein the first level interconnecting devices comprise top-of-rack (TOR) switches.

4. The data center network of claim 1, wherein the second level interconnecting devices comprise fabric switches, wherein the full network comprises the predetermined number of first level interconnecting devices and the predetermined number less one fabric switches.

5. The data center network of claim 1, wherein the second level interconnecting devices of each of the first plurality of data switching units comprise an interconnect module, which comprise: n optical data ports each comprising n1 optical interfaces, wherein each optical interface receives and transmits data signals over a communication medium; and an interconnecting network implementing a full mesh topology for interconnecting the optical interfaces of each port each to a respective one of the optical interfaces of each of the other ports.

6. The data center network of claim 5, wherein one of the optical data ports is provided as an uplink interface for connecting to an external network.

7. The data center network of claim 5, wherein the interconnecting network comprises optical fibers.

8. The data center network of claim 5, wherein the optical interfaces each receive and transmit optical signals to and from optical transceiver.

9. The data center network of claim 5, wherein the second level connecting devices provide communication links for traffic among the first level interconnecting devices.

10. The data center network of claim 1, wherein the third level interconnecting devices each comprise a spine switch.

11. The data center network of claim 10, wherein each first level interconnecting device is connected by communication links to each of the spine switches in each of the second plurality of data switching units.

12. The data center network of claim 11, wherein the communication links in each of the second plurality of data switching units are implemented at least in part by a fanout cable transpose rack.

13. The data center network of claim 12, wherein each first level interconnecting device is connected by a fanout cable transpose rack to each of the spine switches in each of the second plurality of data switching units.

14. The data center network of claim 13, wherein the fanout cable transpose rack comprises a fraction of a full-rack, a standalone full rack, or multiple, full racks.

15. The data center network of claim 10, wherein the spine switch comprises one or more switching elements connected to one or more interconnect modules, each interconnect modules comprising: n optical data ports each comprising n-1 optical interfaces, wherein each optical interface receives and transmits data signals over a communication medium; and an interconnecting network implementing a full mesh topology for interconnecting the optical interfaces of each port each to a respective one of the optical interfaces of each of the other ports.

16. The data center network of claim 15, wherein one of the optical data ports is provided as an uplink interface for connecting to an external network.

17. The data center network of claim 15, wherein the interconnecting network comprises optical fibers.

18. The data center network of claim 15, wherein the optical interfaces each receive and transmit optical signals to and from optical transceiver.

19. The data center network of claim 15, wherein the switching elements and the interconnection modules are each packaged in a stackable, rack-mount chassis.

20. The data center network of claim 19, wherein the stackable switching device comprises multiple network devices, each being 1U high.

21. The data center network of claim 15, wherein the switching elements and the interconnection modules are packaged in a printed circuit board-based multi-unit rack mount chassis.

22. The data center network of claim 21, wherein the multi-unit switching device is multiple rack units high.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1a illustrates congestion due to hash collision in a fat-tree network under ECMP.

(2) FIG. 1b illustrates aggregation congestion in a fat-tree network topology.

(3) FIG. 1c illustrates congestion due to blocking condition in a fat-tree network.

(4) FIG. 2a shows the architecture of a state-of-the-art data center network.

(5) FIG. 2b shows in detail an implementation of a spine plane of the data center network of FIG. 2a.

(6) FIG. 2c shows in detail an implementation of a server pod of FIG. 2a using four fabric switches to distribute machine-to-machine traffic across 48 top-of-rack switches.

(7) FIG. 2d shows in detail an implementation of an edge pod of FIG. 2a using four edge switches to provide uplink interfaces to connect to one or more external networks.

(8) FIG. 3 illustrates a full mesh topology in a network of 9 nodes.

(9) FIG. 4a shows ICAS module 400, which interconnects 9 nodes, according to the full mesh topology of FIG. 3.

(10) FIG. 4b illustrates the connections between the internal connections and the external connections of port 7 of the 9-node ICAS module 400, in accordance with one embodiment of the present invention.

(11) FIG. 5a shows ICAS module 500 connecting port 2 of each of TOR switches 51-0 to 51-8 in a full mesh topology network 500, in accordance with one embodiment of the present invention.

(12) FIG. 5b illustrates, in the full mesh topology network 500 of FIG. 5a, port 2 of TOR switch 51-1 routing a data packet to port 2 of TOR switch 51-7 through internal connection 52-1-7 of port 50-1 and internal connection 52-7-1 of port 50-7 of ICAS2 module 500, in accordance with one embodiment of the present invention.

(13) FIG. 6a shows network 600, which is a more compact representation of the network of FIG. 5a.

(14) FIG. 6b shows network 620, after additional ICAS modules are added to network 600 of FIG. 6a, so as to provide greater bandwidth and path diversity.

(15) FIG. 7a shows that, in the architecture of the data center of FIG. 2a, the topology of a server pod may be reduced to a (4, 48) bipartite graph.

(16) FIG. 7b shows, as an example, network 720 represented as a (5, 6) bipartite graph.

(17) FIG. 7c shows the 6-node full mesh graph embedded in the (5, 6) bipartite graph of FIG. 7b.

(18) FIG. 8a shows an improved data center network 800, in accordance with one embodiment of the present invention; data center networks 800 includes 20 spine planes, providing optional uplinks 801, and 188 server pods, providing optional uplinks 802, uplinks 801 and 801 connecting to one or more external networks.

(19) FIG. 8b shows in detail an implementation of modified spine plane 820, having 20 spine switches, providing optional uplink interface 821 for connecting to an external network.

(20) FIG. 8c shows in detail an implementation of modified server pod 830 in a (20, 21) fabric/TOR topology, having 20 fabric switches for distributing machine-to-machine traffic across 20 top-of-rack switches, in accordance with one embodiment of the present invention; the 21.sup.st TOR switch is removed from the modified server pod 830 so that the connections are provided as optional uplink interfaces 831 for connecting the fabric switches to an external network.

(21) FIG. 9a shows ICAS-based data center network 900, achieved by replacing the server pods of network 800 of FIG. 8a (e.g., server pod 830 of FIG. 8c) with ICAS pods 91-0 to 91-197, each ICAS pod being shown in greater detail in FIG. 9c, according to one embodiment of the present invention; in FIG. 9a, optional uplinks 901, shared by 20 spine planes, and optional uplinks 902, shared by 188 ICAS pods are provided for connecting to an external network.

(22) FIG. 9b shows in detail spine plane 920, which implements one of the spine planes in data center network 900 and which is achieved by integrating a fanout cable transpose rack into spine plane 820 of FIG. 8b, according to one embodiment of the present invention; the spine switches in spine plane 920 provide optional uplink 921 for connecting to an external network.

(23) FIG. 9c shows in detail an implementation of ICAS pod 930, which is achieved by replacing fabric switches 83-0 to 83-19 in server pod 830 of FIG. 8c, according to one embodiment of the present invention; each ICAS pod provides 2010G uplinks 931 for connecting to an external network.

(24) FIG. 9d illustrates a spine switch implemented with a single chip high-radix (i.e., a high port count) switching integrated circuit; such a spine switch makes use of the highest port count switching integrated circuit available at present time.

(25) FIG. 9e shows a spine switch formed by stacking four switch boxes each implemented by a Trident-II ASICs (9610G configuration each) and 1 ICAS box 953; ICAS box 953 includes four ICAS modules 95-0 to 95-3 in one IU chassis, with each ICAS module having three copies of ICAS1X5 configuration, such that ICAS box 953 provides a non-blocking 1:1 subscription ratio to each of four switch boxes 96-0 to 96-3.

(26) FIG. 9f shows a spine switch of an ICAS-based multi-unit switching device in which four ICAS-based fabric cards 97-0 to 97-3 connect in a full mesh topology to switching ASIC's 98-0 to 98-3, with switching ASIC 98-0 and 98-1 being housed in line card 973 and switching ASIC's 98-2 and 98-3 being housed in line card 974.

(27) To facilitate cross-referencing among the figures and to simplify the detailed description, like elements are assigned like reference numerals.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

(28) The present invention simplifies the network architecture by eliminating the switches in the fabric layer based on a new fabric topology, referred herein as the interconnect-as-a-switch (ICAS) topology. The ICAS topology of the present invention is based on the full mesh topology. In a full mesh topology, each node is connected to all other nodes. The example of a 9-node full mesh topology is illustrated in FIG. 3. The inherent connectivity of a full mesh network can be exploited to provide fabric layer switching.

(29) As discussed in further detail below, the ICAS topology enables a data center network that is far superior than a network of the fat-tree topology used in prior art data center networks. Unlike other network topologies, the ICAS topology imposes a structure on the network which avoids congestion, while allowing aggregation of traffic using a multipath routing scheme (e.g., ECMP). According to one embodiment, the present invention provides an ICAS module as a component for interconnecting communicating devices. FIG. 4a shows ICAS module 400, which interconnects 9 nodes according to the full mesh topology of FIG. 3.

(30) FIG. 4a shows ICAS module 400 having ports 40-0 to 40-8 each providing 8 external connections and 8 internal connections. In ICAS module 400, each internal connection of each port connects to a corresponding internal connection of another port. In fact, each port is connected to every one of the other ports through exactly one internal connection. In this context, each connection includes a receive-transmit pair of optical fibers capable of, for example, a 10 Gbits per second data rate. In FIG. 4a, the internal connections for port i are indexed from 0-8, except index i is skipped. (For example, the internal connections for port 7 are 0, 1, 2, 3, 4, 5, 6 and 8.). Furthermore, internal connection j of port i is connected to internal connection i of port j. The external connections for each port of ICAS module 400 are indexed sequentially as 0-7.

(31) FIG. 4b illustrates in detail the connections between the internal connections and the external connections of a port 7 in ICAS module 400, in accordance with the present invention. As shown in FIG. 4b, the external connections are connected one-to-one to the internal connections in index order. (For example, for port 7, external connections 42-0 to 42-7 are connected in index order to internal connections 41-0 to 41-6 and 41-8.) Thus, in FIGS. 4a and 4b, for port i, external connections 0-7 connect, respectively, to internal connections 0, . . . , 8. Thus, as can be easily seen from FIGS. 4a and 4b, any pair of ports x and y are connected through internal connection x of pony and internal connection y of port x. This indexing scheme allows an external switching device to assign routes for data packets using the internal port indices of the source and destination ports. No congestion condition (e.g., due to hash collision, aggregation model, or strict-sense blocking) can occur between any pair of ports.

(32) As switching in ICAS module 400 is achieved passively by its connectivity, no power is dissipated in performing the switching function. Typical port-to-port delay through an ICAS passive switch is around 10 ns 5 ns/meter, for an optical fiber), making it very desirable for a data center application, or for big data, AI and HPC environments.

(33) The indexing scheme of external-to-internal connections in ICAS module 400 of FIG. 4a is summarized in Table 1 below:

(34) TABLE-US-00001 TABLE 1 ICAS ICAS External Connection Port 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 1 0 2 3 4 5 6 7 8 2 0 1 3 4 5 6 7 8 3 0 1 2 4 5 6 7 8 4 0 1 2 3 5 6 7 8 5 0 1 2 3 4 6 7 8 6 0 1 9 3 4 5 7 8 7 0 1 2 3 4 5 6 8 8 0 1 2 3 4 5 6 7

(35) FIG. 5a shows network 500, in which ICAS module 400 connects port 2 of each of TOR switches 51-0 to 51-8 in a full mesh topology, in accordance with one embodiment of the present invention.

(36) FIG. 5b illustrates, in the full mesh topology network 500 of FIG. 5a, TOR switch 51-1 routing a data packet to TOR switch 51-7 through external connection 53-1-6 and internal connection 52-1-7 of port 50-1, and internal connection 52-7-1 and external connection 53-7-1 of port 50-7, within ICAS module 400, in accordance with one embodiment of the present invention. As shown in FIG. 5b, TOR switch 51-1, which is connected to port 50-1 of ICAS module 400, receives a data packet with a destination reachable through internal port 52-1-7 of ICAS module 400. TOR switch 51-1 has a port that includes 8 connections 54-1-0 to 54-1-7 (provided as two QSFP port) mapping one-to-one to external connections 53-1-0 to 53-1-7 of port 50-1 of ICAS module 400, which in turn maps one-to-one to internal connections 52-1-0, 52-1-2 to 52-1-8 in sequential order of port 50-1 of ICAS module 400. TOR switch 51-7 has a port that includes 8 connections 54-7-0 to 54-7-7 (provided as two QSFP port) mapping one-to-one to external connections 53-7-0 to 53-7-7 of port 50-7 of ICAS module 400, which in turn maps one-to-one to internal connections 52-7-0 to 52-7-6 and 52-7-8 in sequential order of port 50-7 of ICAS module 400. Each connection in a TOR switch port may be a 10-G connection, for example. As ports 50-1 and 50-7 of ICAS module 400 are connected through the ports' respective internal connections 52-1-7 and 52-7-1, TOR switch 51-1 sends the data packet through its connection 54-1-6 to external connection 53-1-6 of ICAS module 400. Because of the full mesh topology in the connections within ICAS module 400, the data packet is routed to external connection 53-7-1 of ICAS module 400.

(37) In full mesh topology network 500, the interfaces of each TOR switch is divided into ports, such that each port contains 8 interfaces (connections). To illustrate this arrangement, port 2 from each TOR switch connects to ICAS module 400. As each TOR switch has a dedicated path through ICAS module 400 to each of the other TOR switches, no congestion can result from two or more flows from different source switches being routed to the same destination switch (the Single-Destination-Multiple-Source Traffic Aggregation case). In that case, for example, when TOR switches 51-0 to 51-8 each have a 10-G data flow that has TOR switch 51-0 as destination, all the flows would be routed on paths through respective connections. Table 2 summarizes the separate designated paths:

(38) TABLE-US-00002 TABLE 2 ICAS Source ICAS Destination Destina- Source Internal Internal tion T1.p2.c0 .Math. ICAS2.p1.c0 .Math. ICAS2.p0.c1 .Math. T0.p2.c0 T2.p2.c0 .Math. ICAS2.p2.c0 .Math. ICAS2.p0.c2 .Math. T0.p2.c1 T3.p2.c0 .Math. ICAS2.p3.c0 .Math. ICAS2.p0.c3 .Math. T0.p2.c2 T4.p2.c0 .Math. ICAS2.p4.c0 .Math. ICAS2.p0.c4 .Math. T0.p2.c3 T5.p2.c0 .Math. ICAS2.p5.c0 .Math. ICAS2.p0.c5 .Math. T0.p2.c4 T6.p2.c0 .Math. ICAS2.p6.c0 .Math. ICAS2.p0.c6 .Math. T0.p2.c5 T7.p2.c0 .Math. ICAS2.p7.c0 .Math. ICAS2.p0.c7 .Math. T0.p2.c6 T8.p2.c0 .Math. ICAS2.p8.c0 .Math. ICAS2.p0.c8 .Math. T0.p2.c7

(39) In Table 2 (as well as in all Tables herein), the switch source and the switch destination are each specified by 3 values: Ti.p.sub.k.c.sub.k, where T.sub.i is the TOR switch with index i, p.sub.j is the port with port number j and c.sub.k is the connection with connection number k. Likewise, the source and destination connections in module 500 are also each specified by 3 values: ICASj.p.sub.i.c.sub.k, where ICASj is the ICAS module with index j, p.sub.i is the port with port number i and c.sub.k is the internal or external connection with connection number k.

(40) An ICAS-based network is customarily allocated so that when its ports are connected to port i from all TOR switches the ICAS will be labeled as ICASi with index i.

(41) Congestion can also be avoided in full mesh topology network 500 with a suitable routing method, even when a source switch receives a large burst of aggregated data (e.g., 80 Gbits per second) from all its connected servers to be routed to the same destination switch (the Port-to-Port Traffic Aggregation case). In this case, it is helpful to imagine the TOR switches as consisting of two groups: the source switch i and the rest of the switches 0 to i1, i+1 to 8. The rest of the switches are herein collectively referred to as the spine group. Suppose TOR switch 51-1 receives 80 Gbits per second (e.g., 8 10G flows) from all its connected servers all designating to destination TOR switch 51-0. The routing method for the Port-to-Port Traffic Aggregation case allocates the aggregated traffic to its 8 10G-connections with port 51-1 of ICAS module 500, such that the data packets in each 10G-connection is routed to a separate TOR switch in the spine group (Table 3A):

(42) TABLE-US-00003 TABLE 3A ICAS Source ICAS Destination Destina- Source Internal Internal tion T1.p2.c0 .Math. ICAS2.p1.c0 .Math. ICAS2.p0.c1 .Math. T0.p2.c0 T1.p2.c1 .Math. ICAS2.p1.c2 .Math. ICAS2.p2.c1 .Math. T2.p2.c1 T1.p2.c2 .Math. ICAS2.p1.c3 .Math. ICAS2.p3.c1 .Math. T3.p2.c1 T1.p2.c3 .Math. ICAS2.p1.c4 .Math. ICAS2.p4.c1 .Math. T4.p2.c1 T1.p2.c4 .Math. IC AS2.p1.c5 .Math. ICAS2.p5.c1 .Math. T5.p2.c1 T1.p2.c5 .Math. ICAS2.p1.c6 .Math. ICAS2.p6.c1 .Math. T6.p2.c1 T1.p2.c6 .Math. ICAS2.p1.c7 .Math. ICAS2.p7.c1 .Math. T7.p2.c1 T1.p2.c7 .Math. ICAS2.p1.c8 .Math. ICAS2.p8.c1 .Math. T8.p2.c1

(43) Note that the data routed to TOR switch 51-0 has arrived its designation and therefore would not be routed further. Each TOR switch in the spine group, other than TOR switch 51-0, then allocates its 10G-connection to connection 0 for forwarding its received data to TOR switch 51-0 (Table 3B):

(44) TABLE-US-00004 TABLE 3B ICAS Source ICAS Destination Destina- Source Internal Internal tion .Math. .Math. .Math. T2.p2.c0 .Math. ICAS2.p2.c0 .Math. ICAS2.p0.c2 .Math. T0.p2.c1 T3.p2.c0 .Math. ICAS2.p3.c0 .Math. ICAS2.p0.c3 .Math. T0.p2.c2 T4.p2.c0 .Math. ICAS2.p4.c0 .Math. ICAS2.p0.c4 .Math. T0.p2.c3 T5.p2.c0 .Math. ICAS2.p5.c0 .Math. ICAS2.p0.c5 .Math. T0.p2.c4 T6.p2.c0 .Math. ICAS2.p6.c0 .Math. ICAS2.p0.c6 .Math. T0.p2.c5 T7.p2.c0 .Math. ICAS2.p7.c0 .Math. ICAS2.p0.c7 .Math. T0.p2.c6 T8.p2.c0 .Math. ICAS2.p8.c0 .Math. ICAS2.p0.c8 .Math. T0.p2.c7

(45) Thus, the full mesh topology network of the present invention provides performance that is in stark contrast to prior art network topologies (e.g., fat-tree), in which congestions in the source or destination switches cannot be avoided under Single-Destination-Multiple-Source Traffic Aggregation and Port-to-Port Traffic Aggregation cases.

(46) Also, as discussed above, when TOR switches 51-0 to 51-8 abide by the rule m2n2, where m is the number of network-side connections (e.g., the connections with a port in ICAS module 500) and n is the number of the TOR switch's input connections (e.g., connections to the servers within the data center), a strict blocking condition is avoided. In other words, a static path is available between any pair of input connections under any traffic condition. Avoiding such a blocking condition is essential in a circuit-switched network but is rarely significant in a flow-based switched network.

(47) In the full mesh topology network 500 of FIG. 5a, each port of ICAS module 500 has 8 connections connected to an 8-connection port of a corresponding TOR switch (e.g., 8 10-G connections) Full mesh topology network 500 of FIG. 5a may be redrawn in a more compact form in FIG. 6a, with a slight modification. FIG. 6a illustrates ICAS2 module 60-2 interconnecting to port 2 of each of TOR switches 61-0 to 61-8. In FIG. 6a, the connections between port 2 of TOR switch 61-0 and port 0 of ICAS module 60-2 (now labeled ICAS2) are represented as a single line (e.g., the single line between port 2 of TOR switch 61-0 and port 0 of ICAS module 60-2). Such a line, of course, represents all 8 eight connections between the TOR switch and a corresponding port in ICAS module 60-2. This is exactly the case in FIG. 6b where each TOR switch 63-0 to 63-8 is shown also to have 4 ports, to allow configuring network 620 of FIG. 6b, where three additional ICAS modules 62-0, 62-1 and 62-3 in addition to 62-2 and respective connections are added to network 600 of FIG. 6a.

(48) In full mesh topology network 500, uniform traffic may be spread out to the spine group and then forwarded to its destination. In network 620 of FIG. 6b, the additional ICAS modules may be used to provide greater bandwidth. So long as the additional ports are available in the TOR switches, additional ICAS modules may be added to the network to increase path diversity and bandwidth.

(49) The inventor of the present invention investigated in detail the similarities and the differences between the full mesh topology of the present invention and other network topologies, such as the fat-tree topology in the data center network of FIG. 2a. The inventor first observes that, in the architecture of the data center network of FIG. 2a, the fat-tree network represented in a server pod (the fabric/TOR topology) can be reduced to a (4, 48) bipartite graph, so long as the fabric switches merely perform an interconnect function for traffic originated among the TOR switches. This (4, 48) bipartite graph is shown in FIG. 7a. In FIG. 7a, the upper set of nodes, nodes 0-3 (fabric nodes) 70-0 to 70-3, represent the four fabric switches in the server pod of FIG. 2a and the lower set of 48 nodes (i.e., leaf 0-47), labeled 71-0 to 71-47, represent the 48 TOR switches in a server pod of FIG. 2a.

(50) The inventor discovered that an n-node full mesh graph is embedded in a fabric-leaf network represented by a bipartite graph with (n1, n) nodes (i.e., a network with n1 fabric nodes and n server leaves). FIG. 7b shows, as an example, a (5, 6) bipartite graph with 5 nodes 72-0 to 72-4 and 6 leaves 73-0 to 73-5. FIG. 7c shows the 6-node full mesh graph 740 with 6 nodes 74-0 to 74-5 embedded in the (5, 6) bipartite graph of FIG. 7b.

(51) This discovery leads to the following rather profound results: (a) An n-node full mesh graph is embedded in an (n1, n)-bipartite graph; and the (n1, n) bipartite graph and the data center Fabric/TOR topology have similar connectivity characteristics; (b) A network in the (n1, n) Fabric/TOR topology (i.e., with n1 fabric switches and n TOR switches) can operate in same connectivity characteristics as a network with full mesh topology (e.g., network 500 of FIG. 5a); (c) Fabric switches are unnecessary in an (n1, n) Fabric/TOR topology network, as the fabric switches merely performs an interconnecting function among the TOR switches (i.e., these fabric switches can be replaced by direct connections among TOR switches; and (d) A data center network based on a fat-tree topology (e.g., the Fabric/TOR topology) can be improved significantly using ICAS modules.

(52) In the following, a data center network that incorporates ICAS modules in place of fabric switches may be referred to as an ICAS-based data center network. An ICAS-based data center network has the following advantages:

(53) (a) less costly, as fabric switches are not used;

(54) (b) lower power consumption, as ICAS modules are passive;

(55) (c) less congestion;

(56) (d) lower latency;

(57) (e) less network layers; and.

(58) (f) greater scalability as a data center network.

(59) These results may be advantageously used to improve typical state-of-the-art data center networks. FIG. 8a shows an improved data center network 800, in accordance with one embodiment of the present invention. Data center network 800 uses the same types of components as the data center network of FIG. 2a (i.e., spine switches, fabric switches and TOR switches), except that the number of fabric switches are increased to one less than the number of TOR switches (FIG. 8c shows equal number of fabric switches and TOR switches because one of the TOR switch, the 21st TOR switch, is removed so that the connections are provided as an uplink interface to connect to one or more external networks, thus providing an uplink from each of the 20 fabric switches).

(60) FIG. 8a shows the architecture of an improved data center network, organized by three layers of switching devicesi.e., top-of-rack (TOR) switches and fabric switches implemented in 188 server pods 81-0 to 81-187 and spine switches implemented in 20 spine planes 80-0 to 80-19interconnected by interlinks in a fat-tree topology. An interlink refers to the network connections between a server pod and a spine plane. For example, interlink k of each of the 188 server pods is connected to spine plane k; interlink p of each of the 20 spine planes is connected to server pod p. The 20 spine planes each provide an optional uplink (e.g. uplink 801) and the 188 server pods each provide an optional uplink (e.g., uplink 802) for connection to one or more external networks. In this example, to allow comparison, the numbers of server pods and spine plane are chosen so that the improved data center network and the typical state-of-the-art data center network have identical network characteristics (2.2 Pbps total server-side bandwidth; 3:1 oversubscription ratioserver-side to network-side bandwidth ratio; Trident-II ASIC). Other configurations of the improved data center network are also possible, for instance, 32-TOR server pod or 48-TOR server pod but with higher radix switching silicon than the Trident-II ASIC.

(61) Details of a spine plane of FIG. 8a are shown in FIG. 8b. In FIG. 8b, spine plane 820 consists of 20 spine switches 82-0 to 82-19 each connecting to 188 server pods. The connections from all 20 spine switches are grouped into 188 interlinks, with each interlink including a connection from each spine switch 82-0 to 82-19, for a total of 20 connections per interlink.

(62) Details of a server pod of FIG. 8a are shown in FIG. 8c. In FIG. 8c, the network-side interface (as opposed to the server-side interface) of the server pod is separated into intra-pod links and inter-pod links (i.e., the interlinks). The two link types are made independent from each other. The intra-pod region 832 consists of the intra-pod links, the intra-pod link side of the 20 TOR switches 84-0 to 84-19 and the 20 fabric switches 83-0 to 83-19 interconnected by 10G connections in a fat-tree topology. For example, connection kin each of the 20 TOR switches is connected to fabric switch k; connection p of each of the 20 fabric switches is connected to TOR switch p. 20 fabric switches each provide an optional uplink (e.g., uplink 831) to connect to an external network. The inter-pod region consists of the inter-pod links (i.e., the interlinks), the interlink side of the 20 TOR switches 84-0 to 84-19 each providing an interlink of 20 10-G connections to connect to all 20 spine switches on the same spine plane, for a total of 20 interlinks per server pod. For example, interlink k of each of the 188 TOR switches across the 188 server pods are connected to spine plane k; interlink p of each of the 20 spine planes are connected to server pod p. Each TOR switch provides 4810G connections in 12QSFP interfaces as downlink to connect to servers.

(63) The data traffic through the fabric switches is primarily limited to intra-pod. The TOR switches now route both the intra-pod traffic as well as inter-pod traffic and are more complex. The independent link types achieve massive scalability in data center network implementations. (Additional independent links provided from higher radix switching ASIC may be created to achieve larger scale of connectivity objectives). Additionally, data center network 800 incorporates the full mesh topology concept (without physically incorporating an ICAS module) to remove redundant network devices and allow the use of innovative switching methods, in order to achieve a lean and mean data center fabric with improved data traffic characteristics.

(64) As shown in FIG. 8c, FIG. 8b and FIG. 8a, data center network 800 includes 20188 TOR switches and 20188 fabric switches equally distributed over 188 server pods, and 2020 spine switches equally distributed over 20 spine planes. In FIG. 8a, each TOR switch has 100 10G-connections (i.e., 25 QSFPs of bandwidth in 10G mode), of which 60 10G-connections are provided server-side and 40 10G-connections are provided network-side. (Among the network-side connections 20 10G-connections are used for intra-pod traffic and 20 10G-connections are used for inter-pod traffic). In each server pod, fabric switches 83-0 to 83-19 each include 21 10G-connections, of which 20 10G-connections are allocated to connect with a 10G-connection in each of TOR switches 84-0 to 84-19, and the rest being converted to provide as uplink interface to connect to external network. In this manner, fabric switches 83-0 to 83-19 support the intra-pod region data traffic and the uplinks in the server pod by a 21-node full mesh topology (with the uplinks of fabrics switches 0-19 collectively seen as one node). Using a suitable routing algorithm, such as any of those described above in conjunction with Single-Source-Multiple-Destination Traffic Aggregation and Port-to-Port Traffic Aggregation, network congestion can be eliminated from all fabric switches.

(65) As the network in the intra-pod region of each server pod can operate in the same connectivity characteristics as a full mesh topology network, all the 20 fabric switches of the server pod may be replaced by an ICAS module. ICAS-based data center network 900, resulting from substituting fabric switches 83-0 to 83-19 of data center network 800, is shown in FIG. 9a. To distinguish from the server pod of data center network 800, a server pod with its fabric switches replaced by an ICAS module is referred to as an ICAS pod.

(66) FIG. 9a shows the architecture of an ICAS-based data center network, organized by two layers of switching devicesi.e., top-of-rack (TOR) switches, ICAS module implemented in 188 server pods 91-0 to 91-187 and spine switches implemented in 20 spine planes for 90-0 to 90-19interconnected by interlinks in a fat-tree topology. Interlinks refer to network connections between ICAS pods and spine planes. For example, interlink k of each of the 188 ICAS pods is connected to spine plane k; interlink p of each of the 20 spine planes are connected to ICAS pod p. 20 spine planes provide optional uplinks 901 and 188 ICAS pods provide optional 2010G uplinks 902 for connecting to an external network.

(67) Details of a spine plane of FIG. 9a are shown in FIG. 9b. In FIG. 9b, spine plane 920 includes 20 spine switches 92-0 to 92-19 and fanout cable transpose rack 921. Through fanout cable transpose rack 921 connections from all 20 spine switches 92-0 to 92-19 are grouped into 188 interlinks, with each interlink including a connection from each spine switches 92-0 to 92-19, for a total of 20 connections per interlink. (Each interlink may include 5 QSFP straight cables.) On one side of fanout cable transpose rack 921 is 20 groups of 47 stackable MPO adapters per group, indicated by reference numeral 923, with each group of 47 adaptors providing 47 QSFP straight cables for connecting a spine switch. On another side of fanout cable transpose rack 921 is 188 groups of 5 stackable MPO adapters per group, labeled by reference numeral 924, with each group of 5 adaptors providing 5 QSFP straight cables to form an interlink for connecting to ICAS pod. The fiber connections from the spine switches and the fiber connections from interlinks are made at LC adapter-type mounting panel 922.

(68) As pointed out earlier in this detailed description, the state-of-the-art data centers and switch silicon are designed with 4 interfaces (TX, RX) at 10 Gb/s or 25 Gb/s each per port in mind. Switching devices are interconnected at the connection level in ICAS-based data center. In such a configuration, a QSFP cable coming out from a QSFP transceiver is separated into 4 connections, and 4 connections from different QSFP transceivers are combined in a QSFP cable for connecting to another QSFP transceiver. Also, a spine plane may interconnect a large and varying number of ICAS pods (e.g., in the hundreds) because of the scalability of an ICAS-based data center network. Such a cabling scheme is more suitable to be organized in a fanout cable transpose rack (e.g., fanout cable transpose rack 921), which may be one or multiple racks and be integrated into the spine planes. Specifically, the spine switches and the TOR switches may each connect to the fanout cable transpose rack with QSFP straight cables. Such an arrangement simplifies the cabling in a data center. FIG. 9b illustrates such an arrangement for data center network 900 of FIG. 9a.

(69) Details of an ICAS pod of FIG. 9a are shown in FIG. 9c. In FIG. 9c, the network-side interface (as opposed to the server-side interface) of an ICAS pod is divided into intra-pod links and inter-pod links (i.e., interlinks) and the two links types are made independent from each other. The intra-pod region consists of intra-pod links between the 20 TOR switches 93-0 to 93-19 and ICAS module 931, interconnected by 10G connections in a full mesh topology. ICAS module 931 provides optional 2010G uplinks to connect to one or more external network. The inter-pod region consists of inter-pod links (i.e., the interlinks) each provided between one of 20 TOR switches 93-0 to 93-19 and a corresponding one of 20 spine planes. Each interlink includes 20 10-G connections for connecting to all 20 spine switches on the corresponding spine plane. For example, interlink k of each of 188 TOR switches across the 188 ICAS pods is connected to spine plane k; interlink p of each of the 20 spine planes is connected to server pod p. Each TOR switch provides 6010G connections in 15QSFP interfaces as downlink for connecting to servers.

(70) The data traffic through the ICAS module is primarily limited to intra-pod. The TOR switches now perform routing for the intra-pod traffic as well as inter-pod traffic and are more complex. The independent link types achieve massive scalability in data center network implementations. (Additional independent links provided from higher radix switching ASIC may be created to achieve a larger scale of connectivity objectives).

(71) As shown in FIG. 9c, FIG. 9b and FIG. 9a, each TOR switch allocates 2010G-connections (5QSFPs in 10G mode) to connect to its associated ICAS module (e.g., ICAS module 931) to support intra-pod traffic, and 5 QSFPs in 10G mode (20 10G-connections) to connect to the fiber transpose rack to support inter-pod traffic. As shown in FIG. 9c, each ICAS pod includes 205 QSFP transceivers for intra-pod traffic, connected by 100 QSFP straight cables, and 2015 QSFP (10G mode) transceivers for server traffic, for a total 500 QSFP transceivers. The 20 TOR switches in an ICAS pod may be implemented by 20 Trident II ASICs. Although 20 TOR switches are shown in each WAS pod in FIG. 9c, the ICAS module is scalable to connect up to 48 TOR switches in a ICAS pod (based on 32QSFP Trident-II+ switch ASIC). Also, each ICAS pod operates as a layer-3 cluster running BGP and ECMP.

(72) Together, the ICAS pods and the spine planes form a modular network topology capable of accommodating hundreds of thousands of 10G-connected servers, scaling to multi-petabit bisection bandwidth, and covering a data center with congestion improved and non-oversubscribed rack-to-rack performance.

(73) According to one embodiment of the present invention, a spine switch can be implemented using a high-radix (e.g., 24010G) single chip switching device, as shown in FIG. 9d. Single-chip implementation saves the cost of extra transceivers, cables, rack space, latency and power consumption than multi-unit (rack unit) chassis-based switching device and stackable switching device implementations. The disadvantage of the single-chip spine switch approach is its network scalability, which limits the system to 240 ICAS pods at this time. As mentioned above, the semiconductor implementation limits the scale of a high-radix switching integrated circuit.

(74) To work-around the CMOS semiconductor limitation, one may create a stackable switching device, in which multiple ICAS modules and switching devices are put in a rack or in multiple racks to form a larger high-radix (i.e., high port-count) spine switch, such as shown in FIG. 9e.

(75) Details of an ICAS-based stackable switching device 950 are shown in FIG. 9e. FIG. 9e shows ICAS modules 95-0 to 95-3 each connect in a full mesh topology to four Trident-II ASIC-based switches 96-0 to 96-3, illustrating how such switches may be used to builds a stackable spine switch. FIG. 9e shows each of switches 96-0 to 96-3 having a switching bandwidth of 24 QSFPs in 10G mode provided in 1:1 subscription ratio and an ICAS box 953 integrating four ICAS modules 95-0 to 95-3 in one 1U chassis, with each ICAS module containing three duplicate copies of ICAS1X5 sub-modules and each sub-module providing 410G uplink 951. The four switches 96-0 to 96-3 provide data links 952 of 1.92 Tbps bandwidth to connect to servers. ICAS-based stackable switching device 950 provides total uplink bandwidth of 480 Gbps (4340 Gbps) to connect to one or more external networks, facilitates a non-blocking 1:1 subscription ratio and provides a full-mesh non-blocking interconnect, with a total of 1.92 Tbps of switching bandwidth.

(76) ICAS-based stackable switching device has the benefits of improved network congestion, saving the costs, power consumption and space savings than the switching devices implemented in the state of the art data center. As shown in the ICAS+Stackable Chassis column of Table 4, data center with ICAS and ICAS-based stackable switching device performs remarkably with total switching ASIC saving by 53.5%, total power consumption saving by 26.0%, total space saving by 25.6% and much improved network congestion performance. However total QSFP transceiver usage is increased by 2.3%.

(77) The above stackable switching device is for illustrative purpose only. A person of ordinary skill in the art may expand the scalability of the stackable switching device. The present invention is not limited by the illustration.

(78) One embodiment of the present invention also provides a multi-unit (rack unit) chassis-based switching device. The multi-unit chassis switching device groups many switching integrated circuits across multiple line cards. The multi-unit chassis-based switching device interconnects line cards, control cards, CPU cards through PCB-based fabric cards or backplane and saves the proportion cost of transceivers, cables and rack space required to interconnect them. It is instructive to learn that a rack unit (RU or simply U) is a measure of chassis height in data center and equals to 1.75 inches tall. A full rack is a 48U (48 rack unit) tall rack.

(79) Details of an ICAS-based multi-unit chassis switching device 970 are shown in FIG. 9f. FIG. 9f shows four ICAS-based fabric cards 97-0 to 97-3 each being connected in a full mesh topology to switching ASIC's 98-0 to 98-3. In FIG. 9f, switching ASIC 98-0 and 98-1 are housed in line card 973 and switching ASIC's 98-2 and 98-3 are housed in line card 974. Line cards 973 and 974 are connected through high speed printed circuit board (PCB) connectors to fabric cards 97-0 to 97-3. As shown in FIG. 9f, four Trident-II ASIC-based switches 98-0 to 98-3 may be used to builds a multi-unit chassis switch, each having a switching bandwidth of 24 QSFPs in 10G mode provided in 1:1 subscription ratio, and four ICAS-based fabric cards 97-0 to 97-3 containing three duplicate copies of ICAS1X5 sub-modules, with each sub-module providing 410G uplink 971. Two line cards provide data links 972 of 1.92 Tbps of bandwidth to connect to servers. ICAS-based multi-unit chassis switching device 970 provides a total uplink bandwidth of 480 Gbps (4340 Gbps) to connect to one or more external networks and facilitates a full-mesh non-blocking 1:1 subscription ratio interconnect, with a total of 1.92 Tbps of switching bandwidth.

(80) Multi-unit chassis-based switching device with fabric cards that are ICAS-based full-mesh topology has the benefits of improved network congestion, saving the costs and power consumption than that of ASIC-based fabric cards implementation with fat-tree topology. As shown in the ICAS+Multi-unit Chassis column of Table 4, a data center with ICAS and ICAS-based multi-unit chassis-based switching device performs remarkably with a total QSFP transceiver saving by 12.6%, total switching ASIC saving by 53.5%, total power consumption saving by 32.7%, total space saving by 29.95% and much improved network congestion performance.

(81) The above multi-unit chassis switching device is for illustrative purpose only. A person of ordinary skill in the art may expand the scalability of the multi-unit chassis switching device. The present invention is not limited by the illustration.

(82) The multi-unit chassis-based switching device has the disadvantage of a much longer development time and a higher cost to manufacture due to its system complexity and is also limited overall by the form factor of the multi-unit chassis. The multi-unit chassis-based switching device, though provides a much larger port count than the single-chip switching device. Although the stackable switching device requires additional transceivers and cables than that of the multi-unit chassis-based approach, the stackable switching device approach has the advantage of greater manageability in the internal network interconnection, virtually unlimited scalability, and requires significantly less time for assembling a much larger switching device.

(83) The material required for (i) the data center networks of FIG. 2a, using state of the art multi-unit switching device (Fat-tree+Multi-unit Chassis), (ii) an implementation of data center network 900 of FIG. 9a, using ICAS-based multi-unit switching device ICAS+Multi-unit Chassis, and (iii) an implementation of data center network 900 of FIG. 9a, using ICAS-based stackable switching device ICAS+Stackable Chassis are summarized and compared in Table 4,

(84) TABLE-US-00005 TABLE 4 Fat-tree + ICAS + ICAS + Multi-unit Multi-unit Stackable Chassis Chassis Chassis Intralink (within Pod) N/A 5 5 Interlink (Across Pod) 4 5 5 Downlink (to Server) 12 15 15 Total 16 25 25 D:U ratio 3 3 3 D:I ratio N/A 3 3 Number of 10G Interface 96 184.3 184.3 (for comparison) QSFP XCVR Module (Watt) 4 4 4 TOR Switch (Watt) 150 200 200 Multi-unit Chassis (Watt) 1660 0 0 Spine-side Interlink QSFP XCVR 18432 18800 38000 TOR-side Interlink QSFP XCVR 18432 18800 18800 Fabric/TOR-side Intralink 36864 18800 18800 QSFP XCVR Server-side QSFP XCVR 55296 56400 56400 Total QSFP XCVR 129024 112800 132000 (12.6%) (2.3%) ASIC in Spine Switch 2304 1600 1600 ASIC in Fabric Switch 4608 0 0 ASIC on Tor Switvh 4608 3760 3760 Total Switching ASIC 11520 5360 5360 (53.5%) (53.5%) Spine Switch (KW) 392.448 327.2 472.0 Fabric Switch (KW) 784.896 0 0 TOR Switch (KW) 986.112 1128.0 1128.0 Total Power Consumption (KW) 2163.456 1455.2 1600 (32.7%) (26.0%) 96 x QSFP Spine Switch (8U) 1536 0 0 96 x QSFP Fabric Switch (8U) 3072 0 0 48 x QSFP Spine Switch (4U) 0 1600 1600 TOR Switvh (1U) 4608 3760 3760 ICAS1X5TRIPLE (1U) 0 0 400 ICAS5X21 (2U) 0 376 376 Transpose Rack (36U) 0 720 720 ICAS2X9 (1U) 0 0 0 ICAS8X33 (4U) 0 0 0 ICAS10C41 (6U) 0 0 0 ICAS16X65 (16U) 0 0 0 Total Rack Unit (U) 9216 6456 6856 (29.95%) (25.6%) Pod Interlink Bandwidth (Tbps) 7.7 4.0 4.0 Pod Intralink Bandwidth (Tbps) 7.7 4.0 4.0 Total Data Link Bandwidth (Pbps) 2.2 2.2 2.2 Per Plane Uplink Bandwidth (Tbps) 7.7/plane 0 0 Total Spine Uplink Bandwidth (Tbps) 0 150.4 601.6 Total ICAS Uplink Bandwidth (Tbps) 0 37.6 37.6 Spine-side Interlink QSFP Cable 18432 18800 18800 QSFP Fanout Cable (Transpose Rack) 0 37600 37600 QSFP Fanout Cable (ICAS5X21) 0 19740 19740 TOR-side Interlink QSFP Cable 0 18800 18800 TOR-side Intralink QSFP Cable 18432 18800 18800 Spine Switch QSFP Cable 0 0 19200 QSFP Fanout Cable 0 0 19200 (ICAS1X5TRIPLE) Total QSFP Cable 36864 56400 75600 Total QSFP Fanout Cable 0 57340 76540

(85) As shown in Table 4, the ICAS-based systems require significantly less power dissipation, ASICs and space, resulting in reduced material costs and energy.

(86) The above detailed description is provided to illustrate specific embodiments of the present invention and is not intended to be limiting. Numerous modifications and variations within the scope of the present invention are possible. The present invention is set forth in the accompanying claims.

Data center interconnect as a switch

Assignee

Inventors

Cpc classification

Classification Explorer

H04Q2011/0098

ELECTRICITY

Classification Explorer

H04B10/40

ELECTRICITY

Classification Explorer

H04B10/25

ELECTRICITY

Classification Explorer

H04Q2011/006

ELECTRICITY

Classification Explorer

H04Q11/0003

ELECTRICITY

Classification Explorer

H04B10/50

ELECTRICITY

Classification Explorer

H04Q11/0005

ELECTRICITY

Classification Explorer

H04Q11/0062

ELECTRICITY

Classification Explorer

H04B10/60

ELECTRICITY

International classification

Classification Explorer

H04Q11/00

ELECTRICITY

Classification Explorer

H04B10/50

ELECTRICITY

Classification Explorer

H04B10/60

ELECTRICITY

Classification Explorer

H04B10/25

ELECTRICITY

Classification Explorer

H04B10/40

ELECTRICITY

Abstract

Claims

Description