Mouse over elephant
11044539 · 2021-06-22
Inventors
Cpc classification
H04L49/602
ELECTRICITY
G06F15/17337
PHYSICS
H04Q2011/0064
ELECTRICITY
H04Q2011/0081
ELECTRICITY
H04Q2011/0073
ELECTRICITY
International classification
G06F15/173
PHYSICS
Abstract
An optical switch plane with one or more switch layers, each layer with multiple switches is provided. In a data center, an optical circuit switch plane is added between the device plane and packet switch plane. Direct speed of light connections may be created between devices, the data center temporally shrunk, remote devices localized, elephant flows kept out of mouse switches, mouse switch spend reduced, stranded resources recovered, layer 1 reconfigured and optimized, bare metal bent, secure tunnels created, networks physically isolated, failure resiliency increased, and packet switch congestion avoided.
Claims
1. An optical data network with up/down connectivity, including: an optical circuit switch plane, between an end device plane, and an optical packet switch plane; said end device plane includes a multitude of end devices; said end devices include: servers, GPUs, FPGAs, ASICs, neural networks, memory, or storage; said circuit switch plane includes two or more circuit switch layers with north/south connectivity; said packet switch plane includes two or more packet switch layers with north/south connectivity; and said two or more layers of said circuit plane are interconnected with up/down connectivity to respective layers of said two or more layers of said packet plane.
2. The network of claim 1, where said circuit switch plane includes multiple hierarchical optical switch layers; and each layer including at least two optical circuit switches.
3. The network of claim 2, where said circuit switch plane includes two or more switch layers; 200K or more ports; and The maximum insertion loss of said plane is <=3 dB.
4. The network of claim 2, where said circuit switch plane includes three or more switch layers; 1M or more ports; and the maximum insertion loss of said plane is <=5 dB.
5. The network of claim 2, where said plane includes 20K or more optical ports; and the maximum insertion loss of said plane is <=3 dB.
6. The network of claim 1, where a majority of ports from said device plane couple to said circuit switch plane.
7. The network of claim 1, where a majority of ports from a said packet switch plane couple to said circuit switch plane.
8. The network of claim 1, where a majority of said devices include one or more PSM optics modules.
9. The network of claim 1, where a majority of said devices include multiple optical network ports; and a said circuit switch couples some said ports from said devices to a said packet switch, without circuit switching said connections.
10. The network of claim 1, where two said devices, each located in a different rack, are optically connected via a circuit that does not traverse said packet plane.
11. The network of claim 1, where two said devices, each located in a different pod, are optically connected via a circuit that does not traverse said packet plane.
12. The network of claim 1, where southbound ports of a single said packet switch are optically coupled via said circuit switch plane to said devices located in different racks.
13. The network of claim 1, where one or more WAN ports are coupled to a said device plane via said circuit switch plane, without coupling to said packet switch plane.
14. The network of claim 1, where said packet switch ports are oversubscribed; and the oversubscription ratio of device ports to packet switch ports is 2:1 or greater.
15. A method to configure the network of claim 1, including: accepting a request for resources, determining a low-cost route, verifying the route meets requirements, and configuring said circuit network and said devices.
16. The method of claim 15, further including: configuring said circuit switched plane to order connections to said packet switch ports, as to produce a more desired packet switch latency.
17. The method of claim 15, further including: dynamically clustering a multitude of said devices by reconfiguring said circuit switched plane.
18. The method of claim 15, further including: direct connecting GPU, FPGA, or ASIC devices to a server or another GPU, FPGA, or ASIC.
19. The network of claim 1, where 33% or more of ports from said device plane couple to said circuit switch plane; where 33% or more of ports from said packet switch plane couple to said circuit switch plane; and where the sum of the number of ports of all said circuit switches is greater than 20,000, with a circuit switch plane insertion loss of <=5 dB.
20. The network of claim 1, where said up/down connectivity between said circuit/packet planes also has a north/south connectivity between different said layers of said planes.
Description
DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
DETAILED DESCRIPTION
(11)
(12) Devices 112 may be housed in racks 114, and racks 114 organized into pods 116. Pods 116 may be containers.
(13) Devices 112 have a network port. Ports may be Omni Path Architecture (OPA) with PSM4 optics. PSM ports may be switched independently, in 25 GBPS increments from a 100 GBPS module. Other protocols may be used, such as Ethernet, Infiniband, FibreChannel, or other protocol over fiber, or any mix of the above. Other optics may by used, such as CWDM4, coarse WDM, dense WDM, BX, LX, SX, or other optics module, or any mix of the above. Each end of a link must have compatible protocol and optics.
(14) Switches may have a number of ports for east/west connectivity to other switches in the same layer and a number of ports for north/south connectivity to higher or lower layers. Higher layers may span multiple physical locations. In the context used within this paper, east/west does not mean within a datacenter, and north/south outside of the datacenter. A novel up/down routing dimension may be available between circuit plane 120 and packet plane 130. This is a mouse network over an elephant network. East/west is illustrated on the page as left/right, north/south as up/down, and up/down with squiggly lines. East/west and north/south flows may exist on both the circuit and packet planes. A single circuit connection may have multiple east/west and/or north/south components.
(15) A packet layer may have the same or different topology as a circuit layer. Elephant traffic need not traverse a mouse switch, saving many mouse switch ports. The up/down direction may be oversubscribed by 7:1, 3:1, 2:1, or other circuit:packet ratio. The oversubscription ratio may have local variations. The number of packet switches over a circuit switch may vary. Some circuit switches may not have a directly corresponding packet switch. Connections to packet switches elsewhere in the network may be made.
(16) High port-count low insertion-loss non-blocking transparent optical circuit switches are required for large switch planes. An example of such a switch is described in copending Ser. No. 16/041,815 “Optical Switch”, and is incorporated by reference. 2 or 3 switch layers may have an insertion loss of <=3 dB, or <=5 dB. Switches may have 500, 1000, 2000, 4000, or more ports. Planes may have 20K, 50K, 200K, 1M, or more ports, with an insertion loss of <=3, 4, or 5 dB. Insertion losses exclude interconnect fiber. Existing low port-count switches have insufficient connectivity to form a large plane. Too many ports are used for up/down, east/west, north/south traffic compared to the number of device ports. Existing high insertion-loss switches limit hops through multiple optical switches.
(17)
(18) Connections between A/B sub-layer add another routing dimension. This allows traffic to load balance between A/B sub-layers, and not be distinct redundant networks until necessary. This may be advantageous in relieving congestion. If more than one circuit switch is required in a pod, they may be partitioned horizontally as A/B layers, instead of vertically with each switch serving different racks.
(19) Dashed lines in circuit switch 223A indicate fixed (not switched) connections between a device and a packet switch. This may be desired if all nodes have some amount of packet switched traffic. Conversely, if most nodes have point to point traffic, such as GPU clusters, HPC, direct connection, or storage replication, fixed connections to a packet switch may waste ports. Fixed connections may also be used with non-redundant network 100. This split of fixed ports from a PSM4 fiber bundle may be done inside of the switch, but without connecting to the switch fabric, to simplify datacenter cabling.
(20)
(21)
(22)
(23) Backup network load need not burden a packet switch, or other network traffic. Circuit switched plane 120 may connect SSD 112f and cold store 112j. SSD 112f and cold store 112j are preferably able to stream files, with server 112a (not shown) running the backup application only needing to generate transfer requests and confirm completion. Optionally, some devices, such as cold store 112j, may have a low percentage of use within its pod and may be hard connected directly to the spine layer. Files may also be moved between store 112c, SSD 112f, JBOD, NVM, hard disk, hot storage 112h, warm storage 112i, cold storage 112j, backup, or other device.
(24)
(25) Storage 112c may connect to multiple servers 112a in various pods 116. Storage 112c may be intelligent storage, capable of processing Hadoop requests. Hadoop replication may be a logical function, moving the network and not moving the data. Additionally, storage 112c may connect to other storage 112c, allowing replication between devices without burdening servers 112a or the packet network 130. This may be necessary due to limited ports on storage 112c. Alternatively, a packet switch 133 may be used for aggregation.
(26)
(27) Packet switch 133 may be a Top Of Cluster (TOC) switch. All nodes of the cluster connect to the same switch, regardless of which rack of pod they may be in. Any node within the cluster may reach any other node within the cluster with a one hop latency.
(28)
(29)
(30)
(31) Other functionality (not illustrated) may include diagnostic 112o with TDR, power meter, or other diagnostic hardware which may be scripted to validate switch ports, fiber, and fiber connectivity. Human errors in fiber connectivity might be corrected by updating port mapping, instead of moving fiber to correct the wiring error.
(32)
(33) Resource requests may be accepted in step 510. The request may contain: customer contract requirements; direct connect requests; hardware allocation request; drive mount/unmount requests; open/close stream request; Hadoop requests; packet switch traffic statistics; packet switch latency; composition script request; application request; and/or other sources. Certain OS functions may be modified to simplify request generation.
(34) Direct connect, drive mount/unmount, and open/close stream requests may configure a circuit switched route between source and destination devices.
(35) Hadoop replication requests may configure circuit switched routes between multiple servers and a single storage device, moving the network instead of moving the data.
(36) Latency within a packet switch is dependent on source and destination ports. Use of QOS or packet traffic statistics may be used to configure the circuit switched network to reorder packet switch ports.
(37) Backup applications may configure various routes between source and destination.
(38) FEA, HPC, or Al HPC applications may cluster servers into a network topology that best matches the data topology.
(39) Clustered devices may have a packet switch port. All packet switch ports for the cluster may connect to a single TOC packet switch.
(40) A composer may provide a requested resources and connection topology.
(41) Lowest cost of available resource allocations may be determined in step 520. Cost of resources required to meet the resource request may be calculated using: number and type of ports and devices consumed; packet switch port to port latency; wire latency; minimum quantity of unallocated resources; billable cost; and/or other factors. If a sufficiently low cost route is not available, existing routes may be moved to a higher cost route that still within acceptable cost.
(42) Verify proposed resource allocation in step 530. If the allocation fails to meet: latency requirements; physical isolation requirements; and/or other factors, the request may be rejected.
(43) Network and devices are configured in step 540.
(44) The orchestration request returns in step 550.
(45) The previous examples are intended to be illustrative. Countless additional variations and applications are readily envisioned. Planes and layers may be partially implemented. Resource orchestration will vary.