OPTICAL INTERCONNECTION MODULES FOR AI NETWORKS
20250234117 ยท 2025-07-17
Assignee
Inventors
Cpc classification
H04Q2011/0081
ELECTRICITY
International classification
Abstract
An optical fabric includes a plurality of optical waveguides. The fabric has Np input ports with index, X, and Np output ports with index, Y. An interconnection map between input ports, index X, and output ports, index Y is provided by a non-linear function Y=F(X) that satisfies reversible properties given by, F(Y)=X or, X=F(F(X)) or F.sup.1(X)=F(X). The fabric provides full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports where at least one number, M.sub.1 or M.sub.2 is an even number, and wherein M.sub.1M.sub.2=Np.
Claims
1. A optical fabric comprising a plurality of optical waveguides wherein the fabric has Np input ports with index, X, and Np output ports with index, Y, an interconnection map between input ports, index X, and output ports, index Y is provided by a non-linear function Y=F(X) that satisfies reversible properties given by, F(Y)=X or, X=F(F(X)) or F.sup.1(X)=F(X), and the fabric provides full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports where at least one number, M.sub.1 or M.sub.2 is an even number, and wherein M.sub.1M.sub.2=Np.
2. The fabric of claim 1, wherein an interconnection table is defined by a function, F(X)=Bin2Dec(FlipDigit(Dec2Bin(X1))+1).
3. The fabric of claim 1, wherein the fabric is used in a Spine and Leaf network with Ns Spines and Nl Leaf switches wherein M.sub.1/M.sub.2=KNs/Nl and K is an integer positive number.
4. The fabric of claim 1, wherein the fabric is used in a Spine and Leaf network, with Ns Spines and Nl Leaf switches wherein M.sub.2/M.sub.1=KNs/Nl where K is an integer positive number.
5. The fabric of claim 3, wherein the fabric can connect all Ns Spines to all Nl Leaf switches where Nl or Ns is an even number.
6. The fabric of claim 3, wherein the fabric can connect all Ns Leaf switches to all Nl Servers where Nl or Ns is an even number.
7. An apparatus for forming an optical fabric comprising a plurality of multi-fiber connector adapters where the adapters connect to network equipment in a data communications network, such as Spine and Leaf switches, and an internal mesh having at least 128 optical waveguides, wherein a light path of connected transmitters and receivers are matched to provide proper optical connections and wherein the internal mesh is designed to enable an arbitrary even number of uplinks from Leaf switches to Spine switches or Servers to Leaf switches.
8. The apparatus of claim 7, wherein the apparatus can be installed in a rack and can stacked to provide folded Clos network topology of different sizes and radixes.
9. The apparatus of claim 7, wherein the apparatus can be used to scale optical networks from four to thousands of switches.
10. The apparatus of claim 7, wherein the apparatus can be stacked to provide folded Clos network topology for switches using an even number of uplinks where each of those uplinks comprises multi-fiber connectors.
11. The apparatus of claim 7, wherein the apparatus can be used to implement fabrics to connect several hundred thousand GPUs.
12. The apparatus of claim 7, wherein the apparatus provides redundant paths, reducing the risk of network failure due to interconnection errors.
13. The apparatus of claim 7, wherein the apparatus has a small form factor that enables to stacking of at least 2 apparatuses in one RU, allowing the stacking of up to 132 apparatuses per rack.
14. A structured cable system comprising a stack of modules, wherein each module has a plurality of optical parallel connector adapters and incorporate an internal fabric or mesh, and wherein the internal mesh is designed to enable full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports wherein at least one number, M.sub.1 or M.sub.2 is an even number, and a number of input ports is equal to a number of output ports, and given by, M.sub.1M.sub.2, wherein the stack of modules can be used to deploy or scale various Clos network topologies.
15. The structured cable system of claim 14, wherein the structured cable system can be used to scale optical networks from two to ten thousand switches.
16. The structured cable system of claim 14, wherein the structured cable system provides redundant paths, reducing a risk of network failure due to interconnection errors.
17. The structured cable system of claim 14, wherein the structured cable system enables fabrics with an arbitrarily even number of uplinks.
18. A fiber optic module apparatus, which comprises, a main body, an internal fabric made of optical waveguides, a front face, a rear side, a left side, and a right side wherein the front face accommodates a multiplicity of multi-fiber connectors, the rear face accommodates a multiplicity of multi-fiber connectors, identical in number to the front face, an internal structure providing space for optical lanes of optical fibers or optical waveguides, wherein the internal mesh is designed to enable full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports where at least one number, M.sub.1 or M.sub.2 is an even number, and where the number of input ports is equal to the number of output ports, where the total number of ports is given by 2M.sub.1M.sub.2.
19. The fiber optic module of claim 18, wherein the fiber optic module can be stacked to provide folded Clos network topology of various radixes.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
DESCRIPTION OF INVENTION
[0053] Modular apparatuses and a general method to deploy optical networks of a diversity of uplinks and radices are disclosed in this document. The module and method can be used with standalone, stacked, or chassis-based network switches, as long as the modular connections utilize single ferrule or multi-ferrule (MF), MPO connectors (or other multi-fiber connectors) with more than 8 fiber pairs. In particular, switches supporting Ethernet-specified SR or DR transceivers in their ports, such as 100GBASE-SR4, 200GBASE-SR4, or 400GBASE-DR4, 400GBASE-SR8, 800GBASE-SR8, 1.6T SR8 (Terabit BiDi) or Infiniband 400G or 800G NDR, or future 1.6T XDR.
[0054]
[0055] The MF ports can be implemented with arrays of small MPO ferrules, such as commercially available SN-MT or MCC connectors. Each ferrule can have 8, 12, or 16 fibers. For example, in
[0056] The module 400 width, W, is in the range of 12 inches up to 19 inches, and the height, H, is in the range of 0.4 to 0.64 inches. Rails, 405, on both sides of the module, would enable the modules to be inserted into a chassis structure if required. Alternatively, using brackets 406, the modules can be directly attached to the rack. By using the specified height range for this embodiment, up to four modules can be stacked in less than 2 RU depending on density requirements.
[0057]
[0058]
[0059] In this document, we employ the nomenclature {Ns, Nl, Ml} to represent a Spine-and-Leaf fabric that can connect Ns Spine to Nl Leaf switches, where each Leaf switch has Ml uplinks. This fabric has Np input and Np output ports, where Np=NsMs=NlMl. In general, a fabric with 2Np ports, where each of the Np input ports is connected only to one of the Np output ports can be implemented in different configurations, based on the number of input/output port permutations port connections, given by Np! where ! represents the factorial function.
[0060] From that large set of possible configurations, the number of fabrics that can be used in a specific Spine-and-Leaf network, {Ns, Nl, Ml}, is given by (Nl!).sup.Ns, assuming that Ms=Nl and Ml=Ns. Almost all those fabrics become useless when the number of spine or leaf switches changes. This happens even if the total number of ports is kept identical. This might look irrelevant for networks that are implemented only once and never modified. However, as AI models increase nearly 10 per year, scaling GPU networks could change in its configuration. Also, different section of the network, GPU network (backend), CPU network frontend, can require different Spine-and-Leaf configurations.
[0061] Considering these cases, prior art fabric modules do not provide the flexibility to absorb those changes. Moreover, since most of the modules work only for the specific fabric, their utilization in large network deployment requires the use of several types of fabric modules, which impact in cost, inventory management, and complexity of the deployment. Moreover, when considering future scaling of the network, a small change in the number of spines of leaves can require a major change in the fabric modules.
[0062] To illustrate the problem, we assume a small fabric, and for simplicity, we assume that Ml/Ns=1, which implies that each Spine connects to each Leaf using only one port. This simple fabric, {Ns=4, Nl=8}, with Np=32 ports, is designed to provide full connectivity between four Spine switches to eight Leaf switches as shown in
[0063] In
[0064] Full modeling of a large number of fabrics shows the same problem. Although they can provide full connectivity between a bespoken Spine and Leaf switch configuration they cannot operate when the number of Spines or Leaf ports changes.
[0065] One might consider that this is an inherent limitation of the fabrics and therefore a universal module that can be used for multiple networks is not feasible. However, a deeper analysis of the problem performed by the inventors indicate that this is not the case. We found that a mapping function, Y=F(X), where X is the index of input ports, X=1 to Np, and Y is the index of the output ports that not only enable full connectivity for a Spine-and-Leaf {Ns, Nl, Ml} but also a large set of potential variations in its configuration. We can estimate approximately that the variations on the Spine-and-Leaf network are represented by {Ns2.sup.k, Nl2.sup.k} where k is an integer number that ranges from log 2(min(Ns,Nl)) to +log 2(min(Ns,Nl)).
[0066] In general, the mapping function Y=F(X), can be described as,
[0068] The mapping function can convert any input index to an output index, which represents the interconnection between two ports. We provide a detailed example of the mapping for this type of fabric with 32 input and 32 out ports. In this fabric, we select port #2, port index X=2, and compute the binary representation of X1=1 as 00001 and the bits are flipped, producing 10000 which results in output index Y=17 after conversion to decimal number and increased by one. Therefore, the input port 2 interconnects to output port 17. We can use this function for all the ports of the fabrics {Ns=4, Nl=8, Ml=4} and produce the interconnection diagrams shown in
[0069] This fabric provides full connectivity between Spine and Leaf switches for all the variations described in Table I. For example, in
[0070] Therefore, this fabric, labeled here as universal-type fabric, can be used for multiple network configurations, creating opportunities for a new type of fabric modules, such as 400 other embodiments shown in this disclosure, that not only encapsulate sections of the network but can be used as identical building blocks (such as bricks in a building) to facilitate the deployment of large datacenters, AI clusters or other types of optical networks.
[0071] The function F(X) was used to produce the fabrics of a diverse number of ports, for example, details of fabric F-64-001, used in module 400 (
[0072] General properties of the fabric are the non-linear characteristic of the function, Y=F(X) that satisfy reversible property given by F(Y)=X or, X=F(F(X)) or F.sup.1(X)=F(X). For example, in Table II, for universal-type fabric F-16-001, we can select any X value, e.g., X=2, and show that F(F(2))=2 and therefore, F.sup.1(2)=F(2). Those properties enable a reversible fabric. In addition, from Table II, and in general from the described equation Y=F(X), it can be shown that we can connect any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports when either M.sub.1 or M.sub.2 is an even number and M.sub.1M.sub.2=Np. For example,
[0073] In
[0074] The method using the described function F(X) helps also in the construction of the fabric modules since it produces symmetric fabrics, which show periodical patterns. Other properties, F(X)F(X1)=Np/2 (for X>1 assuming Ns/Ml=1), and F(X1)>F(X) for X odd>1 produce repeated crossing points, and other periodicity are advantageous for the decomposition of the fabric in smaller pieces, something similar to factoring polynomial functions, so complex fabrics can be implemented based on smaller ones.
[0075] In general, for a given number of ports Np, there is only one universal-type fabric, one in Np! fabrics that have the mentioned properties, flexibility to accommodate diverse networks, and symmetries, for example, any universal-type fabric such as the one shown in
[0076] Application on how to use the modules with the universal-type fabrics, F-Np-001 are shown the in next section of this disclosure.
Applications of Module 400
[0077] Universal-type fabrics, F-Np-001 for different numbers of ports can be implemented in modules 400 of less than 0.5 RU with multi-fiber connectors MPO or multi-fiber multi-ferrule connectors such as SN-MT or MMC. Some of the fabrics that can be used in modules 400 are shown in
[0078] Here we use F-64-001 to illustrate how the modules can be used in machine learning training networks where often two types of Spine-and-Leaf networks are used, one between the GPU servers and the Leaf switches, and another one from Leaf to Spine switches.
[0079] We will assume a cluster with 32 servers, e.g., Nvidia DGX server each with eight H100 GPUs and 16 optical uplinks that connect to 16 Leaf switches, each with 8 uplinks that connect to 8 Spine switches. The fabric that represents the interconnections from the GPU servers to the Leaf switches resembles the fabric shown in
[0080] Using the modules 400, this network can be implemented in less than 4RU space, with a stack of eight modules, each containing a universal-type fabric F-64-001 as shown in
[0081] Similarly, a stack of four modules 400, occupying less than 2 RU space, can be used to connect the 16 Leaf switches to 8 Spine switches, as shown in
[0082] Using large chassis switches such as Nexus 9000 or Arista 7800 as Spines, it is possible to increase the number of GPU servers to several tens of thousands. In all those cases, modules 400 can simplify the scaling of AI networks.
[0083] Previous examples showed the application examples of modules 400 for the network that connects the GPU servers and the backend network. In AI clusters, some fabrics connect servers to storage or CPU servers. Those datacenters fabrics tend to have oversubscriptions greater than one and to use less number of uplinks. The same type of modules 400 can be used.
[0084] While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.