OPTICAL INTERCONNECTION MODULES FOR AI NETWORKS

20250234117 ยท 2025-07-17

Assignee

Inventors

Cpc classification

International classification

Abstract

An optical fabric includes a plurality of optical waveguides. The fabric has Np input ports with index, X, and Np output ports with index, Y. An interconnection map between input ports, index X, and output ports, index Y is provided by a non-linear function Y=F(X) that satisfies reversible properties given by, F(Y)=X or, X=F(F(X)) or F.sup.1(X)=F(X). The fabric provides full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports where at least one number, M.sub.1 or M.sub.2 is an even number, and wherein M.sub.1M.sub.2=Np.

Claims

1. A optical fabric comprising a plurality of optical waveguides wherein the fabric has Np input ports with index, X, and Np output ports with index, Y, an interconnection map between input ports, index X, and output ports, index Y is provided by a non-linear function Y=F(X) that satisfies reversible properties given by, F(Y)=X or, X=F(F(X)) or F.sup.1(X)=F(X), and the fabric provides full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports where at least one number, M.sub.1 or M.sub.2 is an even number, and wherein M.sub.1M.sub.2=Np.

2. The fabric of claim 1, wherein an interconnection table is defined by a function, F(X)=Bin2Dec(FlipDigit(Dec2Bin(X1))+1).

3. The fabric of claim 1, wherein the fabric is used in a Spine and Leaf network with Ns Spines and Nl Leaf switches wherein M.sub.1/M.sub.2=KNs/Nl and K is an integer positive number.

4. The fabric of claim 1, wherein the fabric is used in a Spine and Leaf network, with Ns Spines and Nl Leaf switches wherein M.sub.2/M.sub.1=KNs/Nl where K is an integer positive number.

5. The fabric of claim 3, wherein the fabric can connect all Ns Spines to all Nl Leaf switches where Nl or Ns is an even number.

6. The fabric of claim 3, wherein the fabric can connect all Ns Leaf switches to all Nl Servers where Nl or Ns is an even number.

7. An apparatus for forming an optical fabric comprising a plurality of multi-fiber connector adapters where the adapters connect to network equipment in a data communications network, such as Spine and Leaf switches, and an internal mesh having at least 128 optical waveguides, wherein a light path of connected transmitters and receivers are matched to provide proper optical connections and wherein the internal mesh is designed to enable an arbitrary even number of uplinks from Leaf switches to Spine switches or Servers to Leaf switches.

8. The apparatus of claim 7, wherein the apparatus can be installed in a rack and can stacked to provide folded Clos network topology of different sizes and radixes.

9. The apparatus of claim 7, wherein the apparatus can be used to scale optical networks from four to thousands of switches.

10. The apparatus of claim 7, wherein the apparatus can be stacked to provide folded Clos network topology for switches using an even number of uplinks where each of those uplinks comprises multi-fiber connectors.

11. The apparatus of claim 7, wherein the apparatus can be used to implement fabrics to connect several hundred thousand GPUs.

12. The apparatus of claim 7, wherein the apparatus provides redundant paths, reducing the risk of network failure due to interconnection errors.

13. The apparatus of claim 7, wherein the apparatus has a small form factor that enables to stacking of at least 2 apparatuses in one RU, allowing the stacking of up to 132 apparatuses per rack.

14. A structured cable system comprising a stack of modules, wherein each module has a plurality of optical parallel connector adapters and incorporate an internal fabric or mesh, and wherein the internal mesh is designed to enable full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports wherein at least one number, M.sub.1 or M.sub.2 is an even number, and a number of input ports is equal to a number of output ports, and given by, M.sub.1M.sub.2, wherein the stack of modules can be used to deploy or scale various Clos network topologies.

15. The structured cable system of claim 14, wherein the structured cable system can be used to scale optical networks from two to ten thousand switches.

16. The structured cable system of claim 14, wherein the structured cable system provides redundant paths, reducing a risk of network failure due to interconnection errors.

17. The structured cable system of claim 14, wherein the structured cable system enables fabrics with an arbitrarily even number of uplinks.

18. A fiber optic module apparatus, which comprises, a main body, an internal fabric made of optical waveguides, a front face, a rear side, a left side, and a right side wherein the front face accommodates a multiplicity of multi-fiber connectors, the rear face accommodates a multiplicity of multi-fiber connectors, identical in number to the front face, an internal structure providing space for optical lanes of optical fibers or optical waveguides, wherein the internal mesh is designed to enable full connectivity from any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports where at least one number, M.sub.1 or M.sub.2 is an even number, and where the number of input ports is equal to the number of output ports, where the total number of ports is given by 2M.sub.1M.sub.2.

19. The fiber optic module of claim 18, wherein the fiber optic module can be stacked to provide folded Clos network topology of various radixes.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0020] FIG. 1A shows s spine and leaf fabric, having 16 spines fully connected to 32 Leaf switches {Ns=16, Nl=32, Ml=16}.

[0021] FIG. 1B highlights that each leaf switch of the fabric of FIG. 1A has 16 uplinks.

[0022] FIG. 2A shows a front of module 400.

[0023] FIG. 2B shows a rear of module 400.

[0024] FIG. 2C shows one MF (multi-ferrule) port of module 400.

[0025] FIG. 3A shows a top view of module 400.

[0026] FIG. 3B shows a top view of module 400 with the cover removed to display the interconnections.

[0027] FIG. 4A along with FIGS. 4B-4D show how a typical fabric can only be used for a single spine and leaf configuration, FIG. 4A shows that for {N s=4, Nl=8, Ml=4} there is full connectivity between spines (S) and leaves (L).

[0028] FIG. 4B shows that for {Ns=2, Nl=16, Ml=2} there is not full connectivity between S and L.

[0029] FIG. 4C shows that for {Ns=16, Nl=2, Ml=16} there is no full connectivity between S and L.

[0030] FIG. 4D shows that for {Ns=8, Nl=4, Ml=8} there is no full connectivity between S and L.

[0031] FIG. 5A, along with FIGS. 5B-5D show how a universal-type fabric can be used for several S and L configurations, FIG. 5A shows that for {Ns=4, Nl=8, Ml=4} the fabric provides full connectivity between S and L.

[0032] FIG. 5B shows that for {Ns=2, Nl=16, Ml=2} the fabric provides full connectivity between S and L.

[0033] FIG. 5C shows that for {Ns=16, Nl=2, Ml=15} there is full connectivity between S and L.

[0034] FIG. 5D shows that for {Ns=8, Nl=4, Ml=8} there is full connectivity between S and L.

[0035] FIG. 6A shows a universal type fabric using the function F-32-001.

[0036] FIG. 6B shows a universal type fabric using the function F-64-001.

[0037] FIG. 6C shows a universal type fabric using the function F-128-001.

[0038] FIG. 7A shows a first example of a universal type fabric F-32-001, showing connections of M.sub.1 adjacent input ports to M.sub.2 adjacent output ports wherein M.sub.2=8, M.sub.1=4.

[0039] FIG. 7B shows a second example of a universal type fabric F-32-001, showing connections of M.sub.1 adjacent input ports to M.sub.2 adjacent output ports wherein M.sub.2=16, M.sub.1=2.

[0040] FIG. 7C shows a third example of a universal type fabric F-32-001, showing connections of M.sub.1 adjacent input ports to M.sub.2 adjacent output ports wherein M.sub.2=4, M.sub.1=8.

[0041] FIG. 7D shows a fourth example of a universal type fabric F-32-001, showing connections of M.sub.1 adjacent input ports to M.sub.2 adjacent output ports wherein M.sub.2=2, M.sub.1=16.

[0042] FIG. 8 shows an example of a GPU cluster displaying the backend network used for GPU server communication.

[0043] FIG. 9A shows the front or back of modules 400 with fabric F-64-001 for an AI cluster with 32 GPU servers, G, with 16 uplinks (e.g., 8 NDR 800G breakout to 16 NDR 400G) connected to 16 Leaf switches.

[0044] FIG. 9B shows the back or front of modules 400 with fabric F-64-001 for an AI cluster with 32 GPU servers, G, with 16 uplinks (e.g., 8 NDR 800G breakout to 16 NDR 400G) connected to 16 Leaf switches.

[0045] FIG. 10A shows the front or back of modules 400 with fabric F-64-001 for an AI cluster with 16 L with 16 uplinks to 8 S.

[0046] FIG. 10B shows the back or front of modules 400 with fabric F-64-001 for an AI cluster with 16 L with 16 uplinks to 8 S.

[0047] FIG. 11A shows the front or back of modules 400 with fabric F-64-001 for an AI cluster with 64 L with 32 uplinks to 32 S.

[0048] FIG. 11B shows the back or front of modules 400 with fabric F-64-001 for an AI cluster with 64 L with 32 uplinks to 32 S.

[0049] FIG. 12A shows the front or back of modules 400 with fabric F-64-001 for data center applications for a fabric with 32 L with 6 uplinks to 6 S.

[0050] FIG. 12B shows the back or front of modules 400 with fabric F-64-001 for data center applications for a fabric with 32 L with 6 uplinks to 6 S.

[0051] FIG. 13 shows Table I in which three fabric modifications keeping the same number of ports Np=32 were generated.

[0052] FIG. 14 shows Table II, F(x) for fiber universal-type fabrics. The table shows mapping up to port 64. Additional ports can be obtained from F(X).

DESCRIPTION OF INVENTION

[0053] Modular apparatuses and a general method to deploy optical networks of a diversity of uplinks and radices are disclosed in this document. The module and method can be used with standalone, stacked, or chassis-based network switches, as long as the modular connections utilize single ferrule or multi-ferrule (MF), MPO connectors (or other multi-fiber connectors) with more than 8 fiber pairs. In particular, switches supporting Ethernet-specified SR or DR transceivers in their ports, such as 100GBASE-SR4, 200GBASE-SR4, or 400GBASE-DR4, 400GBASE-SR8, 800GBASE-SR8, 1.6T SR8 (Terabit BiDi) or Infiniband 400G or 800G NDR, or future 1.6T XDR.

[0054] FIGS. 2A and 2B shows a front and rear view of the disclosed module 400, which is the key element in facilitating optical network deployment, reshaping, and scaling. In this embodiment, the module has 16 MF connector ports in the front and rear sections, comprising a total of 32 MF ports (or 128 single ferrule ports) as shown in FIGS. 2A and 2B.

[0055] The MF ports can be implemented with arrays of small MPO ferrules, such as commercially available SN-MT or MCC connectors. Each ferrule can have 8, 12, or 16 fibers. For example, in FIG. 2C ports 1F-a-d have 4 sub-ports 1Fa, 1Fb, 1Fe, and 1Fd, each with one 16-fiber MPO ferrule.

[0056] The module 400 width, W, is in the range of 12 inches up to 19 inches, and the height, H, is in the range of 0.4 to 0.64 inches. Rails, 405, on both sides of the module, would enable the modules to be inserted into a chassis structure if required. Alternatively, using brackets 406, the modules can be directly attached to the rack. By using the specified height range for this embodiment, up to four modules can be stacked in less than 2 RU depending on density requirements.

[0057] FIG. 3A shows a top view of the depicted module 400 of depth B>3 inches with cover.

[0058] FIG. 3B shows the interconnection scheme of the internal fabric, F-64-001. The internal fabric, F-64-001, which stands for fabric with Np=64 links, configuration design #001, comprises arrays of multiple optical waveguides, implemented in a variety of methods. For example, glass or plastic optical fibers are organized in the desired interconnection pattern and embedded and glued inside plastic film (optical flex-circuit method). Alternatively, polymer waveguides embedded in a plurality of layers in a printed circuit board (PCB) or direct laser written waveguides, where waveguides are written in 3D inside a glass using a femtosecond laser can be also utilized.

[0059] In this document, we employ the nomenclature {Ns, Nl, Ml} to represent a Spine-and-Leaf fabric that can connect Ns Spine to Nl Leaf switches, where each Leaf switch has Ml uplinks. This fabric has Np input and Np output ports, where Np=NsMs=NlMl. In general, a fabric with 2Np ports, where each of the Np input ports is connected only to one of the Np output ports can be implemented in different configurations, based on the number of input/output port permutations port connections, given by Np! where ! represents the factorial function.

[0060] From that large set of possible configurations, the number of fabrics that can be used in a specific Spine-and-Leaf network, {Ns, Nl, Ml}, is given by (Nl!).sup.Ns, assuming that Ms=Nl and Ml=Ns. Almost all those fabrics become useless when the number of spine or leaf switches changes. This happens even if the total number of ports is kept identical. This might look irrelevant for networks that are implemented only once and never modified. However, as AI models increase nearly 10 per year, scaling GPU networks could change in its configuration. Also, different section of the network, GPU network (backend), CPU network frontend, can require different Spine-and-Leaf configurations.

[0061] Considering these cases, prior art fabric modules do not provide the flexibility to absorb those changes. Moreover, since most of the modules work only for the specific fabric, their utilization in large network deployment requires the use of several types of fabric modules, which impact in cost, inventory management, and complexity of the deployment. Moreover, when considering future scaling of the network, a small change in the number of spines of leaves can require a major change in the fabric modules.

[0062] To illustrate the problem, we assume a small fabric, and for simplicity, we assume that Ml/Ns=1, which implies that each Spine connects to each Leaf using only one port. This simple fabric, {Ns=4, Nl=8}, with Np=32 ports, is designed to provide full connectivity between four Spine switches to eight Leaf switches as shown in FIG. 4A. We verify that for this, and almost all simulated networks, the fabric will not allow changes to the number of Spine or Leaf switches, even for an identical number of ports. For example, in Table 1 we generate three fabric modifications keeping the same number of ports Np=32.

[0063] In FIG. 4A, the original configuration shows a full connection between the spine and leaf switches. As designed, at least one port of each Leaf connects to one port of the Spines. For FIG. 4B, we increased the number of Leaf switches to 16, each with two ports, and changed the number of Spines to 2, each with 16 ports. In this case, as shown in FIG. 4B, it can be seen the Leaf switches cannot connect with all the Spines. For example, two ports from Leaf #1 connect to Spine #1 but there are no ports of Leaf #1 that connect to Spine #2. Similar issues occur with all the other Leaf switches. Therefore, if a network configuration changes, the fabric would need to be replaced even if the number of ports per module is kept the same. FIGS. 4C and 4D show a similar problem for all the other cases in Table 1; there is no full connectivity between spine and leaf switches when their number change.

[0064] Full modeling of a large number of fabrics shows the same problem. Although they can provide full connectivity between a bespoken Spine and Leaf switch configuration they cannot operate when the number of Spines or Leaf ports changes.

[0065] One might consider that this is an inherent limitation of the fabrics and therefore a universal module that can be used for multiple networks is not feasible. However, a deeper analysis of the problem performed by the inventors indicate that this is not the case. We found that a mapping function, Y=F(X), where X is the index of input ports, X=1 to Np, and Y is the index of the output ports that not only enable full connectivity for a Spine-and-Leaf {Ns, Nl, Ml} but also a large set of potential variations in its configuration. We can estimate approximately that the variations on the Spine-and-Leaf network are represented by {Ns2.sup.k, Nl2.sup.k} where k is an integer number that ranges from log 2(min(Ns,Nl)) to +log 2(min(Ns,Nl)).

[0066] In general, the mapping function Y=F(X), can be described as,

[00001] F ( X ) = Bin 2 Dec ( FlipDigit ( Dec 2 Bin ( X - 1 ) ) + 1 , ( 1 ) [0067] where Dec2Bin(.) is a function that convert a decimal number to binary, FlipDigit(.) is a function that reverse the bits of the binary number, and Bin2Dec(.) is a function that convert binary number to decimal numbers.

[0068] The mapping function can convert any input index to an output index, which represents the interconnection between two ports. We provide a detailed example of the mapping for this type of fabric with 32 input and 32 out ports. In this fabric, we select port #2, port index X=2, and compute the binary representation of X1=1 as 00001 and the bits are flipped, producing 10000 which results in output index Y=17 after conversion to decimal number and increased by one. Therefore, the input port 2 interconnects to output port 17. We can use this function for all the ports of the fabrics {Ns=4, Nl=8, Ml=4} and produce the interconnection diagrams shown in FIGS. 5A-5D.

[0069] This fabric provides full connectivity between Spine and Leaf switches for all the variations described in Table I. For example, in FIG. 5A the fabric fully connects four Spines, with 8 ports each, to eight Leaf switches, 4 ports each. In FIG. 5B we reduce the number of Spines and increase the Leaf switches as shown in Table I and notice that full connectivity is maintained. Similar behavior occurs for the other cases as shown in FIGS. 5C and 5D.

[0070] Therefore, this fabric, labeled here as universal-type fabric, can be used for multiple network configurations, creating opportunities for a new type of fabric modules, such as 400 other embodiments shown in this disclosure, that not only encapsulate sections of the network but can be used as identical building blocks (such as bricks in a building) to facilitate the deployment of large datacenters, AI clusters or other types of optical networks.

[0071] The function F(X) was used to produce the fabrics of a diverse number of ports, for example, details of fabric F-64-001, used in module 400 (FIG. 3B), are shown in FIG. 6B. Other fabrics for smaller or larger number of ports that share the same capacity to operate with modification of the original fabric are shown in FIG. 6A for Np=32 ports, F-32-001 and FIG. 6C for Np=128 ports, F-128-001. As mentioned, there are a large number of potential permutations of these fabrics, but only one fabric, the fabric with the index 001 in our nomenclature represents universal-type fabrics (flexible and symmetric). Interconnection maps for some universal-type fabrics from F(X) are shown in Table II, for some of the fabrics.

[0072] General properties of the fabric are the non-linear characteristic of the function, Y=F(X) that satisfy reversible property given by F(Y)=X or, X=F(F(X)) or F.sup.1(X)=F(X). For example, in Table II, for universal-type fabric F-16-001, we can select any X value, e.g., X=2, and show that F(F(2))=2 and therefore, F.sup.1(2)=F(2). Those properties enable a reversible fabric. In addition, from Table II, and in general from the described equation Y=F(X), it can be shown that we can connect any group of M.sub.1 adjacent input ports to any group of M.sub.2 adjacent output ports when either M.sub.1 or M.sub.2 is an even number and M.sub.1M.sub.2=Np. For example, FIGS. 7A-7D show 4 interconnection cases for universal-type fabric F-32-001.

[0073] In FIG. 7A, there are 4 groups of M.sub.2=8 adjacent output ports connected to 8 groups of M.sub.1=4 input ports. As shown in the figure, any input group connects to any output group using the same number of connections. For example, in FIG. 7B, which has two M.sub.2=16 output groups and 16, M.sub.1=2 input groups, output group 1 connects to any of the 16 input groups and vice versa. In FIG. 7D M.sub.2=2, M.sub.1=16, the reverse case, any of the 16 output groups connect to any of the two input groups.

[0074] The method using the described function F(X) helps also in the construction of the fabric modules since it produces symmetric fabrics, which show periodical patterns. Other properties, F(X)F(X1)=Np/2 (for X>1 assuming Ns/Ml=1), and F(X1)>F(X) for X odd>1 produce repeated crossing points, and other periodicity are advantageous for the decomposition of the fabric in smaller pieces, something similar to factoring polynomial functions, so complex fabrics can be implemented based on smaller ones.

[0075] In general, for a given number of ports Np, there is only one universal-type fabric, one in Np! fabrics that have the mentioned properties, flexibility to accommodate diverse networks, and symmetries, for example, any universal-type fabric such as the one shown in FIGS. 6A-6C can be vertically flipped without producing changes in the interconnections.

[0076] Application on how to use the modules with the universal-type fabrics, F-Np-001 are shown the in next section of this disclosure.

Applications of Module 400

[0077] Universal-type fabrics, F-Np-001 for different numbers of ports can be implemented in modules 400 of less than 0.5 RU with multi-fiber connectors MPO or multi-fiber multi-ferrule connectors such as SN-MT or MMC. Some of the fabrics that can be used in modules 400 are shown in FIGS. 6A-6C. For relatively small networks, 400 can be implemented with universal fabrics F-16-001 or F-32-001 for Leaf switches with uplinks, Ml, ranging from 2 to 12 uplinks (even numbers). Larger fabrics using a larger number of uplinks, M l16, such as some used for machine learning training networks can require universal-type fabrics F-64-001 or even F-128-001.

[0078] Here we use F-64-001 to illustrate how the modules can be used in machine learning training networks where often two types of Spine-and-Leaf networks are used, one between the GPU servers and the Leaf switches, and another one from Leaf to Spine switches. FIG. 8 shows a schematic that represents a machine learning training cluster, showing the GPU servers. The servers can accommodate 4, 8, or more advanced GPUs, with storage, NICs, internal switches, and CPU, and have 4, 8, or more optical ports, QSFP-DD or OSFP, that can operate at Ethernet or Infiniband rates up to 800 Gbps per port, today, and 1.6 Tbps in the near future. Those transceivers that typically utilize MPO connectors with 16 fibers or duplex MPO ports each with 8 fibers, can breakout into lower speeds transceivers that use MPO connectors with 8 fibers. For example, an 800G transceiver (2 MPOs with a total of 16 fibers) can breakout into two 400G transceivers (8 fibers). Therefore, even in a small AI cluster, the fiber optics links between servers and switches can easily exceed several thousands of fibers. In this cluster, we will focus first on the connection between GPU servers (G) and the Leaf switches (L Switches) and then on the connection between L switches to Spine switches (S switches). In the figure, we show potential locations where the fiber modules (FM) 400 can be installed.

[0079] We will assume a cluster with 32 servers, e.g., Nvidia DGX server each with eight H100 GPUs and 16 optical uplinks that connect to 16 Leaf switches, each with 8 uplinks that connect to 8 Spine switches. The fabric that represents the interconnections from the GPU servers to the Leaf switches resembles the fabric shown in FIG. 1A, where the Leaves (Ls) are replaced by servers (Gs) and the Spines by Leaves. In that case, each link corresponds to 8 fibers.

[0080] Using the modules 400, this network can be implemented in less than 4RU space, with a stack of eight modules, each containing a universal-type fabric F-64-001 as shown in FIGS. 9A and 9B, where the Leaf switches connect in one side of the modules.

[0081] Similarly, a stack of four modules 400, occupying less than 2 RU space, can be used to connect the 16 Leaf switches to 8 Spine switches, as shown in FIGS. 10A and 10B. Using the modules 400, the AI cluster can grow to a large number enabling tens of thousands of GPUs. For example, a network with 32 Spine and 64 Leaf switches can be implemented using 32 modules 400 occupying less than 16 RUs, as shown in FIGS. 11A and 11B. This implementation can potentially connect 256 GPU servers or around 2048 GPUs.

[0082] Using large chassis switches such as Nexus 9000 or Arista 7800 as Spines, it is possible to increase the number of GPU servers to several tens of thousands. In all those cases, modules 400 can simplify the scaling of AI networks.

[0083] Previous examples showed the application examples of modules 400 for the network that connects the GPU servers and the backend network. In AI clusters, some fabrics connect servers to storage or CPU servers. Those datacenters fabrics tend to have oversubscriptions greater than one and to use less number of uplinks. The same type of modules 400 can be used. FIGS. 12A and 12B show a case using 32 Leaf switches with six uplinks that connect to six Spine switches. Only three modules 400 are needed for this network. However, the last one has the inputs and outputs organized differently to accommodate the six uplinks. In this example, the fabric, {Ns=6, Nl=32, Ml=6} was divided in two fabrics, {Ns=4, Nl=32, Ml=4} which require two modules 400 (module #1 and #2) and one fabric{Ns=2, Nl=32, Ml=2} which requires one module 400 (module #3). A similar method can be applied to any other fabric with an even number of uplinks.

[0084] While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.