OPTICAL SWITCH FABRICS FOR HIGH PERFORMANCE COMPUTING
20250142235 ยท 2025-05-01
Inventors
Cpc classification
International classification
Abstract
This invention is related to two-stage optical packet switch fabrics for TE (terabit Ethernet)-based HPCNs (high performance computing networks). The main features of the invented optical switch architecture are the following: (1) it can support thousands of TE links; (2) it has a low signal power loss and requires no optical amplifiers; (3) it can switch WDM packets simultaneously without using wavelength converters.
The patent application presents various embodiments of the two-stage switch architecture. In one embodiment, the first stage comprises K NN (N inputs and N outputs) AWGs and the second stage comprises N KK OSSes (optical space switches). This results in a port count of KN. Each input can transmit N wavelengths and a total of KN.sup.2 packets can pass through the switch fabric simultaneously without blocking. The switch fabric is named AS for the technologies used in the two stages. Currently 3232 AWGs are available. This allows an AS switch fabric to support more than a couple of thousands TE links easily.
In another embodiment, the first stage comprises N KK OSSes and the second stage comprises K NN AWGs. It is named SA for the same reason given above. Similar to the first embodiment, the total number of source and destination ports supported by an SA switch fabric equals KN, and KN.sup.2 packets can be transmitted simultaneously through the switch fabric. All the features and advantages of the AS architecture are inherited by the SA architecture.
In still another embodiment, two AS or SA switch fabrics are used in parallel to construct a switching system capable of handling any kind of unbalanced traffic loads. A port processor for keeping a bounded delay, for processing and re-sequencing packets in such a switching system is also presented in this patent application.
Claims
1. An optical switch fabric for switching WDM (Wavelength Division Multiplexed) packets between a plurality of source ports and a plurality of destination ports, comprising: a first switching stage comprising a plurality of NN (N inputs and N outputs) AWGs (Arrayed Wavelength Gratings) to route WDM packets, received from the plurality of source ports, to a second switching stage; and the second switching stage comprising a plurality of KK (K inputs and K outputs) OSSes (Optical Space Switches) configured to switch WDM packets, received from the first switching stage, to the plurality of destination ports.
2. The optical switch fabric of claim 1, wherein the first switching stage comprises K AWGs and the second switching stage comprises N OSSes.
3. The optical switch fabric of claim 2, wherein a total of KN source ports and KN destination ports supported by the switch fabric are divided into K groups, numbered from 0 to K1, and each source-port and destination-port group comprises N members, numbered from 0 to N1; the p-th member of the q-th source-port group, where 0<pN1 and 0qK1, is connected to the p-th input of the q-th AWG of the first switching stage; the m-th output of the n-th AWG of the first switching stage, where 0mN1 and 0nK1, is connected to the n-th input of the m-th OSS of the second switching stage; and the r-th output of the s-th OSS of the second switching stage, where 0rK1 and 0sN1, is connected to the s-th member of the r-th destination-port group.
4. The optical switch fabric of claim 1, wherein the OSSes of the second switching stage operate in a TDM (time division multiplexing) mode.
5. An optical switch fabric for switching WDM (Wavelength Division Multiplexed) packets between a plurality of source ports and a plurality of destination ports, comprising: a first switching stage comprising a plurality of KK (K inputs and K outputs) OSSes (Optical Space Switches) configured to switch WDM packets, received from the plurality of source ports, to a second switching stage; and the second switching stage comprising a plurality of NN (N inputs and N outputs) AWGs (Arrayed Wavelength Gratings) to route WDM packets, received from the first switching stage, to the plurality of destination ports.
6. The optical switch fabric of claim 5, wherein the first switching stage comprises N OSSes and the second switching stage comprises K AWGs.
7. The optical switch fabric of claim 6, wherein a total of KN source ports and KN destination ports supported by the switch fabric are divided into K groups, numbered from 0 to K1, and each source-port and destination-port group comprises N members, numbered from 0 to N1; the p-th member of the q-th source-port group, where 0<pN1 and 0qK1, is connected to q-th input of the p-th OSS of the first switching stage; the m-th output of the n-th OSS of the first switching stage, where 0mK1 and 0 nN1, is connected to the n-th input of the m-th AWG of the second switching stage; and the r-th output of the s-th AWG of the second switching stage, where 0rN1 and 0sK1, is connected to the r-th member of the s-th destination-port group.
8. The optical switch fabric of claim 5, wherein the OSSes of the first switching stage operate in a TDM (time division multiplexing) mode.
9. A port processor for processing packets in a switching system that uses two switch fabrics, named phase-1 and phase 2, operating in parallel, comprising: a phase-1 port processor, connected to the phase-1 switch fabric, comprising a phase-1 input processor and a phase-1 output processor; and a phase-2 port processor, connected to the phase-2 switch fabric, comprising a phase-2 input processor and a phase-2 output processor; wherein the phase-1 input processor evenly distributes cells (fixed length packets) received from an external port to outputs of the phase-1 switch fabric; the phase-1 output processor passes cells, received from the phase-1 switch fabric, either to the phase-2 input port processor or to the phase-1 input port processor; the phase-2 input processor route cells, receives from the phase-1 output processor, to outputs of the phase-2 switch fabric; and the phase-2 output processor re-sequences cells, received from the phase-2 switch fabric, before sending the cells to an external port.
10. The port processor of claim 9, wherein the phase-2 input processor puts a cell, received from the phase-1 output processor, into a queue, called VOQ (virtual output queue), containing cells destined for the same output of the phase-2 switch; the phase-1 output processor passes a cell, received from the phase-1 switch fabric, to the phase-2 input port processor if the length of the VOQ of the cell is smaller than a given limit ; and the phase-1 output processor passes a cell, received from the phase-1 switch fabric, to the phase-1 input port processor if the length of the VOQ of the cell equals the given limit .
11. The port processor of claim 9, wherein the phase-2 output processor puts a cell, received from the phase-2 switch fabric, into the location (sequence-number % L.sub.viq) of a queue, called VIQ (virtual input queue), containing cells originating from the same input of the phase-1 switch, wherein sequence-number is the cell's arriving time slot in the phase-1 switch, and L.sub.viq is the total number of cells provided to a VIQ.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION
[0027] The following describes a scalable AWG-based optical switch architecture in various exemplary embodiments. An optical switch fabric (e.g. 100 shown in
[0028] Although a port processor (e.g. 112A) is considered part of the switch fabric, it is typically integrated with a packet processor (e.g. 111A) on the same chip or in the same package. With today's advanced IC and packaging technology, the bandwidth of the interface (i.e. 113A-C/114A-C) between a packet processor and a port processor can easily exceed the bandwidth required to support a TE link. However, the per-wire bandwidth outside a chip is much lower, which is why WDM (wavelength division multiplexing) is necessary for transmitting packets between a HPC card (e.g. 110A) and the switch fabric 100. It's worth noting that there is a one-to-one correspondence between a HPC-Card Source Port (e g. 113A) and a switch source port (e.g. 116A) (or between 114A and 117A) in
[0029] Two embodiments of the optical switch fabric 100 are described below. The switch fabrics are named AS and SA for the technologies (i.e. AWG and Space Switching) used by the two switching stages of each switch fabric.
As Switch Fabric
[0030] An AS switch fabric comprises two stages of switching devices: [0031] the first stage comprising K NN AWGs (e.g. 510A-E); and [0032] the second stage comprising N KK OSSes (e.g. 520A-D).
The total port count of an AS switch fabric equals KN. As K can be greater than N, the total port count can exceed N.sup.2, which is the total port count of a TASA or ASA switch
[0033] How the two switching stages are connected in an AS switch fabric is described below. It is a crucial element in our invention. Each input/output link of an AWG or an OSS is assigned a two-tuple address [group, member], where group refers to the link's connected switching device (i.e. AWG or OSS) and member refers to the link number within the switching device for the input/output link. Let L.sub.1[ ] and L.sub.2[ ] denote the set of input and output links of the first-stage AWGs. For example, L.sub.1[i,k] refers to the k-th input link of the i-th AWG, where 0kN1 and 0iK1. Let L.sub.3[ ] and L.sub.4[ ] denote the set of input and output links of the OSSes in the 2.sup.nd stage (see
[0034] Let all OSSes operate in a TDM mode (with the frame size=K). The connection patterns used by an OSS in each slot of a TDM frame are stored in its control memory (e.g. 560 in
where j is an input of an OSS, (j) its connected output, and s the given TDM slot. Since s ranges from 0 to K1, there are K connection patterns in total.
[0035] The connection patterns of (2) and the switch's topology described by (1) result in the following connectivity for the entire switch: [0036] In a TDM slot, all source ports of a source-port group are connected to destination ports belonging to one destination port group.
This can be seen by examining how a link L.sub.2[i,j] is connected to a destination port in L.sub.6[ ] in a given TDM slot s:
We can see in (3) that the group value of the connected port depends only on i, not j. If we fix i and only change j in L.sub.2[i,j], all the connected destination ports in (3) belong to the same group. This implies that a source-port group will only be connected to a destination port group in a TDM slot (see
[0037] As with any TDM switch, an AS switch fabric operating in a TDM mode can experience performance issues if traffic is not distributed evenly. To address this problem, we propose a switching system that utilizes two AS fabrics in parallel. More details are given below in the section titled PORT PROCESSOR AND TWO-PHASE SWITCHING below.
[0038] An AS switch fabric has several advantages over a TASA or an ASA switch fabric. First, it is more scalable, with a total port count (=KN) that can exceed N.sup.2 (since K can be easily made larger than N). Second, it has only two stages of switching devices, making it more cost-effective to implement than the TASA switch, which has three stages. Third, signal loss is reduced by at least 10 dB in the AS switch fabric due to the two-stage switching design, eliminating the need for optical amplifiers (e.g. 340A-E and 440A-D) which are necessary in an ASA and a TASA switch. Finally, only one NN AWG is located between a source-port group and a destination-port group in a given TDM slot (see
SA Switch Fabric
[0039] We can apply a similar principle to design an SA (Space Switching-AWG) switch fabric as illustrated in
[0045] Compared to a TASA or an ASA switch fabric, an SA switch fabric offers advantages similar to those of the AS architecture, including: (a) improved scalability; (b) lower implementation costs due to the use of only two stages of switching devices; (c) the elimination of optical amplifiers; and (d) the ability to use commercially available, even-sized AWGs.
Two-Phase Switching and Port Processor
[0046] TDM switch fabrics, as already mentioned, can suffer from poor performance when traffic is unevenly distributed. To address this issue, we have developed a two-phase switching system 700 (see
[0047] It is worth noting that the two-phase switching system presented above differs from the conventional cascading approach depicted in
[0048] To accommodate the two switch fabrics 710 and 720 in one switching system 700, the port processor 112A (or 112B, 112C) comprises two port processors 810 and 820, one for each switch fabric. The two port processors are integrated into the same chip so that they can exchange cells, which is crucial for achieving a bounded delay for the entire switch 700. Each port processor 810 (or 820) can be further divided into two component processors: input/output port processors 811/812 (or 821/822). Their functions are described below: [0049] ph1_pp 810 (the port processor of the phase-1 switch): [0050] ph1_ppi (e.g. 811): The input processor of the ph1_pp (e.g.810). It receives cells from a HPC-card's source port (e.g. 113A) and distributes them to the outputs of the phase-1 switch (e.g. 710) in a round-robin fashion. [0051] ph1_ppo (e.g. 812): The output processor of the ph1_pp (e.g. 810). It hands over received cells either to the ph2_ppi (e.g. 821) or back to the ph1_ppi (e.g. 811) (see discussions below). [0052] ph2_pp 820 (the port processor of the phase-2 switch). [0053] ph2_ppi (e.g. 821): The input processor of the ph2_pp (e.g. 820). It routes cells through the phase-2 switch (e.g. 720) to their original destination ports. [0054] ph2_ppo (e.g. 822): The output processor of the ph2_pp (e.g. 820). It re-sequences cells, received from the phase 2 switch (e.g. 720), before sending them to the connected HPC card.
[0055] The data rate of a future TE link can reach several terabits per second. This rate determines the cell transmission time, denoted by iSlot, inside a port processor (e.g. 112A) and all of its component processors (i.e. 811, 812, 821, and 822) must operate at the iSlot time-scale, processing cells at a rate one-cell/iSlot. WDM must be used to transmit data between an HPC card and the optical switch fabric. Assuming N wavelengths are used, then the per-wavelength data transmission rate is I/N the rate of a TE link, which determines the cell transmission time, denoted by sSlot, of one wavelength. Switch fabric 710 or 720 operates at the sSlot time scale. It is clear that the equation sSlot=N iSlots, as shown in
[0056] Each cell transmitted in the switch contains a header that carries various pieces of information, including three essential ones: [0057] source port address: This is the input address of the phase-1 switch. [0058] destination port address: This is the output port address of the phase-2 switch. [0059] sequence number: This refers to the arriving time slot number (in iSlots) of a cell. It is treated as the cell arrival time.
When a cell arrives from a packet processor (e.g. 113A), it is placed into a single Distribution Queue (DQ) (e.g. 813). The ph1_ppi can take N cells from the DQ in one sSlot and distributes them to the outputs of the phase-1 switch. When a cell reaches an output of the phase-1 switch, the ph1_ppo (e.g. 812) passes it to the ph2_ppi (e.g. 821), which puts the cell into a queue containing all cells destined for the same output port. This queue is traditionally known as a Virtual Output Queue (VOQ), and since there are KN destination ports, there will be KN VOQs (e g. 823) in the ph2_ppi. VOQs are served in a round-robin fashion, and each VOQ is served only once in a frame (i.e., K TDM slots). In contrast, there is a single DQ, which means that the DQ is served at a rate KN times that of a VOQ. Consequently, the delay, denoted by Ph1Delay, of the phase-1 switch is much smaller than the delay, denoted by Ph2Delay, of the phase-2 switch. To derive the delay bound, denoted by Ph12DB, of the switch, we should focus on the Ph2Delay first.
[0060] To bound Ph2Delay, we limit the length of each VOQ to a specified value a. As a result, the Ph2Delay is bounded by (a x frame size). When a cell arrives at the ph1_ppo (e.g. 812), if the length of the corresponding VOQ has already reached a, the ph2_ppi will instruct ph1_ppo to hand the cell back to the ph1_ppi (e.g. 811) co-located in the same chip. The cell will then pass through the phase-I switch 710 again and get distributed to a different output. This lookback scheme implies that a cell will pass through the phase-1 switch multiple times. Although the Ph1Delay is small, we still need to limit the number of loopback times (which is recorded in a cell's header) to bound the Ph1Delay. This is done by limiting the maximum loopback times to a specified value y. When this condition is violated, the ph1_ppi will discard the cell. It should be noted that our cell discarding scheme is based on loopback times, which is different from conventional timer-based packet discarding schemes (e.g. M. Sammour, et al, Method and apparatus for PCDP discard, U.S. Pat. No. 10,630,819, 2020).
[0061] Once the bounds for Ph1Delay and Ph2Delay are given, the value of Ph12DB can be derived. A bounded delay, as shown below, simplifies the cell re-sequencing task. Note that cell re-sequencing is done for each source port. When a cell is received, the ph2_ppo places it into a queue, traditionally called a Virtual Input Queue (VIQ), containing cells with the same source port address. As there are KN source ports, there will be KN VIQs (e.g., 824) in the ph2_ppo. VIQs serve as cell-resequencing buffers in our design and are implemented as circular buffers (
(sequencenumber % s.sub.viq),
where % is the mod operator. Therefore, VIQ entries of the same position are for cells having the same arrival time (i.e., the same sequence number). Selecting a VIQ for transmission is position-based (see
must be satisfied, where t.sub.current denotes the current time (in iSlots). This guarantees that if a VIQ has a cell belonging to the current position, the cell must have already arrived when the VIQ is selected. If no VIQs can satisfy the condition set by (5), the ph2_ppo temporarily suspends its VIQ selection. The selection process resumes when t.sub.current gets updated in the next iSlot. As shown above, the delay bound Ph12DB simplifies the selection task to just checking if (5) is satisfied or not.
[0062] The above cell resequencing scheme is simple and fast to implement. However, equipping each VIQ with s.sub.viq cells is wasteful because most slots in a VIQ are usually empty. In practice, we use a separate memory unit called VIQ Storage to store the bodies of cells. The VIQ Storage is a linked list shared by all VIQs, while each element in a VIQ contains only a pointer (several bytes) pointing to the cell location in the VIQ Storage. With this implementation, memory efficiency is no longer a concern.
[0063] Finally, it should be noted that various VIQ-based schemes have been proposed to maintain packet sequence in a buffered network in which cells can travel through different paths to reach their destination outputs. One example is given in Park, et al, Maintain packet sequence using cell flow control, U.S. Pat. No. 7,688,816, 2010 which uses VIQs and flow control to maintain cell sequence in a buffered network. However, these methods do not apply to our two-phase switch, which uses two un-buffered switch fabrics in parallel.