Data center network with multiplexed communication of data packets across servers
11469922 · 2022-10-11
Assignee
Inventors
- Deepak Goel (San Jose, CA)
- Pradeep Sindhu (Los Altos Hills, CA)
- Srihari Raju Vegesna (San Jose, CA, US)
- Robert William Bowdidge (San Jose, CA, US)
- Ayaskant Pani (Fremont, CA, US)
Cpc classification
H04L12/4633
ELECTRICITY
H04J14/0212
ELECTRICITY
International classification
H04L45/00
ELECTRICITY
Abstract
A network system for a data center is described in which a switch fabric provides interconnectivity such that any servers may communicate packet data to any other of the servers using any of a number of parallel data paths. Moreover, according to the techniques described herein, edge-positioned access nodes, permutation devices and core switches of the switch fabric may be configured and arranged in a way such that the parallel data paths provide single L2/L3 hop, full mesh interconnections between any pairwise combination of the access nodes, even in massive data centers having tens of thousands of servers. The access nodes may be arranged within access node groups, and permutation devices may be used within the access node groups to spray packets across the access node groups prior to injection within the switch fabric, thereby increasing the fanout and scalability of the network system.
Claims
1. A system comprising: a plurality of servers; a plurality of access nodes, each of the access nodes coupled to a subset of the servers to communicate data packets between the servers; and an electrical permutation device coupled to a subset of the access nodes and configured to communicate the data packets to other access nodes within the plurality of access nodes, wherein the electrical permutation device comprises a set of input ports and a set of output ports to communicate the data packets between the subset of the access nodes, wherein each of the input ports receives data packets of a plurality of packet flows that each have a unique source address for the packet flows received on the same input port, and wherein the electrical permutation device is configured to permute, based on the input ports, the plurality of packet flows received on each of the input ports across the output ports of the electrical permutation device to provide connectivity between the input ports and each of the output ports such that each output port receives a different unique permutation of the input ports and the respective source addresses of the packet flows.
2. The system of claim 1, further comprising: a plurality of electrical permutation devices including the electrical permutation device, wherein each of the plurality of electrical permutation devices is connected to a different subset of the access nodes.
3. The system of claim 2, wherein the plurality of access nodes and the plurality of electrical permutation devices are configured to provide full mesh connectivity between any pairwise combination of the servers.
4. The system of claim 2, wherein the plurality of access nodes and the plurality of electrical permutation devices are configured to connect any pairwise combination of the access nodes by at most a single layer three (L3) hop.
5. The system of claim 4, wherein the plurality of access nodes and the plurality of electrical permutation devices are configured to provide a plurality of parallel data paths between the access nodes.
6. The system of claim 5, wherein the plurality of servers includes a source server and a destination server, wherein the plurality of access nodes includes a source access node coupled to the source server and a destination access node coupled to the destination server, and wherein the source access node, when communicating a packet flow of packets between the source server and the destination server, sprays the packets of the packet flow across at least a subset of the access nodes via a plurality of data paths to the destination access node, and wherein the destination access node reorders the packets into an original sequence of the packet flow and delivers the reordered packets to the destination server.
7. The system of claim 6, wherein the source access node sprays the packets of the packet flow across the plurality of parallel data paths by directing each of the packets to a randomly or round-robin selected one of the access nodes.
8. The system of claim 6, wherein the source access node sprays the packets of the packet flow across the plurality of parallel data paths by directing each of the packets to one of the access nodes based on bandwidth.
9. The system of claim 1, wherein the electrical permutation device comprises: a shared packet buffer configured to buffer the data packets of the packet flows received on the input ports; a set of egress queues configured to store descriptors of the data packets within the shared packet buffer for transmission on the output ports; and a packet writer configured to direct the different unique permutations of the packet flows to the output ports so that each output port receives a different one of the unique permutations of combinations of input ports on which the packet flows were received and the respective source addresses of the packet flows.
10. The system of claim 9, wherein each of the packets comprises an Ethernet packet, wherein the source address comprises a source Media Access Control (MAC) address for the respective packets, and wherein the packet writer is configured to direct the unique permutations of the packets to the output ports based on the permuted combinations of input ports and a set of low order bits of each MAC address for the respective packet flows.
11. The system of claim 1, wherein each of the plurality of access nodes comprises: a source component operable to receive traffic from one or more of the plurality of servers; a source switching component operable to switch source traffic to other source switching components of different access nodes; a destination switching component operable to switch inbound traffic received from other source switching components; and a destination component operable to reorder packet flows received via the destination switching component and provide the packet flows to a destination server coupled to the access node.
12. The system of claim 1, wherein one or more of the access nodes comprise storage devices configured to provide network accessible storage for use by applications executing on the servers.
13. The system of claim 1, wherein the subset of access nodes is a first subset of access nodes, the system further comprising: a switch fabric comprising a plurality of core switches; and a first optical permutation device optically coupling the first subset of access nodes to the core switches by optical links to communicate data packets between the first subset of access nodes and the core switches as optical signals, wherein the first optical permutation device comprises a set of input optical ports and a set of output optical ports to direct optical signals between the first subset of access nodes and the core switches, and wherein the first optical permutation device is configured such that optical communications received from the input optical ports are permuted across the output optical ports based on wavelength to provide optical connectivity between the input optical ports and each of the output optical ports.
14. A method comprising: interconnecting a plurality of servers by an intermediate network comprising: a plurality of access nodes, each of the access nodes coupled to a subset of the servers to communicate data packets between the servers, an electrical permutation device coupled to a subset of the access nodes and configured to communicate the data packets to other access nodes within the plurality of access nodes, wherein the electrical permutation device comprises a set of input ports and a set of output ports to communicate the data packets between the subset of the access nodes, wherein each of the input ports receives data packets of a plurality of packet flows that each have a unique source address for the packet flows received on the same input port, and wherein the electrical permutation device is configured to permute, based on the input ports, the plurality of packet flows received on each of the input ports across the output ports of the electrical permutation device to provide connectivity between the input ports and each of the output ports such that each output port receives a different unique permutation of the input ports and the respective source addresses of the packet flows; and communicating a packet flow between the servers across the intermediate network.
15. The method of claim 14, wherein the intermediate network further comprises a plurality of electrical permutation devices including the electrical permutation device, wherein each of the plurality of electrical permutation devices is connected to a different subset of the access nodes.
16. The method of claim 15, wherein the plurality of access nodes and the plurality of electrical permutation devices are configured to provide full mesh connectivity between any pairwise combination of the servers.
17. The method of claim 15, wherein the plurality of access nodes and the plurality of electrical permutation devices are configured to connect any pairwise combination of the access nodes by at most a single layer three (L3) hop.
18. The method of claim 15, wherein the plurality of access nodes and the plurality of electrical permutation devices are configured to provide a plurality of parallel data paths between the access nodes.
19. The method of claim 14, wherein one or more of the access nodes comprise storage devices configured to provide network accessible storage for use by applications executing on the servers.
20. A system comprising: a switch fabric comprising a plurality of core switches; a plurality of servers; a plurality of access nodes, each of the access nodes coupled to a subset of the servers to communicate data packets between the servers; an electrical permutation device coupled to a subset of the access nodes and configured to communicate the data packets to other access nodes within the plurality of access nodes, wherein the electrical permutation device comprises a set of input ports and a set of output ports to communicate the data packets between the subset of the access nodes, wherein each of the input ports receives data packets of a plurality of packet flows that each have a unique source address for the packet flows received on the same input port, and wherein the electrical permutation device is configured to permute, based on the input ports, the plurality of packet flows received on each of the input ports across the output ports of the electrical permutation device to provide connectivity between the input ports and each of the output ports such that each output port receives a different unique permutation of the input ports and the respective source addresses of the packet flows; and an optical permutation device optically coupling the subset of access nodes to the core switches by optical links to communicate the data packets between the access nodes and the core switches as optical signals, wherein each of the optical permutation devices comprises a set of input optical ports and a set of output optical ports to direct optical signals between the access nodes and the core switches to communicate the data packets.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
DETAILED DESCRIPTION
(26)
(27) In some examples, data center 10 may represent one of many geographically distributed network data centers. In the example of
(28) In this example, data center 10 includes a set of storage systems and application servers 12 interconnected via a high-speed switch fabric 14. In some examples, servers 12 are arranged into multiple different server groups, each including any number of servers up to, for example, n servers 12.sub.1-12.sub.n. Servers 12 provide computation and storage facilities for applications and data associated with customers 11 and may be physical (bare-metal) servers, virtual machines running on physical servers, virtualized containers running on physical servers, or combinations thereof.
(29) In the example of
(30) Although not shown, data center 10 may also include, for example, one or more non-edge switches, routers, hubs, gateways, security devices such as firewalls, intrusion detection, and/or intrusion prevention devices, servers, computer terminals, laptops, printers, databases, wireless mobile devices such as cellular phones or personal digital assistants, wireless access points, bridges, cable modems, application accelerators, or other network devices.
(31) In the example of
(32) In example implementations, access nodes 17 are configurable to operate in a standalone network appliance having one or more access nodes. For example, access nodes 17 may be arranged into multiple different access node groups 19, each including any number of access nodes up to, for example, x access nodes 17.sub.1-17.sub.x. As such, multiple access nodes 17 may be grouped (e.g., within a single electronic device or network appliance), referred to herein as an access node group 19, for providing services to a group of servers supported by the set of access nodes internal to the device. In one example, an access node group 19 may comprise four access nodes 17, each supporting four servers so as to support a group of sixteen servers.
(33) In the example of
(34) As one example, each access node group 19 of multiple access nodes 17 may be configured as standalone network device, and may be implemented as a two rack unit (2 RU) device that occupies two rack units (e.g., slots) of an equipment rack. In another example, access node 17 may be integrated within a server, such as a single 1 RU server in which four CPUs are coupled to the forwarding ASICs described herein on a mother board deployed within a common computing device. In yet another example, one or more of access nodes 17 and servers 12 may be integrated in a suitable size (e.g., 10 RU) frame that may, in such an example, become a network storage compute unit (NSCU) for data center 10. For example, an access node 17 may be integrated within a mother board of a server 12 or otherwise co-located with a server in a single chassis.
(35) According to the techniques herein, example implementations are described in which access nodes 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of servers 12 may communicate packet data for a given packet flow to any other of the servers using any of a number of parallel data paths within the data center 10. For example, example network architectures and techniques are described in which access nodes, in example implementations, spray individual packets for packet flows between the access nodes and across some or all of the multiple parallel data paths in the data center switch fabric 14 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity.
(36) In this way, according to the techniques herein, example implementations are described in which access nodes 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of servers 12 may communicate packet data for a given packet flow to any other of the servers using any of a number of parallel data paths within the data center 10. For example, example network architectures and techniques are described in which access nodes, in example implementations, spray individual packets for packet flows between the access nodes and across some or all of the multiple parallel data paths in the data center switch fabric 14 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity.
(37) As described herein, the techniques of this disclosure introduce a new data transmission protocol referred to as a Fabric Control Protocol (FCP) that may be used by the different operational networking components of any of access nodes 17 to facilitate communication of data across switch fabric 14. As further described, FCP is an end-to-end admission control protocol in which, in one example, a sender explicitly requests a receiver with the intention to transfer a certain number of bytes of payload data. In response, the receiver issues a grant based on its buffer resources, QoS, and/or a measure of fabric congestion. In general, FCP enables spray of packets of a flow to all paths between a source and a destination node, and may provide any of the advantages and techniques described herein, including resilience against request/grant packet loss, adaptive and low latency fabric implementations, fault recovery, reduced or minimal protocol overhead cost, support for unsolicited packet transfer, support for FCP capable/incapable nodes to coexist, flow-aware fair bandwidth distribution, transmit buffer management through adaptive request window scaling, receive buffer occupancy based grant management, improved end to end QoS, security through encryption and end to end authentication and/or improved ECN marking support. More details on the FCP are available in U.S. Provisional Patent Application No. 62/566,060, filed Sep. 29, 2017, entitled “Fabric Control Protocol for Data Center Networks with Packet Spraying Over Multiple Alternate Data Paths,” the entire content of which is incorporated herein by reference.
(38) The techniques may provide certain advantages. For example, the techniques may increase significantly the bandwidth utilization of the underlying switch fabric 14. Moreover, in example implementations, the techniques may provide full mesh interconnectivity between the servers of the data center and may nevertheless be non-blocking and drop-free.
(39) Although access nodes 17 are described in
(40)
(41) In some example implementations, each access node 17 may, therefore, have multiple parallel data paths for reaching any given other access node 17 and the servers 12 reachable through those access nodes. In some examples, rather than being limited to sending all of the packets of a given flow along a single path in the switch fabric, switch fabric 14 may be configured such that access nodes 17 may, for any given packet flow between servers 12, spray the packets of the packet flow across all or a subset of the M parallel data paths of switch fabric 14 by which a given destination access node 17 for a destination server 12 can be reached.
(42) According to the disclosed techniques, access nodes 17 may spray the packets of individual packet flows across the M paths end-to-end forming a virtual tunnel between a source access node and a destination access node. In this way, the number of layers included in switch fabric 14 or the number of hops along the Mparallel data paths, may not matter for implementation of the packet spraying techniques described in this disclosure.
(43) The technique of spraying packets of individual packet flows across all or a subset of the Mparallel data paths of switch fabric 14, however, enables the number of layers of network devices within switch fabric 14 to be reduced, e.g., to a bare minimum of one. Further, it enables fabric architectures in which the switches are not connected to each other, reducing the likelihood of failure dependence between two switches and thereby increasing the reliability of the switch fabric. Flattening switch fabric 14 may reduce cost by eliminating layers of network devices that require power and reduce latency by eliminating layers of network devices that perform packet switching. In one example, the flattened topology of switch fabric 14 may result in a core layer that includes only one level of spine switches, e.g., core switches 22, that may not communicate directly with one another but form a single hop along the Mparallel data paths. In this example, any access node 17 sourcing traffic into switch fabric 14 may reach any other access node 17 by a single, one-hop L3 lookup by one of core switches 22.
(44) An access node 17 sourcing a packet flow for a source server 12 may use any technique for spraying the packets across the available parallel data paths, such as available bandwidth, random, round-robin, hash-based or other mechanism that may be designed to maximize, for example, utilization of bandwidth or otherwise avoid congestion. In some example implementations, flow-based load balancing need not necessarily be utilized and more effective bandwidth utilization may be used by allowing packets of a given packet flow (five tuple) sourced by a server 12 to traverse different paths of switch fabric 14 between access nodes 17 coupled to the source and destinations servers. The respective destination access node 17 associated with the destination server 12 may be configured to reorder the variable length IP packets of the packet flows and deliver the packets to the destination server in the sequence in which they were sent.
(45) In some example implementations, each access node 17 implements at least four different operational networking components or functions: (1) a source component operable to receive traffic from server 12, (2) a source switching component operable to switch source traffic to other source switching components of different access nodes 17 (possibly of different access node groups) or to core switches 22, (3) a destination switching component operable to switch inbound traffic received from other source switching components or from cores switches 22 and (4) a destination component operable to reorder packet flows and provide the packet flows to destination servers 12.
(46) In this example, servers 12 are connected to source components of the access nodes 17 to inject traffic into the switch fabric 14, and servers 12 are similarly coupled to the destination components within the access nodes 17 to receive traffic therefrom. Because of the full-mesh, parallel data paths provided by switch fabric 14, each source switching component and destination switching component within a given access node 17 need not perform L2/L3 switching. Instead, access nodes 17 may apply spraying algorithms to spray packets of a packet flow, e.g., available bandwidth, randomly, round-robin, based on QoS/scheduling or otherwise to efficiently forward packets without, in some examples, requiring packet analysis and lookup operations.
(47) Destination switching components of access nodes 17 may provide a limited lookup necessary only to select the proper output port for forwarding packets to local servers 12. As such, with respect to full routing tables for the data center, only core switches 22 may need to perform full lookup operations. Thus, switch fabric 14 provides a highly-scalable, flat, high-speed interconnect in which servers 12 are, in some embodiments, effectively one L2/L3 hop from any other server 12 within the data center.
(48) Access nodes 17 may need to connect to a fair number of core switches 22 in order to communicate packet data to any other of access nodes 17 and the servers 12 accessible through those access nodes. In some cases, to provide a link multiplier effect, access nodes 17 may connect to core switches 22 via top of rack (TOR) Ethernet switches, electrical permutation devices, or optical permutation (OP) devices (not shown in
(49) Flow-based routing and switching over Equal Cost Multi-Path (ECMP) paths through a network may be susceptible to highly variable load-dependent latency. For example, the network may include many small bandwidth flows and a few large bandwidth flows. In the case of routing and switching over ECMP paths, the source access node may select the same path for two of the large bandwidth flows leading to large latencies over that path. In order to avoid this issue and keep latency low across the network, an administrator may be forced to keep the utilization of the network below 25-30%, for example. The techniques described in this disclosure of configuring access nodes 17 to spray packets of individual packet flows across all available paths enables higher network utilization, e.g., 85-90%, while maintaining bounded or limited latencies. The packet spraying techniques enable a source access node 17 to fairly distribute packets of a given flow across all the available paths while taking link failures into account. In this way, regardless of the bandwidth size of the given flow, the load can be fairly spread across the available paths through the network to avoid over utilization of a particular path. The disclosed techniques enable the same amount of networking devices to pass three times the amount of data traffic through the network while maintaining low latency characteristics and reducing a number of layers of network devices that consume energy.
(50) As shown in the example of
(51) As described, each access node group 19 may be configured as standalone network device, and may be implemented as a device configured for installation within a compute rack, a storage rack or a converged rack. In general, each access node group 19 may be configured to operate as a high-performance I/O hub designed to aggregate and process network and/or storage I/O for multiple servers 12. As described above, the set of access nodes 17 within each of the access node groups 19 provide highly-programmable, specialized I/O processing circuits for handling networking and communications operations on behalf of servers 12. In addition, in some examples, each of access node groups 19 may include storage devices 27, such as high-speed solid-state hard drives, configured to provide network accessible storage for use by applications executing on the servers. Each access node group 19 including its set of access nodes 17, storage devices 27, and the set of servers 12 supported by the access nodes 17 of that access node group may be referred to herein as a network storage compute unit (NSCU) 40.
(52)
(53) Although access node group 19 is illustrated in
(54) In one example implementation, access nodes 17 within access node group 19 connect to servers 52 and solid state storage 41 using Peripheral Component Interconnect express (PCIe) links 48, 50, and connect to other access nodes and the datacenter switch fabric 14 using Ethernet links 42, 44, 46. For example, each of access nodes 17 may support six high-speed Ethernet connections, including two externally-available Ethernet connections 42 for communicating with the switch fabric, one externally-available Ethernet connection 44 for communicating with other access nodes in other access node groups, and three internal Ethernet connections 46 for communicating with other access nodes 17 in the same access node group 19. In one example, each of externally-available connections 42 may be a 100 Gigabit Ethernet (GE) connection. In this example, access node group 19 has 8×100 GE externally-available ports to connect to the switch fabric 14.
(55) Within access node group 19, connections 42 may be copper, i.e., electrical, links arranged as 8×25 GE links between each of access nodes 17 and optical ports of access node group 19. Between access node group 19 and the switch fabric, connections 42 may be optical Ethernet connections coupled to the optical ports of access node group 19. The optical Ethernet connections may connect to one or more optical devices within the switch fabric, e.g., optical permutation devices described in more detail below. The optical Ethernet connections may support more bandwidth than electrical connections without increasing the number of cables in the switch fabric. For example, each optical cable coupled to access node group 19 may carry 4×100 GE optical fibers with each fiber carrying optical signals at four different wavelengths or lambdas. In other examples, the externally-available connections 42 may remain as electrical Ethernet connections to the switch fabric.
(56) The four remaining Ethernet connections supported by each of access nodes 17 include one Ethernet connection 44 for communication with other access nodes within other access node groups, and three Ethernet connections 46 for communication with the other three access nodes within the same access node group 19. In some examples, connections 44 may be referred to as “inter-access node group links” and connections 46 may be referred to as “intra-access node group links.”
(57) Ethernet connections 44, 46 provide full-mesh connectivity between access nodes within a given structural unit. In one example, such a structural unit may be referred to herein as a logical rack (e.g., a half-rack or a half physical rack) that includes two NSCUs 40 having two AGNs 19 and supports an 8-way mesh of eight access nodes 17 for those AGNs. In this particular example, connections 46 would provide full-mesh connectivity between the four access nodes 17 within the same access node group 19, and connections 44 would provide full-mesh connectivity between each of access nodes 17 and four other access nodes within one other access node group of the logical rack (i.e., structural unit). In addition, access node group 19 may have enough, e.g., sixteen, externally-available Ethernet ports to connect to the four access nodes in the other access node group.
(58) In the case of an 8-way mesh of access nodes, i.e., a logical rack of two NSCUs 40, each of access nodes 17 may be connected to each of the other seven access nodes by a 50 GE connection. For example, each of connections 46 between the four access nodes 17 within the same access node group 19 may be a 50 GE connection arranged as 2×25 GE links. Each of connections 44 between the four access nodes 17 and the four access nodes in the other access node group may include four 50 GE links. In some examples, each of the four 50 GE links may be arranged as 2×25 GE links such that each of connections 44 includes 8×25 GE links to the other access nodes in the other access node group. This example is described in more detail below with respect to
(59) In another example, Ethernet connections 44, 46 provide full-mesh connectivity between access nodes within a given structural unit that is a full-rack or a full physical rack that includes four NSCUs 40 having four AGNs 19 and supports a 16-way mesh of access nodes 17 for those AGNs. In this example, connections 46 provide full-mesh connectivity between the four access nodes 17 within the same access node group 19, and connections 44 provide full-mesh connectivity between each of access nodes 17 and twelve other access nodes within three other access node group. In addition, access node group 19 may have enough, e.g., forty-eight, externally-available Ethernet ports to connect to the four access nodes in the other access node group.
(60) In the case of a 16-way mesh of access nodes, each of access nodes 17 may be connected to each of the other fifteen access nodes by a 25 GE connection, for example. In other words, in this example, each of connections 46 between the four access nodes 17 within the same access node group 19 may be a single 25 GE link. Each of connections 44 between the four access nodes 17 and the twelve other access nodes in the three other access node groups may include 12×25 GE links.
(61) As shown in
(62) In one example, solid state storage 41 may include twenty-four SSD devices with six SSD devices for each of access nodes 17. The twenty-four SSD devices may be arranged in four rows of six SSD devices with each row of SSD devices being connected to one of access nodes 17. Each of the SSD devices may provide up to 16 Terabytes (TB) of storage for a total of 384 TB per access node group 19. As described in more detail below, in some cases, a physical rack may include four access node groups 19 and their supported servers 52. In that case, a typical physical rack may support approximately 1.5 Petabytes (PB) of local solid state storage. In another example, solid state storage 41 may include up to 32 U.2×4 SSD devices. In other examples, NSCU 40 may support other SSD devices, e.g., 2.5″ Serial ATA (SATA) SSDs, mini-SATA (mSATA) SSDs, M.2 SSDs, and the like.
(63) In the above described example in which each of the access nodes 17 is included on an individual access node sled with local storage for the access node, each of the access node sleds may include four SSD devices and some additional storage that may be hard drive or solid state drive devices. In this example, the four SSD devices and the additional storage may provide approximately the same amount of storage per access node as the six SSD devices described in the previous example.
(64) In one example, each of access nodes 17 supports a total of 96 PCIe lanes. In this example, each of connections 48 may be an 8×4-lane PCI Gen 3.0 connection via which each of access nodes 17 may communicate with up to eight SSD devices within solid state storage 41. In addition, each of connections 50 between a given access node 17 and the four server nodes 12 within the server 52 supported by the access node 17 may be a 4×16-lane PCIe Gen 3.0 connection. In this example, access node group 19 has a total of 256 external facing PCIe links that interface with servers 52. In some scenarios, access nodes 17 may support redundant server connectivity such that each of access nodes 17 connects to eight server nodes 12 within two different servers 52 using an 8×8-lane PCIe Gen 3.0 connection.
(65) In another example, each of access nodes 17 supports a total of 64 PCIe lanes. In this example, each of connections 48 may be an 8×4-lane PCI Gen 3.0 connection via which each of access nodes 17 may communicate with up to eight SSD devices within solid state storage 41. In addition, each of connections 50 between a given access node 17 and the four server nodes 12 within the server 52 supported by the access node 17 may be a 4×8-lane PCIe Gen 4.0 connection. In this example, access node group 19 has a total of 128 external facing PCIe links that interface with servers 52.
(66)
(67) Each of access node groups 19 connects to servers 52 using PCIe links 50, and to switch fabric 14 using Ethernet links 42. Access node groups 19.sub.1 and 19.sub.2 may each include four access nodes connected to each other using Ethernet links and local solid state storage connected to the access nodes using PCIe links as described above with respect to
(68) In addition, each of access node groups 19 supports PCIe connections 50 to servers 52. In one example, each of connections 50 may be a 4×16-lane PCIe Gen 3.0 connection such that access node group 19 has a total of 256 externally-available PCIe links that interface with servers 52. In another example, each of connections 50 may be a 4×8-lane PCIe Gen 4.0 connection for communication between access nodes within access node group 19 and server nodes within servers 52. In either example, connections 50 may provide a raw throughput of 512 Gigabits per access node 19 or approximately 128 Gigabits of bandwidth per server node without accounting for any overhead bandwidth costs.
(69) As discussed above with respect to
(70)
(71) In the illustrated configuration of an 8-way mesh interconnecting two access node groups 19, each access node 17 connects via full mesh connectivity to each of the other seven access nodes in the cluster. The mesh topology between access nodes 17 includes intra-access node group links 46 between the four access nodes included in the same access node group 19, and inter-access node group links 44 between access nodes 17.sub.1-17.sub.4 in access node group 19.sub.1 and access nodes 17.sub.5-17.sub.8 in access node group 19.sub.2. Although illustrated as a single connection between each of access nodes 17, each of connections 44, 46 are bidirectional such that each access node connects to each other access node in the cluster via a separate link.
(72) Each of access nodes 17.sub.1-17.sub.4 within first access node group 19.sub.1 has three intra-access node group connections 46 to the other access nodes in first access node group 19.sub.1. As illustrated in first access node group 19.sub.1, access node 17.sub.1 supports connection 46A to access node 17.sub.4, connection 46B to access node 17.sub.3, and connection 46C to access node 17.sub.2. Access node 17.sub.2 supports connection 46A to access node 17.sub.1, connection 46D to access node 17.sub.4, and connection 46E to access node 17.sub.3. Access node 17.sub.3 supports connection 46B to access node 17.sub.1, connection 46E to access node 17.sub.2, and connection 46F to access node 17.sub.4. Access node 17.sub.4 supports connection 46A to access node 17.sub.1, connection 46D to access node 17.sub.2, and connection 46F to access node 17.sub.3. The access nodes 17.sub.5-17.sub.8 are similarly connected within second access node group 19.sub.2.
(73) Each of access nodes 17.sub.1-17.sub.4 within first access node group 19.sub.1 also has four inter-access node group connections 44 to the access nodes 17.sub.5-17.sub.8 in second access node group 19.sub.2. As illustrated in
(74) Each of access nodes 17 may be configured to support up to 400 Gigabits of bandwidth to connect to other access nodes in the cluster. In the illustrated example, each of access nodes 17 may support up to eight 50 GE links to the other access nodes. In this example, since each of access nodes 17 only connects to seven other access nodes, 50 Gigabits of bandwidth may be leftover and used for managing the access node. In some examples, each of connections 44, 46 may be single 50 GE connections. In other examples, each of connections 44, 46 may be 2×25 GE connections. In still other examples, each of intra-access node group connections 46 may be 2×25 GE connections, and each of inter-access node group connections 44 may be single 50 GE connections to reduce a number of inter-box cables. For example, from each access node 17.sub.1-17.sub.4 within first access node group 19.sub.1, 4×50 GE links go off box to connect to access nodes 17.sub.5-17.sub.8 in second access node group 19.sub.2. In some examples, the 4×50 GE links may be taken out from each of the access nodes 17 using DAC cables.
(75)
(76) In the illustrated example, rack 70 includes four access node groups 19.sub.1-19.sub.4 that are each separate network appliances 2 RU in height. Each of the access node groups 19 includes four access nodes and may be configured as shown in the example of
(77) In this example, each of the access node groups 19 supports sixteen server nodes. For example, access node group 19.sub.1 supports server nodes A1-A16, access node group 19.sub.2 supports server nodes B1-B16, access node group 19.sub.3 supports server nodes C1-C16, and access node group 19.sub.4 supports server nodes D1-D16. A server node may be a dual-socket or dual-processor server sled that is ½Rack in width and 1 RU in height. As described with respect to
(78) Access node groups 19 and servers 52 are arranged into NSCUs 40 from
(79) NSCUs 40 may be arranged into logical racks 60, i.e., half physical racks, from
(80) Logical racks 60 within rack 70 may be connected to the switch fabric directly or through an intermediate top of rack device 72. As noted above, in one example, TOR device 72 comprises a top of rack Ethernet switch. In other examples, TOR device 72 comprises an optical permutor that transports optical signals between access nodes 17 and core switches 22 and that is configured such that optical communications are “permuted” based on wavelength so as to provide full-mesh connectivity between the upstream and downstream ports without any optical interference.
(81) In the illustrated example, each of the access node groups 19 may connect to TOR device 72 via one or more of the 8×100 GE links supported by the access node group to reach the switch fabric. In one case, the two logical racks 60 within rack 70 may each connect to one or more ports of TOR device 72, and TOR device 72 may also receive signals from one or more logical racks within neighboring physical racks. In other examples, rack 70 may not itself include TOR device 72, but instead logical racks 60 may connect to one or more TOR devices included in one or more neighboring physical racks.
(82) For a standard rack size of 40 RU it may be desirable to stay within a typical power limit, such as a 15 kilowatt (kW) power limit. In the example of rack 70, not taking the additional 2 RU TOR device 72 into consideration, it may be possible to readily stay within or near the 15 kW power limit even with the sixty-four server nodes and the four access node groups. For example, each of the access node groups 19 may use approximately 1 kW of power resulting in approximately 4 kW of power for access node groups. In addition, each of the server nodes may use approximately 200 W of power resulting in around 12.8 kW of power for servers 52. In this example, the 40 RU arrangement of access node groups 19 and servers 52, therefore, uses around 16.8 kW of power.
(83)
(84) In some examples, the different operational networking components of access node 17 may perform flow-based switching and ECMP based load balancing for Transmission Control Protocol (TCP) packet flows. Typically, however, ECMP load balances poorly as it randomly hashes the flows to paths such that a few large flows may be assigned to the same path and severely imbalance the fabric. In addition, ECMP relies on local path decisions and does not use any feedback about possible congestion or link failure downstream for any of the chosen paths.
(85) The techniques described in this disclosure introduce a new data transmission protocol referred to as a Fabric Control Protocol (FCP) that may be used by the different operational networking components of access node 17. FCP is an end-to-end admission control protocol in which a sender explicitly requests a receiver with the intention to transfer a certain number of bytes of payload data. In response, the receiver issues a grant based on its buffer resources, QoS, and/or a measure of fabric congestion.
(86) For example, the FCP includes admission control mechanisms through which a source node requests permission before transmitting a packet on the fabric to a destination node. For example, the source node sends a request message to the destination node requesting a certain number of bytes to be transferred, and the destination node sends a grant message to the source node after reserving the egress bandwidth. In addition, instead of the flow-based switching and ECMP forwarding used to send all packets of a TCP flow on the same path to avoid packet reordering, the FCP enables packets of an individual packet flow to be sprayed to all available links between a source node and a destination node. The source node assigns a packet sequence number to each packet of the flow, and the destination node uses the packet sequence numbers to put the incoming packets of the same flow in order.
(87) SF component 30 of access node 17 is considered a source node of the fabric. According to the disclosed techniques, for FCP traffic, SF component 30 is configured to spray its input bandwidth (e.g., 200 Gbps) over links to multiple SX components of access nodes within a logical rack. For example, as described in more detail with respect to
(88) SX component 32 of access node 17 may receive incoming packets from multiple SF components of access nodes within the logical rack, e.g., SF component 30 and seven other SF components of other access nodes within the logical rack. For FCP traffic, SX component 32 is also configured to spray its incoming bandwidth over links to multiple core switches in the fabric. For example, as described in more detail with respect to
(89) DX component 34 of access node 17 may receive incoming packets from multiple core switches either directly or via one or more intermediate devices, e.g., TOR Ethernet switches, electrical permutation devices, or optical permutation devices. For example, DX component 34 may receive incoming packets from eight core switches, or four or eight intermediate devices. DX component 34 is configured to select a DF component to which to send the received packets. For example, DX component 34 may be connected to DF component 36 and seven other DF components of other access nodes within the logical rack. In some case, DX component 34 may become a congestion point because DX component 34 may receive a large amount of bandwidth (e.g., 200 Gbps) that is all to be sent to the same DF component. In the case of FCP traffic, DX component 34 may avoid long term congestion using the admission control mechanisms of FCP.
(90) DF component 36 of access node 17 may receive incoming packets from multiple DX components of access nodes within the logical rack, e.g., DX component 34 and seven other DX components of other access nodes within the logical rack. DF component 36 is considered a destination node of the fabric. For FCP traffic, DF component 36 is configured to recorder packets of the same flow prior to transmitting the flow to a destination server 12.
(91) In some examples, SX component 32 and DX component 34 of access node 17 may use the same forwarding table to perform packet switching. In this example, the personality of access node 17 and the nexthop identified by the forwarding table for the same destination IP address may depend on a source port type of the received data packet. For example, if a source packet is received from a SF component, access node 17 operates as SX component 32 and determines a nexthop to forward the source packet over the fabric toward a destination node. If a packet is received from a fabric-facing port, access node 17 operates as DX component 34 and determines a final nexthop to forward the incoming packet directly to a destination node. In some examples, the received packet may include an input tag that specifies its source port type.
(92)
(93) As shown in
(94) Thus, according to the disclosed techniques, upon receiving source traffic from one of servers 12, SF component 30A implemented by access node 17.sub.1, for example, performs an 8-way spray of packets of the same flow across all available links to SX components 32 implemented by access nodes 17 included in logical rack 60. More specifically, SF component 30A sprays across one internal SX component 32A of the same access node 17.sub.1 and seven external SX components 32B-32H of the other access nodes 17.sub.2-17.sub.8 within logical rack 60. In some implementations, this 8-way spray between SFs 30 and SXs 32 within logical rack 60 may be referred to as a first-stage spray. As described in other portions of this disclosure, a second-stage spray may be performed over a second-level network fanout within the switch fabric between access nodes 17 and core switches 22. For example, the second-stage spray may be performed through an intermediate device, such as a TOR Ethernet switch, an electric permutation device, or an optical permutation device, described in more detail below with respect to
(95) In some examples, as described in more detail above, the first four access nodes 17.sub.1-17.sub.4 may be included in a first access node group 19.sub.1 and the second four access nodes 17.sub.4-17.sub.8 may be included in a second access node group 19.sub.2. The access nodes 17 within the first and second access node groups 19 may be connected to each other via a full-mesh in order to allow the 8-way spray between SFs 30 and SXs 32 within logical rack 60. In some examples, logical rack 60 including the two access nodes groups together with their supported servers 12 may be referred to as a half-rack or a half physical rack. In other examples, more or fewer access nodes may be connected together using full-mesh connectivity. In one example, sixteen access nodes 17 may be connected together in a full-mesh to enable a first-stage 16-way spray within a full physical rack.
(96)
(97) According to the disclosed techniques, the switch fabric comprises a FCP fabric. The FCP fabric may be visualized as including multiple channels, e.g., a request channel, a grant channel, a FCP data channel and a non-FCP data channel. As illustrated in
(98) The request channel within the FCP fabric may be used to carry FCP request messages from the source node to the destination node. Similar to the FCP data packets, the FCP request messages may be sprayed over all available paths toward the destination node, but the request messages do not need to be reordered. In response, the grant channel within the FCP fabric may be used to carry FCP grant messages from the destination node to source node. The FCP grant messages may also be sprayed over all available paths toward the source node, and the grant messages do not need to be reordered. The non-FCP data channel within the FCP fabric carries data packets that do not use the FCP protocol. The non-FCP data packets may be forwarded or routed using ECMP based load balancing, and, for a given flow identified by a five tuple, the packets are expected to be delivered in order to the destination node.
(99) The example of
(100) Upon receiving source FCP traffic from one of the servers 12, an SF component 30A of access node 17.sub.1 in the first logical rack 60.sub.1 performs an 8-way spray of packets of the FCP traffic flow across all available paths to SX components 32 implemented by the access nodes 17 in the first logical rack 60.sub.1. As further illustrated in
(101) Although illustrated in
(102) According to the disclosed techniques, in one example implementation, each of SF components 30 and SX components 32 uses an FCP spray engine configured to apply a suitable load balancing scheme to spray the packets of a given FCP packet flow across all available links to a destination node. For example, the FCP spray engine may track a number of bytes transmitted on each link in order to select a least loaded link on which to forward a packet. In addition, the FCP spray engine may track link failures downstream to provide flow fairness by spraying packets in proportion to bandwidth weight on each active link. In this way, the spray of packets may not be uniform across the available links toward the destination node, but bandwidth will be balanced across the active links even over relatively short periods.
(103) In this example, the source node, e.g., SF component 30A of access node 17.sub.1, within first logical rack 60.sub.1 sends a request message to the destination node, e.g., DF component 36B of access node 17.sub.2, within second logical rack 60.sub.2 requesting a certain weight or bandwidth and the destination node sends a grant message to the source node after reserving the egress bandwidth. The source node also determines whether any link failures have occurred between core switches 22 and logical rack 60.sub.2 that includes the destination node. The source node may then use all active links in proportion to the source and destination bandwidths. As an example, assume there are N links between the source node and the destination node each with source bandwidth Sb.sub.i and destination bandwidth UN, where i=1 . . . N. The actual bandwidth from the source nodes to the destination node is equal to min(Sb, Db) determined on a link-by-link basis in order to take failures into account. More specifically, the source bandwidth (Sb) is equal to Σ.sub.i=1.sup.NSb.sub.i, and destination bandwidth (Db) is equal to Σ.sub.i=1.sup.NDb.sub.i, and the bandwidth (b.sub.i) of each link is equal to min(Sb.sub.i, Db.sub.i). The weight of the bandwidth used on each link is equal to b.sub.i/Σ.sub.i=1.sup.Nb.sub.i.
(104) In the case of FCP traffic, SF components 30 and SX components 32 use the FCP spray engine to distribute packets of the FCP traffic flow based on the load on each link toward the destination node, proportion to its weight. The spray engine maintains credit memory to keep track of credits (i.e., available bandwidth) per nexthop member link, uses packet length included in an FCP header to deduct credits (i.e., reduce available bandwidth), and associates a given packet to the one of the active links having the most credits (i.e., the least loaded link). In this way, for FCP packets, the SF components 30 and SX components 32 spray packets across member links of a nexthop for a destination node in proportion to the member links' bandwidth weights.
(105) Core switches 22 operate as the single hop along logical tunnel 100 between the source node, e.g., SF component 30A of access node 17.sub.1, in first logical rack 60.sub.1 and the destination node, e.g., DF component 36B of access node 17.sub.2, in the second logical rack 60.sub.2. Core switches 22 perform a full lookup operation for L2/L3 switching of the received packets. In this way, core switches 22 may forward all the packets for the same traffic flow toward the destination node, e.g., DF component 36B of access node 17.sub.2, in the second logical rack 60.sub.2 that supports the destination server 12. Although illustrated in
(106) DX components 34 and DF components 36 of access nodes 17 within second logical rack 60.sub.2 also have full mesh connectivity in that each DX component 34 is connected to all of the DF components 36 within second logical rack 60.sub.2. When any of DX components 34 receive the packets of the traffic flow from core switches 22, the DX components 34 forward the packets on a direct path to DF component 36B of access node 17.sub.2. DF component 36B may perform a limited lookup necessary only to select the proper output port for forwarding the packets to the destination server 12. In response to receiving the packets of the traffic flow, DF component 36B of access node 17.sub.2 within second logical rack 60.sub.2 reorders the packets of the traffic flow based on sequence numbers of the packets. As such, with respect to full routing tables for the data center, only the core switches 22 may need to perform full lookup operations. Thus, the switch fabric provides a highly-scalable, flat, high-speed interconnect in which servers are effectively one L2/L3 hop from any other server 12 within the data center.
(107) A brief description of FCP and one example of its operation with respect to
(108) As described above, FCP data packets are sent from a source node, e.g., SF component 30A of access node 17.sub.1 within first logical rack 60.sub.1, to a destination node, e.g., DF component 36B of access node 17.sub.2 within second logical rack 60.sub.2, via logical tunnel 100. Before any traffic is sent over tunnel 100 using FCP, the connection must be established between the end points. A control plane protocol executed by access nodes 17 may be used to set up a pair of tunnels, one in each direction, between the two FCP end points. The FCP tunnels are optionally secured (e.g., encrypted and authenticated). Tunnel 100 is considered to be unidirectional from the source node to the destination node, and a FCP partner tunnel may be established in the other direction from the destination node to the source node. The control plane protocol negotiates the capabilities (e.g., block size, MTU size, etc.) of both end points, and establishes the FCP connection between the end points by setting up tunnel 100 and its partner tunnel and an initializing queue state context for each tunnel.
(109) Each of the end points is assigned a source tunnel ID and a corresponding destination tunnel ID. At each end point, a queue ID for a given tunnel queue is derived based on the assigned tunnel ID and priority. For example, each FCP end point may allocate a local tunnel handle from a pool of handles and communicate the handle to its FCP connection partner end point. The FCP partner tunnel handle is stored in a lookup table and referenced from the local tunnel handle. For the source end point, e.g., access node 17.sub.1 within first logical rack 60.sub.1, a source queue is identified by the local tunnel ID and priority, and a destination tunnel ID is identified from the lookup table based on the local tunnel ID. Similarly, for the destination end point, e.g., access node 17.sub.2 within second logical rack 60.sub.2, a destination queue is identified by the local tunnel ID and priority, and a source tunnel ID is identified from the lookup table based on the local tunnel ID.
(110) FCP tunnel queues are defined as buckets of independent traffic streams that use FCP to transport payload across the FCP fabric. An FCP queue for a given tunnel is identified by the tunnel ID and priority, and the tunnel ID is identified by the source/destination end point pair for the given tunnel. Alternatively, the end points may use a mapping table to derive the tunnel ID and priority based on an internal FCP queue ID for the given tunnel. In some examples, an FCP fabric tunnel, e.g., logical tunnel 100, may support 1, 2, 4, or 8 queues per tunnel. The number of queues per tunnel is a FCP fabric property and may be configured at the time of deployment. All tunnels within the FCP fabric may support the same number of queues per tunnel. Each end point may support a maximum of 16,000 queues.
(111) When the source node is communicating with the destination node, the source node encapsulates the packets using an FCP over UDP encapsulation. The FCP header carries fields identifying tunnel IDs, queue IDs, packet sequence numbers (PSNs) for packets, and request, grant, and data block sequence numbers between the two end points. At the destination node, the incoming tunnel ID is unique for all packets from the specific source node. The tunnel encapsulation carries the packet forwarding as well as the reordering information used by the destination node. A single tunnel carries packets for one or multiple queues between the source and destination nodes. Only the packets within the single tunnel are reordered based on sequence number tags that span across the queues of the same tunnel. The source node tags the packets with tunnel PSNs when they are sent over the tunnel toward the destination node. The destination node reorders the packets based on the tunnel ID and the PSNs. At the end of the reorder, the destination node strips the tunnel encapsulation and forwards the packets to the respective destination queues.
(112) An example of how an IP packet entering FCP tunnel 100 at a source end point is transmitted to a destination end point is described here. A source server 12 having an IP address of A0 sends an IP packet for a destination server 12 having an IP address of B0. The source FCP endpoint, e.g., access node 17.sub.1 within first logical rack 60.sub.1, transmits an FCP request packet with source IP address A and destination IP address B. The FCP request packet has an FCP header to carry the Request Block Number (RBN) and other fields. The FCP request packet is transmitted over UDP over IP. The destination FCP end point, e.g., access node 17.sub.2 within first logical rack 60.sub.2, sends a FCP grant packet back to the source FCP end point. The FCP grant packet has an FCP header to carry the Grant Block Number (GBN) and other fields. The FCP grant packet is transmitted over UDP over IP. The source end point transmits the FCP data packet after receiving the FCP grant packet. The source end point appends a new (IP+UDP+FCP) data header on the input data packet. The destination end point removes the append (IP+UDP+FCP) data header before delivering the packet to the destination host server.
(113)
(114) Furthermore, as described herein, each optical permutor 132 is configured such that optical communications received from downstream ports on each of several wavelengths 136 are “permuted” across upstream ports 138 based on wavelength so as to provide full-mesh connectivity between the upstream and downstream ports without any optical interference. That is, each optical permutor 132 is configured to ensure that optical communications received from any one of downstream servers 12 can be directed to any upstream-facing optical ports 138 without optical interference with any simultaneous communications from any other server 12. Moreover, optical permutors 132 may be bi-directional, i.e., similarly configured to permute communications from upstream ports 138 across downstream ports 136 such that no optical interference occurs on any of the downstream ports. In this way, optical permutors 132 provide bi-directional, full-mesh point-to-point connectivity for transporting communications for servers 12 to/from core switches 22 at the granularity of individual wavelengths.
(115) For example, optical permutor 132.sub.1 is configured to optically direct optical communications from downstream-facing ports 136.sub.1-136.sub.x out upstream-facing ports 138.sub.1-138.sub.x such that each upstream port 138 carries a different one of the possible unique permutations of the combinations of downstream-facing ports 136 and the optical frequencies carried by those ports, where no single upstream-facing port 138 carries communications from servers 12 associated with the same wavelength. As such, in this example, each upstream-facing port 138 carries a non-interfering wavelength from each of the downstream facing ports 136, thus allowing a full mesh of communication. In
(116) In this way, switch fabric 14 may provide full mesh interconnectivity such that any of servers 12 may communicate packet data to any other of the servers 12 using any of a number of parallel data paths. Moreover, according to the techniques described herein, switch fabric 14 may be configured and arranged in a way such that the parallel data paths in switch fabric 14 provides single L2/L3 hop, full mesh interconnections (bipartite graph) between servers 12, even in massive data centers having hundreds of thousands of servers. In some example implementations, each access node 17 may logically be connected to each core switch 22 and, therefore, have multiple parallel data paths for reaching any given other access node and the servers 12 reachable through those access nodes. As such, in this example, for M core switches 22, M possible data paths exist between each access node 17. Each access node 17 may be viewed as effectively directly connected to each core switch 22 (even though it is connected through an optical permutor) and thus any access node sourcing traffic into switch fabric 14 may reach any other access node 17 by a single, one-hop L3 lookup by an intermediate device (core switch).
(117) Further example details of optical permutors are described in U.S. Provisional Appl. No. 62/478,414, filed Mar. 29, 2017, entitled “NON-BLOCKING, FULL-MESH DATA CENTER NETWORK HAVING OPTICAL PERMUTORS,” the entire contents of which are incorporated herein by reference.
(118)
(119) In this example, network 200 represents a multi-tier network having M groups of Z physical network core switches 202A-1-202M-Z (collectively, “switches 202”) that are optically interconnected to O optical permutors 204-1-204-O (collectively, “OPs 204”), which in turn interconnect endpoints (e.g., servers 215) via Y groups of X access nodes 206A-1-206Y-X (collectively, “ANs 206”). Endpoints (e.g., servers 215) may include storage systems, application servers, compute servers, and network appliances such as firewalls and or gateways.
(120) In the example of
(121) Each optical permutor from OPs 204 receives light at a set of wavelengths from each of a set of multiple optical fibers coupled to the optical permutor and redistributes and outputs the wavelengths among each of another set of multiple optical fibers optically coupled to the optical permutor. Each optical permutor 204 may simultaneously input wavelengths from access nodes 206 for output to switches 202 and input wavelengths from switches 202 for output to access nodes 206.
(122) In the example of
(123) Network 200 may interconnect endpoints using one or more switching architectures, such as multi-tier multi-chassis link aggregation group (MC-LAG), virtual overlays, and IP fabric architectures. Each of switches 202 may represent a layer 2/layer 3 (e.g., Ethernet/IP) switch that participates in the one or more switching architectures configured for network 200 to provide point-to-point connectivity between pairs of access nodes 206. In the case of an IP fabric, each of switches 202 and access nodes 206 may execute a layer 3 routing protocol (e.g., BGP and/or OSPF) to exchange routes for subnets behind each of the access nodes 206.
(124) In the example of
(125) Each of access nodes 206 includes at least one optical interface to couple to a port of one of optical permutors 204. For example, access node 206A-1 is optically coupled to a port of optical permutor 204-1. As another example, access node 206A-2 is optically coupled to a port of optical permutor 204-2. In the example of
(126) In the example of
(127) Full mesh 220A of group 211A enables each pair of access nodes 206A-1-206A-X (“access nodes 206A”) to communicate directly with one another. Each of access nodes 206A may therefore reach each of optical permutors 204 either directly (via a direct optical coupling, e.g., access node 206A-1 with optical permutor 204-1) or indirectly via another of access nodes 206A. For instance, access node 206A-1 may reach optical permutor 204-O (and, by extension due to operation of optical permutor 204-O, switches 202M-1-202M-Z) via access node 206A-X. Access node 206A-1 may reach other optical permutors 204 via other access nodes 206A. Each of access nodes 206A therefore has point-to-point connectivity with each of switch groups 209. Access nodes 206 of groups 211B-211Y have similar topologies to access nodes 206A of group 211A. As a result of the techniques of this disclosure, therefore, each of access nodes 206 has point-to-point connectivity with each of switch groups 209.
(128) The wavelength permutation performed by each of optical permutors 204 of permutation layer 212 may reduce a number of electrical switching operations required to perform layer 2 forwarding or layer 2/layer 3 forwarding of packets among pairs of access nodes 206. For example, access node 206A-1 may receive outbound packet data from a locally-coupled server 215 and that is destined for an endpoint associated with access node 206Y-1. Access node 206A-1 may select a particular transport wavelength on which to transmit the data on the optical link coupled to optical permutor 204-1, where the selected transport wavelength is permuted by optical permutor 204-1 as described herein for output on a particular optical link coupled to a switch of switching tier 210, where the switch is further coupled by another optical link to optical permutor 204-O. As a result, the switch may convert the optical signal of the selected transport wavelength carrying the data to an electrical signal and layer 2 or layer 2/layer 3 forward the data to the optical interface for optical permutor 204-O, which converts the electrical signal for the data to an optical signal for a transport wavelength that is permuted by optical permutor 204-O to access node 206Y-1. In this way, access node 206A-1 may transmit data to any other access node, such as access node 206Y-1, via network 200 with as few as a single intermediate electrical switching operation by switching tier 210.
(129)
(130) In the example of
(131) In the example of
(132) As illustrated, the optical cables connect to the respective logical racks 60 via electro-optical circuits 226. Each logical rack 60 thus has four optical ports, one for each of electro-optical circuits 226. Electro-optical circuits 226 convert electrical signals into optical signals and convert optical signals into electrical signals. For example, in logical rack 60.sub.1 of rack 70.sub.1, EO1 may convert electrical signals from AN1 and AN2 of access node group 19.sub.1 into optical signals for transmission over optical cable 230A to optical permutor 132.sub.1. Although not fully illustrated in
(133) As described in detail above, each access node supports 2×100 GE connections toward the switch fabric such that a given logical rack 60 supports 16×100 GE connections from the eight access nodes AN1-AN8. The electro-optical circuits 226 within the given logical rack 60 convert the electrical signals carried on the 16×100 GE connections into optical signals for transmission over 4×400 GE optical cables to the four optical permutors 132. As an example, in logical rack 60.sub.1 of physical rack 70.sub.1, AN1 and AN2 together may have 4×100 GE connections for communicating to the switch fabric. Within access node group 19.sub.1, the 4×100 GE connections are copper links to EO1. In some examples, these copper links may have finer granularity, e.g., 16×25 GE links. Upon converting the electrical signals received on the copper links to optical signals, EO1 sends the converted optical signals over a single 400 GE optical cable 230A to a downstream-facing port 224 of optical permutor 132.sub.1.
(134) Each of optical permutors 132 also has sixteen upstream-facing ports 222 to connect to sixteen core switches CX1-CX16 22 within a given one of switch groups 220.sub.1-220.sub.4. As described in more detail below with respect to
(135) Each of switch groups 220.sub.1-220.sub.4 has a set of switches 22 that are each optically coupled to a same one of optical permutors 132.sub.1-132.sub.4 via a respective optical cable 223. For example, each of the upstream-facing ports 222 may support a 400 GE optical cable 223 between optical permutor 132.sub.1 and one of cores switches 22 within switch group 220.sub.1. Core switches 22 may convert optical signals received on optical cables 223 into electrical signals prior performing full lookup and switching functions on the received traffic. Prior to forwarding the traffic back to the one of optical permutors 132 via optical cables 223, core switches 22 may convert the traffic back into optical signals.
(136)
(137) In the example of
(138) Logical racks 60 of each of the other racks 70.sub.2-70.sub.8 are similarly connected to each of the four optical permutors 132 via optical cables. For example, one of optical cables 240 from logical rack 60.sub.1 of rack 70.sub.8 connects to the second-to-last downstream-facing port 224 of each of optical permutors 132.sub.1-132.sub.4, and one of optical cables 242 from logical rack 60.sub.2 of rack 70.sub.8 connects to the last downstream-facing port 224 of each of optical permutors 132.sub.1-132.sub.4. Each of the optical cables 230, 232, 240, 242 may be 400 GE optical cables. In some examples, each of the 400 GE optical cables may include four 100 GE optical fibers that each carry multiplexed optical signals having four different wavelengths or lambdas.
(139) As described above, upon receipt of traffic from logical racks 60 on downstream-facing ports 224, optical permutors 132 spray the traffic across all upstream-facing ports 222. Optical permutors 132 then forward the traffic on upstream-facing ports 222 to each of sixteen core switches within a same switch group 220. For example, optical permutor 132.sub.1 transmits the traffic from each upstream-facing port 222 to one of core switches CX1-CX16 in switch group 220.sub.1 along optical cables 223. Each of optical cables 223 may be a 400 GE optical cable.
(140)
(141) Optical permutors 302 operate substantially similar to optical permutors 132 described above, but have eight upstream-facing ports 322 and eight downstream-facing ports 324. In the example network cluster architecture 318 of
(142) In the example of
(143) Each of the optical cables 330, 332, 334, 340 may be 400 GE optical cables. In some examples, each of the 400 GE optical cables may include four 100 GE optical fibers that each carry multiplexed optical signals having four different wavelengths or lambdas. As described in detail above, each physical rack 70 includes four access node groups that each include four access nodes. Each access node supports 2×100 GE connections toward the switch fabric such that a given physical rack 70 supports 32×100 GE connections from the sixteen access nodes. Electro-optical circuits within the given physical rack 70 convert the electrical signals carried on the 32×100 GE connections into optical signals for transmission over 8×400 GE optical cables to the eight optical permutors 302.
(144) As described above, upon receipt of traffic from racks 70 on downstream-facing ports 324, optical permutors 302 spray the traffic across all upstream-facing ports 322. Optical permutors 302 then forward the traffic on upstream-facing ports 322 to each of eight core switches within a same switch group 320. For example, optical permutor 302.sub.1 transmits the traffic from each upstream-facing port 322 to one of core switches CX1-CX8 in switch group 320.sub.1 along optical cables 323. Each of optical cables 323 may be a 400 GE optical cable.
(145) The example network cluster architectures 218, 318 illustrated in
(146) TABLE-US-00001 TABLE 1 Access Server Sockets/ Network Racks Nodes Nodes Processors Bandwidth ½ 8 32 64 1.6 Tbps 1 16 64 128 3.2 Tbps 8 128 512 1,024 25.6 Tbps 100 1,600 6,400 12,800 320 Tbps 1,000 16,000 64,000 128,000 3.2 Pbps 2,000 32,000 128,000 256,000 6.4 Pbps
(147)
(148) Each of clusters 350 includes a plurality of core switches 22 and a plurality of access nodes 17 that are each coupled to a plurality of servers 12.sub.1-12.sub.n. Although not shown in
(149) As illustrated in
(150) Border gateway devices 354 enable clusters 350 to connect to each other and to the outside world via a level of routers 356 and service provider network 360, e.g., the Internet or another public WAN network. For example, border gateway devices 354 may be substantially similar to gateway device 20 from
(151) In the illustrated example of
(152) In one example of a packet being forwarded between cluster 350A and 350B, at time T0, an access node 17A within cluster 350A sends the packet to access node 352A over one of multiple links across fabric 14A. At time T1, access node 352A receives the packet from fabric 14A. At time T2, access node 352A sends the packet to border gateway device 354A over the PCIe connection. Border gateway device 354A performs a forwarding lookup on the destination IP address of the packet, and determine that the packet is destined for cluster 350B behind border gateway device 354A. At time T3, border gateway 354A within cluster 350A sends the packet to border gateway device 354B within cluster 350B over one of multiple links across service provider network 360. For example, border gateway device 354A may send the packet over one of multiple paths to routers 356A, 356B using either the packet spraying techniques described in this disclosure or ECMP. At time T4, border gateway device 354B within cluster 350B receives the packet from routers 356A, 356B. Border gateway device 354B performs a forwarding lookup on the destination IP address of the packet, and sends the packet to access node 352B over the PCIe connection. At time T5, access node 352B sends the packet to access node 17B within cluster 350B over one of multiple links across fabric 14B. At time T6, access node 17B receives the packet from fabric 14B. Access node 17B performs a forwarding lookup on the destination IP address of the packet, and sends the packet to one of servers 12 at time T7.
(153)
(154) In the example of
(155)
(156) For example, in
(157) The following provides a complete example for one implementation of optical permutor 400 of
(158) Further, each of the four optical fiber pairs for each of input ports P1-P16 is coupled to a different access node 17, thereby providing bidirectional optical connectivity from 64 different access nodes.
(159) Table 2 lists one example configuration for optical permutor 400 for optical communications in the core-facing direction. That is, Table 2 illustrates an example configuration of optical permutor 400 for producing, on the optical fibers of core-facing output ports P17-P32, a set of 64 unique permutations for combinations of optical input ports P1-P16 and optical wavelengths L1-L4 carried by those input ports, where no single optical output port carries multiple optical communications having the same wavelength. For example, the first column of Table 2 lists the wavelengths L1-L4 carried by the four fibers F1-F4 of each input optical interfaces for ports P0-P16 while the right column lists the unique and non-interfering permutation of input port fiber/wavelength combination output on each optical output interface of ports P17-P32.
(160) TABLE-US-00002 TABLE 2 Core-switch facing Output Rack-facing Input ports Ports for Optical Permutor for Optical Permutor (permutation of wavelengths & input port) Input Port 1: Output Port 17: Fiber 1: P1F1L1-P1F1L4 Fiber 1: P1F1L1, P2F1L2, P3F1L3, P4F1L4 Fiber 2: P1F2L1-P1F2L4 Fiber 2: P5F1L1, P6F1L2, P7F1L3, P8F1L4 Fiber 3: P1F3L1-P1F3L4 Fiber 3: P9F1L1, P10F1L2, P11F1L3, Fiber 4: P1F4L1-P1F4L4 P12F1L4 Fiber 4: P13F1L1, P14F1L2, P15F1L3, P16F1L4 Input Port 2: Output Port 18: Fiber 1: P2F1L1-P2F1L4 Fiber 1: P1F2L1, P2F2L2, P3F2L3, P4F2L4 Fiber 2: P2F2L1-P2F2L4 Fiber 2: P5F2L1, P6F2L2, P7F2L3, P8F2L4 Fiber 3: P2F3L1-P2F3L4 Fiber 3: P9F2L1, P10F2L2, P11F2L3, Fiber 4: P2F4L1-P2F4L4 P12F2L4 Fiber 4: P13F2L1, P14F2L2, P15F2L3, P16F2L4 Input Port 3: Output Port 19: Fiber 1: P3F1L1-P3F1L4 Fiber 1: P1F3L1, P2F3L2, P3F3L3, P4F3L4 Fiber 2: P3F2L1-P3F2L4 Fiber 2: P5F3L1, P6F3L2, P7F3L3, P8F3L4 Fiber 3: P3F3L1-P3F3L4 Fiber 3: P9F3L1, P10F3L2, P11F3L3, Fiber 4: P3F4L1-P3F4L4 P12F3L4 Fiber 4: P13F3L1, P14F3L2, P15F3L3, P16F3L4 Input Port 4: Output Port 20: Fiber 1: P4F1L1-P4F1L4 Fiber 1: P1F4L1, P2F4L2, P3F4L3, P4F4L4 Fiber 2: P4F2L1-P4F2L4 Fiber 2: P5F4L1, P6F4L2, P7F4L3, P8F4L4 Fiber 3: P4F3L1-P4F3L4 Fiber 3: P9F4L1, P10F4L2, P11F4L3, Fiber 4: P4F4L1-P4F4L4 P12F4L4 Fiber 4: P13F4L1, P14F4L2, P15F4L3, P16F4L4 Input Port 5: Output Port 21: Fiber 1: P5F1L1-P5F1L4 Fiber 1: P2F1L1, P3F1L2, P4F1L3, P5F1L4 Fiber 2: P5F2L1-P5F2L4 Fiber 2: P6F1L1, P7F1L2, P8F1L3, P9F1L4 Fiber 3: P5F3L1-P5F3L4 Fiber 3: P10F1L1, P11F1L2, P12F1L3, Fiber 4: P5F4L1-P5F4L4 P13F1L4 Fiber 4: P14F1L1, P15F1L2, P16F1L3, P1F1L4 Input Port 6: Output Port 22: Fiber 1: P6F1L1-P6F1L4 Fiber 1: P2F2L1, P3F2L2, P4F2L3, P5F2L4 Fiber 2: P6F2L1-P6F2L4 Fiber 2: P6F2L1, P7F2L2, P8F2L3, P9F2L4 Fiber 3: P6F3L1-P6F3L4 Fiber 3: P10F2L1, P11F2L2, P12F2L3, Fiber 4: P6F4L1-P6F4L4 P13F2L4 Fiber 4: P14F2L1, P15F2L2, P16F2L3, P1F2L4 Input Port 7: Output Port 23: Fiber 1: P7F1L1-P7F1L4 Fiber 1: P2F3L1, P3F3L2, P4F3L3, P5F3L4 Fiber 2: P7F2L1-P7F2L4 Fiber 2: P6F3L1, P7F3L2, P8F3L3, P9F3L4 Fiber 3: P7F3L1-P7F3L4 Fiber 3: P10F3L1, P11F3L2, P12F3L3, Fiber 4: P7F4L1-P7F4L4 P13F3L4 Fiber 4: P14F3L1, P15F3L2, P16F3L3, P1F3L4 Input Port 8: Output Port 24: Fiber 1: P8F1L1-P8F1L4 Fiber 1: P2F4L1, P3F4L2, P4F4L3, P5F4L4 Fiber 2: P8F2L1-P8F2L4 Fiber 2: P6F4L1, P7F4L2, P8F4L3, P9F4L4 Fiber 3: P8F3L1-P8F3L4 Fiber 3: P10F4L1, P11F4L2, P12F4L3, Fiber 4: P8F4L1-P8F4L4 P13F4L4 Fiber 4: P14F4L1, P15F4L2, P16F4L3, P1F4L4 Input Port 9: Output Port 25: Fiber 1: P9F1L1-P9F1L4 Fiber 1: P3F1L1, P4F1L2, P5F1L3, P6F1L4 Fiber 2: P9F2L1-P9F2L4 Fiber 2: P7F1L1, P8F1L2, P9F1L3, Fiber 3: P9F3L1-P9F3L4 P10F1L4 Fiber 4: P9F4L1-P9F4L4 Fiber 3: P11F1L1, P12F1L2, P13F1L3, P14F1L4 Fiber 4: P15F1L1, P16F1L2, P1F1L3, P2F1L4 Input Port 10: Output Port 26: Fiber 1: P10F1L1-P10F1L4 Fiber 1: P3F2L1, P4F2L2, P5F2L3, P6F2L4 Fiber 2: P10F2L1-P10F2L4 Fiber 2: P7F2L1, P8F2L2, P9F2L3, Fiber 3: P10F3L1-P10F3L4 P10F2L4 Fiber 4: P10F4L1-P10F4L4 Fiber 3: P11F2L1, P12F2L2, P13F2L3, P14F2L4 Fiber 4: P15F2L1, P16F2L2, P1F2L3, P2F2L4 Input Port 11: Output Port 27: Fiber 1: P11F1L1-P11F1L4 Fiber 1: P3F3L1, P4F3L2, P5F3L3, P6F3L4 Fiber 2: P11F2L1-P11F2L4 Fiber 2: P7F3L1, P8F3L2, P9F3L3, Fiber 3: P11F3L1-P11F3L4 P10F3L4 Fiber 4: P11F4L1-P11F4L4 Fiber 3: P11F3L1, P12F3L2, P13F3L3, P14F3L4 Fiber 4: P15F3L1, P16F3L2, P1F3L3, P2F3L4 Input Port 12: Output Port 28: Fiber 1: P12F1L1-P12F1L4 Fiber 1: P3F4L1, P4F4L2, P5F4L3, P6F4L4 Fiber 2: P12F2L1-P12F2L4 Fiber 2: P7F4L1, P8F4L2, P9F4L3, Fiber 3: P12F3L1-P12F3L4 P10F4L4 Fiber 4: P12F4L1-P12F4L4 Fiber 3: P11F4L1, P12F4L2, P13F4L3, P14F4L4 Fiber 4: P15F4L1, P16F4L2, P1F4L3, P2F4L4 Input Port 13: Output Port 29: Fiber 1: P13F1L1-P13F1L4 Fiber 1: P4F1L1, P5F1L2, P6F1L3, P7F1L4 Fiber 2: P13F2L1-P13F2L4 Fiber 2: P8F1L1, P9F1L2, P10F1L3, Fiber 3: P13F3L1-P13F3L4 P11F1L4 Fiber 4: P13F4L1-P13F4L4 Fiber 3: P12F1L1, P13F1L2, P14F1L3, P15F1L4 Fiber 4: P16F1L1, P1F1L2, P2F1L3, P3F1L4 Input Port 14: Output Port 30: Fiber 1: P14F1L1-P14F1L4 Fiber 1: P4F2L1, P5F2L2, P6F2L3, P7F2L4 Fiber 2: P14F2L1-P14F2L4 Fiber 2: P8F2L1, P9F2L2, P10F2L3, Fiber 3: P14F3L1-P14F3L4 P11F2L4 Fiber 4: P14F4L1-P14F4L4 Fiber 3: P12F2L1, P13F2L2, P14F2L3, P15F2L4 Fiber 4: P16F2L1, P1F2L2, P2F2L3, P3F2L4 Input Port 15: Output Port 31: Fiber 1: P15F1L1-P15F1L4 Fiber 1: P4F3L1, P5F3L2, P6F3L3, P7F3L4 Fiber 2: P15F2L1-P15F2L4 Fiber 2: P8F3L1, P9F3L2, P10F3L3, Fiber 3: P15F3L1-P15F3L4 P11F3L4 Fiber 4: P15F4L1-P15F4L4 Fiber 3: P12F3L1, P13F3L2, P14F3L3, P15F3L4 Fiber 4: P16F3L1, P1F3L2, P2F3L3, P3F3L4 Input Port 16: Output Port 32: Fiber 1: P16F1L1-P16F1L4 Fiber 1: P4F4L1, P5F4L2, P6F4L3, P7F4L4 Fiber 2: P16F2L1-P16F2L4 Fiber 2: P8F4L1, P9F4L2, P10F4L3, Fiber 3: P16F3L1-P16F3L4 P11F4L4 Fiber 4: P16F4L1-P16F4L4 Fiber 3: P12F4L1, P13F4L2, P14F4L3, P15F4L4 Fiber 4: P16F4L1, P1F4L2, P2F4L3, P3F4L4
(161) Continuing the example, Table 3 lists an example configuration for optical permutor 400 with respect to optical communications in the reverse, downstream direction, i.e., from core switches 22 to access nodes 17. That is, Table 3 illustrates an example configuration of optical permutor 400 for producing, on the optical fibers of rack-facing output ports P1-P16, a set of 64 unique permutations for combinations of core-facing input ports P16-P32 and optical wavelengths L1-L4 carried by those input ports, where no single optical output port carries multiple optical communications having the same wavelength.
(162) TABLE-US-00003 TABLE 3 Access node-facing Output Core switch-facing Input Ports for Optical Permutor Ports for Optical Permutor (permutation of wavelengths & input port) Input Port 17: Output Port 1: Fiber 1: P17F1L1-P17F1L4 Fiber 1: P17F1L1, P18F1L2, P19F1L3, P20F1L4 Fiber 2: P17F2L1-P17F2L4 Fiber 2: P21F1L1, P22F1L2, P23F1L3, P24F1L4 Fiber 3: P17F3L1-P17F3L4 Fiber 3: P25F1L1, P26F1L2, P27F1L3, P28F1L4 Fiber 4: P17F4L1-P17F4L4 Fiber 4: P29F1L1, P30F1L2, P31F1L3, P32F1L4 Input Port 18: Output Port 2: Fiber 1: P18F1L1-P18F1L4 Fiber 1: P17F2L1, P18F2L2, P19F2L3, P20F2L4 Fiber 2: P18F2L1-P18F2L4 Fiber 2: P21F2L1, P22F2L2, P23F2L3, P24F2L4 Fiber 3: P18F3L1-P18F3L4 Fiber 3: P25F2L1, P26F2L2, P27F2L3, P28F2L4 Fiber 4: P18F4L1-P18F4L4 Fiber 4: P29F2L1, P30F2L2, P31F2L3, P32F2L4 Input Port 19: Output Port 3: Fiber 1: P19F1L1-P19F1L4 Fiber 1: P17F3L1, P18F3L2, P19F3L3, P20F3L4 Fiber 2: P19F2L1-P19F2L4 Fiber 2: P21F3L1, P22F3L2, P23F3L3, P24F3L4 Fiber 3: P19F3L1-P19F3L4 Fiber 3: P25F3L1, P26F3L2, P27F3L3, P28F3L4 Fiber 4: P19F4L1-P19F4L4 Fiber 4: P29F3L1, P30F3L2, P31F3L3, P32F3L4 Input Port 20: Output Port 4: Fiber 1: P20F1L1-P20F1L4 Fiber 1: P17F4L1, P18F4L2, P19F4L3, P20F4L4 Fiber 2: P20F2L1-P20F2L4 Fiber 2: P21F4L1, P22F4L2, P23F4L3, P24F4L4 Fiber 3: P20F3L1-P20F3L4 Fiber 3: P25F4L1, P26F4L2, P27F4L3, P28F4L4 Fiber 4: P20F4L1-P20F4L4 Fiber 4: P29F4L1, P30F4L2, P31F4L3, P32F4L4 Input Port 21: Output Port 5: Fiber 1: P21F1L1-P21F1L4 Fiber 1: P18F1L1, P19F1L2, P20F1L3, P21F1L4 Fiber 2: P21F2L1-P21F2L4 Fiber 2: P22F1L1, P23F1L2, P24F1L3, P25F1L4 Fiber 3: P21F3L1-P21F3L4 Fiber 3: P26F1L1, P27F1L2, P28F1L3, P29F1L4 Fiber 4: P21F4L1-P21F4L4 Fiber 4: P30F1L1, P31F1L2, P32F1L3, P17F1L4 Input Port 22: Output Port 6: Fiber 1: P22F1L1-P22F1L4 Fiber 1: P18F2L1, P19F2L2, P20F2L3, P21F2L4 Fiber 2: P22F2L1-P22F2L4 Fiber 2: P22F2L1, P23F2L2, P24F2L3, P25F2L4 Fiber 3: P22F3L1-P22F3L4 Fiber 3: P26F2L1, P27F2L2, P28F2L3, P29F2L4 Fiber 4: P22F4L1-P22F4L4 Fiber 4: P30F2L1, P31F2L2, P32F2L3, P17F2L4 Input Port 23: Output Port 7: Fiber 1: P23F1L1-P23F1L4 Fiber 1: P18F3L1, P19F3L2, P20F3L3, P21F3L4 Fiber 2: P23F2L1-P23F2L4 Fiber 2: P22F3L1, P23F3L2, P24F3L3, P25F3L4 Fiber 3: P23F3L1-P23F3L4 Fiber 3: P26F3L1, P27F3L2, P28F3L3, P29F3L4 Fiber 4: P23F4L1-P23F4L4 Fiber 4: P30F3L1, P31F3L2, P32F3L3, P17F3L4 Input Port 24: Output Port 8: Fiber 1: P24F1L1-P24F1L4 Fiber 1: P18F2L1, P19F2L2, P20F2L3, P21F2L4 Fiber 2: P24F2L1-P24F2L4 Fiber 2: P22F2L1, P23F2L2, P24F2L3, P25F2L4 Fiber 3: P24F3L1-P24F3L4 Fiber 3: P26F2L1, P27F2L2, P28F2L3, P29F2L4 Fiber 4: P24F4L1-P24F4L4 Fiber 4: P30F2L1, P31F2L2, P32F2L3, P17F2L4 Input Port 25: Output Port 9: Fiber 1: P25F1L1-P25F1L4 Fiber 1: P19F1L1, P20F1L2, P21F1L3, P22F1L4 Fiber 2: P25F2L1-P25F2L4 Fiber 2: P23F1L1, P24F1L2, P25F1L3, P26F1L4 Fiber 3: P25F3L1-P25F3L4 Fiber 3: P27F1L1, P28F1L2, P29F1L3, P30F1L4 Fiber 4: P25F4L1-P25F4L4 Fiber 4: P31F1L1, P32F1L2, P17F1L3, P18F1L4 Input Port 26: Output Port 10: Fiber 1: P26F1L1-P26F1L4 Fiber 1: P19F2L1, P20F2L2, P21F2L3, P22F2L4 Fiber 2: P26F2L1-P26F2L4 Fiber 2: P23F2L1, P24F2L2, P25F2L3, P26F2L4 Fiber 3: P26F3L1-P26F3L4 Fiber 3: P27F2L1, P28F2L2, P29F2L3, P30F2L4 Fiber 4: P26F4L1-P26F4L4 Fiber 4: P31F2L1, P32F2L2, P17F2L3, P18F2L4 Input Port 27: Output Port 11: Fiber 1: P27F1L1-P27F1L4 Fiber 1: P19F3L1, P20F3L2, P21F3L3, P22F3L4 Fiber 2: P27F2L1-P27F2L4 Fiber 2: P23F3L1, P24F3L2, P25F3L3, P26F3L4 Fiber 3: P27F3L1-P27F3L4 Fiber 3: P27F3L1, P28F3L2, P29F3L3, P30F3L4 Fiber 4: P27F4L1-P27F4L4 Fiber 4: P31F3L1, P32F3L2, P17F3L3, P18F3L4 Input Port 28: Output Port 12: Fiber 1: P28F1L1-P28F1L4 Fiber 1: P19F4L1, P20F4L2, P21F4L3, P22F4L4 Fiber 2: P28F2L1-P28F2L4 Fiber 2: P23F4L1, P24F4L2, P25F4L3, P26F4L4 Fiber 3: P28F3L1-P28F3L4 Fiber 3: P27F4L1, P28F4L2, P29F4L3, P30F4L4 Fiber 4: P28F4L1-P28F4L4 Fiber 4: P31F4L1, P32F4L2, P17F4L3, P18F4L4 Input Port 29: Output Port 13: Fiber 1: P29F1L1-P29F1L4 Fiber 1: P20F1L1, P21F1L2, P22F1L3, P23F1L4 Fiber 2: P29F2L1-P29F2L4 Fiber 2: P24F1L1, P25F1L2, P26F1L3, P27F1L4 Fiber 3: P29F3L1-P29F3L4 Fiber 3: P28F1L1, P29F1L2, P30F1L3, P31F1L4 Fiber 4: P29F4L1-P29F4L4 Fiber 4: P32F1L1, P17F1L2, P18F1L3, P19F1L4 Input Port 30: Output Port 14: Fiber 1: P30F1L1-P30F1L4 Fiber 1: P20F2L1, P21F2L2, P22F2L3, P23F2L4 Fiber 2: P30F2L1-P30F2L4 Fiber 2: P24F2L1, P25F2L2, P26F2L3, P27F2L4 Fiber 3: P30F3L1-P30F3L4 Fiber 3: P28F2L1, P29F2L2, P30F2L3, P31F2L4 Fiber 4: P30F4L1-P30F4L4 Fiber 4: P32F2L1, P17F2L2, P18F2L3, P19F2L4 Input Port 31: Output Port 15: Fiber 1: P31F1L1-P31F1L4 Fiber 1: P20F3L1, P21F3L2, P22F3L3, P23F3L4 Fiber 2: P31F2L1-P31F2L4 Fiber 2: P24F3L1, P25F3L2, P26F3L3, P27F3L4 Fiber 3: P31F3L1-P31F3L4 Fiber 3: P28F3L1, P29F3L2, P30F3L3, P31F3L4 Fiber 4: P31F4L1-P31F4L4 Fiber 4: P32F3L1, P17F3L2, P18F3L3, P19F3L4 Input Port 32: Output Port 16: Fiber 1: P32F1L1-P32F1L4 Fiber 1: P20F4L1, P21F4L2, P22F4L3, P23F4L4 Fiber 2: P32F2L1-P32F2L4 Fiber 2: P24F4L1, P25F4L2, P26F4L3, P27F4L4 Fiber 3: P32F3L1-P32F3L4 Fiber 3: P28F4L1, P29F4L2, P30F4L3, P31F4L4 Fiber 4: P32F4L1-P32F4L4 Fiber 4: P32F4L1, P17F4L2, P18F4L3, P19F4L4
(163) Table 4 lists a second example configuration for optical permutor 400 for optical communications in the core-facing direction. As with Table 2 above, Table 4 illustrates an example configuration of optical permutor 400 for producing, on the optical fibers of core-facing output ports P17-P32, a set of 64 unique permutations for combinations of optical input ports P1-P16 and optical wavelengths L1-L4 carried by those input ports, where no single optical output port carries multiple optical communications having the same wavelength. Similar to Table 2 above, the first column of Table 4 lists the wavelengths L1-L4 carried by the four fibers F1-F4 of each input optical interfaces for ports P0-P16 while the right column lists another example of unique and non-interfering permutation of input port fiber/wavelength combination output on each optical output interface of ports P17-P32.
(164) TABLE-US-00004 TABLE 4 Rack-facing Core-switch facing Output Input ports for Ports for Optical Permutor Optical Permutor (permutation of wavelengths & input port) Input Port 1: Output Port 17: Fiber 1: P1F1L1-P1F1L4 Fiber 1: P1F1L1, P2F1L2, P3F1L3, P4F1L4 Fiber 2: P1F2L1-P1F2L4 Fiber 2: P5F1L1, P6F1L2, P7F1L3, P8F1L4 Fiber 3: P1F3L1-P1F3L4 Fiber 3: P9F1L1, P10F1L2, P11F1L3, P12F1L4 Fiber 4: P1F4L1-P1F4L4 Fiber 4: P13F1L1, P14F1L2, P15F1L3, P16F1L4 Input Port 2: Output Port 18: Fiber 1: P2F1L1-P2F1L4 Fiber 1: P2F1L1, P3F1L2, P4F1L3, P1F1L4 Fiber 2: P2F2L1-P2F2L4 Fiber 2: P6F1L1, P7F1L2, P8F1L3, P5F1L4 Fiber 3: P2F3L1-P2F3L4 Fiber 3: P10F1L1, P11F1L2, P12F1L3, P9F1L4 Fiber 4: P2F4L1-P2F4L4 Fiber 4: P14F1L1, P15F1L2, P16F1L3, P13F1L4 Input Port 3: Output Port 19: Fiber 1: P3F1L1-P3F1L4 Fiber 1: P3F1L1, P4F1L2, P1F1L3, P2F1L4 Fiber 2: P3F2L1-P3F2L4 Fiber 2: P7F1L1, P8F1L2, P5F1L3, P6F1L4 Fiber 3: P3F3L1-P3F3L4 Fiber 3: P11F1L1, P12F1L2, P9F1L3, P10F1L4 Fiber 4: P3F4L1-P3F4L4 Fiber 4: P15F1L1, P16F1L2, P13F1L3, P14F1L4 Input Port 4: Output Port 20: Fiber 1: P4F1L1-P4F1L4 Fiber 1: P4F1L1, P1F1L2, P2F1L3, P3F1L4 Fiber 2: P4F2L1-P4F2L4 Fiber 2: P8F1L1, P5F1L2, P6F1L3, P7F1L4 Fiber 3: P4F3L1-P4F3L4 Fiber 3: P12F1L1, P9F1L2, P10F1L3, P11F1L4 Fiber 4: P4F4L1-P4F4L4 Fiber 4: P16F1L1, P13F1L2, P14F1L3, P15F1L4 Input Port 5: Output Port 21: Fiber 1: P5F1L1-P5F1L4 Fiber 1: P1F2L1, P2F2L2, P3F2L3, P4F2L4 Fiber 2: P5F2L1-P5F2L4 Fiber 2: P5F2L1, P6F2L2, P7F2L3, P8F2L4 Fiber 3: P5F3L1-P5F3L4 Fiber 3: P9F2L1, P10F2L2, P11F2L3, P12F2L4 Fiber 4: P5F4L1-P5F4L4 Fiber 4: P13F2L1, P14F2L2, P15F2L3, P6F2L4 Input Port 6: Output Port 22: Fiber 1: P6F1L1-P6F1L4 Fiber 1: P2F2L1, P3F2L2, P4F2L3, P1F2L4 Fiber 2: P6F2L1-P6F2L4 Fiber 2: P6F2L1, P7F2L2, P8F2L3, P5F2L4 Fiber 3: P6F3L1-P6F3L4 Fiber 3: P10F2L1, P11F2L2, P12F2L3, P9F2L4 Fiber 4: P6F4L1-P6F4L4 Fiber 4: P14F2L1, P15F2L2, P16F2L3, P13F2L4 Input Port 7: Output Port 23: Fiber 1: P7F1L1-P7F1L4 Fiber 1: P3F2L1, P4F2L2, P1F2L3, P2F2L4 Fiber 2: P7F2L1-P7F2L4 Fiber 2: P7F2L1, P8F2L2, P5F2L3, P6F2L4 Fiber 3: P7F3L1-P7F3L4 Fiber 3: P11F2L1, P12F2L2, P9F2L3, P10F2L4 Fiber 4: P7F4L1-P7F4L4 Fiber 4: P15F2L1, P16F2L2, P13F2L3, P14F2L4 Input Port 8: Output Port 24: Fiber 1: P8F1L1-P8F1L4 Fiber 1: P4F2L1, P1F2L2, P2F2L3, P3F2L4 Fiber 2: P8F2L1-P8F2L4 Fiber 2: P8F2L1, P5F2L2, P6F2L3, P7F2L4 Fiber 3: P8F3L1-P8F3L4 Fiber 3: P12F2L1, P9F2L2, P10F2L3, P11F2L4 Fiber 4: P8F4L1-P8F4L4 Fiber 4: P16F2L1, P13F2L2, P14F2L3, P15F2L4 Input Port 9: Output Port 25: Fiber 1: P9F1L1-P9F1L4 Fiber 1: P1F3L1, P2F3L2, P3F3L3, P4F3L4 Fiber 2: P9F2L1-P9F2L4 Fiber 2: P5F3L1, P6F3L2, P7F3L3, P8F3L4 Fiber 3: P9F3L1-P9F3L4 Fiber 3: P9F3L1, P10F3L2, P11F3L3, P12F3L4 Fiber 4: P9F4L1-P9F4L4 Fiber 4: P13F3L1, P14F3L2, P15F3L3, P16F3L4 Input Port 10: Output Port 26: Fiber 1: P10F1L1-P10F1L4 Fiber 1: P2F3L1, P3F3L2, P4F3L3, P1F3L4 Fiber 2: P10F2L1-P10F2L4 Fiber 2: P6F3L1, P7F3L2, P8F3L3, P5F3L4 Fiber 3: P10F3L1-P10F3L4 Fiber 3: P10F3L1, P11F3L2, P12F3L3, P9F3L4 Fiber 4: P10F4L1-P10F4L4 Fiber 4: P14F3L1, P15F3L2, P16F3L3, P13F3L4 Input Port 11: Output Port 27: Fiber 1: P11F1L1-P11F1L4 Fiber 1: P3F3L1, P4F3L2, P1F3L3, P2F3L4 Fiber 2: P11F2L1-P11F2L4 Fiber 2: P7F3L1, P8F3L2, P5F3L3, P6F3L4 Fiber 3: P11F3L1-P11F3L4 Fiber 3: P11F3L1, P12F3L2, P9F3L3, P10F3L4 Fiber 4: P11F4L1-P11F4L4 Fiber 4: P15F3L1, P16F3L2, P13F3L3, P14F3L4 Input Port 12: Output Port 28: Fiber 1: P12F1L1-P12F1L4 Fiber 1: P4F3L1, P1F3L2, P2F3L3, P3F3L4 Fiber 2: P12F2L1-P12F2L4 Fiber 2: P8F3L1, P5F3L2, P6F3L3, P7F3L4 Fiber 3: P12F3L1-P12F3L4 Fiber 3: P12F3L1, P9F3L2, P10F3L3, P11F3L4 Fiber 4: P12F4L1-P12F4L4 Fiber 4: P16F3L1, P13F3L2, P14F3L3, P15F3L4 Input Port 13: Output Port 29: Fiber 1: P13F1L1-P13F1L4 Fiber 1: P1F4L1, P2F4L2, P3F4L3, P4F4L4 Fiber 2: P13F2L1-P13F2L4 Fiber 2: P5F4L1, P6F4L2, P7F4L3, P8F4L4 Fiber 3: P13F3L1-P13F3L4 Fiber 3: P9F4L1, P10F4L2, P11F4L3, P12F4L4 Fiber 4: P13F4L1-P13F4L4 Fiber 4: P13F4L1, P14F4L2, P15F4L3, P16F4L4 Input Port 14: Output Port 30: Fiber 1: P14F1L1-P14F1L4 Fiber 1: P2F4L1, P3F4L2, P4F4L3, P1F4L4 Fiber 2: P14F2L1-P14F2L4 Fiber 2: P6F4L1, P7F4L2, P8F4L3, P5F4L4 Fiber 3: P14F3L1-P14F3L4 Fiber 3: P10F4L1, P11F4L2, P12F4L3, P9F4L4 Fiber 4: P14F4L1-P14F4L4 Fiber 4: P14F4L1, P15F4L2, P16F4L3, P13F4L4 Input Port 15: Output Port 31: Fiber 1: P15F1L1-P15F1L4 Fiber 1: P3F4L1, P4F4L2, P1F4L3, P2F4L4 Fiber 2: P15F2L1-P15F2L4 Fiber 2: P7F4L1, P8F4L2, P5F4L3, P6F4L4 Fiber 3: P15F3L1-P15F3L4 Fiber 3: P11F4L1, P12F4L2, P9F4L3, P10F4L4 Fiber 4: P15F4L1-P15F4L4 Fiber 4: P15F4L1, P16F4L2, P13F4L3, P14F4L4 Input Port 16: Output Port 32: Fiber 1: P16F1L1-P16F1L4 Fiber 1: P4F4L1, P1F4L2, P2F4L3, P3F4L4 Fiber 2: P16F2L1-P16F2L4 Fiber 2: P8F4L1, P5F4L2, P6F4L3, P7F4L4 Fiber 3: P16F3L1-P16F3L4 Fiber 3: P12F4L1, P9F4L2, P10F4L3, P11F4L4 Fiber 4: P16F4L1-P16F4L4 Fiber 4: P16F4L1, P13F4L2, P14F4L3, P15F4L4
(165) Continuing the example, Table 5 lists another example configuration for optical permutor 400 with respect to optical communications in the reverse, downstream direction, i.e., from core switches 22 to access nodes 17. Like Table 3 above, Table 5 illustrates another example configuration of optical permutor 400 for producing, on the optical fibers of rack-facing output ports P1-P16, a set of 64 unique permutations for combinations of core-facing input ports P16-P32 and optical wavelengths L1-L4 carried by those input ports, where no single optical output port carries multiple optical communications having the same wavelength.
(166) TABLE-US-00005 TABLE 5 Core switch-facing Access node-facing Output Input Ports for Ports for Optical Permutor Optical Permutor (permutation of wavelengths & input port) Input Port 17: Output Port 1: Fiber 1: P17F1L1-P17F1L4 Fiber 1: P17F1L1, P18F1L2, P19F1L3, P20F1L4 Fiber 2: P17F2L1-P17F2L4 Fiber 2: P21F1L1, P22F1L2, P23F1L3, P24F1L4 Fiber 3: P17F3L1-P17F3L4 Fiber 3: P25F1L1, P26F1L2, P27F1L3, P28F1L4 Fiber 4: P17F4L1-P17F4L4 Fiber 4: P29F1L1, P30F1L2, P31F1L3, P32F1L4 Input Port 18: Output Port 2: Fiber 1: P18F1L1-P18F1L4 Fiber 1: P18F1L1, P19F1L2, P20F1L3, P17F1L4 Fiber 2: P18F2L1-P18F2L4 Fiber 2: P22F1L1, P23F1L2, P24F1L3, P21F1L4 Fiber 3: P18F3L1-P18F3L4 Fiber 3: P26F1L1, P27F1L2, P28F1L3, P25F1L4 Fiber 4: P18F4L1-P18F4L4 Fiber 4: P30F1L1, P31F1L2, P32F1L3, P29F1L4 Input Port 19: Output Port 3: Fiber 1: P19F1L1-P19F1L4 Fiber 1: P19F1L1, P20F1L2, P17F1L3, P18F1L4 Fiber 2: P19F2L1-P19F2L4 Fiber 2: P23F1L1, P24F1L2, P21F1L3, P22F1L4 Fiber 3: P19F3L1-P19F3L4 Fiber 3: P27F1L1, P28F1L2, P25F1L3, P26F1L4 Fiber 4: P19F4L1-P19F4L4 Fiber 4: P31F1L1, P32F1L2, P29F1L3, P30F1L4 Input Port 20: Output Port 4: Fiber 1: P20F1L1-P20F1L4 Fiber 1: P20F1L1, P17F1L2, P18F1L3, P19F1L4 Fiber 2: P20F2L1-P20F2L4 Fiber 2: P24F1L1, P21F1L2, P22F1L3, P23F1L4 Fiber 3: P20F3L1-P20F3L4 Fiber 3: P28F1L1, P25F1L2, P26F1L3, P27F1L4 Fiber 4: P20F4L1-P20F4L4 Fiber 4: P32F1L1, P29F1L2, P30F1L3, P31F1L4 Input Port 21: Output Port 5: Fiber 1: P21F1L1-P21F1L4 Fiber 1: P17F2L1, P18F2L2, P19F2L3, P20F2L4 Fiber 2: P21F2L1-P21F2L4 Fiber 2: P21F2L1, P22F2L2, P23F2L3, P24F2L4 Fiber 3: P21F3L1-P21F3L4 Fiber 3: P25F2L1, P26F2L2, P27F2L3, P28F2L4 Fiber 4: P21F4L1-P21F4L4 Fiber 4: P29F2L1, P30F2L2, P31F2L3, P32F2L4 Input Port 22: Output Port 6: Fiber 1: P22F1L1-P22F1L4 Fiber 1: P18F2L1, P19F2L2, P20F2L3, P17F2L4 Fiber 2: P22F2L1-P22F2L4 Fiber 2: P22F2L1, P23F2L2, P24F2L3, P21F2L4 Fiber 3: P22F3L1-P22F3L4 Fiber 3: P26F2L1, P27F2L2, P28F2L3, P25F2L4 Fiber 4: P22F4L1-P22F4L4 Fiber 4: P30F2L1, P31F2L2, P32F2L3, P29F2L4 Input Port 23: Output Port 7: Fiber 1: P23F1L1-P23F1L4 Fiber 1: P19F2L1, P20F2L2, P17F2L3, P18F2L4 Fiber 2: P23F2L1-P23F2L4 Fiber 2: P23F2L1, P24F2L2, P21F2L3, P22F2L4 Fiber 3: P23F3L1-P23F3L4 Fiber 3: P27F2L1, P28F2L2, P25F2L3, P26F2L4 Fiber 4: P23F4L1-P23F4L4 Fiber 4: P31F2L1, P32F2L2, P29F2L3, P30F2L4 Input Port 24: Output Port 8: Fiber 1: P24F1L1-P24F1L4 Fiber 1: P20F2L1, P17F2L2, P18F2L3, P19F2L4 Fiber 2: P24F2L1-P24F2L4 Fiber 2: P24F2L1, P21F2L2, P22F2L3, P23F2L4 Fiber 3: P24F3L1-P24F3L4 Fiber 3: P28F2L1, P25F2L2, P26F2L3, P27F2L4 Fiber 4: P24F4L1-P24F4L4 Fiber 4: P32F2L1, P29F2L2, P30F2L3, P31F2L4 Input Port 25: Output Port 9: Fiber 1: P25F1L1-P25F1L4 Fiber 1: P17F3L1, P18F3L2, P19F3L3, P20F3L4 Fiber 2: P25F2L1-P25F2L4 Fiber 2: P21F3L1, P22F3L2, P23F3L3, P24F3L4 Fiber 3: P25F3L1-P25F3L4 Fiber 3: P25F3L1, P26F3L2, P27F3L3, P28F3L4 Fiber 4: P25F4L1-P25F4L4 Fiber 4: P29F3L1, P30F3L2, P31F3L3, P32F3L4 Input Port 26: Output Port 10: Fiber 1: P26F1L1-P26F1L4 Fiber 1: P18F3L1, P19F3L2, P20F3L3, P17F3L4 Fiber 2: P26F2L1-P26F2L4 Fiber 2: P22F3L1, P23F3L2, P24F3L3, P21F3L4 Fiber 3: P26F3L1-P26F3L4 Fiber 3: P26F3L1, P27F3L2, P28F3L3, P25F3L4 Fiber 4: P26F4L1-P26F4L4 Fiber 4: P30F3L1, P31F3L2, P32F3L3, P29F3L4 Input Port 27: Output Port 11: Fiber 1: P27F1L1-P27F1L4 Fiber 1: P19F3L1, P20F3L2, P17F3L3, P18F3L4 Fiber 2: P27F2L1-P27F2L4 Fiber 2: P23F3L1, P24F3L2, P21F3L3, P22F3L4 Fiber 3: P27F3L1-P27F3L4 Fiber 3: P27F3L1, P28F3L2, P25F3L3, P26F3L4 Fiber 4: P27F4L1-P27F4L4 Fiber 4: P31F3L1, P32F3L2, P29F3L3, P30F3L4 Input Port 28: Output Port 12: Fiber 1: P28F1L1-P28F1L4 Fiber 1: P20F3L1, P17F3L2, P18F3L3, P18F3L4 Fiber 2: P28F2L1-P28F2L4 Fiber 2: P24F3L1, P21F3L2, P22F3L3, P23F3L4 Fiber 3: P28F3L1-P28F3L4 Fiber 3: P28F3L1, P25F3L2, P26F3L3, P27F3L4 Fiber 4: P28F4L1-P28F4L4 Fiber 4: P32F3L1, P29F3L2, P30F3L3, P31F3L4 Input Port 29: Output Port 13: Fiber 1: P29F1L1-P29F1L4 Fiber 1: P17F4L1, P18F4L2, P19F4L3, P20F4L4 Fiber 2: P29F2L1-P29F2L4 Fiber 2: P21F4L1, P22F4L2, P23F4L3, P24F4L4 Fiber 3: P29F3L1-P29F3L4 Fiber 3: P25F4L1, P26F4L2, P27F4L3, P28F4L4 Fiber 4: P29F4L1-P29F4L4 Fiber 4: P29F4L1, P30F4L2, P31F4L3, P32F4L4 Input Port 30: Output Port 14: Fiber 1: P30F1L1-P30F1L4 Fiber 1: P18F4L1, P19F4L2, P20F4L3, P17F4L4 Fiber 2: P30F2L1-P30F2L4 Fiber 2: P22F4L1, P23F4L2, P24F4L3, P21F4L4 Fiber 3: P30F3L1-P30F3L4 Fiber 3: P26F4L1, P27F4L2, P28F4L3, P25F4L4 Fiber 4: P30F4L1-P30F4L4 Fiber 4: P30F4L1, P31F4L2, P32F4L3, P29F4L4 Input Port 31: Output Port 15: Fiber 1: P31F1L1-P31F1L4 Fiber 1: P19F4L1, P20F4L2, P17F4L3, P18F4L4 Fiber 2: P31F2L1-P31F2L4 Fiber 2: P23F4L1, P24F4L2, P21F4L3, P22F4L4 Fiber 3: P31F3L1-P31F3L4 Fiber 3: P27F4L1, P28F4L2, P25F4L3, P26F4L4 Fiber 4: P31F4L1-P31F4L4 Fiber 4: P31F4L1, P32F4L2, P29F4L3, P30F4L4 Input Port 32: Output Port 16: Fiber 1: P32F1L1-P32F1L4 Fiber 1: P20F4L1, P17F4L2, P18F4L3, P19F4L4 Fiber 2: P32F2L1-P32F2L4 Fiber 2: P24F4L1, P21F4L2, P22F4L3, P23F4L4 Fiber 3: P32F3L1-P32F3L4 Fiber 3: P28F4L1, P25F4L2, P26F4L3, P27F4L4 Fiber 4: P32F4L1-P32F4L4 Fiber 4: P32F4L1, P29F4L2, P30F4L3, P31F4L4
(167)
(168) Optical permutor 500 includes a respective one of optical demultiplexers 600A-600N (herein, “optical demuxes 600”) for each optical input interface 520, where the optical demultiplexer is configured to demultiplex the optical communications for a given optical input onto internal optical pathways 640 based on the bandwidth of the optical communications. For example, optical demux 600A separates the optical communications received on optical input interface 520A onto a set of internal optical pathways 640A based on wavelengths λ.sub.1,1, λ.sub.1,2, . . . λ.sub.1,n. Optical demux 600B separates the optical communications received on optical input interface 520B onto a set of internal optical pathways 640B based on wavelengths λ.sub.2,1, λ.sub.2,2, . . . λ.sub.2,n. Each optical demux 600 operates in a similar fashion to separate the optical communications received from the receptive input optical interface 520 so as to direct the optical communications through internal optical pathways 640 toward optical output ports 540A-540N (herein, “optical output ports 540”).
(169) Optical permutor 500 includes a respective one of optical multiplexers 620A-620N (herein, “optical muxes 620”) for each optical output port 540, where the optical multiplexer receives as input optical signals from optical pathways 640 that lead to each optical demux 600. In other words, optical pathways 640 internal to optical permutor 500 provide a full-mesh of N.sup.2 optical interconnects between optical demuxes 600 and optical muxes 620. Each optical multiplexer 620 receives N optical pathways as input and combines the optical signals carried by the N optical pathways into a single optical signal for output onto a respective optical fiber.
(170) Moreover, optical demuxes 600 are each configured such that optical communications received from input interface ports 520 are “permuted” across optical output ports 540 based on wavelength so as to provide full-mesh connectivity between the ports and in a way that ensures optical interference is avoided. That is, each optical demux 600 is configured to ensure that each optical output port 54 receives a different one of the possible unique permutations of the combinations of optical input ports 520 and the optical frequencies carried by those ports and where no single optical output port 540 carries communications having the same wavelength.
(171) For example, optical demux 600A may be configured to direct the optical signal having wavelength λ.sub.1,1 to optical mux 620A, wavelength λ.sub.1,2 to optical mux 620B, wavelength λ.sub.1,3 to optical mux 620C, . . . and wavelength λ.sub.1,n to optical mux 620N. Optical demux 600B is configured to deliver a different (second) permutation of optical signals by outputting wavelength λ.sub.2,n to optical mux 620A, wavelength λ.sub.2,1 to optical mux 620B, wavelength λ.sub.2,2 to optical mux 620C, . . . and wavelength λ.sub.2,n−1 to optical mux 620N. Optical demux 600C is configured to deliver a different (third) permutation of optical signals by outputting wavelength λ.sub.3,n−1 to optical mux 620A, wavelength λ.sub.3,n−2 to optical mux 620B, wavelength λ.sub.3,n−3 to optical mux 620C, . . . and wavelength λ.sub.3,n−2 to optical mux 620N. This example configuration pattern continues through optical demux 600N, which is configured to deliver a different (N.sup.th) permutation of optical signals by outputting wavelength λ.sub.N,2 to optical mux 620A, wavelength λ.sub.N,3 to optical mux 620B, wavelength λ.sub.N,4 to optical mux 620C, . . . and wavelength λ.sub.N,1 to optical mux 620N.
(172) In the example implementation, optical pathways 640 are arranged such that the different permutations of input interface/wavelengths are delivered to optical muxes 620. In other words, each optical demux 600 may be configured to operate in a similar manner, such as λ.sub.1 being provided to a first port of the demux, λ.sub.2 being provided to a second port of the demux . . . , and λ.sub.n, being provided to an N.sup.th port of the demux. Optical pathways 640 are arranged to optically deliver a specific permutation of wavelengths to each optical mux 620 such that any communications from any one of optical demuxes 600 can reach any optical mux 620 and, moreover, each permutation of wavelengths is selected to avoid any interference between the signals, i.e., be non-overlapping.
(173) For example, as shown in
(174) In this way, a different permutation of input optical interface/wavelength combination is provided to each optical mux 620 and, moreover, each one of the permutations provided to the respective optical mux is guaranteed to include optical communications having non-overlapping wavelengths.
(175) Optical permutor 500 illustrates one example implementation of the techniques described herein. In other example implementations, each optical interface 520 need not receive all N wavelengths from a single optical fiber. For example, different subsets of the N wavelengths can be provided by multiple fibers, which would then be combined (e.g., by a multiplexer) and subsequently permuted as described herein. As one example, optical permutor 500 may have 2N optical inputs 520 so as to receive 2N optical fibers, where a first subset of N optical fibers carries wavelengths λ.sub.1 . . . λ.sub.n/2 and a second subset of N optical fibers carries wavelengths λ.sub.n/2+1 . . . λ.sub.n Light from pairs of the optical inputs from the first and second set may be combined to form optical inputs carrying N wavelengths, which may then be permuted as shown in the example of
(176) In the example implementation, optical permutor 500, including optical input ports 520, optical demuxes 600, optical pathways 640, optical muxes 620 and optical output ports 540 may be implemented as one or more application specific integrated circuit (ASIC), such as a photonic integrated circuit or an integrated optical circuit. In other words, the optical functions described herein may be integrated on a single chip, thereby providing an integrated optical permutor that may be incorporated into electronic cards, devices and systems.
(177)
(178) In the example implementation of
(179) In other examples, optical permutors 132, 400, 500 may make use of star couplers and waveguide grating routers described in Kaminow, “Optical Integrated Circuits: A Personal Perspective,” Journal of Lightwave Technology, vol. 26, no. 9, May 1, 2008, the entire contents of which are incorporated herein by reference.
(180)
(181) In some example implementations, intra-access node group permutators may be used to provide multiplexed packet spraying between access nodes, which may further increase the scalability of the network systems described herein. That is, in some examples, one or more first stage permutation devices may be coupled between access nodes of the same access node group to increase the fan out as individual access nodes spray packets for packet flows across other access nodes within the same access node group.
(182)
(183) In general, electrical permutor 2000 permutes packet-based communications (e.g., packet flows) from a set of input ports 2020 across a set of output ports 2040. As further explained, electrical permutor 2000 is an interconnect device that multiplexes packet-based communications between access nodes, such as access nodes of a common access node group that service one or more racks of servers. Source components (SF) 2005 and source switching components (SX) 2007 of the access nodes of access node group 2001 are shown in the example of
(184) In this way, the SF component for each access node may be configured to spray packets it receives from servers toward all other access nodes in access node group 2001 such that the packets transmitted to electrical permutor 2000 from any given access node over any link 2002 is a mixture of packets from any of the servers supported by the respective access node. As described in further detail below, electrical permutor 2000 operates to provide seamless connectivity from any of the source components (SFs) to any of the source switching components (SXs) within access node group 2001. Moreover, use of electrical permutor 2000 may achieve significantly higher fanout from an individual SF to an increased number of SXs, thus effectively increasing the size of access node group 2001, without increasing the requirement for additional cabling or network ports.
(185) For example, in the example of
(186) More specifically, example electrical permutor 2000 includes a plurality of input ports 2020A-2020P (herein, “input ports 2020”) to receive electrical signals from a respective access node, where each of the electrical signals for the different links carries packets having up to n unique source network addresses from the transmitting SF component of the respective access node. In this way, communications received by electrical permutor 2000 on each of links 2002 may be viewed as n unique electrical streams of packets designated as ES.sub.p,n, where the subscript p represents one of the P inputs ports and the subscript n represents the source network address used by the transmitting access node that initially received the packet from one of the servers and is spraying the packet toward the SXs of the access node group. Thus, using this nomenclature, input 2020A receives n electrical streams (ES) carrying variable-sized packets sprayed across the access node group(s), where the unique fixed-bandwidth electrical streams flowing through input port 2020A utilize n different source network addresses and are designated as electrical streams ES.sub.1,1, ES.sub.1,2, . . . ES.sub.1,n. Similarly, input 2020B receives n electrical streams carrying variable-sized packets sprayed across the access node group, where the different electrical streams are designated ES.sub.2,1, ES.sub.2,2, . . . ES.sub.2,n. This pattern continues for all P input ports connected access nodes, where each access node is assigned n different network addresses for use as source network addresses.
(187) As shown in
(188) For example, in
(189) In this way, electrical permutor 2000 is configured to direct communications from input ports 2020 out output ports 2040 such that each output port 2040 carries a different one of the possible unique permutations of the combinations of input ports 2020 and the electrical streams carried by those ports for the different network addresses used by the access nodes of the access node group, where each output port 2020 carries one of the electrical streams from every one of the access nodes and for each of the different input ports. As such, in this example, each output port 2040 carries an electrical stream of packet flows from each of the access nodes of the access node group, thus allowing a full mesh of communication. Electrical permutor 2000 may be used to achieve enhanced fanout in a first level of packet spraying within an access node group prior to a first layer three (L3) lookup and forwarding of the packets by the SX component of the access nodes within the access node group, and in some example implementations may be deployed in conjunction with any of the optical permutors described herein (e.g., OPs 132, OPs 204, OPs 302, OP 400 and OP 500) to provide bi-directional, full-mesh point-to-point connectivity for transporting communications for servers 12 to/from core switches 22. In some examples, electrical permutor 2000 has only a minimal packet buffer to absorb any collision of packets from multiple input ports simultaneously, and, in some examples, no output port 2040 is oversubscribed for bandwidth.
(190) As such, an access network or switch fabric 14 (see
(191) As described above, network implementations using the electrical permutor may be useful so as to increase the radix (fanout) of any access node group. For example, large network implementations may easily be achieved, such as easily achieving larger access nodes groups (e.g., 32, 64 or 128) to support large number of racks (e.g., 1024 racks) as one example.
(192) In the example of
(193)
(194) In general, each EP operates similar to the functionality described in reference to EP 2000 of
(195)
(196) In general, an example packet originating from a server is received by a respective access node (step 1). The SF component of the access node receiving the outbound packet sprays the packet to any other of the access nodes in the access node group (step 2). At this time, the access node selects the appropriate source MAC address to be used when encapsulating the packet within a tunnel packet and forwards the packet toward one of the electrical permutors (step 3).
(197) Upon receiving the packet from the source access node, the EP multiplexes the packet to its appropriate output port based on the low order bits of the source network address (e.g., MAC address) (step 4). The receiving EP delivers packet directly to its corresponding port without permutation (step 5). For example, for the EP receiving packets from the transmitting EP, all electrical streams of packets received on input port 0 would be delivered to output port 0, all electrical streams for packets received on input port 1 would be delivered to output port 1, and the like. In this way, the pair of EP's mimic direct connection between access nodes.
(198) Upon receiving the sprayed packet, the SX component of the receiving access node within the access node group performs a layer 3 lookup on the destination IP address for the tunneled packet to determine a destination access node for terminating the transport of the packet (step 6). If the access node receiving the sprayed packet is the destination access node such that the destination server is a local server to that access node, the receiving access nodes performs packet extraction from the tunnel packet, packet reorder and delivery to the destination server (step 7). If the destination access node is a different access node located within Rack Group 0 (i.e., within the same access node group), the access node forwards the packet to the local access node (e.g., via the EP's as discussed herein) for packet extraction, packet reorder and delivery to the destination server (step 8). Otherwise, the access node forwards the packet toward the Rack Group having the destination access node by way of one of the core switches (step 9), for packet reordering and delivery to the destination server (step 10).
(199) The example described with respect to
(200)
(201) In general, the Ingress Port Block receives input communications, such as Ethernet packets, and implements an Ethernet MAC with Physical Coding Sublayer and Forward Error Correction code. Upon receiving an Ethernet frame, the Ingress Port Block sends the received ethernet frame to the parser block for parsing purpose.
(202) The parser block parses the incoming Ethernet frame and is configured to determine the egress port to forward the frame in accordance with the permutation requirements described herein.
(203) The Packet Writer block of the electrical permutor writes incoming Ethernet frames in the Packet Buffer. Hardware logic manages the buffer in fixed size cell units of 256 bytes. The Packet Writer Block interfaces with a Free Cell Manager to obtain addresses of free cells within the Packet Buffer. The Packet Writer Block uses the free cell addresses received from the Free Cell Manager to write incoming frames for temporary storage.
(204) The Packet Write Arbiter provides a controlled interface to the Packet Buffer and multiplexes accesses from the different packet writers. For example, the Packet Write Arbiter arbitrates the packet buffer address port and multiplexes incoming cells to be written to packet buffer memory. The Packet Write Arbiter creates a packet in form of list of cells within the Packet Buffer and enqueues the cell addresses in the Egress Packet Queue block for subsequent transmission.
(205) In one example, the Packet Buffer has a limited size to buffer (n−1)*n*packet MTU size to guarantee zero packet loss, where n represents the number of outputs ports. In the case where a shared buffer is used, as shown in
(206) The Egress Packet Queue block implements a set of (e.g., 8 or 16) packet queues for each egress port of electrical permutor 2150. Packet lists are implemented in form of link-lists. The egress packet queue structure shares the storage.
(207) An Egress Scheduler accesses the cell addresses to be popped from the Egress Packet Queues and schedules the packet from non-empty queues for transmission out the corresponding egress port. The Egress Scheduler sends the packet with its description information to the corresponding packet reader block for reading the corresponding packet from the Packet Buffer.
(208) For example, the Packet Reader block of electrical permutor 2150 has responsibility to traverse over the link-list of the packet to be transmitted and create addresses to the packet buffer to read the cell information. After reading the data from the packet buffer, the Packet Reader block sends the cell address to free cell manager to free the packet buffer space previously occupied by the packet.
(209) The Packet Read Address Mux multiplexes different read addresses and issue a read command to the Packet Buffer block for reading the packet data. The packet data read from the packet buffer is sent to the egress port group, which transmits the packet data with appropriate Ethernet CRC.
(210) In the example of
(211)
(212) As shown in this example, a set of access nodes 17 exchange control plane messages to establish a logical tunnel over a plurality of parallel network paths that provide packet-based connectivity between the access nodes (3101). For example, with respect to
(213) Once the logical tunnel is established, one of the access nodes (referred to as the ‘transmitting access node’ in
(214) Upon receiving a grant from the access node associated with a destination of the packet data to be transmitted (3108), the transmitting access nodes encapsulates the outbound packets within tunnel packets, thereby forming each tunnel packet to have a tunnel header for traversing the logical tunnel and a payload containing one or more of the outbound packets (3104).
(215) Upon forming the tunnel packets, the transmitting access node forwards the tunnel packets by spraying the tunnel packets over the multiple, parallel paths through switch fabric 14 by which the receiving access node is reachable (3106). In some example implementations, the transmitting access node may, prior to forwarding the tunnel packets to switch fabric 14, spray the tunnel packets across multiple access nodes that, for example, form one or more access node groups (e.g., within one or more rack groups proximate to the transmitting access node, as one example), thereby providing a first level fanout for distributing the tunnel packets across the parallel paths. See, for example,
(216) Upon receipt, the receiving access node extracts the original packets that are encapsulated within the tunnel packets (3109), reorders the original packets in the order sent by the application or storage server, and delivers the packets to the intended destination server (3110).