CHIP-TO-CHIP INTERCONNECT WITH A LAYERED COMMUNICATION ARCHITECTURE
20240378097 ยท 2024-11-14
Assignee
Inventors
- Pankaj Kansal (Fremont, CA, US)
- Arvind Srinivasan (San Jose, CA)
- Harikrishna Madadi Reddy (San Jose, CA, US)
- Naader Hasani (Menlo Park, CA, US)
Cpc classification
G06F15/17337
PHYSICS
G06F15/7825
PHYSICS
International classification
Abstract
A system includes a first integrated circuit package including a first group of one or more artificial intelligence processing units and a first chip-to-chip interconnect communication unit and a second integrated circuit package including a second group of one or more artificial intelligence processing units and a second chip-to-chip interconnect communication unit. The system also includes an interconnect between the first integrated circuit package and the second integrated circuit package, wherein the first chip-to-chip interconnect communication unit and the second chip-to-chip interconnect communication unit manage Ethernet-based communication via the interconnect using a layered communication architecture supporting a credit-based data flow control and a retransmission data flow control.
Claims
1. A first integrated circuit device comprising: one or more processing units to generate data packets to be sent to a second integrated circuit device; and an interconnect communication unit to send the data packets to the second integrated circuit device via a chip-to-chip interconnect, and manage the data packets sent to the second integrated circuit device via the chip-to-chip interconnect using a layered communication architecture supporting a credit-based data flow control that updates a total-blocks-sent counter that counts a total number of blocks sent to the second integrated circuit device based on a block size of each sent data packet, wherein the chip-to-chip interconnect comprises a direct chip-to-chip connection between the first integrated circuit device as a first chip and the second integrated circuit device as a second chip to provide a direct communication between the first integrated circuit device and the second integrated circuit device.
2. The first integrated circuit device of claim 1, wherein the interconnect communication unit comprises a send buffer to store the data packets sent to the second integrated circuit device via the chip-to-chip interconnect, and wherein the interconnect communication unit is to remove one of the data packets stored in the send buffer upon receiving an acknowledgement of the data packet from the second integrated circuit device.
3. The first integrated circuit device of claim 1, wherein the interconnect communication unit is to send an updated count of the total-blocks-sent counter to the second integrated circuit device.
4. The first integrated circuit device of claim 3, wherein the interconnect communication unit is to: receive a credit-limit value of the second integrated circuit device, and determine whether to send another data packet to the second integrated circuit device based on whether a sum of the updated count of the total-blocks-sent counter and a block size of the other data packet is less than or equal to the credit-limit value of the second integrated circuit device.
5. The first integrated circuit device of claim 1, wherein the interconnect communication unit further comprises: a receive buffer to store data packets received from the second integrated circuit device via the chip-to-chip interconnect, and an absolute-blocks-received counter that counts a total number of blocks received from the second integrated circuit device based on block sizes of the received data packets.
6. The first integrated circuit device of claim 5, wherein the interconnect communication unit is to: calculate a credit limit of the first integrated circuit device by adding a count of the absolute-blocks-received counter and available space of the receive buffer, and send the calculated credit limit of the first integrated circuit device to the second integrated circuit device, wherein the second integrated circuit device is to determine whether to send another data packet to the first integrated circuit device based on the calculated credit limit of the first integrated circuit device.
7. The first integrated circuit device of claim 1, wherein the interconnect communication unit comprises circuitry logic configured to support a physical layer network communication protocol, including an Ethernet communication protocol, to send the data packets to the second integrated circuit device via the chip-to-chip interconnect.
8. The first integrated circuit device of claim 7, wherein the interconnect communication unit further comprises additional circuitry logic configured to support communication protocols beyond the physical layer network communication protocol.
9. The first integrated circuit device of claim 1, wherein the layered communication architecture of the interconnect communication unit comprises an application layer configured to manage work requests from a software application and access data in a high-bandwidth memory, and a message-dispatch-and-reassembly layer configured to manage segmentation and reassembly of the accessed data into the data packets for dispatching based on the work requests presented from the application layer.
10. The first integrated circuit device of claim 1, wherein the layered communication architecture of the interconnect communication unit further supports a retransmission data flow control that communicates with the second integrated circuit device when a data packet sent from the second integrated circuit device was received with error and causes the second integrated circuit device to retransmit the data packet received with error.
11. A method comprising: configuring a first integrated circuit device comprising one or more processing units to generate data packets and an interconnect communication unit to send the data packets to a second integrated circuit device; establishing a chip-to-chip interconnect between the first integrated circuit device and the second integrated circuit device, wherein the chip-to-chip interconnect comprises a direct chip-to-chip connection between the first integrated circuit device as a first chip and the second integrated circuit device as a second chip to provide a direct communication between the first integrated circuit device and the second integrated circuit device; causing the interconnect communication unit to send the data packets to the second integrated circuit device via the chip-to-chip interconnect; and causing the interconnect communication unit to manage the data packets sent to the second integrated circuit device via the chip-to-chip interconnect using a layered communication architecture supporting a credit-based data flow control that updates a total-blocks-sent counter that counts a total number of blocks sent to the second integrated circuit device based on a block size of each sent data packet.
12. The method of claim 11, further comprising: providing a send buffer in the interconnect communication unit to store the data packets sent to the second integrated circuit device via the chip-to-chip interconnect; and removing, by the interconnect communication unit, one of the data packets stored in the send buffer upon receiving an acknowledgement of the data packet from the second integrated circuit device.
13. The method of claim 11, further comprising: sending, by the interconnect communication unit, an updated count of the total-blocks-sent counter to the second integrated circuit device; receiving, by the interconnect communication unit, a credit-limit value of the second integrated circuit device; and determining, by the interconnect communication unit, whether to send another data packet to the second integrated circuit device via the chip-to-chip interconnect based on whether a sum of the updated count of the total-blocks-sent counter and a block size of the other data packet is less than or equal to the credit-limit value of the second integrated circuit device.
14. The method of claim 11, further comprising: providing a receive buffer in the interconnect communication unit to store received data packets sent from the second integrated circuit device via the chip-to-chip interconnect; and providing an absolute-blocks-received counter to count a total number of blocks received from the second integrated circuit device based on a block size of the received data packets.
15. The method of claim 14, further comprising: calculating, by the interconnect communication unit, a credit limit of the first integrated circuit device by adding a count of the absolute-blocks-received counter and available space of the receive buffer; and sending, by the interconnect communication unit, the calculated credit limit of the first integrated circuit device to the second integrated circuit device, wherein the second integrated circuit device determines whether to send another data packet to the first integrated circuit device based on the calculated credit limit of the first integrated circuit device.
16. The method of claim 11, wherein the interconnect communication unit comprises circuitry logic that supports a physical layer network communication protocol, including an Ethernet communication protocol, to send the data packets to the second integrated circuit device via the chip-to-chip interconnect.
17. A first integrated circuit device comprising: one or more processing units to generate data packets; and an interconnect communication unit comprising: circuitry logic configured to support a physical layer network communication protocol to send the data packets to a second integrated circuit device via a chip-to-chip interconnect, wherein the chip-to-chip interconnect comprises a direct chip-to-chip connection between the first integrated circuit device as a first chip and the second integrated circuit device as a second chip, a send buffer to store the data packets sent to the second integrated circuit device via the chip-to-chip interconnect, and a total-blocks-sent counter to count a total number of blocks sent to the second integrated circuit device, wherein the interconnect communication unit is configured with a layered communication architecture supporting a credit-based data flow control that updates the total-blocks-sent counter based on a block size of each sent data packet.
18. The first integrated circuit device of claim 17, wherein the interconnect communication unit is to: receive a credit-limit value of the second integrated circuit device, and determine whether to send another data packet to the second integrated circuit device via the chip-to-chip interconnect based on whether a sum of the updated count of the total-blocks-sent counter and a block size of the other data packet is less than or equal to the credit-limit value of the second integrated circuit device.
19. The first integrated circuit device of claim 17, wherein the interconnect communication unit further comprises: a receive buffer to store received data packets sent from the second integrated circuit device via the chip-to-chip interconnect, and an absolute-blocks-received counter that counts a total number of blocks received from the second integrated circuit device based on block sizes of the received data packets.
20. The first integrated circuit device of claim 19, wherein the interconnect communication unit is to: calculate a credit limit of the first integrated circuit device by adding a count of the absolute-blocks-received counter and available space of the receive buffer, and send the calculated credit limit of the first integrated circuit device to the second integrated circuit device, wherein the second integrated circuit device determines whether to send another data packet to the first integrated circuit device bases on the calculated credit limit of the first integrated circuit device.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
[0004]
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
DETAILED DESCRIPTION
[0017] The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term processor refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
[0018] A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
[0019] A system includes a first integrated circuit package including a first group of one or more artificial intelligence processing units and a first chip-to-chip interconnect communication unit and a second integrated circuit package including a second group of one or more artificial intelligence processing units and a second chip-to-chip interconnect communication unit. The system also includes an interconnect between the first integrated circuit package and the second integrated circuit package, wherein the first chip-to-chip interconnect communication unit and the second chip-to-chip interconnect communication unit manage ethernet-based communication via the interconnect using a layered communication architecture supporting a credit-based data flow control and a retransmission data flow control.
[0020] In various embodiments, a protocol is defined to connect two compute devices. Such a connection can be described as a chip-to-chip interconnect. As used herein, a chip includes semiconductor material cut from a larger wafer of material on which transistors can be etched. Chips may also be referred to as computer chips, microchips, integrated circuits (ICs), silicon chips, etc. Advantages of the techniques disclosed herein include scalability and more efficient communication protocols. A problem to solve is how to achieve fast and flexible chip-to-chip communication. The problem arises because oftentimes chip-to-chip communication relies on chip-specific communication protocols that are neither flexible nor scalable. As described in further detail herein, a solution includes leveraging an existing physical layer protocol (e.g., Ethernet) that is widely used and/or implemented in existing devices for use for chip-to-chip communication. In order to leverage the existing physical layer protocol, in various embodiments, a layered communication architecture is deployed. In various embodiments, the layered communication architecture supports credit-based flow control and packet retransmission techniques.
[0021] The techniques disclosed herein address problems associated with the growth of artificial intelligence (AI) clusters. In AI clusters, many types of interfaces typically exist. For example, an Internet protocol may be used, which may determine network speed, but another interface may be used for chip-to-chip communication, which would determine chip-to-chip speed. Different proprietary chip-to-chip protocols may be used at the same time. Using the techniques disclosed herein, a standard physical protocol can be adopted for chip-to-chip communication to allow for improved scalability and additional application and reliability protocol layers can be incorporated to support the standard physical protocol. Using these layers, it is possible to scale to any speed because the standard physical protocol is implemented by all devices, meaning device-to-device communication is possible without specialized switching hardware. As described in further detail herein, technological advantages of such an approach include: 1) a layered architecture that carves out responsibilities for each layer, offering a flexible application layer interface that can be tuned for any application, an interconnect that can be split into multiple parallel slices to meet a required interface bandwidth (packets can be sprayed among multiple slices using efficient load balancing and congestion control), and a scalable architecture with possible extensions to one-hop networks through a standard switch; 2) full duplex high-speed and low-latency links scalable to network speed growth; 3) point-to-point resilient and lossless connection; 4) support for credit-based flow control to manage traffic congestion and receive overflow; 5) and low overhead and area footprint.
[0022]
[0023] In the example shown, integrated circuit packages 102 and 110 are separate devices. In some embodiments, integrated circuit packages 102 and 110 are separate AI accelerators. An AI accelerator refers to a high-performance computation device that is specifically designed for the efficient processing of AI workloads, e.g., neural network training. Processing unit(s) 104 and processing unit(s) 112 of integrated circuit package 102 and integrated circuit package 110, respectively, can be comprised of various types of one or more processors. Examples of processors include application-specific integrated circuits (ASICs), graphics processing units (GPUs), central processing units (CPUs), field-programmable gate arrays (FPGAs), multicore scalar processors, spatial accelerators, or other types of components configured to perform logical and arithmetical operations on data as specified in instructions. In various embodiments, processing unit(s) 104 and processing unit(s) 112 handle AI workloads. Stated alternatively, in various embodiments, processing unit(s) 104 and processing unit(s) 112 are artificial intelligence processing units.
[0024] In the example illustrated, integrated circuit package 102 and integrated circuit package 110 are connected directly via interconnect 108. Thus, in the example shown, interconnect 108 is a chip-to-chip interconnect. Stated alternatively, interconnect 108 includes a physical path between integrated circuit package 102 and integrated circuit package 110 for data exchange. With respect to the physical path, in some embodiments, interconnect 108 comprises conventional electric wire interconnection. It is also possible for other signal transmission modes to be employed, e.g., interconnection through optical or wireless interconnection techniques. In various embodiments, as described in further detail herein, a layered architecture is utilized to interface integrated circuit package 102 and integrated circuit package 110. An interface for integrated circuit package 102 and/or integrated circuit package 110 refers to a logical stateful connection between integrated circuit packages 102 and 110 following a common protocol. In the example shown, integrated circuit packages 102 and 110 are connected in a direct connection topology. It is also possible for integrated circuit packages 102 and 110 to be connected through a switch topology. For example, integrated circuit packages 102 and 110 may be linked via a single hop connection in which integrated circuit package 102 communicates with integrated circuit package 110 through another device (e.g., a hardware switch). In such scenarios, the communication interfaces described herein would also be implemented on the intermediary device (e.g., the hardware switch). The example illustrated shows a point-to-point connection between two devices. This example is illustrative and not restrictive. Using the techniques disclosed herein, any number of devices may be connected in any of various manners (e.g., point-to-multi-point via star, mesh, bus, or other topologies).
[0025] In the example shown, chip-to-chip interconnect communication units 106 and 114 implement communication interfaces for integrated circuit packages 102 and 110, respectively. Stated alternatively, a chip-to-chip interconnect communication unit of a device handles communication of data (e.g., created by a processing unit) to another device through interconnect 108. In the example illustrated, each chip-to-chip interconnect communication unit is shown as separate from associated processing unit(s). It is also possible for the chip-to-chip interconnect communication unit to be integrated with one or more processing units (e.g., reside on a same integrated circuit). In some embodiments, the chip-to-chip interconnect communication unit is included on a hardware component that is specifically designed to connect different computers (e.g., a network interface controller). As described in further detail below, in some embodiments, the chip-to-chip interconnect communication unit includes ethernet communication logic, additional protocol logic, and a hardware buffer.
[0026]
[0027] In the example illustrated, chip-to-chip interconnect communication unit 200 includes Ethernet communication logic 202, additional protocol logic 204 (computational logic), and hardware buffer 206. In some embodiments, a network interface controller (NIC) (also known as a network interface card, network adapter, or local area network adapter) is included in chip-to-chip interconnect communication unit 200 and includes Ethernet communication logic 202. Ethernet communication logic 202 includes electronic circuitry configured to communicate using a specific physical layer (e.g., interconnect 108 of
[0028] In various embodiments, additional protocol logic 204 includes electronic circuitry configured to communicate using various communication protocols of a layered communication architecture. Additional protocol logic 204 supports protocols beyond the physical layer communication that Ethernet communication logic 202 supports. In some embodiments, additional protocol logic 204 is included in a dedicated integrated circuit, e.g., on an ASIC. It is also possible for additional protocol logic 204 to be integrated on a same chip as Ethernet communication logic 202. In some embodiments, additional protocol logic 204 includes electronic circuitry configured to perform the processing of application layer 302 of
[0029] Hardware buffer 206 is configured to store data processing during various phases of receive and transmit across various layers of a layered communication architecture (e.g., architecture 300 of
[0030]
[0031] In the example illustrated, architecture 300 is comprised of application layer 302, message dispatch and reassembly layer 304, reliability layer 306, and physical network layer 308. In the example illustrated, application layer 302 and message dispatch and reassembly layer 304 communicate with high bandwidth memory 310. Layering of communication protocols as shown for architecture 300 has the advantage of allowing for flexibility and lighter protocols. It is possible to independently change or add additional features to each layer.
[0032] In various embodiments, application layer 302 interfaces with a communication library and manages work requests from a software application. In various embodiments, from a workload standpoint, abstraction is provided to applications through a send buffer and receive buffer interface. The send buffer and receive buffer hold data between two communication partners and manage data reliability and credit flow. The send buffer is also used for holding unacknowledged packets and hence manages any data loss recovery. The receive buffer can be a data holding buffer for pipeline purposes. In some embodiments, the work requests from applications can be managed by work request managers and interface with direct memory access (DMA) engines. A DMA engine can move data between high-bandwidth memory (HBM) and send/receive buffers. Managing work requests, including managing transmit and receive data flow, is described in further detail herein (e.g., see
[0033] With respect to transmitting data, in various embodiments, message dispatch and reassembly layer 304 is responsible for segmentation and reassembly of the message packets based on work requests presented from application layer 302. As part of the message segmentation, message dispatch and reassembly layer 304 determines a load balancing scheme across various reliability layers and dispatches the packets accordingly. In some embodiments, switching within a device (e.g., integrated circuit packages 102 or 110 of
[0034] In various embodiments, reliability layer 306 is maintained per network path, and the network path can be configured as a logical/physical port of an endpoint device. In various embodiments, packets that go through reliability layer 306 are transmitted and received in order and reliability layer 306 maintains a monotonically incrementing sequence that indicates the order of transmission and thereby packet receipt order. In various embodiments, reliability layer 306 maintains end-to-end credits, acknowledgements for packet data, and retransmissions when packet drops occur. Credit-based data flow is described in further detail herein (e.g., see
[0035] In various embodiments, physical network layer 308 is the interface into a connection with another device. In some embodiments, the logic for physical network layer 308 resides in Ethernet communication logic 202 of
[0036]
[0037] In framework 400, with respect to transmit (TX) direction 402, data from re-transmit buffer 404 (or another data storage location) goes through MAC 406 and Serdes 408 functional blocks. With respect to receive (RX) direction 410, data is received via Serdes 408 and MAC 406 functional blocks and placed in RX buffer 412. In some embodiments, MAC 406 and Serdes 408 are included in physical network layer 308 of
[0038] In various embodiments, various features are deployed and implemented for reliable and lossless traffic between link partners, e.g., to prevent data corruption and packet loss. In some embodiments, MAC 406 utilizes a cyclic redundancy check (CRC) or other error-detecting code. The MAC of a communication partner (MAC 414) can check the CRC and flag an error if the CRC does not match, and receiver transport logic can drop packets received with bad CRCs.
[0039] In various embodiments, a credit-based flow control mechanism is used to maintain overflow at a receiver. For example, RX buffer 412 can store data received and exchange the available space in the receive buffer with a transmitter periodically. The transmitter can check the available credit at the receiver before sending packets. In this manner, two communication partners are aware of buffer availability of each other and can maintain traffic flow. Throttling of the transmitter can be performed if there is not enough credit left in the receive buffer. Credit-based flow control is described in further detail herein (e.g., see
[0040] In various embodiments, for reliable and lossless transmission, a mechanism for reporting loss and errors to the transmitter and perform re-transmission of dropped/lost packets by the transmitter is utilized. In various embodiments, a Packet Sequence Number (PSN), re-transmit buffer (e.g., re-transmit buffer 404) and an acknowledgement (ACK)/no acknowledgement (NAK) flow control framework is utilized to manage the retransmission of lost/dropped packets. In the example shown, re-transmit buffer 404 stores packets being transmitted to a communication partner. The packets can be retired after receiving an ACK and re-transmitted after receiving a NAK. Transmitters can attach a sequence number to each packet and keep track of the sequence number of packets received without any error. Receivers can periodically update the sequence number of the last good packet received. Retransmission using ACK and NAK signals is described in further detail herein (e.g., see
[0041]
[0042] In the example illustrated, transmitter 502 sends packets to receiver 506 via link 504. Transmitter 502 (TX 502) and receiver 506 (RX 506) are communication partners. In some embodiments, transmitter 502 and receiver 506 are integrated circuit package 102 of
[0043] In various embodiments, credit-based flow control uses credit counts. A credit can be a specified size block of data (e.g., a credit can be a 64-byte block of data). In various embodiments, a maximum number of credits is supported (e.g., 2048 credits). With these example values, a total of 128 kilobytes (kB) of data is supported (204864 bytes=128 KB). In various embodiments, TX 502 (the transmitter) and RX 506 (the receiver) maintain an absolute credit count. In various embodiments, TX 502 maintains a Total Blocks Sent (TBS) counter, which tracks total blocks sent via link 504 after initialization, updates for every packet sent via link 504, and sends the TBS value to RX 506 periodically. In addition, in various embodiments, TX 502 receives a Credit Limit (CL) from RX 506 and periodically stores the CL locally. In various embodiments, TX 502 allows packets to be transmitted if TBS+Packet Size is less than or equal to CL. In various embodiments, RX 506 receives data in a receive buffer and maintains an Absolute Blocks Received (ABR) counter, which tracks the total blocks received via link 504, updates for every packet received, and overrides ABR with TBS received from TX 502. In some embodiments, the receive buffer is included in hardware buffer 206 of
[0044] In various embodiments, at initialization, TX 502 and RX 506 update the available receive credits according to the size of the receive buffer. In various embodiments, at initialization, TX 502 sets TBS and CL to 0 and waits for a CL update from RX 506. In addition, in various embodiments, RX 506 sets ABR to 0 and updates CL according to the size of the receive buffer. In various embodiments, RX 506 sends the updated CL, which TX 502 receives and uses to override its local copy of CL. When TX 502 receives a request to send a packet over link 504, TX 502 transmits the packet if TBS+Packet Size is less than or equal to CL. Sending of the packet is illustrated in
[0045]
[0046] In the example illustrated, communication is shown between transmit side 602 and receive side 604. In some embodiments, transmit side 602 and receive side 604 correspond to integrated circuit package 102 of
[0047] In the above sequences of steps, transmit side 602 maintains two PSNs: a Last Acknowledged PSN that keeps track of the last ACK (good packet) received and a latest sent PSN. Receive side 604 also maintains two PSNs: a Last Acknowledged PSN of the last ACK (good packet) sent back to transmit side 602 and a Latest Received PSN keeping track of the latest received packet. In various embodiments, transmit side 602 also maintains a timer that re-starts every time a valid ACK is received, and receive side 604 maintains a count of continuous good packets received after the last ACK was sent. The steps illustrated in
[0048]
[0049] In the example illustrated, at step 706, receive side 704 receives a bad packet (e.g., with a CRC error, PSN count error, packet length error in which packet length is greater than available space in the receive buffer, etc.). Receive side 704 drops the bad packet. Because a bad packet is received, the ACK procedure in
[0050]
[0051] In the example shown, DMA engine 808 fetches data from high-bandwidth memory and segments them into maximum transport unit (MTU) size payloads. In some embodiments, the high-bandwidth memory is high bandwidth memory 310 of
[0052]
[0053]
[0054] At 1002, a first integrated circuit package including a first group of one or more artificial intelligence processing units and a first chip-to-chip interconnect communication unit is configured. In some embodiments, the first integrated circuit package is integrated circuit package 102 of
[0055] At 1004, a second integrated circuit package including a second group of one or more artificial intelligence processing units and a second chip-to-chip interconnect communication unit is configured. In some embodiments, the second integrated circuit package is integrated circuit package 110 of
[0056] At 1006, an interconnect between the first integrated circuit package and the second integrated circuit package is established, wherein the first chip-to-chip interconnect communication unit and the second chip-to-chip interconnect communication unit manage ethernet-based communication via the interconnect using a layered communication architecture supporting a credit-based data flow control and a retransmission data flow control. In some embodiments, the interconnect that is established is interconnect 108 of
[0057]
[0058]
[0059] In the example shown, integrated circuit packages 1204, 1206, 1208 are connected to integrated circuit package with switch 1202, which acts as a hub, by interconnects 1210, 1212, and 1214, respectively. In some embodiments, each interconnect in system 1200 is the same type of interconnect as interconnect 108 of
[0060] Thus, with a switched topology, it is possible to connect compute units (e.g., AI accelerators) to each other with fewer physical connections (in this case, three physical connections to connect four components in system 1200 instead of six physical connections in system 1100 of
[0061]
[0062] Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.