G06F13/36

Message based general register file assembly

In an example, an apparatus comprises a plurality of execution units, and logic, at least partially including hardware logic, to assemble a general register file (GRF) message and hold the GRF message in storage in a data port until all data for the GRF message is received. Other embodiments are also disclosed and claimed.

Data processor including relay circuits coupled through a ring bus and method for controlling the same
11494327 · 2022-11-08 · ·

A data processor capable of suppressing variation in latency during a bus access is provided. The data processor according to one embodiment includes a ring bus through which a plurality of relay circuits, being coupled to a plurality of bus masters and a plurality of slaves, are coupled in the shape of a ring. Each of the relay circuits includes: an arbitration circuit which arbitrates an adjacent request packet of an adjacent relay circuit and a bus request packet of a nearest bus master with use of priority of these request packets, and outputs the request packet after arbitration to a next relay circuit; and a priority adjustment circuit which adjusts priority of the bus request packet according to the number of relay circuits through which the bus request packet passes before reaching its destination.

Data processor including relay circuits coupled through a ring bus and method for controlling the same
11494327 · 2022-11-08 · ·

A data processor capable of suppressing variation in latency during a bus access is provided. The data processor according to one embodiment includes a ring bus through which a plurality of relay circuits, being coupled to a plurality of bus masters and a plurality of slaves, are coupled in the shape of a ring. Each of the relay circuits includes: an arbitration circuit which arbitrates an adjacent request packet of an adjacent relay circuit and a bus request packet of a nearest bus master with use of priority of these request packets, and outputs the request packet after arbitration to a next relay circuit; and a priority adjustment circuit which adjusts priority of the bus request packet according to the number of relay circuits through which the bus request packet passes before reaching its destination.

EFFICIENT SIGNALING SCHEME FOR HIGH-SPEED ULTRA SHORT REACH INTERFACES
20230097677 · 2023-03-30 ·

A multi-chip module (MCM) includes a first integrated circuit (IC) chip to receive first data. The first IC chip includes a first transfer interface to transmit the first data off the first IC chip. A second IC chip includes an input interface to receive the first data from the first IC chip. The second IC chip includes switching circuitry to selectively forward the first data to one of a first output interface or a second output interface. The first output interface is communicatively coupled to a third IC chip, while the second output interface is configured to output the first data from the MCM.

EFFICIENT SIGNALING SCHEME FOR HIGH-SPEED ULTRA SHORT REACH INTERFACES
20230097677 · 2023-03-30 ·

A multi-chip module (MCM) includes a first integrated circuit (IC) chip to receive first data. The first IC chip includes a first transfer interface to transmit the first data off the first IC chip. A second IC chip includes an input interface to receive the first data from the first IC chip. The second IC chip includes switching circuitry to selectively forward the first data to one of a first output interface or a second output interface. The first output interface is communicatively coupled to a third IC chip, while the second output interface is configured to output the first data from the MCM.

HIGH-SPEED DESERIALIZER WITH PROGRAMMABLE AND TIMING ROBUST DATA SLIP FUNCTION

Provided are embodiments for operating a high-speed deserializer. Embodiments can include receiving a clock slip signal to enable operation of the slip pulse generation circuit, and generating a slip pulse signal using the slip pulse-controlled clock generation circuit, wherein the slip pulse signal is programmable to slip one or more bits of a serial input data. Embodiments can also include generating a plurality of deserialization clocks for sampling the serial input data using the slip pulse-controlled clock generation circuit, wherein the plurality of deserialization clocks are generated simultaneously with each other, and providing the plurality of deserialization clocks to the deserializer to selectively sample the serial input data.

HIGH-SPEED DESERIALIZER WITH PROGRAMMABLE AND TIMING ROBUST DATA SLIP FUNCTION

Provided are embodiments for operating a high-speed deserializer. Embodiments can include receiving a clock slip signal to enable operation of the slip pulse generation circuit, and generating a slip pulse signal using the slip pulse-controlled clock generation circuit, wherein the slip pulse signal is programmable to slip one or more bits of a serial input data. Embodiments can also include generating a plurality of deserialization clocks for sampling the serial input data using the slip pulse-controlled clock generation circuit, wherein the plurality of deserialization clocks are generated simultaneously with each other, and providing the plurality of deserialization clocks to the deserializer to selectively sample the serial input data.

Distributed AI training topology based on flexible cable connection

A data processing system includes a central processing unit (CPU) and accelerator cards coupled to the CPU over a bus, each of the accelerator cards having a plurality of data processing (DP) accelerators to receive DP tasks from the CPU and to perform the received DP tasks. At least two of the accelerator cards are coupled to each other via an inter-card connection, and at least two of the DP accelerators are coupled to each other via an inter-chip connection. Each of the inter-card connection and the inter-chip connection is capable of being dynamically activated or deactivated, such that in response to a request received from the CPU, any one of the accelerator cards or any one of the DP accelerators within any one of the accelerator cards can be enabled or disabled to process any one of the DP tasks received from the CPU.

Distributed AI training topology based on flexible cable connection

A data processing system includes a central processing unit (CPU) and accelerator cards coupled to the CPU over a bus, each of the accelerator cards having a plurality of data processing (DP) accelerators to receive DP tasks from the CPU and to perform the received DP tasks. At least two of the accelerator cards are coupled to each other via an inter-card connection, and at least two of the DP accelerators are coupled to each other via an inter-chip connection. Each of the inter-card connection and the inter-chip connection is capable of being dynamically activated or deactivated, such that in response to a request received from the CPU, any one of the accelerator cards or any one of the DP accelerators within any one of the accelerator cards can be enabled or disabled to process any one of the DP tasks received from the CPU.

Data processing network with flow compaction for streaming data transfer

An improved protocol for data transfer between a request node and a home node of a data processing network that includes a number of devices coupled via an interconnect fabric is provided that minimizes the number of response messages transported through the interconnect fabric. When congestion is detected in the interconnect fabric, a home node sends a combined response to a write request from a request node. The response is delayed until a data buffer is available at the home node and home node has completed an associated coherence action. When the request node receives a combined response, the data to be written and the acknowledgment are coalesced in the data message.