G06F9/544

Fully pipelined hardware operator logic circuit for converting human-readable decimal character sequence floating-point representations to IEEE 754-2008 binary floating-point format representations
11635956 · 2023-04-25 ·

A fully pipelined convertToBinaryFromDecimalCharacter hardware operator logic circuit configured to convert one or more human-readable decimal character sequence floating-point representations to IEEE 754-2008 binary floating-point representations every clock cycle. The circuit converts decimal character sequence floating-point representations up to 28 decimal digits in length to IEEE 754 binary64, binary32, or binary16 floating-point format representations.

INSTRUCTION SETS FOR GENERATING SCHEDULES FOR TASK EXECUTION IN COMPUTING SYSTEMS

One or more embodiments of the present disclosure relate to receiving application data indicative of a plurality of runnables corresponding to a computing application. Additionally, one or more embodiments may relate to generating, based at least on the application data, an execution schedule for execution of the plurality of runnables using a plurality of compute engines. The execution schedule may include one or more commands corresponding to one or more timing fences. The one or more timing fences may dictate a timing and order of execution between at least a first runnable and a second runnable of the plurality of runnables.

BRANCH AND BOUND SORTING FOR SCHEDULING TASK EXECUTION IN COMPUTING SYSTEMS

One or more embodiments of the present disclosure relate to identifying, based on application data associated with a computing application that includes a set of runnables, a plurality of scheduling branches associated with scheduling execution of at least a subset of runnables of the set of runnables. Further, one or more embodiments relate to selecting a scheduling branch from the plurality of scheduling branches based at least on a coupling constraint that is applied to related runnables of at least the subset of runnables. The related runnables may include a first runnable that is designated for execution on a first compute engine and that triggers execution of a second runnable on a second compute engine. In addition, one or more embodiments may relate to determining an execution schedule of the set of runnables based at least on the scheduling branch.

SYSTEM TASK MANAGEMENT FOR COMPUTING SYSTEMS

One or more embodiments of the present disclosure relate to executing, by a plurality of compute engines, a plurality of runnables of a computing application based at least on an execution schedule and a set of commands associated with the execution schedule. The execution schedule may be generated using a compiling system to include the set of commands. The set of commands may include one or more individual commands corresponding to one or more timing fences dictating a timing and order of execution of one or more individual runnables of the plurality of runnables.

DATA PROCESSING PIPELINE
20230385114 · 2023-11-30 ·

A data processing device includes a plurality of hardware accelerators, a scheduler circuit, and a blocking circuit. The scheduler circuit is coupled to the plurality of hardware accelerators, and includes a plurality of hardware task schedulers. Each hardware task scheduler is coupled to a corresponding hardware accelerator, and is configured to control execution of the task by the hardware accelerator. The blocking circuit is coupled to the plurality of hardware accelerators and configured to inhibit communication between a first hardware accelerator and a second hardware accelerator of the plurality of hardware task schedulers.

CHIP-TO-CHIP INTERCONNECT WITH A LAYERED COMMUNICATION ARCHITECTURE

A system includes a first integrated circuit package including a first group of one or more artificial intelligence processing units and a first chip-to-chip interconnect communication unit and a second integrated circuit package including a second group of one or more artificial intelligence processing units and a second chip-to-chip interconnect communication unit. The system also includes an interconnect between the first integrated circuit package and the second integrated circuit package, wherein the first chip-to-chip interconnect communication unit and the second chip-to-chip interconnect communication unit manage ethernet-based communication via the interconnect using a layered communication architecture supporting a credit-based data flow control and a retransmission data flow control.

NEURAL PROCESSING DEVICE AND METHOD FOR JOB SCHEDULING THEREOF
20230385105 · 2023-11-30 ·

A neural processing device and a method for job scheduling are provided. The neural processing device configured to receive, by an address space ID (ASID) manager, first and second requests from at least one context, respectively, and determine whether ASIDs are allocated, store jobs of contexts to which the ASIDs have not been allocated from the ASID manager in entities, schedule, by a job scheduler, an execution order of the jobs stored in the entities and cause the ASID manager to allocate the ASIDs to the contexts to which the ASIDs have not been allocated among the at least one context, and sequentially receive, by a command queue, jobs of contexts to which the ASIDs have been allocated, store the jobs as standby jobs, and sequentially execute the standby jobs.

Spatial locality transform of matrices

A method comprises accessing a flattened input stream that includes a set of parallel vectors representing a set of input values of a kernel-sized tile of an input tensor that is to be convolved with a kernel. An expanded kernel is received that is generated by permuting values from the kernel. A control pattern is received that includes a set of vectors each corresponding to the output value position for the kernel-sized tile of the output and indicating a vector of the flattened input stream to access input values. The method further comprises generating, for each output position of each kernel-sized tile of the output, a dot product between a first vector that includes values of the flattened input stream as selected by the control pattern, and a second vector corresponding to a vector in the expanded kernel corresponding to the output position.

Hardware coherent computational expansion memory
11556344 · 2023-01-17 · ·

Embodiments herein describe transferring ownership of data (e.g., cachelines or blocks of data comprising multiple cachelines) from a host to hardware in an I/O device. In one embodiment, the host and I/O device (e.g., an accelerator) are part of a cache-coherent system where ownership of data can be transferred from a home agent (HA) in the host to a local HA in the I/O device—e.g., a computational slave agent (CSA). That way, a function on the I/O device (e.g., an accelerator function) can request data from the local HA without these requests having to be sent to the host HA. Further, the accelerator function can indicate whether the local HA tracks the data on a cacheline-basis or by a data block (e.g., multiple cachelines). This provides flexibility that can reduce overhead from tracking the data, depending on the function's desired use of the data.

ARCHITECTURE FOR LARGE PAYLOAD HANDLING IN EVENT PIPELINE
20220405155 · 2022-12-22 ·

Systems and methods are provided for automatically orchestrating the handling of events through a processing pipeline without limitation (or without a substantial limitation) as to the size of the event payload associated with the event. The event pipeline system stores event payloads in data stores and generates notifications regarding the events. The notifications may be placed into event streams for handling by various processing components of the event pipeline system. The processing components may receive notifications or events that they are to process, and may separately access event payloads from the data stores. The processing components may generate and save processed event payloads to the data stores in a streaming fashion such that the computing resources of the processing components do not limit (or substantially limit) the size of the event payloads that the processing components may handle.