G06F9/3851

RECONFIGURABLE SIMD ENGINE
20230214351 · 2023-07-06 · ·

An exemplary SIMD computing system comprises a SIMD processing element (SPE) configured to perform a selected operation on a portion of a processor input data word, with the operation selected by control signals read from a control memory location addressed by a decoded instruction. The SPE may comprise one or more adder, multiplier, or multiplexer coupled to the control signals. The control signals may comprise one or more bit read from the control memory. The control memory may be an MxN (M rows by N columns) memory having M possible SIMD operations and N control signals. Each instruction decoded may select an SPE operation from among N rows. A plurality of SPEs may receive the same control signals. The control memory may be rewritable, advantageously permitting customizable SIMD operations that are reconfigurable by storing in the control memory locations control signals designed to cause the SPE to perform selected operations.

Mechanism to trigger early termination of cooperating processes

Devices and techniques for triggering early termination of cooperating processes in a processor are described herein. A system includes multiple memory-compute nodes, wherein a memory-compute node comprises: event manager circuitry configured to establish a broadcast channel to receive event messages; and thread manager circuitry configured to organize a plurality of threads to perform portions of a cooperative task, wherein the plurality of threads each monitor the broadcast channel to receive event messages on the broadcast channel, and wherein upon achieving a threshold operation, the thread manager circuitry is to use the event manager circuitry to broadcast, on the broadcast channel, an event message indicating that the cooperative task is complete, causing other threads, in response to receiving the event message, to terminate execution of their respective portions of the cooperative task.

Compute optimizations for neural networks using ternary weight

One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a ternary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the ternary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.

Systems, methods, and apparatuses for heterogeneous computing

Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.

Detecting execution hazards in offloaded operations

Detecting execution hazards in offloaded operations is disclosed. A second offload operation is compared to a first offload operation that precedes the second offload operation. It is determined whether the second offload operation creates an execution hazard on an offload target device based on the comparison of the second offload operation to the first offload operation. If the execution hazard is detected, an error handling operation may be performed. In some examples, the offload operations are processing-in-memory operations.

Method of completing a programmable atomic transaction by ensuring memory locks are cleared
11693690 · 2023-07-04 · ·

Disclosed is an instruction for a programmable atomic transaction that is executed as the last instruction and that terminates the executing thread, waits for all outstanding store operations to finish, clears the programmable atomic lock, and sends a completion response back to the issuing process. This guarantees that the programmable atomic lock is cleared when the transaction completes. By coupling thread termination with clearing the lock bit, this guarantees that the thread cannot terminate without clearing the lock.

THREAD PRIORITIES USING MISPREDICTION RATE AND SPECULATIVE DEPTH

Methods and systems for determining a priority of a threads is described. A processor can execute branch instructions of the thread. The processor can predict branch instruction outcomes of the branch instructions of the thread. The processor can increment a misprediction count of the thread in response to an actual execution of a branch instruction of the thread being different from a corresponding branch instruction prediction outcome of the thread. The processor can determine the priority of the thread based on the misprediction count of the thread.

Memory-network processor with programmable optimizations

Various embodiments are disclosed of a multiprocessor system with processing elements optimized for high performance and low power dissipation and an associated method of programming the processing elements. Each processing element may comprise a fetch unit and a plurality of address generator units and a plurality of pipelined datapaths. The fetch unit may be configured to receive a multi-part instruction, wherein the multi-part instruction includes a plurality of fields. First and second address generator units may generate, based on different fields of the multi-part instruction, addresses from which to retrieve first and second data for use by an execution unit for the multi-part instruction or a subsequent multi-part instruction. The execution units may perform operations using a single pipeline or multiple pipelines based on third and fourth fields of the multi-part instruction.

MULTIPLE REGISTER ALLOCATION SIZES FOR THREADS

Provision of multiple register allocation sizes for threads is described. An example of a system includes one or more processors including a graphics processor, the graphics processor including at least a first local thread dispatcher (TDL) and multiple processing resources, each processing resource including a plurality of registers; and memory for storage of data for processing, wherein the one or more processors are to determine a register size for a first thread; identify one or more processing resources having sufficient register space for the first thread; select a processing resource of the one or more processing resources having sufficient register space to assign the first thread; select an available thread slot of the selected processing resource for the first thread; and allocate registers of the selected processing resource for the first thread.

SYNCHRONIZATION BARRIER

Apparatuses, systems, and techniques to implement a barrier operation. In at least one embodiment, a memory barrier operation causes accesses to memory by a plurality of groups of threads to occur in an order indicated by the memory barrier operation.