G06F9/30101

APPARATUS AND METHOD FOR VECTOR PACKED DUAL COMPLEX-BY-COMPLEX AND DUAL COMPLEX-BY-COMPLEX CONJUGATE MULTIPLICATION

An apparatus and method for multiplying packed real and imaginary components of complex numbers and complex conjugates. For example, one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed real and imaginary data elements; a second source register to store a second plurality of packed real and imaginary data elements; and execution circuitry to execute the decoded instruction. The execution circuitry includes multiplier circuitry to multiply select real and imaginary data elements in the first and second source registers to generate a plurality of real and imaginary products; adder circuitry to add/subtract various real and imaginary products, scale the results according to an immediate of the instruction, round the scaled results; and saturation circuitry to saturate the rounded results.

Control registers to store thread identifiers for threaded loop execution in a self-scheduling reconfigurable computing fabric
11567766 · 2023-01-31 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.

Reset and replay of memory sub-system controller in a memory sub-system

In an embodiment, a system includes a plurality of memory components and a processing device that is operatively coupled with the plurality of memory components. The processing device includes a host interface, an access management component, a media management component (MMC), and an MMC-restart manager that is configured to perform operations including detecting a triggering event for restarting the MMC, and responsively performing MMC-restart operations that include suspending operation of the access management component; determining whether the MMC is operating, and if so then suspending operation of the MMC; resetting the MMC; resuming operation of the MMC; and resuming operation of the access management component.

Instruction execution that broadcasts and masks data values at different levels of granularity

An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.

ZERO-COPY SPARSE MATRIX FACTORIZATION SYNTHESIS FOR HETEROGENEOUS COMPUTE SYSTEMS
20230024035 · 2023-01-26 ·

A system, method, and computer-readable medium for synthesizing zero-copy sparse matrix factorization operations in heterogeneous compute systems are provided. The system includes a host and an accelerator device. The host device is configured to divide an input matrix into a plurality of blocks which are transferred to a memory of the accelerator device. The host device is also configured to generate at least one index buffer that includes pointers to the block in the accelerator's memory, where each index buffer represents a frontal matrix associated with a matrix decomposition algorithm. The host processor is configured to receive one or more kernels configured to process the index buffer(s) on an accelerator device. The index buffers are processed by the accelerator device and the modified block data is written back to a memory of the host device to generate a factorized output matrix.

PROCESSING WORK ITEMS IN PROCESSING LOGIC
20230229592 · 2023-07-20 ·

A plurality of work items are processed through a processing pipeline comprising a plurality of stages in processing logic. The processing of a work item includes: (i) reading data in accordance with a memory address associated with the work item, (ii) updating the read data, and (iii) writing the updated data in accordance with the memory address associated with the work item. The method includes processing a first work item and a second work item through the processing pipeline, wherein the processing of the first work item through the pipeline is initiated earlier than the processing of the second work item, and where it is determined that the first and second work items are associated with the same memory address, first updated data of the first work item is written to a register in the processing logic, and the processing of the second work item comprises reading the first updated data from the register instead of reading data from the memory.

Inferring future value for speculative branch resolution

Aspects of the invention include includes determining a first instruction in a processing pipeline, wherein the first instruction includes a compare instruction, determining a second instruction in the processing pipeline, wherein the second instruction includes a conditional branch instruction relying on the compare instruction, determining a predicted result of the compare instruction, and completing the conditional branch instruction using the predicted result prior to executing the compare instruction.

COMPUTATIONAL MEMORY
20230229450 · 2023-07-20 ·

An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.

Marking current context data to control a context-data-dependent processing operation to save current or default context data to a data location

A data processing system includes processing circuitry for executing context-data-dependent program instructions which are decoded by decoder circuitry. Such context-data-dependent program instructions perform processing which is dependent upon currently existing context data. As an example, the context-data-dependent program instructions may be floating point instructions and the context data may be rounding mode information. The decoder circuitry supports a context save instruction which saves context data when it is marked as having been used and saves default context data when the current context data is marked as not having been used. The decoder circuitry further supports a context restore instruction which restores context data when the current context data is marked as having been used and permits the current context data to continue for future use when it is marked as currently unused.

Configuration of base clock frequency of processor based on usage parameters

A processing device includes a plurality of processing cores, a control register, associated with a first processing core of the plurality of processing cores, to store a first base clock frequency value at which the first processing core is to run, and a power management circuit to receive a base clock frequency request comprising a second base clock frequency value, store the second base clock frequency value in the control register to cause the first processing core to run at the second base clock frequency value, and expose the second base clock frequency value on a hardware interface associated with the power management circuit.