G06F9/3895

Point to point connected processing elements with data joiner components

A microprocessor system comprises a first processing element, a second processing element, a point-to-point connection between the first processing element and the second processing element, and a communication bus connecting together at least the first processing element and the second processing element. The first processing element includes a first matrix computing unit and the second processing element includes a second matrix computing unit. The point-to-point connection is configured to provide at least a result of the first processing element to a data joiner component of the second processing element configured to join at least the provided result of the first processing element with a result of the second matrix computing unit.

Programmable Accelerator for Data-Dependent, Irregular Operations

Aspects of the disclosure provide for an accelerator capable of accelerating data dependent, irregular, and/or memory-bound operations. An accelerator as described herein includes a programmable engine for efficiently executing computations on-chip that are dynamic, irregular, and/or memory-bound, in conjunction with a co-processor configured to accelerate operations that are predictable in computational load and behavior on the co-processor during design and fabrication.

MEMORY-NETWORK PROCESSOR WITH PROGRAMMABLE OPTIMIZATIONS

Various embodiments are disclosed of a multiprocessor system with processing elements optimized for high performance and low power dissipation and an associated method of programming the processing elements. Each processing element may comprise a fetch unit and a plurality of address generator units and a plurality of pipelined datapaths. The fetch unit may be configured to receive a multi-part instruction, wherein the multi-part instruction includes a plurality of fields. First and second address generator units may generate, based on different fields of the multi-part instruction, addresses from which to retrieve first and second data for use by an execution unit for the multi-part instruction or a subsequent multi-part instruction. The execution units may perform operations using a single pipeline or multiple pipelines based on third and fourth fields of the multi-part instruction.

SPECIALIZED FIXED FUNCTION HARDWARE FOR EFFICIENT CONVOLUTION

One embodiment provides a graphics processor comprising an instruction cache to store an instruction and a compute block configured to perform multiply-accumulate operations in response to execution of the instruction. The compute block includes a scheduler to schedule a plurality of threads for execution of the instruction and multiply-accumulate circuitry configured to execute the instruction via the plurality of threads, wherein the multiply-accumulate circuitry includes a plurality of functional units configured to process, in parallel via the plurality of threads, a corresponding plurality of matrix elements to multiply a first matrix and a second matrix, and to multiply the first matrix and the second matrix includes to multiply data elements in a row of the first matrix by corresponding data elements in a column of the second matrix to generate a plurality of products.

METHOD OF STORING REGISTER DATA ELEMENTS TO INTERLEAVE WITH DATA ELEMENTS OF A DIFFERENT REGISTER, A PROCESSOR THEREOF, AND A SYSTEM THEREOF

A method of storing register data elements to interleave with data elements of a different register, a processor thereof, and a system thereof, wherein each non-consecutive data elements of a register is retrieved to be stored to interleave with each non-consecutive data elements of a different register upon an executive of an interleaving store instruction, wherein a mask instruction directing a lane of a storage space in which the non-consecutive data elements are stored is executed in conjunction with the interleaving store instruction, and wherein a processor of a second type is configured to emulate a processor of a first type to store the non-consecutive data elements the same as non-consecutive data elements stored in the first type processor.

METHOD OF INTERLEAVED PROCESSING ON A GENERAL-PURPOSE COMPUTING CORE
20230367604 · 2023-11-16 ·

A method of “interleaved processing” (IP) is proposed which generalizes the functional principle of memory interleaving by extending the interleaved memory system into the processor chip and prepending each write access to one of the extended interleaved memory banks by a data transforming operation. The method opens a new dimension of large scale software parallelization and is implemented in autonomous processing units called “parallel processing channels” (PPC) that integrate processor and memory at a very low machine balance—which solves the memory wall problem— and execute on-chip machine transactions at a 1 Flop/cycle throughput. IP computing systems are linearly performance scalable and capable of pipelined execution of very large and complex HPC workloads. They have unique performance advantages in strided vector, tensor, and data set operations; for relevant HPC workload types, up to 10×-100× per-Watt single-processor performance gains compared to today's technologies are expected.

Specialized fixed function hardware for efficient convolution

One embodiment provides an apparatus comprising an instruction cache to store a plurality of instructions, a scheduler unit coupled to the instruction cache, the scheduler unit to schedule the plurality of instructions for execution, an instruction fetch and decode unit to decode the plurality of instructions to determine a set of operations to perform in response, one or more compute blocks to perform parallel multiply-accumulate operations based on the instruction fetch and decode unit decoding a first instruction of the plurality of instructions, and matrix multiplication logic to perform matrix multiplication operations based on the instruction fetch and decode unit decoding a second instruction of the plurality of instructions.

COMPUTE OPTIMIZATION MECHANISM

An apparatus to facilitate compute optimization is disclosed. The apparatus includes a mixed precision core including mixed-precision execution circuitry to execute one or more of the mixed-precision instructions to perform a mixed-precision dot-product operation comprising to perform a set of multiply and accumulate operations.

Systems and methods for improving computational speed of planning by enabling interactive processing in hypercubes
11416262 · 2022-08-16 · ·

A system for assigning a workload to compute resources includes an interface and a processor. The interface is configured to receive a workload. The processor is configured to break the workload into a set of subproblems; and for a subproblem of the set of subproblems: determine whether the subproblem benefits from intersheet parallelism; determine whether the subproblem benefits from intrasheet parallelism; determine whether the subproblem benefits from directed acyclic graph (DAG) partitioning; and assign the subproblem, wherein assigning the subproblem utilizes optimization when appropriate based at least in part on benefits from the intersheet parallelism, the intrasheet parallelism, and the DAG partitioning.

Method of storing register data elements to interleave with data elements of a different register, a processor thereof, and a system thereof

A method of storing register data elements to interleave with data elements of a different register, a processor thereof, and a system thereof, wherein each non-consecutive data elements of a register is retrieved to be stored to interleave with each non-consecutive data elements of a different register upon an executive of an interleaving store instruction, wherein a mask instruction directing a lane of a storage space in which the non-consecutive data elements are stored is executed in conjunction with the interleaving store instruction, and wherein a processor of a second type is configured to emulate a processor of a first type to store the non-consecutive data elements the same as non-consecutive data elements stored in the first type processor.