G06F9/38873

Vector Processor Architectures

The present disclosure relates to an integrated circuit device that includes a plurality of vector registers configurable to store a plurality of vectors and switch circuitry communicatively coupled to the plurality of vector registers. The switch circuitry is configurable to route a portion of the plurality of vectors. Additionally, the integrated circuit device includes a plurality of vector processing units communicatively coupled to the switch circuitry. The plurality of vector processing units is configurable to receive the portion of the plurality of vectors, perform one or more operations involving the portion of the plurality of vector inputs, and output a second plurality of vectors generated by performing the one or more operations.

AI accelerator apparatus using full mesh connectivity chiplet devices for transformer workloads

An AI accelerator apparatus using in-memory compute chiplet devices. The apparatus includes a first semiconductor substrate having a plurality of chiplets, each of which includes a plurality of tiles. Each tile includes a plurality of slices, a central processing unit (CPU), and a hardware dispatch device. Each slice can include a digital in-memory compute (DIMC) device configured to perform high throughput computations. In particular, the DIMC device can be configured to accelerate the computations of attention functions for transformer-based models (a.k.a. transformers) applied to machine learning applications. The chiplets are in a full mesh connectivity configuration such that at least one of the die-to-die (D2D) interconnects of each chiplet is coupled to one of the D2D interconnects of each other chiplet using a non-diagonal link. The chiplets can also include other interfaces to facilitate communication between the chiplets, memory and a server or host system.

Compute unit sorting for reduced divergence

Described herein are techniques for reducing divergence of control flow in a single-instruction-multiple-data processor. The method includes, at a point of divergent control flow, identifying control flow targets for different execution items, sorting the execution items based on the control flow targets, reorganizing the execution items based on the sorting, and executing after the point of divergent control flow, with the reorganized execution items.

PROCESSOR FOR CONFIGURABLE PARALLEL COMPUTATIONS
20250272094 · 2025-08-28 ·

A programmable data processor includes multiple numerous configurable pipeline circuits each including numerous arithmetic and logic operator circuits that can be configured into an execution pipeline that can be controlled according to a state machine. Each configurable pipeline circuit also includes numerous building block circuits that can configured into a sequencer for the state machine. The building block circuits may include (i) state elements for representing a state in the state machine, and (ii) loop elements for representing a loop in the state machine.

Structured Sparse Matrix Acceleration In Systolic Arrays

Aspects of the disclosure are directed to hardware acceleration of structured sparse workloads with block quantization. A hardware accelerator can receive compressed input matrices, for example as part of a workload for training or processing a machine learning model. The hardware accelerator can multiply the compressed input matrix with a gains matrix loaded in one or more matrix multiply units (MXUs) of the hardware accelerator. The input matrices can be further provided in a block data type format, in which blocks of mantissas are represented with a single shared scaling factor. An MXU can multiply the block data, shift or cast the block data according to a shared scaling factor to generate an output product. To that end, block data type matrices exhibiting structured sparsity patterns can be accelerated without affecting the overall accuracy or quality of the output to the workload being processed.

Structured Sparse Matrix Acceleration In Systolic Arrays

Methods, systems, and apparatus, including computer-readable storage media for processing block-scaled data on processing devices, where the block size of the data is smaller than the number of implemented processing lanes on the devices. An example process performed by the devices is matrix multiplication. The processing device is configured to load pre-computed scaling factors for static data, and to generate scaling factors for dynamic data as part of the matrix multiplication pipeline for the device. The processing device is configured to cause scaling factors of different blocks of either operand matrix being multiplied to be applied to corresponding blocks during multiplication. The processing device can generate correct matrix multiplication products of block-scaled input, even when the block size is more granular or smaller than the number of processing lanes. Aspects of the disclosure relate to generating scaling factors for input matrices received by a SIMD-configured processing device.

PROCESSING PIPELINE WITH ZERO LOOP OVERHEAD
20250335197 · 2025-10-30 ·

Techniques are disclosed for reducing or eliminating loop overhead caused by function calls in processors that form part of a pipeline architecture. The processors in the pipeline process data blocks in an iterative fashion, with each processor in the pipeline completing one of several iterations associated with a processing loop for a commonly-executed function. The described techniques leverage the use of message passing for pipelined processors to enable an upstream processor to signal to a downstream processor when processing has been completed, and thus a data block is ready for further processing in accordance with the next loop processing iteration. The described techniques facilitate a zero loop overhead architecture, enable continuous data block processing, and allow the processing pipeline to function indefinitely within the main body of the processing loop associated with the commonly-executed function where efficiency is greatest.

AI ACCELERATOR APPARATUS USING FULL MESH CONNECTIVITY CHIPLET DEVICES FOR TRANSFORMER WORKLOADS

An AI accelerator apparatus using in-memory compute chiplet devices. The apparatus includes a first semiconductor substrate having a plurality of chiplets, each of which includes a plurality of tiles. Each tile includes a plurality of slices, a central processing unit (CPU), and a hardware dispatch device. Each slice can include a digital in-memory compute (DIMC) device configured to perform high throughput computations. In particular, the DIMC device can be configured to accelerate the computations of attention functions for transformer-based models (a.k.a. transformers) applied to machine learning applications. The chiplets are in a full mesh connectivity configuration such that at least one of the die-to-die (D2D) interconnects of each chiplet is coupled to one of the D2D interconnects of each other chiplet using a non-diagonal link. The chiplets can also include other interfaces to facilitate communication between the chiplets, memory and a server or host system.

COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE

A method and apparatus for reducing divergence of control flow when executing multiple execution items in parallel are disclosed. The method comprises, at a point of divergent control flow, for each execution item, identifying a control flow target that designates a respective post-divergence code path; sorting the execution items in accordance with the identified control flow targets to obtain sorted execution-item groups; redistributing the execution items between distinct wavefronts of a workgroup or different time slots within a wavefront so that, within at least one wavefront or time slot, a greater proportion of the execution items share a common control flow target than prior to the redistribution; and continuing execution of the execution items after the point of divergent control flow using the redistributed execution items.

Systems and methods for automatically generating computer programming code and schedules and comparing their performance

Systems and methods utilize an auto-scheduler of a Domain Specific Language (DSL) to schedule one or more portions of a computer program written in a programming language other than the DSL. Portions of the computer program compatible with the DSL may be identified. The portions may be translated to a form compatible with the DSL. The DSL may generate schedules for the portions. Code may be generated for the computer program and the code may be executed. The schedules generated by the DSL may be utilized during execution of the code generated for the computer program.