Patent classifications
G06F15/8046
SYSTOLIC ARRAY WITH EFFICIENT INPUT REDUCTION AND EXTENDED ARRAY PERFORMANCE
Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.
USING SPARSITY METADATA TO REDUCE SYSTOLIC ARRAY POWER CONSUMPTION
A processing apparatus can include a general-purpose parallel processing engine comprising a matrix accelerator including a multi-stage systolic array, where each stage includes multiple processing elements associated with multiple processing channels. The multiple processing elements are configured to receive output sparsity metadata that is independent of input sparsity of input matrix elements and perform processing operations on the input matrix elements based on the output sparsity metadata.
DUAL PIPELINE PARALLEL SYSTOLIC ARRAY
A processing apparatus described herein includes a general-purpose parallel processing engine comprising a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.
SYSTOLIC ARRAY HAVING SUPPORT FOR OUTPUT SPARSITY
A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.
SYSTOLIC ARRAY OF ARBITRARY PHYSICAL AND LOGICAL DEPTH
A processing apparatus includes a processing resource including a general-purpose parallel processing engine and a matrix accelerator. The matrix accelerator includes first circuitry to receive a command to perform operations associated with an instruction, second circuitry to configure the matrix accelerator according to a physical depth of a systolic array within the matrix accelerator and a logical depth associated with the instruction, third circuitry to read operands for the instruction from a register file associated with the systolic array, fourth circuitry to perform operations for the instruction via one or more passes through one or more physical pipeline stages of the systolic array based on a configuration performed by the second circuitry, and fifth circuitry to write output of the operations to the register file associated with the systolic array.
Vector processing unit
A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.
Arbitrating throttling recommendations for a systolic array
Throttling recommendations for a systolic array may be arbitrated. Throttling recommendations may be received at an arbiter for a systolic array from different sources, such as one or more monitors implemented in an integrated circuit along with the systolic array or sources external to the integrated circuit with the systolic array. A strongest throttling recommendation may be selected. The rate at which data enters the systolic array may be modified according to the strongest throttling recommendation.
Multi-Mode Integrated Circuits With Balanced Energy Consumption
Aspects of the disclosure include methods, systems, and apparatus, including computer-readable storage media for multi-mode integrated circuits with balanced energy consumption. A method includes determining, by one or more processors and based at least on a maximum energy threshold for planned multi-mode system having one or more processing units, a respective number of operations that can be performed per clock cycle by the processing units for each operating mode. The system is configured to consume the same amount of energy per clock cycle in each operating mode, but perform more operations in operating modes corresponding to operations performed on smaller bit-width operands.
Multi-Dimensional Data Path Architecture
Various implementations described herein are directed to a device having a multi-layered logic structure with a first logic layer and a second logic layer arranged vertically in a stacked configuration. The device may have a memory array that provides data, and also, the device may have an inter-layer data bus that vertically couples the memory array to the multi-layered logic structure. The inter-layer data bus may provide multiple data paths to the first logic layer and the second logic layer for reuse of the data provided by the memory array.
Neural network accelerator including bidirectional processing element array
Provided is a neural network accelerator which performs a calculation of a neural network provided with layers, the neural network accelerator including a kernel memory configured to store kernel data related to a filter, a feature map memory configured to store feature map data which are outputs of the layers, and a Processing Element (PE) array including PEs arranged along first and second directions, wherein each of the PEs performs a calculation using the feature map data transmitted in the first direction from the feature map memory and the kernel data transmitted in the second direction from the kernel memory, and transmits a calculation result to the feature map memory in a third direction opposite to the first direction.