IPIQ

G06F15/8046

Synthesizing Zero-Loss Low-Power Approximate DNN Accelerators With Large-Scale Search

20230162010 · 2023-05-25 ·

Systems and methods are provided for designing approximate, low-power deep learning accelerator chips that have little to no accuracy loss when executing a deep learning model. A set of approximate systolic arrays may be generated. The performance of each approximate systolic array in the set of approximate systolic arrays processing a deep neural network (DNN) may be determined. Each layer in the DNN may be mapped to an approximate systolic array in the set of approximate systolic arrays. A subset of the set of approximate systolic arrays may be selected for inclusion in the inference chip design based on the mapping and the performance of each approximate systolic array in the set of approximate systolic arrays.

SYSTEMS AND METHODS FOR ACCELERATED NEURAL-NETWORK CONVOLUTION AND TRAINING

20220335283 · 2022-10-20 ·

An application-specific integrated circuit for an artificial neural network is integrated with a high-bandwidth memory. The neural network includes a systolic array of interconnected processing elements, including upstream processing elements and downstream processing elements. Each processing element includes input/output port pairs for concurrent forward and back propagation. The processing elements can be used for convolution, in which case the input/output port pairs can support the fast and efficient scanning of kernels relative to activations.

Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range

11467806 · 2022-10-11 ·

Amazon Technologies, Inc.

Thomas Elmer

Systems and methods are provided to perform multiply-accumulate operations of normalized numbers in a systolic array to enable greater computational density, reduce the size of systolic arrays required to perform multiply-accumulate operations of normalized numbers, and/or enable higher throughput operation. The systolic array can be provided normalized numbers by a column of normalizers and can lack support for denormal numbers. Each normalizer can normalize the inputs to each processing element in the systolic array. The systolic array can include a multiplier and an adder. The multiplier can have multiple data paths that correspond to the data type of the input. The multiplier and adder can employ expanded exponent range to operate on normalized floating-point numbers and can lack support for denormal numbers.

Using shared data bus to support systolic array tiling

11625453 · 2023-04-11 ·

Amazon Technologies, Inc.

To improve utilization of a systolic array, each row of the array is provided with a number of general purpose row input data buses. Each of the general purpose row input data buses can be operable to transfer either feature map (FMAP) input elements or weight values into the processing elements of the corresponding row of the array. By using such general purpose row input data buses, concurrent matrix multiplications as well as faster background weight loading can be achieved in the array.

Synchronizing operations in hardware accelerator

11468304 · 2022-10-11 ·

Amazon Technologies, Inc.

Ron Diamant

In one example, a hardware accelerator comprises an event register that stores an event; a hardware execution engine; and a controller configured to: extract, from an instruction, parameters of an operation to be performed by the hardware execution engine, and a synchronization primitive of a plurality of synchronization primitives for the event; and based on the synchronization primitive, perform at least one of: controlling a start time of the operation at the hardware execution engine, or determining whether to access the event register. The synchronization primitives include a set operation to set the event and/or a wait operation to suspend the operation at the hardware execution engine until the event is set. The plurality of synchronization primitive defines different conditions to be satisfied in order to perform the set operation.

SPARSE SYSTOLIC ARRAY DESIGN

20230109301 · 2023-04-06 ·

A systolic array can be configured to skip distributed operands that have zero-values, resulting in improved resource efficiency. A skip module is introduced to receive operands from memory, identify whether they have a zero value or not, and, if they are nonzero, generate an operand vector including an index before sending the operand vector to a processing element.

Systems and methods for improving cache efficiency and utilization

11620256 · 2023-04-04 ·

Intel Corporation

Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.

Application specific integrated circuit accelerators

11652484 · 2023-05-16 ·

Google Llc

An application specific integrated circuit (ASIC) chip includes: a systolic array of cells; and multiple controllable bus lines configured to convey data among the systolic array of cells, in which the systolic array of cells is arranged in multiple tiles, each tile of the multiple tiles including 1) a corresponding sub array of cells of the systolic array of cells, 2) a corresponding subset of controllable bus lines of the multiple controllable bus lines, and 3) memory coupled to the subarray of cells.

SYSTOLIC NEURAL CPU PROCESSOR

20230205729 · 2023-06-29 ·

Jie Gu
Yuhao Ju

A systolic neural CPU (SNCPU) including a two-dimensional systolic array of reconfigurable processing elements (PE's) fuses a conventional CPU with a convolutional neural network (CNN) accelerator in four phases of operation: row-CPU, column-accelerator, column-CPU, and row-accelerator. The SNCPU cycles through the four phases to avoid costly data movement across cores, reduce overhead, and reduce latency. The PE's communicate bidirectionally with neighboring PE's and memory units at an outer edge of the array. A row of PE's is configurable into a first deep neural network (DNN) accumulator at a first time and configurable into a first CPU pipeline at a second time. A column of PE's is configurable into a second DNN accumulator at a third time and configurable into a second CPU pipeline at a fourth time.

GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT

20230195685 · 2023-06-22 ·

Intel Corporation

Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.

Patent classifications

G06F15/8046