G06F5/01

DUAL EXPONENT BOUNDING BOX FLOATING-POINT PROCESSOR

Apparatus and methods are disclosed for performing matrix operations, including operations suited to neural network and other machine learning accelerators and applications, using dual exponent formats. Disclosed matrix formats include single exponent bounding box floating-point (SE-BBFP) and dual exponent bounding box floating-point (DE-BBFP) formats. Shared exponents for each element are determined for each element based on whether the element is used as a row of matrix tile or a column of a matrix file, for example, for a dot product operation. Computing systems suitable for employing such neural networks include computers having general-purpose processors, neural network accelerators, or reconfigure both logic devices, such as Field programmable gate arrays (FPGA). Certain techniques disclosed herein can provide improved system performance while reducing memory and network bandwidth used.

Dynamic tile parallel neural network accelerator

A dynamic-tile neural network accelerator allows for the number and size of computational tiles to be re-configured. Each sub-array of computational cells has edge cells on the left-most column that have an added vector mux that feeds the cell output back to an adder-comparator to allow Rectified Linear Unit (ReLU) and pooling operations that combine outputs shifted in from other cells. The edge cells drive external output registers and receive external weights. The weights and outputs are shifted in opposite directions horizontally between cells while control and input data are shifted in a same direction vertically between cells. A column of row data selectors is inserted between sub-arrays to bypass weights and output data around sub-arrays, while a row of column data selectors are inserted between sub-arrays to bypass control and input data. Larger tiles are configured by passing data directly through these selectors without bypassing.

Dynamic tile parallel neural network accelerator

A dynamic-tile neural network accelerator allows for the number and size of computational tiles to be re-configured. Each sub-array of computational cells has edge cells on the left-most column that have an added vector mux that feeds the cell output back to an adder-comparator to allow Rectified Linear Unit (ReLU) and pooling operations that combine outputs shifted in from other cells. The edge cells drive external output registers and receive external weights. The weights and outputs are shifted in opposite directions horizontally between cells while control and input data are shifted in a same direction vertically between cells. A column of row data selectors is inserted between sub-arrays to bypass weights and output data around sub-arrays, while a row of column data selectors are inserted between sub-arrays to bypass control and input data. Larger tiles are configured by passing data directly through these selectors without bypassing.

System and method for shift-based information mixing across channels for shufflenet-like neural networks
11615319 · 2023-03-28 · ·

Disclosed herein includes a system, a method, and a device for performing a convolution on data of a current layer of a neural network, including a plurality of channels arranged in a first order and partitioned into a plurality of first partitions according to the first order. Each first partition includes a result of a convolution on a corresponding partition of channels in data of a previous layer of the neural network. The device shifts the plurality of channels arranged in the first order to a second order, partition the shifted plurality of channels into a plurality of second partitions, according to the second order. For each of the plurality of second partitions, the device performs a convolution on channels of the shifted plurality of channels that are in the corresponding second partition.

System and method for shift-based information mixing across channels for shufflenet-like neural networks
11615319 · 2023-03-28 · ·

Disclosed herein includes a system, a method, and a device for performing a convolution on data of a current layer of a neural network, including a plurality of channels arranged in a first order and partitioned into a plurality of first partitions according to the first order. Each first partition includes a result of a convolution on a corresponding partition of channels in data of a previous layer of the neural network. The device shifts the plurality of channels arranged in the first order to a second order, partition the shifted plurality of channels into a plurality of second partitions, according to the second order. For each of the plurality of second partitions, the device performs a convolution on channels of the shifted plurality of channels that are in the corresponding second partition.

Circuit
11614919 · 2023-03-28 · ·

A circuit, comprising a first term operation circuit and a second term operation circuit, a third term operation circuit, and a second calculation circuit. Each of the first and the second term operation circuits comprises multiple higher bit operation circuits, a lowest bit operation circuit, and a first calculation circuit. Each of the higher bit operation circuits selectively left-shifts a multiplicand by different bits, outputs the shifted multiplicand, determines a sign of the shifted multiplicand, and left-shifts the shifted multiplicand. The lowest bit operation circuit outputs the multiplicand, and determines a sign of the multiplicand. The first calculation circuit generates a term operation result. The third term operation circuit generates a third term operation result. The second calculation circuit adds the term operation result of the first and second term operation circuits and the third term operation result to generate a total operation result.

Circuit
11614919 · 2023-03-28 · ·

A circuit, comprising a first term operation circuit and a second term operation circuit, a third term operation circuit, and a second calculation circuit. Each of the first and the second term operation circuits comprises multiple higher bit operation circuits, a lowest bit operation circuit, and a first calculation circuit. Each of the higher bit operation circuits selectively left-shifts a multiplicand by different bits, outputs the shifted multiplicand, determines a sign of the shifted multiplicand, and left-shifts the shifted multiplicand. The lowest bit operation circuit outputs the multiplicand, and determines a sign of the multiplicand. The first calculation circuit generates a term operation result. The third term operation circuit generates a third term operation result. The second calculation circuit adds the term operation result of the first and second term operation circuits and the third term operation result to generate a total operation result.

TININESS DETECTION
20230035159 · 2023-02-02 ·

An apparatus comprises floating-point processing circuitry to perform a floating-point operation with rounding to generate a floating-point result value; and tininess detection circuitry to detect a tininess status indicating whether an outcome of the floating-point operation is tiny. A tiny outcome corresponds to a non-zero number with a magnitude smaller than a minimum non-zero magnitude representable as a normal floating-point number in a floating-point format to be used for the floating-point result value. The tininess detection circuitry comprises hardware circuit logic configured to support both before rounding tininess detection and after rounding tininess detection for detecting the tininess status.

SINGLE-CYCLE KULISCH ACCUMULATOR
20230092574 · 2023-03-23 · ·

A processor to calculate a floating-point dot-product that receives a sequence of first and second floating-point numbers in which the sequence of the first and second floating-point numbers having a sign, a mantissa value and an exponent value. A floating-point unit determines the floating-point dot-product of the sequences by adding the exponent values to determine an exponent product, calculating a shift amount as a one's complement of a low exponent, multiplying the mantissas of the sequences to determine a product value of the mantissas, right shifting the product value of the mantissa by the shift amount to generate a shifted product, selecting segments of an accumulator based on a high exponent, and adding the selected segments to the shifted product to generate a sum. The sum is then written into the selected segments of the accumulator.

PERFORMING COMPARISON OPERATIONS USING EXTENDED EXPONENT RANGE FLOATING POINT VALUES
20220350566 · 2022-11-03 ·

A method and a processing module for performing a particular comparison operation using floating point values received in one or more input formats, The exponent range of the floating point values is extended. One or more of the following is performed: (a) a floating point value of zero is replaced with a non-zero substitute floating point value whose magnitude is small enough to behave like zero if all other values involved in the particular comparison operation are non-zero finite values in their input format; (b) one or more of the floating point values are shifted by a non-zero amount which is small enough to behave like zero if all other values involved in the particular comparison operation are non-zero finite values in their input format, wherein said non-zero amount is too small to be representable using the one or more input formats but is representable using the extended exponent range; and (c) a floating point value of infinity is replaced with a finite substitute floating point value whose magnitude is large enough to behave like infinity if all other values involved in the particular comparison operation are non-zero finite values in their input format, wherein said finite substitute floating point value has a magnitude that is too large to be representable using the one or more input formats but is representable using the extended exponent range.