G06F7/544

Fused Multiply-Add operator for mixed precision floating-point numbers with correct rounding
11550544 · 2023-01-10 · ·

A fused multiply-add hardware operator comprising a multiplier receiving two multiplicands as floating-point numbers encoded in a first precision format; an alignment circuit associated with the multiplier configured to convert the result of the multiplication into a first fixed-point number; and an adder configured to add the first fixed-point number and an addition operand. The addition operand is a floating-point number encoded in a second precision format, and the operator comprises an alignment circuit associated with the addition operand, configured to convert the addition operand into a second fixed-point number of reduced dynamic range relative to the dynamic range of the addition operand, having a number of bits equal to the number of bits of the first fixed-point number, extended on both sides by at least the size of the mantissa of the addition operand; the adder configured to add the first and second fixed-point numbers without loss.

Artificial neural networks
11551075 · 2023-01-10 · ·

The present disclosure relates to a neuron for an artificial neural network. The neuron includes: a first dot product engine operative to: receive a first set of weights; receive a set of inputs; and calculate the dot product of the set of inputs and the first set of weights to generate a first dot product engine output. The neuron further includes a second dot product engine operative to: receive a second set of weights; receive an input based on the first dot product engine output; and generate a second dot product engine output based on the product of the first dot product engine output and a weight of the second set of weights. The neuron further includes an activation function module arranged to generate a neuron output based on the second dot product engine output. The first dot product engine and the second dot product engine are structurally or functionally different.

Method and apparatus for configuring a reduced instruction set computer processor architecture to execute a fully homomorphic encryption algorithm

Systems and methods for configuring a reduced instruction set computer processor architecture to execute fully homomorphic encryption (FHE) logic gates as a streaming topology. The method includes parsing sequential FHE logic gate code, transforming the FHE logic gate code into a set of code modules that each have in input and an output that is a function of the input and which do not pass control to other functions, creating a node wrapper around each code module, configuring at least one of the primary processing cores to implement the logic element equivalents of each element in a manner which operates in a streaming mode wherein data streams out of corresponding arithmetic logic units into the main memory and other ones of the plurality arithmetic logic units.

Bipolar all-memristor circuit for in-memory computing
11694070 · 2023-07-04 · ·

A circuit for performing energy-efficient and high-throughput multiply-accumulate (MAC) arithmetic dot-product operations and convolution computations includes a two dimensional crossbar array comprising a plurality of row inputs and at least one column having a plurality of column circuits, wherein each column circuit is coupled to a respective row input. Each respective column circuit includes an excitatory memristor neuron circuit having an input coupled to a respective row input, a first synapse circuit coupled to an output of the excitatory memristor neuron circuit, the first synapse circuit having a first output, an inhibitory memristor neuron circuit having an input coupled to the respective row input, and a second synapse circuit coupled to an output of the inhibitory memristor neuron circuit, the second synapse circuit having a second output. An output memristor neuron circuit is coupled to the first output and second output of each column circuit and has an output.

Logarithmic addition-accumulator circuitry, processing pipeline including same, and methods of operation

An integrated circuit including a plurality of logarithmic addition-accumulator circuits, connected in series, to, in operation, perform logarithmic addition and accumulate operations, wherein each logarithmic addition-accumulator circuit includes: (i) a logarithmic addition circuit to add a first input data and a filter weight data, each having the logarithmic data format, and to generate and output first sum data having a logarithmic data format, and (ii) an accumulator, coupled to the logarithmic addition circuit of the associated logarithmic addition-accumulator circuit, to add a second input data and the first sum data output by the associated logarithmic addition circuit to generate first accumulation data. The integrated circuit may further include first data format conversion circuitry, coupled to the output of each logarithmic addition circuit, to convert the data format of the first sum data to a floating point data format wherein the accumulator may be a floating point type.

Key-value memory network for predicting time-series metrics of target entities

A system implements a key value memory network including a key matrix with key vectors learned from training static feature data and time-series feature data, a value matrix with value vectors representing time-series trends, and an input layer to receive, for a target entity, input data comprising a concatenation of static feature data of the target entity, time-specific feature data, and time-series feature data for the target entity. The key value memory network also includes an entity-embedding layer to generate an input vector from the input data, a key-addressing layer to generate a weight vector indicating similarities between the key vectors and the input vector, a value-reading layer to compute a context vector from the weight and value vectors, and an output layer to generate predicted time-series data for a target metric of the target entity by applying a continuous activation function to the context vector and the input vector.

Single-stage hardware sorting blocks and associated multiway merge sorting networks
11693623 · 2023-07-04 ·

A system and methods for designing single-stage hardware sorting blocks, and further using the single-stage hardware sorting blocks to reduce the number of stages in multistage sorting processes, or to define multiway merge sorting networks.

SRAM-based cell for in-memory computing and hybrid computations/storage memory architecture

An in-memory computing device includes in some examples a two-dimensional array of memory cells arranged in rows and columns, each memory cell made of a nine-transistor current-based SRAM. Each memory cell includes a six-transistor SRAM cell and a current source coupled by a switching transistor, which is controlled by input signals on an input line, to an output line associates with the column of memory cells the memory cell is in. The current source includes a switching transistor controlled by the state of the six-transistor SRAM cell, and a current regulating transistor adapted to generate a current at a level determined by a control signal applied at the gate. The control signal can be set such that the total current in each output line is increased by a factor of 2 in each successive column of the memory cells.

SYSTOLIC ARRAY WITH EFFICIENT INPUT REDUCTION AND EXTENDED ARRAY PERFORMANCE

Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.

Efficient hardware architecture for accelerating grouped convolutions

Hardware accelerators for accelerated grouped convolution operations. A first buffer of a hardware accelerator may receive a first row of an input feature map (IFM) from a memory. A first group comprising a plurality of tiles may receive a first row of the IFM. A plurality of processing elements of the first group may compute a portion of a first row of an output feature map (OFM) based on the first row of the IFM and a kernel. A second buffer of the accelerator may receive a third row of the IFM from the memory. A second group comprising a plurality of tiles may receive the third row of the IFM. A plurality of processing elements of the second group may compute a portion of a third row of the OFM based on the third row of the IFM and the kernel as part of a grouped convolution operation.