G06F2207/4802

Method of operation for a configurable number theoretic transform (NTT) butterfly circuit for homomorphic encryption

Fully homomorphic encryption integrated circuit (IC) chips, systems and associated methods are disclosed. In one embodiment, a method of operation for a number theoretic transform (NTT) butterfly circuit is disclosed. The (NTT) butterfly circuit includes a high input word path cross-coupled with a low word path. The high input word path includes a first adder/subtractor, and a first multiplier. The low input word path includes a second adder/subtractor, and a second multiplier. The method includes selectively bypassing the second adder/subtractor and the second multiplier, and reconfiguring the low and high input word paths into different logic processing units in response to different mode control signals.

RESISTIVE MEMORY ARRAYS FOR PERFORMING MULTIPLY-ACCUMULATE OPERATIONS

In one example in accordance with the present disclosure a resistive memory array is described. The array includes a number of resistive memory elements to receive a common-valued read signal. The array also includes a number of multiplication engines to perform a multiply operation by receiving a memory element output from a corresponding resistive memory element, receiving an input signal, and generating a multiplication output based on a received memory element output and a received input signal. The array also includes an accumulation engine to sum multiplication outputs from the number of multiplication engines.

WEIGHTED MATRIX FOR INPUT DATA STREAM
20220198250 · 2022-06-23 ·

Examples of performing convolution operations based on a weighted matrix are described. In an example, an input data stream vector is processed using a weighted matrix stored onto a processing unit of a neural network accelerator. The weighted matrix may correspond to a first convolution filter and a second convolution filter.

Sum-of-products operator, sum-of-products operation method, logical operation device, and neuromorphic device
11340869 · 2022-05-24 · ·

A sum-of-products operator including: a first circuit configured to generate a plurality of signals, each of which corresponds to each of a plurality of data; a second circuit including a first operation circuit configured to multiply each of the signals generated by the first circuit by a weight using a plurality of variable resistive elements having variable resistance values, and to calculate a sum of a plurality of results of multiplications; a third circuit configured to calculate a result of summing values corresponding to the data or a result of the summing value after being adjusted; and a fourth circuit including a differential circuit configured to output a difference between a calculated result in the first operation circuit of the second circuit and a calculated result in the third circuit.

DYNAMIC BIAS ANALOG VECTOR-MATRIX MULTIPLICATION OPERATION CIRCUIT AND OPERATION CONTROL METHOD THEREFOR

A dynamic bias analog vector-matrix multiplication operation circuit and an operation control method therefor. The dynamic bias analog vector-matrix multiplication operation circuit comprises: positive value weight columns (10.sub.1-10.sub.N), constant columns (20.sub.1-20.sub.M) and subtractors (30.sub.1-30.sub.N), wherein the number of the subtractors is equal to the number of the positive value weight columns, the subtractors are correspondingly connected to the positive value weight columns on a one-to-one basis, and the number of the constant columns is less than the number of the positive value weight columns; minuend input ends of the subtractors are correspondingly connected to output ends of the positive value weight columns, subtrahend input ends thereof are connected to output ends of the constant columns, and output ends thereof output operation results; and subtrahend input ends of a plurality of subtractors are connected to the same constant column. Before a weight is written in a programmable semiconductor device, a constant positive value is added to each element in a weight array to obtain a weight array to be configured, said weight array is written in a positive value weight column, and the constant positive value is written in a constant column. Therefore, a negative value weight column does not need to be set, such that the circuit structure can be simplified.

Analog multiplier-accumulators

An example electronic device includes a crossbar array, row driver circuitry, and column output circuits for each of the column lines of the crossbar array. The crossbar array may include row lines, column lines, and memristors that each are connected between one of the row lines and one of the column lines. The row driver circuitry may be to apply a plurality of analog voltages to a first node during a plurality of time periods, respectively, and, for each of the row lines, selectively connect the row line to the first node during one of the plurality of time periods based on a digital input vector. The column output circuits may each include: an integration capacitor, a switch that is controlled by an integration control signal, and current mirroring circuitry. The current mirroring circuitry may be to, when the switch is closed, flow an integration current to or from an electrode of the integration capacitor whose magnitude mirrors a current flowing on the corresponding column line. The integration control signal may be to close the switch for a specified amount of time during each of the plurality of time periods.

Low latency long short-term memory inference with sequence interleaving

Systems, apparatuses, and methods for implementing a low latency long short-term memory (LSTM) machine learning engine using sequence interleaving techniques are disclosed. A computing system includes at least a host processing unit, a machine learning engine, and a memory. The host processing unit detects a plurality of sequences which will be processed by the machine learning engine. The host processing unit interleaves the sequences into data blocks and stores the data blocks in the memory. When the machine learning engine receives a given data block, the machine learning engine performs, in parallel, a plurality of matrix multiplication operations on the plurality of sequences in the given data block and a plurality of coefficients. Then, the outputs of the matrix multiplication operations are coupled to one or more LSTM layers.

PULSE GENERATION FOR UPDATING CROSSBAR ARRAYS

Provided are embodiments for a computer-implemented method, a system, and a computer program product for updating an analog crossbar array. Embodiment include receiving a number used in matrix multiplication to represent using pulse generation for a crossbar array, and receiving a bit-length to represent the number. Embodiments also include selecting pulse positions in a pulse sequence having the bit length to represent the number, performing a computation using the selected pulse positions in the pulse sequence, and updating the crossbar array using the computation.

POWER-EFFICIENT COMPUTE-IN-MEMORY POOLING
20220012580 · 2022-01-13 ·

A multiply-and-accumulate (MAC) circuit having a plurality of compute-in-memory bitcells is configured to multiply a plurality of stored weight bits with a plurality of input bits to provide a MAC output voltage. A successive approximation analog-to-digital converter includes a capacitive-digital-to-analog-converter (CDAC) configured to subtract a bias voltage from the MAC output voltage to provide a CDAC output voltage. A comparator compares the CDAC output voltage to a fixed reference voltage.

Adjustable precision for multi-stage compute processes

Disclosed techniques provide for dynamically changing precision of a multi-stage compute process. For example, changing neural network (NN) parameters on a per-layer basis depending on properties of incoming data streams and per-layer performance of an NN among other considerations. NNs include multiple layers that may each be calculated with a different degree of accuracy and therefore, compute resource overhead (e.g., memory, processor resources, etc.). NNs are usually trained with 32-bit or 16-bit floating-point numbers. Once trained, an NN may be deployed in production. One approach to reduce compute overhead is to reduce parameter precision of NNs to 16 or 8 for deployment. The conversion to an acceptable lower precision is usually determined manually before deployment and precision levels are fixed while deployed. Disclosed techniques and implementations address automatic rather than manual determination or precision levels for different stages and dynamically adjusting precision for each stage at run-time.