G06F7/5443

Neural network circuit

A neural network circuit having a novel structure is provided. A plurality of arithmetic circuits each including a register, a memory, a multiplier circuit, and an adder circuit are provided. The memory outputs different weight data in response to switching of a context signal. The multiplier circuit outputs multiplication data of the weight data and input data held in the register. The adder circuit performs a product-sum operation by adding the obtained multiplication data to data obtained by a product-sum operation in an adder circuit of another arithmetic circuit. The obtained product-sum operation data is output to an adder circuit of another arithmetic circuit, so that product-sum operations of different weight data and input data are performed.

Semiconductor device having neural network

A semiconductor device capable of efficiently recognizing images utilizing a neural network is provided. The semiconductor device includes a shift register group, a D/A converter, and a product-sum operation circuit. The product-sum operation circuit includes an analog memory and stores a parameter of a filter. The shift register group captures image data and outputs part of the image data to the D/A converter while shifting the image data. The D/A converter converts the part of the input image data into analog data and outputs the analog data to the product-sum operation circuit.

Bit matrix multiplication

Detailed are embodiments related to bit matrix multiplication in a processor. For example, in some embodiments a processor comprising: decode circuitry to decode an instruction have fields for an opcode, an identifier of a first source bit matrix, an identifier of a second source bit matrix, an identifier of a destination bit matrix, and an immediate; and execution circuitry to execute the decoded instruction to perform a multiplication of a matrix of S-bit elements of the identified first source bit matrix with S-bit elements of the identified second source bit matrix, wherein the multiplication and accumulation operations are selected by the operation selector and store a result of the matrix multiplication into the identified destination bit matrix, wherein S indicates a plural bit size is described.

Dynamic processing element array expansion

A computer-implemented method includes receiving a neural network model that includes a tensor operation, and dividing the tensor operation into sub-operations. The sub-operations includes at least two sub-operations that have no data dependency between the two sub-operations. The computer-implemented method further includes assigning a first sub-operation in the two sub-operations to a first computing engine, assigning a second sub-operation in the two sub-operations to a second computing engine, and generating instructions for performing, in parallel, the first sub-operation by the first computing engine and the second sub-operation by the second computing engine. An inference is then made based on a result of the first sub-operation, a result of the second sub-operation, or both. The first computing engine and the second computing engine are in a same integrated circuit device or in two different integrated circuit devices.

LOW-LATENCY POLYNOMIAL MODULO MULTIPLICATION OVER RING

A modular polynomial multiplier includes a plurality of processing elements. Each includes a multiplication unit, an addition unit and a delay unit. The addition unit has an input connected to the output of the multiplication unit. The delay unit is connected to the output of the addition unit delays values by one clock cycle. The first input of the multiplication unit of each processing element carries a respective coefficient of a first polynomial and the second input of the multiplication unit of each processing element is connected to one of an input line carrying a sequence of coefficients of a second polynomial having n coefficients and a delay line carrying the sequence of coefficients of the second polynomial delayed by n clock cycles and negated.

BIT-SERIAL COMPUTING DEVICE AND TEST METHOD FOR EVALUATING THE SAME

A bit-serial computing device includes a computing circuit and a scaler. The computing circuit includes multiple MAC slices, and receives a multiplier vector and a multiplicand vector that contains multiple multiplicand inputs. Each multiplicand input contains multiple multiplicand segments that have different significances. The significances respectively correspond to the MAC slices. Correspondence between the significances and the MAC slices is variable. Each MAC slice calculates an inner product of the multiplier vector and a vector that is constituted by the multiplicand segments of the multiplicand inputs having the significance corresponding to the MAC slice. With respect to each MAC slice, the scaler multiplies the inner product that is calculated by the MAC slice by a weighting ratio that represents the significance corresponding to the MAC slice, so as to obtain a scaled inner product that corresponds to the MAC slice.

Vector-vector multiplication techniques for processing systems

Vector-vector multiplication or matrix-matrix multiplication computation on computing systems can include computing a first portion of a vector-vector multiplication product based on a most-significant-bit set of a first vector and a most-significant-bit set of a second vector, and determining if the first portion of the vector-vector multiplication product is less than a threshold. If the first partial vector-vector multiplication product is not less than the threshold, a remaining portion of the vector-vector multiplication product can be computed, and a rectified linear vector-vector multiplication product can be determined for the sum of the first portion of the vector-vector multiplication product and the remaining portion of the vector-vector multiplication product. If the first portion of the vector-vector multiplication product is less than the threshold, computation of the remaining portion of the vector-vector multiplication product can be skipped and the rectified linear vector-vector multiplication product can be set to a zero scalar.

Device for computing an inner product
11567731 · 2023-01-31 · ·

A device for computing an inner product includes an index unit, a storage operation unit, a redundant to 2's complement (RTC) converter, a mapping table, and a multiplier-accumulate (MAC) module. The index unit, storing index values, is coupled to word lines. The storage operation unit includes the word lines and bit lines and stores data values. The mapping table stores coefficients corresponding to the index values. The index unit enables the word line according to a count value and the index value, such that the storage operation unit accumulates the data values corresponding to the bit lines and the enabled word line, thereby generating accumulation results. The RTC converter converts the accumulation results into a total data value in 2's complement format. The MAC module operates based on the total data value and the coefficient to generate an inner product value.

Neural network inference circuit read controller with multiple operational modes
11568227 · 2023-01-31 · ·

Some embodiments provide a neural network inference circuit for executing a neural network with multiple layers. The neural network inference circuit includes a set of processing circuits for executing the layers of the neural network, a set of memories for storing data used by the set of processing circuits to execute the neural network layers, and a read controller for retrieving the data from the set of memories and storing the data in a cache for use by the set of processing circuits. The read controller retrieves the data in one of (i) a first mode for retrieving the data from sequential memory locations within the set of memories to store in the cache and (ii) a second mode for retrieving the data from non-sequential memory locations within the set of memories to store in the cache.

IN-MEMORY COMPUTATION SYSTEM WITH DRIFT COMPENSATION CIRCUIT

A circuit includes a memory array with memory cells arranged in a matrix of rows and columns, where each row includes a word line connected to the memory cells of the row, and each column includes a bit line connected to the memory cells of the column. Computational weights for an in-memory compute operation (IMCO) are stored in the memory cells. A word line control circuit simultaneously actuates word lines in response to input signals providing coefficient data for the IMCO by applying word line signal pulses. A column processing circuit connected to the bit lines processes analog signals developed on the bit lines in response to the simultaneous actuation of the word lines to generate multiply and accumulate output signals for the IMCO. Pulse widths of the signal pulses are modulated to compensate for cell drift. The IMCO further handles positive/negative calculation for the coefficient data and computational weights.