Patent classifications
G06F5/01
In-memory computing architecture and methods for performing MAC operations
In-memory computing architectures and methods of performing multiply-and-accumulate operations are provided. The method includes sequentially shifting bits of first input bytes into each row in an array of memory cells arranged in rows and columns. Each memory cell is activated based on the bit to produce a bit-line current from each activated memory cell in a column on a shared bit-line proportional to a product of the bit and a weight stored therein. Charges produced by a sum of the bit-line currents in a column are accumulated in first charge-storage banks coupled to a shared bit-line in each of the columns. Concurrently, charges from second input bytes accumulated in second charge-storage banks previously coupled to the columns are sequentially converted into output bytes. The charge-storage banks are exchanged after the first input bytes have been accumulated and the charges from the second input bytes converted. The method then repeats.
In-memory computing architecture and methods for performing MAC operations
In-memory computing architectures and methods of performing multiply-and-accumulate operations are provided. The method includes sequentially shifting bits of first input bytes into each row in an array of memory cells arranged in rows and columns. Each memory cell is activated based on the bit to produce a bit-line current from each activated memory cell in a column on a shared bit-line proportional to a product of the bit and a weight stored therein. Charges produced by a sum of the bit-line currents in a column are accumulated in first charge-storage banks coupled to a shared bit-line in each of the columns. Concurrently, charges from second input bytes accumulated in second charge-storage banks previously coupled to the columns are sequentially converted into output bytes. The charge-storage banks are exchanged after the first input bytes have been accumulated and the charges from the second input bytes converted. The method then repeats.
Low latency matrix multiply unit
Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.
METHOD AND APPARATUS WITH FLOATING POINT PROCESSING
A processor-implemented includes receiving a first floating point operand and a second floating point operand, each having an n-bit format comprising a sign field, an exponent field, and a significand field, normalizing a binary value obtained by performing arithmetic operations for fields corresponding to each other in the first and second floating point operands for an n-bit multiplication operation, determining whether the normalized binary value is a number that is representable in the n-bit format or an extended normal number that is not representable in the n-bit format, according to a result of the determining, encoding the normalized binary value using an extension bit format in which an extension pin identifying whether the normalized binary value is the extended normal number is added to the n-bit format, and outputting the encoded binary value using the extended bit format, as a result of the n-bit multiplication operation.
Reconfigurable processor circuit architecture
A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
Reconfigurable processor circuit architecture
A representative reconfigurable processing circuit and a reconfigurable arithmetic circuit are disclosed, each of which may include input reordering queues; a multiplier shifter and combiner network coupled to the input reordering queues; an accumulator circuit; and a control logic circuit, along with a processor and various interconnection networks. A representative reconfigurable arithmetic circuit has a plurality of operating modes, such as floating point and integer arithmetic modes, logical manipulation modes, Boolean logic, shift, rotate, conditional operations, and format conversion, and is configurable for a wide variety of multiplication modes. Dedicated routing connecting multiplier adder trees allows multiple reconfigurable arithmetic circuits to be reconfigurably combined, in pair or quad configurations, for larger adders, complex multiplies and general sum of products use, for example.
Floating point to fixed point conversion
A binary logic circuit converts a number in floating point format having an exponent E of ew bits, an exponent bias B given by B=2.sup.ew-1−1, and a significand comprising a mantissa M of mw bits into a fixed point format with an integer width of iw bits and a fractional width of fw bits. The circuit includes a shifter operable to receive a significand input comprising a contiguous set of the most significant bits of the significand and configured to left-shift the significand input by a number of bits equal to the value represented by k least significant bits of the exponent to generate a shifter output, wherein min{(ew−1),bitwidth(iw−2−s.sub.y)}≤k≤(ew−1) where s.sub.y=1 for a signed floating point number and s.sub.y=0 for an unsigned floating point number, and a multiplexer coupled to the shifter and configured to: receive an input comprising a contiguous set of bits of the shifter output; and output the input if the most significant bit of the exponent is equal to one.
Fused convolution and batch normalization for neural networks
A processing unit implements a convolutional neural network (CNN) by fusing at least a portion of a convolution phase of the CNN with at least a portion of a batch normalization phase. The processing unit convolves two input matrices representing inputs and weights of a portion of the CNN to generate an output matrix. The processing unit performs the convolution via a series of multiplication operations, with each multiplication operation generating a corresponding submatrix (or “tile”) of the output matrix at an output register of the processing unit. While an output submatrix is stored at the output register, the processing unit performs a reduction phase and an update phase of the batch normalization phase for the CNN. The processing unit thus fuses at least a portion of the batch normalization phase of the CNN with a portion of the convolution.
Fused convolution and batch normalization for neural networks
A processing unit implements a convolutional neural network (CNN) by fusing at least a portion of a convolution phase of the CNN with at least a portion of a batch normalization phase. The processing unit convolves two input matrices representing inputs and weights of a portion of the CNN to generate an output matrix. The processing unit performs the convolution via a series of multiplication operations, with each multiplication operation generating a corresponding submatrix (or “tile”) of the output matrix at an output register of the processing unit. While an output submatrix is stored at the output register, the processing unit performs a reduction phase and an update phase of the batch normalization phase for the CNN. The processing unit thus fuses at least a portion of the batch normalization phase of the CNN with a portion of the convolution.
DUAL EXPONENT BOUNDING BOX FLOATING-POINT PROCESSOR
Apparatus and methods are disclosed for performing matrix operations, including operations suited to neural network and other machine learning accelerators and applications, using dual exponent formats. Disclosed matrix formats include single exponent bounding box floating-point (SE-BBFP) and dual exponent bounding box floating-point (DE-BBFP) formats. Shared exponents for each element are determined for each element based on whether the element is used as a row of matrix tile or a column of a matrix file, for example, for a dot product operation. Computing systems suitable for employing such neural networks include computers having general-purpose processors, neural network accelerators, or reconfigure both logic devices, such as Field programmable gate arrays (FPGA). Certain techniques disclosed herein can provide improved system performance while reducing memory and network bandwidth used.