Patent classifications
G06F2207/4824
TECHNIQUES FOR FAST DOT-PRODUCT COMPUTATION
Techniques are presented to improve the speed of calculating floating-point dot-products, such as in a floating point unit (FPU). Rather than determine the full maximum exponent initially and wait until the full individual shift amounts are calculated to right-shift each mantissa product, each product of exponents is divided into two fields, a high field and a low field. The low field is used as a fine-grained shift amount to right-shift each mantissa product as soon as the mantissa product is ready, while only hi field participates in the maximum exponent calculation. This allows a dot-product computation to be speed up in two ways: Right-shifting of the mantissa product can begin as soon as the mantissa products are calculated, without waiting for the maximum exponent calculation; and calculation of the maximum exponent is sped up because it is calculated only on the high fields of the exponent, not its full-width.
Multiplication and accumulation (MAC) operator
A MAC operator includes a plurality of multipliers, a plurality of floating-point to fixed-point converters, an adder tree, an accumulator, and a fixed-point to floating-point converter. Each of the plurality of multipliers may perform a multiplication operation on first data and second data of a single-precision floating-point (FP32) format to output multiplication result data of the FP 32 format. Each of the plurality of floating-point to fixed-point converters may convert the FP 32 format into a fixed-point format. The adder tree may perform a first addition operation on the data of the fixed-point format. The accumulator may perform an accumulation operation on the data output from the adder tree. And the fixed-point to floating-point converter may convert the data of the fixed-point format into data of the FP32 format.
Processing-in-memory (PIM) devices
A Processing-In-Memory (PIM) device includes an error correction code (ECC) logic circuit and an error accumulation detection circuit. The error correction code (ECC) logic circuit configured to detect an erroneous bits included in first data to generate a parity bit, and to detect an error correction capability of the first data to generate an error correction fail signal. The error accumulation detection circuit configured to generate an error accumulation signal counted by a pulse of the error correction fail signal. The error correction capability set to the maximum number of erroneous bits that can be corrected by performing an ECC operation on the first data.
Neural network processor for handling differing datatypes
Embodiments relate to a neural engine circuit that includes an input buffer circuit, a kernel extract circuit, and a multiply-accumulator (MAC) circuit. The MAC circuit receives input data from the input buffer circuit and a kernel coefficient from the kernel extract circuit. The MAC circuit contains several multiply-add (MAD) circuits and accumulators used to perform neural networking operations on the received input data and kernel coefficients. MAD circuits are configured to support fixed-point precision (e.g., INT8) and floating-point precision (FP16) of operands. In floating-point mode, each MAD circuit multiplies the integer bits of input data and kernel coefficients and adds their exponent bits to determine a binary point for alignment. In fixed-point mode, input data and kernel coefficients are multiplied. In both operation modes, the output data is stored in an accumulator, and may be sent back as accumulated values for further multiply-add operations in subsequent processing cycles.
Inference apparatus, convolution operation execution method, and program
An inference apparatus comprises a plurality of PEs (Processing Elements) and a control part. The control part operates a convolution operation in a convolutional neural network using each of a plurality of pieces of input data and a weight group including a plurality of weights corresponding to each of the plurality of pieces of input data by controlling the plurality of PEs. Further, each of the plurality of PEs executes a computation including multiplication of a single piece of the input data by a single weight and also executes multiplication included in the convolution operation using an element with a non-zero value included in each of the plurality of pieces of input data.
RELIABLE SUPERVISED MACHINE LEARNING USING INTERVAL ARITHMETIC
An interval arithmetic based system and method for solving a global optimization problem is contemplated and provided. Provisions are made for a bisection indexing scheme for a parameter domain of an objective function wherein unique codes are assigned to each iterative interval subset of the parameter domain. Relationships for, between and among iterative subdivisions are arithmetically delimited, with unique codes populating an integer field of a bisection queue system memory component for particular arrays in a bisection context system memory component, and a further integer array of the bisection context. In connection to depth first domain bisection, operations are undertaken relative to the bisection context which are memorialized in relation to the bisection queue, operations which include a work stealing scheme for simultaneous breadth first searching.
Reconfigurable input precision in-memory computing
Technology for reconfigurable input precision in-memory computing is disclosed herein. Reconfigurable input precision allows the bit resolution of input data to be changed to meet the requirements of in-memory computing operations. Voltage sources (that may include DACs) provide voltages that represent input data to memory cell nodes. The resolution of the voltage sources may be reconfigured to change the precision of the input data. In one parallel mode, the number of DACs in a DAC node is used to configure the resolution. In one serial mode, the number of cycles over which a DAC provides voltages is used to configure the resolution. The memory system may include relatively low resolution voltage sources, which avoids the need to have complex high resolution voltage sources (e.g., high resolution DACs). Lower resolution voltage sources can take up less area and/or use less power than higher resolution voltage sources.
Power efficient near memory analog multiply-and-accumulate (MAC)
A near memory system is provided for the calculation of a layer in a machine learning application. The near memory system includes an array of memory cells for storing an array of filter weights. A multiply-and-accumulate circuit couples to columns of the array to form the calculation of the layer.
Resistive matrix computation circuit
A resistive matrix computation circuit and methods for using the same are disclosed. In one embodiment, a resistive matrix computation circuit includes a memory configured to store a first set of operands and a second set of operands, where the first set of input operands and the second set of input operands are programmable by a controller, and the first set of operands and the second set of operands are cross-multiplied to form a plurality of product pairs, a plurality of resistive multiplier circuits configured to generate a plurality of output voltages according to the plurality of product pairs; the controller is configured to control the plurality of resistive multiplier circuits to perform multiplications using the first set of operands and the second set of operands, and an aggregator circuit configured to aggregate the plurality of output voltages from the plurality of resistive multiplier circuits, where the plurality of output voltages represent an aggregated value of the plurality of product pairs.
NEURAL NETWORK DATA COMPUTATION USING MIXED-PRECISION
Techniques for mixed-precision data manipulation for neural network data computation are disclosed. A first left group comprising eight bytes of data and a first right group of eight bytes of data are obtained for computation using a processor. A second left group comprising eight bytes of data and a second right group of eight bytes of data are obtained. A sum of products is performed between the first left and right groups and the second left and right groups. The sum of products is performed on bytes of 8-bit integer data. A first result is based on a summation of eight values that are products of the first group’s left eight bytes and the second group’s left eight bytes. A second result is based on the summation of eight values that are products of the first group’s left eight bytes and the second group’s right eight bytes. Results are output.