Patent classifications
G06F7/483
REGISTER FILE FOR SYSTOLIC ARRAY
A processing apparatus includes a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements.
REGISTER FILE FOR SYSTOLIC ARRAY
A processing apparatus includes a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements.
Lookahead priority collection to support priority elevation
A queuing requester for access to a memory system is provided. Transaction requests are received from two or more requestors for access to the memory system. Each transaction request includes an associated priority value. A request queue of the received transaction requests is formed in the queuing requester. Each transaction request includes an associated priority value. A highest priority value of all pending transaction requests within the request queue is determined. An elevated priority value is selected when the highest priority value is higher than the priority value of an oldest transaction request in the request queue; otherwise the priority value of the oldest transaction request is selected. The oldest transaction request in the request queue with the selected priority value is then provided to the memory system. An arbitration contest with other requesters for access to the memory system is performed using the selected priority value.
Lookahead priority collection to support priority elevation
A queuing requester for access to a memory system is provided. Transaction requests are received from two or more requestors for access to the memory system. Each transaction request includes an associated priority value. A request queue of the received transaction requests is formed in the queuing requester. Each transaction request includes an associated priority value. A highest priority value of all pending transaction requests within the request queue is determined. An elevated priority value is selected when the highest priority value is higher than the priority value of an oldest transaction request in the request queue; otherwise the priority value of the oldest transaction request is selected. The oldest transaction request in the request queue with the selected priority value is then provided to the memory system. An arbitration contest with other requesters for access to the memory system is performed using the selected priority value.
ARITHMETIC DEVICE
An arithmetic device according to an embodiment includes a first operation unit that calculates a shared exponent bias value for shifting a dynamic range of a floating-point operation; a second operation unit that calculates a sum-of-product arithmetic result of a second number of bits larger than a first number of bits by performing arithmetic operations corresponding to a large number of elements on a first data set formed of a shared exponent bias value and an activation value of a floating point of the first number of bits, and a second data set formed of a shared exponent bias value and a weight of a floating point of the first number of bits; and a quantizer that updates the activation value by quantizing the number of bits of the sum-of-product arithmetic result from the second number of bits to the first number of bits.
ARITHMETIC DEVICE
An arithmetic device according to an embodiment includes a first operation unit that calculates a shared exponent bias value for shifting a dynamic range of a floating-point operation; a second operation unit that calculates a sum-of-product arithmetic result of a second number of bits larger than a first number of bits by performing arithmetic operations corresponding to a large number of elements on a first data set formed of a shared exponent bias value and an activation value of a floating point of the first number of bits, and a second data set formed of a shared exponent bias value and a weight of a floating point of the first number of bits; and a quantizer that updates the activation value by quantizing the number of bits of the sum-of-product arithmetic result from the second number of bits to the first number of bits.
Power Saving Floating Point Multiplier-Accumulator with Precision-Aware Accumulation
A floating point multiplier-accumulator (MAC) multiplies and accumulates N pairs of floating point values using N MAC processors operating simultaneously, each pair of values comprising an input value and a coefficient value to be multiplied and accumulated. The pairs of floating point values are simultaneously processed by the plurality of MAC processors, each of which outputs a signed integer form fraction and a maximum exponent. A range estimator forms a possible range of values from the exponent differences and determines an adder precision. The integer form fractions are summed using the adder precision, a sign bit is extracted, and a floating point value is output. Each MAC processor provides its integer form fraction with a precision determined by the MAC processor's exponent difference.
Digital Signal Processor and Method
A digital signal processor according to an embodiment comprises a processing stage. The processing stage is configured to receive Cartesian coordinates of a vector in a floating point format and to output polar coordinates of the vector in a floating point format. The processing stage comprises a first electronic circuit configured to iteratively implement, timed by a clock signal, a CORDIC algorithm in a floating point format.
QUANTIZATION EVALUATOR
A method of quantization evaluation, including, receiving a floating point data set, determining a floating point neural network model output utilizing the floating point data set, quantizing the floating point data set utilizing a quantization model yielding a quantized data set, determining a quantized neural network model output utilizing the quantized data set, determining whether an accuracy error between the floating point neural network model output and the quantized neural network model output exceeds an predetermined error tolerance, determining a floating point neural network tensor output utilizing the floating point data set if the predetermined error tolerance is exceeded, determining a quantized neural network tensor output utilizing the quantized data set if the predetermined error tolerance is exceeded, determining a per-tensor error based on the floating point neural network tensor output and the quantized neural network tensor output and updating the quantization model based on the per-tensor error.
QUANTIZATION EVALUATOR
A method of quantization evaluation, including, receiving a floating point data set, determining a floating point neural network model output utilizing the floating point data set, quantizing the floating point data set utilizing a quantization model yielding a quantized data set, determining a quantized neural network model output utilizing the quantized data set, determining whether an accuracy error between the floating point neural network model output and the quantized neural network model output exceeds an predetermined error tolerance, determining a floating point neural network tensor output utilizing the floating point data set if the predetermined error tolerance is exceeded, determining a quantized neural network tensor output utilizing the quantized data set if the predetermined error tolerance is exceeded, determining a per-tensor error based on the floating point neural network tensor output and the quantized neural network tensor output and updating the quantization model based on the per-tensor error.