G06F7/487

Reducing power consumption in a fused multiply-add (FMA) unit of a processor
09778911 · 2017-10-03 · ·

In one embodiment, the present invention includes a processor having a fused multiply-add (FMA) unit to perform FMA instructions and add-like instructions. This unit can include an adder with multiple segments each independently controlled by a logic. The logic can clock gate at least one segment during execution of an add-like instruction in another segment of the adder when the add-like instruction has a width less than a width of the FMA unit. Other embodiments are described and claimed.

Temporally split fused multiply-accumulate operation
09778908 · 2017-10-03 · ·

A microprocessor splits a fused multiply-accumulate operation of the form A*B+C into first and second multiply-accumulate sub-operations to be performed by a multiplier and an adder. The first sub-operation at least multiplies A and B, and conditionally also accumulates C to the partial products of A and B to generate an unrounded nonredundant sum. The unrounded nonredundant sum is stored in memory shared by the multiplier and adder for an indefinite time period, enabling the multiplier and adder to perform other operations unrelated to the multiply-accumulate operation. The second sub-operation conditionally accumulates C to the unrounded nonredundant sum if C is not already incorporated into the value, and then generates a final rounded result.

Processor Comprising Three-Dimensional Memory (3D-M) Array

The present invention discloses a processor comprising three-dimensional memory (3D-M) array (3D-processor). Instead of logic-based computation (LBC), the 3D-processor uses memory-based computation (MBC). It comprises an array of computing elements, with each computing element comprising an arithmetic logic circuit (ALC) and a 3D-M-based look-up table (3DM-LUT). The ALC performs arithmetic operations on the LUT data, while the 3DM-LUT is stored in at least one 3D-M array.

Neural network device for neural network operation, method of operating neural network device, and application processor including the neural network device

Provided are a neural network device for performing a neural network operation, a method of operating the neural network device, and an application processor including the neural network device. The neural network device includes a direct memory access (DMA) controller configured to receive floating-point data from a memory; a data converter configured to convert the floating-point data received through the DMA controller to integer-type data; and a processor configured to perform a neural network operation based on an integer operation by using the integer-type data provided from the data converter.

Error unbiased approximate multiplier for normalized floating-point numbers and implementation method of error unbiased approximate multiplier
11429347 · 2022-08-30 · ·

The present invention discloses an error unbiased approximate multiplier for normalized floating-point numbers and an implementation method of the error unbiased approximate multiplier. The error unbiased approximate multiplier includes a symbol and exponent bit module, a mantissa approximation module and a normalization module, wherein the symbol and exponent bit module processes symbolic operation and exponent bit operation of the floating-point numbers; the mantissa approximation module obtains a mantissa approximation result under different accuracy requirements by summing a result of multilevel error correction modules; and the normalization module adjusts an exponent bit according to the operation result of the mantissa and processes the overflow of the exponent bit to obtain the final product result. According to the present invention, for the multiply operation of the normalized floating-point numbers under the IEEE 754 standard, under the controllable accuracy levels, error distribution is unbiased, and area, speed and energy efficiency are obviously improved.

SCALING HALF-PRECISION FLOATING POINT TENSORS FOR TRAINING DEEP NEURAL NETWORKS
20220269931 · 2022-08-25 · ·

A graphics processor is described that includes a single instruction, multiple thread (SIMT) architecture including hardware multithreading. The multiprocessor can execute parallel threads of instructions associated with a command stream, where the multiprocessor includes a set of functional units to execute at least one of the parallel threads of the instructions. The set of functional units can include a mixed precision tensor processor to perform tensor computations. The functional units can also include circuitry to analyze statistics for output values of the tensor computations, determine a target format to convert the output values, the target format determined based on the statistics for the output values and a precision associated with a second layer of the neural network, and convert the output values to the target format.

SCALING HALF-PRECISION FLOATING POINT TENSORS FOR TRAINING DEEP NEURAL NETWORKS
20220269931 · 2022-08-25 · ·

A graphics processor is described that includes a single instruction, multiple thread (SIMT) architecture including hardware multithreading. The multiprocessor can execute parallel threads of instructions associated with a command stream, where the multiprocessor includes a set of functional units to execute at least one of the parallel threads of the instructions. The set of functional units can include a mixed precision tensor processor to perform tensor computations. The functional units can also include circuitry to analyze statistics for output values of the tensor computations, determine a target format to convert the output values, the target format determined based on the statistics for the output values and a precision associated with a second layer of the neural network, and convert the output values to the target format.

Method and apparatus for permuting streamed data elements

A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.

Method and apparatus for permuting streamed data elements

A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.

BINARY FUSED MULTIPLY-ADD FLOATING-POINT CALCULATIONS

A binary fused multiply-add floating-point unit configured to operate on an addend, a multiplier, and a multiplicand. The unit is configured to receive as the addend an unrounded result of a prior operation executed in the unit via an early result feedback path; to perform an alignment shift of the unrounded addend on an unrounded exponent and an unrounded mantissa; as well as perform a rounding correction for the addend in parallel to the actual alignment shift, responsive to a rounding-up signal.