Patent classifications
G06F2207/382
DEEP LEARNING ACCELERATION WITH MIXED PRECISION
A device for deep learning acceleration with mixed precision may include a first precision mode port to receive an indication of an input precision mode and a second precision mode port to receive an indication of an output precision mode. The device may include a first data port to receive map data and a second data port to receive kernel data. The device may include multiply-accumulate (MAC) components that are each configured to generate a MAC output based on the input precision mode, the map data, and the kernel data. The device may include an adder component to generate an adder component output based on the input precision mode and one or more MAC outputs. The device may include a rounding component to round the adder component output, based on the output precision mode, to generate a rounded output, and an output port to output the rounded output.
DEEP LEARNING ACCELERATION WITH MIXED PRECISION
A device for deep learning acceleration with mixed precision may include vector-vector (VV) components that are each configured to generate a VV output based on an input precision mode, an output precision mode, and at least one accumulation of products. Each accumulation of products may be calculated by adding products based on the input precision mode. Each product may be calculated by multiplying a map word and a kernel word based on the input precision mode. The input precision mode may indicate an input word length for the map word and for the kernel word, and the output precision mode may indicate an output word length for the VV output. The device may include one or more components configured to concatenate VV outputs, corresponding to the VV components, to generate a concatenated VV output. The device may include an output port configured to output the concatenated VV output.
Adder capable of supporting addition and subtraction of up to n-bit data and method of supporting addition and subtraction of a plurality of data type using the adder
An adder for supporting multiple data types by controlling a carry propagation is provided. The adder includes a plurality of first addition areas configured to receive pieces of incoming operand data, wherein each of the plurality of first addition areas includes a predetermined unit number of bits, and a plurality of second addition areas configured to receive pieces of control data based on a type of the operand data and an operation type, wherein the plurality of second addition areas are alternately arranged between the plurality of first addition areas.
Method and device for dynamically adjusting decimal point positions in neural network computations
The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.
Dynamic precision management for integer deep learning primitives
One embodiment provides for a graphics processing unit to perform computations associated with a neural network, the graphics processing unit comprising a hardware processing unit having a dynamic precision fixed-point unit that is configurable to quantize elements of a floating-point tensor to convert the floating-point tensor into a dynamic fixed-point tensor.
LOW-POWER PROCESSOR WITH SUPPORT FOR MULTIPLE PRECISION MODES
Multiple data wordlengths may be supported by a processor through a single data path and/or a single set of registers. For example, the processor may support 16-bit wordlengths and 24-bit wordlengths through a single datapath. For supported data wordlengths that are less than the wordlength of the registers and datapath, the data may be left-aligned within the registers and datapath. The left alignment of data may allow saturation detection in the processor to be performed by examining the same saturation point regardless of the wordlength of the data being operated on. A special saturation mode of the processor may set the lower bits to zero when a configuration register or instruction-bit is set and saturation is detected.
Computing device and method
The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.
Integrated circuits with machine learning extensions
An integrated circuit with specialized processing blocks is provided. A specialized processing block may be optimized for machine learning algorithms and may include a multiplier data path that feeds an adder data path. The multiplier data path may be decomposed into multiple partial product generators, multiple compressors, and multiple carry-propagate adders of a first precision. Results from the carry-propagate adders may be added using a floating-point adder of the first precision. Results from the floating-point adder may be optionally cast to a second precision that is higher or more accurate than the first precision. The adder data path may include an adder of the second precision that combines the results from the floating-point adder with zero, with a general-purpose input, or with other dot product terms. Operated in this way, the specialized processing block provides a technical improvement of greatly increasing the functional density for implementing machine learning algorithms.
Dynamic precision management for integer deep learning primitives
One embodiment provides for a graphics processing unit to perform computations associated with a neural network, the graphics processing unit comprising compute unit including a hardware logic unit having dynamic precision fixed-point logic, the compute unit to receive a set of dynamic fixed-point tensors, compute, via the dynamic precision fixed-point logic, a right-shift value using an absolute maximum value within the set of dynamic fixed-point tensors and a dynamic range of the set of dynamic fixed-point tensors, right-shift data values within the set of dynamic fixed-point tensors based on the right-shift value, increment a shared exponent associated with the set of dynamic fixed-point tensors based on the right-shift value, perform a compute operation on the set of dynamic fixed-point tensors, and generate an output tensor via the compute operation on the set of dynamic fixed-point tensors.
Instructions for fused multiply-add operations with variable precision input operands
Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.