Patent classifications
G06F7/487
Method and apparatus for permuting streamed data elements
A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.
Method and apparatus for implied bit handling in floating point multiplication
A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.
Method and apparatus for implied bit handling in floating point multiplication
A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.
Computing accelerator using a lookup table
A computing accelerator using a lookup table. The accelerator may accelerate floating point multiplications by retrieving the fraction portion of the product of two floating-point operands from a lookup table, or by retrieving the product of two floating-point operands of two floating-point operands from a lookup table, or it may retrieve dot products of floating point vectors from a lookup table. The accelerator may be implemented in a three-dimensional memory assembly. It may use approximation, the symmetry of a multiplication lookup table, and zero-skipping to improve performance.
SEMICONDUCTOR DEVICE
When the conversion arithmetic of the numerical type of floating-point data and integer data is performed by software, the load of the CPU becomes heavy. A semiconductor device includes a memory, a bus coupled to the memory, a bus master coupled to the bus, and a conversion arithmetic circuit coupled to the bus. The conversion arithmetic circuit includes a floating-point data adder-subtracter, an integer data adder-subtracter, and a shift operator. The semiconductor device converts the floating-point data to the integer data or converts the integer data to the floating-point data, without employing a multiplier and a divider of the floating-point data.
SEMICONDUCTOR DEVICE
When the conversion arithmetic of the numerical type of floating-point data and integer data is performed by software, the load of the CPU becomes heavy. A semiconductor device includes a memory, a bus coupled to the memory, a bus master coupled to the bus, and a conversion arithmetic circuit coupled to the bus. The conversion arithmetic circuit includes a floating-point data adder-subtracter, an integer data adder-subtracter, and a shift operator. The semiconductor device converts the floating-point data to the integer data or converts the integer data to the floating-point data, without employing a multiplier and a divider of the floating-point data.
ARITHMETIC UNITS AND RELATED CONVERTERS
Devices for adding floating point numbers, devices for multiplying floating point numbers, devices for floating-point fused multiply-add operations, devices for performing fixed point number operations, and associated converters thereof. A preprocessed fixed point format is a fixed point format wherein the LSD of all numbers exactly represented in said format is equal to B/2 (i.e. one for binary radix), and the rest are rounded to one of these numbers. A preprocessed floating point format is a floating point format wherein the significand is a preprocessed fixed point number.
Multiple mode arithmetic circuit
A tile of an FPGA includes a multiple mode arithmetic circuit. The multiple mode arithmetic circuit is configured by control signals to operate in an integer mode, a floating-point mode, or both. In some example embodiments, multiple integer modes (e.g., unsigned, two's complement, and sign-magnitude) are selectable, multiple floating-point modes (e.g., 16-bit mantissa and 8-bit sign, 8-bit mantissa and 6-bit sign, and 6-bit mantissa and 6-bit sign) are supported, or any suitable combination thereof. The tile may also fuse a memory circuit with the arithmetic circuits. Connections directly between multiple instances of the tile are also available, allowing multiple tiles to be treated as larger memories or arithmetic circuits. By using these connections, referred to as cascade inputs and outputs, the input and output bandwidth of the arithmetic circuit is further increased.
Scaling half-precision floating point tensors for training deep neural networks
One embodiment provides for a machine-learning accelerator device a multiprocessor to execute parallel threads of an instruction stream, the multiprocessor including a compute unit, the compute unit including a set of functional units, each functional unit to execute at least one of the parallel threads of the instruction stream. The compute unit includes compute logic configured to execute a single instruction to scale an input tensor associated with a layer of a neural network according to a scale factor, the input tensor stored in a floating-point data type, the compute logic to scale the input tensor to enable a data distribution of data of the input tensor to be represented by a 16-bit floating point data type.
Scaling half-precision floating point tensors for training deep neural networks
One embodiment provides for a machine-learning accelerator device a multiprocessor to execute parallel threads of an instruction stream, the multiprocessor including a compute unit, the compute unit including a set of functional units, each functional unit to execute at least one of the parallel threads of the instruction stream. The compute unit includes compute logic configured to execute a single instruction to scale an input tensor associated with a layer of a neural network according to a scale factor, the input tensor stored in a floating-point data type, the compute logic to scale the input tensor to enable a data distribution of data of the input tensor to be represented by a 16-bit floating point data type.