Patent classifications
G06F7/483
COMPUTER PROCESSOR FOR HIGHER PRECISION COMPUTATIONS USING A MIXED-PRECISION DECOMPOSITION OF OPERATIONS
Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.
METHOD AND SYSTEM FOR COMPRESSING APPLICATION DATA FOR OPERATIONS ON MULTI-CORE SYSTEMS
A system and method to compress application control data, such as weights for a layer of a convolutional neural network, is disclosed. A multi-core system for executing at least one layer of the convolutional neural network includes a storage device storing a compressed weight matrix of a set of weights of the at least one layer of the convolutional network and a decompression matrix. The compressed weight matrix is formed by matrix factorization and quantization of a floating point value of each weight to a floating point format. A decompression module is operable to obtain an approximation of the weight values by decompressing the compressed weight matrix through the decompression matrix. A plurality of cores executes the at least one layer of the convolutional neural network with the approximation of weight values to produce an inference output.
FIXED-POINT MULTIPLICATION FOR NETWORK QUANTIZATION
Techniques for training a neural network having a plurality of computational layers with associated weights and activations for computational layers in fixed-point formats include determining an optimal fractional length for weights and activations for the computational layers; training a learned clipping-level with fixed-point quantization using a PACT process for the computational layers; and quantizing on effective weights that fuses a weight of a convolution layer with a weight and running variance from a batch normalization layer. A fractional length for weights of the computational layers is determined from current values of weights using the determined optimal fractional length for the weights of the computational layers. A fixed-point activation between adjacent computational layers is related using PACT quantization of the clipping-level and an activation fractional length from a node in a following computational layer. The resulting fixed-point weights and activation values are stored as a compressed representation of the neural network.
Fused Multiply-Add operator for mixed precision floating-point numbers with correct rounding
A fused multiply-add hardware operator comprising a multiplier receiving two multiplicands as floating-point numbers encoded in a first precision format; an alignment circuit associated with the multiplier configured to convert the result of the multiplication into a first fixed-point number; and an adder configured to add the first fixed-point number and an addition operand. The addition operand is a floating-point number encoded in a second precision format, and the operator comprises an alignment circuit associated with the addition operand, configured to convert the addition operand into a second fixed-point number of reduced dynamic range relative to the dynamic range of the addition operand, having a number of bits equal to the number of bits of the first fixed-point number, extended on both sides by at least the size of the mantissa of the addition operand; the adder configured to add the first and second fixed-point numbers without loss.
Fused Multiply-Add operator for mixed precision floating-point numbers with correct rounding
A fused multiply-add hardware operator comprising a multiplier receiving two multiplicands as floating-point numbers encoded in a first precision format; an alignment circuit associated with the multiplier configured to convert the result of the multiplication into a first fixed-point number; and an adder configured to add the first fixed-point number and an addition operand. The addition operand is a floating-point number encoded in a second precision format, and the operator comprises an alignment circuit associated with the addition operand, configured to convert the addition operand into a second fixed-point number of reduced dynamic range relative to the dynamic range of the addition operand, having a number of bits equal to the number of bits of the first fixed-point number, extended on both sides by at least the size of the mantissa of the addition operand; the adder configured to add the first and second fixed-point numbers without loss.
FLOATING POINT FUSED MULTIPLY ADD WITH MERGED 2'S COMPLEMENT AND ROUNDING
A method includes receiving an unrounded mantissa value and a round bit associated with the unrounded mantissa value. The method also includes receiving a 2's complement signal that indicates whether the unrounded mantissa value results from a 1's complement operation. The method includes incrementing the unrounded mantissa value to provide an incremented value. The unrounded mantissa value is a non-incremented value. The method further includes providing one of the incremented value or non-incremented value as a rounded mantissa value responsive to the 2's complement signal.
MULTI-PRECISION ARITHMETIC RIGHT SHIFT
A method includes receiving, by each of an upper shift circuit and a lower shift circuit, an operand for an arithmetic right shift operation. The upper shift circuit is configured to provide an upper output, the lower shift circuit is configured to provide a lower output, and the upper output concatenated with the lower output is a result of the arithmetic right shift operation. The method also includes receiving a shift value for the arithmetic right shift operation; responsive to the shift value, detecting a shift condition in which a portion of, but not all of, the operand could be shifted into bits corresponding to the lower output; and responsive to detecting the shift condition, providing, by a middle shift circuit, at least a portion of the operand to the lower shift circuit as a selectable input.
Methods to compress range doppler map (RDM) values from floating point to decibels (dB)
Embodiments of a telemetry device and methods to convert a binary floating point number to a compressed number is described herein. The binary floating point number may comprise a mantissa and an exponent. The telemetry device may determine a first number based on a product of the exponent and a constant, wherein the constant may be proportional to a logarithm of the number two. The telemetry device may determine a second number using one or more bits of the mantissa as an index into a predetermined lookup table. Values of the lookup table may be proportional to logarithms of candidate mantissa values. The telemetry device may determine the compressed number based on rounding of a sum. The sum may include the first and second numbers. The rounding may be based on a predetermined step size.
Method and apparatus for generating fixed point neural network
A method and apparatus for generating a fixed point neural network are provided. The method includes selecting at least one layer of a neural network as an object layer, wherein the neural network includes a plurality of layers, each of the plurality of layers corresponding to a respective one of plurality of quantization parameters; forming a candidate parameter set including candidate parameter values with respect to a quantization parameter of the plurality of quantization parameters corresponding to the object layer; determining an update parameter value from among the candidate parameter values based on levels of network performance of the neural network, wherein each of the levels of network performance correspond to a respective one of the candidate parameter values; and updating the quantization parameter with respect to the object layer based on the update parameter value.
Method and apparatus for generating fixed point neural network
A method and apparatus for generating a fixed point neural network are provided. The method includes selecting at least one layer of a neural network as an object layer, wherein the neural network includes a plurality of layers, each of the plurality of layers corresponding to a respective one of plurality of quantization parameters; forming a candidate parameter set including candidate parameter values with respect to a quantization parameter of the plurality of quantization parameters corresponding to the object layer; determining an update parameter value from among the candidate parameter values based on levels of network performance of the neural network, wherein each of the levels of network performance correspond to a respective one of the candidate parameter values; and updating the quantization parameter with respect to the object layer based on the update parameter value.