IPIQ

G06F7/49947

Method and apparatus with neural network parameter quantization

11593625 · 2023-02-28 ·

Samsung Electronics Co., Ltd.

Provided is a processor implemented method that includes performing training or an inference operation with a neural network by obtaining a parameter for the neural network in a floating-point format, applying a fractional length of a fixed-point format to the parameter in the floating-point format, performing an operation with an integer arithmetic logic unit (ALU) to determine whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process, and performing an operation of quantizing the parameter in the floating-point format to a parameter in the fixed-point format, based on a result of the operation with the ALU.

APPARATUS AND METHOD FOR VECTOR PACKED DUAL COMPLEX-BY-COMPLEX AND DUAL COMPLEX-BY-COMPLEX CONJUGATE MULTIPLICATION

20230004390 · 2023-01-05 ·

Intel Corporation

An apparatus and method for multiplying packed real and imaginary components of complex numbers and complex conjugates. For example, one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed real and imaginary data elements; a second source register to store a second plurality of packed real and imaginary data elements; and execution circuitry to execute the decoded instruction. The execution circuitry includes multiplier circuitry to multiply select real and imaginary data elements in the first and second source registers to generate a plurality of real and imaginary products; adder circuitry to add/subtract various real and imaginary products, scale the results according to an immediate of the instruction, round the scaled results; and saturation circuitry to saturate the rounded results.

STOCHASTIC ROUNDING FOR NEURAL PROCESSOR CIRCUIT

20230236799 · 2023-07-27 ·

Kenneth W. Waters

Embodiments relate to a neural processor circuit that includes a neural engine and a post-processing circuit. The neural engine performs a computational task related to a neural network to generate a processed value. The post-processing circuit includes a random bit generator, an adder circuit and a rounding circuit. The random bit generator generates a random string of bits. The adder circuit adds the random string of bits to a version of the processed value to generate an added value. The rounding circuit truncates the added value to generate an output value of the computational task. The random bit generator may include a linear-feedback shift register (LFSR) that generates random numbers based on a seed. The seed may be derived from a master seed that is specific to a task of the neural network.

MULTIPLICATION BY A RATIONAL IN HARDWARE WITH SELECTABLE ROUNDING MODE

20230229397 · 2023-07-20 ·

Thomas Rose

A fixed logic circuit for performing multiplication of an input x by a constant rational p/q so as to calculate an output y according to a directed rounding or round-to-nearest rounding mode. Fixed logic hardware is derived comprising an addition array configured to operate on canonical signed digit (CSD) forms of binary values (a CSD array) so as to form an approximation of a multiplication of an input x [m−1:0] by a rational p/q. A truncated summation array of a finite sequence of most significant bits of an infinite CSD expansion of the rational p/q operating on the bits of the input x satisfies

[00001] $Δ_{high} - Δ_{low} < \frac{1}{q} .$

Registers define a plurality of corrective constants for a respective plurality of rounding modes, and selection logic selects the respective corrective constant for that rounding mode in dependence on a rounding mode in which the truncated summation array is to operate.

High-precision anchored-implicit processing

11704092 · 2023-07-18 ·

Arm Limited

An apparatus includes a processing circuit and a storage device. The processing circuit is configured to perform one or more processing operations in response to one or more instructions to generate an anchored-data element. The storage device is configured to store the anchored-data element. A format of the anchored-data element includes an identification item, an overlap item, and a data item. The data item is configured to hold a data value of the anchored-data element. The identification item indicates an anchor value for the data value or one or more special values.

Prepare for shorter precision (round for reround) mode in a decimal floating-point instruction

11698772 · 2023-07-11 ·

International Business Machines Corporation

An instruction is executed in round-for-reround mode wherein the permissible resultant value that is closest to and no greater in magnitude than the infinitely precise result is selected. If the selected value is not exact and the units digit of the selected value is either 0 or 5, then the digit is incremented by one and the selected value is delivered. In all other cases, the selected value is delivered.

FIXED-POINT MULTIPLICATION FOR NETWORK QUANTIZATION

20230214639 · 2023-07-06 ·

Techniques for training a neural network having a plurality of computational layers with associated weights and activations for computational layers in fixed-point formats include determining an optimal fractional length for weights and activations for the computational layers; training a learned clipping-level with fixed-point quantization using a PACT process for the computational layers; and quantizing on effective weights that fuses a weight of a convolution layer with a weight and running variance from a batch normalization layer. A fractional length for weights of the computational layers is determined from current values of weights using the determined optimal fractional length for the weights of the computational layers. A fixed-point activation between adjacent computational layers is related using PACT quantization of the clipping-level and an activation fractional length from a node in a following computational layer. The resulting fixed-point weights and activation values are stored as a compressed representation of the neural network.

Fused Multiply-Add operator for mixed precision floating-point numbers with correct rounding

11550544 · 2023-01-10 ·

Kalray

Nicolas Brunie

A fused multiply-add hardware operator comprising a multiplier receiving two multiplicands as floating-point numbers encoded in a first precision format; an alignment circuit associated with the multiplier configured to convert the result of the multiplication into a first fixed-point number; and an adder configured to add the first fixed-point number and an addition operand. The addition operand is a floating-point number encoded in a second precision format, and the operator comprises an alignment circuit associated with the addition operand, configured to convert the addition operand into a second fixed-point number of reduced dynamic range relative to the dynamic range of the addition operand, having a number of bits equal to the number of bits of the first fixed-point number, extended on both sides by at least the size of the mantissa of the addition operand; the adder configured to add the first and second fixed-point numbers without loss.

FLOATING POINT FUSED MULTIPLY ADD WITH MERGED 2'S COMPLEMENT AND ROUNDING

20230214179 · 2023-07-06 ·

Garrett Joseph LIES

A method includes receiving an unrounded mantissa value and a round bit associated with the unrounded mantissa value. The method also includes receiving a 2's complement signal that indicates whether the unrounded mantissa value results from a 1's complement operation. The method includes incrementing the unrounded mantissa value to provide an incremented value. The unrounded mantissa value is a non-incremented value. The method further includes providing one of the incremented value or non-incremented value as a rounded mantissa value responsive to the 2's complement signal.

Methods to compress range doppler map (RDM) values from floating point to decibels (dB)

11552650 · 2023-01-10 ·

Raytheon Company

Embodiments of a telemetry device and methods to convert a binary floating point number to a compressed number is described herein. The binary floating point number may comprise a mantissa and an exponent. The telemetry device may determine a first number based on a product of the exponent and a constant, wherein the constant may be proportional to a logarithm of the number two. The telemetry device may determine a second number using one or more bits of the mantissa as an index into a predetermined lookup table. Values of the lookup table may be proportional to logarithms of candidate mantissa values. The telemetry device may determine the compressed number based on rounding of a sum. The sum may include the first and second numbers. The rounding may be based on a predetermined step size.

Patent classifications

G06F7/49947