G06F7/49957

Method and apparatus with neural network parameter quantization

Provided is a processor implemented method that includes performing training or an inference operation with a neural network by obtaining a parameter for the neural network in a floating-point format, applying a fractional length of a fixed-point format to the parameter in the floating-point format, performing an operation with an integer arithmetic logic unit (ALU) to determine whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process, and performing an operation of quantizing the parameter in the floating-point format to a parameter in the fixed-point format, based on a result of the operation with the ALU.

Prepare for shorter precision (round for reround) mode in a decimal floating-point instruction

An instruction is executed in round-for-reround mode wherein the permissible resultant value that is closest to and no greater in magnitude than the infinitely precise result is selected. If the selected value is not exact and the units digit of the selected value is either 0 or 5, then the digit is incremented by one and the selected value is delivered. In all other cases, the selected value is delivered.

FLOATING POINT FUSED MULTIPLY ADD WITH MERGED 2'S COMPLEMENT AND ROUNDING
20230214179 · 2023-07-06 ·

A method includes receiving an unrounded mantissa value and a round bit associated with the unrounded mantissa value. The method also includes receiving a 2's complement signal that indicates whether the unrounded mantissa value results from a 1's complement operation. The method includes incrementing the unrounded mantissa value to provide an incremented value. The unrounded mantissa value is a non-incremented value. The method further includes providing one of the incremented value or non-incremented value as a rounded mantissa value responsive to the 2's complement signal.

MULTIPLIER FOR FLOATING-POINT OPERATION, METHOD, INTEGRATED CIRCUIT CHIP, AND CALCULATION DEVICE
20230076931 · 2023-03-09 ·

The present disclosure relates to a multiplier, a method, an integrated circuit chip, and a computation apparatus for a floating-point computation. The computation apparatus may be included in a combined processing apparatus, which may also include a general interconnection interface and other processing apparatus. The computation apparatus interacts with other processing apparatus to jointly complete computation operations specified by the user. The combined processing apparatus may also include a storage apparatus, which is respectively connected to the computation apparatus and other processing apparatus and is used for storing data of the computation apparatus and other processing apparatus. Solutions of the present disclosure may be widely used in various floating-point data computations

Floating point instruction with selectable comparison attributes

An instruction to perform a comparison of a first value and a second value is executed. Based on a control of the instruction, a compare function to be performed is determined. The compare function is one of a plurality of compare functions configured for the instruction, and the compare function has a plurality of options for comparison. A compare option based on the first value and the second value is selected from the plurality of options defined for the compare function, and used to compare the first value and the second value. A result of the comparison is then placed in a select location, the result to be used in processing within a computing environment.

Temporally split fused multiply-accumulate operation
09778908 · 2017-10-03 · ·

A microprocessor splits a fused multiply-accumulate operation of the form A*B+C into first and second multiply-accumulate sub-operations to be performed by a multiplier and an adder. The first sub-operation at least multiplies A and B, and conditionally also accumulates C to the partial products of A and B to generate an unrounded nonredundant sum. The unrounded nonredundant sum is stored in memory shared by the multiplier and adder for an indefinite time period, enabling the multiplier and adder to perform other operations unrelated to the multiply-accumulate operation. The second sub-operation conditionally accumulates C to the unrounded nonredundant sum if C is not already incorporated into the value, and then generates a final rounded result.

Deep neural network architecture using piecewise linear approximation

A log circuit for piecewise linear approximation is disclosed. The log circuit identifies an input associated with a logarithm operation to be performed using piecewise linear approximation. The log circuit then identifies a range that the input falls within from various ranges associated with piecewise linear approximation (PLA) equations for the logarithm operation, where the identified range corresponds to one of the PLA equations. The log circuit computes a result of the corresponding PLA equation based on the respective operands of the equation. The log circuit then returns an output associated with the logarithm operation, which is based at least partially on the result of the PLA equation.

DECIMAL FLOATING-POINT ROUND-FOR-REROUND INSTRUCTION
20230342112 · 2023-10-26 ·

A decimal floating-point instruction is executed in a round-for-reround mode. The decimal floating-point instruction is configured to perform a decimal floating-point operation on a decimal floating-point operand. The executing includes forming based on performing the decimal floating-point operation, an intermediate result having a high order portion and a low order portion. The high order portion has a least significant digit. A rounded-for-reround number is created from the intermediate result. The rounded-for-reround number includes the high order portion of the intermediate result and based on the least significant coefficient digit of the high order portion being a selected value and based on the low order portion having another selected value, the least significant digit of the rounded-for-reround number is incremented. The rounded-for-reround number is stored.

METHOD AND SYSTEM FOR TRAINING MACHINE LEARNING MODELS USING DYNAMIC FIXED-POINT DATA REPRESENTATIONS

Systems and methods for training a machine learning model. The methods comprise receiving a plurality of first data points, each data point of the first data points being represented in a floating-point representation. The methods further comprise converting the plurality of first data points into a corresponding plurality of second data points. Each of the second data points is represented in a dynamic fixed-point representation. The plurality of second data points may include: for each second data point, the sign component of the corresponding first data point, for each second data point, a dynamic fixed-point mantissa component, and one or more shared fraction components. At least two of the second data points share a value of a shared fraction component of the one or more shared fraction components. The methods further comprise performing integer computations during training of the machine learning model using the second data points.

Floating-point unit stochastic rounding for accelerated deep learning

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements comprising a portion of a neural network accelerator performs flow-based computations on wavelets of data. Each processing element has a respective compute element and a respective routing element. Each compute element has a respective floating-point unit enabled to perform stochastic rounding, thus in some circumstances enabling reducing systematic bias in long dependency chains of floating-point computations. The long dependency chains of floating-point computations are performed, e.g., to train a neural network or to perform inference with respect to a trained neural network.