G06F7/5525

Small multiplier after initial approximation for operations with increasing precision
11579844 · 2023-02-14 · ·

In an aspect, a processor includes circuitry for iterative refinement approaches, e.g., Newton-Raphson, to evaluating functions, such as square root, reciprocal, and for division. The circuitry includes circuitry for producing an initial approximation; which can include a LookUp Table (LUT). LUT may produce an output that (with implementation-dependent processing) forms an initial approximation of a value, with a number of bits of precision. A limited-precision multiplier multiplies that initial approximation with another value; an output of the limited precision multiplier goes to a full precision multiplier circuit that performs remaining multiplications required for iteration(s) in the particular refinement process being implemented. For example, in division, the output being calculated is for a reciprocal of the divisor. The full-precision multiplier circuit requires a first number of clock cycles to complete, and both the small multiplier and the initial approximation circuitry complete within the first number of clock cycles.

Use of a single instruction set architecture (ISA) instruction for vector normalization

Embodiments described herein are generally directed to an improved vector normalization instruction. An embodiment of a method includes responsive to receipt by a GPU of a single instruction specifying a vector normalization operation to be performed on V vectors: (i) generating V squared length values, N at a time, by a first processing unit, by, for each N sets of inputs, each representing multiple component vectors for N of the vectors, performing N parallel dot product operations on the N sets of inputs. Generating V sets of outputs representing multiple normalized component vectors of the V vectors, N at a time, by a second processing unit, by, for each N squared length values of the V squared length values, performing N parallel operations on the N squared length values, wherein each of the N parallel operations implement a combination of a reciprocal square root function and a vector scaling function.

COMBINED DIVIDE/SQUARE ROOT PROCESSING CIRCUITRY AND METHOD
20230017462 · 2023-01-19 ·

An apparatus comprises combined divide/square root processing circuitry to perform, in response to a divide instruction, a given radix-64 iteration of a radix-64 divide operation, and in response to a square root instruction, a given radix-64 iteration of a radix-64 square root operation; in which: the combined divide/square root processing circuitry comprises shared circuitry to generate at least one output value for the given radix-64 iteration on a same data path used for both the radix-64 divide operation and the radix-64 square root operation.

ON-THE-FLY CONVERSION
20230018056 · 2023-01-19 ·

A data processing apparatus that converts a plurality of signed digits representing an input value in redundant representation comprises receiver circuitry to receive, at each of a plurality of iterations, a signed digit from the plurality of signed digits, and previous intermediate data from a previous iteration. Concatenation circuitry performs a concatenation of bits corresponding to the signed digit and bits of the previous intermediate data to produce updated intermediate data. Output circuitry provides the updated intermediate data as previous intermediate data of a next iteration. The previous intermediate data comprises S3[i] in non-redundant representation, which is at least part of the input value multiplied by 3 in non-redundant representation.

SQUARE ROOT PROCESSING CIRCUITRY AND METHOD
20230013054 · 2023-01-19 ·

Square root processing circuitry performs a given radix-r iteration of a radix-r square root operation, by performing multiple radix-n sub-iterations in a same processing cycle, where n<r. The square root processing circuitry comprises, for a given radix-n sub-iteration: digit selection circuitry to select, based on a previous remainder estimate, a next radix-n result digit for a square root result; remainder update circuitry to adjust a previous remainder value to generate an updated remainder value; and remainder estimate circuitry to generate an updated remainder estimate indicative of an estimate of a portion of the updated remainder value. In a final radix-n sub-iteration of the given radix-r iteration, the remainder estimate circuitry generates the updated remainder estimate in parallel with the remainder update circuitry generating the updated remainder value.

SQUARE ROOT CALCULATIONS ON AN ASSOCIATIVE PROCESSING UNIT
20230221925 · 2023-07-13 ·

A method for calculating a square root B having N bits of a number X having 2N bits includes iterating on bits b.sub.i of square root B starting from the most significant bit until the least significant bit of square root B. For each iteration, the method includes locating a 1 at the squared location of bit b.sub.i in a CHECK variable, determining the value of bit b.sub.i from the result of a comparison of number X with a function of all previously found bits and a previous comparison outcome, shifting all previously found bits right 1 location in a CHECK variable, and adding the determined value of bit b.sub.i into its squared location in the CHECK variable.

SMALL MULTIPLIER AFTER INITIAL APPROXIMATION FOR OPERATIONS WITH INCREASING PRECISION
20230214186 · 2023-07-06 ·

In an aspect, a processor includes circuitry for iterative refinement approaches, e.g., Newton-Raphson, to evaluating functions, such as square root, reciprocal, and for division. The circuitry includes circuitry for producing an initial approximation; which can include a LookUp Table (LUT). LUT may produce an output that (with implementation-dependent processing) forms an initial approximation of a value, with a number of bits of precision. A limited-precision multiplier multiplies that initial approximation with another value; an output of the limited precision multiplier goes to a full precision multiplier circuit that performs remaining multiplications required for iteration(s) in the particular refinement process being implemented. For example, in division, the output being calculated is for a reciprocal of the divisor. The full-precision multiplier circuit requires a first number of clock cycles to complete, and both the small multiplier and the initial approximation circuitry complete within the first number of clock cycles.

Modeling expectation mismatches

Embodiments determine mismatches in evaluations. Embodiments receive a first evaluation of an employee from a supervisor of the employee, the first evaluation including supervisor comment ratings and supervisor numerical ratings, each of the supervisor comment ratings and supervisor numerical ratings corresponding to an evaluation category. Embodiments receive a second evaluation of the employee from the employee, the second evaluation including employee comment ratings and employee numerical ratings, each of the employee comment ratings and employee numerical ratings corresponding to the evaluation category. Embodiments determine first sentiment polarity scores of the supervisor comment ratings and second sentiment polarity scores of the employee comment ratings. Embodiments determine polarity mismatch scores based on the first sentiment polarity scores and the second sentiment polarity scores and determine average differential ratings based on the supervisor numerical ratings and the employee numerical ratings. Embodiments combine the polarity mismatch scores and the average differential ratings.

BFLOAT16 SQUARE ROOT AND/OR RECIPROCAL SQUARE ROOT INSTRUCTIONS

Techniques for performing square root or reciprocal square root calculations on BF16 data elements in response to an instruction are described. An example of an instruction is one that includes fields for an opcode, an identification of a location of a packed data source operand, and an identification of a packed data destination operand, wherein the opcode is to indicate that execution circuitry is to perform, for each data element position of the packed data source operand, a calculation of a square root value of a BF16 data element in that position and store a result of each square root into a corresponding data element position of the packed data destination operand.

SYSTEMS AND METHODS FOR ACCELERATING THE COMPUTATION OF THE RECIPROCAL FUNCTION AND THE RECIPROCAL-SQUARE-ROOT FUNCTION

A field programmable gate array (FPGA) including a configurable interconnect fabric connecting a plurality of logic blocks configured to implement a reciprocal function data path including: a mantissa computation stage including a mantissa portion of the reciprocal function data path configured to: partition an M-bit mantissa component of an input floating-point value into L most-significant bits and M-L least significant bits; lookup a slope value and an offset value, based on the L most significant bits, from a reciprocal lookup table; and compute an output mantissa component of an output floating-point value by multiplying the slope value by the M-L least significant bits to compute a product and adding the offset value to the product; and an exponent computation stage configured to compute an output exponent component of the output floating-point value, the computing the output exponent component including negating an exponent component of the input floating-point value.