Patent classifications
G06F7/49915
Systems and methods to perform floating-point addition with selected rounding
Disclosed embodiments relate to performing floating-point addition with selected rounding. In one example, a processor includes circuitry to decode and execute an instruction specifying locations of first and second floating-point (FP) sources, and an opcode indicating the processor is to: bring the FP sources into alignment by shifting a mantissa of the smaller source FP operand to the right by a difference between their exponents, generating rounding controls based on any bits that escape; simultaneously generate a sum of the FP sources and of the FP sources plus one, the sums having a fuzzy-Jbit format having an additional Jbit into which a carry-out, if any, select one of the sums based on the rounding controls, and generate a result comprising a mantissa-wide number of most-significant bits of the selected sum, starting with the most significant non-zero Jbit.
METHOD AND APPARATUS FOR VECTOR PERMUTATION
A method is provided that includes performing, by a processor in response to a vector permutation instruction, permutation of values stored in lanes of a vector to generate a permuted vector, wherein the permutation is responsive to a control storage location storing permute control input for each lane of the permuted vector, wherein the permute control input corresponding to each lane of the permuted vector indicates a value to be stored in the lane of the permuted vector, wherein the permute control input for at least one lane of the permuted vector indicates a value of a selected lane of the vector is to be stored in the at least one lane, and storing the permuted vector in a storage location indicated by an operand of the vector permutation instruction.
Multiple modes for handling overflow conditions resulting from arithmetic operations
A method and apparatus for handling overflow conditions resulting from arithmetic operations involving floating point numbers. An indication is stored as part of a thread's context indicating one of two possible modes for handling overflow conditions. In a first mode, a result of an arithmetic operation is set to the limit representable in the floating point format. In a second mode, a result of an arithmetic operation is set to a NaN.
Converting floating point numbers to reduce the precision
A hardware module comprising at least one of: one or more field programmable gate arrays and one or more application specific integrated circuits configured to: receive a number in floating-point representation at a first precision level, the number comprising an exponent and a first mantissa; apply a first random number to the first mantissa to generate a first carry; truncate the first mantissa to a level specified by a second precision level; add the first carry to the least significant bit of the mantissa truncated to the level specified by the second precision level to form a mantissa for the number in floating-point representation at the second precision level.
Converting floating point data into integer data using a dynamically adjusted scale factor
The embodiments herein describe a conversion engine that converts floating point data into integer data using a dynamic scaling factor. To select the scaling factor, the conversion engine compares a default (or initial) scaling factor value to an exponent portion of the floating point value to determine a shift value with which to bit shift a mantissa of the floating point value. After bit shifting the mantissa, the conversion engine determines whether the shift value caused an overflow or an underflow and whether that overflow or underflow violates a predefined policy. If the policy is violated, the conversion engine adjusts the scaling factor and restarts the conversion process. In this manner, the conversion engine can adjust the scaling factor until identifying a scaling factor that converts all the floating point values in the batch without violating the policy.
OPERATION UNIT, FLOATING-POINT NUMBER CALCULATION METHOD AND APPARATUS, CHIP, AND COMPUTING DEVICE
An operation unit, and a floating-point number calculation method and apparatus are provided. The operation unit includes a disassembly circuit and an arithmetic unit. The disassembly circuit may obtain a mode and a to-be-calculated floating-point number that are included in a calculation instruction, and disassemble the to-be-calculated floating-point number according to a preset rule. Then, the arithmetic unit completes processing of the calculation instruction based on the mode and the disassembled to-be-calculated floating-point number.
METHOD AND APPARATUS FOR PERMUTING STREAMED DATA ELEMENTS
A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.
Method and Apparatus for Vector Based Finite Impulse Response (FIR) Filtering
A method includes executing, by a processor a vector finite impulse response (VFIR) filter instruction that specifies coefficients, data elements, and a storage location. The executing includes reordering a subset of the data elements to provide each data element of the reordered subset of the data elements to a respective slice multiply component of a vector multiplier of the processor, generating, by the vector multiplier, filter outputs based on the coefficients and data elements, and storing the filter outputs in the storage location.
APPARATUSES, METHODS, AND SYSTEMS FOR 8-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS
Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. A processor embodiment includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a destination matrix having single-precision elements, a first source matrix, and a second source matrix, the source matrices having elements that each comprise a quadruple of 8-bit floating-point values, the opcode to indicate execution circuitry is to cause, for each element of the first source matrix and corresponding element of the second source matrix, a conversion of the 8-bit floating-point values to single-precision values, a multiplication of different pairs of converted single-precision values to generate plurality of results, and an accumulation of the results with previous contents of a corresponding element of the destination matrix, decode circuitry to decode the fetched instruction, and the execution circuitry to respond to the decoded instruction as specified by the opcode.
Enhanced low precision binary floating-point formatting
Techniques for operating on and calculating binary floating-point numbers using an enhanced floating-point number format are presented. The enhanced format can comprise a single sign bit, six bits for the exponent, and nine bits for the fraction. Using six bits for the exponent can provide an enhanced exponent range that facilitates desirably fast convergence of computing-intensive algorithms and low error rates for computing-intensive applications. The enhanced format can employ a specified definition for the lowest binade that enables the lowest binade to be used for zero and normal numbers; and a specified definition for the highest binade that enables it to be structured to have one data point used for a merged Not-a-Number (NaN)/infinity symbol and remaining data points used for finite numbers. The signs of zero and merged NaN/infinity can be “don't care” terms. The enhanced format employs only one rounding mode, which is for rounding toward nearest up.