G06F7/5324

APPARATUS AND METHOD WITH PARALLEL DATA PROCESSING

An apparatus with parallel processing includes: a first processor module; and a second processor module configured to perform parallel processing in synchronization with the first processor module, wherein the first processor module is configured to: determine first operation result data using an operation process in a first time interval; transmit the first operation result data to the second processor module; determine second operation result data using the operation process in a second time interval; and determine whether to transmit the second operation result data to the second processor module, and wherein the second processor module is configured to determine second prediction result data corresponding to the second operation result data based on the first operation result data received from the first processor module and a prediction process in response to the first processor module determining not to transmit the second operation result data to the second processor module.

COMPUTER PROCESSOR FOR HIGHER PRECISION COMPUTATIONS USING A MIXED-PRECISION DECOMPOSITION OF OPERATIONS
20230214215 · 2023-07-06 ·

Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.

Computer processor for higher precision computations using a mixed-precision decomposition of operations
11544057 · 2023-01-03 · ·

Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.

PROCESSING ELEMENT, NEURAL PROCESSING DEVICE INCLUDING SAME, AND MULTIPLICATION OPERATION METHOD USING SAME
20220405560 · 2022-12-22 ·

The present disclosure discloses a processing element and a neural processing device including the processing element. The processing element includes a weight register configured to store a weight, an input activation register configured to store input activation, a flexible multiplier configured to generate result data by performing a multiplication operation of the weight and the input activation by using a first multiplier of a first precision or using both the first multiplier and a second multiplier of the first precision in response to a calculation mode signal and a saturating adder configured to generate a partial sum by using the result data.

Modular gated multiplier circuitry and multiplication technique

Various implementations described herein are related to a device having multiplier circuitry with an array of summation result cells that holds summation bit values for shifted arrays added together. The device may include latch circuitry having one or more gated elements disposed between the summation result cells, and the gated elements may be adapted to provide a portion of the summation bit values based on a gating signal.

Multi-partitioning for combination operations

Systems and methods are disclosed for processing and executing queries against one or more dataset. As part of processing the query, the system determines whether the query is susceptible to a significantly imbalanced partition. In the event, the query is susceptible to an imbalanced partition, the system monitors the query and determines whether to perform a multi-partitioning determination to avoid a significantly imbalanced partition.

Arithmetic circuit for performing product-sum arithmetic

An arithmetic circuit includes a LUT generation circuit (1) that, when coefficients c[n] (n=1, . . . , N) are paired two by two, outputs a value calculated for each of the pairs, and distributed arithmetic circuits (2-m) that calculate values z[m] that are sums of products of data x[m, n] of a data set X[m] containing M pairs of data x[m, n] and the coefficients c[n], in parallel for each of the M pairs. The distributed arithmetic circuit (2-m) includes binomial distributed arithmetic circuits that, for each of the pairs, calculate sums of products of a value obtained by pairing N data x[m, n] corresponding to the circuit two by two and a value obtained by pairing the coefficients c[n] two by two, and a figure matching circuit that matches a number of decimal figures of the sums with a predetermined number of decimal figures.

Outlier quantization for training and inference

Machine learning may include training and drawing inference from artificial neural networks, processes which may include performing convolution and matrix multiplication operations. Convolution and matrix multiplication operations are performed using vectors of block floating-point (BFP) values that may include outliers. BFP format stores floating-point values using a plurality of mantissas of a fixed bit width and a shared exponent. Elements are outliers when they are too large to be represented precisely with the fixed bit width mantissa and shared exponent. Outlier values are split into two mantissas. One mantissa is stored in the vector with non-outliers, while the other mantissa is stored outside the vector. Operations, such as a dot product, may be performed on the vectors in part by combining the in-vector mantissa and exponent of an outlier value with the out-of-vector mantissa and exponent.

Fast digital multiply-accumulate (MAC) by fast digital multiplication circuit

Certain aspects provide methods and apparatus for multiplication of digital signals. In accordance with certain aspects, a multiplication circuit may be used to multiply a portion of a first digital input signal with a portion of a second digital input signal via a first multiplier circuit to generate a first multiplication signal, and multiply another portion of the first digital input signal with another portion of the second digital input signal via a second multiplier circuit to generate a second multiplication signal. A third multiplier circuit and multiple adder circuits may be used to generate an output of the multiplication circuit based on the first and second multiplication signals.

Multiple mode arithmetic circuit

A tile of an FPGA includes a multiple mode arithmetic circuit. The multiple mode arithmetic circuit is configured by control signals to operate in an integer mode, a floating-point mode, or both. In some example embodiments, multiple integer modes (e.g., unsigned, two's complement, and sign-magnitude) are selectable, multiple floating-point modes (e.g., 16-bit mantissa and 8-bit sign, 8-bit mantissa and 6-bit sign, and 6-bit mantissa and 6-bit sign) are supported, or any suitable combination thereof. The tile may also fuse a memory circuit with the arithmetic circuits. Connections directly between multiple instances of the tile are also available, allowing multiple tiles to be treated as larger memories or arithmetic circuits. By using these connections, referred to as cascade inputs and outputs, the input and output bandwidth of the arithmetic circuit is further increased.