G06F7/544

Execution unit
11561799 · 2023-01-24 · ·

An execution unit comprising a processing pipeline configured to perform calculations to evaluate a plurality of mathematical functions. The processing pipeline comprises a plurality of stages through which each calculation for evaluating a mathematical function progresses to an end result. Each of a plurality of processing circuits in the pipeline is configured to perform an operation on input values during at least one stage of the plurality of stages. The plurality of processing circuits include multiplier circuits. A first multiplier circuit and a second multiplier circuit are configured to operate in parallel, such that at the same stage in the processing pipeline, the first multiplier circuit and the second multiplier circuit perform their processing. A third multiplier circuit is arranged in series with the first multiplier circuit and the second multiplier circuit and processes outputs from the first multiplier circuit and the second multiplier circuit.

Memory processing unit

An in-memory computing system for computing vector-matrix multiplications includes an array of resistive memory devices arranged in columns and rows, such that resistive memory devices in each row of the array are interconnected by a respective word line and resistive memory devices in each column of the array are interconnected by a respective bitline. The in-memory computing system also includes an interface circuit electrically coupled to each bitline of the array of resistive memory devices and computes the vector-matrix multiplication between an input vector applied to a given set of word lines and data values stored in the array. For each bitline, the interface circuit receives an output in response to the input being applied to the given wordline, compares the output to a threshold, and increments a count maintained for each bitline when the output exceeds the threshold. The count for a given bitline represents a dot-product.

HIGH DYNAMIC RANGE DIGITIZATION TECHNOLOGY FOR ANALOG COMPUTE-IN-MEMORY AND EDGE AI APPLICATIONS

Systems, apparatuses and methods may provide for compute-in-memory (CiM) accelerator technology that includes a multiply-accumulate (MAC) computation stage, an analog amplifier stage coupled to an output of the MAC computation stage, and an analog to digital conversion (ADC) stage coupled to an output of the analog amplifier stage, wherein a gain setting of the analog amplifier stage modifies a quantization granularity of the ADC stage.

HARDWARE ACCELERATOR FOR PERFORMING COMPUTATIONS OF DEEP NEURAL NETWORK AND ELECTRONIC DEVICE INCLUDING THE SAME

A hardware accelerator includes a processing core including a plurality of multipliers configured to perform one-dimensional (1D) sub-word parallelism between a sign and a mantissa of a first tensor and a sign and a mantissa of a second tensor, a first processing device configured to operate in a two-dimensional (2D) operation mode in which results of computation by the plurality of multipliers are output, and a second processing device configured to operate in a three-dimensional (3D) operation mode in which results of computation by the plurality of multipliers are accumulated in a channel direction and then a result of accumulating the results of computation is output.

Efficient Piecewise Polynomial Approximators
20230229732 · 2023-07-20 ·

A method for approximating a mathematical function defined over a range includes initially dividing at least part of the range into a set of segments. For at least a subset of the segments, the mathematical function is approximated within each segment by a respective approximation polynomial. A series of one or more segment-merging iterations is performed, a given iteration including: selecting adjacent segments as candidates for merging; approximating the mathematical function by a candidate approximation polynomial, over at least a merged segment formed by merging the adjacent segments; and, if approximation of the mathematical function meets a specified condition, updating the set of segments by (i) replacing the adjacent segments with the merged segment and (ii) replacing the approximation polynomials of the adjacent segments with the candidate approximation polynomial.

Efficient Piecewise Polynomial Approximators
20230229732 · 2023-07-20 ·

A method for approximating a mathematical function defined over a range includes initially dividing at least part of the range into a set of segments. For at least a subset of the segments, the mathematical function is approximated within each segment by a respective approximation polynomial. A series of one or more segment-merging iterations is performed, a given iteration including: selecting adjacent segments as candidates for merging; approximating the mathematical function by a candidate approximation polynomial, over at least a merged segment formed by merging the adjacent segments; and, if approximation of the mathematical function meets a specified condition, updating the set of segments by (i) replacing the adjacent segments with the merged segment and (ii) replacing the approximation polynomials of the adjacent segments with the candidate approximation polynomial.

HYBRID MULTIPY-ACCUMULATION OPERATION WITH COMPRESSED WEIGHTS

A compute block can perform hybrid multiply-accumulate (MAC) operations. The compute block may include a weight compressing module and a processing element (PE) array. The weight compression module may select a first group of one or more weights and a second group of one or more weights from a weight tensor of a DNN (deep neural network) layer. A weight in the first group is quantized to a power of two value. A weight in the second group is quantized to an integer. The integer and the exponent of the power of two value may be stored in a memory in lieu of the original values of the weights. A PE in the PE array includes a shifter configured to shift an activation of the layer by the exponent of the power of two value and a multiplier configured to multiplying the integer with another activation of the layer.

PROCESSING-IN-MEMORY DEVICE WITH ALL OPERATION MODE AND DISPERSION OPERATION MODE
20230230622 · 2023-07-20 · ·

A processing-in-memory (PIM) device includes a plurality of multiplication and accumulation (MAC) units, each of the MAC units including a memory bank and a MAC operator, and a control circuit configured to control the plurality of MAC units to perform an all MAC mode operation in which MAC operations are performed in all MAC units, among the plurality of MAC units, or a dispersion MAC mode operation in which the MAC operations are performed in some MAC units, among the plurality of MAC units.

Methods for improving AI engine MAC utilization
11562214 · 2023-01-24 · ·

Embodiments of the invention disclose an integrated circuit and a method for improving utilization of multiply and accumulate (MAC) units on the integrated circuit in an artificial intelligence (AI) engine. In one embodiment, the integrated circuit can include a scheduler for allocating the MAC units to execute a neural network model deployed on the AI engine to process input data. The scheduler includes status information for the MAC units, and can select one or more idle MAC units based on the status information for use to process the feature map slice. The integrated circuit can dynamically map idle MAC units to an input feature map, thereby improving utilization of the MAC units. A pair of linked list, each with a reference head, can be provided in a static random access memory (SRAM) to store only feature map slices and weights for a layer that is currently being processed. When processing a next layer, the two reference heads can be swapped so that output feature map slices for the current layer can be used as input feature maps for the next layer.

Parallel processing of a convolutional layer of a neural network with compute-in-memory array
11562205 · 2023-01-24 · ·

An apparatus includes first and second compute-in-memory (CIM) arrays. The first CIM array is configured to store weights corresponding to a filter tensor, to receive a first set of activations corresponding to a first receptive field of an input, and to process the first set of activations with the weights to generate a corresponding first tensor of output values. The second CIM array is configured to store a first copy of the weights corresponding to the filter tensor and to receive a second set of activations corresponding to a second receptive field of the input. The second CIM array is also configured to process the second set of activations with the first copy of the weights to generate a corresponding second tensor of output values. The first and second compute-in-memory arrays are configured to process the first and second receptive fields in parallel.