G06F7/544

Inference apparatus, convolution operation execution method, and program
11580369 · 2023-02-14 · ·

An inference apparatus comprises a plurality of PEs (Processing Elements) and a control part. The control part operates a convolution operation in a convolutional neural network using each of a plurality of pieces of input data and a weight group including a plurality of weights corresponding to each of the plurality of pieces of input data by controlling the plurality of PEs. Further, each of the plurality of PEs executes a computation including multiplication of a single piece of the input data by a single weight and also executes multiplication included in the convolution operation using an element with a non-zero value included in each of the plurality of pieces of input data.

SECURE INVERSE SQUARE ROOT COMPUTATION SYSTEM, SECURE NORMALIZATION SYSTEM, METHODS THEREFOR, SECURE COMPUTATION APPARATUS, AND PROGRAM

The bit decomposition unit (11) generates a bit representation lap {a.sub.0}, . . . , {a.sub.λ−1} of a. A first bit sequence generator (12) calculates {a′.sub.i}={a.sub.i}∨{a.sub.i+1} to generate {a′.sub.0}, . . . , {a′.sub.λ′−1}. A flag sequence generator (13) generates {x.sub.0}, . . . , {x.sub.λ′−1} indicating a most significant bit of {a′.sub.0}, . . . , {a′.sub.λ′−1}. A normalization multiplier generator (14) generates [c′] by bit-connecting {x.sub.λ′−1}, . . . , {x.sub.0}. A second bit sequence generator (15) sets {a″.sub.i}={a.sub.2i} to generate {a″.sub.0}, . . . . A flag calculator (16) sums {x.sub.j}{a′.sub.j} to calculate a share value {r}. A normalization unit (18) calculates [b]: =[c′][c′][2a] when r=1 and [b]: =[c′][c′][a] when r=0. A inverse square root calculator (19) calculates [w]: =[1/√b]*√2 when r=1, and [w]: =[1/√b] when r=0. An inverse normalization unit (20) multiplies [1/√a]: =[w][c′].

MEMORY DEVICE FOR PERFORMING CONVOLUTION OPERATION
20230043170 · 2023-02-09 ·

A memory device performs a convolution operation. The memory device includes first to N-th processing elements (PEs), a first analog-to-digital converter (ADC), a first shift adder, and a first accumulator. The first to N-th PEs, where N is a natural number equal to or greater than 2, are respectively associated with at least one weight data included in a weight feature map and are configured to perform a partial convolution operation with at least one input data included in an input feature map. The first ADC is configured to receive a first partial convolution operation result from the first to N-th PEs. The first shift adder shifts an output of the first ADC. The first accumulator accumulates an output from the first shift adder.

NEURAL NETWORK FACILITATING FIXED-POINT EMULATION OF FLOATING-POINT COMPUTATION
20230008856 · 2023-01-12 ·

An DNN accelerator can perform fixed-point emulation of floating-point computation. In a multiplication operation on two floating-point matrices, the DNN accelerator determines an extreme exponent for a row in the first floating-point matrix and determines another extreme exponent for a column in the second floating-point matrix. The row and column can be converted to fixed-point vectors based on the extreme exponents. The two fixed-point vectors are fed into a PE array in the DNN accelerator. The PE array performs a multiplication operation on the two fixed-point vectors and generates a fixed-point inner product. The fixed-point inner product can be converted back to a floating-point inner product based on the extreme exponents. The floating-point inner product is an element in the matrix resulted from the multiplication operation on the two floating-point matrices. The matrix can be accumulated with another matrix resulted from a fixed-point emulation of a floating-point matrix multiplication.

Physics simulation on machine-learning accelerated hardware platforms
11550971 · 2023-01-10 · ·

At least one machine-accessible storage medium that provides instructions that, when executed by a machine, will cause the machine to perform operations. The operations comprise configuring a simulated environment to be representative of a physical device based, at least in part, on an initial description of the physical device that described structural parameters of the physical device. The operations further comprise performing a physics simulation with an artificial intelligence (“AI”) accelerator. The AI accelerator includes a matrix multiply unit for computing convolution operations via a plurality of multiply-accumulate units. The operations further comprise computing a field response in response of the physical device in response to an excitation source within the simulated environment when performing the physics simulation. The field response is computed, at least in part, with the convolution operations to perform spatial differencing.

Power efficient near memory analog multiply-and-accumulate (MAC)
11574173 · 2023-02-07 · ·

A near memory system is provided for the calculation of a layer in a machine learning application. The near memory system includes an array of memory cells for storing an array of filter weights. A multiply-and-accumulate circuit couples to columns of the array to form the calculation of the layer.

Resistive matrix computation circuit
11593456 · 2023-02-28 · ·

A resistive matrix computation circuit and methods for using the same are disclosed. In one embodiment, a resistive matrix computation circuit includes a memory configured to store a first set of operands and a second set of operands, where the first set of input operands and the second set of input operands are programmable by a controller, and the first set of operands and the second set of operands are cross-multiplied to form a plurality of product pairs, a plurality of resistive multiplier circuits configured to generate a plurality of output voltages according to the plurality of product pairs; the controller is configured to control the plurality of resistive multiplier circuits to perform multiplications using the first set of operands and the second set of operands, and an aggregator circuit configured to aggregate the plurality of output voltages from the plurality of resistive multiplier circuits, where the plurality of output voltages represent an aggregated value of the plurality of product pairs.

Scalable matrix computation circuit
11593455 · 2023-02-28 · ·

A scalable matrix computation circuit and methods for using the same are disclosed. In one embodiment, a matrix computation circuit includes a plurality of first operand memory configured to store a first set of input operands of the matrix computation circuit, a plurality of second operand memory configured to store a second set of input operands of the matrix computation circuit, where the first and second sets of input operands are programmable by the controller, a plurality of multiplier circuits arranged in a plurality of rows and plurality of columns, where each row receives a corresponding operand from the first set of operands, and each column receives a corresponding operand from the second set of operands, and the each corresponding operand from the each row is used multiple times by the multiplier circuits in that row to perform multiplications controlled by the controller, and a plurality of aggregator circuits configured to store charges produced by the plurality of multiplier circuits.

Methods and devices for fixed extrapolation error data simplification processes for telematics

Methods and devices for simplifying data collected from assets are provided. An example method involves obtaining raw data from a data source at an asset, determining that a data logging trigger is satisfied by determining that a recently obtained point in the raw data differs from a corresponding predicted point predicted by extrapolation based on previously saved points included in one or more previously generated simplified sets of data by an amount of extrapolation error that is limited by an upper bound that is fixed as the raw data is collected over time, and, when the data logging trigger is satisfied, performing a dataset simplification algorithm on the raw data to generate a simplified set of data.

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.