G06F7/5443

Processing apparatus and processing method

The present disclosure provides a computation device and method. The device may include an input module configured to acquire input data; a model generation module configured to construct an offline model according to an input network structure and weight data; a neural network operation module configured to generate a computation instruction based on the offline model and cache the computation instruction, and compute the data to be processed based on the computation instruction to obtain a computation result; and an output module configured to output a computation result. The device and method may avoid the overhead caused by running an entire software architecture, which is a problem in a traditional method.

Accelerated mathematical engine

Various embodiments of the disclosure relate to an accelerated mathematical engine. In certain embodiments, the accelerated mathematical engine is applied to image processing such that convolution of an image is accelerated by using a two-dimensional matrix processor comprising sub-circuits that include an ALU, output register and shadow register. This architecture supports a clocked, two-dimensional architecture in which image data and weights are multiplied in a synchronized manner to allow a large number of mathematical operations to be performed in parallel.

Kernel Decomposition and Activation Broadcasting in Deep Neural Networks (DNNs)

An DNN accelerator may perform 1×N kernel decomposition to decompose a convolutional kernel into kernel vectors, each of which includes multiple weights. Through the kernel decomposition, a weight operand may be generated from a filter. The DNN accelerator converts an input tensor into input operands. An input operand includes activations and has the same size as the weight operand. The DNN accelerator may read a first activation in the input operand from memory to an internal memory of a first PE and read a second activation in the input operand from the memory to an internal memory of a second PE. The first PE may receive the second activation from the second PE through activation broadcasting between the two PEs and perform MAC operations on the input operand and weight operand. The second PE may perform MAC operations on another input operand in the input tensor and the weight operand.

Decoupled Execution Of Workload For Crossbar Arrays

A computing system architecture is presented for decoupling execution of workload by crossbar arrays and similar memory modules. The computing system includes: a data bus; a core controller connected to the data bus; and a plurality of local tiles connected to the data bus. Each local tile in the plurality of local tiles includes a local controller and at least one memory module, where the memory module performs computation using the data stored in memory without reading the data out of the memory.

Neural Network Architecture Using Single Plane Filters
20230214631 · 2023-07-06 ·

Hardware for implementing a Deep Neural Network (DNN) having a convolution layer, the hardware comprising an input buffer configured to provide data windows to a plurality of convolution engines, each data window comprising a single input plane; and each of the plurality of convolution engines being operable to perform a convolution operation by applying a filter to a data window, each filter comprising a set of weights for combination with respective data values of a data window, and each of the plurality of convolution engines comprising: multiplication logic operable to combine a weight of the filter with a respective data value of the data window provided by the input buffer; and accumulation logic configured to accumulate the results of a plurality of combinations performed by the multiplication logic so as to form an output for a respective convolution operation.

COMPUTER PROCESSOR FOR HIGHER PRECISION COMPUTATIONS USING A MIXED-PRECISION DECOMPOSITION OF OPERATIONS
20230214215 · 2023-07-06 ·

Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.

METHOD AND APPARATUS WITH BIT-SERIAL DATA PROCESSING OF A NEURAL NETWORK

A processor-implemented data processing method includes encoding a plurality of weights of a filter of a neural network using an inverted two's complement fixed-point format; generating weight data based on values of the encoded weights corresponding to same filter positions of a plurality of filters; and performing an operation on the weight data and input activation data using a bit-serial scheme to control when to perform an activation function with respect to the weight data and input activation data.

TWO-DIMENSIONAL ARRAY-BASED NEUROMORPHIC PROCESSOR AND IMPLEMENTING METHOD

A 2D array-based neuromorphic processor includes: axon circuits each being configured to receive a first input corresponding to one bit from among bits indicating n-bit activation; first direction lines extending in a first direction from the axon circuits; second direction lines intersecting the first direction lines; synapse circuits disposed at intersections of the first direction lines and the second direction lines, and each being configured to store a second input corresponding to one bit from among bits indicating an m-bit weight and to output operation values of the first input and the second input; and neuron circuits connected to the first or second direction lines, each of the neuron circuits being configured to receive an operation value output from at least one of the synapse circuits, based on time information assigned individually to the synapse circuits, and to perform an arithmetic operation by using the operation values.

CALCULATING DEVICE, CALCULATION PROGRAM, AND CALCULATION METHOD
20230214446 · 2023-07-06 · ·

According to one embodiment, a calculating device includes a processor configured to perform a matrix transformation processing, and an update processing. The matrix transformation processing includes deriving a second matrix by transforming first row vectors included in a first matrix. The update processing includes an update of first and second variable sets. The update of the second variable set includes obtaining the second variable set after the update by adding a first update function of the updated first variable set to the second variable set. The first update function includes at least one of first or second multiply-add operation. The first multiply-add operation includes a multiply-add operation of the updated first variable set and a component of the second matrix. The second multiply-add operation includes a multiply-add operation of a component of the second matrix and a variable dependent on the updated first variable set.

MULTIPURPOSE MULTIPLY-ACCUMULATOR ARRAY
20230214185 · 2023-07-06 ·

Embodiments of the present disclosure include a multipurpose multiply-accumulator (MAC) array circuit comprising one or more input memories for receiving operands and a plurality of multiply-accumulator circuits each selectively coupled to the one or more input memories to receive at least a pair of operands and generate a result. Each of the plurality of multiply-accumulator circuits receives operands from the one or more input memories independently. Additionally, selection of operands from the one or more input memories is controlled based on at least an operation and/or data types, where different operation and/or data types configure the plurality of multiply-accumulator circuits to receive different pairs of operands from the one or more input memories to execute particular operation types.