Patent classifications
G06F7/527
COMBINED DIVIDE/SQUARE ROOT PROCESSING CIRCUITRY AND METHOD
An apparatus comprises combined divide/square root processing circuitry to perform, in response to a divide instruction, a given radix-64 iteration of a radix-64 divide operation, and in response to a square root instruction, a given radix-64 iteration of a radix-64 square root operation; in which: the combined divide/square root processing circuitry comprises shared circuitry to generate at least one output value for the given radix-64 iteration on a same data path used for both the radix-64 divide operation and the radix-64 square root operation.
Processing apparatus and processing method
The present disclosure provides a computation device and method. The device may include an input module configured to acquire input data; a model generation module configured to construct an offline model according to an input network structure and weight data; a neural network operation module configured to generate a computation instruction based on the offline model and cache the computation instruction, and compute the data to be processed based on the computation instruction to obtain a computation result; and an output module configured to output a computation result. The device and method may avoid the overhead caused by running an entire software architecture, which is a problem in a traditional method.
Processing apparatus and processing method
The present disclosure relates to a processing device including a memory configured to store data to be computed; a computational circuit configured to compute the data to be computed, which includes performing acceleration computations on the data to be computed by using an adder circuit and a multiplier circuit; and a control circuit configured to control the memory and the computational circuit, which includes performing acceleration computations according to the data to be computed. The present disclosure may have high flexibility, good configurability, fast computational speed, low power consumption, and other features.
MATRIX MULTIPLICATION CIRCUIT MODULE AND MATRIX MULTIPLICATION METHOD
A matrix multiplication circuit module and a matrix multiplication method are provided by the embodiments of the present disclosure. The circuit module includes one or more row-column calculation units for realizing row-column multiplication calculation. Each of the row-column calculation units comprises one or more multiplying units and an adding unit. Each of the one or more multiplying unit has an output end connected to an input end of the adding unit. Each of the multiplying units comprises an electrical signal regulating subunit and a load. The electrical signal regulating subunit is configured to regulate a magnitude of an input electrical signal. A multiplication operation is performed by the electrical signal regulating subunit and the load in response to an electrical signal inputted to the multiplying unit. The load has a fixed load value.
FPGA specialist processing block for machine learning
The present disclosure describes a digital signal processing (DSP) block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.
MEMORY DEVICE AND OPERATION METHOD THEREOF
A memory device and an operation method thereof are provided. The operation method includes: encoding an input data, sending an encoded input data to at least one page buffer, and reading out the encoded input data in parallel; encoding a first part and a second part of a weight data into an encoded first part and an encoded second part of the weight data, respectively, writing the encoded first part and the encoded second part of the weight data into a plurality of memory cells of the memory device, and reading out the encoded first part and the encoded second part of the weight data in parallel; multiplying the encoded input data with the encoded first part and the encoded second part of the weight data respectively to parallel generate a plurality of partial products; and accumulating the partial products to generate an operation result.
Efficient constant multiplier implementation for programmable logic devices
Various techniques are provided to efficiently implement user designs in programmable logic devices (PLDs). In one example, a computer-implemented method includes receiving a design identifying operations to be performed by a PLD and synthesizing the design into a plurality of PLD components. The synthesizing includes detecting a constant multiplier operation in the design, determining a nearest boundary condition for the constant multiplier operation, and decomposing the constant multiplier operation using the nearest boundary condition to reduce the plurality of PLD components. The reduced plurality of PLD components comprise at least one look up table (LUT) configured to implement an addition or subtraction operation of the decomposed constant multiplier operation.
Device for computing the inner product of vectors
A device for computing the inner product of vectors includes a vector data arranger, a vector data pre-accumulator, a number converter, and a post-accumulator. The vector data arranger stores a first vector and sequentially outputs a plurality of vector data based on the first vector. The vector data pre-accumulator stores a second vector, receives each of the vector data, and pre-accumulates the second vector, so as to generate a plurality accumulation results. The number converter and the post-accumulator receive and process all the accumulation results corresponding to each of the vector data to generate an inner product value. The present invention implements a lookup table with the vector data pre-accumulator and the number converter to increase calculation speed and reduce power consumption.
AREA AND ENERGY EFFICIENT MULTI-PRECISION MULTIPLY-ACCUMULATE UNIT-BASED PROCESSOR
Systems, apparatuses and methods may provide for multi-precision multiply-accumulate (MAC) technology that includes a plurality of arithmetic blocks, wherein the plurality of arithmetic blocks each contain multiple multipliers, and wherein the logic is to combine multipliers one or more of within each arithmetic block or across multiple arithmetic blocks. In one example, one or more intermediate multipliers are of a size that is less than precisions supported by arithmetic blocks containing the one or more intermediate multipliers.
In-Memory Near-Data Approximate Acceleration
A random access memory may include memory banks and arithmetic approximation units. Each arithmetic approximation unit may be dedicated to one or more of the memory banks and include a respective multiply-and-accumulate unit and a respective lookup-table unit. The respective multiply-and-accumulate unit is configured to iteratively perform shift and add operations with two inputs and to provide a result of the shift and add operations to the respective lookup-table unit. The result approximates or is a product of the two inputs. The respective lookup-table unit is configured produce an output by applying a pre-defined function to the result. The arithmetic approximation units are configured for parallel operation. The random access memory may also include a memory controller configured to receive instructions, from a processor, regarding locations within the memory banks from which to obtain the two inputs and in which to write the output.