Patent classifications
G06F7/533
Multiply-accumulate “0” data gating
In an example, an apparatus comprises a plurality of execution units and logic, at least partially including hardware logic, to gate at least one of a multiply unit or an accumulate unit in response to an input of value zero. Other embodiments are also disclosed and claimed.
Apparatus and method using neural network
An apparatus includes a first holding unit and a second holding unit configured to hold first-type data and second-type data, respectively, a first operation unit configured to execute a first product-sum operation based on the first-type data, a branch unit configured to output an operation result of the first product-sum operation in parallel, a sampling unit configured to sample the operation result and to output a sampling result, and a second operation unit configured to execute a second product-sum operation based on the second-type data and the sampling result.
Apparatus and method using neural network
An apparatus includes a first holding unit and a second holding unit configured to hold first-type data and second-type data, respectively, a first operation unit configured to execute a first product-sum operation based on the first-type data, a branch unit configured to output an operation result of the first product-sum operation in parallel, a sampling unit configured to sample the operation result and to output a sampling result, and a second operation unit configured to execute a second product-sum operation based on the second-type data and the sampling result.
Neural processing accelerator
A system for calculating. A scratch memory is connected to a plurality of configurable processing elements by a communication fabric including a plurality of configurable nodes. The scratch memory sends out a plurality of streams of data words. Each data word is either a configuration word used to set the configuration of a node or of a processing element, or a data word carrying an operand or a result of a calculation. Each processing element performs operations according to its current configuration and returns the results to the communication fabric, which conveys them back to the scratch memory.
Multiplier pipelining optimization with a bit folding correction
One embodiment provides a system. The system includes a register to store an operand; a multiplier; and optimizer logic to initiate a square/multiply stage to operate on the operand, initiate a reduction stage prior to completion of the square/multiply stage, and determine whether a carry propagation has occurred.
Multiplier pipelining optimization with a bit folding correction
One embodiment provides a system. The system includes a register to store an operand; a multiplier; and optimizer logic to initiate a square/multiply stage to operate on the operand, initiate a reduction stage prior to completion of the square/multiply stage, and determine whether a carry propagation has occurred.
METHOD AND APPARATUS FOR PERFORMING DEEP LEARNING OPERATIONS
Disclosed is an apparatus and method for performing deep learning operations. The apparatus includes a systolic array comprising multiplier accumulator (MAC) units, and a control circuit configured to control an operation of a multiplexer connected to at least one of the MAC units and operations of the MAC units according to a plurality of operation modes.
Method and apparatus for efficient binary and ternary support in fused multiply-add (FMA) circuits
An apparatus and method for efficiently performing a multiply add or multiply accumulate operation. For example, one embodiment of a processor comprises: a decoder to decode an instruction specifying an operation, the instruction comprising a first operand identifying a multiplier and a second operand identifying a multiplicand; and fused multiply-add (FMA) execution circuitry comprising first multiplication circuitry to perform a multiplication using the multiplicand and multiplier to generate a result for multipliers and multiplicands falling within a first precision range, and second multiplication circuitry to be used instead of the first multiplication circuitry for multipliers and multiplicands falling within a second precision range.
Neural processing accelerator
A system for calculating. A scratch memory is connected to a plurality of configurable processing elements by a communication fabric including a plurality of configurable nodes. The scratch memory sends out a plurality of streams of data words. Each data word is either a configuration word used to set the configuration of a node or of a processing element, or a data word carrying an operand or a result of a calculation. Each processing element performs operations according to its current configuration and returns the results to the communication fabric, which conveys them back to the scratch memory.
Processing apparatus and processing method
The present disclosure provides a processing device and method. The device includes: an input/output module, a controller module, a computing module, and a storage module. The input/output module is configured to store and transmit input and output data; the controller module is configured to decode a computation instruction into a control signal to control other modules to perform operation; the computing module is configured to perform four arithmetic operation, logical operation, shift operation, and complement operation on data; and the storage module is configured to temporarily store instructions and data. The present disclosure can execute a composite scalar instruction accurately and efficiently.