Patent classifications
G06F9/3893
Adaptive matrix multiplication accelerator for machine learning and deep learning applications
An adaptive matrix multiplier. In some embodiments, the matrix multiplier includes a first multiplying unit a second multiplying unit, a memory load circuit, and an outer buffer circuit. The first multiplying unit includes a first inner buffer circuit and a second inner buffer circuit, and the second multiplying unit includes a first inner buffer circuit and a second inner buffer circuit. The memory load circuit is configured to load data from memory, in a single burst of a burst memory access mode, into the first inner buffer circuit of the first multiplying unit; and into the first inner buffer circuit of the second multiplying unit.
Enhanced multiply accumulate device for neural networks
A device for performing multiply/accumulate operations processes values in first and second buffers and having a first width using a computational pipeline with a second width, such as half the first width. A sequencer processes combinations of portions (high-high, low-low, high-low, low-high) of the values in the first and second buffers using a multiply/accumulate circuit and adds the accumulated result of each combination of portions to a group accumulator. Adding to the group accumulator may be preceded by left shifting the accumulated result (the first width for the high-high combination and the second width for the low-high and high-low combination).
EXPLICIT SCHEDULING OF ON-CHIP OPERATIONS
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for obtaining a first schedule, for a first hardware block of an integrated circuit device, where the first schedule identifies a first set of operations to be performed by the first hardware block. Obtaining a second schedule for a second hardware block of the integrated circuit device, where the second schedule identifies a second set of operations to be performed by the second hardware block and where operations of the second schedule are coordinated with operations of the first schedule such that the first schedule triggers the first hardware block to send data to the second block at a first pre-scheduled value of a counter, and the second schedule triggers the second hardware block to accept the data at an input at a second pre-scheduled value of the counter that is after the first pre-scheduled value. Performing, by the first hardware block, the first set of operations according to the first schedule, and performing, by the second hardware block, the second set of operations according to the second schedule.
IC including logic tile, having reconfigurable MAC pipeline, and reconfigurable memory
An integrated circuit including configurable multiplier-accumulator circuitry, wherein, during processing operations, a plurality of the multiplier-accumulator circuits are serially connected into pipelines to perform concatenated multiply and accumulate operations. The integrated circuit includes a first memory and a second memory, and a switch interconnect network, including configurable multiplexers arranged in a plurality of switch matrices. The first and second memories are configurable as either a dedicated read memory or a dedicated write memory and connected to a given pipeline, via the switch interconnect network, during a processing operation performed thereby; wherein, during a first processing operations, the first memory is dedicated to write data to a first pipeline and the second memory is dedicated to read data therefrom and, during a second processing operation, the first memory is dedicated to read data from a second pipeline and the second memory is dedicated to write data thereto.
Method and system for instruction block to execution unit grouping
A method for emulating a guest centralized flag architecture by using a native distributed flag architecture. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks, wherein each of the instruction blocks comprise two half blocks; scheduling the instructions of the instruction block to execute in accordance with a scheduler; and using a distributed flag architecture to emulate a centralized flag architecture for the emulation of guest instruction execution.
METHODS AND APPARATUS FOR DEEP LEARNING NETWORK EXECUTION PIPELINE ON MULTI-PROCESSOR PLATFORM
Methods and systems are disclosed using an execution pipeline on a multi-processor platform for deep learning network execution. In one example, a network workload analyzer receives a workload, analyzes a computation distribution of the workload, and groups the network nodes into groups. A network executor assigns each group to a processing core of the multi-core platform so that the respective processing core handle computation tasks of the received workload for the respective group.
PARALLEL PROCESSING DEVICE
A parallel processing device includes: a plurality of memories configured to output a plurality of pieces of memory output data respectively; a plurality of input units configured to output a plurality of pieces of input unit output data respectively; a plurality of addition units configured to receive the plurality of pieces of input unit output data, perform a parallel processing function and a data path configuration function according to a plurality of configuration values, and output a plurality of pieces of addition unit output data; and a plurality of delay units configured to delay the plurality of pieces of addition unit output data according to a clock signal, and output the plurality of pieces of delay data respectively. The plurality of pieces of input unit output data are selected from the plurality of pieces of memory output data and a plurality of pieces of delay data respectively.
SUPPORTING 8-BIT FLOATING POINT FORMAT OPERANDS IN A COMPUTING ARCHITECTURE
An apparatus to facilitate supporting 8-bit floating point format operands in a computing architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that operates on 8-bit floating point operands to cause the processor to perform a parallel dot product operation; a controller to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and systolic dot product circuitry to execute the decoded instruction using systolic layers, each systolic layer comprises one or more sets of interconnected multipliers, shifters, and adder, each set of multipliers, shifters, and adders to generate a dot product of the 8-bit floating point operands.
Multiplier-accumulator circuitry having processing pipelines and methods of operating same
An integrated circuit including memory to store image data and filter weights, and a plurality of multiply-accumulator execution pipelines, each multiply-accumulator execution pipeline coupled to the memory to receive (i) image data and (ii) filter weights, wherein each multiply-accumulator execution pipeline processes the image data, using associated filter weights, via a plurality of multiply and accumulate operations. In one embodiment, the multiply-accumulator circuitry of each multiply-accumulator execution pipeline, in operation, receives a different set of image data, each set including a plurality of image data, and, using filter weights associated with the received set of image data, processes the set of image data associated therewith, via performing a plurality of multiply and accumulate operations concurrently with the multiply-accumulator circuitry of the other multiply-accumulator execution pipelines, to generate output data. Each set of image data includes all of the image that correlates to the output data generated therefrom.
COMPUTER-READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM, INFORMATION PROCESSING APPARATUS, AND INFORMATION PROCESSING METHOD
A recording medium stores a program for causing a computer to execute a process including: determining, by using each of processes included in a matrix process as a first process and using a process next to the first process as a second process, a synchronization method for processing units that process elements of a first portion of a matrix, based on the number of the processing units that process the elements of the first portion in the first process and the number of processing units that process elements of a second portion of the matrix in the second process; executing the first process by using the processing units that process the elements of the first portion; executing a synchronization process on the processing units that process the elements of the first portion; and executing the second process by using the processing units that process the elements of the second portion.