Patent classifications
G06F15/825
FLOW CONTROL FOR RECONFIGURABLE PROCESSORS
The technology disclosed relates to storing a dataflow graph with a plurality of compute nodes that transmit data along data connections, and controlling data transmission between compute nodes in the plurality of compute nodes along the data connections by using control connections to control writing of data.
Mask field propagation among memory-compute tiles in a reconfigurable architecture
A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. The tiles can be arranged in an array or grid and can be communicatively coupled. In an example, a first node can include a tile cluster of N memory-compute tiles, and the N memory-compute tiles can be coupled using a first portion of a synchronous compute fabric. Operations performed by the respective processing and storage elements of the N memory-compute tiles can be selectively enabled or disabled based on information in a mask field of data propagated through the first portion of the synchronous compute fabric.
Buffer Splitting
A method in a reconfigurable computing system includes receiving a user program for execution on a reconfigurable dataflow computing system, comprising a grid of compute units and grid of memory units interconnected with a switching array. The user program includes multiple tensor-based algebraic expressions that are converted to an intermediate representation comprising one or more logical operations executable via dataflow through compute units. These one or more logical operations are preceded by or followed by a buffer, each buffer corresponding to one or more memory units. The method includes determining whether splitting a selected buffer yields a reduced cost and then splitting the selected buffer, in response to the determining step, to produce first and second buffers. Dataflow through memory units corresponding to the first and second buffers is controlled by one or more memory units within the grid of memory units. Buffer splitting optimization reduces memory unit consumption.
EXECUTION ENGINE FOR EXECUTING SINGLE ASSIGNMENT PROGRAMS WITH AFFINE DEPENDENCIES
The execution engine is a new organization for a digital data processing apparatus, suitable for highly parallel execution of structured fine-grain parallel computations. The execution engine includes a memory for storing data and a domain flow program, a controller for requesting the domain flow program from the memory, and further for translating the program into programming information, a processor fabric for processing the domain flow programming information and a crossbar for sending tokens and the programming information to the processor fabric.
DATA FLOW CONTROL DEVICE, DATA FLOW CONTROL METHOD, AND DATA FLOW CONTROL PROGRAM
A data flow control device includes processing circuitry configured to calculate, from among data flows obtained by commonizing and integrating at least a part of a first data flow in operation in a system that processes data and an input second data flow, a third data flow in which a resource use amount when operated in the system satisfies a predetermined condition, and instruct the system to switch the first data flow to the third data flow.
OVERLAY LAYER HARDWARE UNIT FOR NETWORK OF PROCESSOR CORES
Methods and systems for executing an application data flow graph on a set of computational nodes are disclosed. The computational nodes can each include a programmable controller from a set of programmable controllers, a memory from a set of memories, a network interface unit from a set of network interface units, and an endpoint from a set of endpoints. A disclosed method comprises configuring the programmable controllers with instructions. The method also comprises independently and asynchronously executing the instructions using the set of programmable controllers in response to a set of events exchanged between the programmable controllers themselves, between the programmable controllers and the network interface units, and between the programmable controllers and the set of endpoints. The method also comprises transitioning data in the set of memories on the computational nodes in accordance with the application data flow graph and in response to the execution of the instructions.
Reservoir computing data flow processor
A reservoir computing data flow processor includes a plurality of reservoir units to be units constituting a reservoir. The reservoir is able to be reconfigured by changing a connection relationship between the reservoir units. Each of the reservoir units is an operation unit block configured to execute a predetermined operation. The operation unit block includes a first adder configured to perform an addition operation on at least two inputs, a nonlinear operator configured to apply a nonlinear function to an output from the first adder or a result of multiplying the output by a predetermined coefficient, and a second adder configured to perform an addition operation on at least two inputs including an output from the nonlinear operator or a result of multiplying the output by a predetermined coefficient.
COMPILER OPERATIONS FOR TENSOR STREAMING PROCESSOR
Embodiments are directed to a processor having a functional slice architecture. The processor is divided into tiles (or functional units) organized into a plurality of functional slices. The functional slices are configured to perform specific operations within the processor, which includes memory slices for storing operand data and arithmetic logic slices for performing operations on received operand data (e.g., vector processing, matrix manipulation). The processor includes a plurality of functional slices of a module type, each functional slice having a plurality of tiles. The processor further includes a plurality of data transport lanes for transporting data in a direction indicated in a corresponding instruction. The processor also includes a plurality of instruction queues, each instruction queue associated with a corresponding functional slice of the plurality of functional slices, wherein the instructions in the instruction queues comprise a functional slice specific operation code.
Instruction format and instruction set architecture for tensor streaming processor
Embodiments are directed to a processor having a functional slice architecture. The processor is divided into tiles (or functional units) organized into a plurality of functional slices. The functional slices are configured to perform specific operations within the processor, which includes memory slices for storing operand data and arithmetic logic slices for performing operations on received operand data (e.g., vector processing, matrix manipulation). The processor includes a plurality of functional slices of a module type, each functional slice having a plurality of tiles. The processor further includes a plurality of data transport lanes for transporting data in a direction indicated in a corresponding instruction. The processor also includes a plurality of instruction queues, each instruction queue associated with a corresponding functional slice of the plurality of functional slices, wherein the instructions in the instruction queues comprise a functional slice specific operation code.
Disaggregation of processing pipeline
A method for processing includes receiving a definition of a processing pipeline including multiple sequential processing stages. The processing pipeline is partitioned into a plurality of partitions. The first partition of the processing pipeline is executed on a first computational accelerator, whereby the first computational accelerator writes output data from a final stage of the first partition to an output buffer in a first memory. The output data are copied over a packet communication network to an input buffer in a second memory. The second partition of the processing pipeline is executed on a second computational accelerator using the copied output data in the second memory as input data to a first stage of the second partition.