G06F15/825

All Reduce Across Multiple Reconfigurable Dataflow Processors
20230409520 · 2023-12-21 · ·

A method for a reconfigurable computing system includes receiving a compute graph for execution on multiple RDPs interconnected with a ring network having R interconnected RDPs. A compute graph with a node specifying a reduction operation for a first and second tensor is detected. The detected compute graph node is partitioned into a compute subgraph corresponding to an RDP of the R interconnected RDPs. A first node is inserted into the compute subgraph that specifies a partial reduction operation for producing a partial reduction result corresponding to a shard of the first tensor and a shard of the second tensor. A second node is inserted for communicating the partial reduction result to an adjacent RDP. A third node is inserted that specifies a reduction operation for producing a total reduction result. A fourth node is inserted for communicating the total reduction result to at least one other RDP.

Dataflow Triggered Tasks for Accelerated Deep Learning

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow based computations on wavelets of data. Each processing element has a compute element and a routing element. Each compute element has memory. Each router enables communication via wavelets with nearest neighbors in a 2D mesh. Routing is controlled by respective virtual channel specifiers in each wavelet and routing configuration information in each router. A compute element receives a particular wavelet comprising a particular virtual channel specifier and a particular data element. Instructions are read from the memory of the compute element based at least in part on the particular virtual channel specifier. The particular data element is used as an input operand to execute at least one of the instructions.

EVENT PROCESSING
20210081145 · 2021-03-18 ·

An event-processing unit for processing tokens associated with a state or state transition, herein also referred to as an event, of an external device is disclosed. The EPU allows token-processing schemes, in which the processing of incoming tokens and the further handling of a processing result by the EPU are determined not only by the token identifier, but also by the payload data of the incoming token or by data in the data memory. A flag-processing capability of a processing-control stage allows applying flag-processing operations such as logical operations to data obtained as a processing result of an ALU-processing operation. The result of these operations determines a subsequent handling of ALU-result data by the EPU. Thus, whether or not the ALU-result data is written to the data memory also influences the processing of any subsequent incoming tokens for which that data is used in the ALU-processing operation.

ENABLING ACCELERATED PROCESSING UNITS TO PERFORM DATAFLOW EXECUTION

Methods and systems are disclosed for performing dataflow execution by an accelerated processing unit (APU). Techniques disclosed include decoding information from one or more dataflow instructions. The decoded information is associated with dataflow execution of a computational task. Techniques disclosed further include configuring, based on the decoded information, dataflow circuitry, and, then, executing the dataflow execution of the computational task using the dataflow circuitry.

Reconfigurable computer accelerator providing stream processor and dataflow processor

A reconfigurable hardware accelerator for computers combines a high-speed dataflow processor, having programmable functional units rapidly reconfigured in a network of programmable switches, with a stream processor that may autonomously access memory in predefined access patterns after receiving simple stream instructions. The result is a compact, high-speed processor that may exploit parallelism associated with many application-specific programs susceptible to acceleration.

APPARATUSES, METHODS, AND SYSTEMS FOR TIME-MULTIPLEXING IN A CONFIGURABLE SPATIAL ACCELERATOR

Systems, methods, and apparatuses relating to time-multiplexing circuitry in a configurable spatial accelerator are described. In one embodiment, a configurable spatial accelerator (CSA) includes a plurality of processing elements; and a time-multiplexed, circuit switched interconnect network between the plurality of processing elements. In another embodiment, a configurable spatial accelerator (CSA) includes a plurality of time-multiplexed processing elements; and a time-multiplexed, circuit switched interconnect network between the plurality of time-multiplexed processing elements.

WAVELET REPRESENTATION FOR ACCELERATED DEEP LEARNING

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a compute element with dedicated storage and a routing element. Each router enables communication with nearest neighbors in a 2D mesh. The communication is via wavelets in accordance with a representation comprising an index specifier, a virtual channel specifier, a task specifier, a data element specifier, and an optional control/data specifier. The virtual channel specifier and the task specifier are associated with one or more instructions. The index specifier and the data element are optionally associated with operands of the one or more instructions.

Commit logic and precise exceptions in explicit dataflow graph execution architectures

Systems and methods are disclosed for executing instructions with a block-based processor. Instructions can be executed in any order as their dependencies arrive, but the individual instructions are committed in a serial fashion. Further, exception handling can be performed by storing transient state for an instruction block and resuming by restoring the transient state. This allows programmers to see intermediate state for the instruction block before the subject block has committed. In one examples of the disclosed technology, a method of operating a processor executing a block-based instruction set architecture includes executing at least one instruction encoded for an instruction block, responsive to determining that an individual instruction of the instruction block can commit, advancing a commit frontier for the instruction block to include all instructions in the instruction block that can commit, and committing one or more instructions inside the advanced commit frontier.

Data processing apparatus, data processing method, and program recording medium
10789203 · 2020-09-29 · ·

A process set selection unit generates, based on a process set comprising a processing block performing arithmetic on a group of inputs and a group of outputs produced by the processing block, a group of new inputs having a combination number less than that of the group of inputs and a new processing block for the group of new inputs. A reuse execution unit prepares, based on the new processing block for performing arithmetic on the group of new inputs and a group of outputs produced by the new processing block, an associated result which associates the group of new inputs with the group of outputs, produces the group of outputs obtained from the association result if the group of new inputs have values equal to those of the group of inputs, and, if not, executes the new processing blocks to register an executed result to the associated result.

REACH-BASED EXPLICIT DATAFLOW PROCESSORS, AND RELATED COMPUTER-READABLE MEDIA AND METHODS

Exemplary reach-based explicit dataflow processors and related computer-readable media and methods. The reach-based explicit dataflow processors are configured to support execution of producer instructions encoded with explicit naming of consumer instructions intended to consume the values produced by the producer instructions. The reach-based explicit dataflow processors are configured to make available produced values as inputs to explicitly named consumer instructions as a result of processing producer instructions. The reach-based explicit dataflow processors support execution of a producer instruction that explicitly names a consumer instruction based on using the producer instruction as a relative reference point from the producer instruction. This reach-based explicit naming architecture does not require instructions to be grouped in instruction blocks to support a fixed block reference point for explicit naming of consumer instructions, and thus is not limited to explicit naming of consumer instructions only within the same instruction block of the producer instruction.