IPIQ

G06F15/825

All Reduce Across Multiple Reconfigurable Dataflow Processors

20230409520 · 2023-12-21 ·

SambaNova Systems, Inc.

Mingran Wang

A method for a reconfigurable computing system includes receiving a compute graph for execution on multiple RDPs interconnected with a ring network having R interconnected RDPs. A compute graph with a node specifying a reduction operation for a first and second tensor is detected. The detected compute graph node is partitioned into a compute subgraph corresponding to an RDP of the R interconnected RDPs. A first node is inserted into the compute subgraph that specifies a partial reduction operation for producing a partial reduction result corresponding to a shard of the first tensor and a shard of the second tensor. A second node is inserted for communicating the partial reduction result to an adjacent RDP. A third node is inserted that specifies a reduction operation for producing a total reduction result. A fourth node is inserted for communicating the total reduction result to at least one other RDP.

Dataflow Triggered Tasks for Accelerated Deep Learning

20210056400 · 2021-02-25 ·

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow based computations on wavelets of data. Each processing element has a compute element and a routing element. Each compute element has memory. Each router enables communication via wavelets with nearest neighbors in a 2D mesh. Routing is controlled by respective virtual channel specifiers in each wavelet and routing configuration information in each router. A compute element receives a particular wavelet comprising a particular virtual channel specifier and a particular data element. Instructions are read from the memory of the compute element based at least in part on the particular virtual channel specifier. The particular data element is used as an input operand to execute at least one of the instructions.

EVENT PROCESSING

20210081145 · 2021-03-18 ·

Alexander WEISS

An event-processing unit for processing tokens associated with a state or state transition, herein also referred to as an event, of an external device is disclosed. The EPU allows token-processing schemes, in which the processing of incoming tokens and the further handling of a processing result by the EPU are determined not only by the token identifier, but also by the payload data of the incoming token or by data in the data memory. A flag-processing capability of a processing-control stage allows applying flag-processing operations such as logical operations to data obtained as a processing result of an ALU-processing operation. The result of these operations determines a subsequent handling of ALU-result data by the EPU. Thus, whether or not the ALU-result data is written to the data memory also influences the processing of any subsequent incoming tokens for which that data is used in the ALU-processing operation.

ENABLING ACCELERATED PROCESSING UNITS TO PERFORM DATAFLOW EXECUTION

20230418782 · 2023-12-28 ·

Advanced Micro Devices, Inc.

Methods and systems are disclosed for performing dataflow execution by an accelerated processing unit (APU). Techniques disclosed include decoding information from one or more dataflow instructions. The decoded information is associated with dataflow execution of a computational task. Techniques disclosed further include configuring, based on the decoded information, dataflow circuitry, and, then, executing the dataflow execution of the computational task using the dataflow circuitry.

Reconfigurable computer accelerator providing stream processor and dataflow processor

11853244 · 2023-12-26 ·

Wisconsin Alumni Research Foundation

A reconfigurable hardware accelerator for computers combines a high-speed dataflow processor, having programmable functional units rapidly reconfigured in a network of programmable switches, with a stream processor that may autonomously access memory in predefined access patterns after receiving simple stream instructions. The result is a compact, high-speed processor that may exploit parallelism associated with many application-specific programs susceptible to acceleration.

APPARATUSES, METHODS, AND SYSTEMS FOR TIME-MULTIPLEXING IN A CONFIGURABLE SPATIAL ACCELERATOR

20200409709 · 2020-12-31 ·

Systems, methods, and apparatuses relating to time-multiplexing circuitry in a configurable spatial accelerator are described. In one embodiment, a configurable spatial accelerator (CSA) includes a plurality of processing elements; and a time-multiplexed, circuit switched interconnect network between the plurality of processing elements. In another embodiment, a configurable spatial accelerator (CSA) includes a plurality of time-multiplexed processing elements; and a time-multiplexed, circuit switched interconnect network between the plurality of time-multiplexed processing elements.

WAVELET REPRESENTATION FOR ACCELERATED DEEP LEARNING

20200364546 · 2020-11-19 ·

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a compute element with dedicated storage and a routing element. Each router enables communication with nearest neighbors in a 2D mesh. The communication is via wavelets in accordance with a representation comprising an index specifier, a virtual channel specifier, a task specifier, a data element specifier, and an optional control/data specifier. The virtual channel specifier and the task specifier are associated with one or more instructions. The index specifier and the data element are optionally associated with operands of the one or more instructions.

Commit logic and precise exceptions in explicit dataflow graph execution architectures

10824429 · 2020-11-03 ·

Microsoft Technology Licensing, Llc

Systems and methods are disclosed for executing instructions with a block-based processor. Instructions can be executed in any order as their dependencies arrive, but the individual instructions are committed in a serial fashion. Further, exception handling can be performed by storing transient state for an instruction block and resuming by restoring the transient state. This allows programmers to see intermediate state for the instruction block before the subject block has committed. In one examples of the disclosed technology, a method of operating a processor executing a block-based instruction set architecture includes executing at least one instruction encoded for an instruction block, responsive to determining that an individual instruction of the instruction block can commit, advancing a commit frontier for the instruction block to include all instructions in the instruction block that can commit, and committing one or more instructions inside the advanced commit frontier.

Data processing apparatus, data processing method, and program recording medium

10789203 · 2020-09-29 ·

Nec Corporation

Takamichi Miyamoto

A process set selection unit generates, based on a process set comprising a processing block performing arithmetic on a group of inputs and a group of outputs produced by the processing block, a group of new inputs having a combination number less than that of the group of inputs and a new processing block for the group of new inputs. A reuse execution unit prepares, based on the new processing block for performing arithmetic on the group of new inputs and a group of outputs produced by the new processing block, an associated result which associates the group of new inputs with the group of outputs, produces the group of outputs obtained from the association result if the group of new inputs have values equal to those of the group of inputs, and, if not, executes the new processing blocks to register an executed result to the associated result.

REACH-BASED EXPLICIT DATAFLOW PROCESSORS, AND RELATED COMPUTER-READABLE MEDIA AND METHODS

20200301877 · 2020-09-24 ·

Exemplary reach-based explicit dataflow processors and related computer-readable media and methods. The reach-based explicit dataflow processors are configured to support execution of producer instructions encoded with explicit naming of consumer instructions intended to consume the values produced by the producer instructions. The reach-based explicit dataflow processors are configured to make available produced values as inputs to explicitly named consumer instructions as a result of processing producer instructions. The reach-based explicit dataflow processors support execution of a producer instruction that explicitly names a consumer instruction based on using the producer instruction as a relative reference point from the producer instruction. This reach-based explicit naming architecture does not require instructions to be grouped in instruction blocks to support a fixed block reference point for explicit naming of consumer instructions, and thus is not limited to explicit naming of consumer instructions only within the same instruction block of the producer instruction.

Patent classifications

G06F15/825