G06F15/825

DATA FLOWS IN A PROCESSOR WITH A DATA FLOW MANAGER

Methods, apparatuses, and systems for implementing data flows in a processor are described herein. A data flow manager may be configured to generate a configuration packet for a compute operation based on status information regarding multiple processing elements of the processor. Accordingly, multiple processing elements of a processor may concurrently process data flows based on the configuration packet. For example, the multiple processing elements may implement a mapping of processing elements to memory, while also implementing identified paths, through the processor, for the data flows. After executing the compute operation at certain processing elements of the processor, the processing results may be provided. In speech signal processing operations, the processing results may be compared to phonemes to identify such components of human speech in the processing results. Once dynamically identified, the processing elements may continue comparing additional components of human speech to facilitate processing of an audio recording, for example.

FPGA-BASED GRAPH DATA PROCESSING METHOD AND SYSTEM THEREOF
20200242072 · 2020-07-30 ·

An FPGA-based graph data processing method is provided for executing graph traversals on a graph having characteristics of a small-world network by using a first processor being a CPU and a second processor that is a FPGA and is in communicative connection with the first processor, wherein the first processor sends graph data to be traversed to the second processor, and obtains result data of the graph traversals from the second processor for result output after the second processor has completed the graph traversals of the graph data by executing level traversals, and the second processor comprises a sparsity processing module and a density processing module, the sparsity processing module operates in a beginning stage and/or an ending stage of the graph traversals, and the density processing module with a higher degree of parallelism than the sparsity processing module operates in the intermediate stage of the graph traversals.

Convolution Engine for Neural Networks

A method and hardware system for mapping an input map of a convolutional neural network layer to an output map are disclosed. An array of processing elements are interconnected to support unidirectional dataflows through the array along at least three different spatial directions. Each processing element is adapted to combine values of dataflows along different spatial directions into a new value for at least one of the supported dataflows. For each data entry in the output map, a plurality of products from pairs of weights of a selected convolution kernel and selected data entries in the input map is provided and arranged into a plurality of associated partial sums. Products associated with a same partial sum are accumulated on the array and accumulated on the array into at least one data entry in the output map.

TASK SYNCHRONIZATION FOR ACCELERATED DEEP LEARNING

Techniques in advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements performs flow-based computations on wavelets of data. Each processing element has a compute element and a routing element. Each compute element has memory. Each router enables communication via wavelets with at least nearest neighbors in a 2D mesh. Routing is controlled by respective virtual channel specifiers in each wavelet and routing configuration information in each router. A compute element conditionally selects for task initiation a previously received wavelet specifying a particular one of the virtual channels. The conditional selecting excludes the previously received wavelet for selection until at least block/unblock state maintained for the particular virtual channel is in an unblock state. The compute element executes block/unblock instructions to modify the block/unblock state.

COMMIT LOGIC AND PRECISE EXCEPTIONS IN EXPLICIT DATAFLOW GRAPH EXECUTION ARCHITECTURES

Systems and methods are disclosed for executing instructions with a block-based processor. Instructions can be executed in any order as their dependencies arrive, but the individual instructions are committed in a serial fashion. Further, exception handling can be performed by storing transient state for an instruction block and resuming by restoring the transient state. This allows programmers to see intermediate state for the instruction block before the subject block has committed. In one examples of the disclosed technology, a method of operating a processor executing a block-based instruction set architecture includes executing at least one instruction encoded for an instruction block, responsive to determining that an individual instruction of the instruction block can commit, advancing a commit frontier for the instruction block to include all instructions in the instruction block that can commit, and committing one or more instructions inside the advanced commit frontier.

Reconfigurable interconnected programmable processors

A plurality of software programmable processors is disclosed. The software programmable processors are controlled by rotating circular buffers. A first processor and a second processor within the plurality of software programmable processors are individually programmable. The first processor within the plurality of software programmable processors is coupled to neighbor processors within the plurality of software programmable processors. The first processor sends and receives data from the neighbor processors. The first processor and the second processor are configured to operate on a common instruction cycle. An output of the first processor from a first instruction cycle is an input to the second processor on a subsequent instruction cycle.

A CONFIGURABLE PROCESSING ARCHITECTURE
20240028554 · 2024-01-25 ·

A configurable processing unit including a core processing element and a plurality of assist processing elements can be coupled together by one or more networks. The core processing element can include a large processing logic, large non-volatile memory, input/output interfaces and multiple memory channels. The plurality of assist processing elements can each include smaller processing logic, smaller non-volatile memory and multiple memory channels. One or more bitstreams can be utilized to configure and reconfigure computation resources of the core processing element and memory management of the plurality of assist processing elements.

APPARATUSES, METHODS, AND SYSTEMS FOR CONDITIONAL OPERATIONS IN A CONFIGURABLE SPATIAL ACCELERATOR

Systems, methods, and apparatuses relating to conditional operations in a configurable spatial accelerator are described. In one embodiment, a hardware accelerator includes an output buffer of a first processing element coupled to an input buffer of a second processing element via a first data path that is to send a first dataflow token from the output buffer of the first processing element to the input buffer of the second processing element when the first dataflow token is received in the output buffer of the first processing element; an output buffer of a third processing element coupled to the input buffer of the second processing element via a second data path that is to send a second dataflow token from the output buffer of the third processing element to the input buffer of the second processing element when the second dataflow token is received in the output buffer of the third processing element; a first backpressure path from the input buffer of the second processing element to the first processing element to indicate to the first processing element when storage is not available in the input buffer of the second processing element; a second backpres sure path from the input buffer of the second processing element to the third processing element to indicate to the third processing element when storage is not available in the input buffer of the second processing element; and a scheduler of the second processing element to cause storage of the first dataflow token from the first data path into the input buffer of the second processing element when both the first backpres sure path indicates storage is available in the input buffer of the second processing element and a conditional token received in a conditional queue of the second processing element from another processing element is a first value.

INSTRUCTION FORMAT AND INSTRUCTION SET ARCHITECTURE FOR TENSOR STREAMING PROCESSOR

Embodiments are directed to a processor having a functional slice architecture. The processor is divided into tiles (or functional units) organized into a plurality of functional slices. The functional slices are configured to perform specific operations within the processor, which includes memory slices for storing operand data and arithmetic logic slices for performing operations on received operand data (e.g., vector processing, matrix manipulation). The processor includes a plurality of functional slices of a module type, each functional slice having a plurality of tiles. The processor further includes a plurality of data transport lanes for transporting data in a direction indicated in a corresponding instruction. The processor also includes a plurality of instruction queues, each instruction queue associated with a corresponding functional slice of the plurality of functional slices, wherein the instructions in the instruction queues comprise a functional slice specific operation code.

ISSUING A SEQUENCE OF INSTRUCTIONS INCLUDING A CONDITION-DEPENDENT INSTRUCTION

An apparatus, method and computer program, the apparatus comprising processing circuitry to execute instructions, issue circuitry to issue the instructions for execution by the processing circuitry, and candidate instruction storage circuitry to store a plurality of condition-dependent instructions, each specifying at least one condition. The issue circuitry is configured to issue a given condition-dependent instruction in response to a determination or a prediction of the at least one condition specified by the given condition-dependent instruction being met, and when the given condition-dependent instruction is a sequence-start instruction, the issue circuitry is responsive to the determination or prediction to issue a sequence of instructions comprising the sequence-start instruction and at least one subsequent instruction.