G06F15/7885

Lossless tiling in convolution networks-backward pass

Disclosed is a data processing system to receive a processing graph of an application. A compile time logic is configured to modify the processing graph and generate a modified processing graph. The modified processing graph is configured to apply a post-padding tiling after applying a cumulative input padding that confines padding to an input. The cumulative input padding pads the input into a padded input. The post-padding tiling tiles the padded input into a set of pre-padded input tiles with a same tile size, tiles intermediate representation of the input into a set of intermediate tiles with a same tile size, and tiles output representation of the input into a set of non-overlapping output tiles with a same tile size. Runtime logic is configured with the compile time logic to execute the modified processing graph to execute the application.

RECONFIGURABLE DATAFLOW UNIT WITH STREAMING WRITE FUNCTIONALITY

A reconfigurable processing unit is disclosed, comprising a first internal network and a second internal network with different protocols, an interface to an external network with a different protocol, a first configurable unit connected to the first internal network, a second configurable unit connected to both the first internal network and the second internal network, and a third configurable unit connected to both the second internal network and the interface to the external network. The third configurable unit is configured to receive a payload from the external network and send the transaction type identifier and the source application ID to the second configurable unit over the second internal network. The second configurable unit sends information to the first configurable unit based on the transaction type identifier and the source application ID matching the local application ID retrieved from the register.

Reconfigurable parallel processing
11971847 · 2024-04-30 · ·

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.

LOSSLESS TILING IN CONVOLUTION NETWORKS - TILING CONFIGURATION BETWEEN TWO SECTIONS

Disclosed is a method that includes sectioning a graph into a sequence of sections, the sequence of sections including at least a first section followed by a second section. The first section is configured to generate a first output in a first target tiling configuration in response to processing a first input in a first input tiling configuration. The graph is configured to reconfigure the first output in the first target tiling configuration to a second input in a second input tiling configuration. The second section is configured to generate a second output in a second target tiling configuration in response to processing the second input in the second input tiling configuration.

Interface unit for routing prioritized input data to a processor

An interface unit for data exchange between a first processor of a computer system and a peripheral environment. The interface unit has a number of input data channels for receiving input data from the peripheral environment and a first access management unit. The access management unit is configured to receive a request for providing the input data, stored in the number of input data channels, from a first interface processor stored in the interface unit and from a second interface processor stored in the interface unit and to provide or not to provide the input data, stored in the number of input data channels, to the first interface processor and the second interface processor. A first priority and a second priority can be stored in the first access management unit.

COMPARISON-BASED SORT IN A RECONFIGURABLE ARRAY PROCESSOR HAVING MULTIPLE PROCESSING ELEMENTS FOR SORTING ARRAY ELEMENTS

An array processor includes a managing element having a load streaming unit coupled to multiple processing elements. The load streaming unit provides input data portions to each of a first subset of the processing elements and also receives output data from each of a second subset of the processing elements based on a comparatively sorted combination of the input data portions provided to the first subset of processing elements. Furthermore, each of processing elements is configurable by the managing element to compare input data portions received from either the load streaming unit or two or more of the other processing elements, wherein the input data portions are stored for processing in respective queues. Each processing unit is further configurable to select an input data portion to be output data based on the comparison, and in response to selecting the input data portion, remove a queue entry corresponding to the selected input data portion. Each processing element may be further configured to provide the selected output data portion to either the managing element or as an input to one of the processing elements.

Fast argument load in a reconfigurable data processor

A reconfigurable processor includes an array of configurable units connected by a bus system. Each configurable unit has a configuration data store, organized as a shift register, to store configuration data. The configuration data store also includes individually addressable argument registers respectively made up of word-sized portions of the shift register to provide arguments to the configurable unit. The configurable unit also includes program load logic shift data into the configuration data store, and argument load logic to directly load data into the argument registers without shifting the received argument data through the shift register. A program load controller is associated with the array to respond to a program load command by executing a program load process, and a fast argument load (FAL) controller is associated with the array to respond to an FAL command by executing an FAL process.

Reconfigurable Parallel Processing
20240264975 · 2024-08-08 ·

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.

Reconfigurable data processor with fast argument load using a runtime program on a host processor

Argument registers in a reconfigurable processor are loaded from a runtime program running on a host processor. The runtime program stores a configuration file in a memory. A program load controller reads the configuration file from the memory and distributes it to configurable units in in the reconfigurable processor which sequentially shift it into a shift register of the configuration data store. The runtime program stores an argument load file in the memory and a fast argument load (FAL) controller reads the argument load file from memory and distributes (value, control) tuples to the configuration units in the reconfigurable processor. The configurable units process the tuples by writing the value directly into an argument register made up of a portion of the shift register in the configuration data store specified by the control of the tuple without shifting the value through the shift register.

Lossless tiling in convolution networks—materialization of tensors

Disclosed is a data processing system that includes a plurality of reconfigurable processors and processor memory. Runtime logic, operatively coupled to the plurality of reconfigurable processors and the processor memory, is configured to configure at least one reconfigurable processor in the plurality of reconfigurable processors with a first subgraph in a sequence of subgraphs of a graph; load an input onto the processor memory; on a tile-by-tile basis, process a first set of input tiles from the input through the first subgraph and generate a first set of intermediate tiles, load the first set of intermediate tiles onto the processor memory, and process the first set of intermediate tiles through the first subgraph and generate a first set of output tiles; and compose output tiles in the first set of output tiles into a first composed input, and load the first composed input onto the processor memory.