Patent classifications
G06F15/7839
Lossless tiling in convolution networks—read-modify-write in backward pass
Disclosed is a data processing system which includes compile time logic configured to section a graph into a sequence of subgraphs, the sequence of subgraphs including at least a first subgraph. The compile time logic configures the first subgraph to generate a plurality of output tiles of an output tensor. A runtime logic configured with the compile time logic is to execute the sequence of subgraphs to generate, at the output of the first subgraph, the plurality of output tiles of the output tensor, and write the plurality of output tiles in a memory in an overlapping configuration. In an example, an overlapping region between any two neighboring output tiles of the plurality of output tiles comprises a summation of a corresponding region of a first neighboring output tile and a corresponding region of a second neighboring output tile.
Tensor partitioning and partition access order
A method of processing partitions of a tensor in a target order includes receiving, by a reorder unit and from two or more producer units, a plurality of partitions of a tensor in a first order that is different from the target order, storing the plurality of partitions in the reorder unit, and providing, from the reorder unit, the plurality of partitions in the target order to one or more consumer units. In an example, the one or more consumer units process the plurality of partitions in the target order.
Multi-headed multi-buffer for buffering data for processing
An integrated circuit includes a plurality of configurable units, each configurable unit having two or more corresponding sections. The plurality of configurable units is arranged in a serial arrangement to form a chain of sections of the configurable units. A data bus is connected to the plurality of configurable units which communicates data at a clock rate. The chain of sections is to receive and write a series of tensors at the clock rate at a first end section of the chain of sections, and sequentially propagate the series of tensors through individual sections within the chain of sections at the clock rate. The chain of sections is to output the series of tensors at a second end section of the chain of sections. The chain of sections is to also output the series of tensors at an intermediate section of the chain of sections.
INDEPENDENT CONTROL OF MULTIPLE CONCURRENT APPLICATION GRAPHS IN A RECONFIGURABLE DATA PROCESSOR
A reconfigurable data processor includes a plurality of configurable units, and a configuration controller. The configuration controller is configured to start execution of a first application graph in a first set of configurable units. Then, concurrently with the execution of the first application graph in the first set of configurable units, the configuration controllers receive a command to load a configuration file into a second set of configurable units and obtain the configuration file. The configuration file contains information to configure the second set of configurable units to execute a second application graph. The configuration file is then loaded into the second set of configurable units and execution of the second application graph is started in the second set of configurable units.
METHODS, SYSTEMS, ARTICLES OF MANUFACTURE AND APPARATUS FOR NETWORK SERVICE MANAGEMENT
Methods, apparatus, systems, and articles of manufacture are disclosed for network service management. An example apparatus includes microservice translation circuitry to query, at a first time, a memory address range corresponding to a plurality of services, and generate state information corresponding to the plurality of services at the first time. The example apparatus also includes microservice request circuitry to query, at a second time, the memory address range to identify a memory address state change, the memory address state change indicative of an instantiation request for at least one of the plurality of services, and microservice instantiation circuitry to cause a first compute device to instantiate the at least one of the plurality of services.
COMPUTING TILE
Systems, apparatuses, and methods related to a computing tile are described. The computing tile may perform operations on received data to extract some of the received data. The computing tile may perform operations without intervening commands. The computing tile may perform operations on data streamed through the computing tile to extract relevant data from data received by the computing tile. In an example, the computing tile is configured to receive a command to initiate an operation to reduce a size of a block of data from a first size to a second size. The computing tile can then receive a block of data from a memory device coupled to the apparatus. The computing tile can then perform an operation on the block of data to extract predetermined data from the block of data to reduce a size of the block of data from a first size to a second size.
TOP LEVEL NETWORK AND ARRAY LEVEL NETWORK FOR RECONFIGURABLE DATA PROCESSORS
A reconfigurable data processor comprises an array of configurable units and a bus system. The bus system is connected to the array of configurable units. The bus system includes a top level network and an array level network. The top level network is connected to an external data interface for communication with memory outside of the array of configurable units. The array level network is connected to configurable units in the array of configurable units.
Computing tile
Systems, apparatuses, and methods related to a computing tile are described. The computing tile may perform operations on received data to extract some of the received data. The computing tile may perform operations without intervening commands. The computing tile may perform operations on data streamed through the computing tile to extract relevant data from data received by the computing tile. In an example, the computing tile is configured to receive a command to initiate an operation to reduce a size of a block of data from a first size to a second size. The computing tile can then receive a block of data from a memory device coupled to the apparatus. The computing tile can then perform an operation on the block of data to extract predetermined data from the block of data to reduce a size of the block of data from a first size to a second size.
High performance processor
Implementations relate to a data processor that includes a data processing unit having a plurality of processing elements and a cache hierarchy including a plurality of levels of data caches. The data caches include a first level data cache connected to a second level data cache, and a main memory connected to the highest level cache of the cache hierarchy. At least one of the first level data cache or second level data cache is divided into a plurality of cache segments, and during operation of the data processor, at least some of the plurality of cache segments are excluded from cache operation. Each of the excluded cache segments is dedicated to an associated processing element as tightly coupled local access memory.
INSTRUCTION BASED CONTROL OF MEMORY ATTRIBUTES
Embodiments described herein provide techniques to facilitate instruction-based control of memory attributes. One embodiment provides a graphics processor comprising a processing resource, a memory device, a cache coupled with the processing resources and the memory, and circuitry to process a memory access message received from the processing resource. The memory access message enables access to data of the memory device. To process the memory access message, the circuitry is configured to determine one or more cache attributes that indicate whether the data should be read from or stored the cache. The cache attributes may be provided by the memory access message or stored in state data associated with the data to be accessed by the access message.