G06F15/7892

PROCESSOR WITH MEMORY CONTROLLER INCLUDING DYNAMICALLY PROGRAMMABLE FUNCTIONAL UNIT

A processor including a memory controller for interfacing an external memory and a programmable functional unit (PFU). The PFU is programmed by a PFU program to modify operation of the memory controller, in which the PFU includes programmable logic elements and programmable interconnectors. For example, the PFU is programmed by the PFU program to add a function or otherwise to modify an existing function of the memory controller enhance its functionality during operation of the processor. In this manner, the functionality and/or operation of the memory controller is not fixed once the processor is manufactured, but instead the memory controller may be modified after manufacture to improve efficiency and/or enhance performance of the processor, such as when executing a corresponding process.

Incorporating a spatial array into one or more programmable processor cores

Functional units disposed in one or more processor cores are communicatively coupled using both a shared bypass network and a switched network. The shared bypass network enables the functional units to be operated conventionally for general processing while the switched network enables specialized processing in which the functional units are configured as a spatial array. In the spatial array configuration, operands produced by one functional unit can only be sent to a subset of functional units to which dependent instructions have been mapped a priori. The functional units may be dynamically reconfigured at runtime to toggle between operating in the general configuration and operating as the spatial array. Information to control the toggling between operating configurations may be provided in instructions received by the functional units.

INTERCONNECT-BASED RESOURCE ALLOCATION FOR RECONFIGURABLE PROCESSORS

The technology disclosed relates to interconnect-based resource allocation for reconfigurable processors. In particular, the technology disclosed relates to a runtime logic that is configured to receive target interconnect bandwidth and target interconnect latency, and rated interconnect bandwidth and rated interconnect latency. The runtime logic is further configured to respond by allocating, to configuration files defining an application graph, processing elements in a plurality of processing elements, and interconnects between the processing elements, and executing the configuration files using the allocated processing elements and the allocated interconnects.

Tensor Partitioning and Partition Access Order

A method of processing partitions of a tensor in a target order includes receiving, by a reorder unit and from two or more producer units, a plurality of partitions of a tensor in a first order that is different from the target order, storing the plurality of partitions in the reorder unit, and providing, from the reorder unit, the plurality of partitions in the target order to one or more consumer units. In an example, the one or more consumer units process the plurality of partitions in the target order.

Reconfigurable processor with routing node frequency based on the number of routing nodes

Provided is a reconfigurable processor capable of reducing the routing processing time of routing nodes by driving the routing nodes at a greater frequency than a driving frequency of the processing elements. The reconfigurable processor includes one or more processing elements configured to be driven at a first driving frequency, and one or more routing nodes configured to be provided on paths that are formed between the processing elements, and to be driven at a second driving frequency that is greater than the first driving frequency.

Lossless tiling in convolution networks—read-modify-write in backward pass

Disclosed is a data processing system which includes compile time logic configured to section a graph into a sequence of subgraphs, the sequence of subgraphs including at least a first subgraph. The compile time logic configures the first subgraph to generate a plurality of output tiles of an output tensor. A runtime logic configured with the compile time logic is to execute the sequence of subgraphs to generate, at the output of the first subgraph, the plurality of output tiles of the output tensor, and write the plurality of output tiles in a memory in an overlapping configuration. In an example, an overlapping region between any two neighboring output tiles of the plurality of output tiles comprises a summation of a corresponding region of a first neighboring output tile and a corresponding region of a second neighboring output tile.

Tensor partitioning and partition access order

A method of processing partitions of a tensor in a target order includes receiving, by a reorder unit and from two or more producer units, a plurality of partitions of a tensor in a first order that is different from the target order, storing the plurality of partitions in the reorder unit, and providing, from the reorder unit, the plurality of partitions in the target order to one or more consumer units. In an example, the one or more consumer units process the plurality of partitions in the target order.

Resource allocation for reconfigurable processors

A system is described that has a node and runtime logic. The node has a plurality of processing elements operatively coupled by interconnects. The runtime logic is configured to receive target interconnect bandwidth, target interconnect latency, rated interconnect bandwidth and rated interconnect latency. The runtime logic responds by allocating to configuration files defined by the application graph: (1) processing elements in the plurality of processing elements, and (2) interconnects between the processing elements. The runtime logic further responds by executing the configuration files using the allocated processing elements and the allocated interconnects.

Configuration of hardware devices

Methods are provided for configuring a reconfigurable hardware device to execute a user application. Such a method includes providing static shell logic on the device. The static shell logic is controlled by a primary management core for managing operation of the device, and has a predetermined hardware interface. The method includes configuring on the device, via the primary management core, dynamic shell logic for implementing dynamically-selected shell functionality. The dynamic shell logic includes a secondary management core, adapted to communicate with the primary management core via the hardware interface, for managing operation of the dynamic shell logic. The method further comprises configuring on the device, via the primary management core, application logic, having an interface with the dynamic shell logic, for executing the user application. The secondary management core uploads to the primary management core dynamic code to adapt the primary management core for use with the dynamic shell logic.

Processing of ethernet packets at a programmable integrated circuit

Methods, systems, and computer programs are presented for processing Ethernet packets at a Field Programmable Gate Array (FPGA). One programmable integrated circuit includes: an internal network on chip (iNOC) comprising rows and columns, clusters, coupled to the iNOC, comprising a network access point (NAP) and programmable logic; and an Ethernet controller coupled to the iNOC. When the controller operates in packet mode, each complete inbound Ethernet packet is sent from the controller to one of the NAPs via the iNOC, where two or more NAPs are configurable to receive the complete inbound Ethernet packets from the controller. The controller is configurable to operate in quad segment interface (QSI) mode where each complete inbound Ethernet packet is broken into segments, which are sent from the controller to different NAPs via the iNOC, where two or more NAPs are configurable to receive the complete inbound Ethernet packets from the controller.