Patent classifications
G06F15/7871
INTER-NODE EXECUTION OF CONFIGURATION FILES ON RECONFIGURABLE PROCESSORS USING NETWORK INTERFACE CONTROLLER (NIC) BUFFERS
The technology disclosed relates to inter-node execution of configuration files on reconfigurable processors using network interface controller (NIC) buffers. In particular, the technology disclosed relates to a runtime logic that is configured to execute configuration files that define applications and application data for applications using a first reconfigurable processor connected to a first host, and a second reconfigurable processor connected to a second host. The first reconfigurable processor is configured to push input data for the applications in a first plurality of buffers. The first host is configured to cause a first network interface controller (NIC) to stream the input data to a second plurality of buffers from the first plurality of buffers. The second host is configured to cause a second NIC to stream the input data to the second reconfigurable processor from the second plurality of buffers.
EXECUTING A NEURAL NETWORK GRAPH USING A NON-HOMOGENOUS SET OF RECONFIGURABLE PROCESSORS
A system for executing a graph partitioned across a plurality of reconfigurable computing units includes a processing node that has a first computing unit reconfigurable at a first level of configuration granularity and a second computing unit reconfigurable at a second, finer, level of configuration granularity. The first computing unit is configured by a host system to execute a first dataflow segment of the graph using one or more dataflow pipelines to generate a first intermediate result and to provide the first intermediate result to the second computing unit without passing through the host system. The second computing unit is configured by the host system to execute a second dataflow segment of the graph, dependent upon the first intermediate result, to generate a second intermediate result and to send the second intermediate result to a third computing unit, without passing through the host system, to continue execution of the graph.
Fuseload architecture for system-on-chip reconfiguration and repurposing
Methods, systems, and devices that support fuseload architectures for system-on-chip (SoC) reconfiguration and repurposing are described. Trim data may be loaded from fuses to registers on a die based on a fuse header. For example, a set of registers coupled with a set of fuses on the die may be identified, where the set of fuses may store trim data to be copied to the registers as part of a fuseload procedure. In such cases, one or more fuse headers may be identified within the trim data, and each fuse header may correspond to a fuse group that includes a subset of fuses. Based on one or more subfields within a fuse header, a mapping between fuse addresses and register addresses may be determined, and the trim data from each fuse group may be copied into a set of registers based on the mapping.
Multi-headed multi-buffer for buffering data for processing
An integrated circuit includes a plurality of configurable units, each configurable unit having two or more corresponding sections. The plurality of configurable units is arranged in a serial arrangement to form a chain of sections of the configurable units. A data bus is connected to the plurality of configurable units which communicates data at a clock rate. The chain of sections is to receive and write a series of tensors at the clock rate at a first end section of the chain of sections, and sequentially propagate the series of tensors through individual sections within the chain of sections at the clock rate. The chain of sections is to output the series of tensors at a second end section of the chain of sections. The chain of sections is to also output the series of tensors at an intermediate section of the chain of sections.
ENERGY EFFICIENT MICROPROCESSOR WITH INDEX SELECTED HARDWARE ARCHITECTURE
An SoC maintains the full flexibility of a general-purpose microprocessor while providing energy efficiency similar to an ASIC by implementing software-controlled virtual hardware architectures that enable the SoC to function as a virtual ASIC. The SoC comprises a plurality of “Stella” Reconfigurable Multiprocessors (SRMs) supported by a Network-on-a-Chip that provides efficient data transfer during program execution. A hierarchy of programmable switches interconnects the programmable elements of each of the SRMs at different levels to form their virtual architectures. Arithmetic, data flow, and interconnect operations are also rendered programmable. An architecture index” points to a storage location where pre-determined hardware architectures are stored and extracted during program execution. The programmed architectures are able to mimic ASIC properties such as variable computation types, bit-resolutions, data flows, and amount and proportions of compute and data flow operations and sizes. Once established, each architecture remains in place as long as needed.
FPGA-BASED PROGRAMMABLE DATA ANALYSIS AND COMPRESSION FRONT END FOR GPU
Methods, devices, and systems for information communication. Information transmitted from a host to a graphics processing unit (GPU) is received by information analysis circuitry of a field-programmable gate array (FPGA). A pattern in the information is determined by the information analysis circuitry. A predicted information pattern is determined, by the information analysis circuitry, based on the information. An indication of the predicted information pattern is transmitted to the host. Responsive to a signal from the host based on the predicted information pattern, the FPGA is reprogrammed to implement decompression circuitry based on the predicted information pattern. In some implementations, the information includes a plurality of packets. In some implementations, the predicted information pattern includes a pattern in a plurality of packets. In some implementations, the predicted information pattern includes a zero data pattern.
Overlay layer hardware unit for network of processor cores
Methods and systems for executing an application data flow graph on a set of computational nodes are disclosed. The computational nodes can each include a programmable controller from a set of programmable controllers, a memory from a set of memories, a network interface unit from a set of network interface units, and an endpoint from a set of endpoints. A disclosed method comprises configuring the programmable controllers with instructions. The method also comprises independently and asynchronously executing the instructions using the set of programmable controllers in response to a set of events exchanged between the programmable controllers themselves, between the programmable controllers and the network interface units, and between the programmable controllers and the set of endpoints. The method also comprises transitioning data in the set of memories on the computational nodes in accordance with the application data flow graph and in response to the execution of the instructions.
TECHNOLOGIES FOR PROVIDING EFFICIENT REPROVISIONING IN AN ACCELERATOR DEVICE
Technologies for providing efficient reprovisioning in an accelerator device include an accelerator sled. The accelerator sled includes a memory and an accelerator device coupled to the memory. The accelerator device is to configure itself with a first bit stream to establish a first kernel, execute the first kernel to produce output data, write the output data to the memory, configure itself with a second bit stream to establish a second kernel, and execute the second kernel with the output data in the memory used as input data to the second kernel. Other embodiments are also described and claimed.
Parallel hardware hypervisor for virtualizing application-specific supercomputers
A parallel hypervisor system for virtualizing application-specific supercomputers is disclosed. The hypervisor system comprises (a) at least one software-virtual hardware pair consisting of a software application, and an application-specific virtual supercomputer for accelerating the said software application, wherein (i) The virtual supercomputer contains one or more virtual tiles; and (ii) The software application and the virtual tiles communicate among themselves with messages; (b) One or more reconfigurable physical tiles, wherein each virtual tile of each supercomputer can be implemented on at least one physical tile, by configuring the physical tile to perform the virtual tile’s function; and (c) A scheduler implemented substantially in hardware, for parallel pre-emptive scheduling of the virtual tiles on the physical tiles.
Configuration unload of a reconfigurable data processor
A reconfigurable data processor comprises a bus system, and an array of configurable units connected to the bus system, configurable units in the array including configuration data stores to store unit files comprising a plurality of sub-files of configuration data particular to the corresponding configurable units. Configurable units in the plurality of configurable units each include logic to execute a unit configuration load process, including receiving via the bus system, sub-files of a unit file particular to the configurable unit, and loading the received sub-files into the configuration store of the configurable unit. A configuration load controller connected to the bus system, including logic to execute an array configuration load process, including distributing a configuration file comprising unit files for a plurality of the configurable units in the array.