Patent classifications
G06F8/452
Decoupling loop dependencies using buffers to enable pipelining of loops
Decoupling loop dependencies using first in, first out (FIFO) buffers or other types of buffers to enable pipelining of loops is disclosed. By using buffers along with tailored ordering of their writes and reads, loop dependencies can be decoupled. This allows the loop to be pipelined and can lead to improved performance.
Skip Buffer Splitting
A compiler transforms a high-level program into configuration data for a coarse-grained reconfigurable (CGR) data processor with an array of CGR units. The compiler includes a method that identifies a skip buffer in a dataflow graph, determines limitations associated with the array, and searches for a lowest cost implementation topology and stage depth. At least three topologies are considered, including a cascaded buffer topology, a hybrid buffer topology, and a striped buffer topology. The lowest cost implementation topology and stage depth are based on the size of the buffered data (usually, the size of a tensor), the depth of the skip buffer, and the array's limitations. The hybrid buffer topology includes multiple sections of parallel memory units. The data travels between memory units in one section to adjacent memory units in a next section without intervening reorder buffers.
System and method for compiling high-level language code into a script executable on a blockchain platform
A computer-implemented method (and corresponding system) is provided that enables or facilitates the execution of a portion of source code, written in a high-level language (HLL), on a blockchain platform. The method and system can include a blockchain compiler, arranged to convert a portion of high-level source code into a form that can be used with a blockchain platform. This may be the Bitcoin blockchain or an alternative. The method can include: receiving the portion of source code as input; and generating an output script comprising a plurality of op codes. The op codes are a subset of op codes that are native to a functionally-restricted, blockchain scripting language. The outputted script is arranged and/or generated such that, when executed, the script provides, at least in part, the functionality specified in the source code. The blockchain scripting language is restricted such that it does not natively support complex control-flow constructs or recursion via jump-based loops or other recursive programming constructs. The step of generating the output script may comprise the unrolling at least one looping construct provided in the source code. The method may further comprise providing or using an interpreter or virtual machine arranged to convert the output script into a form that is executable on a blockchain platform.
Offloading server and offloading program
An offloading server includes: a data transfer designation section configured to analyze reference relationships of variables used in loop statements in an application and designate, for data that can be transferred outside a loop, a data transfer using an explicit directive that explicitly specifies a data transfer outside the loop; a parallel processing designation section configured to identify loop statements in the application and specify a directive specifying application of parallel processing by an accelerator and perform compilation for each of the loop statements; and a parallel processing pattern creation section configured to exclude loop statements causing a compilation error from loop statements to be offloaded and create a plurality of parallel processing patterns each of which specifies whether to perform parallel processing for each of the loop statements not causing a compilation error.
Systems and methods for increased bandwidth utilization regarding irregular memory accesses using software pre-execution
Systems and methods are configured to receive code containing an original loop that includes irregular memory accesses. The original loop can be split. A pre-execution loop that contains code to prefetch content of the memory can be generated. Execution of the pre-execution loop can access memory inclusively between a starting location and the starting location plus a prefetch distance. A modified loop that can perform at least one computation based on the content prefetched with execution of the pre-execution loop can be generated. Execution of the main loop can to follow the execution of the pre-execution loop. The original loop can be replaced with the pre-execution loop and the modified loop.
Optimizing runtime alias checks
Optimizing runtime alias checks includes identifying, by a compiler, a base pointer and a plurality of different memory accesses based on the base pointer in a code loop; generating, by the compiler, a first portion of runtime code to determine a minimum access and a maximum access of the plurality of different memory accesses; and generating, by the compiler, a second portion of runtime code including one or more runtime alias checks for the minimum access and one or more runtime alias checks for the maximum access.
USING HARDWARE-ACCELERATED INSTRUCTIONS
A computer-implemented method of implementing a computation using a hardware-accelerated instruction of a processor system by solving a constraint satisfaction problem. A solution to the constraint satisfaction problem represents a possible invocation of the hardware-accelerated instruction in the computation. The constraint satisfaction problem assigns nodes of a data flow graph of the computation to nodes of a data flow graph of the instruction. The constraint satisfaction problem comprises constraints enforcing that the assigned nodes of the computation data flow graph have equivalent data flow to the instruction data flow graph, and constraints restricting which nodes of the computation data flow graph can be assigned to the inputs of the hardware-accelerated instruction, with restrictions being imposed by the hardware-accelerated instruction and/or its programming interface.
Hardware Acceleration Method, Compiler, and Device
A hardware acceleration method includes obtaining compilation policy information and a source code, where the compilation policy information indicates that a first code type matches a first processor and a second code type matches a second processor; analyzing a code segment in the source code according to the compilation policy information; determining a first code segment belonging to the first code type or a second code segment belonging to the second code type; compiling the first code segment into a first executable code; sending the first executable code to the first processor; compiling the second code segment into a second executable code; and sending the second executable code to the second processor.
Processor architecture
A processor having a functional slice architecture is divided into a plurality of functional units (“tiles”) organized into a plurality of slices. Each slice is configured to perform specific functions within the processor, which may include memory slices (MEM) for storing operand data, and arithmetic logic slices for performing operations on received operand data. The tiles of the processor are configured to stream operand data across a first dimension, and receive instructions across a second dimension orthogonal to the first dimension. The timing of data and instruction flows are configured such that corresponding data and instructions are received at each tile with a predetermined temporal relationship, allowing operand data to be transmitted between the slices of the processor without any accompanying metadata. Instead, each slice is able to determine what operations to perform on received data based upon the timing at which the data is received.
Dataflow all-reduce for reconfigurable processor systems
Roughly described, a system for data parallel training of a neural network on multiple reconfigurable units configured by a host with dataflow pipelines to perform different steps in the training CGRA units are configured to evaluate first and second sequential sections of neural network layers based on a respective subset of training data, and to back-propagate the error through the sections to calculate parameter gradients for the respective subset. Gradient synchronization and reduction are performed by one or more units having finer grain reconfigurability, such as an FPGA. The FPGA performs synchronization and reduction of the gradients for the second section while the CGRA units perform back-propagation through the first sequential section. Intermediate results are transmitted using a P2P message passing protocol layer. Execution of dataflow segments in the different units is triggered by receipt of data, rather than by a command from any host system.