G06F8/445

Devices, methods, and media for efficient data dependency management for in-order issue processors

Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.

Nested loops reversal enhancements

Systems, apparatuses and methods may provide for technology to identify in user code, a nested loop which would result in cache memory misses when executed. The technology further reverses an order of iterations of a first inner loop in the nested loop to obtain a modified nested loop. Reversing the order of iterations increases a number of times that cache memory hits occur when the modified nested loop is executed.

Unaligned instruction relocation

In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unaligned ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.

COMPILER FOR TRANSLATING BETWEEN A VIRTUAL IMAGE PROCESSOR INSTRUCTION SET ARCHITECTURE (ISA) AND TARGET HARDWARE HAVING A TWO-DIMENSIONAL SHIFT ARRAY STRUCTURE
20170242669 · 2017-08-24 · ·

A method is described that includes translating higher level program code including higher level instructions having an instruction format that identifies pixels to be accessed from a memory with first and second coordinates from an orthogonal coordinate system into lower level instructions that target a hardware architecture having an array of execution lanes and a shift register array structure that is able to shift data along two different axis. The translating includes replacing the higher level instructions having the instruction format with lower level shift instructions that shift data within the shift register array structure.

PARALLEL PROCESSING ARCHITECTURE USING DISTRIBUTED REGISTER FILES
20220308872 · 2022-09-29 · ·

Techniques for task processing based on a parallel processing architecture using distributed register files are disclosed. A two-dimensional array of compute elements is accessed. Each compute element is known to a compiler and is coupled to its neighboring compute elements. The array of compute elements is controlled on a cycle-by-cycle basis. The controlling is enabled by a stream of wide control words generated by the compiler. Virtual registers are mapped to a plurality of physical register files distributed among one or more of the compute elements. Virtual registers are represented by the compiler. The mapping is performed by the compiler. A broadcast write operation is enabled to two or more of the physical register files. Operations contained in the control words are executed. Operations are enabled by at least one of the distributed physical register files. Implementation in separate compute elements enables parallel operation processing.

General purpose distributed data parallel computing using a high level language

General-purpose distributed data-parallel computing using a high-level language is disclosed. Data parallel portions of a sequential program that is written by a developer in a high-level language are automatically translated into a distributed execution plan. The distributed execution plan is then executed on large compute clusters. Thus, the developer is allowed to write the program using familiar programming constructs in the high level language. Moreover, developers without experience with distributed compute systems are able to take advantage of such systems.

Application interface on multiple processors
09766938 · 2017-09-19 · ·

A method and an apparatus that execute a parallel computing program in a programming language for a parallel computing architecture are described. The parallel computing program is stored in memory in a system with parallel processors. The system includes a host processor, a graphics processing unit (GPU) coupled to the host processor and a memory coupled to at least one of the host processor and the GPU. The parallel computing program is stored in the memory to allocate threads between the host processor and the GPU. The programming language includes an API to allow an application to make calls using the API to allocate execution of the threads between the host processor and the GPU. The programming language includes host function data tokens for host functions performed in the host processor and kernel function data tokens for compute kernel functions performed in one or more compute processors, e.g. GPUs or CPUs, separate from the host processor. Standard data tokens in the programming language schedule a plurality of threads for execution on a plurality of processors, such as CPUs or GPUs in parallel. Extended data tokens in the programming language implement executables for the plurality of threads according to the schedules from the standard data tokens.

OFFLOADING SERVER AND OFFLOADING PROGRAM

An offloading server includes: a data transfer designation section configured to analyze reference relationships of variables used in loop statements in an application and designate, for data that can be transferred outside a loop, a data transfer using an explicit directive that explicitly specifies a data transfer outside the loop; a parallel processing designation section configured to identify loop statements in the application and specify a directive specifying application of parallel processing by an accelerator and perform compilation for each of the loop statements; and a parallel processing pattern creation section configured to exclude loop statements causing a compilation error from loop statements to be offloaded and create a plurality of parallel processing patterns each of which specifies whether to perform parallel processing for each of the loop statements not causing a compilation error.

Pre-instruction scheduling rematerialization for register pressure reduction

Examples are disclosed herein that relate to performing rematerialization operation(s) on program source code prior to instruction scheduling. In one example, a method includes prior to performing instruction scheduling on program source code, for each basic block of the program source code, determining a register pressure at a boundary of the basic block, determining whether the register pressure at the boundary is greater than a target register pressure, based on the register pressure at the boundary being greater than the target register pressure, identifying one or more candidate instructions in the basic block suitable for rematerialization to reduce the register pressure at the boundary, and performing a rematerialization operation on at least one of the one or more candidate instructions to reduce the register pressure at the boundary to be less than the target register pressure.

Method of Using Multidimensional Blockification To Optimize Computer Program and Device Thereof

Disclosed embodiments relate to a method and device for optimizing compilation of source code. The proposed method receives a first intermediate representation code of a source code and analyses each basic block instruction of the plurality of basic block instructions contained in the first intermediate representation code for blockification. In order to blockify the identical instructions, the one or more groups of basic block instructions are assessed for eligibility of blockification. Upon determining as eligible, the group of basic block instructions are blockified using one of one dimensional SIMD vectorization and two-dimensional SIMD vectorization. The method further generates a second intermediate representation of the source code which is translated to executable target code with more efficient processing capacity.