G06F8/445

Methods, devices, and media for reducing register pressure in flexible vector processors

Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.

Offloading server and offloading program

An offloading server includes: a data transfer designation section configured to analyze reference relationships of variables used in loop statements in an application and designate, for data that can be transferred outside a loop, a data transfer using an explicit directive that explicitly specifies a data transfer outside the loop; a parallel processing designation section configured to identify loop statements in the application and specify a directive specifying application of parallel processing by an accelerator and perform compilation for each of the loop statements; and a parallel processing pattern creation section configured to exclude loop statements causing a compilation error from loop statements to be offloaded and create a plurality of parallel processing patterns each of which specifies whether to perform parallel processing for each of the loop statements not causing a compilation error.

Method of Using Multidimensional Blockification To Optimize Computer Program and Device Thereof

Disclosed embodiments relate to a method and device for optimizing compilation of source code. The proposed method receives a first intermediate representation code of a source code and analyses each basic block instruction of the plurality of basic block instructions contained in the first intermediate representation code for blockification. In order to blockify the identical instructions, the one or more groups of basic block instructions are assessed for eligibility of blockification. Upon determining as eligible, the group of basic block instructions are blockified using one of one dimensional SIMD vectorization and two-dimensional SIMD vectorization. The method further generates a second intermediate representation of the source code which is translated to executable target code with more efficient processing capacity.

PARALLEL PROCESSING ARCHITECTURE USING SPECULATIVE ENCODING
20220214885 · 2022-07-07 · ·

Techniques for program execution in a parallel processing architecture using speculative encoding are disclosed. A two-dimensional array of compute elements is accessed, where each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the array of compute elements is provided on a cycle-by-cycle basis. The control is enabled by a stream of wide, variable length, control words generated by the compiler. Two or more operations are coalesced into a control word, where the control word includes a branch decision and operations associated with the branch decision. The coalesced control word includes speculatively encoded operations for at least two possible branch paths. The at least two possible branch paths generate independent side effects. Operations associated with the branch decision that are not indicated by the branch decision are suppressed.

PARALLEL PROCESSING ARCHITECTURE WITH DISTRIBUTED REGISTER FILES
20220291957 · 2022-09-15 · ·

Techniques for task processing based on a parallel processing architecture with distributed register files are disclosed. A two-dimensional array of compute elements is accessed. Each compute element is known to a compiler and is coupled to its neighboring compute elements. The array of compute elements is controlled on a cycle-by-cycle basis. The controlling is enabled by a stream of wide, variable length, control words generated by the compiler. Virtual registers are mapped to a plurality of physical register files distributed among one or more of the compute elements. Virtual registers are represented by the compiler. The mapping is performed by the compiler. A broadcast write operation is enabled to two or more of the physical register files. Operations contained in the control words are executed. Operations are enabled by at least one of the distributed physical register files. Implementation in separate compute elements enables parallel operation processing.

METHOD OF OPTIMIZING SCALAR REGISTER ALLOCATION AND A SYSTEM THEREOF

The present disclosure relates to a system and a method of optimizing scalar register allocation by a processor. The method comprises receiving an intermediate code and information about one or more available physical registers in a memory of the processor, as input. The method further comprises allocating one or more virtual registers based on the received information, wherein each virtual register is having size of each available physical register. The method also comprises mapping one or more groups of 8-bit location of the one or more virtual registers to one or more register classes. The method further comprises identifying a plurality of scalar variables from the input intermediate code, and dynamically assigning the one or more available physical registers to the identified scalar variables using the one or more register classes.

SYNCHRONIZATION INSTRUCTION INSERTION METHOD AND APPARATUS
20220113971 · 2022-04-14 ·

This application discloses example synchronization instruction insertion methods and example apparatuses. One example method includes obtaining a first program block comprising one or more statements, where each of the one or more statements includes one or more function instructions. A first function instruction and a second function instruction between which data dependency exists in the first program block can then be determined. A synchronization instruction pair between a first statement including the first function instruction and a second statement including the second function instruction can then be inserted.

DEVICES, METHODS, AND MEDIA FOR EFFICIENT DATA DEPENDENCY MANAGEMENT FOR IN-ORDER ISSUE PROCESSORS

Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.

METHODS, DEVICES, AND MEDIA FOR REDUCING REGISTER PRESSURE IN FLEXIBLE VECTOR PROCESSORS

Methods, devices and media for reducing register pressure in flexible vector processors are described. In various embodiments described herein, methods, devices and media are disclosed that selectively re-scalarize vector instructions in a sequence of instructions such that register pressure is reduced and thread level parallelism is increased. A compiler may be used to perform a first method to partially or fully scalarize vectorized instructions of a code region of high register pressure. A compiler may be used to perform a second method to fully scalarize a sequence of vectorized instructions while preserving associations of the scalar instructions with their original vectorized instructions; the scalar instructions may then be scheduled and selectively re-vectorized. Devices executing code compiled with either method are described, as are processor-readable media storing code compiled by either method.

HIGHLY PARALLEL PROCESSING ARCHITECTURE WITH COMPILER
20220075651 · 2022-03-10 · ·

Techniques for task processing using a highly parallel processing architecture with a compiler are disclosed. A two-dimensional array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. A set of directions is provided to the hardware, through a control word generated by the compiler, for compute element operation and memory access precedence. The set of directions enables the hardware to properly sequence compute element results. The set of directions controls data movement for the array of compute elements. A compiled task is executed on the array of compute elements, based on the set of directions. The compute element results are generated in parallel in the array, and the compute element results are ordered independently from control word arrival at each compute element.