G06F8/452

COMPUTER-READABLE RECORDING MEDIUM STORING CONVERSION PROGRAM AND CONVERSION METHOD
20230176851 · 2023-06-08 · ·

A recording medium stores a program causing a computer to execute a process including: generating, based on a dependency relationship between statements in a program, a directed graph in which the statement in the program is a node and the dependency relationship is an edge; detecting, based on the dependency relationship represented by the edge, a node of which a part of a loop process has a dependency relationship with another preceding or following node, from the directed graph; updating the directed graph by dividing the detected node into a first node having the part of the loop process and a second node having a loop process other than the part of the loop process, fusing the divided first node and the another node, and assigning dependency information based on a data access pattern to a node after fusing; and converting the program, based on the directed graph after update.

Nested loops reversal enhancements

Systems, apparatuses and methods may provide for technology to identify in user code, a nested loop which would result in cache memory misses when executed. The technology further reverses an order of iterations of a first inner loop in the nested loop to obtain a modified nested loop. Reversing the order of iterations increases a number of times that cache memory hits occur when the modified nested loop is executed.

Methods and apparatus to eliminate partial-redundant vector loads

Methods, apparatus, systems and articles of manufacture are disclosed to eliminate partial-redundant vector loads. An example apparatus includes a node group to associate a vector operation with a node group based on a load type of the vector operation. The example apparatus also includes a candidate identifier to identify a candidate in the node group, the candidate to include a subset of vector operations of the node group. The example apparatus also includes a code optimizer to determine replacement code based on a characteristic of the candidate, and to compare an estimated cost associated with executing the replacement code to a threshold cost relative to a cost of executing the candidate. The example apparatus also includes a code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold cost.

System and method of loop vectorization by compressing indexes and data elements from iterations based on a control mask

Loop vectorization methods and apparatus are disclosed. An example method includes generating a first control mask for a set of iterations of a loop by evaluating a condition of the loop, wherein generating the first control mask includes setting a bit of the control mask to a first value when the condition indicates that an operation of the loop is to be executed, and setting the bit of the first control mask to a second value when the condition indicates that the operation of the loop is to be bypassed. The example method also includes compressing indexes corresponding to the first set of iterations of the loop according to the first control mask.

Processor architecture

A processor having a functional slice architecture is divided into a plurality of functional units (“tiles”) organized into a plurality of slices. Each slice is configured to perform specific functions within the processor, which may include memory slices (MEM) for storing operand data, and arithmetic logic slices for performing operations on received operand data. The tiles of the processor are configured to stream operand data across a first dimension, and receive instructions across a second dimension orthogonal to the first dimension. The timing of data and instruction flows are configured such that corresponding data and instructions are received at each tile with a predetermined temporal relationship, allowing operand data to be transmitted between the slices of the processor without any accompanying metadata. Instead, each slice is able to determine what operations to perform on received data based upon the timing at which the data is received.

Methods and systems to vectorize scalar computer program loops having loop-carried dependences

Methods and systems to convert a scalar computer program loop having loop-carried dependences into a vector computer program loop are disclosed. One such method includes, replacing the scalar recurrence operation in the scalar computer program loop with a first vector summing operation and a first vector recurrence operation. The first vector summing operation is to generate a first running sum and the first vector recurrence operation is to generate a first vector. In some examples, the first vector recurrence operation is based on the scalar recurrence operation. Disclosed methods also include inserting: 1) a renaming operation to rename the first vector, 2) a second vector summing operation that is to generate a second running sum; and 3) a second vector recurrence operation to generate a second vector based on the renamed first vector.

Computer processor employing explicit operations that support execution of software pipelined loops and a compiler that utilizes such operations for scheduling software pipelined loops

A computer processor includes execution logic (having a number of functional units) configured to perform operations that access operand data values stored in a plurality of operand storage elements. Such operand data values include a predefined None operand data value indicative of a missing operand value. The operations include a RETIRE operation specifying a number of operand data values that is intended to be retired in a predefined machine cycle. During execution of the RETIRE operation, zero or more at None operand data values are selectively retired in the predefined machine cycle based on the number of operand data values specified by the RETIRE operation and the number of operand data values to be retired as a result of execution of other operations by the execution logic in the predefined machine cycle. Other aspects and software tools are also described and claimed.

Parallelizing compile method, parallelizing compiler, parallelizing compile apparatus, and onboard apparatus

A parallelizing compile method includes, dividing a sequential program for an embedded system into multiple macro tasks, specifying (i) a starting end task and (ii) a termination end task, fusing (i) the starting end task, (ii) the termination end task, and (iii) a group of the multiple macro tasks, extracting a group of multiple new macro tasks from the multiple new macro tasks fused in the fusing based on a data dependency, performing a static scheduling assigning the multiple new macro tasks to the multiple processor units, so that the group of the multiple new macro tasks is parallelly executable by the multiple processor units, and generating a parallelizing program. In addition, a parallelizing compiler, a parallelizing compile apparatus and an onboard apparatus are provided.

Loop nest parallelization without loop linearization
09760356 · 2017-09-12 · ·

Systems and methods may provide for identifying a nested loop iteration space in user code, wherein the nested loop iteration space includes a plurality of outer loop iterations, and distributing iterations from the nested loop iteration space across a plurality of threads, wherein each thread is assigned a group of outer loop iterations. Additionally, a compiler output may be automatically generated, wherein the compiler output contains serial code corresponding to each group of outer loop iterations and de-linearization code to be executed outside the plurality of outer loop iterations. In one example, the de-linearization code includes index recovery code that is positioned before one or more instances of the serial code in the compiler output.

PROGRAM, INFORMATION CONVERSION DEVICE, AND INFORMATION CONVERSION METHOD

A program causes a computer to serve as an information conversion device that is equipped with at least one of (A)-(E): (A) a replication necessity analysis processor specifying a location where an instruction referred to from phi functions present in one basic block is present and inserting an inter-register transfer instruction therein; (B) an intra-loop constant analysis processor specifying a closed path in which the references of the phi functions are circulated and inserting the inter-register transfer instruction therein; (C) an inter-instruction dependency analysis processor specifying a location where data dependency is present between instructions, which are reference destinations of the phi functions, and inserting the inter-register transfer instruction thereat; (D) an identical instruction reference analysis processor specifying, in a plurality of execution paths, a location where the phi functions referring to a result of the identical instruction before branching are present and inserting the inter-register transfer instruction therein; and (E) a spill-out effectiveness analysis processor storing a parameter value present in loop processing and targeted by the inter-register transfer instruction in a storage element other than a general-purpose register before start of the loop processing, loading the value after end of the loop processing, and deleting the inter-register transfer instruction.