Patent classifications
G06F8/4441
PACKING CONDITIONAL BRANCH OPERATIONS
Disclosed in some examples, are systems, methods, devices, and machine readable mediums which use improved dynamic programming algorithms to pack conditional branch instructions. Conditional code branches may be modeled as directed acyclic graphs (DAGs) which have a topological ordering. These DAGs may be used to construct a dynamic programming table to find a partial mapping of one path onto the other path using dynamic programming algorithms.
METHOD AND SYSTEM FOR SOFTWARE ENHANCEMENT AND MANAGEMENT
A software enhancement and management system (E&M System) can include two ways to decompose existing software such that new functionality can be added: functional decomposition and time-affecting linear pathway (TALP) decomposition. Functional decomposition captures the inputs and outputs of the existing software's functions and attaches the new algorithmic constructs presented as other functions that receive the outputs of the existing software's functions. TALP decomposition allows for the generation of time-prediction polynomials that approximate time-complexity functions, speedup, and automatic dynamic loop-unrolling-based parallelization for each TALP.
Reduced memory consumption of compiler-transformed asynchronous methods
An asynchronous method is implemented in a manner that reduces the amount of runtime overhead needed to execute the asynchronous method. The data elements needed to suspend an asynchronous method to await completion of an asynchronous operation, to resume the asynchronous method at a resumption point, and to provide a completion status of the caller of the asynchronous method are consolidated into one or two reusable objects. An asynchronous method may be associated with a distinct object pool of reusable objects. The size of a pool and the total size of all pools can be configured statically or dynamically based on runtime conditions.
Device profiling in GPU accelerators by using host-device coordination
System and method of compiling a program having a mixture of host code and device code to enable Profile Guided Optimization (PGO) for device code execution. An exemplary integrated compiler can compile source code programmed to be executed by a host processor (e.g., CPU) and a co-processor (e.g., a GPU) concurrently. The compilation can generate an instrumented executable code which includes: profile instrumentation counters for the device functions; and instructions for the host processor to allocate and initialize device memory for the counters and to retrieve collected profile information from the device memory to generate instrumentation output. The output is fed back to the compiler for compiling the source code a second time to generate optimized executable code for the device functions defined in the source code.
Information processing apparatus, computer-readable recording medium storing compiling program, and compiling method
An information processing apparatus includes a processor configured to: for each of a plurality of loops, acquire loop information including a number of variables, a number of registers, a number of memory commands for inputting and outputting a value of the variable between the register and a main storage device, and a number of arithmetic commands for the value of the variable stored in the register, which are used in the loop; calculate the number of variables, the number of registers, the number of memory commands, and the number of arithmetic commands, which correspond to a combination of the loops that are candidates for loop fusion, for each of the combinations of the loops; determine a combination to which the loop fusion is to be applied among the combinations which are calculated for each of the combinations; and execute the loop fusion on the determined combination.
Configurable delay insertion in compiled instructions
Techniques are disclosed for utilizing configurable delays in an instruction stream. A set of instructions to be executed on a set of engines are generated. The set of engines are distributed between a set of hardware elements. A set of configurable delays are inserted into the set of instructions. Each of the set of configurable delays includes an adjustable delay amount that delays an execution of the set of instructions on the set of engines. The adjustable delay amount is adjustable by a runtime application that facilitates the execution of the set of instructions on the set of engines. The runtime application is configured to determine a runtime condition associated with the execution of the set of instructions on the set of engines and to adjust the set of configurable delays based on the runtime condition.
METHOD OF REORDERING CONDITION CHECKS
Described is a computer-implemented method of reordering condition checks. Two or more condition checks in computer code that may be reordered within the code are identified. It is determined that the execution frequency of a later one of the condition checks is satisfied at a greater frequency than a preceding one of the condition checks. It is determined that there is an absence of side effects in the two or more condition checks. The values of the condition checks are propagated and abstract interpretation is performed on the values that are propagated. It is determined that the condition checks are exclusive of each other, and the condition checks are reordered within the computer code.
METHODS AND APPARATUS TO ELIMINATE PARTIAL-REDUNDANT VECTOR LOADS
Methods, apparatus, systems and articles of manufacture are disclosed to eliminate partial-redundant vector load operations. An example apparatus includes a node grouper to associate a vector operation with a node group, a candidate verifier to perform a dependencies test on a subset of the node group, and identify a subset of the node group as a candidate when the subset satisfies the dependencies test, and a code optimizer to determine replacement code based on a characteristic of the candidate in the node group and compare an estimated cost associated with executing the replacement code to a threshold. The example apparatus also includes a code generator to generate machine code using the replacement code when the estimated cost of executing the replacement code satisfies the threshold.
Computation modification by amplification of stencil including stencil points
In a sequence of major computational steps or in an iterative computation, a stencil amplifier can increase the number of data elements accessed from one or more data structures in a single major step or iteration, thereby decreasing the total number of computations and/or communication operations in the overall sequence or the iterative computation. Stencil amplification, which can be optimized according to a specified parameter such as compile time, rune time, code size, etc., can improve the performance of a computing system executing the sequence or the iterative computation in terms of run time, memory load, energy consumption, etc. The stencil amplifier typically determines boundaries, to avoid erroneously accessing data elements not present in the one or more data structures.
Neural network operation reordering for parallel execution
Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.