G06F8/458

Compiler-Generated Asynchronous Enumerable Object
20190265956 · 2019-08-29 ·

An asynchronous method is implemented in a manner that reduces the amount of runtime overhead needed to execute the asynchronous method. The data elements needed to suspend an asynchronous method to await completion of an asynchronous operation, to resume the asynchronous method at a resumption point, and to provide a completion status of the caller of the asynchronous method are consolidated into one or two reusable objects. An asynchronous method may be associated with a distinct object pool of reusable objects. The size of a pool and the total size of all pools can be configured statically or dynamically based on runtime conditions.

System and method to accelerate reduce operations in graphics processor

Embodiments described herein provide a system, method, and apparatus to accelerate reduce operations in a graphics processor. One embodiment provides an apparatus including one or more processors, the one or more processors including a first logic unit to perform a merged write, barrier, and read operation in response to a barrier synchronization request from a set of threads in a work group, synchronize the set of threads, and report a result of an operation specified in association with the barrier synchronization request.

Optimize control-flow convergence on SIMD engine using divergence depth

There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running SPMD (Single Program Multiple Data) code on SIMD (Single Instruction Multiple Data) machine. The machine runs an instruction stream over input data streams. The machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation. The machine updates the lane-PC of each active lane according to targets of the branch operation. The machine selects an active lane and activates only lanes whose lane-PCs match the thread-PC. The machine decrements the lane depth counters of the selected active lanes and updates the lane-PC of each active lane upon the instruction stream reaching a first instruction. The machine assigns the lane-PC of a lane with a largest lane depth counter value to the thread-PC and activates all lanes whose lane-PCs match the thread-PC.

METHOD OF REORDERING CONDITION CHECKS
20190188113 · 2019-06-20 ·

Described is a computer-implemented method of reordering condition checks. Two or more condition checks in computer code that may be reordered within the code are identified. It is determined that the execution frequency of a later one of the condition checks is satisfied at a greater frequency than a preceding one of the condition checks. It is determined that there is an absence of side effects in the two or more condition checks. The values of the condition checks are propagated and abstract interpretation is performed on the values that are propagated. It is determined that the condition checks are exclusive of each other, and the condition checks are reordered within the computer code.

Reduced Memory Consumption of Compiler-Transformed Asynchronous Methods
20190187965 · 2019-06-20 ·

An asynchronous method is implemented in a manner that reduces the amount of runtime overhead needed to execute the asynchronous method. The data elements needed to suspend an asynchronous method to await completion of an asynchronous operation, to resume the asynchronous method at a resumption point, and to provide a completion status of the caller of the asynchronous method are consolidated into one or two reusable objects. An asynchronous method may be associated with a distinct object pool of reusable objects. The size of a pool and the total size of all pools can be configured statically or dynamically based on runtime conditions.

Distributed synchronization scheme

A system including a machine-learning accelerator (MLA) hardware comprising computation-control units that each have a programmable dependency matrix; and a compiler computing module configured to generate, based on a machine-learning model, dependency instructions indicating dependencies between the computation-control units; wherein the computation-control units include at least: a first computation-control unit configured to generate, after completion of a first operation, a synchronization token representing the completion of the first operation, the synchronization token specifying a recipient identifier for an intended recipient computation-control unit of the synchronization token; a second computation-control unit configured to: configure the programmable dependency matrix of the second computation-control unit according to the dependency instructions to include dependency conditions for performing operations; receive the synchronization token based on the recipient identifier; update a dependency state to reflect the received synchronization token; and execute an operation in response to a determination that the dependency state satisfies the dependency condition.

SYNCHRONISATION OF EXECUTION THREADS ON A MULTI-THREADED PROCESSOR
20190155607 · 2019-05-23 ·

Method and apparatus are provided for synchronising execution of a plurality of threads on a multi-threaded processor. A program executed by a thread can have a number of synchronisation points corresponding to points where execution is to be synchronised with another thread. Execution of a thread is paused when it reaches a synchronisation point until at least one other thread with which it is intended to be synchronised reaches a corresponding synchronisation point. Execution is subsequently resumed. A control core maintains status data for threads and can cause a thread that is ready to run to use execution resources that were occupied by a thread that is waiting for a synchronisation event.

DEVICE PROFILING IN GPU ACCELERATORS BY USING HOST-DEVICE
20190146766 · 2019-05-16 ·

System and method of compiling a program having a mixture of host code and device code to enable Profile Guided Optimization (PGO) for device code execution. An exemplary integrated compiler can compile source code programmed to be executed by a host processor (e.g., CPU) and a co-processor (e.g., a GPU) concurrently. The compilation can generate an instrumented executable code which includes: profile instrumentation counters for the device functions; and instructions for the host processor to allocate and initialize device memory for the counters and to retrieve collected profile information from the device memory to generate instrumentation output. The output is fed back to the compiler for compiling the source code a second time to generate optimized executable code for the device functions defined in the source code.

METHOD AND SYSTEM FOR YIELD OPERATION SUPPORTING THREAD-LIKE BEHAVIOR

A method, system, and computer program product synchronize a group of workitems executing an instruction stream on a processor. The processor is yielded by a first workitem responsive to a synchronization instruction in the instruction stream. A first one of a plurality of program counters is updated to point to a next instruction following the synchronization instruction in the instruction stream to be executed by the first workitem. A second workitem is run on the processor after the yielding.

Method of reordering condition checks between variables based on a frequency that check is satisfied

Described is a computer-implemented method of reordering condition checks. Two or more condition checks in computer code that may be reordered within the code are identified. It is determined that the execution frequency of a later one of the condition checks is satisfied at a greater frequency than a preceding one of the condition checks. It is determined that there is an absence of side effects in the two or more condition checks. The values of the condition checks are propagated and abstract interpretation is performed on the values that are propagated. It is determined that the condition checks are exclusive of each other, and the condition checks are reordered within the computer code.