G06F9/38885

BOOSTING LOCAL MEMORY PERFORMANCE IN PROCESSOR GRAPHICS
20180300139 · 2018-10-18 ·

In some cases, processor graphics with a slower local memory can compensate by using another memory in place of the lowest level or L3 cache. For example, in some processors, there is a large register space that can be used for the local memory function by allocating the local memory within those registers. Also, since the registers do not operate with barriers, barriers can be simulated by letting one execution unit thread execute more SIMD instructions. For example, one execution thread may simulate a whole work-group in the OpenCL API.

Compute optimization mechanism for deep neural networks

An apparatus to facilitate compute optimization is disclosed. The apparatus includes a at least one processor to perform operations to implement a neural network and compute logic to accelerate neural network computations.

ENHANCED PERFORMANCE FOR GRAPHICAL PROCESSING UNIT TRANSACTIONAL MEMORY

A computer system implementing transactional memory. The computing system includes a plurality of Single Instruction Multiple Thread (SIMT) cores and a conflicting address table (CAT) for each core. The CAT stores word addresses for reads and writes correlated with flags indicating whether a corresponding word is written or read by a committing transaction. The CATs for the different SIMT cores are coupled together by an interconnect. A commit unit (CU) is coupled to the SIMT cores and is configured to validate transactions. The cores access its CAT to access a first address of data affected by a first transaction to be committed at the CU. The first address is compared to a second address affected by a second transaction. When the first address matches the second address, the core delays or prevents committing the first transaction at the CU by pausing the first transaction or aborting the first transaction.

Execution of divergent threads using a convergence barrier

A method, system, and computer program product for executing divergent threads using a convergence barrier are disclosed. A first instruction in a program is executed by a plurality of threads, where the first instruction, when executed by a particular thread, indicates to a scheduler unit that the thread participates in a convergence barrier. A first path through the program is executed by a first divergent portion of the participating threads and a second path through the program is executed by a second divergent portion of the participating threads. The first divergent portion of the participating threads executes a second instruction in the program and transitions to a blocked state at the convergence barrier. The scheduler unit determines that all of the participating threads are synchronized at the convergence barrier and the convergence barrier is cleared.

Architecture and execution for efficient mixed precision computations in single instruction multiple data/thread (SIMD/T) devices

A method for improving power, performance, area (PPA) for mixed precision computations in a processing environment. The method includes determining a braiding factor as a number of units of work encoded into a physical thread. A value of the braiding factor is determined based on a mix of precision requirements presented for individual units of work. Units of work are classified as instructions for applied code transformation based on associated precision requirements for the processing environment. Instruction inputs from specified registers are packed together into a destination register according to the determined value of the braiding factor. The packed instructions presented in vector form are executed with an instruction set architecture configured for executing packed instructions of different precisions.

OPTIMIZE CONTROL-FLOW CONVERGENCE ON SIMD ENGINE USING DIVERGENCE DEPTH
20180232239 · 2018-08-16 ·

There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running SPMD (Single Program Multiple Data) code on SIMD (Single Instruction Multiple Data) machine. The machine runs an instruction stream over input data streams. The machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation. The machine updates the lane-PC of each active lane according to targets of the branch operation. The machine selects an active lane and activates only lanes whose lane-PCs match the thread-PC. The machine decrements the lane depth counters of the selected active lanes and updates the lane-PC of each active lane upon the instruction stream reaching a first instruction. The machine assigns the lane-PC of a lane with a largest lane depth counter value to the thread-PC and activates all lanes whose lane-PCs match the thread-PC.

Uniform load processing for parallel thread sub-sets

One embodiment of the present invention sets forth a technique for processing load instructions for parallel threads of a thread group when a sub-set of the parallel threads request the same memory address. The load/store unit determines if the memory addresses for each sub-set of parallel threads match based on one or more uniform patterns. When a match is achieved for at least one of the uniform patterns, the load/store unit transmits a read request to retrieve data for the sub-set of parallel threads. The number of read requests transmitted is reduced compared with performing a separate read request for each thread in the sub-set. A variety of uniform patterns may be defined based on common access patterns present in program instructions. A variety of uniform patterns may also be defined based on interconnect constraints between the load/store unit and the memory when a full crossbar interconnect is not available.

Method and apparatus for SIMD structured branching

An apparatus and method for a SIMD structured branching. For example, one embodiment of a processor comprises: an execution unit having a plurality of channels to execute instructions; and a branch unit to process control flow instructions and to maintain a per channel count for each channel and a control instruction count for the control flow instructions, the branch unit to enable and disable the channels based at least on the per channel count.

SIMD CHANNEL UTILIZATION UNDER DIVERGENT CONTROL FLOW

Methods and apparatus relating to techniques for improved SIMD channel utilization in a divergent control flow environment. In an example, an apparatus comprises logic, at least partially comprising hardware logic, to determine instructions in an instruction set which are combinable into a super-instruction to execute in a divergent control flow environment, combine a first instruction and a second instruction to form a super-instruction, encode the super-instruction, and queue the super-instruction for execution on a processor. Other embodiments are also disclosed and claimed.

Optimize control-flow convergence on SIMD engine using divergence depth

There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running SPMD (Single Program Multiple Data) code on SIMD (Single Instruction Multiple Data) machine. The machine runs an instruction stream over input data streams. The machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation. The machine updates the lane-PC of each active lane according to targets of the branch operation. The machine selects an active lane and activates only lanes whose lane-PCs match the thread-PC. The machine decrements the lane depth counters of the selected active lanes and updates the lane-PC of each active lane upon the instruction stream reaching a first instruction. The machine assigns the lane-PC of a lane with a largest lane depth counter value to the thread-PC and activates all lanes whose lane-PCs match the thread-PC.