G06F9/38885

Exception handling for debugging in a graphics environment

An apparatus to facilitate exception handling for debugging in a graphics environment is disclosed. The apparatus includes load store pipeline hardware circuitry to: in response to a page fault exception being enabled for a memory access request received from a thread of the plurality of threads, allocate a memory dependency token correlated to a scoreboard identifier (SBID) that is included with the memory access request; send, to memory fabric of the graphics processor, the memory access request comprising the memory dependency token; receive, from the memory fabric in response to the memory access request, a memory access response comprising the memory dependency token and indicating occurrence of a page fault error condition and fault details associated with the page fault error condition; and return the SBID associated with the memory access response and fault details of the page fault error condition to a debug register of the thread.

EFFICIENT DATA SHARING FOR GRAPHICS DATA PROCESSING OPERATIONS

An apparatus to facilitate efficient data sharing for graphics data processing operations is disclosed. The apparatus includes a processing resource to generate a stream of instructions, an L1 cache communicably coupled to the processing resource and comprising an on-page detector circuit to determine that a set of memory requests in the stream of instructions access a same memory page; and set a marker in a first request of the set of memory requests; and arbitration circuitry communicably coupled to the L1 cache, the arbitration circuitry to route the set of memory requests to memory comprising the memory page and to, in response to receiving the first request with the marker set, remain with the processing resource to process the set of memory requests.

Implementing specialized instructions for accelerating dynamic programming algorithms

Various techniques for accelerating dynamic programming algorithms are provided. For example, a fused addition and comparison instruction, a three-operand comparison instruction, and a two-operand comparison instruction are used to accelerate a Needleman-Wunsch algorithm that determines an optimized global alignment of subsequences over two entire sequences. In another example, the fused addition and comparison instruction is used in an innermost loop of a Floyd-Warshall algorithm to reduce the number of instructions required to determine shortest paths between pairs of vertices in a graph. In another example, a two-way single instruction multiple data (SIMD) floating point variant of the three-operand comparison instruction is used to reduce the number of instructions required to determine the median of an array of floating point values.

Method and apparatus for unstructured control flow for SIMD execution engine

An apparatus and method for a SIMD unstructured branching. For example, one embodiment of a processor comprises: an execution unit having a plurality of channels to execute instructions; and a branch unit to process unstructured control flow instructions and to maintain a per channel count value for each channel, the branch unit to store instruction pointer tags for the unstructured control flow instructions in a memory and identify the instruction pointer tags using tag addresses, the branch unit to further enable and disable the channels based at least on the per channel count value.

Tree-based thread management
09921847 · 2018-03-20 · ·

In one embodiment of the present invention, a streaming multiprocessor (SM) uses a tree of nodes to manage threads. Each node specifies a set of active threads and a program counter. Upon encountering a conditional instruction that causes an execution path to diverge, the SM creates child nodes corresponding to each of the divergent execution paths. Based on the conditional instruction, the SM assigns each active thread included in the parent node to at most one child node, and the SM temporarily discontinues executing instructions specified by the parent node. Instead, the SM concurrently executes instructions specified by the child nodes. After all the divergent paths reconverge to the parent path, the SM resumes executing instructions specified by the parent node. Advantageously, the disclosed techniques enable the SM to execute divergent paths in parallel, thereby reducing undesirable program behavior associated with conventional techniques that serialize divergent paths across thread groups.

System and method for managing static divergence in a SIMD computing architecture

A method is presented for processing one or more instructions to be executed on multiple threads in a Single-Instruction-Multiple-Data (SIMD) computing system. The method includes the steps of analyzing the instructions to collect divergent threads among a plurality of thread groups of the multiple threads; obtaining a redirection array for thread-operand association adjustment among the divergent threads according to the analysis, where the redirection array is used for exchanging a first operand associated with a first divergent thread in a first thread group with a second operand associated with a second divergent thread in a second thread group; and generating compiled code corresponding to the instructions according to the redirection array.

Methods and systems for managing an instruction sequence with a divergent control flow in a SIMT architecture

A computer-implemented method of executing an instruction sequence with a recursive function call of a plurality of threads within a thread group in a Single-Instruction-Multiple-Threads (SIMT) system is provided. Each thread is provided with a function call counter (FCC), an active mask, an execution mask and a per-thread program counter (PTPC). The instruction sequence with the recursive function call is executed by the threads in the thread group according to a program counter (PC) indicating a target. Upon executing the recursive function call, for each thread, the active mask is set according to the PTPC and the target indicated by the PC, the FCC is determined when entering or returning from the recursive function call, the execution mask is determined according to the FCC and the active mask. It is determined whether an execution result of the recursive function call takes effects according to the execution mask.

Dynamic wavefront creation for processing units using a hybrid compactor

A method, a non-transitory computer readable medium, and a processor for repacking dynamic wavefronts during program code execution on a processing unit, each dynamic wavefront including multiple threads are presented. If a branch instruction is detected, a determination is made whether all wavefronts following a same control path in the program code have reached a compaction point, which is the branch instruction. If no branch instruction is detected in executing the program code, a determination is made whether all wavefronts following the same control path have reached a reconvergence point, which is a beginning of a program code segment to be executed by both a taken branch and a not taken branch from a previous branch instruction. The dynamic wavefronts are repacked with all threads that follow the same control path, if all wavefronts following the same control path have reached the branch instruction or the reconvergence point.

Analysis system and method for reducing the control flow divergence in the Graphics Processing Units (GPUs)

The invention discloses an analysis system and method for reducing control flow divergence in the Graphics Processing Units (GPUs). A computing unit is used to count the number of branch, number of cycle, and to calculate at least one direction ratio. A profiler is used to determine whether the code having the optimized control flow structure and the specialized branch or not. The optimization decision unit can determine which transform pattern can be used to transform the sub-control flow structure.

Composable neural network kernels

A technique for manipulating a generic tensor is provided. The technique includes receiving a first request to perform a first operation on a generic tensor descriptor associated with the generic tensor, responsive to the first request, performing the first operation on the generic tensor descriptor, receiving a second request to perform a second operation on generic tensor raw data associated with the generic tensor, and responsive to the second request, performing the second operation on the generic tensor raw data.