G06F9/38885

Architecture for block sparse operations on a systolic array

Embodiments described herein include software, firmware, and hardware logic that provides techniques to perform arithmetic on sparse data via a systolic processing unit. One embodiment provides for data aware sparsity via compressed bitstreams. One embodiment provides for block sparse dot product instructions. One embodiment provides for a depth-wise adapter for a systolic array.

Barrier state save and restore for preemption in a graphics environment

An apparatus to facilitate barrier state save and restore for preemption in a graphics environment is disclosed. The apparatus includes processing resources to execute a plurality of execution threads that are comprised in a thread group (TG) and mid-thread preemption barrier save and restore hardware circuitry to: initiate an exception handling routine in response to a mid-thread preemption event, the exception handling routine to cause a barrier signaling event to be issued; receive indication of a valid designated thread status for a thread of a thread group (TG) in response to the barrier signaling event; and in response to receiving the indication of the valid designated thread status for the thread of the TG, cause, by the thread of the TG having the valid designated thread status, a barrier save routine and a barrier restore routine to be initiated for named barriers of the TG.

SHADER LAUNCH SCHEDULING OPTIMIZATION
20240403056 · 2024-12-05 ·

A processing system is configured to implement techniques for dynamically selecting an order for executing operations of different branches in a set of branch instructions. In response to receiving the set of branch instructions, the processing system identifies whether a first branch instruction of the set of branch instructions is associated with a first latency value that meets (i.e., equals or exceeds) a latency threshold. Based on the first latency value meeting the latency threshold, the processing system executes operations associated with a second branch instruction of the set of branch instructions prior to executing operations associated with the first branch instruction.

Node encoding for spatially-organized ray tracing acceleration data structure

Disclosed techniques relate to acceleration data structure for ray intersection testing. In some embodiments, storage circuitry stores node data for a spatially organized acceleration data structure, including to store the following node information for a given node: origin coordinates for the node and, for a given child node of multiple child nodes, child information that includes: quantized bounding region information for a bounding region corresponding to the child node, where the quantized bounding region information encodes bounding region coordinates as offsets relative to the origin coordinates. Traversal circuitry may traverse multiple nodes of the data structure and determine whether a ray intersects a bounding region indicated by given a node of the data structure based on the node information. Disclosed techniques may provide substantial improvements to performance, data size, and power consumption.

METHODS AND SYSTEMS FOR MANAGING AN INSTRUCTION SEQUENCE WITH A DIVERGENT CONTROL FLOW IN A SIMT ARCHITECTURE

A computer-implemented method of executing an instruction sequence with a recursive function call of a plurality of threads within a thread group in a Single-Instruction-Multiple-Threads (SIMT) system is provided. Each thread is provided with a function call counter (FCC), an active mask, an execution mask and a per-thread program counter (PTPC). The instruction sequence with the recursive function call is executed by the threads in the thread group according to a program counter (PC) indicating a target. Upon executing the recursive function call, for each thread, the active mask is set according to the PTPC and the target indicated by the PC, the FCC is determined when entering or returning from the recursive function call, the execution mask is determined according to the FCC and the active mask. It is determined whether an execution result of the recursive function call takes effects according to the execution mask.

Method and apparatus for power reduction during lane divergence

A method and device for reducing power during an instruction lane divergence includes idling an inactive execution lane during the lane divergence.

IMPLEMENTING SPECIALIZED INSTRUCTIONS FOR ACCELERATING DYNAMIC PROGRAMMING ALGORITHMS

Various techniques for accelerating dynamic programming algorithms are provided. For example, a fused addition and comparison instruction, a three-operand comparison instruction, and a two-operand comparison instruction are used to accelerate a Needleman-Wunsch algorithm that determines an optimized global alignment of subsequences over two entire sequences. In another example, the fused addition and comparison instruction is used in an innermost loop of a Floyd-Warshall algorithm to reduce the number of instructions required to determine shortest paths between pairs of vertices in a graph. In another example, a two-way single instruction multiple data (SIMD) floating point variant of the three-operand comparison instruction is used to reduce the number of instructions required to determine the median of an array of floating point values.

STREAMING WAVE COALESCER CIRCUIT

A Streaming Wave Coalescer (SWC) circuit stores a first set of state values associated with a first subset of threads of a first wave in a bin based on each of the first subset of threads including a first set of instructions to be executed. A second set of state values associated with a second subset of threads of a second wave is stored in the bin based on each of the second subset of threads including the first set of instructions to be executed and based on the first wave and the second wave both being associated with a hard key. A third wave is formed from the threads of the first subset and the second subset and is emitted for execution. As a result of reorganizing the threads and reconstituting a different wave, thread divergence of waves sent for execution is reduced.

Large integer multiplication enhancements for graphics environment

An apparatus to facilitate large integer multiplication enhancements in a graphics environment is disclosed. The apparatus includes a processor comprising processing resources, the processing resources comprising multiplier circuitry to: receive operands for a multiplication operation, wherein the multiplication operation is part of a chain of multiplication operations for a large integer multiplication; and issue a multiply and add (MAD) instruction for the multiplication operation utilizing at least one of a double precision multiplier or a 48 bit output, wherein the MAD instruction to generate an output in a single clock cycle of the processor.

TASK EXECUTION IN A SIMD PROCESSING UNIT WITH PARALLEL GROUPS OF PROCESSING LANES
20250061536 · 2025-02-20 ·

A SIMD processing unit processes a plurality of tasks which each include up to a predetermined maximum number of work items. The work items of a task are arranged for executing a common sequence of instructions on respective data items. The data items are arranged into blocks, with some of the blocks including at least one invalid data item. Work items which relate to invalid data items are invalid work items. The SIMD processing unit comprises a group of processing lanes configured to execute instructions of work items of a particular task over a plurality of processing cycles. A control module assembles work items into the tasks based on the validity of the work items, so that invalid work items of the particular task are temporally aligned across the processing lanes. In this way the number of wasted processing slots due to invalid work items may be reduced.