Patent classifications
G06F9/38885
PROGRAMMABLE COARSE GRAINED AND SPARSE MATRIX COMPUTE HARDWARE WITH ADVANCED SCHEDULING
- Eriko Nurvitadhi ,
- Balaji Vembu ,
- Nicolas C. Galoppo Von Borries ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Kamal Sinha ,
- Nadathur Rajagopalan Satish ,
- Jeremy Bottleson ,
- Farshad Akhbari ,
- Altug Koker ,
- Narayan Srinivasa ,
- Dukhwan Kim ,
- Sara S. Baghsorkhi ,
- Justin E. Gottschlich ,
- Feng Chen ,
- Elmoustapha Ould-Ahmed-Vall ,
- Kevin Nealis ,
- Xiaoming Chen ,
- Anbang Yao
One embodiment provides a parallel processor comprising a hardware scheduler to schedule pipeline commands for compute operations to one or more of multiple types of compute units, a plurality of processing resources including a first sparse compute unit configured for input at a first level of sparsity and hybrid memory circuitry including a memory controller, a memory interface, and a second sparse compute unit configured for input at a second level of sparsity that is greater than the first level of sparsity.
COMPUTE OPTIMIZATION MECHANISM FOR DEEP NEURAL NETWORKS
An apparatus to facilitate compute optimization is disclosed. The apparatus includes a at least one processor to perform operations to implement a neural network and compute logic to accelerate neural network computations.
Apparatus and method of optimizing divergent processing in thread groups preliminary class
A data processor is disclosed in which groups of execution threads comprising a thread group can execute a set of instructions in lockstep, and in which a plurality of execution lanes can perform processing operations for the execution threads. In response to an execution thread issuing circuit determining whether a portion of active threads of a first thread group and a portion of active threads of a second thread group use different execution lanes of the plurality of execution lanes, the execution thread issuing circuit issuing both the portion of active threads of a first thread group and a portion of active threads of a second thread group for execution. This can have the effect of increasing data processor efficiency, thereby increasing throughput and reducing latency.
Data path and instruction set for packed pixel operations for video processing
One embodiment of the present invention discloses a method for processing video data within a video data processing path of a processing unit. The video data processing path includes three stages. In the first stage, source operands are extracted from a local register file and are ordered to map efficiently onto the downstream data path. In the second stage, arithmetic operations are performed on the source operands based on video processing instructions to generate intermediate results. In the third stage, additional operations are performed on the intermediate results based on the video processing instructions. In some embodiment, the intermediate results are combined with additional operands retrieved from the local register file.
GPU divergence barrier
A device includes a memory, and at least one programmable processor configured to determine, for each warp of a plurality of warps, whether a Boolean expression is true for a corresponding thread of each warp, pause execution of each warp having a corresponding thread for which the expression is true, determine a number of active threads for each of the plurality of warps for which the expression is true, sort the plurality of warps for which the expression is true based on the number of active threads in each of the plurality of warps, swap thread data of an active thread of a first warp of the plurality of warps with thread data of an inactive thread of a second warp of the plurality of warps, and resume execution of the at least one of the plurality of warps for which the expression is true.
SYSTEM AND METHOD FOR MANAGING STATIC DIVERGENCE IN A SIMD COMPUTING ARCHITECTURE
A method is presented for processing one or more instructions to be executed on multiple threads in a Single-Instruction-Multiple-Data (SIMD) computing system. The method includes the steps of analyzing the instructions to collect divergent threads among a plurality of thread groups of the multiple threads; obtaining a redirection array for thread-operand association adjustment among the divergent threads according to the analysis, where the redirection array is used for exchanging a first operand associated with a first divergent thread in a first thread group with a second operand associated with a second divergent thread in a second thread group; and generating compiled code corresponding to the instructions according to the redirection array.
Method and system for resolving thread divergences
A computing device detects divergences between threads in a thread group executing on a parallel processing unit. The computing device includes an address divergence unit that identifies a subset of non-divergent threads included in the thread group. The address divergence unit stores instructions related to the subset of non-divergent threads in a multi-issue queue. The address divergence unit causes the instructions related to the subset of non-divergent threads to be retrieved from the multi-issue queue when the parallel processing unit is available. The address divergence unit causes the subset of non-divergent threads to be issued for execution on the parallel processing unit. The address divergence unit repeats the identifying, storing, and causing steps for the remaining threads in the thread group that are not included in the subset of non-divergent threads.
Compressing execution cycles for divergent execution in a single instruction multiple data (SIMD) processor
In one embodiment, the present invention includes a processor with a vector execution unit to execute a vector instruction on a vector having a plurality of individual data elements, where the vector instruction is of a first width and the vector execution unit is of a smaller width. The processor further includes a control logic coupled to the vector execution unit to compress a number of execution cycles consumed in execution of the vector instruction when at least some of the individual data elements are not to be operated on by the vector instruction. Other embodiments are described and claimed.
Task Execution in a SIMD Processing Unit with Parallel Groups of Processing Lanes
A SIMD processing unit processes a plurality of tasks which each include up to a predetermined maximum number of work items. The work items of a task are arranged for executing a common sequence of instructions on respective data items. The data items are arranged into blocks, with some of the blocks including at least one invalid data item. Work items which relate to invalid data items are invalid work items. The SIMD processing unit comprises a group of processing lanes configured to execute instructions of work items of a particular task over a plurality of processing cycles. A control module assembles work items into the tasks based on the validity of the work items, so that invalid work items of the particular task are temporally aligned across the processing lanes. In this way the number of wasted processing slots due to invalid work items may be reduced.
Updated Origin Coordinates for Traversal of Ray Tracing Acceleration Data Structure
Disclosed techniques relate to traversal techniques for ray tracing. In some embodiments, ray intersect circuitry receives a ray intersect request that indicates origin and direction information for a ray in a graphics scene. The ray intersect circuitry may traverse multiple nodes of a spatially organized acceleration data structure, wherein a given node of the multiple nodes indicates coordinates corresponding to a bounding region of the graphics scene. In response to detection that the ray intersects with a first bounding volume, the ray intersect circuitry stores a local parametric value for the ray that indicates a point at which the ray intersected the first bounding volume and may use the local parametric value as an origin value of the ray for one or more intersection tests between the ray and one or more child bounding volumes of the first bounding volume.