Patent classifications
G06F9/38885
Partial sorting for coherency recovery
Devices and methods for partial sorting for coherence recovery are provided. The partial sorting is efficiently executed by utilizing existing hardware along the memory path (e.g., memory local to the compute unit). The devices include an accelerated processing device which comprises memory and a processor. The processor is, for example, a compute unit of a GPU which comprises a plurality of SIMD units and is configured to determine, for data entries each comprising a plurality of bits, a number of occurrences of different types of the data entries by storing the number of occurrences in one or more portions of the memory local to the processor, sort the data entries based on the determined number of occurrences stored in the one or more portions of the memory local to the processor and execute the sorted data entries.
MONITOR SUPPORT ON ACCELERATED PROCESSING DEVICE
A technique for implementing synchronization monitors on an accelerated processing device (APD) is provided. Work on an APD includes workgroups that include one or more wavefronts. All wavefronts of a workgroup execute on a single compute unit. A monitor is a synchronization construct that allows workgroups to stall until a particular condition is met. Responsive to all wavefronts of a workgroup executing a wait instruction, the monitor coordinator records the workgroup in an entry queue. The workgroup begins saving its state to a general APD memory and, when such saving is complete, the monitor coordinator moves the workgroup to a condition queue. When the condition specified by the wait instruction is met, the monitor coordinator moves the workgroup to a ready queue, and, when sufficient resources are available on a compute unit, the APD schedules the ready workgroup for execution on a compute unit.
Pre-scheduled replays of divergent operations
One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced. One advantage of the disclosed technique is that divergent operations requiring one or more replay operations execute with reduced latency.
Systems and methods for voting among parallel threads
One embodiment of the present invention sets forth a technique for efficiently performing voting operations within a multi-threaded parallel-processing system. A group of related parallel program threads executes within a processor core together in parallel. A new instruction, called a vote instruction, is introduced that enables a parallel program thread to post an individual vote within the context of the group of related threads and to receive the result of the vote. In this fashion, the vote instruction advantageously reduces overhead associated with inter-thread communication, thereby improving overall system performance.
Processor and method for dynamically allocating processing elements to front end units using a plurality of registers
Embodiments include a processor capable of supporting multi-mode and corresponding methods. The processor includes front end units, a number of processing elements more than a number of the front end units; and a controller configured to determine if thread divergence occurs due to conditional branching. If there is thread divergence, the processor may set control information to control processing elements using currently activated front end units. If there is not, the processor may set control information to control processing elements using a currently activated front end unit.
INTELLIGENT THREAD DISPATCH AND VECTORIZATION OF ATOMIC OPERATIONS
A mechanism is described for facilitating intelligent dispatching and vectorizing at autonomous machines. A method of embodiments, as described herein, includes detecting a plurality of threads corresponding to a plurality of workloads associated with tasks relating to a graphics processor. The method may further include determining a first set of threads of the plurality of threads that are similar to each other or have adjacent surfaces, and physically clustering the first set of threads close together using a first set of adjacent compute blocks.
PROGRAMMABLE COARSE GRAINED AND SPARSE MATRIX COMPUTE HARDWARE WITH ADVANCED SCHEDULING
- Eriko Nurvitadhi ,
- Balaji Vembu ,
- Nicolas C. Galoppo Von Borries ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Kamal Sinha ,
- Nadathur Rajagopalan Satish ,
- Jeremy Bottleson ,
- Farshad Akhbari ,
- Altug Koker ,
- Narayan Srinivasa ,
- Dukhwan Kim ,
- Sara S. Baghsorkhi ,
- Justin E. Gottschlich ,
- Feng Chen ,
- Elmoustapha Ould-Ahmed-Vall ,
- Kevin Nealis ,
- Xiaoming Chen ,
- Anbang Yao
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex machine learning compute operation.
MIXED INFERENCE USING LOW AND HIGH PRECISION
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising instruction decode logic to decode a single instruction including multiple operands into a single decoded instruction, the multiple operands having differing precisions and a general-purpose graphics compute unit including a first logic unit and a second logic unit, the general-purpose graphics compute unit to execute the single decoded instruction, wherein to execute the single decoded instruction includes to perform a first instruction operation on a first set of operands of the multiple operands at a first precision and a simultaneously perform second instruction operation on a second set of operands of the multiple operands at a second precision.
MIXED INFERENCE USING LOW AND HIGH PRECISION
One embodiment provides for a graphics processing unit (GPU) to accelerate machine learning operations, the GPU comprising an instruction cache to store a first instruction and a second instruction, the first instruction to cause the GPU to perform a floating-point operation, including a multi-dimensional floating-point operation, and the second instruction to cause the GPU to perform an integer operation; and a general-purpose graphics compute unit having a single instruction, multiple thread (SIMT) architecture, the general-purpose graphics compute unit to simultaneously execute the first instruction and the second instruction, wherein the integer operation corresponds to a memory address calculation.
GRAPHICS CONTROL FLOW MECHANISM
An apparatus to facilitate control flow in a graphics processing system is disclosed. The apparatus includes logic a plurality of execution units to execute single instruction, multiple data (SIMD) and flow control logic to detect a diverging control flow in a plurality of SIMD channels and reduce the execution of the control flow to a subset of the SIMD channels.