Patent classifications
G06F9/3888
Instruction Cache for Hardware Multi-Thread Microprocessor
Embodiments are provided for instructions cache system for a hardware multi-thread microprocessor. In some embodiments, a cache controller device includes multiple interfaces connected to a hardware multi-thread microprocessor. A first interface of the multiple interfaces can receive a fetch request from a first execution thread during a first clock cycle. A second interface of the multiple interfaces can receive a fetch request from a second execution thread during a second clock cycle after the first clock cycle. The cache controller device also includes a multiplexer to send first response signals in response to the fetch request from the first execution thread, and also to send second response signals in response to the fetch request from the second execution thread.
GRAPH-BASED MEMORY STORAGE
Apparatuses, systems, and techniques to cause information to be stored in one or more memory locations based, at least in part, on one or more graphs. In at least one embodiment, a compiler analyzes one or more graphs to determine one or more sets of data items to be stored in one or more consecutive memory locations.
SINGLE INSTRUCTION, MULTIPLE THREAD (SIMT) PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS
A processor of an aspect includes an instruction unit to receive a single instruction, multiple thread (SIMT) instruction. The SIMT instruction has at least one field to provide at least one value. The at least one value is to indicate a plurality of threads that are to execute the SIMT instruction. The processor also includes a SIMT processor coupled with the instruction unit. The SIMT processor is to execute the SIMT instruction for each of the plurality of threads. Other processors, methods, systems, and machine-readable medium storing such a SIMT instructions are also disclosed.
Cooperative Group Arrays
- Greg PALMER ,
- Gentaro HIROTA ,
- Ronny Krashinsky ,
- Ze Long ,
- Brian Pharris ,
- Rajballav DASH ,
- Jeff TUCKEY ,
- Jerome F. Duluk, Jr. ,
- Lacky Shah ,
- Luke DURANT ,
- Jack Choquette ,
- Eric WERNESS ,
- Naman GOVIL ,
- Manan PATEL ,
- Shayani DEB ,
- SANDEEP NAVADA ,
- John Edmondson ,
- Prakash Bangalore Prabhakar ,
- Wish Gandhi ,
- Ravi MANYAM ,
- Apoorv PARLE ,
- Olivier GIROUX ,
- Shirish Gadre ,
- Steve HEINRICH
A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.
METHOD AND APPARATUS FOR PERFORMING REDUCTION OPERATIONS ON A PLURALITY OF ASSOCIATED DATA ELEMENT VALUES
Embodiments detailed herein relate to reduction operations on a plurality of data element values. In one embodiment, a process comprises decoding circuitry to decode an instruction and execution circuitry to execute the decoded instruction. The instruction specifies a first input register containing a plurality of data element values, a first index register containing a plurality of indices, and an output register, where each index of the plurality of indices maps to one unique data element position of the first input register. The execution includes to identify data element values that are associated with one another based on the indices, perform one or more reduction operations on the associated data element values based on the identification, and store results of the one or more reduction operations in the output register.
Method and apparatus for performing reduction operations on a plurality of associated data element values
Embodiments detailed herein relate to reduction operations on a plurality of data element values. In one embodiment, a process comprises decoding circuitry to decode an instruction and execution circuitry to execute the decoded instruction. The instruction specifies a first input register containing a plurality of data element values, a first index register containing a plurality of indices, and an output register, where each index of the plurality of indices maps to one unique data element position of the first input register. The execution includes to identify data element values that are associated with one another based on the indices, perform one or more reduction operations on the associated data element values based on the identification, and store results of the one or more reduction operations in the output register.
METHOD AND APPARATUS FOR PERFORMING REDUCTION OPERATIONS ON A PLURALITY OF DATA ELEMENT VALUES
Embodiments detailed herein relate to reduction operations on a plurality of data element values. In one embodiment, a process comprises decoding circuitry to decode an instruction and execution circuitry to execute the decoded instruction. The instruction specifies a first input register containing a plurality of data element values, a first index register containing a plurality of indices, and an output register, where each index of the plurality of indices maps to one unique data element position of the first input register. The execution includes to identify data element values that are associated with one another based on the indices, perform one or more reduction operations on the associated data element values based on the identification, and store results of the one or more reduction operations in the output register.
Fence Enforcement Techniques based on Stall Characteristics
Techniques are disclosed relating to channel stalls or deactivations based on the latency of prior operations. In some embodiments, a processor includes a plurality of channel pipelines for a plurality of channels and a plurality of execution pipelines shared by the channel pipelines and configured to perform different types of operations provided by the channel pipelines. First scheduler circuitry may assign threads to channels and second scheduler circuitry may assign an operation from a given channel to a given execution pipeline based on decode of an operation for that channel. Dependency circuitry may, for a first operation that depends on a prior operation that uses one of the execution pipelines, determine, based on status information for the prior operation from the one of the execution pipelines, whether to stall the first operation or to deactivate a thread that includes the first operation from its assigned channel.
VIRTUAL MULTI-PORT MEMORY PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS
A processor includes a shared memory, and an instruction unit to receive a single instruction, multiple thread (SIMT) instruction having a first source register identifier and a second source register identifier. The SIMT instruction indicates a number of data values to be written to the shared memory concurrently. A SIMT processor includes processor elements each to execute instructions of a different corresponding thread of a parallel thread group. Each of a number of processor elements, equal in number to the number of data values, is to execute the SIMT instruction to concurrently write a different corresponding one of the number of data values from a first source register of the respective processor element identified by the first source register identifier to the shared memory at an address based on address information from a second source register of the respective processor element identified by the second source register identifier.
Fence enforcement techniques based on stall characteristics
Techniques are disclosed relating to channel stalls or deactivations based on the latency of prior operations. In some embodiments, a processor includes a plurality of channel pipelines for a plurality of channels and a plurality of execution pipelines shared by the channel pipelines and configured to perform different types of operations provided by the channel pipelines. First scheduler circuitry may assign threads to channels and second scheduler circuitry may assign an operation from a given channel to a given execution pipeline based on decode of an operation for that channel. Dependency circuitry may, for a first operation that depends on a prior operation that uses one of the execution pipelines, determine, based on status information for the prior operation from the one of the execution pipelines, whether to stall the first operation or to deactivate a thread that includes the first operation from its assigned channel.