Patent classifications
G06F9/383
SYSTEMS AND METHODS TO LOAD A TILE REGISTER PAIR
Embodiments detailed herein relate to systems and methods to load a tile register pair. In one example, a processor includes: decode circuitry to decode a load matrix pair instruction having fields for an opcode and source and destination identifiers to identify source and destination matrices, respectively, each matrix having a PAIR parameter equal to TRUE; and execution circuitry to execute the decoded load matrix pair instruction to load every element of left and right tiles of the identified destination matrix from corresponding element positions of left and right tiles of the identified source matrix, respectively, wherein the executing operates on one row of the identified destination matrix at a time, starting with the first row.
METHOD AND DEVICE FOR PROVIDING A VECTOR STREAM INSTRUCTION SET ARCHITECTURE EXTENSION FOR A CPU
A method and device for providing a vector stream instruction set architecture extension for a CPU. In one aspect, there is provided a vector stream engine unit comprising: a first fast memory storage for temporarily storing data of vector data streams from a memory for loading into a vector register file; a second fast memory storage for temporarily storing data of the vector data streams from the vector register file for loading into the memory; a prefetcher configured to prefetch data of the vector data streams from the memory into the first fast storage memory, and prefetch data of the vector data streams from the vector register file into the second fast storage memory; and a stream configuration table (SCT) storing stream information for prefetching data from the vector data streams.
Data processing apparatus having streaming engine with read and read/advance operand coding
A streaming engine employed in a digital signal processor specified a fixed data stream. Once started the data stream is read only and cannot be written. Once fetched, the data stream is stored in a first-in-first-out buffer for presentation to functional units in the fixed order. Data use by the functional unit is controlled using the input operand fields of the corresponding instruction. A read only operand coding supplies the data an input of the functional unit. A read/advance operand coding supplies the data and also advances the stream to the next sequential data elements. The read only operand coding permits reuse of data without requiring a register of the register file for temporary storage.
Systems, methods, and apparatuses for heterogeneous computing
- Rajesh M. Sankaran ,
- Gilbert Neiger ,
- Narayan Ranganathan ,
- Stephen R. Van Doren ,
- Joseph Nuzman ,
- Niall D. McDonnell ,
- Michael A. O'Hanlon ,
- Lokpraveen B. Mosur ,
- Tracy Garrett Drysdale ,
- Eriko Nurvitadhi ,
- Asit K. Mishra ,
- Ganesh Venkatesh ,
- Deborah T. Marr ,
- Nicholas P. Carter ,
- Jonathan D. Pearce ,
- Edward T. Grochowski ,
- Richard J. Greco ,
- Robert Valentine ,
- Jesus Corbal ,
- Thomas D. Fletcher ,
- Dennis R. Bradford ,
- Dwight P. Manley ,
- Mark J. Charney ,
- Jeffrey J. Cook ,
- Paul Caprioli ,
- Koichi Yamada ,
- Kent D. Glossop ,
- David B. Sheffield
Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
Mechanism for interrupting and resuming execution on an unprotected pipeline processor
Techniques related to executing a plurality of instructions by a processor comprising receiving a first instruction for execution on an instruction execution pipeline, beginning execution of the first instruction, receiving one or more second instructions for execution on the instruction execution pipeline, the one or more second instructions associated with a higher priority task than the first instruction, storing a register state associated with the execution of the first instruction in one or more registers of a capture queue associated with the instruction execution pipeline, copying the register state from the capture queue to a memory, determining that the one or more second instructions have been executed, copying the register state from the memory to the one or more registers of the capture queue, and restoring the register state to the instruction execution pipeline from the capture queue.
STREAMING ENGINE WITH STREAM METADATA SAVING FOR CONTEXT SWITCHING
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces addresses of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. Stream metadata is stored in response to a stream store instruction. Stored stream metadata is restored to the stream engine in response to a stream restore instruction. An interrupt changes an open stream to a frozen state discarding stored stream data. A return from interrupt changes a frozen stream to an active state.
INSTRUCTION TO QUERY FOR MODEL-DEPENDENT INFORMATION
An instruction is executed to perform a query function. The executing includes obtaining information relating to a selected model of a processor. The information includes at least one model-dependent data attribute of the selected model of the processor. The information is placed in a selected location for use by at least one application in performing one or more functions.
Data structure-aware prefetching method and device on graphics processing unit
The invention discloses a data structure-aware prefetching method and device on a graphics processing unit. The method comprises the steps of acquiring information for a memory access request in which a monitoring processor checks a graph data structure and read data, using a data structure access mode defined by a breadth first search and graph data structure information to generate four corresponding vector prefetching requests and store into a prefetching request queue. The device comprises a data prefetching unit distributed into each processing unit, each data prefetching unit is respectively connected with an memory access monitor, a response FIFO and a primary cache of a load/store unit, and comprises an address space classifier, a runtime information table, prefetching request generation units and the prefetching request queue. According to the present invention, data required by graph traversal can be prefetched more accurately and efficiently using the breadth first search, thereby improving the performance of GPU to solve a graph computation problem.
Prefetch filter table for storing moderately-confident entries evicted from a history table
Disclosed is a computer-implemented method to increase the efficiency of a prefetch system. The method includes receiving a system call including an instruction address. The method includes determining a confidence score. The method further includes creating an entry, including the instruction address, an associated data address, and the confidence score. The method includes determining the instruction address is not present in a history table, where the history table includes a plurality of entries. The method further includes determining, in response to adding the first entry to the history table, a second entry is evicted from the history table. The method includes entering the second entry into a filter table in response to determining the second confidence score is a moderate confidence score, where the moderate confidence score is any confidence score that is greater than a predefined low threshold and less than a predefined high threshold.
Gateway pull model
A computer system comprising: (i) a computer subsystem configured to act as a work accelerator, and (ii) a gateway connected to the computer subsystem, the gateway enabling the transfer of data to the computer subsystem from external storage at pre-compiled data exchange synchronization points attained by the computer subsystem, which act as a barrier between a compute phase and an exchange phase of the computer subsystem, wherein the computer subsystem is configured to pull data from a gateway transfer memory of the gateway in response to the pre-compiled data exchange synchronization point attained by the subsystem, wherein the gateway comprises at least one processor configured to perform at least one operation to pre-load at least some of the data from a first memory of the gateway to the gateway transfer memory in advance of the pre-compiled data exchange synchronization point attained by the subsystem.