Patent classifications
G06F9/3888
Extension of CPU context-state management for micro-architecture state
A processor saves micro-architectural contexts to increase the efficiency of code execution and power management. A save instruction is executed to store a micro-architectural state and an architectural state of a processor in a common buffer of a memory upon a context switch that suspends the execution of a process. The micro-architectural state contains performance data resulting from the execution of the process. A restore instruction is executed to retrieve the micro-architectural state and the architectural state from the common buffer upon a resumed execution of the process. Power management hardware then uses the micro-architectural state as an intermediate starting point for the resumed execution.
Method and apparatus improving the execution of instructions by execution threads in data processing systems
In a data processing system, a program to be executed by a programmable processing unit of the data processing system is analyzed to identify a sequence of instructions that would produce the same result for plural execution threads were those plural execution threads each to execute the sequence of instructions using the same input data. Then, when the program is being executed, when an execution thread is to execute the identified sequence of instructions, it is determined whether a result produced by an earlier execution thread executing the sequence of instructions, and that used the same input data, is stored in memory or not. The current thread then either executes the sequence of instructions, or retrieves the stored result produced by the earlier execution of the sequence of instructions and skips execution of the sequence of instructions for which the result is stored, accordingly.
METHOD AND SYSTEM TO PROVIDE USER-LEVEL MULTITHREADING
A method and system to provide user-level multithreading are disclosed. The method according to the present techniques comprises receiving programming instructions to execute one or more shared resource threads (shreds) via an instruction set architecture (ISA). One or more instruction pointers are configured via the ISA; and the one or more shreds are executed simultaneously with a microprocessor, wherein the microprocessor includes multiple instruction sequencers.
POLICIES FOR SHADER RESOURCE ALLOCATION IN A SHADER CORE
A method for use in a processor for arbitrating between multiple processes to select wavefronts for execution on a shader core is provided. The processor includes a compute pipeline configured to issue wavefronts to the shader core for execution, a hardware queue descriptor associated with the compute pipeline, and the shader core. The shader core is configured to execute work for the compute pipeline corresponding to a first memory queue descriptor executed using data for the first memory queue descriptor that is loaded into a first hardware queue descriptor. The processor is configured to detect a context switch condition, and, responsive to the context switch condition, perform a context switch operation including loading data for a second memory queue descriptor into the first hardware queue descriptor. The shader core is configured to execute work corresponding to the second memory queue descriptor that is loaded into the first hardware queue descriptor.
INSTRUCTIONS FOR DUAL DESTINATION TYPE CONVERSION, MIXED PRECISION ACCUMULATION, AND MIXED PRECISION ATOMIC MEMORY OPERATIONS
Disclosed embodiments relate to instructions for dual-destination type conversion, accumulation, and atomic memory operations. In one example, a system includes a memory, a processor including: a fetch circuit to fetch the instruction from a code storage, the instruction including an opcode, a first destination identifier, and a source identifier to specify a source vector register, the source vector register including a plurality of single precision floating point data elements, a decode circuit to decode the fetched instruction, and an execution circuit to execute the decoded instruction to: convert the elements of the source vector register into double precision floating point values, store a first half of the double precision floating point values to a first location identified by the first destination identifier, and store a second half of the double precision floating point values to a second location.
GENERALIZED ACCELERATION OF MATRIX MULTIPLY ACCUMULATE OPERATIONS
A method, computer readable medium, and processor are disclosed for performing matrix multiply and accumulate (MMA) operations. The processor includes a datapath configured to execute the MMA operation to generate a plurality of elements of a result matrix at an output of the datapath. Each element of the result matrix is generated by calculating at least one dot product of corresponding pairs of vectors associated with matrix operands specified in an instruction for the MMA operation. A dot product operation includes the steps of: generating a plurality of partial products by multiplying each element of a first vector with a corresponding element of a second vector; aligning the plurality of partial products based on the exponents associated with each element of the first vector and each element of the second vector; and accumulating the plurality of aligned partial products into a result queue utilizing at least one adder.
OPTIMIZED COMPUTE HARDWARE FOR MACHINE LEARNING OPERATIONS
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a fetch unit to fetch a single instruction having multiple input operands, wherein the multiple input operands have an unequal bit-length, a first input operand having a first bit-length and a second input operand having a second bit-length; a decode unit to decode the single instruction into a decoded instruction; an operand length unit to determine a smaller bit-length of the first bit-length and the second bit-length; and a compute unit to perform a matrix operation on the multiple input operands to generate an output value having a bit length of the smaller bit length.
Processor and method for dynamically allocating processing elements to front end units using a plurality of registers
Embodiments include a processor capable of supporting multi-mode and corresponding methods. The processor includes front end units, a number of processing elements more than a number of the front end units; and a controller configured to determine if thread divergence occurs due to conditional branching. If there is thread divergence, the processor may set control information to control processing elements using currently activated front end units. If there is not, the processor may set control information to control processing elements using a currently activated front end unit.
Eliminating redundant store instructions from execution while maintaining total store order
A processor includes a front end including circuitry to decode instructions from an instruction stream, a data cache unit including circuitry to cache data for the processor, and a binary translator. The binary translator includes circuitry to identify a redundant store in the instruction stream, mark the start and end of a region of the instruction stream with the redundant store, remove the redundant store, and store an amended instruction stream with the redundant store removed.
INSTRUCTIONS AND LOGIC TO PERFORM FLOATING-POINT AND INTEGER OPERATIONS FOR MACHINE LEARNING
- Himanshu Kaul ,
- Mark A. Anders ,
- Sanu K. Mathew ,
- Anbang Yao ,
- Joydeep Ray ,
- Ping T. Tang ,
- Michael S. Strickland ,
- Xiaoming Chen ,
- Tatiana Shpeisman ,
- Abhishek R. Appu ,
- Altug Koker ,
- Kamal Sinha ,
- Balaji Vembu ,
- Nicolas C. Galoppo Von Borries ,
- Eriko Nurvitadhi ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Vasanth Ranganathan ,
- Sanjeev Jahagirdar
One embodiment provides for a machine-learning hardware accelerator comprising a compute unit having an adder and a multiplier that are shared between integer data path and a floating-point datapath, the upper bits of input operands to the multiplier to be gated during floating-point operation.