Patent classifications
G06F9/30079
Supporting large-word operations in a reduced instruction set computer (“RISC”) processor
A Reduced Instruction Set Computer (“RISC”) supporting large-word operations in a computing environment is disclosed. In one implementation, in response to receiving one or more control signals from a central processing unit (“CPU”), a set of operations are executed on a state of a special purpose execution unit (“SPU”) having a plurality of SPU registers, the SPU being associated with the CPU and the state of the SPU having word widths of one or more of the plurality of registers being greater in size than word widths of a plurality of CPU registers of a computing system and a set of state-master bits to synchronize the state of the SPU and a state of the CPU. The results of the set of operations are stored in the plurality of CPU registers or an alternative set of the plurality of SPU registers.
Systems and methods for improving cache efficiency and utilization
- Altug Koker ,
- Joydeep Ray ,
- Ben Ashbaugh ,
- Jonathan Pearce ,
- Abhishek Appu ,
- Vasanth Ranganathan ,
- Lakshminarayanan Striramassarma ,
- Elmoustapha Ould-Ahmed-Vall ,
- Aravindh Anantaraman ,
- Valentin Andrei ,
- Nicolas Galoppo von Borries ,
- Varghese George ,
- Yoav Harel ,
- Arthur Hunter, JR. ,
- Brent Insko ,
- Scott Janus ,
- Pattabhiraman K ,
- Mike Macpherson ,
- Subramaniam Maiyuran ,
- Marian Alin Petre ,
- Murali Ramadoss ,
- Shailesh Shah ,
- Kamal Sinha ,
- Prasoonkumar Surti ,
- Vikranth Vemulapalli
Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.
Processing pipeline where fast data passes slow data
Various embodiments relate to an inline encryption engine in a memory controller configured to process data read from a memory, including: a first data pipeline configured to receive data that is plaintext data and a first validity flag; a second data pipeline having the same length as the first data pipeline configured to: receive data that is encrypted data and a second validity flag; decrypt the encrypted data from the memory and output decrypted plaintext data; an output multiplexer configured to select and output data from either the first pipeline or the second pipeline; and control logic configured to control the output multiplexer, wherein the control logic is configured to output valid data from the first pipeline when the second pipeline does not have valid output decrypted plaintext data available.
Systems and Methods for Using Error Correction and Pipelining Techniques for an Access Triggered Computer Architecture
A method for improving performance of an access triggered architecture for a computer implemented application is provided. The method first executes typical operations of the access triggered architecture according to an execution time, wherein the typical operations comprise: obtaining a dataset and an instruction set; and using the instruction set to transmit the dataset to a functional block associated with an operation, wherein the functional block performs the operation using the dataset to generate a revised dataset. The method further creates a pipeline of the typical operations to reduce the execution time of the typical operations, to create a reduced execution time; and executes the typical operations according to the reduced execution time, using the pipeline.
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT
Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.
HYBRID VIRTUAL GPU CO-SCHEDULING
An embodiment of a semiconductor package apparatus may include technology to manage one or more virtual graphic processor units, and co-schedule the one or more virtual graphic processor units based on both general processor instructions and graphics processor instructions. Other embodiments are disclosed and claimed.
METHOD AND APPARATUS FOR EFFICIENT PROGRAMMABLE INSTRUCTIONS IN COMPUTER SYSTEMS
Systems, apparatuses, and methods for implementing as part of a processor pipeline a reprogrammable execution unit capable of executing specialized instructions are disclosed. A processor includes one or more reprogrammable execution units which can be programmed to execute different types of customized instructions. When the processor loads a program for execution, the processor loads a bitfile associated with the program. The processor programs a reprogrammable execution unit with the bitfile so that the reprogrammable execution unit is capable of executing specialized instructions associated with the program. During execution, a dispatch unit dispatches the specialized instructions to the reprogrammable execution unit for execution. The results of other instructions, such as integer and floating point instructions, are available immediately to instructions executing on the reprogrammable execution unit since the reprogrammable execution unit shares the processor registers with the integer and floating point execution units.
Cache coherence shared state suppression
A method includes receiving, by a level two (L2) controller, a first request for a cache line in a shared cache coherence state; mapping, by the L2 controller, the first request to a second request for a cache line in an exclusive cache coherence state; and responding, by the L2 controller, to the second request.
Tile-based result buffering in memory-compute systems
A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. A first tile in a first node can include a processor with a processor output and a first register network configured to receive information from the processor output and information from one or more of the multiple other tiles in the first node. In response to an output instruction and a delay instruction, the register network can provide an output signal to one of the multiple other tiles in the first node. Based on the output instruction, the output signal can include one or the other of the information from the processor output and the information from one or more of the multiple other tiles in the first node. A timing characteristic of the output signal can depend on the delay instruction.
Circuit for fast interrupt handling
A circuit for fast interrupt handling is disclosed. An apparatus includes a processor circuit having an execution pipeline and a table configured to store a plurality of pointers that correspond to interrupt routines stored in a memory circuit. The apparatus further includes an interrupt redirect circuit configured to receive a plurality of interrupt requests. The interrupt redirect circuit may select a first interrupt request among a plurality of interrupt requests of a first type. The interrupt redirect circuit retrieves a pointer from the table using information associated with the request. Using the pointer, the execution pipeline retrieves first program instruction from the memory circuit to execute a particular interrupt routine.