Patent classifications
G06F9/383
Store prefetches for dependent loads in a processor
An information handling system, method, and processor that detects a store instruction for data in a processor where the store instruction is a reliable indicator of a future load for the data; in response to detecting the store instruction, sends a prefetch request to memory for an entire cache line containing the data referenced in the store instruction, and preferably only the single cache line containing the data; and receives, in response to the prefetch request, the entire cache line containing the data referenced in the store instruction.
DECOUPLED ACCESS-EXECUTE PROCESSING AND PREFETCHING CONTROL
Apparatuses and methods are provided, relating to the control of data processing in devices which comprise both decoupled access-execute processing circuitry and prefetch circuitry. Control of the access portion of the decoupled access-execute processing circuitry may be dependent on a performance metric of the prefetch circuitry. Alternatively or in addition, control of the prefetch circuitry may be dependent on a performance metric of the access portion.
Programmable vision accelerator
In one embodiment of the present invention, a programmable vision accelerator enables applications to collapse multi-dimensional loops into one dimensional loops. In general, configurable components included in the programmable vision accelerator work together to facilitate such loop collapsing. The configurable elements include multi-dimensional address generators, vector units, and load/store units. Each multi-dimensional address generator generates a different address pattern. Each address pattern represents an overall addressing sequence associated with an object accessed within the collapsed loop. The vector units and the load store units provide execution functionality typically associated with multi-dimensional loops based on the address pattern. Advantageously, collapsing multi-dimensional loops in a flexible manner dramatically reduces the overhead associated with implementing a wide range of computer vision algorithms. Consequently, the overall performance of many computer vision applications may be optimized.
Prefetching
A technique is provided for prefetching data items. An apparatus has a storage structure with a plurality of entries to store data items. The storage structure is responsive to access requests from processing circuitry to provide access to the data items. The apparatus has prefetch circuitry to prefetch data and correlation information storage to store correlation information for a plurality of data items. The correlation information identifies, for each of the plurality of data items, one or more correlated data items. The prefetch circuitry is configured to monitor the access requests from the processing circuitry. In response to detecting a hit in the correlation information storage for a particular access request that identifies a requested data item for which the correlation information storage stores correlation information, the prefetch circuitry is configured to prefetch the one or more correlated data items identified by the correlation information for the requested data item.
PREFETCHING
A technique is provided for prefetching data items. An apparatus has a storage structure with a plurality of entries to store data items. The storage structure is responsive to access requests from processing circuitry to provide access to the data items. The apparatus has prefetch circuitry to prefetch data and correlation information storage to store correlation information for a plurality of data items. The correlation information identifies, for each of the plurality of data items, one or more correlated data items. The prefetch circuitry is configured to monitor the access requests from the processing circuitry. In response to detecting a hit in the correlation information storage for a particular access request that identifies a requested data item for which the correlation information storage stores correlation information, the prefetch circuitry is configured to prefetch the one or more correlated data items identified by the correlation information for the requested data item.
Systems and methods for improving cache efficiency and utilization
- Altug Koker ,
- Joydeep Ray ,
- Ben Ashbaugh ,
- Jonathan Pearce ,
- Abhishek Appu ,
- Vasanth Ranganathan ,
- Lakshminarayanan Striramassarma ,
- Elmoustapha Ould-Ahmed-Vall ,
- Aravindh Anantaraman ,
- Valentin Andrei ,
- Nicolas Galoppo von Borries ,
- Varghese George ,
- Yoav Harel ,
- Arthur Hunter, JR. ,
- Brent Insko ,
- Scott Janus ,
- Pattabhiraman K ,
- Mike Macpherson ,
- Subramaniam Maiyuran ,
- Marian Alin Petre ,
- Murali Ramadoss ,
- Shailesh Shah ,
- Kamal Sinha ,
- Prasoonkumar Surti ,
- Vikranth Vemulapalli
Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.
PROCESSOR AND CONTROL METHOD OF PROCESSOR
A processor includes: a storage unit that stores instructions; a counting unit that specifies an instruction to be decoded by a count value; a decoding unit that decodes an instruction; and a control unit that, when the decoded instruction is a repeat instruction, updates the count value of the counting unit so as to cause repeat target instructions in number corresponding to a designated number of instructions, out of instructions succeeding the repeat instruction, to be repeatedly executed a designated number of repetition times, and generates updated operands being operation objects of the repeat target instructions that are to be executed for the second or later time, and when the repeat target instructions are to be executed for the second or later time, updates operands of the repeat target instructions for use in the second or later time execution, to the generated updated operands and outputs the updated operands.
Providing code sections for matrix of arithmetic logic units in a processor
The present invention relates to a processor having a trace cache and a plurality of ALUs arranged in a matrix, comprising an analyser unit located between the trace cache and the ALUs, wherein the analyser unit analyses the code in the trace cache, detects loops, transforms the code, and issues to the ALUs sections of the code combined to blocks for joint execution for a plurality of clock cycles.
Multi-variate strided read operations for accessing matrix operands
In one embodiment, a matrix processor comprises a memory to store a matrix operand and a strided read sequence, wherein: the matrix operand is stored out of order in the memory; and the strided read sequence comprises a sequence of read operations to read the matrix operand in a correct order from the memory. The matrix processor further comprises circuitry to: receive a first instruction to be executed by the matrix processor, wherein the first instruction is to instruct the matrix processor to perform a first operation on the matrix operand; read the matrix operand from the memory based on the strided read sequence; and execute the first instruction by performing the first operation on the matrix operand.
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT
Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.