G06F2212/6028

EFFICIENT WORK UNIT PROCESSING IN A MULTICORE SYSTEM

Techniques are described in which a system having multiple processing units processes a series of work units in a processing pipeline, where some or all of the work units access or manipulate data stored in non-coherent memory. In one example, this disclosure describes a method that includes identifying, prior to completing processing of a first work unit with a processing unit of a processor having multiple processing units, a second work unit that is expected to be processed by the processing unit after the first work unit. The method also includes processing the first work unit, and prefetching, from non-coherent memory, data associated with the second work unit into a second cache segment of the buffer cache, wherein prefetching the data associated with the second work unit occurs concurrently with at least a portion of the processing of the first work unit by the processing unit.

DATA CACHE WITH PREDICTION HINTS FOR CACHE HITS
20230205703 · 2023-06-29 ·

Described is a data cache with prediction hints for a cache hit. The data cache includes a plurality of cache lines, where a cache line includes a data field, a tag field, and a prediction hint field. The prediction hint field is configured to store a prediction hint which directs alternate behavior for a cache hit against the cache line. The prediction hint field is integrated with the tag field or is integrated with a way predictor field.

USER INTERFACE EXECUTION APPARATUS AND USER INTERFACE DESIGNING APPARATUS

The present invention has an object of providing a user interface execution apparatus and a user interface designing apparatus which can estimate the maximum size of a storage area for storing data to be prefetched when a user interface is designed and can present updated data to the user even when the prefetched data is updated after the prefetch. A user interface execution apparatus in the present invention includes a processor to execute a program; and a memory to store the program which, when executed by the processor, performs processes of: transitioning a state of the user interface execution apparatus; issuing a prefetch request for data; storing the data; generating the code from an interface definition and a state transition definition; and selecting, before transitioning the state, data to be prefetched based on a difference between a data obtaining interface to be used in a state before the transitioning and a data obtaining interface to be used in a state after the transitioning.

EFFECTIVENESS AND PRIORITIZATION OF PREFETECHES

A method, system, and computer program product are provided for prioritizing prefetch instructions. The method includes a processor issuing a prefetch instruction and fetching elements from a cache that can include a memory or a higher level cache. The processor stores the elements in temporary storage and monitors for accesses by an instruction. The processor stores a record representing the prefetch instruction. The processor updates the record with an indicator and issues a new prefetch instruction by comparing the new prefetch instruction to the record, based on the new prefetch instruction matching the prefetch instruction, assigning the indicator to the new prefetch instruction as a priority value, based on the new prefetch instruction not matching the prefetch instruction, assigning a default value to the new prefetch instruction as the priority value, and determining whether to execute the new prefetch instruction, based on the priority value of the new prefetch instruction.

Way predictor and enable logic for instruction tightly-coupled memory and instruction cache
11687342 · 2023-06-27 · ·

Disclosed herein are systems and method for instruction tightly-coupled memory (iTIM) and instruction cache (iCache) access prediction. A processor may use a predictor to enable access to the iTIM or the iCache and a particular way (a memory structure) based on a location state and program counter value. The predictor may determine whether to stay in an enabled memory structure, move to and enable a different memory structure, or move to and enable both memory structures. Stay and move predictions may be based on whether a memory structure boundary crossing has occurred due to sequential instruction processing, branch or jump instruction processing, branch resolution, and cache miss processing. The program counter and a location state indicator may use feedback and be updated each instruction-fetch cycle to determine which memory structure(s) needs to be enabled for the next instruction fetch.

GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT

Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.

SYSTEM, DEVICE AND METHOD FOR ACCESSING DEVICE-ATTACHED MEMORY

A device connected to a host processor via a bus includes: an accelerator circuit configured to operate based on a message received from the host processor; and a controller configured to control an access to a memory connected to the device, wherein the controller is further configured to, in response to a read request received from the accelerator circuit, provide a first message requesting resolution of coherence to the host processor and prefetch first data from the memory.

EFFICIENT AND CONCURRENT MODEL EXECUTION

An accelerator is disclosed. A circuit may process a data to produce a processed data. A first tier storage may include a first capacity and a first latency. A second tier storage may include a second capacity and a second latency. The second capacity may be larger than the first capacity, and the second latency may be slower than the first latency. A bus may be used to transfer at least one of the data or the processed data between the first tier storage and the second tier storage.

TLB ACCESS MONITORING
20230185730 · 2023-06-15 ·

An apparatus includes circuitry couplable to a host system and a memory device. The circuitry is configured to determine whether a page table maintained on the circuitry includes a physical address of the memory device corresponding to a virtual address associated with a TLB fill request from the host system. Responsive to determining that the page table includes the physical address, the circuitry provides signaling indicative of a completion to the TLB fill request to the host system, prefetch a page of data at the physical address from the memory device using the physical address from the page table, and provide signaling indicative of the page of data to the host system.

PREFETCH MANAGEMENT IN A HIERARCHICAL CACHE SYSTEM

An apparatus includes a CPU core, a first memory cache with a first line size, and a second memory cache having a second line size larger than the first line size. Each line of the second memory cache includes an upper half and a lower half. A memory controller subsystem is coupled to the CPU core and to the first and second memory caches. Upon a miss in the first memory cache for a first target address, the memory controller subsystem determines that the first target address resulting in the miss maps to the lower half of a line in the second memory cache, retrieves the entire line from the second memory cache, and returns the entire line from the second memory cache to the first memory cache.