Patent classifications
G06F8/4442
Set associative cache memory with heterogeneous replacement policy
A set associative cache memory, comprising: an array of storage elements arranged as M sets by N ways; an allocation unit that allocates the storage elements in response to memory accesses that miss in the cache memory. Each memory access selects a set; for each parcel of a plurality of parcels, a parcel specifier specifies: a subset of ways of the N ways included in the parcel. The subsets of ways of parcels associated with a selected set are mutually exclusive; a replacement scheme associated with the parcel from among a plurality of predetermined replacement schemes. For each memory access, the allocation unit: selects the parcel specifier in response to the memory access; and uses the replacement scheme associated with the parcel to allocate into the subset of ways of the selected set included in the parcel.
MEMORY DEVICE HAVING MULTIPLE READ BUFFERS FOR READ LATENCY REDUCTION
A memory device can include: a memory array arranged to store data lines; an interface that receives a first read command requesting bytes of data in a consecutively addressed order from a starting byte; a cache memory having a first buffer storing a first data line including the starting byte, and a second buffer storing a second data line, from the cache memory or the memory array; output circuitry that accesses data from the first buffer, and sequentially outputs each byte from the starting byte through a highest addressed byte of the first data line; and from the second buffer and sequentially outputs each byte from a lowest addressed byte of the second data line until the requested bytes of data have been output in order to execute the first read command, the contents of the first and second buffers being maintained in the cache memory.
Software solution for cooperative memory-side and processor-side data prefetching
A solution for cooperative data prefetching that enables software control of a memory-side data prefetch and/or a processor-side data prefetch is provided. In one embodiment, the invention provides a solution for generating an application, in which access to application data for the application is improved (e.g., optimized) in program code for the application. In particular, a push request, for performing a memory-side data prefetch of the application data, and a prefetch request, for performing a processor-side data prefetch, are added to the program code. The memory-side data prefetch results in the application data being copied from a first data store to a second data store that is faster than the first data store while the processor-side data prefetch results in the application data being copied from the second data store to a third data store that is faster than the second data store.
Per-instance preamble for graphics processing
A method for processing data in a graphics processing unit (GPU) including receiving an instance identifier for an instance and a shader program comprising a preamble code block and a main shader code block, assigning, the instance identifier to a general purpose register at wave creation, allocating address space within the constant memory for instance uniforms, and determining the preamble code block has not been executed and the wave is a first wave of the instance to be executed, based on determining the preamble code block has not been executed and the wave is the first wave to be executed, executing the preamble code block to store the plurality of instance uniforms in the constant memory and based, at least in part, on executing the preamble code block, executing the wave of the plurality of waves using at least one of the plurality of instance constants stored inconstant memory.
Anticipated prefetching for a parent core in a multi-core chip
Embodiments relate to prefetching data on a chip having a scout core and a parent core coupled to the scout core. The method includes determining that a program executed by the parent core requires content stored in a location remote from the parent core. The method includes sending a fetch table address determined by the parent core to the scout core. The method includes accessing a fetch table that is indicated by the fetch table address by the scout core. The fetch table indicates how many of pieces of content are to be fetched by the scout core and a location of the pieces of content. The method includes based on the fetch table indicating, fetching the pieces of content by the scout core. The method includes returning the fetched pieces of content to the parent core.
Nested loops reversal enhancements
Systems, apparatuses and methods may provide for technology to identify in user code, a nested loop which would result in cache memory misses when executed. The technology further reverses an order of iterations of a first inner loop in the nested loop to obtain a modified nested loop. Reversing the order of iterations increases a number of times that cache memory hits occur when the modified nested loop is executed.
Multiple data prefetchers that defer to one another based on prefetch effectiveness by memory access type
A processor includes a first prefetcher that prefetches data in response to memory accesses and a second prefetcher that prefetches data in response to memory accesses. Each of the memory accesses has an associated memory access type (MAT) of a plurality of predetermined MATs. The processor also includes a table that holds first scores that indicate effectiveness of the first prefetcher to prefetch data with respect to the plurality of predetermined MATs and second scores that indicate effectiveness of the second prefetcher to prefetch data with respect to the plurality of predetermined MATs. The first and second prefetchers selectively defer to one another with respect to data prefetches based on their relative scores in the table and the associated MATs of the memory accesses.
Information processing apparatus and compiling method
A memory stores code including a plurality of functions and a plurality of function calls each calling one of the plurality of functions. A processor calculates, for each of the plurality of functions, a plurality of index values including a first index value indicating an iteration status of a loop in the function and a second index value indicating the code size of the function. The processor calculates, for each of the plurality of function calls, an evaluation value based on the plurality of index values that are calculated for the function called by the function call. The processor selects one or more of the plurality of function calls, based on the evaluation value, and inlines the selected function calls.
INFORMATION PROCESSING DEVICE, COMPILING METHOD, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM
An information processing device includes a memory, and a processor coupled to the memory and configured to detect an access pattern according to which a memory reference instruction in a first loop process to be executed posterior to a second loop process accesses first data elements in the memory every loop iteration, and insert a prefetch instruction to the second loop process based on the access pattern, the prefetch instruction being an instruction to transfer at least one of the first data elements from the memory to a first sector of a cache memory, the at least one of the first data elements transferred to the first sector of the cache memory being never cached out by a second data element different from each of the first data elements.
Compiler sub expression directed acyclic graph (DAG) remat for register pressure
The present disclosure relates to devices and methods for transforming program source code using a rematerialization operation. The devices and methods may identify at least one hot spot with high register pressure in a program source code for an application and identify a plurality of live variables within the at least one hot spot. The devices and methods may group the plurality of live variables by a basic block that has contained a define or single use of the plurality of live variables. The devices and methods may build a directed acyclic graph (DAG) for each basic block that has a grouped plurality of live variables. The devices and methods may save the DAG as a candidate instruction to move in the program source code and may generate transformed program source code for the application by moving the candidate instruction.