G06F15/8069

No-locality hint vector memory access processors, methods, systems, and instructions
11392500 · 2022-07-19 · ·

A processor of an aspect includes a plurality of packed data registers, and a decode unit to decode a no-locality hint vector memory access instruction. The no-locality hint vector memory access instruction to indicate a packed data register of the plurality of packed data registers that is to have a source packed memory indices. The source packed memory indices to have a plurality of memory indices. The no-locality hint vector memory access instruction is to provide a no-locality hint to the processor for data elements that are to be accessed with the memory indices. The processor also includes an execution unit coupled with the decode unit and the plurality of packed data registers. The execution unit, in response to the no-locality hint vector memory access instruction, is to access the data elements at memory locations that are based on the memory indices.

Methods and apparatus for eviction in dual datapath victim cache system

Methods, apparatus, systems and articles of manufacture are disclosed to evict in a dual datapath victim cache system. An example apparatus includes a cache storage, a cache controller operable to receive a first memory operation and a second memory operation concurrently, comparison logic operable to identify if the first and second memory operations missed in the cache storage, and a replacement policy component operable to, when at least one of the first and second memory operations corresponds to a miss in the cache storage, reserve an entry in the cache storage to evict based on the first and second memory operations.

Aggressive write flush scheme for a victim cache

A caching system including a first sub-cache and a second sub-cache in parallel with the first sub-cache, wherein the second sub-cache includes: line type bits configured to store an indication that a corresponding cache line of the second sub-cache is configured to store write-miss data, and an eviction controller configured to evict a cache line of the second sub-cache storing write-miss data based on an indication that the cache line has been fully written.

Write merging on stores with different privilege levels

A caching system including a first sub-cache, a second sub-cache, coupled in parallel with the first sub-cache, for storing write-memory commands that are not cached in the first sub-cache, the second sub-cache including privilege bits configured to store an indication that a corresponding cache line of the second sub-cache is associated with a level of privilege, and wherein the second sub-cache is further configured to receive a first write memory command for a memory address associated with a first level of privilege, store, in the second sub-cache, first data associated with the first write memory command and the level of privilege associated with the cache line, receive a second write memory command for the cache line, the second write memory command associated with a second level of privilege, merge the first level of privilege with the second level of privilege, and output the merged privilege level with the cache line.

Victim cache with write miss merging

A caching system including a first sub-cache, a second sub-cache, coupled in parallel with the first sub-cache, for storing cache data evicted from the first sub-cache and write-memory commands that are not cached in the first sub-cache, and a cache controller configured to receive two or more cache commands, determine a conflict exists between the received two or more cache commands, determine a conflict resolution between the received two or more cache commands, and sending the two or more cache commands to the first sub-cache and the second sub-cache.

Reconfigurable Parallel Processing
20220100701 · 2022-03-31 ·

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.

Methods and apparatus for multi-banked victim cache with dual datapath

Methods, apparatus, systems and articles of manufacture are disclosed for multi-banked victim cache with dual datapath. An example cache system includes a storage element that includes banks operable to store data, ports operable to receive memory operations in parallel, wherein each of the memory operations has a respective address, and a plurality of comparators coupled such that each of the comparators is coupled to a respective port of the ports and a respective bank of the banks and is operable to determine whether a respective address of a respective memory operation received by the respective port corresponds to the data stored in the respective bank.

Apparatus and methods for combining vectors

Aspects for vector combination in neural network are described herein. The aspects may include a direct memory access unit configured to receive aa first vector, a second vector, and a controller vector. The first vector, the second vector, and the controller vector may each include one or more elements indexed in accordance with a same one-dimensional data structure. The aspects may further include a computation module configured to select one of the one or more control values, determine that the selected control value satisfies a predetermined condition, and select one of the one or more first elements that corresponds to the selected control value in the one-dimensional data structure as an output element based on a determination that the selected control value satisfies the predetermined condition.

Shared memory access for reconfigurable parallel processor using a plurality of memory ports each comprising an address calculation unit

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads and a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit. Each of the plurality of MPs may comprise an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit.

Private memory access for reconfigurable parallel processor using a plurality of memory ports each comprising an address calculation unit

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) and a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit. Each PE may have a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads. Each of the plurality of MPs may comprise an address calculation unit configured to generate respective memory addresses for each thread to access a different memory bank in the memory unit.