G06F12/0844

Apparatus, systems, and methods for providing computational imaging pipeline

The present application relates generally to a parallel processing device. The parallel processing device can include a plurality of processing elements, a memory subsystem, and an interconnect system. The memory subsystem can include a plurality of memory slices, at least one of which is associated with one of the plurality of processing elements and comprises a plurality of random access memory (RAM) tiles, each tile having individual read and write ports. The interconnect system is configured to couple the plurality of processing elements and the memory subsystem. The interconnect system includes a local interconnect and a global interconnect.

APPARATUS, SYSTEMS, AND METHODS FOR PROVIDING COMPUTATIONAL IMAGING PIPELINE

The present application relates generally to a parallel processing device. The parallel processing device can include a plurality of processing elements, a memory subsystem, and an interconnect system. The memory subsystem can include a plurality of memory slices, at least one of which is associated with one of the plurality of processing elements and comprises a plurality of random access memory (RAM) tiles, each tile having individual read and write ports. The interconnect system is configured to couple the plurality of processing elements and the memory subsystem. The interconnect system includes a local interconnect and a global interconnect.

Processor and arithmetic processing method

A processor includes request issuing units issuing an access request to a storage, a data array including banks holding sub data divided from data read from the storage based on the access request, a switch to transfer the access request to one of the banks, and first and second determination units. The first determination unit determines a cache hit when a tag address included in the access address matches a tag address held therein in correspondence with an index address included in the access address. The second determination unit determines a cache hit when identification information corresponding to a first tag address included in the access address and a second tag address included in the access address, match identification information and second tag address held therein. A cache controller makes access to the data array or storage, based on a determination result of the first or second determination unit.

Processor and arithmetic processing method

A processor includes request issuing units issuing an access request to a storage, a data array including banks holding sub data divided from data read from the storage based on the access request, a switch to transfer the access request to one of the banks, and first and second determination units. The first determination unit determines a cache hit when a tag address included in the access address matches a tag address held therein in correspondence with an index address included in the access address. The second determination unit determines a cache hit when identification information corresponding to a first tag address included in the access address and a second tag address included in the access address, match identification information and second tag address held therein. A cache controller makes access to the data array or storage, based on a determination result of the first or second determination unit.

MEMORY ACCESS BOUNDS CHECKING FOR A PROGRAMMABLE ATOMIC OPERATOR
20220414004 · 2022-12-29 ·

Devices and techniques for memory access bounds checking for a programmable atomic operator are described herein. A processor can execute a programmable atomic operator with a base memory address. The processor can obtain a memory interleave size indicator corresponding to the programmable atomic operator and calculate a contiguous memory address range from the base memory address and the memory interleave size. The processor can then detect that a memory request from the programmable atomic operator is outside the contiguous memory address range and deny the memory request when it is outside of the contiguous memory address range and allow the memory request otherwise.

Memory access bounds checking for a programmable atomic operator

Devices and techniques for memory access bounds checking for a programmable atomic operator are described herein. A processor can execute a programmable atomic operator with a base memory address. The processor can obtain a memory interleave size indicator corresponding to the programmable atomic operator and calculate a contiguous memory address range from the base memory address and the memory interleave size. The processor can then detect that a memory request from the programmable atomic operator is outside the contiguous memory address range and deny the memory request when it is outside of the contiguous memory address range and allow the memory request otherwise.

TECHNIQUES FOR HANDLING CACHE COHERENCY TRAFFIC FOR CONTENDED SEMAPHORES

The techniques described herein improve cache traffic performance in the context of contended lock instructions. More specifically, each core maintains a lock address contention table that stores addresses corresponding to contended lock instructions. The lock address contention table also includes a state value that indicates progress through a series of states meant to track whether a load by the core in a spin-loop associated with semaphore acquisition has obtained the semaphore in an exclusive state. Upon detecting that a load in a spin-loop has obtained the semaphore in an exclusive state, the core responds to incoming requests for access to the semaphore with negative acknowledgments. This allows the core to maintain the semaphore cache line in an exclusive state, which allows it to acquire the semaphore faster and to avoid transmitting that cache line to other cores unnecessarily.

TECHNIQUES FOR HANDLING CACHE COHERENCY TRAFFIC FOR CONTENDED SEMAPHORES

The techniques described herein improve cache traffic performance in the context of contended lock instructions. More specifically, each core maintains a lock address contention table that stores addresses corresponding to contended lock instructions. The lock address contention table also includes a state value that indicates progress through a series of states meant to track whether a load by the core in a spin-loop associated with semaphore acquisition has obtained the semaphore in an exclusive state. Upon detecting that a load in a spin-loop has obtained the semaphore in an exclusive state, the core responds to incoming requests for access to the semaphore with negative acknowledgments. This allows the core to maintain the semaphore cache line in an exclusive state, which allows it to acquire the semaphore faster and to avoid transmitting that cache line to other cores unnecessarily.

METHODS AND SYSTEMS FOR DISTRIBUTING MEMORY REQUESTS

A memory request, including an address, is accessed. The memory request also specifies a type of an operation (e.g., a read or write) associated with an instance (e.g., a block) of data. A group of caches is selected using a bit or bits in the address. A first hash of the address is performed to select a cache in the group. A second hash of the address is performed to select a set of cache lines in the cache. Unless the operation results in a cache miss, the memory request is processed at the selected cache. When there is a cache miss, a third hash of the address is performed to select a memory controller, and a fourth hash of the address is performed to select a bank group and a bank in memory.

METHODS AND SYSTEMS FOR DISTRIBUTING MEMORY REQUESTS

A memory request, including an address, is accessed. The memory request also specifies a type of an operation (e.g., a read or write) associated with an instance (e.g., a block) of data. A group of caches is selected using a bit or bits in the address. A first hash of the address is performed to select a cache in the group. A second hash of the address is performed to select a set of cache lines in the cache. Unless the operation results in a cache miss, the memory request is processed at the selected cache. When there is a cache miss, a third hash of the address is performed to select a memory controller, and a fourth hash of the address is performed to select a bank group and a bank in memory.