Patent classifications
G06F12/0857
METHOD AND APPARATUS FOR DYNAMICALLY ADJUSTING PIPELINE DEPTH TO IMPROVE EXECUTION LATENCY
Apparatus and method for managing pipeline depth of a data processing device. For example, one embodiment of an apparatus comprises: an interface to receive a plurality of work requests from a plurality of clients; and a plurality of engines to perform the plurality of work requests; wherein the work requests are to be dispatched to the plurality of engines from a plurality of work queues, the work queues to store a work descriptor per work request, each work descriptor to include information needed to perform a corresponding work request, wherein the plurality of work queues include a first work queue to store work descriptors associated with first latency characteristics and a second work queue to store work descriptors associated with second latency characteristics; engine configuration circuitry to configure a first engine to have a first pipeline depth based on the first latency characteristics and to configure a second engine to have a second pipeline depth based on the second latency characteristics.
Memory Controller with Programmable Atomic Operations
A memory controller circuit is disclosed which is coupleable to a first memory circuit, such as DRAM, and includes: a first memory control circuit to read from or write to the first memory circuit; a second memory circuit, such as SRAM; a second memory control circuit adapted to read from the second memory circuit in response to a read request when the requested data is stored in the second memory circuit, and otherwise to transfer the read request to the first memory control circuit; predetermined atomic operations circuitry; and programmable atomic operations circuitry adapted to perform at least one programmable atomic operation. The second memory control circuit also transfers a received programmable atomic operation request to the programmable atomic operations circuitry and sets a hazard bit for a cache line of the second memory circuit.
Memory IC with data loopback
A memory controller component of a memory system stores memory access requests within a transaction queue until serviced so that, over time, the transaction queue alternates between occupied and empty states. The memory controller transitions the memory system to a low power mode in response to detecting the transaction queue is has remained in the empty state for a predetermined time. In the transition to the low power mode, the memory controller disables oscillation of one or more timing signals required to time data signaling operations within synchronous communication circuits of one or more attached memory devices and also disables one or more power consuming circuits within the synchronous communication circuits of the one or more memory devices.
Methods and systems for a stripe mode cache pool
N-way associative cache pools can be implemented in an N-way associative cache. Different cache pools can be indicated by pool values. Different processes running on a computer can use different cache pools. An N-way associative cache circuit can be configured to have one or more stripe mode cache pools that are N-way associative. A cache control circuit can receive a physical address for a memory location and can interpret the physical address as fields including a tag field that contains a tag value and a set field that contains a set value. The physical address can also be used to determine a pool value that identifies one of the stripe mode cache pools. A set of N cache entries in the one of the stripe mode cache pools can be concurrently searched for the tag value. The set of N cache entries is determined using the set value.
Selectively writing back dirty cache lines concurrently with processing
A graphics pipeline includes a cache having cache lines that are configured to store data used to process frames in a graphics pipeline. The graphics pipeline is implemented using a processor that processes frames for the graphics pipeline using data stored in the cache. The processor processes a first frame and writes back a dirty cache line from the cache to a memory concurrently with processing of the first frame. The dirty cache line is retained in the cache and marked as clean subsequent to being written back to the memory. In some cases, the processor generates a hint that indicates a priority for writing back the dirty cache line based on a read command occupancy at a system memory controller.
Processor with reduced interrupt latency
A processor with reduced interrupt latency is disclosed. An apparatus includes a processor core and a cache subsystem having a cache controller and a cache. The processor core is configured to submit, to the cache controller, requests for access to the cache, wherein a given request for access to the cache specifies whether the given request is abandonable or non-abandonable in an event of an interrupt request. In response to a particular interrupt request, the processor core may provide an indication to cause the cache controller to abandon requests for access to the cache identified as abandonable. After receiving an acknowledgement from the cache controller that the abandonable requests have been abandoned, the processor core may begin execution of an interrupt handler in order to service the interrupt request.
ARITHMETIC PROCESSOR AND METHOD FOR OPERATING ARITHMETIC PROCESSOR
An arithmetic processor including a plurality of core groups each including a plurality of cores and a cache unit, a plurality of home agents each including a tag directory and a store command queue and a store command queue. The store command queue enters the received store request to the entry queue in order of reception, the cache unit stores the data of the store request in a data RAM. The store command queue sets a data ownership acquisition flag of the store request to valid when obtaining a data ownership of the store request and issues a top-of-queue notification to the cache control unit when the flag of the top-of-queue entry is valid. In response to the top-of-queue notification, the cache unit update a cache tag to modified state and issue a store request completion notification.
Techniques for increasing the isolation of workloads within a multiprocessor instance
In various embodiments, an isolation application determines processor assignment(s) based on a performance cost estimate. The performance cost estimate is associated with an estimated level of cache interference arising from executing a set of workloads on a set of processors. Subsequently, the isolation application configures at least one processor included in the set of processors to execute at least a portion of a first workload that is included in the set of workloads based on the processor assignment(s). Advantageously, because the isolation application generates the processor assignment(s) based on the performance cost estimate, the isolation application can reduce interference in a non-uniform memory access (NUMA) microprocessor instance.
DYNAMICALLY COALESCING ATOMIC MEMORY OPERATIONS FOR MEMORY-LOCAL COMPUTING
Dynamically coalescing atomic memory operations for memory-local computing is disclosed. In an embodiment, it is determined whether a first atomic memory access and a second atomic memory access are candidates for coalescing. In response to a triggering event, the atomic memory accesses that are candidates for coalescing are coalesced in a cache prior to requesting memory-local processing by a memory-local compute unit. The atomic memory accesses may be coalesced in the same cache line or atomic memory accesses in different cache lines may be coalesced using a multicast memory-local processing command.
CONCURRENT PAGE CACHE RESOURCE ACCESS IN A MULTI-PLANE MEMORY DEVICE
A memory device includes a first memory array, a second memory array, and a page cache circuit coupled to the first memory array and the second memory array. The page cache circuit includes at least one set of concurrent resources and at least one shared resource, wherein the at least one set of concurrent resources are asynchronously and concurrently accessible by the first memory array and the second memory array, and wherein the at least one shared resource is accessible in a time-multiplexed fashion by the first memory array and the second memory array.