G06F9/30047

AGGRESSIVE WRITE FLUSH SCHEME FOR A VICTIM CACHE
20230004500 · 2023-01-05 ·

A caching system including a first sub-cache and a second sub-cache in parallel with the first sub-cache, wherein the second sub-cache includes: line type bits configured to store an indication that a corresponding cache line of the second sub-cache is configured to store write-miss data, and an eviction controller configured to evict a cache line of the second sub-cache storing write-miss data based on an indication that the cache line has been fully written.

CAT AWARE LOADS AND SOFTWARE PREFETCHES
20230236973 · 2023-07-27 ·

In one embodiment, a method of selectively reserving portions of a last level cache (LLC) for a multi-core processor, the method comprising: allocating, by an executive system, plural classes of service to the portions of the LLC, wherein the portions comprise ways, and wherein each of the plural classes of service are allocated to one or more of the ways; assigning, by the executive system, one of the plural classes of service to an application as a default class of service, wherein the assignment controls which of the ways the application can allocate into; and overriding, by the application, the default class of service to enable allocation by the application to the one or more of the ways associated with a non-default class of service.

COMPUTATIONAL MEMORY WITH COOPERATION AMONG ROWS OF PROCESSING ELEMENTS AND MEMORY THEREOF
20230004522 · 2023-01-05 ·

A computing device includes an array of processing elements mutually connected to perform single instruction multiple data (SIMD) operations, memory cells connected to each processing element to store data related to the SIMD operations, and a cache connected to each processing element to cache data related to the SIMD operations. Caches of adjacent processing elements are connected. The same or another computing device includes rows of mutually connected processing elements to share data. The computing device further includes a row arithmetic logic unit (ALU) at each row of processing elements. The row ALU of a respective row is configured to perform an operation with processing elements of the respective row.

COOPERATIVE GARBAGE COLLECTION BARRIER ELISION

Techniques are disclosed for eliding load and store barriers while maintaining garbage collection invariants. Embodiments described herein include techniques for identifying an instruction, such as a safepoint poll, that checks whether to pause a thread between execution of a dominant and dominated access to the same data field. If a poll instruction is identified between the two data accesses, then a pointer for the data field may be recorded in an entry associated with the poll instruction. When the thread is paused to execute a garbage collection operation, the recorded information may be used to update values associated with the data field in memory such that the dominated access may be executed without any load or store barriers.

Write cache circuit, data write method, and memory
11714645 · 2023-08-01 · ·

The present disclosure provides a write cache circuit, a data write method, and a memory. The write cache circuit includes: a control circuit configured to generate, on the basis of a mask write instruction, a first write pointer and a pointer to be positioned, generate a second write pointer on the basis of a write command, generate a first output pointer on the basis of a mask write shift instruction, and generate a second output pointer on the basis of a write shift instruction; a first cache circuit configured to cache, on the basis of the first write pointer, the pointer to be positioned and output a positioned pointer on the basis of the first output pointer, the positioned pointer being configured to instruct a second cache circuit to output a write address written by the second write pointer generated according to the mask write instruction.

Streaming engine with early and late address and loop count registers to track architectural state

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. The streaming engine stores an early address of next to be fetched data elements and a late address of a data element in the stream head register for each of the nested loops. The streaming engine stores an early loop counts of next to be fetched data elements and a late loop counts of a data element in the stream head register for each of the nested loops.

COHERENCE-BASED DYNAMIC CODE REWRITING, TRACING AND CODE COVERAGE
20230028825 · 2023-01-26 ·

A device tracks accesses to pages of code executed by processors and modifies a portion of the code without terminating the execution of the code. The device is connected to the processors via a coherence interconnect and a local memory of the device stores the code pages. As a result, any requests to access cache lines of the code pages made by the processors will be placed on the coherence interconnect, and the device is able to track any cache-line accesses of the code pages by monitoring the coherence interconnect. In response to a request to read a cache line having a particular address, a modified code portion is returned in place of the code portion stored in the code pages.

ZERO-COPY SPARSE MATRIX FACTORIZATION SYNTHESIS FOR HETEROGENEOUS COMPUTE SYSTEMS
20230024035 · 2023-01-26 ·

A system, method, and computer-readable medium for synthesizing zero-copy sparse matrix factorization operations in heterogeneous compute systems are provided. The system includes a host and an accelerator device. The host device is configured to divide an input matrix into a plurality of blocks which are transferred to a memory of the accelerator device. The host device is also configured to generate at least one index buffer that includes pointers to the block in the accelerator's memory, where each index buffer represents a frontal matrix associated with a matrix decomposition algorithm. The host processor is configured to receive one or more kernels configured to process the index buffer(s) on an accelerator device. The index buffers are processed by the accelerator device and the modified block data is written back to a memory of the host device to generate a factorized output matrix.

Moving entries between multiple levels of a branch predictor based on a performance loss resulting from fewer than a pre-set number of instructions being stored in an instruction cache register

An instruction processing device and an instruction processing method are provided. The instruction processing device includes: a first-level branch target buffer, configured to store entries of a first plurality of branch instructions; a second-level branch target buffer, configured to store entries of a second plurality of branch instructions, wherein the entries in the first-level branch target buffer are accessed faster than the entries in the second-level branch target buffer; an instruction fetch unit coupled to the first-level branch target buffer and the second-level branch target buffer, the instruction fetch unit including circuitry configured to add, for a first branch instruction, one or more entries corresponding to the first branch instruction into the first-level branch target buffer when the one or more entries corresponding to the first branch instruction are identified in the second-level branch target buffer; and an execution unit including circuitry configured to execute the first branch instruction.

LOCK FREE HIGH THROUGHPUT RESOURCE STREAMING
20230019646 · 2023-01-19 ·

Methods, systems and apparatuses may provide for technology that conducts, via a plurality of concurrent threads, transfers of graphics resources into and out of graphics memory, wherein the transfers bypass lock operations between the plurality of concurrent threads, generates frames based on the graphics resources in the graphics memory, and streams the frames to a display. In one example, the transfers also bypass explicit wait operations for the graphics resources to be fully resident in the graphics memory.