Patent classifications
G06F2212/452
Hierarchical general register file (GRF) for execution block
In an example, an apparatus comprises a plurality of execution units, and a first general register file (GRF) communicatively couple to the plurality of execution units, wherein the first GRF is shared by the plurality of execution units. Other embodiments are also disclosed and claimed.
Sharing instruction cache footprint between multiple threads
Aspects are provided for sharing instruction cache footprint between multiple threads. A set/way pointer to an instruction cache line is derived from a system memory address associated with an instruction fetch from a memory page. It is determined that the instruction cache line is shareable between a first thread and a second thread. An alias table entry is created indicating that other instruction cache lines associated with the memory page are also shareable between threads. Another instruction fetch is received from another thread requesting an instruction from another system memory address associated with the memory page. A further set/way pointer to another instruction cache line is derived from the other system memory address. It is determined that the other instruction cache line is shareable based on the alias table entry.
STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.
Throttling Schemes in Multicore Microprocessors
An electronic device includes a cache, a processing cluster having one or more processors, and prefetch throttling circuitry that determines a congestion level of the processing cluster based on an extent to which the data retrieval requests sent from the processors to the cache are not satisfied by the cache. Congestion criteria require that the congestion level of the cluster is above a cluster congestion threshold. In accordance with a determination that the congestion level of the cluster satisfies the congestion criteria, the prefetch throttling circuit causes one of the processors to limit prefetch requests to the cache to prefetch requests of at least a threshold quality. In accordance with a determination that the congestion level of the cluster does not satisfy the congestion criteria, the prefetch throttling circuit forgoes causing the processors to limit prefetch requests to the cache to prefetch requests of at least the threshold quality.
Instruction caching scheme for memory devices
Methods, systems, and devices for an enhanced instruction caching scheme are described. A memory controller may include a first closely-coupled memory component that is associated with storing data and control information and a second closely-coupled memory component that is associated with storing control information. The memory controller may be configured to retrieve data from the first memory closely-coupled component and control information from a second closely-coupled memory component. Control information may be stored in the first closely-coupled memory component, and a memory controller may access the control information stored in the first closely-coupled memory component by transferring, from the first closely-coupled memory component, the control information into the second closely-coupled memory component. After transferring the control information into the second closely-coupled memory component, the memory controller may access the control information from the second closely-coupled memory component.
Storage array invalidation maintenance
Techniques are disclosed relating to managing storage array invalidations. A computer system may comprise a processor core configured to operate in an idle state and operate in a run state in which the processor core executes instructions. The computer system may further comprise a power management circuit that is configured to receive, while the processor core is in the idle state, a set of invalidation requests directed to the processor core to invalidate a set of entries of a storage array of the processor core. The power management circuit may store invalidation information indicative of the set of invalidation requests. The power management circuit may determine that the processor core has received a request to transition to the run state. Prior to the processor core operating in the run state, the power management circuit may invalidate the set of entries of the storage array based on the invalidation information.
Link stack based instruction prefetch augmentation
A computer-implemented method of performing a link stack based prefetch augmentation using a sequential prefetching includes observing a call instruction in a program being executed, and pushing a return address onto a link stack for processing the next instruction. A stream of instructions is prefetched starting from a cached line address of the next instruction and is stored in an instruction cache.
Spatial and temporal merging of remote atomic operations
Disclosed embodiments relate to spatial and temporal merging of remote atomic operations. In one example, a system includes an RAO instruction queue stored in a memory and having entries grouped by destination cache line, each entry to enqueue an RAO instruction including an opcode, a destination identifier, and source data, optimization circuitry to receive an incoming RAO instruction, scan the RAO instruction queue to detect a matching enqueued RAO instruction identifying a same destination cache line as the incoming RAO instruction, the optimization circuitry further to, responsive to no matching enqueued RAO instruction being detected, enqueue the incoming RAO instruction; and, responsive to a matching enqueued RAO instruction being detected, determine whether the incoming and matching RAO instructions have a same opcode to non-overlapping cache line elements, and, if so, spatially combine the incoming and matching RAO instructions by enqueuing both RAO instructions in a same group of cache line queue entries at different offsets.
Processor device for executing SIMD instructions
In a processor device according to the present invention, a memory access unit reads data to be processed from an external memory and writes the data to a first register group that a plurality of processors does not access among a plurality of register groups. A control unit sequentially makes each of the plurality of processors implement a same instruction, in parallel with changing an address of a register group that stores the data to be processed. A scheduler, based on specified scenario information, specifies an instruction to be implemented and a register group to be accessed for the plurality of processors, and specifies a register group to be written to among the plurality of register groups and data to be processed that is to be written for the memory access unit.
Hardware compression and decompression engine
A method and system for compressing and decompressing data is disclosed. A compression command may initiate the prefetching of first data, which may be stored in a first buffer. Multiple words of the first data may be read from the first buffer and used to generate a plurality of compressed packets, each of which includes a command specifying a type of packet. The compressed packets may be combined into a group and multiple groups may be combined and stored in a second buffer. A decompression command may initiate the prefetching of second data, which is stored in the first buffer. A portion of the second data may be read from the first buffer and used to generate a group of compressed packets. Multiple output words may be generated dependent upon the group of compressed packets.