G06F9/30043

HANDLING OF SINGLE-COPY-ATOMIC LOAD/STORE INSTRUCTION
20230017802 · 2023-01-19 ·

In response to a single-copy-atomic load/store instruction for requesting an atomic transfer of a target block of data between the memory system and the registers, where the target block has a given size greater than a maximum data size supported for a single load/store micro-operation by a load/store data path, instruction decoding circuitry maps the single-copy-atomic load/store instruction to two or more mapped load/store micro-operations each for requesting transfer of a respective portion of the target block of data. In response to the mapped load/store micro-operations, load/store circuitry triggers issuing of a shared memory access request to the memory system to request the atomic transfer of the target block of data of said given size to or from the memory system, and triggers separate transfers of respective portions of the target block of data over the load/store data path.

INLINE DATA INSPECTION FOR WORKLOAD SIMPLIFICATION

A method, computer readable medium, and processor are described herein for inline data inspection by using a decoder to decode a load instruction, including a signal to cause a circuit in a processor to indicate whether data loaded by a load instruction exceeds a threshold value. Moreover, an indication of whether data loaded by a load instruction exceeds a threshold value may be stored.

Arithmetic processing device having multicore ring bus structure with turn-back bus for handling register file push/pull requests
11550576 · 2023-01-10 · ·

An arithmetic processing device includes arithmetic processing units, each having a calculator unit; a scheduler that controls a push instruction to write data to a register file in one of the arithmetic processing units and a pull instruction to read data from the register file; a pull request bus to which the scheduler outputs a pull request and which is connected to the arithmetic processing units; a push request bus to which the scheduler outputs a push request and which is connected to the arithmetic processing units; and a pull data bus that inputs, into the scheduler, pull data read from the register file in response to the pull request. Each of the arithmetic processing units includes a pull data turn-back bus that propagates pull data read from its register file to the pull data bus.

INFERRING FUTURE VALUE FOR SPECULATIVE BRANCH RESOLUTION IN A MICROPROCESSOR

A system, processor, programming product and/or method including: an instruction dispatch unit configured to dispatch instructions of a compare immediate-conditional branch instruction sequence; and a compare register having at least one entry to hold information in a plurality of fields. Operations include: writing information from a first instruction of the compare immediate-conditional branch instruction sequence into one or more of the plurality of fields in an entry in the compare register; writing an immediate field and the ITAG of a compare immediate instruction into the entry in the compare register; writing, in response to dispatching a conditional branch instruction, an inferred compare result value into the entry in the compare register; comparing a computed compare result value to the inferred compare result value stored in the entry in the compare register; and not execute the compare immediate instruction or the conditional branch instruction.

Co-scheduled loads in a data processing apparatus
11693665 · 2023-07-04 · ·

A data processing apparatus and method of operating such is disclosed. Issue circuitry buffers operations prior to execution until operands are available in a set of registers. A first and a second load operation are identified in the issue circuitry, when both are dependent on a common operand, and when the common operand is available in the set of registers. Load circuitry has a first address generation unit to generate a first address for the first load operation and a second address generation unit to generate a second address for the second load operation. An address comparison unit compares the first address and the second address. The load circuitry is arranged to cause a merged lookup to be performed in local temporary storage, when the address comparison unit determines that the first and the second address differ by less than a predetermined address range characteristic of the local temporary storage.

Victim cache that supports draining write-miss entries

A caching system including a first sub-cache and a second sub-cache in parallel with the first sub-cache, wherein the second sub-cache includes a set of cache lines, line type bits configured to store an indication that a corresponding cache line of the set of cache lines is configured to store write-miss data, and an eviction controller configured to flush stored write-miss data based on the line type bits.

Memory tagging apparatus and method

An apparatus and method for tagged memory management. For example, one embodiment of a processor comprises: execution circuitry to execute instructions and process data, at least one instruction to generate a system memory access request having a first address pointer; and address translation circuitry to determine whether to translate the first address pointer with or without metadata processing, wherein if the first address pointer is to be translated with metadata processing, the address translation circuitry to: perform a lookup in a memory metadata table to identify a memory metadata value, determine a pointer metadata value associated with the first address pointer, and compare the memory metadata value with the pointer metadata value, the comparison to generate a validation of the memory access request or a fault condition, wherein if the comparison results in a validation of the memory access request, then accessing a set of one or more address translation tables to translate the first address pointer to a first physical address and to return the first physical address responsive to the memory access request.

Routing network using global address map with adaptive main memory expansion for a plurality of home agents
11693805 · 2023-07-04 · ·

An adaptive memory expansion scheme is proposed, where one or more memory expansion capable Hosts or Accelerators can have their memory mapped to one or more memory expansion devices. The embodiments below describe discovery, configuration, and mapping schemes that allow independent SCM implementations and CPU-Host implementations to match their memory expansion capabilities. As a result, a memory expansion host (e.g., a memory controller in a CPU or an Accelerator) can declare multiple logical memory expansion pools, each with a unique capacity. These logical memory pools can be matched to physical memory in the SCM cards using windows in a global address map. These windows represent shared memory for the Home Agents (HAs) (e.g., the Host) and the Slave Agent (SAs) (e.g., the memory expansion device).

Compiler program, compiling method, information processing device

A compiler program causes a computer to execute optimization processing for an optimization target program. The optimization target program includes a loop including a vector store instruction and a vector load instruction for an array variable. The optimization processing includes (1) unrolling the vector store instruction and the vector load instruction in the loop by an unrolling number of times to generate a plurality of unrolled vector store instructions and a plurality of unrolled vector load instructions, and (2) scheduling to move an unrolled vector load instruction among the plurality of unrolled vector load instructions, which is located after a first unrolled vector store instruction that is located at first among the plurality of unrolled vector load instructions, before the first unrolled vector store instruction.

Graphics processing unit systems for performing data analytics operations in data science

Systems and methods are provided for efficiently performing processing intensive operations, such as those involving large volumes of data, that enable accelerated processing time of these operations. In at least one embodiment, a system includes a graphics processor unit (GPU) including a memory and a plurality of cores. The plurality of cores perform a plurality of data analytics operations on a respectively allocated portion of a dataset, each of the plurality of cores using only the memory to store data input for each of the plurality of data analytics operations performed by the plurality of cores. The data storage for the plurality of data analytics operations performed by the plurality of cores is also provided solely by the memory.