Patent classifications
G06F9/3834
MICROPROCESSOR THAT PREVENTS SAME ADDRESS LOAD-LOAD ORDERING VIOLATIONS
A microprocessor prevents same address load-load ordering violations. Each load queue entry holds a load physical memory line address (PMLA) and an indication of whether a load instruction has completed execution. The microprocessor fills a line specified by a fill PMLA into a cache entry and snoops the load queue with the fill PMLA, either before the fill or in an atomic manner with the fill with respect to ability of the filled entry to be hit upon by any load instruction, to determine whether the fill PMLA matches load PMLAs in load queue entries associated with load instructions that have completed execution and there are other load instructions in the load queue that have not completed execution. The microprocessor, if the condition is true, flushes at least the other load instructions in the load queue that have not completed execution.
STORE-TO-LOAD FORWARDING CORRECTNESS CHECKS AT STORE INSTRUCTION COMMIT
A microprocessor includes a load queue, a store queue, and a load/store unit that, during execution of a store instruction, records store information to a store queue entry. The store information comprises store address and store size information about store data to be stored by the store instruction. The load/store unit, during execution of a load instruction that is younger in program order than the store instruction, performs forwarding behavior with respect to forwarding or not forwarding the store data from the store instruction to the load instruction and records load information to a load queue entry, which comprises load address and load size information about load data to be loaded by the load instruction, and records the forwarding behavior in the load queue entry. The load/store unit, during commit of the store instruction, uses the recorded store information and the recorded load information and the recorded forwarding behavior to check correctness of the forwarding behavior.
Thwarting Store-to-Load Forwarding Side Channel Attacks by Pre-Forwarding Matching of Physical Address Proxies and/or Permission Checking
A method and system for mitigating against side channel attacks (SCA) that exploit speculative store-to-load forwarding is described. The method comprises ensuring that the physical load and store addresses match and/or that permissions are present before speculatively store-to-load forwarding. Various improvements maintain a short load-store pipeline, including usage of a virtual level-one data cache (DL1), usage of an inclusive physical level-two data cache (DL2), storage and lookup of physical data address equivalents in the DL1, and using a memory dependence predictor (MDP) to speed up or replace store queue camming of load data addresses against store data addresses.
System and Method for Implementing Strong Load Ordering in a Processor Using a Circular Ordering Ring
A system and corresponding method enforce strong load ordering in a processor. The system comprises an ordering ring that stores entries corresponding to in-flight memory instructions associated with a program order, scanning logic, and recovery logic. The scanning logic scans the ordering ring in response to execution or completion of a given load instruction of the in-flight memory instructions and detects an ordering violation in an event at least one entry of the entries indicates that a younger load instruction has completed and is associated with an invalidated cache line. In response to the ordering violation, the recovery logic allows the given load instruction to complete, flushes the younger load instruction, and restarts execution of the processor after the given load instruction in the program order, causing data returned by the given and younger load instructions to be returned consistent with execution according to the program order to satisfy strong load ordering.
APPARATUSES, METHODS, AND SYSTEMS TO PRECISELY MONITOR MEMORY STORE ACCESSES
Systems, methods, and apparatuses relating to circuitry to precisely monitor memory store accesses are described. In one embodiment, a system includes a memory, a hardware processor core comprising a decoder to decode an instruction into a decoded instruction, an execution circuit to execute the decoded instruction to produce a resultant, a store buffer, and a retirement circuit to retire the instruction when a store request for the resultant from the execution circuit is queued into the store buffer for storage into the memory, and a performance monitoring circuit to mark the retired instruction for monitoring of post-retirement performance information between being queued in the store buffer and being stored in the memory, enable a store fence after the retired instruction to be inserted that causes previous store requests to complete within the memory, and on detection of completion of the store request for the instruction in the memory, store the post-retirement performance information in storage of the performance monitoring circuit.
MEMORY TRANSACTION MANAGEMENT
A device includes a processor coupled to a memory. The processor is configured to assign distinct domain identifiers to each of multiple software threads. The processor is also configured to control operation of one or more components of the processor based on a number of memory transactions associated with a domain identifier.
A MULTI-PART COMPARE AND EXCHANGE OPERATION
A method for executing an atomic compare and exchange operation, the method may include processing a compare command and a conditional exchange command while considering hardware failures.
CACHE COHERENCE VALIDATION USING DELAYED FULFILLMENT OF L2 REQUESTS
Methods and systems for validating cache coherence in a data processing system are described. A processing element may detect a load instruction requesting the processing element to transfer data from a global memory location to a local memory location. The processing element may apply, in response to detecting the load instruction requesting the processing element to transfer data from the global memory location to the local memory location, a delay to the transfer of the data from the global memory location to the local memory location. The processing element may execute the load instruction and transferring the data from the global memory location to the local memory location with the applied delay. The processing element may validate, in response to executing the load instruction and transferring the data with the applied delay, a cache coherence of the data processing system.
Preserving memory ordering between offloaded instructions and non-offloaded instructions
Preserving memory ordering between offloaded instructions and non-offloaded instructions is disclosed. An offload instruction for an operation to be offloaded is processed and a lock is placed on a memory address associated with the offload instruction. In response to completing a cache operation targeting the memory address, the lock on the memory address is removed. For multithreaded applications, upon determining that a plurality of processor cores have each begun executing a sequence of offload instructions, the execution of non-offload instructions that are younger than any of the offload instructions is restricted. In response to determining that each processor core has completed executing its sequence of offload instructions, the restriction is removed. The remote device may be, for example, a processing-in-memory device or an accelerator coupled to a memory.
Systems, methods, and devices for queue availability monitoring
A method may include determining, with a queue availability module, that an entry is available in a queue, asserting a bit in a register based on determining that an entry is available in the queue, determining, with a processor, that the bit is asserted, and processing, with the processor, the entry in the queue based on determining that the bit is asserted. The method may further include storing the register in a tightly coupled memory associated with the processor. The method may further include storing the queue in the tightly coupled memory. The method may further include determining, with the queue availability module, that an entry is available in a second queue, and asserting a second bit in the register based on determining that an entry is available in the second queue. The method may further include finding the first bit in the register using a find first instruction.