G06F9/3834

PROCESSORS SUPPORTING ATOMIC WRITES TO MULTIWORD MEMORY LOCATIONS & METHODS

A system and method process atomic instructions. A processor system includes a load store unit (LSU), first and second registers, a memory interface, and a main memory. In response to a load link (LL) instruction, the LSU loads first data from memory into the first register and sets an LL bit (LLBIT) to indicate a sequence of atomic instructions is being executed. The LSU further loads second data from memory into the second register in response to a load (LD) instruction. The LSU places a value of the second register into the memory interface in response to a store conditional coupled (SCX) instruction. When the LLBIT is set and in response to a store (SC) instruction, the LSU places a value of the second register into the memory interface and commits the first and second register values in the memory interface into the main memory when the LLBIT is set.

Memory access management

A device includes a scoreboard and a processor. The scoreboard includes scoreboard entries configured to store information regarding one or more uncompleted memory access operations. The scoreboard also includes a dependency matrix configured to store dependency information corresponding to the scoreboard entries. The processor is configured to retrieve a first memory access instruction that indicates a first address range of a first memory access operation, and to add an indication of the first memory access instruction to a first scoreboard entry. The processor is further configured to, based on determining that the first address range at least partially overlaps a second address range associated with a second scoreboard entry that corresponds to a second memory access instruction, set an element of the dependency matrix to have a has-dependency value indicating a dependency of the first scoreboard entry on the second scoreboard entry.

Systems, apparatuses, and methods for data speculation execution

Systems, methods, and apparatuses for data speculation execution (DSX) are described. In some embodiments, a hardware apparatus for performing DSX comprises a hardware decoder to decode an instruction, the instruction to include an opcode and an operand to store a portion of a fallback address and an operand to store a stride value, execution hardware to execute the decoded instruction to initiate a data speculative execution (DSX) region by activating DSX tracking hardware to track speculative memory accesses and detect ordering violations in the DSX region, and storing the fallback address.

APPARATUSES, METHODS, AND SYSTEMS FOR NEURAL NETWORKS

Methods and apparatuses relating to processing neural networks are described. In one embodiment, an apparatus to process a neural network includes a plurality of fully connected layer chips coupled by an interconnect; a plurality of convolutional layer chips each coupled by an interconnect to a respective fully connected layer chip of the plurality of fully connected layer chips and each of the plurality of fully connected layer chips and the plurality of convolutional layer chips including an interconnect to couple each of a forward propagation compute intensive tile, a back propagation compute intensive tile, and a weight gradient compute intensive tile of a column of compute intensive tiles between a first memory intensive tile and a second memory intensive tile.

APPARATUS AND METHOD FOR NON-SERIALIZING SPLIT LOCKS
20170286115 · 2017-10-05 ·

An apparatus and method are described for performing split lock operations in a multi-core processor. For example, one embodiment of a processor comprises: a plurality of cores to execute instructions, each core comprising a core cache to cache data during instruction execution; a shared cache to be shared by two or more of the plurality of cores; a locking agent on a first core to initiate a split lock operation in response to detecting a transaction targeting at least two cache lines, the locking agent to transmit a request for the two cache lines to be set to an Exclusive state; at least one coherence enforcement engine to receive the request from the locking agent and to responsively cause any copies of the two cache lines in other cores to be invalidated; the locking agent to permit the transaction targeting the two cache lines to complete upon receipt of an indication that the cache lines are in the Exclusive state and, upon completion of the transaction, to transmit an indication that the transaction is complete to the coherence enforcement engine.

Thread waiting in a multithreaded processor architecture
09778949 · 2017-10-03 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for thread waiting. One of the methods includes starting, by a first thread on a processing core, a task by starting to execute a plurality of task instructions; initiating, by the first thread, an atomic memory transaction using a transactional memory system, including: specifying, to the transactional memory system, at least a first memory address for the atomic memory transaction and temporarily ceasing the task by not proceeding to execute the task instructions; receiving, by the first thread, a signal as a consequence of a second thread accessing the first memory address specified for the atomic memory transaction; and as a consequence of receiving the signal, resuming the task, by the first thread, and continuing to execute the task instructions.

Latent modification instruction for substituting functionality of instructions during transactional execution

An instruction stream includes a transactional code region. The transactional code region includes a latent modification instruction (LMI), a next sequential instruction (NSI) following the LMI, and a set of target instructions following the NSI in program order. Each target instruction has an associated function, and the LMI at least partially specifies a substitute function for the associated function. A processor executes the LMI, the NSI, and at least one of the target instructions, employing the substitute function at least partially specified by the LMI. The LMI, the NSI, and the target instructions may be executed by the processor in sequential program order or out of order.

Dynamic selection of OSC hazard avoidance mechanism

Methods, systems and computer program products for dynamically selecting an OSC hazard avoidance mechanism are provided. Aspects include receiving a load instruction that is associated with an operand store compare (OSC) prediction. The OSC prediction is stored in an entry of an OSC history table (OHT) and includes a multiple dependencies indicator (MDI). Responsive to determining the MDI is in a first state, aspects include applying a first OSC hazard avoidance mechanism in relation to the load instruction. Responsive to determining that the load instruction is dependent on more than one store instruction, aspects include placing the MDI in a second state. The MDI being in the second state provides an indication to apply a second OSC hazard avoidance mechanism in relation to the load instruction.

Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses
09733831 · 2017-08-15 · ·

In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.

Speculative execution of correlated memory access instruction methods, apparatuses and systems

A processor core, a processor, an apparatus, and an instruction processing method are disclosed. The processor core includes: an instruction fetch unit, where the instruction fetch unit includes a speculative execution predictor and the speculative execution predictor compares a program counter of a memory access instruction with a table entry stored in the speculative execution predictor and marks the memory access instruction; a scheduler unit adapted to adjust a send order of marked memory access instructions and send the marked memory access instructions according to the send order; an execution unit adapted to execute the memory access instructions according to the send order. In the instruction fetch unit, a memory access instruction is marked according to a speculative execution prediction result. In the scheduler unit, a send order of memory access instructions is determined according to the marked memory access instruction and the memory access instructions are sent. In the execution unit, the memory access instructions are executed according to the send order. This helps avoiding re-execution of a memory access instruction due to an address correlation of the memory access instruction. Consequently, this eliminates the need of adding an idle cycle in an instruction pipeline and the need of refreshing the pipeline to clear a memory access instruction that is incorrectly speculated.