G06F9/312

Microprocessor with ALU integrated into store unit

A superscalar pipelined microprocessor includes a register set defined by an instruction set architecture of the microprocessor, execution units, and a store unit, coupled to the cache memory and distinct from the other execution units of the microprocessor. The store unit comprises an ALU. The store unit receives an instruction that specifies a source register of the register set and an operation to be performed on a source operand to generate a result. The store unit reads the source operand from the source register. The ALU performs the operation on the source operand to generate the result, rather than forwarding the source operand to any of the other execution units of the microprocessor to perform the operation on the source operand to generate the result. The store unit operatively writes the result to the cache memory.

Method and system for implementing recovery from speculative forwarding miss-predictions/errors resulting from load store reordering and optimization
09928121 · 2018-03-27 · ·

A method for forwarding data from the store instructions to a corresponding load instruction in an out of order processor. The method includes accessing an incoming sequence of instructions; reordering the instructions in accordance with processor resources for dispatch and execution; ensuring a closest earlier store in machine order for to a corresponding load, by determining if said store has an actual age but said corresponding load does not have an actual age, then said store is earlier than said corresponding load; if said corresponding load has an actual age but said store does not have an actual age, then said corresponding load is earlier than said store; if neither said corresponding load or said store have an actual age, then a virtual identifier table is used to determine which is earlier; and if both said corresponding load and said store have actual ages, then the actual ages are used to determine which is earlier.

Instruction and logic for characterization of data access

A processor includes a front end to receive an instruction, a decoder to decode the instruction, a core to execute the first instruction, and a retirement unit to retire the first instruction. The core includes logic to execute the first instruction, including logic to repeatedly record a translation lookaside buffer (TLB) until a designated number of records are determined, and flush the TLB after a flush interval.

Virtual load store queue having a dynamic dispatch window with a distributed structure
09904552 · 2018-02-27 · ·

An out of order processor. The processor includes a distributed load queue and a distributed store queue that maintain single program sequential semantics while allowing an out of order dispatch of loads and stores across a plurality of cores and memory fragments; wherein the processor allocates other instructions besides loads and stores beyond the actual physical size limitation of the load/store queue; and wherein the other instructions can be dispatched and executed even though intervening loads or stores do not have spaces in the load store queue.

Apparatus and method for transferring a plurality of data structures between memory and a plurality of vector registers

An apparatus and method are provided for transferring a plurality of data structures between memory and a plurality of vector registers, each vector register being arranged to store a vector operand comprising a plurality of data elements. Access circuitry is used to perform access operations to move data elements of vector operands between the data structures in memory and specified vector registers, each data structure comprising multiple data elements stored at contiguous addresses in the memory. Decode circuitry is responsive to a single access instruction identifying a plurality of vector registers and a plurality of data structures that are located discontiguously with respect to each other in the memory, to generate control signals to control the access circuitry to perform a sequence of access operations to move the plurality of data structures between the memory and the plurality of vector registers such that the vector operand in each vector register holds a corresponding data element from each of the plurality of data structures. This provides a very efficient mechanism for performing complex access operations, resulting in an increase in execution speed, and potential reductions in power consumption.

Non-default instruction handling within transaction

Embodiments relate to non-default instruction handling within a transaction. An aspect includes entering a transaction, the transaction comprising a first plurality of instructions and a second plurality of instructions, wherein a default manner of handling of instructions in the transaction is one of atomic and non-atomic. Another aspect includes encountering a non-default specification instruction in the transaction, wherein the non-default specification instruction comprises a single instruction that specifies the second plurality of instructions of the transaction for handling in a non-default manner comprising one of atomic and non-atomic, wherein the non-default manner is different from the default manner. Another aspect includes handling the first plurality of instructions in the default manner. Yet another aspect includes handling the second plurality of instructions in the non-default manner.

Non-serialized push instruction for pushing a message payload from a sending thread to a receiving thread

In at least some embodiments, a processor core executes a sending thread including a first push instruction and a second push instruction subsequent to the first push instruction in a program order. Each of the first and second push instructions requests that a respective message payload be pushed to a mailbox of a receiving thread. In response to executing the first and second push instructions, the processor core transmits respective first and second co-processor requests to a switch in the data processing system via an interconnect fabric of the data processing system. The processor core transmits the second co-processor request to the switch without regard to acceptance of the first co-processor request by the switch.

Method and apparatus for speculative vectorization

An apparatus and method for speculative vectorization. For example, one embodiment of a processor comprises: a queue comprising a set of locations for storing addresses associated with vectorized memory access instructions; and execution logic to execute a first vectorized memory access instruction to access the queue and to compare a new address associated with the first vectorized memory access instruction with existing addresses stored within a specified range of locations within the queue to detect whether a conflict exists, the existing addresses having been previously stored responsive to one or more prior vectorized memory access instructions.

Coprocessor for out-of-order loads

Systems and methods for implementing certain load instructions, such as vector load instructions by cooperation of a main processor and a coprocessor. The load instructions which are identified by the main processor for offloading to the coprocessor are committed in the main processor without receiving corresponding load data. Post-commit, the load instructions are processed in the coprocessor, such that latencies incurred in fetching the load data are hidden from the main processor. By implementing an out-of-order load data buffer associated with an in-order instruction buffer, the coprocessor is also configured to avoid stalls due to long latencies which may be involved in fetching the load data from levels of memory hierarchy, such as L2, L3, L4 caches, main memory, etc.

Instruction and logic to provide vector loads with strides and masking functionality

Instructions and logic provide vector loads and/or stores with stride and mask functionality. Some embodiments, responsive to an instruction specifying: a set of loads, destination register, mask register, memory address, and stride length; execution units read values in the mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the corresponding multiple of said stride length is generated according to the data field's position in the mask register to load the data element from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. These instructions can restart after faults.