G06F9/3854

Linkable issue queue parallel execution slice for a processor

An execution slice circuit for a processor core has multiple parallel instruction execution slices and provides flexible and efficient use of internal resources. The execution slice circuit includes a master execution slice for receiving instructions of a first instruction stream and a slave execution slice for receiving instructions of a second instruction stream and instructions of the first instruction stream that require an execution width greater than a width of the slices. The execution slice circuit also includes a control logic that detects when a first instruction of the first instruction stream has the greater width and controls the slave execution slice to reserve a first issue cycle for issuing the first instruction in parallel across the master execution slice and the slave execution slice.

MULTI-NULLIFICATION

Apparatus and methods are disclosed for nullifying memory store instructions and one or more registers identified in a target field of a nullification instruction. In some examples of the disclosed technology, an apparatus can include memory and one or more block-based processor cores configured to fetch and execute a plurality of instruction blocks. One of the cores can include a control unit configured, based at least in part on receiving a nullification instruction, to obtain an instruction identification for a memory access instruction of a plurality of memory access instructions and a register identification of at least one of a plurality of registers, based on a first and second target fields of the nullification instruction. The at least one register and the memory access instruction associated with the instruction identification are nullified. Based on the nullified memory access instruction, a subsequent memory access instruction is executed.

Tightly coupled processor arrays using coarse grained reconfigurable architecture with iteration level commits

An apparatus and method for supporting simultaneous multiple iterations (SMI) in a course grained reconfigurable architecture (CGRA). In support of SMI, the apparatus includes: Hardware structures that connect all of multiple processing engines (PEs) to a load-store unit (LSU) configured to keep track of which compiled program code iterations have completed, which ones are in flight and which are yet to begin, and a control unit including hardware structures that are used to maintain synchronization and initiate and terminate loops within the PEs. SMI permits execution of the next instruction within any iteration (in flight). If instructions from multiple iterations are ready for execution (and are pre-decoded), then the hardware selects the lowest iteration number ready for execution. If in a particular clock cycle, a loop iteration with a lower iteration number is stalled (i.e., is waiting for data), the instruction from the next highest iteration number that is ready thereby will be automatically executed automatically allowing the CGRA to have high ILP by overlapping concurrent loop iterations.

Memory address collision detection of ordered parallel threads with bloom filters

A semiconductor chip is described having a load collision detection circuit comprising a first bloom filter circuit. The semiconductor chip has a store collision detection circuit comprising a second bloom filter circuit. The semiconductor chip has one or more processing units capable of executing ordered parallel threads coupled to the load collision detection circuit and the store collision detection circuit. The load collision detection circuit and the store collision detection circuit is to detect younger stores for load operations of said threads and younger loads for store operations of said threads.

CACHE STORING DATA FETCHED BY ADDRESS CALCULATING LOAD INSTRUCTION WITH LABEL USED AS ASSOCIATED NAME FOR CONSUMING INSTRUCTION TO REFER
20180293073 · 2018-10-11 ·

A processor architecture includes a register file hierarchy to implement virtual registers that provide a larger set of registers than those directly supported by an instruction set architecture to facilitate multiple copies of the same architecture register for different processing threads, where the register file hierarchy includes a plurality of hierarchy levels. The processor architecture further includes a plurality of execution units coupled to the register file hierarchy.

Processor, accelerator, and direct memory access controller within a core reading/writing local synchronization flag area for parallel
10095657 · 2018-10-09 · ·

It is provided a processor system comprising at least one processor core provided on a semiconductor chip and including a processor, a memory and an accelerator. The memory includes an instruction area, a synchronization flag area and a data area. The accelerator starts, even if the processor is executing another processing, acceleration processing and executes the task in a case of confirming that a flag indicating that the processor has completed predetermined processing has been written into the synchronization flag area; and stores the data subjected to the acceleration processing into the data area, and further writes a flag indicating that the completion of the acceleration processing. The processor starts, even if the accelerator is executing another processing, the task corresponding to a flag in a case of confirming that the flag indicating the completion of the acceleration processing has been written into the synchronization flag area.

Mechanism to preclude I/O-dependent load replays in an out-of-order processor

An apparatus including first and second reservation stations. The first reservation station dispatches a load micro instruction, and indicates on a hold bus if the load micro instruction is a specified load micro instruction directed to retrieve an operand from a prescribed resource other than on-core cache memory. The second reservation station is coupled to the hold bus, and dispatches one or more younger micro instructions therein that depend on the load micro instruction for execution after a number of clock cycles following dispatch of the first load micro instruction, and if it is indicated on the hold bus that the load micro instruction is the specified load micro instruction, the second reservation station is configured to stall dispatch of the one or more younger micro instructions until the load micro instruction has retrieved the operand. The resources include an input/output (I/O) unit, configured to perform I/O operations via an I/O bus coupling an out-of-order processor to I/O resources.

Instruction block address register

Apparatus and methods are disclosed for controlling instruction flow in block-based processor architectures. In one example of the disclosed technology, an instruction block address register stores an index address to a memory storing a plurality of instructions for an instruction block, the indexed address being inaccessible when the processor is in one or more unprivileged operational modes, one or more execution units configured to execute instructions for the instruction block, and a control unit configured to fetch and decode two or more of the plurality of instructions from the memory based on the indexed address.

HYBRID ATOMICITY SUPPORT FOR A BINARY TRANSLATION BASED MICROPROCESSOR
20180285112 · 2018-10-04 ·

A processing device including a first shadow register, a second shadow register, and an instruction execution circuit, communicatively coupled to the first shadow register and the second shadow register, to receive a sequence of instructions comprising a first local commit marker, a first global commit marker, and a first register access instruction referencing an architectural register, speculatively execute the first register access instruction to generate a speculative register state value associated with a physical register, responsive to identifying the first local commit marker, store, in the first shadow register, the speculative register state value, and responsive to identifying the first global commit marker, store, in the second shadow register, the speculative register state value.

DATA PROCESSING

Data processing circuitry comprises out-of-order instruction execution circuitry to execute program instructions in an instruction execution order; a data store, to store information on a set of instructions for which execution has been initiated, the data store providing ordering information indicating the relative position of each instruction in the set of instructions with respect to a program code order; commit circuitry to commit the results of instructions executed by the instruction execution circuitry; one or more cumulative status registers configured to be set in response to a respective condition generated by execution of an instruction and then to remain set until an unset instruction is executed; and an identifier store, to store for at least those of the one or more cumulative status registers which are not currently set, an identifier of an instruction which is earliest in the program code order in the set of instructions and which generated a condition to set that cumulative status register.