G06F9/3824

MANAGING A DIVIDED LOAD REORDER QUEUE

Managing a divided load reorder queue including storing load instruction data for a load instruction in an expanded LRQ entry in the LRQ; launching the load instruction from the expanded LRQ entry; determining that the load instruction is in a finished state; moving a subset of the load instruction data from the expanded LRQ entry to a compact LRQ entry in the LRQ, wherein the compact LRQ entry is smaller than the expanded LRQ entry; and removing the load instruction data from the expanded LRQ entry.

Data processing method and apparatus, and related product

The present disclosure provides a data processing method and an apparatus and a related product. The products include a control module including an instruction caching unit, an instruction processing unit, and a storage queue unit. The instruction caching unit is configured to store computation instructions associated with an artificial neural network operation; the instruction processing unit is configured to parse the computation instructions to obtain a plurality of operation instructions; and the storage queue unit is configured to store an instruction queue, where the instruction queue includes a plurality of operation instructions or computation instructions to be executed in the sequence of the queue. By utilizing the above-mentioned method, the present disclosure can improve the operation efficiency of related products when performing operations of a neural network model.

Executing Memory Requests Out of Order
20170357512 · 2017-12-14 ·

An on-chip cache is described which receives memory requests and in the event of a cache miss, the cache generates memory requests to a lower level in the memory hierarchy (e.g. to a lower level cache or an external memory). Data returned to the on-chip cache in response to the generated memory requests may be received out-of-order. An instruction scheduler in the on-chip cache stores pending received memory requests and effects the re-ordering by selecting a sequence of pending memory requests for execution such that pending requests relating to an identical cache line are executed in age order and pending requests relating to different cache lines are executed in an order dependent upon when data relating to the different cache lines is returned. The memory requests which are received may be received from another, lower level on-chip cache or from registers.

OPERATION OF A MULTI-SLICE PROCESSOR IMPLEMENTING SIMULTANEOUS TWO-TARGET LOADS AND STORES

Operation of a multi-slice processor that includes a plurality of execution slices and a load/store superslice, where the load/store superslice includes a set predict array, a first load/store slice, and a second load/store slice. Operation of such a multi-slice processor includes: receiving a two-target load instruction directed to the first load/store slice and a store instruction directed to the second load/store slice; determining a first subset of ports of the set predict array as inputs for an effective address for the two-target load instruction; determining a second subset of ports of the set predict array as inputs for an effective address for the store instruction; and generating, in dependence upon logic corresponding to the set predict array that is less than logic implementing an entire load/store slice, output for performing the two-target load instruction in parallel with generating output for performing the store instruction.

Generation and use of memory access instruction order encodings

Apparatus and methods are disclosed for controlling execution of memory access instructions in a block-based processor architecture using a hardware structure that indicates a relative ordering of memory access instruction in an instruction block. In one example of the disclosed technology, a method of executing an instruction block having a plurality of memory load and/or memory store instructions includes selecting a next memory load or memory store instruction to execute based on dependencies encoded within the block, and on a store vector that stores data indicating which memory load and memory store instructions in the instruction block have executed. The store vector can be masked using a store mask. The store mask can be generated when decoding the instruction block, or copied from an instruction block header. Based on the encoded dependencies and the masked store vector, the next instruction can issue when its dependencies are available.

FETCHED DATA IN AN ULTRA-SHORT PIPED LOAD STORE UNIT
20170351521 · 2017-12-07 ·

Techniques are disclosed for receiving an instruction for processing data that includes a plurality of sectors. A method includes decoding the instruction to determine which of the plurality of sectors are needed to process the instruction and fetching at least one of the plurality of sectors from memory. The method includes determining whether each sector that is needed to process the instruction has been fetched. If all sectors needed to process the instruction have been fetched, the method includes transmitting a sector valid signal and processing the instruction. If all sectors needed to process the instruction have not been fetched, the method includes blocking a data valid signal from being transmitted, fetching an additional one or more of the plurality of sectors until all sectors needed to process the instruction have been fetched, transmitting a sector valid signal, and reissuing and processing the instruction using the fetched sectors.

Draining operation to cause store data to be written to persistent memory

An apparatus comprises a write buffer to buffer store requests issued by the processing circuitry, prior to the store data being written to at least one cache. Draining circuitry detects a draining trigger event having potential to cause loss of state stored in the at least one cache. In response to the draining trigger event, the draining circuitry performs a draining operation to identify whether the write buffer buffers any committed store requests requiring persistence, and when the write buffer buffers at least one committed store request requiring persistence, to cause the store data associated with the at least one committed store request to be written to persistent memory. This helps to eliminate barrier instructions from software, simplifying persistent programming and improving performance.

PIPELINE MERGING IN A CIRCUIT
20220374242 · 2022-11-24 ·

Devices and techniques for pipeline merging in a circuit are described herein. A parallel pipeline result can be obtained for a transaction index of a parallel pipeline. Here, the parallel pipeline is one of several parallel pipelines that share transaction indices. An element in a vector can be marked. The element corresponds to the transaction index. The vector is one of several vectors respectively assigned to the several parallel pipelines. Further each element in the several vectors corresponds to a possible transaction index with respective elements between vectors corresponding to the same transaction index. Elements between the several vectors that correspond to the same transaction index can be compared to determine when a transaction is complete. In response to the transaction being complete, the result can be released to an output buffer in response to the transaction being complete.

APPARATUSES, METHODS, AND SYSTEMS TO PRECISELY MONITOR MEMORY STORE ACCESSES
20230176870 · 2023-06-08 ·

Systems, methods, and apparatuses relating to circuitry to precisely monitor memory store accesses are described. In one embodiment, a system includes a memory, a hardware processor core comprising a decoder to decode an instruction into a decoded instruction, an execution circuit to execute the decoded instruction to produce a resultant, a store buffer, and a retirement circuit to retire the instruction when a store request for the resultant from the execution circuit is queued into the store buffer for storage into the memory, and a performance monitoring circuit to mark the retired instruction for monitoring of post-retirement performance information between being queued in the store buffer and being stored in the memory, enable a store fence after the retired instruction to be inserted that causes previous store requests to complete within the memory, and on detection of completion of the store request for the instruction in the memory, store the post-retirement performance information in storage of the performance monitoring circuit.

Address translation structures to provide separate translations for instruction fetches and data accesses

An address translation capability in which information is obtained from an address translation structure to be used to translate a first address to a second address, the first address being of a first address type and the second address being of a second address type. The address translation structure includes a first set of information to translate the first address to one address of the second address type and a second set of information to translate the first address to another address of the second address type. To obtain the information, the first set of information or the second set of information is selected as the information to be used to translate the first address to the second address, based on an attribute of the first address.