Patent classifications
G06F9/3854
COMPUTE-INTENSIVE KERNEL GENERATOR, MICRO-KERNEL CODE CACHE, FUSED KERNEL GENERATOR AND CYCLIC DEPENDENCE FREE GRAPH PARTITIONING FOR DEEP LEARNING WORKLOADS
Systems, apparatuses and methods may provide for technology that identifies a data layout associated with input tensors and output tensors, generates a micro-kernel based at least in part on the data layout, and generates a nested outer loop for a kernel, wherein the micro-kernel performs one or more subtasks associated with a task represented by the kernel. The technology also includes micro-kernel code caches, fused kernel generators and cyclic dependence free graph partitioning for deep learning workloads.
DATA PROCESSING DEVICE
A data processing device includes an instruction issue circuit configured to issue instructions; a plurality of execution circuits configured to execute, in parallel, the instructions issued from the instruction issue circuit; and a plurality of delay circuits configured to delay arrival timings of when the instructions issued from the instruction issue circuit arrive at the plurality of execution circuits, the plurality of delay circuits being arranged between the instruction issue circuit and the plurality of execution circuits. The arrival timings of the instructions arriving at at least two execution circuits included in the plurality of execution circuits are different from each other.
Physical address proxy reuse management
Each load/store queue entry holds a load/store physical address proxy (PAP) for use as a proxy for a load/store physical memory line address (PMLA). The load/store PAP comprises a set index and a way that uniquely identifies an L2 cache entry holding a memory line at the load/store PMLA when an L1 cache provides the load/store PAP during the load/store instruction execution. The microprocessor removes a line at a removal PMLA from an L2 entry, forms a removal PAP as a proxy for the removal PMLA that comprises a set index and a way, snoops the load/store queue with the removal PAP to determine whether the removal PAP is being used as a proxy for the removal PMLA, fills the removed entry with a line at a fill PMLA, and prevents the removal PAP from being used as a proxy for the removal PMLA and the fill PMLA concurrently.
PRECISE EXCEPTIONS FOR EDGE PROCESSORS
Systems and methods are disclosed for supporting debugging of programs in block-based processor architectures. In one example of the disclosed technology, a processor includes an exception event handler, a memory interface, at least one block-based processor core coupled to the memory interface and configured to responsive to receiving an exception event signal while executing an instruction block, store state data for the core generated by executing the instruction block, transfer control of the core to a second instruction block, and resume execution of the first instruction by restoring state for the processor core from the stored state data.
MANAGING A DIVIDED LOAD REORDER QUEUE
Managing a divided load reorder queue including storing load instruction data for a load instruction in an expanded LRQ entry in the LRQ; launching the load instruction from the expanded LRQ entry; determining that the load instruction is in a finished state; moving a subset of the load instruction data from the expanded LRQ entry to a compact LRQ entry in the LRQ, wherein the compact LRQ entry is smaller than the expanded LRQ entry; and removing the load instruction data from the expanded LRQ entry.
AN APPARATUS AND METHOD TO GENERATE TRACE DATA IN RESPONSE TO TRANSACTIONAL EXECUTION
There is provided an apparatus comprising processing circuitry to execute a transaction comprising a number of program instructions that execute to generate updates to state data, to commit the updates if the transaction completes without a conflict, and to generate trace control signals during execution of the number of program instructions. The processing circuitry uses at least one resource during execution of the program instructions. Transaction trace circuitry generates trace items in response to the trace control signals. In response to the trace control signals indicating that a change in a usage level of the at least one resource has occurred during execution of the program instructions, the transaction trace circuitry generates at least one trace item that indicates the usage level of the at least one resource.
Last branch record indicators for transactional memory
In one embodiment, a processor includes an execution unit and at least one last branch record (LBR) register to store address information of a branch taken during program execution. This register may further store a transaction indicator to indicate whether the branch was taken during a transactional memory (TM) transaction. This register may further store an abort indicator to indicate whether the branch was caused by a transaction abort. Other embodiments are described and claimed.
DEFER BUFFER
An apparatus comprises processing circuitry for executing instructions of two or more threads of processing, hardware registers to store context data for the two or more threads concurrently, and commit circuitry to commit results of executed instructions of the threads, where for each thread the commit circuitry commits the instructions of that thread in program order. At least one defer buffer is provided to buffer at least one blocked instruction for which execution by the processing circuitry is complete but execution of an earlier instruction of the same thread in the program order is incomplete. This can help to resolve inter-thread blocking and hence improve performance.
Speeding up younger store instruction execution after a sync instruction
Mechanisms are provided, in a processor, for executing instructions that are younger than a previously dispatched synchronization (sync) instruction is provided. An instruction sequencer unit of the processor dispatches a sync instruction. The sync instruction is sent to a nest of one or more devices outside of the processor. The instruction sequencer unit dispatches a subsequent instruction after dispatching the sync instruction. The dispatching of the subsequent instruction after dispatching the sync instruction is performed prior to receiving a sync acknowledgement response from the nest. The instruction sequencer unit performs a completion of the subsequent instruction based on whether completion of the subsequent instruction is dependent upon receiving the sync acknowledgement from the nest and completion of the sync instruction.
Processor system and method based on instruction read buffer
This invention provides a cache system and method based on instruction read buffer (IRB). When applied to the field of processor, it is capable of filling instructions to the instruction read buffer which can be directly accessed by processor core and the processor core outputs instruction to the processor core for execution autonomously and achieve a high cache hit rate.