G06F9/312

System and method for predicting memory dependence when a source register of a push instruction matches the destination register of a pop instruction

A system and method for efficiently reducing the latency and power of memory access operations. A processor includes a stack pointer (SP) load-store dependence (LSD) predictor which predicts whether a memory dependence exists on a store instruction. The processor also includes a register file (RF) LSD predictor which predicts whether a memory dependence exists on a store instruction or a load instruction by a subsequent load instruction in program order. Each of the SP-LSD predictor and the RF-LSD predictor predicts and performs register renaming in a pipeline stage earlier than a renaming pipeline stage. The RF-LSD predictor also determines whether any intervening instructions between a producer memory instruction and a consumer memory instruction modify a predicted dependence.

Executing load-store operations without address translation hardware per load-store unit port

Technical solutions are described for out-of-order (OoO) execution of one or more instructions by a processing unit includes receiving, by a load-store unit (LSU) of the processing unit, an OoO window of instructions including a plurality of instructions to be executed OoO, and issuing, by the LSU, instructions from the OoO window. The issuing includes selecting an instruction from the OoO window, the instruction using an effective address. Further, in response to the instruction being a load instruction, it is determined whether the effective address is present in an effective address directory (EAD). In response to the effective address being present in the EAD, the load instruction is issued using the effective address. Further, in response to the instruction being a store instruction, a real address mapped to the effective address is determined from an effective-real translation (ERT) table, and the store instruction is issued using the real address.

Computer vision processing in hardware data paths

An apparatus includes a memory and a processor. The memory may be configured to store a directed acyclic graph. The processor may be configured to (i) receive a command to run the directed acyclic graph, (ii) parse the directed acyclic graph into a data flow including one or more operators, (iii) schedule the operators in one or more data paths, and (iv) generate one or more output vectors by processing one or more input vectors in the data paths. The processor generally comprises a plurality of hardware engines. The data paths may be implemented with the hardware engines. The hardware engines may operate in parallel to each other.

Conflict mask generation
10691454 · 2020-06-23 · ·

Single Instruction, Multiple Data (SIMD) technologies are described. A processor can store a first bitmap and generate a second bitmap with each cell identifying a mask bit. The mask bit is set when 1) a corresponding cell in a first bitmap is not in conflict with other elements in the first bitmap or 2) a corresponding cell is in conflict with one or more other cells in the first bitmap and is a last cell in a sequential order of the first bitmap that conflicts with the one or more other cells, wherein a position of each cell in the second bitmap maps to a same position of the corresponding cell in the first bitmap. The processor can store the second bitmap as a mask for a scatter operation to avoid lane conflicts.

Executing processor instructions using minimal dependency queue

Examples of techniques for executing instructions out of order are described herein. An example computer-implemented method includes receiving, via a processor, a plurality of instructions to be executed. The method includes sending, via the processor, an instruction to a minimal dependency queue in response to detecting the instruction includes a minimally dependent instruction. The method also includes selecting, via the processor, an instruction from a set of instructions that are eligible to be executed based on a scheme. The method further includes executing, via the processor, the instruction.

Methods and systems for using state vector data in a state machine engine
10671295 · 2020-06-02 · ·

A state machine engine includes a state vector system. The state vector system includes an input buffer configured to receive state vector data from a restore buffer and to provide state vector data to a state machine lattice. The state vector system also includes an output buffer configured to receive state vector data from the state machine lattice and to provide state vector data to a save buffer.

Converting a stream of data using a lookaside buffer

A stream of data is accessed from a memory system by an autonomous memory access engine, converted on the fly by the memory access engine, and then presented to a processor for data processing. A portion of a lookup table (LUT) containing converted data elements is preloaded into a lookaside buffer associated with the memory access engine. As the stream of data elements is fetched from the memory system each data element in the stream of data elements is replaced with a respective converted data element obtained from the LUT in the lookaside buffer according to a content of each data element to thereby form a stream of converted data elements. The stream of converted data elements is then propagated from the memory access engine to a data processor.

Computation engine with strided dot product

In an embodiment, a computation engine may perform dot product computations on input vectors. The dot product operation may have a first operand and a second operand, and the dot product may be performed on a subset of the vector elements in the first operand and each of the vector elements in the second operand. The subset of vector elements may be separated in the first operand by a stride that skips one or more elements between each element to which the dot product operation is applied. More particularly, in an embodiment, the input operands of the dot product operation may be a first vector having second vectors as elements, and the stride may select a specified element of each second vector.

Dual data streams sharing dual level two cache access ports to maximize bandwidth utilization

A streaming engine employed in a digital data processor specifies fixed first and second read only data streams. Corresponding stream address generator produces address of data elements of the two streams. Corresponding steam head registers stores data elements next to be supplied to functional units for use as operands. The two streams share two memory ports. A toggling preference of stream to port ensures fair allocation. The arbiters permit one stream to borrow the other's interface when the other interface is idle. Thus one stream may issue two memory requests, one from each memory port, if the other stream is idle. This spreads the bandwidth demand for each stream across both interfaces, ensuring neither interface becomes a bottleneck.

Cache preload operations using streaming engine

A stream of data is accessed from a memory system using a stream of addresses generated in a first mode of operating a streaming engine in response to executing a first stream instruction. A block cache preload operation is performed on a cache in the memory using a block of addresses generated in a second mode of operating the streaming engine in response to executing a second stream instruction.