G06F9/3888

Cache structure and utilization

Embodiments are generally directed to cache structure and utilization. An embodiment of an apparatus includes one or more processors including a graphics processor; a memory for storage of data for processing by the one or more processors; and a cache to cache data from the memory; wherein the apparatus is to provide for dynamic overfetching of cache lines for the cache, including receiving a read request and accessing the cache for the requested data, and upon a miss in the cache, overfetching data from memory or a higher level cache in addition to fetching the requested data, wherein the overfetching of data is based at least in part on a current overfetch boundary, and provides for data is to be prefetched extending to the current overfetch boundary.

Load multiple primitives per thread in a graphics pipeline

Systems, apparatuses, and methods for loading multiple primitives per thread in a graphics pipeline are disclosed. A system includes a graphics pipeline frontend with a geometry engine, shader processor input (SPI), and a plurality of compute units. The geometry engine generates primitives which are accumulated by the SPI into primitive groups. While accumulating primitives, the SPI tracks the number of vertices and primitives per group. The SPI determines wavefront boundaries based on mapping a single vertex to each thread of the wavefront while allowing more than one primitive per thread. The SPI launches wavefronts with one vertex per thread and potentially multiple primitives per thread. The compute units execute a vertex phase and a multi-cycle primitive phase for wavefronts with multiple primitives per thread.

Reconfigurable Parallel Processing
20240264975 · 2024-08-08 ·

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.

Method and arrangement for handling memory access for a TCF-aware processor

An arrangement for handling shared data memory access for a TCF-aware processor. The arrangement comprises at least a flexible latency handling unit (601) comprising local memory (602) and related control logic, said local memory being provided for storing shared data memory access related data. The arrangement is configured to receive at least one TCF comprising at least one instruction, the at least one instruction being associated with at least one fiber, wherein the flexible latency handling unit is configured to determine if shared data memory access is required by the at least one instruction, if shared data memory access is required, send a shared data memory access request, via the flexible latency handling unit, observe, essentially continuously, if a reply to the shared data memory access request is received, suspend continued execution of the instruction until a reply is received, and continue execution of the instruction after receiving the reply so that the delay associated with the shared data memory access is dynamically determined by the actual required shared data memory access latency.

Techniques for parallel execution
12056494 · 2024-08-06 · ·

Apparatuses, systems, and techniques to identify instructions for advanced execution. In at least one embodiment, a processor performs one or more instructions that have been identified by a compiler to be speculatively performed in parallel.

Apparatuses, methods, and systems for 8-bit floating-point matrix dot product instructions

Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. A processor embodiment includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a destination matrix having single-precision elements, a first source matrix, and a second source matrix, the source matrices having elements that each comprise a quadruple of 8-bit floating-point values, the opcode to indicate execution circuitry is to cause, for each element of the first source matrix and corresponding element of the second source matrix, a conversion of the 8-bit floating-point values to single-precision values, a multiplication of different pairs of converted single-precision values to generate plurality of results, and an accumulation of the results with previous contents of a corresponding element of the destination matrix, decode circuitry to decode the fetched instruction, and the execution circuitry to respond to the decoded instruction as specified by the opcode.

Compute optimization mechanism

An apparatus to facilitate compute optimization is disclosed. The apparatus includes a mixed precision core including mixed-precision execution circuitry to execute one or more of the mixed-precision instructions to perform a mixed-precision dot-product operation comprising to perform a set of multiply and accumulate operations.

Instructions and logic to provide SIMD SM4 cryptographic block cipher functionality

Instructions and logic provide for a Single Instruction Multiple Data (SIMD) SM4 round slice operation. Embodiments of an instruction specify a first and a second source data operand set, and substitution function indicators, e.g. in an immediate operand. Embodiments of a processor may include encryption units, responsive to the first instruction, to: perform a slice of SM4-round exchanges on a portion of the first source data operand set with a corresponding keys from the second source data operand set in response to a substitution function indicator that indicates a first substitution function, perform a slice of SM4 key generations using another portion of the first source data operand set with corresponding constants from the second source data operand set in response to a substitution function indicator that indicates a second substitution function, and store a set of result elements of the first instruction in a SIMD destination register.

Instruction and logic for optimization level aware branch prediction

A computer-readable storage medium, method and system for optimization-level aware branch prediction is described. A gear level is assigned to a set of application instructions that have been optimized. The gear level is also stored in a register of a branch prediction unit of a processor. Branch prediction is then performed by the processor based upon the gear level.

Pre-scheduled replays of divergent operations

One embodiment of the present disclosure sets forth an optimized way to execute pre-scheduled replay operations for divergent operations in a parallel processing subsystem. Specifically, a streaming multiprocessor (SM) includes a multi-stage pipeline configured to insert pre-scheduled replay operations into a multi-stage pipeline. A pre-scheduled replay unit detects whether the operation associated with the current instruction is accessing a common resource. If the threads are accessing data which are distributed across multiple cache lines, then the pre-scheduled replay unit inserts pre-scheduled replay operations behind the current instruction. The multi-stage pipeline executes the instruction and the associated pre-scheduled replay operations sequentially. If additional threads remain unserviced after execution of the instruction and the pre-scheduled replay operations, then additional replay operations are inserted via the replay loop, until all threads are serviced. One advantage of the disclosed technique is that divergent operations requiring one or more replay operations execute with reduced latency.