G06F9/30

Instruction address translation and caching for primary and alternate branch prediction paths

Techniques for performing instruction fetch operations are provided. The techniques include determining instruction addresses for a primary branch prediction path; requesting that a level 0 translation lookaside buffer (“TLB”) caches address translations for the primary branch prediction path; determining either or both of alternate control flow path instruction addresses and lookahead control flow path instruction addresses; and requesting that either the level 0 TLB or an alternative level TLB caches address translations for either or both of the alternate control flow path instruction addresses and the lookahead control flow path instruction addresses.

Processing pipeline with first and second processing modes having different performance or energy consumption characteristics

An apparatus 2 has a processing pipeline 4 supporting at least a first processing mode and a second processing mode with different energy consumption or performance characteristics. A storage structure 22, 30, 36, 50, 40, 64, 44 is accessible in both the first and second processing modes. When the second processing mode is selected, control circuitry 70 triggers a subset 102 of the entries of the storage structure to be placed in a power saving state.

Systems, apparatuses, and methods for controllable sine and/or cosine operations

Embodiments of systems, apparatuses, and methods for performing vector-packed controllable sine and/or cosine operations in a processor are described. For example, execution circuitry executes a decoded instruction to compute at least a real output value and an imaginary output value based on at least a cosine calculation and a sine calculation, the cosine and sine calculations each based on an index value from a packed data source operand, add the index value with an index increment value from the packed data source operand to create an updated index value, and store the real output value, the imaginary output value, and the updated index value to a packed data destination operand.

Instructions for vector operations with constant values

Disclosed embodiments relate to instructions for vector operations with immediate values. In one example, a system includes a memory and a processor that includes fetch circuitry to fetch the instruction from a code storage, the instruction including an opcode, a destination identifier to specify a destination vector register, a first immediate, and a write mask identifier to specify a write mask register, the write mask register including at least one bit corresponding to each destination vector register element, the at least one bit to specify whether the destination vector register element is masked or unmasked, decode circuitry to decode the fetched instruction, and execution circuitry to execute the decoded instruction, to, use the write mask register to determine unmasked elements of the destination vector register, and, when the opcode specifies to broadcast, broadcast the first immediate to one or more unmasked vector elements of the destination vector register.

Systems for performing instructions to quickly convert and use tiles as 1D vectors

Disclosed embodiments relate to systems for performing instructions to quickly convert and use matrices (tiles) as one-dimensional vectors. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode, locations of a two-dimensional (2D) matrix and a one-dimensional (1D) vector, and a group of elements comprising one of a row, part of a row, multiple rows, a column, part of a column, multiple columns, and a rectangular sub-tile of the specified 2D matrix, and wherein the opcode is to indicate a move of the specified group between the 2D matrix and the 1D vector, decode circuitry to decode the fetched instruction; and execution circuitry, responsive to the decoded instruction, when the opcode specifies a move from 1D, to move contents of the specified 1D vector to the specified group of elements.

Non-cached loads and stores in a system having a multi-threaded, self-scheduling processor
11579888 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute instructions; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In a representative embodiment, the processor core is further adapted to execute a non-cached load instruction to designate a general purpose register rather than a data cache for storage of data received from a memory circuit. The core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, and to generate one or more work descriptor data packets to another circuit for execution of corresponding execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Programmable instruction buffering for accumulating a burst of instructions

A processing system 2 includes a processing pipeline 12, 14, 16, 18, 28 which includes fetch circuitry 12 for fetching instructions to be executed from a memory 6, 8. Buffer control circuitry 34 is responsive to a programmable trigger, such as explicit hint instructions delimiting an instruction burst, or predetermined configuration data specifying parameters of a burst together with a synchronising instruction, to trigger the buffer control circuitry to stall a stallable portion of the processing pipeline (e.g. issue circuitry 16), to accumulate within one or more buffers 30, 32 fetched instructions starting from a predetermined starting instruction, and, when those instructions have been accumulated, to restart the stallable portion of the pipeline.

Discrete Three-Dimensional Processor

A discrete three-dimensional (3-D) processor comprises stacked first and second dice. The first die comprises 3-D memory (3D-M) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-M array(s). In one preferred embodiment, the first and second dice are face-to-face bonded. In another preferred embodiment, the first and second dice have a same die size.

Discrete Three-Dimensional Processor

A discrete three-dimensional (3-D) processor comprises first and second dice. The first die comprises 3-D memory (3D-M) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-M array(s). Typical off-die peripheral-circuit component could be an address decoder, a sense amplifier, a programming circuit, a read-voltage generator, a write-voltage generator, a data buffer, or a portion thereof.

METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
20230037321 · 2023-02-09 ·

A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.