G06F9/30196

Non-cached loads and stores in a system having a multi-threaded, self-scheduling processor
11579888 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute instructions; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In a representative embodiment, the processor core is further adapted to execute a non-cached load instruction to designate a general purpose register rather than a data cache for storage of data received from a memory circuit. The core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, and to generate one or more work descriptor data packets to another circuit for execution of corresponding execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Implementing specialized instructions for accelerating Smith-Waterman sequence alignments

Various techniques for accelerating Smith-Waterman sequence alignments are provided. For example, threads in a group of threads are employed to use an interleaved cell layout to store relevant data in registers while computing sub-alignment data for one or more local alignment problems. In another example, specialized instructions that reduce the number of cycles required to compute each sub-alignment score are utilized. In another example, threads are employed to compute sub-alignment data for a subset of columns of one or more local alignment problems while other threads begin computing sub-alignment data based on partial result data received from the preceding threads. After computing a maximum sub-alignment score, a thread stores the maximum sub-alignment score and the corresponding position in global memory.

SYSTEMS, METHODS, AND APPARATUSES FOR TILE LOAD

Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.

Systems, methods, and apparatuses for tile store

Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in at least a form of decode circuitry to decode an instruction having fields for an opcode, a source matrix operand identifier, and destination memory information, and execution circuitry to execute the decoded instruction to store each data element of configured rows of the identified source matrix operand to memory based on the destination memory information.

APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS FOR A HARDWARE ASSISTED HETEROGENEOUS INSTRUCTION SET ARCHITECTURE DISPATCHER

Systems, methods, and apparatuses to support instructions for a hardware assisted heterogeneous instruction set architecture dispatcher are described. In one embodiment, a hardware processor includes a plurality of processor cores comprising a first type of processor core that supports a first instruction set architecture and a second type of processor core that supports a second different instruction set architecture, a decoder circuit of a processor core of the plurality of processor cores to decode a single instruction into a decoded single instruction, the single instruction including a field that identifies a requested core type and an opcode that indicates an execution circuit of the processor core is to: read a register to determine a core type of the processor core, cause the processor core to enter a first mode, that only permits execution of the first instruction set architecture by the processor core, when the requested core type and the core type of the processor core are the first type, cause the processor core to enter a second mode, that only permits execution of the second different instruction set architecture by the processor core, when the requested core type and the core type of the processor core are the second type, cause the processor core to enter a third mode, that only permits execution of the first instruction set architecture by the processor core, when the requested core type is the second type and the core type of the processor core is the first type, and cause the processor core to enter a fourth mode, that only permits execution of the second different instruction set architecture by the processor core, when the requested core type is the first type and the core type of the processor core is the second type, and the execution circuit of the processor core to execute the decoded single instruction according to the opcode.

Apparatus and method for scalable qubit addressing
11531922 · 2022-12-20 · ·

An apparatus and method for scalable qubit addressing. For example, one embodiment of a processor comprises: a decoder comprising quantum instruction decode circuitry to decode quantum instructions to generate quantum microoperations (uops) and non-quantum decode circuitry to decode non-quantum instructions to generate non-quantum uops; execution circuitry comprising: an address generation unit (AGU) to generate a system memory address responsive to execution of one or more of the non-quantum uops; and quantum index generation circuitry to generate quantum index values responsive to execution of one or more of the quantum uops, each quantum index value uniquely identifying a quantum bit (qubit) in a quantum processor; wherein to generate a first quantum index value for a first quantum uop, the quantum index generation circuitry is to read the first quantum index value from a first architectural register identified by the first quantum uop.

Multiply-accumulation in a data processing apparatus
11513796 · 2022-11-29 · ·

A data processing apparatus, a method of operating a data processing apparatus, a non-transitory computer readable storage medium, and an instruction are provided. The instruction specifies a first source register, a second source register, and a set of N accumulation registers. In response to the instruction control signals are generated, causing processing circuitry to extract N data elements from content of the first source register, perform a multiplication of each of the N data elements by content of the second source register, and apply a result of each multiplication to content of a respective target register of the set of N accumulation registers. As a result plural (N) multiplications are performed in a manner that effectively provides a multiplier N times the register width, but without requiring the register file to be made N times larger.

Thread creation on local or remote compute elements by a multi-threaded, self-scheduling processor
11513840 · 2022-11-29 · ·

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Blackbox matching engine

A method and apparatus are disclosed for enhancing operable functionality of input source code files from a software program by identifying a first code snippet and a first library function which generate similar outputs from a shared input by parsing each and every line of code in a candidate code snippet to generate a templatized code snippet data structure for the first code snippet, and then testing the templatized code snippet data structure against extracted library function information to check for similarity of outputs between the first code snippet and the first library function in response to a shared input so that the developer is presented with a library function recommendation which includes the first code snippet, the first library function, and instructions for replacing the first code snippet with the first library function.

SHARED PREFETCH INSTRUCTION AND SUPPORT

Techniques for shared data prefetch are described. An exemplary instruction for shared data prefetch includes at least one field for an opcode, at least one field for a source operand to provide a memory address at least a byte of data, wherein the opcode is to indicate that circuitry is to fetch of a line of data from memory at the provided address that contains the byte specified with the source operand and store that byte in at least a cache local to a requester, wherein the byte of data is to be stored in a shared state.