G06F9/30149

DATA PROCESSING APPARATUS HAVING STREAMING ENGINE WITH READ AND READ/ADVANCE OPERAND CODING
20230325188 · 2023-10-12 ·

A streaming engine employed in a digital signal processor specified a fixed data stream. Once started the data stream is read only and cannot be written. Once fetched, the data stream is stored in a first-in-first-out buffer for presentation to functional units in the fixed order. Data use by the functional unit is controlled using the input operand fields of the corresponding instruction. A read only operand coding supplies the data an input of the functional unit. A read/advance operand coding supplies the data and also advances the stream to the next sequential data elements. The read only operand coding permits reuse of data without requiring a register of the register file for temporary storage.

METHOD AND APPARATUS FOR PERMUTING STREAMED DATA ELEMENTS

A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.

Method and Apparatus for Vector Based Finite Impulse Response (FIR) Filtering
20230333848 · 2023-10-19 ·

A method includes executing, by a processor a vector finite impulse response (VFIR) filter instruction that specifies coefficients, data elements, and a storage location. The executing includes reordering a subset of the data elements to provide each data element of the reordered subset of the data elements to a respective slice multiply component of a vector multiplier of the processor, generating, by the vector multiplier, filter outputs based on the coefficients and data elements, and storing the filter outputs in the storage location.

VARIABLE-LENGTH INSTRUCTION STEERING TO INSTRUCTION DECODE CLUSTERS

Embodiments of apparatuses and methods for variable-length instruction steering to instruction decode clusters are disclosed. In an embodiment, an apparatus includes a decode cluster and chunk steering circuitry. The decode cluster includes multiple instruction decoders. The chunk steering circuitry is to break a sequence of instruction bytes into a plurality of chunks, create a slice from a one or more of the plurality of chunks based on one or more indications of a number of instructions in each of the one or more of the plurality of chunks, wherein the slice has a variable size and includes a plurality of instructions, and steer the slice to the decode cluster.

Speculative usage of parallel decode units

Aspects of the present disclosure relate to an apparatus comprising fetch circuitry. The fetch circuitry comprises a pointer-based fetch queue for queuing processing instructions retrieved from a storage, and pointer storage for storing a pointer identifying a current fetch queue element. The apparatus comprises decode circuitry having a plurality of decode units, and fetch queue extraction circuitry to, based on the pointer, extract the content of a plurality of elements of the fetch queue; apply combinatorial logic to speculatively produce, from the content of said fetch queue entries, a plurality of speculative potential instructions; and transmit each speculative potential instruction to a corresponding one of said decode units. Each decode unit is configured to decode the corresponding speculative potential instruction. The instruction extraction circuitry is configured to extract a subset of said plurality of speculative potential instructions, and transmit said determined subset to pipeline component circuitry.

Tracking streaming engine vector predicates to control processor execution

In a method of operating a computer system, an instruction loop is executed by a processor in which each iteration of the instruction loop accesses a current data vector and an associated current vector predicate. The instruction loop is repeated when the current vector predicate indicates the current data vector contains at least one valid data element and the instruction loop is exited when the current vector predicate indicates the current data vector contains no valid data elements.

METHOD AND APPARATUS FOR VECTOR SORTING
20230099669 · 2023-03-30 ·

A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values are sorted in an order indicated by the vector sort instruction, and storing the sorted vector in a storage location.

Method and apparatus to sort a vector for a bitonic sorting algorithm

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

PRE-STAGED INSTRUCTION REGISTERS FOR VARIABLE LENGTH INSTRUCTION SET MACHINE

Methods and systems relating to improved processing architectures with pre-staged instructions are disclosed herein. A disclosed processor includes a memory, at least one functional processing unit, a bus, a set of instruction registers configured to be loaded, using the bus, with a set of pre-staged instructions from the memory, and a logic circuit configured to provide the set of pre-staged instructions from the set of instruction registers to the at least one functional processing unit in response to receiving an instruction from the instruction memory.

Systems, methods, and apparatuses for tile load

Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.