G06F9/315

Method and apparatus for performing reduction operations on a set of vector elements

An apparatus and method are described for performing SIMD reduction operations. For example, one embodiment of a processor comprises: a value vector register containing a plurality of data element values to be reduced; an index vector register to store a plurality of index values indicating which values in the value vector register are associated with one another; single instruction multiple data (SIMD) reduction logic to perform reduction operations on the data element values within the value vector register by combining data element values from the value vector register which are associated with one another as indicated by the index values in the index vector register; and an accumulation vector register to store results of the reduction operations generated by the SIMD reduction logic.

Method and apparatus for efficiently managing architectural register state of a processor

An apparatus and method for efficiently managing the architectural state of a processor. For example, one embodiment of a processor comprises: a source mask register to be logically subdivided into at least a first portion to store a usable portion of a mask value and a second portion to store an indication of whether the usable portion of the mask value has been updated; a control register to store an unusable portion of the mask value; architectural state management logic to read the indication to determine whether the mask value has been updated prior to performing a store operation, wherein if the mask value has been updated, then the architectural state management logic is to read the usable portion of the mask value from the first portion of the source mask register and zero out bits of the unusable portion of the mask value to generate a final mask value to be saved to memory, and wherein if the mask value has not been updated, then the architectural state management logic is to concatenate the usable portion of the mask value with the unusable portion of the mask value read from the control register to generate a final mask value to be saved to memory.

Merging and sorting arrays on an SIMD processor

Methods, systems, and articles of manufacture for merging and sorting arrays on a processor are provided herein. A method includes splitting an input array into multiple sub-arrays across multiple processing elements; merging the multiple sub-arrays into multiple vectors; and sorting the multiple vectors by comparing and swapping one or more vector elements among the multiple vectors.

Instruction and logic to provide vector linear interpolation functionality

Instructions and logic provide vector linear interpolation functionality. In some embodiments, responsive to an instruction specifying: a first operand from a set of vector registers, a size of each of the vector elements, a portion of the vector elements upon which to compute linear interpolations, a second operand from a set of vector registers, and a third operand; an execution unit, reads a first, a second and a third value of the size of vector elements from corresponding data fields in the first, the second and the third operand respectively and computes an interpolated value as the first value multiplied by the second value minus the second value multiplied by the third value plus the third value.

Processor instruction to store indexes of source data elements in positions representing a sorted order of the source data elements
09766888 · 2017-09-19 · ·

A processor of an aspect includes packed data registers, and a decode unit to decode an instruction. The instruction may indicate a first source packed data to include at least four data elements, indicate a second source packed data to include at least four data elements, and indicate a destination storage location. An execution unit is coupled with the packed data registers and the decode unit. The execution unit, in response to the instruction, is to store a result packed data in the destination storage location. The result packed data may include at least four indexes that may identify corresponding data element positions in the first and second source packed data. The indexes may be stored in positions in the result packed data that are to represent a sorted order of corresponding data elements in the first and second source packed data.

Processors, methods, systems, and instructions to Partition a source packed data into lanes
11204764 · 2021-12-21 · ·

A processor includes a decode unit to decode an instruction that is to indicate a source packed data that is to include a plurality of adjoining data elements, a number of adjoining data elements, and a destination. The processor also includes an execution unit coupled with the decode unit. The execution unit, in response to the instruction, is to store a result packed data in the destination. The result packed data is to have a plurality of lanes that are each to store a different non-overlapping set of the indicated number of adjoining data elements aligned with a least significant end of the respective lane. The different non-overlapping sets of the indicated number of the adjoining data elements in adjoining lanes of the result packed data are to be separated from one another by at least one most significant data element position of the less significant lane of the adjoining lanes.

Automatic compiler dataflow optimization to enable pipelining of loops with local storage requirements

Systems, apparatuses and methods may provide for technology that detects one or more local variables in source code, wherein the local variable(s) lack dependencies across iterations of a loop in the source code, automatically generate pipeline execution code for the local variable(s), and incorporate the pipeline execution code into an output of a compiler. In one example, the pipeline execution code includes an initialization of a pool of buffer storage for the local variable(s).

Method and apparatus for vector permutation

A method is provided that includes performing, by a processor in response to a vector permutation instruction, permutation of values stored in lanes of a vector to generate a permuted vector, wherein the permutation is responsive to a control storage location storing permute control input for each lane of the permuted vector, wherein the permute control input corresponding to each lane of the permuted vector indicates a value to be stored in the lane of the permuted vector, wherein the permute control input for at least one lane of the permuted vector indicates a value of a selected lane of the vector is to be stored in the at least one lane, and storing the permuted vector in a storage location indicated by an operand of the vector permutation instruction.

Inserting predefined pad values into a stream of vectors

Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a pad value indicator. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. A padded stream vector is formed that includes a specified pad value without accessing the pad value from system memory.

Decimal load immediate instruction

An instruction generates a value for use in processing within a computing environment. The instruction obtains a sign control associated with the instruction, and shifts an input value of the instruction in a specified direction by a selected amount to provide a result. The result is placed in a first designated location in a register, and the sign, which is based on the sign control, is placed in a second designated location of the register. The result and the sign provide a signed value to be used in processing within the computing environment.