Patent classifications
G06F9/302
Method and apparatus for performing reduction operations on a set of vector elements
An apparatus and method are described for performing SIMD reduction operations. For example, one embodiment of a processor comprises: a value vector register containing a plurality of data element values to be reduced; an index vector register to store a plurality of index values indicating which values in the value vector register are associated with one another; single instruction multiple data (SIMD) reduction logic to perform reduction operations on the data element values within the value vector register by combining data element values from the value vector register which are associated with one another as indicated by the index values in the index vector register; and an accumulation vector register to store results of the reduction operations generated by the SIMD reduction logic.
Method and apparatus for efficiently managing architectural register state of a processor
An apparatus and method for efficiently managing the architectural state of a processor. For example, one embodiment of a processor comprises: a source mask register to be logically subdivided into at least a first portion to store a usable portion of a mask value and a second portion to store an indication of whether the usable portion of the mask value has been updated; a control register to store an unusable portion of the mask value; architectural state management logic to read the indication to determine whether the mask value has been updated prior to performing a store operation, wherein if the mask value has been updated, then the architectural state management logic is to read the usable portion of the mask value from the first portion of the source mask register and zero out bits of the unusable portion of the mask value to generate a final mask value to be saved to memory, and wherein if the mask value has not been updated, then the architectural state management logic is to concatenate the usable portion of the mask value with the unusable portion of the mask value read from the control register to generate a final mask value to be saved to memory.
Three source operand floating-point addition instruction with operand negation bits and intermediate and final result rounding
A processor includes a decode unit to decode a three source floating point addition instruction indicating a first source operand having a first floating point data element, a second source operand having a second floating point data element, and a third source operand having a third floating point data element. An execution unit is coupled with the decode unit. The execution unit, in response to the instruction, stores a result in a destination operand indicated by the instruction. The result includes a result floating point data element that includes a first floating point rounded sum. The first floating point rounded sum represents an additive combination of a second floating point rounded sum and the third floating point data element. The second floating point rounded sum represents an additive combination of the first floating point data element and the second floating point data element.
Floating point instruction with selectable comparison attributes
An instruction to perform a comparison of a first value and a second value is executed. Based on a control of the instruction, a compare function to be performed is determined. The compare function is one of a plurality of compare functions configured for the instruction, and the compare function has a plurality of options for comparison. A compare option based on the first value and the second value is selected from the plurality of options defined for the compare function, and used to compare the first value and the second value. A result of the comparison is then placed in a select location, the result to be used in processing within a computing environment.
Methods and apparatus to use an access triggered computer architecture
A method for using an access triggered architecture for a computer implemented application is provided. The method receives a set of data at a designated functional block associated with a system memory location; performs an operation at the designated functional block, using the set of data, to generate a result, wherein the operation is performed each time information is received at the designated functional block; and returns the generated result to the system memory location.
Division operations in memory
Examples of the present disclosure provide apparatuses and methods related to performing division operations in memory. An example apparatus might include a first group of memory cells coupled to a first access line and configured to store a dividend element. An example apparatus might include a second group of memory cells coupled to a second access line and configured to store a divisor element. An example apparatus might also include a controller configured to cause the dividend element to be divided by the divisor element by controlling sensing circuitry to perform a number of operations without transferring data via an input/output (I/O) line.
Apparatus and method for scaling pre-scaled results of complex multiply-accumulate operations on packed real and imaginary data elements
Apparatus and method to transform complex data including a processor that comprises: multiplier circuitry to multiply packed complex N-bit data elements with packed complex M-bit data elements to generate at least four real products; adder circuitry to subtract a first real product from a second real product to generate a first temporary result, subtract a third real product from a fourth real product to generate a second temporary result, add the first temporary result to a first packed N-bit data element to generate a first pre-scaled result, subtract the first temporary result from the first packed N-bit data element to generate a second pre-scaled result, add the second temporary result to a second packed N-bit data element to generate a third pre-scaled result, and subtract the second temporary result from the second packed N-bit data element to generate a fourth pre-scaled result; and scaling circuitry to scale the pre-scaled results.
Merging and sorting arrays on an SIMD processor
Methods, systems, and articles of manufacture for merging and sorting arrays on a processor are provided herein. A method includes splitting an input array into multiple sub-arrays across multiple processing elements; merging the multiple sub-arrays into multiple vectors; and sorting the multiple vectors by comparing and swapping one or more vector elements among the multiple vectors.
Multifunctional hexadecimal instruction form system and program product
A new zSeries floating-point unit has a fused multiply-add dataflow capable of supporting two architectures and fused MULTIPLY and ADD and Multiply and SUBTRACT in both RRF and RXF formats for the fused functions. Both binary and hexadecimal floating-point instructions are supported for a total of 6 formats. The floating-point unit is capable of performing a multiply-add instruction for hexadecimal or binary every cycle with a latency of 5 cycles. This supports two architectures with two internal formats with their own biases. This has eliminated format conversion cycles and has optimized the width of the dataflow. The unit is optimized for both hexadecimal and binary floating-point architecture supporting a multiply-add/subtract per cycle.
Vector instructions for selecting and extending an unsigned sum of products of words and doublewords for accumulation
Disclosed embodiments relate to executing a vector unsigned multiplication and accumulation instruction. In one example, a processor includes fetch circuitry to fetch a vector unsigned multiplication and accumulation instruction having fields for an opcode, first and second source identifiers, a destination identifier, and an immediate, wherein the identified sources and destination are same-sized registers, decode circuitry to decode the fetched instruction, and execution circuitry to execute the decoded instruction, on each corresponding pair of first and second quadwords of the identified first and second sources, to: generate a sum of products of two doublewords of the first quadword and either two lower words or two upper words of the second quadword, based on the immediate, zero-extend the sum to a quadword-sized sum, and accumulate the quadword-sized sum with a previous value of a destination quadword in a same relative register position as the first and second quadwords.