Patent classifications
G06F9/3893
VECTOR SIMD VLIW DATA PATH ARCHITECTURE
A Very Long Instruction Word (VLIW) digital signal processor particularly adapted for single instruction multiple data (SIMD) operation on various operand widths and data sizes. A vector compare instruction compares first and second operands and stores compare bits. A companion vector conditional instruction performs conditional operations based upon the state of a corresponding predicate data register bit. A predicate unit performs data processing operations on data in at least one predicate data register including unary operations and binary operations. The predicate unit may also transfer data between a general data register file and the predicate data register file.
Coprocessors with Bypass Optimization, Variable Grid Architecture, and Fused Vector Operations
In an embodiment, a coprocessor may include a bypass indication which identifies execution circuitry that is not used by a given processor instruction, and thus may be bypassed. The corresponding circuitry may be disabled during execution, preventing evaluation when the output of the circuitry will not be used for the instruction. In another embodiment, the coprocessor may implement a grid of processing elements in rows and columns, where a given coprocessor instruction may specify an operation that causes up to all of the processing elements to operate on vectors of input operands to produce results. Implementations of the coprocessor may implement a portion of the processing elements. The coprocessor control circuitry may be designed to operate with the full grid or partial grid, reissuing instructions in the partial grid case to perform the requested operation. In still another embodiment, the coprocessor may be able to fuse vector mode operations.
MULTIPLE ACCUMULATE BUSSES IN A SYSTOLIC ARRAY
Systems and methods are provided to enable parallelized multiply-accumulate operations in a systolic array. Each column of the systolic array can include multiple busses enabling independent transmission of input partial sums along the respective bus. Each processing element of a given columnar bus can receive an input partial sum from a prior element of the given columnar bus, and perform arithmetic operations on the input partial sum. Each processing element can generate an output partial sum based on the arithmetic operations, provide the output partial sum to a next processing element of the given columnar bus, without the output partial sum being processed by a processing element of the column located between the two processing elements that uses a different columnar bus. Use of columnar busses can enable parallelization to increase speed or enable increased latency at individual processing elements.
Method and apparatus for linear function processing in pipelined storage circuits
An integrated circuit may have processing and storage circuits that perform read-modify-write operations on a wide data path. A CAD tool may partition the wide data path into data path subsets based on the width of the wide data path, the characteristics of the processing and storage circuits, and various constraints such as resource constraints and timing constraints. The CAD tool may also instantiate corresponding pipelined circuitry. The pipelined circuitry may be arranged in slices with cascaded processing and storage circuits. Each processing and storage circuit in a slice may perform a read-modify-write operation based on the corresponding data path subset and any prior result produced by other processing and storage circuits.
EFFICIENT MULTIPLY-ACCUMULATION BASED ON SPARSE MATRIX
Disclosed herein includes improving computational efficiency of multiply-accumulate (MAC) operation. In one aspect, a computing device identifies, a first vector including non-zero elements of a base matrix, and a second vector indicating a location of each of the non-zero elements of the base matrix. In one aspect, the device determines a first element and a second element of the first vector. In one aspect, the device determines a third element and a fourth element of the second vector. In one aspect, the device determines i) a fifth element of an input vector according to the third element of the second vector, and ii) a sixth element of the input vector according to the fourth element of the second vector. In one aspect, the device causes a MAC circuitry to perform a dot product according to the first element, the second element, the fifth element, and the sixth element.
Super multiply add (super MADD) instructions with three scalar terms
A processing core is described having execution unit logic circuitry having a first register to store a first vector input operand, a second register to a store a second vector input operand and a third register to store a packed data structure containing scalar input operands a, b, c. The execution unit logic circuitry further include a multiplier to perform the operation (a*(first vector input operand))+(b*(second vector operand))+c.
Temporally split fused multiply-accumulate operation
A microprocessor splits a fused multiply-accumulate operation of the form A*B+C into first and second multiply-accumulate sub-operations to be performed by a multiplier and an adder. The first sub-operation at least multiplies A and B, and conditionally also accumulates C to the partial products of A and B to generate an unrounded nonredundant sum. The unrounded nonredundant sum is stored in memory shared by the multiplier and adder for an indefinite time period, enabling the multiplier and adder to perform other operations unrelated to the multiply-accumulate operation. The second sub-operation conditionally accumulates C to the unrounded nonredundant sum if C is not already incorporated into the value, and then generates a final rounded result.
Method and apparatus for performing a shift and exclusive or operation in a single instruction
Method and apparatus for performing a shift and XOR operation. In one embodiment, an apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources perform a shift and XOR on at least one value.
Apparatus and method for scaling pre-scaled results of complex multiply-accumulate operations on packed real and imaginary data elements
Apparatus and method to transform complex data including a processor that comprises: multiplier circuitry to multiply packed complex N-bit data elements with packed complex M-bit data elements to generate at least four real products; adder circuitry to subtract a first real product from a second real product to generate a first temporary result, subtract a third real product from a fourth real product to generate a second temporary result, add the first temporary result to a first packed N-bit data element to generate a first pre-scaled result, subtract the first temporary result from the first packed N-bit data element to generate a second pre-scaled result, add the second temporary result to a second packed N-bit data element to generate a third pre-scaled result, and subtract the second temporary result from the second packed N-bit data element to generate a fourth pre-scaled result; and scaling circuitry to scale the pre-scaled results.
Coprocessors with bypass optimization, variable grid architecture, and fused vector operations
In an embodiment, a coprocessor may include a bypass indication which identifies execution circuitry that is not used by a given processor instruction, and thus may be bypassed. The corresponding circuitry may be disabled during execution, preventing evaluation when the output of the circuitry will not be used for the instruction. In another embodiment, the coprocessor may implement a grid of processing elements in rows and columns, where a given coprocessor instruction may specify an operation that causes up to all of the processing elements to operate on vectors of input operands to produce results. Implementations of the coprocessor may implement a portion of the processing elements. The coprocessor control circuitry may be designed to operate with the full grid or partial grid, reissuing instructions in the partial grid case to perform the requested operation. In still another embodiment, the coprocessor may be able to fuse vector mode operations.