Patent classifications
G06F2207/3828
MULTI-MODE FUSION MULTIPLIER
A multiplier is configured to implement a binary single-multiplication operation A[m.sub.1-1:0]×B[m.sub.2-1:0], or an accumulated sum operation of 2N binary multiplications A0[m.sub.3-1:0]×B0[m.sub.4-1:0]. The multiplier includes P precoders, Q groups of fusion coders, and a compressor. The P precoders and the Q groups of fusion coders are configured to code a first value and a second value in the single-multiplication operation or the multi-multiplication accumulated sum operation, and output a plurality of partial products to the compressor. The compressor may be configured to compress the plurality of partial products corresponding to the single-multiplication operation or the multi-multiplication accumulated sum operation to obtain two accumulated values.
Adder capable of supporting addition and subtraction of up to n-bit data and method of supporting addition and subtraction of a plurality of data type using the adder
An adder for supporting multiple data types by controlling a carry propagation is provided. The adder includes a plurality of first addition areas configured to receive pieces of incoming operand data, wherein each of the plurality of first addition areas includes a predetermined unit number of bits, and a plurality of second addition areas configured to receive pieces of control data based on a type of the operand data and an operation type, wherein the plurality of second addition areas are alternately arranged between the plurality of first addition areas.
Apparatus and method for scaling pre-scaled results of complex multiply-accumulate operations on packed real and imaginary data elements
Apparatus and method to transform complex data including a processor that comprises: multiplier circuitry to multiply packed complex N-bit data elements with packed complex M-bit data elements to generate at least four real products; adder circuitry to subtract a first real product from a second real product to generate a first temporary result, subtract a third real product from a fourth real product to generate a second temporary result, add the first temporary result to a first packed N-bit data element to generate a first pre-scaled result, subtract the first temporary result from the first packed N-bit data element to generate a second pre-scaled result, add the second temporary result to a second packed N-bit data element to generate a third pre-scaled result, and subtract the second temporary result from the second packed N-bit data element to generate a fourth pre-scaled result; and scaling circuitry to scale the pre-scaled results.
Apparatus and method for vector horizontal add of signed/unsigned words and doublewords
An apparatus and method for performing a packed horizontal addition of words and doublewords. One embodiment of a processor includes a decoder to decode a packed horizontal add instruction which includes an opcode and one or more operands used to identify a plurality of packed words; a source register to store a plurality of packed words; execution circuitry to execute the decoded instruction, and a destination register to store a final result as a packed result word in a designated data element position. The execution circuitry includes operand selection circuitry to identify first and second packed words from the source register in accordance with the operands and opcode; adder circuitry to add the two packed words to generate a temporary sum; a temporary storage of at least 17 bits to store the temporary sum; and saturation circuitry to saturate the temporary sum if necessary to generate the final result.
APPARATUS AND METHOD FOR VECTOR HORIZONTAL ADD OF SIGNED/UNSIGNED WORDS AND DOUBLEWORDS
An apparatus and method for performing a packed horizontal addition of words and doublewords. One embodiment of a processor includes a decoder to decode a packed horizontal add instruction which includes an opcode and one or more operands used to identify a plurality of packed words; a source register to store a plurality of packed words; execution circuitry to execute the decoded instruction, and a destination register to store a final result as a packed result word in a designated data element position. The execution circuitry includes operand selection circuitry to identify first and second packed words from the source register in accordance with the operands and opcode; adder circuitry to add the two packed words to generate a temporary sum; a temporary storage of at least 17 bits to store the temporary sum; and saturation circuitry to saturate the temporary sum if necessary to generate the final result.
Two-dimensional multi-layer convolution for deep learning
This application relates to a multi-layer convolution operation. The multi-layer convolution operation is optimized for a vector processing unit having a number of data paths configured to operate on vector operands containing a number of elements processed in parallel by the data paths. The convolution operation specifies a convolution kernel utilized to filter a multi-channel input and generate a multi-channel output of the convolution operation. A number of threads are generated to process blocks of the multi-channel output, each block comprising a set of windows of a number of channels of the multi-channel output. Each window is a portion of the array of elements in a single layer of the multi-channel output. Each thread processes a block in accordance with an arbitrary width of the block, processing a set of instructions for each sub-block of the block having a well-defined width, the instructions optimized for the vector processing unit.
APPARATUS AND METHOD FOR SCALING PRE-SCALED RESULTS OF COMPLEX MUTIPLY-ACCUMULATE OPERATIONS ON PACKED REAL AND IMAGINARY DATA ELEMENTS
An apparatus and method for performing a transform on complex data. For example, one embodiment of a processor comprises: multiplier circuitry to multiply packed real N-bit data elements in the first source register with packed real M-bit data elements in the second source register and to multiply packed imaginary N-bit data elements in the first source register with packed imaginary M-bit data elements in the second source register to generate at least four real products, adder circuitry to subtract a first selected real product from a second selected real product to generate a first temporary result and to subtract a third selected real product from a fourth selected real product to generate a second temporary result, the adder circuitry to add the first temporary result to a first packed N-bit data element from the third source register to generate a first pre-scaled result, to subtract the first temporary result from the first packed N-bit data element to generate a second pre-scaled result, to add the second temporary result to a second packed N-bit data element from the third source register to generate a third pre-scaled result, and to subtract the second temporary result from the second packed N-bit data element to generate a fourth pre-scaled result; scaling circuitry to scale the first, second, third and fourth pre-scaled results to a specified bit width to generate first, second, third, and fourth final results; and a destination register to store the first, second, third, and fourth final results in specified data element positions.
Dense digital arithmetic circuitry utilization for fixed-point machine learning
Systems and methods are related to improving throughput of neural networks in integrated circuits by combining values in operands to increase compute density. A system includes an integrated circuit (IC) having multiplier circuitry. The IC receives a first value and a second value in a first operand. The IC performs a multiplication operation, via the multiplier circuitry, on the first operand and a second operand to produce a first multiplied product based at least in part on the first value and a second multiplied product based at least in part on the second value.
Method, device, and system for task processing
A number of RSA computing tasks that have different word lengths which are less than a maximum word length of an operand register are processed at the same time by combining a number of different word lengths to be equal to or less than the maximum word length of the operand register.
Processor supporting arithmetic instructions with branch on overflow and methods
A method provides for decoding, in a microprocessor, an instruction into data identifying a first register, a second register, an immediate value, and an opcode identifier. The opcode identifier is interpreted as indicating that an arithmetic operation is to be performed on the first register and the second register, and that the microprocessor is to perform a change of control operation in response to the addition of the first register and the second register causing overflow or underflow. The change of control operation is to a location in a program determined based on the immediate value. A processor can be provided with a decoder and other supporting circuitry to implement such method. Overflow/underflow can be checked on word boundaries of a double-word operation.