Patent classifications
G06F7/53
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.
APPARATUS AND METHOD WITH PARALLEL DATA PROCESSING
An apparatus with parallel processing includes: a first processor module; and a second processor module configured to perform parallel processing in synchronization with the first processor module, wherein the first processor module is configured to: determine first operation result data using an operation process in a first time interval; transmit the first operation result data to the second processor module; determine second operation result data using the operation process in a second time interval; and determine whether to transmit the second operation result data to the second processor module, and wherein the second processor module is configured to determine second prediction result data corresponding to the second operation result data based on the first operation result data received from the first processor module and a prediction process in response to the first processor module determining not to transmit the second operation result data to the second processor module.
HARDWARE ACCELERATOR FOR PERFORMING COMPUTATIONS OF DEEP NEURAL NETWORK AND ELECTRONIC DEVICE INCLUDING THE SAME
A hardware accelerator includes a processing core including a plurality of multipliers configured to perform one-dimensional (1D) sub-word parallelism between a sign and a mantissa of a first tensor and a sign and a mantissa of a second tensor, a first processing device configured to operate in a two-dimensional (2D) operation mode in which results of computation by the plurality of multipliers are output, and a second processing device configured to operate in a three-dimensional (3D) operation mode in which results of computation by the plurality of multipliers are accumulated in a channel direction and then a result of accumulating the results of computation is output.
Multiplier and Adder in Systolic Array
The subject matter described herein provides systems and techniques for the design and use of multiply-and-accumulate (MAC) units to perform matrix multiplication by systolic arrays, such as those used in accelerators for deep neural networks (DNNs). These MAC units may take advantage of the particular way in which matrix multiplication is performed within a systolic array. For example, when a matrix A is multiplied with a matrix B, the scalar value, a, of the matrix A is reused many times, the scalar value, b, of the matrix B may be streamed into the systolic array and forwarded to a series of MAC units in the systolic array, and only the final values and not the intermediate values of the dot products, computed for the matrix multiplication, may be correct. MAC unit hardware that is particularized to take advantage of these observations is described herein.
Computer processor for higher precision computations using a mixed-precision decomposition of operations
Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.
CIRCUIT FOR HANDLING PROCESSING WITH OUTLIERS
A system and method for handling processing with outliers. In some embodiments, the method includes: reading a first activation and a second activation, each including a least significant part and a most significant part, multiplying a first weight and a second weight by the respective activations, the multiplying of the first weight by the first activation including multiplying the first weight by the least significant part of the first activation in a first multiplier, the multiplying of the second weight by the second activation including: multiplying the second weight by the least significant part of the second activation in a second multiplier, and multiplying the second weight by the most significant part of the second activation in a shared multiplier, the shared multiplier being associated with a plurality of rows of an array of activations.
EFFICIENT COMPLEX MULTIPLY AND ACCUMULATE
Two commands each perform a partial complex multiply and accumulate. By using these two commands together, a full complex multiply and accumulate operation is performed. As compared to traditional implementations, this reduces the number of commands used from eight (four multiplies, a subtraction and three adds) to two. In some example embodiments, a single-instruction/multiple-data (SIMD) architecture is used to enable each command to perform multiple partial complex multiply and accumulate operations simultaneously, further increasing efficiency. One application of a complex multiply and accumulate is in generating images from pulse data of a radar or lidar. For example, an image may be generated from a synthetic aperture radar (SAR) on an autonomous vehicle (e.g., a drone). The image may be provided to a trained machine learning model that generates an output. Based on the output, inputs to control circuits of the autonomous vehicle are generated.
EFFICIENT COMPLEX MULTIPLY AND ACCUMULATE
Two commands each perform a partial complex multiply and accumulate. By using these two commands together, a full complex multiply and accumulate operation is performed. As compared to traditional implementations, this reduces the number of commands used from eight (four multiplies, a subtraction and three adds) to two. In some example embodiments, a single-instruction/multiple-data (SIMD) architecture is used to enable each command to perform multiple partial complex multiply and accumulate operations simultaneously, further increasing efficiency. One application of a complex multiply and accumulate is in generating images from pulse data of a radar or lidar. For example, an image may be generated from a synthetic aperture radar (SAR) on an autonomous vehicle (e.g., a drone). The image may be provided to a trained machine learning model that generates an output. Based on the output, inputs to control circuits of the autonomous vehicle are generated.