Patent classifications
G06F9/302
Instruction and logic to provide vector linear interpolation functionality
Instructions and logic provide vector linear interpolation functionality. In some embodiments, responsive to an instruction specifying: a first operand from a set of vector registers, a size of each of the vector elements, a portion of the vector elements upon which to compute linear interpolations, a second operand from a set of vector registers, and a third operand; an execution unit, reads a first, a second and a third value of the size of vector elements from corresponding data fields in the first, the second and the third operand respectively and computes an interpolated value as the first value multiplied by the second value minus the second value multiplied by the third value plus the third value.
Automatically introducing register dependencies to tests
Method, apparatus and product for automatically introducing register dependency into tests. A test template represents an abstract test scenario to be utilized for testing a target processor. The abstract test scenario requires that a value be assigned to a register. A test that implements the abstract test scenario is generated. The test is a set of instructions that are executable by the target processor. The generation of the test comprises: determining a memory address to retain the value in a memory that is accessible to the target processor; and adding to the test an instruction to load to the register the value from the memory address, whereby adding a register dependency to the test that is not required by the abstract test scenario. The test can be executed on the target processor or simulation thereof.
Hardware unit for performing matrix multiplication with clock gating
Hardware units and methods for performing matrix multiplication via a multi-stage pipeline wherein the storage elements associated with one or more stages of the pipeline are clock gated based on the data elements and/or portions thereof that known to have a zero value (or can be treated as having a zero value). In some cases, the storage elements may be clock gated on a per data element basis based on whether the data element has a zero value (or can be treated as having a zero value). In other cases, the storage elements may be clock gated on a partial element basis based on the bit width of the data elements. For example, if bit width of the data elements is less than a maximum bit width for the data elements then a portion of the bits related to that data element can be treated as having a zero value and a portion of the storage elements associated with that data element may not be clocked. In yet other cases the storage elements may be clock gated on both a per element and a partial element basis.
Optimizing hardware FIFO instructions
Methods, systems, and apparatus for scheduling first-in-first-out instructions are described. In one aspect, a method includes receiving data representing code of a program to be executed by a processing unit comprising hardware processors. For each of one or more of the hardware processors, an order of independent groups of first-in-first-out (FIFO) instructions for execution by the hardware processor is identified in the data representing the code of the program. For each independent group of FIFO instructions for execution by the hardware processor, a path length metric that represents how long it will take to reach an end of the program from the independent group of FIFO instructions is determined. A new order of the independent groups of FIFO instructions for execution by the hardware processor is generated based at least on the path length metric for each independent group of FIFO instructions for execution by the hardware processor.
Apparatus and method for complex multiplication
An embodiment of the invention is a processor including execution circuitry to calculate, in response to a decoded instruction, a result of a complex multiplication of a first complex number and a second complex number. The calculation includes a first operation to calculate a first term of a real component of the result and a first term of the imaginary component of the result. The calculation also includes a second operation to calculate a second term of the real component of the result and a second term of the imaginary component of the result. The processor also includes a decoder, a first source register, and a second source register. The decoder is to decode an instruction to generate the decoded instruction. The first source register is to provide the first complex number and the second source register is to provide the second complex number.
Register-based matrix multiplication with multiple matrices per register
Techniques for performing matrix multiplication in a data processing apparatus are disclosed, comprising apparatuses, matrix multiply instructions, methods of operating the apparatuses, and virtual machine implementations. Registers, each register for storing at least four data elements, are referenced by a matrix multiply instruction and in response to the matrix multiply instruction a matrix multiply operation is carried out. First and second matrices of data elements are extracted from first and second source registers, and plural dot product operations, acting on respective rows of the first matrix and respective columns of the second matrix are performed to generate a square matrix of result data elements, which is applied to a destination register. A higher computation density for a given number of register operands is achieved with respect to vector-by-element techniques.
Data processing apparatus and method
The disclosure provides a data processing device and method. The data processing device may include: a task configuration information storage unit and a task queue configuration unit. The task configuration information storage unit is configured to store configuration information of tasks. The task queue configuration unit is configured to configure a task queue according to the configuration information stored in the task configuration information storage unit. According to the disclosure, a task queue may be configured according to the configuration information.
Apparatus and methods for matrix multiplication
Aspects for matrix multiplication in neural network are described herein. The aspects may include a controller unit configured to receive a matrix-multiply-matrix (MM) instruction that includes a first starting address of a first matrix, a first size of the first matrix, a second starting address of a second matrix, and a second size of the second matrix; multiple computation modules configured to respectively multiply, in response to the MM instruction, row vectors of the first matrix with column vectors of the second matrix to generate one or more result elements; and an interconnection unit configured to combine the result elements to generate one or more row vectors of a result matrix.
Instructions and logic to perform floating point and integer operations for machine learning
- Himanshu Kaul ,
- Mark A. Anders ,
- Sanu K. Mathew ,
- Anbang Yao ,
- Joydeep Ray ,
- Ping T. Tang ,
- Michael S. Strickland ,
- Xiaoming Chen ,
- Tatiana Shpeisman ,
- Abhishek R. Appu ,
- Altug Koker ,
- Kamal Sinha ,
- Balaji Vembu ,
- Nicolas C. Galoppo Von Borries ,
- Eriko Nurvitadhi ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Vasanth Ranganathan ,
- Sanjeev Jahagirdar
A processing apparatus is provided comprising a multiprocessor having a multithreaded architecture. The multiprocessor can execute at least one single instruction to perform parallel mixed precision matrix operations. In one embodiment the apparatus includes a memory interface and an array of multiprocessors coupled to the memory interface. At least one multiprocessor in the array of multiprocessors is configured to execute a fused multiply-add instruction in parallel across multiple threads.
Inserting predefined pad values into a stream of vectors
Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a pad value indicator. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. A padded stream vector is formed that includes a specified pad value without accessing the pad value from system memory.