Patent classifications
G06F15/8076
Dynamic acceleration of data processor operations using data-flow analysis
A method and apparatus are provided for dynamically determining when an operation, specified by one or more instructions in a data processing system, is suitable for accelerated execution. Data indicators are maintained, for data registers of the system, that indicate when data-flow from a register derives from a restricted source. In addition, instruction predicates are provided for instructions to indicate which instructions are capable of accelerated execution. From the data indicators and the instruction predicates, the microarchitecture of the data processing system determines, dynamically, when the operation is a thread-restricted function and suitable for accelerated execution in a hardware accelerator. The thread-restricted function may be executed on a hardware processor, such as a vector, neuromorphic or other processor.
System and method for reducing non-linearity in mixed signal processing using complex polynomial vector processor
A system for reducing non-linearity in mixed signal processing using complex polynomial vector processor 102 is provided. The complex polynomial vector processor 102 includes a data processing unit (104) and a co-efficient feeder unit (106). The data processing unit (104) converts a high-speed data stream into a polar-like format (PL) data and calculates required polynomial powers for the high-speed data stream using the PL format data. The data processing unit (104) includes a multiplier accumulator (MAC) unit (206) that generates processed high-speed data and a delay unit (208) that combines time separated input with the processed high-speed data to generate output data with reduced non-linearity.
PROVIDING RECONFIGURABLE FUSION OF PROCESSING ELEMENTS (PEs) IN VECTOR-PROCESSOR-BASED DEVICES
Providing reconfigurable fusion of processing elements (PEs) in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device provides a vector processor including a plurality of PEs and a decode/control circuit. The decode/control circuit receives an instruction block containing a vectorizable loop comprising a loop body. The decode/control circuit determines how many PEs of the plurality of PEs are required to execute the loop body, and reconfigures the plurality of PEs into one or more fused PEs, each including the determined number of PEs required to execute the loop body. The plurality of PEs, reconfigured into one or more fused PEs, then executes one or more loop iterations of the loop body. Some aspects further include a PE communications link interconnecting the plurality of PEs, to enable communications between PEs of a fused PE and communications of inter-iteration data dependencies between PEs without requiring vector register file access operations.
Registers in vector processors to store addresses for accessing vectors
Disclosed herein are vector index registers in vector processors that each store multiple addresses for accessing multiple positions in vectors. It is known to use scalar index registers in vector processors to access multiple positions of vectors by changing the scalar index registers in vector operations. By using a vector indexing register for indexing positions of one or more operand vectors, the scalar index register can be replaced and at least the continual changing of the scalar index register can be avoided.
Vector processor with vector first and multiple lane configuration
A vector processor with a vector first and multi-lane configuration. A vector operation for a vector processor can include a single vector or multiple vectors as input. Multiple lanes for the input can be used to accelerate the operation in parallel. And, a vector first configuration can enhance the multiple lanes by reducing the number of elements accessed in the lanes to perform the operation in parallel.
SYSTEM AND METHOD OF LOADING AND REPLICATION OF SUB-VECTOR VALUES
A processor includes a vector register configured to load data responsive to a special purpose load instruction. The processor also includes circuitry configured to replicate a selected sub-vector value from the vector register.
Instruction and logic to provide vector compress and rotate functionality
Instructions and logic provide vector compress and rotate functionality. A processor may include a mask register, a decoder, and an execution unit. The mask register may include a data field, wherein the data field corresponds to an element location in a vector. The decoder may be coupled to the mask register. The decoder may decode an instruction to obtain a decoded instruction. The decoded instruction may specify a vector source, the mask register, a vector destination, and a vector destination offset location. The execution unit is coupled to the decoder. The execution unit may read an unmasked value in the data field; copy an vector element from the vector source to a location adjacent to the element; change the unmasked value to a masked value; determine that the vector destination is full; store a vector destination operand associated with the vector destination in a memory; and re-execute the instruction using the masked value and the vector destination offset location.
Apparatus and method for transferring a plurality of data structures between memory and one or more vectors of data elements stored in a register bank
An apparatus and method are provided for transferring a plurality of data structures from memory into one or more vectors of data elements stored in a register bank. The apparatus has first interface circuitry to receive data structures retrieved from memory, where each data structure has an associated identifier and comprises N data elements. Multi-axial buffer circuitry is provided having an array of storage elements, where along a first axis the array is organized as N sets of storage elements, each set containing a plurality VL of storage elements, and where along a second axis the array is organized as groups of N storage elements, with each group containing a storage element from each of the N sets. Access control circuitry then stores the N data elements of a received data structure in one of the groups selected in dependence on the associated identifier. Responsive to an indication that all required data structures have been stored in the multi-axial buffer circuitry, second interface circuitry then outputs the data elements stored in one or more of the sets of storage elements as one or more corresponding vectors of data elements for storage in a register bank, each vector containing VL data elements. Such an approach can significantly increase the performance of handling such load operations, and give rise to potential energy savings.
Dynamic Acceleration of Data Processor Operations Using Data-flow Analysis
A method and apparatus are provided for dynamically determining when an operation, specified by one or more instructions in a data processing system, is suitable for accelerated execution. Data indicators are maintained, for data registers of the system, that indicate when data-flow from a register derives from a restricted source. In addition, instruction predicates are provided for instructions to indicate which instructions are capable of accelerated execution. From the data indicators and the instruction predicates, the microarchitecture of the data processing system determines, dynamically, when the operation is a thread-restricted function and suitable for accelerated execution in a hardware accelerator. The thread-restricted function may be executed on a hardware processor, such as a vector, neuromorphic or other processor.
VECTOR MULTIPLY-ADD INSTRUCTION
An apparatus comprises processing circuitry, a number of vector register and a number of scalar registers. An instruction decoder is provided which supports decoding of a vector multiply-add instruction specifying at least one vector register and at least one scalar register. In response to the vector multiply-add instruction, the decoder controls the processing circuitry to perform a vector multiply-add instruction in which each lane of processing generates a respective result data element corresponding to a sum of difference of a product value and an addend value, with the product value comprising the product of a respective data element of a first vector value and a multiplier value. In each lane of processing at least one of the multiplier value and the addend value is specified as a portion of a scalar value stored in a scalar register.