G06F9/38873

Solution to divergent branches in a SIMD core using hardware pointers
09639371 · 2017-05-02 · ·

A system and method for efficiently processing instructions in hardware parallel execution lanes within a processor. In response to a given divergent point within an identified loop, a compiler generates code wherein when executed determines a size of a next very large instruction world (VLIW) to process and determine multiple pointer values to store in multiple corresponding PC registers in a target processor. The updated PC registers point to instructions intermingled from different basic blocks between the given divergence point and a corresponding convergence point. The target processor includes a single instruction multiple data (SIMD) micro-architecture. The assignment for a given lane is based on branch direction found at runtime for the given lane at the given divergent point. The processor includes a vector register for mapping PC registers to execution lanes.

MULTIPROCESSOR DEVICE
20170116153 · 2017-04-27 · ·

A multiprocessor device includes external memory, processors, a memory aggregate unit, register memory, a multiplexer, and an overall control unit. The memory aggregate unit aggregates memory accesses of the processors. The register memory is prepared by a number equal to the product of the number of registers managed by the processors and the maximum number of processes of the processors. The multiplexer accesses the register memory according to a command given against register access of the processors. The overall control unit extracts a parameter from the command and provides the parameter to the processors and multiplexer, and controls them, as well as has a given number of processes consecutively processed using the same command while having addressing for the register memory changed by the processors, and when the given number of processes ends, has the command switched to a next command and processing repeated for a given number of processes.

Vector processing engines (VPEs) employing tapped-delay line(s) for providing precision correlation / covariance vector processing operations with reduced sample re-fetching and power consumption, and related vector processor systems and methods

Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision correlation/covariance vector processing operations with reduced sample re-fetching and/or power consumption are disclosed. The VPEs disclosed herein are configured to provide correlation/covariance vector processing operations, such as code division multiple access (CDMA) correlation/covariance vector processing operations as a non-limiting example. A tapped-delay line(s) is included in the data flow paths between memory and execution units in the VPE. The tapped-delay line (s) is configured to receive and provide an input vector data sample set to execution units for performing correlation/covariance vector processing operations. The tapped-delay line(s) is also configured to shift the input vector data sample set for each filter delay tap and provide the shifted input vector data sample set to the execution units, so the shifted input vector data sample set need not be re-fetched from the vector data file during the filter vector processing operations.

Matrix multiplication operations using pair-wise load and splat operations

Mechanisms for performing a matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A pair-wise load and splat operation is performed to load a pair of scalar values of a second vector operand and replicate the pair of scalar values within a second target vector register. An operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored. This operation may be repeated for a second pair of scalar values of the second vector operand.

Vector processor architectures

The present disclosure relates to an integrated circuit device that includes a plurality of vector registers configurable to store a plurality of vectors and switch circuitry communicatively coupled to the plurality of vector registers. The switch circuitry is configurable to route a portion of the plurality of vectors. Additionally, the integrated circuit device includes a plurality of vector processing units communicatively coupled to the switch circuitry. The plurality of vector processing units is configurable to receive the portion of the plurality of vectors, perform one or more operations involving the portion of the plurality of vector inputs, and output a second plurality of vectors generated by performing the one or more operations.

Distributed shared memory

Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache. Such distributed shared memory supports cooperative parallelism and strong scaling across multiple processing cores by permitting data sharing and communications previously possible only within the same processing core.

Predication in a vector processor

Embodiments relate to vector processor predication in an active memory device. An aspect includes a system for vector processor predication in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perform a method including decoding an instruction with a plurality of sub-instructions to execute in parallel. One or more mask bits are accessed from a vector mask register in the processing element. The one or more mask bits are applied by the processing element to predicate operation of a unit in the processing element associated with at least one of the sub-instructions.

Vector processing in an active memory device

Embodiments relate to vector processing in an active memory device. An aspect includes a method for vector processing in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in parallel is determined. Based on the iteration count, execution of the sub-instructions in parallel is repeated for multiple iterations by the processing element. Multiple locations in the memory are accessed in parallel based on the execution of the sub-instructions.

INSTRUCTION FOR ACCELERATING SNOW 3G WIRELESS SECURITY ALGORITHM

Vector instructions for performing SNOW 3G wireless security operations are received and executed by the execution circuitry of a processor. The execution circuitry receives a first operand of the first instruction specifying a first vector register that stores a current state of a finite state machine (FSM). The execution circuitry also receives a second operand of the first instruction specifying a second vector register that stores data elements of a liner feedback shift register (LFSR) that are needed for updating the FSM. The execution circuitry executes the first instruction to produce a updated state of the FSM and an output of the FSM in a destination operand of the first instruction.

SCALABLE SINGLE-INSTRUCTION-MULTIPLE-DATA INSTRUCTIONS
20170046168 · 2017-02-16 ·

A method for executing scalable single-instruction-multiple-data (SIMD) instructions includes performing a query to determine a hardware vector length of a SIMD processor. The method also includes scaling a first instruction of the scalable SIMD instructions to a first scaled vector length to generate a first scaled instruction. The first scaled vector length is based on the hardware vector length, and the first instruction is a compiled instruction having an adaptable vector length. The method also includes adjusting a first number of iterations to be used by the SIMD processor to perform first operations associated with the first instruction based on the first scaled vector length.