Patent classifications
G06F9/38873
PARALLEL PROCESSING DEVICE
A parallel processing device includes: a plurality of memories configured to output a plurality of pieces of memory output data respectively; a plurality of input units configured to output a plurality of pieces of input unit output data respectively; a plurality of addition units configured to receive the plurality of pieces of input unit output data, perform a parallel processing function and a data path configuration function according to a plurality of configuration values, and output a plurality of pieces of addition unit output data; and a plurality of delay units configured to delay the plurality of pieces of addition unit output data according to a clock signal, and output the plurality of pieces of delay data respectively. The plurality of pieces of input unit output data are selected from the plurality of pieces of memory output data and a plurality of pieces of delay data respectively.
PROCESSING GROUPS OF DATA IN PARALLEL
The present application discloses a method and an apparatus for a processor, and a computer-readable storage medium. The method for a processor includes: reading a plurality of groups of data from a data set by using a first vector instruction, where each group of data includes a plurality of pieces of data; performing an extremum operation on the plurality of groups of data in parallel by using a second vector instruction to obtain a first group of intermediate results; and calculating an extreme value of the data set based on the first group of intermediate results. In the foregoing technical solution, a plurality of groups of data in a data set are operated in parallel to determine an extreme value of the data set, which helps to improve a speed of data processing.
HIGH PERFORMANCE PROCESSOR SYSTEM AND METHOD BASED ON GENERAL PURPOSE UNITS
This invention provides a high performance processor system and a method based on a common general purpose unit, it may be configured into a variety of different processor architectures; before the processor executes instructions, the instruction is filled into the instruction read buffer, which is directly accessed by the processor core, then instruction read buffer actively provides instructions to processor core to execute, achieving a high cache hit rate.
High performance processor system and method based on general purpose units
This invention provides a high performance processor system and a method based on a common general purpose unit, it may be configured into a variety of different processor architectures; before the processor executes instructions, the instruction is filled into the instruction read buffer, which is directly accessed by the processor core, then instruction read buffer actively provides instructions to processor core to execute, achieving a high cache hit rate.
OPTIMIZE CONTROL-FLOW CONVERGENCE ON SIMD ENGINE USING DIVERGENCE DEPTH
There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running SPMD (Single Program Multiple Data) code on SIMD (Single Instruction Multiple Data) machine. The machine runs an instruction stream over input data streams. The machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation. The machine updates the lane-PC of each active lane according to targets of the branch operation. The machine selects an active lane and activates only lanes whose lane-PCs match the thread-PC. The machine decrements the lane depth counters of the selected active lanes and updates the lane-PC of each active lane upon the instruction stream reaching a first instruction. The machine assigns the lane-PC of a lane with a largest lane depth counter value to the thread-PC and activates all lanes whose lane-PCs match the thread-PC.
Method and apparatus to process 4-operand SIMD integer multiply-accumulate instruction
According to one embodiment, a processor includes an instruction decoder to receive an instruction to process a multiply-accumulate operation, the instruction having a first operand, a second operand, a third operand, and a fourth operand. The first operand is to specify a first storage location to store an accumulated value; the second operand is to specify a second storage location to store a first value and a second value; and the third operand is to specify a third storage location to store a third value. The processor further includes an execution unit coupled to the instruction decoder to perform the multiply-accumulate operation to multiply the first value with the second value to generate a multiply result and to accumulate the multiply result and at least a portion of a third value to an accumulated value based on the fourth operand.
CODE GENERATION APPARATUS AND CODE GENERATION METHOD
An apparatus includes a processor configured to execute a process of generating a second program according to a first program, the second program including: a first process in which an operation according to a first operation instruction is performed on operation elements iteratively; a second process in which a mask bit string including as mask bits as the number of operand elements is set; and a third process in which the operation is performed on as respective elements as the number of operand elements according to second operation instruction, the elements including one or more remainder operation elements not subjected to the operation in the first process and one or more non-operation elements excluded from being operated on as the number of operand elements, the operation according to the second operation instruction being performed.
PROCESSING PIPELINE WITH ZERO LOOP OVERHEAD
Techniques are disclosed for reducing or eliminating loop overhead caused by function calls in processors that form part of a pipeline architecture. The processors in the pipeline process data blocks in an iterative fashion, with each processor in the pipeline completing one of several iterations associated with a processing loop for a commonly-executed function. The described techniques leverage the use of message passing for pipelined processors to enable an upstream processor to signal to a downstream processor when processing has been completed, and thus a data block is ready for further processing in accordance with the next loop processing iteration. The described techniques facilitate a zero loop overhead architecture, enable continuous data block processing, and allow the processing pipeline to function indefinitely within the main body of the processing loop associated with the commonly-executed function where efficiency is greatest.
Vector processing engines (VPEs) employing reordering circuitry in data flow paths between execution units and vector data memory to provide in-flight reordering of output vector data stored to vector data memory, and related vector processor systems and methods
Vector processing engines (VPEs) employing reordering circuitry in data flow paths between execution units and vector data memory to provide in-flight reordering of output vector data stored to vector data memory are disclosed. Related vector processor systems and methods are also disclosed. Reordering circuitry is provided in data flow paths between execution units and vector data memory in the VPE. The reordering circuitry is configured to reorder output vector data sample sets from execution units as a result of performing vector processing operations in-flight while the output vector data sample sets are being provided over the data flow paths from the execution units to the vector data memory to be stored. In this manner, the output vector data sample sets are stored in the reordered format in the vector data memory without requiring additional post-processing steps, which may delay subsequent vector processing operations to be performed in the execution units.
Optimize control-flow convergence on SIMD engine using divergence depth
There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running SPMD (Single Program Multiple Data) code on SIMD (Single Instruction Multiple Data) machine. The machine runs an instruction stream over input data streams. The machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation. The machine updates the lane-PC of each active lane according to targets of the branch operation. The machine selects an active lane and activates only lanes whose lane-PCs match the thread-PC. The machine decrements the lane depth counters of the selected active lanes and updates the lane-PC of each active lane upon the instruction stream reaching a first instruction. The machine assigns the lane-PC of a lane with a largest lane depth counter value to the thread-PC and activates all lanes whose lane-PCs match the thread-PC.