G06F9/38873

Predication in a vector processor

Embodiments relate to vector processor predication in an active memory device. An aspect includes a method for vector processor predication in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. One or more mask bits are accessed from a vector mask register in the processing element. The one or more mask bits are applied by the processing element to predicate operation of a unit in the processing element associated with at least one of the sub-instructions.

APPARATUS AND METHOD FOR TRANSFERRING A PLURALITY OF DATA STRUCTURES BETWEEN MEMORY AND A PLURALITY OF VECTOR REGISTERS

An apparatus and method are provided for transferring a plurality of data structures between memory and a plurality of vector registers, each vector register being arranged to store a vector operand comprising a plurality of data elements. Access circuitry is used to perform access operations to move data elements of vector operands between the data structures in memory and specified vector registers, each data structure comprising multiple data elements stored at contiguous addresses in the memory. Decode circuitry is responsive to a single access instruction identifying a plurality of vector registers and a plurality of data structures that are located discontiguously with respect to each other in the memory, to generate control signals to control the access circuitry to perform a sequence of access operations to move the plurality of data structures between the memory and the plurality of vector registers such that the vector operand in each vector register holds a corresponding data element from each of the plurality of data structures. This provides a very efficient mechanism for performing complex access operations, resulting in an increase in execution speed, and potential reductions in power consumption.

Processor architecture and method for simplifying programming single instruction, multiple data within a register

The present disclosure provides a processor, and associated method, for performing parallel processing within a register. An exemplary processor may include a processing element having a compute unit and a register file. The register file includes a register that is divisible into lanes for parallel processing. The processor may further include a mask register and a predicate register. The mask register and the predicate register respective include a number of mask bits and predicate bits equal to a maximum number of divisible lanes of the register. A state of the mask bits and predicate bits is set to respectively achieve enabling/disabling of the lanes from executing an instruction and conditional performance of an operation defined by the instruction. Further, the processor is operable to perform a reduction operation across the lanes of the processing element and/or generate an address for each of the lanes of the processing element.

Data processing apparatus and method for performing segmented operations

A data processing apparatus and method are provided for performing segmented operations. The data processing apparatus comprises a vector register store for storing vector operands, and vector processing circuitry providing N lanes of parallel processing, and arranged to perform a segmented operation on up to N data elements provided by a specified vector operand, each data element being allocated to one of the N lanes. The up to N data elements forms a plurality of segments, and performance of the segmented operation comprises performing a separate operation on the data elements of each segment, the separate operation involving interaction between the lanes containing the data elements of the associated segment. Predicate generation circuitry is responsive to a compute descriptor instruction specifying an input vector operand comprising a plurality of segment descriptors, to generate per lane predicate information used by the vector processing circuitry when performing the segmented operation to maintain a boundary between each of the plurality of segments. As a result, interaction between lanes containing data elements from different segments is prevented. This allows very effective utilisation of the lanes of parallel processing within the vector processing circuitry to be achieved.

AI ACCELERATOR APPARATUS USING FULL MESH CONNECTIVITY CHIPLET DEVICES FOR TRANSFORMER WORKLOADS

An AI accelerator apparatus using in-memory compute chiplet devices. The apparatus includes a first semiconductor substrate having a plurality of chiplets, each of which includes a plurality of tiles. Each tile includes a plurality of slices, a central processing unit (CPU), and a hardware dispatch device. Each slice can include a digital in-memory compute (DIMC) device configured to perform high throughput computations. In particular, the DIMC device can be configured to accelerate the computations of attention functions for transformer-based models (a.k.a. transformers) applied to machine learning applications. The chiplets are in a full mesh connectivity configuration such that at least one of the die-to-die (D2D) interconnects of each chiplet is coupled to one of the D2D interconnects of each other chiplet using a non-diagonal link. The chiplets can also include other interfaces to facilitate communication between the chiplets, memory and a server or host system.

Integrated circuit with control node circuitry and processing circuitry

Traditionally, providing parallel processing within a multi-core system has been very difficult. Here, however, a system is provided where serial source code is automatically converted into parallel source code, and a processing cluster is reconfigured on the fly to accommodate the parallelized code based on an allocation of memory and compute resources. Thus, the processing cluster and its corresponding system programming tool provide a system that can perform parallel processing from a serial program that is transparent to a user. Generally, a control node connected to the address and data leads of a host processor uses messages to control the processing of data in a processing cluster. The cluster includes nodes of parallel processors, shared function memory, a global load/store, and hardware accelerators all connected to the control node by message busses. A crossbar data interconnect routes data to the cluster circuits separate from the message busses.

SEMICONDUCTOR DEVICE
20170017489 · 2017-01-19 ·

A semiconductor device includes a central processing unit capable of executing a vector instruction. The vector instruction is an instruction to calculate a vector register for every element, combine the additional information based on the calculated result for every element, shift the contents of a register different from the vector register to right or left, insert the combined additional information in an empty portion resulting from the shift, and accumulate the additional information in the register.

Vector processing in an active memory device

Embodiments relate to vector processing in an active memory device. An aspect includes a system for vector processing in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perform a method including decoding an instruction with a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in parallel is determined. Execution of the sub-instructions is repeated in parallel for multiple iterations, by the processing element, based on the iteration count. Multiple locations in the memory are accessed in parallel based on the execution of the sub-instructions.

Method and apparatus to process 4-operand SIMD integer multiply-accumulate instruction

According to one embodiment, a processor includes an instruction decoder to receive an instruction to process a multiply-accumulate operation, the instruction having a first operand, a second operand, a third operand, and a fourth operand. The first operand is to specify a first storage location to store an accumulated value; the second operand is to specify a second storage location to store a first value and a second value; and the third operand is to specify a third storage location to store a third value. The processor further includes an execution unit coupled to the instruction decoder to perform the multiply-accumulate operation to multiply the first value with the second value to generate a multiply result and to accumulate the multiply result and at least a portion of a third value to an accumulated value based on the fourth operand.

Analytical model to optimize deep learning models

Techniques for optimizing and deploying deep neural network (CNN) machine learning models for inference using static analysis are described. A method includes obtaining a deep neural network (DNN) machine learning (ML) model, generating an intermediate representation for the ML model, the intermediate representation including one or more nodes corresponding to one or more operators utilized by the ML model, identifying, for at least one node of the intermediate representation, an optimized schedule for at least one operator corresponding to the at least one node using a static analysis that is based on a hardware-specific cost model, generating an optimized intermediate representation using the optimized schedule that is optimized for execution on a hardware platform, and generating code corresponding to the ML model based at least in part on the optimized intermediate representation, wherein the code is specific to the hardware platform.