G06F15/8084

METHOD AND SYSTEM FOR PARTIAL WAVEFRONT MERGER

A method and system for partial wavefront merger is described. Vector processing machines employ the partial wavefront merger to merge partial wavefronts into one or more wavefronts. The system includes a partial wavefront manager and unified registers. The partial wavefront manager detects wavefronts in different single-instruction-multiple-data (SIMD) units which contain inactive work items and active work items (hereinafter referred to as partial wavefronts), moves the partial wavefronts into one or more SIMD unit(s) and merges the partial wavefronts into one or more wavefront(s). The unified register allows each active work item in the one or more merged wavefront(s) to access the previously allocated registers in the originating SIMD units. Consequently, the contents of the unified registers do not have to be copied to the SIMD unit(s) executing the one or merged wavefront(s).

General-Purpose Systolic Array

A systolic array cell is described, the cell including two general-purpose arithmetic logic units (ALUs) and register-file. A plurality of the cells may be configured in a matrix or array, such that the output of the first ALU in a first cell is provided to a second cell to the right of the first cell, and the output of the second ALU in the first cell is provided to a third cell below the first cell. The two ALUs in each cell of the array allow for processing of a different instruction in each cycle.

Vector processor with vector first and multiple lane configuration
11907158 · 2024-02-20 · ·

A vector processor with a vector first and multi-lane configuration. A vector operation for a vector processor can include a single vector or multiple vectors as input. Multiple lanes for the input can be used to accelerate the operation in parallel. And, a vector first configuration can enhance the multiple lanes by reducing the number of elements accessed in the lanes to perform the operation in parallel.

PROVIDING MULTI-ELEMENT MULTI-VECTOR (MEMV) REGISTER FILE ACCESS IN VECTOR-PROCESSOR-BASED DEVICES

Providing multi-element multi-vector (MEMV) register file access in vector-processor-based devices is disclosed. In this regard, a vector-processor-based device includes a vector processor comprising multiple processing elements (PEs) communicatively coupled via a corresponding plurality of channels to a vector register file comprising a plurality of memory banks. The vector processor provides a direct memory access (DMA) controller that is configured to receive a plurality of vectors that each comprise a plurality of vector elements representing operands for processing a loop iteration. The DMA controller arranges the vectors in the vector register file such that, for each group of vectors to be accessed in parallel, vector elements for each vector are stored consecutively, but corresponding vector elements of consecutive vectors are stored in different memory banks of the vector register file. As a result, multiple elements of multiple vectors may be accessed with a single vector register file access operation.

Method and apparatus for performing a vector permute with an index and an immediate

A processor for performing a vector permute comprises: a source vector register to store a plurality of source data elements; a destination vector register to store a plurality of destination data elements; a control vector register to store a plurality of control data elements, each control data element corresponding to one of the destination data elements and including an N bit value indicating whether a source data element is to be copied to the corresponding destination data element; vector permute logic to compare the N bit value of each control data element to an N bit portion of an immediate to determine whether to copy a source data element to the corresponding destination data element, wherein if the N bit values match, then the vector permute logic is to identify a source data element using an index value included in the control data element.

Systems, apparatuses, and methods for setting an output mask in a destination writemask register from a source write mask register using an input writemask and immediate

Embodiments of systems, apparatuses, and methods for performing in a computer processor generation of a predicate mask based on vector comparison in response to a single instruction are described.

Hardware processors and methods for tightly-coupled heterogeneous computing

Methods and apparatuses relating to tightly-coupled heterogeneous computing are described. In one embodiment, a hardware processor includes a plurality of execution units in parallel, a switch to connect inputs of the plurality of execution units to outputs of a first buffer and a plurality of memory banks and connect inputs of the plurality of memory banks and a plurality of second buffers in parallel to outputs of the first buffer, the plurality of memory banks, and the plurality of execution units, and an offload engine with inputs connected to outputs of the plurality of second buffers.

ARITHMETIC UNIT
20190171614 · 2019-06-06 · ·

Provided is an arithmetic processing to reduce a number of parts as it is not necessary to prepare an operation device for each input processing logic. A plurality of types of input processing logics is stored in the ROM, and CPU selects one of the plurality of types of input processing logics and executes input processing according to the selected input processing logic. As a result, there is no need to prepare the ECU for each input processing logic, reducing the number of parts.

NEOHARRY: HIGH-PERFORMANCE PARALLEL MULTI-LITERAL MATCHING ALGORITHM

Methods and embodiments of a high-performance parallel multi-literal matching algorithm called NeoHarry. A chunk of data comprising a character string comprising n bytes is sampled for a byte stream, and data in the sampled chunk are pre-shifted to create shifted copies of data at multiple sampled locations. A mask table is generated having column vectors containing match indicia identifying potential character matches. A look up of the mask table at multiple sampled locations using the pre-shifted data is performed for a target literal character pattern. The mask table lookup results are combined to generate match candidates and exact match verification is performed to identify any generated match candidates that match the target literal character pattern. NeoHarry uses a column-vector-based shift-or model and implements a cross-domain shift algorithm under which character patterns spanning two domains are identified.

Method and apparatus for performing a vector bit shuffle

A processor including a first vector register for storing a plurality of source data elements, a second vector register for storing a plurality of control elements, and a vector bit shuffle logic. Each of the control elements in the first vector register corresponds to a different source data element and includes a plurality of bit fields. Each of the bit fields is associated with a single corresponding bit position in a destination mask register and identifies a single bit from the corresponding source data element to be copied to the single corresponding bit position in the destination mask register. The vector bit shuffle logic is to read the bit fields from the second vector register and, for each bit field, to identify a single bit from a single corresponding source data element and copy it to a single corresponding bit position in the destination mask register.