Patent classifications
G06F9/30038
NEOHARRY: HIGH-PERFORMANCE PARALLEL MULTI-LITERAL MATCHING ALGORITHM
Methods and embodiments of a high-performance parallel multi-literal matching algorithm called NeoHarry. A chunk of data comprising a character string comprising n bytes is sampled for a byte stream, and data in the sampled chunk are pre-shifted to create shifted copies of data at multiple sampled locations. A mask table is generated having column vectors containing match indicia identifying potential character matches. A look up of the mask table at multiple sampled locations using the pre-shifted data is performed for a target literal character pattern. The mask table lookup results are combined to generate match candidates and exact match verification is performed to identify any generated match candidates that match the target literal character pattern. NeoHarry uses a column-vector-based shift-or model and implements a cross-domain shift algorithm under which character patterns spanning two domains are identified.
Permute Instructions for Register-Based Lookups
Permute instructions for register-based lookups is described. In accordance with the described techniques, a processor is configured to perform a register-based lookup by retrieving a first result from a first lookup table based on a subset of bits included in an index of a destination register, retrieving a second result from a second lookup table based on the subset of bits included in the index of the destination register, selecting the first result or the second result based on a bit in the index of the destination register that is excluded from the subset of bits, and overwriting data included in the index of the destination register using a selected one of the first result or the second result.
DATA COMPRESSION USING INSTRUCTION SET ARCHITECTURE
In one embodiment, a computing system may set data to a first group of registers. The first group of registers may be configured to be accessed during a single operation cycle. The system may set a number of patterns to a second group of registers. Each pattern of the number of patterns may include an array of index for the data stored in the first group of registers. The system may select, for a first vector register associated with a vector engine, a first pattern from the patterns stored in the second group of registers. The system may load a first portion of the data from the first group of registers to the first vector register based on the first pattern selected for the first vector register from the patterns stored in the second group of registers.
Content-addressable processing engine
A content-addressable processing engine, also referred to herein as CAPE, is provided. Processing-in-memory (PIM) architectures attempt to overcome the von Neumann bottleneck by combining computation and storage logic into a single component. CAPE provides a general-purpose PIM microarchitecture that provides acceleration of vector operations while being programmable with standard reduced instruction set computing (RISC) instructions, such as RISC-V instructions with standard vector extensions. CAPE can be implemented as a standalone core that specializes in associative computing, and that can be integrated in a tiled multicore chip alongside other types of compute engines. Certain embodiments of CAPE achieve average speedups of 14? (up to 254?) over an area-equivalent out-of-order processor core tile with three levels of caches across a diverse set of representative applications.
Systems, apparatuses, and methods for performing a double blocked sum of absolute differences
Embodiments of systems, apparatuses, and methods for performing in a computer processor vector double block packed sum of absolute differences (SAD) in response to a single vector double block packed sum of absolute differences instruction that includes a destination vector register operand, first and second source operands, an immediate, and an opcode are described.
Convert from zoned format to decimal floating point format
Machine instructions, referred to herein as a long Convert from Zoned instruction (CDZT) and extended Convert from Zoned instruction (CXZT), are provided that read EBCDIC or ASCII data from memory, convert it to the appropriate decimal floating point format, and write it to a target floating point register or floating point register pair. Further, machine instructions, referred to herein as a long Convert to Zoned instruction (CZDT) and extended Convert to Zoned instruction (CZXT), are provided that convert a decimal floating point (DFP) operand in a source floating point register or floating point register pair to EBCDIC or ASCII data and store it to a target memory location.
Convert from zoned format to decimal floating point format
Machine instructions, referred to herein as a long Convert from Zoned instruction (CDZT) and extended Convert from Zoned instruction (CXZT), are provided that read EBCDIC or ASCII data from memory, convert it to the appropriate decimal floating point format, and write it to a target floating point register or floating point register pair. Further, machine instructions, referred to herein as a long Convert to Zoned instruction (CZDT) and extended Convert to Zoned instruction (CZXT), are provided that convert a decimal floating point (DFP) operand in a source floating point register or floating point register pair to EBCDIC or ASCII data and store it to a target memory location.
Systems, apparatuses, and methods for cumulative summation
Systems, methods, and apparatuses for executing an instruction are described. For example, an instruction includes at least an opcode, a field for a packed data source operand, and a field for a packed data destination operand. When executed, the instruction causes for each data element position of the source operand, add to a value stored in that data element position all values stored in preceding data element positions of the packed data source operand and store a result of the addition into a corresponding data element position of the packed data destination operand.
METHODS, APPARATUS, INSTRUCTIONS AND LOGIC TO PROVIDE POPULATION COUNT FUNCTIONALITY FOR GENOME SEQUENCING AND ALIGNMENT
Instructions and logic provide SIMD vector population count functionality. Some embodiments store in each data field of a portion of n data fields of a vector register or memory vector, at least two bits of data. In a processor, a SIMD instruction for a vector population count is executed, such that for that portion of the n data fields in the vector register or memory vector, the occurrences of binary values equal to each of a first one or more predetermined binary values, are counted and the counted occurrences are stored, in a portion of a destination register corresponding to the portion of the n data fields in the vector register or memory vector, as a first one or more counts corresponding to the first one or more predetermined binary values.
APPARATUSES, METHODS, AND SYSTEMS FOR ELEMENT SORTING OF VECTORS
Systems, methods, and apparatuses relating to element sorting of vectors are described. In one embodiment, a processor includes a decoder to decode an instruction into a decoded instruction; and an execution unit to execute the decoded instruction to: provide storage for a comparison matrix to store a comparison value for each element of an input vector compared against the other elements of the input vector, perform a comparison operation on elements of the input vector corresponding to storage of comparison values above a main diagonal of the comparison matrix, perform a different operation on elements of the input vector corresponding to storage of comparison values below the main diagonal of the comparison matrix, and store results of the comparison operation and the different operation in the comparison matrix.