G06F9/30025

Implementing specialized instructions for accelerating Smith-Waterman sequence alignments

Various techniques for accelerating Smith-Waterman sequence alignments are provided. For example, threads in a group of threads are employed to use an interleaved cell layout to store relevant data in registers while computing sub-alignment data for one or more local alignment problems. In another example, specialized instructions that reduce the number of cycles required to compute each sub-alignment score are utilized. In another example, threads are employed to compute sub-alignment data for a subset of columns of one or more local alignment problems while other threads begin computing sub-alignment data based on partial result data received from the preceding threads. After computing a maximum sub-alignment score, a thread stores the maximum sub-alignment score and the corresponding position in global memory.

Computing device and method

The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.

Bit string operations in memory
11709673 · 2023-07-25 · ·

Systems, apparatuses, and methods related to bit string operations in memory are described. The bit string operations may be performed within a memory array without transferring the bit strings or intermediate results of the operations to circuitry external to the memory array. For instance, sensing circuitry that can include a sense amplifier and a compute component can be coupled to a memory array. A controller can be coupled to the sensing circuitry and can be configured to cause one or more bit strings that are formatted according to a universal number format or a posit format to be transferred from the memory array to the sensing circuitry. The sensing circuitry can perform an arithmetic operation, a logical operation, or both using the one or more bit strings.

SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS

Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.

Apparatus and method for ray tracing instruction processing and execution

An apparatus and method to execute ray tracing instructions. For example, one embodiment of an apparatus comprises execution circuitry to execute a dequantize instruction to convert a plurality of quantized data values to a plurality of dequantized data values, the dequantize instruction including a first source operand to identify a plurality of packed quantized data values in a source register and a destination operand to identify a destination register in which to store a plurality of packed dequantized data values, wherein the execution circuitry is to convert each packed quantized data value in the source register to a floating point value, to multiply the floating point value by a first value to generate a first product and to add the first product to a second value to generate a dequantized data value, and to store the dequantized data value in a packed data element location in the destination register.

Bit string operations in memory
11714640 · 2023-08-01 · ·

Systems, apparatuses, and methods related to bit string operations in memory are described. The bit string operations may be performed within a memory array without transferring the bit strings or intermediate results of the operations to circuitry external to the memory array. For instance, sensing circuitry that can include a sense amplifier and a compute component can be coupled to a memory array. A controller can be coupled to the sensing circuitry and can be configured to cause one or more bit strings that are formatted according to a universal number format or a posit format to be transferred from the memory array to the sensing circuitry. The sensing circuitry can perform an arithmetic operation, a logical operation, or both using the one or more bit strings.

Computing device and method

The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.

COMPUTER PROCESSOR FOR HIGHER PRECISION COMPUTATIONS USING A MIXED-PRECISION DECOMPOSITION OF OPERATIONS
20230214215 · 2023-07-06 ·

Embodiments detailed herein relate to arithmetic operations of float-point values. An exemplary processor includes decoding circuitry to decode an instruction, where the instruction specifies locations of a plurality of operands, values of which being in a floating-point format. The exemplary processor further includes execution circuitry to execute the decoded instruction, where the execution includes to: convert the values for each operand, each value being converted into a plurality of lower precision values, where an exponent is to be stored for each operand; perform arithmetic operations among lower precision values converted from values for the plurality of the operands; and generate a floating-point value by converting a resulting value from the arithmetic operations into the floating-point format and store the floating-point value.

Neural network processor and neural network computation method

The present disclosure provides a neural network processor and neural network computation method that deploy a memory and a cache to perform a neural network computation, where the memory may be configured to store data and instructions of the neural network computation, the cache may be connected to the memory via a memory bus, thereby, the actual compute ability of hardware may be fully utilized, the cost and power consumption overhead may be reduced, parallelism of the network may be fully utilized, and the efficiency of the neural network computation may be improved.

System and method for INT9 quantization

A method of converting a data stored in a memory from a first format to a second format is disclosed. The method includes extending a number of bits in the data stored in a double data rate (DDR) memory by one bit to form an extended data. The method further includes determining whether the data stored in the DDR is signed or unsigned data. Moreover, responsive to determining that the data is signed, a sign value is added to the most significant bit of the extended data and the data is copied to lower order bits of the extended data. Responsive to determining that the data is unsigned, the data is copied to lower order bits of the extended data and the most significant bit is set to an unsigned value, e.g., zero. The extended data is stored in an on-chip memory (OCM) of a processing tile of a machine learning computer array.