G06F9/3455

INSTRUCTION AND LOGIC FOR SUM OF SQUARE DIFFERENCES

In an embodiment, a processor includes: a fetch circuit to fetch instructions, the instructions including a sum of squared differences (SSD) instruction; a decode circuit to decode the SSD instruction; and an execution circuit to, during an execution of the decoded SSD instruction, generate an SSD output vector based on a plurality of input vectors, the SSD output vector including a plurality of squared differences values. Other embodiments are described and claimed.

Automatic compiler dataflow optimization to enable pipelining of loops with local storage requirements

Systems, apparatuses and methods may provide for technology that detects one or more local variables in source code, wherein the local variable(s) lack dependencies across iterations of a loop in the source code, automatically generate pipeline execution code for the local variable(s), and incorporate the pipeline execution code into an output of a compiler. In one example, the pipeline execution code includes an initialization of a pool of buffer storage for the local variable(s).

DATA PROCESSING METHOD AND APPARATUS APPLIED TO GRAPHICS PROCESSING UNIT, AND ELECTRONIC DEVICE
20220188380 · 2022-06-16 ·

This application provides a data processing method applied to a GPU, including: reading a to-be-processed matrix from memory corresponding to a target GPU into registers in a target streaming multiprocessor, where the target streaming multiprocessor is a streaming multiprocessor adapted to perform a matrix computation on the to-be-processed matrix; in a process of reading the to-be-processed matrix from the registers into shared memory corresponding to the registers, performing a first preset operation on the to-be-processed matrix to obtain an initial matrix corresponding to the to-be-processed matrix; and in a process of reading the initial matrix from the shared memory into the registers, performing a second preset operation on the initial matrix to obtain a target matrix meeting a matrix multiplication operation requirement of a matrix-dedicated computation unit. The data processing method applied to a graphics processing unit (Graphics Processing Unit, GPU) can transform a to-be-processed matrix into a target matrix meeting a matrix multiplication operation requirement of the GPU, thereby improving applicability of the GPU for matrix multiplication operations.

INSERTING NULL VECTORS INTO A STREAM OF VECTORS

Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array, a null vector count (N), and a selected dimension. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. N null stream vectors are inserted into the stream of vectors for the selected dimension without fetching respective null data from the memory.

Memory device and computing in memory method thereof

A computing in memory method for a memory device is provided. The computing in memory method includes: based on a stride parameter, unfolding a kernel into a plurality of sub-kernels and a plurality of complement sub-kernels; based on the sub-kernels and the complement sub-kernels, writing a plurality of weights into a plurality of target memory cells of a memory array of the memory device; inputting an input data into a selected word line of the memory array; performing a stride operation in the memory array; temporarily storing a plurality of partial sums; and summing the stored partial sums into a stride operation result when all operation cycles are completed.

TWO-DIMENSIONAL ZERO PADDING IN A STREAM OF MATRIX ELEMENTS

Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for two selected dimensions of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When either selected dimension in the stream of vectors exceeds a respective specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.

ONE-DIMENSIONAL ZERO PADDING IN A STREAM OF MATRIX ELEMENTS
20220147484 · 2022-05-12 ·

Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for a selected dimension of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When the selected dimension in the stream of vectors exceeds the specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.

Method and apparatus for a page-local delta-based prefetcher

A method includes recording a first set of consecutive memory access deltas, where each of the consecutive memory access deltas represents a difference between two memory addresses accessed by an application, updating values in a prefetch training table based on the first set of memory access deltas, and predicting one or more memory addresses for prefetching responsive to a second set of consecutive memory access deltas and based on values in the prefetch training table.

Tensor-based memory access

A processor includes an internal memory and processing circuitry. The internal memory is configured to store a definition of a multi-dimensional array stored in an external memory, and indices that specify elements of the multi-dimensional array in terms of multi-dimensional coordinates of the elements within the array. The processing circuitry is configured to execute instructions in accordance with an Instruction Set Architecture (ISA) defined for the processor. At least some of the instructions in the ISA access the multi-dimensional array by operating on the multi-dimensional coordinates specified in the indices.

Vector generating instruction for generating a vector comprising a sequence of elements that wraps as required

An apparatus and method are provided for performing vector processing operations. In particular the apparatus has processing circuitry to perform the vector processing operations and an instruction decoder to decode vector instructions to control the processing circuitry to perform the vector processing operations specified by the vector instructions. The instruction decoder is responsive to a vector generating instruction identifying a scalar start value and wrapping control information, to control the processing circuitry to generate a vector comprising a plurality of elements. In particular, the processing circuitry is arranged to generate the vector such that the first element in the plurality is dependent on the scalar start value, and the values of the plurality of elements follow a regularly progressing sequence that is constrained to wrap as required to ensure that each value is within bounds determined from the wrapping control information. The vector generating instruction can be useful in a variety of situations, a particular use case being to implement a circular addressing mode within memory, where the vector generating instruction can be coupled with an associated vector memory access instruction. Such an approach can remove the need to provide additional logic within the memory access path to support such circular addressing.