G06F9/3455

ACCESSING DATA IN MULTI-DIMENSIONAL TENSORS
20170220352 · 2017-08-03 ·

Methods, systems, and apparatus, including an apparatus for processing an instruction for accessing a N-dimensional tensor, the apparatus including multiple tensor index elements and multiple dimension multiplier elements, where each of the dimension multiplier elements has a corresponding tensor index element. The apparatus includes one or more processors configured to obtain an instruction to access a particular element of a N-dimensional tensor, where the N-dimensional tensor has multiple elements arranged across each of the N dimensions, and where N is an integer that is equal to or greater than one; determine, using one or more tensor index elements of the multiple tensor index elements and one or more dimension multiplier elements of the multiple dimension multiplier elements, an address of the particular element; and output data indicating the determined address for accessing the particular element of the N-dimensional tensor.

Multiple data prefetchers that defer to one another based on prefetch effectiveness by memory access type

A processor includes a first prefetcher that prefetches data in response to memory accesses and a second prefetcher that prefetches data in response to memory accesses. Each of the memory accesses has an associated memory access type (MAT) of a plurality of predetermined MATs. The processor also includes a table that holds first scores that indicate effectiveness of the first prefetcher to prefetch data with respect to the plurality of predetermined MATs and second scores that indicate effectiveness of the second prefetcher to prefetch data with respect to the plurality of predetermined MATs. The first and second prefetchers selectively defer to one another with respect to data prefetches based on their relative scores in the table and the associated MATs of the memory accesses.

RUN LENGTH ENCODING AWARE DIRECT MEMORY ACCESS FILTERING ENGINE FOR SCRATCHPAD ENABLED MULTICORE PROCESSORS

Techniques are described herein for efficient movement of data from a source memory to a destination memory. In an embodiment, in response to a particular memory location being pushed into a first register within a first register space, the first set of electronic circuits accesses a descriptor stored at the particular memory location. The descriptor indicates a width of a column of tabular data, a number of rows of tabular data, and one or more tabular data manipulation operations to perform on the column of tabular data. The descriptor also indicates a source memory location for accessing the tabular data and a destination memory location for storing data manipulation result from performing the one or more data manipulation operations on the tabular data. Based on the descriptor, the first set of electronic circuits determines control information indicating that the one or more data manipulation operations are to be performed on the tabular data and transmits the control information, using a hardware data channel, to a second set of electronic circuits to perform the one or more operations. Based on the control information, the second set of electronic circuits retrieve the tabular data from source memory location and apply the one or more data manipulation operations to generate the data manipulation result. The second set of electronic circuits cause the data manipulation result to be stored at the destination memory location.

INSTRUCTION AND LOGIC TO PROVIDE VECTOR LOADS AND STORES WITH STRIDES AND MASKING FUNCTIONALITY

Instructions and logic provide vector loads and/or stores with stride and mask functionality. In one implementation, a processor is provided that includes decode circuitry to decode an instruction specifying a memory address and a stride length for a set of load operations corresponding to a first plurality of data elements of a destination register. The processor further includes one or more execution units, responsive to the decoded first instruction, to load a first data element from the memory address into a first data element of the destination register, and load a second data element from a memory address that is non-zero multiple of the stride length into a first data element of the destination register.

Method and Computing System for Handling Instruction Execution Using Affine Register File on Graphic Processing Unit
20170269931 · 2017-09-21 ·

The present invention provides an affine engine design to the microarchitecture of the graphic processing unit, in which an operand type detection is performed, and then physical scalar, affine, or vector registers and corresponding ALUs with maximum performance improving and energy saving are allocated to perform instruction execution. In runtime, affine and uniform instructions are executed by the affine engine, while general vector instructions are executed by a vector engine, thereby the affine/uniform instruction execution can be dispatched to the affine engine, so the vector engine can enter a power-saving state to save the energy consumption of the GPU.

Two-dimensional zero padding in a stream of matrix elements

Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for two selected dimensions of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When either selected dimension in the stream of vectors exceeds a respective specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.

METHOD AND APPARATUS FOR A PAGE-LOCAL DELTA-BASED PREFETCHER
20210406183 · 2021-12-30 ·

A method includes recording a first set of consecutive memory access deltas, where each of the consecutive memory access deltas represents a difference between two memory addresses accessed by an application, updating values in a prefetch training table based on the first set of memory access deltas, and predicting one or more memory addresses for prefetching responsive to a second set of consecutive memory access deltas and based on values in the prefetch training table.

Arithmetic processing apparatus and method for controlling arithmetic processing apparatus
11200057 · 2021-12-14 · ·

An arithmetic processing apparatus includes: a memory; and a processor coupled to the memory, wherein the processor: detects whether intervals of a plurality of addresses to be accessed by a memory access instruction that performs memory access to the plurality of addresses by a single instruction are all the same; decodes the memory access instruction as the single instruction when detecting that the intervals are all the same; decodes the memory access instruction as a plurality of instructions when detecting that the intervals are not all the same; and performs the memory access in accordance with the single instruction or the plurality of instructions.

MEMORY ADDRESS GENERATOR
20210382820 · 2021-12-09 ·

A memory address generator for generating an address of a location in a memory includes a first address input for receiving a first address having a location in the memory being accessed during a first memory access cycle, and a next address output configured to output a next address comprising a location in the memory to be accessed during a subsequent memory access cycle based on the current address and a memory address increment value. The address increment unit includes a counter arrangement and a selector arrangement, wherein each counter of the counter arrangement is configured to provide an output signal at the output indicative of a maximum value being reached and the selector arrangement is configured to provide a candidate memory address increment value based on the output of the counter arrangement as the memory address increment value output by the address increment unit.

METHOD AND TENSOR TRAVERSAL ENGINE FOR STRIDED MEMORY ACCESS DURING EXECUTION OF NEURAL NETWORKS
20210373895 · 2021-12-02 ·

A tensor traversal engine in a processor system comprising a source memory component and a destination memory component, the tensor traversal engine comprising: a control signal register storing a control signal for a strided data transfer operation from the source memory component to the destination memory component, the control signal comprising an initial source address, an initial destination address, a first source stride length in a first dimension, and a first source stride count in the first dimension; a source address register communicatively coupled to the control signal register; a destination address register communicatively coupled to the control signal register; a first source stride counter communicatively coupled to the control signal register; and control logic communicatively coupled to the control signal register, the source address register, and the first source stride counter.