G06F9/345

Method and tensor traversal engine for strided memory access during execution of neural networks

A tensor traversal engine in a processor system comprising a source memory component and a destination memory component, the tensor traversal engine comprising: a control signal register storing a control signal for a strided data transfer operation from the source memory component to the destination memory component, the control signal comprising an initial source address, an initial destination address, a first source stride length in a first dimension, and a first source stride count in the first dimension; a source address register communicatively coupled to the control signal register; a destination address register communicatively coupled to the control signal register; a first source stride counter communicatively coupled to the control signal register; and control logic communicatively coupled to the control signal register, the source address register, and the first source stride counter.

Thread group scheduling for graphics processing

Embodiments are generally directed to thread group scheduling for graphics processing. An embodiment of an apparatus includes a plurality of processors including a plurality of graphics processors to process data; a memory; and one or more caches for storage of data for the plurality of graphics processors, wherein the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches.

METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
20230229448 · 2023-07-20 ·

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

Multicast network and memory transfer optimizations for neural network hardware acceleration
11704548 · 2023-07-18 · ·

In one embodiment, a system to deterministically transfer partitions of contiguous computer readable data in constant time includes a computer readable memory and a modulo address generator. The computer readable memory is organized into D banks, to contain contiguous data including a plurality of data elements of size M which are constituent data elements of a vector with N data elements, the data elements to start at an offset address O. The modulo address generator is to generate the addresses of the data elements of a vector with i data elements stored in the computer readable memory, the modulo address generator including at least one forward permutaton to permute data elements with addresses of the form O+M*i where 0<=i<N. Other embodiments are described and claimed.

Multicast network and memory transfer optimizations for neural network hardware acceleration
11704548 · 2023-07-18 · ·

In one embodiment, a system to deterministically transfer partitions of contiguous computer readable data in constant time includes a computer readable memory and a modulo address generator. The computer readable memory is organized into D banks, to contain contiguous data including a plurality of data elements of size M which are constituent data elements of a vector with N data elements, the data elements to start at an offset address O. The modulo address generator is to generate the addresses of the data elements of a vector with i data elements stored in the computer readable memory, the modulo address generator including at least one forward permutaton to permute data elements with addresses of the form O+M*i where 0<=i<N. Other embodiments are described and claimed.

Accumulating data values and storing in first and second storage devices

Herein described is a method of operating an accumulation process in a data processing apparatus. The accumulation process comprises a plurality of accumulations which output a respective plurality of accumulated values, each based on a stored value and a computed value generated by a data processing operation. The method comprises storing a first accumulated value, the first accumulated value being one of said plurality of accumulated values, into a first storage device comprising a plurality of single-bit storage elements; determining that a predetermined trigger has been satisfied with respect to the accumulation process; and in response to the determining, storing at least a portion of a second accumulated value, the second accumulated value being one of said plurality of accumulated values, into a second storage device.

Accumulating data values and storing in first and second storage devices

Herein described is a method of operating an accumulation process in a data processing apparatus. The accumulation process comprises a plurality of accumulations which output a respective plurality of accumulated values, each based on a stored value and a computed value generated by a data processing operation. The method comprises storing a first accumulated value, the first accumulated value being one of said plurality of accumulated values, into a first storage device comprising a plurality of single-bit storage elements; determining that a predetermined trigger has been satisfied with respect to the accumulation process; and in response to the determining, storing at least a portion of a second accumulated value, the second accumulated value being one of said plurality of accumulated values, into a second storage device.

HANDLING OF SINGLE-COPY-ATOMIC LOAD/STORE INSTRUCTION
20230017802 · 2023-01-19 ·

In response to a single-copy-atomic load/store instruction for requesting an atomic transfer of a target block of data between the memory system and the registers, where the target block has a given size greater than a maximum data size supported for a single load/store micro-operation by a load/store data path, instruction decoding circuitry maps the single-copy-atomic load/store instruction to two or more mapped load/store micro-operations each for requesting transfer of a respective portion of the target block of data. In response to the mapped load/store micro-operations, load/store circuitry triggers issuing of a shared memory access request to the memory system to request the atomic transfer of the target block of data of said given size to or from the memory system, and triggers separate transfers of respective portions of the target block of data over the load/store data path.

STREAMING ADDRESS GENERATION

A digital signal processor having at least one streaming address generator, each with dedicated hardware, for generating addresses for writing multi-dimensional streaming data that comprises a plurality of elements. Each at least one streaming address generator is configured to generate a plurality of offsets to address the streaming data, and each of the plurality of offsets corresponds to a respective one of the plurality of elements. The address of each of the plurality of elements is the respective one of the plurality of offsets combined with a base address.

METHOD AND DEVICE FOR PROVIDING A VECTOR STREAM INSTRUCTION SET ARCHITECTURE EXTENSION FOR A CPU
20230214217 · 2023-07-06 ·

A method and device for providing a vector stream instruction set architecture extension for a CPU. In one aspect, there is provided a vector stream engine unit comprising: a first fast memory storage for temporarily storing data of vector data streams from a memory for loading into a vector register file; a second fast memory storage for temporarily storing data of the vector data streams from the vector register file for loading into the memory; a prefetcher configured to prefetch data of the vector data streams from the memory into the first fast storage memory, and prefetch data of the vector data streams from the vector register file into the second fast storage memory; and a stream configuration table (SCT) storing stream information for prefetching data from the vector data streams.