G06F9/3455

Apparatus for Memory Configuration for Array Processor and Associated Methods

An apparatus includes an array processor to process at least one array. The apparatus further includes a memory coupled to the array processor. The at least one array is stored in memory with programmable per-dimension size and stride values.

Apparatus for Array Processor with Program Packets and Associated Methods

An apparatus includes an array processor to process array data in response to information contained in a packet, wherein the packet comprises a set of fields specifying configuration information for processing the array.

Apparatus for Processor with Macro-Instruction and Associated Methods

An apparatus includes an array processor to process array data in response to a set of macro-instructions. A macro-instruction in the set of macro-instructions performs loop operations, array iteration operations, and/or arithmetic logic unit (ALU) operations.

LEARNING DEEP LATENT VARIABLE MODELS BY SHORT-RUN MCMC INFERENCE WITH OPTIMAL TRANSPORT CORRECTION
20220398446 · 2022-12-15 · ·

Learning latent variable models with deep top-down architectures typically requires inferring latent variables for each training example based on posterior distribution of these latent variables. The inference step relies on either time-consuming long-run Markov chain Monte Carlo (MCMC) sampling or a separate inference model for variational learning. Embodiments of a short-run MCMC, such as a short-run Langevin dynamics, are used herein as an approximate flow-based inference engine. Bias existing in the output distribution of non-convergent short-run Langevin dynamics may be corrected by optimal transport (OT), which aims at transforming the biased distribution produced by finite-step MCMC to the prior distribution with a minimum transport cost. Experiment results verify the effectiveness of the OT correction for the short-run MCMC, and demonstrate that latent variable models trained by the disclosed strategy performed better than the variational auto-encoder in terms of image reconstruction, generation and anomaly detection.

CONVOLUTION WITH KERNEL EXPANSION AND TENSOR ACCUMULATION

Certain aspects of the present disclosure provide techniques for kernel expansion. An input data tensor is received at a first layer in a neural network, and a first convolution is performed for a first kernel, where the first kernel has a size greater than a preferred size. Performing the first convolution comprises generating a plurality of intermediate tensors by performing a plurality of intermediate convolutions using a plurality of intermediate kernels with a size of the preferred size, and accumulating the plurality of intermediate tensors to generate an output tensor for the first convolution.

Systems for performing instructions for fast element unpacking into 2-dimensional registers

Disclosed embodiments relate to instructions for fast element unpacking. In one example, a processor includes fetch circuitry to fetch an instruction whose format includes fields to specify an opcode and locations of an Array-of-Structures (AOS) source matrix and one or more Structure of Arrays (SOA) destination matrices, wherein: the specified opcode calls for unpacking elements of the specified AOS source matrix into the specified Structure of Arrays (SOA) destination matrices, the AOS source matrix is to contain N structures each containing K elements of different types, with same-typed elements in consecutive structures separated by a stride, the SOA destination matrices together contain K segregated groups, each containing N same-typed elements, decode circuitry to decode the fetched instruction, and execution circuitry, responsive to the decoded instruction, to unpack each element of the specified AOS matrix into one of the K element types of the one or more SOA matrices.

Prefetch strategy control for parallel execution of threads based on one or more characteristics of a stream of program instructions indicative that a data access instruction within a program is scheduled to be executed a plurality of times

A single instruction multiple thread (SIMT) processor includes execution circuitry, prefetch circuitry and prefetch strategy selection circuitry. The prefetch strategy selection circuitry serves to detect one or more characteristics of a stream of program instructions that are being executed to identify whether or not a given data access instruction within a program will be executed a plurality of times. The prefetch strategy to use is selected from a plurality of selectable prefetch strategies in dependence upon the detection of such detected characteristics.

ACCELERATION OF OPERATIONS
20220342666 · 2022-10-27 ·

Apparatuses, systems, and techniques to reduce a sequence of operations to an equivalent sequence having a smaller number of operations. In at least one embodiment, a sequence of matrix operations are accelerated by combining operations that reorder a matrix with a matrix multiplication operation.

Method and Apparatus for Gather/Scatter Operations in a Vector Processor
20220342590 · 2022-10-27 · ·

In one implementation a vector processor gather/scatter apparatus comprises a plurality of vector ports, and a random access memory, where the plurality of vector ports are in communication with the random access memory, and where one or more of the plurality of vector ports uses one or more of an address register and one or more of a stride register in communication with the random access memory to allow the gather/scatter of random access memory contents.

Processing Device Using Variable Stride Pattern

For certain applications, parts of the application data held in memory of a processing device (e.g. that are produced as a result of operations performed by the execution unit) are arranged in regular repeating patterns in the memory, and therefore, the execution unit may set up a suitable striding pattern for use by a send engine. The send engine accesses the memory at locations in accordance with the configured striding pattern so as to access a plurality of items of data that are arranged together in a regular pattern. In a similar manner as done for sends, the execution may set up a striding pattern for use by a receive engine. The receive engine, upon receiving a plurality of items of data, causes those items of data to be stored at locations in the memory, as determined in accordance with the configured striding pattern.