G06F9/345

METHOD FOR ACCELERATING DEEP NEURAL NETWORKS EXECUTION WITH ADVANCED OPERATOR FUSION
20220413862 · 2022-12-29 ·

This disclosure has presented a new loop fusion framework called DNNFusion. The key advantages of DNNFusion include: 1) a new high-level abstraction comprising mapping type of operators and their combinations and the Extended Computational Graph, and analyses on these abstractions, 2) a novel mathematical-property-based graph rewriting, and 3) an integrated fusion plan generation. DNNFusion is extensively evaluated on 15 diverse DNN models on multiple mobile devices, and evaluation results show that it outperforms four state-of-the-art DNN execution frameworks by up to 8.8× speedup, and for the first time allows many cutting-edge DNN models not supported by prior end-to-end frameworks to execute on mobile devices efficiently (even in real-time). In addition, DNNFusion improves both cache performance and device utilization, enabling execution on devices with more restricted resources. It also reduces performance tuning time during compilation.

METHOD FOR ACCELERATING DEEP NEURAL NETWORKS EXECUTION WITH ADVANCED OPERATOR FUSION
20220413862 · 2022-12-29 ·

This disclosure has presented a new loop fusion framework called DNNFusion. The key advantages of DNNFusion include: 1) a new high-level abstraction comprising mapping type of operators and their combinations and the Extended Computational Graph, and analyses on these abstractions, 2) a novel mathematical-property-based graph rewriting, and 3) an integrated fusion plan generation. DNNFusion is extensively evaluated on 15 diverse DNN models on multiple mobile devices, and evaluation results show that it outperforms four state-of-the-art DNN execution frameworks by up to 8.8× speedup, and for the first time allows many cutting-edge DNN models not supported by prior end-to-end frameworks to execute on mobile devices efficiently (even in real-time). In addition, DNNFusion improves both cache performance and device utilization, enabling execution on devices with more restricted resources. It also reduces performance tuning time during compilation.

LEARNING DEEP LATENT VARIABLE MODELS BY SHORT-RUN MCMC INFERENCE WITH OPTIMAL TRANSPORT CORRECTION
20220398446 · 2022-12-15 · ·

Learning latent variable models with deep top-down architectures typically requires inferring latent variables for each training example based on posterior distribution of these latent variables. The inference step relies on either time-consuming long-run Markov chain Monte Carlo (MCMC) sampling or a separate inference model for variational learning. Embodiments of a short-run MCMC, such as a short-run Langevin dynamics, are used herein as an approximate flow-based inference engine. Bias existing in the output distribution of non-convergent short-run Langevin dynamics may be corrected by optimal transport (OT), which aims at transforming the biased distribution produced by finite-step MCMC to the prior distribution with a minimum transport cost. Experiment results verify the effectiveness of the OT correction for the short-run MCMC, and demonstrate that latent variable models trained by the disclosed strategy performed better than the variational auto-encoder in terms of image reconstruction, generation and anomaly detection.

CONVOLUTION WITH KERNEL EXPANSION AND TENSOR ACCUMULATION

Certain aspects of the present disclosure provide techniques for kernel expansion. An input data tensor is received at a first layer in a neural network, and a first convolution is performed for a first kernel, where the first kernel has a size greater than a preferred size. Performing the first convolution comprises generating a plurality of intermediate tensors by performing a plurality of intermediate convolutions using a plurality of intermediate kernels with a size of the preferred size, and accumulating the plurality of intermediate tensors to generate an output tensor for the first convolution.

Data structure-aware prefetching method and device on graphics processing unit

The invention discloses a data structure-aware prefetching method and device on a graphics processing unit. The method comprises the steps of acquiring information for a memory access request in which a monitoring processor checks a graph data structure and read data, using a data structure access mode defined by a breadth first search and graph data structure information to generate four corresponding vector prefetching requests and store into a prefetching request queue. The device comprises a data prefetching unit distributed into each processing unit, each data prefetching unit is respectively connected with an memory access monitor, a response FIFO and a primary cache of a load/store unit, and comprises an address space classifier, a runtime information table, prefetching request generation units and the prefetching request queue. According to the present invention, data required by graph traversal can be prefetched more accurately and efficiently using the breadth first search, thereby improving the performance of GPU to solve a graph computation problem.

Combining load or store instructions

Various aspects disclosed herein relate to combining instructions to load data from or store data in memory while processing instructions in a computer processor. More particularly, at least one pattern of multiple memory access instructions that reference a common base register and do not fully utilize an available bus width may be identified in a processor pipeline. In response to determining that the multiple memory access instructions target adjacent memory or non-contiguous memory that can fit on a single cache line, the multiple memory access instructions may be replaced within the processor pipeline with one equivalent memory access instruction that utilizes more of the available bus width than either of the replaced memory access instructions.

STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
20230053842 · 2023-02-23 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
20230053842 · 2023-02-23 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

PROCESSOR WITH TABLE LOOKUP UNIT

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.

PROCESSOR WITH TABLE LOOKUP UNIT

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.