G06F15/8092

DEPTHWISE-CONVOLUTION IMPLEMENTATION ON A NEURAL PROCESSING CORE
20220405558 · 2022-12-22 ·

A core of neural processing units is configured to efficiently process a depthwise convolution by maximizing spatial feature-map locality using adder trees. Data paths of activations and weights are inverted, and 2-to-1 multiplexers are every 2/9 multipliers along a row of multipliers. During a depthwise convolution operation, the core is operated using a RS×HW dataflow to maximize the locality of feature maps. For a normal convolution operation, the data paths of activations and weights may be configured for a normal convolution configuration and in which multiplexers are idle.

Vector processing unit

A vector processing unit is described, and includes processor units that each include multiple processing resources. The processor units are each configured to perform arithmetic operations associated with vectorized computations. The vector processing unit includes a vector memory in data communication with each of the processor units and their respective processing resources. The vector memory includes memory banks configured to store data used by each of the processor units to perform the arithmetic operations. The processor units and the vector memory are tightly coupled within an area of the vector processing unit such that data communications are exchanged at a high bandwidth based on the placement of respective processor units relative to one another, and based on the placement of the vector memory relative to each processor unit.

METHOD FOR REPRESENTING A DISTRIBUTED COMPUTING SYSTEM BY GRAPH EMBEDDING
20230055902 · 2023-02-23 · ·

A method of representing a distributed computing system, the distributed computing system comprising a plurality of processing devices connected together according to a predefined topology. The method comprising receiving at least one piece of data from an activity log file relating to at least one processing device among the plurality of processing devices, receiving at least one metric relating to at least one processing device among the plurality of processing devices, receiving at least the predefined topology of the distributed computing system, constructing a graph representative of a distributed computing system operation, the graph comprising the data item extracted from the received log file, the received metric, and the received topology, and embedding at least one part of the graph to obtain at least one state vector representing the at least one part of the embedded graph.

MULTI-ARCHITECTURE EXECUTION GRAPHS

Apparatuses, systems, and techniques to perform multi-architecture execution graphs. In at least one embodiment, a parallel processing platform, such as compute uniform device architecture (CUDA) generates multi-architecture execution graphs comprising a plurality of software kernels to be performed by one or more processor cores having one or more processor architectures.

Systems and methods for systolic array design from a high-level program
11604758 · 2023-03-14 · ·

Systems and methods for automated systolic array design from a high-level program are disclosed. One implementation of a systolic array design supporting a convolutional neural network includes a two-dimensional array of reconfigurable processing elements arranged in rows and columns. Each processing element has an associated SIMD vector and is connected through a local connection to at least one other processing element. An input feature map buffer having a double buffer is configured to store input feature maps, and an interconnect system is configured to pass data to neighboring processing elements in accordance with a processing element scheduler. A CNN computation is mapped onto the two-dimensional array of reconfigurable processing elements using an automated system configured to determine suitable reconfigurable processing element parameters.

VECTOR COMPUTATIONAL UNIT
20230115874 · 2023-04-13 ·

A microprocessor system comprises a computational array and a vector computational unit. The computational array includes a plurality of computation units. The vector computational unit is in communication with the computational array and includes a plurality of processing elements. The processing elements are configured to receive output data elements from the computational array and process in parallel the received output data elements.

System and method for data-layout aware decompression and verification using a hardware accelerator chain
11657018 · 2023-05-23 · ·

A computer implemented method of data decompression and verification includes decompressing a compressed data segment to generate a decompressed data region. The method also includes generating a segment vector array (SVA) including a number of segment vectors corresponding to data segments within the decompressed data region, each segment vector indicating a location and a size of a corresponding data segment. The method also includes transmitting the SVA to a chain plugin module and transmitting segment vector array data to a SVA-based message constructor. The method also includes constructing a SVA-based message including the location and size of data segments within the decompressed data region, and transmitting the SVA-based message to a hardware accelerator. The method also includes performing verification sessions at the hardware accelerator, each verification session corresponding to a specific data segment indicated by the SVA-based message.

HYBRID HARDWARE ACCELERATOR AND PROGRAMMABLE ARRAY ARCHITECTURE
20230205730 · 2023-06-29 ·

Techniques are disclosed for the use of a hybrid architecture that combines a programmable processing array and a hardware accelerator. The hybrid architecture dedicates the most computationally intensive blocks to the hardware accelerator, while maintaining flexibility for additional computations to be performed by the programmable processing array. An interface is also described for coupling the processing array to the hardware accelerator, which achieves a division of functionality and connects the programmable processing array components to the hardware accelerator components without sacrificing flexibility. This results in a balance between power/area and flexibility.

Merging and sorting arrays on an SIMD processor

Methods, systems, and articles of manufacture for merging and sorting arrays on a processor are provided herein. A method includes splitting an input array into multiple sub-arrays across multiple processing elements; merging the multiple sub-arrays into multiple vectors; and sorting the multiple vectors by comparing and swapping one or more vector elements among the multiple vectors.

SYSTEM AND METHOD FOR DATA-LAYOUT AWARE DECOMPRESSION AND VERIFICATION USING A HARDWARE ACCELERATOR CHAIN
20220309030 · 2022-09-29 ·

A computer implemented method of data decompression and verification includes decompressing a compressed data segment to generate a decompressed data region. The method also includes generating a segment vector array (SVA) including a number of segment vectors corresponding to data segments within the decompressed data region, each segment vector indicating a location and a size of a corresponding data segment. The method also includes transmitting the SVA to a chain plugin module and transmitting segment vector array data to a SVA-based message constructor. The method also includes constructing a SVA-based message including the location and size of data segments within the decompressed data region, and transmitting the SVA-based message to a hardware accelerator. The method also includes performing verification sessions at the hardware accelerator, each verification session corresponding to a specific data segment indicated by the SVA-based message.