G06F7/523

COMPUTE IN MEMORY ARCHITECTURE AND DATAFLOWS FOR DEPTH-WISE SEPARABLE CONVOLUTION
20230004350 · 2023-01-05 ·

Certain aspects of the present disclosure provide a method, including: storing a depthwise convolution kernel in a first one or more columns of a CIM array; storing a fused convolution kernel in a second one or more columns of the CIM array; storing pre-activations in one or more input data buffers associated with a plurality of rows of the CIM array; processing the pre-activations with the depthwise convolution kernel in order to generate depthwise output; modifying one or more of the pre-activations based on the depthwise output to generate modified pre-activations; and processing the modified pre-activations with the fused convolution kernel to generate fused output.

COMPLEMENTARY SPARSITY IN PROCESSING TENSORS
20230004800 · 2023-01-05 ·

A hardware accelerator that is efficient at performing computations related to tensors. The hardware accelerator may store a complementary dense process tensor that is combined from a plurality of sparse process tensors. The plurality of sparse process tensors have non-overlapping locations of active values. The hardware accelerator may perform elementwise operations between the complementary dense process tensor and an activation tensor to generate a product tensor. The hardware accelerator may re-arrange the product tensor based on a permutation logic to separate the products into groups. Each group corresponds to one of the sparse process tensors. Each group may be accumulated separately to generate a plurality of output values. The output values may be selected in an activation selection. The activation selection may be a dense activation or a sparse activation such as k winner activation that set non-winners to zeros.

Accelerating binary neural networks within latch structure of non-volatile memory devices

A non-volatile memory device includes an array of non-volatile memory cells that are configured to store weights of a neural network. Associated with the array is a data latch structure that includes a page buffer, which can store weights for a layer of the neural network that is read out of the array, and a transfer buffer, that can store inputs for the neural network. The memory device can perform multiply and accumulate operations between inputs and weight of the neural network within the latch structure, avoiding the need to transfer data out of the array and associated latch structure for portions of an inference operation. By using binary weights and inputs, multiplication can be performed by bit-wise XNOR operations. The results can then be summed and activation applied, all within the latch structure.

Accelerating binary neural networks within latch structure of non-volatile memory devices

A non-volatile memory device includes an array of non-volatile memory cells that are configured to store weights of a neural network. Associated with the array is a data latch structure that includes a page buffer, which can store weights for a layer of the neural network that is read out of the array, and a transfer buffer, that can store inputs for the neural network. The memory device can perform multiply and accumulate operations between inputs and weight of the neural network within the latch structure, avoiding the need to transfer data out of the array and associated latch structure for portions of an inference operation. By using binary weights and inputs, multiplication can be performed by bit-wise XNOR operations. The results can then be summed and activation applied, all within the latch structure.

Decoding System, Decoding Controller, and Decoding Control Method
20220416814 · 2022-12-29 ·

A decoding system, a decoding controller, and a decoding control method are provided. In the decoding system, a decoding controller is disposed between two adjacent decoders. The decoding controller determines whether to perform turn-off based on a non-turn-off indication received by a previous-stage decoder, a turn-off indication output by the previous-stage decoder, and historical turn-off probability statistics. This is equivalent to adding a buffer zone between the two adjacent decoders.

Decoding System, Decoding Controller, and Decoding Control Method
20220416814 · 2022-12-29 ·

A decoding system, a decoding controller, and a decoding control method are provided. In the decoding system, a decoding controller is disposed between two adjacent decoders. The decoding controller determines whether to perform turn-off based on a non-turn-off indication received by a previous-stage decoder, a turn-off indication output by the previous-stage decoder, and historical turn-off probability statistics. This is equivalent to adding a buffer zone between the two adjacent decoders.

USING SPARSITY METADATA TO REDUCE SYSTOLIC ARRAY POWER CONSUMPTION

A processing apparatus can include a general-purpose parallel processing engine comprising a matrix accelerator including a multi-stage systolic array, where each stage includes multiple processing elements associated with multiple processing channels. The multiple processing elements are configured to receive output sparsity metadata that is independent of input sparsity of input matrix elements and perform processing operations on the input matrix elements based on the output sparsity metadata.

USING SPARSITY METADATA TO REDUCE SYSTOLIC ARRAY POWER CONSUMPTION

A processing apparatus can include a general-purpose parallel processing engine comprising a matrix accelerator including a multi-stage systolic array, where each stage includes multiple processing elements associated with multiple processing channels. The multiple processing elements are configured to receive output sparsity metadata that is independent of input sparsity of input matrix elements and perform processing operations on the input matrix elements based on the output sparsity metadata.

SYSTOLIC ARRAY HAVING SUPPORT FOR OUTPUT SPARSITY

A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.

SYSTOLIC ARRAY HAVING SUPPORT FOR OUTPUT SPARSITY

A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.