Patent classifications
G06N3/0495
SPARSITY PROCESSING ON UNPACKED DATA
Sparsity processing within a compute block can be done on unpacked data. The compute block includes a sparsity decoder that generates a combined sparsity vector from an activation sparsity vector and a weight sparsity vector. The activation sparsity vector indicates positions of non-zero valued activations in an activation context. The weight sparsity vector indicates positions of non-zero valued weights in a weight context. The combined sparsity vector comprises one or more zero valued bits and one or more non-zero valued bits. The sparsity decoder may determine the position of a non-zero valued bit in the combined sparsity vector and determine an address for the non-zero valued activation and the non-zero valued weight based on the position of the non-zero valued bit. The non-zero valued activation and the non-zero valued weight may be provided to a PE for performing MAC operations.
Methods and systems for implementing a convolution transpose layer of a neural network
Methods and systems for performing a convolution transpose operation between an input tensor having a plurality of input elements and a filter comprising a plurality of filter weights. The method includes: dividing the filter into a plurality of sub-filters; performing, using hardware logic, a convolution operation between the input tensor and each of the plurality of sub-filters to generate a plurality of sub-output tensors, each sub-output tensor comprising a plurality of output elements; and interleaving, using hardware logic, the output elements of the plurality of sub-output tensors to form a final output tensor for the convolution transpose.
MODEL QUANTIZATION FOR SOFTWARE ENGINEERING TASKS
A deep learning model is quantized during its training to perform a target software engineering task. During training, a portion of the full-precision floating point weights is quantized into INT4 or INT 8 data types through scalar quantization or product quantization to make the model more resilient to quantization and to reduce the noise between the quantized and full-precision model outputs. In scalar quantization, each sub-block consists of a single weight that is mapped into a codeword of a codebook. In product quantization, an identity matrix and a codebook of centroids is used to map a quantized weight into its original value.
System and method of executing deep tensor columns in neural networks
Embodiments of the invention may execute a NN by executing sub-tensor columns, each sub-tensor column including computations from portions of a layers of the NN, and each sub-tensor column performing computations entirely within a first layer of cache (e.g. L2 in one embodiment) and saving its output entirely within a second layer of cache (e.g. L3 in one embodiment). Embodiments may include partitioning the execution of a NN by partitioning the execution of the NN into sub-tensor columns, each sub-tensor column including computations from portions of layers of the NN, each sub-tensor column performing computations entirely within a first layer of cache and saving its output entirely within a second layer of cache.
METHOD AND APPARATUS FOR ENCODING/DECODING DEEP LEARNING NETWORK
Disclosed herein are a method and apparatus for encoding/decoding a deep learning network. According to an embodiment, the method for decoding a deep learning network may include decoding network header information regarding the deep learning network; decoding layer header information regarding a plurality of layers in the deep learning network; decoding layer data information regarding specific information of the plurality of layers; and obtaining the deep learning network and a plurality of layers in the deep learning network, and the layer header information includes layer distinction information associated with distinguishing the plurality of layers.
METHOD AND DEVICE FOR ENCODING/DECODING DEEP NEURAL NETWORK MODEL
Disclosed herein are a method and apparatus for encoding/decoding a deep neural network. According to the present disclosure, the method for decoding a deep neural network may include: in a plurality of layers of the deep neural network, entropy decoding quantization information for a current layer; performing dequantization on the current layer; and obtaining a plurality of layers of the deep neural network. At least one of global quantization and local quantization is performed on the current layer.
MACHINE LEARNING MODEL TRAINING METHOD, ELECTRONIC DEVICE AND STORAGE MEDIUM
A method for training a machine learning model includes: sending, by a first node, first indication information, where the first indication information is used by a second node for determining a first quantization strategy, and the first quantization strategy is used for determining a parameter and/or an output result of the machine learning model.
Weight data compression method, weight data decompression method, weight data compression device, and weight data decompression device
A weight data compression method includes: generating a 4-bit data string of 4-bit data items each expressed as any one of nine 4-bit values, by dividing ternary weight data into data items each having 4 bits; and generating first compressed data including a first flag value string and a first non-zero value string by (i) generating the first flag value string by assigning one of 0 and 1 as a first flag value of a 1-bit flag to a 4-bit data item 0000 and assigning an other of 0 and 1 as a second flag value of the 1-bit flag to a 4-bit data item other than 0000 among the 4-bit data items in the 4-bit data string and (ii) generating the first non-zero value string by converting the 4-bit data item other than 0000 into a 3-bit data item having any one of eight 3-bit values.
Multi-layer neural network system and method
Provided is multi-layer neural network technique that includes: calculating, from an input and using a first one or more layers of a plurality of layers of a neural network, a first intermediate output; reducing a size of one or more dimensions of the first intermediate output; calculating, from the first intermediate output and using a second one or more layers of the neural network, a second intermediate output (the second one or more layers including one or more ultra-low precision layers); reducing a size of one or more dimensions of the second intermediate output; combining a plurality of reduced intermediate outputs (including the reduced first intermediate output and the reduced second intermediate output) to derive a combined intermediate output; and calculating, using the combined intermediate output and one or more higher-precision layers of the plurality of layers, a neural network output.
Generating Pretrained Sparse Student Model for Transfer Learning
A student model may be trained in two stages by using two teacher models, respectively. The first teacher model has been trained with a pretraining dataset. The second teacher model has been trained with a training dataset that is specific to a task to be performed by the student model. In the first stage, the student model may be generated based on a structure of the first teacher model. Internal parameters of the student model are adjusted through a pretraining process based on the first teacher model and the pretraining dataset. Weights of the student model may be pruned during the pretraining process. In the second stage, a sparsity mask is generated for the student model to lock the sparsity pattern generated from the first stage. Further, some of the internal parameters of the student model are modified based on the second teacher model and the training dataset.