Multiplier-Less Sparse Deep Neural Network
20230081531 · 2023-03-16
Assignee
Inventors
Cpc classification
G06N3/082
PHYSICS
International classification
Abstract
Deep neural network (DNN) has been used for various applications to provide inference, regression, classification, and prediction. Although a high potential of DNN has been successfully demonstrated in literature, most DNN requires high computational complexity and high power operation for real-time processing due to a large number of multiply-accumulate (MAC) operations. The present invention provides a way to realize hardware-friendly MAC-less DNN framework with round-accumulate (RAC) operation operations based on power-of-two (PoT) weights. The method and system are based on the realization that rounding-aware training for powers-of-two expansion can eliminate the need of multiplier components from the system without causing any performance loss. In addition, the method and system provide a way to reduce the number of PoT weights based on a knowledge distillation using a progressive compression of an over-parameterized DNN. It is can realize high compression, leading to power-efficient inference for resource-limited hardware implementation such as micro processors and field-programmable gate arrays. A compacting rank is optimized with additional DNN model in a reinforcement learning framework. A rounding granularity is also successively decremented and mixed-order PoT weights are obtained for low-power processing. Another student model is also designed in parallel for a knowledge distillation to find a Pareto-optimal trade-off between performance and complexity.
Claims
1. A computer-implemented method for training a set of artificial neural networks, performed by at least one computing processor, wherein the method uses the at least one processor coupled with a memory storing instructions implementing the method, wherein the instructions, when executed by the at least processor, carry out steps of the method, comprising: (a) initializing a set of trainable parameters of an artificial neural network, wherein the set of trainable parameters comprise a set of trainable weights and a set of trainable biases; (b) training the set of trainable parameters using a set of training data; (c) generating a pruning mask based on the trained set of trainable parameters; (d) rewinding the set of trainable parameters; (e) pruning a selected set of trainable parameters based on the pruning mask; and (f) repeating the above steps from (b) to (e) for a specified number of times to generate a set of sparse neural networks having an incremental sparsity.
2. The method of claim 1, wherein the training further comprises the steps of: feeding the set of training data into a plurality of an input node of the artificial neural network; propagating the set of training data across the artificial neural network according to the set of pruning masks and the trainable parameters; and generating a set of output values from a plurality of an output node of the artificial network; calculating a set of loss values for the set of training data based on the set of output values; updating the set of trainable parameters based on the set of loss values through a backward message passing; and repeating the above steps for a specified number of iteration times.
3. The method of claim 2, wherein the updating the set of trainable parameters is based on stochastic gradient descent, resilient backpropagation, root-mean-square propagation, Broyden-Fletcher-Goldfarb-Shanno algorithm, adaptive momentum optimization, adaptive subgradient, adaptive delta, or a variant thereof.
4. The method of claim 1, wherein the set of loss values is based on mean-square error, mean absolute error, cross entropy, connectionist temporal classification loss, negative log-likelihood, Kullback-Leibler divergence, margin loss, ranking loss, embedding loss, hinge loss, Huber loss, or a variant thereof.
5. The method of claim 2, wherein the updating the set of trainable parameters further comprises the step of rounding the trainable weights to quantize values based on power-of-two, additive powers-of-two, power-of-three, additive powers-of-three, or a variant thereof.
6. The method of claim 1, wherein the incremental sparsity is controlled by an auxiliary neural network trained with a reinforcement learning.
7. The method of claim 5, wherein the rounding the trainable weights is controlled by a rounding neural network trained with a reinforcement learning.
8. A computer-implemented method for testing an artificial neural network, performed by at least one computing processor, wherein the method uses the at least one processor coupled with a memory storing instructions implementing the method, wherein the instructions, when executed by the at least processor, carry out steps of the method, comprising: feeding a set of testing data into a plurality of an input node of the artificial neural network; propagating the set of testing data across the artificial neural network according to a set of pruning masks; and generating a set of output values from a plurality of an output node of the artificial network.
9. The method of claim 8, wherein the propagating further comprises the steps of: transforming values of neuron nodes according to a trained set of trainable parameters of the artificial neural network; modifying the transformed values of neuron nodes according to a set of activation functions; and repeating the above steps across a plural of a neuron layer of the artificial neural network.
10. The method of claim 9, wherein the transforming values of neuron nodes is based on sign flipping, bit shifting, accumulation, biasing, or a variant thereof, according to a quantized set of trainable parameters with power-of-two, additive powers-of-two, power-of-three, additive powers-of-three, or a variant thereof.
11. A system deployed for an artificial neural network comprises: at least one interface link; at least one computing processor; at least one memory bank configured to at least one trained set of trainable parameters of the artificial neural network and a training method and a testing method, wherein the training method and the testing method use the at least one processor coupled with the at least one memory storing instructions implementing the training and testing methods, wherein the instructions, when executed by the at least processor, carry out at steps of the method, comprising; causing the at least one processor to execute the training method and the testing method based on the at least one trained set of trainable parameters; (a) initializing a set of trainable parameters of an artificial neural network, wherein the set of trainable parameters comprise a set of trainable weights and a set of trainable biases; (b) training the set of trainable parameters using a set of training data; (c) generating a pruning mask based on the trained set of trainable parameters; (d) rewinding the set of trainable parameters; (e) pruning a selected set of trainable parameters based on the pruning mask; and (f) repeating the above steps from (b) to (e) for a specified number of times to generate a set of sparse neural networks having an incremental sparsity.
12. The system of claim 11, wherein the artificial neural network is parallelized and serialized.
13. The system of claim 11, wherein the artificial neural network uses a set of pruning masks to reduce arithmetic operations.
14. The system of claim 11, wherein the artificial neural network uses a set of quantized parameters to reduce arithmetic multiplication operations using power-of-two, additive powers-of-two, power-of-three, additive powers-of-three, or a variant thereof.
Description
BRIEF DESCRIPTION OF THE DRAWING
[0018] The accompanying drawings, which are included to provide a further understanding of the invention, illustrate embodiments of the invention and together with the description, explaining the principle of the invention.
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0031] Various embodiments of the present invention are described hereafter with reference to the figures. It would be noted that the figures are not drawn to scale elements of similar structures or functions are represented by like reference numerals throughout the figures. It should be also noted that the figures are only intended to facilitate the description of specific embodiments of the invention. They are not intended as an exhaustive description of the invention or as a limitation on the scope of the invention. In addition, an aspect described in conjunction with a particular embodiment of the invention is not necessarily limited to that embodiment and can be practiced in any other embodiments of the invention.
[0032] Some embodiments of the present disclosure provide a multiplier-less deep neural network (DNN) to mitigate fiber-nonlinear distortion of shaped constellations. The novel DNN achieves an excellent performance-complexity trade-off with progressive lottery ticket hypothesis (LHT) weight pruning and additive powers-of-two (APoT) quantization. Some other embodiments include but not limited to inference and prediction for digital pre-distortion, channel equalization, channel estimation, nonlinear turbo equalization, speech recognition, image processing, biosignal sensing and so on.
DNN Equalization for Probabilistic Amplitude Shaping (PAS) QAM Systems
[0033] The optical communications system 100 under consideration is depicted in
[0034] Due to fiber nonlinearity, residual distortion after LE will limit the achievable information rates and degrade the bit error rate performance.
[0035]
[0036] To train the DNN, various different loss functions can be used, including but not limited to softmax cross entropy, binary cross entropy, distance, mean-square error, mean absolute error, connectionist temporal classification loss, negative log-likelihood, Kullback-Leibler divergence, margin loss, ranking loss, embedding loss, hinge loss, Huber loss, and so on. Specifically, the trainable parameters of DNN architecture are trained by some steps of operations: feeding a set of training data into the input nodes of the DNN; propagating the training data across layers of the DNN; pruning trainable parameters according to some pruning masks; and generating output values from output nodes of the DNN; calculating loss values for the training data; updating the trainable parameters based on the loss values through a backward message passing; and repeating those steps for a specified number of iteration times.
[0037]
[0038] To compensate for the residual nonlinear distortion, we use DNN-based equalizers, which directly generate bit-wise soft-decision log-likelihood ratios (LLRs) for the decoder.
[0039] Zero-Multiplier DNN with Additive Powers-of-Two (APoT) Quantization
[0040] In order to reduce the computational complexity of DNN for real-time processing, the present invention prodives a way to integrate APoT quantization into a DeepShift framework. In the original DeepShift, DNN weights are quantized into a signed PoT as w−±2.sup.u, where u is an integer to train. Note that the PoT weights can fully eliminate multiplier operations from DNN equalizers as it can be realized with bit shifting for fixed-point precision or addition operation for floating-point (FP) precision.
[0041]
[0042]
[0043]
[0044]
Lottery Ticket Hypothesis (LTH) Pruning for Sparse DNN
[0045] Even though our DNN equalizer does not require any multipliers, it still needs a relatively large number of addition operations due to the over-parameterized DNN architecture having huge number of trainable weights. We introduce a progressive version of the LTH pruning method to realize low-power sparse DNN implementation. It is known that an over-parameterized DNN can be significantly sparsified without losing performance and that sparsified DNN can often outperform the dense DNN in LTH.
[0046]
[0047] As discussed above, various DNN equalizers are provided for nonlinear compensation in optical fiber communications employing probabilistic amplitude shaping. We then proposed a zero-multiplier sparse DNN equalizer based on trainable version of APoT quantization and LTH pruning techniques. We showed that APoT quantization can achieve floating-point arithmetic performance without using any multipliers, whereas the conventional PoT quantization suffers from a severe penalty. We also demonstrated that the progressive LTH pruning can eliminate 99% of the weights, enabling highly power-efficient implementation of DNN equalization for real-time fiber-optic systems.
[0048]
[0049] When the computer executable program is performed by the at least one computing processor 120, the program causes the at least one processor to execute a training method and testing method based on the at least one trained set of trainable parameters; (a) initializing a set of trainable parameters of an artificial neural network, wherein the set of trainable parameters comprise a set of trainable weights and a set of trainable biases; (b) training the set of trainable parameters using a set of training data; (c) generating a pruning mask based on the trained set of trainable parameters; (d) rewinding the set of trainable parameters; (e) pruning a selected set of trainable parameters based on the pruning mask; and (f) repeating the above steps from (b) to (e) for a specified number of times to generate a set of sparse neural networks having an incremental sparsity.
[0050] The above-described embodiments of the present invention can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component. Though, a processor may be implemented using circuitry in any suitable format.
[0051] Also, the embodiments of the invention may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
[0052] Use of ordinal terms such as “first,” “second,” in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
[0053] Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention.
[0054] Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.