Method, System, and Computer Program Product for Data Pre-Processing in Deep Learning

20200143236 ยท 2020-05-07

    Inventors

    Cpc classification

    International classification

    Abstract

    The goal of this invention is to develop smart and fast data processing scheme for more computational efficient deep learning to support adaptive and real-time applications. We propose to apply Singular-Value Decomposition (SVD)-QR algorithm to preprocessing of deep learning for large scale data input. For the mass data input, we apply Limited Memory Subspace Optimization for SVD (LMSVD)-QR algorithm to increase the data processing speed. Simulation results in automated handwritten digit recognition show that SVD-QR and LMSVD-QR can tremendously reduce the number of input to deep learning neural network without losing its performance, and both can tremendously increase the data processing speed for deep learning.

    Claims

    1. A method for smart and fast data pre-processing in deep learning comprises two approaches for different application scenarios.

    2. The method of claim 1, wherein said different application scenarios comprising large scale data input and mass data input.

    3. The method of claim 1, wherein said two approaches comprising SVD-QR and LMSVD-QR.

    4. The method of claim 3, wherein said SVD-QR is used for the said large scale data input in claim 2.

    5. The method of claim 3, wherein said LMSVD-QR is used for the said mass data input in claim 2.

    6. A computer-readable medium carrying one or more sequences of one or more instructions for input data pre-processing in deep learning, the one or more sequences of one or more instructions including instructions which, when executed by one or more processors, cause the one or more processors to perform the steps recited in any one of claims 1-5.

    7. An application system configured to include data pre-processor to perform the steps recited in any one of claims 1-5, as input to deep learning comprising: a device configured to compute the singular values using the said SVD or LMSVD; said device configured to determine the number of singular values to select; said device configured to perform the said QR computation to determine which columns in weight matrix should be selected.

    Description

    BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING

    [0034] The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:

    [0035] FIG. 1 is an example illustrating 25 pictures of handwritten digits in the data set.

    [0036] FIG. 2(a)(b) are graphs illustrating the probability of recognition accuracy of SVD-QR preprocessing and uniform downsampling. (a) Neural network-based approach, (b) linear classifier approach.

    [0037] FIG. 3 are graphs of probability of recognition accuracy of SVD-QR preprocessing in Neural network-based approach with different values.

    [0038] FIG. 4(a)(b) are graphs illustrating running time versus the number of inputs. (a) Neural network-based approach, (b) linear classifier approach.

    [0039] FIG. 5 is a graph illustrating probability of recognition accuracy of LMSVD-QR preprocessing for neural network.

    [0040] FIG. 6 are graphs illustrating the number of input versus the percentage of kept singular values.

    [0041] FIG. 7 are graphs illustrating running time versus the number of inputs based on LMSVD-QR in neural network-based approach.

    [0042] FIG. 8(a)(b) are pictures illustrating handwritten digits after SVD-QR, (a) with only 32 pixels left, and (b) with 70 pixels left.

    [0043] FIG. 9(a)(b) are pictures illustrating handwritten digits after uniform downsampling, (a) with 37 pixels left; and (b) with 100 pixels left.

    [0044] FIG. 10 (a)(b) are pictures illustrating handwritten digits after LMSVD-QR pre-processing with r=300. (a) with only 32 pixels left when =0.5; (b) with 69 pixels left when =0.7.

    [0045] FIG. 11(a)(b) are pictures illustrating the selected columns of matrix (i.e., the pixel index). (a) =0.5; (b) =0.7.

    DETAILED DESCRIPTION OF THE INVENTION

    Smart and Fast Data Processing for Deep Learning with Large Scale Data Input

    [0046] We propose to apply SVD-QR for pre-processing for deep learning. For deep learning applications with 1-D input, we can construct a matrix based on its multiple input from training set. The pre-processing procedure can be summarized as follows. [0047] 1. Calculate the SVD [custom-character] of as

    [00001] = U [ 0 0 0 ] .Math. V T and save V. [0048] 2. Based on the diagonal values of , .sub.1, .sub.2, . . . , .sub.r (r=rank()) and desired percentage of kept singular values , to determine {circumflex over (r)}, ({circumflex over (r)}r), where

    [00002] = .Math. i = 1 r ^ .Math. i .Math. i = 1 r .Math. i . stands for the percentage of the kept input power, and when =100%, there is no input reduction. [0049] 3. Partition

    [00003] V = [ V 11 V 12 V 21 V 22 ] , where V.sub.11custom-character.sup.{circumflex over (r)}{circumflex over (r)}, V.sub.12custom-character.sup.{circumflex over (r)}(M{circumflex over (r)}), V.sub.21custom-character.sup.(M{circumflex over (r)}){circumflex over (r)}, and V.sub.22custom-character.sup.(M{circumflex over (r)})(M{circumflex over (r)}). [0050] 4. Using QR decomposition with column pivoting, determine such that


    Q.sup.T[V.sub.11.sup.T,V.sub.21.sup.T]=[R.sub.11,R.sub.12](1) where Q is a unitary matrix. [0051] 5. The permutation matrix corresponds to the ordered most-significant singular values based on the position of 1's in each column, and we can choose the first {circumflex over (r)} columns. The row positions of 1's in the {circumflex over (r)} columns will enable us to find the {circumflex over (r)} most significant columns in , which are the {circumflex over (r)} most important input to the deep learning networks.

    Preprocessing for Mass Data Input

    [0052] We shall apply Limited Memory Subspace Optimization for SVD (LMSVD) algorithm [custom-character] to deep learning preprocessing with massive data input. LMSVD is used for computing dominant singular value decompositions of large matrices. The approach is based on a block Krylov subspace optimization technique which significantly accelerates the classic simultaneous iteration method, then QR could be applied following the LMSVD to obtain the {circumflex over (r)} most important columns in . The purpose of LMSVD is to to compute dominant SVDs of mass data in matrices, with desired precision via choosing appropriate k value, i.e., to consider a real matrix custom-character.sup.mn and a given positive integer k<<min(m, n), such that [custom-character]

    [00004] k = U k .Math. k .Math. V k T = arg .Math. .Math. min rank ( W ) < k .Math. .Math. - W .Math. F 2 ( 2 )

    [0053] where denotes the Frobenius norm of matrix, and U.sub.kcustom-character.sup.mk, V.sub.kcustom-character.sup.nk, a diagonal matrix .sub.kcustom-character.sup.kk whose entries .sub.1.sub.2 . . . .sub.k are the k largest singular values of . So the approximation factor k value is critical in determining the accuracy of this approximation.

    [0054] The main theoretical basis for LMSVD is that the k leading eigenvectors of .sup.T maximize the following Rayleigh-Ritz function under orthogonality constraint:

    [00005] max m k .Math. .Math. T .Math. X .Math. F 2 ( 3 )

    subject to X.sup.TX=I. The goal of LMSVD is to compute the kth dominant SVD of a matrix R.sup.mn as defined in (2) based on accelerating the simple subspace iteration (SSI) method via solving (3) in a chosen subspace at each iteration [custom-character]. Based on LMSVD, we are able to obtain U.sub.k, .sub.k V.sub.k. We propose to use LMSVD-QR on the basis of LMSVD results to select the desired inputs for deep learning.

    [0055] Based on the diagonal values of .sub.k, .sub.1, .sub.2, . . . , .sub.k and desired percentage of kept singular values , to determine {circumflex over (r)}, ({circumflex over (r)}<<k), where

    [00006] = .Math. i = 1 r ^ .Math. i .Math. i = 1 r .Math. i .

    Similarly we can partition V.sub.k, and use the same procedure as in Section 7 to determine the {circumflex over (r)} most important inputs to the deep learning networks.

    Simulation Results for SVD-QR Approach

    [0056] In this invention, handwritten digits recognition is used in our simulation. We apply SVD-QR pre-processing and LMSVD-QR pre-processing for deep learning neural networks to handwritten digits (from 0 to 9) recognition, as illustrated in FIG. 1. Handwritten alphabets recognition will be studied in our future works.

    [0057] Our simulation was based on data set in ex3data1.mat in www.coursera.org (Machine Learning) [custom-character] that contains 5000 training examples of handwritten digits. Each training/testing example contains a 2020 pixels grayscale image of the digit from 0 to 9, and each pixel is represented by a floating point number (from 0.1320 to 1.1277 in the data set we used) indicating the grayscale intensity at that location. The 2020 grid of pixels can be vectorized into a 400-dimensional vector. So a matrix can be constructed where each of these training examples becomes a single row. This gives us a 5000400 matrix where every row is a training example for a handwritten digit (0 to 9) image. The second part of the training set is a 5000-dimensional vector that contains labels (actual digit from 0 to 9) for the training set. Totally we have 5000 examples of handwritten digits in the database, and each digit (from 0 to 9) has 500 examples.

    [0058] A feedforward neural network (NN) [custom-character] was applied to this application with three layers. The input layer has 400 units because 2020 pixels (matrix 2020) could be vectorized into a vector with length 400. The hidden layer has 25 units, and output layer has 10 units (to represent the 10 digits from 0-9). The feedforward neural network was trained using steepest descent algorithm. We trained neural networks using backpropagation to compute the gradient for the neural network cost function. For regularized logistic regression, the cost function is defined as [custom-character]

    [00007] J ( ) = 1 m .Math. .Math. i = 1 m .Math. .Math. k = 1 K .Math. [ - y k ( i ) .Math. log ( ( h ( x ( i ) ) ) k ) - ( 1 - y k ( i ) ) .Math. log ( 1 - h ( x ( i ) ) ) k ) ] + 2 .Math. m [ .Math. j = 1 25 .Math. .Math. k = 1 400 .Math. ( j , k ( 1 ) ) 2 + .Math. j = 1 10 .Math. .Math. k = 1 25 .Math. ( j , k ( 2 ) ) 2 ] ( 4 )

    where m is the input data length, and K=10 is the total number of labels (from 0 to 9). h.sub.(.sup.(i)).sub.k=is the activation output of the k-th unit in the output layer. We randomly initialized the parameters .sup.(l) for symmetry breaking. The initial value range is chosen based on

    [00008] 6 L in + L out

    [custom-character] where L.sub.in and L.sub.out are the number of units in the layers adjacent to .sup.(l), so we chose 8.sup.(l) uniformly distributed within [0.12, 0.12]. We chose =0.1 since it could get better performance based on our experience, and we also compared it against other values of .

    [0059] In comparison, we also applied linear classifier (logistic regression model) [custom-character] to this application. We use multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, we need to train 10 separate logistic regression classifiers. For regularized logistic regression, the cost function is defined as [custom-character]

    [00009] J ( ) = 1 m .Math. .Math. i = 1 m .Math. [ - y ( i ) .Math. log ( h ( x ( i ) ) ) - ( 1 - y ( i ) ) .Math. log ( 1 - h ( x ( i ) ) ) ] + 2 .Math. m .Math. .Math. j = 1 n .Math. j 2 ( 5 )

    where m is the input data length, and for every example i, we compute h.sub.(.sup.(i))=g(.sup.T.sup.(i)) and

    [00010] g ( x ) = 1 1 + e - x

    is the sigmoid function. Steepest descent was used to train the parameters in the logistic regression model, and we chose =0.1 since it could get better performance.

    [0060] We ran simulations for two scenarios. [0061] 1. In the first scenario, all data (5000 examples) were used in training for 200 iterations, and then parameters in the feedforward neural network were frozen, and the 5000 examples were tested for recognition accuracy. The probability of recognition accuracy is 99.7%. In contrast, we applied the SVD-QR preprocessing to a feedforward network with the same number of neurons in hidden layer and output. For all 5000 examples (each example is a vector with length 400), we can get a matrix with size 5000400. Based on the procedure described above, we chose {circumflex over (r)} which satisfies that the ratio between the sum of {circumflex over (r)} largest singular values and the sum of all singular values is greater . We chose =0.5, 0.6, 0.7, 0.8, and the corresponding {circumflex over (r)}=32, 48, 70,103, respectively. In comparison, we applied uniform downsampling with 80 and 100 columns kept in . In FIG. 2a, the probability of recognition accuracy of SVD-QR preprocessing and uniform downsampling based on all data (5000 samples) are plotted. Observe that SVD-QR preprocessing performs very well, and for 103 inputs, the probability of recognition accuracy (99.7%) is the same as that of 400 inputs (99.7%), and it performs much better than that of uniform downsampling (92.66% for 100 inputs).

    [0062] All these simulations were based on =0.1. To see how other a values work, we also compared it against other values of , as summarized in FIG. 3. Observe that =0.1 performs the best, so we chose =0.1 in all remaining simulations.

    [0063] In this scenario, the linear classifiers were also trained for 200 iterations based on all 5000 examples. For linear classifier with 400 inputs, the probability of recognition accuracy is 96.36%. We applied SVD-QR for preprocessing, and the performances are summarized in FIG. 2b for different number of inputs. Observe that for 103 inputs after SVD-QR, the probability of recognition accuracy is 92.5%, and for 100 inputs after uniform downsampling, the performance is only 81%, which shows SVD-QR preprocessing performs much better than uniform downsampling. Comparing neural network classifier to linear classifier, it's clear that neural network-based approach performs much better. Our simulation results also demonstrate that SVD-QR preprocessing is very powerful for neural network-based deep learning. [0064] 2. In the second scenario, only 50% of the data (2500 examples, 250 examples for each digit) were used in training for 200 iterations, and then parameters in the deep feedforward network were frozen, and the remaining 2500 examples (250 examples for each digit) were tested for recognition accuracy. The probability of recognition accuracy is 90.2%. In contrast, we applied the SVD-QR preprocessing to a deep feedforward neural network with three layers. For the training set 2500 examples (each example is a vector with length 400), we can get a matrix with size 2500400. Similarly, we chose =0.5, 0.6, 0.7, 0.8, and the corresponding {circumflex over (r)}=31, 47, 68, 101, respectively. In comparison, we also applied uniform downsampling with 80 and 100 columns kept in . The probability of recognition accuracy of SVD-QR preprocessing and uniform downsampling based on 2500 examples for training and 2500 examples for testing are summarized in FIG. 2a (for neural network-based approaches) and FIG. 2b (for linear classifier-based approach).

    [0065] Observe the two sets of comparisons in FIG. 2a, the SVD-QR preprocessing tremendously reduces the number of inputs to neural networks. Based on 103 inputs (after SVD-QR, for all data training), the NN performs exactly the same as that of 400 inputs (99.7%). For the second scenario, 101 inputs (after SVD-QR preprocessing) can achieve recognition accuracy 89.84% comparing to accuracy of 90.2% with 400 inputs. Most important, the smaller number of inputs reduces the computational complexity and increases the speed of decision process. To compare the running time numerically, we ran the simulations for neural network-based approach with original 400 inputs, and the simulation time is 75 seconds based on MacBook Pro with 2.8 GHz Intel Core i7 Processor and 16 GB memory. For comparison with SVD-QR preprocessing, the running time versus the number of inputs are summarized for both neural network-based approach (in FIG. 4a) and linear classifier (in FIG. 4b). The energy consumption is proportional to the running time, so comparing to the original 400 inputs, SVD-QR approach has reduced the energy consumption around 70%, which is good for energy efficient IoT.

    Simulation Results for LMSVD-QR Approach

    [0066] As presented in Section 8, the approximation factor k value is critical in determining the accuracy of LMSVD approximation, and subsequently will help to determine the number of input {circumflex over (r)} to deep learning network. We ran simulations for different k values (k=200, 250, 300) and values (=0.5, 0.6, 0.7, 0.8).

    [0067] Similar to the two scenarios in Section 11, we also ran simulations for the same two scenarios, i.e., all data (5000 examples) were used in training for 200 iterations, versus only 50% data were used in training for 200 iterations. We summarize the probability of recognition accuracy of LMSVD-QR preprocessing for neural network in FIG. 5, the number of input versus the percentage of kept singular values () in FIG. 6, and running time versus the number of inputs in FIG. 7.

    [0068] Observe FIGS. 5-7, the performances of neural network in scenario one (based on all data for training) are much better than those in scenario two (based on 50% data in training and remaining 50% for testing). Observe FIG. 5, for k=300 with all data in training, the probability of recognition accuracy is 99.7% with only 104 inputs, same as that with the number of inputs 400 (with no input reduction); and for both scenarios, the probability of recognition accuracy with k=300 performs much better than k=250 and k=200, which verifies that large value of k has better approximation in LMSVD. Observe FIG. 6, the number of input monotonically increases when the percentage of kept singular values () increases, and the value of k or training scenario doesn't have big impact on the number of inputs. However, even for the same , different k results in different {circumflex over (r)} because k value determines the approximation accuracy in (2), and different singular values and singular vectors are obtained for different k.

    [0069] Regarding the deep learning processing speed, it is vastly increased because of LMSVD-QR. As mentioned in Section 11, for no input reduction (with the number of input 400), the running time is 75 seconds. Observe FIG. 7, the running time of has been vastly reduced because of smart and fast preprocessing using LMSVD-QR. For example, with k=300, and the number of input reduced to 104, it only takes 23 seconds to achieve the same recognition accuracy with 400 inputs. Comparing to the original 400 inputs, LMSVD-QR approach has reduced the energy consumption around 75%, which is more desirable for energy efficient IoT.

    Performance Analysis

    [0070] How did the SVD-QR and LMSVD-QR algorithms improve real-time preprocessing? The SVD-QR and LMSVD-QR selected only a small subset as input to neural network for deep learning. Observe (4), m, is the input data length. When m is smaller because of data pre-processing, much less number of computations will be involved, which improves deep learning speed. Since SVD-QR and LMSVD-QR are linear transformations, their computation speeds are very fast, so the data preprocessing time could be negligible comparing to the deep learning iterative process. Why did the SVD-QR preprocessing performs much better than uniform downsampling? To examine it visually, the handwritten digits after SVD-QR preprocessing were illustrated in FIG. 8a (with only 32 pixels left) and FIG. 8b (with 70 pixels left) based on 25 digits examples. Since only partial pixels were left, we filled in the reduced pixels using 0's to make the visual effects comparable to the original images in FIG. 1. Based on FIG. 2a, the probability of recognition accuracy for neural network and SVD-QR preprocessing-based approach is 95.87% for 32 pixels, and 99.52% for 70 pixels. We compared it against uniform downsampling, and same handwritten digits after uniform downsampling were illustrated in FIG. 9a (downsampling by 11, with 37 pixels left) and In FIG. 9b (downsampling by 4, with 100 pixels left). Based on FIG. 2a, the probability of recognition accuracy for neural network and uniform downsampling-based approach is 92.66% for 100 pixels. We ran simulations for uniform downsampling by 11 (with 37 pixels left), and obtained the probability of recognition accuracy 81.2%. Our visual observation of FIG. 8ab also testifies that they are easy to be identified after SVD-QR preprocessing, but FIG. 9ab are much more difficult to be identified.

    [0071] For the LMSVD-QR preprocessing algorithm, we also illustrated it when we chose r=300 in FIG. 10a (with only 32 pixels left) and FIG. 10b (with 69 pixels left) based on the same 25 digits examples. Based on FIG. 2a, the probability of recognition accuracy for LMSVD-QR preprocessing-based neural network approach is 95.32% for 32 pixels, and 99.24% for 69 pixels.

    [0072] To look into whether SVD-QR and LMSVD preprocessing algorithms kept the same pixels, we scattered the kept 32 pixels for SVD-QR and LMSVD-QR when =0.5 and the kept 70 pixels for =0.7 in FIGS. 11a and 11b, respectively. Observe these two figures, the kept pixels are not the same, which means, SVD-QR and LMSVD-QR could achieve good performance with different outcomes. Since SVD and LMSVD result in different singular values and singular vectors, the QR will have different selected pixels. SVD-QR is for large-scale data input, but for mass data input, LMSVD-QR is more appropriate to be used to increase the processing speed.

    BIBLIOGRAPHY

    [0073] [1] http://www.deeplearningbook.org 10

    [0074] [2] R. Baranuik, Compressive sensing, IEEE Signal Processing Magazine, Vol. 24, No. 4, pp. 118-121, July 2007. 2

    [0075] [3] E. Cands, Compressive sampling, Int. Congress of Mathematics, vol. 3, pp. 1433-1452, Madrid, Spain, 2006. 2

    [0076] [4] E. Cands and J. Romberg, Sparse and incoherence in compressive sampling, Inverse Problem, vol. 23, no. 3, pp. 969-985, 2007. 2

    [0077] [5] E. Cands and M. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21-30, March 2008. 2

    [0078] [6] D. Donoho Compressed sensing, IEEE Trans. on Information Theory, Vol. 52, No. 4, pp. 1289-1306, April 2006. 2

    [0079] [7] G. H. Golub and C. F. Van Loan, Matrix Computation, John Hopkins University Press, Baltimore, ML, 2013. 7

    [0080] [8] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd Ed, Springer, New York, N.Y., 2008. 10

    [0081] [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nev. 1

    [0082] [10] X. Liu, Z. Wen, and Y. Zhang, Limited memory block Krylov subspace optimization for computing dominant singular value decompositions, SIAM Journal on Scientific Computing, vol. 35, no. 3, pp. 1641-1668, 2013. 8, 9

    [0083] [11] National Research Council, Frontiers in Massive Data Analysis, Washington, D.C.: The National Academies Press, https://doi.org/10.17226/18374, 2013. 3

    [0084] [12] A. Ng, Machine Learning, www.coursera.org 9, 10, 11

    [0085] [13] P. P. Vaidyanathan and P. Pal, Sparse sensing with co-prime samplers and arrays, IEEE Transactions on Signal Processing, vol. 59, No. 2, February 2011, pp. 573-586. 2

    [0086] [14] P. P. Vaidyanathan and P. Pal, Theory of sparse coprime sensing in multiple dimensions, IEEE Trans. on Signal Processing, vol. 59, no. 8, August 2011, pp. 3592-3608. 2