Method, System, and Computer Program Product for Data Pre-Processing in Deep Learning
20200143236 ยท 2020-05-07
Inventors
Cpc classification
G06F18/214
PHYSICS
G06F17/16
PHYSICS
G06V10/7715
PHYSICS
G06F18/213
PHYSICS
G06V30/19147
PHYSICS
International classification
Abstract
The goal of this invention is to develop smart and fast data processing scheme for more computational efficient deep learning to support adaptive and real-time applications. We propose to apply Singular-Value Decomposition (SVD)-QR algorithm to preprocessing of deep learning for large scale data input. For the mass data input, we apply Limited Memory Subspace Optimization for SVD (LMSVD)-QR algorithm to increase the data processing speed. Simulation results in automated handwritten digit recognition show that SVD-QR and LMSVD-QR can tremendously reduce the number of input to deep learning neural network without losing its performance, and both can tremendously increase the data processing speed for deep learning.
Claims
1. A method for smart and fast data pre-processing in deep learning comprises two approaches for different application scenarios.
2. The method of claim 1, wherein said different application scenarios comprising large scale data input and mass data input.
3. The method of claim 1, wherein said two approaches comprising SVD-QR and LMSVD-QR.
4. The method of claim 3, wherein said SVD-QR is used for the said large scale data input in claim 2.
5. The method of claim 3, wherein said LMSVD-QR is used for the said mass data input in claim 2.
6. A computer-readable medium carrying one or more sequences of one or more instructions for input data pre-processing in deep learning, the one or more sequences of one or more instructions including instructions which, when executed by one or more processors, cause the one or more processors to perform the steps recited in any one of claims 1-5.
7. An application system configured to include data pre-processor to perform the steps recited in any one of claims 1-5, as input to deep learning comprising: a device configured to compute the singular values using the said SVD or LMSVD; said device configured to determine the number of singular values to select; said device configured to perform the said QR computation to determine which columns in weight matrix should be selected.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING
[0034] The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
DETAILED DESCRIPTION OF THE INVENTION
Smart and Fast Data Processing for Deep Learning with Large Scale Data Input
[0046] We propose to apply SVD-QR for pre-processing for deep learning. For deep learning applications with 1-D input, we can construct a matrix based on its multiple input from training set. The pre-processing procedure can be summarized as follows. [0047] 1. Calculate the SVD [] of as
.sup.{circumflex over (r)}{circumflex over (r)}, V.sub.12
.sup.{circumflex over (r)}(M{circumflex over (r)}), V.sub.21
.sup.(M{circumflex over (r)}){circumflex over (r)}, and V.sub.22
.sup.(M{circumflex over (r)})(M{circumflex over (r)}). [0050] 4. Using QR decomposition with column pivoting, determine such that
Q.sup.T[V.sub.11.sup.T,V.sub.21.sup.T]=[R.sub.11,R.sub.12](1) where Q is a unitary matrix. [0051] 5. The permutation matrix corresponds to the ordered most-significant singular values based on the position of 1's in each column, and we can choose the first {circumflex over (r)} columns. The row positions of 1's in the {circumflex over (r)} columns will enable us to find the {circumflex over (r)} most significant columns in , which are the {circumflex over (r)} most important input to the deep learning networks.
Preprocessing for Mass Data Input
[0052] We shall apply Limited Memory Subspace Optimization for SVD (LMSVD) algorithm [] to deep learning preprocessing with massive data input. LMSVD is used for computing dominant singular value decompositions of large matrices. The approach is based on a block Krylov subspace optimization technique which significantly accelerates the classic simultaneous iteration method, then QR could be applied following the LMSVD to obtain the {circumflex over (r)} most important columns in . The purpose of LMSVD is to to compute dominant SVDs of mass data in matrices, with desired precision via choosing appropriate k value, i.e., to consider a real matrix
.sup.mn and a given positive integer k<<min(m, n), such that [
]
[0053] where denotes the Frobenius norm of matrix, and U.sub.k.sup.mk, V.sub.k
.sup.nk, a diagonal matrix .sub.k
.sup.kk whose entries .sub.1.sub.2 . . . .sub.k are the k largest singular values of . So the approximation factor k value is critical in determining the accuracy of this approximation.
[0054] The main theoretical basis for LMSVD is that the k leading eigenvectors of .sup.T maximize the following Rayleigh-Ritz function under orthogonality constraint:
subject to X.sup.TX=I. The goal of LMSVD is to compute the kth dominant SVD of a matrix R.sup.mn as defined in (2) based on accelerating the simple subspace iteration (SSI) method via solving (3) in a chosen subspace at each iteration []. Based on LMSVD, we are able to obtain U.sub.k, .sub.k V.sub.k. We propose to use LMSVD-QR on the basis of LMSVD results to select the desired inputs for deep learning.
[0055] Based on the diagonal values of .sub.k, .sub.1, .sub.2, . . . , .sub.k and desired percentage of kept singular values , to determine {circumflex over (r)}, ({circumflex over (r)}<<k), where
Similarly we can partition V.sub.k, and use the same procedure as in Section 7 to determine the {circumflex over (r)} most important inputs to the deep learning networks.
Simulation Results for SVD-QR Approach
[0056] In this invention, handwritten digits recognition is used in our simulation. We apply SVD-QR pre-processing and LMSVD-QR pre-processing for deep learning neural networks to handwritten digits (from 0 to 9) recognition, as illustrated in
[0057] Our simulation was based on data set in ex3data1.mat in www.coursera.org (Machine Learning) [] that contains 5000 training examples of handwritten digits. Each training/testing example contains a 2020 pixels grayscale image of the digit from 0 to 9, and each pixel is represented by a floating point number (from 0.1320 to 1.1277 in the data set we used) indicating the grayscale intensity at that location. The 2020 grid of pixels can be vectorized into a 400-dimensional vector. So a matrix can be constructed where each of these training examples becomes a single row. This gives us a 5000400 matrix where every row is a training example for a handwritten digit (0 to 9) image. The second part of the training set is a 5000-dimensional vector that contains labels (actual digit from 0 to 9) for the training set. Totally we have 5000 examples of handwritten digits in the database, and each digit (from 0 to 9) has 500 examples.
[0058] A feedforward neural network (NN) [] was applied to this application with three layers. The input layer has 400 units because 2020 pixels (matrix 2020) could be vectorized into a vector with length 400. The hidden layer has 25 units, and output layer has 10 units (to represent the 10 digits from 0-9). The feedforward neural network was trained using steepest descent algorithm. We trained neural networks using backpropagation to compute the gradient for the neural network cost function. For regularized logistic regression, the cost function is defined as [
]
where m is the input data length, and K=10 is the total number of labels (from 0 to 9). h.sub.(.sup.(i)).sub.k=is the activation output of the k-th unit in the output layer. We randomly initialized the parameters .sup.(l) for symmetry breaking. The initial value range is chosen based on
[] where L.sub.in and L.sub.out are the number of units in the layers adjacent to .sup.(l), so we chose 8.sup.(l) uniformly distributed within [0.12, 0.12]. We chose =0.1 since it could get better performance based on our experience, and we also compared it against other values of .
[0059] In comparison, we also applied linear classifier (logistic regression model) [] to this application. We use multiple one-vs-all logistic regression models to build a multi-class classifier. Since there are 10 classes, we need to train 10 separate logistic regression classifiers. For regularized logistic regression, the cost function is defined as [
]
where m is the input data length, and for every example i, we compute h.sub.(.sup.(i))=g(.sup.T.sup.(i)) and
is the sigmoid function. Steepest descent was used to train the parameters in the logistic regression model, and we chose =0.1 since it could get better performance.
[0060] We ran simulations for two scenarios. [0061] 1. In the first scenario, all data (5000 examples) were used in training for 200 iterations, and then parameters in the feedforward neural network were frozen, and the 5000 examples were tested for recognition accuracy. The probability of recognition accuracy is 99.7%. In contrast, we applied the SVD-QR preprocessing to a feedforward network with the same number of neurons in hidden layer and output. For all 5000 examples (each example is a vector with length 400), we can get a matrix with size 5000400. Based on the procedure described above, we chose {circumflex over (r)} which satisfies that the ratio between the sum of {circumflex over (r)} largest singular values and the sum of all singular values is greater . We chose =0.5, 0.6, 0.7, 0.8, and the corresponding {circumflex over (r)}=32, 48, 70,103, respectively. In comparison, we applied uniform downsampling with 80 and 100 columns kept in . In
[0062] All these simulations were based on =0.1. To see how other a values work, we also compared it against other values of , as summarized in
[0063] In this scenario, the linear classifiers were also trained for 200 iterations based on all 5000 examples. For linear classifier with 400 inputs, the probability of recognition accuracy is 96.36%. We applied SVD-QR for preprocessing, and the performances are summarized in
[0065] Observe the two sets of comparisons in
Simulation Results for LMSVD-QR Approach
[0066] As presented in Section 8, the approximation factor k value is critical in determining the accuracy of LMSVD approximation, and subsequently will help to determine the number of input {circumflex over (r)} to deep learning network. We ran simulations for different k values (k=200, 250, 300) and values (=0.5, 0.6, 0.7, 0.8).
[0067] Similar to the two scenarios in Section 11, we also ran simulations for the same two scenarios, i.e., all data (5000 examples) were used in training for 200 iterations, versus only 50% data were used in training for 200 iterations. We summarize the probability of recognition accuracy of LMSVD-QR preprocessing for neural network in
[0068] Observe
[0069] Regarding the deep learning processing speed, it is vastly increased because of LMSVD-QR. As mentioned in Section 11, for no input reduction (with the number of input 400), the running time is 75 seconds. Observe
Performance Analysis
[0070] How did the SVD-QR and LMSVD-QR algorithms improve real-time preprocessing? The SVD-QR and LMSVD-QR selected only a small subset as input to neural network for deep learning. Observe (4), m, is the input data length. When m is smaller because of data pre-processing, much less number of computations will be involved, which improves deep learning speed. Since SVD-QR and LMSVD-QR are linear transformations, their computation speeds are very fast, so the data preprocessing time could be negligible comparing to the deep learning iterative process. Why did the SVD-QR preprocessing performs much better than uniform downsampling? To examine it visually, the handwritten digits after SVD-QR preprocessing were illustrated in
[0071] For the LMSVD-QR preprocessing algorithm, we also illustrated it when we chose r=300 in
[0072] To look into whether SVD-QR and LMSVD preprocessing algorithms kept the same pixels, we scattered the kept 32 pixels for SVD-QR and LMSVD-QR when =0.5 and the kept 70 pixels for =0.7 in
BIBLIOGRAPHY
[0073] [1] http://www.deeplearningbook.org 10
[0074] [2] R. Baranuik, Compressive sensing, IEEE Signal Processing Magazine, Vol. 24, No. 4, pp. 118-121, July 2007. 2
[0075] [3] E. Cands, Compressive sampling, Int. Congress of Mathematics, vol. 3, pp. 1433-1452, Madrid, Spain, 2006. 2
[0076] [4] E. Cands and J. Romberg, Sparse and incoherence in compressive sampling, Inverse Problem, vol. 23, no. 3, pp. 969-985, 2007. 2
[0077] [5] E. Cands and M. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21-30, March 2008. 2
[0078] [6] D. Donoho Compressed sensing, IEEE Trans. on Information Theory, Vol. 52, No. 4, pp. 1289-1306, April 2006. 2
[0079] [7] G. H. Golub and C. F. Van Loan, Matrix Computation, John Hopkins University Press, Baltimore, ML, 2013. 7
[0080] [8] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd Ed, Springer, New York, N.Y., 2008. 10
[0081] [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nev. 1
[0082] [10] X. Liu, Z. Wen, and Y. Zhang, Limited memory block Krylov subspace optimization for computing dominant singular value decompositions, SIAM Journal on Scientific Computing, vol. 35, no. 3, pp. 1641-1668, 2013. 8, 9
[0083] [11] National Research Council, Frontiers in Massive Data Analysis, Washington, D.C.: The National Academies Press, https://doi.org/10.17226/18374, 2013. 3
[0084] [12] A. Ng, Machine Learning, www.coursera.org 9, 10, 11
[0085] [13] P. P. Vaidyanathan and P. Pal, Sparse sensing with co-prime samplers and arrays, IEEE Transactions on Signal Processing, vol. 59, No. 2, February 2011, pp. 573-586. 2
[0086] [14] P. P. Vaidyanathan and P. Pal, Theory of sparse coprime sensing in multiple dimensions, IEEE Trans. on Signal Processing, vol. 59, no. 8, August 2011, pp. 3592-3608. 2