SECOND-ORDER OPTIMIZATION METHODS FOR AVOIDING SADDLE POINTS DURING THE TRAINING OF DEEP NEURAL NETWORKS
20210357740 · 2021-11-18
Inventors
- Xi He (Allentown, PA, US)
- Ioannis Akrotirianakis (Princeton, NJ, US)
- Amit Chakraborty (East Windsor, NJ, US)
Cpc classification
G06N7/00
PHYSICS
International classification
Abstract
A computer-implemented method for training a deep neural network includes defining a loss function corresponding to the deep neural network, receiving a training dataset comprising training samples, and setting current parameter values to initial parameter values. An optimization method is performed which iteratively minimizes the loss function. During each iteration, a steepest direction of the loss function is calculated by determining the gradient of the loss function at the current parameter values. A batch of samples included in training samples is selected. A matrix-free CG solver is applied to obtain an inexact solution to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples. A descent direction is determined, and the parameter values are updated based on the descent direction. Following the optimization method, the parameter values are stored in relationship to the deep neural network.
Claims
1. A computer-implemented method for training a deep neural network, the method comprising: defining a loss function corresponding to the deep neural network; receiving a training dataset comprising a plurality of training samples; setting current parameter values to initial parameter values; perform an optimization method which iteratively minimizes the loss function over a plurality of iterations, wherein each iteration comprises: calculating a steepest direction of the loss function by determining the gradient of the loss function at the current parameter values, selecting a batch of samples included in the plurality of training samples, apply a matrix-free CG solver to obtain an inexact solution to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples, determining a descent direction based on the inexact solution to the linear system and the steepest direction of the loss function, and updating the current parameter values based on the descent direction; and following the optimization method, storing the current parameter values in relationship to the deep neural network.
2. The method of claim 1, wherein the current parameter values are updated based on the descent direction and a learning rate calculated using the steepest direction of the loss function and the descent direction.
3. The method of claim 2, wherein the learning rate is calculated using an Amijo line search method.
4. The method of claim 2, wherein the learning rate is calculated using a Goldstein line-search method.
5. The method of claim 1, wherein the batch of samples comprises a random sampling of the plurality of training samples.
6. The method of claim 5, wherein the random sampling the plurality of training samples is resampled during each of the plurality of iterations.
7. The method of claim 1, wherein the optimization method is performed using a parallel computing platform and computing operations associated with the optimization method are performed in parallel across a plurality of processors included in the parallel computing platform.
8. A computer-implemented method for training a deep neural network, the method comprising: defining a loss function corresponding to the deep neural network; receiving a training dataset comprising a plurality of training samples; setting current parameter values to initial parameter values; using a computing platform to perform an optimization method which iteratively minimizes the loss function over a plurality of iterations, wherein each iteration comprises: calculating a gradient for the loss function at the current parameter values; selecting a batch of samples included in the plurality of training samples, constructing a trust region subproblem that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function with respect to the batch of samples, determining a descent direction by applying a SteihaugCG solver to the trust region subproblem given a trust region radius, and conditionally updating the current parameter values and the trust region radius based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, and (ii) a predicted reduction value provided by the descent direction; and following the optimization method, storing the current parameter values in relationship to the deep neural network.
9. The method of claim 8, wherein the batch of samples comprising a random sampling of the plurality of training samples.
10. The method of claim 9, wherein the random sampling the plurality of training samples is resampled during each of the plurality of iterations.
11. The method of claim 8, wherein the trust region radius corresponds as a spherical area in which the trust region subproblem lies.
12. The method of claim 8, wherein the trust region subproblem is a bounded quadratic minimization problem.
13. The method of claim 8, wherein the current parameter values are updated by: selecting a learning rate for the descent direction; determining a first set of parameters based on the product of the descent direction and the learning rate; determining a momentum descent direction at the first set of parameters; selecting a momentum rate for the momentum descent direction; and updating the current parameter values based on the first set of parameters and the product of the momentum descent direction and the momentum rate.
14. The method of claim 13, wherein the learning rate is determined using a backtracking line search based on the loss function, the current parameter values, and the descent direction.
15. The method of claim 13, wherein the momentum rate is determined using a backtracking line search based on the loss function, the first set of parameters, and the momentum descent direction.
16. The method of claim 8, wherein optimization method is performed using a parallel computing platform and computing operations associated with the optimization method are performed in parallel across a plurality of processors included in the parallel computing platform.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
DETAILED DESCRIPTION
[0024] Systems, methods, and apparatuses are described herein which relate generally to second-order optimization methods for avoiding saddle points during the training of deep neural networks. More specifically, the techniques described herein employ two stochastic Hessian-based methods: Inexact Stochastic Newton-CG (SINNC) and Inexact Stochastic Trust Region method (SINTR). These two methods use stochastic Hessian information for detecting the negative curvature direction efficiently. An earlier-terminated CG solver is given to find an approximate solution for the possibly indefinite sub-problem for SINNC and the SteihaugCG solver is applied and learned in SINTR. A number of illustrated examples are used to demonstrate the superior performance of SINNC and SINTR compared to MSGD and its variants in terms of loss objective value reduction and training accuracy. By using the proposed second-order methods, one could converge to a flatter minimizer which also provides better generalizations of the training model. Thus, SINNC and SINTR show promise in solving large DNNs and achieving better accuracy than MSGD type methods.
[0025] In the descriptions provided below, the following terminology is used. Denote [n]:={1, . . . , n}. We use f.sub.i to denote the loss function corresponding to the i-th sample and label pair (x.sub.i, y.sub.i), where i∈[n]. X and Y represent the samples matrix (x.sub.i . . . , x.sub.n) and labels vector (y.sub.i, . . . y.sub.n). We use
to denote the stochastic Hessian matrix with respect to batch θ≠S⊂[n].
[0026]
[0027] Starting at step 105, a loss function corresponding to the deep neural network is defined. As is generally understood in the art, a loss function is used to guide the training process of a deep neural network. Various loss functions known in the art (e.g., Cross-Entropy, Mean Squared Error, etc.) may be used with the techniques described herein, as well as custom loss functions designed for particular datasets or applications. In general, the loss function of the deep neural network will be known in advance based on the characteristics of the deep neural network. Thus, defining the loss function at step 105 may be simply a matter of specifying the details of the loss function.
[0028] At step 110, various inputs are received, for example, as parameters supplied by a user. These inputs comprise a training set comprising labeled pairs(x.sub.i,y.sub.i).sub.i=1.sup.n, an initial iterate Θ.sub.0, and an initial CG starter d.sub.0. Additionally, configuration information is supplied indicating a CG iteration limit k.sub.max, constant c∈(0,1), and a sample size β∈[n]. Finally, at step 115, the inputs are set to initial values, as necessary.
[0029] Steps 120-135 illustrate an optimization method which iteratively minimizes the loss function over a plurality of iterations. At step 120, the steepest direction of the loss function is calculated by determining the gradient of the loss function at the current parameter values. More generally, the full gradient is evaluated g.sub.t=∇F(Θ.sub.T). Next, a batch of training samples S.sub.T∈[n] is selected at step 125. This batch of samples may be created, for example, using a random sampling of the plurality of training samples such that that |S.sub.y|=β. Such a random sampling may be performed, for example, a single time during the first iteration of the optimization method. Alternatively, the batch can be resampled during each iteration.
[0030] A matrix-free CG solver is applied at step 130 to obtain an inexact solution d.sub.t to a linear system defined by the steepest direction of the loss function and a stochastic Hessian matrix with respect to the batch of samples (i.e., the possible indefinite linear system H.sub.S.sub.
[0031] The current parameter values are updated at step 135 based on the descent direction. In some embodiments, to ensure sufficient reduction of the loss function at each iteration, the current parameter values may also be updated using the learning rate calculated using the steepest direction of the loss function and the descent direction. This learning rate may be calculated, for example, using an Amijo line search method or a Goldstein line-search method. Examples of generic implementations of these methods are described in Nocedal J., and Wright S. J., “Numerical Optimization,” Springer Series in Operations Research and Financial Engineering, 2.sup.nd Edition, 2006. Thus, the learning rate η.sub.t may be selected as the largest element in the set {1, c, c.sup.2, . . . } such that F(Θ.sub.T+η.sub.tp.sub.t)≤F(Θ.sub.T)+cη.sub.tg.sub.t.sup.Tp.sub.t. The updating of the parameters at step 135 is then a matter of updating Θ.sub.T+1 to Θ.sub.T+η.sub.tp.sub.t. The optimization method then repeats again starting at step 120 until convergence or a desired number of steps are performed.
[0032] Following the optimization method, at step 140, the current parameter values are stored in relationship to the deep neural network. More specifically, the final parameter values are stored in a computer readable medium such that they can be used during deployment of the deep neural network on real-world data.
[0033] Note that, unlike truncated Newton-CG methods used in conventional systems, the method 100 considers negative curvature information indicated from the stochastic Hessian matrix. The method 100 also unitizes the stochastic Hessian-vector product but there is no need to evaluate the full Hessian, which is required by saddle-free Newton (SFN) methods. Pseudocode for an example implementation of the SINNC algorithm is set forth in
[0034] To train neural network by second-order methods, the stochastic Hessian matrix and stochastic general Gaussian-Newton matrix are adopted as the approximation of the Hessian matrix, and further build the stochastic quadratic approximated model depending on them. Because training a deep neural network always involves a very large number of parameters, the exact solution of minimizing the quadratic approximation is prohibitive. Instead, we try to achieve a reasonable inexact solution in a computationally cost effective manner. Because the conjugate gradient method (CG) is often used to achieve an increasingly accurate solution after several iterations, the techniques described herein apply CG to minimize our quadratic model.
[0035] A known deficiency of the CG method is that it becomes unstable when an indefinite Hessian matrix is encountered during the minimization of the quadratic model. The reason behind this is that with an indefinite Hessian matrix, we may not find a conjugate direction. Several strategies have been proposed to deal with that deficiency, such as to modify the indefinite Hessian matrix so that the matrix can be positive and apply the CG solver afterward, or to apply a trust region approach which can always find a descent direction, or to use truncated Newton method, which terminates CG iteration whenever the negative curvature is encountered. In embodiments of the present invention, an early-terminated CG solver is applied in order to find an inexact solution for the quadratic model. With a good initial point, one could build a sequence of conjugate directions. From which, we could guarantee to reduce the residue of the system until the terminated condition is satisfied.
[0036]
[0037] Starting at step 405, a loss function a loss function corresponding to the deep neural network is defined. In some embodiments, the loss function can be specified directly in the source code executing the method; while, in other embodiments, the loss function may be supplied as an input value to the source code. At step 410, input values are received by the computing system executing the method 400. These input values include, without limitation, a training set of labeled pairs (x.sub.i, y.sub.i).sub.i=1.sup.n, an initial iterate of parameter values Θ.sub.0, and an initial trust region radius r.sub.0∈(0,R). Additionally, constants η.sub.0, η.sub.1, γ.sub.1, η.sub.2, and ϵ are supplied as inputs, where 0<η.sub.0<η.sub.1<1, 0<γ.sub.1<1<γ.sub.2, ϵ>0 (see
[0038] An optimization method is performed at steps 420-440 to iteratively minimize the loss function over a plurality of iterations. Starting at step 420, the gradient for the loss function at the current parameter values is calculated (i.e., g.sub.t=−∇F(Θ.sub.T)). Next, at step 425, a batch of training samples is selected. As with the method 200 discussed above with respect to
[0039] Then, at step 430, a trust region subproblem is constructed that approximates the loss function using the gradient and a stochastic Hessian matrix of the loss function with respect to the batch of samples. That is, an approximation of F(Θ) at Θ.sub.T is built using a stochastic Hessian H.sub.S.sub.
[0040] Next step 435, a descent direction is determined by applying a SteihaugCG solver to the trust region subproblem given the trust region radius. More specifically, an earlier terminated SteihaugCG solver is applied obtain an inexact minimizer of m.sub.t(d), denoted herein as d.sub.t. The current parameter values and the trust region radius are conditionally updated at step 440 based on a comparison of (i) a true reduction value provided by the loss function given the current parameter values, and (ii) a predicted reduction value provided by the descent direction. Continuing with the terminology used above, the value of Θ.sub.T+1, is updated based on the m.sub.t and d.sub.t, and the following value is calculated:
Then, based on a comparison of ρ.sub.t with the constants η.sub.0 and η.sub.1, the values of Θ.sub.t+1 and the trust region radius r.sub.t+1 are set for the next iteration.
[0041] After updating the values, the method 400 then repeats again starting at step 420 until convergence or a desired number of steps is performed. The methodology for setting the values of Θ.sub.t+1 and r.sub.t+1 is set forth in the pseudocode presented in
[0042] In some embodiments, a momentum parameter may be added to SINTR to improve the escaping efficiency from saddle points. One example algorithm, referred to herein as SINTR+, is shown in
[0043]
[0044]
[0045]
[0046] Parallel portions of a deep learning application may be executed on the architecture 900 as “device kernels” or simply “kernels.” A kernel comprises parameterized code configured to perform a particular function. The parallel computing platform is configured to execute these kernels in an optimal manner across the architecture 900 based on parameters, settings, and other selections provided by the user. Additionally, in some embodiments, the parallel computing platform may include additional functionality to allow for automatic processing of kernels in an optimal manner with minimal input provided by the user.
[0047] The processing required for each kernel is performed by grid of thread blocks (described in greater detail below). Using concurrent kernel execution, streams, and synchronization with lightweight events, the architecture 900 of
[0048] The device 910 includes one or more thread blocks 930 which represent the computation unit of the device 910. The term thread block refers to a group of threads that can cooperate via shared memory and synchronize their execution to coordinate memory accesses. For example, in
[0049] Continuing with reference to
[0050] Each thread can have one or more levels of memory access. For example, in the architecture 900 of
[0051] The embodiments of the present disclosure may be implemented with any combination of hardware and software. For example, aside from parallel processing architecture presented in
[0052] While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
[0053] An executable application, as used herein, comprises code or machine readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine readable instruction, sub-routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
[0054] A graphical user interface (GUI), as used herein, comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions. The GUI also includes an executable procedure or executable application. The executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user. The processor, under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.
[0055] The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity.
[0056] The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the invention to accomplish the same objectives. Although this invention has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be implemented by those skilled in the art, without departing from the scope of the invention. As described herein, the various systems, subsystems, agents, managers and processes can be implemented using hardware components, software components, and/or combinations thereof. No claim element herein is to be construed under the provisions of 35 U.S.C. 112(t) unless the element is expressly recited using the phrase “means for.”