System and method for using a deep learning network over time
11580384 · 2023-02-14
Assignee
Inventors
- Rahul Venkataramani (Bangalore, IN)
- Sai Hareesh Anamandra (Bangalore, IN)
- Hariharan Ravishankar (Bangalore, IN)
- Prasad Sudhakar (Bangalore, IN)
Cpc classification
International classification
Abstract
The present approach relates to a system capable of life-long learning in a deep learning context. The system includes a deep learning network configured to process an input dataset and perform one or more tasks from among a first set of tasks. As an example, the deep learning network may be part of an imaging system, such as a medical imaging system, or may be used in industrial applications. The system further includes a learning unit communicatively coupled to the deep learning network 102 and configured to modify the deep learning network so as to enable it to perform one or more tasks in a second task list without losing the ability to perform the tasks from the first list.
Claims
1. A method for updating a deep learning network over time, comprising the steps of: receiving a first set of parameters from a deep learning network, wherein the deep learning network is trained using a first training dataset to perform a first set of tasks, wherein the first set of parameters specify both a first feature extractor and a first classifier used to perform the first set of tasks; receiving a first feature set corresponding to the first training dataset; receiving an input comprising a second set of tasks and a second training dataset; generating a second set of parameters specifying both a second feature extractor and a second classifier for use by the deep learning network, wherein the second set of parameters are generated using the first set of parameters, the input, and the first feature set, and the first training dataset is not used in generating the second set of parameters; and modifying the deep learning network to use the second set of parameters so that the deep learning network is trained to perform tasks from the first set of tasks and the second set of tasks without degradation.
2. The method of claim 1, wherein the deep learning network comprises a memory augmented neural network.
3. The method of claim 2, wherein the first feature set is stored in a memory of the memory augmented neural network.
4. The method of claim 1, wherein the deep learning network is trained to process one or more of medical images or industrial images.
5. The method of claim 1, wherein the first set of parameters are generated by training the deep learning network to perform the first set of tasks using the first training dataset.
6. The method of claim 1, wherein the second set of parameters enables the deep learning network to perform the tasks from the first task list at a same level as the deep learning network performed the tasks from the first task list prior to generating the second set of parameters.
7. The method of claim 1, wherein the steps of the method are performed on a learning unit associated with the deep learning network and implemented using one or more processor units and at least one memory unit of the learning unit.
8. The method of claim 7, wherein the learning unit comprises a data set generator configured to receive at least the first set of parameters, the first feature set, and the second training dataset and to generate an intermediate feature set based on the first feature extractor and the second training dataset.
9. The method of claim 7, wherein the learning unit comprises a feature transformer unit configured to train a feature transformer based at least on the second training dataset.
10. The method of claim 9, wherein the feature transformer is trained by minimizing a model loss cost function.
11. The method of claim 7, wherein the learning unit comprises a deep learning network parameter generator configured to generate the second set of parameters based at least on a feature transformer.
12. A system, comprising: a deep learning network initially trained using a first training dataset to perform a first set of tasks; a learning unit in communication with the deep learning network, the learning unit comprising: one or more memory components storing data and computer logic; one or more processors configured to execute computer logic stored on the one or more memory components so as to cause acts to be performed comprising: receiving a first set of parameters from the deep learning network, wherein the first set of parameters specify both a first feature extractor and a first classifier used to perform the first set of tasks; receiving a first feature set corresponding to the first training dataset; receiving an input comprising a second set of tasks and a second training dataset; generating a second set of parameters specifying both a second feature extractor and a second classifier for use by the deep learning network, wherein the second set of parameters are generated using the first set of parameters, the input, and the first feature set, and the first training dataset is not used in generating the second set of parameters; and modifying the deep learning network to use the second set of parameters so that the deep learning network is trained to perform tasks from the first set of tasks and the second set of tasks without degradation.
13. The system of claim 12, wherein the one or more memory components and one or more processors facilitate the operation of or implement a dataset generator configured to receive at least the first set of parameters, the first feature set, and the second training dataset and to generate an intermediate feature set based on the first feature extractor and the second training dataset.
14. The system of claim 12, wherein the one or more memory components and one or more processors facilitate the operation of or implement feature transformer logic configured to train a feature transformer based at least on the second training dataset.
15. The system of claim 14, wherein the feature transformer is trained by minimizing a model loss cost function.
16. The system of claim 12, wherein the one or more memory components and one or more processors facilitate the operation of or implement a deep learning network parameter generator configured to generate the second set of parameters based at least on a feature transformer.
17. The system of claim 12, wherein the deep learning network comprises a memory augmented neural network.
18. The system of claim 12, wherein the deep learning network is trained to process one or more of medical images or industrial images.
19. The system of claim 12, wherein the first set of parameters are generated by training the deep learning network to perform the first set of tasks using the first training dataset.
20. The system of claim 12, wherein the second set of parameters enables the deep learning network to perform the tasks from the first task list at a same level as the deep learning network performed the tasks from the first task list prior to generating the second set of parameters.
Description
DRAWINGS
(1) These and other features and aspects of embodiments of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DETAILED DESCRIPTION
(14) As will be described in detail hereinafter, systems and methods for deep learning networks are presented. More particularly, the systems and methods presented in the present specification relate to life-long learning in the context of deep learning networks. Further, the systems and the methods described provide a unified representation framework for a life-long learning framework.
(15) The phrase ‘life-long learning’ as used herein refers to learning techniques for performing already learned tasks with recently acquired data or acquiring the ability to perform newer tasks (i.e., tasks not previously learned) with newer or previously acquired data. The phrase ‘training dataset’ refers to a plurality of combinations of input and corresponding output data that may be used in implementing learning techniques. The phrase ‘feature extractor’ refers to an operator applied on input data vectors to determine a corresponding feature vector. The phrase ‘classifier’ or ‘neural network classifier’ refers to an operator applied on the output of the feature extractor to generate a classification label. The phrase ‘deep learning network’ may refer to a neural network based learning network and is defined in terms of a ‘first set of parameters’ while configured to perform a task from a first task list and a ‘second set of parameters’ while configured to perform a task from a second task list.
(16) As discussed herein, a neural network classifier, parameterized by (θ, κ), is a composition of a feature extractor Φ.sub.θ:X.fwdarw.F and Ψ.sub.κ, a classifier Ψ.sub.κoΦ.sub.θ:X.fwdarw.[C]; where X is the space of input data, and F is a space of low-dimensional feature vectors. In a lifelong learning setup, at any time t−1, the model optimally classifies all of the seen data ∪.sub.t′=0.sup.t-1X.sup.(t′) into the classes [C.sup.(t-1)] and the corresponding features F.sup.(t-1) are well separated. At t, when new training data D.sup.(t)=(X.sup.(t), Y.sup.(t)) is encountered, features extracted using the old feature extractor are not guaranteed to be optimized for classifying the new data and new classes. To alleviate this, the present approach discussed herein changes the feature representation at time t, prior to the classification stage. This is achieved by defining a feature transformer Φ.sub.Δθ.sub.Φ.sub.Δθ.sub.
θ.sup.(t)-1∪Δθ.sup.(t). Practically, this may be realized by augmenting the capacity of the feature extractor using dense layers (e.g., fully connected layers), as discussed in greater detail below.
(17) With the preceding discussion in mind,
(18) The deep learning network 102 may be implemented using CPUs (Central Processing Units) such as Core i7 from INTEL and Ryzen from AMD, or GPUs (Graphic Processing Units) such as NVIDIA GTX and 1080 Ti. Alternatively, the deep learning network 102 may be implemented using FPGAs (Field-Programmable Gate Arrays) and/or ASICs (Application-specific Integrated Circuits) such as a TPU (Tensor Processing Unit) from GOOGLE. In alternative embodiments, the deep learning network 102 of the present specification may also be implemented using mobile processors having special instructions, optimized DSPs, and dedicated NPUs (Neural Processing Units). In other embodiments, the deep learning network 102 may be implemented using energy efficient neuromorphic architectures with or without memristors or using Quantum Computers (QC).
(19) In one embodiment, the learning unit 106 is configured to receive the first set of parameters 108 from the deep learning network 102, and an external input 112 such as, but not limited to, a second training dataset and a second task list selected or specified by an operator. In embodiments when the deep learning network 102 is a MaNN, the learning unit 106 may be further configured to receive the first feature set 116 corresponding to the first training dataset. The learning unit 106 is further configured to generate a second set of parameters 128 (e.g., weights or other learned parameters or values) for use by the deep learning network 102 that is based on the first set of parameters 108, the external input 112 and the first feature set 116. For example, in one embodiment the second set of parameters 128 includes, specifies, or configures a second feature extractor and a second classifier. The learning unit 106 is further configured to modify the deep learning network 102 using the second set of parameters 128. In one embodiment, the first training dataset and the second training dataset may have same probability distribution and the second task list may be same as the first task list. In another embodiment, the probability distribution of the second training dataset may be different from the probability distribution of the first training dataset. In a further embodiment, the second task list may be different from the first task list and the second training dataset may have a different distribution compared to that of the first training dataset. In all of these canonical scenarios, the second set of parameters 128 enables the deep learning network 102 to perform the tasks from the first task list trained using the first training dataset without degradation, i.e., the deep learning network retains it training to perform the tasks of the first task list without degradation.
(20) The learning unit 106 may be configured to generate a second feature set and subsequently store the second feature set in the memory unit 116. In an alternate embodiment, the learning unit 106 is configured to store the first feature set 116 within an internal memory location and use the first feature set 116 for generating the second set of parameters 128. In the depicted example, the learning unit 106 includes a dataset generator unit 118, a feature transformer unit 120 and a deep learning network generator 122. The depicted learning unit 106 also includes a memory unit 132 and a processor unit 130 communicatively coupled to the other units 118, 120, 122 via a communications bus 134 and/or otherwise implementing the other units 118, 120, 122, such as via one or more stored and executable routines or program logic.
(21) The dataset generator unit 118 in this example is communicatively coupled to the deep learning network 102 and configured to receive a first set of parameters 108 from the deep learning network 102. In one implementation, the first set of parameters 108 is generated at a first time instant using the first training dataset. The dataset generator 118 is also configured to receive the first feature set 116 corresponding to the first training dataset either from a memory location within the deep learning network 102 or from a memory location in memory unit 132 associated with the learning unit 106. The dataset generator 118 is also configured to receive the second training dataset corresponding to a second task list via the external input 112. The dataset generator 118 may be further configured to receive a first feature extractor determined a priori and stored in the memory unit 132. In one implementation, the dataset generator 118 is configured to generate an intermediate feature set based on the first feature extractor and the second training dataset. Specifically, in one example the intermediate feature set is given by:
∂F.sup.(t)=∪.sub.τ∈τ.sub.
where Φ.sub.θ.sub.
D.sup.(t)=(∂F.sup.(t)∪F.sup.(t-1),∪.sub.t′∈[1,2, . . . ,t]Y.sup.(t′)),∀t>0 (2)
where, D.sup.(t) is the second training data set, ∂F.sup.(t) is the intermediate feature set, F.sup.(t-1) is the first feature set, ∪ is a set union operator and ∪.sub.t′Y.sup.(t′) is the labelled output corresponding to the union of the intermediate features set and the first feature set.
(22) The feature transformer unit 120 in the depicted example is communicatively coupled to the dataset generator unit 118 and is configured to determine a feature transformer based on the second training dataset using a learning technique. The training procedure for generating or training the feature transformer may, in one embodiment be given by:
TRAIN(Δθ.sup.(t),κ.sup.(t);D.sup.(t)) (3)
where, TRAIN is representative of the training procedure, Δθ.sup.(t) is the feature transformer, κ.sup.(t) is the classifier, and D.sup.(t) is the second training dataset which is used to determine the feature transformer Δθ.sup.(t) and the corresponding classifier κ.sup.(t) by the training procedure TRAIN.
(23) In one embodiment, a model loss cost function is minimized as an objective of the training procedure. If the deep learning network 102 is to be trained to perform classification, the model loss cost function includes a classification loss cost function. In such an embodiment, the deep learning network 102 is trained to learn the class separation boundaries in the feature space. If the deep learning network 102 is to be trained to provide separation among the classes, a center loss cost function is also included in the model loss cost function. That is, classification loss is augmented center loss. This composite loss explicitly forces the transformed features to have class-wise separation. In one embodiment, the classification loss function is given by:
(24)
where D=U.sub.τεI.sub.
(25) Further, the center loss cost function is given by:
(26)
where, μ.sub.c is centroid of all the features corresponding to input data labelled as c. The model loss is given by:
C.sub.Model Loss=C.sub.Classification Loss+C.sub.Center Loss (6)
(27) The deep learning network generator 122 is communicatively coupled to the feature transformer unit 120 and configured to generate the second set of parameters 128. The second set of parameters is given by:
θ.sup.(t)θ.sup.(t-1)∪Δθ.sup.(t) (7)
where θ.sup.(t) is the second set of classifier parameters, θ.sup.(t-1) is the first set of classifier parameters and Δθ.sup.(t) is the feature transformer parameters. The feature transformer Δθ.sup.(t) is given by the mapping:
Φ.sub.Δθ.sub.
where, F.sup.(t-1) is the first feature set and the F.sup.(t) is the second feature set. The second feature extractor operator is given by:
Φ.sub.θ.sub.Φ.sub.Δθ.sub.
where, o is a cascade operator and Φ.sub.Δθ.sub.
F.sup.(t)=Φ.sub.Δθ.sub.
i.e., the second feature set is obtained as a union of the transformed intermediate feature set and the transformed first feature set.
(28) A pseudo code example of an implementation of the life-long learning technique is outlined in the Table-1 below. With respect to the outlined example, the present pseudo-rehearsal strategy is realized through the use of a finite memory module M equipped with READ( ), WRITE( ), and ERASE( ) procedures that can store a subset of F(t−1) and retrieve the same at time t. To limit the size of the memory footprint involved, only a subset of history (specified by sampling operator S) may be stored at every episode of lifelong learning. In practice two strategies that can be pursued for generating or deriving the subset include: (1) random sampling, in which a percentage of the memory is randomly retained, and (2) importance sampling, in which samples are retained that are farther from cluster centroids, given that center loss is optimized at every episode. In addition, storing low-dimensional features is more economical than storing entire images in terms of memory or storage footprint.
(29) TABLE-US-00001 TABLE 1 Input Training data (X.sup.(t), Y.sup.(t)), ∀t ≥ 0 Output (θ.sup.(t), κ.sup.(t)), ∀t t ← 0, ERASE (M) /* Set initial time, erase memory*/ D.sup.(0) ← (X.sup.(0), Y.sup.(0)) /*Obtain initial tasks and training data*/ TRAIN(θ.sup.(0), κ.sup.(0); D.sup.(0)) /*Train initial network*/ F.sup.(0) ← (Φ.sub.θ.sup.(0)(X.sup.(0))) /*Compute Features*/ WRITE (M, S(F.sup.(0), Y.sup.(0))) /*Write Select Features to Memory*/ while TRUE do t ← t + 1, obtain T.sup.(t), (X.sup.(t), Y.sup.(t)) /* Obtain current tasks and data*/ Compute ∂F.sup.(t) using equation (1) /*Compute old model features on new data*/ (F.sup.(t−1), Y.sup.(t−1)) ← READ (M) /*Read previously computed features*/ Form D.sup.(t) using equation (3) /*Form composite training data*/ TRAIN (Δθ.sup.(t), κ.sup.(t); D.sup.(t)) /*Train feature transformer*/ Φ.sub.θ.sup.(t) ← Φ.sub.Δθ.sup.(t) o Φ.sub.θ.sup.(t−1) /*Obtain new feature extractor*/ Compute F.sup.(t) using equation (10) /*Compute new features*/ ERASE (M) /*Erase old features*/ WRITE (M,S(F.sup.(t),∪.sub.t′∈[1,2,...,t] Y.sup.t′)) /*Write new select features*/ end
(30) In the depicted example, the memory unit 132 is communicatively coupled to the processor unit 130 and configured to store programs, training datasets, the first feature extractor, the first classifier, the second feature extractor and the second classifier. Although the memory unit 132 is shown as separate unit for clarity and for the purpose of explanation, the memory unit 132 may in practice be a part of the dataset generator unit 118, a feature transformer unit 120 and/or a deep learning network generator 122 or may, in practice, be a memory or storage used to store routines and/or parameters that implement some or all of these functionalities when executed. In one embodiment, the memory unit 132 may be a dynamic random-access memory (DRAM) device, a static random access memory (SRAM) device, flash memory or other memory devices. In another embodiment, the memory unit 132 may include a non-volatile memory or similar permanent storage device, media such as a hard disk drive, a floppy disk drive, a compact disc read only memory (CD-ROM) device, a digital versatile disc read only memory (DVD-ROM) device, a digital versatile disc random access memory (DVD-RAM) device, a digital versatile disc rewritable (DVD-RW) device, a flash memory device, or other non-volatile storage devices. The memory unit 132 may also be a non-transitory computer readable medium encoded with a program or other executable logic to instruct the one or more processors 130 to generate the first set of parameters 108, the second set of parameters 128 and so forth.
(31) The processor unit 130 may include one or more processors either co-located within a single integrated circuit or distributed in multiple integrated circuits networked to share data and communication in a seamless manner. The processor unit 130 may, in one implementation, include at least one of an arithmetic logic unit, a microprocessor, a microcontroller, a general-purpose controller, a graphics processing unit (GPU), or a processor array to perform the desired computations or run the computer program. In one embodiment, the processor unit 130 may be configured to implement or otherwise aid the functionality of one or more of the dataset generator unit 118, a feature transformer unit 120, and/or a deep learning network generator 122. In some embodiments, the processor unit 130 may be representative of a FPGA, ASIC, or any other special purpose hardware configured to implement one or more of the dataset generator unit 118, a feature transformer unit 120, and/or a deep learning network generator 122.
(32)
(33) At a subsequent second time instant 222, a second training dataset 214 corresponding to a second task list is available. A second set of parameters is generated at the update block 204 as explained with reference to
(34)
(35)
(36) The curves of the graph 400 were generated using data generated from continual learning simulations using techniques as discussed in the present specification as well as known conventional techniques (i.e., naïve learner and cumulative learner). A VGG (Visual Geometry Group)-network was used as a base network in the simulations. Up to two dense layers were added to the base network in each feature transformer step. The feature transform network essentially had one additional dense layer per step. Features from different layers of the VGG-network (pooling layers—3 & 4 and fully connected layers—1 & 2) are stored in the memory 132 and used subsequently for training.
(37) The graph 400 compares performance of the different approaches on the validation dataset and includes a first curve 406 corresponding to a naive training and a second curve 410 corresponding to a feature transform based learning approach. The graph 400 further includes a third curve 412 corresponding to cumulative training using the previously stored training datasets. In particular,
(38)
(39)
(40)
(41)
(42)
(43)
(44) As depicted in
(45)
(46) While memory management has been described above, it may also be noted that additional steps may also be taken to control the growth of network capacity. For example, the present framework can be formulated as a base feature extractor and feature transformer layers, adapting the features for new tasks. In order to check the growth of feature transformer layers, the base feature extractor remains fixed and only the base features are stored and not the latest updated features. This makes existing feature transformer layers reusable for future episodes.
(47) It is to be understood that not necessarily all such objects or advantages described above may be achieved in accordance with any particular embodiment. Thus, for example, those skilled in the art will recognize that the systems and techniques described herein may be embodied or carried out in a manner that achieves or improves one advantage or group of advantages as taught herein without necessarily achieving other objects or advantages as may be taught or suggested herein.
(48) While the technology has been described in detail in connection with only a limited number of embodiments, it should be readily understood that the specification is not limited to such disclosed embodiments. Rather, the technology can be modified to incorporate any number of variations, alterations, substitutions or equivalent arrangements not heretofore described, but which are commensurate with the spirit and scope of the claims. Additionally, while various embodiments of the technology have been described, it is to be understood that aspects of the specification may include only some of the described embodiments. Accordingly, the specification is not to be seen as limited by the foregoing description.