Deep learning heterogeneous computing method based on layer-wide memory allocation and system thereof
11568268 · 2023-01-31
Assignee
Inventors
- Hai Jin (Hubei, CN)
- Xiaofei Liao (Hubei, CN)
- Long Zheng (Hubei, CN)
- Haikun Liu (Hubei, CN)
- Xi Ge (Hubei, CN)
Cpc classification
G06N3/10
PHYSICS
G06F9/5066
PHYSICS
International classification
G06F9/50
PHYSICS
G06N3/10
PHYSICS
Abstract
A deep learning heterogeneous computing method based on layer-wide memory allocation, at least comprises steps of: traversing a neural network model so as to acquire a training operational sequence and a number of layers L thereof; calculating a memory room R.sub.1 required by data involved in operation at the i.sup.th layer of the neural network model under a double-buffer configuration, where 1≤i≤L; altering a layer structure of the i.sup.th layer and updating the training operational sequence; distributing all the data across a memory room of the CPU and the memory room of the GPU according to a data placement method; performing iterative computation at each said layer successively based on the training operational sequence so as to complete neural network training.
Claims
1. A deep learning heterogeneous computing method based on layer-wide memory allocation to be executed by a CPU and a GPU jointly, the deep learning heterogeneous computing method comprising the steps of: traversing a neural network model so as to acquire a training operational sequence and a number of layers L thereof; calculating a memory room R.sub.1 required by data involved in operation at an i.sup.th layer of the neural network model under a double-buffer configuration, where 1≤i≤L; altering a layer structure of the i.sup.th layer and updating the training operational sequence when the memory room R.sub.1 required by the operation at the i.sup.th layer is greater than a memory room of the GPU, the step of altering further comprising: acquiring an operational type corresponding to each said layer of the neural network model based on the training operational sequence; when the i.sup.th layer is a convolution layer and convolution operation is to be performed, segmenting an input feature map required by it to perform the convolution operation according to a height or width dimension before the convolution layer by inserting a segment layer so as to obtain a plurality of locally-input feature maps; performing the convolution operation based on the locally-input feature maps, respectively, so as to acquire a plurality of corresponding locally-output feature maps; merging the plural locally-output feature maps by inserting a merge layer after the convolution layer, so as to form a complete output feature map corresponding to the convolution layer; and updating the training operational sequence distributing all the data across a memory room of the CPU and the memory room of the GPU according to a data placement method when a memory room R.sub.2 required by all data involved in all the layers of the neural network model is greater than the memory room of the GPU, wherein the data placement further comprises: traversing the training operational sequence; making data involved in the segment layer and the merge layer as first data; marking data involved in the other layers as second data; and initializing an available memory room M.sub.1 of the GPU that is equal to a total capacity of the GPU; traversing the second data so as to identify a layer L.sub.1 that requires the largest memory room and a layer L.sub.2 that requires the second largest memory room, a memory room R.sub.L1 required by all data involved during identification of the layer L.sub.1, a memory room R.sub.L2 required by all data involved during identification of the layer L.sub.2, and a memory room R.sub.3 required by the largest data block involved during identification of the layer L.sub.1; and updating a marking of the largest data block to third data when both relations of (R.sub.L1-R.sub.3)*2+R.sub.3<M.sub.1 and R.sub.L2*2+R.sub.3<M.sub.1 are satisfied; and updating a capacity of the available memory room M.sub.1 to M.sub.1-R.sub.3; and performing iterative computation at each said layer successively based on the training operational sequence so as to complete neural network training.
2. The deep learning heterogeneous computing method of claim 1, wherein the step of altering the layer structure of the i.sup.th layer further comprises the steps of: when the i.sup.th layer is a pooling layer, an activation layer or a batchnorm layer, segmenting the input feature map required by it to perform the operation according to a channel dimension by inserting the segment layer before the i.sup.th layer, so as to obtain the plurality of locally-input feature maps; performing the corresponding operation based on the locally-input feature maps, respectively, so as to acquire the plurality of corresponding locally-output feature maps; merging the plural locally-output feature map by inserting the merge layer after the i.sup.th layer, so as to form the complete output feature map corresponding to the layer; and updating the training operational sequence.
3. The deep learning heterogeneous computing method of claim 2, wherein the data placement method further comprises the steps of: where either a relation of (R.sub.L1-R.sub.3)*2+R.sub.3≥M.sub.1 or a relation of R.sub.L2*2+R.sub.3≥M.sub.1 is satisfied, updating the capacity of the available room M.sub.1 to M.sub.1-R.sub.L1*2, and traversing all the second data and calculating a memory room R.sub.4 it requires, in which: where a relation of R.sub.4<M.sub.1 is satisfied, updating a marking of the second data to the third data; and updating the capacity of the available room M.sub.1 to M.sub.1-R.sub.4.
4. The deep learning heterogeneous computing method of claim 3, wherein the data placement method further comprises the steps of: traversing the second data so as to identify the layer L.sub.1 that requires the largest memory room and the layer L.sub.2 that requires the second largest memory room, a memory room R.sub.L1 required by all data involved during identification of the layer L.sub.1, the memory room R.sub.L2 required by all data involved during identification of the layer L.sub.2, and the memory room R.sub.3 required by the largest data block involved during identification of the layer L.sub.1; where both the relations of (R.sub.L1-R.sub.3)*2+R.sub.3<M.sub.1 and R.sub.L2*2+R.sub.3<M.sub.1 are satisfied, updating the marking of the largest data block to the third data; updating the capacity of the available memory room M.sub.1 to M.sub.1-R.sub.3; repeating the preceding steps until either the relation of (R.sub.L1-R.sub.3)*2+R.sub.3≥M.sub.1 or the relation of R.sub.L2*2+R.sub.3≥M.sub.1 is satisfied; where either the relation of (R.sub.L1-R.sub.3)*2+R.sub.3≥M.sub.1 or the relation of R.sub.L2*2+R.sub.3≥M.sub.1 is satisfied, traversing all the second data and calculating the memory room R.sub.4 it requires, in which, where the relation of R.sub.4<M.sub.1 is satisfied, updating the marking of the second data to the third data; and updating the capacity of the available room M.sub.1 to M.sub.1-R.sub.4.
5. The deep learning heterogeneous computing method of claim 4, wherein the data placement method further comprises a step of: storing the first data into the memory room of the CPU, storing the remaining second data into the memory room of the CPU, and storing the third data into the memory room of the GPU.
6. The deep learning heterogeneous computing method of claim 5, wherein the step of calculating the memory room R.sub.1 further comprises a step of: counting tensor shapes of input data and output data required by operation at every layer in the neural network model so as to verify the memory room R.sub.1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7) The following description, in conjunction with the accompanying drawings and preferred embodiments, is set forth as below to illustrate the present invention.
(8) It is noted that, for easy understanding, like features bear similar labels in the attached figures as much as possible.
(9) As used throughout this application, the term “may” is of permitted meaning (i.e., possibly) but not compulsory meaning (i.e., essentially). Similarly, the terms “comprising”, “including” and “consisting” mean “comprising but not limited to”.
(10) The phrases “at least one”, “one or more” and “and/or” are for open expression and shall cover both connected and separate operations. For example, each of “at least one of A, B and C”, “at least one of A, B or C”, “one or more of A, B and C”, “A, B or C” and “A, B and/or C” may refer to A solely, B solely, C solely, A and B, A and C, B and C or A, B and C.
(11) The term “a” or “an” article refers to one or more articles. As such, the terms “a” (or “an”), “one or more” and “at least one” are interchangeable herein. It is also to be noted that the term “comprising”, “including” and “having” used herein are interchangeable.
(12) As used herein, the term “automatic” and its variations refer to a process or operation that is done without physical, manual input. However, where the input is received before the process or operation is performed, the process or operation may be automatic, even if the process or operation is performed with physical or non-physical manual input. If such input affects how the process or operation is performed, the manual input is considered physical. Any manual input that enables performance of the process or operation is not considered “physical”.
Embodiment 1
(13) As shown in
(14) S1: where Layer[i] is a convolution layer, and Layer[i+1] is neither an activation layer nor a pooling layer, segmenting the input feature map of the convolution layer in accordance with the height or width dimension by inserting segment layers so as to replace the convolution layer with many small convolution layers, wherein the small convolution layers take the locally-input feature maps as their inputs to output the corresponding locally-output feature maps, inserting merge layers to merge the locally-output feature maps, thereby generating a complete output feature map, and ending the flow; otherwise, proceeding with Step S2; wherein the term “Layer[i]” refers to the i.sup.th layer of the neural network;
(15) S2: where Layer[i] is a convolution layer, and Layer[i+1] is a activation layer or a pooling layer, segmenting the input feature map of the convolution layer in accordance with the height or width dimension by inserting segment layers so as to replace the convolution layer with many small convolution layers, directly activating or pooling the locally-output feature maps of the small convolution layers, merging the locally-output feature maps by inserting merge layers, and ending the flow; otherwise, proceeding with Step S3;
(16) S3: where Layer[i] is a convolution layer, and Layer[i+1] is an activation layer, while Layer[i+2] is a pooling layer, segmenting the input feature map of the convolution layer in accordance with the height or width dimension by inserting segment layers so as to replace the convolution layer with many small convolution layers, directly activating and pooling the locally-output feature maps of the small convolution layers, then merging the locally-output feature maps by inserting merge layers, thereby generating a complete output feature map, and ending the flow; otherwise, proceeding with Step S4;
(17) S4: where Layer[i] is an activation layer, a pooling layer or a batchnorm layer, segmenting the input feature map of the layer in accordance with the channel dimension by inserting segment layers, activating, pooling or batchnorming the segmented locally-input feature maps, then merging the locally-output feature maps by inserting merge layers, and ending the flow.
(18) For clear explanation, the foregoing steps are description below in the instance of a convolutional neural network.
(19) The convolutional neural network is composed of three parts. The first part is an input layer, and the second part is composed of a plurality of convolution layers, pooling layers and activation layers, while the third part is a fully connected multilayer perceptron. The convolutional neural network may be built differently in terms of structure using different ways, and is usually expressed as follows:
INPUT.fwdarw.[[CONV]*N.fwdarw.[POOLING]M.fwdarw.[FC]K
(20) The structure of the convolutional neural network expressed in the expression above is about having N convolution layers stacked, adding a pooling layer in an optional way, repeating this structure for M times, and at last adding K fully connected layers.
(21)
(22) Referring to
(23) Still referring to
(24) In the process of training the neural network, processing such as convolution, pooling and activation is performed on the input feature map layer by layer, beginning from the input layer. The essence of Steps S1˜S4 is to, before computation of each layer, identify the type of the layer or identify the type of computation processing, to be performed on the input feature map of the layer, and accordingly process the input feature map of the presently worked layer. Particularly, where the present worked layer is a convolution layer or convolution operation has to be performed on the input feature map, a segment layer is inserted before the present worked layer so as to, before convolution operation is performed on the input feature map of the convolution layer, segment the input feature map in accordance with the height or width dimension, thereby obtaining a plurality of locally-input feature maps. Where the presently worked layer is any one of a pooling layer, an activation layer and a batchnorm layer, a segment layer is inserted before the present worked layer, so as to segment its input feature map in accordance with the channel dimension, thereby obtaining a plurality of locally-input feature maps. After the locally-input feature maps are processed in accordance with the layer structure of the neural network and the locally-output feature maps are obtained, the locally-output feature maps are merged by inserting merge layers so as to obtain a complete output feature map.
(25) Staying on
(26) Preferably, as shown in
Embodiment 2
(27) This embodiment is a further improvement on Embodiment 1, and the repeated content is omitted.
(28) S5: setting a first storage room and a second storage room, traversing the operation sequence of a neural network, marking data involved in computation at the segment layer and the merge layer of the neural network as first data, and marking data involved in computation at the other layers as second data. Therein, the first data is stored in the second storage room and the first storage room is initializes so that its available room R.sub.available is equal to its total capacity. Preferably, since more memory room is required by training computation at the segment layer and the merge layer, the training work is moved to the CPU so as to effectively reduce the memory overhead of the GPU and mitigate the effects of the adjustment of the neural network on the resulting performance;
(29) S6: counting all the second data in the neural network, so as to identify a layer L.sub.1 that requires the largest memory room and a layer L.sub.2 that requires the second largest memory room, wherein the memory room occupied by all the data involved in the computation at Layer L.sub.1 is R.sub.L1, and the largest data block has a size of R.sub.biggest, while the memory room occupied by all the data involved in the computation at Layer L.sub.2 is R.sub.L2;
(30) S7: where (R.sub.L1-R.sub.biggest)*2+R.sub.biggest<R.sub.available and R.sub.L2*2+R.sub.biggest<R.sub.available, marking the largest data block in Layer L.sub.1 as third data and storing it into the first storage room, while dynamically adjusting the available room of the first storage room to R.sub.available=R.sub.available−R.sub.biggest, and returning to Step S6; where (R.sub.L1-R.sub.biggest)*2+R.sub.biggest>R.sub.available or R.sub.L2*2+R.sub.biggest>R.sub.available, entering Step S8 for subsequent processing;
(31) S8: traversing all the data blocks composed of the second data, and where the data block has a size of R.sub.data<R.sub.available, storing the data blocks into the first storage room, and dynamically adjusting the available room of the first storage room to R.sub.available=R.sub.available−R.sub.data. Preferably, for minimizing the total data amount to be offloaded and prefetched during training, the data required by the segment layer and the merge layer are placed into the host memory. On the prerequisite that the memory room required by the largest layer to perform operation under the double-buffer configuration is reserved, the remaining data are placed into the memory of the GPU as many as possible, thereby reducing the communication overhead.
(32) Preferably, the first storage room is the memory of the GPU, and the second storage room is the memory of the CPU. All the data involved in computation at the segment layer at least include the un-segmented input feature map and the generated output feature maps. All the data involved in computation at the merge layer computation at least include the un-merged input feature map and the generated output feature maps.
(33) Preferably, as shown in
Embodiment 3
(34) This embodiment is a further improvement on Embodiment 1 and 2, and the repeated content is omitted.
(35) The present invention further provides a deep learning heterogeneous computing method based on layer-wide memory allocation, as shown in
(36) S9: collecting the training operational sequence of a neural network by means of one virtual iteration, and counting tensor shapes of data to be input and output during operation at every layer of the neural network, wherein the memory space required by the double-buffer configuration of every layer is calculated based on the tensor shapes;
(37) S10: where the capacity of the first storage room is greater than the memory room required by every layer, having the neural network remain its original structure and entering Step
(38) S12, and where there is one or more layers in the neural network that require a memory room greater than the capacity of the first storage room, entering Step S11 for subsequent processing;
(39) S11: performing structural adjustment on the layers that require a memory room greater than the capacity of the first storage room using the method for structural adjustment of a neural network as described in Embodiment 1, and performing a virtual iteration on the adjusted neural network so as to collecting its training operational sequence again;
(40) S12: where the total data amount involved in the computation at all the layers in the neural network is smaller than the capacity of the first storage room, storing all the data involved during training for the neural network into the first storage room; and where the total data amount involved in the computation at all the layers in the neural network is greater than the capacity of the first storage room, offloading a part of the data involved in computation at all the layers in the neural network to the host storage room, wherein, whether it is necessary to offload the data to the host storage room is determined using the data placement method as described in Embodiment 2; and
(41) S13: dispatching the computation resources of the CPU and the GPU according to the training operational sequence of the neural network so as to train the neural network.
(42) For clear explanation, the following description is further directed to Steps S9, S12 and S13.
(43) S9: collecting the training operational sequence of a neural network by means of one virtual iteration, and counting tensor shapes of data to be input and output during operation at every layer of the neural network, wherein the memory room required by the double-buffer configuration of every layer is calculated based on the tensor shapes.
(44) Preferably, the first storage room may be the memory of the GPU and the second storage room may be the memory of the CPU, while the host storage room may be a cache. The data are all expressed in tensor. The tensor shape represents the number of dimensions of a tensor and the length of each dimension. For example, in a number set expressed as shape [2, 3], the first dimension has two elements, and the second dimension has three elements. A number set may be further specified as [[1, 2, 3], [4, 5, 6]]. Assuming that a tensor shape is expressed as [N, C, H, W], the memory room required by the tensor is R═S*(N*C*H*W). Therein, S is the number of bytes occupied by every datum of the tensor, while N, C, H and W represent the batch size, the number of channels, the height and the width of the tensor, respectively.
(45) Preferably, a virtual iteration only happens before training for the neural network. Therein, the virtual iteration only counts the training operational sequence of the neural network and does not execute computation tasks at every layer.
(46) S12: where the total data amount involved in the computation at all the layers in the neural network is smaller than the capacity of the first storage room, storing all the data involved during training for the neural network into the first storage room; and where the total data amount involved in the computation at all the layers in the neural network is greater than the capacity of the first storage room, offloading a part of the data involved in computation at all the layers in the neural network to the host storage room, wherein, whether it is necessary to load the data to the host storage room is determined using the data placement method as described in Embodiment 2.
(47) Preferably, configuring the neural network for double buffering of data helps to minimize the communication overhead and accelerate training for the neural network. Where the storage room occupied by all the data required by one iteration of training for the neural network is greater than the memory of the GPU, during the forward-propagation process of the neural network, the data not required by computation at the present worked layer are offloaded to the host storage room. During the back-propagation process of the neural network, the data required by computation at the present worked layer are pre-stored into the memory of the GPU. The use of the computation overhead required by neural network training may hide the communication overhead caused by offloading and pre-storing the data.
(48) S13: dispatching the computation resources of the CPU and the GPU according to the training operational sequence of the neural network so as to train the neural network.
(49) Preferably,
Embodiment 4
(50) This embodiment is a further improvement on previous embodiments, and the repeated content is omitted.
(51) The present invention further provides a deep learning heterogeneous computing system based on layer-wide memory allocation, which at least comprises a neural network adjustment module, a data placement module, a scheduling module, an execution engine, a CPU, a GPU and a host memory. Therein, the neural network adjustment module serves to adjust the network structure, so as to the neural network can use the layer-wide memory allocation method to perform training in the limited memory of the GPU while ensuring correct training. The data placement strategy is to take the memory of the GPU as a cache of the host memory, and to place as many data as possible in the memory of the GPU, thereby reducing communication overhead. The scheduling module overall plans the computation resources across the CPU and the GPU, and assigns computation tasks at the segment layer and the merge layer to the CPU, in order to leverage the available computation resources and mitigate the effects of the adjustment of the neural network on the resulting performance. The execution engine controls the execution sequence of the layers during the neural network training, on the basis of the training operational sequence generated during the virtual iteration.
(52) Preferably, where there is a layer in the neural network whose training requires a memory room greater than the memory capacity of the GPU, the neural network adjustment module is active. The neural network adjustment module converts the calculation of a certain layer into calculation of plural small layers, so as to break the limitation of the memory of the GPU. The data placement strategy has influence on the communication overhead during the training. In order to reduce the total amount of data needed to be unloaded and prefetched during the training, the data required by the segment layer and the merge layer are placed in the host memory. On the prerequisite that the memory room required by the largest layer to perform operation under the double-buffer configuration is reserved, the remaining data are placed into the memory of the GPU as many as possible, thereby reducing the communication overhead. On the other hand, the computation operation for the training at the segment layer and the merge layer is moved to the CPU so as to satisfy its relatively large memory requirements. The scheduling module overall plans the computation resources of the CPU and the GPU to match their computation works and accelerate the training. The execution engine works on the actual training, and control the training process according to the training operational sequence obtained through the virtual iteration. Training of a neural network requires iterations, and the operation sequence of every iteration is identical, thereby resulting in a training network model for prediction.
(53) Preferably, the neural network adjustment module is such configured that when the memory room R.sub.1 required by the operation at the i.sup.th layer is greater than the memory of the GPU, it enters a working mode where it dynamically adjusts the layer structure of the neural network model based on the manner the layer structure of the i.sup.th layer is changed. The data placement module is such configured that when the memory room R.sub.2 required by all the data involved in the neural network model is greater than the memory of the GPU, it enters the working mode where it dynamically adjusts the data required by the training of the neural network model based on the data placement method. The scheduling module is such configured that it assigns computation tasks at the segment layer and the merge layer to the CPU. The execution engine is such configured that it controls computation at every layer to be performed according to the training operational sequence during the training of the neural network.
(54) Preferably, the deep heterogeneous computing system further comprises a host memory. The CPU is such configured that when performing computation tasks at the segment layer or the merge layer, it pre-stores the locally-input feature maps obtained through computing to the memory of the GPU. The GPU is such configured that when working on the present locally-input feature map, it pre-stores the previous locally-input feature map to the host memory. Therein, when the GPU continuously performs computation based on the locally-input feature maps so as to obtain the locally-output feature maps, the CPU merges the locally-output feature maps so as to obtain the complete output feature map.
(55) For clear explanation, the modular connection relation of the disclosed deep learning heterogeneous computing system is described below with reference to
(56) As shown in
Embodiment 5
(57) Preferably, in an experiment, the disclosed system was equipped with: ubuntu16.04, Intel® Xeon® CPU E5-2680, nvidia K80 GPU, while the network models were made of ZFnet, VGG, siftflow-fcn32, and WRN-37-4. The data collected in the experiment are shown in the table below. The numbers following the network models represent the batch sizes. For instance, vgg(32) indicates that vggnetwork has a batch_size=32. Caffe is taken as the control and my_system denotes the system of the present invention. The data shown in the table are results of 10 iterations and the time consumed by training is expressed in second. The blank fields indicate that the caffe system was unable to train the relevant model. As proven by the experimental data, the system of the present invention can break layer-wide limitations to provide better model scalability and is capable of training larger and boarder network models.
(58) TABLE-US-00001 4G GPU memory/10 iteration's time ZFnet ZFnet ZFnet vgg vgg vgg vgg siftflow- WRN- Model/system (128) (256) (384) (16) (32) (64) (128) fcn32 37-4 caffe 13.0575 15.1558 my system 10.0947 25.6147 40.6689 23.5791 35.3117 57.719 98.07 90.1215 542.368
(59) While the above description has illustrated the present invention in detail, it is obvious to those skilled in the art that many modifications may be made without departing from the scope of the present invention and all such modifications are considered a part of the present disclosure. In view of the aforementioned discussion, relevant knowledge in the art and references or information that is referred to in conjunction with the prior art (all incorporated herein by reference), further description is deemed necessary. In addition, it is to be noted that every aspect and every part of any embodiment of the present invention may be combined or interchanged in a whole or partially. Also, people of ordinary skill in the art shall appreciate that the above description is only exemplificative, and is not intended to limit the present invention.
(60) The above discussion has been provided for the purposes of exemplification and description of the present disclosure. This does not mean the present disclosure is limited to the forms disclosed in this specification. In the foregoing embodiments, for example, in order to simplify the objectives of the present disclosure, various features of the present disclosure are combined in one or more embodiments, configurations or aspects. The features in these embodiments, configurations or aspects may be combined with alternative embodiments, configurations or aspects other than those described previously. The disclosed method shall not be interpreted as reflecting the intention that the present disclosure requires more features than those expressively recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Therefore, the following claims are herein incorporated into the embodiments, wherein each claim itself acts as a separate embodiment of the present disclosure.
(61) Furthermore, while the description of the present disclosure comprises description to one or more embodiments, configurations or aspects and some variations and modifications, other variations, combinations and modifications are also within the scope of the present disclosure, for example within the scope of skills and knowledge of people in the relevant field, after understanding of the present disclosure. This application is intended to, to the extent where it is allowed, comprise rights to alternative embodiments, configurations or aspects, and rights to alternative, interchangeable and/or equivalent structures, functions, scopes or steps for the rights claimed, no matter whether such alternative, interchangeable and/or equivalent structures, functions, scopes or steps are disclosed herein, and is not intended to surrender any of the patentable subject matters to the public