A METHOD FOR A DISTRIBUTED LEARNING
20230222354 · 2023-07-13
Inventors
Cpc classification
International classification
Abstract
A computer implemented method for training a learning model by a distributed learning system includes computing nodes. The computing nodes respectively implement the learning model and deriving a gradient information for updating the learning model based on training data. The method involves: encoding, by the computing nodes, the gradient information by exploiting a correlation across the gradient information from the respective computing nodes; exchanging, by the computing nodes, the encoded gradient information within the distributed learning system; determining an aggregate gradient information based on the encoded gradient information from the computing nodes; and updating the learning model of the computing nodes with the aggregate gradient information, thereby training the learning model.
Claims
1.-15. (canceled)
16. A computer implemented method for training a learning model based on training data by means of a distributed learning system comprising computing nodes, the computing nodes respectively implementing the learning model and deriving a gradient information, Gi, for updating the learning model based on the training data, the method comprising: encoding, by the computing nodes, the gradient information by exploiting a correlation across the gradient information from the respective computing nodes by means of an autoencoder model trained to extract gradient information common across the computing nodes; exchanging, by the computing nodes, the encoded gradient information within the distributed learning system; determining an aggregate gradient information, G′, based on the encoded gradient information from the computing nodes; and updating the learning model of the computing nodes with the aggregate gradient information, thereby training the learning model.
17. The computer implemented method according to claim 16, wherein the distributed learning system operates according to a ring-allreduce communication protocol.
18. The computer implemented method according to claim 17, wherein: the encoding comprises encoding, by the respective computing nodes, the gradient information, Gi, based on encoding parameters, thereby obtaining encoded gradient information, Gc,i, for a respective computing node; the exchanging comprises receiving, by the respective computing nodes, the encoded gradient information, Gc,i−1, from the other computing nodes; the determining comprises, by the respective computing nodes, aggregating the encoded gradient information from the respective computing nodes and decoding the aggregated encoded gradient information, G′, based on decoding parameters.
19. The computer implemented method according to claim 16, wherein the distributed learning system operates according to a parameter-server communication protocol.
20. The computer implemented method according to claim 19, wherein the encoding comprises selecting, by the respective computing nodes, most significant gradient information from the gradient information, Gs,i, thereby obtaining a coarse representation of the gradient information, Gs,i, for the respective computing nodes, and encoding, by a selected computing node configured to act as a server node, the gradient information based on encoding parameters, thereby obtaining an encoded gradient information, Gc,1; the exchanging comprises receiving, by the server node, the coarse representations from the other computing nodes; the determining comprises, by the server node, decoding the coarse representations and the encoded gradient information based on the decoding parameters, thereby obtaining decoded gradient information for the respective computing nodes, and aggregating the decoded gradient information, G′,i.
21. The computer implemented method according to claim 19, wherein the encoding comprises selecting, by the respective computing nodes, most significant gradient information from the gradient information, thereby obtaining a coarse representation of the gradient information for the respective computing nodes, and encoding, by a selected computing node, the gradient information based on encoding parameters, thereby obtaining encoded gradient information; the exchanging comprises receiving, by a further computing node configured to act as a server node, the coarse representations from the respective computing nodes and the encoded gradient information from the selected computing node; and the determining comprises, by the server node, decoding the coarse representations and the encoded gradient information based on the decoding parameters, thereby obtaining decoded gradient information for the respective computing nodes, and aggregating the decoded gradient information.
22. The computer implemented method according to claim 16, further comprising, by the respective computing nodes, compressing before the encoding the gradient information.
23. The computer implemented method according to claim 16, further comprising training the autoencoder model at a selected computing node based on the correlation across gradient information from the respective computing nodes.
24. The computer implemented method according to claim 23, wherein the training further comprises deriving the encoding and decoding parameters from the autoencoder model.
25. The computer implemented method according to claim 24, wherein the training further comprises exchanging the encoding and decoding parameters across the other computing nodes.
26. The computer implemented method according to claim 23, wherein the training of the autoencoder model is performed in parallel with training of the learning model.
27. The computer implemented method according to claim 16, wherein the distributed learning system is a convolutional neural network, a graph neural network or a recurrent neural network.
28. A computer program product comprising computer-executable instructions for causing plurality of computing nodes forming a distributed learning system to perform the method according to claim 16 when the program is run on the plurality of computing nodes.
29. A computer readable storage medium comprising the computer program product according to claim 28.
30. A distributed learning system programmed for carrying out the method according to claim 16.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0048] Some example embodiments will now be described with reference to the accompanying drawings.
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
DETAILED DESCRIPTION OF EMBODIMENT(S)
[0062] The present disclosure provides a solution for exchanging gradient information for training a learning model across a distributed learning system with limited communication bandwidth. A distributed learning system may be for example a Recurrent Neural Network, RNN, a Graph Neural Network, GNN, or a Convolutional Neural Network, CNN, in which the learning model is replicated in the computing nodes in the system. The computing nodes may be any wired or wireless devices capable of processing data such as mobile phones or servers.
[0063] The general principle of distributed learning consists of training the learning model at the computing nodes based on local data and exchanging gradient information between the respective learning models to generate a global learning model. The training is performed in an iterative manner where at each iteration new gradient information is derived based on the local data. The exchange of gradient information may be performed at every iteration or every several iterations. Once the gradient information is exchanged, an aggregate gradient information for updating the respective learning models is derived. The manner in which the gradient information is exchanged within the system and how the aggregated gradient information is derived depends on the communication protocol employed by the system.
[0064] The communication protocol may be a Parameter Server, PS, or a Ring-AllReduce, RAR, communication protocol. In the parameter-server communication protocol, the system is organized in a server-worker configuration where a selected node acts as a server node, and the other nodes act as worker nodes. In such a configuration, the server node is responsible to derive the aggregate gradient information by processing the gradient information from all nodes and to distribute the aggregated gradient information to the other nodes. In contrast, in the ring-allreduce communication protocol, each computing node is responsible to derive the aggregate information by processing the gradient information from all nodes and to update its respective learning model with the aggregated information.
[0065] The method for training a learning model by means of a distribution learning system operating according to the parameter-server communication protocol will be now described with reference to
[0066]
[0067] The training of the learning model is performed iteratively. At each iteration, the method performs the steps shown in
[0068] The method proceeds to the step of encoding 220. In this embodiment, the encoding is performed in two steps which may be performed in sequential order or in parallel. In the first step 221, the nodes respectively derive a coarse representation of their gradient information, Gs,i, 11-13. The coarse representation may be derived by, for example, selecting the β % of the gradients with the highest magnitudes. The selection is performed independently at each node. Thus, each node selects the respective the β % of the gradients with the highest magnitudes. The coarse selection may be applied to the compressed gradient information, Gd,i, if the gradient information has been previously compressed, or on the uncompressed gradient information Gi, if the compression step has been omitted. Depending on the bandwidth limitations, the selection may be applied with different degrees of selection rate. A very aggressive selection rate of, e.g., 0.0001% would result in a very compact gradient vector Gs,i containing a limited number of gradient values. By selecting the gradients with the highest magnitudes, the most significant gradient information is extracted. Alternatively, the coarse selection may be performed by applying a coarse information extraction and encoding or another suitable for the purpose algorithm. In the second step 230, the server node, i.e., node 111, encodes its gradient information Gd,1 based on encoding parameters in its encoder 131 to obtain an encoded gradient information Gc,1.
[0069] The method proceeds to step 250, where the encoded gradient information is exchanged across the system. For this purpose, nodes 112 and 113 forward their coarse representations Gs,2 and Gs,3 to the server node 111. In the next step 300, the server node 111 derives the aggregate gradient information as follows. First, the encoded gradient information Gc,1 is combined or fused with the coarse representations, Gs,i, by the logic circuit 15. The fusing may be performed by concatenating the respective coarse representations Gs,i from all nodes with the encoded gradient information Gc,1 from the server node or a learnt representation of Gc,1. Instead of concatenating, a elementwise addition or other attention mechanism may be employed. The resulting fused gradient information 11′-13′,i.e., one fused gradient vector for each respective node i, are then decoded individual by the decoder 140 to derive three decoded gradient vectors, G′,i, 21-23. The decoding step 311 may be performed in a sequential manner or in parallel, as shown in the figure. In the latter case, the decoder 140 comprises three decoders 141-143, one decoder for each node i. Finally, the decoded gradient information G′,i are aggregated in step 312. The aggregated gradient information 30 is then forwarded to the other nodes within the system so that all nodes update 350 their respective learning model with the aggregated gradient information, G′, 30.
[0070] The encoder 131 and decoders 141-143 are respectively set to encode the gradient information by exploiting the correlation across the gradient information from the respective nodes and to decode the encoded information as correct as possible to the original as follows.
[0071] For this purpose, the encoder 131 and the decoders 141-143 form part of an autoencoder which is a neural network trained iteratively to capture the similarities or the correlation across the gradient information, i.e., the common gradient information, from the respective nodes. For this purpose, the autoencoder includes a control logic that encourages the capture of common information across the gradient information from different nodes by the encoder. For this purpose, the control logic implements a similarity loss function which estimates the similarities or the correlation across the gradient information from the respective nodes and a reconstruction loss function. The similarity loss function aims at minimizing the difference across the encoded gradient information from the respective nodes, thereby enforcing the extraction of the common information, and the reconstruction loss function aims at minimizing the differences between the respective decoded gradient information and the original gradient information.
[0072] The autoencoder and the method for training the autoencoder will be now described with reference to
[0073] Those parts of the distributed learning system which are identical to those shown in
[0074] At each iteration, the autoencoder receives 410 gradient information 51-53 from the respective computing nodes. The gradient information may be uncompressed or compressed by the respective computing nodes by means of conventional compression techniques or by applying a coarse selection, i.e., selecting the α % of the gradients with the highest magnitude. Next, in step 411 a coarse representation Gs,i is derived from respective gradient information Gd,i by, for example, extracting the β% of the gradients with the highest magnitudes or by applying a coarse information extraction and encoding or another suitable for the purpose algorithm. Following the step 411 or in parallel to it, the gradient information from the respective nodes Gd,i is encoded 420 by the encoder 131 in a sequential manner. At each convolutional layer of the encoder, the gradient information is essentially downsampled to obtain an encoded gradient information Gc,i 51′-53′. For example, if the gradient vector Gd,i is a one-dimensional gradient vector comprising 100 values its encoded representation may be a gradient vector comprising 20 gradients. The respective encoded gradient information, Gc,i, 51′-53′ are then decoded 430 by the respective decoders 141-143 as follows. At each deconvolutional layer of the decoder, the encoded gradient information is upsampled with the exception that at the fourth convolutional layer the gradient information is first upsampled and then concatenated with its respective coarse representation Gs,i. Once decoding is complete, the method proceeds to step 440 to derive similarities across the gradient information. For this purpose, the encoded and decoded gradient information as well as the original gradient information are all fed into the control logic 14 which minimizes 440 the differences between the respective encoded gradient information and the differences between the respective original and decoded gradient information. By minimizing the respective differences, its ensured that the encoding enforces extraction of the common information across the gradient information and that the decoding enforces correct reconstruction of the encoded gradient information. As a result, one set of encoding parameters and three sets of decoding parameters one set for each decoder are derived. In a final step 450, the encoder and decoders of the selected node, i.e., node 111, are updated with the derived encoding and decoding parameters. The process is repeated until the differences between the respective encoded gradient information and the differences between the respective original and decoded gradient information have been minimized. Finally, once the training of the autoencoder has completed, it is assured that the coarse selectors in the respective nodes (not shown in
[0075] Once the autoencoder 150 is trained, the training of the learning model may proceed based on the encoded gradient information as detailed above with reference to
[0076] Typically, the training of the learning model is performed in two stages. At the initial stage, the learning model is updated using the full gradient information derived from the respective computing nodes. This requires forwarding the full gradient information from the worker nodes to the server node. In the second stage, the learning model is updated with the gradient information compressed using conventional compression techniques. In this stage, the compressed gradient information is forwarded from the respective worker nodes to the server node.
[0077] In the present disclosure, the training of the learning model may be performed in two stages or three stages. In the first case, the training starts by updating the learning model using the full gradient information derived from the respective computing nodes. At the same time, i.e. in parallel with the training of the learning model, the encoder-decoder model of the autoencoder may also be trained based on the full gradient information. Once the encoder-decoder model has been trained, e.g. after 1000 or more iterations, the training of the learning model proceeds to the second stage where the learning model is updated based on the gradient information encoded with the encoding parameters derived by the trained encoder-decoder model. Thus, at the second stage, encoded gradient information is exchanged rather than full gradient information.
[0078] In the second case, the training starts by updating the learning model based on the full gradient information. After, for example, 1000 iterations, the training of the learning model proceeds to the second stage, where both the learning model and the encoder-decoder model are updated with gradient information compressed using a conventional compression technique. Once the encoder-decoder model of the autoencoder has been trained, e.g. after 100 to 300 iterations, the training of the learning model proceeds to the third stage with updating the learning model based on the gradient information encoded with the encoding parameters derived by the trained encoder-decoder model.
[0079] Because the training of the encoder-decoder model in the second case is performed on the compressed gradient information rather than on the full gradient information as in the first case, the training of the autoencoder is completed after a significantly lower number of iterations. Further, the size of the encoded gradient information in the second case may be smaller than the size of the encoded gradient information in the first case.
[0080]
[0081] The method for training a learning model by means of a distribution learning system operating according to the ring-allreduce communication protocol will be now described with reference to
[0082] As detailed above, in a ring-allreduce communication protocol, each computing node in the system is responsible to derive the aggregate gradient information by processing the encoded gradient information from all nodes within the system. Accordingly, the respective nodes 111-113 are configured to encode and decode gradient information by exploiting the correlation across the gradient information from the respective nodes.
[0083] The distributed learning system 100 shown in
[0084] Similar to the other embodiments described above, the training of the learning model is performed iteratively. At each iteration, the method performs the steps shown in
[0085] The method proceeds to step 220. The encoding is performed in two steps performed in sequential order. In the first step 221, the nodes respectively derive a coarse representation of their gradient information, Gs,i. The coarse representation may be derived by, for example, selecting the β% of the gradients with the highest magnitudes. The selection is performed at a selected node which selects the β% of its gradients with the highest magnitudes. The selected node is selected at random at each iteration of the training of the learning model. The selected node then shares the indices of the extracted gradients to all remaining nodes in the network, which in turn construct coarse representation of their gradient information based on the shared indices. The coarse selection may be applied to the compressed gradient information, Gd,i, if the gradient information has been previously compressed, or on the uncompressed gradient information Gi, if the compression step has been omitted. Depending on the bandwidth limitations, the selection may be applied with different degrees of selection rate. A very aggressive selection rate of, e.g. 0.0001% would result in a very compact gradient vector Gs,i containing a limited number of gradient values. By selecting the gradients with the highest magnitudes, the most significant gradient information is extracted. Alternatively, the coarse selection may be performed by applying a coarse information extraction and encoding or another suitable for the purpose algorithm. In the second step 230, the respective nodes 111-113 encode their coarse gradient information, Gs,i, based on encoding parameters pre-set in their respective encoders 131-133. Thus, each node derives encoded gradient information Gc,i 11-13. In the next step 250, the nodes 111-113 exchange the encoded gradient information Gc,i within the system. At the end of this step, each node receives the encoded gradient information from the other nodes. In the next step 300, the respective nodes derive the aggregate gradient information G′ as follows. First, the nodes 111-113 aggregate 321 the respective encoded gradient information Gc,i. Secondly, the aggregated encoded gradient information Gc 20 is decoded 322 to obtain the aggregated gradient information, G′. As shown in
[0086] Similar to the other embodiments, the encoding and decoding parameters of the encoders 131-133 and decoder 141 are derived by training an autoencoder 150, i.e. a neural network trained iteratively to capture the similarities or the correlation across the gradient information, i.e. the common gradient information, from the respective nodes. For this purpose, one of the nodes, e.g., node 111, comprises the autoencoder 150. The autoencoder 150 includes a control logic 14 that encourages the capture of common information across the gradient information from different nodes by the encoder. Herein, the control logic implements a reconstruction loss function which minimizes the differences between the respective decoded gradient information and the original gradient information.
[0087] The autoencoder and the method for training the autoencoder will be now described with reference to
[0088] As shown in
[0089] At each iteration, the autoencoder receives 410 gradient information 51-53 from the respective nodes 111-113. The gradient information may be uncompressed or compressed by the respective nodes by means of conventional compression techniques. Next, the coarse selector 13 derives 411 a coarse representation by selecting 411 the β% of the gradients with the highest magnitudes. Similar to above, the coarse selection may be performed by other suitable for this purpose algorithm. At each iteration of the autoencoder training, the coarse selection is performed based on the indices of the β% gradients with the highest magnitude for a node selected at random. For example, at the first iteration, the indices of the β% of the gradients of the gradient information Gd,1 of the node 111 may be used. By deriving the coarse representation from the gradient information of the other nodes based on the indices of the β% most significant gradient information of the selected node, coarse representations containing common information from the gradient information of the respective nodes is derived. The derived coarse representations Gs,i are then encoded 420 by the respective encoders 131-133, where the respective encoders 131-133 encode the corresponding coarse representations in the same manner, i.e. by employing the same encoding parameters. In other words, the encoders 131-133 may be considered as three instances of the same encoder. At each convolutional layer, the gradient information Gs,i is essentially downsampled to obtain an encoded gradient information Gc,i. For example, if the received gradient vector 121-123 comprises 100 gradients, its encoded representation may comprise 20 gradients. The respective encoded gradient information, Gc,i, are then aggregated 421 by the aggregation circuit 17. The resulting encoded gradient information, Gc, is then decoded 430 by the decoder 141. At each deconvolutional layer the encoded gradient information, Gc, is upsampled to finally obtain a decoder gradient information, Gi. The aggregated decoded gradient information as well as the original gradient information, i.e. Gs,i, are fed into the control logic 14 which minimizes 440 the differences between the respective original and decoded gradient information. As a result, three sets of encoding parameters one set for each encoder and one set of decoding parameters are derived. As mentioned above, the encoding parameters of each encoder are the same. In a final step 450, the node 111 forwards the encoding and decoding parameters with the other nodes within the distributed learning system. As a result, the encoder and decoder of the respective nodes are pre-set with the derived encoding and decoding parameters. The process is repeated until the differences between the respective original and decoded gradient information have reached have been minimized. Once the autoencoder is trained, the learning model may be trained based on the encoded gradient information.
[0090] Similar to other embodiments, the training of the autoencoder is done in parallel with the training of the learning model of the system. The training of the learning model may be performed in two stages or three stages. In the first case, the training starts by updating the learning model using the full gradient information derived from the respective computing nodes. At the same time, i.e., in parallel with the training of the learning model, the encoder-decoder model of the autoencoder is also trained based on the full gradient information. Once the encoder-decoder model has been trained, e.g., after 1000 or more iterations, the training of the learning model proceeds to the second stage where the learning model is updated based on the gradient information decoded by the trained encoder-decoder model. Thus, at the second stage, encoded gradient information is exchanged rather than full gradient information.
[0091] In the second case, the training starts by updating the learning model based on the full gradient information. After, for example, 1000 iterations, the training of the learning model proceeds to the second stage, where both the learning model and the encoder-decoder model are updated with gradient information compressed using a conventional compression technique. Once the encoder-decoder model of the autoencoder has been trained, e.g., after 100 to 300 iterations, the training of the learning model proceeds to the third stage with updating the learning model based on the gradient information encoded with the encoding parameters derived by the trained encoder-decoder model.
[0092] As used in this application, the term “circuitry” may refer to one or more or all of the following: [0093] (a) hardware-only circuit implementations such as implementations in only analog and/or digital circuitry and [0094] (b) combinations of hardware circuits and software, such as (as applicable): [0095] (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and [0096] (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and [0097] (c) hardware circuit(s) and/or processor(s), such as microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software may not be present when it is not needed for operation.
[0098] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.
[0099] Although the present invention has been illustrated by reference to specific embodiments, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied with various changes and modifications without departing from the scope thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the scope of the claims are therefore intended to be embraced therein.
[0100] It will furthermore be understood by the reader of this patent application that the words “comprising” or “comprise” do not exclude other elements or steps, that the words “a” or “an” do not exclude a plurality, and that a single element, such as a computer system, a processor, or another integrated unit may fulfil the functions of several means recited in the claims. Any reference signs in the claims shall not be construed as limiting the respective claims concerned. The terms “first”, “second”, third”, “a”, “b”, “c”, and the like, when used in the description or in the claims are introduced to distinguish between similar elements or steps and are not necessarily describing a sequential or chronological order. Similarly, the terms “top”, “bottom”, “over”, “under”, and the like are introduced for descriptive purposes and not necessarily to denote relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and embodiments of the invention are capable of operating according to the present invention in other sequences, or in orientations different from the one(s) described or illustrated above.