Communication compression method based on model weight distribution in federated learning
11468370 · 2022-10-11
Assignee
Inventors
Cpc classification
H03M7/40
ELECTRICITY
G06F18/2415
PHYSICS
International classification
H03M7/30
ELECTRICITY
Abstract
A communication compression method based on model weight distribution in federated learning, and belongs to the technical field of wireless communication. Based on the existing federated average idea in federated learning, counts the distribution of model weight information to be transmitted between nodes during each communication, then performs scalar quantization and compression through Lloyd-Max quantizer according to their distribution characteristics, then encodes with Huffman coding method, and finally sends the codes to the target node, thereby the minimum mean square quantization error is realized and the number of bits required for communication is reduced.
Claims
1. A communication compression method based on model weight distribution in federated learning, which is characterized in that it is used for central communication system, wherein the central communication system includes K edge nodes and a central server, and each edge node is connected with the central server; the edge node only communicates with the central server in the whole federated learning process; the communication compression method is respectively applied to the process in which the central server broadcasts the global model summarized from all edge nodes to each edge node, and the process in which the edge node uploads an updated local model obtained from the training to the central server. The same parallel operation is performed on each vector parameter [w.sup.(1), . . . , w.sup.(S)] in the model parameter w of the edge node. w.sup.(s) (s=1, 2, 3 . . . S) represents the model parameter s. Taking the vector parameter w.sup.(s) as an example, the communication compression method include the following steps: i) fit the distribution of element [w.sub.1.sup.(s), w.sub.2.sup.(s), . . . , w.sub.n.sup.(s)] in the model parameter w.sup.(s) to be compressed, with w.sub.i.sup.(s) (i=1, 2, 3 . . . n) representing the element i of the model parameter w.sup.(s), to obtain an approximate probability density function p.sub.s(w) of [w.sub.1.sup.(s), w.sub.2.sup.(s), . . . , w.sub.n.sup.(s)]; ii) set the number of quantization intervals as M, and apply the Lloyd-Max algorithm to the probability density function p.sub.s(w), to obtain the quantization interval endpoint vector b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] for minimizing the mean square quantization error σ.sub.q.sup.(s) 2 and the quantization output value vector v.sup.(s)=[v.sub.1.sup.(s), . . . , v.sub.M.sup.(s)]. b.sub.m.sup.(s) (m=0, 1, 2, 3 . . . M) denotes the element m in the quantization interval endpoint vector, and v.sub.m.sup.(s) (m=0, 1, 2, 3 . . . M) denotes the quantization output value m. The quantization interval endpoint vector b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] is used to determine the quantization interval endpoints, and the quantization output value vector is used to determine the quantization output value corresponding to each quantization interval endpoint; iii) establish the mapping of each element w.sub.i.sup.(s) in the model parameter w.sup.(s) in sequence to obtain the quantized lossy model parameter Q(w.sup.(s)); and iv) adopt Huffman Coding method to encode the lossy model parameter Q(w.sup.(s)) obtained from the quantization into the binary codes of compressed model parameter w.sup.(s) for transmission.
2. The communication compression method based on model weight distribution in federated learning as set forth in claim 1, which is characterized in that step ii) “set the number of quantization intervals as M, and apply the Lloyd-Max algorithm to the probability density function p.sub.s(w), to obtain the quantization interval endpoint vector b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] for minimizing the mean square quantization error σ.sub.q.sup.(s)2 and the quantization output value vector v.sup.(s)=[v.sub.1.sup.(s), . . . , v.sub.M.sup.(s)]” is carried out as follows: a) find the maximum value w.sub.max.sup.(s) and the minimum value w.sub.min.sup.(s) of the element w.sub.i.sup.(s) in the model parameter w.sup.(s) to be quantized; initialize b.sub.0.sup.(s)=w.sub.min.sup.(s) and b.sub.M.sup.(s)=w.sub.max.sup.(s); and stochastically assign an initial value to the first quantization output value v.sub.1.sup.(s) to satisfy w.sub.min.sup.(s)<v.sub.1.sup.(s)<w.sub.max.sup.(s); b) let the index m=1; c) substitute v.sub.m.sup.(s) and b.sub.m−1.sup.(s) into Equation (I):
3. The communication compression method based on model weight distribution in federated learning as set forth in claim 1, which is characterized in that step iii) “Establish the mapping of each element w.sub.i.sup.(s) in the model parameter w.sup.(s) in sequence to obtain the quantized lossy model parameter Q(w.sup.(s))” is carried out as follows: by utilizing the optimal quantization interval endpoint vector b.sup.(s) to divide the distribution interval w.sub.i into M segments, and then map the element w.sub.i.sup.(s) falling in each interval to the value of the corresponding sequence in the quantization output vector v.sup.(s).
4. The communication compression method based on model weight distribution in federated learning as set forth in claim 1, which is characterized in that step iv) “Adopt Huffman Coding method to encode the lossy model parameter Q(w.sup.(s)) obtained from the quantization into the binary codes of compressed model parameter w.sup.(s) for transmission” is carried out as follows: a) according to all elements in the model parameter Q(w.sup.(s)) and their frequencies of occurrence, arrange the elements in the descending order of probability, encode from two smallest elements, and merged them into a single element, with the smaller one marked as 0 and the larger one marked as 1; b) update all elements as described in Sub-step a, rearrange them and merge the smallest two elements; c) repeat Sub-steps a and b till the probability of remaining element increases to 1, to obtain a Huffman Tree; and d) starting from the final node with probability of 1 in the Huffman tree, for each element, there is and only one path to the leaf node representing the element; read the binary sequence marked in the path in turn to obtain the Huffman code of the element; finally, compress the model parameter vector to be compressed into the binary Huffman code for actual transmission.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
DETAILED DESCRIPTIONS
(4) The present invention is described in detail in conjunction with the Drawings and Embodiments as follows (but not limited to the following descriptions).
Example 1
(5) Disclosed is a communication compression method based on model weight distribution in federated learning used for central communication system, as shown in
(6) The edge node k stores the local data D.sub.k; During the global model training cycle t, the edge node k obtains a new model parameter w.sub.t.sup.k through the stochastic gradient descent method based on the global model w.sub.t and local data D.sub.k obtained from the central server in this iteration cycle; the value of k is 1, 2, 3, . . . K, and k is a positive integer; and the central server consolidates the updated local models of each edge node to obtain a new global model w.sub.t+1;
(7) The edge node only communicates with the central server in the whole federated learning process. The communication compression method is respectively applied to the process in which the central server broadcasts the global model summarized from all edge nodes to each edge node, and the process in which the edge node uploads an updated local model obtained from the training to the central server. The same parallel operation is performed on each vector parameter [w.sup.(1), . . . , w.sup.(S)] in the model parameter w of the edge node. w.sup.(s) (s=1, 2, 3 . . . S) represents the model parameter s. Taking the vector parameter w.sup.(s) as an example, the communication compression method include the following steps: (1) Fit the distribution of element [w.sub.1.sup.(s), w.sub.2.sup.(s), . . . , w.sub.n.sup.(s)] in the model parameter w.sup.(s) to be compressed, with w.sub.i.sup.(s) (i=1, 2, 3 . . . n) representing the element i of the model parameter w.sup.(s), to obtain an approximate probability density function p.sub.s(w) of [w.sub.1.sup.(s), w.sub.2.sup.(s), . . . , w.sub.n.sup.(s)]. For example, an image classification learning task of CIFAR-10 database is carried out through a convolutional neural network model. This network model includes two convolutional layers and three fully connected layers, and each layer has two parameters w and b. Taking the parameter w of the first fully connected layer as an example, it is a tensor containing 48,000 numbers, denoted as w=[w.sub.1.sup.(s), w.sub.2.sup.(s), . . . , w.sub.48000.sup.(s)]. (2) Set the number of quantization intervals as M, and apply the Lloyd-Max algorithm to the probability density function p.sub.s(w), to obtain the quantization interval endpoint vector b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] for minimizing the mean square quantization error σ.sub.q.sup.(s)2 and the quantization output value vector v.sup.(s)=[v.sub.1.sup.(s), . . . , v.sub.M.sup.(s)]. b.sub.m.sup.(s) (m=0, 1, 2, 3 . . . M) denotes the element m in the quantization interval endpoint vector, and v.sub.m.sup.(s) (m=0, 1, 2, 3 . . . M) denotes the quantization output value m. The compression solution disclosed is to carry out parallel compression of each parameter [w.sup.(1), . . . , w.sup.(S)] in w.sup.(s), and to quantize each parameter in w.sup.(s) against its distribution characteristics into an approximate value, and more specifically, the numerical range of model parameter w.sup.(s) is divided into multiple quantization intervals, and an output value is determined in each interval for approximating all numbers in this interval to this output value. Each quantization interval corresponds to a quantization output value. The quantization interval endpoint vector b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] is used to determine the quantization interval endpoints, and the quantization output value vector is used to determine the quantization output value corresponding to each quantization interval endpoint.
(8) This Step is carried out as follows: 1) Find the maximum value w.sub.max.sup.(s) and the minimum value w.sub.min.sup.(s) of the element w.sub.i.sup.(s) in the model parameter w.sup.(s) to be quantized; initialize b.sub.0.sup.(s)=w.sub.min.sup.(s) and b.sub.M.sup.(s)=w.sub.max.sup.(s); and stochastically assign an initial value to the first quantization output value v.sub.1.sup.(s) to satisfy w.sub.min.sup.(s)<v.sub.1.sup.(s)<w.sub.max.sup.(s). 2) Let the index m=1. 3) Substitute v.sub.m.sup.(s) and b.sub.m−1.sup.(s) iinto Equation (I):
(9)
(10)
(11)
is less than the preset threshold value, to obtain the required quantization interval endpoint vector b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] and quantization output value vector v.sup.(s)=[v.sub.1.sup.(s), . . . , v.sub.M.sup.(s)].
(12) (3) Establish the mapping of each element w.sub.i.sup.(s) in the model parameter w.sup.(s) to obtain the quantized lossy model parameter Q(w.sup.(s)). After Step (2), all elements in the original model parameter w.sup.(s) are replaced with the corresponding approximate values, and the new model parameter is denoted as Q(w.sup.(s)). Q(⋅) represents the mapping of each element in the model parameter according to Step (2). The resulted lossy model parameter Q(w.sup.(s)) lost some information contained in the original model parameter Q (w.sup.(s)), that is, this operation is lossy.
(13) Properly construct b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] and b.sup.(s)=[b.sub.0.sup.(s), . . . , b.sub.M.sup.(s)] according to the distribution p.sub.s(w) of element w.sub.i.sup.(s) in the model parameter w.sup.(s), to determine a mechanism for approximating all numbers in w.sup.(s). Utilize this mechanism to map w.sup.(s) to Q(w.sup.(s)) for accomplishing the quantization. The compression process consists of this quantization procedure and the following coding procedure. The quantization will lead to the loss of accuracy, while compression will not.
(14) This Step is carried out as follows:
(15) By utilizing the optimal quantization interval endpoint vector b.sup.(s) to divide the distribution interval w.sub.i into Msegments, and then map the element w.sub.i.sup.(s) falling in each interval to the value of the corresponding sequence in the quantization output vector v.sup.(s). b.sup.(s) and v.sup.(s) jointly determine the mode for mapping each element w.sub.i.sup.(s) in the model parameter w.sup.(s), Q(⋅) denotes the mapping of each element in the model parameter w.sup.(s) according to this mode, and Q(w.sup.(s)) represents the resulted model parameter. Since Q(w.sup.(s)) is obtained by mapping all internal elements, all elements of Q(w.sup.(s)) exist in v.sup.(s).
(16) (4) Adopt Huffman Coding method to encode the lossy model parameter Q(w.sup.(s)) obtained from the quantization into the binary codes of compressed model parameter w.sup.(s) for transmission.
(17) This Step is carried out as follows: a. According to all elements in the model parameter Q(w.sup.(s)) and their frequencies of occurrence (for example, as for Q(w.sup.(s))=(1, 1, 2, 2, 2, 3), the frequencies of occurrence of elements “1”, “2” and “3” are ⅓, ½ and ⅙ respectively), arrange the elements in the descending order of probability, encode from two smallest elements, and merged them into a single element, with the smaller one marked as 0 and the larger one marked as 1. b. Update all elements as described in Sub-step a, rearrange them and merge the smallest two elements. In this sub-step, “update” means that, after the smallest two elements in total T elements are merged into one element, the resulted T−1 elements are rearranged in the descending order of probability. c. Repeat Sub-steps a and b till the probability of remaining element increases to 1, to obtain a Huffman Tree. d. Starting from the final node with probability of 1 in the Huffman tree, for each element, there is and only one path to the leaf node representing the element. Read the binary sequence marked in the path in turn to obtain the Huffman code of the element. Finally, compress the model parameter vector to be compressed into the binary Huffman code for actual transmission. Huffman tree is a tree structure. When the Huffman tree is determined, the path from the final node to each initial node is unique. Leaf node refers to each initial node, that is, each different element of the model parameter Q(w.sup.(s)) with their respective occurrence probability after quantization.
(18) Compared with existing federated learning compression methods, the method disclosed herein applies the idea of federated average and therefore can reduce a large number of communication frequencies because local training and model updating can be carried out many times before each communication.
(19) The quantization procedure utilizes the Lloyd-Max algorithm to ensure the minimization of the mean square quantization errors in the scalar quantization process.
(20) The coding procedure adopts the Huffman coding method and can ensure that only the shortest binary codes are actually transmitted, for the Huffman codes have the shortest average length and are constructed completely according to the occurrence probability of characters.
(21) The present invention compresses model parameters based on the transmission model scenarios, and the local node can use the gradient information to iterate the local model for many times before each communication between nodes, so that each communication contains more model update information. For model information, since the weights of each parameter follow certain distribution characteristics, the use of a compression method based on the distribution of model weights can reduce communication costs while preserving accuracy to the greatest extent.
(22) In the Example, each local node uses the CNN model to train the data in the image dataset CIFAR-10. First, each edge node uses local data to train the optimal local training model parameters, and then aggregates them to the central server for weighted average. After the updated global model parameters are broadcast by the server to each local node, a global iteration cycle ends. In this cycle, both the aggregation and broadcasting processes involve the disclosed communication compression method.
(23) Traditional methods include the baseline algorithm based on federated stochastic gradient descent (exchange gradient), the Top-k algorithm, the QSGD algorithm, and the baseline algorithm based on federated average (exchange model). The baseline algorithm refers to the uncompressed algorithm. The idea of Top-k is to keep only the first part of the gradient with the largest absolute value before each communication, and set all other numbers to 0. [S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory,” in Proc. NeurIPS 2018, Montreal, QC, Canada, December 2018, pp. 4447-4458.]
(24) QSGD first divides the parameter range into multiple intervals, maps each element in the parameter to the nearest interval endpoint, and then uses the Elias encoding method to reduce the number of bits required for encoding by taking advantage of the low frequency of large numbers. [D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Proc. NIPS 2017, Long Beach, Calif., United states, December 2017, pp. 1710-1721.]
(25)
(26)
(27) It can be seen from