METHODS FOR RELIABLE OVER-THE-AIR COMPUTATION AND FEDERATED EDGE LEARNING
20220391696 · 2022-12-08
Inventors
- Alphan Sahin (Columbia, SC)
- BRYSON EVERETTE (COLUMBIA, SC, US)
- SAFI SHAMS MUHTASIMUL HOQUE (WEST COLUMBIA, SC, US)
Cpc classification
H04L5/0069
ELECTRICITY
International classification
Abstract
The disclosure deals with system and method for an over-the-air computation (AirComp) scheme for federated edge learning (FEEL) without channel state information (CSI) at the edge devices (EDs) or edge server (ES). The disclosure adopts the majority vote (MV) principle and defines multiple subcarriers and orthogonal frequency division multiplexing (OFDM) symbols for voting options, which reduces to frequency-shift keying (FSK) over OFDM subcarriers as a special case. Thus, FSK-based over-the-air computation is provided for federated edge learning without channel state information. Since the votes from EDs are separated on orthogonal resources, the proposed scheme eliminates the need for truncated-channel inversion (TCI) at the EDs and allows the ES to detect MV with a non-coherent detector. We also mitigate the peak-to-mean envelope power ratio (PMEPR) of the synthesized signals by using randomization symbols. Simulations show the proposed scheme provides high test accuracy in fading channels for both independent and identically distributed (IID) and non-IID data while resulting in OFDM symbols with lower PMEPRs as compared to one-bit broadband digital aggregation (OBDA) with quadrature amplitude modulation (QAM).
Claims
1. An over-the-air computation (AirComp) methodology for federated edge learning (FEEL) without using channel state information (CSI) at a plurality of edge devices (EDs) or at an edge server (ES), comprising: a distributed machine-learning model to be trained with the update vectors received at an edge server (ES) as transmitted from a plurality of edge devices (EDs); one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising: transmitting local update vectors as weighted votes over selected multiple orthogonal subcarriers grouped based on the sign of the elements of the update vector from each respective of the plurality of edge devices (EDs) via a wireless multiple access channel, receiving the superposed local updates at the ES, determining the majority vote (MV) for each element of the update vector at the ES with an energy detector over orthogonal time and frequency resources, and inputting the MVs into the machine-learning model to be updated.
2. An over-the-air computation (AirComp) methodology according to claim 1, wherein the votes comprise orthogonal frequency division multiplexing (OFDM) symbols over multiple OFDM subcarriers, and aggregating operations use one-bit broadband digital aggregation (OBDA) and frequency-shift keying (FSK)-based methodology.
3. An over-the-air computation (AirComp) methodology according to claim 2, further comprising operations using randomization symbols on active subcarriers to reduce peak-to-mean envelope power ratio (PMEPR).
4. An over-the-air computation (AirComp) methodology according to claim 1, wherein the receiving operations include the ES detecting MV with a non-coherent detector.
5. An over-the-air computation (AirComp) methodology according to claim 1, wherein the machine learning model comprises artificial intelligence technology over wireless or sensor networks, 5G or higher, 6G wireless standardization, or IEEE 802.11 Wi-Fi.
6. An over-the-air computation (AirComp) methodology according to claim 1, wherein the transmitting local updates operation includes use of gradient averaging.
7. An over-the-air computation (AirComp) methodology according to claim 6, wherein the local gradient estimate g.sub.k.sup.(n) for the kth ED at the nth communication round between at least one ED and the ES comprises: ,
) is the sample loss function that measures the labelling error for (
,
) for the training parameters w.
8. An over-the-air computation (AirComp) methodology according to claim 7, further comprising global gradient operations that the ES determines and distributes a global gradient estimate to the EDs and the current machine-learning model is updated based on a common update rule, and the global gradient operations are repeated consecutively until a predetermined convergence criterion is achieved.
9. An over-the-air computation (AirComp) methodology according to claim 1, wherein the transmitting local updates operation includes use of signs of local gradients by the respective EDs with using a general weight function to increase the probability of the detecting the correct MV.
10. An over-the-air computation (AirComp) methodology according to claim 1, further comprising operations, after a signal passes from each ED through their own multipath channels, the ES observes the superposed symbols on the same subcarrier indices.
11. An over-the-air computation (AirComp) methodology according to claim 10, further comprising detector operations at the ES that the detector compares the energies on two adjacent subcarriers to determine the gradient vector.
12. An over-the-air computation (AirComp) methodology according to claim 1, wherein the machine-learning model is training to learn the task of handwritten-digit recognition.
13. An over-the-air computation (AirComp) methodology according to claim 12, wherein the machine-learning model comprises a convolution neural network with multiple convolutional layers, with each convolutional layer followed by a batch normalization layer and rectified-linear unit (ReLU) activation following each of them.
14. An over-the-air computation (AirComp) methodology according to claim 13, wherein the multiple convolutional layers each have a plurality of filters, and a fully connected layer with plural units and a softmax layer are used after one of the ReLU.
15. An over-the-air computation (AirComp) system for federated edge learning (FEEL) without using channel state information (CSI) at a plurality of edge devices (EDs) or at an edge server (ES), comprising: a machine-learning model training to process data received at an edge server (ES) as transmitted from a plurality of edge devices (EDs); one or more processors; and one or more non-transitory computer-readable media that store instructions that, when executed by the one or more processors, cause the one or more processors to perform operations, the operations comprising: transmitting local updates as votes over selected multiple subcarriers from each respective of the plurality of edge devices (EDs) via a wireless multiple access channel, receiving the local updates at the ES, aggregating the local updates at the ES including separating votes from the EDs using orthogonal resources and majority vote (MV) principle, and inputting the obtained data into the machine-learning model as training data or data to process.
16. An over-the-air computation (AirComp) system according to claim 15, wherein the votes comprise orthogonal frequency division multiplexing (OFDM) symbols over multiple OFDM subcarriers, and aggregating operations use one-bit broadband digital aggregation (OBDA) and frequency-shift keying (FSK)-based methodology.
17. An over-the-air computation (AirComp) system according to claim 15, wherein the receiving operations include the ES detecting MV with a non-coherent detector.
18. An over-the-air computation (AirComp) system according to claim 15, wherein the transmitting local updates operation includes use of either gradient averaging or use of signs of local gradients by the respective EDs.
19. An over-the-air computation (AirComp) system according to claim 18, further comprising global gradient operations comprising that the ES determines and distributes a global gradient estimate to the EDs and the current machine-learning model is updated based on a common update rule, and the global gradient operations are repeated consecutively until a predetermined convergence criterion is achieved.
20. An over-the-air computation (AirComp) system according to claim 15, wherein the machine-learning model comprises a convolution neural network with multiple convolutional layers, with each convolutional layer followed by a batch normalization layer and rectified-linear unit (ReLU) activation following each of them.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0025] A full and enabling disclosure of the present subject matter, including the best mode thereof to one of ordinary skill in the art, is set forth more particularly in the remainder of the specification, including reference to the accompanying figures in which:
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048] Table 1 correlates Layers and Learnables for a Neural Network at the EDs;
[0049]
[0050]
[0051]
[0052]
[0053] Repeat use of reference characters in the present specification and drawings is intended to represent the same or analogous features, elements, or steps of the presently disclosed subject matter.
DETAILED DESCRIPTION OF THE PRESENTLY DISCLOSED SUBJECT MATTER
[0054] Reference will now be made in detail to various embodiments of the disclosed subject matter, one or more examples of which are set forth below. Each embodiment is provided by way of explanation of the subject matter, not limitation thereof. In fact, it will be apparent to those skilled in the art that various modifications and variations may be made in the present disclosure without departing from the scope or spirit of the subject matter. For instance, features illustrated or described as part of one embodiment, may be used in another embodiment to yield a still further embodiment. Thus, it is intended that the presently disclosed subject matter covers such modifications and variations as come within the scope of the appended claims and their equivalents.
[0055] In general, the present disclosure is directed to a system in which we consider an OFDM-based FEEL system with K users. Prior to the training, the initial values of the model parameters, denoted by w∈q, and its structure are distributed to the EDs from an ES to set up a common learning model at the EDs, where q is the model size. We denote the local dataset containing labeled data samples at the kth ED as |{(
,
)}∈D.sub.k| for k=1, . . . , K, where
and
are
th data sample and their associated label, respectively. The main goal of the FEEL system is to obtain the trained model parameters without uploading the local data to the ES.
A. Learning Model
[0056] The local loss function of the model with the parameters w at the kth ED can be calculated as:
where ƒ(w,,
) is the sample loss function that measures the labelling error for (
,
) for the parameters w.
[0057] Assuming identical local dataset sizes, i.e., |D.sub.k|=D for k=1, . . . , K, the global loss function can be measured as:
In this disclosure, we focus on a FEEL system based on gradient averaging.sup.[7]. For each communication round n of FEEL, the kth ED calculates an estimate of the global gradient of the loss function in Eq. (2) by using its local dataset Dk and the parameter vector w.sup.(n). Assuming that all data samples in D.sub.k are used for gradient estimation, the local gradient estimate for the kth ED at the nth communication round, denoted by g.sub.k.sup.(n) can be expressed as:
where ∇ represents the gradient operator.
[0058] Assuming that the local gradient estimates are reliably received at the ES, the ES can obtain the global estimate of the gradient of the loss function in Eq. (2) as:
[0059] Subsequently, the ES distributes the global gradient estimate ĝ.sup.(n) to the EDs and the current model is updated based on a common update rule, e.g., gradient descent given by w.sup.(n+1)=w.sup.(n)−ηĝ.sup.(n) where η is the learning rate and w.sup.(1)=w. This process is repeated consecutively until a predetermined convergence criterion is achieved.
[0060] In this disclosure, we adopted sin SGD [8] for FEEL. Instead of the actual values of local gradients, the EDs transmitted the signs of their local gradients, i.e., {tilde over (g)}.sub.k.sup.(n) for k=1, . . . , K, to the ES where the ith element of is {tilde over (g)}.sub.k,i.sup.(n)sin(g.sub.k,i.sup.(n)). Then the estimate of the global gradient for the ith parameter can be calculated by using the MV principle as given by:
The ES then transmitted v.sup.(n)=(v.sub.0.sup.(n), . . . , v.sub.q-1.sup.(n)) to the EDs and the models at the EDs are updated, e.g., w.sup.(n+1)=w.sup.(n)−ηv.sup.(n).
B. Signal Model
[0061] In this disclosure, we assume that the EDs access the wireless channel on the same time-frequency resources simultaneously for AirComp with S OFDM symbols consisting of M active subcarriers. We assume the transmissions from the EDs are synchronized in both time and frequency and arrive at the ES within the CP duration. We also assume that the CP duration is larger than the maximum-excess delays of the channels between the ES and the EDs. The superposed symbol on the l subcarrier of the mth OFDM symbol at the ES can then be written as:
where h.sub.k,l∈ is the channel coefficient between ES and the kth ED on the l subcarrier and
[|h.sub.k,l|.sup.2]=1, t.sub.k,l,m∈
is the transmitted symbol from the kth ED on the l subcarrier of the mth OFDM symbol, and n.sub.l is the zero mean additive white Gaussian noise (AWGN) with the variance σ.sub.n.sup.2 on the l subcarrier for l∈{0, 1, . . . , M−1} and m∈{0, 1, . . . , S−1}.
[0062] Let x(t)∈ be the baseband OFDM symbol in continuous time for t∈[0, T.sub.s), where Ts is the OFDM symbol duration. We defined the PMEPR of an OFDM symbol as max.sub.t∈[0,T.sub.
.sub.t[|x(t)|.sup.2] is the mean envelope power. For AirComp schemes, P.sub.tx changes based on the gradient information. In this disclosure, for a fair comparison, we calculate P.sub.tx when all subcarriers are actively utilized, i.e., P.sub.tx=M/N, where N is the inverse DFT (IDFT) size.
FSK-Based Majority Vote
A. Transmitter
[0063] Let ƒ be a bijective function that maps i∈{0, 1, . . . , q−1} to the distinct pairs (m.sub.0, l.sub.0) and (m.sub.1, l.sub.1) for m.sub.0, m.sub.1∈{0, 1, . . . , S−1}) and l.sub.0, l.sub.1∈{0, 1, . . . , M−1}. Based on {tilde over (g)}.sub.k,i.sup.(n), at the nth communication round, we propose to calculate the symbol t.sub.k,l.sub.
respectively, where E.sub.s=2 is the normalized symbol energy and S.sub.k,i is a randomization symbols for k∈{1, . . . , K}.
[0064] Therefore, the proposed scheme separates the options for voting over two different resources identified in time and frequency. In this disclosure, we chose S.sub.k,i based on a random quadrature phase shift keying (QPSK) symbol to reduce PMEPR by decreasing the correlation in the frequency domain.sup.[13].
[0065] In one implementation, when {tilde over (g)}.sub.k,i=1, the symbols t.sub.k,l.sub.
[0066] In one implementation, the symbols t.sub.k,l.sub.
where g.sub.k,i is the local stochastic gradient and ω(g.sub.k,i) is a weighting function. The weighting function may be an even-symmetric function that ranges from 0 to 1 in order to limit the power of the transmitted OFDM symbols. The main motivation for using a weight function is that it can lower the error probability of detecting the incorrect majority vote as compared to the sign operation. It may also increase the convergence rate in the case of heterogenous data distribution scenarios. Examples of the smooth, non-decreasing weight function for negative or positive g.sub.k,i are as follows:
where h, t, ρ are some non-negative coefficients. All of these examples ensures that gradual power increases if the magnitude of the gradient local gradient is large. Therefore, if an ED has a smaller absolute local gradient, its impact on the MV becomes smaller. Similarly, if an ED has a large absolute local gradient, its impact on the MV becomes larger. Hence, the convergence speed may improve.
[0067] In one implementation, ω(g.sub.k,i)=1 may be chosen to achieve a design based on signs as described. In one implementation, the parameters of the weight function may be tuned through the communications round. For example, the tuning may be based on maximum values of the absolute local gradients or update vectors or the communication round index.
[0068] The functionality of f can be divided into two different mappers, i.e., gradient mapper (GM) and resource mapper (RM). While GM shuffles the quantized gradients, RM identifies how the options for voting are distributed to the time and frequency resources. As a special case of RM, if m.sub.1=m.sub.0 and l.sub.1=l.sub.0+1 for all i, the adjacent subcarriers of moth OFDM symbol are used for voting, i.e., FSK over OFDM subcarriers. In this case, the weight of the kth ED's vote in the MV for the ith gradient is independent from its vote since these subcarriers are likely to experience similar channel conditions in practice, i.e., h.sub.k,l.sub.
[0069] Gradient mapper and resource mapper may be utilized with an interleaver or an encryption function to increase the security of the proposed scheme. For example, gradient mapper or resource mapper may map the votes to different subcarriers for each communication round based on an encryption operation. Hence, an eavesdropper cannot recover the order of the gradients by simply capturing the transmission.
[0070] In one implementation, the symbols t.sub.k,l.sub.
B. Receiver
[0071] At the ES, the pairs (m.sub.0, l.sub.0) and (m.sub.1, l.sub.1) are first calculated by using the mapping function ƒ for a given i. Assuming independent multipath channels between the ES and the EDs, it can be shown that:
where K.sub.0 and K.sub.1 are the number of EDs that vote for 1 and −1 for the ith gradient, respectively.
[0072] Therefore, the energies on the superposed symbols r.sub.l.sub.
where t is the maximum distance between |r.sub.l.sub.
[0073] In
C. Trade-offs and Comparisons
[0074] As prior literature approaches are opposed.sup.[6], [7], the proposed scheme does not need channel inversions at the EDs. From this aspect, it is compatible with time-varying channels (e.g., mobile networks.sup.[14]) and does not lose gradient information due to TCI. On the other hand, it quadruples the number of time-frequency resources for AirComp as compared to OBDA-QAM.sup.[7]; however, OBDA-QAM is not investigated in terms of PMEPR in the literature. As shown in, OBDA-QAM can suffer from high PMEPR, while the proposed scheme reduces PMEPR with a simple randomization technique that also leads to better accuracy results for non-IID data. As compared to approaches indicated in prior literature.sup.[11], [12], the proposed scheme also does not require CSI at the ES or multiple antennas.
Numerical Results
[0075] For the numerical results, we considered the learning task of handwritten digit recognition with a FEEL system and compared the proposed scheme with BAA.sup.[6] for gradient averaging and OBDA-QAM.sup.[7]. We used the MNIST dataset that contains 60,000 labelled handwritten digit images sized 28×28, from 0-9. From the IID dataset, we randomly partition 20,000 training images into equal shares to K∈{10, 50} EDs. For the non-IID data set, we chose 5 digits for each ED and selected the images randomly, i.e., different dataset can contain the same image. For a fair comparison, we used the same data randomization for different AirComp schemes.
[0076] For the model, we considered a convolution neural network (CNN) that includes one 5×5 and two 3×3 convolutional layers, where each of them is followed by a batch normalization layer and rectified-linear unit (ReLU) activation following each of them. All convolutional layers have 20 filters. After the third ReLU, a fully connected layer with 10 units and a softmax layer were utilized. At the input layer, no normalization was applied. Our model has q=123090 learnable parameters, which corresponds to S=206, S=103, and S=52 OFDM symbols for the OBDA-FSK, BAA, and OBDA-QAM for M=1200, respectively. The subcarrier spacing was set to 15 kHz, the TCI (the truncation threshold) was 0.2, and the threshold t was set to 0.01 for the proposed scheme.
[0077] To test the FEEL, we considered two different uplink signal-to-noise ratios (SNRs), i.e., 0 dB and 20 dB.
[0078] For the fading channel, we considered ITU Extended Pedestrian A (EPA) with no mobility and then regenerated the channels between the ES and the EDs to capture the long-term channel variations for each communication round. For TCI, we assumed that CSI was available at the EDs. For the update rule, we considered stochastic gradient descent with momentum, where the momentum is 0:9. The initial learning rate was 0:01 and the learning rate decayed with a rate of 0:05 for every communication round.
[0079] In
[0080] [|r.sub.l.sub.
[|r.sub.l.sub.
[0081] In
Concluding Remarks
[0082] In this disclosure, we proposed an AirComp scheme for FEEL. The proposed scheme relies on MV and forms the options for voting on different subcarriers and/or OFDM symbols, and thus, it allows the receiver to detect MV with a non-coherent detector and eliminates the need for TCI at the EDs as it is compatible with time-varying channels. Further, it can be used along with randomization methods in the frequency domain to reduce the PMEPR. Through simulations, we demonstrated that the proposed method provides a high-test accuracy in fading channel for both IID and non-IID data, which results in an acceptable PMEPR distribution at the expense of a larger number of time and frequency resources.
[0083] The proposed method can be improved in various ways. For example, to lower PMEPR further, the randomization symbols can be designed based on the gradients. The precoded-OFDM (e.g., discrete Fourier transform (DFT)-spread OFDM) or various mapping strategies can also be explored to improve the proposed method. In this disclosure, we focused on one-bit quantitation. Extending the proposed concept to different quantization levels is another interesting research direction that can be pursued. The system-level analysis of the proposed method with heterogeneous data is also another direction that can be investigated.
ADDITIONAL DISCLOSURE
[0084] Federated edge learning (FEEL) is an implementation of federated learning (FL) in a wireless network to train a model without moving the local data generated at the edge devices (EDs) to an edge server (ES).sup.[001], [002]. With FEEL, a large number of model parameters (or gradients) needs to be communicated between many EDs and the ES through wireless channels. However, typical user multiplexing methods such as orthogonal frequency division multiple access (OFDMA) can be inefficient to address the spectrum congestion due to a large number of EDs.sup.[003]. To address this issue, one of the promising solutions is to perform the calculations needed for FEEL, e.g., averaging, with an over-the-air computation (AirComp) method that harnesses the signal-superposition property of the wireless-multiple access channel.sup.[004]-[006]. However, developing an AirComp scheme is not a trivial task due to the multipath channel, power misalignment, and time-synchronization errors in practice. Also, the channel state information (CSI) needs to be available at the EDs or the ES with state-of-the-art solutions. In this study, we propose an AirComp scheme to address these issues.
[0085] In the literature, various AirComp schemes are proposed for FEEL. In .sup.[007], analog modulation over orthogonal frequency division multiplexing (OFDM) is investigated for broadband analog aggregation (BAA). Particularly, it is proposed to modulate the OFDM subcarriers with the model parameters at the EDs. To overcome the impact of the multipath channel on the transmitted signals, the symbols on the OFDM subcarriers are multiplied with the inverse of the channel coefficients and the subcarriers that fade are excluded from the transmissions, which is known as truncated-channel inversion (TCI) in the literature. In .sup.[008], an additional time-varying precoder is applied along with TCI to facilitate the aggregation. In .sup.[009], it is proposed to sparsify the gradient estimates and project the resultant sparse vector into a low-dimensional vector to reduce the bandwidth. The compressed data is transmitted with BAA. In .sup.[010], one-bit broadband digital aggregation (OBDA) is proposed to facilitate the implementation of FEEL for a practical wireless system. In this method, considering distributed training by majority vote (MV) with the sign stochastic gradient descend (signSGD).sup.[011], the EDs transmit quadrature phase-shift keying (QPSK) symbols over OFDM subcarriers along with TCI, where the real and imaginary parts of the QPSK symbols are formed by using the signs of the stochastic gradients, i.e., votes. At the ES, the signs of the real and imaginary components of the superposed received symbols on each subcarrier are calculated to obtain the MV for the sign of each gradient. However, the EDs still need the CSI for TCI as in BAA for AirComp. In .sup.[012] and .sup.[013], blind EDs are considered. However, it is assumed that the CSI for each ED is available at the ES. The impact of the channel on AirComp is mitigated through beamforming with a large number of antennas.
[0086] In this study, we investigate an AirComp method based on non-coherent detection to achieve FEEL without using CSI at the EDs and the ES. Inspired by the MV with signSGD.sup.[011], we use orthogonal resources, i.e., multiple subcarriers and/or OFDM symbols, to transmit the signs of local stochastic gradients. Hence, the votes from different EDs accumulate on the orthogonal resources non-coherently in fading channel with the proposed scheme. The ES then obtains the MV with an energy detector. Considering the randomness in the detected MVs due to the fading channel, path loss, and power control in the cell, we prove the convergence of learning in the presence of the proposed scheme for a non-convex loss function. We demonstrate that the proposed approach is robust against time-synchronization errors and power misalignment at the ES. We also show that it can be used with well-known peak-to-mean envelope power ratio (PMEPR) reduction techniques as it does not utilize the amplitude and the phase to encode the sign of local stochastic gradients. Finally, we evaluate the scheme by considering independent and identically distributed (IID) data and non-IID data where the data distribution is a function of the locations of EDs.
[0087] Notation: The complex and real numbers are denoted by and
, respectively.
[⋅] is the expectation of its argument.
[⋅] is the indicator function and
[⋅] is the probability of its argument. The sign function is denoted by sign(⋅) and results in 1, −1, or +1 at random for a positive, a negative, or a zero-valued argument, respectively.
System Model
A. Scenario
[0088] Consider a wireless network with K EDs that are connected to an ES, where each ED and the ES are equipped with single antennas. We assume that the frequency synchronization in the network is done before the transmissions with a control mechanism as done in 3GPP Fourth Generation (4G) Long Term Evolution (LTE) and/or Fifth Generation (5G) New Radio (NR) with random-access channel (RACH) and/or physical uplink control channel (PUCCH).sup.[014]. In this study, we consider the fact that the time synchronization among the EDs is not ideal, and the maximum difference between the time of arrivals of the EDs signals at the ES location is T.sub.sync seconds and it is equal to the reciprocal to the signal bandwidth.
[0089] In this study, the power alignment at the ES can be imperfect and the level of misalignment is controlled with a power control mechanism. We assume that the signal-to-noise ratio (SNR) of an ED at the ES is 1/σ.sub.n.sup.2 the reference distance R.sub.ref. We then set the received signal power of the kth ED at the ES as
where r.sub.k is the link distance between the kth ED and the ES, α is the path loss exponent, and β∈[0,α] is a coefficient that determines the amount of the path loss compensated. While β=0 means that there is no power control in the network, β=α leads to a system with perfect power alignment at the ES. We define the effective path loss exponent α.sub.eff as α.sub.effα−β.
[0090] In this study, we assume that the EDs are deployed in a cell, where the cell radius is R.sub.max meters and the minimum distance between the ES and the EDs is R.sub.min meters for R.sub.min≥R.sub.ref. It is worth emphasizing that we do not consider the impact of multiple cells (e.g., inter-cell interference) or a more complicated large-scale channel model (e.g., shadowing) on learning in this work as our goal is to provide insights into the impact of power misalignment and the path loss on distributed learning with a tractable analysis.
B. Signal Model
[0091] In this study, for AirComp, the EDs access the wireless channel on the same time-frequency resources simultaneously with S OFDM symbols consisting of M active subcarriers. We assume that the cyclic prefix (CP) duration is larger than T.sub.sync and the maximum-excess delays of the channel between the ES and the EDs. Considering independent frequency-selective channels between the EDs and the ES, the superposed symbol on the lth subcarrier of the mth OFDM symbol at the ES for the nth communication round of FEEL can be written as
where h.sub.k,l,m.sup.(n)∈ is the channel coefficient between the ES and the kth ED, t.sub.k,l,m.sup.(n)∈
is the transmitted symbol from the kth ED, and n.sub.l,m.sup.(n) is the symmetric additive white Gaussian noise (AWGN) with zero mean and the variance σ.sub.n.sup.2 on the lth subcarrier for l∈{0, 1, . . . , M−1} and m∈{0, 1, . . . , S−1}.
[0092] We consider the fact that the time synchronization at the receiver may not be precise. To model this, we assume that the synchronization point where the discrete Fourier transform (DFT) starts can deviate by N.sub.err samples within the CP window. Note that the uncertainty of the synchronization point within the CP window is often not an issue for traditional communications due to the channel estimation. However, it can cause a non-negligible impact on AirComp.
[0093] Let x(t.sub.time)∈ be a baseband OFDM symbol in continuous time for t.sub.time∈[0, T.sub.s), where T.sub.s is the OFDM symbol duration. We define the PMEPR of an OFDM symbol as
where P.sub.tx=[|x(t.sub.time)|.sup.2] is the mean-envelope power.
C. Learning Model
[0094] Let .sub.k denote the local data containing labeled data samples at the kth ED as {(
,
)}∈
.sub.k for k=1, . . . , K, where
and
are
th data sample and its associated label, respectively. The centralized learning problem can be expressed as
where =
.sub.1∪
.sub.2∪ . . . ∪
.sub.K and ƒ(w, x, y) is the sample loss function that measures the labeling error for (x, y) for the parameters w=[w.sub.1 . . . , w.sub.q].sup.T∈
.sub.q, and q is the number of parameters. With full-batch gradient descend, a local optimum point can be obtained as
w.sup.(n+1)=w.sup.(n)−ηg.sup.(n) (4)
where η is the learning rate and
where ith element of the vector g.sup.(n) is the gradient of F(w.sup.(n)) with respect to w.sub.i.sup.(n).
[0095] In .sup.[011], in the context of parallel processing, distributed training by MV with signSGD is investigated to solve (3). In this method, for the nth communication round, the kth ED.sup.1 first calculates the local stochastic gradient as
where .sub.k⊂
.sub.k is the selected data batch from the local data set and n.sub.b=|
.sub.k| as the batch size. Instead of the actual values of local gradients, the EDs then send the signs of their local stochastic gradients, denoted as {tilde over (g)}.sub.k.sup.(n) for k=1, . . . , K, to the ES, where the ith element of the vector {tilde over (g)}.sub.k.sup.(n) is {tilde over (g)}.sub.k,i.sup.(n)
sign({tilde over (g)}.sub.k,i.sup.(n)). The ES obtains the MV for the ith gradient as
[0096] Subsequently, the ES pushes v.sup.(n)=[v.sub.1.sup.(n), . . . , v.sub.q.sup.(n)].sup.T to the EDs and the models at the EDs are updated as .sup.1We refer to the workers and parameter-server mentioned in [011] as EDs and ES, respectively, to describe distributed training by MV with signSGD.
w.sup.(n+1)=w.sup.(n)−ηv.sup.(n) (8)
[0097] This procedure is repeated consecutively until a predetermined convergence criterion is achieved.
[0098] For FEEL, the optimization problem can also be expressed as (3) in a scenario where the local data samples and their labels are not available at the ES and the link between an ED and the ES experiences independent frequency-selective fading channel. To solve (3) under these constraints, in this study, we adopt the same procedure summarized for the distributed training by the MV. With the motivations of eliminating the latency caused by orthogonal multiple access and enabling distributed training in mobile wireless networks, we propose a simple-but-effective AirComp scheme to detect the MV in fading channel without using CSI at the EDs and the ES.
FSK-Based Majority Vote
A. Edge Device—Transmitter
[0099] With the proposed AirComp scheme, the EDs perform a low-complexity operation to transmit the signs of the gradients given in (6): Let ƒ be a bijective function that maps i∈{1, 2, . . . , q} to the distinct pairs (m.sup.+, l.sup.+) and (m.sup.−, l.sup.−) for m.sup.+, m.sup.−∈{0, 1, . . . , S−1}) and l.sup.+, l.sup.−∈{0, 1, . . . , M−1}. Based on the value of
respectively, where E.sub.s=2 is a factor to normalize the symbol energy and s.sub.k,i.sup.(n) is a randomization symbol on the unit circle. Therefore, to indicate the sign of a local stochastic gradient, our scheme dedicates two subcarriers with (9) and (10), as opposed to modulating the phase of a subcarrier as done in OBDA. Also, we do not use TCI to compensate the impact of multipath channel on transmitted symbols as our goal is to exploit the energy accumulation on two different subcarriers to detect the MV with a non-coherent detector.
[0100] As a special case of ƒ, if m.sup.−=m.sup.+ and l.sup.−=l.sup.++1 hold for all i, the adjacent subcarriers of m.sup.+th OFDM symbol forms the options for a vote, which corresponds to frequency-shift keying (FSK) over OFDM subcarriers. In this case, the kth ED's vote for the ith gradient becomes independent from its choice since the adjacent subcarriers are likely to experience similar channel conditions, i.e., h.sub.i,l.sub.
[0101] After the calculations of t.sub.k,l.sub.
B. Edge Server—Receiver
[0102] The receiver at the ES observes the superposed symbols at all subcarriers as expressed in (2). By using the mapping function ƒ, the superposed symbols for a given i can be shown as
respectively. The receiver at the ES detects the MV for the ith gradient with an energy detector as
v.sub.i.sup.(n)=sign(Δ.sub.i.sup.(n)) (13)
where Δ.sub.i.sup.(n)e.sub.i.sup.+−e.sub.i.sup.− for e.sub.i.sup.+
|r.sub.l.sub.
[0103] The proposed scheme leads to a fundamentally different training strategy since it determines the correct MV in (7) probabilistically by comparing el and el. To elaborate this, assume that the multipath channels between the ES and the EDs are independent. Let K.sub.i.sup.+ and K.sub.i.sup.−=K−K.sub.i.sup.+ be the number of EDs that vote for 1 and −1 for the ith gradient, respectively.
Lemma 1. [e.sub.i.sup.+] and
[e.sub.i.sup.−] can be calculated as
μ.sub.i.sup.+[e.sub.i.sup.+]=E.sub.sK.sub.i.sup.+λ+σ.sub.n.sup.2 (14)
and
μ.sub.i.sup.−[e.sub.i.sup.−]=E.sub.sK.sub.i.sup.−λ+σ.sub.n.sup.2 (15)
respectively, where
[0104] Proof: Since (11) is a weighted summation of independent complex Gaussian random variables with zero mean and unit variance (i.e., channel coefficients), r.sub.l.sub.
[0105] To calculate (17), we need to calculate the expected value of y=r.sup.−α.sup.
[0106] Hence, the distribution of y can obtained as
[0107] By using (19), the expected value of y can be calculated as (16). The same analysis can be done for μ.sub.i.sup.−.
[0108] Based on Lemma 1, (13) is likely to obtain the correct MV because μ.sub.i.sup.+ and μ.sub.i.sup.− are linear functions of and K.sub.i.sup.+ and K.sub.i.sup.−, respectively. However, the detection performance depends on the parameter λ∈[0, 1] that captures the impacts of power control, path loss, and cell size on e.sub.i.sup.+ and e.sub.i.sup.−. In
C. Convergence in Fading Channel
[0109] We consider several standard assumptions made in the literature for the convergence analysis.sup.[10], [11]:
Assumption 1 (Bounded loss function). F(w)≥F*, ∀w.
Assumption 2 (Smoothness). Let g be the gradient of F(w) evaluated at w. For all w and w′, the expression given by
holds for a non-negative constant vector L=[L.sub.1, . . . , L.sub.q].sup.T.
Assumption 3 (Variance bound). The stochastic gradient estimates {{tilde over (g)}.sub.k=[{tilde over (g)}.sub.k,1, . . . , {tilde over (g)}.sub.k,q].sup.T=∇F.sub.k(w.sup.(n))}, ∀k, are independent and unbiased estimates of g=[g.sub.1, . . . , g.sub.q.sup.T=∇F(w) with a coordinate bounded variance, i.e.,
[{tilde over (g)}.sub.k]=g,∀k (20)
[({tilde over (g)}.sub.k,i−g.sub.i).sup.2]≤σ.sub.i.sup.2/n.sub.b,∀k,i (21)
where is a non-negative constant vector.
Assumption 4 (Unimodal, symmetric gradient noise). For any given w, the elements of the vector {tilde over (g)}.sub.k, ∀k, has a unimodal distribution that is also symmetric around its mean.
[0110] We also assume that the parameters e.sub.i.sup.+ and e.sub.i.sup.− are exponential random variables, where their means are μ.sub.i.sup.+ and μ.sub.i.sup.−, respectively. This assumption holds true when the power control is ideal under IID Rayleigh fading. It is a weak assumption under imperfect power control due to the central limit theorem.
[0111] By extending our theorem in .sup.[015] with the considerations of path loss, power control, and cell size, the convergence rate in the presence of FSK-MV can obtained as follows:
[0112] Theorem 1. For n.sub.b=N/γ and η=1/√{square root over (∥L∥.sub.1n.sub.b)}, the convergence rate of the distributed training by the MV based on FSK in fading channel is
where γ is a positive integer,
and λ∈[0, 1] given in (16) is a parameter that captures the parameters related to the path loss, power control, and cell size.
[0113]
[0114] The proof of Theorem 1 is given in the appendix.
[0115] Based on Theorem 1, we can infer the followings: 1) For a larger SNR (i.e., a larger 1/σ.sub.n.sup.2) and a large number of EDs (i.e., a larger K), the convergence rate with FSK-MV in fading channel improves since a decreases. 2) The power control results in a better convergence rate since A increases with a lower α.sub.eff. 3) Another way of improving the convergence rate is to reduce to cell size, yielding a large λ as illustrated in
D. Comparisons
[0116] Robustness against Time-Varying Fading Channel: As opposed to the approaches in .sup.[007] and .sup.[010], the proposed scheme does not utilize the CSI for TCI at the EDs. Hence, it is compatible with time-varying channels (e.g., mobile networks.sup.[016]) and does not lose gradient information due to TCI. As a trade-off, it quadruples the number of time-frequency resources for AirComp as compared to OBDA in .sup.[010]. As compared to the approaches in .sup.[012] and .sup.[013], the proposed scheme also does not require CSI at the ES or multiple antennas.
[0117] 2) Robustness against Time-Synchronization Errors: As demonstrated in Section IV, the proposed scheme provides immunity against the time-synchronization errors. This is because the timing misalignment among the EDs or the uncertainty on the receiver synchronization within the CP window cause phase rotations in the frequency domain and FSK-MV does not encode information on the amplitude or phase. Also, the proposed scheme does not use any channel-related information at the EDs and the ES. Hence, FSK-MV is more robust against time-synchronization errors as compared to OBDA.
[0118] 3) Robustness against Power-Amplifier Non-linearity: The proposed scheme separates the options for voting over two different resources identified in time and frequency. Hence, it allows one to choose s.sub.k,i.sup.(n) based on specific purposes. In this study, we use random QPSK symbol to reduce PMEPR by decreasing the correlation in the frequency domain.sup.[017]. OBDA is not investigated in terms of PMEPR in the literature. As shown in Section IV, OBDA can suffer from high PMEPR, while the proposed scheme reduces PMEPR with a simple randomization technique. Also, FSK-MV does not require a long transmission power constraint as in introduced for OBDA.sup.[010, Eq. 9 and Eq. 10] since the .sub.2-norm of the OFDM symbols do not change as a function of CSI with FSK-MV.
Numerical Results
[0119] For the numerical results, we consider the learning task of handwritten-digit recognition in a single cell with K=50 EDs for R.sub.min=10 meters and R.sub.max=100 meters. We assume that the path loss exponent is α=4. To demonstrate the impact of the imperfect power control on distributed learning, we choose β∈{2, 4} and set the SNR, i.e., 1/σ.sub.n.sup.2, to be 20 dB at R.sub.ref=10 meters. The link distance between the kth ED and the ES is set to r.sub.k=√{square root over (R.sub.min.sup.2+(k−1)(R.sub.max.sup.2−R.sub.min.sup.2)/(K−1))} based on (18). For the fading channel, we consider ITU Extended Pedestrian A (EPA) with no mobility and regenerate the channels between the ES and the EDs independently for each communication round to capture the long-term channel variations. The subcarrier spacing is set to 15 kHz. We use M=1200 subcarriers (i.e., the signal bandwidth is 18 MHz). In the case of imperfect time synchronization, we assume that the difference between time of arriving ED signals is maximum T.sub.sync=55.6 ns and the synchronization uncertainty at the ES is N.sub.err=3 samples. Otherwise, these parameters are set to 0.
[0120] For the local data at the EDs, we use the MNIST database that contains labeled handwritten-digit images size of 28×28 from digit 0 to digit 9.sup.2. We consider both IID data and non-IID data in the cell. To prepare the data, we first choose |D|=25000 training images from the database, where each digit has distinct 2500 images. For the scenario with the IID data, we assume that each ED has 50 distinct images for each digit. For the scenario with the non-IID data, we assume that the distribution of the images depends on the locations of the EDs to test the FEEL in a more challenging scenario. To this end, we divide the cell into 5 areas with concentric circles and the EDs located in uth area have the data samples with the labels {u−1, u, 1+u, 2+u, 3+u, 4+u} for u∈{1, . . . , 5}. Hence, the availability of the labels gradually changes based on the link distance. The areas between two adjacent concentric circles are identical and the number of EDs in each area is 10. The IID and non-IID data distributions are illustrated in
[0121] For the model, we consider a convolution neural network (CNN) that includes one 5×5 and two 3×3 convolutional layers, where each of them is followed by a batch normalization layer and rectified-linear unit (ReLU) activation follow each of them. All convolutional layers have 20 filters. After the third ReLU, a fully connected layer with 10 units and a softmax layer are utilized. At the input layer, no normalization is applied. Our model, outline in Table I, has q=123090 learnable parameters, which corresponds to S=206 and S=52 OFDM symbols for the FSK-MV and OBDA.sup.[10], respectively. For TCI, the truncation threshold is 0.2 and we assume that CSI is available at the EDs. For the update rule, the learning rate is set to 0.01. The batch size n.sub.b is set to 64. For the test accuracy calculations, we use 10000 test samples available in the MNIST database.
[0122] In
[0123] In
[0124] Although the test accuracy with OBDA with TCI (with ideal synchronization) or FSK-MV (with/without ideal synchronization) reaches to 95%,
Concluding Remarks
[0125] In this study, we propose an effective AirComp scheme for FEEL. The proposed scheme relies on the distributed learning by the MV with the signSGD in fading channel. As compared to the state-of-the-art solutions on AirComp, it uses different subcarriers and/or OFDM symbols to indicate the sign of the local stochastic gradients. Thus, it allows the receiver at the ES to detect the MV with a non-coherent detector and eliminates the need for CSI at the EDs by exploiting the non-coherent energy accumulation on the subcarriers. We also prove the convergence of the distributed learning by taking path loss, power control, and cell size into account. Through simulations, we demonstrate that the proposed method can provide a high-test accuracy in fading channel even when the power control and the time synchronization are imperfect while resulting in an acceptable PMEPR distribution at the expense of a larger number of time and frequency resources. We also provide insights into the scenarios where local data distribution depends on the locations of the EDs and demonstrate the impact of non-IID data on the distributed learning when the power control is not ideal. Our results indicate that adaptive learning methods that consider the bias in the MV due to the non-IID data and/or imperfect power control are required for achieving a higher test accuracy.
APPENDIX PROOF OF THEOREM 1
[0126] Proof: The proof of Theorem 1 relies on a well-known strategy of relating the norm of the gradient of the loss function F(w) to the expected improvement made in a single step as described in [11]. Let g.sup.(n) be the gradient of F(w.sup.(n)) (i.e., the true gradient). By using Assumption 2 and using (13), we can state that
[0127] The main challenge is to obtain an upper bound on the stochasticity-induced error. To address this, assume that sign(g.sub.i.sup.(n))=1. Let Z be a random variable for counting the number of EDs with the correct decision, i.e., sign(g.sub.i.sup.(n))=1. The random variable Z can then be model as the sum of K independent Bernoulli trials, i.e., a binomial variable with the success and failure probabilities given by
P.sub.i[sign({tilde over (g)}.sub.k,i.sup.(n))=sign(g.sub.i.sup.(n))]
q.sub.i[sign({tilde over (g)}.sub.k,i.sup.(n))≠sign(g.sub.i.sup.(n))]
respectively, for all k. This implies that
To calculate [sign(Δ.sub.i.sup.(n))≠1|Z=K.sub.i.sup.+], we use the distribution of Δ.sub.i.sup.(n), which can be obtained by using the properties of exponential random variables as
[0128] Thus, by integrating (23) with respect to Δ.sub.i.sup.(n),
[0129] Hence, by using (24) and the properties of binomial coefficients
[0130] Under Assumption 2 and Assumption 3, by using the derivations in [11], it can be shown that
Hence, an upper bound on the stochasticity-induced error can be obtained as
[0131] Based on Assumption 1,
[0132] By rearranging the terms in (26) and using the expressions for n.sub.b and η, (22) is reached.
[0133] While certain embodiments of the disclosed subject matter have been described using specific terms, such description is for illustrative purposes only, and it is to be understood that changes and variations may be made without departing from the spirit or scope of the subject matter. The patentable scope of the presently disclosed subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they include structural and/or step elements that do not differ from the literal language of the claims, or if they include equivalent structural and/or elements or steps with insubstantial differences from the literal language of the claims.
REFERENCES
[0134] [1] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated learning: A signal processing perspective,” 2021. [Online]. Available: arXiv:2103.17150 [0135] [2] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” 2021. [Online]. Available: arXiv:2104.02151 [0136] [3] M. Goldenbaum, H. Boche, and S. Sta'nczak, “Harnessing interference for analog function computation in wireless sensor networks,” IEEE Trans. Signal Process., vol. 61, no. 20, pp. 4893-4906, October 2013. [0137] [4] W. Liu, X. Zang, Y. Li, and B. Vucetic, “Over-the-air computation systems: Optimization, analysis and scaling laws,” IEEE Trans. Wireless Commun., vol. 19, no. 8, pp. 5488-5502, August 2020. [0138] [5] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498-3516, October 2007. [0139] [6] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491-506, January 2020. [0140] [7] G. Zhu, Y. Du, D. Gündüz, and K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 2120-2135, November 2021. [0141] [8] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proc. in International Conference on Machine Learning, vol. 80. Proceedings of Machine Learning Research, 10-15 Jul. 2018, pp. 560-569. [0142] [9] T. Sery, N. Shlezinger, K. Cohen, and Y. C. Eldar, “Over-the-air federated learning from heterogeneous data,” 2020. [Online]. Available: arXiv:2009.12787 [0143] [10] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546-3557, February 2020. [0144] [11] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over the air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022-2035, 2020. [0145] [12] M. M. Amiria, T. M. Duman, D. Gündüz, S. R. Kulkarni, and H. Vincent Poor, “Collaborative machine learning at the wireless edge with blind transmitters,” IEEE Trans. Wireless Commun., pp. 1-1, March 2021. [0146] [13] Y. A. Jawhar, L. Audah, M. A. Taher, K. N. Ramli, N. S. M. Shah, M. Musa, and M. S. Ahmed, “A review of partial transmit sequence for PAPR reduction in the OFDM systems,” IEEE Access, vol. 7, pp. 18021-18041, 2019. [0147] [14] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, “Federated learning in the sky: Joint power allocation and scheduling with UAV swarms,” in Proc. IEEE International Conference on Communications (ICC), 2020, pp. 1-6. [0148] [001] T. Gafni, N. Shlezinger, K. Cohen, Y. C. Eldar, and H. V. Poor, “Federated learning: A signal processing perspective,” 2021. [Online]. Available: arXiv:2103.17150 [0149] [002] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. Vincent Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” IEEE J. Sel. Areas Commun., pp. 1-26, 2021. [0150] [003] H. Hellstrom, J. M. B. da Silva Jr, V. Fodor, and C. Fischione, “Wireless for machine learning,” 2020. [0151] [004] M. Goldenbaum, H. Boche, and S. Stan'czak, “Harnessing interference for analog function computation in wireless sensor networks,” IEEE Trans. Signal Process., vol. 61, no. 20, pp. 4893-4906, October 2013. [0152] [005] W. Liu, X. Zang, Y. Li, and B. Vucetic, “Over-the-air computation systems: Optimization, analysis and scaling laws,” IEEE Trans. Wireless Commun., vol. 19, no. 8, pp. 5488-5502, August 2020. [0153] [006] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans. Inf. Theory, vol. 53, no. 10, pp. 3498-3516, October 2007. [0154] [007] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491-506, January 2020. [0155] [008] T. Sery, N. Shlezinger, K. Cohen, and Y. C. Eldar, “Over-the-air federated learning from heterogeneous data,” 2020. [Online]. Available: arXiv:2009.12787 [0156] [009] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546-3557, February 2020. [0157] [010] G. Zhu, Y. Du, D. Gündüz, and K. Huang, “One-bit over-the-air aggregation for communication-efficient federated edge learning: Design and convergence analysis,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 2120-2135, November 2021. [0158] [011] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proc. in International Conference on Machine Learning, vol. 80. Proceedings of Machine Learning Research, 10-15 Jul. 2018, pp. 560-569. [0159] [012] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022-2035, 2020. [0160] [013] M. M. Amiria, T. M. Duman, D. Gündüz, S. R. Kulkarni, and H. Vin-cent Poor, “Collaborative machine learning at the wireless edge with blind transmitters,” IEEE Trans. Wireless Commun., pp. 1-1, March 2021. [0161] [014] E. Dahlman, S. Parkvall, and J. Skold, 5G NR: The Next Generation Wireless Access Technology, 1st ed. USA: Academic Press, Inc., 2018. [0162] [015] A. Sahin, B. Everette, and S. Hoque, “Over-the-air computation with DFT-spread OFDM for federated edge learning,” in Proc. IEEE Wireless Communications and Networking Conference (WCNC) (submitted), April 2022, pp. 1-6. [0163] [016] T. Zeng, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, “Federated learning in the sky: Joint power allocation and scheduling with UAV swarms,” in Proc. IEEE International Conference on Communications (ICC), 2020, pp. 1-6. [0164] [017] Y. A. Jawhar, L. Audah, M. A. Taher, K. N. Ramli, N. S. M. Shah, M. Musa, and M. S. Ahmed, “A review of partial transmit sequence for PAPR reduction in the OFDM systems,” IEEE Access, vol. 7, pp. 18 021-18 041, 2019.