MULTI-USER SCHEDULING METHOD AND SYSTEM BASED ON REINFORCEMENT LEARNING FOR 5G IOT SYSTEM
20230345451 · 2023-10-26
Assignee
Inventors
- Wendi Wang (Nanjing, CN)
- Hong ZHU (Nanjing, CN)
- Pu WANG (Nanjing, CN)
- Honghua Xu (Nanjing, CN)
- Su Pan (Nanjing, CN)
- Dongxu Zhou (Nanjing, CN)
- Xin QIAN (Nanjing, CN)
- Zibo LI (Nanjing, CN)
- Xuanxuan SHI (Nanjing, CN)
- Junkang WANG (Nanjing, CN)
- Zhongzhong XU (Nanjing, CN)
- Jun ZHAO (Nanjing, CN)
Cpc classification
H04W52/00
ELECTRICITY
Y02D30/70
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
A multi-user scheduling method based on reinforcement learning for 5G IoT, including: calculating the achievable rate of each user in a set according to a communication scenario model; generating an initial scheduled users set according to the achievable rate of each user; according to the achievable rate of each user and the number of times each user is scheduled, evaluating the estimated value of each user's action value under the current scheduling period by using the Q-learning method; determining the upper bound value of the confidence interval of each user's action value; determining the set of scheduled users under the current scheduling period according to the upper bound of the confidence interval of each user's action value; according to the set of scheduled users under the current scheduling period, recalculating the actual achievable rate value of each selected user under the current scheduling period.
Claims
1. A multi-user scheduling method based on reinforcement learning for a 5G IoT (Internet of Things system), comprising: (a) calculating a achievable rate of each user in a set, according to the communication scenario model of the 5G IoT; (b) generating an initial scheduled users set according to the achievable rate of the each user; (c) according to the achievable rate of the each user and a number of times the each user being scheduled, evaluating an estimated value of an action value of the each user under a current scheduling period by utilizing Q-learning method; (d) obtaining an upper bound value of a confidence interval for the action value of the each user; (e) determining a scheduled users set in the current scheduling period according to the upper bound value of the confidence interval of the action value of the each user; (f) calculating one more time the achievable rate of the each selected user under the current scheduling period according to the scheduled users set under the current scheduling period, (g) returning to (c) until the estimated value of the action value of the each user converges to the achievable rate of the each user in the current scheduling period, and then outputting the scheduled users set in (e) performed at last round as a result of a next scheduling period.
2. The method of claim 1, wherein in (a): the achievable rate, representing as R.sub.m, of a user m, is: .sub.m.sup.0, and N.sub.m is a number of antennas of the user m, specially:
.sub.m is a joint matrix of a channel matrix of other users except the user m, and the above equation is the singular value decomposition of
.sub.m.
3. The method of claim 1, wherein (b) further comprising: (b1) defining the scheduled users set Ψ=Ø, and the set of all users in the environment A={M.sub.0}; (b2) for each user x in the set A−Ψ, defining Ψ.sub.x=Ψ+{x}, and calculate the data rate R.sub.Ψ.sub.
4. The method of claim 1, wherein in(c): an estimated value of action value {tilde over (q)}.sub.a.sub.
5. The method of claim 4, wherein the estimated value of action values {tilde over (q)}.sub.a.sub.
6. The method of claim 4, wherein in (d): the upper bound value of the confidence interval Q.sub.a.sub.
7. The method of claim 6, further comprising: wherein in (g), the action value of the each user converges to the achievable rate of the each user, includes:
8. The method of claim 1, wherein (e) comprising: the scheduled users set π.sub.t in the current scheduling period is
9. The method of claim 1, wherein: in (c), evaluating the estimated value of each user's action value under the current scheduling period includes: if the current scheduling period is the first scheduling period, setting the estimated value of each user's action value {tilde over (q)}.sub.a.sub.
10. A multi-user scheduling system based on reinforcement learning for a 5G IoT (Internet of Things system), comprising: a performance computing module and a logic operating module; the performance computing module computes the achievable rate of an each user in a scheduled users set and utilizes the achievable rates of the each user as a historical rate of the each user to provide a function for the multi-user scheduling system to evaluate the each user's action value; the logic operating module generates an initial scheduled users set, evaluates an estimated value of the each user's action value under a current scheduling period, determines an upper bound value of a confidence interval of the each user's action value, and determines the scheduled users set under the current scheduling period.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
DETAILED DESCRIPTION
[0033] The present disclosure will be further described below with reference to the accompanying drawings. The following examples are only aimed to illustrate the technical solutions of the present disclosure more clearly, but the examples may not limit the protection scope of the present disclosure.
[0034] The present disclosure proposes a multi-user scheduling method based on reinforcement learning for the 5G Internet of Things system. Specially, the method estimates the estimated value of the user's action value from the user's historical achievable rate samples by using Q-learning method and selects users according to the Q values (the upper bound value of the confidence interval of the estimated values of the action values). After the users being selected, the base station calculates the achievable rate of each selected user. Compared with the traditional user scheduling algorithm, this method needs to collect the historical achievable rates of each user as samples to evaluate the action value. It is necessary to collect the samples as many as possible, so that the action value can be estimated more accurately, so it takes time to make the method converge. However, the proposed user scheduling method does not need to try varieties of different user combinations before scheduling, which can reduce a lot of unnecessary calculation of the achievable rates, so the computational complexity has been reduced, and the optimal scheduled users set can still be obtained after the method converges, and then to ensure optimal the performance of the system.
[0035] As shown in
[0040] Furthermore, the communication scenario model of the 5G IoT system, in the step (1), the received signal y.sub.m of the user m can be described as:
.sub.m to be the joint matrix of the channel matrices of other users except user m, and perform SVD (Singular Value Decomposition) on it:
.sub.m.sup.0,
.sub.m.sup.0 are composed of left and right singular vectors corresponding to zero singular values of joint matrix
.sub.m,
.sub.m.sup.1,
.sub.m.sup.1 are composed of left and right singular vectors corresponding to non-zero singular values of
.sub.m, and the main diagonal elements of the diagonal matrix
.sub.m are the non-zero singular values of
.sub.m. As
.sub.m.sup.0 exists in the null space of the joint matrix
.sub.m, then:
[0043] Therefore, performing precoding processing on the transmitted signal of user m by utilizing .sub.m.sup.0, the inter-user interference can be eliminated completely.
[0044] In order to ensure that the null space has non-zero solutions, the number of rows of .sub.m should not be greater than the number of columns of
.sub.m, then:
[0045] Equation (4) is the constraint expression of the maximum number M of users those can be scheduled simultaneously, which can be used to solve the maximum number M of users that can be scheduled at the same time. Wherein, T is the number of antennas, and N.sub.i is the number of antennas for each user.
[0046] Set the precoding matrix T.sub.m=.sub.m.sup.0V.sub.m.sup.1(m=1, 2, . . . , M), wherein V.sub.m.sup.1 is obtained by H.sub.m
.sub.m.sup.0 through singular value decomposition:
[0047] Wherein, Σ.sub.m.sup.1 is a diagonal matrix, and its main diagonal elements are the non-zero singular values of H.sub.m.sub.m.sup.0, which are represented by λ.sub.m,n and has N.sub.m non-zero singular values in total.
[0048] Substituting T.sub.m into (1), it can get:
y.sub.m=H.sub.m.sub.m.sup.0V.sub.m.sup.1s.sub.m+n.sub.m=U.sub.m.sup.1Σ.sub.m.sup.1s.sub.m+n.sub.m (6)
[0049] Therefore, each user's channel can be equivalent to multiple parallel channels. The actual achievable rate or achievable rate R.sub.m of user m is:
.sub.m.sup.0, n=1, 2, . . . , N.sub.m. Furthermore, σ.sup.2 is the additive white Gaussian noise power.
[0051] In order to improve the utilization rate of space resources, the system can allow the number of access users M.sub.0 to exceed the maximum number of users M those can communicate simultaneously. In each scheduling period, the system needs to select M users from M.sub.0 users according to the achievable rate. Assuming the scheduled users set is Ψ, and then the system throughput R.sub.Ψ is:
[0052] It should be noted that the system throughput R.sub.Ψ is equal to the total system data rate multiplied by the length of the scheduling period, and then maximizing system throughput is equivalent to maximizing the total data rate of the system in the case of constant time length. In the present disclosure, the time length is regarded as the unit time, so R.sub.Ψ can also represent the system throughput.
[0053] The present disclosure utilizes the learning ability of reinforcement learning for an unknown environment to find the optimal scheduled users set through the iteration of Q value evaluation and scheduled set update so that to achieve the goal of maximizing the total throughput of the system. The proposed reinforcement learning model consists of three main components: Agent, Action, and Reward, which are mapped to the user scheduling of the 5G IoT system:
[0054] Agent:
[0055] The agent is the executive subject of reinforcement learning, which is the base station in the present disclosure. In the environment where the agent is located, a total of M.sub.0 users can be selected for access, from which M users (M is the maximum number of scheduled users) need to be selected for service. The state of the agent is represented by the scheduled users in this period.
[0056] Action:
[0057] Action a.sub.i indicates the action that the base station selects user i to access, and after the action a.sub.i is executed in the scheduling period t, the agent's state transitions: adding user i to the scheduled users set Ψ.sub.t. The same action cannot be repeated in a scheduling period, and M actions need to be executed, Action set A={M.sub.0}. The scheduled users set for period t is π.sub.t, and its modulus |n.sub.t|=M, also π.sub.t⊂M.sub.0.
[0058] Reward:
[0059] After performing the action, the environment may feedback a gain or action gain to the agent. Since a achievable rate of a user may be affected by the joint channel matrix, and the subsequent users' selection will affect the achievable rate of the previous user, therefore, the gain or the action gain, i.e., the user's achievable rate, be calculated after a round of scheduling. Specially, the gain or actual gain r.sub.i(t) obtained by the base station after performing the action a.sub.i in the period t is the achievable rate of the selected user i in this round of scheduling, which is obtained by equation (7). The total gain obtained by the base station after the end of the scheduling period is G.sub.t=Σ.sub.i∈π.sub.
[0060] To generate an initial scheduled users set, the steps are as listed below: [0061] Step (1.1): setting the scheduled users set Ψ=Ø, and the set of all users in the environment A={M.sub.0}. [0062] Step (1.2): for each user x in the set A−Ψ, setting Ψ.sub.x=Ψ+{x}, and calculating the data rate or rate R.sub.Ψ.sub.
[0063] Finding the user m* that satisfies the condition x*=argmax.sub.m∈A-ΨR.sub.Ψ.sub.
i∈π.sub.0,
[0065] Wherein, the total gain, earned by the base station,
[0066] It should be noted that if there are not enough historical achievable rate samples in the initial period, the initial estimate value for each user's action value {tilde over (q)}.sub.a.sub.
[0067] Furthermore, the step (2) specifically includes:
The estimate value for user's action value is defined as the user's expectation of the achievable rate. In Q-learning methods, it is necessary to utilize state transition probabilities to calculate action values. However, it is difficult to obtain state transition probabilities in practice. The method of the present disclosure adopts the weighted average value of the historical achievable rate of the user or the user terminal as the estimation value of the action value. The estimate value of the action value {tilde over (q)}.sub.a.sub.
[0068] Wherein {circumflex over (τ)}.sub.a.sub.
[0069] Equation (9) is further simplified as follows to reduce the storage pressure of the base station:
[0071] The advantage of using equation (10) is that it can use the action value of the previous period to update the action value of the next period without storing all the historical achievable rates generated by the user, thereby improving the computing efficiency.
[0072] Furthermore, the step (3) specifically comprises:
[0073] In the proposed scheduling method, the base station cannot obtain the real value of the action value when selecting a user and can only estimate the action values of all users through the previously achievable rates of the selected users. Selecting M users or user terminals with the largest sum of estimated values of action values, wherein this operation is called “Utilization” because it utilizes the existing information. However, as the action value of each user is estimated from historical data rate samples, the number of samples affects the accuracy of the estimation of the action value. When the sample size is limited, the imprecision of the action value makes it impossible to guarantee that there is no user exists among the other M.sub.0−M users that can achieve higher system throughput. Meanwhile, due to movement or some other reasons, variables in data rates or rates may also cause users outside the currently scheduled users set to have higher action values. Therefore, it is necessary to schedule users who are outside the current scheduled users set and have been scheduled less frequently before, calculate the achievable rate of these users, increase the sample space of these users, and make the estimated value of the action values of these users more accurate. This operation is called “exploration” as it tries to explore more unknown information. When the balance of the “exploration” and “utilization” is reached, that is, after accurately estimating the action value of each user or user terminal, the system will find M users who can truly make the system obtain the highest total throughput. To balance exploration and utilization, the upper confidence bound algorithm (UCB) is adopted, and the Q value of the user or user terminal is defined as the upper bound value of the confidence interval of the action values of the user.
[0074] The upper bound value Q.sub.a.sub.
[0075] The scheduled users set π.sub.t(t≥1) of the t scheduling period is:
[0076] Wherein, the selection of π.sub.t is not related to the user's Q value in the previous period, as a user won't be selected until the Q value in the previous period has been calculated. The rate calculation after the user be selected is used to update the Q value to optimize the selection. The sequence of each scheduling period is: i. calculating the action value and Q value according to the historical achievable rate and selection times; ii. selecting the user according to the Q value; iii. calculating the achievable rate of the selected user.
[0077] In the equation (12), π has been replaced by π.sub.t to avoid misunderstanding. The main goal of this equation (12) is to select π, which has the largest “sum of user Q values”, to be π.sub.t.
[0078] When the scheduling period starts, it is possible to select all the scheduling users in this scheduling period directly by utilizing equation (12), and then to calculate the achievable rates of these users according to equation (13), which may be used to update the action value and optimize the user selection in the next scheduling period. The agent utilizes equation (7) to calculate the actual achievable rate or achievable rate of each user in the scheduled users set π.sub.t. In the scheduling period t, the achievable rate of user i (i.e., the gain generated by user i) is:
[0079] The data rate newly generated may update the estimated values of the action values for all users and the upper bound value of the confidence interval for the action values, and then to be applied for users' selection in the t+1 period.
[0080] The difference between the method of the present disclosure and the common greedy algorithm is that the method selects users firstly and then calculate performance (while greedy algorithm needs to calculate performance firstly and then select users according to performance). Therefore, the method needs to select users and then calculate the use's actual performance to optimize the next user's selection. Through continuous optimization, the users' selection can achieve optimal performance eventually.
[0081] As shown in equations of (11) and (12), the method of the present disclosure traverses all users in the initial stage, avoiding the scenarios those some users have never been selected. After multiple rounds of users scheduling, the number of selections of each user increases continuously, the confidence interval gradually converges, and the user's Q value gets equal to the action value. The base station may select the users with higher action value and maximizes the system throughput. From a long-term perspective, the method can be used to balance exploration and utilization, also it guarantees long-term system performance. Additionally, the Q value of the users, who have not been scheduled for a long time, may grow continuously without an upper bound value, so that these users may be scheduled by the system eventually. Therefore, the method can avoid “extreme unfairness” scenarios.
[0082] Furthermore, the specific steps of step (4) are listed as below:
[0083] From the t scheduling period (t≥1), repeating steps (2) and (3) until for any user i∈{M.sub.0},
the method converges and obtains the optimal scheduled users set π*.
[0084] In conclusion, the difference between the method of the present disclosure and the common greedy algorithm is that the method selects users firstly and then calculate performance (while greedy algorithm needs to calculate performance firstly and then select users according to performance). Therefore, the method needs to select users and then calculate the use's actual performance to optimize the next users' selection. Through continuous optimization, the users' selection can achieve optimal performance eventually.
[0085] Thus, the idea of the method is to directly select all the scheduled users in this scheduling period by using equation (12), and then calculate the achievable rates of these users according to equation (13), which may update the action value and optimize the users' selection in the next scheduling period.
[0086] The method of selecting users based on historical data cannot guarantee that the optimal scheduled set be selected in each period before the method convergences, i.e., π.sub.t≠π*, as there is not enough amount of data to support an accurate user rate model. The accurate data may not be obtained (evaluated by Q-learning) until a certain number of samples can be obtained. In order to obtain accurate data for all users, we utilize a confidence upper bound method (Q.sub.a.sub.
[0087] Therefore, π.sub.t is obtained by comparing Q.sub.a.sub.
[0088] The additional advantage of the method is that the method may select users firstly and then calculate the system performance achieved by these users. Distinguishing from the greedy algorithm, which calculates performance firstly and then selects users, the present disclosure can reduce the scheduling delay significantly (the time from all users initiating scheduling applications to the base station completing users' selection).
[0089] The specific exemplary embodiments of the present disclosure may be described in detail below with reference to the accompanying drawings.
[0090] A division method for community users based on an improved support vector machine (SVM) algorithm, the method comprises the following steps: [0091] (i). according to the characteristics of 5G communication system, establishing a problem model of the communication scenario model between community mobile users to the base station;
[0092]
m=1, 2, . . . , M.sub.0, and N.sub.m≤T,
is the channel state matrix of user m. wherein, h.sub.i,j represents the spatial channel fading coefficient rank(H.sub.m)=N.sub.m from the j transmit antenna to the i receive antenna. Setting s.sub.m represent the transmitted signal vector sent by the base station to user m, and each value contains the information to be transmitted T.sub.m that is the precoding matrix of user m, then the received signal y.sub.m of user m is:
y.sub.m=H.sub.mT.sub.ms.sub.m+n.sub.m (1)
[0093] The precoding is set as T.sub.m=.sub.m.sup.0V.sub.m.sup.1,
.sub.m.sup.0 is composed of the right singular vectors corresponding to the zero singular values of the joint matrix
.sub.m, and V.sub.m.sup.1 is composed of the right singular vectors corresponding to the non-zero singular values of the matrix H.sub.m
.sub.m.sup.0. At the point, the signal interference of other users has been eliminated, and the achievable rate R.sub.m of user m can be obtained:
[0097] Correspondingly, the present disclosure also discloses a multi-user scheduling system based on reinforcement learning for 5G IoT (Internet of Things), comprising: a performance computing module and a logic computing module.
[0098] The performance computing module calculates the achievable rate or actual achievable rate of each user in the scheduled users set and applies these rates as the user's historical rates to provide foundation for the system to evaluate the user's action value.
[0099] The logic operation module generates an initial scheduled users' set, evaluates the estimated value of each user's action value under the current scheduling period, determines the upper bound value of the confidence interval of each user's action value, and determines the scheduled users set under the current scheduling period.
[0100]
[0101] As the estimated value of the action value of each user or user terminal may be inaccurate, the present disclosure proposes step (c) to balance exploration and utilization in order to avoid the method falling into a local optimal situation, and the Q value of a user is defined as the upper bond value of the confidence interval of the user's action value, and applies the Q value as a criterion for user selection. The Q value of a user i is:
[0102] The set of scheduled users in the t scheduling period is:
[0103] For user i∈{M.sub.0}, the achievable rate
[0104] Repeating sept (b) and (c) until the method converges when
for any user i∈{M.sub.0}, and then obtaining the optimal scheduled users set π*.
[0105]
[0106]
[0107]
[0111] Therefore, the computational complexity of one scheduling period is O(MT.sup.3+M.sub.0) when adopting the method of the present disclosure. Table 1 shows the computational complexities of various scheduling algorithms.
TABLE-US-00001 TABLE 1 the computational complexities of scheduling algorithms Computational Algorithms complexities Greedy algorithm (based on user rate) O(M.sub.0T.sup.3M.sup.2) Greedy algorithm (based on F-form) O(M.sub.0T.sup.3M.sup.2) Greedy algorithm (based on chord distance) O(M.sub.0T.sup.3) The method of the present disclosure O(MT.sup.3 + M.sub.0) (based on user rate)
[0112] When the total number of users is massive, the computational complexity of this method is much lower than that of the greedy algorithm. Meanwhile, the computational complexity of the method does not increase too much even the total number of users increases. Therefore, the proposed method can ensure that the system maintains a low computational load even if there are enormous IoT nodes in the 5G network.
[0113] The applicant of the present disclosure has depicted and described the embodiments of the present disclosure in detail with reference to the accompanying drawings, but those skilled in the art should understand that the above embodiments are only exemplary embodiments of the present disclosure, and the detailed description is only to help readers to understand the ideas better rather than to limit the protection scope of the present disclosure. On contrary, any improvement or modification based on the spirit of the present disclosure should fall within the protection scope of the present disclosure.