METHOD AND APPARATUS FOR PROGRAMMABLE AND CUSTOMIZED INTELLIGENCE FOR TRAFFIC STEERING IN 5G NETWORKS USING OPEN RAN ARCHITECTURES
20230319662 · 2023-10-05
Assignee
Inventors
- Rajarajan Sivaraj (Plano, TX, US)
- Rahul Soundrarajan (Bengaluru, IN)
- Pankaj Kenjale (Plano, TX, US)
- Ankith Gujar (Sunnyvale, CA, US)
- Tarunjeet Singh (New Delhi, IN)
- Wasi Asghar (Bengaluru, IN)
Cpc classification
H04W36/13
ELECTRICITY
International classification
Abstract
A method of optimizing traffic steering (TS) radio resource management (RRM) decisions for handover of individual user equipment (UE) in Open Radio Access Network (O-RAN) includes: providing an O-RAN-compliant near real time RAN intelligent controller (near-RT RIC) configured to interact with O-RAN nodes; and utilizing an artificial intelligence (AO-based TS application xApp in the near-RT RIC to optimize TS handover control and maximize UE throughput utility. The TS xApp is configured utilizing a virtualized and simulated environment for O-RAN, which virtualized and simulated environment for O-RAN is provided by ns-O-RAN platform. The optimization problem to be solved is formulated as a Markov Decision Process (MDP), and a solution to the optimization problem is derived by using at least one reinforcement learning (RL) technique.
Claims
1. A method of optimizing traffic steering (TS) radio resource management (RRM) decisions for handover of at least one user equipment (UE) in Open Radio Access Network (O-RAN), comprising: providing an O-RAN-compliant near real time RAN intelligent controller (near-RT RIC) configured to interact with O-RAN nodes; and utilizing an artificial intelligence (AI)-based TS application in the near-RT RIC to optimize TS handover control and maximize UE throughput utility.
2. The method according to claim 1, wherein a data-driven AI-powered TS xApp in the near-RT RIC is utilized to optimize the TS handover control.
3. The method according to claim 2, wherein the TS xApp is configured utilizing a virtualized and simulated environment for O-RAN.
4. The method according to claim 3, wherein the virtualized and simulated environment for O-RAN is provided by ns-O-RAN platform.
5. The method according to claim 4, wherein the optimization problem to be solved is formulated as a Markov Decision Process (MDP).
6. The method according to claim 5, wherein a solution to the optimization problem is derived by using at least one reinforcement learning (RL) technique.
7. The method according to claim 6, wherein the RL technique is utilized to select an optimal target cell for TS handover of the UE.
8. The method according to claim 7, wherein the RL technique is based on at least a Deep Q-Network (DQN) algorithm.
9. The method according to claim 8, wherein the DQN algorithm includes at least one of Conservative Q-learning (CQL) algorithm and Random Ensemble Mixture (REM) algorithm.
10. The method according to claim 9, wherein the RL technique is additionally based on Convolutional Neural Network (CNN) architecture.
11. The method according to claim 10, wherein the at least one of the CQL algorithm and the REM algorithm is used in conjunction with the CNN architecture to model a Q-function and the loss function.
12. The method according to claim 10, wherein the RL technique enables control of multiple UEs using a single RL agent.
13. The method according to claim 4, wherein the Near-RT RIC with a TS xApp is integrated with a simulated environment on ns-3.
14. The method according to claim 6, wherein the Near-RT RIC with a TS xApp is integrated with a simulated environment on ns-3 for data collection and testing of at least one RL-based control policy.
15. The method according to claim 4, wherein the TS xApp in the near-RT RIC is evaluated for Key Performance Indicators (KPIs) including at least one of UE throughput, spectral efficiency, and mobility overhead.
16. The method according to claim 15, wherein the evaluation of the TS xApp for KPIs is performed on a simulated RAN network generated by an ns-O-RAN platform.
17. The method according to claim 16, wherein the ns-O-RAN platform includes a combination of ns-3 5G RAN module and an O-RAN-compliant E2 implementation.
18. The method according to claim 9, wherein an offline Q-learning training is performed using the CQL algorithm, and the trained CQL algorithm is deployed in the TS xApp for at least one of online value iteration, inference derivation and handover control.
19. The method according to claim 10, wherein an offline Q-learning training is performed using the CQL algorithm, and the trained CQL algorithm is deployed in the TS xApp for at least one of online value iteration, inference derivation and handover control.
20. The method of claim 11, wherein an offline O-learning training is performed using the CQL algorithm, and the trained CQL algorithm is deployed in the TS xApp for at least one of online value iteration, inference derivation and handover control.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
DETAILED DESCRIPTION OF THE INVENTION
[0042] Before describing the example embodiments, an overview of the relevant technology framework will be presented, i.e., O-RAN cellular architecture, AI/ML in RIC, dual connectivity and traffic steering.
[0043] O-RAN Cellular Architecture is described in connection with
[0044] Also shown in
[0045] The procedures and messages exchanged over the E2 interface are standardized by E2 Application Protocol (E2AP). Using E2AP, the E2 nodes can send reports (e.g., with RAN data or UE context information) to the near-RT RIC 1002. In addition, the near-RT RIC 1002 can send control actions (e.g., containing RRM decisions), policies, and subscriptions to the E2 node. The xApps (e.g., xApp 1 and xApp n in
[0046] The Near-RT RIC 1002 connects to the Non-RT RIC 1004, which is responsible for setting high-level RRM objectives, over the AI interface. The Non-RT RIC 1004 is deployed in a centralized Service Management and Orchestration (SMO) engine that does Fault, Configuration, Accounting, Performance and Security (FCAPS) management and infrastructure orchestration (which functionalities are represented by rApp applications rApp 1 and rApp n) for the E2 nodes, O-RU 1001d and Near-RT RIC 1002 through O1 and O2 interfaces, respectively, as shown in
[0047] AI/ML in RIC will be described in this section. O-RAN Alliance has defined specifications for life cycle management of ML-driven RAN control from RIC. Considered in an example embodiment are ML models trained offline and deployed as xApps for online inference and RRM control in the RIC. In an example embodiment, we consider reinforcement learning (RL), which teaches an agent how to choose an action from its action space, within a particular environment, to maximize rewards over time. The goal of the RL agent is then to compute a policy, which is a mapping between the environment states and actions so as to maximize a long term reward. RL problems are of particular interest to RIC, since these problems are closed-loop in nature. The RL agent autonomously interacts with the environment for the purpose of taking control actions, and these actions influence subsequent inputs to the agent.
[0048] According to an example embodiment, the RL model of interest is Deep Q-Network (DQN), which is a model-free, off-policy, value-based RL. “Model-free” means that the RL algorithm does not model the state transition probability in the environment due to actions, but estimates the reward from state-action samples for the purpose of taking subsequent actions. In off-policy RL algorithms, the target policy, which is the policy that the RL agent is learning to iteratively improve its reward value function, is different from the behavior policy, which is the policy used by the RL agent to generate action towards interacting with the environment. An example embodiment of the RL algorithm uses a Q-value that measures the expected reward for taking a particular action at a given state. DQN can be i) trained offline, and ii) its policy can be continually updated online, and iii) subsequently deployed in the inference host for the purpose of generating optimal actions, as the agent receives live data streams from the environment.
[0049] In this section, dual connectivity and traffic steering will be discussed. Dual connectivity is a mode of 5G RAN deployment in which the UE is jointly connected to more than one base station (e.g., O-eNB/gNB). One of the base stations is designated as the master node (solely responsible for control plane procedures of a UE), and the other base station is the secondary node (jointly responsible for data transfer for the UE along with the master node). A prevalent 5G deployment in North America (and globally) is E-UTRAN-NR Dual Connectivity (EN-DC) Non-Stand-Alone (NSA) mode 3X, where the LTE O-eNB is the master node, and NR gNB is the secondary node.
[0050] Traffic steering (TS) is a RAN functionality (handled by the RRC layer) for managing connectivity and mobility decisions of UEs in the RAN. More specifically, TS handles the following, on a UE basis: (i) Primary Cell (PCell) selection and handover, (ii) selection and change of master and secondary nodes for dual connectivity, (iii) selection and handover of Primary cell of the Secondary Node (PSCell).
[0051] In the present disclosure, an O-RAN compliant near-RT RIC is implemented, which uses xApps with standard-compliant service models that can be deployed on a real network. In addition, the present disclosure provides a method to test the performance of the xApp combining the real-world RIC with a large scale RAN deployment based on end-to-end, full-stack, 3GPP-based simulations in ns-3. In the following sections, the following are presented for the example embodiments: the system model assumption, the near-RT RIC software architecture, and the ns-O-RAN design.
[0052]
[0053] Also shown in
[0054] The Near-RT RIC software architecture is described in this section. The near-RT RIC in the example embodiment can be implemented with a cloud-native architecture having containerized micro-services that can be deployed on Kubernets. The architecture diagram for the example near-RT RIC 1002 presented in this disclosure is shown in
[0055] In the example embodiment shown in
[0056] The E2 node 2003 accepts the subscription and starts streaming KPMs and L3 RRC measurements. The raw streamed KPM data is stored by Data Pipeline and KPM job control service 1002d. The ETL, data aggregation (and ingestion) services 1002e can retrieve relevant measurements stored in this data repository, and correlate and aggregate in time series the UE level KPM information and L3 RRC measurements. The TS xApp 1002a can then fetch and process the data to perform inference derivation (e.g., using the algorithm described below in further detail). If a handover needs to be performed, the TS xApp 1002a communicates with the E2 termination service 1002b to send the control action to the RAN.
[0057] O-RAN integration in ns-3 is described in this section. In the example embodiment, ns-O-RAN, an O-RAN integration tool for ns-3 simulations, is used to evaluate the example system and the method according to the present disclosure. ns-O-RAN connects a real-world near-RT RIC with ns-3, enabling large scale (i) collection of RAN KPMs, and (ii) testing of closed-loop control of simulated cellular networks. Thanks to the flexibility of ns-3, such integration eases the design, development, and testing of xApps across different RAN setups with no infrastructure deployment cost. ns-3, which provides realistic modeling capabilities for large-scale wireless scenarios, features a channel model with propagation and fading compliant with 3GPP specifications, and a full-stack 5G model for EN-DC RAN, in addition to the TCP/IP stack, multiple applications, and mobility models.
[0058] ns-O-RAN bridges ns-3 to the real-world, O-RAN-compliant RIC to enable production code (i.e., code that can be used in real-world networks) to be developed and tested against simulated RANs. To do so, we connect the E2 termination 3002 of the real-world near-RT RIC 1002 to a set of E2 endpoints 3003 (net device) in ns-3, which are responsible for handling all the E2 messages to and from the simulated environment. This connection was implemented by extending the E2 simulator, namely e2sim 3004, and incorporating it into a simulator module 3001 for ns-3, which can decode, digest, and provide feedback for all the messages coming from the near-RT RIC 1002, and enables streaming RAN telemetry based on simulation data to the near-RT RIC 1002.
[0059] The design of ns-O-RAN addresses several challenges that would otherwise prevent communications between the simulated and real-world environments. As previously discussed above, the near-RT RIC 1002 expects to interface with a number of disaggregated and distinct endpoints, i.e., multiple O-DUs, O-CU-CPs and/or O-CU-UPs, which are usually identified by different IP addresses and/or ports. Instead, all the ns-3 simulated RAN functions (e.g., net devices 3003) are handled by a single process. e2sim itself was not designed to handle multiple hosts at once, while the E2 protocol specifications, which rely on the Stream Control Transmission Protocol (SCTP) for communication over E2 interface (E2AP), do not pose any limitation in this sense. To address this, we extended the e2sim library to support multiple endpoints at the same time and created independent entities (i.e., C++ objects) in the simulated environment to represent different RAN-side E2 terminations 3002. Each RAN function is bound to just one E2 interface, as depicted in
[0060] Finally, there is also a gap in timing between the real-world near-RT RIC 1002 and the simulator module 3001 for ns-3, which is a discrete-event framework that can execute faster or slower than the wall clock time. This may potentially lead to inconsistencies between the ns-3 simulated environment and the near-RT RIC 1002 which is expecting the real-world timing. To synchronize the two systems, at the beginning of the simulation ns-3 stores the current Unix time in milliseconds and uses it as baseline timestamp. Whenever an E2 message is sent to the near-RT RIC 1002, the simulator module 3001 for ns-3 will sum the simulation time elapsed and the baseline timestamp, ensuring consistency on both sides of the happened-before relationship.
[0061] Set forth In this section are the optimization problem for the traffic steering xApp and the method to determine the optimal target cells for handover of UEs. We consider as objective function the weighted cumulative sum of the logarithmic throughput of all the UEs across time, as a function of their instantaneous target PSCell. The optimization goal is to maximize the objective function by optimizing the choice of the target PSCells for all UEs. At the same time, it is desired to avoid frequent handovers for individual UEs, since the handovers increase the network overhead and decrease the network performance. Thus, we associate a cost function for every UE-specific handover and model it as an exponential decay function of the linear difference in time since the previous handover for that particular UE. This means that smaller the difference in time, higher the cost, and vice-versa. This cost function is added as a constraint to ensure that the cost does not exceed a pre-defined cost threshold.
[0062] Let β.sub.u is a weight associated with any UE.sub.u∈U. R.sub.u,t is the throughput at any discrete window of time t, which depends on c.sub.u,t, i.e., the PSCell assigned to during t, and on RAN performance parameters b.sub.1, b.sub.2, . . . b.sub.B. These are available at the near-RT RIC (where the optimization is solved), thanks to E2SM-KPM/RC reports from the E2 nodes during the time window t. C.sup.NR is the universe of all the N′ NR cells. The cost associated with handover for UE
at time t is given by K.sub.u,t, the initial cost is K.sub.0 (where K.sub.0>0), the decay constant is δ (where 0<δ<1), t′.sub.u is the time when the previous handover was executed for
, X.sub.u,t is a 0/1 decision variable which yields a value 1, if
was subject to handover at time t, and 0, otherwise. W is a predefined cost threshold, which represents a maximum value that cannot be exceeded by the cost function. We consider any time window t for an infinite time horizon ranging from t.sub.0 to ∞. The constrained optimization problem is formulated as follows:
where K.sub.u,t=K.sub.0e.sup.−δ.Math.t−t′u), K.sub.0>0 and 0<δ<1. Applying Lagrangian multiplier X to the constrained optimization problem in Equation (1), the constrained optimization problem becomes the following:
[0063] According to the present disclosure, we use a data-driven approach (specifically, RL) to model and learn R.sub.u,t as a function of {c.sub.u,t, b.sub.1, b.sub.2, b.sub.B}, due to the lack of a deterministic closed-form equation for R.sub.u,t as a function of the parameters, and its relationship with cost K.sub.u,t and the handover decision variable X.sub.u,t. We consider the infinite time horizon MDP to model the system, where the EN-DC RAN (including the UEs) is the environment, and a single RL agent is deployed in the near-RT RIC containing the TS xApp. The system is modeled as an MDP because the TS xApp in the RIC controls the target PSCell for the UEs handover, while the resulting state (including the RAN performance parameters and the user throughput) is stochastic. The MDP is defined by the tuple S, A, P, R, γ, I
, each of which will be defined below.
[0064] S is the state space, comprising per-UE E2SM-KPM periodic data and per-UE E2SM-RC periodic/event-driven data. Let C′.sub.u,t.Math.C.sup.NR be the set of serving PSCell and neighboring cells for any UE at time t. The state vector for
at time t from the environment ({right arrow over (S)}.sub.u,t) includes the UE identifier for
and the set of parameters b.sub.1, b.sub.2, . . . b.sub.B, which set of parameters includes the following:
[0065] (i) the UE-specific L3 RRC measurements (obtained from the E2 node O-CU-CP), e.g., sinr.sub.u,c,t for any cell c∈C′.sub.u,t for the UE ;
[0066] (ii) PRB.sub.c,t, the cell-specific Physical Resource Block (PRB) utilization for c at time t obtained from the E2 node O-DU;
[0067] (iii) Z.sub.c,t, the cell-specific number of active UEs in the cell c with active Transmission Time Interval (TTI) transmission at t obtained from O-DU;
[0068] (iv) P.sub.c,t, the total number of MAC-layer transport blocks transmitted by cell c across all UEs served by c at time t (obtained from the E2 node O-DU);
[0069] (v) p.sub.c,t.sup.QPSK, p.sub.c,t.sup.16QAM, p.sub.c,t.sup.64QAM, the cell-specific number of successful) transmitted transport blocks with QPSK, 16QAM and 64QAM modulation rates from the cell c to all UEs served by the c at time t normalized by P.sub.c,t; and
[0070] (vi) the cost the UE would incur, if handed over to c.sub.u,t at t (i.e., where c.sub.u,t≠c.sub.u,t−1), which cost is represented by:
[0071] Note that the cost k(c.sub.u,t) is zero if there is no handover, i.e., c.sub.u,t=c.sub.u,t−1.
[0072] The above-listed state information items are aggregated across all the serving and neighboring cells of , i.e., ∀c∈C′.sub.u,.Math.C.sup.NR, along with the cell identifier for c, during the reporting window t to generate a consolidated record for
for t. This aggregated state information for
is fed as input feature to the RL agent on the TS xApp. This is done for all UEs in U, whose aggregated state information is fed to the same RL agent. If any of the parameters in the state information from the environment any UE
is missing, the RIC ETL service uses a configurable small window ε to look back into recent history (e.g., tens to hundreds of ms) and fetch those historical parameters for the missing ones.
[0073] A is the action space, represented by the following expression:
A={HO(c.sub.1),HO(c.sub.2), HO(c.sub.N′),
where, c1, c2, . . . c.sub.N′∈C.sup.NR. Here, a.sub.u,t=HO(c), where a.sub.u,t∈A, indicates that the RL agent is recommending a handover action for u to any cell c at t, and a.sub.u,t=
[0074] P({right arrow over (S)}.sub.u,t+1|{right arrow over (S)}.sub.u,t,a.sub.u,t) is the state transition probability of UE u from state {right arrow over (S)}.sub.u,t at t to {right arrow over (S)}.sub.u,t+a at t+1 caused by action a.sub.u,t ∈A.
[0075] R: S×AΔ is the reward function for UE
at t+1, as a result of action a.sub.u,t, given by the following expression (3):
R.sub.u,t+1=β.Math.(log R.sub.u,t+1(c.sub.t+1)−log R.sub.u,t(c.sub.u,t))−k(c.sub.u,t+1) (3)
[0076] The reward for UE u is the improvement in the logarithmic throughput R.sub.u,t due to the transition from {right arrow over (S)}.sub.u,t to {right arrow over (S)}.sub.u,t+1 caused by action a.sub.u,t taken at t, minus the cost factor. The reward is positive, if the improvement in log throughput is higher than the cost, and negative, otherwise. R.sub.u,t is obtained from O-CU-UP using E2SM-KPM.
[0077] γ∈[0, 1] is the discount factor for future rewards. The value function V.sup.π (s) is the net return given by the expected cumulative discounted sum reward from step t onwards due to policy π, which value function is represented as follows:
[0078] I is the initial distribution of the UE states.
[0079] According to the present disclosure, we consider two policies: (i) a target policy π(a|s), to learn the optimal handover action a for any state s={right arrow over (S)}.sub.u,t; and (ii) a behavior policy μ(a|s), to generate the handover actions which result in state transition and a new state data from the environment. In connection with these policies, we utilize Q-learning, a model-free, off-policy, value-based RL approach. We compute the Q function, an action-value function which measures the expected discounted reward upon taking any action a on any given state s based on any policy π. The value returned by the Q-function is referred to as the Q-value, i.e.,
From equations (4) and (5), we derive the following:
[0080] The optimal policy π* is the one that maximizes the expected discounted return, and the optimal Q functionQ*(s, a) is the action-value function for π* given by the Bellman equation as follows:
[0081] According to an example embodiment of the present disclosure, we use the Q-learning algorithm to iteratively update the Q-values for each state-action pair using the Bellman equation (equation (8) shown below), until the Q function converges to Q* This process is called value iteration, and is used to determine the optimal policy π* that maximizes the Q-function, yielding Q*. Value iteration by the RL agent leverages the exploration-exploitation trade-off to update the target policy π. Value iteration explores the state space of the environment by taking random handover control actions and learning the Q-function for the resulting state-action pair, and exploits its learning to choose the optimal control action maximizing the Q-value, i.e.,
Such value iteration algorithms converge to the optimal action-value function, i.e.,
The Bellman error Δ, which is represented below in equation (9), is the update to the expected return of state s, when we observe the next state s′. Q-learning repeatedly adjusts the Q-function to minimize the Bellman error, shown below:
[0082] This approach of
has practical constraints, and to address this, we use a CNN approximator with weights θ to estimate the Q function Q(s, a; θ), and refer to it as the Q-network. An example embodiment of the CNN architecture according to the present disclosure is shown in
[0083] Deep Q-learning comes from parameterizing Q-values using CNNs. Therefore, instead of learning a table of Q-values, the example method learns the weights of the CNN 0 that outputs the Q-value for every given state-action pair. The Q-network is trained by minimizing a sequence of loss functions L.sub.i(θ.sub.i, π) for each iteration i. The optimal Q-value, as a result of CNN approximator, is represented by
is the target for iteration i. The parameters from the previous iteration θ.sub.i−1 are fixed for optimizing the loss function L.sub.i(θ.sub.i). The gradient of the loss function is obtained by differentiating the loss function in Equation (10) with respect to θ, and the loss can be minimized by computing its stochastic gradient descent.
[0084] According to an example embodiment of the present disclosure, an off-policy Q-learning algorithm, called DQN, is used for this purpose. The DQN algorithm leverages an experience replay buffer, where the RL agent's experiences at each step e.sub.t=(s.sub.t, a.sub.t, r.sub.t, s.sub.t+1) are collected using the behavior policy μ and stored in a replay buffer D={e.sub.1e.sub.2, . . . e.sub.t−1} for the policy iterate π.sub.i. D is pooled over many episodes, composed of samples from policy iterates π.sub.9, π.sub.1, . . . π.sub.i, so as to train the new policy iterate
At each time step of data collection, the transitions are added to a circular replay buffer. To compute the loss L.sub.i(θ.sub.i) and the gradient, we use a mini-batch of transitions sampled from the replay buffer, instead of using the latest transition to compute the loss and its gradient. Using an experience replay has advantages in terms of an off-policy approach, better data efficiency from re-using transitions and better stability from uncorrelated transitions.
[0085] To leverage the full potential of the integrated ns-3 simulation environment in ns-O-RAN and harness large datasets generated from the simulator via offline data collection for data-driven RL, an example method according to the present disclosure utilizes offline Q-learning (Q-learning is a type of RL). This enables learning the Convolutional Neural Network (CNN) weights by training the Q-network using the Deep-Q Network (DQN) model from dataset D collected offline based on any behavior policy (potentially unknown, using any handover algorithm) π without online interactions with the environment, and hence no additional exploration by the agent is necessary beyond the experiences e.sub.t available in D via μ. The trained model is then deployed online to interact with the environment, and the Q-function is iteratively updated online.
[0086] According to an example embodiment, a robust offline Q-learning variant of the DQN algorithm is utilized, called Random Ensemble Mixture (REM), which enforces optimal Bellman consistency on J random convex combinations of multiple Q-value estimates to approximate the optimal Q-function. This approximator is defined by mixing probabilities on a (J−1) simplex and is trained against its corresponding target to minimize the Bellman error, as represented below in equation (11).
Here, α.sub.j∈, such that
α.sub.j represents the probability distribution over the standard (J−1)-simplex. While REM prevents the effect of outliers and can effectively address imbalances in the offline dataset D, offline-Q learning algorithms suffer from action distribution shift caused by bias towards out-of-distribution actions with over-estimated Q values. This is because the Q-value iteration in Bellman equation uses actions from target policy π being learned, while the Q-function is trained on action-value pair generated from D generated using behavior policy μ. To avoid this problem of over-estimation of Q-values for out-of-distribution actions, an example embodiment of the present disclosure utilizes a conservative variant of offline DQN, called Conservative O-learning (CQL), that learns a conservative, lower-bound Q-function by (i) minimizing Q-values computed using REM under the target policy distribution π, and (ii) introducing a Q-value maximization term under the behavior policy distribution μ. From Equation (10), the iterative update for training the Q-function using CQL and REM can be represented as follows:
[0087] Here, {circle around (L)}.sub.i(θ.sub.i, π) and {circle around (Q)}.sup.π(s, a; θ.sub.i) are as defined in Equation (11).
[0088] An example sequence of steps for offline Q-learning training method (Algorithm 1) is summarized below.
TABLE-US-00001 1: Store offline data (generated from ns-3) using any handover algorithm and behavior policy μ into replay buffer D consisting of UE-specific records (∀u ∈ U) 2: while D not empty and value iteration i 3: Begin training step: 4: Select a batch of 2.sup.x.sup.
[0089] After the offline Q-learning training (Algorithm 1) has been completed, the Q-learning algorithm is deployed in the TS xApp for online value iteration, inference and control method (Algorithm 2), which is summarized below.
TABLE-US-00002 1: while Incoming experience data e.sub.t for any UE u from RAN environment to near-RT RIC for t ∈ [t.sub.0,∞] 2: Append e.sub.t to replay buffer D′ .Math. D in AI/ML training services with length D′ ≤ D 3: Begin inference step: 4: Repeat steps 4 and 5 from Algorithm 1 5: Generate HO control action for u from the TS xApp over E2 to RAN environment based on {hacek over (Q)}.sub.i.sup.π 6: Set i ←i + 1 7: end while
[0090] Described in the following sections are the simulation scenario, the baseline handover modes considered for the comparison, the metrics of interest, and the results based on a large scale evaluation in different deployment scenarios.
[0091] For simulation scenario, a dense urban deployment is modeled, with N=1 O-eNBs and M=7 gNBs. One of the gNBs is co-located with the O-eNB at the center of the scenario, the others provide coverage in an hexagonal grid. Each node has an independent E2 termination, with reporting periodicity set to 100 ms. In an example embodiment, two configurations are studied: (i) low band with center frequency 850 MHz and inter-site distance between the gNBs of 1700 m; and (ii) C-band, with center frequency of 3.5 GHz and inter-site distance of 1000 m. In each configuration, the bandwidth is 10 MHz for the O-eNB and 20 MHz for the gNBs. The channel is modeled as a 3GPP Urban Macro (UMa) channel. The 3GPP NR gNBs use numerology 2. N.sub.uE=|U| dual-connected UEs are randomly dropped in each simulation run with a uniform distribution, and move according to a random walk process with minimum speed S.sub.min=2.0 m/s and maximum speed S.sub.max=4.0 m/s.
[0092] In terms of traffic models according to example embodiments of the present disclosure, it is provided that the users request downlink traffic from a remote server with a mixture of four traffic models, each assigned to 25% of the UEs, i.e., the traffic models include: (i) full buffer Maximum Bit Rate (MBR) traffic, which saturates at R.sub.fb,max=20 Mbit/s, to simulate file transfer or synchronization with cloud services; (ii) bursty traffic with an average data rate of R.sub.b,max=3 Mbit/s, to model video streaming applications; and (iii) two bursty traffic models with an average data rate of 750 Kbit/s and 150 Kbit/s, for web browsing, instant messaging applications, and Guaranteed Bit Rate (GBR) traffic (e.g., phone calls). The two bursty traffic models feature on and off phases with a random exponential duration.
[0093] In terms of baseline handover strategies according to example embodiments of the present disclosure, three baseline handover models are considered (and/or utilized) for training the AI agent and for evaluating its effectiveness. The three models, which represent different strategies used for handovers in cellular networks, include: RAN RRM heuristic; SON1; and SON2. RAN RRM heuristic decides to perform a handover if a target cell has a channel quality metric (e.g., in this case, the SINR) above a threshold (e.g., 3 dB) with respect to the current cell. The SON1 and SON2 algorithms use more advanced heuristics, based on a combination of a threshold and a Time-to-Trigger (TTT). The SON1 algorithm assumes a fixed TTT, i.e., the handover is triggered only if the target cell SINR is above a threshold (e.g., 3 dB) for a fixed amount of time (e.g., 110 ms). The SON2 algorithm uses a dynamic TTT, which is decreased proportionally to the difference between the target and current cell SINR.
[0094] For the performance evaluation of the TS xApp, we utilize the metrics related to throughput, channel quality, spectral efficiency, and mobility overhead. For the throughput, we report the average UE throughput at the Packet Data Convergence Protocol (PDCP) layer, i.e., including both LTE and NR split bearers, as well as the 10th and 95th percentiles (of all the users in a simulation, averaged over multiple independent runs). The channel quality is represented by the SINR. For the spectral efficiency, we analyze the average value for each UEs and cell, as well as the 10th percentile, and the percentage of PRBs used for downlink traffic. Finally, we evaluate the UE mobility overhead H.sub.u as the number of handovers per unit time weighted by a throughput factor
®.sub.u=(R.sub.u)/Σ.sub.u′∈U
(R′.sub.u),
where (R.sub.u) is the average throughput for the user over the same unit time.
[0095] Data collection and agent training are discussed in this section. The data collection is based on a total of more than 2000 simulations for the different configurations, including multiple independent simulation runs for each scenario. Table 1 provides the list of RL hyperparameters (a hyperparameter is a parameter whose value is used to control the learning process) and their values considered according to an example embodiment in the present disclosure.
TABLE-US-00003 TABLE 1 RL hyperparmeters and their value Hyperparameters Value |DQN Agent (Offline) Target update period 8000 Batch size 32 Number of heads (n heads in FIG. 4) 200 Number of actions (N in FIG. 4)) 7 Minimum replay history 20000 Terminal (Episode) length 1 Gamma 0.99 Replay capacity 1000000 Number of iterations 400 Training steps 100000 Optimizer Optimizer AdamOptimizer Learning rate 0.00005 NN (FIG. 4) Conv1D Layer filters = 32, kernel size = 8 strides = 8, activation = ReLu Flatten Layer 225 neurons Dense Layer 1 128 neurons Dense Layer 2 32 neurons Dense Layer 3 1400 neurons
[0096] In the offline training, frequency with which the target network gets updated is set to 8000 training steps. In an example embodiment, 400 iterations are performed during the offline training, and each iteration has 100,000 training steps, for a total of 40 million training steps. In a training step, a batch of 32 samples (or data points) are selected randomly for input to Neural Network (NN), e.g., the CNN architecture shown in
[0097] The CNN architecture shown in .sub.iB=225) involves taking the pooled feature map that is generated (e.g., in a pooling step, which is not explicitly shown, after the convolution layer 4001) and transforming it into one-dimensional vectors. Third layer 4005, fourth layer 4006, and fifth layer 4007 are fully connected layers (where each input is connected to all neurons), with 2.sup.x.sub.2=128 for third layer 4005, 2.sup.x.sub.3=32 for the fourth layer, and 1400 units/neurons for the fifth layer 4007. The number of units in the fifth layer 4007 layer is given by the product of n=200 (the number of heads of the REM) and the number of actions N=7. We use the Adam optimizer with a learning rate of 0.00005.
[0098] In this section, we discuss the results obtained after the training and the online testing of the xApp as described above. The RL agent was tested in simulations with the baselines Handovers (HOs) disabled. The experiments were repeated with different numbers of UEs, and averaged around 600,000 records for Frequency Range 1 (FR1) 850 MHz and around 300,000 records for FR1 C-band in online evaluation.
[0099] Moreover, by looking at the percentiles of the user throughput, it can be seen that our RL agent brings consistent improvement not only on the average UEs, but also between the worst users (10th percentile user throughput, FIG. 6c), showing 30% improvements, and best users (95th percentile user throughput,
[0100] The above-listed improvements in the throughput could, however, eventually come with a major cost in terms of HO management, and thus energy expenditure. The mobility overhead H.sub.u of
[0101]
[0102] In summary, the present disclosure provides a complete, system-level, O-RAN-compliant framework for the optimization of TS in 3GPP networks. More specifically, the present disclosure provides a method of throughput maximization through the selection of the NR serving cell in an EN-DC setup. A cloud-native near-RT RIC is implemented, which is connected through open, O-RAN interfaces to a simulated RAN environment in ns-3. In addition, the present disclosure provides a custom xApp for the near-RT RIC, with a data-driven handover control based on REM and CQL. Finally, the performance of the agent on a large scale deployment in multiple frequency bands is profiled, evaluating its gain over traditional handover heuristics.