Self-powered integrated sensing and communication interactive method of high-speed railway based on hierarchical deep reinforcement learning
20230196119 · 2023-06-22
Assignee
Inventors
- Fengye Hu (Changchun, CN)
- Zhuang Ling (Changchun, CN)
- Tanda Liu (Changchun, CN)
- Hailong Li (Changchun, CN)
- Zhijun Li (Changchun, CN)
- Wuliji Nashun (Changchun, CN)
- Difei Jia (Changchun, CN)
- Long Lv (Changchun, CN)
- Qiang Li (Changchun, CN)
Cpc classification
H04B7/22
ELECTRICITY
G06N3/006
PHYSICS
B61L27/70
PERFORMING OPERATIONS; TRANSPORTING
Y02T10/40
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
The present invention provides a self-powered integrated sensing and communication (ISAC) interactive method of high-speed railway based on hierarchical deep reinforcement learning (HDRL), including: Constructing an integrated system framework for passive sensing and communication of high-speed train, where the passive sensor is mainly used for receiving train status information, and the access point (AP) is utilized for status information sensing of the train; During the remote communication between the AP and the base station (BS), Gaussian mixture model (GMM) clustering method is utilized for obtaining reference handover triggering points and completing the communication handover; Proposing an option-based HDRL algorithm to train the high-speed train agent so as to implement the dynamic autonomous switching process of information sensing and remote communication, thereby ensuring the minimum of task completion time and the timely charging for sensors. The present invention integrates passive sensing and remote communication.
Claims
1. A self-powered integrated sensing and communication (ISAC) interactive method of the high-speed train based on hierarchical deep reinforcement learning (HDRL), comprising the following steps: S1. Constructing an integrated system framework for passive sensing and communication of the high-speed train, which includes high-speed train carriages and base stations (BS) for communicating. High-speed train carriages comprise an access point (AP) structure and passive sensor structures. The former is mainly used for receiving train status information and communicating remotely with base stations described before. The latter functions include harvesting wireless radio frequency (RF) energy from the AP and sending sensed train status information to the AP. S2. Establishing an information sensing and remote communication integrated model, which consists of RF energy harvesting model for passive sensors, information sensing model for train APs and remote communication model between the AP and the BS. The gaussian mixture model (GMM) clustering method divides remote communication handover triggering area and obtains reference handover triggering points. Handover of BS that communicates with the AP occurs in these reference points during the moving of high-speed trains. S3. Formulating the integrated optimization problem: Establishing a joint optimization model with task completion time minimization as the objective function. Training the joint optimization model by applying the HDRL algorithm, so as to solve the optimal configuration of the system framework and acquire the optimal interaction policy. The task completion time includes the time duration that passive sensors harvest energy, the time duration that the AP senses information and the time duration that the AP communicates with the remote BS. In conclusion, the joint optimization model represents the dynamic autonomous switching process of energy transmission, information sensing and remote communication of high-speed trains.
2. The self-powered ISAC interactive method of high-speed railway is based on the HDRL algorithm according to claim 1. The establishment of the nonlinear energy harvesting model for passive sensors includes: Based on the process of the AP sending RF signals of given unit power to passive sensors with given transmission power, the energy signal model received by passive sensors is constructed; After passive sensors receiving RF signals, the RF energy is utilized to charge their own circuits. Ultimately, the nonlinear energy harvesting model for passive sensors is established; The establishment of the information sensing model for train APs: After the completion of data acquisition, using backscatter communication technology, the AP can sense the information of the data collected by passive sensors, and build the information sensing signal model received by the AP and propose the transmission rate model of the signal.
3. The self-powered ISAC interactive method of high-speed railway based on HDRL algorithm according to claim 1, the remote communication model between the AP and the BS contains: Constructing the transmission rate model of the BS when receiving signals from the AP; Transmission rates of signals received by the BS constituted of multiple Gaussian distribution hybrid vectors with given parameters, which form a Gaussian mixture model for describing the probability distribution of the reference handover triggering points, and dividing the handover range of the AP and the BS communication based on the clustering results of the Gaussian mixture. By fitting the relationship between transmission rate, speed and time, predictive value and the corresponding distribution of the reference handover triggering points can be obtained in order that the current reference handover triggering point prediction is viewed as a priori information for the next update calculation; The location of each communication handover triggering area center is determined by the mean vector of the Gaussian distribution hybrid vectors that comprised the transmission rate, and the reliability of the prediction is determined by the covariance vector of the Gaussian distribution hybrid vectors that comprised the transmission rate represented by the shape and size of the communication handover triggering area.
4. The self-powered ISAC interactive method of high-speed railway based on HDRL algorithm according to claim 1, the option-based HDRL algorithm applies a semi-Markov decision process (SMDP) to simulate a high-speed railbus sensing and communication scenario, including state sets, action sets, option sets, transition probability, total reward and reward discount factor; the high-speed railbus AP, as a single agent, can learn policy based on options. The AP selects an option based on its initial state at the beginning of the task and then executes the action according to the policy of the selected option. At the moment when the option ends and reaches the total reward for the selected option, choose the option to be executed according to the policy based on the state information, and so on until the end of the task.
5. The self-powered ISAC interactive method of high-speed railway based on HDRL algorithm according to claim 4, state sets of the high-speed railbus AP includes: remote communication connection probability, high-speed railbus position, remaining energy of the sensor and percentage of information sensed by the AP from the sensor; Action sets includes three actions: energy transmission from the AP to the sensor, the AP information sensing and the remote communication between the AP and the BS; Similarly, option sets contains three options: information sensing, energy transmission and remote communication; The total reward of options is divided into three categories: energy remaining reward, information sensing reward and remote communication reward.
6. The self-powered ISAC interactive method of high-speed railway based on HDRL algorithm according to claim 4, the high-speed train AP receives the total reward for each option at the end moment of that option, and the total reward is a function of the initial state of the option and the options action. More specifically, the energy remaining reward is for punishing the working condition of insufficient power during the execution of that option, the information sensing reward is used for punishing the AP for repeated selection of a passive sensor that has completed the acquisition, and the remote communication reward is used to punish the AP for repeatedly selecting a BS that has completed communication handover.
7. The self-powered ISAC interactive method of high-speed railway based on HDRL algorithm according to claim 4, the option-based HDRL algorithm first inputs the current state information into the option-value neural network, in which the corresponding output is the option probability; Then the agent derives the optimal option by comparing the index of values obtained from the random selection and greedy algorithms; Finally the agent outputs the corresponding action according to the policy and termination condition of the selected option.
8. The self-powered ISAC interactive method of high-speed railway based on HDRL algorithm according to claim 7, the option-value neural network of option-based HDRL algorithm has one input layer, five hidden layers and one output layer. The input layer receives state information and option rewards, hidden layers include five fully connected layers. Rectified Linear Unit (ReLU) is employed for all hidden layers as an activation function, and Softmax normalized exponential function is employed for the output layer to obtain the option probability.
9. The self-powered ISAC interactive method of high-speed railway based on HDRL algorithm according to claim 7, the option-value neural network of option-based HDRL algorithm is trained by using experience random sampling and experience replay. The update of the option-value neural network parameters is completed by computing the gradient of the loss function.
Description
DESCRIPTION OF DRAWINGS
[0032] To more clearly describe the technical solution in the embodiments of the present invention or in the prior art, the drawings required to be used in the description of the embodiments or the prior art will be simply presented below. Apparently, the drawings in the following description are merely the embodiments of the present invention, and for those ordinary skilled in the art, other drawings can also be obtained according to the provided drawings without contributing creative labor.
[0033]
[0034]
[0035]
DETAILED DESCRIPTION
[0036] The technical solution will be clearly and fully described below in combination with the drawings in the embodiments of the present invention. Apparently, the described embodiments are merely part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments in the present invention, all other embodiments obtained by those ordinary skilled in the art without contributing creative labor will belong to the protection scope of the present invention.
[0037] Embodiments of the present invention disclose a self-powered integrated sensing and communication interactive method of the high-speed train based on hierarchical deep reinforcement learning (HDRL), as shown in
[0038] S1: Constructing an integrated system framework for self-powered passive sensing and communication of high-speed trains, which includes high-speed train carriages and base stations (BSs) for communicating. High-speed train carriages comprise an access point (AP) structure and a passive sensor structure. The former is mainly used for receiving train status information and communicating remotely with BS described before. The latter functions include harvesting wireless radio frequency (RF) energy from the AP and sending sensed train status information to the AP.
[0039] S2: Establishing an information sensing and remote communication model, which consists of a RF energy harvesting model for passive sensors, an information sensing model for train APs, and a remote communication model between the AP and the BS. The gaussian mixture model (GMM) clustering method is utilized to divide the remote communication handover triggering area and obtain reference handover triggering points. Handover of BS that communicates with the AP occurs in these reference points during the moving of high-speed trains.
[0040] S3: Formulating the integrated optimization problem: Establishing a joint optimization model with task completion time minimization as the objective function. Training the joint optimization model by applying the HDRL algorithm, so as to solve the optimal configuration of the system framework and acquire the optimal interaction policy. The task completion time includes: the time duration that passive sensors harvests energy, the time duration that the AP senses information and the time duration that the AP communicates with the remote BS. In conclusion, the joint optimization model represents the dynamic autonomous switching process of energy transmission, information sensing and remote communication.
[0041] In embodiment S1, constructing an integrated system framework for passive sensing and communication of high-speed trains, which includes high-speed train carriages and BSs for communicating with high-speed trains remotely. High-speed train carriages comprise an AP for transmitting RF energy, sensing information and communicating with the BS and passive sensors without batteries. Each sensor first collects RF energy from the AP for sensing train status information, and then the AP senses the train status information through low-power backscatter communication technology. In the process of remote communication between the AP and the BS, the reference handover triggering point is obtained based on the GMM clustering method to complete the communication handover during the operation of the high-speed train. The scenario assumed by the present invention has practical reference value.
[0042] In embodiment S2, Establishing an information sensing and remote communication model:
[0043] (1) Information Sensing Model
[0044] The AP sends the unit power RF signal to the sensor at transmission power p.sub.m, and the sensor receives the energy signal as:
y.sub.S,m=√{square root over (p.sub.m)}h.sub.me.sub.m+n.sub.s
[0045] where e.sub.m is the energy signal, h.sub.m is the downlink channel gain from the AP to the sensor, n.sub.s is the noise, which follows the circularly symmetric complex Gaussian (CSCG) distribution.
[0046] After the sensor receiving the RF signal, the RF energy is used to charge its own circuit and acquire the sensor data. In a high-speed railway system, the nonlinear energy harvesting model can be expressed as:
[0047] where P.sub.H,m denotes the power of the nonlinear energy harvesting model,
is auxiliary variable,
is auxiliary function, a and b are parameters characterized by the circuit, and P.sub.max is the maximum transmission power.
[0048] After the completion of sensor's data acquisition, the AP applies backscatter communication technology for the information sensing of data collected by the sensor, and the sensing signal received at the AP is denoted as:
y.sub.A,m=√{square root over (α.sub.m)}√{square root over (p.sub.m)}g.sub.mh.sub.me.sub.mx.sub.m+n.sub.A
[0049] where α.sub.m is the backscatter proportion, x.sub.m is the collected data signal at the sensor, and g.sub.m is the uplink channel gain from the sensor to the AP. n.sub.A is the circuit noise, which follows the CSCG distribution. σ.sub.A.sup.2 refers to the noise power, the n.sub.s term is neglected by comparison, since the noise at the sensor is negligible for the lower power consumption. The transmission rate for the sensing signal received at the AP can be formulated as:
[0050] where B is the channel bandwidth, and p.sub.m|h.sub.m|.sup.2 denotes the interference from other sensors to the AP.
[0051] (2) Remote Communication Model
[0052] During the high-speed train operation, the AP needs to maintain communication with the BS. The communication signal received by the BS can be given by:
y.sub.B,n=√{square root over (p.sub.n)}l.sub.nz.sub.n+n.sub.B
[0053] where z.sub.n is the unit power information signal transmitted by the AP, p.sub.n refers to the transmission power of the AP. In addition, n.sub.B is the BS's noise, which follows the CSCG distribution, σ.sub.B.sup.2 is the noise power. The channel gain l.sub.n between the BS and the AP under the high-speed railway communication scenario can be denoted as:
l.sub.n=ζ exp(−j2πf.sub.cτ.sub.LOS)
[0054] where ζ stands for larger-scale fading coefficient, and f.sub.c is the carrier frequency. τ.sub.LOS=∥D.sub.Tx−D.sub.Rx∥/c refers to the arrival time of the LOS link, ∥g∥ is 2-Norm function, D.sub.Tx and D.sub.Rx represent the real-time position between the AP and the BS. c is the lightspeed. D.sub.Tx is related to the initial position, running speed and operation time of the high-speed train.
[0055] In order to effectively evaluate the communication condition between the AP and the BS, it is necessary to study the transmission rate of the BS when receiving signals from the AP:
[0056] where P.sub.ICI|l.sub.n|.sup.2 denotes the inter-channel interference. Due to that the ICI power caused by the Doppler shift is not coordinated among different subcarriers, we only consider its average impact and treat it as a part of white noise.
[0057] To meet the requirements of high-quality wireless communication services during the operation of high-speed train, the present invention proposes to analyze the communication handover triggering area with GMM so as to derive the reference handover triggering point in advance. The Gaussian mixture model, which consists of multiple Gaussian models, can be used to describe the probability distribution of the reference handover triggering point. Assuming that the received signal transmission rates of all BSs are composed of K Gaussian distribution hybrid vectors with given parameters, i represents the index of the location. The Gaussian mixture probability density function is expressed as:
is the Gaussian density function, Θ={μ.sub.k, Σ.sub.k,ζ.sub.k} is the position data sequence and model parameters, in which ζ.sub.k is the weight obeying
μ.sub.k and Σ.sub.k are the mean and covariance vector of Gaussian distribution hybrid vector r.sub.i, respectively.
[0058] Supposing the training signal set obtained by sampling is R={r.sub.1, r.sub.2, . . . , r.sub.i, . . . , r.sub.l} and the log-likelihood function of the training signal is:
[0059] For a given training signal set and the communication area number, the parameters Θ={μ.sub.k, Σ.sub.k, ζ.sub.k} are estimated by maximizing the log-likelihood function utilizing an expectation maximization (EM) algorithm. The handover communication range of the AP and the BS is divided based on the clustering results of the Gaussian mixture model. The present invention sets the position of the train's starting point as the initial value for the update of the GMM algorithm. It calculates the predicted value and its distribution by fitting the relationship between transmission rate, speed and time. In the update process, the train reports a set of data rates, and then calculates a handover triggering point and updates its distribution. Finally, the result can be used as prior information for the next update calculation. The location of each handover triggering area center is determined by the parameter μ.sub.k. The reliability of the predicted value is determined by the covariance Σ.sub.k represented by the shape and size of the handover triggering area.
[0060] In embodiment S3, Combining the time duration τ.sub.m.sup.c that passive sensors harvesting energy, the time duration τ.sub.m.sup.d that the AP sensing information, and the time duration τ.sub.n.sup.r that the AP communicates with the remote BS to formulate the integrated optimization problem based on the integrated system framework for passive sensing and communication of high-speed train. To solve the optimal configuration of the system framework and acquire the optimal interaction policy, the goal is to minimize the total task completion time under the multiple constraints of information sensing rate, remote transmission rate and energy consumption.
[0061] where C1, C2, C3, and C4 are constraints, C1 denotes the AP information sensing rate constraint, to ensure that the AP successfully senses the train state information, r.sub.A,m is the AP information sensing rate, r.sub.A,min is the lower bound of the AP information sensing rate; C2 denotes the AP remote transmission rate constraint, to ensure the remote communication between the AP and the BS, r.sub.B,m is the AP remote transmission rate, r.sub.B,min is the lower bound of the remote transmission rate; C3 denotes the energy constraint of the AP to guarantee that the AP operate properly, E.sub.n.sup.total the total energy value of the AP, E.sub.T,n is the energy consumed by the AP to charge the sensor, E.sub.C,n is the energy consumed by remote communication of the AP; C4 denotes the energy consumption constraint of the sensor, the energy harvested by the sensor should ensure the stable operation of the sensor, E.sub.H,m is the energy harvested by the passive sensor, is the energy consumed by data acquisition of the passive sensor.
[0062] In embodiment S3, the option-based HDRL algorithm applies a SMDP to simulate a high-speed train sensing and communication scenario, including state sets, action sets, option sets, transition probability, total reward and reward discount factor; The high-speed train AP, as a single agent, can learn policy based on options. The AP selects an option based on its initial state at the beginning of the task and then executes the action according to the policy of the selected option. At the moment when the option ends and reaches the total reward for the selected option, the agent chooses the option to be executed according to the policy based on the state information until the end of the task.
[0063] It is worth noting that in a conventional Markov decision process (MDP), the system needs to choose actions when the system state changes. However, in option-based HDRL, the state may change several times between two decision epochs, while only the state at the decision epoch is relevant to the system.
[0064] Compared with the conventional MDP, embodiments of the present invention utilize a semi-Markov decision process (SMDP) to simulate a high-speed railbus sensing and communication scenario, and the SMDP contains six tuples<S, A, O, P, R, γ>, where S, A and O represent the set of states, actions and options, respectively. P is the transition probability set, R is the total reward set, γ is the reward discount factor. As a single agent, the high-speed train AP can learn policy based on the selected option. The AP selects an option o.sub.0 based on its initial state s.sub.0 at the beginning of the task and then executes the action according to the policy π of selected option o.sub.0. At the moment t when the option o.sub.0 ends and reaches the total reward R.sub.t for the selected option, the selected option o.sub.t is executed according to the policy ω based on the state information s.sub.t, and so on until the end of the task.
[0065] In this embodiment, the state of each AP in high-speed railbus includes four components, i.e., S={S.sub.1, S.sub.2, S.sub.3, S.sub.4}. More in detail, S.sub.1 is the set of remote communication connection probability, B={B.sub.1, . . . , B.sub.n, . . . , B.sub.N}∈ S.sub.1 are the probability vectors, and B.sub.n∈ [0,1] refers to the connection probability to the corresponding BS; S.sub.2 is the set of trains location, which is related to the two-dimensional coordinates of the remote communication link. S.sub.3 is the set of remaining energy of each sensor, S.sub.4 is the set of information sensing state from the AP to the sensor m, i.e., D={D.sub.1, . . . , D.sub.m, . . . , D.sub.M}∈ S.sub.4, D.sub.m ∈ [0,1] is the data acquisition ratio.
[0066] In this embodiment, the AP action set A contains three basic actions: energy transmission from the AP to the sensor A.sub.c, the AP information sensing A.sub.d and the remote communication between the AP and the BS A.sub.r.
[0067] In this embodiment, the set of options O executed by the high-speed train AP contains three options: information sensing o.sub.d, energy transmission o.sub.c and remote communication o.sub.r, i.e., O={o.sub.r, o.sub.d, o.sub.c}, where o.sub.d={o.sub.1, . . . , o.sub.m, . . . o.sub.M} indicates that the AP senses information from the sensor m; o.sub.c denotes the AP transmits energy to the sensor; and o.sub.r={o.sub.1 . . . , o.sub.n, . . . o.sub.N} denotes the AP communicates with the BS remotely. Each option can be considered as a series of actions in general, all in three tuples <I, π, β>, and the set of options that can be selected by the AP in any state is within the set of options O, i.e., I=S. In the present invention, the intra-option policy for selecting actions is set as a known determined policy π, and the termination condition for any option is that the system finishes all the actions.
[0068] Specifically, for the option of information sensing o.sub.d, the policy is that the AP collects data from the sensor via backscatter communication until the data acquisition is completed and end of the current option; For the option of energy transmission o.sub.c, the policy is that the AP charges the sensor using the RF signal in the form of broadcast communication until the power is fully charged and the current option ends; For the option of remote communication o.sub.r, the policy is that the AP communicates with the BS remotely until the completion of handover and the current option is finished. In the simulation operation, the intra-option policy for selecting actions does not need to be trained.
[0069] The total reward R.sub.t of each option is obtained by the AP at the end time t of corresponding option. R.sub.t is a function of the initial state s.sub.t and action s.sub.t. Supposing that the total reward of the option is divided into the energy remaining reward R.sub.E, the information sensing reward R.sub.D and the remote communication reward R.sub.T. The energy remaining reward R.sub.E is for punishing the operation out of battery situation of the sensor, i.e.,
[0070] where φ.sub.E is a negative constant, E.sub.r represents the remaining energy. The information sensing reward R.sub.D is used for punishing the AP for the repeated selection of a passive sensor that has completed acquisition, which can be given by
[0071] where φ.sub.D is a negative constant, the remote communication reward R.sub.T is used to punish the AP for repeatedly selecting a BS that has completed communication handover.
[0072] Ultimately, the total reward for an agent after an option is the sum of the above three rewards, i.e., R.sub.t=R.sub.E+R.sub.D+R.sub.T.
[0073] In the embodiment, based on the Deep Q-Network (DQN) framework, an option-based HDRL algorithm is used for training high-speed train in order to find the optimal policy to solve the ISAC problem. The high-speed train ends from the previous option o.sub.t−1 during the interaction with environment, and receives the corresponding reward R.sub.t−1 and the next step status information s.sub.t. Input the current state information s.sub.t into the option-value neural network which have one input layer, five hidden layers and one output layer. Hidden layers include five fully connected layers, the first fully connected layer contains 1024 neurons and the Rectified Linear Unit (ReLU) function is employed as an activation function. The output of the first layer is:
X.sub.1=ReLU(W.sub.1.sup.Ts.sub.t+b.sub.1)
[0074] where W.sub.1 is the weight parameter of the first layer, b.sub.1 is the bias parameter. The input of the second hidden layer is the output of the first hidden layer, the second fully connected layer contains 512 neurons and the ReLU function is employed as an activation function similarly. The output of this layer is:
X.sub.2=ReLU(W.sub.2.sup.TX.sub.1+b.sub.2)
[0075] where W.sub.2 is the weight parameter of the second layer, b.sub.2 is the bias parameter. The input of the third hidden layer is the output of the second hidden layer, the third fully connected layer contains 256 neurons and the ReLU function is employed as an activation function similarly. The output of this layer is:
X.sub.3=ReLU(W.sub.3.sup.TX.sub.2+b.sub.3)
[0076] where W.sub.3 is the weight parameter of the third layer, b.sub.3 is the bias parameter. The input of the fourth hidden layer is the output of the third hidden layer, the fourth fully connected layer contains 128 neurons and the ReLU function is employed as an activation function similarly. The output of this layer is:
X.sub.4=ReLU(W.sub.4.sup.TX.sub.3+b.sub.4)
[0077] where W.sub.4 is the weight parameter of the fourth layer, b.sub.4 is the bias parameter. The input of the fifth hidden layer is the output of the fourth hidden layer, the fifth fully connected layer contains 64 neurons and the ReLU function is employed as an activation function similarly. The output of this layer is:
X.sub.5=ReLU(W.sub.5.sup.TX.sub.4+b.sub.5)
[0078] where, W.sub.5 is the weight parameter of the fifth layer, b.sub.5 is the bias parameter. The output layer accepts the output of the fifth layer and uses the softmax activation function to output the 0-dimensional vector o:
o=softmax(W.sub.6.sup.TX.sub.5+b.sub.6)
[0079] where, W.sub.6 and b.sub.6 are the weight parameter and the bias parameter, respectively. Softmax is the normalized exponential function. The output of option-value neural network Q.sup.option is the option probability, i.e.,
[0080] The optimal option is computed by using the ε-greedy algorithm, ε is a smaller value between 0 and 1, chosen randomly with probability ε each time and with probability 1−ε using the greedy algorithm, i.e., the index of the largest value is selected as the option o.sub.t, ε-greedy is expressed as
[0081] Selecting the policy π and termination condition β corresponding to o.sub.t from the option sets and interacting with the environment continuously.
[0082] In the training of the option-based HDRL algorithm, a high-speed train experience replay buffer D={s.sub.t, o.sub.t, R.sub.t, s.sub.t+1} is set, where s.sub.t denotes the current state, o.sub.t represents the option action obtained according to the current algorithm, R.sub.t is the total reward and s.sub.t+1 represents the next state to which the system is transferred after the transition probability P. The option-value neural network Q.sup.option is trained by applying experience replay and experience random sampling. The option-value neural network Q.sup.option in the algorithm, also known as the evaluation network, sets the target-value network Q.sup.target to express the optimal evaluation network Q.sup.option*approximately. The loss function of the evaluation network is expressed as
Loss(θ)=E[R.sub.t+γ arg max Q.sup.target(s.sub.t−1)−Q.sup.option(s.sub.t,o.sub.t;θ)].sup.2
[0083] The above equation E denotes the expectation function in the experience replay buffer D, and θ represents all parameters in the option-value neural network Q.sup.option, which can be updated by:
θ.sub.new=θ.sub.old−κ∇.sub.θLoss(θ)
[0084] where κ is the learning rate, θ.sub.new and θ.sub.old denote the parameters after and before the update of the option-value network, respectively. The gradient of the loss function ∇.sub.θLoss(θ) can be expressed as
∇.sub.θLoss(θ)=E[2(arg max Q.sup.target(s.sub.t+1)+R.sub.t−Q.sup.option(s.sub.t,o.sub.t;θ))×∇.sub.θQ.sup.option(s.sub.t,o.sub.t;θ)]
[0085] The target-value network is updated by using the parameters of the original target-value network and the current estimated network periodically, with the following updating rule:
θ.sub.new.sup.target=ρθ.sub.old.sup.target+(1−ρ)θ
[0086] where ρ is the updating rate and ρ∈[0, 1] θ.sub.new.sup.target and θ.sub.old.sup.target denote the parameters target after and before the update of the target-value network Q.sup.target, respectively.
[0087]
[0088] The above provides a detailed description of a self-powered integrated sensing and communication interactive method of high-speed railway based on HDRL, and specific examples are applied in this embodiment to elaborate the principle and implementation of the invention. The above description is only for helping to understand the method of the invention and its core idea; at the same time, for the general technical person in the field, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, the contents of this specification should not be construed as a limitation of the invention.
[0089] The above description of the disclosed embodiments enables those skilled in the art to realize or use the present invention. Many modifications to these embodiments will be apparent to those skilled in the art. The general principle defined herein can be realized in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to these embodiments shown herein, but will conform to the widest scope consistent with the principle and novel features disclosed herein.