METHOD FOR MULTI-TIME SCALE VOLTAGE QUALITY CONTROL BASED ON REINFORCEMENT LEARNING IN A POWER DISTRIBUTION NETWORK

20220405633 · 2022-12-22

    Inventors

    Cpc classification

    International classification

    Abstract

    A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network is provided, which relates to the field of power system operation and control. The method includes: constituting an optimization model for multi-time scale reactive voltage control in a power distribution network based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and deciding action values of the controlled devices by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning.

    Claims

    1. A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network, comprising: determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating.

    2. The method as claimed in claim 1, further comprising: 1) determining the multi-time scale reactive voltage control object and establishing the constraints for multi-time scale reactive voltage optimization, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network comprises: 1-1) determining the multi-time scale reactive voltage control object of the controlled power distribution network: O T = min T O , T B .Math. t ~ = 0 T ~ - 1 [ C O T O , loss ( k t ~ ) + C B T B , loss ( k t ~ ) + C P min Q G , Q C .Math. τ = 0 k - 1 P loss ( k t ~ + τ ) ] ( 0.41 ) where {tilde over (T)} is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=k{tilde over (T)} is the number of control cycles of the fast continuous device in one day; {tilde over (t)} is a number of control cycles of the slow discrete device; T.sub.O is a gear of an on-load tap changer OLTC; T.sub.B is a gear of a capacitor station; Q.sub.G is a reactive power output of the distributed generation DG; Q.sub.C is a reactive power output of a static var compensator SVC; C.sub.O,C.sub.B,C.sub.P respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; P.sub.loss.sup.(k{tilde over (t)}+τ) is a power distribution network loss at the moment k{tilde over (t)}+τ, τ being an integer, τ=0,1, 2, . . . ,k−1; and T.sub.O,loss.sup.(k{tilde over (t)}) is a gear change adjusted by the OLTC at the moment k{tilde over (t)}, and T.sub.B,loss.sup.(k{tilde over (t)}) is a gear change adjusted by the capacitor station at the moment k{tilde over (t)}, which are respectively calculated by the following formulas: T O , loss ( k t ~ ) = .Math. i = 1 n OLTC .Math. "\[LeftBracketingBar]" T O , i ( k t ~ ) - T O , i ( k t ~ - k ) .Math. "\[RightBracketingBar]" , t ~ > 0 , i [ 1 , n OLTC ] T B , loss ( k t ~ ) = .Math. i = 1 n CB .Math. "\[LeftBracketingBar]" T B , i ( k t ~ ) - T B , i ( kt - k ) .Math. "\[RightBracketingBar]" , t ~ > 0 , i [ 1 , n CB ] T O , loss ( 0 ) = T B , loss ( 0 ) = 0 ( 0.42 ) where T.sub.O,i.sup.(k{tilde over (t)}) is a gear set value of an i.sup.th OLTC device at the moment k{tilde over (t)}, n.sub.OLTC is a total number of OLTC devices; T.sub.B,i.sup.(k{tilde over (t)}) is a gear set value of an i.sup.th capacitor station at the moment k{tilde over (t)}, and n.sub.CB is a total number of capacitor stations; 1-2) establishing the constraints for multi-time scale reactive voltage optimization in the controlled power distribution network which include: voltage constraints and output constraints:
    V≤V.sub.i.sup.(k{tilde over (t)}+τ)V,
    |Q.sub.Gi.sup.(k{tilde over (t)}+τ)|≤√{square root over (S.sub.Gi.sup.2−(P.sub.Gi.sup.(k{tilde over (t)}+τ)).sup.2)},
    Q.sub.CiQ.sub.Ci.sup.(k{tilde over (t)}+τ)≤Q.sub.Ci,
    i ∈N ,{tilde over (t)}∈[0,T),τ∈[0, k)   (0.43) where N is a set of all nodes in the power distribution network, V.sub.i.sup.(k{tilde over (t)}+τ) is a voltage magnitude of the node i at the moment k{tilde over (t)}+τ, V,V are a lower limit and an upper limit of the node voltage respectively; Q.sub.Gi.sup.(k{tilde over (t)}+τ) is the DG reactive power output of the node i at the moment k{tilde over (t)}+τ; Q.sub.Ci.sup.(k{tilde over (t)}+τ) is the SVC reactive power output of the node i at the moment k{tilde over (t)}+τ; Q.sub.Ci,Q.sub.Ci are a lower limit and an upper limit of the SVC reactive power output of the node i; S.sub.Gi is a DG installed capacity of the node i; P.sub.Gi.sup.(k{tilde over (t)}+τ) is a DG active power output at the moment k{tilde over (t)}+τ of the node i; adjustment constraints:
    1≤T.sub.O,i.sup.(k{tilde over (t)})≤T.sub.O,i,{tilde over (t)}>0,i∈[1,n.sub.OLTC]
    1≤T.sub.B,i.sup.(k{tilde over (t)})≤T.sub.B,i,{tilde over (t)}>0,i∈[1,n.sub.CB]  (0.44) where T.sub.O,i is a number of gears of the i.sup.th OLTC device, and T.sub.B,i is a number of gears of the i.sup.th capacitor station. 2) constructing the hierarchical interaction training framework based on the two-layer Markov decision process based on the optimization model established in step 1) and actual configuration of the power distribution network, comprises: 2-1) corresponding to system measurements of the power distribution network, constructing a state observation s at the moment t shown in the following formula:
    s=(P, Q, V, T.sub.O, T.sub.B).sub.t   (0.45) where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; T.sub.O is a vector composed of respective OLTC gears, and T.sub.B is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (.Math.).sub.t represents a value measured at the moment t; 2-2) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r.sub.f of the fast continuous device shown in the following formula: r f = - C P P loss ( s ) - C V V loss ( s ) P loss ( s ) = .Math. i N P i ( s ) , V loss ( s ) = .Math. i N [ [ V i ( s ) - V _ ] + 2 + [ V - - V i ( s ) ] + 2 ] ( 0.46 ) where s,a,s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; P.sub.loss(s′) is a network loss at the moment t+1; V.sub.loss(s′) is a voltage deviation rate at the moment t+1; P.sub.i(s′) is the active power output of the node i at the moment t+1; V.sub.i(s′) is a voltage magnitude of the node i at the moment t+1; [x].sub.+=max(0, x) ; C.sub.V is a cost coefficient of voltage violation probability; 2-3) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r.sub.s of the slow discrete device shown in the following formula:
    r.sub.s=−C.sub.OT.sub.O,loss({tilde over (s)},{tilde over (s)}′)−C.sub.BT.sub.B,loss({tilde over (s)},{tilde over (s)}′)−R.sub.f({s.sub.τ, a.sub.τ|τ∈[0, k)},s.sub.k)   (0.47) where {tilde over (s)},{tilde over (s)}′ are a state observation at the moment k{tilde over (t)} and a state observation at the moment k{tilde over (t)}+k; T.sub.O,loss({tilde over (s)},{tilde over (s)}′) is an OLTC adjustment cost generated by actions at the moment k{tilde over (t)}; T.sub.B,loss({tilde over (s)},{tilde over (s)}′) is a capacitor station adjustment cost generated by actions at the moment k{tilde over (t)}; R.sub.f({s.sub.τ,a.sub.τ|τ∈[0,k)},s.sub.k) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows: R f ( { s τ , a τ .Math. τ [ 0 , k ) } , s k ) = .Math. τ = 0 k - 1 r f ( s τ , a τ , s τ + 1 ) ( 0.48 ) 2-4) constructing an action variable a.sub.t of the fast agent and an action variable ã.sub.t of the slow agent at the moment t shown in the following formula:
    a.sub.t=(Q.sub.G, Q.sub.C).sub.t
    ã.sub.t=(T.sub.O, T.sub.B).sub.t   (0.49) where Q.sub.G,Q.sub.C are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network; 3) setting the slow agent to control the slow discrete device and setting the fast agent to control the fast continuous device, comprise: 3-1) the slow agent is a deep neural network including a slow strategy network {tilde over (π)} and a slow evaluation network Q.sub.s.sup.{tilde over (π)}, wherein an input of the slow strategy network {tilde over (π)} is {tilde over (s)}, an output is probability distribution of an action ã, and a parameter of the slow strategy network {tilde over (π)} is denoted as θ.sub.s; an input of the slow evaluation network Q.sub.s.sup.{tilde over (π)} is {tilde over (s)}, an output is an evaluation value of each action, and a parameter of the slow evaluation network Q.sub.s.sup.{tilde over (π)} are denoted as ϕ.sub.s; 3-2) the fast agent is a deep neural network including a fast strategy network π and a fast evaluation network Q.sub.f.sup.π, wherein an input of the fast strategy network π is s, an output is probability distribution of the action a, and a parameter of the fast strategy network π is denoted as θ.sub.f; an input of the fast evaluation Q.sub.f.sup.π is (s,a), an output is an evaluation value of actions, and a parameter of the fast evaluation network Q.sub.f.sup.π is denoted as ϕ.sub.f; 4) initializing parameters: 4-1) randomly initializing parameters of the neural networks corresponding to respective agents θ.sub.s, θ.sub.f, ϕ.sub.s, ϕ.sub.f; 4-2) inputting a maximum entropy parameter α.sub.s of the slow agent and a maximum entropy parameter α.sub.f of the fast agent; 4-3) initializing the discrete time variable as t=0, an actual time interval between two steps of the fast agent is Δt, and an actual time interval between two steps of the slow agent is kΔt; 4-4) initializing an action probability of the fast continuous device as p=−1; 4-5) initializing cache experience database as D.sub.l=∅ and initializing agent experience database as D=∅; 5) executing by the slow agent and the fast agent, the following control steps at the moment t: 5-1) judging if t mod k≠0: if yes, going to step 5-5) and if no, going to step 5-2); 5-2) obtaining by the slow agent, state information from measurement devices in the power distribution network; 5-3) judging if D.sub.l=∅: if yes, calculating r.sub.s, adding an experience sample to D, updating D←D∪{({tilde over (s)},ã,r.sub.s,{tilde over (s)}′,D.sub.l)} and going to step 5-4); if no, directly going to step 5-4); 5-4) updating {tilde over (s)} to {tilde over (s)}′; 5-5) generating the action ã of the slow discrete device with the slow strategy network if of the slow agent according to the state information {tilde over (s)}; 5-6) distributing ã to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t; 5-7) obtaining by the fast agent, state information s′ from measurement devices in the power distribution network; 5-8) judging if p≥0: if yes, calculating r.sub.f, adding an experience sample to D.sub.l, updating D.sub.l←D.sub.l∪{s,a,r.sub.f,s′,p)}, and going to step 5-9); if no, directly going to step 5-9); 5-9) updating s′ to s; 5-10) generating the action a of the fast continuous device with the fast strategy network {tilde over (π)} of the fast agent according to the state information s and updating p=π(a|s); 5-11) distributing a to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and going to step 6); 6) judging t mod k=0: if yes, going to step 6-1); if no, going to step 7); 6-1) randomly selecting a set of experiences D.sup.B∈D from the agent experience database D, wherein a number of samples in the set of experiences is B; 6-2) calculating a loss function of the parameter ϕ.sub.s with each sample in D.sup.B: L ( ϕ s ) = E s ~ , a ~ , r s , s ~ [ ( Q s π ( s ~ , a ~ ) - ω y s ) 2 ] ( 0.5 ) where y s = r s + γ [ Q s π ( s ~ , a ~ ) - α s log π ~ ( a ~ .Math. s ~ ) ] ( 0.51 ) where ã′˜{tilde over (π)}(.Math.|{tilde over (s)}) and γ is a conversion factor; ω = .Math. i = 0 k - 1 π ( a i .Math. s i ) p ( a i .Math. s i ) ( 0.52 ) 6-3) updating the parameter ϕ.sub.s:
    ϕ.sub.s←ϕ.sub.s−ρ.sub.s∇.sub.ϕ.sub.sL(ϕ.sub.s)   (0.53) where ρ.sub.s is a learning step length of the slow discrete device; 6-4) calculating a loss function of the parameter θ.sub.s; L ( θ s ) = - E s ~ D B [ Q s π ( s ~ , a ~ ~ π ~ θ s ( .Math. .Math. s ~ ) ) ] ( 0.54 ) 6-5) updating the parameter θ.sub.s:
    θ.sub.s←θ.sub.s−ρ.sub.s∇.sub.θ.sub.sL(θ.sub.s)   (0.55) and going to step 7); 7) executing by the fast agent, the following learning steps at the moment t: 7-1) randomly selecting a set of experiences D.sup.B∈D from the agent experience database D, wherein a number of samples in the set of experiences is B; 7-2) calculating a loss function of the parameter ϕ.sub.f with each sample in D.sup.B: L ( ϕ f ) = E s , a , r f , s [ ( Q f π ( s , a ) - y f ) 2 ] ( 0.56 ) where y f = r f + γ [ Q f π ( s , a ) - α f log π ( a .Math. s ) ] ( 0.57 ) where a′˜π(.Math.|s); 7-3) updating the parameter ϕ.sub.f:
    ϕ.sub.f←ϕ.sub.f−ρ.sub.f∇.sub.ϕ.sub.fL(ϕ.sub.f)   (0.58) where p.sub.f is a learning step length of the fast continuous device; 7-4) calculating a loss function of the parameter θ.sub.f; L ( θ f ) = - E s D B [ Q f π ( s , a ~ π θ f ( .Math. .Math. s ) ) ] ( 0.59 ) 7-5) updating the parameter θ.sub.f:
    θ.sub.f←θ.sub.f−ρ.sub.f∇.sub.θ.sub.fL(θ.sub.f)   (0.60) 8) let t=t+1, returning to step 5).

    Description

    DETAILED DESCRIPTION

    [0071] A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network is provided in the disclosure. The method includes: determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating. The method includes the following steps.

    [0072] 1) according to a reactive voltage control object of a slow discrete device (which refers to a device that performs control by adjusting gears in an hour-level action cycle, such as an OLTC, a capacitor station, etc.) and a reactive voltage control object of a fast continuous device (which refers to a device that performs control by adjusting continuous set values in minute-level action cycle, such as a distributed generation DG, a static var compensator SVC, etc.) in the controlled power distribution network, the multi-time scale reactive voltage control object is determined, and optimization constraints are established for multi-time scale reactive voltage control, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network. The specific steps are as follows.

    [0073] 1-1) the multi-time scale reactive voltage control object of the controlled power distribution network is determined:

    [00010] O T = min T o , T B .Math. t ~ = 0 T ~ - 1 [ C O T O , loss ( k t ~ ) + C B T B , loss ( k t ~ ) + C P min Q G , Q C .Math. τ = 0 k - 1 P loss ( k t ~ + τ ) ] ( 0.21 )

    [0074] where {tilde over (T)} is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=k{tilde over (T)} is the number of control cycles of the fast continuous device in one day; {tilde over (t)} is a number of control cycles of the slow discrete device; T.sub.O is a gear of an on-load tap changer OLTC; T.sub.B is a gear of a capacitor station; Q.sub.G is a reactive power output of the distributed generation DG; Q.sub.C is a reactive power output of a static var compensator SVC; C.sub.O,C.sub.B,C.sub.P respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; P.sub.loss.sup.(k{tilde over (t)}+τ) is a power distribution network loss at the moment k{tilde over (t)}+τ, τ being an integer, τ=0, 1, 2 . . . ,k−1; T.sub.O,loss.sup.(k{tilde over (t)}) is a gear change adjusted by the OLTC at the moment k{tilde over (t)}, and T.sub.B,loss.sup.(k{tilde over (t)}) is a gear change adjusted by the capacitor station at the moment k{tilde over (t)}, which are respectively calculated by the following formulas:

    [00011] T O , loss ( k t ~ ) = .Math. i = 1 n OLTC .Math. "\[LeftBracketingBar]" T O , i ( k t ~ ) - T O , i .Math. k t ~ - k ) .Math. "\[RightBracketingBar]" , t ~ > 0 , i [ 1 , n OLTC ] T B , loss ( k t ~ ) = .Math. i = 1 n CB .Math. "\[LeftBracketingBar]" T B , i .Math. k t ~ ) - T B , i .Math. k t - k ) .Math. "\[RightBracketingBar]" , t ~ > 0 , i [ 1 , n CB ] T O , loss .Math. 0 ) = T B , loss .Math. 0 ) = 0 ( 0.22 )

    [0075] where T.sub.O,i.sup.(k{tilde over (t)}) is a gear set value of an i.sup.th OLTC device at the moment k{tilde over (t)}, n.sub.OLTC is a total number of OLTC devices; T.sub.B,i.sup.(k{tilde over (t)}) is a gear set value of an i.sup.th capacitor station at the moment k{tilde over (t)}, and n.sub.CB is a total number of capacitor stations;

    [0076] 1-2) the constraints are established for multi-time scale reactive voltage optimization in the controlled power distribution network:

    [0077] the constraints for reactive voltage optimization are established according to actual conditions of the controlled power distribution network, including voltage constraints and output constraints expressed by:


    V≤V.sub.i.sup.(k{tilde over (t)}+τ)V,


    |Q.sub.Gi.sup.(k{tilde over (t)}+τ)|≤√{square root over (S.sub.Gi.sup.2−(P.sub.Gi.sup.(k{tilde over (t)}+τ)).sup.2)},


    Q.sub.ciQ.sub.Ci.sup.(k{tilde over (t)}+τ)≤Q.sub.Ci,


    i∈N, {tilde over (t)}∈[0, T), τ∈[0, k)   (0.23)

    [0078] where N is a set of all nodes in the power distribution network, V.sub.i.sup.(k{tilde over (t)}+τ) is a voltage magnitude of the node i at the moment k{tilde over (t)}+τ, V,V are a lower limit and an upper limit (the typical values are respectively 0.9 and 1.1) of the node voltage respectively; Q.sub.Gi.sup.(k{tilde over (t)}+τ) is the DG reactive power output of the node i at the moment k{tilde over (t)}+τ; Q.sub.Ci.sup.(k{tilde over (t)}+τ) is the SVC reactive power output of the node i at the moment k{tilde over (t)}+τ; Q.sub.Ci,Q.sub.Ci are a lower limit and an upper limit of the SVC reactive power output of the node i; S.sub.Gi is a DG installed capacity of the node i; P.sub.Gi.sup.(k{tilde over (t)}+τ) is a DG active power output at the moment k{tilde over (t)}+τ of the node i;

    [0079] adjustment constraints are expressed by:


    1≤T.sub.O,i.sup.(k{tilde over (t)}+τ)≤T.sub.O,i, {tilde over (t)}>0, i∈[1, n.sub.OLTC]


    1≤T.sub.B,i.sup.(k{tilde over (t)}+τ)≤T.sub.B,i, {tilde over (t)}>0, i∈[1, n.sub.CB]  (0.24)

    [0080] where T.sub.O,i is a number of gears of the i.sup.th OLTC device, and T.sub.B,i is a number of gears of the i.sup.th capacitor station.

    [0081] 2) in combination with the optimization model established in step 1) and actual configuration of the power distribution network, the hierarchical interaction training framework based on the two-layer Markov decision process is constructed. The specific steps are as follows:

    [0082] 2-1) corresponding to system measurements of the power distribution network, a state observation s at the moment t is constructed in the following formula:


    s=(P, Q, V, T.sub.O, T.sub.B).sub.t   (0.25)

    [0083] where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; T.sub.O is a vector composed of respective OLTC gears, and T.sub.B is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (.Math.).sub.t represents a value measured at the moment t;

    [0084] 2-2) corresponding to the multi-time scale reactive voltage optimization object, feedback variable r.sub.f of the fast continuous device is constructed in the following formula:

    [00012] r f = - C P P loss ( s ) - C V V loss ( s ) P loss ( s ) = .Math. i N P i ( s ) , ( 0.26 ) V loss ( s ) = .Math. i N [ [ V i ( s ) - V ¯ ] + 2 + [ V ¯ - V i ( s ) ] + 2 ]

    [0085] where s,a,s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; P.sub.loss(s′) is a network loss at the moment t+1; V.sub.loss(s′) is a voltage deviation rate at the moment t+1; P.sub.i(s′) is the active power output of the node i at the moment t+1; V.sub.i(s′) is a voltage magnitude of the node i at the moment t+1; [x].sub.+=max(0, x); C.sub.V is a cost coefficient of voltage violation probability;

    [0086] 2-3) corresponding to the multi-time scale reactive voltage optimization object, feedback variable r.sub.s of the slow discrete device is constructed in the following formula:


    r.sub.s=−C.sub.OT.sub.O,loss({tilde over (s)}, {tilde over (s)}′)−C.sub.BT.sub.B,loss({tilde over (s)}, {tilde over (s)}′)−R.sub.f({s.sub.τ, a.sub.τ|τ∈[0, k)}, s.sub.k)   (0.27)

    [0087] where are {tilde over (s)},{tilde over (s)}′ a state observation at the moment k{tilde over (t)} and a state observation at the moment k{tilde over (t)}+k respectively; T.sub.O,loss({tilde over (s)},{tilde over (s)}′) is an OLTC adjustment cost generated by actions at the moment k{tilde over (t)}; T.sub.B,loss({tilde over (s)},{tilde over (s)}′) is a capacitor station adjustment cost generated by actions at the moment k{tilde over (t)}; R.sub.f({s.sub.τ,a.sub.τ|τ∈[0,k)},s.sub.k) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows:

    [00013] R f ( { s τ , a τ .Math. "\[LeftBracketingBar]" τ [ 0 , k ) } , s k ) = .Math. τ = 0 k - 1 r f ( s τ , a τ , s τ + 1 ) ( 0.28 )

    [0088] 2-4) corresponding to each adjustable resource, an action variable a.sub.t of the fast agent and an action variable ã.sub.t of the slow agent at the moment t are constructed in the following formula:


    a.sub.t=(Q.sub.G, Q.sub.C).sub.t


    ã.sub.t=(T.sub.O, T.sub.B).sub.t   (0.29)

    [0089] where Q.sub.G, Q.sub.C are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network respectively;

    [0090] 3) the slow agent is set to control the slow discrete device and the fast agent is set to control the fast continuous device. The specific steps are as follows.

    [0091] 3-1) the slow agent is implemented by a deep neural network including a slow strategy network {tilde over (π)} and a slow evaluation network Q.sub.s.sup.{tilde over (π)}. [0092] 3-1-1) the slow strategy network {tilde over (π)} is a deep neural network with an input being {tilde over (s)} and an output being probability distribution of an action ã, which includes several hidden layers (typically 2 hidden layers), each hidden layer having several neurons (typically 512 neurons), an activation function being the ReLU function, and a network parameter denoted as θ.sub.s; [0093] 3-1-2) the slow evaluation network Q.sub.s.sup.{tilde over (π)} is a deep neural network with an input being {tilde over (s)} and an output being an evaluation value of each action, which includes several hidden layers (typically 2 hidden layers), each hidden layer having several neurons (typically 512 neurons), an activation function being the ReLU function, and a network parameter denoted as ϕ.sub.s;

    [0094] 3-2) the fast agent is implemented by a deep neural network including a fast strategy network π and a fast evaluation network Q.sub.f.sup.π. [0095] 3-2-1) the fast strategy network π is a deep neural network with an input being s and an output being probability distribution of the action a, which includes several hidden layers (typically 2 hidden layers), each hidden layer having several neurons (typically 512 neurons), an activation function being the ReLU function, and a network parameter denoted as θ.sub.f; [0096] 3-2-2) the fast evaluation network Q.sub.f.sup.π is a deep neural network with an input being (s,a) and an output being an evaluation value of actions, which includes several hidden layers (typically 2 hidden layers), each hidden layer having several neurons (typically 512 neurons), an activation function being the ReLU function, and a network parameter denoted as ϕ.sub.f;

    [0097] 4) the variables in the relevant control processes are initialized.

    [0098] 4-1) parameters of the neural networks corresponding to respective agents θ.sub.s, θ.sub.f, ϕ.sub.s, ϕ.sub.f are randomly initialized;

    [0099] 4-2) a maximum entropy parameter α.sub.s of the slow agent and a maximum entropy parameter α.sub.f of the fast agent are input, which are respectively configured to control the randomness of the slow and fast agents and a typical value of which is 0.01;

    [0100] 4-3) the discrete time variable is initialized as t=0, an actual time interval between two steps of the fast agent is Δt and an actual time interval between two steps of the slow agent is kΔt, which are determined according to the actual measurements of the local controller and the command control speed;

    [0101] 4-4) an action probability of the fast continuous device is initialized as p=−1;

    [0102] 4-5) experience databases are initialized, in which cache experience database is initialized as D.sub.l=∅ and agent experience database is initialized as D=∅;

    [0103] 5) the slow agent and the fast agent execute the following control steps at the moment t:

    [0104] 5-1) it is judged whether t mod k≠0. If yes, step 5-5) is performed and if no, step 5-2) is performed;

    [0105] 5-2) the slow agent obtains state information {tilde over (s)}′ from measurement devices in the power distribution network;

    [0106] 5-3) it is judged whether D.sub.l≠∅. If yes, r.sub.s is calculated, an experience sample is added to D, D←D∪{{tilde over (s)},ã,r.sub.s,{tilde over (s)}′,D.sub.l)} is updated, and step 5-4) is performed; if no, step 5-4) is directly performed;

    [0107] 5-4) let {tilde over (s)}←{tilde over (s)}′;

    [0108] 5-5) the action ã of the slow discrete device is generated with the slow strategy network {tilde over (π)} of the slow agent according to the state information {tilde over (s)};

    [0109] 5-6) ã is distributed to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t;

    [0110] 5-7) the fast agent obtains state information s′ from measurement devices in the power distribution network;

    [0111] 5-8) it is judged whether p≥0. If yes, r.sub.f is calculated, an experience sample is added to D.sub.l, D.sub.l←D.sub.l∪{(s,a,r.sub.f,s′,p)} is updated, and step 5-9) is performed; if no, step 5-9) is directly performed;

    [0112] 5-9) let s←s′;

    [0113] 5-10) the action a of the fast continuous device is generated with the fast strategy network {tilde over (π)} of the fast agent according to the state information s and p=π(a|s) is updated;

    [0114] 5-11) a is distributed to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and step 6) is performed;

    [0115] 6) it is judged whether t mod k=0. If yes, step 6-1) is performed; if no, step 7) is performed;

    [0116] 6-1) a set of experiences D.sup.B∈D is randomly selected from the agent experience database D, wherein a number of samples in the set of experiences is B (a typical value is 64);

    [0117] 6-2) a loss function of the parameter ϕ.sub.s is calculated with each sample in D.sup.B:

    [00014] L ( ϕ s ) = E s ~ , a ~ , r s , s ~ [ ( Q s π ( s ~ , a ~ ) - ω y s ) 2 ] ( 0.3 )

    where

    [00015] E s ~ , a ~ , r s , s ~

    is taken from D.sup.B and y.sub.s is determined by:


    y.sub.s=r.sub.s+γ[Q.sub.s.sup.π({tilde over (s)}′, ã′)−α.sub.s log {tilde over (π)}(ã′|{tilde over (s)})]   (0.31)

    [0118] where ã′˜{tilde over (π)}(.Math.|{tilde over (s)}) and γ is a conversion factor, a typical value of which is 0.98;

    [00016] ω = .Math. i = 0 k - 1 π ( a i .Math. s i ) p ( a i .Math. s i ) ( 0.32 )

    [0119] 6-3) the parameter ϕ.sub.s is updated:


    ϕ.sub.s←ϕ.sub.s−ρ.sub.s∇.sub.ϕ.sub.sL(ϕ.sub.s)   (0.33)

    where ρ.sub.s is a learning step length of the slow discrete device, a typical value of which is 0.0001;

    [0120] 6-4) a loss function of the parameter θ.sub.s is calculated:

    [00017] L ( θ s ) = - E s ~ D B [ Q s π ( s ~ , a ~ ~ π ~ θ s ( .Math. .Math. s ~ ) ) ] ( 0.34 )

    [0121] 6-5) the parameter θ.sub.s is updated:


    θ.sub.s←θ.sub.s−ρ.sub.s∇.sub.θ.sub.sL(θ.sub.s)   (0.35)

    [0122] and step 7) is then performed;

    [0123] 7) the fast agent executes the following learning steps at the moment t:

    [0124] 7-1) a set of experiences D.sup.B∈D is randomly selected from the agent experience database D, wherein a number of samples in the set of experiences is B (a typical value is 64);

    [0125] 7-2) a loss function of the parameter ϕ.sub.f is calculated with each sample in D.sup.B:

    [00018] L ( ϕ f ) = E s , a , r f , s [ ( Q f π ( s , a ) - y f ) 2 ] ( 0.36 )

    [0126] where

    [00019] E s , a , r f , s

    is taken from D.sup.B and y.sub.f is determined by:


    y.sub.f=r.sub.f+γ[Q.sub.f.sup.π(s′,a′)−α.sub.f log π(a′|s)]  (0.37)

    where a′˜π(.Math.|s);

    [0127] 7-3) the parameter ϕ.sub.f is updated:


    ϕ.sub.f←ϕ.sub.f−ρ.sub.f∇.sub.ϕ.sub.fL(ϕ.sub.f)   (0.18)

    [0128] where ρ.sub.f is a learning step length of the fast continuous device, a typical value of which is 0.00001;

    [0129] 7-4) a loss function of the parameter θ.sub.f is calculated:

    [00020] L ( θ f ) = - E s D B [ Q f π ( s , a ~ π θ f ( .Math. .Math. s ) ) ] ( 0.39 )

    [0130] 7-5) the parameter θ.sub.f is updated:


    θ.sub.f←θ.sub.f−ρ.sub.f∇.sub.θ.sub.fL(θ.sub.f)   (0.20)

    [0131] 8) let t=t+1, it returns to step 5) and repeats the steps 5) to 8). The method is an online learning control method, which continuously runs online and updates the neural network, while performing online control until the user manually stops it.