Power grid reactive voltage control model training method and system
11689021 · 2023-06-27
Assignee
Inventors
- Wenchuan WU (Beijing, CN)
- Haotian LIU (Beijing, CN)
- Hongbin SUN (Beijing, CN)
- Bin Wang (Beijing, CN)
- Qinglai GUO (Beijing, CN)
- Tian Xia (Beijing, CN)
Cpc classification
Y04S10/40
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y04S40/20
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y02E40/30
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y04S10/50
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y02E40/70
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
H02J2203/20
ELECTRICITY
H02J13/00001
ELECTRICITY
H02J3/001
ELECTRICITY
H02J3/18
ELECTRICITY
Y02E60/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
A power grid reactive voltage control model training method. The method comprises: establishing a power grid simulation model; establishing a reactive voltage optimization model, according to a power grid reactive voltage control target; building interactive training environment based on Adversarial Markov Decision Process, in combination with the power grid simulation model and the reactive voltage optimization model; training the power grid reactive voltage control model through a joint adversarial training algorithm; and transferring the trained power grid reactive voltage control model to an online system. The power grid reactive voltage control model trained by using the method according to the present disclosure has transferability as compared with the traditional method, and may be directly used for online power grid reactive voltage control.
Claims
1. A power grid reactive voltage control model training method, comprising: establishing, by a regional power grid control center server, a power grid simulation model based on a regional power grid comprising n+1 nodes and corresponding measuring apparatuses; establishing, by the regional power grid control center server, a reactive voltage optimization model based on data measured by the measuring apparatuses, according to a power grid reactive voltage control target; building, by the regional power grid control center server, interactive training environment based on Adversarial Markov Decision Process, in combination with the power grid simulation model and the reactive voltage optimization model, wherein the building further comprises: A1: constructing an Adversarial Markov Decision Process state variable, an expression being as follows:
s=(P,Q,V,t); A2: constructing a feedback variable, an expression being as follows:
ReLU(x)=max(0,x); A3: constructing a reactive voltage control model action variable a.sub.p, an expression being as follows: a.sub.p=(Q.sub.G,Q.sub.C) wherein, Q.sub.G, Q.sub.C are both reactive power output vectors; and A4: constructing an adversarial model action variable a.sub.o, an expression being as follows:
a.sub.o=(G,B) where, G, B are respectively vectors composed of conductance and susceptance of all lines; training, by the regional power grid control center server, the power grid reactive voltage control model through a joint adversarial training algorithm to generate a trained power grid reactive control model, wherein the training further comprises: B1: defining a reinforcement learning target function, an expression being as follows:
ã.sub.p.sup.θ(s,ξ.sub.p)=tan h(μ.sub.θ(s)+σ.sub.θ(s)□ξ.sub.p),ξ.sub.p˜N(0, I)
ã.sub.o.sup.ω(s,ξ.sub.o)=tan h(μ.sub.ω(s)+σ.sub.ω(s)□ξ.sub.o),ξ.sub.o ˜N(0, I) wherein, θ is a reactive voltage control model policy network parameter; ω is an adversarial model policy network parameter; μ.sub.θ(s) and σ.sub.θ(s) are respectively a mean value and a variance function of the reactive voltage control model; μ.sub.ω(s) and σ.sub.ω(s) are respectively a mean value and a variance function of the adversarial model; N (0,I) is a standard Gaussian distribution function; ξ.sub.p, ξ.sub.o are respectively random variables of the reactive voltage control model and the adversarial model; and s is the Adversarial Markov Decision Process state variable; B3: defining a joint adversarial value function network Q.sub.ϕ.sup.π(s,a.sub.p,a.sub.o), an expression being as follows:
y(r,s′)=r+γ[Q.sub.ϕ.sup.π(s′,ã.sub.o′)−α.sub.p log π(ã′.sub.p|s′)−α.sub.o log π(ã.sub.o′|s′)] wherein, ã′.sub.p, ã′.sub.o−are respectively estimated action variables of the reactive voltage control model and the adversarial model at time t+1; training the joint adversarial value function network Q.sub.ϕ.sup.π(s,a.sub.p,a.sub.o) by using an expression below:
min(Q.sub.ϕ.sup.π(s′,ã.sub.p′,ã.sub.o′)−y).sup.2,ã.sub.p′∞π.sub.p(.Math.|s′) wherein, ϕ represents parameters; π.sub.p(.Math.|s′),π.sub.o(.Math.|s′) are respectively a reactive voltage control model policy function and an adversarial model policy function at time t+1; B4: training the reactive voltage control model policy network, an expression being as follows:
D={(s.sub.t,a.sub.t.sup.p,a.sub.t.sup.o,r.sub.t,s′.sub.t)} where, S.sub.i is an Adversarial Markov Decision Process state variable at time t; a.sub.t.sup.p is a reactive voltage control model action variable at time t; a.sub.t.sup.o is an adversarial model action variable at time t; r.sub.t is a feedback variable at time t; and s.sub.t.sup.′ is an Adversarial Markov Decision Process state variable at time t+1; and performing reactive voltage control model continuous online learning in the regional power grid controller, which specifically includes steps of: S1: acquiring by the regional power grid controller, through a remote adjusting system, measured data from measuring apparatuses of the regional power grid to form a corresponding state variable s.sub.t=(P,Q,V,t); S2: extracting a set of experiences from the experience library D.sub.B∈D where B is quantity; S3: updating the reactive voltage control model on D.sub.B; S4: generating an optimal action a.sub.t=tan h(μ.sub.θ(s.sub.t)+σ.sub.θ(s.sub.t)□ξ)=(Q.sub.G, Q.sub.C) at time t, by using the reactive voltage control model policy network; S5: issuing the optimal action to controlled devices through the remote adjusting system, wherein the controlled devices include distributed generation devices and static var compensators, wherein the remote adjusting system receives and executes remote adjusting commands to remotely adjust remote quantity control devices; and S6: t=t+1, returning to step S1.
2. The power grid reactive voltage control model training method according to claim 1, wherein, establishing a power grid simulation model includes: constructing, with respect to the regional power grid of n+1 nodes, an undirected graph as follows:
Π(N,E) wherein, N is a set of the power grid nodes; E is a set of the power grid branches, E=(i, j)∈N×N ; and i, j are both the power grid nodes.
3. The power grid reactive voltage control model training method according to claim 2, wherein, establishing a power grid simulation model further includes: constructing a power flow equation of the power grid as follows:
P.sub.i=G.sub.i.sub.i.sup.2−G.sub.iV.sub.iVcos θ.sub.i−B.sub.iV.sub.isin θ.sub.i,∀ij∈E
Q.sub.i=−B.sub.iV.sub.i.sup.2+B.sub.ijV.sub.iV.sub.j cos θ.sub.ij−G.sub.iV.sub.iV.sub.j sin θ.sub.ij,∀ij ∈E
θ.sub.ij=θ.sub.i−θ,∀ij∈E wherein, V.sub.iθ.sub.i are respectively a voltage amplitude and a phase angle of the power grid node i ; V.sub.j,θ.sub.j are respectively a voltage amplitude and a phase angle of the power grid node j; G.sub.ij,B.sub.ij are respectively conductance and susceptance of the power grid branch ij ; P.sub.ij,Q.sub.j are respectively active power and reactive power of the power grid branch ij ; and θ.sub.ij is a phase angle difference of the power grid branch ij ; with respect to the power grid node j∈N , an expression of power thereof is as follows:
4. The power grid reactive voltage control model training method according to claim 2, wherein, the reactive voltage optimization model is as follows:
5. The power grid reactive voltage control model training method according to claim 4, wherein, an expression of the entropy function H is as follows:
6. A power grid reactive voltage control model training system, comprising: a processor; a memory that stores program instructions executed by the processor; program instructions, implemented on the processor, configured to establish a power grid simulation model based on a regional power grid comprising n+1 nodes and corresponding measuring apparatuses; program instructions, implemented on the processor, configured to establish a reactive voltage optimization model based on data measured by the measuring apparatuses, according to a power grid reactive voltage control target; program instructions, implemented on the processor, configured to build interactive training environment based on Adversarial Markov Decision Process, in combination with the power grid simulation model and the reactive voltage optimization model, wherein the program instructions configured to build the interactive training environment further comprises: A1: program instructions to construct an Adversarial Markov Decision Process state variable, an expression being as follows:
s=(P,Q,V,t); A2: program instructions to construct a feedback variable, an expression being as follows:
a.sub.p=(Q.sub.G,Q.sub.C); and A4: program instructions to construct an adversarial model action variable a.sub.o, an expression being as follows:
a.sub.o=(G, B); program instructions, implemented on the processor, configured to train the power grid reactive voltage control model through a joint adversarial training algorithm to generate a trained power grid reactive voltage control model, wherein the program instructions to train further comprises: B1: program instructions to define a reinforcement learning target function, an expression being as follows:
ã.sub.p.sup.θ(s,ξ.sub.p)=tan h(μ.sub.θ) (s)+σ.sub.θ(s)□ξ.sub.p),ξ.sub.p˜N(0,I)
ã.sub.o.sup.ω(s,ξ.sub.o)=tan h(μ.sub.ω(s)+σ.sub.ω(s)□ξ.sub.o),ξ.sub.o˜N(0,I); B3: program instructions to define a joint adversarial value function network Q.sub.ϕ.sup.π(s,a.sub.p,a.sub.o), an expression being as follows:
y(r,s′)=r+γ[Q.sub.ϕ.sup.π(s′,ã.sub.p′, ã.sub.o′)−α.sub.p log π(ã.sub.o′|s′)]; program instructions to train the joint adversarial value function network Q.sub.ϕ.sup.π(s,a.sub.pa.sub.o) by using an expression below:
min(Q.sub.ϕ.sup.π(s′,ã.sub.p′,ã.sub.o′)−y).sup.2,ã.sub.p′˜π.sub.p(.Math.|s′); B4: program instructions to train the reactive voltage control model policy network, an expression being as follows:
D={(s.sub.t,a.sub.t.sup.p,a.sub.t.sup.o,r.sub.t,s.sub.t′)} where, s.sub.t is an Adversarial Markov Decision Process state variable at time t; a.sub.t.sup.p is a reactive voltage control model action variable at time t; a.sub.t.sup.o is an adversarial model action variable at time t; r.sub.t is a feedback variable at time t; and s.sub.t′ is an Adversarial Markov Decision Process state variable at time t+1; and program instructions, implemented on the processor, to perform continuous online learning in the regional power grid controller for the reactive voltage control model, which specifically includes steps of: S1: acquiring by the regional power grid controller, through a remote adjusting system, measured data from measuring apparatuses of the regional power grid to form a corresponding state variable s.sub.t=(P,Q,V,t); S2: extracting a set of experiences from the experience library D.sub.B∈D, where B is quantity; S3: updating the reactive voltage control model on D.sub.B; S4: generating an optimal action a.sub.t=tan h(μ.sub.θ(s.sub.i)+σ.sub.θ(s.sub.t)□ξ)=(Q.sub.G, Q.sub.C) at time t, by using the reactive voltage control model policy network; S5: issuing the optimal action to controlled devices through the remote adjusting system, wherein the controlled devices include distributed generation devices and static var compensators, wherein the remote adjusting system receives and executes remote adjusting commands to remotely adjust remote quantity control devices; and S6: t=t+1, returning to step S1.
7. A power grid reactive voltage control model training method, comprising: establishing, by a regional power grid control center server, a power grid simulation model based on a regional power grid comprising n+1 nodes and corresponding measuring apparatuses; establishing, by the regional power grid control center server, a reactive voltage optimization model based on data measured by the measuring apparatuses, according to a power grid reactive voltage control target; building, by the regional power grid control center server, interactive training environment based on Adversarial Markov Decision Process, in combination with the power grid simulation model and the reactive voltage optimization model, wherein the building further comprises: A1: constructing an Adversarial Markov Decision Process state variable, an expression being as follows:
s=(P,Q,V,t); A2: constructing a feedback variable, an expression being as follows:
a.sub.p=(Q.sub.G,Q.sub.C); and A4: constructing an adversarial model action variable a.sub.o, an expression being as follows:
a.sub.o=(G,B); training, by the regional power grid control center server, the power grid reactive voltage control model through a joint adversarial training algorithm to generate a trained power grid reactive control model, wherein the training further comprises: B1: defining a reinforcement learning target function, an expression being as follows:
ã.sub.p.sup.θ(s,ξ.sub.p)=tan h(μ.sub.θ(s)+σ.sub.θ(s)□ξ.sub.p),ξ.sub.p˜N(0I)
ã.sub.o.sup.ω(s,ξ.sub.o)=tan h(μ.sub.ω(s)+σ.sub.ω(s)□ξ.sub.o),ξ.sub.o˜N(0I); B3: defining a joint adversarial value function network A.sub.ϕ.sup.90 (s,a.sub.p,a.sub.o) an expression being as follows:
y(r,s′)=r+γ[Q.sub.ϕ.sup.π(s′,ã.sub.p′,ã.sub.o′)−α.sub.p log π(ã.sub.o′|s′)]; training the joint adversarial value function network Q.sub.ϕ.sup.90 (s,a.sub.p,a.sub.o) by using an expression below:
min(q.sub.99 .sup.90 (s′, ã.sub.p′,ã.sub.o′)−y).sup.2,ã.sub.p′˜π.sub.p(.Math.|s′); B4: training the reactive voltage control model policy network, an expression being as follows:
Q.sub.101.sup.*(s,a.sub.p)=.sub.ξ-NE Q.sub.99 .sup.*(s,a.sub.pã.sub.o.sup.107 (s, ξ.sub.o)) to marginalize the joint adversarial value function network; C3: deploying the reactive voltage control model policy π.sub.p formed by the marginalized joint adversarial value function network and the reactive voltage control model policy network to the online system; C4: initializing the time variable t=0; initializing an experience library D=∅, an expression of the experience library D being as follows:
D={(s.sub.t,a.sub.t.sup.pa.sub.t.sup.o,r.sub.t,s.sub.t′)}; and performing reactive voltage control model continuous online learning in the regional power grid controller, which specifically includes steps of: S1: acquiring by the regional power grid controller, through a remote adjusting system, measured data from measuring apparatuses of the regional power grid to form a corresponding state variable s.sub.t=(P,Q,V,t); S2: extracting a set of experiences from the experience library D.sub.B∈D, where B is quantity; S3: updating the reactive voltage control model on D.sub.B; S4: generating an optimal action a.sub.t=tan h(∥.sub.θ(s.sub.t)+σ.sub.θ(s.sub.t)□ξ)=(Q.sub.G,Q.sub.C) at time t, by using the reactive voltage control model policy network; S5: issuing the optimal action to controlled devices through the remote adjusting system, wherein the controlled devices include distributed generation devices and static var compensators, wherein the remote adjusting system receives and executes remote adjusting commands to remotely adjust remote quantity control devices; and S6: t=t+1, returning to step S1.
8. The power grid reactive voltage control model training method according to claim 7, wherein, establishing a power grid simulation model includes: constructing, with respect to the regional power grid of n+1 nodes, an undirected graph as follows:
Π(N,E) wherein, N is a set of the power grid nodes; E is a set of the power grid branches, E=(i, j)∈N×N ; and i , j are both the power grid nodes.
9. The power grid reactive voltage control model training method according to claim 8, wherein, establishing a power grid simulation model further includes: constructing a power flow equation of the power grid as follows:
P.sub.ij=G.sub.ijV.sub.i.sup.2−G.sub.ijV.sub.iV.sub.j cos θ.sub.ij−B.sub.ijV.sub.iV.sub.j sin θ.sub.ij, ∀ij∈E
Q.sub.ij=−B.sub.ijV.sub.i.sup.2+B.sub.ijV.sub.iV.sub.j cos θ.sub.ij, −G.sub.ijV.sub.iV.sub.j sin θ.sub.i, ∀ij∈E
θ.sub.ij=θ.sub.i−θ.sub.j, ∀ij∈E wherein, V.sub.i,θ.sub.i are respectively a voltage amplitude and a phase angle of the power grid node i ; V,θare respectively a voltage amplitude and a phase angle of the power grid node j; G.sub.i,B.sub.i are respectively conductance and susceptance of the power grid branch ij ; P.sub.ij,Q.sub.ij are respectively active power and reactive power of the power grid branch ij ; and θ.sub.ij is a phase angle difference of the power grid branch ij ; with respect to the power grid node j∈N , an expression of power thereof is as follows:
10. The power grid reactive voltage control model training method according to claim 9, wherein, the reactive voltage optimization model is as follows:
11. The power grid reactive voltage control model training method according to claim 10, wherein, building interactive training environment based on Adversarial Markov Decision Process, wherein P,Q are respectively power grid node active power and reactive power injection vectors; V is a power grid node voltage vector; and t is a time variable during training; wherein, C.sub.V is a voltage suppression coefficient; and ReLU is a nonlinear function, which is defined as:
ReLU(x)=max(0,x); wherein, Q.sub.G,Q.sub.D are both reactive power output vectors; and where, G, B are respectively vectors composed of conductance and susceptance of all lines.
12. The power grid reactive voltage control model training method according to claim 11, wherein, training the power grid reactive voltage control model through a joint adversarial training algorithm: where, γ is a reduction coefficient; a α.sub.pα.sub.o are respectively maximum entropy multipliers of the reactive voltage control model and the adversarial model; π.sub.p is a reactive voltage control model policy, π.sub.o an adversarial model policy, π.sub.p(.Math.|S.sub.t),π.sub.o(.Math.|s.sub.t) are respectively a reactive voltage control model policy function and an adversarial model policy function, which are defined as action probability distribution in a state S.sub.t, and are fitted by a deep neural network; and H is an entropy function; wherein, θ is a reactive voltage control model policy network parameter; ω is an adversarial model policy network parameter; μ.sub.θ(s) and σ.sub.θ(s) are respectively a mean value and a variance function of the reactive voltage control model; μ.sub.ω(s) and σ.sub.ω(s) are respectively a mean value and a variance function of the adversarial model; N (0,I) is a standard Gaussian distribution function; ξ.sub.p,ξ.sub.o are respectively random variables of the reactive voltage control model and the adversarial model; and s is the Adversarial Markov Decision Process state variable; wherein, s′ is an Adversarial Markov Decision Process state variable at time t+1; a′.sub.p,a′.sub.o are respectively action variables of the reactive voltage control model and the adversarial model at time t+1; π.sub.p(a′.sub.p|s′),π.sub.o(a′.sub.o|s′) are respectively a reactive voltage control model action probability value and an adversarial model action probability value at time t+1; wherein, ã′.sub.p,ã′.sub.o are respectively estimated action variables of the reactive voltage control model and the adversarial model at time t+1; and wherein, ϕ represents parameters; π.sub.p(.Math.|s′),π.sub.o(.Math.|s′) are respectively a reactive voltage control model policy function and an adversarial model policy function at time t+1.
13. The power grid reactive voltage control model training method according to claim 12, wherein, an expression of the entropy function H is as follows:
14. The power grid reactive voltage control model training method according to claim 13, wherein, s.sub.t is an Adversarial Markov Decision Process state variable at time t; a.sub.t.sup.p is a reactive voltage control model action variable at time t; a.sub.t.sup.o is an adversarial model action variable at time t; r.sub.t is a feedback variable at time t; and s.sub.t′ is an Adversarial Markov Decision Process state variable at time t+1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) In order to clearly illustrate the technical solution of the embodiments of the present disclosure or in the prior art, the drawings that need to be used in description of the embodiments or the prior art will be briefly described in the following; it is obvious that the described drawings are only related to some embodiments of the present disclosure; based on the drawings, those ordinarily skilled in the art can acquire other drawings, without any inventive work.
(2)
(3)
(4)
(5)
DETAILED DESCRIPTION
(6) In order to make objects, technical details and advantages of the embodiments of the present disclosure apparent, the technical solutions of the embodiment will be described in a clearly and fully understandable way in connection with the drawings related to the embodiments of the present disclosure. It is obvious that the described embodiments are just a part but not all of the embodiments of the present disclosure. Based on the described embodiments herein, those ordinarily skilled in the art can obtain other embodiments, without any inventive work, which should be within the scope of the present disclosure.
(7) The present disclosure provides a power grid reactive voltage control model training method; exemplarily,
(8) Step 1: establishing a power grid simulation model; the power grid simulation model including an undirected graph of a regional power grid based on n+1 nodes, a power flow equation of the power grid, power expressions of respective power grid nodes in the power grid, and power grid parameters; which is specifically as follows:
(9) constructing an undirected graph Π(N,E), with respect to the regional power grid of n+1 nodes; where, N is a set of the power grid nodes, N=0, . . . , n; E is a set of the power grid branches, E=(i,j)∈N×N; and i, j are both power grid nodes. Constructing the power flow equation of the power grid is as follows:
P.sub.ij=G.sub.ijV.sub.i.sup.2−G.sub.ijV.sub.iV.sub.j cos θ.sub.ij−B.sub.ijV.sub.iV.sub.j sin θ.sub.ij,∀ij∈E
Q.sub.ij=B.sub.ijV.sub.i.sup.2+B.sub.ijV.sub.iV.sub.j cos θ.sub.ij−G.sub.ijV.sub.iV.sub.j sin θ.sub.ij,∀ij∈E
θ.sub.ij=θ.sub.i−θ.sub.j,∀ij∈E (1)
(10) wherein, V.sub.i,θ.sub.i are respectively a voltage amplitude and a phase angle of the power grid node i; V.sub.j,θ.sub.j are respectively a voltage amplitude and a phase angle of the power grid node j; G.sub.ij,B.sub.ij are respectively conductance and susceptance of a power grid branch ij; P.sub.ij,Q.sub.ij are respectively active power and reactive power of the power grid branch ij; and θ.sub.ij is a phase angle difference of the power grid branch ij;
(11) with respect to the power grid node j∈N, an expression of power thereof is as follows:
(12)
(13) Where, P.sub.j,Q.sub.j are respectively active power injection and reactive power injection of the power grid node j; G.sub.sh,i,B.sub.sh,i are respectively ground conductance and susceptance of the power grid node i; P.sub.Dj,Q.sub.Dj are respectively active power load and reactive power load of the power grid node j; P.sub.Gj,Q.sub.Gj are respectively active power output and reactive power output based on a Distributed Generation (DG) of the power grid node j; Q.sub.Cj is reactive power output based on Static Var Compensator (SVC) of the power grid node j; N.sub.IB is a set of power grid nodes coupled to DG in the power grid; N.sub.CD is a set of power grid nodes coupled to static var compensators in the power grid; and K(i) is a set of correspondent nodes of all branches connected with the node i. In general, N.sub.IB∩N.sub.CD=Ø.
(14) Step 2: establishing a reactive voltage optimization model, according to a power grid reactive voltage control target, that is, according to a control target that can make the power grid to achieve a minimized network loss, and ensure that voltages of the respective power grid nodes are within limits, an expression being as follows:
(15)
(16) Where, V.sub.i,
(17) Step 3: building interactive training environment based on Adversarial Markov Decision Process (AMDP), in combination with the power grid simulation model and the reactive voltage optimization model, which specifically includes steps of:
(18) 3.1: constructing an Adversarial Markov Decision Process state variable, with data measured by the power grid system, an expression being as follows:
s=(P,Q,V,t) (4)
(19) Where, P,Q are respectively power grid node active power and reactive power injection vectors; V is a power grid node voltage vector; and t is a time variable during training.
(20) 3.2: constructing a feedback variable, based on the reactive voltage optimization model, an expression being as follows:
(21)
(22) Where, C.sub.V is a voltage suppression coefficient with a typical value of 1,000; and ReLU is a nonlinear function, which is defined as: Re LU(x)=max(0,x).
(23) 3.3: constructing a reactive voltage control model action variable a.sub.p, with respect to reactive power of a controllable flexible resource, e.g., reactive power of a distributed generation device and a static var compensator, etc., an expression being as follows:
a.sub.p=(Q.sub.G,Q.sub.C) (6)
(24) Where, Q.sub.G,Q.sub.C are both reactive power output vectors;
(25) 3.4: constructing an adversarial model action variable a.sub.o, with respect to parameter uncertainty of the power grid simulation model, i.e., a possible range of parameter errors, for disturbing the reactive voltage control model, an expression being as follows:
a.sub.o=(G,B) (7)
(26) Where, G, B are respectively vectors composed of conductance and susceptance of all lines.
(27) Step 4: training the power grid reactive voltage control model through a joint adversarial training algorithm, specifically comprises steps of:
(28) 4.1: defining a reinforcement learning target function, an expression being as follows:
(29)
(30) Where, γ is a reduction coefficient with a typical value of 0.95; α.sub.p,α.sub.o respectively correspond to maximum entropy multipliers of the reactive voltage control model and the adversarial model, with a typical value of 0.1; π.sub.p(.Math.|s.sub.t),π.sub.o(.Math.|s.sub.t) respectively correspond to a reactive voltage control model policy function and an adversarial model policy function, which are defined as action probability distribution in a state s.sub.t, and are fitted by a deep neural network; and H is an entropy function, an expression being as follows:
(31)
(32) 4.2: converting forms of the reactive voltage control model policy function and the adversarial model policy function, by using reparameterization trick, expressions being respectively as follows:
ã.sub.p.sup.θ(s,ξ.sub.p)=tan h(μ.sub.θ(s)+σ.sub.θ(s)□ξ.sub.p),ξ.sub.p˜N(0,I)
ã.sub.o.sup.ω(s,ξ.sub.o)=tan h(μ.sub.ω(s)+σ.sub.ω(s)□ξ.sub.o),ξ.sub.o˜N(0,I) (10)
(33) Where, θ is a reactive voltage control model policy network parameter; ω is an adversarial model policy network parameter; μ.sub.θ(s) and σ.sub.θ(s) are respectively a mean value and a variance function of the reactive voltage control model; μ.sub.ω(s) and σ.sub.ω(s) are respectively a mean value and a variance function of the adversarial model; N (0,I) is a standard Gaussian distribution function; ξ.sub.p,ξ.sub.o are respectively random variables of the reactive voltage control model and the adversarial model; and s is the Adversarial Markov Decision Process state variable.
(34) 4.3: defining and training a joint adversarial value function network Q.sub.ϕ.sup.π(s,a.sub.p,a.sub.0); where, ϕ represents parameters; the value function network represents an expected feedback under a corresponding state and action; and a recursive form of Q.sub.ϕ.sup.π(s,a.sub.p,a.sub.o) is obtained through a Bellman equation, an expression being as follows:
(35)
(36) Where, s′ is an Adversarial Markov Decision Process state variable at time t+1; a′.sub.p, a′.sub.o are respectively action variables of the reactive voltage control model and the adversarial model at time t+1; and π.sub.p(a′.sub.p|s′),π.sub.o(a′.sub.o|s′) are respectively a reactive voltage control model action probability value and an adversarial model action probability value at time t+1.
(37) From the above, an estimated value of Q.sub.ϕ.sup.π(s,a.sub.p,a.sub.o) can be calculated for training, as shown in (12) below:
y(r,s′)=r+γ[Q.sub.ϕ.sup.π(s′,ã.sub.p′,ã.sub.o′)−α.sub.p log π(ã.sub.p′|s′)−α.sub.o log π(ã.sub.o|s′)] (12)
(38) Where, ã′.sub.p,ã′.sub.o are respectively estimated action variables of the reactive voltage control model and the adversarial model at time t+1.
(39) When training the joint adversarial value function network Q.sub.ϕ.sup.π(s,a.sub.p,a.sub.o), an expression below can be used:
min(Q.sub.ϕ.sup.π(s′,ã.sub.p′,ã.sub.o′)−y).sup.2,ã.sub.p′˜π.sub.p(.Math.|s′),ã.sub.o′˜π.sub.o(.Math.|s′) (13)
(40) Where, ϕ represents parameters; π.sub.p(.Math.|s′),π.sub.o(.Math.|s′) are respectively a reactive voltage control model policy function and an adversarial model policy function at time t+1.
(41) 4.4: training the reactive voltage control model policy network, an expression being as follows:
(42)
(43) Step 5: transferring the trained reactive voltage control model to an online system, which specifically comprises steps of:
(44) 5.1: undergoing multiple rounds of training as described above, until a convergence state is reached, to obtain an optimal joint adversarial value function network Q.sub.ϕ.sup.* and a current reactive voltage control model policy π.sub.p, then stopping the training process.
(45) 5.2: using an expression below:
(46)
(47) to marginalize the joint adversarial value function network.
(48) 5.3: deploying the reactive voltage control model policy π.sub.p formed by the marginalized joint adversarial value function network and the reactive voltage control model policy network to the online system.
(49) 5.4: initializing the time variable t=0; initializing an experience library D=Ø, the experience library being a set constituted by all historical experiences, and an expression of the experience library D being as follows:
D={(s.sub.t,a.sub.t.sup.p,a.sub.t.sup.o,r.sub.t,s.sub.t′)}
(50) Where, s.sub.t is an Adversarial Markov Decision Process state variable at time t; a.sub.t.sup.p of is a reactive voltage control model action variable at time t; a.sub.t.sup.o is an adversarial model action variable at time t; r.sub.t is a feedback variable at time t; and s.sub.t′ is an Adversarial Markov Decision Process state variable at time t+1.
(51) Step 6: reactive voltage control model continuous online learning, which specifically comprises steps of:
(52) 6.1: acquiring measured data from measuring apparatuses of the regional power grid to form a corresponding state variable s.sub.t=(P,Q,V,t);
(53) 6.2: extracting a set of experiences from the experience library, D.sub.B ∈D, where B is a quantity with a typical value of 64.
(54) 6.3: updating the reactive voltage control model on D.sub.B, by using Expression (13) and Expression (14).
(55) 6.4: generating an optimal action a.sub.t=tan h (μ.sub.θ(s.sub.t)+σ.sub.θ(s.sub.t)□ξ)=(Q.sub.G,Q.sub.C) at time t, by using the reactive voltage control model policy network;
(56) 6.5: issuing the above-described optimal action to a controlled device through a remote adjusting system, wherein, the remote adjusting system is configured to receive and execute remote adjusting commands, and to remotely adjust remote quantity control devices; and the controlled devices include distributed generation devices and static var compensators.
(57) 6.6: t=t+1, returning to step 6.1.
(58) The present disclosure further provides a power grid reactive voltage control model training system that can implement the above-described method, as shown in
(59) Exemplarily,
(60) The regional power grid control center server establishes a reactive voltage control model according to the data measured by the measuring apparatuses of the respective power grid nodes in the power grid system; the reactive voltage control model establishing steps are as the above-described step 1 to step 5; the reactive voltage control model established by the regional power grid control center server will be deployed to an online system, that is, to the regional power grid controller; and the reactive voltage control model will continue online learning in the regional power grid controller. Specifically, a remote adjusting system is used for communication between the regional power grid and the regional power grid controller; and the measuring apparatuses of the respective power grid nodes in the regional power grid transmit the data measured by the measuring apparatuses, including active and reactive power injection vectors, as well as power grid node voltage vectors of the respective power grid nodes, to the regional power grid controller through the remote adjusting system; the regional power grid controller controls reactive voltage control model online learning according to the data measured by the measuring apparatuses; and the learning step is as the above-described step 6. The reactive voltage control model continues online learning, generates an optimal reactive voltage control policy, and issues the optimal reactive voltage control policy to the distributed generation devices and the static var compensators, to control the distributed generation devices and the static var compensators to perform corresponding actions.
(61) The present disclosure further provides a computer-readable storage medium; the computer-readable storage medium stores logic instructions therein; and a processor may call the logic instructions in the computer-readable storage medium to execute the method according to the above-described embodiment, as shown in
(62) In addition, the logic instructions in the above-described computer-readable storage medium may be implemented in a form of a software functional unit, and sold or used as an independent product.
(63) The above-described computer-readable storage medium may be configured to store software programs and computer-executable programs, for example, program instructions/modules corresponding to the method according to this embodiment. The processor runs the software programs, instructions and modules stored in the computer-readable storage medium, so as to execute functional applications and data processing, that is, implement the reactive voltage control model training method according to the above-described embodiments.
(64) The computer-readable storage medium may include a program storage region and a data storage region, wherein, the program storage region may store an operating system and an application program required by at least one function; and the data storage region may store data created according to use of a terminal device, etc. In addition, the computer-readable storage medium may include a high-speed random access memory, and may further include a non-volatile memory.
(65) In these embodiments, by considering an error between the power grid simulation model and a real physical system model as disturbance during training, the Adversarial Markov Decision Process is established to train the adversarial model synchronously, and disturb the reactive voltage control model by using model errors, to further make the reactive voltage control model robust to model errors, so as to train a transferable deep reinforcement learning model. These embodiments make full use of internal information of the power grid simulation model, so that the obtained model may be safely and efficiently transferred to online power grid reactive voltage control, thus greatly improves efficiency and safety of the data-driven power grid reactive voltage control method, and is particularly suitable to be used in a regional power grid with a serious model incompleteness problem, which not only saves high costs of repeated maintenance of an accurate model, but also avoids the safety problem caused by online learning of the data-driven power grid reactive voltage control method, making it suitable for large-scaled promotion.
(66) Although the present disclosure is explained in detail with reference to the foregoing embodiments, those ordinarily skilled in the art will readily appreciate that many modifications are possible in the foregoing respective embodiments, or equivalent substitutions are made for part of technical features; however, these modifications or substitutions are not intended to make the essences of the corresponding technical solutions depart from the spirit and the scope of the technical solutions of the respective embodiments of the present disclosure.