Multi-objective real-time power flow control method using soft actor-critic

Abstract

Systems and methods are disclosed for control voltage profiles, line flows and transmission losses of a power grid by forming an autonomous multi-objective control model with one or more neural networks as a Deep Reinforcement Learning (DRL) agent; training the DRL agent to provide data-driven, real-time and autonomous grid control strategies; and coordinating and optimizing power controllers to regulate voltage profiles, line flows and transmission losses in the power grid with a Markov decision process (MDP) operating with reinforcement learning to control problems in dynamic and stochastic environments.

Claims

1. A method to control voltage profiles, line flows and transmission losses of a power grid, comprising: forming an autonomous multi-objective control model with one or more neural networks as a Deep Reinforcement Learning (DRL) agent using a soft actor critic with multiple control objectives including regulating bus voltages within power zones and minimizing transmission line losses while respecting power flow equations and physical constraints; training the DRL agent to provide data-driven, real-time and autonomous grid control strategies; and coordinating and optimizing power controllers to regulate voltage profiles, line flows and transmission losses in the power grid with a Markov decision process (MDP) operating with reinforcement learning with the soft actor critic and updating a policy network and temperature coefficient to control problems in dynamic and stochastic environments.

2. The method of claim 1, wherein the DRL agents are trained offline by interacting with offline simulations and historical events which are periodically updated.

3. The method of claim 1, wherein the DRL agent provides autonomous control actions once abnormal conditions are detected.

4. The method of claim 1, wherein after an action is taken in the power grid a.sub.t a current state, the DRL agent receives a reward from the power grid.

5. The method of claim 1, comprising updating a relationship among action, states and reward in an agent's non-transitory memory.

6. The method of claim 1, comprising solving a coordinated voltage, line flows and transmission losses control problem.

7. The method of claim 6, comprising performing a Markov Decision Process (MDP) that represents a discrete time stochastic control process.

8. The method of claim 6, comprising using a 4-tuple to formulate the MDP:
(S,A,P.sub.a,R.sub.a) where S is a vector of system states, A is a list of actions to be taken, P.sub.a(s, s′)=Pr(s.sub.t+1=s′|s.sub.t=s, a.sub.t=a) represents a transition probability from a current state s.sub.t to a new state, s.sub.t+1, after taking an action a at time=t, and R.sub.a(s, s′) is a reward received after reaching state s′ from a previous state s to quantify control performance.

9. The method of claim 1, wherein the DRL agent comprises two architecture-identical deep neural networks including a target network and an evaluation network.

10. The method of claim 1, comprising providing a sub-second control with an EMS or PMU data stream from a wide area measurement system (WAMS).

11. The method of claim 1, wherein the DRL agent self-learns by exploring control options in a high dimension by moving out of local optima.

12. The method of claim 1, comprising performing voltage control, line flow control and transmission loss control by the DRL agent by considering multiple control objectives and security constraints.

13. The method of claim 1, wherein a reward is determined based on voltage operation zones with voltage profiles, including a normal zone, a violation zone, and a diverged zone.

14. The method of claim 1, comprising applying a decaying ε-greedy method for learning, with a decaying probability of ε.sub.i to make a random action selection at an i.sup.th iteration, wherein ε.sub.i is updated as ${.Math.}_{i + 1} = {\begin{matrix} r_{d} \times {.Math.}_{i}, if {.Math.}_{i} > {.Math.}_{\min} \\ {.Math.}_{\min}, else \end{matrix}$ an r.sub.d is a constant decay rate.

15. A method to control voltage profiles, line flows and transmission losses of a power grid, comprising: measuring states of a power grid; determining abnormal conditions and locating affected areas in the power grid; creating representative operating conditions including contingencies for the power grid; conducting power grid simulations in an offline or online environment; training deep-reinforcement-learning-based agents using a soft actor critic with multiple control objectives including regulating bus voltages within power zones and minimizing transmission line losses while respecting power flow equations and physical constraints for autonomously controlling power grid voltage profiles, line flows and transmission losses; and coordinating and optimizing control actions of power controllers in the power grid with the soft actor critic and updating a policy network and temperature coefficient.

16. The method of claim 15, wherein the measuring states comprises measuring from phasor measurement units or energy management systems.

17. The method of claim 15, comprising generating data-driven, autonomous control commands for correcting voltage issues and line flow issues considering contingencies in the power grid.

18. The method of claim 15, comprising presenting expected control outcomes once one or more DRL-based commands are applied to a power grid.

19. The method of claim 15, comprising providing a sub-second control with a phasor measurement unit (PMU) data stream from a wide area measurement system.

20. The method of claim 15, comprising providing a platform for data-driven, autonomous control commands for regulating voltages, line flows, and transmission losses in a power network under normal and contingency operating conditions.

Description

BRIEF DESCRIPTIONS OF FIGURES

(1) FIG. 1 shows an exemplary framework of the interaction between SAC agents and power grid simulation environment.

(2) FIG. 2 shows an exemplary framework for multi-objective autonomous controls for grid operation using deep reinforcement learning, including multiple offline training processes, online training process and real-time utilization process.

(3) FIG. 3 shows an exemplary SAC agent performance in controlling voltage profiles and line losses in the DRL-based autonomous control method for a power grid.

(4) FIG. 4 shows an exemplary chart showing reduction in transmission losses.

(5) FIG. 5 shows an exemplary power grid control system using the above framework.

DETAILED DESCRIPTION OF THE ILLUSTRATED EMBODIMENTS

(6) Without losing generality, this embodiment mainly targets at deriving real-time corrective operational control decisions for the actual system operating conditions at an interval of 5 minutes in a control center. The control objectives include regulating bus voltages within their secure zones and minimizing transmission line losses while respecting power flow equations and physical constraints, e.g., line ratings, limits of generators. The mathematical formulation of the control problem is given below:

(7) Objective:
minimize Σ.sub.i,j.sup.N Ploss.sub.i,j,(i,j)∈Ω.sub.L (1)
Subject to:
Σ.sub.n∈Gi P.sub.n.sup.g−Σ.sub.m∈Di P.sub.m.sup.d−g.sub.iV.sub.i.sup.2=Σ.sub.j∈B.sub.i P.sub.ij(y),i∈B (2)
Σ.sub.n∈Gi Q.sub.n.sup.g−Σ.sub.m∈Di Q.sub.m.sup.d−b.sub.iV.sub.i.sup.2=Σ.sub.j∈B.sub.i QP.sub.ij(y),i∈B (3)
P.sub.n.sup.min≤P.sub.n≤p.sub.n.sup.max,n∈G (4)
Q.sub.n.sup.min≤Q.sub.n≤Q.sub.n.sup.max,n∈G (5)
V.sub.i.sup.min≤V.sub.i≤V.sub.i.sup.max,i∈B (6)
P.sub.ij.sup.2+Q.sub.ij.sup.2≤S.sub.ij.sup.max,(i,j)⊂Ω.sub.L (7)
where Eqs. (2) and (3) represent active and reactive power flow equations, respectively. Eqs. (4) and (5) are active and reactive power output constraints of each generator, respectively. Eqs. (6) and (7) specify bus voltage secure zones and line flow limits of a power system to be controlled, respectively.
A. Overall Flowchart of the Proposed Methodology

(8) Deriving multi-objective real-time control actions can be formulated as a discrete-time stochastic control process, a.k.a., MDP. Among various DRL techniques, the off-policy, SAC method is adopted because of its superior performance in fast convergence and robustness, which maximizes the expected reward by exploring as many control actions as possible, leading to a better chance of finding optimum. FIG. 1 provides the interaction process between the power grid environment (EMS AC power flow solver) and the SAC agent, where the environment receives control action, outputs the corresponding next system states and calculates the reward, and the SAC agent receives states and reward before outputting control actions.

(9) The main flowchart of the proposed methodology is depicted in FIG. 2. The left side of FIG. 2 shows the offline training process of a SAC agent. Representative system operating snapshots are collected from EMS for preprocessing. System state variables are extracted from those converged snapshots and fed into SAC agent training submodule, where neural networks are used to establish direct mapping between system states and control actions. The controls are then verified by another AC power flow solution to calculate reward values before updating SAC agent weights for maximizing long-term expected reward. To ensure long-term effectiveness and robustness of SAC agent, multiple training processes with different sets of hyperparameters are launched simultaneously, including several offline training processes and one online training process (initialized by the best offline-trained model), shown in the right side of FIG. 2. The best-performing model from these processes is selected for application in real-time environment.

(10) B. Training Effective SAC Agents

(11) To train effective DRL agents for multi-objective real-time power flow control, one needs to carefully define several key elements, including:

(12) 1) Episode and Terminating Conditions

(13) Each episode is defined as a quasi-steady-state operating snapshot, obtained from the EMS system and saved in text files. Termination condition of a training episode can be: i) no more voltage or thermal violations & reduction of transmission losses reaching a threshold, e.g., 0.5%; ii) power flow diverges; or iii) the maximum number of control iteration is reached.

(14) 2) State Space

(15) The action space is formed by including bus voltage magnitudes, phase angles, active and reactive power on transmission lines. Batch normalization technique is applied to different types of variables for maintaining consistency and improving model training efficiency.

(16) 3) Control Space

(17) In this work, conventional generators are used to regulate voltage profiles and transmission line losses. A control vector is then created to include voltage set points at each power plant as continuous values, e.g., [0.9,1.1] p.u.

(18) 4) Reward Definition

(19) The reward value at each control iteration when training SAC agent adopts the following logic:

(20) If voltage or flow violation is detected:

(21) $reward = - \frac{dev_overflow}{1 0} - \frac{vio_voltage}{1 0 0}$
else if delta_p_loss<0:
reward=50−delta_p_loss*1000
else if delta_p_loss>=0.02
reward=−100
else: reward=−1−(p_loss−p_loss_pre)*50
where dev_overflow=Σ.sub.i.sup.N (Sline(i)−Sline_max(i)).sup.2; Nis the total number of lines with thermal violation; Sline is the apparent power of line; Sline_max is the limit of line apparent power; vio_voltage=Σ.sub.j.sup.M(Vm(j)−Vmin)*(Vm(j)−Vmax); M is the total number of buses with voltage violation;

(22) $delta_p_loss = \frac{p_loss - p_loss_pre}{p_loss_pre};$
p_loss is the present transmission loss value and p_loss_pre is the line loss at the base case. The details of training SAC agents are given in Algorithm I, shown below.

(23) TABLE-US-00001 Algorithm I: Soft Actor-Critic Training Process for Multi-Objective Power Flow Control 1. Initialize weights of neural networks, θ and ϕ, for policy π(s, a) and value function V(s), respectively; initialize weights ψ and ψ for the two Q(s, a) functions; initialize replay buffer D ; set up training environment, env 2. for: k = 1, 2, ... (k is the counter of episodes for training) 3. for: t =1,2,... (t stands for control iteration) 4. reset environment s ← env.reset( ) 5. obtain states and actions a ~ π(.Math. |s.sub.t) 6. apply action a and obtain the next states s.sub.t+1, reward value r and termination signal done 7. store tuple<s.sub.t, a.sub.t, r.sub.t, s.sub.t+1, done > in D 8. s.sub.t = s.sub.t+1 9. if satisfying policy updating conditions, conduct 10. for a required number of policy updates, conduct 11. randomly sample from D, < s.sub.t, a.sub.t, r, s.sub.t+1, done > 12. update Q function, Q(s, a): θ.sub.i ← θ.sub.i − λ.sub.Q∇J.sub.Q(θ.sub.i) 13. update value function V(s): ψ ← ψ − λ.sub.V∇J.sub.V(ψ) 14. update policy network π(s, a): ϕ ← ϕ − λ.sub.π∇J.sub.π(ϕ) 15. update target network ψ ← τψ + (1 − τ)ψ 16. update temperature coefficient, a

(24) In one implementation, the system uses a 4-tuple to formulate the MDP:
(S,A,P.sub.a,R.sub.a)

(25) where S is a vector of system states, A is a list of actions to be taken, P.sub.a(s, s′)=Pr(s.sub.t+1=s′|s.sub.t=s, a.sub.t=a) represents a transition probability from a current state s.sub.t to a new state, s.sub.t+1, after taking an action a at time=t, and R.sub.a(s, s′) is a reward received after reaching state s′ from a previous state s to quantify control performance. The system includes providing a sub-second control with an EMS or PMU data stream from a wide area measurement system (WAMS). The system can apply a decaying E-greedy method for learning, with a decaying probability of E, to make a random action selection at an i.sup.th iteration, wherein ε.sub.i is updated as

(26) ${.Math.}_{i + 1} = {\begin{matrix} r_{d} \times {.Math.}_{i}, if {.Math.}_{i} > {.Math.}_{\min} \\ {.Math.}_{\min}, else \end{matrix}$

(27) an r.sub.d is a constant decay rate.

(28) The proposed SAC-based methodology for multi-objective power flow control was developed and deployed in the control center of SGCC Jiangsu Electric Power Company. For demonstrating its effectiveness, the city-level high-voltage (220 kV+) power network is used, which consists of 45 substations, 5 power plants (with 12 generators) and around 100 transmission lines, serving electricity to the city of Zhangjiagang. Massive historical operating snapshots (full topology node/breaker models for Jiangsu province with ˜1500 nodes and ˜420 generators, at an interval of 5 minutes) were obtained from their EMS system (named D5000 system) where the AC power flow computational module is used as grid simulation environment to train SAC agents. The control objectives are set to minimize transmission losses (at least 0.5% reduction) without violating bus voltages ([0.97-1.07] pu) and line flow limits (100% of MVA rating). Voltage setpoints of the 12 generators in 5 power plants are adjusted by the SAC agent.

(29) The performance of training and testing SAC agents using a time series of actual system snapshots is illustrated in FIG. 3 and FIG. 4. From 12/3/2019 to 1/13/2020, 7,249 operating snapshots were collected. Two additional copies of the original snapshots were created and randomly shuffled to create a training set (80%) and a test set (20%). For the first ˜150 snapshots, the SAC agent struggles to find effective policies (with negative reward values), however, achieves satisfactory performance thereafter. Several training processes are simultaneously launched and updated twice a week to ensure control performance. For real-time application during this period, the developed method provides valid controls for 99.41% of these cases. The average line loss reduction is 3.6412% (compared to the line loss value before control actions). There are 1,019 snapshots with voltage violations, in which SAC agent solves 1,014 snapshots completely and effectively mitigates the remaining 5 snapshots.

(30) An example of test bed can be found in FIG. 5. The test bed models the exemplary Power Grid and Sensor Network where data collected from energy management system (EMS) or phasor measurement unit (PMU) is transmitted through communication networks to the data server. The data server stores and manages the measured data and provides data pipeline to the application server. The pre-trained reinforcement learning model is running on the application server. The control command and expected performance is sent to the user interface and shown to the users. The test bed running the method of FIG. 2 has a framework is modeled by the following: forming an autonomous multi-objective control model with one or more neural networks as a Deep Reinforcement Learning (DRL) agent; training the DRL agent to provide data-driven, real-time and autonomous grid control strategies; and coordinating and optimizing power controllers to regulate voltage profiles, line flows and transmission losses in the power grid with a Markov decision process (MDP) operating with reinforcement learning to control problems in dynamic and stochastic environments.

(31) The system supports training effective SAC agents with periodic updating for multi-objective power flow control in real-time operational environment. The detailed design and flowchart of the proposed methodology are provided for reducing transmission losses without violating voltage and line constraints. Numerical simulations conducted on a real power network in real-time operational environment demonstrates the effectiveness and robustness.

(32) Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. As used herein, the term “module” or “component” may refer to software objects or routines that execute on the computing system. The different components, modules, engines, and services described herein may be implemented as objects or processes that execute on the computing system (e.g., as separate threads). While the system and methods described herein may be preferably implemented in software, implementations in hardware or a combination of software and hardware are also possible and contemplated. In this description, a “computing entity” may be any computing system as previously defined herein, or any module or combination of modulates running on a computing system. All examples and conditional language recited herein are intended for pedagogical objects to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Although embodiments of the present inventions have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.

Multi-objective real-time power flow control method using soft actor-critic

Inventors

Cpc classification

Classification Explorer

Y04S40/20

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G05B13/04

PHYSICS

Classification Explorer

Y04S10/50

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

H02J2203/20

ELECTRICITY

Classification Explorer

Y02E40/70

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

H02J3/001

ELECTRICITY

Classification Explorer

Y02E60/00

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G05B13/027

PHYSICS

International classification

Classification Explorer

H02J3/00

ELECTRICITY

Classification Explorer

G05B13/02

PHYSICS

Classification Explorer

G05B13/04

PHYSICS

Abstract

Claims

Description