PV Ramp Rate Control Using Reinforcement Learning Technique Through Integration of Battery Storage System

Abstract

Systems and methods are disclosed for storing photovoltaic (PV) generation by applying reinforcement learning (RL)-based control to battery storages for PV ramp rate control; and exchanging energy dynamically to limit a ramp rate of the PV power output and maintaining a battery state of charge level at a predefined level to minimize required battery size and extend the battery life cycles.

Claims

1. A process for storing photovoltaic (PV) generation, comprising: applying reinforcement learning (RL)-based control to battery storages for PV ramp rate control; and exchanging energy dynamically to limit a ramp rate of the PV power output and maintaining a battery state of charge level at a predefined level to minimize required battery size and extend the battery life cycles.

2. The process of claim 1, comprising adjusting battery operation to different PV profiles without knowing in advance the PV profiles.

3. The process of claim 1, comprising monitoring system operation status at each time instant t {P.sub.dc (t), E.sub.be,cap (t), P.sub.BE(t)}.

4. The process of claim 1, wherein the controller generates a battery power change control action P.sub.be(t). battery operation controller applies the reinforcement learning-based optimization approaches.

5. The process of claim 1, wherein for the RL, the Q-learning is used to find an optimal battery operation sequence.

6. The process of claim 1, comprising determining discrete state-action (s.sub.t, a.sub.t) pairs as estimates an expected value of a total reward return over all successive optimal actions.

7. The process of claim 4, comprising iteratively updating the Q-value for each state-action pair along system operation.

8. The process of claim 5, comprising applying the Q-value to determine battery operation actions.

9. The process of claim 1, comprising determining a reward function R as a function of suppression of PV power ramp rate and a deviation of battery capacity from predefined setting.

10. The process of claim 1, comprising determining a power balance as:
P.sub.dc=P.sub.pv+P.sub.be where battery power (P.sub.be) is controlled to compensate for fluctuations of PV power generation (P.sub.pv), so that a ramp rate of the total power output (P.sub.dc) to grid can be limited within a desired level.

11. The process of claim 1, wherein a ramp-rate of P.sub.dc comprises a maximum allowable ramp rate (MARR).

12. The process of claim 11, wherein the ramp rate of P.sub.dc comprises: $\frac{P_{dc}}{t} = \frac{P_{pv}}{t} + \frac{P_{be}}{t}$

13. The process of claim 11, wherein a sampling time interval is t, comprising determining $\begin{matrix} \frac{.Math. .Math. P_{dc}}{.Math. .Math. t} = \frac{.Math. .Math. P_{pv}}{.Math. .Math. t} + \frac{.Math. .Math. P_{be}}{.Math. .Math. t} & (3) \end{matrix}$ and the ramp rate satisfies: $.Math. \frac{.Math. .Math. P_{dc}}{.Math. .Math. t} .Math. < MARR$ $.Math. \frac{.Math. .Math. P_{pv}}{.Math. .Math. t} + \frac{.Math. .Math. P_{be}}{.Math. .Math. t} .Math. < MARR .$

14. The process of claim 1, comprising optimizing a battery operation policy by: limiting a ramp rate of integrated DC power (RR.sub.dc) within MARR; maintaining a battery energy capacity around a reference setting point (E.sub.be, ref) where the battery life can be maximized.

15. The process of claim 1, comprising optimizing multi-objective functions with: $\begin{matrix} \min .Math. .Math. Obj = .Math. f ({RR}_{dc}) + f (E_{be}) \\ = .Math._{1} .Math. {_{t = t_{0}}^{t_{n}} (\frac{E_{be} (t) - E_{be, ref}}{E_{be, ref}})}^{2} +_{2} .Math. {_{t = t_{0}}^{t_{n}} (\frac{{RR}_{dc} (t)}{MVRR})}^{2}, \end{matrix}$ where .sub.2, .sub.1 are the weight coefficients, where RR.sub.dc: a targeted ramp rate of integrated DC power; RR.sub.be,event: a ramp rate of BE power during ramping event time period (t.sub.1t.sub.2); RR.sub.be,post-event: a ramp rate of BE power during post-ramping event time period (t.sub.2t.sub.3); and RR.sub.be,reco: a ramp rate of BE power during recovering time period (t.sub.3t.sub.4).

16. The process of claim 1, comprising determining state space S, action set A, and reward functions R, the reward R is a function of S and A, wherein a State (S) space includes {(P.sub.dc(t), E.sub.be,cap(t), P.sub.BE(t))}, an Action (A) space only includes one element {P.sub.be (t)}, the battery power change, and a Reward value (R).

17. The process of claim 16, wherein the reward value is calculated at each time instant. The Reward value at t is calculated based on the collected information between t1 and t. $R (t) = - {_{1} (\frac{E_{be} (t - 1) - E_{be, ref}}{E_{be, ref}})}^{2} .Math. .Math. .Math. t - {_{2} (\frac{{RR}_{dc} (t - 1)}{MVRR})}^{2} .Math. .Math. .Math. t .$

18. The process of claim 1, comprising applying Q-learning to find an optimal battery operation sequence to maximize the total rewards.

19. The process of claim 18, wherein the Q-learning uses temporal differences to estimate Q value of each state-action pair Q*(s,a), wherein Q*(s,a) is an expected value of taking action a in state s and following the optimal policy thereafter, where the expected value means the cumulative discounted reward with: $Q^{*} (s, a) = {.Math.}_{i = 0}^{n} .Math.^{i} .Math. R_{t + i}$ where is a discount factor between 0 and 1.

20. The process of claim 19, wherein the action-value set Q(s,a) is learned and updated along system operation, comprising determining an optimal action by selecting the action with the highest Q value in each state and updating Q(s,a) as: $Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + a_{t} (s_{t}, a_{t}) .Math. (R_{t + 1} + .Math. .Math. \max_{a} .Math. Q_{t} (s_{t + 1}, a) - Q_{t} (s_{t}, a_{t})) .$

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIG. 1 shows an exemplary PV ramp rate control using reinforcement learning technique through integration of a battery storage system.

[0016] FIG. 2 shows an exemplary diagram of battery integration with a PV system.

[0017] FIG. 3 illustrates an exemplary control approach for PV ramp down event.

[0018] FIG. 4 illustrates an exemplary control approach for PV ramp up event.

[0019] FIG. 5 illustrates an exemplary online battery operation updating flowchart.

[0020] FIG. 6 shows an exemplary flowchart of Q value updating.

[0021] FIG. 7 shows an exemplary operation flowchart of reinforcement learning-based control method of battery storage.

[0022] FIG. 8 shows an exemplary system for PV ramp rate control using reinforcement learning technique through integration of a battery storage system.

DESCRIPTION

[0023] FIG. 1 shows an exemplary PV ramp rate control using reinforcement learning technique through integration of a battery storage system. The system includes a battery storage system 100 with a renewable ramp rate control. The system also includes a module that provides RL-based approach for ramp rate control 102. A module 103 provides optimized battery operation to limit PV ramp rate meanwhile reduce battery capacity requirement and extend battery life cycles. This is achieved with no requirement for PV power profiles.

[0024] The target system (PV integrated with Battery storage system) is shown in FIG. 2 which includes a solar panel (PV) and an energy storage device such as a battery. While FIG. 2 shows one PV and battery framework of a battery integration with a PV system, there may exist different system frameworks such as a configuration where the battery and PV are directly connected through two separate dc-ac inverters onto PCC. Assume that there is no energy loss of the converters, we have the following power balance equation.

P.sub.dc=P.sub.pv+P.sub.be(1)

[0025] As shown in FIG. 2, the battery power (P.sub.be) is controlled to compensate the fluctuations of PV power generation (P.sub.pv), so that the ramp rate of the total power output (P.sub.dc) to grid can be limited within a desired level.

[0026] The desired ramp-rate of P.sub.dc is defined as the maximum allowable ramp rate (MARR). The MARR could be defined in different units, e.g. W/sec, kW/min.

[0027] The ramp rate of P.sub.dc can be described as:

[00001] $\begin{matrix} \frac{P_{dc}}{t} = \frac{P_{pv}}{t} + \frac{P_{be}}{t} & (2) \end{matrix}$

[0028] Assume the sampling time interval is t, Eq. (2) can written as

[00002] $\begin{matrix} \frac{.Math. .Math. P_{dc}}{.Math. .Math. t} = \frac{.Math. .Math. P_{pv}}{.Math. .Math. t} + \frac{.Math. .Math. P_{be}}{.Math. .Math. t} & (3) \end{matrix}$

[0029] So that the ramp rate should satisfy

[00003] $.Math. \frac{.Math. .Math. P_{dc}}{.Math. .Math. t} .Math. < MARR$ $.Math. \frac{.Math. .Math. P_{pv}}{.Math. .Math. t} + \frac{.Math. .Math. P_{be}}{.Math. .Math. t} .Math. < MARR$

[0030] To illustrate the instant ramp rate control method, an illustrative PV power ramp down and the corresponding compensating battery power is shown in FIG. 3 while FIG. 4 illustrates an exemplary control approach for PV ramp up event. The entire procedure is divided into three steps: ramping event time, post-event time, recovery time. During the ramping event time period (t.sub.1t.sub.2), the battery is discharged (the red dotted curve) to constraint the ramp rate of P.sub.dc within MARR, while during the post-ramp rate event time period (t.sub.2t.sub.3), the battery is kept discharged to sustain the ramp rate of P.sub.dc within MARR until the battery power decreases into zero. During the recovery period (t.sub.3t.sub.4), the battery is controlled to be charged while the ramp rate of P.sub.dc is still kept within MARR. FIG. 4 shows the similar procedure for ramping up event. The shade region in FIG. 3 and FIG. 4 indicates the charged (E.sub.chr) or discharged (E.sub.dis) battery energy.

[0031] There are several variables which need to be defined or optimized during the ramping control process: [0032] RR.sub.dc: the targeted ramp rate of integrated DC power; [0033] RR.sub.be,event: the ramp rate of BE power during ramping event time period (t.sub.1t.sub.2) [0034] RR.sub.be,post-event: the ramp rate of BE power during post-ramping event time period (t.sub.2t.sub.3) [0035] RR.sub.be,reco: the ramp rate of BE power during recovering time period (t.sub.3t.sub.4)

[0036] Among those variables, the ramp rate or power change of BE power determines the ramp rate of integrated DC power output.

[0037] The battery operation policy can be optimized considering the following two objectives: [0038] 1) The ramp rate of integrated DC power (RR.sub.dc) is limited within MARR; [0039] 2) The battery energy capacity is maintained around the reference setting point (E.sub.be, ref) where the battery life can be maximized.

[0040] The multi-objective functions are described as:

[00004] $\begin{matrix} \begin{matrix} \min .Math. .Math. Obj = .Math. f ({RR}_{dc}) + f (E_{be}) \\ = .Math._{1} .Math. {_{t = t_{0}}^{t_{n}} (\frac{E_{be} (t) - E_{be, ref}}{E_{be, ref}})}^{2} +_{2} .Math. {_{t = t_{0}}^{t_{n}} (\frac{{RR}_{dc} (t)}{MVRR})}^{2} \end{matrix} & (4) \end{matrix}$

[0041] Where .sub.2, .sub.1 are the weight coefficients.

[0042] The following operation constraints needs to be satisfied: [0043] The ramp rate of exported DC power is within limit

|RR.sub.dc(t)|MARR [0044] The battery energy level is within limits

E.sub.be,minE.sub.be(t)E.sub.be,max [0045] Battery power is within limit

P.sub.be,minP.sub.be(t)P.sub.be,max

[0046] At each time instant t, when the PV power output fluctuates (P.sub.pv(t), so is the exported DC power P.sub.dc(t) when the battery power output (P.sub.be(t)) is kept the same as previous time step (t1). Based the above known conditions, the battery power will be adjusted to minimize the objectives in (4) while subject to the above constraints. The online management flowchart at each time instant t is shown in FIG. 5.

[0047] As shown in FIG. 5, the process decides a battery power change P.sub.be at each time epoch t to minimize the control objectives in (4) while subject to all the constraints. The Reinforcement Learning (RL) technique-based approach is used to manage the battery operation during PV ramp rate control.

[0048] Next, the Reinforcement Learning-based optimization approach is detailed.

[0049] There are three elements in RL techniques: state space S, action set A, and reward functions R, the reward R is a function of S and A. There are defined as follows: [0050] State (S) [0051] The state space includes {(P.sub.dc(t), E.sub.be,cap (t), P.sub.BE(t))}. [0052] Action (A) [0053] The action space only includes one element {P.sub.be(t)}, the battery power change. [0054] Reward value (R) [0055] The reward value is calculated at each time instant. The Reward value at t is calculated based on the collected information between t1 and t.

[00005] $\begin{matrix} R (t) = - {_{1} (\frac{E_{be} (t - 1) - E_{be, ref}}{E_{be, ref}})}^{2} .Math. .Math. .Math. t - {_{2} (\frac{{RR}_{dc} (t - 1)}{MVRR})}^{2} .Math. .Math. .Math. t & (5) \end{matrix}$

[0056] The reward function is defined in a similar way as the objectives in (4). The R is defined in this way so that the energy drawn from battery and the ramp rate of exported DC power is minimized through maximizing the reward value R.

[0057] As one of the RL techniques, the Q-learning is used to find the optimal battery operation sequence which maximizes the total rewards. Q-learning uses temporal differences to estimate Q value of each state-action pair Q*(s,a). Q*(s,a) is the expected value of taking action a in state s and following the optimal policy thereafter, where the expected value means the cumulative discounted reward such as:

[00006] $Q^{*} (s, a) = {.Math.}_{i = 0}^{n} .Math.^{i} .Math. R_{t + i}$

[0058] Where is the discount factor between 0 and 1. The reflects how much of the future rewards are counted into total value compared with the immediate rewards. One of the advantages of Q-learning is that it does not require a model of the environment.

[0059] The action-value set Q(s,a) is learned and updated along system operation, the optimal action can determined by selecting the action with the highest Q value in each state. The update of Q(s,a) is value iteration update defined as:

[00007] $\begin{matrix} Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + a_{t} (s_{t}, a_{t}) .Math. (R_{t + 1} + .Math. .Math. \max_{a} .Math. Q_{t} (s_{t + 1}, a) - Q_{t} (s_{t}, a_{t})) & (6) \end{matrix}$

[0060] Where R.sub.t+1 is the reward after performing a.sub.t in state s.sub.t, a.sub.t(s.sub.t, a.sub.t) is the learning rate, it could be a constant value for all state-action pair, or it varies with the state-action pair. is the discount factor between 0 and 1. The reflects how much of the future rewards are counted into total value compared with the immediate rewards.

[0061] At the beginning of the Q learning, the initial value of Q for all state-action pairs can be set arbitrarily and updated iteratively later. The Q-learning procedure is illustrated in the flowchart in FIG. 6 where Q is initialized. The current state is observed and an action is selected. The process monitors the current rewards and the next state, and updates Q. This is repeated until all actions have been selected, and subsequently the process executes an action a that maximizes Q.

[0062] There are different policies for the action selection. The choice of these policies aims the trade-off between the exploitation and exploration phase during system operation. For example, -greedy policy can be chosen for the action selection during exploration phase, where the action with highest Q value is selected with probability 1 and the rest of the time a random action is chosen uniformly.

[0063] Mode definition is discussed next. The state-action pairs (s.sub.t, at) are discretely defined. The discrete modes are defined as follows. [0064] P.sub.dc(t): To allow for certain level of slow variations of P.sub.dc, a dead-band (|P.sub.dc(t)|P.sub.dc,db) is set to allow a small ramp rate of P.sub.dc. Outside the dead-band, the mode of P.sub.dc(t) is defined at interval of P.sub.dc,int. [0065] E.sub.be,cap(t): The mode of battery capacity E.sub.be,cap (t) is defined at interval of E.sub.be,int. [0066] P.sub.BE(t): The mode of battery output power P.sub.BE(t) is defined at interval of P.sub.be,int. [0067] P.sub.be(t). The mode of control action P.sub.be(t) is defined at interval of P.sub.be,int.

[0068] The number of state-action pair modes can be chosen based on the system computation capability, the required control operation rate.

[0069] FIG. 7 shows an exemplary operation flowchart of reinforcement learning-based control method of battery storage. The system performs the RL-based power optimization and may include the following: [0070] 1. Monitor the system operation status at each time instant t {P.sub.dc(t) E.sub.be,cap (t) P.sub.BE(t)}, [0071] 2. The controller generates the control action P.sub.be(t), which is the battery power change. The battery operation controller applies the reinforcement learning-based optimization approaches, as illustrated in FIG. 5 and FIG. 6. [0072] 3. For the RL, the Q-learning technique is used to find an optimal battery operation sequence, where the discrete state-action (s.sub.t, a.sub.t) pairs are defined, so is the reward functions R. The Q-value for each state-action pair, which estimates the expected value of the total reward return over all successive optimal actions, is initialized and then iteratively updated along system operation. The Q-value helps determine the battery operation actions. [0073] 4. The definition of reward function R not only considers the success of suppression of PV power ramp rate, but also the deviation of battery capacity from predefined setting.

[0074] Different from the techniques used in prior art, such as low-pass filter-based approach, power curtailment, the system applies reinforcement learning-based control approach of battery storages for PV ramp rate control, which is new. The storage operation is decided dynamically to limit ramp rate of the PV power output, meanwhile the battery SoC level is maintained around predefined level to minimize the required battery size and extend the battery life cycles. This optimization-based approach does not need PV power profiles known, and can adjust the battery operation to different PV generation profiles.

[0075] FIG. 8 shows an exemplary system for PV ramp rate control using reinforcement learning technique through integration of a battery storage system. Referring now to the drawings in which like numerals represent the same or similar elements and initially to FIG. 8, an exemplary processing system 100, to which the present principles may be applied, is illustratively depicted in accordance with an embodiment of the present principles. The processing system 100 includes at least one processor (CPU) 104 operatively coupled to other components via a system bus 102. A cache 106, a Read Only Memory (ROM) 108, a Random Access Memory (RAM) 110, an input/output (I/O) adapter 120, a sound adapter 130, a network adapter 140, a user interface adapter 150, and a display adapter 160, are operatively coupled to the system bus 102.

[0076] A first storage device 122 and a second storage device 124 are operatively coupled to system bus 102 by the I/O adapter 120. The storage devices 122 and 124 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth. The storage devices 122 and 124 can be the same type of storage device or different types of storage devices. A speaker 132 is operatively coupled to system bus 102 by the sound adapter 130. A transceiver 142 is operatively coupled to system bus 102 by network adapter 140. A display device 162 is operatively coupled to system bus 102 by display adapter 160. A first user input device 152, a second user input device 154, and a third user input device 156 are operatively coupled to system bus 102 by user interface adapter 150. The user input devices 152, 154, and 156 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles. The user input devices 152, 154, and 156 can be the same type of user input device or different types of user input devices. The user input devices 152, 154, and 156 are used to input and output information to and from system 100.

[0077] Of course, the processing system 100 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements. For example, various other input devices and/or output devices can be included in processing system 100, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art. For example, various types of wireless and/or wired input and/or output devices can be used. Moreover, additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art. These and other variations of the processing system 100 are readily contemplated by one of ordinary skill in the art given the teachings of the present principles provided herein.

[0078] It should be understood that embodiments described herein may be entirely hardware, or may include both hardware and software elements which includes, but is not limited to, firmware, resident software, microcode, etc.

[0079] Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.

[0080] A data processing system suitable for storing and/or executing program code may include at least one processor, e.g., a hardware processor, coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.

[0081] The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention.

PV Ramp Rate Control Using Reinforcement Learning Technique Through Integration of Battery Storage System

Inventors

Cpc classification

Classification Explorer

Y02E10/56

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

Y02E10/50

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

H02J7/35

ELECTRICITY

Classification Explorer

Y02E70/30

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

H10F77/955

ELECTRICITY

Classification Explorer

H02S40/38

ELECTRICITY

International classification

Classification Explorer

H02J7/35

ELECTRICITY

Classification Explorer

H02J7/00

ELECTRICITY

Classification Explorer

H02S40/38

ELECTRICITY

Abstract

Claims

Description