PV Ramp Rate Control Using Reinforcement Learning Technique Through Integration of Battery Storage System
20170117744 ยท 2017-04-27
Inventors
Cpc classification
Y02E10/56
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y02E10/50
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
Y02E70/30
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
H02J7/00
ELECTRICITY
Abstract
Systems and methods are disclosed for storing photovoltaic (PV) generation by applying reinforcement learning (RL)-based control to battery storages for PV ramp rate control; and exchanging energy dynamically to limit a ramp rate of the PV power output and maintaining a battery state of charge level at a predefined level to minimize required battery size and extend the battery life cycles.
Claims
1. A process for storing photovoltaic (PV) generation, comprising: applying reinforcement learning (RL)-based control to battery storages for PV ramp rate control; and exchanging energy dynamically to limit a ramp rate of the PV power output and maintaining a battery state of charge level at a predefined level to minimize required battery size and extend the battery life cycles.
2. The process of claim 1, comprising adjusting battery operation to different PV profiles without knowing in advance the PV profiles.
3. The process of claim 1, comprising monitoring system operation status at each time instant t {P.sub.dc (t), E.sub.be,cap (t), P.sub.BE(t)}.
4. The process of claim 1, wherein the controller generates a battery power change control action P.sub.be(t). battery operation controller applies the reinforcement learning-based optimization approaches.
5. The process of claim 1, wherein for the RL, the Q-learning is used to find an optimal battery operation sequence.
6. The process of claim 1, comprising determining discrete state-action (s.sub.t, a.sub.t) pairs as estimates an expected value of a total reward return over all successive optimal actions.
7. The process of claim 4, comprising iteratively updating the Q-value for each state-action pair along system operation.
8. The process of claim 5, comprising applying the Q-value to determine battery operation actions.
9. The process of claim 1, comprising determining a reward function R as a function of suppression of PV power ramp rate and a deviation of battery capacity from predefined setting.
10. The process of claim 1, comprising determining a power balance as:
P.sub.dc=P.sub.pv+P.sub.be where battery power (P.sub.be) is controlled to compensate for fluctuations of PV power generation (P.sub.pv), so that a ramp rate of the total power output (P.sub.dc) to grid can be limited within a desired level.
11. The process of claim 1, wherein a ramp-rate of P.sub.dc comprises a maximum allowable ramp rate (MARR).
12. The process of claim 11, wherein the ramp rate of P.sub.dc comprises:
13. The process of claim 11, wherein a sampling time interval is t, comprising determining
14. The process of claim 1, comprising optimizing a battery operation policy by: limiting a ramp rate of integrated DC power (RR.sub.dc) within MARR; maintaining a battery energy capacity around a reference setting point (E.sub.be, ref) where the battery life can be maximized.
15. The process of claim 1, comprising optimizing multi-objective functions with:
16. The process of claim 1, comprising determining state space S, action set A, and reward functions R, the reward R is a function of S and A, wherein a State (S) space includes {(P.sub.dc(t), E.sub.be,cap(t), P.sub.BE(t))}, an Action (A) space only includes one element {P.sub.be (t)}, the battery power change, and a Reward value (R).
17. The process of claim 16, wherein the reward value is calculated at each time instant. The Reward value at t is calculated based on the collected information between t1 and t.
18. The process of claim 1, comprising applying Q-learning to find an optimal battery operation sequence to maximize the total rewards.
19. The process of claim 18, wherein the Q-learning uses temporal differences to estimate Q value of each state-action pair Q*(s,a), wherein Q*(s,a) is an expected value of taking action a in state s and following the optimal policy thereafter, where the expected value means the cumulative discounted reward with:
20. The process of claim 19, wherein the action-value set Q(s,a) is learned and updated along system operation, comprising determining an optimal action by selecting the action with the highest Q value in each state and updating Q(s,a) as:
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
DESCRIPTION
[0023]
[0024] The target system (PV integrated with Battery storage system) is shown in
P.sub.dc=P.sub.pv+P.sub.be(1)
[0025] As shown in
[0026] The desired ramp-rate of P.sub.dc is defined as the maximum allowable ramp rate (MARR). The MARR could be defined in different units, e.g. W/sec, kW/min.
[0027] The ramp rate of P.sub.dc can be described as:
[0028] Assume the sampling time interval is t, Eq. (2) can written as
[0029] So that the ramp rate should satisfy
[0030] To illustrate the instant ramp rate control method, an illustrative PV power ramp down and the corresponding compensating battery power is shown in
[0031] There are several variables which need to be defined or optimized during the ramping control process: [0032] RR.sub.dc: the targeted ramp rate of integrated DC power; [0033] RR.sub.be,event: the ramp rate of BE power during ramping event time period (t.sub.1t.sub.2) [0034] RR.sub.be,post-event: the ramp rate of BE power during post-ramping event time period (t.sub.2t.sub.3) [0035] RR.sub.be,reco: the ramp rate of BE power during recovering time period (t.sub.3t.sub.4)
[0036] Among those variables, the ramp rate or power change of BE power determines the ramp rate of integrated DC power output.
[0037] The battery operation policy can be optimized considering the following two objectives: [0038] 1) The ramp rate of integrated DC power (RR.sub.dc) is limited within MARR; [0039] 2) The battery energy capacity is maintained around the reference setting point (E.sub.be, ref) where the battery life can be maximized.
[0040] The multi-objective functions are described as:
[0041] Where .sub.2, .sub.1 are the weight coefficients.
[0042] The following operation constraints needs to be satisfied: [0043] The ramp rate of exported DC power is within limit
|RR.sub.dc(t)|MARR [0044] The battery energy level is within limits
E.sub.be,minE.sub.be(t)E.sub.be,max [0045] Battery power is within limit
P.sub.be,minP.sub.be(t)P.sub.be,max
[0046] At each time instant t, when the PV power output fluctuates (P.sub.pv(t), so is the exported DC power P.sub.dc(t) when the battery power output (P.sub.be(t)) is kept the same as previous time step (t1). Based the above known conditions, the battery power will be adjusted to minimize the objectives in (4) while subject to the above constraints. The online management flowchart at each time instant t is shown in
[0047] As shown in
[0048] Next, the Reinforcement Learning-based optimization approach is detailed.
[0049] There are three elements in RL techniques: state space S, action set A, and reward functions R, the reward R is a function of S and A. There are defined as follows: [0050] State (S) [0051] The state space includes {(P.sub.dc(t), E.sub.be,cap (t), P.sub.BE(t))}. [0052] Action (A) [0053] The action space only includes one element {P.sub.be(t)}, the battery power change. [0054] Reward value (R) [0055] The reward value is calculated at each time instant. The Reward value at t is calculated based on the collected information between t1 and t.
[0056] The reward function is defined in a similar way as the objectives in (4). The R is defined in this way so that the energy drawn from battery and the ramp rate of exported DC power is minimized through maximizing the reward value R.
[0057] As one of the RL techniques, the Q-learning is used to find the optimal battery operation sequence which maximizes the total rewards. Q-learning uses temporal differences to estimate Q value of each state-action pair Q*(s,a). Q*(s,a) is the expected value of taking action a in state s and following the optimal policy thereafter, where the expected value means the cumulative discounted reward such as:
[0058] Where is the discount factor between 0 and 1. The reflects how much of the future rewards are counted into total value compared with the immediate rewards. One of the advantages of Q-learning is that it does not require a model of the environment.
[0059] The action-value set Q(s,a) is learned and updated along system operation, the optimal action can determined by selecting the action with the highest Q value in each state. The update of Q(s,a) is value iteration update defined as:
[0060] Where R.sub.t+1 is the reward after performing a.sub.t in state s.sub.t, a.sub.t(s.sub.t, a.sub.t) is the learning rate, it could be a constant value for all state-action pair, or it varies with the state-action pair. is the discount factor between 0 and 1. The reflects how much of the future rewards are counted into total value compared with the immediate rewards.
[0061] At the beginning of the Q learning, the initial value of Q for all state-action pairs can be set arbitrarily and updated iteratively later. The Q-learning procedure is illustrated in the flowchart in
[0062] There are different policies for the action selection. The choice of these policies aims the trade-off between the exploitation and exploration phase during system operation. For example, -greedy policy can be chosen for the action selection during exploration phase, where the action with highest Q value is selected with probability 1 and the rest of the time a random action is chosen uniformly.
[0063] Mode definition is discussed next. The state-action pairs (s.sub.t, at) are discretely defined. The discrete modes are defined as follows. [0064] P.sub.dc(t): To allow for certain level of slow variations of P.sub.dc, a dead-band (|P.sub.dc(t)|P.sub.dc,db) is set to allow a small ramp rate of P.sub.dc. Outside the dead-band, the mode of P.sub.dc(t) is defined at interval of P.sub.dc,int. [0065] E.sub.be,cap(t): The mode of battery capacity E.sub.be,cap (t) is defined at interval of E.sub.be,int. [0066] P.sub.BE(t): The mode of battery output power P.sub.BE(t) is defined at interval of P.sub.be,int. [0067] P.sub.be(t). The mode of control action P.sub.be(t) is defined at interval of P.sub.be,int.
[0068] The number of state-action pair modes can be chosen based on the system computation capability, the required control operation rate.
[0069]
[0074] Different from the techniques used in prior art, such as low-pass filter-based approach, power curtailment, the system applies reinforcement learning-based control approach of battery storages for PV ramp rate control, which is new. The storage operation is decided dynamically to limit ramp rate of the PV power output, meanwhile the battery SoC level is maintained around predefined level to minimize the required battery size and extend the battery life cycles. This optimization-based approach does not need PV power profiles known, and can adjust the battery operation to different PV generation profiles.
[0075]
[0076] A first storage device 122 and a second storage device 124 are operatively coupled to system bus 102 by the I/O adapter 120. The storage devices 122 and 124 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth. The storage devices 122 and 124 can be the same type of storage device or different types of storage devices. A speaker 132 is operatively coupled to system bus 102 by the sound adapter 130. A transceiver 142 is operatively coupled to system bus 102 by network adapter 140. A display device 162 is operatively coupled to system bus 102 by display adapter 160. A first user input device 152, a second user input device 154, and a third user input device 156 are operatively coupled to system bus 102 by user interface adapter 150. The user input devices 152, 154, and 156 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles. The user input devices 152, 154, and 156 can be the same type of user input device or different types of user input devices. The user input devices 152, 154, and 156 are used to input and output information to and from system 100.
[0077] Of course, the processing system 100 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements. For example, various other input devices and/or output devices can be included in processing system 100, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art. For example, various types of wireless and/or wired input and/or output devices can be used. Moreover, additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art. These and other variations of the processing system 100 are readily contemplated by one of ordinary skill in the art given the teachings of the present principles provided herein.
[0078] It should be understood that embodiments described herein may be entirely hardware, or may include both hardware and software elements which includes, but is not limited to, firmware, resident software, microcode, etc.
[0079] Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
[0080] A data processing system suitable for storing and/or executing program code may include at least one processor, e.g., a hardware processor, coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
[0081] The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention.