METHOD AND APPARATUS FOR BASEBALL STRATEGY PLANNING BASED ON REINFORCEMENT LEARNING
20210387070 · 2021-12-16
Assignee
Inventors
Cpc classification
G16H20/30
PHYSICS
G06Q99/00
PHYSICS
G06Q10/06375
PHYSICS
A63F2011/0093
HUMAN NECESSITIES
A63B71/0605
HUMAN NECESSITIES
International classification
A63B71/06
HUMAN NECESSITIES
Abstract
A method and an apparatus for baseball strategy planning based on reinforcement learning are provided. The method includes steps below. Historical data of innings in past games of a team is collected. Multiple game states, multiple offensive and defensive actions, and multiple rewards corresponding to multiple offensive and defensive results are defined based on multiple offensive and defensive processes occurring during the game, and are used to establish a Q table. The Q table is updated according to multiple combinations of the game state, the offensive and defensive action, and the offensive and defensive result recorded in the historical data. According to a current game state, Q values of all offensive and defensive actions executable in the current game state recorded in the updated Q table are sorted, and the offensive and defensive action suitable for being executed in the current game state is recommended according to a sorting result.
Claims
1. A baseball strategy planning method based on reinforcement learning, adapted for an electronic apparatus having a processor, the method comprising: collecting historical data of multiple innings in past games of a team; defining multiple game states, multiple offensive and defensive actions, and multiple rewards corresponding to multiple offensive and defensive results according to multiple offensive and defensive processes occurring during the game, and using the game states, the offensive and defensive actions, and the rewards to establish a Q table; updating the Q table according to multiple combinations of the game state, the offensive and defensive action, and the offensive and defensive result recorded in the historical data; and sorting, according to a current game state, Q values of all offensive and defensive actions executable in the game state recorded in the updated Q table, and recommending the offensive and defensive action suitable for being executed in the game state according to a sorting result.
2. The baseball strategy planning method based on reinforcement learning according to claim 1, wherein the game state comprises a base occupation status, a number of outs, or a strike/ball count.
3. The baseball strategy planning method based on reinforcement learning according to claim 1, wherein the offensive and defensive action comprises multiple pitch types of a pitcher and multiple hitting actions of a hitter, and the hitting actions comprise a bunt, a hit, a sacrifice fly, or no swing.
4. The baseball strategy planning method based on reinforcement learning according to claim 1, wherein the rewards corresponding to the offensive and defensive results comprise negative rewards representing losing a score, a base being advanced, and hitting by a hitter on a defensive side, a zero reward representing not losing a score on the defensive side, and positive rewards representing not being hit by the hitter, and striking out or putting out the hitter on the defensive side.
5. The baseball strategy planning method based on reinforcement learning according to claim 1, wherein the rewards corresponding to the offensive and defensive results comprise positive rewards representing scoring, advancing a base, and hitting a ball on an offensive side, a zero reward representing not scoring on the offensive side, and negative rewards representing a hitter missing a ball, and being stricken out or put out on the offensive side.
6. The baseball strategy planning method based on reinforcement learning according to claim 1, wherein the step of updating the Q table according to the multiple combinations of the game state, the offensive and defensive action, and the offensive and defensive result recorded in the historical data comprises: for each of the game states, searching for an offensive and defensive result and a new game state obtained after executing multiple offensive and defensive actions in the game state recorded in the historical data, and using the offensive and defensive result and the new game state to calculate a reward obtained by executing each of the offensive and defensive actions in the game state; and updating, by using the calculated rewards and Q values of executing multiple offensive and defensive actions in the new game state, a Q value of executing each of the offensive and defensive actions in the game state in the Q table.
7. The baseball strategy planning method based on reinforcement learning according to claim 1, wherein after the step of recommending the offensive and defensive action suitable for being executed in the game state according to the sorting result, the method further comprises: receiving a selection of the recommended offensive and defensive action; calculating a reward obtained by executing the selected offensive and defensive action in the game state according to an offensive and defensive result and a new game state obtained after executing the selected offensive and defensive action; and updating, by using the calculated reward and Q values of executing multiple offensive and defensive actions in the new game state, a Q value of executing the selected offensive and defensive action in the game state in the Q table.
8. The baseball strategy planning method based on reinforcement learning according to claim 1, wherein the Q values of all offensive and defensive actions executable in the game state comprise Q values of executing the offensive and defensive actions by multiple players capable of executing the offensive and defensive actions.
9. A baseball strategy planning apparatus based on reinforcement learning, comprising: a data retrieval device connected an external device; a storage device storing a computer program; and a processor coupled to the data retrieval device and the storage device and configured to load and execute the computer program to: collect, by the data retrieval device, historical data of multiple innings in past games of a team from the external device; define multiple game states, multiple offensive and defensive actions, and multiple rewards corresponding to multiple offensive and defensive results according to multiple offensive and defensive processes occurring during the game, and using the game states, the offensive and defensive actions, and the rewards to establish a Q table; update the Q table according to multiple combinations of the game state, the offensive and defensive action, and the offensive and defensive result recorded in the historical data; and sort, according to a current game state, Q values of all offensive and defensive actions executable in the game state recorded in the updated Q table, and recommend the offensive and defensive action suitable for being executed in the game state according to a sorting result.
10. The baseball strategy planning apparatus based on reinforcement learning according to claim 9, wherein the game state comprises a base occupation status, a number of outs, or a strike/ball count.
11. The baseball strategy planning apparatus based on reinforcement learning according to claim 9, wherein the offensive and defensive action comprises multiple pitch types of a pitcher and multiple hitting actions of a hitter, and the hitting actions comprise a bunt, a hit, a sacrifice fly, or no swing.
12. The baseball strategy planning apparatus based on reinforcement learning according to claim 9, wherein the rewards corresponding to the offensive and defensive results comprise negative rewards representing losing a score, a base being advanced, and hitting by a hitter on a defensive side, a zero reward representing not losing a score on the defensive side, and positive rewards representing not being hit by the hitter, and striking out or putting out the hitter on the defensive side.
13. The baseball strategy planning apparatus based on reinforcement learning according to claim 9, wherein the rewards corresponding to the offensive and defensive results comprise positive rewards representing scoring, advancing a base, and hitting a ball on an offensive side, a zero reward representing not scoring on the offensive side, and negative rewards representing a hitter missing a ball, and being stricken out or put out on the offensive side.
14. The baseball strategy planning apparatus based on reinforcement learning according to claim 9, wherein the processor is configured to: for each of the game states, search for an offensive and defensive result and a new game state obtained after executing multiple offensive and defensive actions in the game state recorded in the historical data, and use the offensive and defensive result and the new game state to calculate a reward obtained by executing each of the offensive and defensive actions in the game state; and update, by using the calculated rewards and Q values of executing multiple offensive and defensive actions in the new game state, a Q value of executing each of the offensive and defensive actions in the game state in the Q table.
15. The baseball strategy planning apparatus based on reinforcement learning according to claim 9, wherein the processor is further configured to: receive a selection of the recommended offensive and defensive action; calculate a reward obtained by executing the selected offensive and defensive action in the game state according to an offensive and defensive result and a new game state obtained after executing the selected offensive and defensive action; and update, by using the calculated reward and Q values of executing multiple offensive and defensive actions in the new game state, a Q value of executing the selected offensive and defensive action in the game state in the Q table.
16. The baseball strategy planning apparatus based on reinforcement learning according to claim 9, wherein the Q values of all offensive and defensive actions executable in the game state comprise Q values of executing the offensive and defensive actions by multiple players capable of executing the offensive and defensive actions.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016]
[0017]
[0018]
[0019]
DESCRIPTION OF THE EMBODIMENTS
[0020] An embodiment of the disclosure provides a baseball strategy planning method and a baseball strategy planning apparatus based on reinforcement learning (RL), which use a reinforcement learning algorithm to generate offensive and defensive strategies in real-time in baseball innings. The method is divided into two stages. The first stage is offline planning, which collects past game data of the team, and updates a value function pairing the state and the action in the inning through reinforcement learning. The second stage is online learning, which uses the value function established in the first stage to recommend an optimal offensive or defensive strategy in the current state, and then further updates the value function pairing the state and the action in the inning according to the action actually selected.
[0021] Specifically,
[0022] The data retrieval device 12 is, for example, any wired or wireless interface device that may be connected to an external device (not shown) and is configured to collect historical data of multiple innings in past games of the team. In the case of a wired interface, the data retrieval device 12 may be an interface such as a universal serial bus (USB), RS232, a universal asynchronous receiver/transmitter (UART), an inter-integrated circuit (I2C), a serial peripheral interface (SPI), a display port, a thunderbolt, etc., but is not limited thereto. In the case of a wireless interface, the data retrieval device 12 may be device compatible with communication protocols such as wireless fidelity (Wi-Fi), RFID, bluetooth, infrared, near-field communication (NFC), device-to-device (D2D), etc., but is not limited thereto. In some embodiments, the data retrieval device 12 may also include a network card compatible with Ethernet or wireless network standards such as 802.11g, 802.11n, 802.11ac, etc., so that the baseball strategy planning apparatus 10 can be connected to an external device via a network to collect or receive historical information of baseball games.
[0023] The storage device 14 is, for example, any form of a fixed or movable random access memory (RAM), read-only memory (ROM), flash memory, hard disk, a similar device, or a combination of the above devices, and is configured to store a computer program executable by the processor 16. In some embodiments, the storage device 14 also stores, for example, historical information of baseball games collected by the data retrieval device 12 from an external device.
[0024] The processor 16 is, for example, a central processing unit (CPU), or another programmable general-purpose or specific-purpose microprocessor, microcontroller, digital signal processor (DSP), programmable controller, application specific integrated circuit (ASIC), programmable logic device (PLD), another similar device, or a combination of the above devices, and the disclosure is not limited thereto. In this embodiment, the processor 16 may load a computer program from the storage device 14 to execute the baseball strategy planning method based on reinforcement learning of the embodiment of the disclosure.
[0025]
[0026] In step S210, the processor 16 of the baseball strategy planning apparatus 10 collects, through the data retrieval device 12, historical data of multiple innings in past games of a team from an external device. The external device is, for example, a server or a computer which records game data of each team and is not specifically limited herein.
[0027] In step S220, according to multiple offensive and defensive processes occurring during the game, the processor 16 defines multiple game states, multiple offensive and defensive actions, and multiple rewards corresponding to multiple offensive and defensive results, which are used to establish a Q table. Specifically, for example, in the embodiment of the disclosure, the game process is regarded as a Markov decision process (MDP), in which the time interval is defined as the pitching interval of the pitcher, and an episodic setting is adopted to define multiple combinations of a state, an action, and a reward respectively for the defensive and offensive processes, which are recorded in a Q table for learning.
[0028] Taking the Q table in Table 1 as an example, when the team takes an action A.sub.0 in a state S.sub.0, the team may obtain a reward R.sub.1 according to the result and enter a next state S.sub.1. Similarly, when the team takes an action A.sub.1 in the state S.sub.1, the team may obtain a reward R.sub.2 according to the result and enter a next state S.sub.2; when the team takes an action A.sub.2 in the state S.sub.2, the team may obtain a reward R.sub.3 according to the result and enter a next state S.sub.3, and so on. Therefore, a Q table which records the rewards obtained by taking various actions in various states can be established.
TABLE-US-00001 TABLE 1 State Action Reward S.sub.0 A.sub.0 R.sub.1 S.sub.1 A.sub.1 R.sub.2 S.sub.2 A.sub.2 R.sub.3 S.sub.3 A.sub.3 R.sub.4 . . . . . . . . . . . .
[0029] In some embodiments, the game state includes a base occupation status, a number of outs, a strike/ball count, or other information facilitating analysis of the situation, which is not specifically limited herein. The base occupation status includes, for example, no one on base and eight permutations/combinations of first base occupied, second base occupied, and third base occupied (i.e., nine possibilities in total), which are respectively defined as values of 0 to 8. The number of outs comes in, for example, three possibilities including zero outs, one out, and two outs, which are respectively defined as values of 0 to 2. The strike/ball count comes in, for example, twelve possibilities including the number of strikes (0 to 2) and the number of balls (0 to 3), which are respectively defined as values of 0 to 11. In an embodiment, the game state may record the above combination in a vector form. For example, when a player is on the first base, two players are out, and the count is two strikes and three balls, the game state may be recorded as {1, 2, 11}, and so on. In an embodiment, the game state is represented by, for example, one single value calculated from the above value combination, which is not specifically limited herein.
[0030] In some embodiments, the offensive and defensive actions may be divided depending on the defensive side and the offensive side. For the defensive side, the offensive and defensive actions include multiple pitch types of the pitcher, such as a straight pitch, a curveball, a slider, a forkball, etc. For the offensive side, the offensive and defensive actions include multiple hitting actions of the hitter, such as a bunt, a hit, a sacrifice fly, no swing, etc. The above offensive and defensive actions may be represented by different values. This embodiment does not limit the types of the offensive and defensive actions and their representation methods.
[0031] In some embodiments, the offensive and defensive results may also be divided depending on the defensive side and the offensive side, and according to situations favorable for the defensive side or the offensive side, negative-to-positive ranging rewards (including a zero reward) may be respectively given in this embodiment. A positive reward means that the situation is more favorable for the defensive side or the offensive side, a negative reward means that the situation is less favorable for the defensive side or the offensive side, and a zero reward means that the situation is neither favorable nor unfavorable for the defensive side or the offensive side.
[0032] For the defensive side, the rewards corresponding to the offensive and defensive results include negative rewards representing losing a score, a base being advanced, and hitting by the hitter, a zero reward representing not losing a score, and positive rewards representing not being hit by the hitter, and striking out or putting out a hitter. For example, whenever one score is lost, a reward β.sub.1 is given; whenever one base is advanced by the opponent (including a base stolen by the runner), a reward β.sub.2 is given; if the pitcher's ball is hit by the hitter, a reward β.sub.3 is given; if a score is not lost, a reward 0 is given; if the pitcher's ball is not hit by the hitter, a reward β.sub.4 is given; if the hitter is stricken out or put out, a reward β.sub.5 is given, where β.sub.1≤β.sub.2≤β.sub.3≤0≤β.sub.4≤β.sub.5.
[0033] On the other hand, for the offensive side, the rewards corresponding to the offensive and defensive results include positive rewards representing scoring, advancing a base, and hitting a ball, a zero reward representing not scoring, and negative rewards representing the hitter missing a ball, and being stricken out or put out. For example, if the hitter is stricken out or put out, a reward δ.sub.1 is given; if the hitter swings but misses the ball, a reward δ.sub.2 is given; if our side does not score, a reward 0 is given; if the hitter swings and hits the ball, a reward δ.sub.3 is given; whenever our side advances one base (including a base stolen by the runner), a reward δ.sub.4 is given; whenever our side scores one point, a reward δ.sub.5 is given, where δ.sub.1≤δ.sub.2≤0≤δ.sub.3≤δ.sub.4≤δ.sub.5.
[0034] Returning to the flowchart of
[0035]
[0036] In step S231, the processor 16 accesses the storage device 12 to retrieve game historical data previously collected and stored in the storage device 12.
[0037] In step S232, the processor 16 observes the game state. The processor 16, for example, selects a game state for learning from multiple game states recorded in a previously established Q table.
[0038] In step S233, the processor 16 searches for an offensive and defensive result and a new game state obtained after executing different offensive and defensive actions in the game state as recorded in the historical data. For example, in the state where no one is out and the bases are loaded, after the offensive side executes a bunt, a result of scoring one point and a new game state where one player is out and the second and third bases are occupied are obtained.
[0039] In step S234, the processor 16 calculates a reward corresponding to each offensive and defensive result. For example, for the defensive side, if the offensive and defensive result is losing one score, the obtained reward is β.sub.1; if the offensive and defensive result is no score loss, the obtained reward is 0; if the offensive and defensive result is striking out the hitter, the obtained reward is β.sub.5. In contrast, for the offensive side, if the offensive and defensive result is being stricken out, the obtained reward is δ.sub.1; if the offensive and defensive result is no score, the obtained reward is 0; if the offensive and defensive result is scoring one point, the obtained reward is δ.sub.5.
[0040] In step S235, by using the calculated rewards and the Q values of executing multiple offensive and defensive actions in the new game state, the processor 16 updates the Q value of executing each offensive and defensive action in the game state in the Q table.
[0041] In step S236, the processor 16 updates the game state. Namely, the previously observed or learned game state is updated to the new game state. Afterwards, returning to step S232, the processor 16 re-observes the game state and performs learning by using the historical data.
[0042] Specifically, for the defensive side, assuming that an action A.sub.t,defense is executed in a game state S.sub.t,defense in round t, the reward corresponding to the execution result is R.sub.t+1,defense, and the corresponding new game state (i.e., the game state in round t+1) is S.sub.t+1,defense, then the Q value Q.sub.defense(S.sub.t,defense, A.sub.t,defense) corresponding to the state S.sub.t,defense and the action A.sub.t,defense in the Q table may be updated by the following formula (1):
[0043] In the formula, α is the learning rate, γ is the discount factor, Q.sub.defense(S.sub.t+1,defense,a) is the Q value of executing an action a in the new game state S.sub.t+1,defense. Among multiple actions a in the game state S.sub.t,defense, by taking an action having the largest Q value as the optimal action a*, a reward obtained by executing the action a* in the new game state S.sub.t+1,defense is fed back to the Q value corresponding to the action a* in the original game state S.sub.t,defense. In addition, the learning rate α is any number having a value between 0 and 1 and determines the influence on the update of Q.sub.defense(S.sub.t,defense, A.sub.t,defense), as illustrated in formula (1). The discount factor γ is, for example, any number having a value between 0.9 and 0.99 and may determine the ratio of the Q value of the new game state S.sub.t+1,defense to the fed back reward.
[0044] On the other hand, for the offensive side, assuming that an action A.sub.t,offense is executed in a game state S.sub.t,offense in round t, the reward corresponding to the execution result is R.sub.t+1,offense, and the corresponding new game state (i.e., the game state in round t+1) is S.sub.t+1,offense, then the Q value Q.sub.offense(S.sub.t,offense, A.sub.t,offense) corresponding to state S.sub.t,offense and the action A.sub.t,offense in the Q table may be updated by the following formula (2):
[0045] In the formula, α is the learning rate, γ is the discount factor, Q.sub.offense(S.sub.t+1,offense,a) is the Q value of executing an action a in the new game state S.sub.t+1,offense. Among multiple actions a in the game state S.sub.t,offense, by taking an action having the largest Q value as the optimal action a*, a reward obtained by executing the action a* in the new game state S.sub.t+1,offense is fed back to the Q value corresponding to the action a* in the original game state S.sub.t,offense. In addition, the learning rate α is any number having a value between 0 and 1 and determines the influence on the update of Q.sub.offense(S.sub.t,offense,A.sub.t,offense), as illustrated in formula (2). The discount factor γ is, for example, any number having a value between 0.9 and 0.99 and may determine the ratio of the Q value of the new game state S.sub.t+1,offense to the fed back reward.
[0046] Based on the offline training of the above steps, the Q table has learned the value function (i.e., Q value) of executing various actions in various states. Therefore, in the actual game, by applying this Q table, it is possible to evaluate the current game state in real-time and recommend the optimal strategy.
[0047] Specifically, returning to the flowchart of
[0048] Taking the defensive side as an example, for the current game state S.sub.t,defense, all actions a executable in this game state may be queried from the Q table to sort the Q values Q.sub.defense(S.sub.t,defense,a) of all actions a for strategy evaluation. The optimal defensive strategy action A.sub.t,defense* may be defined as:
[0049] In some embodiments, due to the different pitch types which each pitcher is capable of, the set of actions a in the above formula may be changed according to the capability of the pitcher at the moment; namely, the pitcher's capability may be incorporated into the learning and decision-making. Similarly, for the offensive side, the set of all actions a executable in the current game state may also be changed according to the capability of the hitter at the moment; namely, the capability of the hitter may also be incorporated into the learning and decision-making.
[0050] Based on the above, from the team's standpoint, the method of this embodiment plans the overall offensive and defensive strategies of the team by using the reinforcement learning method. Different from the datafication method for individual players, the method of this embodiment is more comprehensive and favorable for keeping up with the game.
[0051] It is noted that in the actual game, in addition to applying the pre-learned Q table to evaluate the current game state in real-time and recommend the optimal strategy, the embodiment of the disclosure may further perform online learning and update on the trained Q table according to the strategy selected by the team to continuously learn the game experience.
[0052]
[0053] In step S410, the processor 16 observes the current game state. The current game state is, for example, manually input by the coach, or obtained by the processor 16 through automatically reading information such as the inning score, the number of pitches, and the offensive and defensive data of the current game, which is not specifically limited herein.
[0054] In step S420, according to the current game state, the processor 16 sorts the Q values of all offensive and defensive actions executable in this game state recorded in the updated Q table, and recommends offensive and defensive actions suitable for being executed in this game state according to the sorting result. Step S420 is the same as or similar to step S240 in
[0055] Different from the foregoing embodiment, in this embodiment, in step S430, the processor 16 further receives a selection of the recommended offensive and defensive action. In some embodiments, the processor 16 receives the operation of selecting the recommended offensive and defensive action by the team (e.g., the coach) through an input device (not shown) such as a keyboard, a mouse, or a touch pad.
[0056] In step S440, the processor 16 calculates a reward obtained by executing the selected offensive and defensive action in the game state according to an offensive and defensive result and a new game state obtained after executing the selected offensive and defensive action. The processor 16 may also obtain the offensive and defensive result and the new game state through manually inputting or automatically reading information such as the inning score, the number of pitches, and the offensive and defensive data of the current game, which is not specifically limited herein.
[0057] In step S450, by using the calculated reward and the Q values of executing multiple offensive and defensive actions in the new game state, the processor 16 updates the Q value of executing the selected offensive and defensive action in the game state in the Q table.
[0058] Different from the offline planning stage which uses the actions selected in the past games to perform learning, in the online learning stage, the processor 16 directly calculates the reward to update the Q table according to the action currently selected by the team and the offensive and defensive result obtained after executing the action. By continuously updating the Q table, the Q table can continue to learn the game experience for evaluating or recommending strategies which meet the recent status of the team or the current status of the game in a future inning.
[0059] In summary of the above, in the baseball strategy planning method and the baseball strategy planning apparatus based on reinforcement learning in the embodiments of the disclosure, a Q table which can reflect pairing of states and actions in an inning is established in advance by using the past game data of the team, so that offensive or defensive strategies suitable for the current state can be recommended in an actual game. In addition, by continuously updating this Q table, it is possible to continue to learn the game experience and recommend strategies which are more in line with the current state of the game.
[0060] It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed embodiments without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the disclosure covers modifications and variations provided that they fall within the scope of the following claims and their equivalents.