MAXIMUM ENTROPY REGULARISED MULTI-GOAL REINFORCEMENT LEARNING
20200334565 · 2020-10-22
Inventors
Cpc classification
G06N7/01
PHYSICS
G06N3/006
PHYSICS
International classification
Abstract
The present invention is related to a computer-implemented method of training artificial intelligence (AI) systems or rather agents (Maximum Entropy Regularised multi-goal Reinforcement Learning), in particular, an AI system/agent for controlling a technical system. By constructing a prioritised sampling distribution q(.sup.g) with a higher entropy .sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g and sampling the goal state trajectories .sup.g with the prioritised sampling distribution q(.sup.g) the AI system/agent is trained to achieve unseen goals by learning from diverse achieved goal states uniformly.
Claims
1. A computer-implemented method of training artificial intelligence, AI, systems, comprising the iterative step of: sampling a real goal g.sup.e of a multitude of real goals G.sup.e with a probability p(g.sup.e) and an initial state s.sub.0 with a probability of p(s.sub.0); and for each episode of each epoch of the training the iterative steps of: sampling an action a.sub.t from a single-goal conditioned behaviour policy that is represented by a Universal Value Function Approximator, UVFA; stepping an environment for a new state s.sub.t+1 with the sampled action a.sub.t; updating an replay buffer that comprises a distribution p(.sup.g) of goal state trajectories .sup.g with the current state s.sub.t and the current action a.sub.t, wherein the goal state trajectories .sup.g contain pairs of states s.sub.t from a multitude of states S.sub.t and corresponding actions a.sub.t from a multitude of actions A.sub.t; constructing a prioritised sampling distribution q(.sup.g) with a higher entropy
.sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g in the replay buffer
; sampling the goal state trajectories .sup.g with the prioritised sampling distribution q(.sup.g) and a current density model , q(.sup.g|); and updating the single-goal conditioned behaviour policy to an maximum of an Energy
.sub.q of a reward r for the states S.sub.t and the real goals G.sup.e, max
.sub.g [r(S.sub.t, G.sup.e)]; and after each episode for each epoch of the training the step of: updating the density model ; while the computer-implemented method has not converged.
2. The computer-implemented method according to claim 1, wherein the step of updating the goal conditioned behaviour policy is based on a Deep Deterministic Policy Gradient, DDPG, method and/or on a Hindsight Experience Replay, HER, method.
3. The computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method according to claim 1.
4. The computer-readable medium having stored thereon the computer program according to claim 3.
5. A data processing system comprising means for carrying out the steps of the method according to claim 1.
Description
BRIEF DESCRIPTION
[0041] The present invention and its technical field are subsequently explained in further detail by exemplary embodiments shown in the drawings. The exemplary embodiments only conduce better understanding of the present invention and in no case are to be construed as limiting for the scope of the present invention. Particularly, it is possible to extract aspects of the subject-matter described in the figures and to combine it with other components and findings of the present description or figures, if not explicitly described differently. Equal reference signs refer to the same objects, such that explanations from other figures may be supplementally used.
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
DETAILED DESCRIPTION
[0050] In
[0051] In
[0052] As depicted in
for each episode of each epoch fe2 of the training the iterative steps of: [0054] sampling 2 an action a.sub.t; [0055] stepping 3 an environment; [0056] updating 4 an replay buffer ; [0057] constructing 5 a prioritised sampling distribution q(.sup.g); [0058] sampling 6 goal state trajectories .sup.g; and [0059] updating 7 a single-goal conditioned behaviour policy ; as well as after each episode for each epoch fe1 of the training the step of: [0060] updating 8 a density model .
[0061] In the step of sampling 1 the real goal g.sup.e, the real goal g.sup.e of a multitude of real goals G.sup.e with a probability p(g.sup.e) and an initial state s.sub.0 with a probability of p(s.sub.0) are sampled. The real goals g.sup.e of the multitude of real goals G.sup.e are environmental goals like desired position and orientation of an object which has to be manipulated by a robot arm. The initial state s.sub.0 comprises the initial state like the initial position and orientation of the robot arms and all its joints.
[0062] In the step of sampling 2 an action a.sub.t, an action a.sub.t from the single-goal conditioned behaviour policy that is represented by a Universal Value Function Approximator (UVFA) is sampled. The actions a.sub.t lead from the current state s.sub.t to the next state s.sub.t+1. The states s.sub.t comprise two sub-vectors, one achieved goal state s.sup.g (e.g. position and orientation of the object being manipulated) and one context state s.sup.c (s=(s.sup.gs.sup.c)). An achieved goal g.sup.s can be represented by a state and, thus, the achieved goal states can be written g.sup.s=s.sup.g UVFA essentially generalises the value functions (Q-functions) to multiple achieved goal states g.sup.s where Q-values depend not only on state-action pairs (s.sub.t, a.sub.t), but also on the achieved goal states g.sup.s.
[0063] In the step of stepping 3 the environment, the environment is stepped for a new state s.sub.t+1 with the sampled action a.sub.t.
[0064] In the step of updating 4 the replay buffer , the replay buffer
that comprises a distribution p(.sup.g) of goal state trajectories .sup.g is updated with the current state s.sub.t and the current action a.sub.t. The goal state trajectories .sup.g contain pairs of states s.sub.t from a multitude of states S.sub.t and corresponding actions a.sub.t from a multitude of actions A.sub.t.
[0065] In the step of constructing 5 the prioritised sampling distribution or rather proposal probability density function q(.sup.g), the prioritised sampling distribution q(.sup.g) is stepped with a higher entropy .sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g in the replay buffer
. Goal state trajectories .sup.g with a lower probability are chosen more likely due to the prioritised sampling distribution q(.sup.g). This leads to a uniform selection of goal state trajectories .sup.g.
[0066] In the step of sampling 6 the goal state trajectories .sup.g, the goal state trajectories .sup.g are sampled with the prioritised sampling distribution q(.sup.g) and a current density model (q(.sup.g|)).
[0067] In the step of updating 7 the single-goal conditioned behaviour policy , the single-goal conditioned behaviour policy is updated to a maximum of an Energy .sub.q of a reward r for the states S.sub.t and real goals G.sup.e (max
.sub.q [r(S.sub.t, G.sup.e)]).
[0068] After the previous steps 2 to 7 are iteratively executed for each episode of the current epoch fe2, in the step of updating 8 the density model , the density model is updated for each epoch fe1.
[0069] All iterative steps are executed as long as the computer-implemented method has not converged. The method may converge to the optimal policy and/or until a predefined criterion is fulfilled (e.g. a number of epochs). This is checked (y: yes/n: no) 9 after each epoch or before a new epoch is started. For example, a criterion for convergence may be a simple upper limit C of epochs, for example C=200. The upper limit C is preferably between 50 to 200.
[0070] In
[0071] In each episode of each epoch the agent 10 (AI system) samples an action a.sub.t from the from the single-goal conditioned behaviour policy represented as UVFA (step 2).
[0072] Then the environment 20 is stepped with the sampled action a.sub.t from the current state s.sub.t (e.g. current position and orientation of the object being manipulated and of the robot arm used for manipulation) to the next state s.sub.t+1 (step 3).
[0073] Based on the sampled action a.sub.t and the state s.sub.t the replay buffer is updated (step 4).
[0074] Afterwards the prioritised sampling distribution or rather proposal probability density function q(.sup.g) with higher entropy .sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g in the replay buffer
is constructed (step 5).
[0075] With the constructed prioritised sampling distribution q(.sup.g) the goal state trajectories .sup.g in the replay buffer are sampled (step 6).
[0076] The sampled goal state trajectories .sup.g, the new state s.sub.t+1 (state for the next iteration/episode) and the corresponding action a.sub.t are provided to the agent 10 for gaining new experience. Further, the single-goal conditioned behaviour policy is updated to max .sub.q [r(S.sub.t, G.sup.E)] (step 7 not depicted in
[0077] After each episode of the current epoch the density model of the agent 10 is updated (step 8).
[0078] The steps 2 to 8 are iteratively repeated as described above for each epoch of the training. After the method has converged, no further epoch of the training is started (by sampling 1 a new real goal g.sup.e and a new initial state s.sub.0, step 1).
[0079] In
[0080] The performance of the method according to the present invention (MER multi-goal RL method) has been tested on a variety of simulated robotic tasks (i.e. OpenAI Gym: Push, Pick & Place, Slide, Egg, Block and Pen) and compared with state of the art methods as baselines, including DDPG and HER. The most similar method to MER multi-goal RL seems to be Prioritised Experience Replay (PER) (combined with DDPG(+HER)). In the experiments, first the performance improvement of MER multi-goal RL has been compared to DDPG with/without HER and to PER (with DDPG with/without HER). Afterwards, the time-complexity of MER multi-goal RL has been compared to DDPG(+HER) and to PER(+DDPG(+HER)). As will be subsequently described in detail MER multi-goal RL improves performance with much less computational time than DDPG(+HER) and PER.
[0081] A principle difference between MER multi-goal RL and PER is that PER uses TD-errors, while MER multi-goal RL is based on the entropy.
[0082] To test the performance difference among methods including DDPG, PER+DDGP and MER multi-goal RL (MERmgRL)+DDGP, the experiment has been run in the three robot arm environments of OpenAI Gym. The DDPG has been used as the baseline because the robot arm environment is relatively simple. In the more challenging robot hand environments the DDPG+HER has been used as the baseline and the performance among DDPG+HER, PER+DDPG+HER, and MER multi-goal RL+DDPG+HER has been tested. To combine PER with HER, the TD-error of each transition has been calculated based on the randomly selected achieved goals. Then the transitions with higher TD-errors have been prioritised for replay. The mean success rates have been compared. Each experiment has been carried out with 5 random seeds and the shaded areas in
TABLE-US-00003 Task: Push Pick & Place Slide Method: success time success time success time DDPG 99.90% 5.52 h 39.34% 5.61 h 75.67% 5.47 h PER + DDPG 99.94% 30.66 h 67.19% 25.73 h 66.33% 25.85 h MERmgRL 99.96% 6.76 h 76.02% 6.92 h 76.77% 6.66 h Task: Egg Block Pen Method: success time success time success time DDPG + HER 76.19% 7.33 h 20.32% 8.47 h 27.28% 7.55 h PER + DDPG + 75.46% 79.86 h 18.95% 80.72 h 27.74% 81.17 h HER MERmgRL 81.30% 17.00 h 25.00% 19.88 h 31.88% 25.36 h
[0083] From
[0084] In
[0085] To compare the sample-efficiency of the baseline and MER multi-goal RL, the number of training samples needed for a certain mean success rate has been compared. The comparison is shown in
[0086] In
[0087] To further understand why maximum entropy in goal space facilitates learning, it is looked into the TD-errors during training. The correlation between the complementary predictive density
[0088] In
[0089] Here, exemplarily a computer-readable storage disc 20 like a Compact Disc (CD), Digital Video Disc (DVD), High Definition DVD (HD DVD) or Blu-ray Disc (BD) has stored thereon the computer program according to the second aspect of the present invention and as schematically shown in
[0090] In
[0091] The data processing system 30 may be a personal computer (PC), a laptop, a tablet, a server, a distributed system (e.g. cloud system) and the like. The data processing system 30 comprises a central processing unit (CPU) 31, a memory having a random access memory (RAM) 32 and a non-volatile memory (MEM, e.g. hard disk) 33, a human interface device (HID, e.g. keyboard, mouse, touchscreen etc.) 34 and an output device (MON, e.g. monitor, printer, speaker, etc.) 35. The CPU 31, RAM 32, HID 34 and MON 35 are communicatively connected via a data bus. The RAM 32 and MEM 33 are communicatively connected via another data bus. The computer program according to the second aspect of the present invention and schematically depicted in
[0092] In particular, the CPU 31 and RAM 33 for executing the computer program may comprise several CPUs 31 and several RAMs 33 for example in a computation cluster or a cloud system. The HID 34 and MON 35 for controlling execution of the computer program may be comprised by a different data processing system like a terminal communicatively connected to the data processing system 30 (e.g. cloud system).
[0093] Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations exist. It should be appreciated that the exemplary embodiment or exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration in any way. Rather, the foregoing summary and detailed description will provide those skilled in the art with a convenient road map for implementing at least one exemplary embodiment, it being understood that various changes may be made in the function and arrangement of elements described in an exemplary embodiment without departing from the scope as set forth in the appended claims and their legal equivalents. Generally, this application is intended to cover any adaptations or variations of the specific embodiments discussed herein.
[0094] In the foregoing detailed description, various features are grouped together in one or more examples for the purpose of streamlining the disclosure. It is understood that the above description is intended to be illustrative, and not restrictive. It is intended to cover all alternatives, modifications and equivalents as may be included within the scope of the invention. Many other examples will be apparent to one skilled in the art upon reviewing the above specification.
[0095] Specific nomenclature used in the foregoing specification is used to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art in light of the specification provided herein that the specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the present invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. Throughout the specification, the terms including and in which are used as the plain-English equivalents of the respective terms comprising and wherein, respectively. Moreover, the terms first, second, and third, etc., are used merely as labels, and are not intended to impose numerical requirements on or to establish a certain ranking of importance of their objects. In the context of the present description and claims the conjunction or is to be understood as including (and/or) and not exclusive (either . . . or).