MAXIMUM ENTROPY REGULARISED MULTI-GOAL REINFORCEMENT LEARNING

20200334565 · 2020-10-22

    Inventors

    Cpc classification

    International classification

    Abstract

    The present invention is related to a computer-implemented method of training artificial intelligence (AI) systems or rather agents (Maximum Entropy Regularised multi-goal Reinforcement Learning), in particular, an AI system/agent for controlling a technical system. By constructing a prioritised sampling distribution q(.sup.g) with a higher entropy custom-character.sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g and sampling the goal state trajectories .sup.g with the prioritised sampling distribution q(.sup.g) the AI system/agent is trained to achieve unseen goals by learning from diverse achieved goal states uniformly.

    Claims

    1. A computer-implemented method of training artificial intelligence, AI, systems, comprising the iterative step of: sampling a real goal g.sup.e of a multitude of real goals G.sup.e with a probability p(g.sup.e) and an initial state s.sub.0 with a probability of p(s.sub.0); and for each episode of each epoch of the training the iterative steps of: sampling an action a.sub.t from a single-goal conditioned behaviour policy that is represented by a Universal Value Function Approximator, UVFA; stepping an environment for a new state s.sub.t+1 with the sampled action a.sub.t; updating an replay buffer custom-character that comprises a distribution p(.sup.g) of goal state trajectories .sup.g with the current state s.sub.t and the current action a.sub.t, wherein the goal state trajectories .sup.g contain pairs of states s.sub.t from a multitude of states S.sub.t and corresponding actions a.sub.t from a multitude of actions A.sub.t; constructing a prioritised sampling distribution q(.sup.g) with a higher entropy custom-character.sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g in the replay buffer custom-character; sampling the goal state trajectories .sup.g with the prioritised sampling distribution q(.sup.g) and a current density model , q(.sup.g|); and updating the single-goal conditioned behaviour policy to an maximum of an Energy custom-character.sub.q of a reward r for the states S.sub.t and the real goals G.sup.e, max custom-character.sub.g [r(S.sub.t, G.sup.e)]; and after each episode for each epoch of the training the step of: updating the density model ; while the computer-implemented method has not converged.

    2. The computer-implemented method according to claim 1, wherein the step of updating the goal conditioned behaviour policy is based on a Deep Deterministic Policy Gradient, DDPG, method and/or on a Hindsight Experience Replay, HER, method.

    3. The computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method according to claim 1.

    4. The computer-readable medium having stored thereon the computer program according to claim 3.

    5. A data processing system comprising means for carrying out the steps of the method according to claim 1.

    Description

    BRIEF DESCRIPTION

    [0041] The present invention and its technical field are subsequently explained in further detail by exemplary embodiments shown in the drawings. The exemplary embodiments only conduce better understanding of the present invention and in no case are to be construed as limiting for the scope of the present invention. Particularly, it is possible to extract aspects of the subject-matter described in the figures and to combine it with other components and findings of the present description or figures, if not explicitly described differently. Equal reference signs refer to the same objects, such that explanations from other figures may be supplementally used.

    [0042] FIG. 1 shows a schematic flow chart of the computer-implemented method/computer program according to the first/second aspect of the present invention.

    [0043] FIG. 2 shows a schematic algorithm of the computer-implemented method/computer program according to the first/second aspect of the present invention.

    [0044] FIG. 3 shows a schematic flow chart of the steps during an episode of the training with computer-implemented method/computer program according to the first/second aspect of the present invention.

    [0045] FIG. 4 shows a schematic diagram of a performance test of the computer-implemented method according to the first aspect of the present invention.

    [0046] FIG. 5 shows a schematic diagram of a sample-efficiency test of the computer-implemented method according to the first aspect of the present invention.

    [0047] FIG. 6 shows a schematic diagram of TD-errors during training with the computer-implemented method according to the first aspect of the present invention.

    [0048] FIG. 7 shows a schematic view of the computer-readable medium according to the third aspect of the present invention.

    [0049] FIG. 8 shows a schematic view of the data processing system according to the fourth aspect of the present invention.

    DETAILED DESCRIPTION

    [0050] In FIG. 1 a flow chart of an embodiment of the computer-implemented method according to the first aspect of the present invention and of the computer program according to the second aspect of the present invention is exemplarily depicted.

    [0051] In FIG. 2 a corresponding algorithm of the embodiment of FIG. 1 is schematically depicted.

    [0052] As depicted in FIGS. 1 and 2 the computer-implemented method of training an artificial intelligence (AI) system (MER multi-goal RL method) and the corresponding computer program comprise the iterative step of: [0053] sampling 1 a real goal g.sup.e; and
    for each episode of each epoch fe2 of the training the iterative steps of: [0054] sampling 2 an action a.sub.t; [0055] stepping 3 an environment; [0056] updating 4 an replay buffer custom-character; [0057] constructing 5 a prioritised sampling distribution q(.sup.g); [0058] sampling 6 goal state trajectories .sup.g; and [0059] updating 7 a single-goal conditioned behaviour policy ; as well as after each episode for each epoch fe1 of the training the step of: [0060] updating 8 a density model .

    [0061] In the step of sampling 1 the real goal g.sup.e, the real goal g.sup.e of a multitude of real goals G.sup.e with a probability p(g.sup.e) and an initial state s.sub.0 with a probability of p(s.sub.0) are sampled. The real goals g.sup.e of the multitude of real goals G.sup.e are environmental goals like desired position and orientation of an object which has to be manipulated by a robot arm. The initial state s.sub.0 comprises the initial state like the initial position and orientation of the robot arms and all its joints.

    [0062] In the step of sampling 2 an action a.sub.t, an action a.sub.t from the single-goal conditioned behaviour policy that is represented by a Universal Value Function Approximator (UVFA) is sampled. The actions a.sub.t lead from the current state s.sub.t to the next state s.sub.t+1. The states s.sub.t comprise two sub-vectors, one achieved goal state s.sup.g (e.g. position and orientation of the object being manipulated) and one context state s.sup.c (s=(s.sup.gs.sup.c)). An achieved goal g.sup.s can be represented by a state and, thus, the achieved goal states can be written g.sup.s=s.sup.g UVFA essentially generalises the value functions (Q-functions) to multiple achieved goal states g.sup.s where Q-values depend not only on state-action pairs (s.sub.t, a.sub.t), but also on the achieved goal states g.sup.s.

    [0063] In the step of stepping 3 the environment, the environment is stepped for a new state s.sub.t+1 with the sampled action a.sub.t.

    [0064] In the step of updating 4 the replay buffer custom-character, the replay buffer custom-character that comprises a distribution p(.sup.g) of goal state trajectories .sup.g is updated with the current state s.sub.t and the current action a.sub.t. The goal state trajectories .sup.g contain pairs of states s.sub.t from a multitude of states S.sub.t and corresponding actions a.sub.t from a multitude of actions A.sub.t.

    [0065] In the step of constructing 5 the prioritised sampling distribution or rather proposal probability density function q(.sup.g), the prioritised sampling distribution q(.sup.g) is stepped with a higher entropy custom-character.sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g in the replay buffer custom-character. Goal state trajectories .sup.g with a lower probability are chosen more likely due to the prioritised sampling distribution q(.sup.g). This leads to a uniform selection of goal state trajectories .sup.g.

    [0066] In the step of sampling 6 the goal state trajectories .sup.g, the goal state trajectories .sup.g are sampled with the prioritised sampling distribution q(.sup.g) and a current density model (q(.sup.g|)).

    [0067] In the step of updating 7 the single-goal conditioned behaviour policy , the single-goal conditioned behaviour policy is updated to a maximum of an Energy custom-character.sub.q of a reward r for the states S.sub.t and real goals G.sup.e (max custom-character.sub.q [r(S.sub.t, G.sup.e)]).

    [0068] After the previous steps 2 to 7 are iteratively executed for each episode of the current epoch fe2, in the step of updating 8 the density model , the density model is updated for each epoch fe1.

    [0069] All iterative steps are executed as long as the computer-implemented method has not converged. The method may converge to the optimal policy and/or until a predefined criterion is fulfilled (e.g. a number of epochs). This is checked (y: yes/n: no) 9 after each epoch or before a new epoch is started. For example, a criterion for convergence may be a simple upper limit C of epochs, for example C=200. The upper limit C is preferably between 50 to 200.

    [0070] In FIG. 3 a flow-chart of the steps 2 to 6 of each episode of each epoch of the training and the step 8 of each epoch of the training with the with computer-implemented method or computer program of FIGS. 1 and 2 is schematically depicted (step 7 is not depicted in FIG. 3).

    [0071] In each episode of each epoch the agent 10 (AI system) samples an action a.sub.t from the from the single-goal conditioned behaviour policy represented as UVFA (step 2).

    [0072] Then the environment 20 is stepped with the sampled action a.sub.t from the current state s.sub.t (e.g. current position and orientation of the object being manipulated and of the robot arm used for manipulation) to the next state s.sub.t+1 (step 3).

    [0073] Based on the sampled action a.sub.t and the state s.sub.t the replay buffer custom-character is updated (step 4).

    [0074] Afterwards the prioritised sampling distribution or rather proposal probability density function q(.sup.g) with higher entropy custom-character.sub.q(.sup.g) than the distribution p(.sup.g) of goal state trajectories .sup.g in the replay buffer custom-character is constructed (step 5).

    [0075] With the constructed prioritised sampling distribution q(.sup.g) the goal state trajectories .sup.g in the replay buffer custom-character are sampled (step 6).

    [0076] The sampled goal state trajectories .sup.g, the new state s.sub.t+1 (state for the next iteration/episode) and the corresponding action a.sub.t are provided to the agent 10 for gaining new experience. Further, the single-goal conditioned behaviour policy is updated to max custom-character.sub.q [r(S.sub.t, G.sup.E)] (step 7 not depicted in FIG. 3).

    [0077] After each episode of the current epoch the density model of the agent 10 is updated (step 8).

    [0078] The steps 2 to 8 are iteratively repeated as described above for each epoch of the training. After the method has converged, no further epoch of the training is started (by sampling 1 a new real goal g.sup.e and a new initial state s.sub.0, step 1).

    [0079] In FIG. 4 a diagram of a performance test of the computer-implemented method according to the first aspect of the present invention schematically shown in FIGS. 1 to 3 is schematically depicted. The mean success rate MSR for Push PU, Pick & Place PI, Slide SL, Egg EG, Block BL and Pen PE is plotted over the amount of epochs used for training.

    [0080] The performance of the method according to the present invention (MER multi-goal RL method) has been tested on a variety of simulated robotic tasks (i.e. OpenAI Gym: Push, Pick & Place, Slide, Egg, Block and Pen) and compared with state of the art methods as baselines, including DDPG and HER. The most similar method to MER multi-goal RL seems to be Prioritised Experience Replay (PER) (combined with DDPG(+HER)). In the experiments, first the performance improvement of MER multi-goal RL has been compared to DDPG with/without HER and to PER (with DDPG with/without HER). Afterwards, the time-complexity of MER multi-goal RL has been compared to DDPG(+HER) and to PER(+DDPG(+HER)). As will be subsequently described in detail MER multi-goal RL improves performance with much less computational time than DDPG(+HER) and PER.

    [0081] A principle difference between MER multi-goal RL and PER is that PER uses TD-errors, while MER multi-goal RL is based on the entropy.

    [0082] To test the performance difference among methods including DDPG, PER+DDGP and MER multi-goal RL (MERmgRL)+DDGP, the experiment has been run in the three robot arm environments of OpenAI Gym. The DDPG has been used as the baseline because the robot arm environment is relatively simple. In the more challenging robot hand environments the DDPG+HER has been used as the baseline and the performance among DDPG+HER, PER+DDPG+HER, and MER multi-goal RL+DDPG+HER has been tested. To combine PER with HER, the TD-error of each transition has been calculated based on the randomly selected achieved goals. Then the transitions with higher TD-errors have been prioritised for replay. The mean success rates have been compared. Each experiment has been carried out with 5 random seeds and the shaded areas in FIG. 4 represent the standard deviation. The learning curve with respect to training epochs is shown in FIG. 4. For all experiments, 19 CPUs have been used and the agent has been trained for 200 epochs. After training, the best-learned policy has been used for evaluation and it has been tested in the environment. The testing results are the mean success rates. A comparison of the performances along with the training time is shown in the subsequent table.

    TABLE-US-00003 Task: Push Pick & Place Slide Method: success time success time success time DDPG 99.90% 5.52 h 39.34% 5.61 h 75.67% 5.47 h PER + DDPG 99.94% 30.66 h 67.19% 25.73 h 66.33% 25.85 h MERmgRL 99.96% 6.76 h 76.02% 6.92 h 76.77% 6.66 h Task: Egg Block Pen Method: success time success time success time DDPG + HER 76.19% 7.33 h 20.32% 8.47 h 27.28% 7.55 h PER + DDPG + 75.46% 79.86 h 18.95% 80.72 h 27.74% 81.17 h HER MERmgRL 81.30% 17.00 h 25.00% 19.88 h 31.88% 25.36 h

    [0083] From FIG. 4, it can be seen that MER multi-goal RL (MERmgRL) converges faster in all six tasks than both the baseline and PER. The agent trained with MER multi-goal RL also shows a better performance at the end of the training, as shown in the table. Further, in the table it can be seen that the training time of MER multi-goal RL lies in between the baseline and PER. To be more specific, MER multi-goal RL consumes much less computational time than PER does, as no TD-errors are sampled. For example in the robot arm environments, on average MER multigoal RL consumes about 1.2 times the training time of DDPG. In comparison, PER*DDPG consumes about 5 times the training time as DDPG does. In this case, MER multi-goal RL is 4 times faster than PER. Compared to PER, MER multi-goal RL is much faster in computational time because it only updates the goal state trajectory density once per epoch. Due to this reason, MER multi-goal RI, is much more efficient than PER in computational time and can be easily combined with any multi-goal RL method, such as DDPG and HER. The table shows that baseline methods with MER multi-goal RL give a better performance in all six tasks. The improvement goes up to 39.34 percentage points compared to the baseline methods. The average improvement over the six tasks is 9.15 percentage points. It can be seen that MER multi-goal RL is a simple yet effective method, and it improves state-of-the-art methods.

    [0084] In FIG. 5 a diagram of a sample-efficiency test of the computer-implemented method according to the first aspect of the present invention schematically shown in FIGS. 1 to 3 is schematically depicted. The amount of trainings samples TS for Push PU, Pick & Place PI, Slide SL, Egg EG, Block BL and Pen PE is plotted over the mean success rate MSR.

    [0085] To compare the sample-efficiency of the baseline and MER multi-goal RL, the number of training samples needed for a certain mean success rate has been compared. The comparison is shown in FIG. 5. From FIG. 5, in the FetchPush-v0 environment, it can be seen that for the same 99% mean success rate, the baseline DDPG needs 273,600 samples for training, while MER multi-goal RI, only needs 112,100 samples. In this case, MER multi-goal RL is more than twice (2.44) as sample-efficient as DDPG. Similarly, in the other five environments, MER multi-goal RI, improves sample-efficiency by factors around one to three. In conclusion, for all six environments, MER multi-goal RL is able to improve sample-efficiency by an average factor of two (1.95) over the baseline's sample-efficiency.

    [0086] In FIG. 6 a diagram of TD-errors during training with the computer-implemented method according to the first aspect of the present invention schematically shown in FIGS. 1 to 3 is schematically depicted. The TD-Error TDE for Egg EG, Block BL and Pen PE is plotted over complementary trajectory density CTD.

    [0087] To further understand why maximum entropy in goal space facilitates learning, it is looked into the TD-errors during training. The correlation between the complementary predictive density p(.sup.g|), equation 11, and the TD-errors of the goal state trajectory is investigated. The Pearson correlation coefficients, i.e., Pearson's r, between the density p(.sup.g|) and the TD-errors of the goal state trajectory are 0.63, 0.76, and 0.73, for the hand manipulation of egg, block, and pen tasks, respectively. The plot of the Pearson correlation is shown in FIG. 6. The value of Pearson's r is between 1 and 1, where 1 is total positive linear correlation, 0 is no linear correlation, 1 is total negative linear correlation. It can be seen that the complementary predictive density is correlated with the TD-errors of the trajectory with an average Pearson's r of 0.7. This proves that the agent learns faster from a more diverse goal distribution. Under-represented goals often have higher TD-errors. Therefore, it is helpful to maximise the goal entropy and prioritise the underrepresented goals during training.

    [0088] In FIG. 7 an embodiment of the computer-readable medium 20 according to the third aspect of the present invention is schematically depicted.

    [0089] Here, exemplarily a computer-readable storage disc 20 like a Compact Disc (CD), Digital Video Disc (DVD), High Definition DVD (HD DVD) or Blu-ray Disc (BD) has stored thereon the computer program according to the second aspect of the present invention and as schematically shown in FIGS. 1 to 3. However, the computer-readable medium may also be a data storage like a magnetic storage/memory (e.g. magnetic-core memory, magnetic tape, magnetic card, magnet strip, magnet bubble storage, drum storage, hard disc drive, floppy disc or removable storage), an optical storage/memory (e.g. holographic memory, optical tape, Tesa tape, Laserdisc, Phasewriter (Phasewriter Dual, PD) or Ultra Density Optical (UDO)), a magneto-optical storage/memory (e.g. MiniDisc or Magneto-Optical Disk (MO-Disk)), a volatile semiconductor/solid state memory (e.g. Random Access Memory (RAM), Dynamic RAM (DRAM) or Static RAM (SRAM)), a non-volatile semiconductor/solid state memory (e.g. Read Only Memory (ROM), Programmable ROM (PROM), Erasable PROM (EPROM), Electrically EPROM (EEPROM), Flash-EEPROM (e.g. USB-Stick), Ferroelectric RAM (FRAM), Magnetoresistive RAM (MRAM) or Phase-change RAM).

    [0090] In FIG. 8 an embodiment of the data processing system 30 according to the fourth aspect of the present invention is schematically depicted.

    [0091] The data processing system 30 may be a personal computer (PC), a laptop, a tablet, a server, a distributed system (e.g. cloud system) and the like. The data processing system 30 comprises a central processing unit (CPU) 31, a memory having a random access memory (RAM) 32 and a non-volatile memory (MEM, e.g. hard disk) 33, a human interface device (HID, e.g. keyboard, mouse, touchscreen etc.) 34 and an output device (MON, e.g. monitor, printer, speaker, etc.) 35. The CPU 31, RAM 32, HID 34 and MON 35 are communicatively connected via a data bus. The RAM 32 and MEM 33 are communicatively connected via another data bus. The computer program according to the second aspect of the present invention and schematically depicted in FIGS. 1 to 3 can be loaded into the RAM 32 from the MEM 33 or another computer-readable medium 20. According to the computer program the CPU executes the steps 1 to 8 of the computer-implemented method according to the first aspect of the present invention and schematically depicted in FIGS. 1 to 3. The execution can be initiated and controlled by a user via the HID 34. The status and/or result of the executed computer program may be indicated to the user by the MON 35. The result of the executed computer program may be permanently stored on the non-volatile MEM 33 or another computer-readable medium.

    [0092] In particular, the CPU 31 and RAM 33 for executing the computer program may comprise several CPUs 31 and several RAMs 33 for example in a computation cluster or a cloud system. The HID 34 and MON 35 for controlling execution of the computer program may be comprised by a different data processing system like a terminal communicatively connected to the data processing system 30 (e.g. cloud system).

    [0093] Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations exist. It should be appreciated that the exemplary embodiment or exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration in any way. Rather, the foregoing summary and detailed description will provide those skilled in the art with a convenient road map for implementing at least one exemplary embodiment, it being understood that various changes may be made in the function and arrangement of elements described in an exemplary embodiment without departing from the scope as set forth in the appended claims and their legal equivalents. Generally, this application is intended to cover any adaptations or variations of the specific embodiments discussed herein.

    [0094] In the foregoing detailed description, various features are grouped together in one or more examples for the purpose of streamlining the disclosure. It is understood that the above description is intended to be illustrative, and not restrictive. It is intended to cover all alternatives, modifications and equivalents as may be included within the scope of the invention. Many other examples will be apparent to one skilled in the art upon reviewing the above specification.

    [0095] Specific nomenclature used in the foregoing specification is used to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art in light of the specification provided herein that the specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the present invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. Throughout the specification, the terms including and in which are used as the plain-English equivalents of the respective terms comprising and wherein, respectively. Moreover, the terms first, second, and third, etc., are used merely as labels, and are not intended to impose numerical requirements on or to establish a certain ranking of importance of their objects. In the context of the present description and claims the conjunction or is to be understood as including (and/or) and not exclusive (either . . . or).