Method, device and computer program for producing a strategy for a robot

Abstract

A method for producing a strategy for a robot. The method includes the following steps: initializing the strategy and an episode length; repeated execution of the loop including the following steps: producing a plurality of further strategies as a function of the strategy; applying the plurality of the further strategies for the length of the episode length; ascertaining respectively a cumulative reward, which is obtained in the application of the respective further strategy; updating the strategy as a function of a second plurality of the further strategies that obtained the greatest cumulative rewards. After each execution of the loop, the episode length is increased. A computer program, a device for carrying out the method, and a machine-readable memory element on which the computer program is stored, are also described.

Claims

1. A method of training a neural network, comprising: producing parameters of the neural network representing a strategy to control a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the producing including: initializing the strategy and an episode length; repeatedly executing a loop including the steps: producing a plurality of further strategies as a function of the strategy; applying the plurality of the further strategies for a respective at least one episode having the episode length; ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy; updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards; wherein the episode length is increased following each execution of the loop; and storing the parameters representing the strategy in a memory connected to the neural network.

2. The method as recited in claim 1, wherein a time budget is initialized, the loop being executed only for as long as time remains in the time budget.

3. The method as recited in claim 2, wherein the time budget is increased following every execution of the loop.

4. The method as recited in claim 1, wherein the episode length is initially set to a value smaller than an expected number of actions for reaching the specifiable goal.

5. The method as recited in claim 4, wherein the expected number of actions is ascertained by a Monte Carlo simulation.

6. The method as recited in claim 1, wherein the further strategies are sorted in descending order according to the respective cumulative reward and are respectively weighted using a second specifiable value assigned to a respective position in the order.

7. The method as recited in claim 1, wherein a current state of the robot, and/or a current state of an environment of the robot is detected using a sensor, a control variable being provided for an actuator of the robot, as a function of the sensor value using the updated strategy.

8. A method, comprising: producing parameters of a neural network representing a strategy for controlling a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the parameters of the neural network being produced by: initializing the strategy and an episode length; repeatedly executing a loop including the steps: producing a plurality of further strategies as a function of the strategy; applying the plurality of the further strategies for a respective at least one episode having the episode length; ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy; updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards; wherein the episode length is increased following each execution of the loop; storing the parameters representing the strategy in a memory connected to the neural network; and operating the robot using the neural network to activate an actuator of the robot to provide an action corresponding to the produced strategy as a function of a current state of the robot and/or a current state of an environment of the robot sensed by a sensor and provided to the neural network, the produced strategy being implemented by the neural network in that the neural network provides the action corresponding to the produced strategy from a state provided to the neural network.

9. A non-transitory machine-readable memory element on which is stored a computer program, which when executed by a computer causes the computer to perform a method of training a neural network, the method comprising: producing parameters of the neural network representing a strategy to control a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the producing including: initializing the strategy and an episode length; repeatedly executing a loop including the steps: producing a plurality of further strategies as a function of the strategy; applying the plurality of the further strategies for a respective at least one episode having the episode length; ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy; updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards; wherein the episode length is increased following each execution of the loop; and storing the parameters representing the strategy in a memory connected to the neural network.

10. A device, the device comprising: a processing unit configured to execute computer program instructions to control a method of training a neural network, the method including: producing parameters of the neural network representing a strategy to control a robot so that a specifiable goal is reached when the robot performs actions based on the strategy, depending on a respective situation, the producing including: initialize the strategy and an episode length; repeatedly execute a loop including: producing a plurality of further strategies as a function of the strategy; applying the plurality of the further strategies for a respective at least one episode having the episode length; ascertaining, for each of the further strategies, a respective cumulative reward which is obtained when applying the respective further strategy; and updating the strategy as a function of a specifiable number of the further strategies that obtained the greatest respective cumulative rewards; wherein the episode length is increased following each execution of the loop; and storing the parameters representing the strategy in a memory connected to the neural network.

11. The method as recited in claim 8, wherein a time budget is initialized, the loop being executed only for as long as time remains in the time budget.

12. The method as recited in claim 11, wherein the time budget is increased following every execution of the loop.

13. The method as recited in claim 8, wherein the episode length is initially set to a value smaller than an expected number of actions for reaching the specifiable goal.

14. The method as recited in claim 13, wherein the expected number of actions is ascertained by a Monte Carlo simulation.

15. The method as recited in claim 8, wherein the further strategies are sorted in descending order according to the respective cumulative reward and are respectively weighted using a second specifiable value assigned to a respective position in the order.

16. The method as recited in claim 8, wherein the robot includes at least one of: an at least partially autonomous vehicle, a drone, a production tool, or a machine tool.

17. The method as recited in claim 1, wherein the robot includes at least one of: an at least partially autonomous vehicle, a drone, a production tool, or a machine tool.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 shows a schematic representation of a robot.

(2) FIG. 2 shows a schematic representation of a first pseudocode.

(3) FIG. 3 shows a schematic representation of a second pseudocode.

(4) FIG. 4 shows a schematic representation of a device for executing the pseudocode.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

(5) FIG. 1 shows a schematic representation of a robot (10). The robot (10) is designed to learn autonomously a strategy (i.e., policy) by exploring, expediently by interacting with, its environment (11). Depending on the strategy and a detected sensor variable (x), a decision module (14), comprising the strategy, ascertains an optimal action (a). In one exemplary embodiment, the strategy is stored in a memory P in the form of parameters (θ) of a neural network. The decision module (14) comprises this neural network, which ascertains the action (a) as a function of the detected sensor variable (x). The architecture of this neural network may be for example the architecture that is described in the related art document cited at the outset. The sensor variable (x) is detected by a sensor (13). For this purpose, the latter detects a state (12) of the environment (11) of the robot (10). An actuator (15) of the robot (10) may be controlled on the basis of the action (a). As a result of actuator (15) performing the action (a), the state (16) of the environment (11) changes. The performance of the action (a) may serve to explore the environment (11) or to solve the specifiable task or to reach a specifiable goal.

(6) The robot (10) further comprises a processing unit (17) and a machine-readable memory element (18). A computer program may be stored on memory element (18), comprising commands which, when executed on the processing unit (17) prompt the processing unit (17) to operate the robot (10).

(7) It should be noted that the robot (10) may also be an at least partially autonomous vehicle, a drone or a production/machine tool.

(8) FIG. 2 shows in exemplary fashion a pseudocode of a method “canonical evolution strategy (ES)” for producing the strategy for the robot (10).

(9) At the beginning of the pseudocode, it is necessary to specify an initial strategy θ.sub.0, a time budget T, a maximum episode length E, a population variable λ, a parent population variable μ and a mutation step variable σ and a cumulative reward function F(⋅). The initial strategy θ.sub.0 is preferably a variable, which are the parameters of the neural network. The initial strategy may be initialized randomly.

(10) At the beginning of the pseudocode, in lines 1 and 2, a first loop is executed via the parent population variable μ in order to ascertain the constants w.sub.j.

(11) Subsequently, the strategy is optimized by a second loop in lines 4 through 11.

(12) The second loop is executed until time budget T is depleted. In the second loop, the initialized strategy θ.sub.0 is mutated by applying, e.g., a random noise. Thereupon, in line 7, the performance of the mutated strategies is evaluated using the cumulative reward function F. The cumulative reward function F may be a cumulative reward over an episode having an episode length E.

(13) In line 9, the strategies are then arranged in descending order according to their obtained cumulative reward s.sub.i. In the subsequent line 10, the strategy is updated as a function of the top μ strategies that are respectively weighted with the constant w.sub.j.

(14) The updated strategy may thereupon be output or used as the final strategy in order to execute the second loop anew. The renewed execution of the second loop may be repeated as often as necessary until a specifiable abort criterion is fulfilled. The specifiable abort criterion may be for example that a change of the strategy is smaller than a specifiable threshold value.

(15) FIG. 3 shows by way of example a pseudocode of a method to adapt time budget T and episode length E dynamically during the implementation of the ES.

(16) For this purpose, an episode scheduler, a time scheduler and a number of iterations N are initially provided.

(17) In line 1 of the second pseudoalgorithm, the strategy θ.sub.0 is initialized by sampling from a normal distribution. Subsequently, a loop is executed beginning at line 2 through line 6 over the number of iterations N. First, the maximum episode length E is ascertained by the episode scheduler and optionally the maximum time budget T is ascertained by the time scheduler as a function of the current iteration n. Subsequently, the method ES is carried out using these two ascertained variables E and/or T.

(18) Following each executed loop pass, the episode scheduler may double the episode length E: E(n)=2.sup.nE(0). The initial episode length E(0) may be a value smaller than an expected number of steps required for reaching the goal. Alternatively, the initial episode length E(0) may be divided by a specifiable value, for example 2. Alternatively, the initial episode length E(0) may be ascertained by a Monte Carlo simulation.

(19) The time scheduler may increase the time budget T incrementally with the increasing number of executed loop passes, for example: T(n)=2.sup.nκ. The value κ may correspond to 20 minutes for example. Alternatively, the time scheduler may keep the time budget T constant for every loop pass, it being possible for T to equal 1 hour, for example.

(20) The advantage of the episode scheduler and/or of the time scheduler is that first a strategy is learned in short episodes, which is subsequently used to solve more complex tasks more effectively in longer episodes. For the knowledge of the strategy that was learned in the short episodes may be used again for solving the longer episodes. The advantage of the time scheduler is that an available total time budget may be efficiently divided into partial times for the individual episode lengths.

(21) FIG. 4 shows a schematic representation of a device (40) for training the decision module (14), in particular for executing the pseudocode in accordance with FIG. 2 or 3. Device (40) comprises a training module (41), which simulates e.g. the environment (11) and outputs the cumulative reward F. The adaptation module (43) then updates the strategy and stores the updated strategy in memory P.

Method, device and computer program for producing a strategy for a robot

Assignee

Inventors

Cpc classification

Classification Explorer

B25J9/161

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05B2219/40499

PHYSICS

Classification Explorer

G05B17/02

PHYSICS

Classification Explorer

B25J9/163

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05B13/027

PHYSICS

International classification

Classification Explorer

B25J9/16

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05B17/02

PHYSICS

Abstract

Claims

Description