DEVICE AND METHOD FOR DETERMINING SAFE ACTIONS TO BE EXECUTED BY A TECHNICAL SYSTEM

20230281511 · 2023-09-07

    Inventors

    Cpc classification

    International classification

    Abstract

    A computer-implemented method for training a machine learning system. The machine learning system is configured to determine a control signal characterizing an action to be executed by a technical system. The method includes obtaining a safe action to be executed by the technical system including: obtaining a state signal; determining, by a parametrized policy module of the machine learning system, a distribution of potentially unsafe actions that could be executed by the technical system; sampling a potentially unsafe action from the distribution; obtaining, by a safety module of the machine learning system, the safe action. The method further includes determining a loss value based on the state signal and the safe action; and training the machine learning system by updating parameters of the policy module according to a gradient of the loss value with respect to the parameters.

    Claims

    1. A computer-implemented method for training a machine learning system, wherein the machine learning system is configured to determine a control signal characterizing an action to be executed by a technical system, wherein the method for training comprises the following steps: obtaining a safe action to be executed by the technical system, including: obtaining a state signal, wherein the state signal characterizes a state of an environment; determining, by a parametrized policy module of the machine learning system, a distribution of potentially unsafe actions that could be executed by the technical system, wherein the policy module determines the distribution based on the obtained state signal; sampling a potentially unsafe action from the distribution; obtaining, by a safety module of the machine learning system, the safe action, wherein the safe action is obtained based on the sampled potentially unsafe action and a set of safe actions with respect to a current environment of the technical system; determining a loss value based on the state signal and the safe action, wherein the loss value characterizes a reward obtained based on the safe action; training the machine learning system by updating parameters of the policy module according to a gradient of the loss value with respect to the parameters.

    2. The method according to claim 1, wherein the obtaining of the safe action by the safety module includes mapping the potentially unsafe action to an action from the set of safe actions when the potentially unsafe action is not in the set of safe actions, wherein the mapping is performed using a piecewise diffeomorphism.

    3. The method according to claim 2, wherein the mapping of the potentially unsafe action to an action from the set of safe actions includes: determining a countable partition of the space of actions; determining, for each set of the countable partition, whether the set is safe set or an unsafe set, wherein a set is determined as safe set when the set includes only actions from the set of safe actions and when there exists a trajectory of actions for future states that includes only safe actions and wherein a set is determined as unsafe set otherwise; when the potentially unsafe action is in an unsafe set: determining a safe set from the partition based on the distribution of the potentially unsafe actions; mapping the potentially unsafe action to an action from the safe set; providing the action as the safe action; Otherwise, when the potentially unsafe action is not in an unsafe set, providing the potentially unsafe action as the safe action.

    4. The method according to claim 3, wherein the determining the safe set includes determining, for each safe set in the partition, a probability density of a representative action of the safe set of the partition with respect to the distribution of potentially unsafe actions, wherein the safe set including the representative action with a highest probability density value is provided as determined safe set.

    5. The method according to claim 3, wherein the determining of the safe set includes determining, for each safe set in the partition, a probability density of a representative action of the safe set of the partition with respect to the distribution of potentially unsafe actions, wherein the safe set is sampled based on the determined probability densities and the sampled safe set is provided as determined safe set.

    6. The method according to claim 3, wherein the safe set is determined by choosing the set from the partition that is deemed safe and has a minimal distance to the potentially unsafe action.

    7. The method according to claim 3, wherein the mapping of the potentially unsafe action to an action from the safe set and the providing of the action as the safe action includes determining a relative position of the potentially unsafe action in the unsafe set and providing the action at the relative position in the safe set as the safe action.

    8. The method according to claim 3, wherein the mapping of the potentially unsafe action to an action from the safe set and the providing of the action as the safe action includes determining an action from the safe set that has a minimal distance to the potentially unsafe action and providing the action as the safe action.

    9. The method according to claim 1, wherein the loss value is determined by a discriminator, and training the machine learning system includes training the policy module and the discriminator according to generative adversarial imitation learning.

    10. A computer-implemented method for determining a control signal for controlling an actuator of a technical system the method comprising the following steps: training a machine learning system, wherein the machine learning system is configured to determine a control signal characterizing an action to be executed by a technical system, wherein the training includes: obtaining a safe action to be executed by the technical system, including: obtaining a state signal, wherein the state signal characterizes a state of an environment, determining, by a parametrized policy module of the machine learning system, a distribution of potentially unsafe actions that could be executed by the technical system, wherein the policy module determines the distribution based on the obtained state signal, sampling a potentially unsafe action from the distribution, obtaining, by a safety module of the machine learning system, the safe action, wherein the safe action is obtained based on the sampled potentially unsafe action and a set of safe actions with respect to a current environment of the technical system, determining a loss value based on the state signal and the safe action, wherein the loss value characterizes a reward obtained based on the safe action, training the machine learning system by updating parameters of the policy module according to a gradient of the loss value with respect to the parameters; and determining the control signal using the trained machined learning system and based on a state signal of an environment.

    11. A machine learning system configured to determine a control signal characterizing an action to be executed by a technical system, wherein the machine learning system is trained by: obtaining a safe action to be executed by the technical system, including: obtaining a state signal, wherein the state signal characterizes a state of an environment; determining, by a parametrized policy module of the machine learning system, a distribution of potentially unsafe actions that could be executed by the technical system, wherein the policy module determines the distribution based on the obtained state signal; sampling a potentially unsafe action from the distribution; obtaining, by a safety module of the machine learning system, the safe action, wherein the safe action is obtained based on the sampled potentially unsafe action and a set of safe actions with respect to a current environment of the technical system; determining a loss value based on the state signal and the safe action, wherein the loss value characterizes a reward obtained based on the safe action; training the machine learning system by updating parameters of the policy module according to a gradient of the loss value with respect to the parameters.

    12. The method according to claim 1, wherein the policy module is trained according to a reinforcement learning paradigm or an imitation learning paradigm, wherein during inference of the machine learning system, potentially unsafe actions provided by the policy module are mapped to safe actions, by the safety module of the machine learning system, to safe actions.

    13. A training system configured to train a machine learning system, wherein the machine learning system is configured to determine a control signal characterizing an action to be executed by a technical system, the training system configured to: obtain a safe action to be executed by the technical system, including: obtaining a state signal, wherein the state signal characterizes a state of an environment; determining, by a parametrized policy module of the machine learning system, a distribution of potentially unsafe actions that could be executed by the technical system, wherein the policy module determines the distribution based on the obtained state signal; sampling a potentially unsafe action from the distribution; obtaining, by a safety module of the machine learning system, the safe action, wherein the safe action is obtained based on the sampled potentially unsafe action and a set of safe actions with respect to a current environment of the technical system; determine a loss value based on the state signal and the safe action, wherein the loss value characterizes a reward obtained based on the safe action; train the machine learning system by updating parameters of the policy module according to a gradient of the loss value with respect to the parameters.

    14. A non-transitory machine-readable storage medium on which is stored a computer program for training a machine learning system, wherein the machine learning system is configured to determine a control signal characterizing an action to be executed by a technical system, the computer program, when executed by a processor, causing the processor to perform the following steps: obtaining a safe action to be executed by the technical system, including: obtaining a state signal, wherein the state signal characterizes a state of an environment; determining, by a parametrized policy module of the machine learning system, a distribution of potentially unsafe actions that could be executed by the technical system, wherein the policy module determines the distribution based on the obtained state signal; sampling a potentially unsafe action from the distribution; obtaining, by a safety module of the machine learning system, the safe action, wherein the safe action is obtained based on the sampled potentially unsafe action and a set of safe actions with respect to a current environment of the technical system; determining a loss value based on the state signal and the safe action, wherein the loss value characterizes a reward obtained based on the safe action; training the machine learning system by updating parameters of the policy module according to a gradient of the loss value with respect to the parameters.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0073] FIG. 1 shows a machine learning system, according to an example embodiment of the present invention.

    [0074] FIG. 2 shows a diagram depicting steps of a method for training the machine learning system, according to an example embodiment of the present invention.

    [0075] FIG. 3 exemplarily a mapping of a potentially unsafe action to a safe action, according to an example embodiment of the present invention.

    [0076] FIG. 4 shows a control system comprising a machine learning system controlling an actuator in its environment, according to an example embodiment of the present invention.

    [0077] FIG. 5 shows the control system controlling an at least partially autonomous robot, according to an example embodiment of the present invention.

    [0078] FIG. 6 shows the control system controlling a manufacturing machine, according to an example embodiment of the present invention.

    [0079] FIG. 7 shows a training system for training the machine learning system, according to an example embodiment of the present invention.

    DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

    [0080] FIG. 1 shows a machine learning system (60) for determining a safe action (ā), wherein the safe action (ā) is used for controlling a technical system. The machine learning system (60) determines the safe action (ā) based on a state signal (s) provided to the machine learning system (60). The state signal (s) is processed by a parametrized policy module (61) of the machine learning system, wherein the policy module (61) is configured to provide a probability distribution for an action to be performed by the technical system. The policy module (61) may preferably comprise or be a conditional generative model using the state signal (s) as condition. Preferably, the generative model may be a conditional normalizing flow or a conditional gaussian model, e.g., a conditional gaussian mixture model.

    [0081] A potentially unsafe action (â) may then be sampled from the policy module (61), wherein the potentially unsafe action (â) is then processed by a safety module (62) of the machine learning system (60). The safety module (62) is configured to map the potentially unsafe action (â) to the safe action (ā) if the safety module (62) deems the potentially unsafe action (â) to actually be unsafe. The safety module (62) determines the safety of the potentially unsafe action (â) based on a provided set of safe actions (Ā) that may be safely executed by the technical system in the environment. If the potentially unsafe action (â) is determined to be unsafe, the safety module (62) performs a mapping by means of a piecewise diffeomorphism from the unsafe action (â) to the safe action (ā). The determined safe action (ā) is then put out by the machine learning system (60).

    [0082] FIG. 2 shows a flow chart of a method (100) for training the machine learning system (60). The method starts with a first step (101), wherein in the first step (101) a state signal (s) is determined from the environment of the technical system.

    [0083] In a second step (102), the policy module (61) of the machine learning system (60) then determines the probability distribution for actions from the preferably continuous action space.

    [0084] In a third step (103) a potentially unsafe action (â) is sampled from the probability distribution.

    [0085] In a fourth step (104), the safety module (62) of the machine learning system (60) obtains a safe action (ā) based on the potentially unsafe action (â) by means of the diffeomorphism.

    [0086] The steps one (101) to four (104) may preferably be repeated in order to determine a trajectory of state signals (s) and safe actions (ā). The trajectory may then be used in a fifth step (105) of the method (100) for determining a loss value with respect to the actions. The loss value may preferably characterize a desired goal to be achieved. For example, the loss value may characterize an expected return. Preferably, the loss value is determined according to the framework of generative adversarial imitation learning, i.e., by comparing the determined trajectory to trajectories determined by an expert, wherein the comparison is performed based on a discriminator.

    [0087] In a sixth step (106), parameters of the policy module (61) are then updated. Preferably, this is achieved by means of gradient descent, wherein a gradient of the loss value with respect to parameters of the policy module (61) is determined.

    [0088] Preferably, the steps one (101) to six (106) are repeated iteratively until a desired amount of iterations is achieved and/or until the loss value or a loss value with respect to a validation set is equal to or below a predefined threshold. If one of the described exit criteria is met, the method (100) ends.

    [0089] FIG. 3 depicts the fourth step (104) of the method (100) for training in more detail. The action space is partitioned into a partition (M), wherein the partition elements are boxes. The figure depicts an embodiment of a 2-dimensional action space. Preferably, the boxes are chosen to be squares, wherein an edge length of a box may be considered a hyperparameter of the method (100). It should be noted that the partition does not need to span the entire possible action space. For example, it is also possible that prior information allows for partitioning only a subspace of the action space.

    [0090] In general, the shape of a box (e.g., geometric figure, length of sides, number of points in a polygon defining partition elements) may be considered a hyperparameter of the method (100). The partition elements, i.e., the different subsets of the action space may then be categorized as either safe sets (k) (indicated by shaded squares in the figure) and unsafe sets (u) (indicated by white squares in the figure). Determining whether a partition element (i.e., subset of the action space) is safe or not may be achieved by means of determining a worst-case safety cost w.sub.t(s.sub.t,a) as described earlier. For example, an action at the center of a box may be used to infer whether all actions in a box are safe and there exists a future trajectory of only safe options.

    [0091] In the embodiment depicted in FIG. 3, the potentially unsafe action (â) is determined to fall into an unsafe region of the action space (i.e., it is part of an unsafe set (u) from the partition (M)). The potentially unsafe action (â) is hence mapped into a safe set (k). The safe set (k) may be determined by selecting the partition element of the partition (M) that is closest to the potentially unsafe action (â) in terms of a distance measure on the action space, e.g., an L.sub.p-norm. Alternatively, it is also possible to determine a density for actions acting as representatives for respective partition element, e.g., action at the center of the respective boxes. For example, for each partition element that is determined as safe a density of a respective action at the center may be determined based on the density determined from the policy module (61) and the partition element with highest density may be chosen as safe set (k).

    [0092] In the embodiment, mapping the potentially unsafe action (â) to the safe action (ā) is then achieved by determining a relative position of the potentially unsafe action (â) in the unsafe set (u) along the horizontal and vertical axes and providing the action from the safe set (k) as safe action (ā) that has a same relative position along the horizontal and vertical axes in the safe set (k).

    [0093] FIG. 4 shows a control system (40) comprising the machine learning system (60) for determining a control signal (A) for controlling an actuator (10) of a technical system in its environment (20). The actuator (10) interacts with the control system (40). The actuator (10) and its environment (20) will be jointly called actuator system. At preferably evenly spaced points in time, a sensor (30) senses a condition of the actuator system. The sensor (30) may comprise several sensors. Preferably, the sensor (30) is an optical sensor that takes images of the environment (20). An output signal (S) of the sensor (30) (or in case the sensor (30) comprises a plurality of sensors, an output signal (S) for each of the sensors) which encodes the sensed condition is transmitted to the control system (40).

    [0094] Thereby, the control system (40) receives a stream of sensor signals (S). It then computes a series of control signals (A) depending on the stream of sensor signals (S), which are then transmitted to the actuator (10).

    [0095] The control system (40) receives the stream of sensor signals (S) of the sensor (30) in an optional receiving unit (50). The receiving unit (50) transforms the sensor signals (S) into state signals (s). Alternatively, in case of no receiving unit (50), each sensor signal (S) may directly be taken as a state signal (s).

    [0096] The state signal (s) is then passed on to a machine learning system (60).

    [0097] The machine learning system (60) is parametrized by parameters (Φ), which are stored in and provided by a parameter storage (St.sub.1).

    [0098] The machine learning system (60) determines a safe action (ā) from the sate signal (s). The safe action (ā) is transmitted to an optional conversion unit (80), which converts the safe action (ā) into the control signals (A). The control signals (A) are then transmitted to the actuator (10) for controlling the actuator (10) accordingly. Alternatively, the safe action (ā) may already characterize a control signal (A) and may be submitted to the actuator (10) directly.

    [0099] The actuator (10) receives control signals (A), is controlled accordingly, and carries out the safe action (ā) corresponding to the control signal (A). The actuator (10) may comprise a control logic which transforms the control signal (A) into a further control signal, which is then used to control actuator (10).

    [0100] In further embodiments, the control system (40) may comprise the sensor (30). In even further embodiments, the control system (40) alternatively or additionally may comprise an actuator (10).

    [0101] In still further embodiments, it can be envisioned that the control system (40) controls a display (10a) instead of or in addition to the actuator (10).

    [0102] Furthermore, the control system (40) may comprise at least one processor (45) and at least one machine-readable storage medium (46) on which instructions are stored which, if carried out, cause the control system (40) to carry out a method according to an aspect of the present invention.

    [0103] FIG. 5 shows an embodiment in which the control system (40) is used to control an at least partially autonomous robot, e.g., an at least partially autonomous vehicle (100).

    [0104] The sensor (30) may comprise one or more video sensors and/or one or more radar sensors and/or one or more ultrasonic sensors and/or one or more LiDAR sensors. Some or all of these sensors are preferably but not necessarily integrated in the vehicle (100). The state signal (s) derived from the sensor signal (S) may characterize information about the environment of the vehicle, e.g. curvature of the road the vehicle (100) currently travels along and/or information about distance to other traffic participants and/or immobile environment entities such as trees, houses, or traffic cones and/or information about lanes or lane markings of the road. Alternatively, the sate signal (s) may characterize an image of the environment.

    [0105] The machine learning system (60) may be configured to determine an action to be executed by the vehicle (100), e.g., a longitudinal and/or lateral acceleration. The action may be chosen by the machine learning system (60) such that the vehicle (100) follows a predefined path while not colliding with other elements of its environment, e.g., road participants. As a fail-safe action or fail-safe actions, the action determined by the machine learning system (60) may characterize an emergency brake and/or an emergency evasive steering and/or a lane switch into an emergency lane.

    [0106] The actuator (10), which is preferably integrated in the vehicle (100), may be given by a brake, a propulsion system, an engine, a drivetrain, or a steering of the vehicle (100).

    [0107] Alternatively or additionally, the control signal (A) may also be used to control the display (10a), e.g., for displaying the safe action (ā) determined by the machine learning system (60) and/or for displaying the partition of safe actions.

    [0108] In further embodiments, the at least partially autonomous robot may be given by another mobile robot (not shown), which may, for example, move by flying, swimming, diving, or stepping. The mobile robot may, inter alia, be an at least partially autonomous lawn mower, or an at least partially autonomous cleaning robot. In all of the above embodiments, the control signal (A) may be determined such that propulsion unit and/or steering and/or brake of the mobile robot are controlled such that the mobile robot may avoid collisions with said identified objects.

    [0109] In a further embodiment, the at least partially autonomous robot may be given by a gardening robot (not shown), which uses the sensor (30), preferably an optical sensor, to determine a state of plants in the environment (20). The actuator (10) may control a nozzle for spraying liquids and/or a cutting device, e.g., a blade. Depending on an identified species and/or an identified state of the plants, a control signal (A) may be determined to cause the actuator (10) to spray the plants with a suitable quantity of suitable liquids and/or cut the plants. In the embodiment, the safe action (a) may characterize a desired nozzle opening.

    [0110] In even further embodiments, the at least partially autonomous robot may be given by a domestic appliance (not shown), like e.g. a washing machine, a stove, an oven, a microwave, or a dishwasher. The sensor (30), e.g., an optical sensor, may detect a state of an object which is to undergo processing by the household appliance. For example, in the case of the domestic appliance being a washing machine, the sensor (30) may detect a state of the laundry inside the washing machine. The control signal (A) may then be determined depending on a detected material of the laundry.

    [0111] FIG. 6 shows an embodiment in which the control system (40) is used to control a manufacturing machine (11), e.g., a punch cutter, a cutter, a gun drill or a gripper, of a manufacturing system (200), e.g., as part of a production line. The manufacturing machine may comprise a transportation device, e.g., a conveyer belt or an assembly line, which moves a manufactured product (12). The control system (40) controls an actuator (10), which in turn controls the manufacturing machine (11).

    [0112] The sensor (30) may be given by an optical sensor which captures properties of, e.g., a manufactured product (12).

    [0113] The image machine learning system (60) may determine a position of the manufactured product (12) with respect to the transportation device. The actuator (10) may then be controlled depending on the determined position of the manufactured product (12) for a subsequent manufacturing step of the manufactured product (12). For example, the actuator (10) may be controlled to cut the manufactured product at a specific location of the manufactured product itself. Alternatively, it may be envisioned that the image machine learning system (60) classifies, whether the manufactured product is broken or exhibits a defect. The actuator (10) may then be controlled as to remove the manufactured product from the transportation device.

    [0114] FIG. 7 shows an embodiment of a training system (140) for training the machine learning system (60) of the control system (40) by means of a training data set (T). The training data set (T) comprises a plurality of states signals (x.sub.i) which are used for training the machine learning system (60).

    [0115] For training, a training data unit (150) accesses a computer-implemented database (St.sub.2), the database (St.sub.2) providing the training data set (T). The training data unit (150) determines from the training data set (T) preferably randomly at least one state signal (x.sub.i) and transmits the state signal (x.sub.i) to the machine learning system (60). The machine learning system (60) determines a safe action (y.sub.i) based on the state signal (x.sub.i). The determined safe action (y.sub.i) is transmitted to a modification unit (180).

    [0116] Based on the determined safe action (y.sub.i), the modification unit (180) then determines new parameters (Φ′) for the machine learning system (60). This may be achieved according to conventional reinforcement learning methods such as vanilla policy gradients, trust region policy optimization, proximal policy optimization, deep deterministic policy gradients, or actor-critic methods. Preferably, the new parameters may be determined according to the method of generative adversarial imitation learning.

    [0117] The modification unit (180) determines the new parameters (Φ′) based on a loss value. In the given embodiment, this is done using a gradient ascent method, preferably stochastic gradient descent, Adam, or AdamW. In further embodiments, training may also be based on an evolutionary algorithm or a second-order method for training neural networks.

    [0118] In other preferred embodiments, the described training is repeated iteratively for a predefined number of iteration steps or repeated iteratively until the loss value falls below a predefined threshold value. Alternatively or additionally, it is also possible that training is terminated when an average loss value with respect to a test or validation data set falls below a predefined threshold value. In at least one of the iterations the new parameters (Φ′) determined in a previous iteration are used as parameters (Φ) of the machine learning system (60) for a further iteration.

    [0119] Furthermore, the training system (140) may comprise at least one processor (145) and at least one machine-readable storage medium (146) containing instructions which, when executed by the processor (145), cause the training system (140) to execute a training method according to one of the aspects of the present invention.

    [0120] The term “computer” may be understood as covering any devices for the processing of pre-defined calculation rules. These calculation rules can be in the form of software, hardware or a mixture of software and hardware.

    [0121] In general, a plurality can be understood to be indexed, that is, each element of the plurality is assigned a unique index, preferably by assigning consecutive integers to the elements contained in the plurality. Preferably, if a plurality comprises N elements, wherein N is the number of elements in the plurality, the elements are assigned the integers from 1 to N. It may also be understood that elements of the plurality can be accessed by their index.