INTERACTING WITH AN UNSAFE PHYSICAL ENVIRONMENT
20210187734 · 2021-06-24
Assignee
Inventors
- David Reeb (Renningen, DE)
- Jeremy Zieg Kolter (Pittsburgh, PA, US)
- Melrose Roderick (Pittsburgh, PA, US)
- Vaishnavh Nagarajan (Pittsburgh, PA, US)
Cpc classification
B25J9/1666
PERFORMING OPERATIONS; TRANSPORTING
B25J9/163
PERFORMING OPERATIONS; TRANSPORTING
International classification
Abstract
A computer-implemented method of configuring a system which interacts with a physical environment. An action of the system in a state of the physical environment results in an updated state of the physical environment according to a transition probability. A safe set of state-action pairs known to be safely performable and an unsafe set of state-action pairs to be avoided are indicated. During an environment interaction, a safe set of state-action pairs is updated by estimating a transition probability for a state-action pair based on an empirical transition probability of a similar other state-action pair, and including the state-action pair in the safe set of state-action pairs only if the state-action pair is not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability.
Claims
1. A computer-implemented method of configuring a system which interacts with a physical environment, wherein an action of the system in a state of the physical environment results in an updated state of the physical environment according to a transition probability, the method comprising the following steps: accessing data indicating a safe set of state-action pairs known to be safely performable and data indicating an unsafe set of state-action pairs to be avoided when interacting with the physical environment; while the system interacts with the physical environment, maintaining empirical transition probabilities of state-action pairs resulting in updated states; and iteratively controlling an interaction with the physical environment by, in an iteration: obtaining data indicating a current state of the physical environment; updating the safe set of state-action pairs, including: estimating an estimated transition probability for each state-action pair of the state action pairs resulting in the updated states based on an empirical transition probability of a similar other state-action pair, and including the state-action pair in the safe set of state-action pairs when the state-action pair is not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability; selecting an action to be performed in a current state of the physical environment from the safe set of state-action pairs; and providing the action to be performed to the system.
2. The method according to claim 1, wherein the system interacts with the physical environment according to a reward function, the method further comprising, in an iteration: determining a goal-oriented action to be performed in the current state of the physical environment based on the reward function and selecting the action only when the action in the current state of the physical environment is included in the safe set of state-action pairs.
3. The method according to claim 2, further comprising: determining a set of goal state-action pairs reachable by performing goal-oriented actions, and performing the goal-oriented action only when each goal state-action pair of the set of goal state-action pairs is included in the safe set of state-action pairs.
4. The method according to claim 2, further comprising: selecting a similar state-action pair that is similar to a goal-oriented state-action pair not included in the safe set, and estimating transition probabilities for the goal-oriented state-action pair based on empirical transition probabilities of the similar state-action pair.
5. The method according to any one of claim 2, further comprising: selecting a return state-action pair for returning to the set of goal state-action pairs.
6. The method according to claim 1, further comprising: determining an action and raising an alert when the action is not included in the safe set of state-action pairs.
7. The method according to claim 1, further comprising: determining a similarity between the state-action pair and the other state-action pair by comparing only portions of respective states and/or actions relevant for transition probabilities.
8. The method according to claim 1, further comprising, in at least one iteration, estimating a transition probability for a first state-action pair for which no empirical transition probability is available and selecting the action from the first state-pair to be performed.
9. The method according to claim 1, wherein the estimating of the estimated transition probability for the state-action pair includes: determining similarities between the state-action pair and one or more other state-action pairs for which empirical transition probabilities are available; selecting a most relevant other state-action pair based on at least the similarities; and determining the estimated transition probability for the state-action pair based on the empirical transition probabilities of the selected other state-action pair.
10. The method according to claim 9, further comprising: determining confidence intervals of the empirical transition probabilities of the one or more other state-action pairs, the most relevant other state-action pair being selected based additionally on the determined confidence intervals.
11. The method according to claim 1, wherein the controlling of the interaction with the physical environment is performed in a training phase, the method further comprising: controlling a further interaction with the physical environment in a use phase by repeatedly: obtaining the current state of the physical environment; selecting the action to be performed in the current state of the physical environment from the safe set of state-action pairs determined in the training phase; and providing the selected action to be performed to the system.
12. The method according to claim 1, wherein the data indicating the current state of the physical environment includes sensor data of a computer-controlled device, and the method further comprises determining control data for letting the computer-controlled device effect the selected action in the physical environment.
13. The method according to claim 12, wherein the physical environment includes objects to be avoided by the computer-controlled device, and wherein state-action pairs are defined as sufficiently similar regardless of the objects to be avoided.
14. A configuration system for configuring an interaction system which interacts with a physical environment, wherein an action of the interaction system in a state of the physical environment results in an updated state of the physical environment according to a transition probability, the configuration system comprising: a data interface for accessing data indicating a safe set of state-action pairs known to be safely performable and data indicating an unsafe set of state-action pairs to be avoided when interacting with the physical environment; a processor subsystem configured to, while the interaction system interacts with the physical environment, maintain empirical transition probabilities of state-action pairs resulting in updated states, and to iteratively control an interaction of the interaction system with the physical environment by, in an iteration: obtain, from the interaction system, data indicating a current state of the physical environment; update the safe set of state-action pairs, including: estimating an estimated transition probability for each state-action pair of the state-action pairs resulting in updated states based on an empirical transition probability of a similar other state-action pair, and including the state-action pair in the safe set of state-action pairs when the state-action pair is not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability; select an action to be performed in the current state of the physical environment from the safe set of state-action pairs; providing the action to be performed to the interaction system.
15. A non-transitory computer-readable medium on which is stored instructions for configuring a system which interacts with a physical environment, wherein an action of the system in a state of the physical environment results in an updated state of the physical environment according to a transition probability, the instructions, when executed by a processor system, causing the processor system to perform the following steps: accessing data indicating a safe set of state-action pairs known to be safely performable and data indicating an unsafe set of state-action pairs to be avoided when interacting with the physical environment; while the system interacts with the physical environment, maintaining empirical transition probabilities of state-action pairs resulting in updated states; and iteratively controlling an interaction with the physical environment by, in an iteration: obtaining data indicating a current state of the physical environment; updating the safe set of state-action pairs, including: estimating an estimated transition probability for each state-action pair of the state action pairs resulting in the updated states based on an empirical transition probability of a similar other state-action pair, and including the state-action pair in the safe set of state-action pairs when the state-action pair is not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability; selecting an action to be performed in a current state of the physical environment from the safe set of state-action pairs; and providing the action to be performed to the system.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0045] These and other aspects of the present invention will be apparent from and elucidated further with reference to the embodiments described by way of example in the following description and with reference to the figures.
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054] It should be noted that the figures are purely diagrammatic and not drawn to scale. In the figures, elements which correspond to elements already described may have the same reference numerals.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0055]
[0056] The processor subsystem 140 may be configured to, during operation of the system 100 and using the data interface 120, access data 030, 040. For example, as shown in
[0057] Processor subsystem 140 may be configured to, during operation of the system 100 and using the data interface 120, iteratively control an interaction of the interaction system 200 with the physical environment by performing one or more iterations. In an iteration, processor subsystem 140 may obtain data from the interaction system 200 indicating a current state of the physical environment. In the iteration, processor subsystem 140 may further update the set of state-action pairs. The updating by processor subsystem 140 may comprise estimating a transition probability for a state-action pair based on an empirical transition probability of a similar other state-action pair. The updating may further comprise including the state-action pair in the safe set of state-action pairs if the state-action pair is not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability.
[0058] In the iteration, processor subsystem 140 may further select an action to be performed in the current state of the physical environment from the safe set of state-action pairs, and provide the action to be performed to the interaction system 200.
[0059] The configuration system 100 may comprise a communication interface 160 configured for communication 162 with the interaction system 200. Communication interface 160 may internally communicate with processor subsystem 140 via data communication 142. Communication interface 160 may be arranged for direct communication with the other system 200, e.g., using USB, IEEE 1394, or similar interfaces. Communication interface 160 may also communicate over a computer network, for example, a wireless personal area network, an internet, an intranet, a LAN, a WLAN, etc. For instance, communication interface 160 may comprise a connector, e.g., a wireless connector, an Ethernet connector, a Wi-Fi, 4G or 4G antenna, a ZigBee chip, etc., as appropriate for the computer network. Communication interface 160 may also be an internal communication interface, e.g., a bus, an API, a storage interface, etc. For example, configuration system 100 and interaction system 200 may be combined into a single system, e.g., their processor subsystems and/or data interfaces may be combined, etc.
[0060] Various details and aspects of the operation of the system 100 will be further elucidated with reference to
[0061] In general, the system 100 may be embodied as, or in, a single device or apparatus, such as a workstation, e.g., laptop or desktop-based, or a server. The device or apparatus may comprise one or more microprocessors which execute appropriate software. For example, the processor subsystem may be embodied by a single Central Processing Unit (CPU), but also by a combination or system of such CPUs and/or other types of processing units. The software may have been downloaded and/or stored in a corresponding memory, e.g., a volatile memory such as RAM or a non-volatile memory such as Flash. Alternatively, the functional units of the system, e.g., the data interface and the processor subsystem, may be implemented in the device or apparatus in the form of programmable logic, e.g., as a Field-Programmable Gate Array (FPGA) and/or a Graphics Processing Unit (GPU). In general, each functional unit of the system may be implemented in the form of a circuit. It is noted that the system 100 may also be implemented in a distributed manner, e.g., involving different devices or apparatuses, such as distributed servers, e.g., in the form of cloud computing.
[0062]
[0063] Processor subsystem 240 may be configured to perform an interaction with the environment 010 by iteratively obtaining, via a sensor interface 280, sensor data 282 from one or more sensors indicative of a state of the physical environment 010; providing the state of the physical system to a configuration system 100; and obtaining an action to be performed in the state of the physical environment 010 from the configuration system 100 in return; and provide via an actuator interface 270 actuator data 272 to one or more actuators causing the actuators to effect the obtained action in the physical environment 010.
[0064] The interaction system 200 may comprise a communication interface 260 configured for communication 262 with a configuration system 100. Communication interface 260 may internally communicate with processor subsystem 240 via data communication 242. Communication interface 260 may be as discussed for communication system 160 of configuration system 100. In particular, communication interface 160 may be an internal communication interface, e.g., a bus, an API, a storage interface, etc., e.g., interaction system 200 may be combined with configuration system 100 in a single system.
[0065] In an example embodiment, interaction system 200 may be configured by configuration system 100 in a training phase. Interaction system 200 may obtain at least the safe set of state-action pairs 040 from interaction system 100 during the training phase. For example, as illustrated in the figure, interaction system 200 may access safe set 040 via data interface 220, e.g., processor subsystem 240 may obtain safe set 040 from the interaction system via the data interface 220 or store an otherwise obtained safe set 040 using the data interface 220. In a use phase following this training phase, interaction system 200 may interact with the environment 010, e.g., according to a policy, wherein interaction system 200 may iteratively: obtain a current state of the physical environment; select an action to be performed in this current state from safe set 040; and provide actuator data based on the selected action to the one or more actuators as described herein.
[0066] The system 200 may comprise a sensor interface 280 for obtaining, from one or more sensors (not shown), sensor data 282 indicative of a state of the physical environment 010. Sensor interface 280 may internally communicate with processor subsystem 240 via data communication 244. In the following, for explanatory purposes, a single sensor is discussed. The sensor data 282 may comprise one or more physical quantities of the environment and/or interaction system 200. In some embodiments, the sensor may be arranged in physical environment 010. In other examples, the sensor may be arranged remotely from the environment, for example if the quantities can be measured remotely. For example, a camera-based sensor may be arranged outside of environment 010 but may nevertheless measure quantities associated with the environment, such as a position and/or orientation of the physical entity in the environment. Sensor interface 280 may also access the sensor data from elsewhere, e.g., from a data storage or a network location. Sensor interface 280 may have any suitable form, including but not limited to a low-level communication interface, e.g., based on I2C or SPI data communication, but also a data storage interface such as a memory interface or a persistent storage interface, or a personal, local or wide area network interface such as a Bluetooth, Zigbee or Wi-Fi interface or an ethernet or fibreoptic interface. The sensor may be part of system 200.
[0067] The system 200 may comprise an actuator interface 270 for providing, to one or more actuators (not shown), actuator data 272 causing the one or more actuators to effect an action in the environment 010. Actuator interface 270 may internally communicate with processor subsystem 240 via data communication 246. For ease of explanation, below, a single actuator is discussed. For example, the actuator may be an electric, hydraulic, pneumatic, thermal, magnetic and/or mechanical actuator. Specific yet non-limiting examples include electrical motors, electroactive polymers, hydraulic cylinders, piezoelectric actuators, pneumatic actuators, servomechanisms, solenoids, stepper motors, etc. The actuator may be part of system 200.
[0068] Various details and aspects of the operation of the system 200 will be further elucidated with reference to
[0069] In general, the system 200 may be embodied as, or in, a single device or apparatus, such as a workstation, e.g., laptop or desktop-based, or a server. The device or apparatus may comprise one or more microprocessors which execute appropriate software. For example, the processor subsystem may be embodied by a single Central Processing Unit (CPU), but also by a combination or system of such CPUs and/or other types of processing units. The software may have been downloaded and/or stored in a corresponding memory, e.g., a volatile memory such as RAM or a non-volatile memory such as Flash. Alternatively, the functional units of the system, e.g., the data interface and the processor subsystem, may be implemented in the device or apparatus in the form of programmable logic, e.g., as a Field-Programmable Gate Array (FPGA) and/or a Graphics Processing Unit (GPU). In general, each functional unit of the system may be implemented in the form of a circuit. It is noted that the system 200 may also be implemented in a distributed manner, e.g., involving different devices or apparatuses, such as distributed servers, e.g., in the form of cloud computing.
[0070]
[0071] For example, autonomous device 62 may be trained to perform a particular task, for example, to move around a warehouse to pick up a certain object. For example, autonomous device 62 may be trained to interact according to a reward function configured for the task to be performed. As further illustrated in
[0072] As also shown in
[0073] Interestingly, in order to determine whether action 423 is safe in the long run for the autonomous device 62 at the present location, transition probabilities for taking action 423 may be estimated based on empirical transition probabilities of a similar other state-action pair, e.g., taking action 427 by the autonomous device in the position 62′. For example, position 62′ may be similar to position 62, e.g., in terms of surface material on which the device 62 moves, incline of the surface, etcetera. For example, it may be known that taking action 427 from position 62′ is safe, e.g., this state-action pair may be included in an initial safe set of state-action pairs, for example because there are no objects to be avoided within a certain range of that state. Moreover, it may be assumed that taking action 427 in state 62′ has an analogous effect as taking action 423 in state 62, and similarly for states 421 and 425; states 422 and 426; and states 424 and 428. At least, a probability of ending up at a certain distance to the right of position 62 by taking action 423 may be assumed to be similar to a probability of ending up at a certain distance to the right of position 62′ by taking action 427, and similarly for the other actions.
[0074] Accordingly, transition probabilities for a state-action pair 423 may be estimated based on empirical transition probabilities for a similar state-action pair 427. Although the state-action pairs may be similar, whether or not the resulting state-action pairs are unsafe may differ, e.g., state-action pairs close to state 62′ may all be safe. Accordingly, transition probabilities of a state-action pair 423 that is potentially unsafe in the long run may be estimated based on transition probabilities of a state-action pair 427 that is comprised in a safe set of state-action pairs known to be safe in the long run. Moreover, if the state-action pair 423 is itself not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair 423 based on the estimated transition probability, state-action pair 423 may be included in the safe set of state-action pairs.
[0075] Accordingly, it may be made possible to perform action 423, e.g., if it is established from experience in state 62′ that the device can brake sufficiently fast without bumping into object 411. Accordingly, the techniques described herein may allow device 62 to perform additional actions that were not initially known to be safe, and accordingly, device 62 may be better enabled to perform its tasks, e.g., getting to the other side of object 411 or stopping close to object 411 for a warehouse item from storage 411 to be loaded onto a transport vehicle 62. In various embodiments, moreover, if an action 423 is not known to be safe but would be desirable to execute, actions may be selected for improving the estimated transition probabilities for the desirable action 423, e.g., by selecting an action to be performed based on which the transition probabilities may be improved, or even moving towards a state from which such an action can be performed. For example, device 62 may move to position 62′ in order to perform action 427 to refine estimated transition probabilities of action 423. Accordingly, the device may learn more efficiently to interact with the environment 010, e.g., the device may learn to perform a task with higher sample efficiency.
[0076]
[0077] Shown in the figure are states S1, 511; S2, 512; S3, 513; S4, 514; and S5, 515. As illustrated in detail for state S1, an action of the system in a state of the physical environment may result in an updated state of the physical environment according to a transition probability. For example, in state S1, the interaction system may perform actions a and b. This is represented by two state-action pairs Ala 521, and A1b, 522, representing the interaction system performing actions a and b from state S1, respectively. Similarly, from state S4, the system may perform the same action a, represented by state-action pair A4a, 523. There may be additional states, actions, and/or state-action pairs that are not shown in this figure for ease of exposition. For example, the number of states may be at most or at least 10, at most or at least 100, or at most or at least 1000. In this example, a discrete state space with a finite number of states is illustrated. For example, such a state space may be obtained by discretizing a continuous state-space. There are typically multiple actions, e.g., at most or at least 10, at most or at least 100, or at most or at least 1000. Typically, an interaction system interacting with a physical environment knows the current state of the environment, and knows which actions it performs, but an action can result in several states.
[0078] For example, as shown in the figure, taking action a from state S1, may result in state S2 with a probability P1a2, 561, and may result in state S3 with probability P1a3, 563. Similarly, taking action b from state S1 may result in state S2 with probability P1b2, 562; and in state S3 with probability P1b3, 564. Taking action a in state S4 may result in state S5 with probability P4a5, 565. Interestingly, the model illustrated in
[0079] As also shown in the figure, a subset of state-action pairs may be indicated as being an unsafe set of state-action pairs to be avoided when interacting with the physical environment. For example, shown in the figure are label US1a?, 531, for state-action pair A1; label US1b?, 532, for state-action pair Alb, and label US4a?, 533, for state-action pair A4a. For example, state-action pair A1b may be labelled as unsafe and state-action pairs Ala and A4a may not be labelled as unsafe, as indicated by the underlining in the figure. The label may denote instantaneous safety, e.g., if a state-action pair is not labelled as unsafe it may still be unsafe in the long run, e.g., by taking the action it may be possible to arrive in another state from which only unsafe actions can be performed.
[0080] For example, the states and actions shown in the figure may form a Markov Decision Process (MDP). Mathematically, an MDP may be represented as a 5-tuple S, A, R, T, γ
with sets of states S and actions A, a known, deterministic reward function R: S×A
, an unknown, stochastic dynamics function T: S×A
.sup.S which maps a state-action pair to a probability distribution over next states, and a discount factor γ. In various embodiments, S and/or A is finite. An interaction system may interact with the physical environment according to the reward function, e.g., by maximizing an expected accumulated reward as is known from reinforcement learning, etc. The reward function may also provide the labelling of state-action pairs as unsafe, e.g., a state-action pair may be labelled as unsafe if its reward is negative. In various embodiments, rewards are assumed to be bounded between −1 and 1. The discount factor γ may be a value between 0 and 1.
[0081]
[0082] As in
[0083] While interacting with the physical environment, a safe set of state-action pairs may be maintained. This is illustrated in the figure by labels S1a?, 641; S1b?, 642; and S4a?, 643; illustrating whether respective state-action pairs A1a, A1b, and A4a are comprised in the safe set. In various embodiments, it may be guaranteed that the set is safe in the sense that it does not comprise state-action pairs labelled as unsafe and that it is closed, e.g., starting from a state in the set and taking actions from the set, it may be guaranteed, at least with high probabilities, that the resulting state is also comprised in the set. Here and elsewhere, a state may be considered to be comprised in a set of state-action pairs if the set of state-action pairs includes an action from that state.
[0084] For example, in the case of MDPs, the following mathematical definitions and notation may be adopted: [0085] Z∈S×A may be defined as closed if for every (s,a)∈Z and for every next s′ for which T(s,a,s′)>0, there exists a′ such that (s′,a′)∈Z. [0086] Z∈S×A may be defined as a safe set if Z is closed and for all (s,a)∈Z, R(s,a)≥0. (s,a)∈Z may be referred to as safe state-action pairs. For example, the safe set that is updated when interacting with a physical environment according to various embodiments, may be updated while satisfying this notion of safety. [0087] For any Z∈S×A, it may be said that s∈Z if there exists any a E A such that (s,a)∈Z. [0088] A state-action pair (s,a) may be defined as an edge of Z if (s,a).Math.Z but s∈Z. [0089] P[⋅|π] may be used to denote the probability of an event occurring while following the policy w. [0090] For a policy π, π∈Π(Z) may be used to denote that, for all s∈Z, (s,π(s))∈Z. [0091] A subset of state-action pairs, Z⊂S×A may be defined as ergodic if Z is closed and for any s∈Z, there exists a policy π.sub.s ∈Π(Z) such that ∀s′, P[∃t, s.sub.t=s′|π.sub.s′,s.sub.0=s]=1. For example, each state in Z may be reachable from each other state through a policy that never exits the subset Z. It may be observed that, when Z is the set of all state-action pairs in the MDP, this definition corresponds to definitions of ergodicity known in the art.
[0092] In various embodiments, an initial safe set Z.sub.0 may be obtained, e.g. an initial safe set satisfying the notion of safety for an MDP described mathematically as above. The initial state s.sub.init from which the controlling of an interaction with the physical environment starts, may be comprised in this initial safe set. Preferably, the initial safe set may be ergodic so that the interaction system can perform actions inside the safe set without the risk of getting stuck.
[0093] In various embodiments, while interacting with the physical environment, empirical transition probabilities of state-action pairs resulting in an updated state may be maintained. As an illustration, shown in the figure are empirical transition probability EMP1a2, 671, of taking action a in state S1 eadingtostate S2; empirical transition probability EMP1b2, 672, of taking action b in state S1 leading to state S2; empirical transition probability EMP1a3, 673, of taking action a in state S1 leading to state S3; empirical transition probability EMP1b3, 674, of taking action b in state S1 leading to state S3; and an empirical transition probability EMP4a5, 675, of taking action a in state S4 leading to state S5. For example, quantities n(s.sub.t,a.sub.t) representing a number of times a state-action pair was performed and n(s.sub.t,a.sub.t,s.sub.t+1) representing a number of times this led to update state s.sub.t+1 may be maintained. An empirical transition probability may be determined as n(s.sub.t,a.sub.t,s.sub.t+1)/n(s.sub.t,a.sub.t). If n(s.sub.t,a.sub.t), e.g., no empirical evidence of taking action a.sub.t from state s.sub.t, an estimated transition probability indicating maximal uncertainty may be returned, e.g., a confidence interval [0,1].
[0094] If In some embodiments, empirical transition probabilities of a state-action pair may be updated, and quantities depending on it recomputed, only if the number of times that the state-action pair is encountered does not exceed a threshold m. The inventors envisaged this measure to help reach the mathematical guarantee of PAC-MDP optimality but in many practical settings it is not needed. Preferably, threshold m is chosen bigger than 1/τ, where T is a threshold parameter further discussed below.
[0095] Interestingly, in various embodiments, the safe set of state-action pairs may be updated while interacting with a physical environment, e.g., by including additional state-action pairs for which it is safe to do so, as discussed further below. However, in order to determine that it is safe to include a state-action pair, transition probabilities may be needed for state-action pairs that are not in the safe set, e.g., for which it is not known to be safe to perform the action. Interestingly, in such cases, performing such potentially unsafe actions may be avoided by instead estimating transition probabilities EST*** for state-action pairs based on empirical transition probabilities EMP*** of other state-action pairs. In particular, in at least one iteration, a transition probability may be estimated for a state-action pair for which no empirical transition probabilities are available, e.g., leading to that action being selected to be performed.
[0096] For example, as shown in the figure, an estimated transition probability EST1a2, 681, of state-action pair Ala leading to state S2 may be determined based on, e.g., equal to, an empirical transition probability EMP4a5 of another action A4a leading to a state S5, if it is determined that state-action pair A4a is sufficiently similar to state-action pair Ala and S5 relates to S4 similarly to how S2 relates to S1. This is illustrated by the dashed arrow from EMP4a5 to EST1a2. It is also possible to determine an estimated transition probability based on, e.g., equal to, an empirical transition probability of the state-action pair itself, as illustrated by the dashed arrow from EMP1a2 to EST1a2; and/or based on multiple empirical transition probabilities, e.g., as an average.
[0097] A detailed example is now given in which a transition probability, e.g., EST1a2, for a state-action pair leading to a certain other state, e.g., A1a leading to S2, is determined by determining similarities between the state-action pair, e.g., Ala, and one or more other state-action pairs, e.g., A4a, for which empirical transition probabilities, e.g., EMP4a5, are available. A most relevant other state-action pair, e.g., A4a, may be selected based on at least the similarities. The estimated transition probability, e.g., EST1a2 for the state-action pair may then be determined based on, e.g., equal to the empirical transition probabilities, e.g., EMP4a5, of the selected other state-action pair.
[0098] In estimating transition probabilities, various embodiments use a predefined analogous state function α: (S×A×S)×(S×A).fwdarw.S and a predefined pairwise state-action distance mapping Δ: (S×A)×(S×A).fwdarw.[0,1]. It may be assumed that, for most or all state-action pairs (s,a,{tilde over (s)},{tilde over (s)})∈(S×A)×(S×A),
Σ.sub.s′∈S|T(s,a,s′)−T({tilde over (s)},ã,α(s,a,s′,{tilde over (s)},ã)|≤Δ((s,a),({tilde over (s)},ã)).
[0099] For example, for two state-action pairs, a bound on the distance, e.g., L.sub.1 distance, between their dynamics may be assumed, based on a mapping between analogous next states. For example, a may represent an identity mapping between respective next states, e.g., α(s,a,s′,{tilde over (s)},ã)=s′). Other functions may also be used, allowing a wider set of analogies to be represented. Conceptually, α may indicate a closeness of reactions of the physical environment in response to respective state-action pairs, allowing to transfer knowledge from previously seen state-action pairs to previously unseen or less seen state-action pairs. For example, the smaller A is, e.g., the more similar the state-action pairs, the more knowledge transfer may take place.
[0100] Accordingly, for respective transition probabilities, estimated transition probabilities : S×A×S.fwdarw.[0,1], e.g., EST1a2, may be mintained. Optionally, also confidence interval widths for the estimated transition probabilities (not shown) may be maintained, e.g., per state-action pair
:S×A.fwdarw.
. Also the confidence intervals may be used to select a most relevant other state-action pair.
[0101] As a detailed example, let T denote the empirical transition probabilities EMP***. Let ∈.sub.T(s,a) denote a L1 confidence interval for an empirical transition probability {circumflex over (T)}(s,a). As known from “An analysis of model-based Interval Estimation for Markov Decision Processes”, if
where n(s,a) is the number of times state-action (s,a) is encountered, the L1 confidence intervals may hold with probability δ.sub.T. Beneficially, using analogies, tighter confidence intervals .sub.T, centred around an estimated
,may be determined, especially for unexperienced state-action pairs, by transferring a confidence interval from a sufficiently similar state-action pair with a relative small confidence interval to another state-action pair with a larger confidence interval.
[0102] As an example, the following algorithm may be used to determine estimated transition probabilities, e.g., EST1a2, from empirical transition probabilities, e.g., EMP4a5:
TABLE-US-00001 Algorithm. Compute Analogy-based Confidence Intervals compute {circumflex over (ϵ)}.sub.T(s, a) using δ.sub.T for (s, a) ∈ S × A do ({tilde over (s)}, ã) ← argmin{{circumflex over (ϵ)}.sub.T(s, a), min.sub.{tilde over (s)},ãϵ.sub.T({tilde over (s)}, ã) + Δ((s, a), ({tilde over (s)}, ã))} .sub.T(s, a): = {circumflex over (ϵ)}.sub.T({tilde over (s)}, ã) for s′ ∈ S do {tilde over (s)}′ ← α((s, a, s′), ({tilde over (s)}, ã))
(s, a, s′): = {circumflex over (T)}({tilde over (s)}, ã, {tilde over (s)}′)
[0103] Interestingly, based on estimated transition probabilities, e.g., EST1a2, a state-action pair, e.g., Ala may be included in the safe set of state-action pairs. Specifically, a state-action pair may be included, e.g., its label S1a? adjusted, only if or whenever, the state-action pair is not labelled as unsafe, e.g., by labelling US1a?, and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability, e.g., EST1a2.
[0104] Specifically, if estimated transition probabilities EST*** for a state-action pair are sufficiently reliable, e.g., if a confidence interval is sufficiently small, then it may be established with sufficient confidence which other states may be reached as a result of a given action. In other words, the support of the next-state distribution of the state-action pair may be recovered with sufficient confidence. Specifically, the inventors realised that if it is assumed that all non-negative transition probabilities occur with a probability at least T, then it may be sufficient for a confidence interval to satisfy (s,a)≤τ/2. Here, T is an irrelevance threshold on unlikely transformations, e.g., all transition probabilities may be assumed to be at least T. Based on determining which other states a state-action pair may result in, the state-action pair may be determined to be safe, e.g., if each resulting state-action pair was already in the safe set or based on other techniques for updating the safe set discussed below. Preferably, threshold parameter τ is to a small value, e.g., smaller than 1/(|S|.Math.tmax) where |S| is the cardinality of the state set S and tmax is the maximum number of environment interactions to be performed by the interaction system.
[0105] In various embodiments, instead of adding state-action pairs to the safe set one-by-one, the safe set of state-action pairs may be updated based on a candidate set of state-action pairs that are not labelled as unsafe and for which sufficiently accurate estimated transition probabilities are available, e.g., for which the support is known. In order to add candidate state-action pairs to {circumflex over (Z)}.sub.safe while ensuring that {circumflex over (Z)}.sub.safe is closed, a safe return policy to {circumflex over (Z)}.sub.safe may be needed, in the sense that for every state-action pair in the return path, the support of the next state distribution is known, and all possible resulting states allow to return to {circumflex over (Z)}.sub.safe with sufficiently high probability. Accordingly, a subset of the candidate set of state-action pairs may be determined for which adding the subset to the safe set of state-action pairs results in an ergodic set of state-action pairs not labelled as unsafe.
[0106] Specifically, the candidate set may be pruned in an iterative approach. In this iterative approach, state-action pairs may be eliminated that cannot be reached from the safe set of state-action pairs according to the estimated transition probabilities, or from which the safe set of state-action pairs cannot be reached according to the estimated transition probabilities. Accordingly, ergodicity may be achieved. Then, state-action pairs may be eliminated to ensure closeness, in other words to avoid reaching a state from which no action from the safe set of state-action pairs or the subset of the candidate set of state-action pairs can be taken. These iterations may be repeated, e.g., until convergence.
[0107] Interestingly, as the inventors were able to show, such an iterative approach may lead to a safe and ergodic set. Moreover, the inventors were also able to show that the approach is complete, in the sense that for each state on the edge of {circumflex over (Z)}.sub.safe for which there exists a return policy to {circumflex over (Z)}.sub.safe which passes only through non-unsafe state-action pairs with sufficiently accurate estimated transition probabilities, this edge action and all of the actions in every possible return trajectory to {circumflex over (Z)}.sub.safe may be added. Accordingly, this pruning approach may allow a relatively large amount of state-action pairs to be included in the safe set.
[0108] As a detailed example, the following algorithm may be used to updated the safe set based on a candidate set of state-action pairs:
TABLE-US-00002 Algorithm. Compute Safe Set Z.sub.candidate ← {(s, a) ∈ (S × A)\{circumflex over (Z)}.sub.safes.t. .sub.T(s, a) < τ/2, R(s, a) ≥ 0} while Z.sub.candidate ≠ Z.sub.closed in the last iteration do Z.sub.reachable ← {(s, a) ∈ Z.sub.candidate: s ∈ {circumflex over (Z)}.sub.safe} while Z.sub.reachable changed in the last iteration do for (s, a) ∈ Z.sub.reachable ∪ {circumflex over (Z)}.sub.safe do add {(s′, a′) ∈ Z.sub.candidate s.t.
(s, a, s′) > 0} to Z.sub.reachable Z.sub.returnable ← Ø while Z.sub.returnable changed in the last iteration do for (s, a) ∈ Z.sub.reachable do if ∃(s′, a′) ∈ Z.sub.returnable ∪ {circumflex over (Z)}.sub.safe s.t.
(s, a, s′) > 0 then add (s, a) to Z.sub.returnable Z.sub.closed ← Z.sub.returnable. while Z.sub.closed changed in the last iteration do for (s, a) ∈ Z.sub.closed do if ∃s′ ∈ S s.t.
(s,a,s′) > 0 and ∀a′ ∈ A, (s′, a′) .Math. Z.sub.closed ∪ {circumflex over (Z)}.sub.safethen remove (s, a) from Z.sub.closed Z.sub.candidate ← Z.sub.closed {circumflex over (Z)}.sub.safe ← Z.sub.closed ∪ {circumflex over (Z)}.sub.safe
[0109] Accordingly, by selecting an action to be performed in a current state of the physical environment from the safe set of state-action pairs, and providing the action to be performed to the interaction system interacting with the physical environment, safety of the interaction with the physical environment may be improved, while still enabling the system to perform actions that were not initially labelled as unsafe. For example, in a random exploration setting, the interaction system may be able to randomly explore the environment in a safe way. Or, in a setting with a parametric policy, for example defined by an artificial neural network, an action may be determined according the parametric policy and performed only if the action from the current state is comprised in the safe set. Generally, an alert may be raised if a determined action is not comprised in the safe set.
[0110] In various embodiments, however, the interaction with the physical environment may take place according to a reward function. For example, a goal-oriented action to be performed in the current state of the physical environment may be determined based on the reward function, for example, to maximize an expected accumulated reward as is conventional. That action may then be performed only if it is comprised in the safe set of state-action pairs. In particular, a set of goal state-action pairs may be determined that are reachable by performing goal-oriented actions. For example, shown in the figure are labels G1a?, 651; G1ba?, 652; and G4a?, 653, indicating whether respective state-action pairs Ala, A1b, and A4b are comprised in the set of goal-oriented actions. In such a case, the determined goal-oriented action may be performed only if each of these goal state-action pairs is comprised in the safe set. As discussed elsewhere, this may reduce the likelihood of getting stuck in certain situations.
[0111] Interestingly, in various embodiments, if a goal-oriented action according to the reward function cannot be performed, e.g., if the goal-oriented action is not comprised in the safe set or if a determined set of goal state-action pairs is not comprised fully in the safe set, then the reward function may nonetheless be used as a guide to determine an action to be performed in several ways. For example, in at least an iteration, a similar state-action pair may be determined that is similar to a goal-oriented state-action pair not comprised in the safe set, and transition probabilities for the goal-oriented state-action pair may be estimated based on empirical transition probabilities of the similar state-action pair, thereby possibly allowing the goal-oriented to be added to the safe set. Or, in at least an iteration, a return state-action pair may be selected for returning to the goal set of state-action pairs.
[0112] Several of the above possibilities are explained based on the following detailed pseudocode example of configuring an interaction system:
TABLE-US-00003 Algorithm. Analogous Safe-state Exploration Using. Parameters m, δ.sub.T, γ.sub.explore, γ.sub.switch {tilde over (Z)}.sub.safe ← Z.sub.0 // initial safe set of state-action pairs {tilde over (Z)}.sub.unsafe ← {(s, a) ∈ S × A:R(s, a) < 0} // subset of state-action pairs labelled unsafe n(s, a), n(s, a, s′) ← 0 for all (s,a,s′) ∈ S × A × S // empirical transition probabilities {circumflex over (Z)}.sub.safe // not all goal-oriented actions comprised in safe set choose action a.sub.t: =
[0113] The above example demonstrates three ways of selecting safe actions based on a reward function, defined by three different policies. The first policy is a goal-oriented policy
[0114] The above algorithm also demonstrates several subsets of the set of state-action pairs that may be maintained throughout the environment interaction. The first set is the safe set of state-action pairs {circumflex over (Z)}.sub.safe which is initialized to an initial safe set Z.sub.0 and may be gradually expanded over time.
[0115] Another set is the set of goal-oriented actions
[0116] Below, an example algorithm is given by which the set of goal-oriented actions
TABLE-US-00004 Algorithm. Compute .sub.T(s, a) > τ/2 then add { {tilde over (s)}, ã ∈ {circumflex over (Z)}.sub.safe s.t.Δ((s, a), ({tilde over (s)}, ã)) < τ/4) to Z.sub.explore else add {s′, a′ ∈ S × A:
(s, a, s′) > 0}\(Z.sub.return ∪ {circumflex over (Z)}.sub.safe ∪ {circumflex over (Z)}.sub.unsafe) to Z.sub.next.sup.L+1 L ← L + 1 if Z.sub.explore = Ø then add Z.sub.edge to {circumflex over (Z)}.sub.unsafe else break
[0117] In order to include a state-action pair in the safe set {circumflex over (Z)}.sub.safe it is noted that accurate transition probabilities may be needed not just for the state-action pair itself but also for state-action pairs on return trajectories to {circumflex over (Z)}.sub.safe. Namely, such transition probabilities may allow to determine that the safe set of state-action pairs can be reached with sufficient probability from the state-action pair to be added. Accordingly, in various embodiments, a similar state-action pair to be performed is determined that is similar to a goal-oriented state-action pair, or to a state-action pair on the return path from the goal-oriented state-action pair to the safe set, allowing transition probabilities for these latter state-action pairs to be estimated. This is not needed, however, e.g., Z.sub.edge←{(s,a)∈{circumflex over (Z)}.sub.safe.sup.C|s∈{circumflex over (Z)}.sub.safe} may be used.
[0118] Practically, as demonstrated in the algorithm, the set of similar state-action pairs may be determined iteratively. In an iteration, current goal policy
[0119] In order to determine set Z.sub.explore, safety may need to be established of unexplored state-action pairs (s,a)∈Z.sub.edge. Conceptually, state-action pairs from {circumflex over (Z)}.sub.safe may be explored that are similar to an unknown return policy from (s,a) in order to learn that unknown policy. Interestingly, as shown in the above algorithm, this may be done without exploring {circumflex over (Z)}.sub.safe exhaustively by performing a breadth-first-search from (s,a). The breadth-first search, demonstrated by the while loop in the above pseudocode, may enumerate a superset of trajectories that contains the true return trajectories. For example, it may first enumerate a list of state-action pairs that are a 1-hop distance away and if any of them have a loose confidence interval, it may add to Z.sub.explorea corresponding similar state-action pair from {circumflex over (Z)}.sub.safe (if any exist). If Z.sub.explore is empty at this point, the algorithm may repeat this process for 2-hop distance, 3-hop distance and so on, for example, until either Z.sub.explore is non-empty or the BFS tree cannot be grown any further. Experiments have shown that this method is more effective than an exhaustive search, which is however also possible.
[0120] If no similar state-action pair to be performed is determined, in the above algorithm, all of Z.sub.edge may be added to {circumflex over (Z)}.sub.unsafe. In the next iteration,
[0121] Mathematical details of example policies
[0122] Given transition probabilities and confidence interval width
.sub.T: S×A.fwdarw.
as described herein, Tt may be referred to below as a candidate transition if it satisfies the following for all (s,a)∈S×A:
[0123] 1. ∥T.sup.†(s,a)|∥.sub.1≤
.sub.T(s,a).
[0124] 2. if for some s′, (s,a,s′)=0 and
.sub.T(s,a)<τ, then T.sup.†(s,a,s′)=0.
[0125] 3. if (s,a)∈Z.sub.0, then ∀s′.Math.Z.sub.0, T.sup.†(s,a,s′)=0 CI() may be used below to denote the space of all candidate transition probabilities.
[0126] Given an MDP model of the physical environment, let M.sup.† be an MDP that is the same as M but with an arbitrary reward function R.sup.† and discount factor γ.sup.†. The optimistic state-action value function may be computed as follows:
[0127] As t.fwdarw.∞,
[0128] Let ) that corresponds to the optimistic transitions that maximizes
S,A,
.
[0129] Goal MDP.
[0130] M.sub.goal may defined as an MDP that is the same as M, but without the state-action pairs from {circumflex over (Z)}.sub.unsafe, a set of state-action pairs labelled as unsafe. More concretely, M.sub.goal=S,A,T,R.sub.goal,γ.sub.goal
, where:
[0131]
corresponding to the state distribution following the optimistic policy from the initial state in the optimistic MDP. For some H>|S|, define
[0132] In other words,
[0133] Explore MDP.
[0134] M.sub.explore=S,A,T,R.sub.explore,γ.sub.explore
may be defined as an MDP with the same states, actions, and transition function as M, but with a different reward function, R.sub.explore:
[0135] Switch MDP.
[0136] M.sub.switch=S,A,T,R.sub.switch,γ.sub.explore
may be defined to be an MDP with the same states, actions, and transition function as M, but with a different reward function, R.sub.switch, and discount factor. More specifically, R.sub.switch may be defined as follows:
[0137] Several illustrative examples of analogous state functions α: (S×A×S)×(S×A).fwdarw.S and pairwise state-action distance mappings Δ are now provided.
[0138] As a first illustrative example, a grid world domain may be considered with unsafe states, in which the interaction system receives a reward of −1 for any action and the episode terminates. A state s may be described by its coordinates on the 2-dimensional grid: s=(x.sub.s,y.sub.s). The system starts on a 7×7 island of safe states and is surrounded by four 5×5 islands of safe states in all four directions, separated from the centre island by a one-state-thick line of dangerous states. The goal is placed on one of the surrounding islands. The system can take actions up, down, left, or right to move in those directions one step, or can take actions jump up, jump down, jump left, or jump right to move two steps, allowing the system to jump over dangerous states. There is a slipping probability of 60%, which causes the system to fall left or right of the intended target, 30% for either side.
[0139] The initial safe set provided to the system in this example, can be the whole centre island and all actions that with probability 1 will keep the system on the centre island. The distance function Δ provided to the system can be Δ((s,a),({tilde over (s)},ã))=0 if a=ã and s and s are within 5 steps from each other (in L.sub.∞ norm) and Δ((s,a),({tilde over (s)},ã))=1 otherwise. As analogous state function, α((s,−,s′),(
[0140] As a second example, a discrete platformer-like domain may be considered. The state space comprises tuples (x,y,{dot over (x)},{dot over (y)}) where x, y are the coordinates of the interaction system and {dot over (x)},{dot over (y)} are the directional velocities of the system. The actions provided to the system are tuples ({dot over (x)}.sub.desired,j) where {dot over (x)}.sub.desired is the desired {dot over (x)} and ranges from −2 to 2, and j is a boolean variable indicating whether or not the system should jump. While on the ground, at every step {dot over (x)} changes by at most 1 in the direction of {dot over (x)}.sub.desired and if j=1 then {dot over (y)} is set to a value ∈{1,2} (otherwise f remains unchanged). While in the air, however, the system's actions have no effect and gravity decreases {dot over (y)} by one at every step. When the system returns to the ground, {dot over (y)} is set to 0. There are three types of surfaces in the environment: 1) concrete, 2) ice, and 3) sand. These surfaces change how high the system can jump. On concrete, when the system jumps, {dot over (y)}=2 with probability 1; on ice {dot over (y)}=2 with probability 0.5 and {dot over (y)}=1 with probability 0.5; and on sand {dot over (y)}=1 with probability 1.
[0141] The environment is arranged into three islands. The first island has all three surface materials from left to right: sand, ice, then concrete. The next two islands are just concrete, with the last one containing the goal state (where the reward is 1). The regions surrounding these islands are unsafe, meaning they produce rewards of −1 and are terminal. The islands are spaced apart such that the system must be on concrete to make the full jump to the next islands, and vice versa.
[0142] The initial safe set provided to the system in this example may be the whole first island and all actions that with probability 1 will keep the system on the centre island. The distance function Δ provided to the system may be Δ((s,a),({tilde over (s)},ã))=0 if a=ã and s and s are either both in the air or both on the same type of surface and Δ((s,a),({tilde over (s)},ã))=1 otherwise. The analogous state function α may be α((s,⋅,s′), ({tilde over (s)},⋅))={tilde over (s)}′ where {tilde over (s)}′ has the same y, {dot over (x)}, and {dot over (y)} values as s′ with the x value shifted by the x difference between s and {tilde over (s)}.
[0143] Several additional concrete embodiments for interacting with a physical environment using safe sets of state-action pairs are envisaged: [0144] The known ∈-greedy algorithm for reinforcement learning may be adapted to perform safe exploration by restricting the allowable set of actions to a safe set of actions as described herein; [0145] The R-Max algorithm as described in R. Brafman, and M. Tennenholtz, “R-max: A general polynomial time algorithm for near-optimal reinforcement learning”, Journal of Machine Learning Research, 3 (Oct.), 2002 (incorporated herein by reference) may also be adapted to perform safe exploration by restricting the allowable set of actions to a safe set of actions as described herein.
[0146]
[0147] However, this is not a limitation, in that the method 700 may also be performed using another system, apparatus or device.
[0148] The method 700 may comprise, in an operation titled “ACCESSING SAFE SET, UNSAFE SET”, accessing 710 data indicating a safe set of state-action pairs known to be safely performable and data indicating an unsafe set of state-action pairs to be avoided when interacting with the physical environment. The method 700 may further comprise iteratively controlling an interaction with the physical environment.
[0149] In an iteration 720-750, the method 700 may comprise, in an operation titled “OBTAINING CURRENT STATE”, obtaining 720 data indicating a current state of the physical environment. In the iteration 720-750, the method 700 may comprise, in an operation titled “UPDATING SAFE SET”, updating 730 the safe set of state-action pairs.
[0150] To update the safe set, method 700 may comprise, in an operation titled “ESTIMATING TRANSITION PROBABILITY BASED ON OTHER PAIR”, estimating 732 a transition probability for a state-action pair based on an empirical transition probability of a similar other state-action pair. To update the safe set, method 700 may comprise, in an operation titled “INCLUDING PAIR IF SAFE”, including 734 the state-action pair in the safe set of state-action pairs if the state-action pair is not labelled as unsafe and the safe set of state-action pairs can be reached with sufficient probability from the state-action pair based on the estimated transition probability.
[0151] In the iteration 720-750, the method 700 may comprise, in an operation titled “SELECTING ACTION FROM SAFE SET”, selecting 740 an action to be performed in the current state of the physical environment from the safe set of state-action pairs. In the iteration 720-750, the method 700 may comprise, in an operation titled “PROVIDING ACTION”, providing 750 the action to be performed to the system.
[0152] It will be appreciated that, in general, the operations of method 700 of
[0153] The method(s) may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. As also illustrated in
[0154] Examples, embodiments or optional features, whether indicated as non-limiting or not, are not to be understood as limiting the present invention.
[0155] It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the present invention. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or stages other than those described. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. Expressions such as “at least one of” when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, “at least one of A, B, and C” should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the description of a device enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are described separately does not indicate that a combination of these measures cannot be used to advantage.