APPROXIMATE VALUE ITERATION WITH COMPLEX RETURNS BY BOUNDING

Abstract

A control system and method for controlling a system, which employs a data set representing a plurality of states and associated trajectories of an environment of the system; and which iteratively determines an estimate of an optimal control policy for the system. The iterative process performs the substeps, until convergence, of estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value. The control system may produce an output signal to control the system directly, or output the optimized control policy. The system preferably is a reinforcement learning system which continually improves.

Claims

1. A method for controlling a system, comprising: providing a set of data representing a plurality of states and associated trajectories of an environment of the system; iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence: estimating a long term value for operation at a respective state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and at least one of: updating an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and controlling the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system..

2. The method according to claim 1, wherein said using a complex return of the data set as a bound to improve the estimated long term value comprises using a truncated portion of a trajectory which is consistent with the estimate of the optimal control policy, to estimate the complex return, without introducing off-policy bias.

3. The method according to claim 2, wherein the truncated portion of the trajectory comprises a predetermined number of sequential data.

4. The method according to claim 2, wherein the truncated portion of the trajectory is truncated dependent on whether a sequential datum is on-policy or off-policy.

5. The method according to claim 1, wherein an inherent negative bias of the complex return is employed as a lower bound for the estimate of the long term value.

6. The method according to claim 1, wherein a trajectory comprises an ordered collection of observations, and the long term value is the sum of the discounted values of a reward received for each observation plus the maximum discounted estimated value for operation at the estimated optimal policy.

7. The method according to claim 1, wherein said iteratively determining comprises: TABLE-US-00004 Algorithm CFQI( custom-character ,M,R.sup.C ) Input: : set of trajectories, M : number of iterations, R.sup.c : complex return function, 1: {circumflex over (Q)}.sub.0 ← 0 2: for m = 1 to M do 3: Let X and Y be empty sets. 4: for k = 1 to | | do 5: for t = 1 to |T.sub.k| do 6: X ← Append(X,(s.sub.t.sup.T.sup.k ,a.sub.t.sup.T.sup.k )) 7: Y ← Append(Y,R.sup.C (t,T.sub.k,{circumflex over (Q)}.sub.m−1)) 8: end for 9: end for 10: {circumflex over (Q)}.sub.m ← Regression(X,Y) 11: Return {circumflex over (Q)}.sub.M wherein {circumflex over (Q)} is an estimate of the optimal policy, s is an observed state, a is an action, and t is an index within a trajectory.

8. The method according to claim 1, wherein the bound to improve the estimated long term value is a bounded return representing the maximum of an unbiased estimator and a complex return function.

9. The method according to claim 1, wherein said iteratively determining comprises: TABLE-US-00005 Algorithm BFQI( custom-character , γ,M,R.sup.B) Input: : set of trajectories, γ discount factor, M: number of iterations, R.sup.B: Bounded Return; 1: {circumflex over (Q)}.sub.0 ← 0 2: for m = 1 to M do 3: Let X and Y be empty sets. 4: for k = 1 to | custom-character | do 5: for all (s.sub.t,, a.sub.t,, s.sub.t+1, r.sub.t+1) ε T.sub.k do 6: X ← Append(X; (s.sub.t; a.sub.t)) 7: Y ← Append(Y; R.sub.t .sup.B) 8: end for 9: end for 10: {circumflex over (Q)}.sub.m ← Regression(X,Y) 11: end for wherein {circumflex over (Q)} is an estimate of the optimal policy, s is an observed state, a is an action, and t is an index within a trajectory.

10. The method according to claim 1, further comprising predicting an upper bound for the estimated optimal control policy.

11. The method according to claim 10, wherein the upper bound for a value associated with a respective state is determined based on at least looking backward along a respective trajectory, to provide an estimate of a respective environment of the system at the respective state, as an inflated value of the past environment of the system to achieve the respective environment.

12. The method according to claim 1, further comprising using the updated estimate of an optimal control policy to control a controlled system.

13. A control system, comprising: a memory configured to store a set of data representing a plurality of states and associated trajectories of an environment of the system; and at least one automated processor, configured to process the data in the memory, according to an algorithm comprising: iteratively determining an estimate of an optimal control policy for the system, comprising performing the substeps until convergence: estimating a long term value for operation at a current state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value.

14. The control system according to claim 13, further comprising an automated communication interface, configured to automatically communicate at least one of: the updated estimate of the optimal control policy to a controller configured to automatically control the system, to at least one of change an operating state of the system and change an environment of the system; and a control signal for automatically controlling the system, to at least one of change an operating state of the system and change an environment of the system, dependent on the updated estimate of the optimal control policy.

15. The control system according to claim 13, wherein the algorithm uses the complex return of the data set as a bound to improve the estimated long term value by truncating the trajectory to a truncated portion which is consistent with the estimate of the optimal control policy, to estimate the complex return, without introducing off-policy bias.

16. The control system according to claim 13, wherein the truncated portion of the trajectory at least one of: comprises a predetermined number of sequential data; and is truncated dependent on whether a sequential datum is on-policy or off-policy.

17. The control system according to claim 13, wherein the algorithm: employs an inherent negative bias of the complex return as a lower bound for the estimate of the long term value; and further comprises predicting an upper bound for the estimated long term value.

18. The control system according to claim 13, wherein the trajectory comprises an ordered collection of observations, and the long term value is the sum of the discounted values of a reward received for each observation plus the maximum discounted estimated value for operation at the estimated optimal policy.

19. The control system according to claim 13, wherein the bound to improve the estimated long term value is a bounded return representing the maximum of an unbiased estimator and a complex return function.

20. A computer readable medium storing nontransitory instructions for controlling at least one automated processor, comprising: nontransitory instructions for controlling the at least one automated processor to perform an algorithm comprising: iteratively determining an estimate of an optimal control policy for a system based on a set of data representing a plurality of states and associated trajectories of an environment of the system; comprising performing the substeps until convergence: estimating a long term value for operation at a current state of the environment over a series of predicted future environmental states; using a complex return of the data set to determine a bound to improve the estimated long term value; and producing an updated estimate of an optimal control policy dependent on the improved estimate of the long term value; and nontransitory instructions for controlling the at least one automated processor to at least one of: update an automated controller for controlling the system with the updated estimate of the optimal control policy, wherein the automated controller operates according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system; and control the system with the updated estimate of the optimal control policy, according to the updated estimate of the optimal control policy to automatically alter at least one of a state of the system and the environment of the system.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0318] FIG. 1 shows a contour plot showing the improvement of max(R.sup.(1), {circumflex over (θ)}) over R.sup.(1). The hashed curves show boundaries between improvement and no improvement for cases of (solid -) bias(R.sup.(1))=0, (dashed) bias(R.sup.(1))=0.5 and (dotted) bias (R.sup.(1))=−0.5.

[0319] FIG. 2 shows the average MSE of the learned Q functions. The behavior policy is varied from 90% to 50% of optimal simulating conditions of increasing bias. Error bars are standard error.

[0320] FIG. 3 shows the mean policy performance in the Acrobot domain over 300 iterations.

[0321] FIG. 4 shows the mean policy performance in the Cart Pole Balancing domain over 300 iterations.

[0322] FIG. 5 shows the mean policy performance at every 10 iterations in the Acrobot domain using 100 trajectories.

[0323] FIG. 6 shows the mean policy performance at every 10 iterations in the Cart Pole Balancing domain using 100 trajectories.

[0324] FIG. 8 shows the mean per-iteration {circumflex over (Q)} difference at every 10 iterations in the Cart Pole Balancing domain using 100 trajectories.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

EXAMPLE 1

Complex Return Fitted Q-Iteration

[0325] To meet the challenges to using trajectory data in complex returns, Complex return Fitted Q-Iteration (CFQI) is provided, a generalization of the FQI framework which allows for any general return based estimate, enabling the seamless integration of complex returns within AVI. Two distinct methods for utilizing complex returns within the CFQI framework are provided. The first method is similar to the idea of Q(λ) [Watkins 1992] and uses truncated portions of trajectories, that are consistent with the approximation of Q* , to calculate complex return estimates without introducing off-policy bias. The second method is more a novel approach that makes use of the inherent negative bias of complex returns, due to the value iteration context, as a lower bound for value estimates. Statistical evidence and analysis is provided that shows how an estimator with predictable, but unknown, bias can provide a bound on value estimates producing a more accurate estimator. Additionally, convergence proofs are provided showing that use of CFQI is guaranteed to converge under the same assumptions as FQI. Finally, an empirical evaluation of the methods on several RL benchmarks is provided that shows how CFQI improves the accuracy of the learned Q* approximation, the quality of the learned policy, and convergence behavior.

[0326] In CFQI, Sample data are most commonly collected in sequences known as trajectories. A trajectory, T, is a sequentially ordered collection of observations where, T=[(s.sub.0, a.sub.0, s.sub.1, r.sub.1), (s.sub.1, a.sub.1, s.sub.2, r.sub.2), . . . ]. Trajectories provide an alternative to using just the R.sub.t.sup.(1) return. Given trajectory data, the 1-step return estimate in Equation (5) has been generalized to produce the n-step returns:

[00010] $\begin{matrix} R_{t}^{(n)} = {.Math.}_{i = 1}^{n - 1} .Math. .Math. γ^{i - 1} .Math. r_{t + i} + γ^{n} .Math. \max_{a \in A} .Math. {\hat{Q}}_{m - 1} (s_{t + n}, a) . & (5) \end{matrix}$

[0327] It should be noted that this definition of the n-step returns differs from the standard on-policy definition of the n-step returns because of its use of the max operation. In principle each of the n-step returns can be used as approximations of Q* (s.sub.t, a.sub.t). Individually each estimator has its own distinct biases and variances. However, when combined, through averaging they can produce an estimator with lower variance than any one individual return [Dietterich 2000]. It is this idea that motivated the development of complex returns [Sutton 1998].

[0328] A complex return is a weighted average, with the weights summing to 1, of the n-step returns. The n-returns are weighted differently because of their assumed relative variance behaviors. The general assumption behind existing complex return methods is that the variance of the n-step returns increases as n increases. From the on-policy literature there are two competing complex return approaches. The classic complex return is the λ-return which serves as the basis for the TD(λ) family of algorithms [Sutton 1998]. More recently, the γ-return was introduced based upon different variance assumptions of the n-step returns [Konidaris 2011]. The γ-return is defined as:

[00011] $\begin{matrix} R_{t}^{γ} = {.Math.}_{n = 1}^{.Math. T .Math.} .Math. .Math. \frac{{({.Math.}_{i = 1}^{n} .Math. .Math. γ^{2 .Math. (i - 1)})}^{- 1}}{{.Math.}_{m = 1}^{.Math. T .Math.} .Math. {({.Math.}_{i = 1}^{m} .Math. .Math. γ^{2 .Math. (i - 1)})}^{- 1}} .Math. R_{t}^{(n)} . & (6) \end{matrix}$

[0329] The difficulty in applying complex returns to FQI is that in an AVI context the trajectories can be sampled off-policy and the target policy is also unknown. Off-policy trajectories introduce undesirable bias into the n-step return estimates that cannot be reduced through averaging. If the target policy were known, as in policy iteration, importance sampling can be used to reduce off-policy bias [Precup 2001]. However, the target policy is unknown in the AVI context.

[0330] Complex Fitted Q-Iteration (CFQI) is thus a generalization according to the present technology, of the popular FQI framework that enables the use of complex return based value estimates. Algorithm 1 provides the details of the approach.

TABLE-US-00001 Algorithm 1: CFQI( custom-character ,M,R.sup.C ) Input: : set of trajectories, M : number of iterations, R.sup.c : complex return function 1: {circumflex over (Q)}.sub.0 ← 0 2: for m = 1 to M do 3: Let X and Y be empty sets. 4: for k = 1 to | | do 5: for t = 1 to |T.sub.k| do 6: X ← Append(X,(s.sub.t.sup.T.sup.k ,a.sub.t.sup.T.sup.k )) 7: Y ← Append(Y,R.sup.C(t,T.sub.k,{circumflex over (Q)}.sub.m−1)) 8: end for 9: end for 10: {circumflex over (Q)}.sub.m ← Regression(X,Y) 11: Return {circumflex over (Q)}.sub.M

[0331] The primary distinction between FQI and CFQI lies in the two update rules (line 7). FQI is limited to the R.sup.(1) return estimate, while CFQI makes use of any chosen complex return R.sup.C to provide value estimates. A second difference between FQI and CFQI is that CFQI processes over trajectories, not unordered samples as in FQI. From a computational complexity standpoint, CFQI has the same computational complexity as FQI. The derivations of the n-returns and complex returns can be performed efficiently by processing trajectories in reverse order. Therefore, the present technology allows for the general use of complex returns as value estimates and does not break the theoretical convergence guarantees of the original approach.

[0332] Theorem 1 CFQI converges w.p.1 if a normalized complex return, computed from fixed length trajectories, is used to derive value targets to be used in a kernel regression model, as defined by Equation (7) and Equation (8).

[0333] Proof for Theorem 1

[0334] In previous works AVI, using R.sup.(1) as the value estimator, has been shown to converge as long as the regression model is an averager such as a normalized kernel method [Gordon 1999, Ormoneit 2002, Ernst 2005]. Specifically, the supervised learning method learns a model, {circumflex over (Q)} (s,a), defined by:

[00012] $\begin{matrix} \hat{Q} (s, a) = \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) .Math. R_{, t}^{c} & (7) \end{matrix}$

[0335] where R.sub.l,t.sup.c is the estimated value for sample t from T.sub.l. R.sub.l,t.sup.c can be R.sup.(1), as in the standard AVI, or, as shown, any normalized complex return. Additionally the kernel, k:(S×A).sup.2 custom-character , must satisfy the following normalization condition:

[00013] $\begin{matrix} \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. .Math. k ((s_{t}^{T_{l}}, a_{t}^{T_{l}}), (s, a)) .Math. = 1, \forall (s, a) & (8) \end{matrix}$

[0336] Following the proof from [Ernst 2005], the sequence of M Q -functions can be rewritten as, {circumflex over (Q)}.sub.m=Ĥ{circumflex over (Q)}.sub.m−1 where Ĥ is an operator mapping any function in a Banach space custom-character of functions over S×A to itself. Ĥ is defined as:

[00014] $(\hat{H} .Math. K) .Math. (s, a) = \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) * .Math. {.Math.}_{t = 1}^{.Math. T_{} - t .Math.} .Math. w (n) [{.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i} + γ^{n} .Math. \max_{a^{'} \in A} .Math. K (s_{t + n}, a^{'})]$

[0337] Next Ĥ is shown to be a contraction in H. Specifically, ∥ĤK−ĤK ∥.sub.∞<∥K−K∥.sub.∞ for any K and K∈ custom-character .

[00015] ${.Math. \hat{H} .Math. K - \hat{H} .Math. \overline{K} .Math.}_{\infty} = \max_{(s, a) \in S \times A} .Math. .Math. \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) .Math. {.Math.}_{n = 1}^{.Math. T_{} - t .Math.} .Math. w (n) .Math. γ^{n} [\max_{a^{'} \in A} .Math. K (s_{t + n}, a^{'}) - \max_{a^{'} \in A} .Math. \overline{K} (s_{t + n}, a^{'})] .Math. \leq \max_{(s, a) \in S \times A} .Math. \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) * {.Math.}_{n = 1}^{.Math. T_{} - t .Math.} .Math. w (n) .Math. γ^{n} .Math. .Math. \max_{a^{'} \in A} .Math. K (s_{t + n}, a^{'}) - \max_{a^{'} \in A} .Math. \overline{K} (s_{n + t}, a^{'}) .Math. < γ .Math. \max_{(s, a) \in S \times A} .Math. \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) * {.Math.}_{n = 1}^{.Math. T_{} - t .Math.} .Math. w (n) .Math. \max_{a^{'} \in A} .Math. .Math. K (s_{t + n}, a^{'}) - \overline{K} (s_{t + n}, a^{'}) .Math. \leq γ .Math. \max_{(s, a) \in S \times A} .Math. \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) * \max_{t^{'} \geq t, (s_{t^{'}}, a^{'}) \in S_{T_{l}} \times A} .Math. .Math. K (s_{t}, a^{'}) - \overline{K} (s_{t^{'}}, a^{'}) .Math. \leq γ .Math. \max_{(s, a) \in S \times A} .Math. .Math. K (s, a) - K (s, a) .Math. = γ .Math. {.Math. K - \overline{K} .Math.}_{\infty} < {.Math. K - \overline{K} .Math.}_{\infty}$

[0338] By fixed-point theorem, the proof is completed.

EXAMPLE 2

Bounded Fitted Q-Iteration

[0339] Using complex returns directly in an off-policy framework can be problematic, the bias of off-policy n-step returns will introduce bias into the final result possibly eliminating any potential benefit from variance reduction. Below are provided two methods for both mitigating and utilizing this bias.

[0340] One way to handle the off-policy bias of the complex returns is to attempt to avoid it by truncating the trajectories where they appear to go off-policy. This idea is borrowed from the Q(λ) [Watkins 1989] approach. In this approach the current {circumflex over (Q)} provides an approximation of the optimal policy, {circumflex over (π)}* that can be used to approximate when a trajectory takes an off-policy sub-optimal action. During the process of calculating the complex return estimates, samples in a trajectory after the first off-policy action are not considered. Assuming {circumflex over (π)}* converges to a close approximation of π* , a strong assumption, this approach should not introduce off-policy bias and can take advantage of portions of trajectories that follow the optimal policy to reduce variance and overall error.

[0341] However, the assumption that {circumflex over (π)}* is an accurate approximation of π* is a poor one, especially early in the iterative process. Additionally, because {circumflex over (π)}* will likely change during the iterative process, the lengths of the trajectories used to calculate the complex returns will change dynamically from one iteration to the next. Changing lengths of the trajectories violates one of the assumption made by Theorem 1 and convergence may no longer be guaranteed. This issue is examined further in the empirical analysis below.

[0342] The second approach utilizes the off-policy n-step return bias rather than attempting to eliminate it. It exploits the predictability of this bias enabling the effective use of complex returns as a bound on the value of the R.sup.(1) return. In an AVI context, such as CFQI, this is possible due to the fact that the target policy is an optimal policy. Because the target policy is optimal, it is a safe assumption that any off-policy bias present in the n-step returns is negative in value. A complex return derived from the biased n-step returns will also be biased negatively, but should have relatively low variance. This insight directly leads to the derivation of a bounded complex return, R.sup.B:

R.sup.B=max (R.sup.(1), R.sup.C). (9)

[0343] where R.sup.C is some chosen complex return function.

[0344] Here we motivate the use of a bounded complex return R.sup.B . Let us first consider a simple example by assuming that {circumflex over (θ)} is a degenerated estimator with variance 0 (that is, it is a constant) and is always less than the true value θ. In this simplified situation, it is always of benefit to use max(R.sup.(1), {circumflex over (θ)}) in place of R.sup.(1) to estimate the value. The reason is obvious: when R.sup.(1) is less than {circumflex over (θ)}, then it is farther away from θ than {circumflex over (θ)}, in which case the greater observation {circumflex over (θ)} should be used. In the worst case scenario, that is, R.sup.(1)>{circumflex over (θ)} with high probability, it is no harm to use max(R.sup.(1), {circumflex over (θ)}) since it would coincide R.sup.(1).

[0345] This simple example shows that sometimes taking the maximum of two estimators can improve both. The stringent assumption van be relaxed, and {circumflex over (θ)} assumed to have a positive variance, and is negatively biased (its expectation is less than the value θ), so that with high probability, {circumflex over (θ)}<θ. This is not difficult to achieve, since it is true either when the variance is not too large or when its expectation is small enough (compared to 0). Again, in this case max(R.sup.(1), {circumflex over (θ)}) would be superior to R.sup.(1).

[0346] The actual improvement of max(R.sup.(1), {circumflex over (θ)}) over R.sup.(1), should it exist, may not be substantial, if R.sup.(1)<{circumflex over (θ)} rarely occurs. This could happen when, for example, the expectation of {circumflex over (θ)} is too small. Moreover, max(R.sup.(1), {circumflex over (θ)}) may even be worse than R.sup.(1), when for example, the variance of {circumflex over (θ)} is too large (in which case “with high probability, {circumflex over (θ)}<θ” is not a true statement.)

[0347] Therefore, in some cases, max(R.sup.(1), {circumflex over (θ)}) improves R.sup.(1). Improvements are possible when Var({circumflex over (θ)}) is small and/or when custom-character ({circumflex over (θ)})−θ is small enough to ensure that {circumflex over (θ)}<θ, but not so small that {circumflex over (θ)}<R.sup.(1) always. Precisely when the improvements occur depends on the underlying distribution of R.sup.(1) and {circumflex over (θ)}. Here both estimators are assumed follow normal distributions. Hence the distributions are fully characterized by their expectations and variances respectively. Although in reality, the true estimators are only approximately normal at best, the analysis conducted here is sufficient to convey the main message.

[0348] To show a concrete example, assume that bias(R.sup.(1))= custom-character (R.sup.(1))−{circumflex over (θ)}=0 and Var(R.sup.(1))=1. The bias for {circumflex over (θ)} and the variance of {circumflex over (θ)} are chosen from the range [−1.5,1]×[0,2]. For each pair of the bias and variance, the mean squared error of the estimators R.sup.(1) and max(R.sup.(1), {circumflex over (θ)}) is estimated by Monte Carlo integration. The set of bias({circumflex over (θ)}) and std({circumflex over (θ)}) are sought to be identified, where max(R.sup.(1), {circumflex over (θ)}) improves R.sup.(1). In FIG. 1, the isoline at 0 is the boundary where the two estimators are equally good. The domain to the southwest of the 0 isoline is precisely the domain of improvement.

[0349] FIG. 1 shows a contour plot showing the improvement of max(R.sup.(1), {circumflex over (θ)}) over R.sup.(1). Hashed and central curves: boundaries between improvement and no improvement for cases of (solid -) bias(R.sup.(1))=0, (dashed) bias(R.sup.(1))=0.5 and (dotted) bias(R.sup.(1))=−0.5.

[0350] Also shown in FIG. 1 are contours of the log ratio of MSE, log(MSE(R.sup.(1))/MSE(max(R.sup.(1), {circumflex over (θ)})) (note that R.sup.C≈{circumflex over (θ)}). The greater this measure is, the more improvement max(R.sup.(1), {circumflex over (θ)}) has. Clearly, the greatest improvement occurs at the unrealistic case where {circumflex over (θ)} is unbiased and has variance 0. Overall, a combination of small bias and small variance guarantees an improvement. More precisely, when the bias of {circumflex over (θ)} is negative, the variance of {circumflex over (θ)} can be greater than that of R.sup.(1) (=1 in this case) for the maximal to provide an improvement. The more negative the bias is, the greater variance is allowed. When the bias is too much negatively biased or the variance is too large, then the improvement becomes negligible. On the other hand, even if the bias of {circumflex over (θ)} is positive, there is still a chance for the maximal to be a better estimator. However, this comes with a more stringent assumption that the variance of {circumflex over (θ)} is much smaller.

[0351] FIG. 1 also shows the boundaries under the cases where R.sup.(1) is biased. The dashed curve and the dotted curve correspond to bias 0.5 and −0.5 respectively. Compared to the solid curve, it is more likely for a maximal estimator such as max(R.sup.(1), {circumflex over (θ)}) to improve R.sup.(1), when R.sup.(1) is itself negatively biased; and vice versa. This is consistent with the motivation to bound the estimator from below so that it does not negatively deviate from the parameter (recall the simple toy example.)

[0352] Characteristics about {circumflex over (θ)} that make max(R.sup.(1), {circumflex over (θ)}) better than R.sup.(1) and hence make the bounding strategy work, can be identified.

[0353] 1. As a bottom line, the expectation of {circumflex over (θ)} needs to be smaller than a positive value τ that satisfies MSE(max(R.sup.(1), τ))=MSE(R.sup.(1)), or equivalently,

[00016] $τ^{2} .Math. Φ (τ) + \int_{τ}^{\infty} .Math. t^{2} .Math. φ (t) .Math. .Math. dt = 1$

in the current example shown in FIG. 1, where Φ(t) and φ(t) are the distribution function and density function of standard normal distribution (direct calculation leads to τ≈0.8399).

[0354] 2. The variance of {circumflex over (θ)} is small in general, but can be greater than that of R.sup.(1) when the expectation of {circumflex over (θ)} is less than θ, i.e., the variance of {circumflex over (θ)} (=R.sup.C)should be small. It can be greater than that of R.sup.(1).

[0355] 3. The bias of {circumflex over (θ)} (=) R.sup.C should not be overly negative.

[0356] The first two criteria ensure a safe bound from below such that R.sup.B is no worse than R.sup.(1). The first criterion makes sure that taking maximal (bounding from below) is meaningful. Otherwise an alternative strategy, namely bounding from above, is needed. The second criterion prevents {circumflex over (θ)} from ruining the mean square error of max(R.sup.(1), {circumflex over (θ)}) through large variance.

[0357] Moreover, the third criterion is available to make sure that the improvement, when it exists, is substantial: {circumflex over (θ)} is not overly negatively biased. This allows a fair chance for R.sup.(1)<{circumflex over (θ)}.

[0358] It is worth noting that for off-policy n-step returns, the expectation generally decreases as n increases. Hence, bias(R.sup.(n))<bias(R.sup.(n−1))<. . . <bias(R.sup.(1)). This mean if one was to use an n-step return, or a weighted average of many n-step returns, as the bounding estimator, it is more likely to fall into the improvement domain, because the smaller bias({circumflex over (θ)}) is, the greater variance is allowed, as can be seen in FIG. 1.

[0359] Consider the n-step returns in the AVI context, their variance and bias properties. Just as the 1-step return R.sub.t.sup.(1) (See Equation (4)) can be used as an estimator for Q* (s.sub.t, a.sub.t) in value iteration, the n-step returns are defined as follow:

[00017] $\begin{matrix} R_{t}^{(n)} = {.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i} + γ^{n} .Math. \max_{a \in A} .Math. {\hat{Q}}_{m - 1} (s_{t + n}, a) . & (10) \end{matrix}$

[0360] All of the n-step returns are approximations of Q* (s.sub.t, a.sub.t). Again, the greedy choice by the max operation makes Equation (10) different from the classic n-step return definition used in the Temporal Difference (TD) family of algorithms.

[0361] A salient feature of the n-step returns is that their variances increase with n due to the stochastic nature of the Markov Decision Process (MDP). The function approximation variance, which can be a substantial component of the overall variance, is often considered to be roughly the same across different samples. The bias of n-step returns is a more complex issue. Among various types of biases (e.g., off-policy bias, function approximation bias, sampling bias, etc.), the behavior of the off-policy bias is unique. When the target policy is an optimal policy, like in the AVI context, the off-policy bias introduced by a suboptimal trajectory is strictly negative, and its magnitude increases as more suboptimal actions are followed towards the end of the trajectory. The same observation can be made when treating any state in a trajectory as the starting point for the rest of the trajectory. In contrast, other types of biases, if they exist, can be positive or negative, and often share roughly the same magnitude across different samples. Therefore, when combining the effects of various sources of bias, the expectation of n-step returns generally decreases as n increases.

[0362] Given the above analysis, any of the n-step estimators can potentially fall into the domain of improvement. However, it is difficult to determine whether Conditions 1 and 2 identified earlier are met for each of the individual n-step estimators and which one is the best, without detailed prior knowledge of the bias and variance behaviors along a trajectory. Therefore, a logical choice is to consider a weighted average of a number of n-step returns, the so called complex return.

[0363] There is a risk that the bounding method may not work, i.e., the bounding estimator falls into the “bad” region, for example if the bias of R.sup.(1) is positive.

[0364] Therefore, a complex return can effectively be utilized as a lower bound for the R.sup.(1) estimator.

[0365] This insight produces Bounded Fitted Q-Iteration (BFQI), an algorithm that makes use of the complex return approach in an AVI framework to provide improved value estimates. The details for BFQI are provided by Algorithm 2.

TABLE-US-00002 Algorithm 2: BFQI( custom-character γ,M,R.sup.B) Input: set of trajectories, γ discount factor, M: number of iterations, R.sup.B: Bounding Return 1: {circumflex over (Q)}.sub.0 ← 0 2: for m = 1 to M do 3: Let X and Y be empty sets. 4: for k = 1 to | | do 5: for all (s.sub.t, a.sub.t, s.sub.t+1, r.sub.t+1) ∈ T.sub.k do 6: X ← Append(X; (s.sub.t; a.sub.t)) 7: Y ← Append(Y;R.sub.t.sup.B) 8: end for 9: end for 10: {circumflex over (Q)}.sub.m ← Regression(X,Y) 11: end for

[0366] Theorem 2 assures that BFQI with the bounding method also converges under the same conditions as FQI.

[0367] Theorem 2 BFQI converges w.p.1 if the R.sup.(1) return is bounded by a normalized complex return on fixed length trajectories to produce value estimates used in a kernel regression model, as defined by Equation (11) and Equation (12).

[0368] Proof for Theorem 2

[0369] Following proof of Theorem 1, Ĥ is defined as:

[00018] $\begin{matrix} (\hat{H} .Math. K) .Math. (s, a) = \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) .Math. \max .Math. {r_{t + 1} + \max_{a^{'} \in A} .Math. K (s_{t + 1}, a^{'}), & (11) \\ .Math. {.Math.}_{n = 1}^{.Math. T_{} - t .Math.} .Math. w (n) [{.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i} + γ^{n} .Math. \max_{a^{'} \in A} .Math. K (s_{t + n}, a^{'})]} & (12) \end{matrix}$

[0370] Ĥ is shown to be a contraction in custom-character .

[00019] ${.Math. \hat{H} .Math. K - \hat{H} .Math. \overline{K} .Math.}_{\infty} = \max_{(s, a) \in S \times A} .Math. .Math. \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) .Math. (\max .Math. {r_{t + 1} + γ .Math. \max_{a^{'} \in A} .Math. K (s_{t + 1}, a^{'}), {.Math.}_{n = 1}^{.Math. T_{} - t .Math.} .Math. w (n) [{.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i} + γ^{n} .Math. \max_{a^{'} \in A} .Math. K (s_{t + n}, a^{'})]} - \max .Math. {r_{t + 1} + γ .Math. .Math. \max_{a^{'} \in A} .Math. \overline{K} (s_{t + 1}, a^{'}), {.Math.}_{n = 1}^{.Math. T_{} - t .Math.} .Math. w .Math. (n) [{.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i} + γ^{n} .Math. \max_{a^{'} \in A} .Math. \overline{K} (s_{t + n}, a^{'})]}) .Math. \leq \max_{(s, a) \in S \times A} .Math. .Math. \underset{T_{} \in }{.Math.} .Math. {.Math.}_{t = 1}^{.Math. T_{} .Math.} .Math. k ((s_{t}^{T_{}}, a_{t}^{T_{}}), (s, a)) .Math. \max .Math. {γ .Math. .Math. \max_{a^{'} \in A} .Math. K (s_{t + 1}, a^{'}) - γ .Math. .Math. \max_{a^{'} \in A} .Math. \overline{K} (s_{t + 1}, a^{'}), {.Math.}_{n = 1}^{.Math. T_{} - t .Math.} .Math. {w (n)}^{*} [γ^{n} .Math. \max_{a^{'} \in A} .Math. K (s_{t + n}, a^{'}) - γ^{n} .Math. \max_{a^{'} \in A} .Math. K (s_{t + n}, a^{'})]} .Math.$

[0371] At this point all that remains is to show that both choices in the second max{ } function are less than ∥K−K∥.sub.∞independently. The first choice, γmax.sub.a′∈AK(s.sub.t+1, a′)−γmax.sub.a′∈AK(s.sub.t+1, a′), was proven in [Ernst 2005]. Finally, the second choice, Σ.sub.n=1.sup.|T.sup.l.sup.−t|w(n)[γ.sup.n max.sub.a′∈AK(s.sub.t+n, a′)−γ.sup.n max.sub.a′∈AK(s.sub.t−n, a′)], is proven by Theorem 1.

EXAMPLE 3

Comparative Example

Trajectory Fitted Q-Iteration

[0372] Trajectory Fitted Q-Iteration (TFQI) [Wright 2013] is an AVI-based algorithm that makes use of the n-step returns. Instead of using a complex return, as provided in Examples 1 and 2, TFQI uses the n-step return that has the highest observed value as the sample Q-value estimate.

R.sup.Max=max(R.sup.(1), R.sup.(2), . . . R.sup.(|T|)) (13 )

EXAMPLE 4

Inverse Return Function FQI

[0373] The next logical question is whether an effective upper bound can be derived as well. One hypothesis is that an upper bound may be derived by looking backward along trajectories instead of forward. The the inverse return may be derived starting from the original n-step return equation:

[00020] $R_{t}^{(n)} = {.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i} + γ^{n} .Math. Q (s_{t + n}, a_{t + n})$ $R_{t}^{(n)} - {.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i} = γ^{n} .Math. Q (s_{t + n}, a_{t + n})$ $\frac{R_{t}^{(n)} - {.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i}}{γ^{n}} = Q (s_{t + n}, a_{t + n})$ $Substituting .Math. .Math. Q (s_{t}, a_{t}) .Math. .Math. for .Math. .Math. R_{t}^{(n)}$ $\frac{Q (s_{t}, a_{t}) - {.Math.}_{i = 1}^{n - 1} .Math. γ^{i - 1} .Math. r_{t + i}}{γ^{n}} = Q (s_{t + n}, a_{t + n})$

[0374] Substituting Q(s.sub.t, a.sub.t) for

[00021] $R_{t}^{(n)} .Math. \frac{Q (s_{t}, a_{t}) - {.Math.}_{I = 1}^{N - 1} .Math. γ^{I - 1} .Math. r_{t + i}}{γ^{n}} = Q (s_{t + n}, a_{t + n}) .$

[0375] Finally, subtracting n time-steps from all time indices leaves the definition of the inverse return, R.sup.(−n):

[00022] $\begin{matrix} R_{t}^{(- n)} = γ^{- n} .Math. Q (s_{t - n}; s_{t - n}) - {.Math.}_{i = 1}^{n - 1} .Math. γ^{i - n - 1} .Math. r_{t - n + i} & (14) \end{matrix}$

[0376] This new inverse return function looks back n-steps along a trajectory to provide an estimate of Q(s.sub.t,a.sub.t). Intuitively the equation makes sense. It states that value of where the agent is Q(s.sub.t,a.sub.t), is equal to the undiscounted value of where it was n-steps ago, γ.sup.−nQ(s.sub.t−n, a.sub.t−n), minus the undiscounted value accumulated by the agent to get to where it is,

[00023] ${.Math.}_{i = 1}^{n - 1} .Math. γ^{i - n - 1} .Math. r_{t - n + i} .$

[0377] In principle, these inverse returns could be used directly to estimate the value of Q(s.sub.t,a.sub.t), just as the forward returns. However, cursory examination reveals that they have poor variance properties compared to the forward n-step returns due to γ.sup.−n. The discount factor, γ∈(0 . . . 1), in the forward view becomes a multiplicative factor in the backward view, that amplifies variance. Further, absorbing states, by their nature, are only found at the end of trajectories. As such, there are no absorbing states for inverse returns. Without absorbing states, there is nothing in an inverse return to provide an absolute grounding value. Still, the inverse returns may provide effective upper bounds for an estimator grounded by value estimates provided by the forward returns. As an upper bound, mirroring the statistical analysis of lower bounds as discussed above, the expectation of the inverse returns should have small positive bias with comparable variance to the R.sup.(1) return. Examining Equation (14) it is clear that, given sub-optimal trajectories, the inverse returns should exhibit positive bias. Sub-optimal action choices made by the behavior policy will decrease the expected value of

[00024] ${.Math.}_{i = 1}^{n - 1} .Math. γ^{i - n - 1} .Math. r_{t - n + i},$

introducing positive bias in the inverted returns.

[0378] The variance of the inverse returns should be controlled if they are to be used effectively as bounds. Complex return methods can provide a means for controlling the variance of the forward returns.

EXAMPLE 5

Empirical Results

[0379] An empirical evaluation of these approaches are provided on several non-deterministic RL benchmarks. The methods are compared based upon accuracy of the learned value function, quality of the derived policy, and convergence behavior. The following methods were employed, annotated with the return they use to estimate value:

TABLE-US-00003 Method Value Estimator FQI R.sup.(1) TFQI R.sup.Max CFQI-C.sub.γ Truncated R.sup.γ CFQI-B.sub.γ(l) R.sup.B(R.sup.(1).sup., R.sup.γ.sup.)

[0380] For the CFQI-B.sub.γ(l) method, l denotes the limit on how many steps down the trajectory to use when computing the R.sup.C return. If l is not listed, it uses the full trajectory.

[0381] For the purpose of evaluating this approach, an empirical comparison of the present bounding method is provided with that of implementations of the standard R.sub.t.sup.(1) method, FQI, the newer R.sub.t.sup.Max method, TFQI, and (as just discussed) a naive implementation of using the TD.sub.λreturn within the FQI framework, R.sub.t.sup.λ. Results are shown for the CFQI-B.sub.γ method, equivalent to BFQI, using the TD.sub.γ return as bounds, R.sub.t.sup.B.sup.λ.

[0382] In all the experiments, linear regression models with Fourier Basis functions [Konidaris 2011] trained using ridge regression are employed. An exhaustive parameter search was performed, varying the complexity of the model, regularization, number of iterations, and trajectory counts. The results reported are representative of the general observed trends.

[0383] AVI is known to exhibit divergence behavior when paired with this type of function approximation model [Boyan 1995]. This issue was circumvented for these specific problem domains by bounding the values returned by the models by V.sub.max.Math.V.sub.max is the maximum possible value for any state-action pair and can be calculated a priori as:

[00025] $\begin{matrix} V_{\max} = \frac{R_{\max}}{(1 - γ)} & (15) \end{matrix}$

[0384] where R.sub.max is the maximum single step reward in the domain. This change was sufficient to ensure convergence for the methods in all the domains tested. This form of function approximation provided superior results than the kernel averaging based methods AVI is guaranteed to converge with. Without bounding the approximation models in this way, divergence behavior was exhibited from all methods. A comprehensive set of parameters were investigated, varying the complexity of the model, regularization, number of iterations, and trajectory counts.

[0385] The first set of experiments evaluates the accuracy of each approach in deriving Q* using identical trajectory data sets. For this purpose, a non-deterministic 51-state Markov chain similar to the one presented in [Lagoudakis 2003] was used as the testing environment. This environment is chosen because Q* can be calculated exactly using dynamic programming. The goal in this domain is to traverse the chain, starting from some random state, to one of the terminal states, in as few steps as possible. States 0, 25, and 50 are the terminal states. From any non-terminal state the agent can take an action to move the agent to one of two neighboring states with a cost of −1. The discount factor, γ, was set to 0.9 and there is a 20% probability that an action taken will result in no transition. The function approximation model uses a 10th order Fourier basis with no regularization in training the model.

[0386] In order to evaluate the methods under varying levels of off-policy bias, multiple training sets of 10,000 trajectories were generated. The sets are generated from behavior policies that follow the optimal policy with 0.9 to 0.5 probability (equivalent to a random policy) at each step. For each run, 1000 trajectories were selected at random from a chosen repository to form a training data set. The results are the average of 200 runs. Each approach was evaluated based on the average MSE of the {circumflex over (Q)} functions after 50 iterations of learning, comparing to the true Q* function, after 50 iterations of learning (sufficient to ensure convergence).

[0387] For completeness, the LSTD-Q algorithm [Lagoudakis 2003], an alternative batch-mode algorithm, is considered. LSTD-Q performed nearly identically to R.sub.t.sup.(l). This finding is expected given that LSTD-Q and R.sup.(1) perform the same given they are optimizing the same objective function. Testing was also conducted using R.sub.t.sup.λ without truncating the returns and found it did not work, as expected.

[0388] FIG. 2 shows the average MSE of the learned Q functions. The behavior policy is varied from 90% to 50% of optimal simulating conditions of increasing bias. Error bars are standard error.

[0389] FIG. 2 shows the experimental results; the bounding approach R.sub.t.sup.B.sup.λ is able to learn Q* as well or more accurately than R.sub.t.sup.(1), most significantly as the off-policy bias increases. This demonstrates that providing a stable and effective bound that reduces overall error. The RM method, in comparison, provides the greatest overall improvement over R.sub.t.sup.(1) when the off-policy bias is highest. However, it is unstable and performs poorly compared to the other methods when there is less off-policy bias demonstrating how this method is prone to overestimating. Not surprisingly the naive R.sub.t.sup.λ, method is not competitive. The results show the CFQI based methods are stable and significantly outperform FQI at most levels of off-policy bias. The only exception is with data from a 90% optimal policy, where CFQI performs comparably to FQI. TFQI on the other hand shows unstable results. It performs poorly when there is less off-policy bias, demonstrating that the R.sup.Max return can be prone to overestimate. However, it is significantly better than all other methods on near random trajectory data. Comparing CFQI-C and CFQI-B, the bounding approach performs significantly better than its truncated complex return counterpart.

[0390] In analyzing this result, it was observed that the values of n-step returns, after convergence, are normally distributed and exhibit the general trend of increasing negative bias as predicted.

[0391] Investigation was performed to determine if increases in accuracy in the approximation of Q* translate to improved learned policies. Two experiments were performed, on challenging RL benchmarks the Acrobot (Acro) swing-up [Sutton 1998] and the Cart Pole Balancing (PB) [Sutton 1998] problems. These two problems represent two different classes of domain: goal oriented and failure avoidance respectively. In the Acro domain, the objective is to derive a policy that enables an under-actuated robot to swing-up in as few steps as possible, limited to 1,000. A cost of −1 is given for every non-terminal transition. Whereas, in the PB domain the goal is to avoid the failure conditions, for up to 10,000 steps, of dropping the pole or exceeding the bounds of the track. Here a positive reward of +1 is given for every non-terminal transition. The discount factor for the Acro domain was set to γ=0.9999, while for CPB it was set to γ=0.9999. Like the Markov chain, these domains were made non-deterministic by incorporating a 20% probability that an action results in no action having been taken. Fourier basis of orders 2, for Acro, and 2 or 3, for CPB, both trained with the same small regularization penalty, are used to represent {circumflex over (Q)}.

[0392] Policy performance is measured by the mean aggregate reward obtained by running a given policy over 50 trials, necessary due to the non-determinism. Experiments are run on data sets comprised of increasing numbers of trajectories to examine the relative sample efficiency of the methods. NeuroEvolution of Augmenting Topologies (NEAT) [Stanley 2002] to generate diverse trajectory sets, comprised of over 5,000 trajectories, for both domains as was done in [Wright 2013]. This form of data violates Least Squares Temporal Difference-Q (LSTD-Q)'s assumptions on sampling distribution, and thus are excluded in these experiments. Additionally, for clarity of exposition results for R.sub.t.sup.λ, also excluded because during testing it was not found to be competitive. The reported results are an average of 200 runs for each experimental setting. The reported results are an average of 200 runs for each setting after 300 iterations of learning. Error bars are not included in the reported results. Instead, statistical significance is determined by performing a paired t-test. Statistical significance is found in the following analysis if (p<0.005).

[0393] FIG. 3 shows the mean policy performance in the Acrobot domain over 300 iterations.

[0394] FIG. 4 shows the mean policy performance in the Cart Pole Balancing domain over 300 iterations.

[0395] TFQI performs the best, significantly outperforming all other methods with the exception of CFQI-B.sub.γ(10) at 100 trajectories. This observation suggests that there is significant negative bias stemming from either the trajectories, model or both. TFQI is the most aggressive of the bounding approaches and finds the best policies. The results for CFQI-C.sub.γ are purposely missing from FIG. 3. CFQI-C.sub.γ has difficulty converging on all data sets in this domain, confirming the suspicion discussed above. The resulting policies all averaged an aggregate reward value of −700 or less, so the result was omitted.

[0396] CFQI-B.sub.γ fails to significantly outperform FQI at 10 and 100 trajectories. This demonstrates how the full γ-return can fail the third criterion by incorporating too much off-policy bias. A solution is provided below that limits the length of complex return. FIG. 3 also shows results for CFQI-B.sub.γ(l) for various l settings. Setting l=2 performs comparably to the default full length setting, CFQI-B.sub.γ, which are representatives of the two extremes of the parameter's range. At l=1 CFQI-B.sub.γ(l) reduces to FQI. Increasing l=5 and the method performs significantly better. No measurable improvement is seen increasing l beyond this value, however, l=10 is used in subsequent experiments as 5 may be too short and 10 did not adversely affect performance. These results show CFQI-B.sub.γ(l) provides effective bounds that enable more effective policies to be learned on less data, demonstrating improved sample efficiency.

[0397] In sharp contrast to the Acro results, TFQI performs the worst in this domain, as shown in FIG. 4. It fails to find a competent policy at all trajectory counts, confirming that it can be an overly aggressive bound and an unstable approach. All other methods perform comparably with FQI with the exception of CFQI-B.sub.γ. At higher trajectory counts CFQI-B.sub.γ learns a significantly better policy than all other methods. This observation can be explained by the γ-return's long-tail weighting and the specifics of the CPB domain. In the CPB domain all rewards are positive with the exception of transitions to failure states. As a result there is little the sub-optimal trajectory bias in long trajectories. CFQI-B.sub.γ looks the furthest out and produces the most effective lower bound in this domain.

[0398] Convergence behavior is an important consideration with any AVI approach because it determines how long it will take before a consistent policy can be extracted or if the approach will succeed at all. The convergence behavior is examined based on policy convergence and convergence of the {circumflex over (Q)} models.

[0399] FIG. 5 shows the final policy performance in the Cart Pole Balancing domain using trajectory sets of increasing size.

[0400] FIG. 6 shows the mean per-iteration {circumflex over (Q)} difference at every 10 iterations in the Acrobot domain using 100 trajectories.

[0401] The results for the Acrobot domain are shown in FIGS. 5 and 6. Both figures are generated from the results of the 100 trajectory count experiments. FIG. 5 shows the policy performance evaluated at every 10th iteration and FIG. 6 shows the per-iteration difference in {circumflex over (Q)} models. From FIG. 5 it appears that the policy for the methods shown all converge around the 100th iteration. The result for CFQI-C(γ) is omitted due its poor performance. The explanation for this is that CFQI-C(γ) fails to converge as shown in FIG. 6). The lack of convergence is caused by the non-fixed length of truncated trajectories. This finding suggests that the CFQI-C approach is not reliable. TFQI converges the fastest of all approaches followed by the CFQI-B methods, which all converge significantly faster than FQI.

[0402] FIG. 7 shows the mean policy performance at every 10 iterations in the Cart Pole Balancing domain using 100 trajectories.

[0403] FIG. 8 shows the mean per-iteration {circumflex over (Q)} difference at every 10 iterations in the Cart Pole Balancing domain using 100 trajectories.

[0404] FIGS. 7 and 8 show the similar results for the CPB domain. FQI, CFQI-C.sub.λ, and CFQI-B.sub.λ(10) all converge to a similar performing policy after 100 iterations. It is somewhat odd that the CFQI-B.sub.λ runs produce near optimal policies early in the iterative process before converging to a lesser performing policy. This results demonstrates how there can be a disconnect between deriving an accurate value function and actual policy performance. TFQI also converges quickly, but to a poor policy. FIG. 8, again, shows that CFQI-C.sub.λ methods fail to converge, but does manage to derive a stable policy. CFQI-B.sub.λ meanwhile, converges towards the best performing policy at a significantly quicker rate than FQI.

[0405] The sub-optimal off-policy bias, as discussed herein, unique to the AVI context, can therefore be exploited by using the complex returns as bounds on value estimates. A new AVI framework, CFQI, and two new approaches based on this framework, CFQI-C and CFQI-B, have been explained. CFQI converges with fixed length complex returns and when bounding is used. An empirical evaluation is presented that clearly demonstrates that the bounding approach improves the accuracy of value estimates for AVI significantly resulting in better policies, faster convergence, and improved sample efficiency.

[0406] Therefore, it is understood that the present technology provides improved performance in approximate value iteration approaches, achieving higher efficiency by providing updating algorithms that employ off-policy trajectory data with optimized upper and lower bound bias.

[0407] The embodiments and features of this invention may be used in combination and subcombination, without departing from the spirit and scope of the invention disclosed herein.

[0408] Thus have been described improvements to reinforcement learning technology, which may be applied to, by way of example, and without limitation, robot control (such as bipedal or quadrupedal walking or running, navigation, grasping, and other control skills); vehicle control (autonomous vehicle control, steering control, airborne vehicle control such as helicopter or plane control, autonomous mobile robot control); machine control; control of wired or wireless communication systems; control of laboratory or industrial equipment; control or real or virtual resources (such as memory management, inventory management and the like); drug discovery (where the controlled action is, say, the definition or DNA sequence of a drug and the states are defined by states of a living entity to which the drug is applied); application to a system in which the state of or output from the system is defined by words (text and/or audio and/or image), such as a system employing natural language; application to a trading system such as a stock, bond, foreign exchange, repo, options, futures, commodities, insurance, etc. markets (although the actions taken may have little effect on such a system, very small effects can be sufficient to achieve useful overall rewards); recommenders, advertising delivery platforms, customer relationship management systems and callcenters, social networks, games and video games, HVAC and environmental control, appliance and motor/motion system control, combustion system control, chemical process control, industrial process control, energy production systems and infrastructure, energy storage systems, energy distribution systems, elevator systems, traffic (physical and logical) control systems, network (physical and logical) management systems, other types of energy consumption systems, queue management systems, and others. The technology may also be used in toys and games, such as quadcopters and hex-copters, cellphones and other portable electronics, office machines such as copiers, scanners and printers, military devices such as munitions and UAVs, vehicle and machine transmissions, vehicle cruise controls, and the like. Other applications for the technology are described in the cited references incorporated herein.

[0409] In a physical control system, various types of sensors may be employed, such as position, velocity, acceleration, angle, angular velocity, vibration, impulse, gyroscopic, compass, magnetometer, SQUID, SQIF, pressure, temperature, volume, chemical characteristics, mass, illumination, light intensity, biosensors, micro electromechanical system (MEMS) sensors etc. The sensor inputs may be provided directly, through a preprocessing system, or as a processed output of another system.

[0410] The technology may be applied to image and video processing. In that case, while the trajectory may also be over time, it may also encompass other physical dimensions, as well as considering objects represented over their respective dimensions. [Lange 2010, Lange 2012].

[0411] The technology may also be applied in non-physical domains, such as in a semantic space, which may have very high dimensionality. For example, a trajectory in the semantic space may represent a series of words or semantic communications, see [Rennie 1999, Van Otterlo 2005, Cuayáhuitl 2009, Cuayáhuitl 2010]. In that case, the processing of the trajectories reveals at least context, and is responsive to changing context. Even modest improvements in efficiency

[0412] The present technology is performed using automated data processors, which may be purpose-built and optimized for the algorithm employed, or general purpose hardware. Without limiting the generality, a written description of such a system is as follows:

[0413] In various embodiments, the System may comprise a standalone computer system, a distributed computer system, a node in a computer network (i.e., a network of computer systems organized in a topology), a network of Systems, and/or the like. It is to be understood that the System and/or the various System elements (e.g., processor, system bus, memory, input/output devices) may be organized in any number of ways (i.e., using any number and configuration of computer systems, computer networks, nodes, System elements, and/or the like) to facilitate System operation. Furthermore, it is to be understood that the various System computer systems, System computer networks, System nodes, System elements, and/or the like may communicate among each other in any number of ways to facilitate System operation. The term “user” refers generally to people and/or computer systems that interact with the System, the term “server” refers generally to a computer system, a program, and/or a combination thereof that handles requests and/or responds to requests from clients via a computer network; the term “client” refers generally to a computer system, a program, a user, and/or a combination thereof that generates requests and/or handles responses from servers via a computer network; the term “node” refers generally to a server, to a client, and/or to an intermediary computer system, program, and/or a combination thereof that facilitates transmission of and/or handling of requests and/or responses.

[0414] The System includes a processor that executes program instructions. In various embodiments, the processor may be a general purpose microprocessor (e.g., a central processing unit (CPU)), a dedicated microprocessor (e.g., a graphics processing unit (GPU), a physics processing unit (PPU), a digital signal processor (DSP), a network processor, and/or the like), an external processor, a plurality of processors (e.g., working in parallel, distributed, and/or the like), a microcontroller (e.g., for an embedded system), and/or the like. The processor may be implemented using integrated circuits (ICs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), and/or the like. In various implementations, the processor may comprise one or more cores, may include embedded elements (e.g., a coprocessor such as a math coprocessor, a cryptographic coprocessor, a physics coprocessor, and/or the like, registers, cache memory, software), may be synchronous (e.g., using a clock signal) or asynchronous (e.g., without a central clock), and/or the like. For example, the processor may be an AMD FX processor, an AMD Opteron processor, an AMD Geode LX processor, an Intel Core i7 processor, an Intel Xeon processor, an Intel Atom processor, an ARM Cortex processor, an IBM PowerPC processor, and/or the like.

[0415] The processor may be connected to system memory, e.g., DDR2, DDR3. DDR4. Or DDR5 via a system bus. The system bus may interconnect these and/or other elements of the System via electrical, electronic, optical, wireless, and/or the like communication links (e.g., the system bus may be integrated into a motherboard that interconnects System elements and provides power from a power supply). In various embodiments, the system bus may comprise one or more control buses, address buses, data buses, memory buses, peripheral buses, and/or the like. In various implementations, the system bus may be a parallel bus, a serial bus, a daisy chain design, a hub design, and/or the like. For example, the system bus may comprise a front-side bus, a back-side bus, AMD's HyperTransport, Intel's QuickPath Interconnect, a peripheral component interconnect (PCI) bus, an accelerated graphics port (AGP) bus, a PCI Express bus, a low pin count (LPC) bus, a universal serial bus (USB), and/or the like. The system memory, in various embodiments, may comprise registers, cache memory (e.g., level one, level two, level three), read only memory (ROM) (e.g., BIOS, flash memory), random access memory (RAM) (e.g., static RAM (SRAM), dynamic RAM (DRAM), error-correcting code (ECC) memory), and/or the like. The system memory may be discreet, external, embedded, integrated into a CPU, and/or the like. The processor may access, read from, write to, store in, erase, modify, and/or the like, the system memory in accordance with program instructions executed by the processor. The system memory may facilitate accessing, storing, retrieving, modifying, deleting, and/or the like data by the processor.

[0416] In various embodiments, input/output devices may be connected to the processor and/or to the system memory, and/or to one another via the system bus. In some embodiments, the input/output devices may include one or more graphics devices. The processor may make use of the one or more graphic devices in accordance with program instructions executed by the processor. In one implementation, a graphics device may be a video card that may obtain (e.g., via a connected video camera), process (e.g., render a frame), output (e.g., via a connected monitor, television, and/or the like), and/or the like graphical (e.g., multimedia, video, image, text) data (e.g., SYSTEM data). A video card may be connected to the system bus via an interface such as PCI, AGP, PCI Express, USB, PC Card, ExpressCard, and/or the like. A video card may use one or more graphics processing units (GPUs), for example, by utilizing AMD's CrossFireX and/or NVIDIA's SLI technologies. A video card may be connected via an interface (e.g., video graphics array (VGA), digital video interface (DVI), Mini-DVI, Micro-DVI, high-definition multimedia interface (HDMI), DisplayPort, Thunderbolt, composite video, S-Video, component video, and/or the like) to one or more displays (e.g., cathode ray tube (CRT), liquid crystal display (LCD), touchscreen, and/or the like) that display graphics. For example, a video card may be an NVidia GRID K2, K1, Tesla K10, K20X, K40, M2075 Quadro K6000, K5200, K2000, AMD FirePro W9000, W8100, AMD Radeon HD 6990, an ATI Mobility Radeon HD 5870, an AMD FirePro V9800P, an AMD Radeon E6760 MXM V3.0 Module, an NVIDIA GeForce GTX 590, an NVIDIA GeForce GTX 580M, an Intel HD Graphics 3000, and/or the like. A graphics device may operate in combination with other graphics devices (e.g., in parallel) to provide improved capabilities, data throughput, color depth, and/or the like.

[0417] In some embodiments, the input/output devices may include one or more network devices. The processor may make use of the one or more network devices in accordance with program instructions executed by the processor. In one implementation, a network device may be a network card that may obtain (e.g., via a Category 6A Ethernet cable), process, output (e.g., via a wireless antenna), and/or the like network data. A network card may be connected to the system bus via an interface such as PCI, PCI Express, USB, FireWire, PC Card, ExpressCard, and/or the like. A network card may be a wired network card (e.g., 10/100/1000, optical fiber), a wireless network card (e.g., Wi-Fi 802.11a/b/g/n/ac/ad, Bluetooth, Near Field Communication (NFC), TransferJet), a modem (e.g., dialup telephone-based, asymmetric digital subscriber line (ADSL), cable modem, power line modem, wireless modem based on cellular protocols such as high speed packet access (HSPA), evolution-data optimized (EV-DO), global system for mobile communications (GSM), worldwide interoperability for microwave access (WiMax), long term evolution (LTE), and/or the like, satellite modem, FM radio modem, radio-frequency identification (RFID) modem, infrared (IR) modem), and/or the like. For example, a network card may be an Intel EXPI9301CT, an Intel EXPI9402PT, a LINKSYS USB300M, a BUFFALO WLI-UC-G450, a Rosewill RNX-MiniN1, a TRENDnet TEW-623PI, a Rosewill RNX-N18OUBE, an ASUS USB-BT211, a MOTOROLA SB6120, a U.S. Robotics USR5686G, a Zoom 5697-00-00F, a TRENDnet TPL-401E2K, a D-Link DHP-W306AV, a StarTech ET91000SC, a Broadcom BCM20791, a Broadcom InConcert BCM4330, a Broadcom BCM4360, an LG VL600, a Qualcomm MDM9600, a Toshiba TC35420 TransferJet device, and/or the like. A network device may be discreet, external, embedded, integrated into a motherboard, and/or the like. A network device may operate in combination with other network devices (e.g., in parallel) to provide improved data throughput, redundancy, and/or the like. For example, protocols such as link aggregation control protocol (LACP) based on IEEE 802.3AD-2000 or IEEE 802.1AX-2008 standards may be used. A network device may be used to connect to a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a personal area network, the Internet, an intranet, a Bluetooth network, an NFC network, a Wi-Fi network, a cellular network, and/or the like.

[0418] In some embodiments, the input/output devices may include one or more peripheral devices. The processor may make use of the one or more peripheral devices in accordance with program instructions executed by the processor. In various implementations, a peripheral device may be a monitor, a touchscreen display, active shutter 3D glasses, head-tracking 3D glasses, a camera, a remote control, an audio line-in, an audio line-out, a microphone, headphones, speakers, a subwoofer, a router, a hub, a switch, a firewall, an antenna, a keyboard, a mouse, a trackpad, a trackball, a digitizing tablet, a stylus, a joystick, a gamepad, a game controller, a force-feedback device, a laser, sensors (e.g., proximity sensor, rangefinder, ambient temperature sensor, ambient light sensor, humidity sensor, an accelerometer, a gyroscope, a motion sensor, an olfaction sensor, a biosensor, a chemical sensor, a magnetometer, a radar, a sonar, a location sensor such as global positioning system (GPS), Galileo, GLONASS, and/or the like), a printer, a fax, a scanner, a copier, a card reader, and/or the like. A peripheral device may be connected to the system bus via an interface such as PCI, PCI Express, USB, FireWire, VGA, DVI, Mini-DVI, Micro-DVI, HDMI, DisplayPort, Thunderbolt, composite video, S-Video, component video, PC Card, ExpressCard, serial port, parallel port, PS/2, TS, TRS, RCA, TOSLINK, network connection (e.g., wired such as Ethernet, optical fiber, and/or the like, wireless such as Wi-Fi, Bluetooth, NFC, cellular, and/or the like), a connector of another input/output device, and/or the like. A peripheral device may be discreet, external, embedded, integrated (e.g., into a processor, into a motherboard), and/or the like. A peripheral device may operate in combination with other peripheral devices (e.g., in parallel) to provide the System with a variety of input, output and processing capabilities.

[0419] In some embodiments, the input/output devices may include one or more storage devices. The processor may access, read from, write to, store in, erase, modify, and/or the like a storage device in accordance with program instructions executed by the processor. A storage device may facilitate accessing, storing, retrieving, modifying, deleting, and/or the like data by the processor. In one implementation, the processor may access data from the storage device directly via the system bus. In another implementation, the processor may access data from the storage device by instructing the storage device to transfer the data to the system memory and accessing the data from the system memory. In various embodiments, a storage device may be a hard disk drive (HDD), a solid-state drive (SSD), a floppy drive using diskettes, an optical disk drive (e.g., compact disk (CD-ROM) drive, CD-Recordable (CD-R) drive, CD-Rewriteable (CD-RW) drive, digital versatile disc (DVD-ROM) drive, DVD-R drive, DVD-RW drive, Blu-ray disk (BD) drive) using an optical medium, a magnetic tape drive using a magnetic tape, a memory card (e.g., a USB flash drive, a compact flash (CO card, a secure digital extended capacity (SDXC) card), a network attached storage (NAS), a direct-attached storage (DAS), a storage area network (SAN), other processor-readable physical mediums, and/or the like. A storage device may be connected to the system bus via an interface such as PCI, PCI Express, USB, FireWire, PC Card, ExpressCard, integrated drive electronics (IDE), serial advanced technology attachment (SATA), external SATA (eSATA), small computer system interface (SCSI), serial attached SCSI (SAS), fibre channel (FC), network connection (e.g., wired such as Ethernet, optical fiber, and/or the like; wireless such as Wi-Fi, Bluetooth, NFC, cellular, and/or the like), and/or the like. A storage device may be discreet, external, embedded, integrated (e.g., into a motherboard, into another storage device), and/or the like. A storage device may operate in combination with other storage devices to provide improved capacity, data throughput, data redundancy, and/or the like. For example, protocols such as redundant array of independent disks (RAID) (e.g., RAID 0 (striping), RAID 1 (mirroring), RAID 5 (striping with distributed parity), hybrid RAID), just a bunch of drives (JBOD), and/or the like may be used. In another example, virtual and/or physical drives may be pooled to create a storage pool. In yet another example, an SSD cache may be used with a HDD to improve speed. Together and/or separately the system memory and the one or more storage devices may be referred to as memory (i.e., physical memory). System memory contains processor-operable (e.g., accessible) data stores. Such data may be organized using one or more data formats such as a database (e.g., a relational database with database tables, an object-oriented database, a graph database, a hierarchical database), a flat file (e.g., organized into a tabular format), a binary file, a structured file (e.g., an HTML file, an XML file), a text file, and/or the like. Furthermore, data may be organized using one or more data structures such as an array, a queue, a stack, a set, a linked list, a map, a tree, a hash, a record, an object, a directed graph, and/or the like. In various embodiments, data stores may be organized in any number of ways (i.e., using any number and configuration of data formats, data structures, System elements, and/or the like).

[0420] In some embodiments, components may include an operating environment component. The operating environment component may facilitate operation via various subcomponents. In some implementations, the operating environment component may include an operating system subcomponent. The operating system subcomponent may provide an abstraction layer that facilitates the use of, communication among, common services for, interaction with, security of, and/or the like of various System elements, components, data stores, and/or the like. In some embodiments, the operating system subcomponent may facilitate execution of program instructions by the processor by providing process management capabilities. For example, the operating system subcomponent may facilitate the use of multiple processors, the execution of multiple processes, multitasking, and/or the like. In some embodiments, the operating system subcomponent may facilitate the use of memory. For example, the operating system subcomponent may allocate and/or free memory, facilitate memory addressing, provide memory segmentation and/or protection, provide virtual memory capability, facilitate caching, and/or the like. In another example, the operating system subcomponent may include a file system (e.g., File Allocation Table (FAT), New Technology File System (NTFS), Hierarchical File System Plus (HFS+), Universal Disk Format (UDF), Linear Tape File System (LTFS)) to facilitate storage, retrieval, deletion, aggregation, processing, generation, and/or the like of data. In some embodiments, the operating system subcomponent may facilitate operation of and/or processing of data for and/or from input/output devices. For example, the operating system subcomponent may include one or more device drivers, interrupt handlers, file systems, and/or the like that allow interaction with input/output devices. In some embodiments, the operating system subcomponent may facilitate operation of the System as a node in a computer network by providing support for one or more communications protocols. For example, the operating system subcomponent may include support for the internet protocol suite (i.e., Transmission Control Protocol/Internet Protocol (TCP/IP)) of network protocols such as TCP, IP, User Datagram Protocol (UDP), Mobile IP, and/or the like. In another example, the operating system subcomponent may include support for security protocols (e.g., Wired Equivalent Privacy (WEP), Wi-Fi Protected Access (WPA), WPA2) for wireless computer networks. In yet another example, the operating system subcomponent may include support for virtual private networks (VPNs). In various embodiments the operating system subcomponent may comprise a single-user operating system, a multi-user operating system, a single-tasking operating system, a cluster or high performance operating system, a multitasking operating system, a single-processor operating system, a multiprocessor operating system, a distributed operating system, an embedded operating system, a real-time operating system, and/or the like. For example, the operating system subcomponent may comprise an operating system such as UNIX, LINUX, IBM i, Sun Solaris, Microsoft Windows Server, Microsoft DOS, Microsoft Windows 7, 8, 8.1, 10, Apple Mac OS X, Apple iOS, Android, Symbian, Windows Phone 7, Windows Phone 8, Blackberry QNX, and/or the like.

[0421] In some implementations, the operating environment component may include a database subcomponent. The database subcomponent may facilitate capabilities such as storage, analysis, retrieval, access, modification, deletion, aggregation, generation, and/or the like of data (e.g., the use of data stores 530). The database subcomponent may make use of database languages (e.g., Structured Query Language (SQL), XQuery), stored procedures, triggers, APIs, and/or the like to provide these capabilities. In various embodiments the database subcomponent may comprise a cloud database, a data warehouse, a distributed database, an embedded database, a parallel database, a real-time database, and/or the like. For example, the database subcomponent may comprise a database such as Microsoft SQL Server, Microsoft Access, MySQL, IBM DB2, Oracle Database, Apache Cassandra database, and/or the like.

[0422] In some implementations, the operating environment component may include an information handling subcomponent. The information handling subcomponent may provide capabilities to serve, deliver, upload, obtain, present, download, and/or the like a variety of information. The information handling subcomponent may use protocols such as Hypertext Transfer Protocol (HTTP), Hypertext Transfer Protocol Secure (HTTP), File Transfer Protocol (FTP), Telnet, Secure Shell (SSH), Transport Layer Security (TLS), Secure Sockets Layer (SSL), peer-to-peer (P2P) protocols (e.g., BitTorrent), and/or the like to handle communication of information such as web pages, files, multimedia content (e.g., streaming media), applications, and/or the like.

[0423] In some implementations, the operating environment component may include a virtualization subcomponent that facilitates virtualization capabilities. In some embodiments, the virtualization subcomponent may provide support for platform virtualization (e.g., via a virtual machine). Platform virtualization types may include full virtualization, partial virtualization, paravirtualization, and/or the like. In some implementations, platform virtualization may be hardware-assisted (e.g., via support from the processor using technologies such as AMD-V, Intel VT-x, and/or the like). In some embodiments, the virtualization subcomponent may provide support for various other virtualized environments such as via operating-system level virtualization, desktop virtualization, workspace virtualization, mobile virtualization, application virtualization, database virtualization, and/or the like. In some embodiments, the virtualization subcomponent may provide support for various virtualized resources such as via memory virtualization, storage virtualization, data virtualization, network virtualization, and/or the like. For example, the virtualization subcomponent may comprise VMware software suite (e.g., VMware Server, VMware Workstation, VMware Player, VMware ESX, VMware ESXi, VMware ThinApp, VMware Infrastructure), Parallels software suite (e.g., Parallels Server, Parallels Workstation, Parallels Desktop, Parallels Mobile, Parallels Virtuozzo Containers), Oracle software suite (e.g., Oracle VM Server for SPARC, Oracle VM Server for x86, Oracle VM VirtualBox, Oracle Solaris 10, Oracle Solaris 11), Informatica Data Services, Wine, and/or the like.

APPROXIMATE VALUE ITERATION WITH COMPLEX RETURNS BY BOUNDING

Inventors

Cpc classification

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

G05B13/0265

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

Y02B10/30

GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS

Classification Explorer

G05B15/02

PHYSICS

International classification

Classification Explorer

G06N7/00

PHYSICS

Classification Explorer

G05B15/02

PHYSICS

Classification Explorer

G06N99/00

PHYSICS

Abstract

Claims

Description