TRAINING ACTION SELECTION NEURAL NETWORKS USING Q-LEARNING COMBINED WITH LOOK AHEAD SEARCH

Abstract

A reinforcement learning system and method that selects actions to be performed by an agent interacting with an environment. The system uses a combination of reinforcement learning and a look ahead search: Reinforcement learning Q-values are used to guide the look ahead search and the search is used in turn to improve the Q-values. The system learns from a combination of real experience and simulated, model-based experience.

Claims

1. A computer implemented method of learning to select actions to be performed by an agent in an environment for performing a task, wherein an action selection neural network is configured to receive data from an input observation characterizing a state of the environment and to process the input observation in accordance with action selection neural network parameters to generate an action selection output for defining a set of action selection scores for selecting the actions to be performed by the agent, wherein the method comprises: receiving an observation characterizing a current state of the environment; performing a look ahead search from the current state of the environment over possible future states of the environment, wherein performing the look ahead search comprises, at each of a series of search steps, determining an action by processing data characterizing a state of the environment at the search step using the action selection neural network to select an action for the search step, and processing the action for the search step using a model to determine data characterizing a state of the environment at a next search step; determining an updated action selection score for the current state of the environment from a result of the look ahead search; selecting an action to be performed by the agent using the updated action selection score; receiving an observation characterizing a next state of the environment; storing in a replay buffer transition data characterizing: the current state of the environment, the action, the next state of the environment, a reward received in response to the agent performing the action, and the updated action selection score; sampling the transition data stored in the replay buffer; and adjusting the action selection neural network parameters using gradients of an objective function determined from the sampled transition data, wherein the objective function comprises a first term dependent on the reward received and a second term dependent upon the updated action selection score.

2. The method of claim 1 wherein performing the look ahead search comprises searching a state tree having nodes representing states of the environment from a root node representing the current state of the environment, wherein the series of search steps continues to a terminal node of the search, the method further comprising processing data characterizing the terminal search state of the environment to determine the action selection output for the terminal search state of the environment, and determining the result of the look ahead search from the action selection output for the terminal search state of the environment.

3. The method of claim 2 wherein determining the result of the look ahead search further comprises estimating a reward from the nodes of the search tree between the root node and the terminal node, and adding the estimated reward to the an estimated value of the terminal state determined from the action selection output for the terminal search state of the environment.

4. The method of claim 2 wherein determining the updated action selection score comprises summing a value derived from the result of the look ahead search and an action selection score from the action selection output for the current state of the environment.

5. The method of claim 2 wherein the action selection output of the action selection neural network comprises the set of action selection scores, and wherein using the action selection neural network to select an action for the search step comprises selecting an action with a maximum action selection score for the search step.

6. The method of claim 5 wherein the action selection output for the terminal search state of the environment comprises a maximum action selection score for the terminal search state of the environment.

7. The method of claim 1 wherein the first term of the objective function comprises a temporal difference learning term dependent upon a difference, based on the transition data sampled from the replay buffer, between the output of the action selection neural network for the current state of the environment and a combination of the reward received and an estimated future return from the next state of the environment from the action selection neural network.

8. The method of any claim 1 wherein the second term of the objective function is dependent upon a difference, based on the transition data sampled from the replay buffer, between the action selection output for the current state of the environment and the updated action selection score for the current state of the environment.

9. The method of any claim 1 wherein the second term of the objective function comprises a cross-entropy loss between a set of action selection scores for the current state of the environment from the action selection neural network and a corresponding set of updated action selection scores for the current state of the environment from the look ahead search.

10. The method of claim 1 wherein the set of action selection scores comprise Q-values for a discrete set of actions, and wherein the action selection output comprises a Q-value output.

11. The method of claim 10 wherein using the action selection neural network to select an action for the search step comprises selecting an action with a highest Q-value, subject to exploration noise.

12. (canceled)

13. (canceled)

14. (canceled)

15. A neural network system for learning to select actions to be performed by an agent in an environment for performing a task, the system comprising: an action selection neural network is configured to receive data from an input observation characterizing a state of the environment and to process the input observation in accordance with action selection neural network parameters to generate an action selection output for defining a set of action selection scores for selecting the actions to be performed by the agent, a replay buffer to store transition data; and a look ahead search subsystem; wherein the look ahead search subsystem is configured to: receive an observation characterizing a current state of the environment; perform a look ahead search from the current state of the environment over possible future states of the environment, wherein performing the look ahead search comprises, at each of a series of search steps, determining an action by processing data characterizing a state of the environment at the search step using the action selection neural network to select an action for the search step, and processing the action for the search step using a model to determine data characterizing a state of the environment at a next search step; determine an updated action selection score for the current state of the environment from a result of the look ahead search; and wherein the neural network system is configured to: select an action to be performed by the agent using the updated action selection score; receive an observation characterizing a next state of the environment; store in the replay buffer transition data characterizing: the current state of the environment, the action, the next state of the environment, a reward received in response to the agent performing the action, and the updated action selection score; sample the transition data stored in the replay buffer; and adjust the action selection neural network parameters using gradients of an objective function determined from the sampled transition data, wherein the objective function comprises a first term dependent on the reward received and a second term dependent upon the updated action selection score.

16. (canceled)

17. The system of claim 15 wherein performing the look ahead search comprises searching a state tree having nodes representing states of the environment from a root node representing the current state of the environment, wherein the series of search steps continues to a terminal node of the search, the method further comprising processing data characterizing the terminal search state of the environment to determine the action selection output for the terminal search state of the environment, and determining the result of the look ahead search from the action selection output for the terminal search state of the environment.

18. The system of claim 17 wherein determining the result of the look ahead search further comprises estimating a reward from the nodes of the search tree between the root node and the terminal node, and adding the estimated reward to the an estimated value of the terminal state determined from the action selection output for the terminal search state of the environment.

19. The system of claim 17 wherein determining the updated action selection score comprises summing a value derived from the result of the look ahead search and an action selection score from the action selection output for the current state of the environment.

20. The system of claim 17 wherein the action selection output of the action selection neural network comprises the set of action selection scores, and wherein using the action selection neural network to select an action for the search step comprises selecting an action with a maximum action selection score for the search step.

21. The system of claim 17, wherein the action selection output for the terminal search state of the environment comprises a maximum action selection score for the terminal search state of the environment.

22. The system of claim 15 wherein the first term of the objective function comprises a temporal difference learning term dependent upon a difference, based on the transition data sampled from the replay buffer, between the output of the action selection neural network for the current state of the environment and a combination of the reward received and an estimated future return from the next state of the environment from the action selection neural network.

23. The system of any claim 15 wherein the second term of the objective function is dependent upon a difference, based on the transition data sampled from the replay buffer, between the action selection output for the current state of the environment and the updated action selection score for the current state of the environment.

24. One or more non-transitory computer-readable storage media storing instructions that when executed by one or more computers are operable to cause the one or more computers to perform operations for learning to select actions to be performed by an agent in an environment for performing a task, wherein an action selection neural network is configured to receive data from an input observation characterizing a state of the environment and to process the input observation in accordance with action selection neural network parameters to generate an action selection output for defining a set of action selection scores for selecting the actions to be performed by the agent, wherein the operations comprise: receiving an observation characterizing a current state of the environment; performing a look ahead search from the current state of the environment over possible future states of the environment, wherein performing the look ahead search comprises, at each of a series of search steps, determining an action by processing data characterizing a state of the environment at the search step using the action selection neural network to select an action for the search step, and processing the action for the search step using a model to determine data characterizing a state of the environment at a next search step; determining an updated action selection score for the current state of the environment from a result of the look ahead search; selecting an action to be performed by the agent using the updated action selection score; receiving an observation characterizing a next state of the environment; storing in a replay buffer transition data characterizing: the current state of the environment, the action, the next state of the environment, a reward received in response to the agent performing the action, and the updated action selection score; sampling the transition data stored in the replay buffer; and adjusting the action selection neural network parameters using gradients of an objective function determined from the sampled transition data, wherein the objective function comprises a first term dependent on the reward received and a second term dependent upon the updated action selection score.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0052] FIG. 1 shows a reinforcement learning system that implements a look ahead search.

[0053] FIGS. 2a and 2b are flow diagrams of an example process for training the reinforcement learning system of FIG. 1.

[0054] FIGS. 3a and 3b illustrate schematically the operation of an example of the reinforcement learning system of FIG. 1.

[0055] In the Figures like reference numerals indicate like elements.

DETAILED DESCRIPTION

[0056] This specification describes a neural network based reinforcement learning system and method. The system combined model-free reinforcement learning with model-based planning to improve training. The model-based planning uses a look ahead search, such as a Monte Carlo Tree Search (MCTS), to improve the performance of the model-free learning. The look ahead search may employ a learned model or a simulation an environment in which the system operates.

[0057] In implementations an action selection neural network defines values, such as state-action values or Q-values, which are used to guide the look ahead search, and results from the search are in turn used to improve an output of the action selection neural network defining the values. Thus the action selection neural network learns from a combination of a value-based objective function, and an “amortized” objective function which encourages the output of the action selection neural network towards values derived from the look ahead search. The system therefore learns from a combination of real experience and simulated, i.e. model-based experience.

[0058] The approach can be used with any reinforcement learning system with access to a model, and by contrast to typical model-based approaches, is able to learn rapidly with a very small search budget. For example whereas some systems can require thousands of MCTS model evaluations per action during training, some implementations of the system described herein can use one or two orders of magnitude fewer evaluations. This is particularly useful where the simulation is computationally expensive, for example because it is complex or because the action space is large.

[0059] FIG. 1 shows a reinforcement learning system 100 that may be implemented as computer programs on one or more computers in one or more locations. The reinforcement learning system 100, at each of multiple time steps, t, selects an action, a, to be performed by an agent 102 in an environment 104. At each time step the reinforcement learning system 100 receives and processes data characterizing a state, s, of the environment, referred to herein as an observation, o, for selecting an action.

[0060] Thus in this specification references to a state, s, of the environment are used as shorthand for references to an observation, o, characterizing the state, s, of the environment. The observation may include an image of the environment and/or other sensor or input data from the environment. Such observations are typically pre-processed e.g. by one or more convolutional neural network layers, and/or by one or more recurrent neural network layers for example where the environment is only partially observed (not shown in FIG. 1).

[0061] The reinforcement learning system 100 may also receive a reward r as a result of performing the action a. In general the reward is a numerical value and may be based on any event or aspect of the environment. For example, the reward r may indicate whether the agent 106 has accomplished a task (e.g., a manipulation task, or navigating to a target location in the environment) or progress of the agent 106 towards accomplishing a task.

[0062] In some implementations, the environment is a real-world environment or a simulation of a real-world environment. The agent may comprise a mechanical agent, such as a robot or vehicle, or a simulation of a mechanical agent, or a control system for a mechanical agent. Alternatively, for example, the agent may comprise a software agent to perform a control or routing task in a communications network or computer system, or for an integrated circuit. Other applications of the reinforcement learning system 100 have been described previously.

[0063] Referring again to FIG. 1, the reinforcement learning system 100 includes an action selection neural network 120. The action selection neural network 120 receives and processes an observation o of a state s and generates an action selection output for defining a set of action selection scores which are used for selecting actions to be performed by the agent. For example the action selection output may determine the action selection scores, or may parameterize a set of distributions representing the action selection scores from which action selection scores may be selected stochastically.

[0064] The reinforcement learning system 100 also includes a look ahead search subsystem 130. As described later, the action selection scores are used as a prior for a look ahead search performed by the look ahead search subsystem 130. In some implementations the action selection neural network 120 is a Q-value neural network with parameters θ, Q.sub.θ, that receives an action, a, and generates an action-value score Q.sub.θ(s, a).

[0065] The look ahead search subsystem 130 generates an updated set of action selection scores by performing the look ahead search, from a current state of the environment over possible future states of the environment, using the set of action selection scores from the action selection neural network 120. In some implementations look ahead search subsystem 130 performs a Monte Carlo Tree Search (MCTS) and the updated set of action selection scores comprises an improved Q-value, Q.sub.MCTS(s, a), for each action a of a set of possible actions.

[0066] More particularly, at each of a series of search steps the look ahead search subsystem 130 determines an action by processing data characterizing a state of the environment at the search step, i.e. data from an observation o, using the action selection neural network, to thereby progress down a search tree. The action is determined by a search policy, which depends on the action selection scores from the action selection neural network. The action for the search step is processed using a model (simulation) to determine data characterizing a state of the environment at a next search step. Each iteration of the look ahead search is used to progressively refine the action selection scores, and after a number of iterations an updated (improved) set of action selection scores is returned, e.g. Q.sub.MCTS(s, a) for each possible action a in state s, later denoted Q.sub.MCTS(s,⋅).

[0067] An actor subsystem 110 uses the updated set of action selection scores to select an action. For example the actor subsystem 110 may select an action with a highest action selection score or, for example, the action selection scores may be processed using a soft-max function to generate respective probability values which can be used to select the action. Optionally the actor subsystem 110 may also implement an exploration policy such as an epsilon-greedy exploration policy, effectively adding exploration noise to the action selections e.g. by choosing an action (from those explored during the search) uniformly at random with a probability E. As previously described, in response the reinforcement learning system 100 receives an observation characterizing a next state of the environment, and optionally a reward.

[0068] For each step of real experience i.e. as contrasted with the look ahead search, a replay buffer 140 stores transition data comprising the current state of the environment, the action, the next state of the environment, a reward received in response to the agent performing the action, and the updated action selection scores.

[0069] A training engine 150 is configured to train the reinforcement learning system 100 using samples of the stored transition data. Thus the training engine 150 is configured to adjust the parameters of the action selection neural network by backpropagating using gradients of an objective function determined from the sampled transition data. The objective function comprises a first term which may be any value-based objective function i.e. a term dependent on the reward received, and a second term comprising an “amortized” objective function i.e. a term dependent upon the updated action selection score. The first and second terms of the objective function may be weighted to match their relative scales.

[0070] In broad terms the training engine 150 adjusts the parameters of the action selection neural network based on both real experience and the look ahead search, and the action selection output of the action selection neural network is then used in turn to guide the look ahead search. The better-guided look ahead search then further improves the action selection output. In this way implementations of the system and method are able to learn faster i.e. from less experience and/or with a smaller simulation search budget, than might otherwise be the case, and can provide improved final performance.

[0071] FIGS. 2a and 2b are flow diagrams of an example process for training the reinforcement learning system 100. The process of FIG. 2 shows one time step; the process is repeated to control the agent to act at successive time steps.

[0072] Referring to FIG. 2a, the process receives an observation of a state s of the environment (step 200), and performs a look ahead search from state s using action selection scores from the action selection neural network 120 to determine a set of updated action selection scores (step 202). Details of the look ahead search are described with reference to FIG. 2b.

[0073] The updated action selection scores are used to select an action a for the agent to perform in the environment (step 204), and an observation of a next state of the environment s′ is received with an optional reward r (step 206). Transition data comprising data e.g. characterizing the experience tuple (s, a, r, s′, Q.sub.MCTS(s,⋅)) is stored in the replay buffer 140 (step 208). Whilst the reinforcement learning system 100 is learning one or more e.g. a minibatch of experience tuples is sampled from the replay buffer 140 and used to update the parameters of the action selection neural network 110 (step 210). After training steps 208 and 210 may be omitted.

[0074] FIG. 2b shows details of an example process for performing the look ahead search. The search is performed over a search tree in which nodes represent states of the environment and edges represent actions performed by the agent. The search involves K iterations, that is the search tree is traversed K times from a root node defined by an initial state of the environment s when the search was initiated.

[0075] At the start of the search action selection values e.g. Q-values are initialized for states and actions in the tree (step 250). This may comprise setting an initial Q-value, Q.sub.0(s, a), for each state and action in the search tree. In implementations Q.sub.0 (s, a)=Q.sub.θ(s, a), that is the initial Q-value is defined by the set of action selection scores from the action selection neural network for each state s and possible action a from that state. A state-action visit count N.sub.0 (s, a) may also be initialized e.g. to 1 for each state and action in the search tree (step 250). This would be the situation if each state-action pair had been visited once with each estimated state-action value given by Q.sub.θ(s, a). In some other implementations the state-action visit count may be initialized to an estimate of previous visit counts.

[0076] A process as defined in steps 252-256, e.g. an MCTS-type process, is then performed for each of the iterations, starting with k=0.

[0077] Starting from the root node, state s, the look ahead search traverses the search tree by performing a succession of search steps, each identifying an action according to the search policy, until a new, terminal search state s.sub.T is added to the search tree (step 252). The new, terminal search state s.sub.T is typically not, though can be, a terminal state of a task performed by the reinforcement learning system. In some implementations the search policy for iteration k, π.sub.k(s), is given by

[00001] $π_{k} (s) = \underset{a}{argmax} (Q_{k} (s, a) + U_{k} (s, a))$

[0078] where U.sub.k (s, a) is an exploration term which may represent an upper confidence bound for the tree (UCT). In implementations

[00002] $U_{k} (s, a) = c_{UCT} \sqrt{\frac{{.Math.}_{a} N_{k} (s, a)}{N_{k} (s, a)}}$

[0079] where the sum over actions Σ.sub.aN.sub.k(s, a) represents a total number simulations (backed up returns) for state s following initialization, N.sub.k(s, a) represents a number simulations in which action a was taken from state s, and C.sub.UCT is a constant that encourages exploration e.g. √{square root over (2)}. For k>0 a value for Q.sub.k(s, a) may be determined as described later.

[0080] As the search tree is traversed an action is selected according to the search policy, and the next state of the environment is determined using the model (simulation), until a new action, a.sub.T−1 that has not previously been explored is selected from state s.sub.T−1, resulting in a reward t.sub.T−1 and new, terminal search state s.sub.T. The new, terminal search state s.sub.T is added to the search tree and a value of this state is determined as a result of the look ahead search (step 254). More specifically the value of this state, V(s.sub.T), may be determined from the action selection output of the action selection neural network for the terminal search state of the environment. For example where the value of state s.sub.T is provided deterministically it may be determined as a maximum action selection score of the state,

[00003] $V (s_{T}) = \max_{a} Q_{θ} (s_{T}, a)$

[0081] A “backed up” return value, may then be determined for each state-action pair (s.sub.t, a.sub.t) traversed by the search. The return value may be determined from the value of state s.sub.T and from the (time-discounted) estimated rewards from the nodes of the search tree traversed between the root node and the terminal node. For example a return R.sub.i(s.sub.t, a.sub.t) for an ith simulated traverse since initialization may be estimated as

[00004] $R_{i} (s_{t}, a_{t}) = γ^{T - t} V (s_{T}) + {.Math.}_{j = t}^{T - 1} γ^{j - t} r_{j}$

[0082] The process may then increment counts of visited states and actions, N.sub.k(S, a), and update the search action selection scores, e.g. the values of Q.sub.k(s, a), for the visited states and actions (step 256). For example, k>0 a value for Q.sub.k(s, a) may be determined according to

[00005] $Q_{k} (s, a) = \frac{Q_{θ} (s, a) + {.Math.}_{i = 1}^{N_{k} (s, a) - 1} R_{i} (s, a)}{N_{k} (s, a)}$

[0083] where the sum suns over visited states and actions.

[0084] After K iterations the updated search action selection scores, e.g. the value of Q.sub.K(s, a), may be returned as the updated action selection score, e.g. Q.sub.MCTS(s, a), used by the process of FIG. 2a (step 258).

[0085] Referring again to FIG. 2a, in some implementations an objective function, custom-character , used to update the parameters of the action selection neural network 110 (step 210) may be given by

custom-character =β.sub.Q.sub.Q+β.sub.A.sub.A

where custom-character .sub.Q is any reinforcement learning value-based loss function, e.g. based on a 1-step or n-step TD (Temporal Difference) target or based on a λ-return, .sub.A is an “amortized” objective function, and β.sub.Q and β.sub.A are respective weights. Each of .sub.Q and .sub.A may be determined from data for a batch, custom-character , of N experience tuples sampled from the replay buffer 140.

[0086] For example custom-character .sub.Q may depend on a difference between the output of the action selection neural network for the current state of the environment and an estimate of a return that will be received by the system if the agent performs a given action in response to the observation, e.g. a combination of the reward received and an estimated (time-discounted) future reward, i.e. return, from the next state of the environment from the action selection neural network.

[0087] The “amortized” objective custom-character .sub.A may depend on a difference between the action selection output for the current state of the environment and the updated action selection scores for the current state of the environment. The difference may comprise a squared loss e.g. an L2 norm, or a cross-entropy loss. Using a cross-entropy loss can improve performance, i.e. can help the action selection output to learn good action selection scores faster. A cross-entropy loss compares relative action selection score values rather than absolute values, i.e. it compares score distributions. This allows the updated action selection scores from the search to be used without relying on these too strongly, which can be helpful when search budgets are small; it can also avoid over-reliance on low-value actions which tend to be avoided by the search.

[0088] In one implementation the action selection scores from the action selection neural network, Q.sub.θ(s,⋅), and the updated action selection scores from the look ahead search, Q.sub.MCTS(s,⋅), are each converted to a respective probability distribution by applying a softmax function, p.sub.MCTS=softmax(Q.sub.MCTS (s,⋅),) and p.sub.θ=softmax(Q.sub.θ(s,⋅),). Then a cross-entropy loss may be determined as

[00006] $ℒ_{A} = - \frac{1}{N} \underset{𝒟}{.Math.} {(p_{MCTS})}^{T} \log p_{θ}$

[0089] where T is an optional temperature parameter which controls the softmax temperature.

[0090] FIGS. 3a and 3b summarize the operation of an example reinforcement learning system 100 as described above. Thus in broad terms, the action selection scores from the action selection neural network, Q.sub.θ(s,⋅), act as a prior for the updated Q-values, Q.sub.k, estimated during a Monte Carlo Tree Search. Over K steps of the search an initial set of Q-values, Q.sub.0(s,⋅)=Q.sub.θ(s,⋅), is improved to Q.sub.K, which is returned as an updated set of action selection scores, Q.sub.MCTS(s,). This updated set of action selection scores is then used to select an action a, e.g. with an epsilon-greedy exploration policy, and the resulting experience (s, a, r, s′, Q.sub.MCTS(s,⋅)) is added to the replay buffer. When learning the system uses real experience (rewards) to update Q.sub.θ(s,⋅) by Q-learning, and uses an amortization loss to regress Q.sub.θ(s,⋅) towards the updated set of action selection scores estimated by the search.

[0091] For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[0092] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal.

[0093] The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0094] A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

[0095] As used in this specification, an “engine,” or “software engine,” refers to a software implemented input/output system that provides an output that is different from the input. An engine can be an encoded block of functionality, such as a library, a platform, a software development kit (“SDK”), or an object. Each engine can be implemented on any appropriate type of computing device, e.g., servers, mobile phones, tablet computers, notebook computers, music players, e-book readers, laptop or desktop computers, PDAs, smart phones, or other stationary or portable devices, that includes one or more processors and computer readable media. Additionally, two or more of the engines may be implemented on the same computing device, or on different computing devices.

[0096] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). For example, the processes and logic flows can be performed by and apparatus can also be implemented as a graphics processing unit (GPU).

[0097] Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

[0098] Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[0099] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.

[0100] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.

[0101] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0102] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0103] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0104] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

TRAINING ACTION SELECTION NEURAL NETWORKS USING Q-LEARNING COMBINED WITH LOOK AHEAD SEARCH

Inventors

Cpc classification

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06N3/006

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06N5/01

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06F18/295

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

International classification

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Abstract

Claims

Description