Asynchronous robotic control using most recently selected robotic action data

11685045 · 2023-06-27

Assignee

Inventors

Cpc classification

International classification

Abstract

Asynchronous robotic control utilizing a trained critic network. During performance of a robotic task based on a sequence of robotic actions determined utilizing the critic network, a corresponding next robotic action of the sequence is determined while a corresponding previous robotic action of the sequence is still being implemented. Optionally, the next robotic action can be fully determined and/or can begin to be implemented before implementation of the previous robotic action is completed. In determining the next robotic action, most recently selected robotic action data is processed using the critic network, where such data conveys information about the previous robotic action that is still being implemented. Some implementations additionally or alternatively relate to determining when to implement a robotic action that is determined in an asynchronous manner.

Claims

1. A method implemented by one or more processors of a robot during performance of a robotic task, the method comprising: controlling a robot to implement a most recently selected robotic action that was determined based on processing, utilizing a trained neural network model that represents a learned value function, of the most recently selected robotic action and of prior vision data captured by a vision component of the robot, wherein the most recently selected robotic action defines a target next state of the robot in performance of the robotic task; during the controlling of the robot to implement the most recently selected robotic action and prior to the robot achieving the target next state defined by the most recently selected robotic action: identifying current vision data that is captured by the vision component during the controlling of the robot to implement the most recently selected robotic action and prior to the robot achieving the target next state of the robot defined by the most recently selected robotic action; identifying a candidate next robotic action; processing, utilizing the trained neural network model, the current vision data, the candidate next robotic action, and most recently selected robotic action data, wherein the most recently selected robotic action data comprises: the most recently selected robotic action, a difference between the target next state of the robot and a current state of the robot that temporally corresponds to the current vision data, or both the most recently selected robotic action and the difference; generating a value for the candidate next robotic action based on the processing; and selecting the candidate next robotic action based on the value; and controlling the robot to implement the selected candidate next robotic action.

2. The method of claim 1, wherein the most recently selected robotic action data comprises the difference between the target next state of the robot and the current state of the robot that temporally corresponds to the current vision data.

3. The method of claim 2, further comprising: selecting the current vision data based on it being most recently captured and buffered in a vision data buffer; and selecting the current state of the robot, for use in determining the difference, based on a current state timestamp, for the current state, being closest temporally to a vision data timestamp of the current vision data.

4. The method of claim 3, wherein selecting the current state of the robot comprises selecting the current state of the robot in lieu of a more recent state of the robot that is more up to date than the current state, based on the current state of the robot being closer temporally to the vision data timestamp than is the more recent state of the robot.

5. The method of claim 1, wherein controlling the robot to implement the selected candidate next robotic action comprises: determining a particular control cycle at which to begin controlling the robot to implement the selected candidate next robotic action, wherein determining the particular control cycle is based on determining whether a minimum amount of time has passed, an amount of control cycles have passed, or the amount of time and the amount of control cycles have passed.

6. The method of claim 5, wherein the minimum amount of time and/or control cycles are relative to: initiation of generating the value for the candidate next robotic action, beginning controlling the robot to implement the most recently selected robot action, or both initiation of generating the value for the candidate next robotic action and beginning controlling the robot to implement the most recently selected robot action.

7. The method of claim 6, wherein the particular control cycle is not a control cycle that immediately follows selecting the candidate next robotic action.

8. The method of claim 1, wherein controlling the robot to implement the selected candidate next robotic action occurs prior to the robot achieving the target next state.

9. The method of claim 1, wherein controlling the robot to implement the selected candidate next robotic action occurs in a control cycle that immediately follows the robot achieving the target next state.

10. The method of claim 1, further comprising, during the controlling of the robot to implement the most recently determined robotic action and prior to the robot achieving the target next state defined by the most recently determined robotic action: identifying an additional candidate next robotic action; processing, utilizing the trained neural network model, the current vision data, the additional candidate next robotic action, and the most recently selected robotic action data; and generating an additional value for the additional candidate next robotic action based on the processing; wherein selecting the candidate next robotic action is based on comparing the value to the additional value.

11. The method of claim 1, wherein the candidate next robotic action comprises a pose change for a component of the robot.

12. The method of claim 11, wherein the component is an end effector and the pose change defines a translation difference for the end effector and a rotation difference for the end effector.

13. The method of claim 12, wherein the end effector is a gripper and the robotic task is a grasping task.

14. A robot, comprising: a vision sensor viewing an environment; actuators; a trained neural network model stored in one or more non-transitory computer readable media, the trained neural network model representing a learned value function; at least one processor configured to: control one or more of the actuators to implement a most recently selected robotic action that was determined based on processing, utilizing the trained neural network model, of the most recently selected robotic action and of prior vision data captured by a vision component of the robot, wherein the most recently selected robotic action defines a target next state of the robot in performance of the robotic task; during the control of the actuators to implement the most recently selected robotic action and prior to the robot achieving the target next state defined by the most recently selected robotic action: identify current vision data that is captured by the vision component during the controlling of the robot to implement the most recently selected robotic action and prior to the robot achieving the target next state of the robot defined by the most recently selected robotic action; identify a candidate next robotic action; process, utilizing the trained neural network model, the current vision data, the candidate next robotic action, and most recently selected robotic action data, wherein the most recently selected robotic action data comprises: the most recently selected robotic action, a difference between the target next state of the robot and a current state of the robot that temporally corresponds to the current vision data, or both the most recently selected robotic action and the difference; generate a value for the candidate next robotic action based on the processing; and select the candidate next robotic action based on the value; and control the robot to implement the selected candidate next robotic action.

15. The robot of claim 14, wherein the most recently selected robotic action data comprises the difference between the target next state of the robot and the current state of the robot that temporally corresponds to the current vision data.

16. The robot of claim 15, wherein the at least one processor is further configured to: select the current vision data based on it being most recently captured and buffered in a vision data buffer; and select the current state of the robot, for use in determining the difference, based on a current state timestamp, for the current state, being closest temporally to a vision data timestamp of the current vision data.

17. The robot of claim 16, wherein in selecting the current state of the robot one or more of the processors are to select the current state of the robot in lieu of a more recent state of the robot that is more up to date than the current state, based on the current state of the robot being closer temporally to the vision data timestamp than is the more recent state of the robot.

18. The robot of claim 14, wherein in controlling the actuators to implement the selected candidate next robotic action one or more of the processors are to: determine, based on determining whether a minimum amount of time or control cycles have passed, a particular control cycle at which to begin controlling the actuators to implement the selected candidate next robotic action.

19. The robot of claim 14, wherein controlling the actuators to implement the selected candidate next robotic action occurs prior to the robot achieving the target next state.

20. The robot of claim 14, wherein controlling the actuators to implement the selected candidate next robotic action occurs in a control cycle that immediately follows the robot achieving the target next state.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 illustrates an example environment in which implementations disclosed herein can be implemented.

(2) FIG. 2A is a flowchart illustrating an example method of converting stored past episode data into offline data for pushing to an offline buffer.

(3) FIG. 2B is an example of how past episode data can be converted into offline data for pushing to an offline buffer.

(4) FIG. 3 is a flowchart illustrating an example method of performing an online critic-guided task episode, and pushing data from the online critic-guided task episode into an online buffer and optionally an offline buffer.

(5) FIG. 4 is a flowchart illustrating an example method of using data from an online buffer or offline buffer in populating a training buffer with data that can be used to train a critic network.

(6) FIG. 5 is a flowchart illustrating an example method of training a critic network.

(7) FIG. 6 is a flowchart illustrating an example method of performing a robotic task using a trained critic network.

(8) FIG. 7 schematically depicts an example architecture of a robot.

(9) FIG. 8 schematically depicts an example architecture of a computer system.

DETAILED DESCRIPTION

(10) FIG. 1 illustrates robots 180, which include robots 180A, 1806, and optionally other (unillustrated) robots. Robots 180A and 1806 are “robot arms” having multiple degrees of freedom to enable traversal of grasping end effectors 182A and 182B along any of a plurality of potential paths to position the grasping end effectors 182A and 182B in desired locations. Robots 180A and 180B each further controls the two opposed “claws” of their corresponding grasping end effector 182A, 182B to actuate the claws between at least an open position and a closed position (and/or optionally a plurality of “partially closed” positions).

(11) Example vision components 184A and 184B are also illustrated in FIG. 1. In FIG. 1, vision component 184A is mounted at a fixed pose relative to the base or other stationary reference point of robot 180A. Vision component 184B is also mounted at a fixed pose relative to the base or other stationary reference point of robot 180B. Vision components 184A and 184B each include one or more sensors and can generate vision data related to shape, color, depth, and/or other features of object(s) that are in the line of sight of the sensors. The vision components 184A and 184B may be, for example, monographic cameras, stereographic cameras, and/or 3D laser scanners. A 3D laser scanner includes one or more lasers that emit light and one or more sensors that collect data related to reflections of the emitted light. A 3D laser scanner may be, for example, a time-of-flight 3D laser scanner or a triangulation based 3D laser scanner and may include a position sensitive detector (PSD) or other optical position sensor.

(12) The vision component 184A has a field of view of at least a portion of the workspace of the robot 180A, such as the portion of the workspace that includes example objects 191A. Although resting surface(s) for objects 191A are not illustrated in FIG. 1, those objects may rest on a table, a tray, and/or other surface(s). Objects 191A include a spatula, a stapler, and a pencil. In other implementations more objects, fewer objects, additional objects, and/or alternative objects may be provided during all or portions of grasp episodes (or other task episodes) of robot 180A as described herein. Moreover, in many implementations objects 191A can be replaced (e.g., by a human or by another robot) with a different set of objects periodically to provide diverse training data.

(13) The vision component 184B has a field of view of at least a portion of the workspace of the robot 180B, such as the portion of the workspace that includes example objects 191B. Although resting surface(s) for objects 191B are not illustrated in FIG. 1, they may rest on a table, a tray, and/or other surface(s). Objects 191B include a pencil, a stapler, and glasses. In other implementations more objects, fewer objects, additional objects, and/or alternative objects may be provided during all or portions of grasp episodes (or other task episodes) of robot 180B as described herein. Moreover, in many implementations objects 191B can be replaced (e.g., by a human or by another robot) with a different set of objects periodically to provide diverse training data.

(14) Although particular robots 180A and 1808 are illustrated in FIG. 1, additional and/or alternative robots may be utilized, including additional robot arms that are similar to robots 180A and 180B, robots having other robot arm forms, robots having a humanoid form, robots having an animal form, robots that move via one or more wheels (e.g., self-balancing robots), submersible vehicle robots, an unmanned aerial vehicle (“UAV”), and so forth.

(15) Also, although particular grasping end effectors are illustrated in FIG. 1, additional and/or alternative end effectors may be utilized, such as alternative impactive grasping end effectors (e.g., those with grasping “plates”, those with more or fewer “digits”/“claws”), “ingressive” grasping end effectors, “astrictive” grasping end effectors, or “contigutive” grasping end effectors, or non-grasping end effectors. Additionally, although particular mountings of vision sensors 184A and 184B are illustrated in FIG. 1, additional and/or alternative mountings may be utilized. For example, in some implementations, vision sensors may be mounted directly to robots, such as on non-actuable components of the robots or on actuable components of the robots (e.g., on the end effector or on a component close to the end effector). Also, for example, in some implementations, a vision sensor may be mounted on a non-stationary structure that is separate from its associated robot and/or may be mounted in a non-stationary manner on a structure that is separate from its associated robot.

(16) Robots 180A, 180B, and/or other robots may be utilized to perform a large quantity of grasp episodes and data associated with the grasp episodes can be stored in offline episode data database 150 and/or provided for inclusion in online buffer 112 (of replay buffer(s) 110), as described herein. As described herein, robots 180A and 180B can optionally initially perform grasp episodes (or other task episodes) according to a scripted exploration policy, in order to bootstrap data collection. The scripted exploration policy can be randomized, but biased toward reasonable grasps. Data from such scripted episodes can be stored in offline episode data database 150 and utilized in initial training of critic network 152 to bootstrap the initial training.

(17) Robots 180A and 180B can additionally or alternatively perform grasp episodes (or other task episodes) using the critic network 152, and data from such episodes provided for inclusion in online buffer 112 during training and/or provided in offline episode data database 150 (and pulled during training for use in populating offline buffer 114). For example, the robots 180A and 180B can utilize method 300 of FIG. 3 in performing such episodes. The episodes provided for inclusion in online buffer 112 during training will be online episodes. However, the version of the critic network 152 utilized in generating a given episode can still be somewhat lagged relative to the version of the critic network 152 that is trained based on instances from that episode. The episodes stored for inclusion in offline episode data database 150 will be an offline episode and instances from that episode will be later pulled and utilized to generate transitions that are stored in offline buffer 114 during training.

(18) The data generated by a robot 180A or 180B during an episode can include state data, robotic actions, and rewards. Each instance of state data for an episode includes at least vision-based data for an instance of the episode, and most recently selected robotic action(s) data that is based on selected robotic action(s) for previous instance(s) of the episode. For example, an instance of state data can include a 2D image when a vision component of a robot is a monographic camera. Each instance of state data can optionally include additional data such as whether a grasping end effector of the robot is open or closed at the instance. More formally, a given state observation can be represented as s ∈ S.

(19) Each of the robotic actions for an episode defines a robotic action that is implemented in the current state to transition to a next state (if any next state). A robotic action can include a pose change for a component of the robot, such as pose change, in Cartesian space, for a grasping end effector of the robot. The pose change can be defined by the action as, for example, a translation difference (indicating a desired change in position) and a rotation difference (indicating a desired change in azimuthal angle). The robotic action can further include, for example, a component action command that dictates, for instance whether a gripper is to be opened, closed, or adjusted to a target state between opened and closed (e.g., partially closed). The robotic action can further include a termination command that dictates whether to terminate performance of the robotic task. The terminal state of an episode will include a positive termination command to dictate termination of performance of the robotic task.

(20) More formally, a given robotic action can be represented as a ∈ A. In some implementations, for a grasping task, A includes a vector in Cartesian space t ∈ R.sup.3 indicating the desired change in the gripper position, a change in azimuthal angle encoded via a sine-cosine encoding r ⊂ R.sup.3, binary gripper open and close commands gopen and gclose and a termination command e that ends the episode, such that a=(t, r, gopen and gclose, e).

(21) Each of the rewards can be assigned in view of a reward function that can assign a positive reward (e.g., “1”) or a negative reward (e.g., “0”) at the last time step of an episode of performing a task. The last time step is one where a termination action occurred, as a result of an action determined based on the critic network indicating termination, or based on a maximum number of time steps occurring. Various self-supervision techniques can be utilized to assign the reward, such as those described herein.

(22) Also illustrated in FIG. 1 is the offline episode data database 150, the replay buffer(s) 110, bellman updaters 122A-N, training workers 124A-N, and a critic network 152. It is noted that all components of FIG. 1 are utilized in training the critic network 152. However, once the critic network 152 is trained (e.g., considered optimized according to one or more criteria), the robots 180A and/or 180B can perform a robotic task using the critic network 152 and without other components of FIG. 1 being present.

(23) As mentioned herein, the critic network 152 can be a deep neural network model, such as the deep neural network model that approximates a Q-function that can be represented as Q.sub.θ(s, a) where θ denotes the learned weights in the neural network model. Implementations of reinforcement learning described herein seek the optimal Q-function (Q.sub.θ(s, a)) by minimizing the Bellman error. This generally corresponds to double Q-learning with a target network, a variant on the standard Bellman error, where Q.sub.θ is a lagged target network. The expectation is taken under some data distribution, which in practice is simply the distribution over all previously observed transitions. Once the Q-function is learned, the policy can be recovered according to π(s)=arg max a Q (s, a).

(24) Q-learning with deep neural network function approximators provides a simple and practical scheme for reinforcement learning with image observations, and is amenable to straightforward parallelization. However, incorporating continuous actions, such as continuous gripper motion in grasping tasks, poses a challenge for this approach. The approach utilized in some implementations described herein is an alternative approach that maintains the generality of non-convex Q-functions while avoiding the need for a second maximizer network. In the approach, a state s and action a are inputs into the critic network, and the max in Equation (3) below is evaluated by means of a stochastic optimization algorithm that can handle non-convex and multimodal optimization landscapes.

(25) Large-scale reinforcement learning that requires generalization over new scenes and objects requires large amounts of diverse data. Such data can be collected by operating robots 180 over a long duration and storing episode data in offline episode data database 150.

(26) To effectively ingest and train on such large and diverse datasets, a distributed, asynchronous implementation can be utilized. A plurality of log readers (not illustrated) operating in parallel can read historical data from offline episode data 150 to generate transitions that it pushes to offline buffer 114 of replay buffer. In some implementations, log readers can each perform one or more steps of method 200 of FIG. 2.

(27) Further, online transitions can optionally be pushed, from robots 180, to online buffer 112. The online transitions can also optionally be stored in offline episode data database 150 and later read by log readers, at which point they will be offline transitions.

(28) A plurality of bellman updaters 122A-N operating in parallel sample transitions from the offline and online buffers 114 and 112. In various implementations, this is a weighted sampling (e.g., a sampling rate for the offline buffer 114 and a separate sampling rate for the online buffer 112) that can vary with the duration of training. For example, early in training the sampling rate for the offline buffer 114 can be relatively large, and can decrease with duration of training (and, as a result, the sampling rate for the online buffer 112 can increase). This can avoid overfitting to the initially scarce on-policy data, and can accommodate the much lower rate of production of on-policy data.

(29) The Bellman updaters 122A-N label sampled data with corresponding target values, and store the labeled samples in a train buffer 116, which can operate as a ring buffer. In labeling a given instance of sampled data with a given target value, one of the Bellman updaters 122A-N can carry out the CEM optimization procedure using the current critic network (e.g., with current learned parameters). Note that one consequence of this asynchronous procedure is that the samples in train buffer 116 are labeled with different lagged versions of the current model. In some implementations, bellman updaters 122A-N can each perform one or more steps of method 400 of FIG. 4.

(30) A plurality of training workers 124A-N operate in parallel and pull labeled transitions from the train buffer 116 randomly and use them to update the critic network 152. Each of the training workers 124A-N computes gradients and sends the computed gradients asynchronously to parameter server(s) (not illustrated). In some implementations, bellman updaters 122A-N can each perform one or more steps of method 500 of FIG. 5. The training workers 124A-N, the Bellman updaters 122A-N, and the robots 180 can pull model weights form the parameter server(s) periodically, continuously, or at other regular or non-regular intervals and can each update their own local version of the critic network 152 utilizing the pulled model weights.

(31) Additional description of implementations of methods that can be implemented by various components of FIG. 1 is provided below with reference to the flowcharts of FIGS. 2-6.

(32) FIG. 2A is a flowchart illustrating an example method 200 of converting stored past episode data into offline data for pushing to an offline buffer (e.g., offline buffer 114 of FIG. 1). For convenience, the operations of the flow chart are described with reference to a system that performs the operations. This system may include one or more components of one or more computer systems, such as one or more processors. Moreover, while operations of method 200 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

(33) At block 202, the system starts log reading. For example, log reading can be initialized at the beginning of reinforcement learning.

(34) At block 204, the systems reads data from a past episode. For example, the system can read data from an offline episode data database that stores states, actions, and rewards from past episodes of robotic performance of a task. The past episode can be one performed by a corresponding real physical robot based on a past version of a critic network. The past episode can, in some implementations and/or situations (e.g., at the beginning of reinforcement learning) be one performed based on a scripted exploration policy, based on a demonstrated (e.g., through virtual reality, kinesthetic teaching, etc.) performance of the task, etc.

(35) At block 206, the system determines most recently selected robotic action(s) based on a robotic transition from time A of the past episode to time B of the past episode. For example, as illustrated in FIG. 2B, the most recently selected robotic action(s) can include robotic action(s) that occurred from time A to time B, such as translation and/or rotation of a gripper, opening and/or closing of the gripper, etc.

(36) At block 208, the system determines current state data that includes: (1) vision data from a time between time A and time B; and (2) the most recently selected robotic action data that is based on the most recently selected robotic action(s) determined at block 206. For example, as illustrated in FIG. 2B, an instance of vision data can be selected based on it having a timestamp between time A and time B. In some implementations, the instance of vision data is selected based on it being at least the minimum delay (described herein) before time B. The most recently selected robotic action data can, in some implementations, include a vector representation of the most recently selected robotic action(s) determined at block 206. In some implementations, the most recently selected robotic action data can additionally or alternatively include a vector representation of a difference between a state of component(s) of the robot at time B, and a state of the component(s) at a time corresponding to the instance of vision data (e.g., having the same timestamp as the vision data—or a timestamp that is closest to the vision data). In other words, a vector representation that indicates a difference between the state of the component(s) at (or very near) a time the vision data of the state data was captured, and the state of the component(s) a time B.

(37) At block 210, the system determines a currently selected robotic action based on a robotic transition from time B to time C. For example, as illustrated in FIG. 2B, the currently selected robotic action(s) can include robotic action(s) that occurred from time B to time C, such as translation and/or rotation of a gripper, opening and/or closing of the gripper, etc.

(38) At block 212, the system generates offline data that includes: the current state data, the currently selected robotic action, and a reward for the episode. The reward can be determined as described herein, and can optionally be previously determined and stored with the data. For example, as illustrated in FIG. 2B the reward can be based on determining whether an attempted grasp (or other attempted task) was successful, based on analysis of various data after termination of the episode.

(39) At block 214, the system pushes the offline data into an offline buffer. The system then returns to block 204 to read data from another past episode.

(40) In various implementations, method 200 can be parallelized across a plurality of separate processors and/or threads.

(41) FIG. 3 is a flowchart illustrating an example method 300 of performing an online critic-guided task episode, and pushing data from the online critic-guided task episode into an online buffer and optionally an offline buffer. For convenience, the operations of the flow chart are described with reference to a system that performs the operations. This system may include one or more components of one or more robots, such as one or more processors of one of robots 180A and 180B. Moreover, while operations of method 300 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

(42) At block 302, the system starts an online task episode.

(43) At block 304, the system stores current state data for the online task episode. The current state data includes most recently selected robotic action data as described herein. At an initial iteration of block 304 the most recently selected robotic action data can be a zero vector or other “null” indication as there are no previously selected robotic action(s) at the initial iteration. The current state data can also include, for example, vision data captured by a vision component associated with the robot and/or current state(s) of robotic component(s).

(44) At block 306, the system selects a robotic action by processing current state data using a current critic network. For example, the system can utilize a stochastic optimization technique (e.g., the CEM technique described herein) to sample a plurality of robotic actions using the current critic network, and can select the sampled action with the highest value generated using the current critic network.

(45) At block 307, the system determines whether a minimum amount of delay has been achieved. In some implementations, the minimum amount of delay is relative to initiation of a most recent iteration of block 306 during the online task episode and/or relative to initiation of a most recent iteration of block 308 (described below) during the online task episode. In some implementations, block 307 can optionally be omitted at least in an initial iteration of block 307 during the online task episode.

(46) If, at block 307, the system determines the minimum amount of delay has been achieved, the system proceeds to block 308 and executes the current selected robotic action. For example, the system can provide commands to one or more actuators of the robot to cause the robot to execute the robotic action. For instance, the system provide commands to actuator(s) of the robot to cause a gripper to translate and/or rotate as dictated by the robotic action and/or to cause the gripper to close or open as dictated by the robotic action (and if different than the current state of the gripper). In some implementations the robotic action can include a termination command (e.g., that indicates whether the episode should terminate) and if the termination command indicates the episode should terminate, the robotic action at block 308 can be a termination of the episode.

(47) At block 310, the system determines a reward based on the system executing the robotic action using the current critic network. In some implementations, when the action is a non-terminal action, the reward can be, for example, “0” reward—or a small penalty (e.g., −0.05) to encourage faster robotic task completion. In some implementations, when the action is a terminal action, the reward can be a “0” if the robotic task was successful and a “1” if the robotic task was not successful. For example, for a grasping task the reward can be “1” if an object was successfully grasped, and a “0” otherwise.

(48) The system can utilize various techniques to determine whether a grasp or other robotic task is successful. For example, for a grasp, at termination of an episode the gripper can be moved out of the view of the camera and a first image captured when it is out of the view. Then the gripper can be returned to its prior position and “opened” (if closed at the end of the episode) to thereby drop any grasped object, and a second image captured. The first image and the second image can be compared, using background subtraction and/or other techniques, to determine whether the gripper was grasping an object (e.g., the object would be present in the second image, but not the first)—and an appropriate award assigned to the last time step.

(49) At block 312, the system pushes the current state data of block 304, the robotic action selected at block 306, and the reward of block 310 to an online buffer to be utilized as online data during reinforcement learning. At block 312, the system can also push the state of block 304, the robotic action selected at block 306, and the reward of block 310 to an offline buffer to be subsequently used as offline data during the reinforcement learning.

(50) At block 314, the system determines whether to terminate the episode. In some implementations and/or situations, the system can terminate the episode if the robotic action at a most recent iteration of block 306 indicated termination. In some additional or alternative implementations and/or situations, the system can terminate the episode if a threshold quantity of iterations of blocks 304-312 have been performed for the episode and/or if other heuristics based termination conditions have been satisfied.

(51) If, at block 314 the system determines not to terminate the episode, then the system returns to block 304. If, at block 314, the system determines to terminate the episode, then the system proceeds to block 302 to start a new online task episode. The system can, a block 316, optionally reset a counter that is used in block 314 to determine if a threshold quantity of iterations of blocks 304-312 have been performed.

(52) In various implementations, method 300 can be parallelized across a plurality of separate real and/or simulated robots.

(53) FIG. 4 is a flowchart illustrating an example method 400 of using data from an online buffer or offline buffer in populating a training buffer with data that can be used to train a critic network. For convenience, the operations of the flow chart are described with reference to a system that performs the operations. This system may include one or more components of one or more computer systems, such as one or more processors replay buffer(s) 110 (FIG. 1). Moreover, while operations of method 400 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

(54) At block 402, the system starts training buffer population.

(55) At block 404, the system retrieves current state data and a currently selected robotic action. The current state data and a currently selected robotic action can be retrieved from an online buffer or an offline buffer. The online buffer can be one populated according to method 300 of FIG. 3. The offline buffer can be one populated according to the method 200 of FIG. 2. In some implementations, the system determines whether to retrieve from the online buffer of the offline buffer based on respective sampling rates for the two buffers. As described herein, the sampling rates for the two buffers can vary as reinforcement learning progresses.

(56) At block 406, the system determines a target value based on the retrieved information from block 404. In some implementations, the system determines the target value using stochastic optimization techniques as described herein. In some implementations, the stochastic optimization technique is CEM. In some of those implementations, block 406 can include using stochastic optimization to generate values for each of a plurality of actions. The value for each of the actions is determined by processing, using a version of the critic network, the current state data (including the most recently selected robotic action data) along with a corresponding one of the actions. The system can then select the maximum value and determine the target value based on the maximum value. In some implementations, the system determines the target value as a function of the max value and a reward included in the data retrieved at block 404.

(57) At block 408, the system stores, in a training buffer, current state data (including the most recently selected robotic action data), a currently selected robotic action, and the target value determined at block 406. The system then proceeds to block 404 to perform another iteration of blocks 404 and 406.

(58) In various implementations, method 400 can be parallelized across a plurality of separate processors and/or threads. Also, although method 200, 300, and 400 are illustrated in separate figures herein for the sake of clarity, it is understood that in many implementations methods 200, 300, and 400 are performed in parallel during reinforcement learning.

(59) FIG. 5 is a flowchart illustrating an example method 500 of training a critic network. For convenience, the operations of the flow chart are described with reference to a system that performs the operations. This system may include one or more components of one or more computer systems, such as one or more processors of one of training workers 124A-N and/or parameter servers. Moreover, while operations of method 500 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

(60) At block 502, the system starts training the critic network.

(61) At block 504, the system retrieves, from a training buffer, current state data (including the most recently selected robot action data), a currently selected robotic action, and a target value.

(62) At block 506, the system generates a predicted value by processing the current state data and the currently selected robotic action using a current version of the critic network. It is noted that in various implementations the current version of the critic network utilized to generate the predicted value at block 506 will be updated relative to the model utilized to generate the target value that is retrieved at block 504. In other words, the target value that is retrieved at block 504 will be generated based on a lagged version of the critic network.

(63) At block 508, the system generates a loss value based on the predicted value and the target value. For example, the system can generate a log loss based on the two values.

(64) At block 510, the system determines whether there is an additional current state data (including the most recently selected robot action data), currently selected robotic action, and target value to be retrieved for the batch (where batch techniques are utilized). If the decision at block 510 is yes, then the system performs another iteration of blocks 504, 506, and 508. If the decision is no, then the system proceeds to block 512.

(65) At block 512, the system determines a gradient based on the loss(es) determined at iteration(s) of block 508, and provides the gradient to a parameter server for updating parameters of the critic network based on the gradient. The system then proceeds back to block 504 and performs additional iterations of blocks 504, 506, 508, and 510, and determines an additional gradient at block 512 based on loss(es) determined in the additional iteration(s) of block 508.

(66) In various implementations, method 500 can be parallelized across a plurality of separate processors and/or threads. Also, although method 200, 300, 400, and 500, it is understood that in many implementations they are performed in parallel during reinforcement learning.

(67) FIG. 6 is a flowchart illustrating an example method 600 of performing a robotic task using a trained critic network. The trained critic can be trained, for example, based on methods 200, 300, 400, and 500 of FIGS. 2-6. For convenience, the operations of the flow chart are described with reference to a system that performs the operations. This system may include one or more components of one or more robots, such as one or more processors of one of robots 180A and 180B. Moreover, while operations of method 600 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

(68) At block 602, the system starts performance of a robotic task.

(69) At block 604, the system determines current state data, including most recently selected robotic action data. At an initial iteration of block 604 the most recently selected robotic action data can be a zero vector or other “null” indication as there are no previously selected robotic action(s) at the initial iteration. The current state data can also include, for example, vision data captured by a vision component associated with the robot and/or current state(s) of robotic component(s). As described herein, when the most recently selected robotic action data is a difference between a target state of robotic component(s) (to be achieved based on the most recently selected robotic action) and a current state of the robotic component(s), the current state can be selected based on it corresponding most closely (temporally) to the current vision data. For example, the current state of the robotic component(s) may not be based on the most recent data available in a state buffer but, instead, the data that has a timestamp that is closest to a timestamp of the most recent vision data instance in a vision data buffer (which may populate at a lower frequency than the state buffer).

(70) At block 606, the system selects a robotic action to perform the robotic task. In some implementations, the system selects the robotic action using stochastic optimization techniques as described herein. In some implementations, the stochastic optimization technique is CEM and, in some of those implementations, block 606 may include one or more of the following sub-blocks.

(71) At sub-block 6061, the system selects N actions, where N is an integer number.

(72) At sub-block 6062, the system generates a value for each action by processing each of the N actions and the current state data (including most recently selected robotic action data) using the trained critic network.

(73) At sub-block 6063, the system selects M actions from the N actions based on the generated values, where M is an integer number.

(74) At sub-block 6064, the system selects N actions based on a Gaussian distribution from the M actions.

(75) At sub-block 6065, the system generates a value for each action by processing each of the N actions and the current state data (including most recently selected robotic action data) using the trained critic network.

(76) At sub-block 6066, the system selects a max value from the values generated at sub-block 6065.

(77) At block 608, the system determines whether a minimum amount of delay has been achieved. In some implementations, the minimum amount of delay is relative to initiation of a most recent iteration of block 606 during the robotic task performance and/or relative to initiation of a most recent iteration of block 608 (described below) during the robotic task performance. In some implementations, block 608 can optionally be omitted at least in an initial iteration of block 608 during the online task episode.

(78) At block 610, the robot executes the selected robotic action.

(79) At block 612, the system determines whether to terminate performance of the robotic task. In some implementations and/or situations, the system can terminate the performance of the robotic task if the robotic action at a most recent iteration of block 606 indicated termination. In some additional or alternative implementations and/or situations, the system can terminate the episode if a threshold quantity of iterations of blocks 604, 606, 608, and 610 have been performed for the performance and/or if other heuristics based termination conditions have been satisfied.

(80) If the system determines, at block 610, not to terminate, then the system performs another iteration of blocks 604, 606, 608, and 610. If the system determines, at block 610, to terminate, then the system proceeds to block 614 and ends performance of the robotic task.

(81) Various machine learning architectures can be utilized for the critic network. In various implementations any vision data, of current state data, can be processed utilizing a first branch of the critic network to generate a vision data embedding. Further, the most recently selected robotic action data (of the current state data) can be processed utilizing a second branch of the critic network, along with a candidate robotic action to be considered and optionally other current state data (e.g., that indicates whether a gripper is open/closed/between open and closed), to generate an additional embedding. The two embeddings can be concatenated (or otherwise combined) and processed utilizing additional layer(s) of the model to generate a corresponding value.

(82) FIG. 7 schematically depicts an example architecture of a robot 725. The robot 725 includes a robot control system 760, one or more operational components 740a-740n, and one or more sensors 742a-742m. The sensors 742a-742m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth. While sensors 742a-742m are depicted as being integral with robot 725, this is not meant to be limiting. In some implementations, sensors 742a-742m may be located external to robot 725, e.g., as standalone units.

(83) Operational components 740a-740n may include, for example, one or more end effectors and/or one or more servo motors or other actuators to effectuate movement of one or more components of the robot. For example, the robot 725 may have multiple degrees of freedom and each of the actuators may control actuation of the robot 725 within one or more of the degrees of freedom responsive to the control commands. As used herein, the term actuator encompasses a mechanical or electrical device that creates motion (e.g., a motor), in addition to any driver(s) that may be associated with the actuator and that translate received control commands into one or more signals for driving the actuator. Accordingly, providing a control command to an actuator may comprise providing the control command to a driver that translates the control command into appropriate signals for driving an electrical or mechanical device to create desired motion.

(84) The robot control system 760 may be implemented in one or more processors, such as a CPU, GPU, and/or other controller(s) of the robot 725. In some implementations, the robot 725 may comprise a “brain box” that may include all or aspects of the control system 760. For example, the brain box may provide real time bursts of data to the operational components 740a-740n, with each of the real time bursts comprising a set of one or more control commands that dictate, inter alio, the parameters of motion (if any) for each of one or more of the operational components 740a-740n. In some implementations, the robot control system 760 may perform one or more aspects of methods 300 and/or 600 described herein.

(85) As described herein, in some implementations all or aspects of the control commands generated by control system 760 in performing a robotic task can be based on an action selected based on current state (e.g., based at least on most recently selected robotic action data, and optionally current vision data) and based on utilization of a trained critic network as described herein. Stochastic optimization techniques can be utilized in selecting an action at each time step of controlling the robot. Although control system 760 is illustrated in FIG. 7 as an integral part of the robot 725, in some implementations, all or aspects of the control system 760 may be implemented in a component that is separate from, but in communication with, robot 725. For example, all or aspects of control system 760 may be implemented on one or more computing devices that are in wired and/or wireless communication with the robot 725, such as computing device 810.

(86) FIG. 8 is a block diagram of an example computing device 810 that may optionally be utilized to perform one or more aspects of techniques described herein. For example, in some implementations computing device 810 may be utilized to provide desired object semantic feature(s) for grasping by robot 925 and/or other robots. Computing device 810 typically includes at least one processor 814 which communicates with a number of peripheral devices via bus subsystem 812. These peripheral devices may include a storage subsystem 824, including, for example, a memory subsystem 825 and a file storage subsystem 826, user interface output devices 820, user interface input devices 822, and a network interface subsystem 816. The input and output devices allow user interaction with computing device 810. Network interface subsystem 816 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.

(87) User interface input devices 822 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 810 or onto a communication network.

(88) User interface output devices 820 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computing device 810 to the user or to another machine or computing device.

(89) Storage subsystem 824 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 924 may include the logic to perform selected aspects of the method of FIGS. 2, 3, 4, 5, and/or 6.

(90) These software modules are generally executed by processor 814 alone or in combination with other processors. Memory 825 used in the storage subsystem 824 can include a number of memories including a main random access memory (RAM) 830 for storage of instructions and data during program execution and a read only memory (ROM) 832 in which fixed instructions are stored. A file storage subsystem 826 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 826 in the storage subsystem 824, or in other machines accessible by the processor(s) 814.

(91) Bus subsystem 812 provides a mechanism for letting the various components and subsystems of computing device 810 communicate with each other as intended. Although bus subsystem 812 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.

(92) Computing device 810 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 810 depicted in FIG. 8 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 810 are possible having more or fewer components than the computing device depicted in FIG. 8.

(93) In some implementations, a method implemented by one or more processors of a robot during performance of a robotic task is provided and includes controlling a robot to implement a most recently selected robotic action that was determined based on processing, utilizing a trained neural network model that represents a learned value function, of the robotic action and of prior vision data captured by a vision component of the robot. The most recently selected robotic action defines a target next state of the robot in performance of the robotic task. The method further includes, during the controlling of the robot to implement the most recently selected robotic action and prior to the robot achieving the target next state defined by the most recently selected robotic action: (a) identifying current vision data that is captured by the vision component during the controlling of the robot to implement the most recently selected robotic action and prior to the robot achieving the target next state of the robot defined by the most recently selected robotic action; (b) identifying a candidate next robotic action; (c) processing, utilizing the trained neural network model, the current vision data, the candidate next robotic action, and most recently selected robotic action data; (d) generating a value for the candidate next robotic action based on the processing; and (e) selecting the candidate next robotic action based on the value. The most recently selected robotic action data includes the most recently selected robotic action, and/or a difference between the target next state of the robot and a current state of the robot that temporally corresponds to the current vision data. The method further includes controlling the robot to implement the selected candidate next robotic action.

(94) These and other implementations may include one or more of the following features.

(95) In some implementations, the most recently selected robotic action data includes the difference between the target next state of the robot and the current state of the robot that temporally corresponds to the current vision data. In some of those implementations, the method further includes: selecting the current vision data based on it being most recently captured and buffered in a vision data buffer; and selecting the current state of the robot, for use in determining the difference, based on a current state timestamp, for the current state, being closest temporally to a vision data timestamp of the current vision data. For example, selecting the current state of the robot can include selecting the current state of the robot in lieu of a more recent state of the robot that is more up to date than the current state, based on the current state of the robot being closer temporally to the vision data timestamp than is the more recent state of the robot.

(96) In some implementations, controlling the robot to implement the selected candidate next robotic action includes determining a particular control cycle at which to begin controlling the robot to implement the selected candidate next robotic action. Determining the particular control cycle can be based on determining whether a minimum amount of time and/or control cycles have passed. The minimum amount of time and/or control cycles can optionally be relative to initiation of generating the value for the candidate next robotic action, and/or beginning controlling the robot to implement the most recently selected robot action. Optionally, the particular control cycle is not a control cycle that immediately follows selecting the candidate next robotic action.

(97) In some implementations, controlling the robot to implement the selected candidate next robotic action occurs prior to the robot achieving the target next state.

(98) In some implementations, controlling the robot to implement the selected candidate next robotic action occurs in a control cycle that immediately follows the robot achieving the target next state.

(99) In some implementations, the method further includes, during the controlling of the robot to implement the most recently determined robotic action and prior to the robot achieving the target next state defined by the most recently determined robotic action: identifying an additional candidate next robotic action; processing, utilizing the trained neural network model, the current vision data, the additional candidate next robotic action, and the most recently selected robotic action data; and generating an additional value for the additional candidate next robotic action based on the processing. In those implementations, selecting the candidate next robotic action is based on comparing the value to the additional value.

(100) In some implementations, the candidate next robotic action includes a pose change for a component of the robot. In some of those implementations, the component is an end effector and the pose change defines a translation difference for the end effector and a rotation difference for the end effector. For example, the end effector can be a gripper and the robotic task can be a grasping task.