LEARNING ROBOTIC SKILLS WITH IMITATION AND REINFORCEMENT AT SCALE

Abstract

Utilizing an initial set of offline positive-only robotic demonstration data for pre-training an actor network and a critic network for robotic control, followed by further training of the networks based on online robotic episodes that utilize the network(s). Implementations enable the actor network to be effectively pre-trained, while mitigating occurrences of and/or the extent of forgetting when further trained based on episode data. Implementations additionally or alternatively enable the actor network to be trained to a given degree of effectiveness in fewer training steps. In various implementations, one or more adaptation techniques are utilized in performing the robotic episodes and/or in performing the robotic training. The adaptation techniques can each, individually, result in one or more corresponding advantages and, when used in any combination, the corresponding advantages can accumulate. The adaptation techniques include Positive Sample Filtering, Adaptive Exploration, Using Max Q Values, and Using the Actor in CEM.

Claims

1. A method implemented by one or more processors, the method comprising: pre-training an actor network and a critic network using reinforcement learning and offline robotic demonstration data from demonstrated robotic episodes, wherein the actor network is a first neural network model that represents a policy, wherein the critic network is a second neural network model that represents a Q-function, and wherein pre-training the actor network and the critic network comprises: pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, and pre-training the critic network based on the robotic demonstration data and using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: further training the actor network and the critic network using reinforcement learning and online episode data from robotic episodes each performed based on the actor network and/or the critic network, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and the CEM, wherein the second set includes a given quantity of unsuccessful episode data, that is from unsuccessful episodes of the robotic episodes, and wherein the given quantity is greater than an alternate quantity, of the unsuccessful episode data, that is included in the first set.

2. The method of claim 1, wherein the alternate quantity is zero and wherein the first set includes only successful episode data that is from successful episodes of the robotic episodes.

3. The method of claim 2, wherein the second set includes the successful episode data that is also included in the first set and includes the unsuccessful episode data.

4. The method of claim 1, wherein the alternate quantity, of the unsuccessful episode data, of the first set, is greater than zero, and wherein the unsuccessful episode data of the first set is a subset of the unsuccessful episode data that is included in the second set.

5. The method of claim 4, wherein the ratio of the successful episode data to the unsuccessful episode data, included in the first set, is greater than three to one.

6. The method of claim 5, wherein the ratio of the successful episode data to the unsuccessful episode data, included in the first set, is greater than ten to one.

7. The method of claim 1, further comprising: generating the first set based on data from the robotic episodes; generating the second set based on filtering, from the first set, at least a majority of the unsuccessful episode data.

8. The method of claim 7, wherein generating the first set based on data from the robotic episodes comprises: populating, over time, a replay buffer with the first set; and wherein further training the actor network based on the first set comprises sampling the episode data of the first set from the replay buffer.

9. The method of claim 8, wherein populating, over time, the replay buffer with the first set, comprises: populating the replay buffer with a goal to maintain a particular ratio, of the successful episode data to the unsuccessful episode data, that is in the replay buffer.

10. The method of claim, 1 further comprising: performing the robotic episodes, wherein performing each of the robotic episodes comprises: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for the robotic episode; determining robotic actions to perform, in the robotic episode, according to the selected exploration strategy.

11. The method of claim 1, further comprising: performing the robotic episodes, wherein performing each of the robotic episodes comprises: for each step of multiple steps of the robotic episode: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for the step; determining a robotic action to perform, for the step, according to the selected exploration strategy.

12. The method of claim 11, wherein the first exploration strategy is a CEM policy in which CEM is performed, using the critic network and sampled actions, and results from the CEM are utilized in selecting an action; and wherein the second exploration strategy is a greedy Gaussian policy in which a Gaussian probability distribution, generated using the actor network based on a corresponding state and corresponding to candidate actions, is utilized in selecting an action.

13. The method of claim 11, wherein selecting the selected exploration strategy from at least the first exploration strategy and the second exploration strategy comprises: selecting the first strategy at a first rate and selecting the second strategy at a second rate that is less than the first rate.

14. The method of claim 13, further comprising: adjusting, the first rate and the second rate after performing at least a threshold quantity of the robotic episodes, wherein adjusting the first rate and the second rate comprises making the first rate and the second rate closer to one another.

15. The method of claim 1, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

16. The method of claim 1, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; using the actor action, as an initial mean for CEM in sampling candidate actions; processing the state data and each of the candidate actions, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

17. The method of claim 1, further comprising, subsequent to the further training: using the actor network, independent of the critic network, in autonomous control of a robot.

18. A method implemented by one or more processors, the method comprising: pre-training an actor network and a critic network using reinforcement learning and robotic demonstration data from demonstrated robotic episodes, wherein the actor network is a first neural network model that represents a policy, wherein the critic network is a second neural network model that represents a Q-function, and wherein pre-training the actor network and the critic network comprises; pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, and pre-training the critic network based on the robotic demonstration data and using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes based on the actor network and/or the critic network, wherein performing each of the robotic episodes comprises: selecting, from at least a first exploration strategy and a second exploration strategy, a selected exploration strategy for: the robotic episode as a whole, or each of multiple steps of the robotic episode; determining robotic actions to perform, in the robotic episode, according to the selecting; further training the actor network and the critic network using reinforcement learning and online episode data from the robotic episodes, wherein further training the actor network and the critic network comprises: further training the actor network based on a first set of the episode data, and using the advantage-weighted regression training objective, and further training the critic network based on a second set of the episode data, and using Q-learning and CEM.

19. The method of claim 18, wherein the first exploration strategy is a CEM policy in which CEM is performed, using the critic network and sampled actions, and results from the CEM are utilized in selecting an action; and wherein the second exploration strategy is a greedy Gaussian policy in which a Gaussian probability distribution, generated using the actor network based on a corresponding state and corresponding to candidate actions, is utilized in selecting an action.

20. A method implemented by one or more processors, the method comprising: pre-training an actor network and a critic network using reinforcement learning and robotic demonstration data from demonstrated robotic episodes, wherein the actor network is a first neural network model that represents a policy, wherein the critic network is a second neural network model that represents a Q-function, and wherein pre-training the actor network and the critic network comprises; pre-training the actor network using an advantage-weighted regression training objective, the advantage-weighted regression training objective utilizing values generated using the second neural network model, and pre-training the critic network based on the robotic demonstration data and using Q-learning and a cross-entropy method (CEM), subsequent to pre-training the actor network and pre-training the critic network: performing online robotic episodes, wherein performing a given robotic episode; further training the actor network and the critic network using reinforcement learning and episode data from the robotic episodes, wherein further training the critic network based on a second set of the episode data comprises, for an instance of the second set of episode data: processing state data of the instance, using the actor network, to generate actor network output; selecting an actor action based on the actor network output; processing the state data and the actor action, using the critic network, to generate an actor action measure for the actor action; processing the state data and each of multiple candidate actions sampled using CEM, using the critic network, to generate a corresponding candidate action measure for each of the candidate actions; determining, from amongst the actor action measure and the corresponding candidate action measures, a maximum measure; and using the maximum measure in training of the critic network.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] FIG. 1 is a flowchart illustrating an example method of pre-training an actor network and a critic network based on offline demonstration data, according to various implementations disclosed herein.

[0034] FIG. 2 is a flowchart illustrating an example method of generating online episode data, according to various implementations disclosed herein.

[0035] FIG. 3 is a flowchart illustrating an example method of further training an actor network and a critic network based on online episode data, according to various implementations disclosed herein.

[0036] FIG. 4 schematically depicts an example architecture of a robot.

[0037] FIG. 5 schematically depicts an example architecture of a computer system.

DETAILED DESCRIPTION

[0038] Implementations disclosed herein relate to particular techniques for utilizing an initial set of offline positive-only robotic demonstration data for pre-training an actor network and a critic network, followed by further training of the networks based on online robotic episodes that utilize the network(s). The online robotic episodes, that utilize the network(s) can include those performed by real physical robot(s) in real environment(s) and/or those performed by robotic simulator(s) in simulated environment(s). The actor network and/or the critic network can be trained to perform one or more robotic task(s), such as those that involve manipulating object(s). For example, the task(s) can include pushing, grasping, or otherwise manipulating one or more objects. As another example, the task can include a more complex task such as loading each of multiple objects into a dishwasher or picking object(s) and placing each of them into an appropriate area (e.g., into one of a recycling bin, a compost bin, and a trash bin).

[0039] Techniques disclosed herein can be utilized in combination with various real and/or simulated robots, such as a telepresence robot, a wheeled robot, mobile forklift robot, a robot arm, an unmanned aerial vehicle (“UAV”), and/or a humanoid robot. The robot(s) can include various sensor component(s) and state data that is utilized in techniques disclosed herein can include sensor data that is generated by those sensor component(s) (e.g., images from a camera and/or other vision data from other vision component(s)) and/or can include state data that is derived from such sensor data (e.g., object bounding box(es) derived from vision data). As a particular example, a robot can include vision component(s) such as, for example, a monographic camera (e.g., generating 2D RGB images), a stereographic camera (e.g., generating 2.5D RGB-D images), and/or a laser scanner (e.g., LIDAR generating a 2.5D depth (D) image or point cloud). A robot can additionally optionally include arm(s) and/or other appendage(s) with end effector(s), such as those that take the form of a gripper. Additional description of some examples of the structure and functionality of various robots is provided herein.

[0040] Robotic simulator(s), when utilized in techniques disclosed herein, can be implemented by one or more computing devices. A robotic simulator is used to simulate an environment that includes corresponding environmental object(s), to simulate a robot operating in the simulated environment, to simulate responses of the simulated robot in response to virtual implementation of various simulated robotic actions, and to simulate interactions between the simulated robot and the simulated environmental objects in response to the simulated robotic actions. Various simulators can be utilized, such as physics engines that simulates collision detection, soft and rigid body dynamics, etc. One non-limiting example of such a simulator is the BULLET physics engine.

[0041] Turning now to the figures, FIG. 1 is a flowchart illustrating an example method 100 of pre-training an actor network and a critic network based on offline demonstration data, according to various implementations disclosed herein. For convenience, the operations of method 100 are described with reference to a system that performs the operations. This system may include one or more components, such as processor(s)) of a robot and/or of computing device(s) (e.g., a cluster of high performance computing devices). Moreover, while operations of method 100 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

[0042] At block 102, pre-training of an actor network and a critic network begins. The actor network can be a first neural network model and the critic network can be a separate neural network model. The actor network can be used to process state data to generate output that indicates an action to be taken in view of the state data. The output can be, for example, a probability distribution over an action space. The action to be taken, based on the output of the actor network, can be the highest probability action, optionally subject to one or more rules-based constraints (e.g., safety and/or kinematic constraint(s)). The critic network can be used to process state data and a candidate action, and generate a measure (e.g., a Q-value) that represents the expected discounted reward for taking the candidate robotic action, in view of the state.

[0043] At block 104, the system identifies one or more instances of offline robotic demonstration data. For example, the system can identify an instance of offline robotic demonstration data in non-batch pre-training techniques. An instance of offline robotic demonstration data can be obtained, for example, from a replay buffer.

[0044] An instance of offline robotic demonstration data can include, for example, an instance of state data, a corresponding robotic action, next state data that is based on the state that results from the corresponding robotic action, and a corresponding reward for the demonstration episode on which the instance is based. In many implementations, the demonstration episodes are all positive demonstrations and, accordingly, the rewards will all be positive rewards, optionally discounted based on discount factor(s) (e.g., a duration of the episode and/or a length of a trajectory of the episode). The demonstration episodes can be, for example, provided by human(s) (e.g., through teleoperation and/or kinesthetic teaching) and/or can be scripted demonstration episodes.

[0045] The state data and next state data can include, for example, environmental state data (e.g., image(s) and/or other vision data captured by vision component(s) of a robot) and/or current robot state data (e.g., that indicates a current state of component(s) of the robot). The robotic action can include a representation of movement of one or more robotic component(s). As one example, the robotic action can indicate, in Cartesian space, a translation and/or rotation of an end effector of a robot. As another example, the robotic action can indicate, in joint space, a target joint configuration of one or more robot joints. As yet another example, the robotic action can indicate, in Cartesian space, a direction of movement of a robot base. Additional and/or alternative robotic action spaces can be defined and utilized.

[0046] At block 106, the system updates the critic network based on the instance(s). For example, the system can update the critic network utilizing Qt-opt techniques and using CEM and/or other stochastic optimization technique(s). In some implementations, in using CEM, CEM is used in selecting candidate action(s) and processing the candidate action(s), along with next state data, using the critic network. This enables utilization, in training, of generated Q-value(s) for the candidate action(s) with the next state data. This enables taking into account the impact that taking the action will have on the next state (e.g., will the next state provide for the ability to take further action(s) that are “good”).

[0047] In some implementations, block 106 includes optional sub-block 106A and/or optional sub-block 106B.

[0048] At sub-block 106A, the system uses the actor in CEM. For example, the system can predict an action using the actor network and based on the instance. That action can be processed, along with current state data of the instance, using the critic network, to generate an actor action measure (e.g., Q-value) for the actor action. Further, the current state data and each of multiple candidate actions sampled using CEM, are also processed using the critic network (i.e., N current state data candidate action pairs), to generate a corresponding candidate action measure (e.g., Q-value) for each of the candidate actions. Instead of always using the maximum candidate action measure (e.g., Q-value) from CEM as the maximum value for training of the critic network (and optionally in the advantage function for training of the actor network) as is typical, the system can compare the actor action measure to the maximum candidate action measure—and use the greater of the two measures as the maximum value for training.

[0049] At sub-block 106B, the system, instead of utilizing an Expected Q-value, a Max Q-Value can be utilized in training of the critic network, and can also be utilized as part of the advantage-weighted regression training objective in training the actor network (e.g., when it's being trained based on the Positive Sample Filtering referenced above).

[0050] At block 108, the system updates the actor network based on the instance(s). In some implementations, the system can update the actor network utilizing an advantage-weighted regression training objective, such as AWAC. In some of those implementations, the advantage-weighted regression training objective utilizes a corresponding Q-value (e.g., a Max Q-value) generated at block 106. For example, as illustrated by optional sub-block 108A of block 108, the training objective can optionally utilize the Max Q-value generated at sub-block 106B.

[0051] At block 110, the system determines if more pre-training should occur. This can be based on whether unprocessed demonstration data remains, whether a threshold duration and/or extent of training has occurred, and/or one or more other criteria.

[0052] If the decision at block 110 is that more pre-training should occur, the system proceeds back to block 104 and identifies new instance(s) of offline robotic demonstration data.

[0053] If the decision at block 110 is that pre-training is complete, the system proceeds to block 112. At block 112, the system proceeds to perform method 200 of FIG. 2 and method 300 of FIG. 3. For example, the system can perform method 200 to generate online episode data and simultaneously perform method 300 to further train the actor network and the critic network based on online episode data that is generated based on method 200. Method 200 and 300 are described in more detail below.

[0054] FIG. 2 is a flowchart illustrating an example method 200 of generating online episode data, according to various implementations disclosed herein. For convenience, the operations of method 200 are described with reference to a system that performs the operations. This system may include one or more components, such as processor(s)) of a robot and/or of computing device(s) (e.g., a cluster of high performance computing devices). Moreover, while operations of method 200 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

[0055] At block 202, generation of episode data begins.

[0056] At block 204, the system optionally selects an exploration strategy for an episode or for a step of the episode. For example, with Adaptive Exploration on an episode-by-episode basis, the system can select an exploration strategy for the episode. On the other hand, with Adaptive Exploration on a step-by-step basis, the system can select an exploration strategy for the upcoming step of the episode. The system can select the exploration strategy from amongst two or more exploration strategies such as a CEM policy, a Gaussian policy, and a greedy Gaussian policy. In selecting amongst the exploration strategies, the system can optionally select from amongst them with a probability, and the probabilities amongst exploration policies can differ. For example, a first exploration policy can be selected with an 80% probability and a second with a 20% probability.

[0057] At block 206, the system processes current state data, using the current action network and/or the current critic network, to select the next action. At an initial iteration of generating episode data, the current action network and the current critic network can be as pre-trained according to method 100. However, as described herein, in various implementations method 300 can be performed simultaneously with method 200. In such implementations, the actor network and the critic network being utilized in method 200 can be periodically updated based on the further training of method 300. Accordingly, the current critic network and the current actor network can evolve (e.g., at least weights thereof updated) over time during performance of method 200.

[0058] Block 206 optionally includes sub-block 206A, in which the system selects the next action based on the exploration strategy, as most recently selected at block 204.

[0059] At block 208, the system executes the next action.

[0060] At block 210, the system determines whether to perform another step in the episode. Whether to perform another step can be based on the most recently selected next action (e.g., was it a termination action), whether a threshold number of steps have been performed, whether the task is complete, and/or one or more other criteria.

[0061] If, at block 210, the system determines to perform another step in the episode, the system proceeds back to block 206 in implementation that don't utilize Adaptive Exploration. In implementations that do utilize Adaptive Exploration, the system proceeds to optional block 212, where the system determines to proceed to block 206 if step-by-step Adaptive Exploration is not being utilized or to instead proceed to block 204 if step-by-step Adaptive Exploration is being utilized.

[0062] If, at block 210, the system determines to perform another step in the episode, the system proceeds to block 214 and determines a reward for the episode. The reward can be determined based on a defined reward function, which will be dependent on the robotic task.

[0063] At block 216, the system stores episode data from the episode. For example, the system can store various instances of transitions during the episode and a reward for the episode. Each transition can include state data, action, and next state data (i.e., next state data from the next state that resulted from the action). In some implementations, block 216 includes sub-block 216A, in which the system populates some or all of the stored episode data in a replay buffer for use in method 300 of FIG. 3. In some of those implementations, whether the system populates the stored episode data in the replay buffer can depend on whether the reward, for the episode data, is positive—indicating a successful episode (i.e., one in which the robotic task was successfully performed). For example, the system can seek to maintain a certain ratio of successful to unsuccessful episode data in the replay buffer, and can determine whether and/or when to populate episode data in dependence on the ratio (e.g., based on what's currently in the replay buffer and based on whether the episode is successful).

[0064] At block 218, the system determines whether to perform more episodes. In some implementations, the system determines whether to perform more episodes based on whether the further training of method 300 is still occurring, whether a threshold quantity of episode data has been generated, whether a threshold duration of episode data generation has occurred, and/or one or more other criteria.

[0065] If, at block 218, the system determines to perform more episodes, the system returns to optional block 204 or, if block 204 is not present, to block 206. It is noted that prior to returning to block 206 the robot (e.g., physical or simulated) and/or the environment (e.g., virtual or simulated) can optionally be reset. For example, when a simulator is being used to perform method 200, the starting pose of the robot can be randomly reset and/or the simulated environment adapted (e.g., with new object(s), new lighting condition(s), and/or new object pose(s)—or even a completely new environment). It is also noted that multiple iterations of method 200 can be performed in parallel. For example, iterations of method 200 can be performed across multiple real physical robots and/or across multiple simulators.

[0066] FIG. 3 is a flowchart illustrating an example method 300 of further training an actor network and a critic network based on online episode data, according to various implementations disclosed herein. For convenience, the operations of method 300 are described with reference to a system that performs the operations. This system may include one or more components, such as processor(s)) of computing device(s) (e.g., a cluster of high performance computing devices). Moreover, while operations of method 300 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted or added.

[0067] At block 302, further training of the actor network and the critic network begins.

[0068] At optional block 304, the system identifies, from a replay buffer, instance(s) of online robotic episode data. The online robotic episode data can be generated based on method 200 of FIG. 2.

[0069] At block 306, the system determines if the instance(s) of episode data are from successful episode(s). If not, the system bypasses updating of the actor network in block 308 (described below). If so, the system does not bypass updating of the actor network in block 308. Accordingly, when optional block 306 is implemented, it can ensure that the actor network is only updated based on episode data from successful episodes.

[0070] At block 308, the system updates the actor network based on the instance(s). Block 308 can share one or more (e.g., all) aspects in common with block 108 of FIG. 1, although the episode data on which block 308 is performed is online episode data. Block 308 includes optional sub-block 308A, which can share one or more (e.g., all) aspects in common with block 108A of FIG. 1.

[0071] At block 310, the system updates the critic network based on the instance(s). Block 310 can share one or more (e.g., all) aspects in common with block 106 of FIG. 1, although the episode data on which block 310 is performed is online episode data. Block 310 includes optional sub-blocks 310A and 310B, which can share one or more (e.g., all) aspects in common with respective of blocks 106A and 106B of FIG. 1.

[0072] At block 312, the system determines if more training should occur. This can be based on whether unprocessed online episode data remains, whether a threshold duration and/or extent of further training has occurred, and/or one or more other criteria.

[0073] If the decision at block 312 is that more training should occur, the system proceeds back to block 304 and identifies new instance(s) of online robotic episode data.

[0074] If the decision at block 312 is that further training is complete, the system proceeds to block 314.

[0075] At block 314, the system can use, or provide for use, at least the actor network in robotic control. In some implementations, the system can use the actor network, independent of the critic network, in robotic control.

[0076] FIG. 4 schematically depicts an example architecture of a robot 420. The robot 420 includes a robot control system 460, one or more operational components 440a-440n, and one or more sensors 442a-442m. The sensors 442a-442m may include, for example, vision sensors, light sensors, pressure sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth. While sensors 442a-m are depicted as being integral with robot 420, this is not meant to be limiting. In some implementations, sensors 442a-m may be located external to robot 420, e.g., as standalone units.

[0077] Operational components 440a-440n may include, for example, one or more end effectors and/or one or more servo motors or other actuators to effectuate movement of one or more components of the robot. For example, the robot 420 may have multiple degrees of freedom and each of the actuators may control actuation of the robot 420 within one or more of the degrees of freedom responsive to the control commands. As used herein, the term actuator encompasses a mechanical or electrical device that creates motion (e.g., a motor), in addition to any driver(s) that may be associated with the actuator and that translate received control commands into one or more signals for driving the actuator. Accordingly, providing a control command to an actuator may comprise providing the control command to a driver that translates the control command into appropriate signals for driving an electrical or mechanical device to create desired motion.

[0078] The robot control system 460 may be implemented in one or more processors, such as a CPU, GPU, and/or other controller(s) of the robot 420. In some implementations, the robot 420 may comprise a “brain box” that may include all or aspects of the control system 460. For example, the brain box may provide real time bursts of data to the operational components 440a-n, with each of the real time bursts comprising a set of one or more control commands that dictate, inter alia, the parameters of motion (if any) for each of one or more of the operational components 440a-n. The control commands can be based on robotic actions determined utilizing a control policy as described herein. For example, the robotic actions can be determined using an actor network trained according to techniques described herein and, optionally, a critic network trained according to techniques described herein.

[0079] FIG. 5 is a block diagram of an example computing device 510 that may optionally be utilized to perform one or more aspects of techniques described herein. For example, a robotic simulator can be implemented on computing device 510 or in a cluster of multiple computing devices 510 (e.g., high-performance server(s) that may lack certain input and/or output component(s)). As another example, a cluster of multiple computing devices can implement one or more aspects of pre-training and/or further training described herein.

[0080] Computing device 510 typically includes at least one processor 514 which communicates with a number of peripheral devices via bus subsystem 512. These peripheral devices may include a storage subsystem 524, including, for example, a memory subsystem 525 and a file storage subsystem 526, user interface output devices 520, user interface input devices 522, and a network interface subsystem 516. The input and output devices allow user interaction with computing device 510. Network interface subsystem 516 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.

[0081] User interface input devices 522 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 510 or onto a communication network.

[0082] User interface output devices 520 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computing device 510 to the user or to another machine or computing device.

[0083] Storage subsystem 524 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 524 may include the logic to perform certain aspects of the method of FIG. 1, the method of FIG. 2, and/or the method of FIG. 3.

[0084] These software modules are generally executed by processor 514 alone or in combination with other processors. Memory 525 used in the storage subsystem 524 can include a number of memories including a main random access memory (RAM) 530 for storage of instructions and data during program execution and a read only memory (ROM) 532 in which fixed instructions are stored. A file storage subsystem 526 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 526 in the storage subsystem 524, or in other machines accessible by the processor(s) 514.

[0085] Bus subsystem 512 provides a mechanism for letting the various components and subsystems of computing device 510 communicate with each other as intended. Although bus subsystem 512 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.

[0086] Computing device 510 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 510 depicted in FIG. 5 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 510 are possible having more or fewer components than the computing device depicted in FIG. 5.

[0087] Other implementations may include a non-transitory computer readable storage medium storing instructions executable by one or more processor(s) (e.g., a central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s)), and/or tensor processing unit(s) (TPU(s))) to perform a method such as one or more of the methods described herein. Yet other implementations may include a system of one or more computers and/or one or more robots that include one or more processors operable to execute stored instructions to perform a method such as one or more of the methods described herein.

LEARNING ROBOTIC SKILLS WITH IMITATION AND REINFORCEMENT AT SCALE

Inventors

Cpc classification

Classification Explorer

G05B2219/40116

PHYSICS

Classification Explorer

B25J9/161

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05B2219/40499

PHYSICS

Classification Explorer

G06N3/092

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

B25J9/163

PERFORMING OPERATIONS; TRANSPORTING

International classification

Classification Explorer

B25J9/16

PERFORMING OPERATIONS; TRANSPORTING

Abstract

Claims

Description