Systems and Methods for Navigating Aerial Vehicles Using Deep Reinforcement Learning
20210124352 · 2021-04-29
Assignee
Inventors
- Salvatore J. Candido (Mountain View, CA)
- Jun Gong (Mountain View, CA, US)
- Marc Gendron-Bellemare (Montreal, CA)
Cpc classification
Y02T50/50
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
B64U2201/10
PERFORMING OPERATIONS; TRANSPORTING
G05D1/0088
PHYSICS
B64U10/30
PERFORMING OPERATIONS; TRANSPORTING
B64C39/024
PERFORMING OPERATIONS; TRANSPORTING
International classification
G05D1/00
PHYSICS
Abstract
The technology relates to navigating aerial vehicles using deep reinforcement learning techniques to generate flight policies. An operational system for controlling flight of an aerial vehicle may include a computing system configured to process an input vector representing a state of the aerial vehicle and output an action, an operation-ready policies server configured to store a trained neural network encoding a learned flight policy, and a controller configured to control the aerial vehicle. The input vector may be processed using the trained neural network encoding the learned flight policy. A method for navigating an aerial vehicle may include selecting a trained neural network encoding a learned flight policy from an operation policies server, generating an input vector comprising a set of characteristics representing a state of the aerial vehicle, selecting an action, by the trained neural network, based on the input vector, converting the action into a set of commands, by a flight computer, the set of commands configured to cause the aerial vehicle to perform the action, and causing, by a controller, the aerial vehicle to perform the action using the set of commands.
Claims
1. An operational system for controlling flight of an aerial vehicle, the system comprising: a computing system configured to process an input vector representing a state of the aerial vehicle and output an action; an operation-ready policies server configured to store a trained neural network encoding a learned flight policy; and a controller configured to control the aerial vehicle, wherein the computing system is configured to process the input vector using the trained neural network encoding the learned flight policy.
2. The system of claim 1, wherein the learned flight policy is configured to determine the action for the aerial vehicle to perform in a given situation.
3. The system of claim 1, wherein the input comprises an input vector characterizing the given situation.
4. The system of claim 1, wherein the neural network encoding the learned flight policy has been trained using a learning system implementing a reinforcement learning algorithm, the learning system having assigned the learned flight policy a score that meets or exceeds an operation-ready or equivalent threshold.
5. The system of claim 1, wherein the trained neural network encoding the learned flight policy is configured to achieve an objective.
6. The system of claim 1, further comprising a flight computer configured to convert the action into a set of commands configured to cause the aerial vehicle to perform the action.
7. The system of claim 6, wherein the controller comprises a logic circuit configured to implement the set of commands.
8. The system of claim 1, wherein the aerial vehicle comprises a lighter than air type vehicle.
9. The system of claim 1, wherein the aerial vehicle comprises a fixed-wing type vehicle.
10. A method for navigating an aerial vehicle, the method comprising: selecting a trained neural network encoding a learned flight policy from an operation policies server; generating an input vector comprising a set of characteristics representing a state of the aerial vehicle; selecting an action, by the trained neural network, based on the input vector; converting the action into a set of commands, by a flight computer, the set of commands configured to cause the aerial vehicle to perform the action; and causing, by a controller, the aerial vehicle to perform the action using the set of commands.
11. The method of claim 10, further comprising determining whether to continue operation of the aerial vehicle.
12. The method of claim 11, further comprising, after determining to continue operation of the aerial vehicle: generating another input vector representing a current state of the aerial vehicle; selecting a next action, by the trained neural network, based on the another input vector; and causing, by the controller, the aerial vehicle to perform the next action.
13. The method of claim 10, wherein the input vector further comprises a set of characteristics representing a state of an environment surrounding the aerial vehicle.
14. The method of claim 13, wherein the environment is a region of the stratosphere, and the aerial vehicle is a high altitude aerial vehicle.
15. The method of claim 10, wherein the trained neural network encoding the learned flight policy is configured to select actions to achieve an objective.
16. The method of claim 15, wherein the objective comprises causing the aerial vehicle to spend an optimal amount of time within a predetermined radius of a target location.
17. The method of claim 15, wherein the objective comprises causing a group of aerial vehicles to provide connection services for a maximum amount of time to a geographical area, wherein the aerial vehicle is one of the group of aerial vehicles.
18. The method of claim 15, wherein the objective comprises causing the aerial vehicle to arrive at a target location at a desired date and time.
19. The method of claim 15, wherein the objective comprises optimizing the aerial vehicle's power consumption while the aerial vehicle navigates to a target location.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027] The figures depict various example embodiments of the present disclosure for purposes of illustration only. One of ordinary skill in the art will readily recognize from the following discussion that other example embodiments based on alternative structures and methods may be implemented without departing from the principles of this disclosure, and which are encompassed within the scope of this disclosure.
DETAILED DESCRIPTION
[0028] The Figures and the following description describe certain embodiments by way of illustration only. One of ordinary skill in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein. Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures.
[0029] The above and other needs are met by the disclosed methods, a non-transitory computer-readable storage medium storing executable code, and systems for navigating aerial vehicles in operation, as well as for generating flight policies for such aerial vehicle navigation using deep reinforcement learning.
[0030] Aspects of the present technology are advantageous for high altitude systems (i.e., systems that are operation capable in the stratosphere, approximately at or above 7 kilometers above the earth's surface in some regions, and at or above 20 kilometers above the earth's surface in other regions, or beyond in the exosphere or cosmic space), such as High Altitude Platforms (HAPs), High Altitude Long Endurance (HALE) aircraft, unmanned aerial vehicles (UAVs), including lighter than air vehicles (e.g., floating stratospheric balloons), propelled lighter than air vehicles (e.g., propelled floating stratospheric balloons), fixed-wing vehicles (e.g., drones, rigid kites), various types of satellites, and other high altitude aerial vehicles. In some examples, high altitude systems are configured to fly above an altitude reserved for commercial airline flights. One way to provide enhanced network access is through a network of aerial vehicles carrying Internet, cellular data, or other network capabilities. To maintain a network, each aerial vehicle in a fleet or network of aerial vehicles may travel to a particular location. In some embodiments, lighter than air aerial vehicles (i.e., propelled or not) may rely on rapidly changing and extreme (i.e., strong, high speed, and volatile) wind conditions to assist in navigation efforts to different locations. Other environmental and non-environmental factors may impact an aerial vehicle's flight plan or policy. In view of this, large scale simulations may be performed to evaluate operational characteristics or capabilities (e.g., power system availability, ambient temperatures, software and hardware versions implemented or accessible onboard, integrity of various components) and life cycle of individual aerial vehicles or fleets. Such simulations may be used to manage the life cycle of an aerial vehicle to manage risk of failures and to optimize availability for service delivery.
[0031] This disclosure is directed to a deep reinforcement learning system (hereinafter “learning system”) for generating optimal flight policies to control aerial vehicles according to a desired goal (i.e., objective), along with methods for performing training learned flight policies that are improved and optimized for one or more objectives, and deploying said learned flight policies in an aerial vehicle navigation system. As described in more detail below, the learning system comprises a simulation module (including one or more “Workers” or simulators), one or more replay buffers, a learning module comprising a deep reinforcement learning architecture designed to train flight policies, and one or more servers or repositories that store learned policies from the learning module.
[0032] In a training loop, the simulation module simulates an aerial vehicle's flight through a region of the atmosphere (e.g., stratosphere) according to a given policy (e.g., encoded into a neural network, for determining an action by the aerial vehicle in a given environment and aerial vehicle state). The simulation module generates a frame, represented by one or more feature vectors, for each time step, and feeds the frames of each simulation to one or more replay buffers. The replay buffers store the frames (e.g., in sequential order in a circular buffer or at random), and the learning module requests a set of frames from one or more replay buffers as inputs. An input may comprise a random sample of frames from a circular buffer, a prioritized sample of frames according to optimization criteria, or a set of frames with other characteristics.
[0033] The training loop continues with the learning module processing the frames according to a deep reinforcement learning architecture to determine, in a given situation (i.e., a given vehicle state in a given environment), which action provides a larger or largest estimated reward. Actions to be taken by an aerial vehicle may include ascending, descending, maintaining altitude, and propelling itself in a direction, among others, and may be manifested as discrete actions such as up, down, or stay (i.e., maintain altitude). In some examples, the learning module may run one or more neural networks that output a value or other representation of an action, or a command associated with an action, and a magnitude associated with said action or command. The deep reinforcement learning architecture may be configured to run one or more variations of reinforcement learning, including value-based methods, distributional methods, and policy-based methods. Some examples of reinforcement learning techniques include, without limitation, Q-learning, double Q-learning, distributional Q-learning, categorical Q-learning, quantile regression Q-learning; policy gradient, actor-critic, soft actor-critic, and trust region policy optimization, among others. The reinforcement learning algorithm in the learning module is characterized by a reward function corresponding to the objective of the flight policy training. The learning module generates learned flight policies (e.g., encoded in neural networks), and scores them according to the reward function. The learned policies may be stored in a policy server, from which the simulation module can pull learned policies to run further simulations.
[0034] A reward function is defined according to a desired objective. Example objectives include: flying within a predetermined radius of a target location; following a mapped trajectory; flying in a given direction; arriving at a location at a desired date and time; maximizing (or otherwise optimizing) the amount of time an aerial vehicle provides connection services to a given area; conserving energy or minimizing energy consumption during a time period of flight or in achieving any of the aforementioned objectives; and achieving any of the aforementioned objectives in coordination with other aerial vehicles (i.e., in the context of a fleet of aerial vehicles). In some examples, the systems described herein may be tuned with a reward function that optimizes for multiple objectives (e.g., any combination of the example objectives above), and may further account for other factors, such as minimizing wear on a vehicle, avoiding inclement weather, etc. A learned flight policy may be deemed high performing if it scores well according to the reward function, and threshold scores may be defined to determine whether a learned flight policy is high performing, low performing, operation-ready, or otherwise should be kept (i.e., stored for further use in simulations or operation) or discarded.
[0035] Multiple learning systems may be run in parallel as a meta-learning system, with varied parameters in each learning system. Parameters that may be varied may include: the reward function (e.g., objective being optimized, slope, additional factors, etc.), reward tuning or modification (e.g., period of time or times of day during which full reward is valid or awarded, penalties for various characteristics of simulation or resulting reward), the number of frames that a learning module requests per input, characteristics of said frames for input (e.g., random, prioritized, or other), the depth of the reinforcement learning architecture (e.g., number of layers, hidden or otherwise) in the learning module, number of objectives, types of objectives, number of available actions, types of available actions, length of time into the future that is being predicted, and more.
[0036] A learning system also may store high performing learned flight policies (i.e., store neural networks encoded with high performing flight policies) in a policy repository for use by an operational navigation system to control movement of an aerial vehicle according to one or more desired objectives. These high performing learned flight policies may be used to determine actions for an aerial vehicle in a given situation. In an embodiment described herein, an aerial vehicle system may generate an input vector characterizing a state of the aerial vehicle, which may be provided to a learned flight policy that processes the input vector to output an action optimized for an objective of the learned flight policy. Such input vectors may be generated onboard an aerial vehicle in some examples, and in other examples, may be generated offboard (e.g., in a datacenter, which may or may not be integrated with a ground station or other cloud infrastructure, or the like). The action may be converted to a set of commands configured to cause the aerial vehicle to perform the action. In some examples, the input vector may include more than an aerial vehicle's physical and operational state (e.g., battery levels, location, pose, speed, weight, dimensions, software version, hardware version, among others), but also may include one or more of the following environmental inputs: sensor inputs (e.g., measuring temperature, pressure, humidity, precipitation, etc.), weather forecasts, map information, air traffic information, date and time.
[0037] Using the systems and methods described herein, one can customize flight policies across vehicles types and also to particular environments (Peru vs. Ecuador vs. Kenya, if it is useful, for the same vehicle) to improve performance with minimal human interference. For example, a system can customize flight policies for particular countries, each with its own set of regulations and no-fly restrictions, as well as weather and wind forecasts specific to its region. Thus, these systems and methods for aerial vehicle navigation can generate controllers for many types of aerial vehicles with minimal custom human design. Also, for any given vehicle and environment, one can process more data, and more types of data, to generate improved controllers.
[0038] Example Systems
[0039]
[0040] Connection 104a may structurally, electrically, and communicatively, connect balloon 101a and/or ACS 103a to various components comprising payload 108a. In some examples, connection 104a may provide two-way communication and electrical connections, and even two-way power connections. Connection 104a may include a joint 105a, configured to allow the portion above joint 105a to pivot about one or more axes (e.g., allowing either balloon 101a or payload 108a to tilt and turn). Actuation module 106a may provide a means to actively turn payload 108a for various purposes, such as improved aerodynamics, facing or tilting solar panel(s) 109a advantageously, directing payload 108a and propulsion units (e.g., propellers 107 in
[0041] Payload 108a may include solar panel(s) 109a, avionics chassis 110a, broadband communications unit(s) 111a, and terminal(s) 112a. Solar panel(s) 109a may be configured to capture solar energy to be provided to a battery or other energy storage unit, for example, housed within avionics chassis 110a. Avionics chassis 110a also may house a flight computer (e.g., computing device 301, as described herein), a transponder, along with other control and communications infrastructure (e.g., a controller comprising another computing device and/or logic circuit configured to control aerial vehicle 120a). Communications unit(s) 111a may include hardware to provide wireless network access (e.g., LTE, fixed wireless broadband via 5G, Internet of Things (IoT) network, free space optical network or other broadband networks). Terminal(s) 112a may comprise one or more parabolic reflectors (e.g., dishes) coupled to an antenna and a gimbal or pivot mechanism (e.g., including an actuator comprising a motor). Terminal(s) 112(a) may be configured to receive or transmit radio waves to beam data long distances (e.g., using the millimeter wave spectrum or higher frequency radio signals). In some examples, terminal(s) 112a may have very high bandwidth capabilities. Terminal(s) 112a also may be configured to have a large range of pivot motion for precise pointing performance. Terminal(s) 112a also may be made of lightweight materials.
[0042] In other examples, payload 108a may include fewer or more components, including propellers 107 as shown in
[0043] Ground station 114 may include one or more server computing devices 115a-n, which in turn may comprise one or more computing devices (e.g., computing device 301 in
[0044]
[0045] As shown in
[0046]
[0047]
[0048] Computing device 301 also may include a memory 302. Memory 302 may comprise a storage system configured to store a database 314 and an application 316. Application 316 may include instructions which, when executed by a processor 304, cause computing device 301 to perform various steps and/or functions, as described herein. Application 316 further includes instructions for generating a user interface 318 (e.g., graphical user interface (GUI)). Database 314 may store various algorithms and/or data, including neural networks (e.g., encoding flight policies, as described herein) and data regarding wind patterns, weather forecasts, past and present locations of aerial vehicles (e.g., aerial vehicles 120a-b, 201a-b, 211a-c), sensor data, map information, air traffic information, among other types of data. Memory 302 may include any non-transitory computer-readable storage medium for storing data and/or software that is executable by processor 304, and/or any other medium which may be used to store information that may be accessed by processor 304 to control the operation of computing device 301.
[0049] Computing device 301 may further include a display 306, a network interface 308, an input device 310, and/or an output module 312. Display 306 may be any display device by means of which computing device 301 may output and/or display data. Network interface 308 may be configured to connect to a network using any of the wired and wireless short range communication protocols described above, as well as a cellular data network, a satellite network, free space optical network and/or the Internet. Input device 310 may be a mouse, keyboard, touch screen, voice interface, and/or any or other hand-held controller or device or interface by means of which a user may interact with computing device 301. Output module 312 may be a bus, port, and/or other interface by means of which computing device 301 may connect to and/or output data to other devices and/or peripherals.
[0050] In some examples computing device 301 may be located remote from an aerial vehicle (e.g., aerial vehicles 120a-b, 201a-b, 211a-c) and may communicate with and/or control the operations of an aerial vehicle, or its control infrastructure as may be housed in avionics chassis 110a-b, via a network. In one embodiment, computing device 301 is a data center or other control facility (e.g., configured to run a distributed computing system as described herein), and may communicate with a controller and/or flight computer housed in avionics chassis 110a-b via a network. As described herein, system 300, and particularly computing device 301, may be used for planning a flight path or course for an aerial vehicle based on wind and weather forecasts to move said aerial vehicle along a desired heading or within a desired radius of a target location. Various configurations of system 300 are envisioned, and various steps and/or functions of the processes described below may be shared among the various devices of system 300, or may be assigned to specific devices.
[0051]
[0052]
[0053] Simulation module 502 may be configured to feed frames of simulations to replay buffers 504, which serve to randomize and store said frames independent of the particular simulations from which they came. A plurality of simulators in simulation module 502 (e.g., comprising several or ten or more simulators) may work to feed simulation frames to a plurality of replay buffers 504 (comprising one or more replay buffers).
[0054] Learning module 506 may be configured to pull sets of frames (e.g., comprising 32 or 64 frames, or any number of frames ranging from a dozen to multiples of fives, tens or dozens of frames) from replay buffers 504 to train learned flight policies. Learning module 506 may comprise a deep reinforcement learning architecture configured to run one or more variations of reinforcement learning, such as Q-learning, double Q-learning, distributional Q-learning, or other policy learning methods. The reinforcement learning algorithm in learning module 506 may be configured to maximize a sum of rewards generated by a reward function corresponding to the objective of the flight policy training. The reward function may be correlated with an objective (i.e., a control objective related to navigation or operation of an aerial vehicle, such as a high altitude aerial vehicle). Examples of objectives may include: following a map (e.g., following the gradient or map of a heuristic function built to indicate how to efficiently cross an ocean); spending the most (or otherwise optimal) amount of time within a radius of a target location (e.g., a latitude-longitude, a city, an island, a group or chain of islands, a group of buoyed structures such as offshore wind farms); following a mapped trajectory; flying in a given direction; arriving at a location at a desired date and time; maximizing (or otherwise optimizing) the amount of time an aerial vehicle provides connection services to a given area; following, or remaining within a given radius of, a terrestrial or nautical vehicle (e.g., a cruise ship, an all-terrain vehicle, etc.); and any of the above in coordination with other aerial vehicles (i.e., in the context of a fleet of aerial vehicles). Thresholds may be predetermined to categorize learned flight policies into various levels of performance (e.g., high, medium, or low performing) based on how a learned flight policy scores according to the reward function (i.e., how high of a reward produced by said learned flight policy). In some examples, the reward score may comprise a value. In other examples, where certain types of reinforcement learning is implemented (e.g., distributional Q-learning), a reward score may comprise a probability distribution or a distribution of rewards. In an example, learning module 506 may be configured to provide medium and high performing learned flight policies to policy server 508 for storage and further use by simulation module 502 to run simulations, and also to provide high performing learned flight policies to operation-ready policies server 510 for use in operational navigation systems, according to methods described below. In another example, learning module 506 may be configured to provide medium and high performing learned flight policies to policy server 508, and to provide a separate category of highest performing learned flight policies to operations policies server 510 for use in operational navigation systems. Learning module 506 may discard (i.e., delete) low-performing learned flight policies.
[0055] Turning to
[0056] In some examples, system 500 also may include an operation-ready policies server 510 separate from policy server 508, in which learned flight policies that meet or exceed a certain threshold score defined for an appropriate category of flight policies (e.g., high performing, highest performing, operation-ready, and other appropriate categories) may be stored. The operation-ready flight policies in operation-ready policies server 510 may be provided to aerial vehicles (e.g., aerial vehicles 120a-b, 201a-b, 211a-c) for navigation of said aerial vehicles (i.e., to determine actions to be performed by said aerial vehicles to achieve an objective), according to methods described in more detail below.
[0057]
[0058] In an example, a subset of 30-60 of learning systems 602a-n may be run in parallel with the objective of training optimal flight policies to navigate an aerial vehicle (e.g., aerial vehicles 120a-b, 201a-b, 211a-c) from a starting location to a target location. In this example, the objective amongst this subset of learning systems 602a-n may be the same, but each of learning systems 602a-n may be run with one or more of the following parameter variations: distances between said starting and target locations, the geographical locations of the starting and target locations (i.e., differing hemispheres or regions of the world, thereby invoking very different environmental conditions), starting times, number of frames per input (e.g., 16, 32, 64 or more), the reward function slope (i.e., how steep is the slope, which translates into how strictly the reward may be scored), number of simulators. In this example, said subset of 30-60 of learning systems 602a-n may further be grouped into subset groups, each subset group running variations on a single parameter or a few related parameters (e.g., a subset group running variations on the reward function, another subset group running variations on geographical locations, another subset group running variations on input vectors, etc.). In another example, another subset of learning systems 602a-n may be training on the same objective with the same parameter variations using a different type of reinforcement learning (e.g., Q-learning in one subset providing value scores, distributional Q-learning in another subset providing probability distribution scores). Evaluation server 512 may evaluate the performance of the learned flight policies being generated by learning systems 602a-n (e.g., retrieving from policy servers 508a-n or directly from learning modules 506a-n). Evaluation server 512 may determine which learned flight policies should be stored in operation-ready policies server 510, and provide feedback on which stacks are doing well or doing poorly. For example, where certain ones of said subset of 30-60 of learning systems 602a-n, characterized by certain sets of parameters, perform poorly (i.e., produce poor performing learned flight policies), the poor performing learning systems (i.e., sets of parameters) may be discontinued. As certain of learning systems 602a-n are discontinued, new ones with new sets of parameters may be added to particular subsets or subset groups.
[0059] In some examples, meta-learning system 600 may be implemented in a distributed computing environment wherein a plurality of copies of a stack of learning systems may be maintained along with a plurality of policy servers, each policy server being dedicated to a learning stack.
[0060] Example Methods
[0061]
[0062] The simulation module may generate a plurality of frames, each frame representing a time step of a simulation at step 704, each frame comprising a feature vector representing a set of features of the aerial vehicle state and environment in said time step, wherein actions taken in each simulation in said time step correspond to a feature vector of a frame, and the resulting time step captured in the frame is rewarded for the behavior (i.e., the actions taken). At step 706, the plurality of frames may be stored in a replay buffer, the replay buffer configured to provide a random or scrambled set of frames (e.g., to disrupt the plurality of frames from their original time sequence order) for input into a learning module. As described herein, the replay buffer may comprise one or more buffers.
[0063] The method may continue, at step 708, with requesting, by a learning module, a set of frames from the replay buffer. The learning module may comprise a deep reinforcement architecture, as described above. The learning module may process the set of frames using a reinforcement learning algorithm at step 710, in order to then generate a learned flight policy that is scored according to a reward function defined for that learning module at step 712. The learned flight policy, encoded in a neural network, may be stored in a policy server at step 714.
[0064] Turning to
[0065] A threshold may be predetermined to define whether a learned flight policy is high performing or low performing, and said threshold may comprise one or more thresholds used to define more granular performance categories (e.g., high performing threshold, very high performing threshold, low performing threshold, medium performing threshold, simulation-ready threshold, operation-ready threshold, discard threshold, anomalous threshold, and others, any of which may be a value-based threshold wherein the score comprises a value or a distribution-based threshold wherein the score comprises a probability distribution).
[0066] In an alternative embodiment, in some examples, the learning module itself may determine whether a threshold is met (i.e., the score for the learned flight policy meets or exceeds the threshold). The learned flight policy, which may be encoded in a neural network, that meets or exceeds said threshold may be stored in the policy server configured to store neural networks, at step 714. Once the learned flight policy is stored in the policy server, it may be provided to a simulation module, or pulled by a simulation module from the policy server, to run further simulations. If the threshold is not met (i.e., the score for the learned flight policy falls below the threshold), the learned flight policy may be discarded. Where the score comprises a value, the value score will be evaluated against a predetermined threshold value. Where the score comprises a probability distribution or distribution of rewards, the score may be evaluated against a predetermined distribution threshold.
[0067] In some examples, the operation-ready threshold for storing in the operation-ready policies server may be different from other thresholds, e.g., for storing in a normal policies server (e.g., if the learned flight policy score meets or exceeds a different and higher threshold). In still other examples, methods 700 and 750 may include evaluating a learned flight policy score against still other thresholds to gate the grouping and treating of learned flight policies in other ways.
[0068] Method 700 may be performed by flight policy training system 500 and method 750 may be performed by a flight policy training system 550, either or both of which may be implemented in a distributed computing system such as distributed computing system 400. In some examples, method 700 may be performed as a training loop in each of a plurality of training systems, such as in a meta-learning system 600. Method 750 also may be performed as part of a training loop in a meta-learning system 600.
[0069] In still other embodiments, methods 700 and 750 may be implemented using real world flight data or other sources of flight data, rather than simulations. Therefore, the training methods described herein may be performed using simulated data, historical data, fresh data collected during operation, or a mixture of these, to generate input for a learning module.
[0070]
[0071] An action may be selected by the trained neural network based on the input vector at step 806. The action may be converted into a set of commands, at step 808, the set of commands configured to cause the aerial vehicle to perform the action. For example, an action “descend” or “down” may be converted to a set of commands that includes spinning an ACS fan motor of the aerial vehicle at a predetermined number of watts. A control system, as described herein, may then cause the aerial vehicle to perform the action using the set of commands at step 810. In some examples, a determination may be made whether the aerial vehicle operation should continue at step 812. If yes, method 800 may return to step 804 to generate further input vectors to select further actions. If no, method 800 may end.
[0072] It would be recognized by a person of ordinary skill in the art that some or all of the steps of methods 700 and 800, as described above, may be performed in a different order or sequence, repeated, and/or omitted without departing from the scope of the present disclosure.
[0073]
[0074] While specific examples have been provided above, it is understood that the present invention can be applied with a wide variety of inputs, thresholds, ranges, and other factors, depending on the application. For example, the time frames and ranges provided above are illustrative, but one of ordinary skill in the art would understand that these time frames and ranges may be varied or even be dynamic and variable, depending on the implementation.
[0075] As those skilled in the art will understand, a number of variations may be made in the disclosed embodiments, all without departing from the scope of the invention, which is defined solely by the appended claims. It should be noted that although the features and elements are described in particular combinations, each feature or element can be used alone without other features and elements or in various combinations with or without other features and elements. The methods or flow charts provided may be implemented in a computer program, software, or firmware tangibly embodied in a computer-readable storage medium for execution by a general-purpose computer or processor.
[0076] Examples of computer-readable storage mediums include a read only memory (ROM), random-access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks.
[0077] Suitable processors include, by way of example, a general-purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), a state machine, or any combination of thereof.