ARTIFICIAL INTELLIGENCE POWERED EMERGENCY PILOT ASSISTANCE SYSTEM
20220292994 · 2022-09-15
Inventors
Cpc classification
G05D1/0088
PHYSICS
G06N7/01
PHYSICS
G06N3/006
PHYSICS
B64D2045/0085
PERFORMING OPERATIONS; TRANSPORTING
International classification
G08G5/02
PHYSICS
Abstract
An emergency pilot assistance system may include an artificial neural network configured to calculate reward (Q) values based on state-action vectors associated with an aircraft. The state-action vectors may include state data associated with the aircraft and action data associated with the aircraft. The system may further include a user output device configured to provide an indication of an action to a user, wherein the action corresponds to an agent action that has a highest reward Q value as calculated by the artificial neural network.
Claims
1. An emergency pilot assistance system comprising: an artificial neural network configured to calculate reward (Q) values based on state-action vectors associated with an aircraft, wherein the state-action vectors include state data associated with the aircraft and action data associated with the aircraft; and a user output device configured to provide an indication of an action to a user, wherein the action corresponds to an agent action that has a highest reward Q value as calculated by the artificial neural network.
2. The system of claim 1, wherein the highest reward Q value is associated with landing the aircraft at a predetermined destination or a calculated emergency destination in response to an emergency.
3. The system of claim 1, wherein the state data include data matrices associated with the aircraft, the data matrices indicating a heading value, a position value, a system state value, an environmental condition value, a feedback value, a pilot action value, a system availability value, a roll value, a pitch value, a yaw value, a rate of change of roll value, a rate of change of pitch value, a rate of change of yaw value, a longitude value, a latitude value, a rate of change of position value, a rate of change of velocity value, or any combination thereof.
4. The system of claim 1, wherein the action data corresponds to a change in heading, a change in velocity, a change in roll, a change in pitch, a change in yaw, a change in a rate of change of roll, a change in a rate of change of pitch, a change in a rate of change of yaw, change in a rate of change of position, a change in a rate of change of velocity, or any combination thereof.
5. The system of claim 4, wherein the agent action is translated into an aircraft surface control action using an inverse aircraft model.
6. The system of claim 1, wherein the agent action is taken from a flight envelope including aircraft flight constraints, wherein the aircraft flight constraints include maps of acceleration and deceleration, rates of climb, rates of drop, velocity thresholds, roll change rate thresholds, pitch change rate thresholds, yaw change rate thresholds, roll thresholds, pitch thresholds, and yaw thresholds.
7. The system of claim 1, wherein the artificial neural network includes a deep Q network.
8. The system of claim 1, wherein the user output device is incorporated into a cockpit of an aircraft, and wherein the indication of the action includes a visual indication, an audio indication, a written indication, or any combination thereof.
9. The system of claim 1, wherein the artificial neural network is implemented at one or more processors, and wherein the one or more processors are further configured to: determine the state data based on one or more aircraft systems; determine availability data associated with one or more aircraft systems; determine a safe landing zone based on the state data and based on the availability data; determine the action data based on the safe landing zone, the availability data, the state data, and stored constraint data; and generate the state-action vectors based on the state data and the action data.
10. The system of claim 1, wherein the artificial neural network is implemented at one or more processors, and wherein the one or more processors are further configured to: determine heading and velocity data associated with the highest reward Q value; and perform one or more inverse dynamics operations to translate the heading and velocity data into the agent action.
11. The system of claim 1, wherein the artificial neural network is implemented at one or more processors, and wherein the one or more processors are further configured to: compare user input to the action and generate a performance rating.
12. The system of claim 1, wherein the use output device is further configured to warn the user when a user input differs from the action.
13. The system of claim 1, wherein the artificial neural network is implemented at one or more processors, and wherein the one or more processors are further configured to: generate updated state-action vectors associated with the aircraft based on updated state data and updated action data; and calculate additional reward Q values based on the updated state-action vectors, wherein the user output device is configured to provide an additional indication of an additional action to the user, wherein the additional action corresponds to an updated agent action that has an updated highest reward Q value as calculated by the artificial neural network.
14. A method for training an artificial neural network for an emergency pilot assistance system, the method comprising: generating training data for a deep Q network by: receiving state data associated with an aircraft and an environment of the aircraft from a simulator while a user is operating the simulator; receiving action data from the simulator associated with actions by the user; generating a set of state-action vectors based on the state data and the action data; and determining a reward Q value associated with the set of state-action vectors; and training a deep Q network based on the training data.
15. The method of claim 14, further comprising: generating additional training data for the deep Q network by: receiving automated state data associated with the aircraft from a memory, the automated state data corresponding to an automated scenario; receiving automated action data from the memory, the automated action data associated with the automated scenario; generating an additional set of state-action vectors based on the automated state data and the automated action data; and determining an additional reward Q value associated with the additional set of state-action vectors; and training the deep Q network based on the additional training data.
16. The method of claim 14, wherein the state data include data matrices associated with the aircraft, the data matrices indicating a heading value, a position value, a system state value, an environmental condition value, a feedback value, a pilot action value, a system availability value, a roll value, a pitch value, a yaw value, a rate of change of roll value, a rate of change of pitch value, a rate of change of yaw value, a longitude value, a latitude value, a rate of change of position value, a rate of change of velocity value, or any combination thereof.
17. The method of claim 14, wherein the action data corresponds to a change in heading, a change in velocity, a change in roll, a change in pitch, a change in yaw, a change in a rate of change of roll, a change in a rate of change of pitch, a change in a rate of change of yaw, change in a rate of change of position, a change in a rate of change of velocity, or any combination thereof.
18. The method of claim 14, wherein the action data is based on a flight envelope including aircraft flight constraints, wherein the aircraft flight constraints include maps of acceleration and deceleration, rates of climb, rates of drop, velocity thresholds, roll change rate thresholds, pitch change rate thresholds, yaw change rate thresholds, roll thresholds, pitch thresholds, and yaw thresholds.
19. An emergency pilot assistance method comprising: calculating reward (Q) values using a deep Q network, wherein the reward values are based on state-action vectors associated with an aircraft, and wherein the state-action vectors include state data associated with the aircraft and action data associated with the aircraft; and providing an indication of an action to a user at a user output device, wherein the action corresponds to an agent action that has a highest reward Q value as calculated by the deep Q network.
20. The method of claim 19, wherein the highest reward Q value is associated with landing the aircraft at a predetermined destination or a calculated emergency destination in response to an emergency.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023] While the disclosure is susceptible to various modifications and alternative forms, specific examples have been shown by way of example in the drawings and will be described in detail herein. However, it should be understood that the disclosure is not intended to be limited to the particular forms disclosed. Rather, the intention is to cover all modifications, equivalents and alternatives falling within the scope of the disclosure.
DETAILED DESCRIPTION
[0024] Described herein is a reinforcement learning based autonomous pilot assistance agent, also referred to herein as an artificial intelligence power emergency pilot assistance system, which can be trained using an aircraft simulator and can perform the tasks of computing velocities, altitudes, and headings of an aircraft from a given origin and a destination without human intervention. The pilot assistance agent may be used to assist and guide a pilot during emergency situations. For example, the computed velocities, altitudes, and headings can be translated into control action that may be performed by the pilot to guide the aircraft to a safe landing zone.
[0025] The systems described herein may rely on a deep Q network to enable model free deep Q learning for obtaining complete reward-based mappings. The mappings may be used to determining a course of action during an emergency. As a brief overview of deep Q learning, as it is applied herein, during an emergency the system may determine a candidate goal (which for example may include determining a safe landing location). The system may also have access to a user policy, which may be based on aircraft flight constraints, a flight envelope, maps of acceleration and deceleration, rate of climb, and rate of drop. The user policy effectively describes the possible actions that may be taken at any given time within the aircraft. Based on these parameters, the system may iteratively map a sequence of possible actions to bring the aircraft to the candidate goal. If the sequence is successful in bringing the aircraft to the candidate goal (i.e., if the sequence will result in the aircraft landing safely at the safe landing location) then a high reward Q value (e.g., 1.0) may be assigned. If the sequence is not successful then a low reward Q value (e.g., 0.0) may be assigned. As each sequence may branch at each iteration the reward Q values may increase or decrease throughout the iterations depending on the likelihood of a safe landing at any given point in the sequence of actions.
[0026] The system may interact with an aircraft environment and pilot to select actions in a way that approximately maximize future reward values. During the system calculations, because future states cannot be perfectly determined, a standard assumption that future rewards may be discounted by a set factor per time-step may be employed. A future discounted return R.sub.t may be calculated as follows:
where T is the flight duration, t′ is the current time step, t is the next time step in the iteration, γ is the discount factor, and r.sub.t′ is the current discounted return. For the examples described herein, γ was set to 0.99. However, other values are possible.
[0027] The desired action-value function Q*(s, a) may be defined as the best expected return achievable by following the policy based on a sequence, s, an action, a, Q*(s, a) may be derived based on the Bellman equation, which is known with respect to deep Q learning. For purposes of this disclosure, the relationship may be described as follows: if the optimal value Q*(s, a) of a sequence at the next time-step is known for all possible actions, then an optimizing strategy is to select an action that maximizes the expected value of r+γQ*(s′, a′), where r is the discounted return and γ is the discount factor.
[0028] The reinforcement learning algorithm as described above may be used to estimate the action-value function by using the Bellman equation as an iterative update. If fully performed, the algorithm would converge to an optimal action-value function. In practice, however, this approach may be impractical, because the action-value function would be estimated separately for each sequence, without any generalization. Thus, the computations would expand exponentially which would likely involve more processing power than is available. Instead, a function approximator may be used to estimate the action-value function, Q(s, a; θ)<Q(s, a). In the reinforcement learning field this is typically a linear function approximator. By relying on training data received during simulation, a deep Q network may be developed to approximate the optimal actions to achieve the greatest probability of a successful outcome.
[0029]
[0030] During the first phase, training of the artificial neural network may be performed along with training a pilot in a training simulator. The system may learn end-to-end mappings of aircraft flight paths (e.g., velocities, altitudes, and headings) from environmental observation and user input with the task reward, e.g., a safe landing, as a form of supervision. The reward may be calculated based on safely landing the aircraft at a desired location or at a nearby safe landing location. From the perspective of the system being trained, the pilot's actions may be incorporated into a policy that also includes constraints such as a flight envelope, maps of acceleration and deceleration, a rate of climb, a rate of drop and others policy data for a safe flight. From the pilot's perspective, the system may behave like an adaptive interface that learns a personalized mapping from the pilot's commands, environments, goal space and flight constraint policy to action of flight path and its other parameters.
[0031] Referring to
[0032] During operation, while the user 116 is performing training exercise in the simulator 110, state data 120 associated with the aircraft 114 and with the environment 112 of the aircraft 114 may be collected from the simulator 110. The state data 120 may indicate a current state of the aircraft 114 and the environment 112. A portion of the state data 120 may also be based on system availability 122 of the aircraft 114. For example, during an emergency one or more systems of the aircraft 114 may be inoperable or otherwise unavailable for use. These factors may be taken into account when generating the state data 120. The state data 120 may also be based on aircraft performance operational constraints 124, which may represent the limits of what a particular aircraft may do in a particular scenario being run at the simulator 110.
[0033] Action data 126 may also be collected from the simulator 110. The action data 126 may be derived from actions 115 taken by the user 116 during flight training. The action data 126 may also be based on a flight envelope 131, representing the actions that may be taken with respect to a particular aircraft.
[0034] Based on the state data 120 and the action data 126, training data 130 may be compiled. The training data 130 may include a set of state-action vectors 132 formed by combining the state data 120 and the action data 126 at incremental steps during the simulation. A reward Q value 134 may be determined based on an outcome associated with the set of state-action vectors 132 and based on the discounted return function described herein. The training data 130 may also include the reward Q value 134 and may be used as training data for the deep Q network 140.
[0035] A challenge typically associate with training emergency assistance systems may be adapting standard deep reinforcement learning techniques that leverage continuous input from the actions 115 and make adjustments to the inputs based on a consequence of feedback associated with the actions 115. By using human-in-the-loop deep Q-learning, as described herein, with a user 116 actively using the simulator 110, the system 100 may learn an approximate state-action value function that computes expected future return values without computing each possible path in the state-action vectors 132 for an action given current environmental observation and the pilot's control input. Rather than finding a highest-value action, the deep Q network 140 may be trained to determine a closest high-value action to a user's input. This approach balances taking optimal actions with preserving a pilot's feedback control loop. This approach also enables the user 116 to directly modulate a level of assistance through a parameter α∈[0, 1], which may set a threshold for tolerance for suboptimal actions.
[0036] Standard deep reinforcement learning algorithm may include a large number of interactions for a very long period in order to have sufficient training. Simulator training alone is likely to be insufficient because it may not be feasible to obtain enough data. During a second phase of training, pilot control input may be replaced with automated scenario files having fixed control inputs from various origins to various destinations. The automated scenario files may cover more of the operating condition of an aircraft during these scenarios. This automated training approach may also be useful for covering extreme emergency conditions, which may be difficult to simulate with a pilot. In some cases, this training will enable the system to determine a safe course of action more reliably than a pilot by learning based on a full-spectrum of input from each scenario and learning based on scenarios that have not yet been anticipated by pilots.
[0037] The remaining portions of the second phase of training may be the same as described with reference to
[0038] Referring to
[0039] The memory 210 may include memory devices such as random-access memory (RAM), read-only memory (ROM), magnetic disk memory, optical disk memory, flash memory, another type of memory capable of storing data and processor instructions, or the like, or combinations thereof. Further, the memory may be part of a processing device (not shown) such as a computing device.
[0040] During operation, automated state data 220 associated with the aircraft 114 and with the automated scenario 212 may be collected. In some examples, the collection may take the form of multiple automated scenario files. The automated state data 220 may indicate a current state of the aircraft 114 during the automated scenario 212. A portion of the automated state data 220 may also be based on system availability 122 of the aircraft 114 and on aircraft performance operational constraints 124, as described with reference to
[0041] Based on the automated state data 220 and the automated action data 226, additional training data 230 may be compiled. The additional training data 230 may include an additional set of state-action vectors 232 formed by combining the automated state data 220 and the automated action data 226. An additional reward Q value 234 may be determined based on an outcome associated with the additional set of state-action vectors 232 and based on the discounted return function described herein. The additional training data 230 may include the additional reward Q value 234 and may be used to train the deep Q network 140.
[0042] While
[0043] Referring to
[0044] The system 300 may include, or otherwise be implemented at, an aircraft 302. The system may also include one or more processors 330, which may be implemented at the aircraft 302 or in some examples, may be distributed in a decentralized manner. The system 300 may also include an artificial neural network 338. Portions of the system 300 may be implemented at the one or more processors 330. However, for clarity different functional aspects of the system 300 may be depicted as separate from the processors 330.
[0045] The aircraft 302 may include aircraft systems 304 and a cockpit 308. The aircraft systems 304 may include mechanical systems, electrical systems, sensors, actuators, and the like. At least some of the aircraft system 304 may be able to determine the existence of an emergency 306. The cockpit 308 may include a user output device 310. The user output device 310 may include visual output systems, audio output systems, text output systems, and the like. The aircraft 302 may include additional systems to perform functions typically associated with aircraft, but which are omitted from
[0046] The one or more processors 330 may include a microcontroller, a central processing unit (CPU), a graphical processing unit (GPU), a digital signal processor (DSP), a peripheral interface controller (PIC), another type of microprocessor, and/or combinations thereof. Further, the one or more processors 330 may be implemented as integrated circuits, complementary metal-oxide-semiconductor (CMOS) field-effect-transistor (MOSFET) circuits, very-large-scale-integrated (VLSI) circuits, field-programmable gate arrays (FPGAs), application-specific integrated circuit (ASICs), combinations of logic gate circuitry, other types of digital or analog electrical design components, or combinations thereof.
[0047] The artificial neural network 338 may include the deep Q network 140 and may be trained as described herein. In particular, the artificial neural network may be trained to perform an approximation function to determine reward Q values associated with states and possible actions associated with the aircraft 302. It should be understood by persons of skill in the art, having the benefit of this disclosure, that the artificial neural network 338 may be a broader network, of which the deep Q network 140 may be a part.
[0048] During operation, an emergency 306 may result from, or be detected by, one or more of the aircraft systems 304. In response to the emergency 306, the one or more processors 330 may determine state data 334 and action data 336 based on the aircraft systems 304. For example, the state data 334 may include a matrix of aircraft heading, positions and velocity, current state, environmental condition, feedbacks, pilot action, aircraft system availability such as current value roll, pitch, yaw, rate of change of roll, pitch and yaw, longitude and latitude, rate of change of position, velocity, other state parameters associated with the aircraft 302, or combinations thereof. The action data 336 may be based on heading and velocity such as the value of roll, pitch, yaw, rate of change of roll, pitch and yaw, rate of change of position, and velocity. State-action vectors 332 may be generated based on the state data 334 and the action data 336.
[0049] The processors 330 may determine and/or compile availability data related to the aircraft systems 304. For example, in an emergency 306, some systems may not be available. A safe landing zone 344 may be determined based on the state data 334 and based on the availability data 342. The safe landing zone 344 may be a predetermined destination 346 or, in some cases, an emergency destination 348 may be determined based on a location of the aircraft 302 and based on the availability data 342 and stored constraint data 358 associated with the aircraft 302. The action data 336 may depend on the safe landing zone 344, the availability data 342, the state data 334, and stored constraint data 358.
[0050] The artificial neural network 338 may be used to determine headings and velocities data 350 that may be associated with calculated reward Q values 352. The reward Q values 352 may be determined based on the state-action vectors 332 and may be associated with landing the aircraft 302 safely at the safe landing zone 344. For example, the higher the reward Q values 352 are, the more likely a safe landing is to occur. From the headings and velocities data 350, heading and velocity data 354 may be associated with a highest reward Q value 356 as determined by the artificial neural network 338.
[0051] One or more inverse dynamics operations 360 may be performed to translate the heading and velocity data 354 into an agent action 366. Further, in some examples, additional data from the headings and velocities data 350 may be translated into agent actions 362. Each of the agent actions 362 may be associated with reward Q values 364, which may correspond to the reward Q values 352. The agent action 366 may be associated with a highest reward Q value 368 that corresponds to the highest reward Q value 356 of the heading and velocity data 354. An inverse aircraft model 367 may be used to translate the agent action 366 into a surface control action 369 that may be usable as instructions to the user 324 to guide the aircraft 302.
[0052] Within the cockpit 308, the user output device 310 may provide an indication 312 of an action 314 to the user 324. The action 314 may correspond to the agent action 366 and may also be, or may be derived from, the surface control action 369. The indication 312 of the action 314 may include a visual indication 316, an audio indication 318, a written indication 320, or any combination thereof. If the user 324 does not perform the action 314, then the user output device 310 may generate a warning 322. The user may perform actions using user input 326, which may include flight controls and/or other controls associated with aircraft cockpits. In cases where, there is no emergency, the system 300 may nevertheless generate a performance rating 370 associated with a flight based on comparing the agent actions 362 generated by the artificial neural network 338 to the user input 326.
[0053] It should be noted that the process described with respect to the system 300 is iterative and may be continually performed during a flight and/or during an in-flight emergency. Thus, agent actions may be continually fed to the output device 310 as the state-action vectors 332 change. Referring to
[0054] The artificial neural network 338 may be used to generate updated headings and velocities data 450, which may be associated with additional reward Q values 452. The updated heading and velocity data 454 that is associated with a highest additional reward Q value 456 may be determined to safely guide the user 324 to land at the safe landing zone 344. Based on the updated headings and velocities data 450, updated agent actions 462 may be generated and associated with additional reward Q values 464, which may correlate with the additional reward Q values of the updated headings and velocities data 450. An updated agent action 466 may be associated with a highest additional reward Q value 468, which may correlate with the highest additional reward Q value 456 of the updated heading and velocity data 454. The updated agent action 466 may be used to generate an updated surface control action 469.
[0055] The user output device 310 may be configured to provide an additional indication 412 of an additional action 414 to the user 324. The additional indication 412 may include an additional visual indication 416, an additional audio indication 418, an additional written indication 420, or any combination thereof. If the user 324 does not perform the additional action 414, an additional warning 422 may be generated. As before, an updated performance rating 470 may be generated based on comparing the user input 326 to the updated agent actions 462.
[0056] By providing indications of actions that a pilot can take to safely land an aircraft at a safe landing zone, the system 300 may reduce the workload on the pilot in case of an emergency. Further, the system 300 may warn pilot when the pilot's actions may lead to on such action which can lead into catastrophic failure. Also, even in cases where there is no emergency, the system 300 can, nevertheless, rate a pilot's performance for training purposes. Other advantages may exist.
[0057] Referring to
[0058] Referring to
[0059] Referring to
[0060] Referring to
[0061] Generating the training data may include receiving state data associated with an aircraft and an environment of the aircraft from a simulator while a user is operating the simulator, at 804. For example, the state data 120 may be received from the simulator 110 while the user 116 is operating the simulator 110.
[0062] Generating the training data may further include receiving action data from the simulator associated with actions by the user, at 806. For example, the action data 126 may be received from the simulator 110.
[0063] Generating the training data may also include generating a set of state-action vectors based on the state data and the action data, at 808. For example, the set of state-action vectors 132 may be generated based on the state data 120 and the action data 126.
[0064] Generating the training data may include determining a reward Q value 134 associated with the set of state-action vectors 132, at 810. For example, the reward Q value 134 may be determined by the system 100 and may be associated with the set of state-action vectors 132.
[0065] The method 800 may further include training a deep Q network based on the training data, at 812. For example, the deep Q network 140 may be trained based on the training data 130.
[0066] The method 800 may also include generating additional training data for the deep Q network, at 814. For example, the additional training data 230 may be generated based on the automated scenario 212, and additional automated scenarios during additional iterations.
[0067] Generating the additional training data may include receiving automated state data associated with the aircraft from a memory, the automated state data corresponding to an automated scenario, at 816. For example, the automated state data 220 may be received from the memory 210.
[0068] Generating the additional training data may further include receiving automated action data from the memory, the automated action data associated with the automated scenario, at 818. For example, the automated action data 226 may be received from the memory 210.
[0069] Generating the additional training data may also include generating an additional set of state-action vectors based on the automated state data and the automated action data, at 820. For example, the additional set of state-action vectors 232 may be generated based on the automated state data 220 and the automated action data 226.
[0070] Generating the additional training data may include determining an additional reward Q value associated with the additional set of state-action vectors, at 822. For example, the additional reward Q value 234 may be generated and may be associated with the additional set of state-action vectors 232.
[0071] The method 800 may include training the deep Q network based on the additional training data, at 824. For example, the deep Q network 140 may be trained based on the additional training data 230.
[0072] Referring to
[0073] The method 900 may further include providing an indication of an action to a user at a user output device, wherein the action corresponds to an agent action that has a highest reward Q value as calculated by the deep Q network, at 904. For example, the indication 312 of the action 314 may be provided to the user 324 at the user output device 310.
[0074] Although various examples have been shown and described, the present disclosure is not so limited and will be understood to include all such modifications and variations as would be apparent to one skilled in the art.