Method for predicting air quality with aid of machine learning models
11488069 · 2022-11-01
Assignee
Inventors
- Li-Yen Kuo (Tainan, TW)
- Chih-Lun Liao (Taichung, TW)
- Chun-Han Tai (Pingtung County, TW)
- Hao-Yu Kao (New Taipei, TW)
Cpc classification
G06N5/01
PHYSICS
G08B21/12
PHYSICS
G06N7/00
PHYSICS
International classification
G06N7/00
PHYSICS
G08B21/12
PHYSICS
Abstract
A method for predicting air quality with the aid of machine learning models includes: (A) providing air pollution data to perform an eXtreme Gradient Boosting (XGBoost) regression algorithm for obtaining a XGBoost prediction value; (B) providing the air pollution data to perform a Long Short-Term Memory (LSTM) algorithm for obtaining an LSTM prediction value; (C) combining the air pollution data, the XGBoost prediction value and the LSTM prediction value to generate air pollution combination data; (D) performing an XGBoost classification algorithm to obtain a suggestion for whether to issue an air pollution alert; and (E) performing the XGBoost regression algorithm on the air pollution combination data to obtain an air pollution prediction value. Two layers of machine learning models are built, and a situation where prediction results are too conservative when a single model does not have enough data can be improved.
Claims
1. A method for predicting air quality with the aid of machine learning models, comprising: (A) providing parameters related to air pollution to perform, by a first layer of the machine learning models, an eXtreme Gradient Boosting (XGBoost) regression algorithm for obtaining a XGBoost prediction value; (B) providing the parameters to perform, by the first layer of the machine learning models, a Long Short-Term Memory (LSTM) algorithm for obtaining an LSTM prediction value; (C) linking, by the machine learning models, a vector corresponding to the parameters, a vector corresponding to the XGBoost prediction value and a vector corresponding to the LSTM prediction value, to generate air pollution combination data; (D) performing, by a second layer of the machine learning models, an XGBoost classification algorithm on the air pollution combination data to obtain an air pollution alert value for determining whether to issue an air pollution alert; and (E) performing, by the second layer of the machine learning models, the XGBoost regression algorithm on the air pollution combination data to obtain an air pollution prediction value.
2. The method of claim 1, wherein the parameters comprises PM2.5 concentration, temperature, humidity, wind speed, a wind direction, and a date.
3. The method of claim 2, wherein the date is expressed with a 2-dimensional coordinate (cos θ, sin θ), where θ=(x/365)*360°, and x indicates a date index within one year.
4. The method of claim 2, wherein the wind direction is expressed with a 2-dimensional coordinate (cos θ, sin θ), wherein θ indicates a wind direction angle.
5. The method of claim 1, wherein when the air pollution alert value is greater than a predetermined value, the air pollution alert is issued.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
DETAILED DESCRIPTION
(6) Embodiments are provided to describe the method of the present invention. Those skilled in the art may understand the advantages and effects of the present invention according to the detailed description, which is provided as follows.
(7) Refer to
(8) Refer to
(9) Step S201: Provide air pollution data, to perform an XGBoost regression algorithm to obtain an XGBoost prediction value.
(10) Step S202: Provide the air pollution data, to perform an LSTM algorithm to obtain an LSTM prediction value.
(11) Step S203: Combine the air pollution data, the XGBoost prediction value and the LSTM prediction value to generate air pollution combination data.
(12) Step S204: Perform an XGBoost classification algorithm to obtain a suggestion for whether to issue an air pollution alert.
(13) Step S205: Perform the XGBoost regression algorithm on the air pollution combination data to obtain a prediction value of the air pollution.
(14) The air pollution data comprises multiple sets of parameter data. For example, weather factors (e.g. sunny and rainy) may influence an amount of pollution in the air. Thus, in an embodiment of the present invention, temperature and humidity during each time interval may be listed in characteristic vectors for machine learning. Note that the reason why humidity is utilized rather than rainfall is that rainfall characteristic exists on rainy days only (there is no data without rain), whereas a humidity difference between cloudy days and sunny days may not be distinguished without humidity information. For the same reason, wind speed and wind directions should also be considered. It should be noted that a wind direction characteristic is typically indicated by an angle, where 0° and 360° have the same meaning, but the values thereof are quite different, and this kind of mathematic model may therefore cause errors in prediction results. Thus, the wind direction characteristic that is originally a 1-dimensional characteristic is mapped to a 2-dimensional space. In forms of coordinates, both 0° and 360° are (1, 0), and 45° is (cos 45°, sin 45°); thus, the error no longer exists. In addition, since Taiwan is located in a monsoon zone, the time of year is also an important factor. Seasonal cycles roughly circulate once per year. December and January are adjacent, but a difference between the values of December and January (i.e. twelve and one) is the greatest. In order to solve this problem, 365 days within one year respectively correspond to angles within 360° as shown in equation (1), and coordinate characteristics are utilized to indicate date characteristics.
(15)
(16) There are many air pollution sources. In addition to gas emissions in the local region, there are external factors carried from outer places via, for example, monsoons, and the influence levels of these external factors may be different according to different wind speeds, wind directions and dates. Additionally, even though adjacent regions may have similar weather conditions, these regions may have different influence levels due to topographic factors. As the number of variables of air pollution is huge, it is hard to predict air pollution accurately. Thus, machine learning is introduced for prediction of air pollution, in order to improve accuracy. In this embodiment of the present invention, the LSTM algorithm, which has time continuity, is selected. Architecture of the LSTM algorithm has the effects of keeping and continuing states of past tasks, and therefore is capable of being utilized for the prediction of air pollution. In addition, the LSTM algorithm can find obvious time-varying characteristics without being influenced by transient noise, which can help the machine learning to find a better solution. Since change in air pollution is performed continuously and slowly, time intervals in this embodiment are shown as follows: [t.sub.0 t.sub.1 t.sub.2 t.sub.3 t.sub.4 t.sub.5 . . . t.sub.n.] where n is a positive integer. The duration of each time interval may be different, such as 1 hour, 8 hours, 24 hours, etc. According to an air pollution index of a previous time interval (known air pollution data), the method (or the LSTM algorithm) may predict an air pollution index of a next time interval (predicted air pollution data).
(17) TABLE-US-00001 TABLE 1 t.sub.1 t.sub.2 t.sub.3 t.sub.4 t.sub.5 . . . t.sub.n K1 P1 K2 P2 K3 P3 K4 P4
(18) Table 1 is a time parameter input table in an LSTM form. The first row of Table 1 illustrates the time intervals [t.sub.0 t.sub.1 t.sub.2 t.sub.3 t.sub.4 t.sub.5 . . . t.sub.n], K1, K2, K3 and K4 are sets of known air pollution data of, respectively, and P1, P2, P3 and P4 are sets of predicted air pollution data of respective time intervals.
(19) Refer to
f.sub.t=σ(U.sub.f.Math.x.sub.t+W.sub.f.Math.h.sub.t-1+b.sub.f)
i.sub.t=σ(U.sub.i.Math.x.sub.t+W.sub.i.Math.h.sub.t-1+b.sub.i)=tan h(U.sub.c.Math.x.sub.t+W.sub.c.Math.h.sub.t-1+b.sub.c)
c.sub.t=f.sub.t*c.sub.t-1+
o.sub.t=σ(U.sub.o.Math.x.sub.t+W.sub.o.Math.h.sub.t-1+b.sub.o)
h.sub.t=o.sub.t.Math.tan h(c.sub.t)
where x.sub.t is an input at a time t, h.sub.t is a state generated at the time t, c.sub.t is a memory generated at the time t, and f.sub.t, i.sub.t, and o.sub.t are internal thinking logics of the LSTM algorithm, according to x.sub.t and h.sub.t-1 (i.e. the input of this moment and the state so far). These four logics correspond to four thinking modes of a human, comprising forgetting (f.sub.t), memory (i.sub.t), experience (
) and thought (o.sub.t). In addition, W (e.g. W.sub.f, W.sub.i, W.sub.c and W.sub.o), U (e.g. U.sub.f, U.sub.i, U.sub.c, and U.sub.o) and b (e.g. b.sub.f, b.sub.i, b.sub.c, and b.sub.o) are weight matrices and bias vector parameters which need to be learned during training. The parameter f.sub.t may determine whether past experience still needs to be kept,
is experience of this time, i.sub.t is arranged to determine the experience of this time, and o.sub.t is a reacting thought regarding the input and the state of this time. This thought (o.sub.t) may be further combined with the memory (i.sub.t) to generate a new state (h.sub.t), and finally the state (h.sub.t) and the memory (i.sub.t) of finishing tasks of this time may be kept for reference at a next time, in order to achieve an effect that is similar to human memory.
(20) Refer to
(21) In order to verify performance of this embodiment, observation data of sixty observation stations of the Environmental Protection Administration (EPA) all over Taiwan is utilized. The observation data during 2014 to 2016 is taken as training data, and the observation data in January 2017 is taken as predicted data.
(22) TABLE-US-00002 TABLE 2 Actually dangerous Actually safe Predicted dangerous a B Predicted safe c D
(23) The observation data is processed to evaluate the overall learning model (referred to as the model, for brevity) of the present invention as shown in Table 2, which illustrates parameters of the precision rate and the recall rate. To verify the accuracy of the prediction of air pollution, three indexes such as the precision rate, the recall rate and F1 score are provided as follows:
(24)
(25) As shown in Table 2, “a” represents conditions that are actually dangerous and are predicted to be dangerous (i.e., true positive), “B” represents conditions that are actually safe but are predicted to be dangerous (i.e., false positive), “c” represents conditions that are actually dangerous but are predicted to be safe (i.e., false negative), and “D” represents conditions that are actually safe and are predicted to be safe (i.e., true negative). A first prediction index of the air pollution, i.e. the precision rate, may indicate conditions that are actually dangerous within all conditions that are predicted to be dangerous; in other words, the precision rate indicates a possibility that a condition of the air pollution is actually dangerous when the model predicts the condition is dangerous. In an ideal situation, predictions every time are correct and no error occurs, and the precision rate is therefore equal to one. A second prediction index of the air pollution, i.e. the recall rate, may indicate sensitivity of the model regarding occurrence of danger. When the recall rate is high, it indicates a high possibility that the model is capable of recognizing danger when the condition is actually dangerous. When the model can correctly predict all dangerous conditions, the recall rate may be one. In order to consider both the precision rate and the recall rate, where both of them are expected to be over a certain level, a third prediction index of the air pollution, i.e. the F1 score, is defined. The F1 score is a combination of the precision rate and the recall rate, where the F1 score may be greatly reduced when one or both of the precision rate and the recall rate falls, and reduction of the F1 score may be greater than an increment of the other of the precision rate and the recall rate. Thus, a better F1 rate may be obtained only if both the precision rate and the recall rate are considered (i.e. both are high).
(26) Refer to
(27) Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.