Method for predicting air quality with aid of machine learning models

Abstract

A method for predicting air quality with the aid of machine learning models includes: (A) providing air pollution data to perform an eXtreme Gradient Boosting (XGBoost) regression algorithm for obtaining a XGBoost prediction value; (B) providing the air pollution data to perform a Long Short-Term Memory (LSTM) algorithm for obtaining an LSTM prediction value; (C) combining the air pollution data, the XGBoost prediction value and the LSTM prediction value to generate air pollution combination data; (D) performing an XGBoost classification algorithm to obtain a suggestion for whether to issue an air pollution alert; and (E) performing the XGBoost regression algorithm on the air pollution combination data to obtain an air pollution prediction value. Two layers of machine learning models are built, and a situation where prediction results are too conservative when a single model does not have enough data can be improved.

Claims

1. A method for predicting air quality with the aid of machine learning models, comprising: (A) providing parameters related to air pollution to perform, by a first layer of the machine learning models, an eXtreme Gradient Boosting (XGBoost) regression algorithm for obtaining a XGBoost prediction value; (B) providing the parameters to perform, by the first layer of the machine learning models, a Long Short-Term Memory (LSTM) algorithm for obtaining an LSTM prediction value; (C) linking, by the machine learning models, a vector corresponding to the parameters, a vector corresponding to the XGBoost prediction value and a vector corresponding to the LSTM prediction value, to generate air pollution combination data; (D) performing, by a second layer of the machine learning models, an XGBoost classification algorithm on the air pollution combination data to obtain an air pollution alert value for determining whether to issue an air pollution alert; and (E) performing, by the second layer of the machine learning models, the XGBoost regression algorithm on the air pollution combination data to obtain an air pollution prediction value.

2. The method of claim 1, wherein the parameters comprises PM2.5 concentration, temperature, humidity, wind speed, a wind direction, and a date.

3. The method of claim 2, wherein the date is expressed with a 2-dimensional coordinate (cos θ, sin θ), where θ=(x/365)*360°, and x indicates a date index within one year.

4. The method of claim 2, wherein the wind direction is expressed with a 2-dimensional coordinate (cos θ, sin θ), wherein θ indicates a wind direction angle.

5. The method of claim 1, wherein when the air pollution alert value is greater than a predetermined value, the air pollution alert is issued.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a diagram illustrating partition grid of air pollution according to the present invention.

(2) FIG. 2 is a flowchart illustrating a method for predicting air quality with the aid of machine learning models according to the present invention.

(3) FIG. 3 is a diagram illustrating architecture of an LSTM algorithm according to the present invention.

(4) FIG. 4 is a diagram illustrating an overall learning model according to the present invention.

(5) FIG. 5 is a prediction result of the air quality of southern regions of Taiwan in January 2017 according to an embodiment of the present invention.

DETAILED DESCRIPTION

(6) Embodiments are provided to describe the method of the present invention. Those skilled in the art may understand the advantages and effects of the present invention according to the detailed description, which is provided as follows.

(7) Refer to FIG. 1, which is a diagram illustrating partition grid of air pollution according to the present invention. As shown in FIG. 1, in addition to air pollution sources within Taiwan itself, pollution sources from nearby regions may also influence Taiwan via winds, wherein influence from nearby regions is greater, and influence from other regions may gradually reduce with increasing distance. Thus, in response to influence levels from different regions, sizes of the partition grid are shown in FIG. 1. Since the influence level of the pollution sources of the Taiwan region itself is the greatest, the partition grid of Taiwan are the most detailed. The influence levels of China and coastal regions of Korea are the second greatest, and the influence levels of inland regions of Mongolia and China are less, so the partition grid thereof are the least detailed.

(8) Refer to FIG. 2, which is a flowchart illustrating a method for predicting air quality with the aid of machine learning models according to the present invention. As shown in FIG. 2, the method comprises the following steps:

(9) Step S201: Provide air pollution data, to perform an XGBoost regression algorithm to obtain an XGBoost prediction value.

(10) Step S202: Provide the air pollution data, to perform an LSTM algorithm to obtain an LSTM prediction value.

(11) Step S203: Combine the air pollution data, the XGBoost prediction value and the LSTM prediction value to generate air pollution combination data.

(12) Step S204: Perform an XGBoost classification algorithm to obtain a suggestion for whether to issue an air pollution alert.

(13) Step S205: Perform the XGBoost regression algorithm on the air pollution combination data to obtain a prediction value of the air pollution.

(14) The air pollution data comprises multiple sets of parameter data. For example, weather factors (e.g. sunny and rainy) may influence an amount of pollution in the air. Thus, in an embodiment of the present invention, temperature and humidity during each time interval may be listed in characteristic vectors for machine learning. Note that the reason why humidity is utilized rather than rainfall is that rainfall characteristic exists on rainy days only (there is no data without rain), whereas a humidity difference between cloudy days and sunny days may not be distinguished without humidity information. For the same reason, wind speed and wind directions should also be considered. It should be noted that a wind direction characteristic is typically indicated by an angle, where 0° and 360° have the same meaning, but the values thereof are quite different, and this kind of mathematic model may therefore cause errors in prediction results. Thus, the wind direction characteristic that is originally a 1-dimensional characteristic is mapped to a 2-dimensional space. In forms of coordinates, both 0° and 360° are (1, 0), and 45° is (cos 45°, sin 45°); thus, the error no longer exists. In addition, since Taiwan is located in a monsoon zone, the time of year is also an important factor. Seasonal cycles roughly circulate once per year. December and January are adjacent, but a difference between the values of December and January (i.e. twelve and one) is the greatest. In order to solve this problem, 365 days within one year respectively correspond to angles within 360° as shown in equation (1), and coordinate characteristics are utilized to indicate date characteristics.

(15) $\begin{matrix} f (x) = \frac{x}{365} \times 360 ° & (1) \end{matrix}$

(16) There are many air pollution sources. In addition to gas emissions in the local region, there are external factors carried from outer places via, for example, monsoons, and the influence levels of these external factors may be different according to different wind speeds, wind directions and dates. Additionally, even though adjacent regions may have similar weather conditions, these regions may have different influence levels due to topographic factors. As the number of variables of air pollution is huge, it is hard to predict air pollution accurately. Thus, machine learning is introduced for prediction of air pollution, in order to improve accuracy. In this embodiment of the present invention, the LSTM algorithm, which has time continuity, is selected. Architecture of the LSTM algorithm has the effects of keeping and continuing states of past tasks, and therefore is capable of being utilized for the prediction of air pollution. In addition, the LSTM algorithm can find obvious time-varying characteristics without being influenced by transient noise, which can help the machine learning to find a better solution. Since change in air pollution is performed continuously and slowly, time intervals in this embodiment are shown as follows: [t.sub.0 t.sub.1 t.sub.2 t.sub.3 t.sub.4 t.sub.5 . . . t.sub.n.] where n is a positive integer. The duration of each time interval may be different, such as 1 hour, 8 hours, 24 hours, etc. According to an air pollution index of a previous time interval (known air pollution data), the method (or the LSTM algorithm) may predict an air pollution index of a next time interval (predicted air pollution data).

(17) TABLE-US-00001 TABLE 1 t.sub.1 t.sub.2 t.sub.3 t.sub.4 t.sub.5 . . . t.sub.n K1 P1 K2 P2 K3 P3 K4 P4

(18) Table 1 is a time parameter input table in an LSTM form. The first row of Table 1 illustrates the time intervals [t.sub.0 t.sub.1 t.sub.2 t.sub.3 t.sub.4 t.sub.5 . . . t.sub.n], K1, K2, K3 and K4 are sets of known air pollution data of, respectively, and P1, P2, P3 and P4 are sets of predicted air pollution data of respective time intervals.

(19) Refer to FIG. 3, which is a diagram illustrating the architecture of the LSTM algorithm according to the present invention. As shown in FIG. 3, internal calculations of the LSTM algorithm are shown as follows:
f.sub.t=σ(U.sub.f.Math.x.sub.t+W.sub.f.Math.h.sub.t-1+b.sub.f)
i.sub.t=σ(U.sub.i.Math.x.sub.t+W.sub.i.Math.h.sub.t-1+b.sub.i)
custom character =tan h(U.sub.c.Math.x.sub.t+W.sub.c.Math.h.sub.t-1+b.sub.c)
c.sub.t=f.sub.t*c.sub.t-1+
o.sub.t=σ(U.sub.o.Math.x.sub.t+W.sub.o.Math.h.sub.t-1+b.sub.o)
h.sub.t=o.sub.t.Math.tan h(c.sub.t)
where x.sub.t is an input at a time t, h.sub.t is a state generated at the time t, c.sub.t is a memory generated at the time t, and f.sub.t, i.sub.t, custom character and o.sub.t are internal thinking logics of the LSTM algorithm, according to x.sub.t and h.sub.t-1 (i.e. the input of this moment and the state so far). These four logics correspond to four thinking modes of a human, comprising forgetting (f.sub.t), memory (i.sub.t), experience () and thought (o.sub.t). In addition, W (e.g. W.sub.f, W.sub.i, W.sub.c and W.sub.o), U (e.g. U.sub.f, U.sub.i, U.sub.c, and U.sub.o) and b (e.g. b.sub.f, b.sub.i, b.sub.c, and b.sub.o) are weight matrices and bias vector parameters which need to be learned during training. The parameter f.sub.t may determine whether past experience still needs to be kept, custom character is experience of this time, i.sub.t is arranged to determine the experience of this time, and o.sub.t is a reacting thought regarding the input and the state of this time. This thought (o.sub.t) may be further combined with the memory (i.sub.t) to generate a new state (h.sub.t), and finally the state (h.sub.t) and the memory (i.sub.t) of finishing tasks of this time may be kept for reference at a next time, in order to achieve an effect that is similar to human memory.

(20) Refer to FIG. 4, which is a diagram illustrating an overall learning model according to the present invention. In order to prevent a situation where the prediction results are too conservative when there is not enough data, for example, a situation where a precision rate is high and/or a recall rate is low, in addition to the LSTM algorithm, an XGBoost method (e.g. the XGBoost regression algorithm) is further applied. The XGBoost method is based on conventional adaptive boosting (AdaBoost), and utilizes multiple simple weak classifiers such as decision Tree and regression and puts them together, to compensate for original weakness. Recent practical experience indicates that the AdaBoost may have excellent manifold learning ability without utilizing a kernel function to perform function mapping on data, and the AdaBoost may operate quickly, and is therefore capable of processing large amounts of data. The XGBoost method is obtained by improving the AdaBoost, and the XGBoost method may learn more effectively in a process of optimization. The overall learning model of the present invention is shown in FIG. 4. The air pollution data is inputted into the XGBoost regression algorithm and LSTM algorithm (e.g. LSTM with batch normalization (BN) for regression) for concurrent learning. After finishing learning, two different models are obtained, wherein output results of these two different models may be further combined with original air pollution data (i.e. the air pollution data), to be an input of a next layer. As a result, the overall learning model with two layers can be obtained; this kind of method may be referred to as model stacking.

(21) In order to verify performance of this embodiment, observation data of sixty observation stations of the Environmental Protection Administration (EPA) all over Taiwan is utilized. The observation data during 2014 to 2016 is taken as training data, and the observation data in January 2017 is taken as predicted data.

(22) TABLE-US-00002 TABLE 2 Actually dangerous Actually safe Predicted dangerous a B Predicted safe c D

(23) The observation data is processed to evaluate the overall learning model (referred to as the model, for brevity) of the present invention as shown in Table 2, which illustrates parameters of the precision rate and the recall rate. To verify the accuracy of the prediction of air pollution, three indexes such as the precision rate, the recall rate and F1 score are provided as follows:

(24) $precision = \frac{a}{a + B}$ $recall = \frac{a}{a + c}$ $F 1 score = \frac{2 \times precision \times recall}{precision + recall}$

(25) As shown in Table 2, “a” represents conditions that are actually dangerous and are predicted to be dangerous (i.e., true positive), “B” represents conditions that are actually safe but are predicted to be dangerous (i.e., false positive), “c” represents conditions that are actually dangerous but are predicted to be safe (i.e., false negative), and “D” represents conditions that are actually safe and are predicted to be safe (i.e., true negative). A first prediction index of the air pollution, i.e. the precision rate, may indicate conditions that are actually dangerous within all conditions that are predicted to be dangerous; in other words, the precision rate indicates a possibility that a condition of the air pollution is actually dangerous when the model predicts the condition is dangerous. In an ideal situation, predictions every time are correct and no error occurs, and the precision rate is therefore equal to one. A second prediction index of the air pollution, i.e. the recall rate, may indicate sensitivity of the model regarding occurrence of danger. When the recall rate is high, it indicates a high possibility that the model is capable of recognizing danger when the condition is actually dangerous. When the model can correctly predict all dangerous conditions, the recall rate may be one. In order to consider both the precision rate and the recall rate, where both of them are expected to be over a certain level, a third prediction index of the air pollution, i.e. the F1 score, is defined. The F1 score is a combination of the precision rate and the recall rate, where the F1 score may be greatly reduced when one or both of the precision rate and the recall rate falls, and reduction of the F1 score may be greater than an increment of the other of the precision rate and the recall rate. Thus, a better F1 rate may be obtained only if both the precision rate and the recall rate are considered (i.e. both are high).

(26) Refer to FIG. 5, which is a prediction result of the air quality of southern regions of Taiwan (e.g. Situn, Douliou, Chiayi, Zuoying and Siaogang) in January 2017 according to an embodiment of the present invention. As shown in FIG. 5, most precision rates, recall rates and F1 scores for one hour, six hours and twelve hours are greater than 75%, indicating that the accuracy of the prediction in this embodiment is near to 80%, where values of RMSE represent the error of the prediction results, and need to be as small as possible. In an ideal situation where prediction values are exactly the same as actual values, the values of RMSE are zero. In addition, when the suggestion for whether to issue an air pollution alert is “yes” (e.g. PM2.5 concentration is greater than a predetermined value), which is obtained by performing the XGBoost classification algorithm on the air pollution combination data in Step 204, the air pollution alert (indicated by a bell in FIG. 5) may be issued as shown on the left-side of FIG. 5, or be issued in other ways to the public.

(27) Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.

Method for predicting air quality with aid of machine learning models

Assignee

Inventors

Cpc classification

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N5/01

PHYSICS

Classification Explorer

G06N20/20

PHYSICS

Classification Explorer

G08B21/12

PHYSICS

Classification Explorer

G06N7/00

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G08B31/00

PHYSICS

International classification

Classification Explorer

G06N20/20

PHYSICS

Classification Explorer

G06N7/00

PHYSICS

Classification Explorer

G08B21/12

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Abstract

Claims

Description