METHOD OF ANALYZING INFLUENCE FACTOR FOR PREDICTING CARBON DIOXIDE CONCENTRATION OF ANY SPATIOTEMPORAL POSITION
20230186173 · 2023-06-15
Inventors
Cpc classification
Y02A90/10
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06N5/01
PHYSICS
International classification
Abstract
The disclosure provides a method of analyzing an influence factor for predicting a carbon dioxide concentration of any spatiotemporal position. Firstly, an atmospheric carbon dioxide spatiotemporal distribution simulation method is proposed. This simulation method constructs a simulation model simulating carbon dioxide concentration distribution of any position of a region based on machine learning algorithm in combination with carbon dioxide data of satellite observation and corresponding environmental factors; next, by use of a global sensitivity analysis method, quantitative evaluation on the importance of multiple influence factors for regional carbon dioxide distribution is achieved.
Claims
1. A method of analyzing an influence factor for predicting a carbon dioxide concentration of any spatiotemporal position, the method comprising: 1) in combination with regional environmental characteristics, classifying environmental factors affecting regional carbon dioxide distribution into a plurality of factors comprising ground coverage type factor, vegetation coverage factor, climate type factor, precipitation factor, atmospheric temperature factor, wind velocity and direction factors, anthropogenic emission amount factor, and biomass combustion emission factor; wherein, in 1), the vegetation coverage factor is from the L3 Normalized Difference Vegetation Index of the Moderate-Resolution Imaging Spectroradiometer (MODIS) satellite; the ground coverage type factor is from the annual global land coverage data from European Space Agency; the climate type factor is from Köppen climate zoning dataset; the precipitation factor and the atmospheric temperature factor are from the Chinese 1 km-resolution monthly average precipitation and atmospheric temperature data from National Tibetan Plateau Data Center; the wind velocity and direction factors are from the wind velocity and direction data from ERAS dataset; the biomass combustion emission factor is from the anthropogenic emission amount from the high resolution global anthropogenic emission dataset ODIAC and biomass combustion emission amount data from the global fire disaster emission database GFED4; 2) in combination with OCO-2 satellite carbon dioxide observation data and the environmental factors, using eXtreme Gradient Boosting tree (XGBoost) machine learning algorithm to construct a Regional Carbon Dioxide Spatiotemporal distribution simulation (RCDS) model and training the simulation model using a training dataset; 3) for the constructed RCDS model, first using a test dataset to verify a model prediction accuracy, and then inputting environmental factor data without satellite observation into the trained carbon dioxide spatiotemporal distribution simulation model to obtain a predicted carbon dioxide concentration and finally obtaining a regional carbon dioxide concentration distribution graph; 4) in combination with the constructed regional carbon dioxide spatiotemporal distribution simulation model and a global sensitivity analysis method, calculating a sensitivity of the carbon dioxide concentration for each input environmental factor parameter; 5) counting the sensitivities of the regional carbon dioxide concentration for different environmental factors obtained by the global sensitivity analysis method, and quantitatively analyzing the size of the sensitivity of each parameter to finally determine an influence degree of each environmental factor along with the regional carbon dioxide distribution.
2. The method of claim 1, wherein the machine learning algorithm used in 2) is eXtreme Gradient Boosting tree (XGBoost) which is a tree integration model based on gradient boost; the basic construction thinking of the XGBoost model is: firstly constructing an initial sub-tree to performing fitting for data to correspondingly obtain a fitting residue, and constructing subsequent sub-trees based on initial sub-tree fitting residue until the subsequent sub-tree fitting residue is less than a threshold, and the final simulation result is a sum of all sub-tree results; the specific construction steps are as follows: initially constructing a weak base learner to obtain a residue corresponding to an initial sub-tree model; for each subsequent training iteration, based on the existing sub-tree model, adding one weak learner to fit a residue of a previous sub-tree model; through continuous learning, fitting K weak learners to reduce the residue between a model prediction result and a true value until the residue is less than a threshold, and the model is terminated, finally the model prediction result is a result obtained by performing weighted summing using K base learners.
3. The method of claim 1, wherein, the specific implementation of performing training using the training dataset in 2) is as follows: first performing preprocessing for the training dataset, comprising data cleaning, data encoding and data transformation, wherein the data cleaning comprises removal of missing value, abnormal value and noise, and the data transformation comprises normalization and dimension reduction; the data encoding is to encode non-numerical features and input into the model for training, encode the environmental factors comprising the ground vegetation type, the climate type and wind direction, by using one-hot encoding; performing normalization processing for the data in the following formula:
4. The method of claim 2, wherein the base learner of the XGBoost model is CART tree, and for a dataset with m features of n samples D=(x.sub.i,y.sub.i)(|D|=n, x.sub.i∈R.sup.m, y.sub.i∈R), the final CART tree prediction value obtained by training is expressed below:
f.sub.i(x.sub.i)=ω.sub.q(x.sub.
ŷ.sub.i.sup.(t)=ŷ.sub.i.sup.(t-1)+f.sub.k.sup.(t)(x.sub.i) a target function is expressed as:
5. The method of claim 1, wherein the global sensitivity analysis method used in 4) is Sobol method, the sensitivity of which is calculated by decomposing an output total variance into a sum of a variance of each parameter and a variance of mutual interaction of parameters, and then performing sensitivity grading calculation based on a ratio of a contribution of the parameter to the output variance; for each environmental factor, a change range and a probability distribution are calculated and then a corresponding sensitivity index is calculated in combination with the regional carbon dioxide spatiotemporal distribution simulation model; the regional carbon dioxide spatiotemporal distribution simulation model is expressed as: y=f(x.sub.1′, x.sub.2′, . . . , x.sub.p′), wherein f is a trained XGBoost model, x.sub.1′, x.sub.2′, . . . , x.sub.p′ are environmental factors affecting carbon dioxide distribution and are input parameters of the XGBoost model; the total variance of the XGBoost model is:
D=∫f.sup.2(x′)dx′−f.sub.0.sup.2 wherein, f.sub.0 is an initial value of the XGBoost model and a partial variance of the XGBoost model is:
D.sub.π.sub.
TS.sub.π=S.sub.π.sub.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0042]
[0043]
[0044]
DETAILED DESCRIPTION
[0045] To describe the technical solution and technical advantages of the disclosure in more details, the disclosure will be fully described below in combination with specific embodiments and accompanying drawings.
[0046] As shown in
[0047] 1. The specific steps of the regional carbon dioxide simulation modeling method based on machine learning algorithm are described below.
[0048] At step 1, environmental factor data affecting regional carbon dioxide distribution are collected, including but not limited to ground coverage type, vegetation coverage, climate type, precipitation, atmospheric temperature, wind velocity and direction, anthropogenic emission amount statistic data, and biomass combustion emission amount of a region, and then matched with the satellite observation carbon dioxide data to obtain training and verification datasets of a machine learning model.
[0049] The vegetation coverage is represented by normalization vegetation index data which may be obtained from the L3 vegetation index product of the MODIS satellite; the anthropogenic emission statistics come from the high resolution global anthropogenic emission dataset ODIAC; the biomass combustion data comes from global fire disaster emission database GFED4; atmospheric temperature and precipitation data comes from Chinese 1 km-resolution monthly average atmospheric temperature dataset provided by National Tibetan Plateau Data Center; the ground coverage data comes from annual global land coverage dataset published by European Space Agency, the climate type data comes from Köppen climate zoning dataset, and the wind velocity and direction data comes from ERAS dataset.
[0050] At step 2, a machine learning algorithm is selected to construct a carbon dioxide distribution simulation model and the model is trained in combination with environmental factors and the satellite carbon dioxide training dataset.
[0051] The specific steps of performing training are as follows: preprocessing the training dataset, comprising data cleaning (removal of missing value, abnormal value and noise and the like), data encoding and data transformation (normalization and dimension reduction and the like) and so on.
[0052] For the processing of the missing value of the dataset, in a case of less missing values, it is considered to delete the sample.
[0053] For the processing for abnormal value and noise, the noise is firstly detected by statistic characteristics of data or clustering method, and then the data is “smoothed” by using a method such as binning, clustering, regression, and combination of computer check and manual check to remove the abnormal values and noise in the data.
[0054] The data encoding is mainly to encode the non-numerical features and input them into the model for training. In this experiment, it is mainly required to encode the environmental factors such as ground coverage type, climate type and wind direction, by using one-hot encoding.
[0055] Data preprocessing also requires normalization processing for the data in the following formula:
[0056] where mean(z.sub.q) is a mean value of data of environmental factor z.sub.q, and std(z.sub.q) is a standard deviation of the data of the environmental factor z.sub.q.
[0057] Furthermore, the machine learning algorithm used in step 2 is eXtreme Gradient Boosting tree (XGBoost) which is a tree integration model based on gradient boost, wherein the basic construction thinking of the model is: firstly constructing an initial sub-tree to performing fitting for data to correspondingly obtain a fitting residue, and constructing subsequent sub-trees based on previous model residue until the model residue is less than a threshold, and the final simulation result is a sum of all sub-tree results; the specific construction steps are as follows:
[0058] initially constructing a weak learner to obtain a residue corresponding to an initial model;
[0059] for each subsequent training iteration, based on the existing model, adding one weak learner to fit a residue of a previous model;
[0060] through continuous learning, fitting K weak learners to reduce the residue between a model prediction result and a true value until the residue is less than a threshold, and the model is terminated where the final model prediction value is a result obtained by performing weighted summing using K base learners.
[0061] Furthermore, the base learner of the XGBoost model is CART tree, and for a dataset with m features of n samples D=(x.sub.i,y.sub.i)(|D|=n,x.sub.i∈R.sup.m,y.sub.i∈R), the final prediction value obtained by training is expressed below:
[0062] wherein K is a number of base learners, x.sub.i is an i-th sample, y.sub.i is a class label corresponding to the i-th sample, f.sub.k(⋅) is a model of a k-th tree, wherein the k-th tree is split into a leave node q of the tree and a corresponding weight part co, i.e.:
f.sub.i(x.sub.i)=ω.sub.q(x.sub.
[0063] wherein ω.sub.g(x.sub.
[0064] for each iteration, the model fits the previous predicted residue and therefore, when a t-th base learner is generated, the prediction model is expressed as:
ŷ.sub.i.sup.(t)=ŷ.sub.i.sup.(t-1)+f.sub.k.sup.(t)(x.sub.i)
[0065] a target function is expressed as:
[0066] wherein the target function is composed of two parts: in a first part, function l(⋅,⋅) describes a difference between a true value and a fitting value, which is calculated based on Euclidean distance; the second part is a regularized part Ω(f.sub.k.sup.(t)) for preventing function overfitting, i.e.
used to limit the complexity of each tree and prevent model overfitting, wherein T is a number of all leave nodes on the CART tree, γ and λ are hyperparameters used to adjust the number of leave nodes and importance distribution of the weight during regularized calculation, ω.sub.j is a weight value of a j-th leave node; to minimize the target function, the XGBoost considers performing second order Taylor expansion for the target function, which is approximately expressed as:
[0067] wherein g.sub.i is a first-order derivative, defined as
h.sub.i is a second-order derivative
and the following result is obtained by substituting into the target function:
[0068] Each iteration minimizes the target function to obtain j optimal leave nodes of the t-th base learner and an optimal solution ω.sub.j corresponding to each leave node.
[0069] The preprocessed training dataset is input into the XGBoost model and parameter adjustment and further optimization are performed for the XGBoost model, and iterations are repeated to obtain an optimal carbon dioxide distribution simulation model.
[0070] At step 3, for the constructed carbon dioxide distribution simulation model, a test dataset is firstly used to verify a model prediction accuracy, and then environmental factor data without satellite observation is input into the trained carbon dioxide distribution simulation model to obtain a predicted carbon dioxide concentration and finally, a regional carbon dioxide concentration spatiotemporal distribution is obtained.
[0071] 2. According to the above trained regional carbon dioxide spatiotemporal distribution simulation model and the global sensitivity analysis method, the importance of the influence factors is quantitatively analyzed, comprising the following steps.
[0072] At step 4, in combination with the constructed regional carbon dioxide spatiotemporal distribution simulation model and the global sensitivity analysis method, a sensitivity of the carbon dioxide distribution for each environmental factor is calculated.
[0073] At step 5, the sensitivities of the regional carbon dioxide concentration for different environmental factors obtained by the global sensitivity analysis method are counted, and the size of the sensitivity of each parameter is quantitatively analyzed to finally determine an influence degree of each environmental factor along with the regional carbon dioxide distribution.
[0074] The global sensitivity analysis method used in step 4 is Sobol method which is performed in the following step:
[0075] for each environmental factor, a change range and a probability distribution are calculated and then a corresponding sensitivity index is calculated in combination with the regional carbon dioxide spatiotemporal distribution simulation model.
[0076] The regional carbon dioxide spatiotemporal distribution simulation model is expressed as: y=f(x.sub.1′,x.sub.2′, . . . , x.sub.p′), wherein f is a trained XGBoost model, x.sub.1′,x.sub.2′, . . . , x.sub.p′ are environmental factors affecting carbon dioxide distribution and are input parameters of the XGBoost model and n is a number of model parameters, i.e. the 9 influence factors in step 1; the total variance of the XGBoost model is:
D=∫f.sup.2(x′)dx′−f.sub.0.sup.2
[0077] wherein, f.sub.0 is an initial value of the model and the a partial variance of the model is:
D.sub.π.sub.
[0078] wherein, 1≤π.sub.1< . . . <π.sub.s≤p, and s=1, 2, . . . , p and the sensitivity S.sub.π.sub.
[0079] wherein S.sub.π.sub.
[0080] further, a total sensitivity index of each environmental factor is obtained, and the total sensitivity index TS.sub.π of the environmental factor x.sub.π.sub.
TS.sub.π=S.sub.π.sub.
[0081] In step 5, the total sensitivity index of each environmental factor obtained by Sobol method is used to evaluate the final sensitivity of the influence factors affecting the regional carbon dioxide distribution, achieving quantitative influence degree analysis.
3. Embodiment
[0082] In this embodiment of the present disclosure, by using OCO-2 satellite XCO2 observation data and corresponding environmental factors of 2016 and the XGBoost modeling, the CO.sub.2 concentration distribution in the eastern region of China is simulated.
TABLE-US-00001 TABLE 1 Modeling accuracy Training samples Test samples R2 RMSE 3153 (70%) 1351 (30%) 0.6751 1.6362 ppm
[0083] By using the global sensitivity analysis method and the constructed carbon dioxide simulation model, quantitative evaluation is performed for the sensitivities of the influence factors to obtain the results as shown in Table 2.
TABLE-US-00002 TABLE 2 a first order sensitivity index and a total sensitivity index of each environmental factor estimated using the global sensitivity analysis method First order Total sensitivity Environmental factors sensitivity index index Ground coverage type 0.013060 0.015529 Vegetation coverage 0.300257 0.320699 Climate type 0.006008 0.007367 Precipitation 0.291814 0.301615 Atmospheric temperature 0.262991 0.277399 Wind velocity and direction 0.713833 0.727576 Anthropogenic emission 0.000197 0.000208 amount Biomass combustion emission 0.000915 0.001157
[0084] To more visually display the sizes of the sensitivities of different environmental factors on the total carbon dioxide distribution, a sector graph of sensitivity indexes is drawn to determine ratios of the influence factors as shown in
[0085] As shown in
[0086] As known from the model accuracy, it is feasible to simulate the regional carbon dioxide spatiotemporal distribution by using model. The method provided by the disclosure can fill in the gap of satellite observation data by simulating the regional carbon dioxide concentration spatiotemporal distribution with the environmental factors. Further, a method of quantitatively evaluating the influence degrees of the environmental factors on the regional carbon dioxide distribution is proposed so as to determine the influence sizes and specific degrees of various environmental factors on the regional carbon dioxide distribution.
[0087] The specific embodiments described in the disclosure are merely illustrated based on the spirit of the disclosure. Those skilled in the art can make various changes or supplementations or similar replacements to the specific embodiments described herein without departing from the spirit of the disclosure or the scope defined by the appended claims.