Method for Predicting Benchmark Value of Unit Equipment Based on XGBoost Algorithm and System thereof
20230213895 · 2023-07-06
Inventors
- Yongkang Wang (Shanghai, CN)
- Gang Xu (Shanghai, CN)
- Ruijie Chen (Shanghai, CN)
- Chen Wang (Shanghai, CN)
- Qingping Li (Shanghai, CN)
- Bin Wu (Shanghai, CN)
- Yi Gong (Shanghai, CN)
Cpc classification
G05B13/042
PHYSICS
Y04S10/50
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
The invention relates to a method for predicting benchmark value of unit equipment based on XGBoost algorithm and a system thereof, wherein the method comprises the following steps: the historical operation data of unit equipment is obtained, the data is preprocessed, and a data set containing a plurality of samples is constructed, and each sample includes the benchmark value of a plurality of parameters of the equipment corresponding to a plurality of features; RF out-of-bag estimation is used for feature importance calculation to eliminate the features with low importance; the data is standardized to eliminate the dimensional effects among features; the data set is input to construct an XGBoost model, and Bayesian super parameter optimization is conducted to obtain the prediction model of benchmark values; and the real-time data of equipment operation is input, and the benchmark values of various equipment parameters are predicted by the prediction model of benchmark values. Compared with the prior art, the invention mines the correlation among data based on the XGBoost algorithm to predict a reasonable equipment benchmark value, and has the advantages of high generalization ability, high prediction accuracy and operation speed and great improvement of the automation ability of the unit.
Claims
1. A method for predicting bench nark value of unit equipment based on XGBoost algorithm is characterized by comprising the following steps: S1. The historical operation data of unit equipment is obtained, the data is preprocessed, and a data set containing a plurality of samples is constructed, and each sample includes the benchmark value of a plurality of parameters of the equipment corresponding to a plurality of features; S2. RF out-of-bag estimation is used for feature importance calculation to eliminate the features with low importance; S3. The data is standardized to eliminate the dimensional effects among features; S4. The data set is input to construct an XGBoost model, and Bayesian super parameter optimization is conducted to obtain the prediction model of benchmark values; S5. The real-time data of equipment operation is input, and the benchmark values of various equipment parameters are predicted by the prediction model of benchmark values.
2. The method for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 1 is characterized in that step S1 is as follows: S11. The historical operation data of the equipment is obtained from the plant level supervisory information system SIS of the unit; S12. The data is checked for blank values and outliers, and the data with blank values and outliers are eliminated; S13. Straightened line type data is filtered; S14. Data features are dimensionally reduced by PCA to obtain a data set containing multiple samples, and each sample contains multiple features.
3. The method for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 1 is characterized in that step S2 is as follows: For each feature of the sample, the random forest (RF) out-of-bag estimation is used to rank the importance of the features and select the features. The average precision decline rate (MDA) is used as an indicator to calculate the importance of the feature. The formula is as follows:
4. The method for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 1 is characterized in that in step S3, the data set contains N samples, each sample has L-type features, and Z-score standardization method is used to standardize each type of features of each sample, as follows:
5. The method for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 1 is characterized in that step S4 comprises the following steps: S41. The data set T containing N samples is input, T={(x.sub.1, y.sub.1), (x.sub.2, y.sub.2), (x.sub.K, y.sub.K), . . . , (X.sub.N, Y.sub.N)}, each sample has L-type features X.sub.i=(x.sub.i1, x.sub.i2, . . . , x.sub.iL), corresponding to the benchmark value of M parameters of the equipment, Y.sub.i=(y.sub.i1, y.sub.i2, . . . , y.sub.iM); S42. The objective function of XGBoost model iteration is established:
6. The method for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 5 is characterized in that in step S43, the XGBoost model super parameters include: Learning rate with the parameter adjustment range of [0.1, 0.15]; Maximum depth of the tree with the parameter adjustment range of (5, 30); Penalty term of complexity with the parameter adjustment range of (0, 30); Randomly selected sample proportion with the parameter adjustment range of (0, 1); Random sampling ratio of features with the parameter adjustment range of (0.2, 0.6); L2 norm regular term of weight with the parameter adjustment range of (0, 10); Number of decision trees with the parameter adjustment range of (500, 1000); Minimum leaf node weight sum with the parameter adjustment range of (0, 10).
7. The method for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 5 is characterized in that the prediction performance of XGBoost model in step S45 includes average absolute percentage error and determination coefficient and the calculation formula is as follows:
8. A system for predicting benchmark value of unit equipment based on XGBoost algorithm is characterized by being based on the method for predicting benchmark value of unit equipment based on XGBoost algorithm described in of claim 1, and comprises the following: A data set construction module, which obtains the historical operation data of unit equipment, preprocesses the data, and constructs a data set containing a plurality of samples. Each sample includes a plurality of features corresponding to the benchmark values of a plurality of parameters of the equipment; A feature selection module, which uses RF out-of-bag estimation to calculate the feature importance of the data and eliminate the features with low importance; A standardization processing module, which standardizes the features of the samples in the data set to eliminate the dimensional impact among features; A model construction module, which inputs the data set, constructs the XGBoost model, and conducts Bayesian super parameter optimization to obtain the prediction model of benchmark values; A prediction module, which inputs the real-time data of equipment operation, and obtains the benchmark values of each parameter of the equipment through the prediction model of benchmark values.
9. The system for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 8 is characterized in that the feature selection module executes the following steps: For each feature of the sample, the random forest (RF) out-of-bag estimation is used to rank the importance of the features and select the features. The average precision decline rate (MDA) is used as an indicator to calculate the importance of the feature. The formula is as follows:
10. The system for predicting benchmark value of unit equipment based on XGBoost algorithm according to claim 8 is characterized in that the model construction module executes the following steps: Step 1. The data set T containing N samples is input, T={(X.sub.1, Y.sub.1), (X.sub.2, Y.sub.2), (X.sub.3, Y.sub.3), . . . , (X.sub.N, Y.sub.N)}, each sample has L-type features, X.sub.i=(x.sub.i1, x.sub.i2, . . . , x.sub.iL), corresponding to the benchmark value of M parameters of the equipment, Y.sub.i=(y.sub.i1, y.sub.i2, . . . , y.sub.iM); Step2. The objective function of XGBoost model iteration is established:
Description
FIGURES
[0063]
DESCRIPTION OF PREFERRED EMBODIMENTS
[0064] The embodiment and specific operation process of the invention are described in detail below in combination with the drawing and specific embodiment. The embodiment is implemented on the premise of the technical solution of the invention, but the protection scope of the invention is not limited to the following embodiment.
[0065] In the drawing, the components with the same structure are represented by the same number, and the components with similar structures or functions are represented by similar numbers. The size and thickness of each component shown in the drawing are arbitrarily given, because the invention does not define the size and thickness of each component. In order to make the diagram clearer, some parts are enlarged appropriately in the drawing.
Embodiment 1
[0066] A method for predicting benchmark value of unit equipment based on XGBoost algorithm, as shown in
[0067] S1. The historical operation data of unit equipment is obtained, the data is preprocessed, and a data set containing a plurality of samples is constructed, and each sample includes the benchmark value of a plurality of parameters of the equipment corresponding to a plurality of features;
[0068] S2. RF out-of-bag estimation is used for feature importance calculation to eliminate the features with low importance;
[0069] S3. The data is standardized to eliminate the dimensional effects among features;
[0070] S4. The data set is input to construct an XGBoost model, and Bayesian super parameter optimization is conducted to obtain the prediction model of benchmark values:
[0071] S5. The real-time data of equipment operation is input, and the benchmark values of various equipment parameters are predicted by the prediction model of benchmark values.
[0072] The overall technical solution of the invention mainly includes data acquisition and preprocessing. The steps are as follows: the random forest (RF) out-of-bag estimation is used to rank the importance of the features, data is standardized, and the XGBoost model optimized by Bayesian parameters is used for modeling, and the model is used for benchmark value prediction. The Java language development data interface is used to collect historical data and for data communication between modules. The data comes from the real-time data base platform plant level SIS (supervisory information system). The XGBoost package (current version 1.4.22) installed separately by Python is used to implement the algorithm. The functions of each part are as follows:
[0073] Step S1 is as follows:
[0074] S11. The historical operation data of the equipment is obtained from the plant level SIS of the unit;
[0075] S12. The data is checked for vacant values and outliers, and the data with vacant values and outliers are eliminated:
[0076] S13. The straight-line data is filtered;
[0077] S14. Data features are dimensionally reduced by PCA to obtain a data set containing multiple samples, and each sample contains multiple features.
[0078] Generally, the generator unit has a supervisory information system (SIS), which stores the historical data collected from the distributed control system (DCS) of the unit.
[0079] The applications deployed in power plants usually only read data from SIS. Real time database (now called temporal database) is the core technology of SIS. A server needs to be deployed in this solution, and the interface program of SIS real-time database needs to be deployed on the server. The historical data is collected according to the above-mentioned measuring points and stored in the open source temporal database deployed on the server.
[0080] It shall obtain the operation history data of the equipment for at least one full year to ensure data completeness. Long term data is not referential. Data is filtered by time. Based on the set time threshold, the original data with a time span of less than one year shall not be extracted. On this basis, the null data is removed. The null data is generally the data that occurs due to on-site sensor failure or abnormal data transmission. Further, the straightened line type data is filtered. The straightened line type abnormal data is defined as follows: if the value of the measured point data in a certain time interval fluctuates within the set threshold range (the threshold range is set according to different types of data), the data in this time interval is the straightened line type abnormal data. It shall be noted that the reasons for the occurrence of the straightened line type abnormal data are as follows: in some abnormal situations, such as the failure of the field sensor, the transmitted data point is not null or error, but the sensor continuously transmits the normal values of the last measurement, which is reflected in the trend chart as a straight line, and is one type of the straightened line type abnormal data.
[0081] Then, principal component analysis (PCA) is used to reduce the dimensions of the filtered features. This function is implemented through the pea module of the sklearn library in Python. The train_test_split function of the sklearn. model_selection module is called to divide the training set and the test set. During principal component analysis, the number of important features which shall be retained can be adjusted. This can be set according to the type of equipment, experience, etc., which can be understood by relevant practitioners.
[0082] In addition, every other period of time, new data is read and supplemented into the database of server on a regular basis, and data preprocessing is repeated, steps S1 to S4 are executed, and the benchmark value prediction model is updated regularly.
[0083] Step S2 is as follows:
[0084] After historical data preprocessing, RF out-of-bag estimation is used to rank the importance of main measuring points representing equipment operation features, such as unit load, current, etc. RF can be used to select features. In the process of randomly and repeatedly sampling from the original sample set for classifier training, about ⅓ of the sample data is not selected, which are called Out of Bag (OOB) data. The error rate of GOB test is recorded as errOOB. The average error of all learner based tests is calculated, and the average accuracy decline rate (MDA) is used as the index to calculate the importance of features. The formula is as follows:
[0085] Wherein, n is the number of base classifiers constructed by random forests, errOOB.sub.t is the out-of-bag error of the t.sup.th base classifier, and errOOB′.sub.t is the out-of-bag error of the t.sup.th base classifier after noise is added. The more MDA decreases, the higher the importance of the feature.
[0086] RF out-of-bag estimation is determined based on the random forest algorithm. In a random forest, multiple decision trees, namely, base classifiers, are constructed. Each decision tree can be understood as making decisions on a feature. After adding noise to a feature at random, if the out-of-bag accuracy is greatly reduced, it indicates that this feature has a great impact on the classification results of the samples, that is, the importance of this feature is high. According to the above idea. RF out-of-bag estimation can be used to rank the importance of features of the samples in the data set and select the features with higher importance. The specific number of reserved features is customized according to the equipment type and experience.
[0087] In step 3:
[0088] The features after preprocessing and feature selection usually have different dimensions and dimensional units, which affect the results of data analysis. Data shall be standardized to eliminate the dimensional effects among features. The data set contains N samples, and each sample has L-type features, Z-score standardization method is used to standardize each type of features of each sample, and centralize the feature data according to the mean value, and then scale the feature data according to the standard deviation. The processed data obey the standard normal distribution, i.e. x˜N(μ,σ.sup.2), as follows:
[0089] Wherein x.sub.nl is the feature data of the type 1 features of the n.sup.th sample, and x.sub.nl′ is the feature data of the type 1 features of the n.sup.th sample after standardization, μ.sub.1 is the mean value of the feature data of the type 1 features in the N.sup.th sample, and θ.sub.1 is the standard deviation of the feature data of the type 1 features in the N.sup.th sample. XGBoost's numpy library can be used in this step to standardize the data.
[0090] Step S4 is as follows:
[0091] The principle of XGBoost algorithm is as follows:
[0092] The data set D={(x.sub.1, y.sub.1), (x.sub.2, y.sub.2), . . . , (x.sub.i, y.sub.i), . . . , (x.sub.n, y.sub.n)}, (x.sub.i∈R.sup.n, γ.sub.j∈R) is given, x.sub.i is the feature which can be understood as the vector of m, and y.sub.i indicates the label corresponding to x.sub.i. For example, to predict whether the product will be purchased according to age, gender and income, x is (age, gender, income), and y is “Yes” or “No” In this application, for the equipment in the unit, the data of different measuring points of the equipment, such as current, voltage, vibration, sound, load, etc., are acquired as the features, the benchmark value of the main parameters of the equipment are taken as the label, and the input of the trained XGBoost model is the current, voltage, vibration, sound, load and other equipment operation data, as well as the output is the predicted benchmark value of each equipment.
[0093] For the objective function of XGBoost:
[0094] Wherein, y.sub.i is the actual value, i.e., the value in the training set; ŷ.sub.i.sup.(t) is the predicted value after the t.sup.th iteration of the i.sup.th sample, and Ω(f.sub.k) is the regularization term. The corresponding formula of ŷ.sub.i.sup.(t) and Ω(f.sub.k) is as follows:
[0095] Wherein K is the total number of leaf nodes in the decision tree; α and β are respectively the coefficients of L.sub.1 and L.sub.2 regular penalty items; and ω.sub.K is the output value of the k.sup.th leaf node of the decision tree.
[0096] ŷ.sub.i.sup.(t) and Ω(f.sub.k) are substituted into the objective function 0 (t), second order Taylor formula is used to expand, and the result is as follows:
Definition
[0097]
[0098] The objective function obtained is as follows:
[0099] To sum up, step S4 comprises the following steps:
[0100] Step 41. The data set T containing N samples is input,
[0101] T={(X.sub.1, Y.sub.1), (X.sub.2, Y.sub.2), (X.sub.3, Y.sub.3), . . . , (X.sub.k, Y.sub.k)}, each sample has L-type features, X.sub.i=(x.sub.i1, x.sub.i2, . . . , x.sub.iL), corresponding to the benchmark value of M parameters of the equipment, Y.sub.i=(y.sub.i1, y.sub.i2, . . . , y.sub.iM);
[0102] Step 42. The objective function of XGBoost model iteration is established.
[0103] Wherein. G.sub.k=Σ.sub.i=1.sub.
[0104] S43. The adjustment range of XGBoost model super parameters is set, and Bayesian optimization algorithm is used to optimize XGBoost super parameters to obtain the optimal combination of super parameters;
[0105] The XGBoost model super parameters selected for optimization include:
[0106] Learning rate with the parameter adjustment range of [0.1, 0.15]:
[0107] Maximum depth of the tree with the parameter adjustment range of (5, 30):
[0108] Penalty term of complexity with the parameter adjustment range of (0, 30):
[0109] Randomly selected sample proportion with the parameter adjustment range of (0, 1);
[0110] Random sampling ratio of features with the parameter adjustment range of (0.2, 0.6);
[0111] L2 norm regular term of weight with the parameter adjustment range of (0, 10);
[0112] Number of decision trees with the parameter adjustment range of (500, 1000);
[0113] Minimum leaf node weight sum with the parameter adjustment range of (0, 10).
[0114] S44. The optimal combination of super parameters is input into the XGBoost model, and the data set T is used to train according to the objective function 0 (t);
[0115] S45. The optimal combination of the super parameters is recorded if the prediction performance of the XGBoost model obtained through training meets the preset accuracy threshold, so as to obtain the prediction model of benchmark values. Otherwise, step S43 is executed to optimize the XGBoost super parameters again.
[0116] In step S45, the average absolute percentage error and determination coefficient are used to assess the model performance, and the calculation formula is as follows:
[0117] Wherein, e.sub.MAPE is the average absolute percentage error, R.sup.2 is the determination coefficient, Y.sub.i is the benchmark value of the i.sup.th sample in the data set. Ŷ.sub.i is the benchmark value predicted by the XGBoost model according to the feature X.sub.i of the i.sup.th sample, and Ÿ.sub.i is the average value of the benchmark values of the N.sup.th sample in the data set.
[0118] Python's Bayesian Optimization library can be used for Bayesian super parameter optimization, designing penalty functions, and finding the global optimal value of the penalty function combining the super parameters as the optimal combination. Relevant practitioners can understand the specific content which is not repeated here. In the iterative process of optimization and model training, for the output problem of multiple solutions by XGBoost, the multioutput tregressor of the sklearn.multioutput module is used for solving. Java programming is used to realize sample input and result output between Python and temporal database. Model training, storage, prediction and scoring are completed by writing Python programs and calling the XGBoost algorithm model in sklearn of the Python machine learning library. After receiving random samples and prediction information, the XGBoost module calls Python program for training and transmits prediction results to Java program to complete prediction.
[0119] Parameter adjustment in machine learning is a tedious but crucial task, which greatly affects the performance of the algorithm. Manual parameter adjustment is time-consuming and mainly based on experience and luck. Grid search and random search do not require manpower, but need a long run time. Through Bayesian super parameter optimization, the invention quickly determines the optimal super parameters of XGBoost model, speeding up model construction.
Embodiment 2
[0120] The invention also protects a system for predicting benchmark value of unit equipment based on XGBoost algorithm, which is based on the method for predicting benchmark value of unit equipment based on XGBoost algorithm described in embodiment 1 and comprises the following:
[0121] A data set construction module, which obtains the historical operation data of unit equipment, preprocesses the data, and constructs a data set containing a plurality of samples. Each sample includes a plurality of features corresponding to the benchmark values of a plurality of parameters of the equipment;
[0122] A feature selection module, which uses RF out-of-bag estimation to calculate the feature importance of the data and eliminate the features with low importance;
[0123] A standardization processing module, which standardizes the features of the samples in the data set to eliminate the dimensional impact among features;
[0124] A model construction module, which inputs the data set, constructs the XGBoost model, and conducts Bayesian super parameter optimization to obtain the benchmark value prediction model;
[0125] A prediction module, which inputs the real-time data of equipment operation, and obtains the benchmark values of each parameter of the equipment through the benchmark value prediction model.
[0126] The specific execution of each module is described in embodiment 1, which is not repeated here.
[0127] For the prediction of the benchmark value of unit equipment, in order to solve the defects of low efficiency and low prediction accuracy of the traditional manual modeling method of power plants, the invention adopts an efficient machine learning algorithm XGBoost (extreme gradient boosting), which has the following steps: the historical operation data of unit equipment is processed to get the data meeting the healthy work conditions. RF out-of-bag estimation is used for ranking the importance of relevant features, such as unit load, current, etc., which are the main test points of equipment operation; the data is standardized; the XGBoost model after Bayesian super parameter optimization is obtained to obtain the prediction model of benchmark values; and the real-time data is input in the prediction model of benchmark values to get the required prediction value of benchmark value.
[0128] The preferred specific embodiments of the invention are described in detail above. It shall be understood that any ordinary technician in the art can make many modifications and changes according to the concept of the invention without any creative work. Therefore, any technical solution that can be obtained by any person skilled in the art according to the concept of the invention on the basis of the prior art through logical analysis, reasoning or limited experiments s hall be within the scope of protection determined by the claims.