METHOD FOR DATA IMPUTATION AND CLASSIFICATION AND SYSTEM FOR DATA IMPUTATION AND CLASSIFICATION

Abstract

A method and a system for data imputation and classification are provided. The system includes a database, a historical sample imputation module and a current sample imputation and classification module. In the method, at first, an imputation calculation is performed on each of classified historical sample groups to obtain a basis matrix and a missing value corresponding to each of the classified historical sample groups. Thereafter, a sample classification stage is performed. In the sample classification stage, an IPP (Iterative Projection Pursuit) algorithm and an equation of nonlinear inequality constraints to calculate weighting vectors corresponding to a current sample. Thereafter, plural candidate samples corresponding to different classes are calculated in accordance with the basis matrix and the weighting vectors, and the sample class of the current sample and a prediction value for a missing value of the current sample are determined accordingly.

Claims

1. A method for data imputation and classification, the method comprising: performing a historical sample processing stage, wherein the historical sample processing stage comprises: providing a plurality of historical samples; classifying the historical samples into a plurality of classes to obtain a plurality of classified historical sample groups, wherein the classified historical sample groups correspond to the classes in a one-to-one manner, and each of the classified historical sample groups comprises a plurality of known historical values and at least one historical missing value; replacing the historical missing value with zero; and performing an imputation calculating step on each of the classified historical sample groups, wherein the imputation calculating step comprises: calculating a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups; calculating a predicted value of the at least one historical missing value of each of the classified historical sample groups in accordance with the base matrix and the weight matrix to which each of the classified historical sample groups corresponds; and performing a sample classification stage to classify a current sample into one of the classes, wherein the current sample comprises a plurality of known values and at least one missing value, and the sample classification stage comprises: calculating a plurality of weight vectors corresponded by the current sample by using an iterative projection pursuit (IPP) algorithm and a nonlinear inequality constraint, wherein the weight vectors correspond to the classes in a one-to-one manner, each of the weight vectors is limited by a weight parameter, and the weight parameter is calculated in accordance with the nonlinear inequality constraint; performing a candidate sample calculating step to calculate a plurality of candidate samples corresponding to the classes in accordance with the base matrix and the weight vector corresponding to the same class, wherein the candidate samples correspond to the classes in a one-to-one manner; calculating a difference between the current sample and each of the candidate samples to obtain a plurality of candidate sample differences; and determining a predicted value of the at least one missing value of the current sample and a class corresponded by the current sample in accordance with the candidate sample differences.

2. The method for data imputation and classification of claim 1, wherein the nonlinear inequality constraint is a quadratic inequality constraint.

3. The method for data imputation and classification of claim 1, wherein the step for calculating the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups is performed by using an alternating least squares (ALS) algorithm and class-dependent data imputation.

4. The method for data imputation and classification of claim 3, wherein the ALS algorithm is a ridge alternating least squares (RALS) algorithm.

5. The method for data imputation and classification of claim 1, wherein the candidate sample calculating step involves multiplying the base matrix by the weight vector to obtain each of the candidate samples.

6. A system for data imputation and classification comprising: a database, configured to store a plurality of classified historical sample groups, wherein the classified historical sample groups correspond to a plurality of classes in a one-to-one manner, and each of the classified historical sample groups comprises a plurality of known historical values and at least one historical missing value; an imputation calculating module for the historical samples, configured to: replace the historical missing value with zero; calculate a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups; and calculate a predicted value of the at least one historical missing value of each of the classified historical sample groups in accordance with the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups; and an imputation and classification module for a current sample, configured to receive the current sample provided from an external device, and configured to: calculate a plurality of weight vectors corresponded by the current sample by using an iterative projection pursuit (IPP) algorithm and a nonlinear inequality constraint, wherein the weight vectors correspond to the classes in a one-to-one manner, each of the weight vectors is limited by a weight parameter, and the weight parameter is calculated in accordance with the nonlinear inequality constraint; perform a candidate sample calculating step to calculate a plurality of candidate samples corresponding to the classes in accordance with the base matrix and the weight vector corresponding to the same class, wherein the candidate samples correspond to the classes in a one-to-one manner; calculate a difference between the current sample and each of the candidate samples to obtain a plurality of candidate sample differences; and determine a predicted value of the at least one missing value of the current sample and a class corresponded by the current sample in accordance with the candidate sample differences.

7. The system for data imputation and classification of claim 6, wherein the nonlinear inequality constraint is a quadratic inequality constraint.

8. The system for data imputation and classification of claim 6, wherein the step for calculating the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups is performed by using an alternating least squares (ALS) algorithm and class-dependent data imputation.

9. The system for data imputation and classification of claim 8, wherein the ALS algorithm is a ridge alternating least squares (RALS) algorithm.

10. The system for data imputation and classification of claim 6, the candidate sample calculating step involves multiplying the base matrix by the weight vector to obtain each of the candidate samples.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] The invention can be more fully understood by reading the following detailed description of the embodiment, with reference made to the accompanying drawings as follows.

[0016] FIG. 1 is a schematic diagram showing a functional diagram of a system for data imputation and classification in accordance with embodiments of the present invention.

[0017] FIG. 2 is a flow chart showing the method for data imputation and classification in accordance with embodiments of the present invention.

[0018] FIG. 3 is a schematic diagram showing historical samples in accordance with an embodiment of the present invention.

[0019] FIG. 4 is a schematic diagram showing classified historical sample groups in accordance with an embodiment of the present invention.

[0020] FIG. 5 is a schematic diagram showing a base matrix and a weight matrix in accordance with an embodiment of the present invention.

[0021] FIG. 6 is a schematic diagram showing a predicted sample matrix in accordance with an embodiment of the present invention.

[0022] FIG. 7 is a schematic diagram showing a current sample in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

[0023] The using of first, second, third, etc. in the specification should be understood for identifying units or data described by the same terminology, but are not referred to particular order or sequence.

[0024] Referring to FIG. 1, FIG. 1 is a schematic diagram showing a functional diagram of a system 100 for data imputation and classification in accordance with embodiments of the present invention. The system 100 includes a database 110, an imputation calculating module 120 for historical samples, and an imputation and classification module 130 for the current sample. The database is configured to store historical samples. In an embodiment of the present invention, the historical samples are classified into plural classified historical sample groups, for example classified historical sample groups 112, 114 and 116. The classified historical sample groups 112, 114 and 116 correspond to plural classes in a one-to-one manner, and are stored in a form of subset of the database 110.

[0025] The imputation calculating module 120 is configured to calculate base matrixes and weight matrixes which are corresponded by the classified historical sample groups, thereby imputing missing values of the classified historical sample groups. The imputation calculating module 120 includes plural basis factor generation modules, for example basis factor generation modules 122, 124 and 126. The basis factor generation modules 122, 124 and 126 are configured to receive the classified historical sample groups 112, 114 and 116, and to calculate a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups 112, 114 and 116. Predicted values of the missing values of each of the classified historical sample group can be derived according to the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups.

[0026] The imputation and classification module 130 is configured to receive new data (also referred as to current sample hereinafter) provided from an external device 140, and to impute and classify the current sample to obtain predicted values of missing values of the current sample and to obtain the class corresponded by the current sample. The imputation and classification module 130 includes plural weight factor generation modules (for example, weight factor generation modules 132a, 134a and 136a), plural data reconstruction modules (for example, data reconstruction modules 132b, 134b and 136b) and a determination module 138. The weight factor generation modules 132a, 134a and 136a are configured to generate weight factors of the current sample corresponding to respective classes. The data reconstruction modules 132b, 134b and 136b are configured to generate plural candidate samples of the current sample corresponding to respective classes. The determination module 138 is configured to determine predicted values of the missing values of the current sample and a class corresponded by the current sample in accordance with the candidate samples. In the following embodiments, algorithms used by the imputation calculating module 120 and the imputation and classification module 130 are introduced.

[0027] At first, an M-by-N sample matrix X with missing values is provided, in which M signifies the number of dimensions (also referred to as a number of independent variables), and N denotes the number of observed samples. Thereafter, an objective function for matrix completion is provided. In this embodiment, a ridge alternating least squares (RALS) algorithm is used to obtain the objective function for matrix completion, but embodiments of the present invention are not limited thereto. In other embodiments of the present invention, other alternating least squares algorithms can be used to obtain the objective function.

[0028] When the Ridge ALS algorithm is used to obtain the objective function for matrix completion, a difference between the matrix X and a matrix formed by a matrix U and a matrix V for objective minimization can be expressed as:

min E.sub.rALS(U,V)=min{X custom-character +.sub.U+.sub.V(1)

[0029] Where U and V are respectively M-by-D and D-by-N unknown matrixes; D is the intermediate dimension; custom-character represents the Frobenius norm; .sub.U and .sub.V represent the ridge parameters for U and V, respectively. Ridge parameters are used to regularize and prevent U and V from overfitting. To find U and V, following equations (2) and (3) are used.

V=(U.sup.U+.sub.VI).sup.1U.sup.G(X)(2)

U.sup.=(VV.sup.+.sub.UI).sup.1VG(X).sup.(3)

[0030] Where is the transpose operator, and an element-wise mask G is imposed on X. If an element of X is missing, the missing element is temporally replaced by a value of zero. In addition, the above equations (2)-(3) are presented in matrixes for the purpose of convenience.

[0031] It is assumed that y is an N-by-1 vector containing the class labels (also referred to as categorical variables). The vector y corresponds to samples in the sample matrix X. Further, it is assumed that the number of the classes is L. Therefore, the sample matrix X can be divided into custom-character , and =1, . . . ,L. The size of is M, where N.sub.1+N.sub.2+ . . . +N.sub.L=N. In the embodiments of the present invention, to reflect characteristics of values corresponding to the classes, a class-dependent data imputation algorithm is used to find class-dependent matrix factors and custom-character , and then a refining step is performed. The class-dependent matrix factors and can be expressed as:

custom-character =(+I).sup.1G()(4)

custom-character =(+I).sup.1G().sup.T(5)

[0032] In the above equations, only corresponding custom-character is used to find and . Through the above step, the class-dependent matrix factors corresponding to each of the classes can be found.

[0033] Thereafter, it is assumed that t is the current sample provided by the external device, and t is an M-by-1 matrix with missing values. Regarding the current sample t, it is assumed that a D-by-1 weight vector custom-character can satisfy the following equation:

t custom-character (6)

[0034] The current sample t belongs to a vector space spanned by custom-character , i.e., span(). However, the formation of the weight vector has various possibilities. Therefore, embodiments of the present invention provide a weight factors formation technology for imputation based on quadratic inequality constraints which can limit the possibilities of the formation of the weight vector custom-character . The weight factors formation technology for imputation based on quadratic inequality constraints in the embodiments of the present invention uses the Ridge ALS algorithm with quadratic inequality constraints, but embodiments of the present invention are not limited thereto. In other embodiments, other alternating least squares algorithms with quadratic inequality constraints can be used to limit the weight vector custom-character .

[0035] Regarding the Ridge ALS algorithm with quadratic inequality constraints, the equation thereof can be expressed as:

[00001] $\begin{matrix} \min .Math. .Math. E_{tALS}^{t} () = \min .Math. {.Math. t - +, s . t . .Math. - .Math. .Math. & (7) \end{matrix}$

[0036] Where custom-character is a D-by-1 vector and the centroid of . In addition, is a predefined radius, and >0. Equation (7) can be generalized as:

[00002] $\begin{matrix} \min .Math. .Math. E_{t .Math. .Math. ALS}^{#} () = \min .Math. {+,_{t}^{2} & (8) \end{matrix}$

[0037] Where custom-character is a q-by-D Tikhonov matrix, is a p-by-D weight matrix, and is a p-by-1 shift vector (e.g., ). To solve equation (8), high-order generalized singular value decomposition (GSVD) is used. How the GSVD is used for , , and is introduced below. For simplicity, subscript is omitted herein. It is assumed that after the high-order GSVD is introduced. custom-character , , and can be expressed as:

[00003] $\begin{matrix} S_{U} = Q_{U}^{T} .Math. UR & (9) \\ S_{B} = Q_{B}^{T} .Math. BR & (10) \\ S_{} = Q_{}^{T} .Math. .Math. .Math. R & (11) \end{matrix}$

[0038] Where Q denotes a unitary matrix and R is a nonsingular matrix. In addition, the off-diagonal terms of S are zeros. It is assumed that , , represent diagonal terms for matrixes S.sub.U, S.sub.B, and S.sub., respectively. Then, matrixes S.sub.U, S.sub.B, and S.sub. can be expressed as:

[00004] $\begin{matrix} S_{U} = diag (_{1},_{2}, .Math. .Math.,_{D}) & (12) \\ S_{B} = diag (_{1},_{2}, .Math. .Math.,_{z}) & (13) \\ S_{} = diag (_{1},_{2}, .Math. .Math.,_{g}) & (14) \end{matrix}$

[0039] Meanwhile, z=min{p,D}, qD and DM. Based on equations (9)-(14), equation (7) is simplified as:

[00005] $\begin{matrix} \min .Math. .Math. E_{t .Math. .Math. ALS}^{#} (\tilde{v}) = \min .Math. {.Math. \overline{t} - + .Math. .Math. S_{} .Math.}, s . t . .Math. S_{B} .Math. \overline{v} -^{2} & (15) \end{matrix}$

[0040] Where {tilde over (t)}=Q.sub.U.sup.Tt, {tilde over (b)}=Q.sub.B.sup.Tb and {tilde over (v)}=R.sup.1v. By introducing a Lagrangian multiplier equation (15) can be modified as:

[00006] $\begin{matrix} \min .Math. .Math. .Math. (\tilde{v}) = \min .Math. {.Math. \tilde{t} - + .Math. .Math. S_{} .Math. + (.Math. S_{B} .Math. \tilde{v} - -^{2})} & (16) \end{matrix}$

[0041] By taking the derivative of custom-character ({tilde over (v)}) with respect to {tilde over (v)}, and by zeroing the result, following equation (17) is obtained:

[00007] $\begin{matrix} (S_{U}^{} .Math. S_{U} + .Math. .Math. S_{}^{} .Math. S_{} + .Math. .Math. S_{H}^{T} .Math. S_{H}) .Math. \tilde{v} = S_{U}^{T} .Math. \tilde{t} + .Math. .Math. S_{B}^{T} .Math. \overline{b} & (17) \end{matrix}$

[0042] Equation (17) can be converted into a function of , i.e., {tilde over (v)}(). It is assumed that r is the rank of the matrix B, and three cases of the function {tilde over (v)}() are discussed as follows after rearrangement of equation (17).

[00008] $\begin{matrix} Case .Math. .Math. 1 .Math. : .Math. .Math. when .Math. .Math. z = p q, \\ \tilde{v} () = {\begin{matrix} \frac{_{i} .Math. {\tilde{t}}_{i} + .Math. .Math._{i} .Math. {\tilde{b}}_{i}}{_{i}^{2} +_{i}^{2} +_{l}^{2}} & j = 1, .Math. .Math., z \\ \frac{_{i} .Math. {\overline{t}}_{i}}{_{i}^{2} +_{i}^{2}} & j = z + 1, .Math. .Math., q \\ \frac{{\tilde{t}}_{i}}{_{i}} & j = q + 1, .Math. .Math., D \end{matrix} & (18) \\ Case .Math. .Math. 2 .Math. : .Math. .Math. when .Math. .Math. z = p > q, \\ \tilde{v} () = {\begin{matrix} \frac{_{i} .Math. {\tilde{t}}_{i} + .Math. .Math._{i} .Math. {\tilde{b}}_{i}}{_{i}^{2} +_{i}^{2} +_{l}^{2}} & j = 1, .Math. .Math., q \\ \frac{_{i} .Math. {\overline{t}}_{i} +_{i} .Math. {\tilde{b}}_{i}}{_{i}^{2} +_{i}^{2}} & j = q + 1, .Math. .Math., z \\ \frac{{\tilde{t}}_{i}}{_{i}} & j = z + 1, .Math. .Math., D \end{matrix} & (19) \\ Case .Math. .Math. 3 .Math. : .Math. .Math. when .Math. .Math. z = D, \\ \tilde{v} () = {\begin{matrix} \frac{_{i} .Math. {\tilde{t}}_{i} + .Math. .Math._{i} .Math. {\tilde{b}}_{i}}{_{i}^{2} +_{i}^{2} +_{l}^{2}} & j = 1, .Math. .Math., q \\ \frac{_{i} .Math. {\overline{t}}_{i} +_{i} .Math. {\tilde{b}}_{i}}{_{i}^{2} +_{i}^{2}} & j = q + 1, .Math. .Math., z \end{matrix} & (20) \end{matrix}$

[0043] To minimize equation (16), S.sub.B{tilde over (v)} custom-character .sup.2 should be zero. After substitution of equations (18)-(20) into S.sub.B{tilde over (v)} respectively, a function () is obtained. The function () can be expressed as:

When r>q,

[00009] $\begin{matrix} () = {.Math.}_{j = 1}^{q} .Math. \frac{_{j} .Math._{j} .Math. {\tilde{t}}_{j} -_{j}^{2} .Math. {\tilde{b}}_{j} -_{j}^{2} .Math. {\tilde{b}}_{j}}{_{j}^{2} +_{j}^{2} +_{j}^{2}} + {.Math.}_{j = q + 1}^{r} .Math. \frac{_{j} .Math._{j} .Math. {\tilde{t}}_{j} -_{j}^{2} .Math. {\tilde{b}}_{j}}{_{j}^{2} +_{j}^{2}} + {.Math.}_{j = r + 1}^{p} .Math. {\tilde{b}}_{j} & (21) \end{matrix}$

Otherwise,

[0044] [00010] $\begin{matrix} () = {.Math.}_{j = 1}^{r} .Math. \frac{_{j} .Math._{j} .Math. \overline{t} -_{j}^{2} .Math. {\tilde{b}}_{j}}{_{j}^{2} +_{j}^{2}} + {.Math.}_{j = r + 1}^{} .Math. {\tilde{b}}_{j} & (22) \end{matrix}$

[0045] Thereafter, is calculated. It is assumed that () is equal to .sup.2, then is obtained. Then, {tilde over (v)} is calculated. Plugging the value of into equation (18), (19), or (20), {tilde over (v)} can be obtained. Thereafter, v is calculated. Plugging {tilde over (v)} into following equation (23):

v=R{tilde over (v)}(23)

Thus, v can be obtained.

[0046] Thereafter, how to use the weight vector v to perform the imputation is introduced below.

[0047] In the embodiments of the present invention, an iterative projection pursuit (IPP) algorithm with quadratic inequality constraints is used to perform the imputation. However, embodiments of the present invention are not limited thereto. In other embodiments of the present invention, other IPP algorithms with nonlinear inequality constraints can be used to perform the imputation.

[0048] In calculation of the imputation of this embodiment, at first, the above class custom-character is used to initialize the current sample t to replace missing values in the current sample t with zeros. Thereafter, a first step is performed to calculate in accordance with the above weight factors formation technology for imputation based on quadratic inequality constraints. In the calculation for custom-character , at first, [i] is plugged into equation (21) or (22) to calculate ()[i], and [i] is obtained, in which i represents the i-th iteration. Then, [i] is plugged into equations (18), (19) or (20) to obtain ()[i]. Thereafter, [i] is calculated, in which [i]=R.sub.l()[i].

[0049] Then, a second step is performed to calculate predicted values of the missing values in the current sample t, in which the calculation of the predicted values performs imputation by using the following equations:

custom-character [i]=[i](24)

custom-character [i+1]=t[i](25)

[0050] Where the operator in equation (25) means to replace the missing values of t with the imputed ones of {circumflex over (t)}.

[0051] The above first step and the second step are repeated until a root-mean-square error (RMSE) custom-character converges, in which the RMSE can be expressed as:

[00011] $\begin{matrix} {.Math.}_{} = \sqrt{({.Math.}_{m = 1}^{M} .Math. e_{, m}^{2} [i + 1]) / M} & (26) \end{matrix}$

where

custom-character [i+1]=G(t[i+1])(27)

[0052] Then, the smallest custom-character is selected to determine the class of the current sample t, in which an equation of the selection is expressed as:

[00012] $\begin{matrix} ^{*} = \underset{}{argmin} .Math. {.Math.}_{} & (28) \end{matrix}$

[0053] Where custom-character is the class of the current sample t.

[0054] Hereinafter, an embodiment is introduced for explaining a method 200 for data imputation and classification corresponded by the system 100.

[0055] Referring to FIG. 2, FIG. 2 is a flow chart showing the method 200 for data imputation and classification. The method 200 includes a historical sample processing stage 210 and a sample classification stage 220. The historical sample processing stage 210 is performed by using the above imputation calculating module 120 for historical samples, and the sample classification stage 220 is performed by using the above imputation and classification module 130 for the current sample.

[0056] In the historical sample processing stage 210, at first, step 211 is performed to provide plural historical samples, as shown in FIG. 3. In this embodiment, the historical samples include eight samples corresponding to weather information of seven days. Each of the historical samples includes five values: atmospheric pressure, humidity, temperature, wind force scale and rainfall. In the historical samples, the temperature of Tuesday, the humidity and wind force scale of Saturday and temperature of Sunday are missing values. In addition, the historical samples are classified into good weather and bad weather. However, in other embodiments of the present invention, if the historical samples are not classified, a classification module can be added to the imputation calculating module 120 to perform a classification step 212 on the historical samples.

[0057] Thereafter, step 213 is performed to replace the historical missing values with zeros, as shown in FIG. 4. In FIG. 4, the transpose operation is performed on the historical samples X to obtain historical samples X. The historical samples X can be divided into two classified historical sample groups X.sub.Good and X.sub.Bad in accordance with the classes of the historical samples X, in which the classified historical sample group X.sub.Good corresponds to a class of good weather, and the classified historical sample group X.sub.Bad corresponds to a class of bad weather.

[0058] Then, step 214 is performed to perform imputation based on each of the classified historical sample groups X.sub.Good and X.sub.Bad. In the embodiments of the present invention, step 214 is performed by using the above basis factor generation modules, for example the basis factor generation modules 122, 124 and 126. In step 214, at first, step 214a is performed to calculate a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups X.sub.Good and X.sub.Bad, as shown in FIG. 5. By using the equations (2)-(3), the classified historical sample group X.sub.Good can be decomposed into a base matrix U.sub.Good and a weight matrix V.sub.Good. Similarly, by using the equations (2)-(3), the classified historical sample group X.sub.Bad can be decomposed into a base matrix U.sub.Bad and a weight matrix V.sub.Bad. Thereafter, step 214b is performed to calculate a predicted value of the at least one historical missing value of each of the classified historical sample groups X.sub.Good and X.sub.Bad in accordance with the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups X.sub.Good and X.sub.Bad. As shown in FIG. 6, in step 214b, the base matrix U.sub.Good is multiplied by the weight matrix V.sub.Good to obtain a predicted sample matrix {circumflex over (X)}.sub.Good, and then the missing values of the classified historical sample groups X.sub.Good can be obtained through the predicted sample matrix {circumflex over (X)}.sub.Good. Similarly, the base matrix U.sub.Bad is multiplied by the weight matrix V.sub.Bad to obtain a predicted sample matrix {circumflex over (X)}.sub.Bad, and then the missing values of the classified historical sample groups X.sub.Bad can be obtained through the predicted sample matrix {circumflex over (X)}.sub.Bad. In an embodiment of the present invention, the transpose operation can be performed on the predicted sample matrixes {circumflex over (X)}.sub.Good and {circumflex over (X)}.sub.Bad to enable the numbers of columns and rows thereof to be the same as the numbers of columns and rows of the classified historical sample groups X.sub.Good and X.sub.Bad, and thus the predicted values of the missing values can be obtained by comparison.

[0059] In the sample classification stage 220, at first, step 221 is performed to calculate weight vectors which are corresponded by a current sample t by using the iterative projection pursuit (IPP) algorithm and a nonlinear inequality constraint. The weight vectors correspond to the above classes, for example good weather and bas weather in a one-to-one manner. In this embodiment of the present invention, the above iterative projection pursuit (IPP) algorithm with quadratic inequality constraints is used to calculate the weight vectors which is corresponded by the current sample t. As shown in FIG. 7, the current sample is weather information of one day, in which the miss value of humidity is replaced by zero. A parameter .sub.Good of the current sample t corresponding to good weather is calculated by using equations (21) and (22), and then a weight vector v.sub.Good of the current sample t corresponding to good weather is calculated by using equations (18), (19), (20), (23) and the parameter .sub.Good, in which the parameter .sub.Good is used to limit the content of the weight vector v.sub.Good. Similarly, a parameter .sub.Bad of the current sample t corresponding to bad weather is calculated by using equations (21) and (22), and then a weight vector v.sub.Bad of the current sample t corresponding to bad weather is calculated by using equations (18), (19), (20), (23) and the parameter .sub.Bad, in which the parameter .sub.Bad is used to limit the content of the weight vector v.sub.Bad.

[0060] In step 222, candidate samples corresponding to the classes are calculated in accordance with the base matrix and the weight vector corresponding to the same class. For example, the base matrix U.sub.Good and the weight vector v.sub.Good correspond to the class of good weather, and thus a candidate sample corresponding to good weather can be calculated in accordance with the base matrix U.sub.Good and the weight vector v.sub.Good. In this embodiment, the base matrix U.sub.Good is multiplied by the weight vector v.sub.Good (i.e., U.sub.Goodv.sub.Good) to obtain a candidate sample t.sub.Good corresponding to good weather. Similarly, the base matrix U.sub.Bad and the weight vector v.sub.Bad correspond to the class of bad weather, and thus a candidate sample corresponding to bad weather can be calculated in accordance with the base matrix U.sub.Bad and the weight vector v.sub.Bad. In this embodiment, the base matrix U.sub.Bad is multiplied by the weight vector v.sub.Bad (i.e., U.sub.Badv.sub.Bad) to obtain a candidate sample t.sub.Bad corresponding to bad weather.

[0061] In step 223, a difference between the current sample t and each of the candidate samples is calculated to obtain candidate sample differences. In this embodiment, differences between the known values of the current sample t and corresponding values of the candidate samples t.sub.Good and t.sub.Bad are calculated by using equation (26) to obtain a sample difference of good weather between the current sample t and the candidate sample t.sub.Good, and to obtain a sample difference of bad weather between the current sample t and the candidate sample t.sub.Bad. However, embodiments of the present invention are not limited thereto. In other embodiments of the present invention, other methods can be used to calculate the differences between the current sample and the candidate samples.

[0062] In step 224, a predicted value of the at least one missing value of the current sample t and a class corresponded by the current sample t are determined in accordance with the candidate sample differences. In this embodiment, the candidate sample with the smallest difference is determined as a correct sample in accordance with equation (28), and then the predicted values of the missing values in the current sample t and the class corresponded by the current sample t are determined in accordance with the correct sample. For example, when the candidate sample difference corresponded by the candidate sample t.sub.Good is smaller than the candidate sample difference corresponded by the candidate sample t.sub.Bad, the candidate sample t.sub.Good is determined as the correct sample. Thereafter, predicted values of the missing values in the current sample t can be obtained by comparing the candidate sample t.sub.Good with the current sample t. In addition, since the candidate sample t.sub.Good corresponds to the class of good weather, the current sample t is determined as good weather.

[0063] It can be understood from the above descriptions that the embodiments of the present invention performs imputation on samples with missing values, and differentiated and nonlinear imputation factors are used for samples corresponding to different classes, thereby obtaining imputed values closer to the true values from real statistical distributions. Therefore, the method 200 for data imputation and classification in the embodiments of the present invention is more precise.

[0064] Although the present invention has been described in considerable detail with reference to certain embodiments thereof, other embodiments are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the embodiments contained herein. It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention covers modifications and variations of this invention provided they fall within the scope of the following claims.

METHOD FOR DATA IMPUTATION AND CLASSIFICATION AND SYSTEM FOR DATA IMPUTATION AND CLASSIFICATION

Inventors

Cpc classification

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N5/04

PHYSICS

Classification Explorer

G06F17/18

PHYSICS

Classification Explorer

G06F18/21326

PHYSICS

International classification

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06K9/68

PHYSICS

Abstract

Claims

Description