METHOD FOR DATA IMPUTATION AND CLASSIFICATION AND SYSTEM FOR DATA IMPUTATION AND CLASSIFICATION
20200193220 ยท 2020-06-18
Inventors
Cpc classification
G06F17/18
PHYSICS
International classification
Abstract
A method and a system for data imputation and classification are provided. The system includes a database, a historical sample imputation module and a current sample imputation and classification module. In the method, at first, an imputation calculation is performed on each of classified historical sample groups to obtain a basis matrix and a missing value corresponding to each of the classified historical sample groups. Thereafter, a sample classification stage is performed. In the sample classification stage, an IPP (Iterative Projection Pursuit) algorithm and an equation of nonlinear inequality constraints to calculate weighting vectors corresponding to a current sample. Thereafter, plural candidate samples corresponding to different classes are calculated in accordance with the basis matrix and the weighting vectors, and the sample class of the current sample and a prediction value for a missing value of the current sample are determined accordingly.
Claims
1. A method for data imputation and classification, the method comprising: performing a historical sample processing stage, wherein the historical sample processing stage comprises: providing a plurality of historical samples; classifying the historical samples into a plurality of classes to obtain a plurality of classified historical sample groups, wherein the classified historical sample groups correspond to the classes in a one-to-one manner, and each of the classified historical sample groups comprises a plurality of known historical values and at least one historical missing value; replacing the historical missing value with zero; and performing an imputation calculating step on each of the classified historical sample groups, wherein the imputation calculating step comprises: calculating a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups; calculating a predicted value of the at least one historical missing value of each of the classified historical sample groups in accordance with the base matrix and the weight matrix to which each of the classified historical sample groups corresponds; and performing a sample classification stage to classify a current sample into one of the classes, wherein the current sample comprises a plurality of known values and at least one missing value, and the sample classification stage comprises: calculating a plurality of weight vectors corresponded by the current sample by using an iterative projection pursuit (IPP) algorithm and a nonlinear inequality constraint, wherein the weight vectors correspond to the classes in a one-to-one manner, each of the weight vectors is limited by a weight parameter, and the weight parameter is calculated in accordance with the nonlinear inequality constraint; performing a candidate sample calculating step to calculate a plurality of candidate samples corresponding to the classes in accordance with the base matrix and the weight vector corresponding to the same class, wherein the candidate samples correspond to the classes in a one-to-one manner; calculating a difference between the current sample and each of the candidate samples to obtain a plurality of candidate sample differences; and determining a predicted value of the at least one missing value of the current sample and a class corresponded by the current sample in accordance with the candidate sample differences.
2. The method for data imputation and classification of claim 1, wherein the nonlinear inequality constraint is a quadratic inequality constraint.
3. The method for data imputation and classification of claim 1, wherein the step for calculating the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups is performed by using an alternating least squares (ALS) algorithm and class-dependent data imputation.
4. The method for data imputation and classification of claim 3, wherein the ALS algorithm is a ridge alternating least squares (RALS) algorithm.
5. The method for data imputation and classification of claim 1, wherein the candidate sample calculating step involves multiplying the base matrix by the weight vector to obtain each of the candidate samples.
6. A system for data imputation and classification comprising: a database, configured to store a plurality of classified historical sample groups, wherein the classified historical sample groups correspond to a plurality of classes in a one-to-one manner, and each of the classified historical sample groups comprises a plurality of known historical values and at least one historical missing value; an imputation calculating module for the historical samples, configured to: replace the historical missing value with zero; calculate a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups; and calculate a predicted value of the at least one historical missing value of each of the classified historical sample groups in accordance with the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups; and an imputation and classification module for a current sample, configured to receive the current sample provided from an external device, and configured to: calculate a plurality of weight vectors corresponded by the current sample by using an iterative projection pursuit (IPP) algorithm and a nonlinear inequality constraint, wherein the weight vectors correspond to the classes in a one-to-one manner, each of the weight vectors is limited by a weight parameter, and the weight parameter is calculated in accordance with the nonlinear inequality constraint; perform a candidate sample calculating step to calculate a plurality of candidate samples corresponding to the classes in accordance with the base matrix and the weight vector corresponding to the same class, wherein the candidate samples correspond to the classes in a one-to-one manner; calculate a difference between the current sample and each of the candidate samples to obtain a plurality of candidate sample differences; and determine a predicted value of the at least one missing value of the current sample and a class corresponded by the current sample in accordance with the candidate sample differences.
7. The system for data imputation and classification of claim 6, wherein the nonlinear inequality constraint is a quadratic inequality constraint.
8. The system for data imputation and classification of claim 6, wherein the step for calculating the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups is performed by using an alternating least squares (ALS) algorithm and class-dependent data imputation.
9. The system for data imputation and classification of claim 8, wherein the ALS algorithm is a ridge alternating least squares (RALS) algorithm.
10. The system for data imputation and classification of claim 6, the candidate sample calculating step involves multiplying the base matrix by the weight vector to obtain each of the candidate samples.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The invention can be more fully understood by reading the following detailed description of the embodiment, with reference made to the accompanying drawings as follows.
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
DETAILED DESCRIPTION
[0023] The using of first, second, third, etc. in the specification should be understood for identifying units or data described by the same terminology, but are not referred to particular order or sequence.
[0024] Referring to
[0025] The imputation calculating module 120 is configured to calculate base matrixes and weight matrixes which are corresponded by the classified historical sample groups, thereby imputing missing values of the classified historical sample groups. The imputation calculating module 120 includes plural basis factor generation modules, for example basis factor generation modules 122, 124 and 126. The basis factor generation modules 122, 124 and 126 are configured to receive the classified historical sample groups 112, 114 and 116, and to calculate a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups 112, 114 and 116. Predicted values of the missing values of each of the classified historical sample group can be derived according to the base matrix and the weight matrix which are corresponded by each of the classified historical sample groups.
[0026] The imputation and classification module 130 is configured to receive new data (also referred as to current sample hereinafter) provided from an external device 140, and to impute and classify the current sample to obtain predicted values of missing values of the current sample and to obtain the class corresponded by the current sample. The imputation and classification module 130 includes plural weight factor generation modules (for example, weight factor generation modules 132a, 134a and 136a), plural data reconstruction modules (for example, data reconstruction modules 132b, 134b and 136b) and a determination module 138. The weight factor generation modules 132a, 134a and 136a are configured to generate weight factors of the current sample corresponding to respective classes. The data reconstruction modules 132b, 134b and 136b are configured to generate plural candidate samples of the current sample corresponding to respective classes. The determination module 138 is configured to determine predicted values of the missing values of the current sample and a class corresponded by the current sample in accordance with the candidate samples. In the following embodiments, algorithms used by the imputation calculating module 120 and the imputation and classification module 130 are introduced.
[0027] At first, an M-by-N sample matrix X with missing values is provided, in which M signifies the number of dimensions (also referred to as a number of independent variables), and N denotes the number of observed samples. Thereafter, an objective function for matrix completion is provided. In this embodiment, a ridge alternating least squares (RALS) algorithm is used to obtain the objective function for matrix completion, but embodiments of the present invention are not limited thereto. In other embodiments of the present invention, other alternating least squares algorithms can be used to obtain the objective function.
[0028] When the Ridge ALS algorithm is used to obtain the objective function for matrix completion, a difference between the matrix X and a matrix formed by a matrix U and a matrix V for objective minimization can be expressed as:
min E.sub.rALS(U,V)=min{X+.sub.U
+.sub.V
(1)
[0029] Where U and V are respectively M-by-D and D-by-N unknown matrixes; D is the intermediate dimension; represents the Frobenius norm; .sub.U and .sub.V represent the ridge parameters for U and V, respectively. Ridge parameters are used to regularize and prevent U and V from overfitting. To find U and V, following equations (2) and (3) are used.
V=(U.sup.U+.sub.VI).sup.1U.sup.G(X)(2)
U.sup.=(VV.sup.+.sub.UI).sup.1VG(X).sup.(3)
[0030] Where is the transpose operator, and an element-wise mask G is imposed on X. If an element of X is missing, the missing element is temporally replaced by a value of zero. In addition, the above equations (2)-(3) are presented in matrixes for the purpose of convenience.
[0031] It is assumed that y is an N-by-1 vector containing the class labels (also referred to as categorical variables). The vector y corresponds to samples in the sample matrix X. Further, it is assumed that the number of the classes is L. Therefore, the sample matrix X can be divided into , and
=1, . . . ,L. The size of
is M
, where N.sub.1+N.sub.2+ . . . +N.sub.L=N. In the embodiments of the present invention, to reflect characteristics of values corresponding to the classes, a class-dependent data imputation algorithm is used to find class-dependent matrix factors
and
, and then a refining step is performed. The class-dependent matrix factors
and
can be expressed as:
=(
+
I).sup.1
G(
)(4)
=(
+
I).sup.1
G(
).sup.T(5)
[0032] In the above equations, only corresponding is used to find
and
. Through the above step, the class-dependent matrix factors corresponding to each of the classes can be found.
[0033] Thereafter, it is assumed that t is the current sample provided by the external device, and t is an M-by-1 matrix with missing values. Regarding the current sample t, it is assumed that a D-by-1 weight vector can satisfy the following equation:
t(6)
[0034] The current sample t belongs to a vector space spanned by , i.e., span(
). However, the formation of the weight vector
has various possibilities. Therefore, embodiments of the present invention provide a weight factors formation technology for imputation based on quadratic inequality constraints which can limit the possibilities of the formation of the weight vector
. The weight factors formation technology for imputation based on quadratic inequality constraints in the embodiments of the present invention uses the Ridge ALS algorithm with quadratic inequality constraints, but embodiments of the present invention are not limited thereto. In other embodiments, other alternating least squares algorithms with quadratic inequality constraints can be used to limit the weight vector
.
[0035] Regarding the Ridge ALS algorithm with quadratic inequality constraints, the equation thereof can be expressed as:
[0036] Where is a D-by-1 vector and the centroid of
. In addition,
is a predefined radius, and
>0. Equation (7) can be generalized as:
[0037] Where is a q-by-D Tikhonov matrix,
is a p-by-D weight matrix, and
is a p-by-1 shift vector (e.g.,
). To solve equation (8), high-order generalized singular value decomposition (GSVD) is used. How the GSVD is used for
,
, and
is introduced below. For simplicity, subscript
is omitted herein. It is assumed that after the high-order GSVD is introduced.
,
, and
can be expressed as:
[0038] Where Q denotes a unitary matrix and R is a nonsingular matrix. In addition, the off-diagonal terms of S are zeros. It is assumed that , , represent diagonal terms for matrixes S.sub.U, S.sub.B, and S.sub., respectively. Then, matrixes S.sub.U, S.sub.B, and S.sub. can be expressed as:
[0039] Meanwhile, z=min{p,D}, qD and DM. Based on equations (9)-(14), equation (7) is simplified as:
[0040] Where {tilde over (t)}=Q.sub.U.sup.Tt, {tilde over (b)}=Q.sub.B.sup.Tb and {tilde over (v)}=R.sup.1v. By introducing a Lagrangian multiplier equation (15) can be modified as:
[0041] By taking the derivative of ({tilde over (v)}) with respect to {tilde over (v)}, and by zeroing the result, following equation (17) is obtained:
[0042] Equation (17) can be converted into a function of , i.e., {tilde over (v)}(). It is assumed that r is the rank of the matrix B, and three cases of the function {tilde over (v)}() are discussed as follows after rearrangement of equation (17).
[0043] To minimize equation (16), S.sub.B{tilde over (v)}.sup.2 should be zero. After substitution of equations (18)-(20) into S.sub.B{tilde over (v)}
respectively, a function () is obtained. The function () can be expressed as:
When r>q,
Otherwise,
[0044]
[0045] Thereafter, is calculated. It is assumed that () is equal to .sup.2, then is obtained. Then, {tilde over (v)} is calculated. Plugging the value of into equation (18), (19), or (20), {tilde over (v)} can be obtained. Thereafter, v is calculated. Plugging {tilde over (v)} into following equation (23):
v=R{tilde over (v)}(23)
Thus, v can be obtained.
[0046] Thereafter, how to use the weight vector v to perform the imputation is introduced below.
[0047] In the embodiments of the present invention, an iterative projection pursuit (IPP) algorithm with quadratic inequality constraints is used to perform the imputation. However, embodiments of the present invention are not limited thereto. In other embodiments of the present invention, other IPP algorithms with nonlinear inequality constraints can be used to perform the imputation.
[0048] In calculation of the imputation of this embodiment, at first, the above class is used to initialize the current sample t to replace missing values in the current sample t with zeros. Thereafter, a first step is performed to calculate
in accordance with the above weight factors formation technology for imputation based on quadratic inequality constraints. In the calculation for
, at first,
[i] is plugged into equation (21) or (22) to calculate (
)[i], and
[i] is obtained, in which i represents the i-th iteration. Then,
[i] is plugged into equations (18), (19) or (20) to obtain
(
)[i]. Thereafter,
[i] is calculated, in which
[i]=R.sub.l
(
)[i].
[0049] Then, a second step is performed to calculate predicted values of the missing values in the current sample t, in which the calculation of the predicted values performs imputation by using the following equations:
[i]=
[i](24)
[i+1]=t
[i](25)
[0050] Where the operator in equation (25) means to replace the missing values of t with the imputed ones of {circumflex over (t)}.
[0051] The above first step and the second step are repeated until a root-mean-square error (RMSE) converges, in which the RMSE
can be expressed as:
where
[i+1]=G(t
[i+1])(27)
[0052] Then, the smallest is selected to determine the class of the current sample t, in which an equation of the selection is expressed as:
[0053] Where is the class of the current sample t.
[0054] Hereinafter, an embodiment is introduced for explaining a method 200 for data imputation and classification corresponded by the system 100.
[0055] Referring to
[0056] In the historical sample processing stage 210, at first, step 211 is performed to provide plural historical samples, as shown in
[0057] Thereafter, step 213 is performed to replace the historical missing values with zeros, as shown in
[0058] Then, step 214 is performed to perform imputation based on each of the classified historical sample groups X.sub.Good and X.sub.Bad. In the embodiments of the present invention, step 214 is performed by using the above basis factor generation modules, for example the basis factor generation modules 122, 124 and 126. In step 214, at first, step 214a is performed to calculate a base matrix and a weight matrix which are corresponded by each of the classified historical sample groups X.sub.Good and X.sub.Bad, as shown in
[0059] In the sample classification stage 220, at first, step 221 is performed to calculate weight vectors which are corresponded by a current sample t by using the iterative projection pursuit (IPP) algorithm and a nonlinear inequality constraint. The weight vectors correspond to the above classes, for example good weather and bas weather in a one-to-one manner. In this embodiment of the present invention, the above iterative projection pursuit (IPP) algorithm with quadratic inequality constraints is used to calculate the weight vectors which is corresponded by the current sample t. As shown in
[0060] In step 222, candidate samples corresponding to the classes are calculated in accordance with the base matrix and the weight vector corresponding to the same class. For example, the base matrix U.sub.Good and the weight vector v.sub.Good correspond to the class of good weather, and thus a candidate sample corresponding to good weather can be calculated in accordance with the base matrix U.sub.Good and the weight vector v.sub.Good. In this embodiment, the base matrix U.sub.Good is multiplied by the weight vector v.sub.Good (i.e., U.sub.Goodv.sub.Good) to obtain a candidate sample t.sub.Good corresponding to good weather. Similarly, the base matrix U.sub.Bad and the weight vector v.sub.Bad correspond to the class of bad weather, and thus a candidate sample corresponding to bad weather can be calculated in accordance with the base matrix U.sub.Bad and the weight vector v.sub.Bad. In this embodiment, the base matrix U.sub.Bad is multiplied by the weight vector v.sub.Bad (i.e., U.sub.Badv.sub.Bad) to obtain a candidate sample t.sub.Bad corresponding to bad weather.
[0061] In step 223, a difference between the current sample t and each of the candidate samples is calculated to obtain candidate sample differences. In this embodiment, differences between the known values of the current sample t and corresponding values of the candidate samples t.sub.Good and t.sub.Bad are calculated by using equation (26) to obtain a sample difference of good weather between the current sample t and the candidate sample t.sub.Good, and to obtain a sample difference of bad weather between the current sample t and the candidate sample t.sub.Bad. However, embodiments of the present invention are not limited thereto. In other embodiments of the present invention, other methods can be used to calculate the differences between the current sample and the candidate samples.
[0062] In step 224, a predicted value of the at least one missing value of the current sample t and a class corresponded by the current sample t are determined in accordance with the candidate sample differences. In this embodiment, the candidate sample with the smallest difference is determined as a correct sample in accordance with equation (28), and then the predicted values of the missing values in the current sample t and the class corresponded by the current sample t are determined in accordance with the correct sample. For example, when the candidate sample difference corresponded by the candidate sample t.sub.Good is smaller than the candidate sample difference corresponded by the candidate sample t.sub.Bad, the candidate sample t.sub.Good is determined as the correct sample. Thereafter, predicted values of the missing values in the current sample t can be obtained by comparing the candidate sample t.sub.Good with the current sample t. In addition, since the candidate sample t.sub.Good corresponds to the class of good weather, the current sample t is determined as good weather.
[0063] It can be understood from the above descriptions that the embodiments of the present invention performs imputation on samples with missing values, and differentiated and nonlinear imputation factors are used for samples corresponding to different classes, thereby obtaining imputed values closer to the true values from real statistical distributions. Therefore, the method 200 for data imputation and classification in the embodiments of the present invention is more precise.
[0064] Although the present invention has been described in considerable detail with reference to certain embodiments thereof, other embodiments are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the embodiments contained herein. It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention covers modifications and variations of this invention provided they fall within the scope of the following claims.