Clustering device, method and program

11520837 · 2022-12-06

Assignee

Inventors

Cpc classification

International classification

Abstract

Clustering can be performed using a self-expression matrix in which noise is suppressed. A self-expression matrix is calculated that minimizes an objective function that is for obtaining, from among matrices included in a predetermined matrix set, a self-expression matrix whose elements are linear weights when data points in a data set are expressed by linear combinations of points, the objective function being represented by a term for obtaining the residual between data points in the data set and data points expressed by linear combinations of points using the self-expression matrix, a first regularization term that is multiplied by a predetermined weight and is for reducing linear weights of the data points that have a large Euclidean norm in the self-expression matrix, and a second regularization term for the self-expression matrix. A similarity matrix defined by the calculated self-expression matrix is then calculated. Then a clustering result is obtained by clustering the data set based on the similarity matrix.

Claims

1. A clustering device comprising a processor configured to execute a method comprising: generating a self-expression matrix that minimizes an objective function that is for obtaining, from among matrices included in a predetermined matrix set, a self-expression matrix whose elements are linear weights when data points in a data set are expressed by linear combinations of points, the objective function being represented by at least: a term for obtaining a residual between data points in the data set and data points expressed by linear combinations of points using the self-expression matrix, a first regularization term that is multiplied by a predetermined weight and is for reducing linear weights of the data points that have a large Euclidean norm in the self-expression matrix, and a second regularization term for the self-expression matrix; generating a similarity matrix defined by the generated self-expression matrix; and generating a clustering result by clustering the data set based on the similarity matrix.

2. The clustering device according to claim 1, wherein the first regularization term is expressed using a result of: obtaining a matrix by calculating a product of a transposed matrix of the data set and a matrix of the data set, obtaining a vector by applying an operator for extracting diagonal elements of a matrix to a vector to the obtained matrix, and applying an operator that returns a diagonal matrix having a vector as diagonal elements to the obtained vector.

3. The clustering device according to claim 1, the processor further configured to execute a method comprising: generating the self-expression matrix that minimizes the objective function of Formula (1) below, from a predetermined matrix set C,
[Formula 1]
Z=argmin.sub.Z∈Ch(X−XZ)+λr(Z)  (1) where X is a data matrix whose elements are data points in the data set, Z is the self-expression matrix, β is the predetermined weight applied to the first regularization term, r(Z) is the second regularization term, λ is a weight applied to the second regularization term, diag(x) is an operator that returns a diagonal matrix having a vector x as diagonal elements, Diag(x) is an operator that extracts diagonal elements of a matrix x to a vector, C is R.sup.N×N or {Z|ZεR.sup.N×N, Z.sub.ii=0}, h(*) is the q-th power of Lp norm, and p and q are positive numbers.

4. A clustering method, the method comprising: generating, by a self-expression matrix generator, a self-expression matrix that minimizes an objective function that is for obtaining, from among matrices included in a predetermined matrix set a self-expression matrix whose elements are linear weights when data points in a data set are expressed by linear combinations of points, the objective function being represented by at least: a term for obtaining a residual between data points in the data set and data points expressed by linear combinations of points using the self-expression matrix, a first regularization term that is multiplied by a predetermined weight and is for reducing linear weights of the data points that have a large Euclidean norm in the self-expression matrix, and a second regularization term for the self-expression matrix; generating, by a similarity generator, a similarity matrix defined by the self-expression matrix that was calculated; and generating, by a cluster generator, a clustering result by clustering the data set based on the similarity matrix.

5. The clustering method according to claim 4, wherein the first regularization term is expressed using a result of obtaining a matrix by calculating a product of a transposed matrix of the data set and a matrix of the data set, obtaining a vector by applying an operator for extracting diagonal elements of a matrix to a vector to the obtained matrix, and applying an operator that returns a diagonal matrix having a vector as diagonal elements to the obtained vector.

6. The clustering method according to claim 4, wherein the self-expression matrix generator generates the self-expression matrix that minimizes the objective function of Formula (3) below, from a predetermined matrix set C,
[Formula 2]
Z*=argmin.sub.Z∈C∥F∥.sup.2.sub.F+λr(Z), . . .  (2) where X is a data matrix whose elements are data points in the data set, Z is the self-expression matrix, β is the predetermined weight applied to the first regularization term, r(Z) is the second regularization term, λ is a weight applied to the second regularization term, diag(x) is an operator that returns a diagonal matrix having a vector x as diagonal elements, Diag(x) is an operator that extracts diagonal elements of a matrix x to a vector, C is R.sup.N×N or {Z|ZεR.sup.N×N, Z.sub.ii=0}, h(*) is the q-th power of Lp norm, and p and q are positive numbers.

7. A computer-readable non-transitory recording medium storing computer-executable program instructions that when executed by a processor causes a computer system to: generate, by a self-expression matrix generator, a self-expression matrix that minimizes an objective function that is for obtaining, from among matrices included in a predetermined matrix set, a self-expression matrix whose elements are linear weights when data points in a data set are expressed by linear combinations of points, the objective function being represented by at least: a term for obtaining a residual between data points in the data set and data points expressed by linear combinations of points using the self-expression matrix, a first regularization term that is multiplied by a predetermined weight and is for reducing linear weights of the data points that have a large Euclidean norm in the self-expression matrix, and a second regularization term for the self-expression matrix; generate, by a similarity generator, a similarity matrix defined by the self-expression matrix that was calculated; and generate, by a cluster generator, a clustering result by clustering the data set based on the similarity matrix.

8. The clustering device according to claim 2, the processor further configured to execute a method comprising: generating the self-expression matrix that minimizes the objective function of Formula (3) below, from a predetermined matrix set C,
[Formula 3]
Z*=argmin.sub.Z∈CE[∥F∥.sup.2.sub.F]+λr(Z), . . .  (3) where X is a data matrix whose elements are data points in the data set, Z is the self-expression matrix, β is the predetermined weight applied to the first regularization term, r(Z) is the second regularization term, λ is a weight applied to the second regularization term, diag(x) is an operator that returns a diagonal matrix having a vector x as diagonal elements, Diag(x) is an operator that extracts diagonal elements of a matrix x to a vector, C is R.sup.N×N or {Z|ZεR.sup.N×N, Z.sub.ii=0}, h(*) is the q-th power of Lp norm, and p and q are positive numbers.

9. The clustering method according to claim 5, wherein the self-expression matrix generator generates the self-expression matrix that minimizes the objective function of Formula (4) below, from a predetermined matrix set C,
[Formula 4]
Z*=argmin.sub.Z∈CE[∥F∥.sup.2.sub.F]+λr(Z), . . .  (4) where X is a data matrix whose elements are data points in the data set, Z is the self-expression matrix, β is the predetermined weight applied to the first regularization term, r(Z) is the second regularization term, λ is a weight applied to the second regularization term, diag(x) is an operator that returns a diagonal matrix having a vector x as diagonal elements, Diag(x) is an operator that extracts diagonal elements of a matrix x to a vector, C is R.sup.N×N or {Z|ZεR.sup.N×N, Z.sub.ii=0}, h(*) is the q-th power of Lp norm, and p and q are positive numbers.

10. The computer-readable non-transitory recording medium of claim 7, wherein the first regularization term is expressed using a result of: obtaining a matrix by calculating a product of a transposed matrix of the data set and a matrix of the data set, obtaining a vector by applying an operator for extracting diagonal elements of a matrix to a vector to the obtained matrix, and applying an operator that returns a diagonal matrix having a vector as diagonal elements to the obtained vector.

11. The computer-readable non-transitory recording medium of claim 7, wherein the self-expression matrix generator generates the self-expression matrix that minimizes the objective function of Formula (5) below, from a predetermined matrix set C,
[Formula 5]
Z*=argmin.sub.Z∈CE[∥F∥.sup.2.sub.F]+λr(Z), . . .  (5) where X is a data matrix whose elements are data points in the data set, Z is the self-expression matrix, β is the predetermined weight applied to the first regularization term, r(Z) is the second regularization term, λ is a weight applied to the second regularization term, diag(x) is an operator that returns a diagonal matrix having a vector x as diagonal elements, Diag(x) is an operator that extracts diagonal elements of a matrix x to a vector, C is R.sup.N×N or {Z|ZεR.sup.N×N, Z.sub.ii=0}, h(*) is the q-th power of Lp norm, and p and q are positive numbers.

12. The clustering device according to claim 1, the processor further configured to execute a method comprising: receiving the data points in the data set, wherein the data points are associated with digitized image data; and providing one or more classes of the digitized image data based on the clustered data points.

13. The clustering device according to claim 1, wherein the generating a cluster result further comprises generating the clustering result of the data points using spectral clustering.

14. The clustering device according to claim 1, wherein the second regularization term includes a Frobenium norm r(Z)=∥Z∥.sub.F.

15. The clustering method according to claim 5, the method further comprising: receiving the data points in the data set, wherein the data points are associated with digitized acoustic data; and providing one or more classes of the digitized acoustic data based on the clustered data points.

16. The clustering method according to claim 5, wherein the cluster generator generates the clustering result of the data points using spectral clustering.

17. The clustering method according to claim 5, wherein the second regularization term includes a Frobenium norm r(Z)=∥Z∥F.

18. The computer-readable non-transitory recording medium of claim 7, the computer-executable program instructions when executed further causing the computer system to: receive the data points in the data set, wherein the data points are associated with digitized voice data; and provide one or more classes of the digitized voice data based on the clustered data points.

19. The computer-readable non-transitory recording medium of claim 7, wherein the cluster generator generates the clustering result of the data points using spectral clustering.

20. The computer-readable non-transitory recording medium of claim 7, wherein the second regularization term includes a Frobenium norm r(Z)=∥Z∥.sub.F.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) FIG. 1 is a diagram showing an example of the clustering of a data set.

(2) FIG. 2 is a conceptual diagram showing an example where the similarity is higher when points belong to the same subspace.

(3) FIG. 3 is a conceptual diagram showing an example of the case of reconstructing a point by a linear combination of two points that belong to the same subspace.

(4) FIG. 4 is a diagram showing an example of the flow of the self-expression approach.

(5) FIG. 5 is a conceptual diagram of the case of using linear combination expression for a data set in a subspace.

(6) FIG. 6 is a block diagram showing a configuration of a clustering device according to an embodiment of the present invention.

(7) FIG. 7 is a flowchart showing a clustering processing routine in the clustering device according to the embodiment of the present invention.

DESCRIPTION OF EMBODIMENTS

(8) An embodiment of the present invention is described in detail below with reference to the drawings.

(9) <Overview of Embodiment of Present Invention>

(10) First, an overview of the embodiment of the present invention will be given.

(11) In order to solve the problems described above, consider a method of randomly sampling data from a data set, and then solving Formula (1) using the sampled data. This will be described based on the example in FIG. 5. In FIG. 5, (c) shows an example of the case where data was randomly sampled from the data in (a) of FIG. 5. If the number of very noisy data points is small compared to the non-noisy data, most of the noisy data can be ignored by performing random sampling, and it is possible to strengthen the tendency expressed by the data whose data points do not deviate much from the actual subspace. In FIG. 5, (d) shows an example that envisions the case where data points are represented by linear combination of points in the same subspace.

(12) First, before describing the specific configuration of the embodiment of the present invention, the background of the idea of the technique of the embodiments of the present invention will be described. Hereinafter, consider the use of h(X−XZ)=∥X−XZ∥.sup.2.sub.F. First, data points are selected according to a Bernoulli distribution with the probability α∈(0,1), and when Formula (1) is redefined using only the selected data, Formula (2) below is obtained.
[Formula 2]
Z*=argmin.sub.Z∈C∥F∥.sup.2.sub.F+λr(Z), where F=XP.sup.T−XP.sup.TPZP.sup.T,P.sup.˜p(P)  (2)

(13) Here, P˜p(P) is a selection matrix that selects data points with the probability α. If Formula (2) is used as it is, the portion of data corresponding to the data not selected by P is not selected. In view of this, consider taking an expected value for the F portion.
[Formula 3]
Z*=argmin.sub.Z∈CE[∥F∥.sup.2.sub.F]+λr(Z), where F=XP.sup.T−XP.sup.TPZP.sup.T,P˜p(P)  (3)

(14) Furthermore, E[∥F∥.sup.2.sub.F] can be transformed as follows.
[Formula 4]
E[∥F∥.sup.2.sub.F]=α∥X−XZ′∥.sup.2.sub.F+(1−α)∥Diag(diag(X.sup.TX).sup.0.5Z′∥.sup.2.sub.F+2(1−α)trace(X.sup.TX)(Z′−I)Diag(diag(Z′)))+(2α−1)(α−1)/α trace(X.sup.TXDiag(diag(Z′))2)  (4)

(15) Here, diag(x) is an operator that returns a diagonal matrix with vector x as diagonal elements, Diag(X) is an operator that extracts only the diagonal elements of the square matrix X to a vector, and trace(X) is a trace operator (i.e., the sum of the diagonal elements). Also, Z′=αZ.

(16) Furthermore, if C={Z|Z∈R.sup.N×N, Z.sub.ii=0} is selected as the matrix set C that is the condition of Z, then Formula (3) can be transformed as follows.
[Formula 5]
Z*=argmin.sub.Z∈Cα∥X−XZ′∥.sup.2.sub.F+(1−α)∥Diag(diag(X.sup.TX)).sup.0.5Z′∥.sup.2.sub.F+λr(Z′/α),  (5)

(17) A new regularization term ∥Diag(diag(X.sup.TX)).sup.0.5Z′∥.sup.2.sub.F can be naturally derived from Formula (5). The elements of Z can be weighted by introducing this term. The configuration of the embodiment of the present invention will be described below based on the derivation by the above-described formula transformation.

(18) <Configuration of Clustering Device According to Embodiment of Present Invention>

(19) The following describes the configuration of a clustering device according to an embodiment of the present invention. As shown in FIG. 6, a clustering device 100 according to this embodiment of the present invention can be configured by a computer including a CPU, a RAM, and a ROM storing a program for executing a clustering processing routine described below and various data. As shown in FIG. 6, as a function configuration, this clustering device 100 includes an input unit 10, an arithmetic operation unit 20, and an output unit 50.

(20) The input unit 10 accepts a data set as a processing target.

(21) The arithmetic operation unit 20 is configured by a self-expression matrix calculation unit 30, a similarity calculation unit 32, and a clustering unit 34.

(22) The self-expression matrix calculation unit 30 calculates, for the accepted data set, a self-expression matrix that minimizes the objective function of Formula (6) below, from the predetermined matrix set C.
[Formula 6]
Z=argmin.sub.Z∈Ch(X−XZ)+β∥Diag(diag(X.sup.TX)).sup.0.5Z∥.sup.2.sub.F+λr(Z)  (6)

(23) The objective function represented by Formula (6) is an objective function for obtaining, from the matrices included in the matrix set C, a self-expression matrix Z whose elements are linear weights when the data points in the data set are expressed by linear combinations of points. The objective function is represented by the following terms. h(X−XZ) is a term for obtaining the residual between data points in the data matrix X whose elements are the data points in the data set, and data points expressed by linear combinations of points using the self-expression matrix Z. Also, ∥Diag(diag(X.sup.TX)).sup.0.5Z∥.sup.2.sub.F is a first regularization term for reducing the linear weights of the data points that have a large Euclidean norm in the self-expression matrix Z. The first regularization term is multiplied by a predetermined weight β. r(Z) is a second regularization term for the self-expression matrix.

(24) Note that λ is a weight for the second regularization term, diag(x) is an operator that returns a diagonal matrix having the vector x as diagonal elements, and Diag(x) is an operator that extracts the diagonal elements of the matrix x to a vector. The first regularization term is expressed using the result of calculating a matrix by calculating the product of a transposed matrix of the data set and a matrix of the data set, calculating a vector by applying the operator Diag(x), which extracts the diagonal elements of a matrix to a vector, to the calculated matrix, and applying the operator diag(x), which returns a diagonal matrix having a vector as diagonal elements, to the calculated vector. C is R.sup.N×N or {Z|Z∈R.sup.N×N, Z.sub.ii=0}, and h(⋅) is the q-th power of the following Lp norm.
[Formula 7]
h(Λ)=(Σ.sub.iΣ.sub.j|Λ.sub.ij|.sup.p).sup.q/p

(25) p and q are a positive number.

(26) Various options are possible for r(Z) of the second regularization term. For example, it is conceivable to use nothing (i.e., r(Z)=0), the Frobenius norm r(Z)=∥Z∥.sub.F, the Frobenius norm squared r(Z)=∥Z∥.sup.2.sub.F, or the L1 norm r(Z)=∥Z∥.sub.1.

(27) The similarity calculation unit 32 calculates the similarity matrix M defined by the self-expression matrix Z calculated by the self-expression matrix calculation unit 30. For example, a method of defining M as M.sub.ij=|Z.sub.ij|+|Z.sub.ji| can be used.

(28) The clustering unit 34 obtains a clustering result by clustering the data set based on the similarity matrix M calculated by the similarity calculation unit 32. As the clustering method, for example, spectral clustering can be used, or a maximum cut method can be used.

(29) <Effects of Clustering Device According to Embodiment of Present Invention>

(30) The following describes effects of the clustering device 100 according to this embodiment of the present invention. When the input unit 10 accepts a data set that is a processing target, the clustering device 100 executes the clustering processing routine shown in FIG. 7.

(31) First, in step S100, for the received data set, a self-expression matrix that minimizes the objective function of Formula (6) is calculated from the predetermined matrix set C. The objective function represented by Formula (6) is an objective function for obtaining, from the matrices included in the matrix set C, a self-expression matrix Z whose elements are linear weights when the data points in the data set are expressed by linear combinations of points. The objective function is represented by the following terms. h(X−XZ) is a term for obtaining the residual between data points in the data matrix X whose elements are the data points in the data set, and data points expressed by linear combinations of points using the self-expression matrix Z. Also, ∥Diag(diag(X.sup.TX)).sup.0.5Z∥.sup.2.sub.F is the first regularization term for reducing the linear weights of the data points that have a large Euclidean norm in the self-expression matrix Z. The first regularization term is multiplied by the predetermined weight β. r(Z) is the second regularization term for the self-expression matrix.

(32) Next, in step S102, the similarity matrix M defined by the self-expression matrix Z calculated in step S100 is calculated.

(33) In step S104, the clustering result of clustering the data set based on the similarity matrix M calculated in step S102 is output to the output unit 50, and then processing is ended.

(34) As described above, according to the clustering device of this embodiment of the present invention, by obtaining a clustering result by clustering a data set based on a similarity matrix as described below, a data set can be clustered with use of a self-expression matrix in which the influence of noise is suppressed. Here, the self-expression matrix is a matrix whose elements are linear weights when the data points in the data set are expressed by linear combinations of points, from among the matrices included in the predetermined matrix set. The similarity matrix is defined by a self-expression matrix calculated by the procedure described below. The self-expression matrix is calculated so as to minimize the objective function of Formula (6) represented by the terms listed below. These terms include a term for obtaining the residual between data points in the data set and data points expressed by linear combinations of points using the self-expression matrix, a first regularization term that is multiplied by the predetermined weight β and is for reducing the linear weights of the data points that have a large Euclidean norm in the self-expression matrix, and a second regularization term for the self-expression matrix.

(35) [Verification of Experimental Examples]

(36) If the regularization term ∥Diag(diag(X.sup.TX)).sup.0.5Z∥.sup.2.sub.F in the proposed method of this embodiment is h(X−XZ)=∥X−XZ∥.sup.2.sub.F, it can be derived by applying random sampling. For this reason, this regularization term can be expected to have robustness with respect to noise. Also, when compared with the regularization term ∥Z∥.sup.2.sub.F used in existing research, the regularization term ∥Diag(diag(X.sup.TX)).sup.0.5Z∥.sup.2.sub.F is the result of weighting Z with Diag(diag(X.sup.TX)).sup.0.5, and as a result, it is possible to also expect an effect of preventing data points that have a large Euclidean norm from exerting an excessively large influence on the self-expression matrix.

(37) In addition, high speed calculation is possible in specific cases. In particular, if “h(X−XZ)=∥X−XZ∥.sup.2.sub.F, r (Z)=0, C=R.sup.N×N”, if “h(X−XZ)=∥X−XZ∥.sup.2.sub.F, r (Z)=0, C={Z|Z∈R.sup.N×N, Z.sub.ii=0}”, if “h(X−XZ)=∥X−XZ∥.sup.2.sub.F, r(Z)=∥Z∥.sup.2.sub.F, C=R.sup.N×N”, or if “h(X−XZ)=∥X−XZ∥.sup.2.sub.F, r(Z)=∥Z∥.sup.2.sub.F, C={Z|Z∈R.sup.N×N, Z.sub.ii=0}”, there is a closed solution in Z, and Z can be calculated at high speed.

(38) The following describes an experimental example that demonstrates that the technique of this embodiment has an experimentally excellent effect.

(39) The effectiveness of this technique can be demonstrated through a clustering experiment using COIL100 or MNIST-RB, which are real-world data sets that are often used in subspace clustering research. COIL100 is a 32×32 image dataset for 100 categories of 72 images each. MNIST-RB is a data set with noise randomly added to the background of MNIST, which is a 28×28 handwritten numeral data set of numerals from 0 to 9. For fairness, half of each of COIL100 and MNIST-RB was used for validation for parameter adjustment and the other half was used for testing. In COIL100, 10 classes were randomly selected from each of the validation and test halves 20 times to obtain 20 evaluation sets. In MNIST-RB, 100 data pieces were randomly selected from each of the validation and test halves 20 times to obtain 20 evaluation sets. Also, the accuracy described in the following formula was used as an evaluation index.
accuracy=(number of correctly clustered points/total number of points)

(40) Table 1 below shows the results when h(X−XZ)=∥X−XZ∥.sup.2.sub.F, C={Z|Z∈R.sup.N×N, Z.sub.ii=0}, r(Z)=0 when calculating the matrix Z. It can be seen that this technique has a higher accuracy than the conventional technique.

(41) TABLE-US-00001 TABLE 1 Accuracy (COIL100) Accuracy (MNIST-RB) Conventional technique 0.933 0.460 Present technique 0.916 0.507

(42) Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

REFERENCE SIGNS LIST

(43) 10 Input unit 20 Arithmetic operation unit 30 Self-expression matrix calculation unit 32 Similarity calculation unit 34 Clustering unit 50 Output unit 100 Clustering device