Point-set kernel clustering
11709917 · 2023-07-25
Assignee
Inventors
Cpc classification
G06V10/762
PHYSICS
G06F18/2321
PHYSICS
G06V10/46
PHYSICS
G06V10/77
PHYSICS
International classification
G06F18/2321
PHYSICS
G06V10/77
PHYSICS
G06V10/762
PHYSICS
Abstract
A computer-implemented clustering method is disclosed for image segmentation, social network analysis, computational biology, market research, search engine and other applications. At the heart of the method is a point-set kernel that measures the similarity between a data point and a set of data points. The method has a procedure that employs the point-set kernel to expand from a seed point to a cluster; and finally identifies all clusters in the given dataset. Applying the method for image segmentation, it identifies several segments in the image, where points in each segment have high similarity: but points in one segment have low similarity with respect to other segments. The method is both effective and efficient that enables it to deal with large scale datasets. In contrast, existing clustering methods are either efficient or effective; and even efficient ones have difficulty dealing with large scale datasets without massive parallelization.
Claims
1. A computer implemented method of segmenting data, comprising steps of: a) receiving one image; and b) converting the image into a dataset of some descriptors, to form a dataset E; and c) converting each point in dataset E, using a feature map of Isolation Kernel, to a point in a dataset D; and d) using a point-set kernel to find the most similar point with respect to the dataset D and then use it as an initial seed for cluster G; and e) growing the cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D with respect to the cluster G; and the cluster stops growing when all points excluding the points in the cluster G in the dataset D have similarities with respect to the dataset D less than or equal to τ, where τ is a similarity threshold; and f) growing the next cluster using the remaining points in the dataset D excluding the points in the cluster G by restarting from step d, until D is empty or no point can be found which has similarity more than τ.
2. The method according to claim 1, wherein using a feature map of Isolation Kernel converting each point of dataset E to a point in a dataset D is comprised of: using a random sample of ψ points from dataset E to produce a Voronoi diagram, where each Voronoi cell isolates one point from the rest of the points in the sample; τ Voronoi diagrams are produced from dataset E and each point x in dataset E is converted using the τ Voronoi diagrams to produce a feature vector Φ(x) of tψ binary attributes in dataset D: x.fwdarw.Φ(x).
3. The method according to claim 2, wherein using a point-set kernel to find the most similar points with respect to the dataset D is comprised of producing kernel mean map {circumflex over (Φ)}(D) from a set of points D via averaging their feature mapped points Φ(X), and measuring the similarity between point x and set D using the point-set kernel {circumflex over (K)}(x,D)=<Φ(x),{circumflex over (Φ)}(D)>.
4. The method according to claim 3, wherein the point-set kernel is configured to be used to describe a cluster in terms of similarity distribution.
5. The method according to claim 4, wherein the point-set kernel is represented as: a, b
denotes a dot product between two vectors a and b; as the point-set kernel is constructed from a dataset D, the point-set kernel equations can be more precisely expressed as:
6. The method according to claim 5, wherein a post-processing is configured to be applied to all clusters produced by point-set kernel clustering to ensure that the following objective is achieved:
7. The method according to claim 1, wherein the similarity threshold τ<1 and growth rate ∈(0,1).
8. Software stored on a non-transitory machine-readable medium comprising instructions for enabling a data processing system to: a) receive one image; and b) convert the image into a dataset of some descriptors to form a dataset E; and c) convert each pint in dataset E, using a feature map of Isolation Kernel, to a point in a dataset D; and d) use a point-set kernel to find the most similar point with respect to the dataset D and then use it as an initial seed for cluster G; and e) grow the cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D with respect to the cluster G; and the cluster stops growing when all points excluding the points in the cluster G in dataset D have similarities with respect to the dataset D less than or equal to τ, where τ is a similarity threshold. f) grow the next cluster using the remaining points in dataset D by restarting from step d, until D is empty or no point can be found which has similarity more than τ.
9. The method according to claim 1, wherein the point-set kernel clustering method is configured to be applied to cluster a set of images, and in social network analysis, computational biology, market research or search engine by replacing the one image with a dataset from one or each of these applications.
10. The method according to claim 9, wherein a data descriptor for each application is used to describe the original dataset, comprising either one data object or a set of data objects, into a set of points in vector representation.
11. The method according to claim 10, wherein the method of segmenting data applied to clustering a set of images, comprising the steps of: a) receiving a first set of images; and b) converting the set of images into a set of points in vector representation, to form a dataset E1, and c) converting each point of dataset E1, using a feature map of Isolation Kernel, to a point in a dataset D1; and d) using a point-set kernel to find the most similar points with respect to the dataset D1 and then use it as an initial seed for cluster G; and e) growing the cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D1 with respect to the cluster G; and the cluster stops growing when all points excluding the points in the cluster G in dataset D1 have similarities with respect to the dataset D1 less than or equal to τ, where τ is a similarity threshold; and f) growing the next cluster using the remaining points in D1 excluding G by restarting from step d, until D1 is empty or no point can be found which has similarity more than τ.
12. The method according to claim 10, wherein the method of segmenting data applied in computational biology, comprising the steps of: a) receiving one computational biology data object or a set of computational biology data objects; and b) converting the computational biology data object or the set of computational biology data objects into a set of points in vector representation, to form a dataset E2; and c) converting each point of dataset E2, using a feature map of Isolation Kernel, to a point in a dataset D2; and d) using a point-set kernel to find the most similar point with respect to the dataset D2 and then use it as an initial seed for cluster G; and e) growing the cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D2 with respect to the cluster G; and the cluster stops growing when all points excluding the points in the cluster G in dataset D2 have similarities with respect to the dataset D2 less than or equal to τ, where τ is a similarity threshold; and f) growing the next cluster using the remaining points in the dataset D2 excluding the points in the cluster G by restarting from step d, until D2 is empty or no point can be found which has similarity more than τ.
13. The method according to claim 10, wherein the method of segmenting data applied in market research, comprising the steps of: a) receiving one market research object or a set of market research data objects; and b) converting the market research data object or the set of market research data objects into a set of points in vector representation, to form a dataset E3; and c) converting each point of dataset E3, using a feature map of Isolation Kernel, to a point in a dataset D3; and d) using a point-set kernel to find the most similar point with respect to the dataset D3 and then use it as an initial seed for cluster G; and e) growing the cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D3 with respect to the cluster G; and the cluster stops growing when all points excluding the points in the cluster G in dataset D3 have similarities with respect to the dataset D3 less than or equal to τ, where τ is a similarity threshold; and f) growing the next cluster using the remaining points in the dataset D3 excluding the points in the cluster G by restarting from step d, until D3 is empty or no point can be found which has similarity more than τ.
14. The method according to claim 10, wherein the method of segmenting data applied in search engine, comprising the steps of: a) receiving one search engine object or a set of search engine data objects; and b) converting the search engine data object or the set of search engine data objects into a set of points in vector representation, to form a dataset E4; and c) converting each point of dataset E4, using a feature map of Isolation Kernel, to a point in a dataset D4; and d) using a point-set kernel to find the most similar point with respect to the dataset D4 and then use it as an initial seed for cluster G; and e) growing the cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D4 with respect to the cluster G; and the cluster stops growing when all points excluding the points in the cluster G in dataset D4 have similarities with respect to the dataset D4 less than or equal to τ, where τ is a similarity threshold; and f) growing the next cluster using the remaining points in the dataset D4 excluding the points in the cluster G by restarting from step d, until D4 is empty or no point can be found which has similarity more than τ.
15. The method according to claim 10, wherein the method of segmenting data applied in social network analysis, comprising the steps of: a) receiving one social network analysis object or a set of social network analysis data objects; and b) converting the social network analysis data object or the set of search engine data objects into a set of points in vector representation, to form a dataset E5; and c) converting each point of dataset E4, using a feature map of Isolation Kernel, to a point in a dataset D5; and d) using a point-set kernel to find the most similar point with respect to the dataset D5 and then use it as an initial seed for cluster G; and e) growing the cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D5 with respect to the cluster G; and the cluster stops growing when all points excluding the points in the cluster G in dataset D5 have similarities with respect to the dataset D5 less than or equal to τ, where τ is a similarity threshold; and f) growing the next cluster using the remaining points in the dataset D5 excluding the points in the cluster G by restarting from step d, until D5 is empty or no point can be found which has similarity more than τ.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The present description will be better understood from the following detailed description and the accompanying drawings. The details of the present invention, both as to its structure and operation, can best be understood by referring to the accompanying drawings, in which like reference numbers and designations refer to like elements.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
DETAILED DESCRIPTION OF EMBODIMENTS
(16) The point-set kernel relies on how similar a data point x, represented as a vector, is to a set of data points G, a point-set kernel is represented as:
{circumflex over (K)}(x,G)=<Φ(x),{circumflex over (Φ)}(G)>
and
(17)
where {circumflex over (Φ)} is a kernel mean map of {circumflex over (K)} and {circumflex over (Φ)} is the feature map of a point-to-point kernel and a, b
denotes a dot product between two vectors a and b.
(18) Computing time: the summation in {circumflex over (Φ)}(G) needs to be done once only as a pre-processing. Then computing {circumflex over (K)}(x,G) based on the dot product, takes a fixed amount of time only, independent of n (the data size of G). Therefore, to compute the similarity of x with respect to G for all points x in G, i.e., {circumflex over (K)}(x,G)∀x∈G, has a computational cost which is proportional to n only.
(19) Note that kernel mean embedding is an approach to convert a point-to-point kernel into a distribution kernel which measures the similarity between two distributions. The point-set kernel can be viewed as a special case of kernel mean embedding. Kernel mean embedding uses the same kernel mean map as used here.
(20) The use of the feature map is necessary to achieve the stated efficiency. The alternative, which employs the point-to-point kernel/distance directly in the computation, will have a computational cost that is proportional to n.sup.2—the root cause of high computational cost in existing density-based algorithms. The point-set kernel formulation assumes that the point-to-point kernel has a finite-dimensional feature map. Commonly used point-to-point kernels (such as Gaussian and Laplacian kernels) have two key limitations: they have a feature map of intractable dimensionality; and their similarity is independent of a given dataset. The first limitation prevents these kernels to be used in the formulation directly.
(21) Therefore, a recently introduced point-to-point kernel, which has an exact finite dimensional feature map called Isolation Kernel, is used in {circumflex over (K)}. Isolation Kernel is a data dependent kernel that is derived directly from data. Isolation Kernel is employed here because it is data dependent, which is essential to a good clustering outcome. Employing a Gaussian Kernel, which is data independent, psKC will perform poorly on datasets with clusters of non-globular shape, different data sizes and/or densities. This is because its similarity measurement is independent of data distribution. As shown in
(22) Isolation Kernel has two characteristics, which are antitheses to the two limitations of data independent kernels mentioned above. The first is that it has a finite-dimensional feature map, which enables Isolation Kernel to be used directly in the point-set kernel as the exact finite dimensional feature map is crucial in achieving runtime proportional to data size. The second refers to its similarity adapts to local density of the data distribution of a given dataset, which means that two points in a sparse region are more similar than two points of equal inter-point distance in a dense region. This characteristic is crucial for the clustering algorithm to obtain good clustering outcomes.
(23) As the point-set kernel is constructed from a dataset D, the point-set kernel equations can be more precisely expressed as:
(24)
where G.Math.D, and Φ is the feature map of Isolation Kernel which is constructed from D. Note that Isolation Kernel has no closed form expression.
(25) The point-set kernel can be used to describe a cluster in terms of similarity distribution, independent of the clustering process. Given a dataset D having k clusters G.sup.j, j=1, . . . , k. The clusters could be the ground truth, or the clustering outcome of an algorithm. The {circumflex over (K)} similarity distribution of all clusters G.sup.j in D is defined as:
(26)
(27) The properties of the point-set kernel for points outside the cluster are described as follows. Given a dataset D and a cluster G⊂D. Let x, x′∈D\G; the distance between x and a set G be
(28)
and ρ(x) denotes the density of x. Properties of the point-set kernel derived from D include: (a) Fall-off-the-cliff property: {circumflex over (K)}(x,G) decreases sharply as l(x,G) increases; (b) Data dependent property:
(29)
if l(x,G)=l(x′,G) and ρ(argmin.sub.z∈G∥x−z∥)>ρ(argmin.sub.z∈G∥x′−z∥). In other words, the rate of falling-off at x is data dependent, i.e., it is proportional to the density at the point G closest to x, i.e., argmin.sub.z∈G∥x−z∥ and inversely proportional to l(x,G).
(30) These properties enable each cluster to be expanded radially in all directions from a seed in multiple iterations, where each iteration recruits a subset of new members in the immediate neighborhood of the expanding cluster and arbitrary-shaped clusters of different densities and sizes to be discovered through growing a cluster.
(31)
(32) The clustering, called point-set kernel clustering or psKC, employs the point-set kernel {circumflex over (K)} to characterize clusters. It identifies all members of each cluster by first locating the seed in the dataset. Then, it expands its members in the cluster's local neighborhood which grows at a set rate () incrementally; and it stops growing when all unassigned points having similarity with respect to the cluster fall below a threshold (τ). The process repeats for the next cluster using the remaining points in dataset D, yet to be assigned to any clusters found so far, until D is empty or no point can be found which has similarity more than τ. All remaining points after the entire clustering process are noise as they are less than the set threshold for each of the clusters discovered. The psKC procedure is shown in Algorithm 1.
(33) TABLE-US-00001 Algorithm 1: point-set Kernel Clustering psKC Input : D - given dataset, τ - similarity threshold, - growth rate Output: G.sup.j, j = 1, . . . , k - k clusters. N - noise set 1 k = 0; 2 while |D| > 1 do 3 | x.sub.p = argmax.sub.xϵD {circumflex over (K)} (x, D): 4 | x.sub.q = argmax.sub.xϵD\{x.sub.
) × {circumflex over (K)} (x.sub.q, {x.sub.p}); 6 | if γ ≤ τ then 7 | | Terminate while-do loop; 8 | end 9 | k++: 10 | G.sub.0.sup.k = {x.sub.p, x.sub.q}; 11 | for (i = 1; γ > τ; i++) do 12 | | G.sub.i.sup.k = {x ϵ D | {circumflex over (K)} (x, G.sub.i−1.sup.k) > γ}; 13 | | γ = (1 −
)γ; 14 | end 15 | G.sup.k = G.sub.i−1.sup.k; 16 | D = D \ G.sup.k: 17 end 18 N = D; 19 return G.sup.j, j = 1, . . . , k: N:
(34) The cluster, grown from a seed, according to psKC can be formally defined as follows: A -expanded cluster grows from a seed x.sub.p selected from D, using {circumflex over (K)}(.Math.,.Math.) with similarity threshold τ<1 and growth rate
∈(0,1), is defined recursively as:
G.sub.i={x∈D|{circumflex over (K)}(x,G.sub.i-1)>γ.sub.i>τ}
where x.sub.q=argmax.sub.x∈D\{x.sub.γ.sub.i-1}; and γ.sub.0=(1−
){circumflex over (K)}(x.sub.q,{x.sub.p}).
(35) Let G.sup.j be -expanded cluster j from dataset D. The number of
-expanded clusters in dataset D is discovered automatically by repeating the above cluster growing process on G.sup.k from D\{G.sup.j, j=1, . . . , k−1}. After discovering all
-expanded clusters G.sup.j in D, noise is defined as
N={x∈D|∀j{circumflex over (K)}(x,G.sup.j)≤τ}.
(36) A post-processing can be applied to all clusters produced by psKC to ensure that the following objective is achieved:
(37)
(38) This post-processing re-examines all points which have the lowest similarity regarding cluster G.sup.j if they could be reassigned to other cluster to maximize the total similarity. This re-examination begins with points in G.sup.j, j=1, . . . , k in the order the clusters are produced.
(39)
(40) As shown in
(41) receiving one image; and
(42) Application data descriptor step: converting the image into a dataset of some descriptors e.g., CIELAB, to form a dataset E; and
(43) Conversion using Isolation Kernel step: converting each point in dataset E, using a feature map of Isolation Kernel, to a point in a dataset D; and
(44) Seed finding step: using a point-set kernel to find the most similar point wrt D and then use it as an initial seed for cluster G; and
(45) Cluster growing step: growing the cluster G incrementally using the point-set kernel by recruiting most similar points from dataset D wrt G; the cluster stops growing when all points excluding G in D have similarities less than or equal to τ, where τ is a similarity threshold; and
(46) growing the next cluster using the remaining points in dataset D excluding G by restarting from the Seed finding step, until the dataset D is empty or no point can be found which has similarity more than τ.
(47) As shown in
(48) As shown in
(49) The use of Isolation Kernel in the point-set kernel enables similarity between a point and a set to be computed efficiently, without the need to compute point-to-point similarity/distance—the root cause of high time complexity in existing algorithms. The finite dimensional feature map of Isolation Kernel and its use in the point-set kernel enable the algorithm to achieve its full potential: runtime proportional to data size (n)—a level unable to be achieved by existing effective clustering algorithms such as DP, and even less effective but efficient algorithms such as scalable kernel k-means. They have runtimes at least proportional to n.sup.2. Time complexity of psKC is demonstrated in the following table given that the maximum number of iterations is fixed given the threshold τ and growth rate , independent of the data size. t and ψ are parameters of Isolation Kernel.
(50) TABLE-US-00002 TABLE 1 1 Build Isolation Kernel (IK) O(tψ) 2 Mapping D of n points using feature map of IK O(ntψ) 3 Assume that it produces k clusters and each cluster O(bntψ) of n/k points takes b iterations* Total time cost O(bntψ)
(51) In order to compare the clustering performance with existing clustering algorithms including DP, scalable kernel k-means which employs Gaussian kernel, and kernel k-means, which employs an adaptive kernel, five experiments were conducted: one for reporting clustering outcomes on artificial datasets, one for clustering outcome on one image (each having high or low resolution), one for clustering a set of images into subsets of images, one for the scaleup test and one for a stability analysis.
(52) Parameter search ranges. Parameters are searched in each of the algorithms, i.e., DP, scalable kernel k-means, kernel k-means and psKC; and their best clustering outcomes are reported after the search. psKC is implemented in C++. Scalable kernel k-means is implemented in Scala as part of the Spark framework; DBSCAN is implemented in Java as part of the WEKA framework. and DP, DP.sub.ik, k-means and kNN kernel are implemented in MATLAB.
(53) The parameter search ranges used in the experiments on artificial datasets are:
(54) (1) DP: ∈ (the bandwidth used for density estimation) is in [0.001 m, 0.002 m, . . . , 0.4 m] where m is the maximum pairwise distance. The number of clusters is set to the true number of clusters.
(55) (2) Kernel k-means: k in kNN kernel is in [0.01n, 0.02n, . . . , 0.99n]; and the number of dimensions used is 100. The number of clusters is set to the true number of clusters.
(56) (3) Scalable kernel k-means: σ in [0.1, 0.25, 0.5, 1, . . . , 16, 24, 32]; k is set to the true number of clusters; s=100 (the target dimensions of the PCA step) and c=400 (sketch size for the Nyström dimensional output), except for data set less than 400 points then it is s=20 and c=200.
(57) (4) psKC: ψ in [2, 4, 6, 8, 16, 24, 32], t=100, τ=0.1 and σ=0.1.
(58) (5) psKC.sub.g: γ=2.sup.i where i in [1, 2, 3, . . . , 16], τ=0.1 and σ in [0.1, 0.01, 0.001, . . . , 1×10.sup.−10].
(59) (6) DBSCAN: ∈ in [0.001 m, 0.002 m, . . . , 0.999 m] and MinPts in [2, 3, . . . , 30], where m is the maximum pairwise distance and MinPts is the density threshold.
(60) (7) DP.sub.ik: For DP, ∈ is in [0.001 m, 0.002 m, . . . , 0.4 m] where m is the maximum pairwise distance. The number of clusters is set to the true number of clusters. For Isolation Kernel: ψ in [2, 4, 6, 8, 16, 24, 32] and t=100.
(61) (8) k-means: The number of clusters is set to the true number of clusters.
(62) The experiments ran on a Linux CPU machine: AMD 16-core CPU with each core running at 2.0 GHz and 32 GB RAM. The feature space conversion was executed on a machine having GPU:2×GTX 1080 Ti with each card having 12 GB RAM. Both are commonly used machines.
(63) Clustering outcomes on artificial datasets. In the first experiment, five commonly used benchmark datasets, namely Ring-G, AC, Aggregation, Spiral and S3 are used.
(64) As shown in
(65) The two versions of kernel k-means are weaker algorithms than DP as they did poorly on at least three out of the five datasets, i.e., Ring-G, Aggregation and Spiral. This is because of the use of k-means which has fundamental weaknesses in detecting clusters that have non-globular shapes. The use of a kernel in both kernel k-means transfers these fundamental weaknesses from input space to feature space. The results show that there is no guarantee that they can detect clusters of non-globular shapes.
(66) Point-set kernel clustering is the only algorithm that did well in all five datasets; and it is the only algorithm that successfully identified all four clusters in the Ring-G dataset. This is a direct result of the cluster identification procedure which employs the point-set kernel. Other algorithms failed to correctly identify the four clusters because their algorithmic design which must determine all density peaks/centers before individual points can be assigned to one of the peaks/centers.
(67) Clustering outcomes on one image (each having high or low resolution). In the second experiment, three photographic images of resolutions from low to high: Forbidden City Gate (499×324 pixels), Vincent van Gogh's Starry night over the Rhone 2 (932×687 pixels) and Chinese Painting: Zhao Mengfu's Autumn Colors (2,005×500 pixels), are used to compare the clustering outcome. Color images are represented in the CIELAB color space. All clustering algorithms in the comparison are presented with a dataset with this CIELAB representation when an image is to be segmented.
(68) As shown in
(69) In contrast, psKC could discover the two clusters, without splitting the sky into two. Note that the Forbidden City Gate image has a total of 161,676 pixels. DP could only process a lower resolution of this image of 60,000 pixels. Like kernel k-means, DP identified this elongated cluster to have two peaks instead of one, which is one weakness of the DP peaks identification procedure.
(70) As shown in
(71)
(72) Clustering a set of images. In the third experiment, psKC is used to cluster a set of images into several subsets of images, instead of segmenting one image into multiple segments.
(73) Meanwhile, digits 4 & 9 are grouped into three clusters. In addition to the two pure clusters, the third cluster consists of both digits of 1:3 proportion. This is in contrast of the result produced by a kNN-graph based clustering algorithm RCC (short for Robust Continuous Clustering), where both digits 4 & 9 have been grouped into a single cluster. The digits grouped in the third cluster have different written styles from those in the two pure clusters of 4 & 9.
(74) Runtime experiment. In the fourth experiment, the MNIST8M dataset which has 8.1 million data points with 784 dimensions is used for the scaleup test. The runtime is measured in terms of CPU seconds (and include the GPU seconds when GPU is used.)
(75) The experimental result in
(76) Even with 12 CPUs, as shown in
(77) In contrast, the algorithmic advantage of psKC, together with the use of the point-set kernel, allows it to run on a standard machine of single-CPU (for clustering) and GPU (for feature mapping in preprocessing). This enables the clustering to be run on a commonly available machine (with both GPU and CPU) to deal with large scale datasets.
(78) In terms of real time: on the dataset with 40 k data points, psKC took 73 seconds which consists of 58 GPU seconds for feature mapping and 15 CPU seconds for clustering. In contrast, DP took 541 seconds. The gap in runtime widens as data size increases: To complete the run on 8.1 million points, DP is projected to take 379 years. That would be 12 billion seconds which is six orders of magnitude slower than psKC's 20 thousand seconds (less than 6 hours). The widening gap is apparent in
(79) As it is, there is no opportunity for DP to do feature mapping (where GPU could be utilized). While it is possible for kernel k-means to make use of GPU as in psKC, the main restriction of scalable kernel k-means is PCA which has no efficient parallel implementation, to the best of our knowledge. The clustering procedures of both DP and psKC could potentially be parallelized, but this does not change their time complexities.
(80) Stability test in the fifth experiment.
(81) On the Spiral dataset, psKC appears to have variance larger than kernel k-means. This is because kernel k-means produced significantly poorer clustering overall, having all 10 trials lower than 0.5 F1 score.
(82) Overall, psKC (using t=100) produces higher F1 score than kernel k-means on all three datasets, where the median result is shown as the line inside the box. In addition,
(83) According to five experimental results, the clustering psKC outclasses DP, and two versions of kernel k-means in terms of both clustering outcomes and runtime efficiency. psKC algorithms have the following advantages. First, the algorithm is deterministic, given a kernel function and the user-specified parameters. This resolves the instability issue and often leads to better clustering outcomes. The only randomization is due to the Isolation Kernel. The use of most similar points in D as seeds is much more stable, even with different initializations of Isolation Kernel, compared with random groupings of clusters which can change wildly from one run to the next. Second, the psKC procedure enables detection of clusters of arbitrary shape, of different sizes and densities. Third, the psKC procedure commits each point to a cluster once it is assigned; and most points which are similar to the cluster never need to be reassigned. This is possible because of the use of a seed to grow a cluster. Points which are similar to a cluster grown from the seed will not be similar to another cluster if the points are less similar to the seeds of other clusters in the first place. The sequential determination of seeds (as opposed to the parallel determination of centers in k-means) makes that possible.
(84) As a result, psKC avoids many unnecessary recomputations in k-means mentioned earlier. In other words, the clustering outcome of psKC is already close to the final maximization objective. The post-processing literally tweaks at the edges by reexamining those lowest similarity points regarding each cluster for possible reassignment to achieve the final maximization of the objective function.
(85) In summary, the two root causes of shortcomings of existing clustering algorithms are: (i) the use of data independent point-to-point distance/kernel (where the kernel has a feature map with intractable dimensionality) to compute the required similarity directly; and (ii) the algorithmic designs that constrict the types of clusters that they can identify. For example, in the case of kernel k-means, even though a kind of point-set kernel is used, it can detect clusters of globular shape only in feature space; and this does not guarantee that non-globular shaped clusters in input space can be detected. These have led to poorer clustering outcomes and the longstanding runtime issue that have prevented them from dealing with large scale datasets.
(86) These root causes are addressed by using a data dependent point-set kernel and a new clustering algorithm which utilizes the point-set kernel to characterize clusters—they encompass many types of clusters which cannot be detected by existing algorithms. As a result, psKC is the only clustering algorithm that is both effective and efficient—a quality which is all but nonexistent in current clustering algorithms. It is also the only kernel-based clustering that has runtime proportional to data size.
(87) The clustering method for data mining of the present invention can be applied to multiple fields, and the image segmentation application is taken as an example in the above embodiment. The method of the data mining can also be applied to applications such as clustering a set of images, social media analysis, computer biology, market research, search engines, etc. When the data analysis is performed in the corresponding field, the data descriptor for each application shall be used to convert the original dataset into a set of vector representation.
(88) The method is both effective and efficient that enables it to deal with large scale datasets. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, the method is more effective and runs orders of magnitude faster when applied to datasets of millions of data points, on a commonly used computing machine.