METHODS FOR CORRELATED HISTOGRAM CLUSTERING FOR MACHINE LEARNING

Abstract

A methodology for correlated histogram clustering for machine learning which does not require a priori knowledge of cluster numbers, which extends beyond bimodal scenarios to multimodal scenarios, and does not need iterative optimization methods nor require powerful data processing.

Claims

1. A method in a machine learning system for generating correlated histogram clusters, comprising the steps of: 1) generating n histograms for n-dimensional data set, D; 2) selecting a subset (D′) of the histogram data based on a frequency greater than a threshold; 3) generating n histograms for n-dimensional data set D′ with optimal bin size; 4) identifying m histogram peaks (modes) for each dimension; 5) for the i.sup.th peak of the j.sup.th dimension, m.sub.ij, identify an index, p, in the data by finding the value in dimension j of D′ closest to m.sub.ij, and set value C.sub.i of centroid C equal to m.sub.ij; 6) identify the associated data value D′.sub.pk for another one of the dimensions, k, and identify the nearest peak from the histogram of k.sup.th dimension to D′.sub.pk, and assign value C.sub.k of centroid C to that peak; 7) repeat step 6 for every dimension of data k through n, k≠I; 8) save centroid C and repeat steps 5-7 for all histogram peaks of the j.sup.th dimension; and, 9) repeat steps 5-8 for all dimensions j through n.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0035] For a more complete understanding of the present disclosure, reference is now made to the following detailed description taken in conjunction with the accompanying drawings, in which:

[0036] FIG. 1 illustrates a histogram for a first data set A;

[0037] FIG. 2 illustrates a histogram for a second data set B;

[0038] FIG. 3 illustrates a coarse histogram for data set A;

[0039] FIG. 4 illustrates a coarse histogram for data set B;

[0040] FIG. 5 illustrates a Quantile Respective Density Estimate (QRDE) for data set A;

[0041] FIG. 6 illustrates a QRDE for data set B;

[0042] FIG. 7 illustrates a generalization of how histograms can be indexed for correlation;

[0043] FIG. 8 illustrates a two-dimensional histogram for data sets A and B;

[0044] FIG. 9 illustrates a flowchart of a methodology for correlating histograms;

[0045] FIGS. 10A and 10B illustrate comparisons of approaches to finding clusters in a dataset;

[0046] FIG. 11 illustrates correlated histograms applied to the same dataset as row three of FIGS. 10A and 10B with sensitivity equal to 0.50;

[0047] FIG. 12 illustrates correlated histograms applied to the same dataset as row three of FIGS. 10A and 10B with sensitivity equal to 0.90; and,

[0048] FIG. 13 illustrates correlated histograms applied to the same dataset as row four of FIGS. 10A and 10B.

[0049] Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated and, in the interest of brevity, may not be described after the first instance.

DETAILED DESCRIPTION

[0050] The following detailed description discloses a methodology, for use in or training of a machine learning system, for generating correlated histogram clusters. The methodology does not require a priori knowledge of cluster numbers, which extends beyond bimodal scenarios to multimodal scenarios, and does not need iterative optimization methods nor require powerful data processing. With so much effort spent on the various machine learning techniques of unsupervised learning, a relatively simple yet unobvious approach is to leverage statistics and correlate histogram data.

[0051] FIGS. 1 and 2 illustrate histograms for a first data set A and a second data set B, respectively, representing a multimodal statistical model based on 10,000 pairs of points. By examining FIGS. 1 and 2, one skilled in the art will surmise the histograms most likely correspond to a tri-modal distribution.

Centroids from Coarse Histograms

[0052] Selecting a threshold frequency (number of counts) results in a coarse histogram. From coarse histograms, the optimal number of bins can be determined for both data sets. Selecting the larger value results in both histograms having equal resolution. For each data set, multiple extrema may be identified from the counts; and for each extrema, the midpoint of the edge width determines the centroid.

[0053] For the histograms illustrated in FIGS. 1 and 2, the corresponding coarse histograms are shown in FIGS. 3 and 4, respectively. The optimal number of bins for data set A is 11 while the optimal number of bins for data set B is 4. As instructed, 11 will be selected for equal resolution. Each data set has 3 peaks (FIGS. 3 and 4) with corresponding edges; i.e., data set A (1.40, 4.36, 8.06) and data set B (0.20, 0.42, 1.19). The widths are A(0.74) and B(0.11) resulting in the modes A(1.77, 4.73, 8.43) and B(0.26, 0.48, 1.25).

Centroids from Density Estimates

[0054] In cases where a histogram is not expressive of the underlying modality, a density estimate that is sensitive to modality may be employed. The Harrell-Davis Density Estimator (Harrell, 1982) is one such density estimate, though there are many, that can aid in the identification of peaks. To interpret modes in the density estimates, another method is required—the Lowland Modality Method using Quantile Respective Density Estimates (QRDE) can be used to find modes from a density estimate (Akinshin, 2020). Using this method, modes are defined as the highest peak, M, between two other peaks, P1 and P2, such that the proportion of the bin area between M and P.sub.i and the total rectangular area between the M and P.sub.i is greater than some threshold value, called the sensitivity.

[0055] FIGS. 5 and 6 illustrate QRDEs for data sets A and B, respectively. In this case, the modes for data set A are (1.654, 4.533, 8.280) and modes for data set B are (0.244, 0.441, 1.213). As before, what remains unobvious is the connection between these centroids.

[0056] The embodiment described hereinafter will reference centroids found in FIGS. 1 and 2, using the coarse histograms.

Optimal Bin Counts

[0057] Any method for constructing a histogram will require the choice of a bin count. There are simple rules-of-thumb for obtaining a bin count like taking the square root of N, taking 1+log(N) (e.g., using the Sturges Method as known to those skilled in the art), or taking 2+cube root of the N where N for all of these is the number data points. These methods rely on the number of data points rather than the underlying statistics of the data. One such approach is to minimize the following function of the mean and variance of the frequencies (Shimazaki and Shinomoto, 2007):

[00001] ${score}_{n} = \frac{(2 * μ - S^{2})}{{(\max - \min)}^{2}}$

As is typically done, histograms, as well as density estimates, are sorted as shown in FIGS. 1 and 2. This sorting approach, however, leads one to believe the order of A's modes are the same as those for B's modes; i.e., A(1.77) goes with B(0.26), A(4.73) goes with B(0.48), and A(8.43) goes with B(1.25). But in this case, they don't go together as one may be led to believe. This can be resolved through indexing.

Indexing

[0058] FIG. 7 illustrates a generalization of how histograms can be indexed for correlation. On the left is a histogram for data set X. Each centroid X-Value has a corresponding index. For data set X, the first centroid (peak) is identified by index 2, the second centroid (peak) is identified by index 1, and the third centroid (peak) is identified by index 3. Histogram correlation is managed by identifying the index with the corresponding centroid B-Values. Index 1 identifies the third B-Value centroid (peak), index 2 identifies the second B-Value centroid (peak), and index 3 identifies the first B-Value centroid (peak).

[0059] If the data is indexed, then one simply looks for an index corresponding to a particular centroid (from A's data set) and then uses that same index to locate the other centroid (from B's data set). For example, data set A has one of many indexes that match the value 4.73 (within a few values of the second decimal place), one of which happens to be the index 77. Looking at data set B, index 77 leads one to find a corresponding value of 0.43. Recognizing 0.43 is near 0.48 (within a few values of the second decimal place), one concludes that one of the cluster centroids (A, B) is the pair (4.73, 0.48). It turns out that this just happens to be the same as the second elements in the histogram order for A and B.

[0060] Repeating the methodology, data set A has one of many indexes that match the value 8.43 (within a few values of the second decimal place), one of which happens to be the index 79. Looking at data set B, index 79 leads one to find a corresponding value of 0.26, which just happens to coincide with the centroid. Thus, one concludes that another of the cluster centroids (A, B) is the pair (8.43, 0.26). This is not in the order of the histogram data. The third centroid value of A corresponds to the first centroid values of B.

[0061] The methodology can be repeated for the last pair or deduced by elimination that is it must be (1.77, 1.25). Of course, this is not in the order of the histogram data. The first mode value of A corresponds to the third mode value of B.

[0062] The final set of three correlated centroids are (1.77, 1.25), (4.73, 0.48), and (8.43, 0.26). A visual embodiment of the final result (a two-dimensional histogram) is shown in FIG. 8 with the A-axis along the bottom-left, the B-axis along the bottom-right, and histogram count (frequency) as the vertical axis. Here, clusters are clearly identified. However, in the case of n-dimensions (n>2), a visual embodiment is challenging.

[0063] The foregoing methodology can be extended beyond this tri-modal example embodiment of two data sets to an embodiment of a multi-modal, n-dimensional data set without the need for knowing the cluster number a priori and performed rapidly without having to apply advance algorithmic techniques. The correlated histogram clustering (“CHC”) methodology is illustrated by the flowchart 900 in FIG. 9, which can be summarized by the following steps: [0064] Step 901: Generate n histograms for n-dimensional data set, D; [0065] Step 902: Select a subset of the histogram data based on a frequency greater than some threshold; call this new data set D′; [0066] One approach is to select those points with frequencies whose z-scores are positive [0067] Step 903: Generate n histograms for n-dimensional data set D′ with optimal bin size; [0068] One approach to building the histograms is optimizing the Shimizaki and Shinomoto cost function based on the mean and variance of bin frequencies. [0069] Another approach to building histograms is to build Density Estimates and acquire the modes from that rather than a traditional fixed-bin-size histogram. [0070] Another approach is to use Akinshin's Adaptive Histograms that keep frequency in bins but do not have fixed-bin sizes. [0071] Step 904: Identify m histogram peaks (modes) for each dimension; [0072] The number “m” of extrema centroid peaks may be any integer, unimodal, bimodal, trimodal, etc. [0073] One approach is Lowland Modality using Density Estimates [0074] Another approach is the Harrel-Davis Estimate [0075] Another approach is using the m-value to determine modality [0076] Step 905: For the i.sup.th peak of the j.sup.th dimension, m.sub.ij, identify an index, p, in the data by finding the value in dimension j of D′ closest to m.sub.ij. Let value C.sub.i of centroid C be equal to m.sub.ij; [0077] Step 906: Identify the associated data value D′.sub.pk for another one of the dimensions, k. Identify the nearest peak from the histogram of k.sup.th dimension to D′.sub.pk. Let value C.sub.k of centroid C be that peak; [0078] Step 907: Repeat step 906 for every dimension of data k through n, k≠I; [0079] Step 908: Save centroid C. Repeat steps 905-907 for all histogram peaks of the j.sup.th dimension; and, [0080] Step 909: Repeat steps 905-908 for all dimensions j through n.
The CHC methodology results in a novel approach to clustering that does not require a priori knowledge of cluster number, extends to multimodal scenarios, and does not need iterative optimization methods nor require powerful data processing.

Comparing Correlated Histogram Methodology to Existing Approaches

[0081] As discussed previously, there exist other approaches to finding clusters in a dataset. FIGS. 10A and 10B illustrate comparisons of such approaches to finding clusters in a dataset; the different approaches are identified by name at the top of each column. The algorithm for each approach assigns datapoints to clusters, each cluster represented by different shapes (e.g., circles, triangles, squares and diamonds) and data considered to be noise represented by solid circles in each plot. The runtime of each algorithm for each dataset is also indicated below each plot. The Correlated Histograms Clustering methodology disclosed herein, as applied to some of the same datasets, is illustrated and described hereinafter with respect to FIGS. 11-13.

[0082] FIG. 11 illustrates correlated histograms generated by the disclosed CHC methodology as applied to the same dataset as row three of FIGS. 10A and 10B with sensitivity equal to 0.50. Those skilled in the art will recognize that FIG. 11 illustrates how the disclosed CHC methodology can find modes in a noisy data set without getting lost in the noise. Note that the tight gaussians on the left and right in FIG. 11 have centroids identified while DBSCAN applied to the same dataset, row three of FIG. 10B, fails to discern the noise in the center from the gaussian on the right. If the QRDE is employed along with the Lowland Modality Method, the sensitivity parameter can be adjusted to be more sensitive to centroids.

[0083] FIG. 12 illustrates correlated histograms generated by the disclosed CHC methodology as applied to the same dataset as row three of FIGS. 10A and 10B with sensitivity equal to 0.90. Those skilled in the art will recognize that FIG. 12 illustrates how a centroid was found in the large noisy gaussian when this hyperparameter is set higher. In comparison, it can be seen in row 3 of FIG. 10B that DBSCAN identified this noisy cluster as a part of the tight gaussian on the right, rather than differentiating the two.

[0084] Finally, FIG. 13 illustrates correlated histograms generated by the disclosed CHC methodology as applied to the same dataset as row four of FIGS. 10A and 10B. Correlated histograms utilizes the underlying statistics with respect to the orthogonal dimensions of the data (x, y, . . . ). Thus, in a scenario such as that illustrated in FIG. 13, the CHC methodology doesn't find three centroids cleanly, rather it finds seven centroids. Though the centroids it finds are near the true centroids of each cluster, more advanced techniques could be applied to identify histogram peaks with respect to other functions rather than the orthogonal vectors that the dataset is assumed to be with respect to.

[0085] The foregoing has disclosed a novel methodology for generating correlated histogram clusters which can be used to advantage in machine learning systems and the training thereof. Although the embodiments and the advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope thereof as defined by the claims. For example, many of the features and functions discussed above can be implemented in software, hardware, firmware, or a combination thereof. Also, many of the features, functions, and steps of operating the same may be reordered, omitted, added, etc., and still fall within the scope of the claims and equivalents of the elements thereof.

METHODS FOR CORRELATED HISTOGRAM CLUSTERING FOR MACHINE LEARNING

Assignee

Inventors

Cpc classification

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

G06N3/088

PHYSICS

Classification Explorer

G06N5/048

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N7/023

PHYSICS

International classification

Classification Explorer

G06N7/02

PHYSICS

Abstract

Claims

Description