In multivariate analysis (also called unsupervised learning), a common statistical problem is to cluster data pints. For instance, given a data as follows:
The goal is to cluster the data into several partitions. For instance, like this:
There are many different methods for clustering. Common approaches include the k-mean clustering, spectral clustering, linkage methods…e.t.c.
Here, I want to introduce the mode clustering method which is a simple but not commonly used technique in statistics. The mode clustering is to cluster the data based on the modes (local maximums) of the density function. The idea is simple: given a density function, for each point, we follow the gradient path for the density until we arrive a mode. Back to our example, the mode clustering looks like this:
Each blue path is the gradient path from each data point. We have four clusters corresponding to the four modes and every point is assigned to one cluster/mode.
The algorithm for mode clustering is called the “mean-shift” algorithm. This is a well-known method in computer vision and image segmentation. In fact, the mean shift algorithm is pushing data points according to the gradient path of a kernel density estimator to the density function.
To sum up, the mode clustering is
- find the estimated density function,
- evaluate the gradient of the estimated density,
- shift data points according to the gradient.
Since the density estimator is the kernel density estimator, the statistical properties are well-studied. Thus, we can study the properties for the mode clustering by using theories for the kernel density estimator.
Interestingly, the mode clustering is not well-known to the statistical society. But I believe that in the future, this method should attract more attentions from statisticians since compared with other methods, the mode clustering has a well-established theory.
Reference: Enhanced Mode Clustering. (2014) Yen-Chi Chen, Christopher R. Genovese, Larry Wasserman. http://arxiv.org/abs/1406.1780