Machine Learning, Tutorial
Get the Optimal K in K-Means Clustering
Provide the quick starting guide to find the optimal number of clusters in K-means clustering
Introduction
Clustering (or Cluster Analysis) helps to identify and group the data points, which is closely related using some measure of distance, to a blob of data or segment. Clustering is classified as unsupervised learning within machine learning space, this means there is no labeled.
It is one of the important techniques in machine learning for business or corporate utilizing data science. Several use cases are:
- In the financial industry, suspicious activities and individuals can be identified using anomaly detection
- In biology, clustering is used to find groups of genes with similar property or expression patterns
- In marketing science, clustering is used to identify the segments of similar customers by various characteristics. With these segments, a business can develop personalized marketing strategies and programs to tackle each segments accordingly
k-means clustering (not to be confused with machine learning technique: k-nearest neighbor classifier) is popular for cluster analysis in data mining. The method is aiming to partition the given observations into k clusters, in which each observation is in the cluster with the nearest mean.
K-means clustering is categorized as prototype-based clustering because each cluster is represented by a prototype, which can either be the centroid (average) of similar points with continuous features, or the medoid (the most representative or most frequently occurring point) in the case of categorical features.
Choosing the right number of k clusters can be a bit tricky, and this article will show the fundamental of k-means clustering and how to choose the right number of clusters, which should help you to kick start your clustering project.
Fundamentals
There are multiple clustering algorithms, which are based on alternative strategies to solve the problem. There are two main clustering techniques;
- Hard clustering techniques, where each observation must only belong to a single cluster
- Soft (or fuzzy) clustering, which is an alternative approach and is based on a membership score that defines how much the elements are compatible with each cluster.
K-means clustering algorithm will
- Randomly assign the simple points as the starting cluster centers (or centroids).
- The algorithm then will iteratively move the centroids to the centers of the samples. This iterative approach is minimizing the within cluster sum of squared errors (SSE), which sometimes called cluster inertia.
- Continue the step (2) until it reached the defined tolerance or the maximum number of iterations is reached.
When the centroids move, it will compute the squared Euclidean distance to measure the similarity between the samples and centroids. Hence, the k-means is very good at identifying clusters with a spherical shape (as it uses the squared Euclidean distance). That is also a drawback as in reality, the data points won’t be in such shape, k-means does not account for variance and has the effect from the curse of dimensionality at a high dimension.
Another drawback of k-means is the users need to specify the number of clusters, k. This can cause analysis and clustering performance to be poor. In this article, I will show several methods to effectively select the number of clusters for k-means clustering.
What is my “k”?
There is no clear way to the question; what is the ‘best’ number of clusters. The optimal number of clusters is somehow subjective (especially in the business world) and depends on the method used to measure the similarities and the parameters used during clustering.
There are two methods used:
- The direct method consists of optimizing a criterion (i.e. within cluster sum of squares). Elbow and silhouette methods are in this method.
- The statistical testing method, this method compares the evidence against the null hypothesis. Gap statistic is one of the examples.
Here’s the sample of the data points to demonstrate and let’s look at each of these methods.
1. Elbow method
The well-known elbow method is to identify the number of clusters based on the assumption that the optimal number of clusters must produce small inertia, or total intra-cluster variation. As such, there will be a trade-off between the inertia and the number of clusters.
With this concept, we should choose a number of clusters so that adding another cluster doesn’t improve much.
We can see the major drop in inertia at 3 clusters, which we normally call that as elbow.
We can then use this as a number of clusters as display below. Are the clusters similar to your intuition?
2. Silhouette method
Silhouette score measures how well an observation is clustered and it estimates the average distance between clusters. It wants to find the optimal number of clusters that produce a subdivision of the dataset to dense blocks that are well separated from each other. For more information, I would refer to here for the implementation guide and definition.
The value will be between -1 and 1, whereas a value near 0 indicates overlapping clusters. Negative values generally indicate that an observation has been assigned to the wrong cluster.
We can clearly see how the silhouette score
peaks up at k=3, which indicates that there are 3 dense agglomerates.
Moreover, we can do more with the Silhouette plot. It displays the measure of how close each point in one cluster is to points in the neighboring clusters.
Sample visualization is shown below.
Even though the Silhouette score shows the highest score at k=3, but visually on the clustered data, I would prefer to use k=4 (or even 5 clusters). What do you think?
3. Gap Statistic Method
The gap statistic is developed by Tibshirani, Walther, and Hastie in their 2001 paper. It compares the total within intra-cluster variation for different values of k with their expected values under the null reference distribution of the data. Hence, the optimal choice of k is the value that maximizes the gap (meaning that the clustering structure is far away from a random uniform distribution of points).
We can then use k=5 from the Gap Statistic method to use in KMeans
and visualize the clustering result.
This is actually the only method, which correctly clustered the data to 5 clusters.
These are not the only techniques to find the optimal k number of clusters. There are few other techniques such as Calinski-Harabasz index, Cluster instability etc, with each technique has its own pros and cons.
Hopefully, this will give you the intuition on K-means clustering and how we can find the optimal k for the clusters.
For the full code on this tutorial, please visit my following GitHub repository: