Clustering

5 min readJun 11, 2021

Clustering is generally techniques for finding subgroups, or clusters in a given dataset. When we cluster the observations of a data set, our aim is to partition the data into distinct groups so that the observations within each group are more or less similar to each other, while observations in different groups are quite different from each other. Clustering is considered as part of unsupervised learning , because the goal here in segregation , making subgroups and not predicting like in supervised learning. Clustering like PCA tries to simplify the data using small summaries; though its worth mentioning the methods used in both are different. The work of clustering is to find homogenous subgroups. In order to identify subgroups, we try to find features and use them to form subgroups. In what follows, for simplicity we will discuss clustering observations.

Clustering of data is very helpful in various industries like banking ( customer segmentation , anomaly detection) , can be useful in healthcare to categorize MRI data ,in retail to recommend products based on past purchase; can also be used in news article grouping to site a few examples.

Some examples of clustering model are

K-means
Hierarchical Clustering
Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering

In K-means clustering we seek to partition observations into a prespecified number of non-overlapping clusters such that the within-cluster variation is as small as possible.

How does K-mean clustering work!

Choose the number of clusters.
Select at random k points, the centroids.
Assign each data point to the closest centroid that forms k clusters.
Compute and place the new centroid of each cluster.
Reassign each data point to the new closest centroid until the cluster assignments stop changing.
This would give the final cluster.

Because we find a local rather than a global optimum, the results obtained will depend on our initial (random) cluster assignment of each observation. Therefor, it’s very important that we run the algorithm multiple times from different random initial configurations.

How do we choose the right number of K ?

Best outcome is based on the total within cluster sum of squares (WCSS).

Model quality changes as the number of cluster changes

The ideal plot will have an elbow where the quality measure improves more slowly as the number of cluster increases. If we look at the graph it indicates that the quality of the model does no longer improve substantially as the model complexity (no. of clusters) increases. In other words the elbow indicates the number of clusters inherent in the data.

To perform K-mean clustering we use K-mean function in R package. The two important features of this function are n start ( A parameter used for reducing the algorithm sensitivity to the random selection of the initial clusters / cluster means.) and Cluster labels( Labels change from one run to other.)

Though K-mean clustering has some advantages like

Easy to implement and high speed performance.
Method is scalable and quite efficient for large datasets.
Will always converge

There are some challenges too!

Need to specify the number of clusters beforehand.
Clusters are sensitive to initial assignment of centroid.
Algorithm is too sensitive to outliers. Since an object with an extremely large value may substantially distort the distribution of the data.

Hierarchical Clustering

In Hierarchical Clustering we do not know in advance how many clusters we want, in fact , we end up with a tree like visual representation of observations, called dendrogram, that allows us to view at once the clustering obtained for each possible number of clusters.

Make each data point a single point cluster that forms N clusters
Take the two closest data points and make them one cluster , this forms N-1 clusters
Take the two closest clusters and make them one cluster that forms N-2 clusters
Repeat above step until we have only 1 cluster

There are various ways by which we can measure the distance

option 1 : Closest points
option 2 : Furthest points
option 3 : Average Distance
option 4 : Distance between centroids

All these distance when converted to height we form the dendogram.

Units in the same cluster are joined by a horizontal line.
The leaves at the bottom represent individual units.
They are useful as they provide a visual representation of clusters.
Optimum number of clusters can be determined by drawing a vertical line to give the largest vertical distance which does not cut any horizontal line.

Algorithms are of two types

Agglomerative Algorithm : It’s a bottom up approach where we start at the individual leaves and successively merge clusters together ( Agnes)
Divisive : This is another approach wherein we start at the top and move down ( DIANA) , starting at the root and recursively splitting the clusters.

A third approach is looking into the density instead of distance, namely DBScan Clustering. A cluster is defined as maximum set of densely connected points. The approach discovers clusters of arbitary shape in spatial distance with some noise.

So randomly select a point p.
Then retrieve all points that are reachable from p with regards to Eps(max. radius of neighborhood) and MinPts(the min number of points within the Eps neighborhood).
If the number of points in neighborhood is more than MinPts then p is core point.
For p core points a cluster is formed.
If p is not a core point make it a noise and move to the next point.
Continue until all the points are covered.

— — — — — — — — — — — — — @— — — — — — — — — — — — — — — — —

Clustering

Written by Abby