Have You Tried All of These Types of Clustering Algorithms?

9 min readMar 11, 2022

Clustering is a widely-used data-mining technique with many applications. Some of its uses include making recommendations, market analysis, customer segmentation, medical analysis and risk analysis. The aim of clustering algorithms is to put instances into groups based on their similarity. It can be used as part of exploratory analysis, be part of a data-mining pipeline with other algorithms or used on its own.

There are many different clustering algorithms, and they all have their unique advantages and disadvantages. The two main categories are hierarchical and partitional. Hierarchical clustering algorithms create clusters by either dividing or combining clusters until they reach the desired or ideal number. Partitional algorithms immediately create the desired number of clusters and then continually adjust them until the algorithm determines it has made the best clusters it can.

Hierarchical

Hierarchical Clustering Algorithms can be divided into two types: Agglomerative and Divisive. Agglomerative algorithms start with every data point as its own cluster. Clusters are continually combined based on a linkage measure until the desired number of clusters are reached. Divisive algorithms go in the opposite direction. All of the data starts as one big cluster. The linkage measure continually divides them until the desired clusters are reached.

Common features:

Splitting or joining are done on a distance measure called the linkage measure. Standard distance measures used are:
Single: based shortest on distance from one point to its nearest neighbour. This measure tends to lead to elongated clusters.
Average connection: based on the median or mean distance to all points in a cluster
Complete: based on longest distance from a point to the farthest point in clusters. This measure tends to lead to very compact clusters.
Can be used with any data types
Sensitive to noise and outliers
Can take a very long time to run or use a lot of memory
Once a data point is assigned, it is never reassigned – even if there would later be a better cluster for it.

Partitional Clusters

With partitional clustering, the data points are all immediately assigned to a cluster. Then the algorithm uses a distance measurement to check if all the points are in the best cluster. Adjustments are made if there is a better cluster for a point. Sometimes moving one point changes the best cluster for other points, so the algorithm continually checks all points until it converges. Convergence means that all points are in the best cluster they can be in, at least given the starting point of that particular run of clustering. There are four categories of partitional clusters: hard/crisp, fuzzy, mixture and square error. In addition, there are six categories of hard/crisp clustering methods: hard-crisp, graph, density-based, subspace, model, search and “everything else”.

Common Features:

Not guaranteed to find the “global” best pattern. This means the algorithm will find the best groupings from whatever clusters they start with. Still, there could be even better groupings if different clusters had been the starting point.
They often run several times with different starting clusters, then choose the best set from all runs.
The number of clusters must be decided by the user before the algorithm is run. It can be hard to determine the ideal number of clusters ahead of time.
Most partitional clusters are circular or spherical.
Does not work well with high-dimensional data

Types

Hard/Crisp: Each data point is assigned to only one cluster. There are many ways of assigning points to clusters, and these each form a different category of clustering:

Graph: Graph-based hierarchical cluster. A graph is made with all the data points. This means each point is connected to other points by lines called “edges.” Edges that don’t meet the linkage criteria are dropped. This results in many smaller graphs that are not connected to each other. These are then reconnected using agglomerative clustering
Density-based Cluster: Creates clusters based on how many points are together in a given space. Data points close to many other data points are in a high-density space and will become part of a cluster. Points that are in low-density spaces will not become part of a cluster. Density-based clusters can be very efficient. However, usefulness of the results depends on the density threshold chosen. If too small of a threshold is selected, clusters that should be separate will be joined together. If the threshold chosen is too high, some clusters will be lost. They do not work well for high-dimensional spaces. If you want to see how they work in action, this video by StatQuest explains it very clearly.
Subspace clustering: Finding clusters based on subsets of the features. Basically, not all of the variables provided to the algorithm are used to calculate membership to all clusters. They can find clusters that could not be found by looking at all of the features together. However, they can be very computationally expensive.
Model-based Clustering: The algorithm creates a model of the probability distribution of the clusters. It then assigns points to each cluster based on how likely it is to belong to their probability distribution.
Search: These clustering algorithms determine the ideal number of clusters themselves. They are often based on nature. There are two main types: evolutionary and swarm. Both types are more likely to find the global best solution (the best solution possible for that dataset) than other partitioning algorithms. They are also more flexible than most other algorithms and can handle clusters of varying shapes, sizes, and densities. They work well with high dimensional data, often finding solutions faster than traditional algorithms.
Other Hard/Crisp Methods:
Time-series: The clustering of time series data. The added dimension of time to data creates new challenges for clustering it. Time-series methods have been developed to deal with these challenges.
Streaming Clustering: The clustering of streaming data. Streaming data comes with its own challenges based on its continually changing nature. Unique algorithms have been developed to address some of these challenges.
Mode-seeking Clustering: Assign cluster labels by using the mode

Square-Error: Points are assigned to clusters based on the sum of the square error. The algorithm tries to place each point into the cluster that minimizes the square error. The square error is measured from each point to the center of each cluster. Depending on the algorithm, the center might be a real-data point or a prototype created to best represent the features of the cluster.

Fuzzy Clustering: Fuzzy clustering assigns each point a probability to show how likely it is to belong to a cluster. The results indicate the probability of each point belonging to a specific cluster. The probabilities are normalized and given in a range between 0–1. 0 means a point has no likelihood of belonging to a cluster, and 1 means it belongs with a probability of 100%. It as many of the same features as K-means, but is slightly more computationally expensive.

Mixture Models: A probability distribution is a function that can determine how likely something is. Mixture models create clusters based on the idea that the data is comprised of points that belong to different probability distributions. The algorithm starts with various probability distributions then calculates the likelihood that each data point belongs to each distribution. It then updates the probability distributions based on the points assigned to them. It iteratively updates points and distributions until it reaches convergence. They can be used with data of any type of distribution, although they are normally used with data that has a normal distribution (Gaussian). They are computationally expensive, but they can find clusters of different sizes and any shape. They do not work well with very large datasets, clusters that contain only a few data points, noise or outliers. The EM (Expectation Maximization) algorithm is often used to form clusters. This video and this article explain how this model works in more detail.

With so many different types of clustering algorithms, it can be challenging to decide which ones to try. Here are some things to think about when deciding on which ones to use:

Does the type of clustering match the tasks of the goals? Do you need hierarchal clusters or want each point to belong to one cluster? Is it better to have probabilities for membership? Is it essential that all data be clustered, or is it better to have really well-defined clusters?
How important is it that the items clustered together be very similar? If it is crucial, then it would be better to choose a method that results in globular clusters? If less similarity is okay, or even necessary, for the task, then algorithms that allow for different shapes and sized clusters would be better.
Do the characteristics of the dataset fit the requirements of the algorithm? If you have a high-density dataset, choosing an algorithm that works well with this kind of data would be better.
How do you want the algorithm to handle noise and outliers? Some algorithms exclude these data points. Others use them, but it results in less coherent clusters. There are even algorithms that group these points together.
How many items or attributes are in the dataset? If your dataset has many features, you should choose an algorithm that works well in high-dimensional spaces. If your dataset has a lot of items, a less computationally expensive algorithm would make a good choice.
Will the results format give you the information you need for your task?

Common Clustering Algorithms

Common hierarchical algorithms

BIRCH

Works for data that can be summarized using averages.
Handles outliers
With the proper parameter selection, it can be much less computationally expensive than many other algorithms
Works with large datasets

CURE

Handles outliers
Can handles clusters of different shapes and sizes
Works with numerical datasets
Works with large datasets

COBWEB

Works well with categorical data
Neither agglomerative nor divisive — assigns points to clusters one at a time.

SWIFT

Finds rare clusters
Works with large datasets

Common partitional algorithms

Graph-based

CHAMELEON

The data is represented as a graph. Each data point is a node, and each edge is a weight that indicates how close the points they connect are.
The graph is “sparsified” by removing edges below a certain threshold
This reduces the density of the dataset
Uses for graph partitioning

OPPOSUM

Very quick
Clusters are about the same size
When many clusters are created, they are usually very pure bits of a larger cluster. These could then be combined in post-processing

Density-based

DB-Scan

Disregards noise and assigns all other points to one cluster
Handles clusters of different sized and shapes
Not affected by noise
Performs poorly with clusters of differing densities
Definition of density must be meaningful for the date
Makes no assumptions about the distribution of data

Subspace

CLIQUE

Based on the Apriori algorithm, which helps it efficiently search subspaces for clusters
Can efficiently summarize which cells are in a cluster
Clusters can overlap — this can make it difficult to interpret the results
Time complexity can be exponential
Not affected by the order the data is processed

DENCLUE

A density based method
Uses a kernel
Computationally very expensive
Can handle noise and outliers
Can find clusters of different sizes and shapes
Does not work well in high-dimensional space
Does not work well with clusters of very different densities

Model-based

Self Organizing Maps (SOM)

Clusters are ordered. This means that clusters near each other are more closely related than clusters that are farther apart.
Easier to interpret results because of the way that the clusters are ordered
It can be challenging to choose settings that will lead to good results. Settings that must be selected are: parameters, neighborhood function, grid type, number of centroids
SOM clusters are sometimes composed of more than one “natural” cluster, and sometimes one natural cluster is broken into several SOM clusters
Not guaranteed to converge
It does not have an objective function, which can make it difficult to compare results

Square-error

K-Means:

Uses a similarity measure to determine how close data points are to each other.
Different measures can be chosen depending on the reason for the clustering
Simpler measures are usually chosen to reduce computational complexity
Random selection of initial centroids can lead to poor results. There are different strategies to improve the results. One of the more effective ones is called K means ++
Greatly affected by outliers. For this reason, they are usually removed before running k-means.
Performs poorly with non-globular clusters and clusters of differing sizes or densities
Uses all attributes when looking for clusters
Finds clusters that are not well separated

References

A. E. Ezugwu et al., “A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects,” Engineering Applications of Artificial Intelligence, vol. 110, p. 104743, Apr. 2022, doi: 10.1016/j.engappai.2022.104743.

A. Biarnes Gaussian Mixture Models and Expectation-Maximization (A full explanation) Sep 12, 2020 https://towardsdatascience.com/gaussian-mixture-models-and-expectation-maximization-a-full-explanation-50fa94111ddd

C. Aggarwal, “An Introduction to Cluster Analysis,” in Data Clustering, C. Aggarwal and C. Reddy, Eds. Boca Raton, FL: CRC Press, 2015.

J.Starmer Clustering with DBSCAN, Clearly Explained!!! Statquest with Josh Starmer https://www.youtube.com/watch?v=RDZUdRSDOok

P.-N. Tan, M. Steinbach, A. Karpatne, and V. Kumar, Introduction to Data Mining (2nd Edition), 2nd ed. Pearson, 2018.

V. Lavrenko EM algorithm: how it works https://www.youtube.com/watch?v=REypj2sy_5U