When It Comes to Clustering Algorithms, Less Is Usually More

Mariann Beagrie
4 min readMar 14, 2022
Photo by Brett Jordan on Unsplash

It may seem counter-intuitive, but giving a computer more information doesn’t always help it learn more. Algorithms learn best when given lots of different examples. However, if each instance is associated with many variables (also called features), the algorithm can perform much worse. When too many features are included, clustering uses a lot of memory and takes a long time. If a lot of data is being processed, it can use so much memory or time that it becomes infeasible. Even in situations where feasibility is not an issue, using too many variables often results in worse clusters.

Reducing the number of variables can be done in a few different ways. One way is to simply look at the variables and choose the ones you think are most relevant to your task. Another way is to let your computer help you select the most relevant features. This can be done with filters, wrappers, hybrid approaches or embedded models. In addition, you can use dimensionality reduction techniques like Principal Component Analysis (PCA) or Singular Value Decomposition (SVD). These combine several features into a value that best represents them. They can significantly reduce the feature space while retaining important information. However, they also make it much more challenging to interpret the results you get, so we won’t be covering them in this article.

Filter Methods

Filter methods use different statistical approaches to evaluate the features and then select the ones with the highest scores. Information Gain, Fischer’s score, Chi Score, Laplacian Score and Pearson’s Correlation Coefficient are some of the measures used. Filter methods can eliminate redundant or irrelevant features in the dataset. It can be done either by looking at each feature individually (univariate) or by comparing all of the features to each other (multivariate). Univariate methods are much more efficient, but they do not address redundant features. Filter methods are the most efficient of the four methods. However, the clusters produced after using them are often not as good quality as those made after using other feature selection methods.

Wrapper Methods

Wrapper methods typically involve using another clustering algorithm to choose the best features before clustering to find groups. These basically work by selecting some features, creating clusters based on them, and evaluating them. This process is repeated with many different subsets of features. The algorithm stops when it finds clusters that meet a previously specified criterion. As it is often impossible to check all the possibilities, this method does not guarantee that the absolute best selection of features will be chosen. However, it will generally provide better results than using filter methods. Because each clustering algorithm finds different clusters, the results will vary depending on the algorithm used for a wrapper. Different types of clustering algorithms generally work better with specific wrappers. See the resources at the end of this article for more in-depth information about which wrappers to use for a particular type of clustering algorithm.

Hybrid Approaches

Hybrid methods combine filter methods with wrapper messages and get some of the benefits of each. They are more efficient than wrapper methods and generally result in better clusters than the filter method. However, these methods are not used as often with clustering algorithms. In hybrid methods, an algorithm uses the filter method to choose the best subsets for testing. It then uses the procedure described in the wrapper method to select the best features. The results will vary with the wrapper method depending on the clustering algorithm used to find the best features.

Embedded Models

Embedded models perform feature selection as part of the clustering process. There are two main ways this is currently done. In the first method, the data is clustered, and the clusters are used as labels for the data. Then a classification method is used to select the best features for those labels. The second way involves using algorithms with a feature selection method built into it. Models that have been proposed include feature-weighting with k-means, MCFS (multicluster feature selection) and RUFSM (robust unsupervised feature selection via matrix factorization).

Ensuring that your data does not have redundant or irrelevant attributes will save processing time, reduce the amount of memory needed and produce better results.

References

Alelyani, S., Tang, J., & Liu, H. (2015). Feature Selection for Clustering: A Review. In C. Aggarwal & C. Reddy (Eds.), Data Clustering. CRC Press.

Hancer, E., Xue, B., & Zhang, M. (2020). A survey on feature selection approaches for clustering. Artificial Intelligence Review, 53(6), 4519–4545. https://doi.org/10.1007/s10462-019-09800-w

--

--

Mariann Beagrie

I have taught in the US, Germany, South Korea and China. I recently completed a degree in Computer Science. I love traveling, reading and learning.