How to perform fast and explainable clustering using CLASSIX?
A cluster is a group of homogeneous objects; in other words, objects with similar properties are collected in one cluster, while objects with dissimilar properties are collected in another. Grouping is the process of categorizing objects into a number of groups in which the objects in each group are substantially similar to those in the other groups. Various clustering algorithms have been used so far like K-Means clustering, mean lag clustering, etc. But in this article, we will discuss the toolkit, named CLASSIX, for clustering which performs more accurate, fast clustering but also explains how it is scoped. Below are the main points listed which should be discussed in this article.
- What is grouping?
- How does CLASSIX aggregate data?
- Implementation of CLASSIX in Python
Let’s talk about clustering first.
What is grouping?
Clustering is the process of putting things together so that members of the same group (cluster) are more common with their peers than members of other groups. Clustering examines all input data and is commonly used in machine learning (ML) methods.
When machine learning practitioners create a cluster, they look at all the different data points and group them based on characteristics they have in common with other data. The algorithm determines the clustering strategy.
Clustering procedures may involve calculating the average distance between data points in dimensional spaces, counting the number of intervals for each data set, predicting the number of clusters, or basing them on dense data areas . Clustering produces explicit links between data points, as well as explanations of why each data point belongs to its cluster.
How does CLASSIX aggregate data?
Distance-based clustering algorithms, such as k-means, consider the pairwise distance between points to decide whether or not they should be clustered. DBSCAN and other density-based clustering algorithms take a more global approach, assuming that data occurs in continuous high-density areas surrounded by low-density regions.
Many density-based clustering methods have the advantage of being able to handle clusters of any shape without having to define the number of clusters in advance. On the other hand, they generally require a greater adjustment of the parameters.
CLASSIX is a method that shares the characteristics of distance and density based methods. The approach is divided into two steps: aggregation and merging. Data points are sorted by their first principal component and then grouped using a greedy aggregation technique during the aggregation phase.
Sorting is essential for traversing data with near-linear complexity, as long as the number of pairwise distance calculations is modest. While the initial sorting requires medium complexity, it is only performed on scalar values, regardless of the dimensionality of the data points. Therefore, the cost of this initial sorting is almost insignificant compared to computations on full-dimensional data.
After the aggregation step, the overlapping groups are merged into clusters using either a distance or density based criterion. Although the density-based merging criterion produces slightly better clusters than the distance-based criterion, the latter is significantly faster. CLASSIX is controlled by only two parameters and its configuration is simple.
In summary, the radius parameter determines the least allowed cluster size, while the minPts parameter specifies the clustering tolerance in the aggregation phase. This is identical to the settings used in DBSCAN, however, CLASSIX does not perform spatial range searches for each data point due to the initial sorting of data points.
Implementation of CLASSIX in Python
In this section, we will perform clustering on the IRIS dataset by dropping the target column and creating a completely unsupervised problem. As mentioned earlier, we will use the CLASSIX method to group the data. Here I set the radius to 0.35, the trial and error method to the density, the minimum points in the grouping being 3 points.
Now, let’s get started with a quick install, import dependencies, and prepare the dataset.
# install library !pip install ClassixClustering # imports import pandas as pd import matplotlib.pyplot as plt from classix import CLASSIX # prepare data data = pd.read_csv('/content/IRIS.csv') data.drop(['species'], inplace=True, axis=1)
Now we just have to call the function by setting the parameters as mentioned above and adapting the data.
# initailize the clustering clx = CLASSIX(radius=0.35, minPts=3, group_merging='density') # fitting the data clx.fit(data)
After editing, this method will give you the grouping results as below.
As we have set minPts to 3, the algorithm will cluster the cluster having points lower than the minPts towards the larger clusters. Now let’s check this visually.
# visualize the clusters plt.figure(figsize=(5,5)) plt.scatter(data.values[:,0], data.values[:,2], c=clx.labels_) plt.show()
Other than that, this algorithm is so capable that it can give a brief explanation of how it grouped the data using the method .Explain().
# explaining the clusters clx.explain()
Through this article, we have discussed clustering. Later, we looked at a fast clustering approach based on sorting data points by their first primary coordinate, which is CLASSIX. Fast aggregation of neighboring data points into groups is a crucial feature of CLASSIX. Due to the simplicity of the aggregation and merging processes, the clustering results can be explained, as we have shown. Further experiments are conducted on this data set which is mentioned in the notebook link in the reference.