Respect all members: no insults, harassment, or hate speech.
Be tolerant of different viewpoints, cultures, and beliefs. If you do not agree with others, just create separate note, article or collection.
Clearly distinguish between personal opinion and fact.
Verify facts before posting, especially when writing about history, science, or statistics.
Promotional content must be published on the “Related Services and Products” page—no more than one paragraph per service. You can also create subpages under the “Related Services and Products” page and publish longer promotional text there.
Do not post materials that infringe on copyright without permission.
Always credit sources when sharing information, quotes, or media.
Be respectful of the work of others when making changes.
Discuss major edits instead of removing others' contributions without reason.
If you notice rule-breaking, notify community about it in talks.
Do not share personal data of others without their consent.
Cluster analysis refers to a family of algorithms and tasks rather than one specific algorithm. It can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek: βότρυς'grape'), typological analysis, and community detection. The subtle differences are often in the use of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification the resulting discriminative power is of interest.
Cluster analysis originated in anthropology by Driver and Kroeber in 1932[1] and introduced to psychology by Joseph Zubin in 1938[2] and Robert Tryon in 1939[3] and famously used by Cattell beginning in 1943[4] for trait theory classification in personality psychology.
The notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms.[5] There is a common denominator: a group of data objects. However, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these "cluster models" is key to understanding the differences between the various algorithms. Typical cluster models include:
Connectivity models: for example, hierarchical clustering builds models based on distance connectivity.
Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector.
Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
Subspace models: in biclustering (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.
Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm.
Signed graph models: Every path in a signed graph has a sign from the product of the signs on the edges. Under the assumptions of balance theory, edges may change sign and result in a bifurcated graph. The weaker "clusterability axiom" (no cycle has exactly one negative edge) yields results with more than two clusters, or subgraphs with only positive edges.[6]
A "clustering" is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:
Hard clustering: each object belongs to a cluster or not
Soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (for example, a likelihood of belonging to the cluster)
There are also finer distinctions possible, for example:
Strict partitioning clustering: each object belongs to exactly one cluster
Strict partitioning clustering with outliers: objects can also belong to no cluster; in which case they are considered outliers
Overlapping clustering (also: alternative clustering, multi-view clustering): objects may belong to more than one cluster; usually involving hard clusters
Hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
Subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap
As listed above, clustering algorithms can be categorized based on their cluster model. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms.
There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder."[5] In fact, an axiomatic approach to clustering demonstrates that it is impossible for any clustering method to meet three fundamental properties simultaneously: scale invariance (results remain unchanged under proportional scaling of distances), richness (all possible partitions of the data can be achieved), and consistency between distances and the clustering structure.[7] The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. An algorithm that is designed for one kind of model will generally fail on a data set that contains a radically different kind of model.[5] For example, k-means cannot find non-convex clusters.[5] Most traditional clustering methods assume the clusters exhibit a spherical, elliptical or convex shape.[8]
Connectivity-based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. These algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name "hierarchical clustering" comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix.
Connectivity-based clustering is a whole family of methods that differ by the way distances are computed. Apart from the usual choice of distance functions, the user also needs to decide on the linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to compute the distance) to use. Popular choices are known as single-linkage clustering (the minimum of object distances), complete linkage clustering (the maximum of object distances), and UPGMA or WPGMA ("Unweighted or Weighted Pair Group Method with Arithmetic Mean", also known as average linkage clustering). Furthermore, hierarchical clustering can be agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions).
These methods will not produce a unique partitioning of the data set, but a hierarchy from which the user still needs to choose appropriate clusters. They are not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge (known as "chaining phenomenon", in particular with single-linkage clustering). In the general case, the complexity is for agglomerative clustering and for divisive clustering,[9] which makes them too slow for large data sets. For some special cases, optimal efficient methods (of complexity ) are known: SLINK[10] for single-linkage and CLINK[11] for complete-linkage clustering.
Linkage clustering examples
Single-linkage on Gaussian data. At 35 clusters, the biggest cluster starts fragmenting into smaller parts, while before it was still connected to the second largest due to the single-link effect.
Single-linkage on density-based clusters. 20 clusters extracted, most of which contain single elements, since linkage clustering does not have a notion of "noise".
In centroid-based clustering, each cluster is represented by a central vector, which is not necessarily a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.
The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well-known approximate method is Lloyd's algorithm,[12] often just referred to as "k-means algorithm" (although another algorithm introduced this name). It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means).
Most k-means-type algorithms require the number of clusters – k – to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid; often yielding improperly cut borders of clusters. This happens primarily because the algorithm optimizes cluster centers, not cluster borders. Steps involved in the centroid-based clustering algorithm are:
Choose, kdistinct clusters at random. These are the initial centroids to be improved upon.
Suppose a set of observations, (x1, x2, ..., xn). Assign each observation to the centroid to which it has the smallest squared Euclidean distance. This results in k distinct groups, each containing unique observations.
Exit iff the new centroids are equivalent to the previous iteration's centroids. Else, repeat the algorithm, the centroids have yet to converge.
K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model-based clustering, and Lloyd's algorithm as a variation of the Expectation-maximization algorithm for this model discussed below.
k-means clustering examples
k-means separates data into Voronoi cells, which assumes equal-sized clusters (not adequate here).
k-means cannot represent density-based clusters.
Centroid-based clustering problems such as k-means and k-medoids are special cases of the uncapacitated, metric facility location problem, a canonical problem in the operations research and computational geometry communities. In a basic facility location problem (of which there are numerous variants that model more elaborate settings), the task is to find the best warehouse locations to optimally service a given set of consumers. One may view "warehouses" as cluster centroids and "consumer locations" as the data to be clustered. This makes it possible to apply the well-developed algorithmic solutions from the facility location literature to the presently considered centroid-based clustering problem.
The clustering framework most closely related to statistics is model-based clustering, which is based on distribution models. This approach models the data as arising from a mixture of probability distributions. It has the advantages of providing principled statistical answers to questions such as how many clusters there are, what clustering method or model to use, and how to detect and deal with outliers.
While the theoretical foundation of these methods is excellent, they suffer from overfitting unless constraints are put on the model complexity. A more complex model will usually be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult. Standard model-based clustering methods include more parsimonious models based on the eigenvalue decomposition of the covariance matrices, that provide a balance between overfitting and fidelity to the data.
One prominent method is known as Gaussian mixture models (using the expectation-maximization algorithm). Here, the data set is usually modeled with a fixed (to avoid overfitting) number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to better fit the data set. This will converge to a local optimum, so multiple runs may produce different results. In order to obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to; for soft clusterings, this is not necessary.
Distribution-based clustering produces complex models for clusters that can capture correlation and dependence between attributes. However, these algorithms put an extra burden on the user: for many real data sets, there may be no concisely defined mathematical model (e.g. assuming Gaussian distributions is a rather strong assumption on the data).
Gaussian mixture model clustering examples
On Gaussian-distributed data, EM works well, since it uses Gaussians for modelling clusters.
Density-based clusters cannot be modeled using Gaussian distributions.
In density-based clustering,[13] clusters are defined as areas of higher density than the remainder of the data set. Objects in sparse areas – that are required to separate clusters – are usually considered to be noise and border points.
The most popular[14] density-based clustering method is DBSCAN.[15] In contrast to many newer methods, it features a well-defined cluster model called "density-reachability". Similar to linkage-based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary shape, in contrast to many other methods) plus all objects that are within these objects' range. Another interesting property of DBSCAN is that its complexity is fairly low – it requires a linear number of range queries on the database – and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times. OPTICS[16] is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter , and produces a hierarchical result related to that of linkage clustering. DeLi-Clu,[17] Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the parameter entirely and offering performance improvements over OPTICS by using an R-tree index.
The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect cluster borders. On data sets with, for example, overlapping Gaussian distributions – a common use case in artificial data – the cluster borders produced by these algorithms will often look arbitrary, because the cluster density decreases continuously. On a data set consisting of mixtures of Gaussians, these algorithms are nearly always outperformed by methods such as EM clustering that are able to precisely model this kind of data.
Mean-shift is a clustering approach where each object is moved to the densest area in its vicinity, based on kernel density estimation. Eventually, objects converge to local maxima of density. Similar to k-means clustering, these "density attractors" can serve as representatives for the data set, but mean-shift can detect arbitrary-shaped clusters similar to DBSCAN. Due to the expensive iterative procedure and density estimation, mean-shift is usually slower than DBSCAN or k-Means. Besides that, the applicability of the mean-shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density estimate, which results in over-fragmentation of cluster tails.[17]
Density-based clustering examples
Density-based clustering with DBSCAN
DBSCAN assumes clusters of similar density, and may have problems separating nearby clusters.
OPTICS is a DBSCAN variant, improving handling of different densities clusters.
The grid-based technique is used for a multi-dimensional data set.[18] In this technique, we create a grid structure, and the comparison is performed on grids (also known as cells). The grid-based technique is fast and has low computational complexity. There are two types of grid-based clustering methods: STING and CLIQUE. Steps involved in the grid-based clustering algorithm are:
Divide data space into a finite number of cells.
Randomly select a cell ‘c’, where c should not be traversed beforehand.
Calculate the density of ‘c’
If the density of ‘c’ greater than threshold density
Mark cell ‘c’ as a new cluster
Calculate the density of all the neighbors of ‘c’
If the density of a neighboring cell is greater than threshold density then, add the cell in the cluster and repeat steps 4.2 and 4.3 till there is no neighbor with a density greater than threshold density.
Repeat steps 2,3 and 4 till all the cells are traversed.
In recent years, considerable effort has been put into improving the performance of existing algorithms.[19][20] Among them are CLARANS,[21] and BIRCH.[22] With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering.
Ideas from density-based clustering methods (in particular the DBSCAN/OPTICS family of algorithms) have been adapted to subspace clustering (HiSC,[26] hierarchical subspace clustering and DiSH[27]) and correlation clustering (HiCO,[28] hierarchical correlation clustering, 4C[29] using "correlation connectivity" and ERiC[30] exploring hierarchical density-based correlation clusters).
Several different clustering systems based on mutual information have been proposed. One is Marina Meilă's variation of information metric;[31] another provides hierarchical clustering.[32] Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information.[33] Also belief propagation, a recent development in computer science and statistical physics, has led to the creation of new types of clustering algorithms.[34]
Evaluation (or "validation") of clustering results is as difficult as the clustering itself.[35] Popular approaches involve "internal" evaluation, where the clustering is summarized to a single quality score, "external" evaluation, where the clustering is compared to an existing "ground truth" classification, "manual" evaluation by a human expert, and "indirect" evaluation by evaluating the utility of the clustering in its intended application.[36]
Internal evaluation measures suffer from the problem that they represent functions that themselves can be seen as a clustering objective. For example, one could cluster the data set by the Silhouette coefficient; except that there is no known efficient algorithm for this. By using such an internal measure for evaluation, one rather compares the similarity of the optimization problems,[36] and not necessarily how useful the clustering is.
External evaluation has similar problems: if we have such "ground truth" labels, then we would not need to cluster; and in practical applications we usually do not have such labels. On the other hand, the labels only reflect one possible partitioning of the data set, which does not imply that there does not exist a different, and maybe even better, clustering.
Neither of these approaches can therefore ultimately judge the actual quality of a clustering, but this needs human evaluation,[36] which is highly subjective. Nevertheless, such statistics can be quite informative in identifying bad clusterings,[37] but one should not dismiss subjective human evaluation.[37]
When a clustering result is evaluated based on the data that was clustered itself, this is called internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. One drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications.[38] Additionally, this evaluation is biased towards algorithms that use the same cluster model. For example, k-means clustering naturally optimizes object distances, and a distance-based internal criterion will likely overrate the resulting clustering.
Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another.[5] Validity as measured by such an index depends on the claim that this kind of structure exists in the data set. An algorithm designed for some kind of models has no chance if the data set contains a radically different set of models, or if the evaluation measures a radically different criterion.[5] For example, k-means clustering can only find convex clusters, and many evaluation indexes assume convex clusters. On a data set with non-convex clusters neither the use of k-means, nor of an evaluation criterion that assumes convexity, is sound.
More than a dozen of internal evaluation measures exist, usually based on the intuition that items in the same cluster should be more similar than items in different clusters.[39]: 115–121 For example, the following methods can be used to assess the quality of clustering algorithms based on internal criterion:
The Davies–Bouldin index can be calculated by the following formula:
where n is the number of clusters, is the centroid of cluster , is the average distance of all elements in cluster to centroid , and is the distance between centroids and . Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:[40]
where d(i,j) represents the distance between clusters i and j, and d '(k) measures the intra-cluster distance of cluster k. The inter-cluster distance d(i,j) between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters. Similarly, the intra-cluster distance d '(k) may be measured in a variety of ways, such as the maximal distance between any pair of elements in cluster k. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable.
The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered, objects with a low value may be outliers. This index works well with k-means clustering, and is also used to determine the optimal number of clusters.[41]
In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks. Such benchmarks consist of a set of pre-classified items, and these sets are often created by (expert) humans. Thus, the benchmark sets can be thought of as a gold standard for evaluation.[35] These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes. However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies.[42] Additionally, from a knowledge discovery point of view, the reproduction of known knowledge may not necessarily be the intended result.[42] In the special scenario of constrained clustering, where meta information (such as class labels) is used already in the clustering process, the hold-out of information for evaluation purposes is non-trivial.[43]
A number of measures are adapted from variants used to evaluate classification tasks. In place of counting the number of times a class was correctly assigned to a single data point (known as true positives), such pair counting metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster.[35]
As with internal evaluation, several external evaluation measures exist,[39]: 125–129 for example:
Purity is a measure of the extent to which clusters contain a single class.[38] Its calculation can be thought of as follows: For each cluster, count the number of data points from the most common class in said cluster. Now take the sum over all clusters and divide by the total number of data points. Formally, given some set of clusters and some set of classes , both partitioning data points, purity can be defined as:
This measure doesn't penalize having many clusters, and more clusters will make it easier to produce a high purity. A purity score of 1 is always possible by putting each data point in its own cluster. Also, purity doesn't work well for imbalanced data, where even poorly performing clustering algorithms will give a high purity value. For example, if a size 1000 dataset consists of two classes, one containing 999 points and the other containing 1 point, then every possible partition will have a purity of at least 99.9%.
The Rand index[44] computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. It can be computed using the following formula:
where is the number of true positives, is the number of true negatives, is the number of false positives, and is the number of false negatives. The instances being counted here are the number of correct pairwise assignments. That is, is the number of pairs of points that are clustered together in the predicted partition and in the ground truth partition, is the number of pairs of points that are clustered together in the predicted partition but not in the ground truth partition etc. If the dataset is of size N, then .
One issue with the Rand index is that false positives and false negatives are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern,[citation needed] as does the chance-corrected adjusted Rand index.
The F-measure can be used to balance the contribution of false negatives by weighting recall through a parameter . Let precision and recall (both external evaluation measures in themselves) be defined as follows:
where is the precision rate and is the recall rate. We can calculate the F-measure by using the following formula:[38]
When , . In other words, recall has no impact on the F-measure when , and increasing allocates an increasing amount of weight to recall in the final F-measure.
Also is not taken into account and can vary from 0 upward without bound.
The Jaccard index is used to quantify the similarity between two datasets. The Jaccard index takes on a value between 0 and 1. An index of 1 means that the two dataset are identical, and an index of 0 indicates that the datasets have no common elements. The Jaccard index is defined by the following formula:
This is simply the number of unique elements common to both sets divided by the total number of unique elements in both sets.
Note that is not taken into account.
The Fowlkes–Mallows index[45] computes the similarity between the clusters returned by the clustering algorithm and the benchmark classifications. The higher the value of the Fowlkes–Mallows index the more similar the clusters and the benchmark classifications are. It can be computed using the following formula:
where is the number of true positives, is the number of false positives, and is the number of false negatives. The index is the geometric mean of the precision and recall and , and is thus also known as the G-measure, while the F-measure is their harmonic mean.[46][47] Moreover, precision and recall are also known as Wallace's indices and .[48] Chance normalized versions of recall, precision and G-measure correspond to Informedness, Markedness and Matthews Correlation and relate strongly to Kappa.[49]
The Chi index[50] is an external validation index that measure the clustering results by applying the chi-squared statistic. This index scores positively the fact that the labels are as sparse as possible across the clusters, i.e., that each cluster has as few different labels as possible. The higher the value of the Chi Index the greater the relationship between the resulting clusters and the label used.
The mutual information is an information theoretic measure of how much information is shared between a clustering and a ground-truth classification that can detect a non-linear similarity between two clustering. Normalized mutual information is a family of corrected-for-chance variants of this that has a reduced bias for varying cluster numbers.[35]
A confusion matrix can be used to quickly visualize the results of a classification (or clustering) algorithm. It shows how different a cluster is from the gold standard cluster.
To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering. One way to do this is to compare the data against random data. On average, random data should not have clusters [verification needed].
There are multiple formulations of the Hopkins statistic.[52] A typical one is as follows.[53] Let be the set of data points in dimensional space. Consider a random sample (without replacement) of data points with members . Also generate a set of uniformly randomly distributed data points. Now define two distance measures, to be the distance of from its nearest neighbor in X and to be the distance of from its nearest neighbor in X. We then define the Hopkins statistic as:
With this definition, uniform random data should tend to have values near to 0.5, and clustered data should tend to have values nearer to 1.
However, data containing just a single Gaussian will also score close to 1, as this statistic measures deviation from a uniform distribution, not multimodality, making this statistic largely useless in application (as real data never is remotely uniform).
Cluster analysis is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments. It is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes.
On PET scans, cluster analysis can be used to differentiate between different types of tissue in a three-dimensional image for many different purposes.[58]
Analysis of antimicrobial activity
Cluster analysis can be used to analyse patterns of antibiotic resistance, to classify antimicrobial compounds according to their mechanism of action, to classify antibiotics according to their antibacterial activity.
IMRT segmentation
Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers, and for use in market segmentation, product positioning, new product development and selecting test markets.
Grouping of shopping items
Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products (eBay does not have the concept of a SKU).
In the study of social networks, clustering may be used to recognize communities within large groups of people.
Search result grouping
In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google[citation needed]. There are currently a number of web-based clustering tools such as Clusty. It also may be used to return a more comprehensive set of results in cases where a search term could refer to vastly different things. Each distinct use of the term corresponds to a unique cluster of results, allowing a ranking algorithm to return comprehensive results by picking the top result from each cluster.[59]
Slippy map optimization
Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map.[citation needed] This makes it both faster and reduces the amount of visual clutter.
Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed. It is a form of restructuring and hence is a way of direct preventative maintenance.
Image segmentation is the process of dividing a digital image into multiple meaningful regions or segments to simplify and/or change the representation of an image, making it easier to analyze. These segments may correspond to different objects, parts of objects, or background areas. The goal is to assign a label to every pixel in the image so that the pixels with similar attributes are grouped together.
This process is used in fields like medical imaging, computer vision, satellite imaging, and in daily applications like face detection and photo editing.
The aurora borealis, or northern lights, above Bear Lake, AlaskaImage after running k-means clustering with k = 16Clustering in Image Segmentation:
Clustering plays a significant role in image segmentation. It groups pixels into clusters based on similarity without needing labeled data. These clusters then define segments within the image.
Here are the most commonly used clustering algorithms for image segmentation:
K-means Clustering: One of the most popular and straightforward methods. Pixels are treated as data points in a feature space (usually defined by color or intensity) and grouped into k clusters. Each pixel is assigned to the nearest cluster center, and the centers are updated iteratively.
Mean Shift Clustering: A non-parametric method that does not require specifying the number of clusters in advance. It identifies clusters by locating dense areas of data points in the feature space.
Fuzzy C-means: Unlike k-means, which assigns pixels to exactly one cluster, fuzzy c-means allows each pixel to belong to multiple clusters with varying degrees of membership.
Clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies.
Recommender systems suggest items, products, or other users to an individual based on their past behavior and current preferences. These systems will occasionally use clustering algorithms to predict a user's unknown preferences by analyzing the preferences and activities of other users within the same cluster. Cluster analysis is not the only approach for recommendation systems, for example there are systems that leverage graph theory. Recommendation algorithms that utilize cluster analysis often fall into one of the three main categories: Collaborative filtering, Content-Based filtering, and a hybrid of the collaborative and content-based.
Collaborative Filtering Recommendation Algorithm
Collaborative filtering works by analyzing large amounts of data on user behavior, preferences, and activities to predict what a user might like based on similarities with others. It detects patterns in how users rate items and groups similar users or items into distinct “neighborhoods.” Recommendations are then generated by leveraging the ratings of content from others within the same neighborhood. The algorithm can focus on either user-based or item-based grouping depending on the context.[60]
Flow diagram that shows a basic and generic approach to recommendation systems and how they utilize clustering
Content-Based Filtering Recommendation Algorithm
Content-based filtering uses item descriptions and a user's preference profile to recommend items with similar characteristics to those the user previously liked. It evaluates the distance between feature vectors of item clusters, or “neighborhoods.” The user's past interactions are represented as a weighted feature vector, which is compared to these clusters. Recommendations are generated by identifying the cluster evaluated be the closest in distance with the user's preferences.[60]
Hybrid Recommendation Algorithms
Hybrid recommendation algorithms combine collaborative and content-based filtering to better meet the requirements of specific use cases. In certain cases this approach leads to more effective recommendations. Common strategies include: (1) running collaborative and content-based filtering separately and combining the results, (2) adding onto one approach with specific features of the other, and (3) integrating both hybrid methods into one model.[60]
Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime. By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively.
Cluster analysis is for example used to identify groups of schools or students with similar properties.
Typologies
From poll data, projects such as those undertaken by the Pew Research Center use cluster analysis to discern typologies of opinions, habits, and demographics that may be useful in politics and marketing.
^Driver and Kroeber (1932). "Quantitative Expression of Cultural Relationships". University of California Publications in American Archaeology and Ethnology. Quantitative Expression of Cultural Relationships. Berkeley, CA: University of California Press: 211–256. Archived from the original on 2020-12-06. Retrieved 2019-02-18.
^Zubin, Joseph (1938). "A technique for measuring like-mindedness". The Journal of Abnormal and Social Psychology. 33 (4): 508–516. doi:10.1037/h0055441. ISSN0096-851X.
^Tryon, Robert C. (1939). Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis for the Isolation of Unities in Mind and Personality. Edwards Brothers.
^Cattell, R. B. (1943). "The description of personality: Basic traits resolved into clusters". Journal of Abnormal and Social Psychology. 38 (4): 476–506. doi:10.1037/h0054116.
^ abcdefEstivill-Castro, Vladimir (20 June 2002). "Why so many clustering algorithms – A Position Paper". ACM SIGKDD Explorations Newsletter. 4 (1): 65–75. doi:10.1145/568574.568575. S2CID7329935.
^Defays, D. (1977). "An efficient algorithm for a complete link method". The Computer Journal. 20 (4). British Computer Society: 364–366. doi:10.1093/comjnl/20.4.364.
^Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise". In Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M. (eds.). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. ISBN1-57735-004-9.
^Ankerst, Mihael; Breunig, Markus M.; Kriegel, Hans-Peter; Sander, Jörg (1999). "OPTICS: Ordering Points To Identify the Clustering Structure". ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60. CiteSeerX10.1.1.129.6542.
^ abAchtert, E.; Böhm, C.; Kröger, P. (2006). "DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking". Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science. Vol. 3918. pp. 119–128. CiteSeerX10.1.1.64.1161. doi:10.1007/11731139_16. ISBN978-3-540-33206-0.
^Sculley, D. (2010). Web-scale k-means clustering. Proc. 19th WWW.
^Huang, Z. (1998). "Extensions to the k-means algorithm for clustering large data sets with categorical values". Data Mining and Knowledge Discovery. 2 (3): 283–304. doi:10.1023/A:1009769707641. S2CID11323096.
^R. Ng and J. Han. "Efficient and effective clustering method for spatial data mining". In: Proceedings of the 20th VLDB Conference, pages 144–155, Santiago, Chile, 1994.
^Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246–257, 2004.
^Meilă, Marina (2003). "Comparing Clusterings by the Variation of Information". Learning Theory and Kernel Machines. Lecture Notes in Computer Science. Vol. 2777. pp. 173–187. doi:10.1007/978-3-540-45167-9_14. ISBN978-3-540-40720-1.
^Kraskov, Alexander; Stögbauer, Harald; Andrzejak, Ralph G.; Grassberger, Peter (1 December 2003). "Hierarchical Clustering Based on Mutual Information". arXiv:q-bio/0311039.
^ abcdPfitzner, Darius; Leibbrandt, Richard; Powers, David (2009). "Characterization and evaluation of similarity measures for pairs of clusterings". Knowledge and Information Systems. 19 (3). Springer: 361–394. doi:10.1007/s10115-008-0150-6. S2CID6935380.
^ abcFeldman, Ronen; Sanger, James (2007-01-01). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press. ISBN978-0521836579. OCLC915286380.
^ abWeiss, Sholom M.; Indurkhya, Nitin; Zhang, Tong; Damerau, Fred J. (2005). Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer. ISBN978-0387954332. OCLC803401334.
^ abcManning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (2008-07-07). Introduction to Information Retrieval. Cambridge University Press. ISBN978-0-521-86571-5.
^Dunn, J. (1974). "Well separated clusters and optimal fuzzy partitions". Journal of Cybernetics. 4: 95–104. doi:10.1080/01969727408546059.
^Peter J. Rousseeuw (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis". Journal of Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7.
^Pourrajabi, M.; Moulavi, D.; Campello, R. J. G. B.; Zimek, A.; Sander, J.; Goebel, R. (2014). "Model Selection for Semi-Supervised Clustering". Proceedings of the 17th International Conference on Extending Database Technology (EDBT). pp. 331–342. doi:10.5441/002/edbt.2014.31.
^Fowlkes, E. B.; Mallows, C. L. (1983). "A Method for Comparing Two Hierarchical Clusterings". Journal of the American Statistical Association. 78 (383): 553–569. doi:10.1080/01621459.1983.10478008. JSTOR2288117.
^Powers, David (2003). Recall and Precision versus the Bookmaker. International Conference on Cognitive Science. pp. 529–534.
^Rosenberg, Andrew, and Julia Hirschberg. "V-measure: A conditional entropy-based external cluster evaluation measure." Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). 2007. pdf
^Hopkins, Brian; Skellam, John Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. 18 (2). Annals Botany Co: 213–227. doi:10.1093/oxfordjournals.aob.a083391.
^Hartuv, Erez; Shamir, Ron (2000-12-31). "A clustering algorithm based on graph connectivity". Information Processing Letters. 76 (4): 175–181. doi:10.1016/S0020-0190(00)00142-3. ISSN0020-0190.
^Remm, Maido; Storm, Christian E. V.; Sonnhammer, Erik L. L. (2001-12-14). "Automatic clustering of orthologs and in-paralogs from pairwise species comparisons11Edited by F. Cohen". Journal of Molecular Biology. 314 (5): 1041–1052. doi:10.1006/jmbi.2000.5197. ISSN0022-2836. PMID11743721.
^ abDi Marco, Antonio; Navigli, Roberto (2013). "Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction". Computational Linguistics. 39 (3): 709–754. doi:10.1162/COLI_a_00148. S2CID1775181.
Cluster analysis is an unsuperviseddata analysis technique that partitions a set of objects into groups, or clusters, such that objects within the same cluster are more similar to each other than to those in other clusters, based on predefined measures of similarity or dissimilarity.[1] This method aims to discover inherent structures or patterns in data without relying on predefined labels or categories, making it a fundamental tool for exploratory analysis in fields like statistics and machine learning.[2] The process typically involves selecting appropriate distance metrics, such as the Euclidean distance for numerical data, to quantify similarities and iteratively forming clusters that maximize intra-cluster cohesion and inter-cluster separation.[3]Key characteristics of cluster analysis include its nonparametric nature, which allows it to handle diverse data types—numerical, categorical, or mixed—without assuming underlying distributions, and its focus on either hard partitioning (where each object belongs to exactly one cluster) or soft partitioning (allowing probabilistic memberships).[1] Common algorithms fall into several categories: partitional methods like k-means, which divide data into a fixed number of non-overlapping subsets by minimizing variance within clusters; hierarchical methods, such as agglomerative clustering that builds a tree-like structure by successively merging similar clusters; and density-based methods like DBSCAN, which identify clusters as dense regions separated by sparse areas.[2] Evaluation often relies on internal criteria, such as silhouette scores measuring cluster compactness and separation, or external validation when ground truth labels are available.[1]Originating in the biological sciences for taxonomic classification in the early 20th century, cluster analysis has evolved significantly with advancements in computational power and data mining, gaining prominence through seminal works in the 1970s and 1980s that formalized algorithms for broader applications.[1] Today, it is widely applied in diverse domains, including genomics for gene expression grouping, marketing for customer segmentation, image processing for object recognition, and climate science for pattern detection in environmental data.[3] Despite its utility, challenges persist, such as sensitivity to outliers, the need to determine the optimal number of clusters, and scalability for large datasets, driving ongoing research into robust and efficient variants.[2]
Fundamentals
Definition and Objectives
Cluster analysis is an unsupervisedmachine learning technique that partitions a given dataset into subsets, known as clusters, such that data points within the same cluster exhibit greater similarity to each other than to those in other clusters, typically measured using distance or dissimilarity metrics.[4] This process relies on inherent data structures without requiring predefined labels or supervision, enabling the discovery of natural groupings in unlabeled data.[5]The primary objectives of cluster analysis include facilitating exploratory data analysis to uncover hidden patterns, supporting pattern discovery in complex datasets, aiding anomaly detection by identifying outliers as points distant from cluster centers, and contributing to dimensionality reduction by summarizing data into representative cluster prototypes.[5] These goals emphasize its role in unsupervised learning, where the aim is to reveal intrinsic data organization for subsequent tasks like classification or visualization, rather than predictive modeling.[4]Effective cluster analysis presupposes appropriate data representation, encompassing numerical attributes (e.g., continuous values like measurements) and categorical attributes (e.g., discrete labels like categories), which may require preprocessing to handle mixed types.[4] Central to this are similarity or distance measures that quantify resemblance between data points; common examples include the Euclidean distance, defined as d(x,y)=∑i=1n(xi−yi)2,
where x and y are data points in an n-dimensional space, and the Manhattan distance, d(x,y)=∑i=1n∣xi−yi∣, which sums absolute differences along each dimension.[4]Key concepts in cluster analysis distinguish between hard clustering, where each data point is assigned exclusively to one cluster with no overlap, and soft clustering, where points may belong to multiple clusters with varying degrees of membership, often represented as probabilities.[5] Clusters can exhibit overlap in soft approaches, allowing for ambiguous boundaries in real-world data, while scalability issues arise with large datasets, as many algorithms struggle with high computational complexity for millions of points, necessitating efficient implementations.[4]
Historical Development
The roots of cluster analysis trace back to the early 1930s in anthropology, where Harold E. Driver and Alfred L. Kroeber introduced quantitative methods to assess similarities in cultural traits, particularly kinship terminologies, marking one of the first systematic attempts to group data based on resemblance measures.[6] This approach laid the groundwork for empirical classification by using similarity coefficients to organize qualitative data into hierarchical structures, though computations were manual and limited to small datasets. In the late 1930s, the technique entered psychology through Joseph Zubin's 1938 proposal for measuring "like-mindedness" among individuals via profile correlations, enabling the identification of homogeneous groups in behavioral data.[7] Robert C. Tryon further advanced these ideas in 1939 with his monograph on cluster analysis, applying correlation profiles and orthometric factors to isolate unities in personality traits and mental abilities, thus formalizing clustering as a tool for psychological taxonomy.[8]The 1960s saw significant milestones in the development of algorithmic frameworks, driven by the need for more robust statistical methods. Joe H. Ward's 1963 minimum variance method introduced an objective function to minimize within-cluster variance in hierarchical clustering, providing a criterion for optimal grouping that became widely adopted for its computational efficiency.[9] Building on this, G. N. Lance and W. T. Williams published their 1967 agglomerative framework, which generalized hierarchical clustering strategies through recursive distance updates, unifying various linkage criteria and facilitating implementation on early computers.[10] Concurrently, James MacQueen formalized the k-means algorithm in 1967, offering a partitioning approach that iteratively assigns data points to clusters by minimizing squared distances to centroids, though its roots extended to earlier work like Lloyd's 1957 vector quantization.[11]From the 1970s onward, cluster analysis transitioned from manual to computer-aided processes, spurred by advancements in statistical software. The integration of clustering routines into packages like SPSS version 5.0 around 1975 enabled hierarchical and k-means methods for larger datasets, democratizing access for researchers in social sciences and beyond.[12] This era also witnessed the emergence of fuzzy clustering approaches, such as those developed by Dunn in 1973 and refined by Bezdek, allowing for partial memberships in clusters.[13] Additionally, density-based approaches emerged, exemplified by Martin Ester and colleagues' 1996 DBSCAN algorithm, which identifies clusters of arbitrary shape in spatial data by focusing on density reachability rather than predefined parameters like the number of clusters.[14] Influential figures such as Teuvo Kohonen contributed in the 1980s with self-organizing maps, a neural network-inspired method for topographic clustering that visualizes high-dimensional data on low-dimensional grids.[15]By the 2000s, cluster analysis underwent a paradigm shift toward probabilistic models, influenced by machine learning's emphasis on uncertainty and scalability. Traditional deterministic methods like k-means gave way to model-based techniques, such as Gaussian mixture models, which assign soft memberships via expectation-maximization and better handle overlapping clusters, as reviewed in works on large-scale data analysis.[16] This evolution reflected broader integration with machine learning, enabling applications in gene expression analysis for biological pattern discovery.
Clustering Methods
Hierarchical Clustering
Hierarchical clustering, also known as connectivity-based clustering, constructs a hierarchy of clusters either by progressively merging smaller clusters into larger ones or by recursively splitting larger clusters into smaller ones.[17] The agglomerative approach, which is the most commonly used, starts with each data point as its own singleton cluster and iteratively merges the closest pair of clusters until all points form a single cluster.[18] In contrast, the divisive approach begins with all data points in one cluster and repeatedly splits the most heterogeneous cluster into two until each point is isolated.[17] The resulting hierarchy is typically visualized using a dendrogram, a tree-like diagram where the height of each merge or split indicates the dissimilarity level at which clusters are combined or divided, allowing users to select a partitioning by cutting the tree at a desired level.[19]The merging or splitting process in hierarchical clustering relies on linkage criteria that define the distance between clusters. Single linkage, also known as nearest-neighbor linkage, measures the minimum distance between any point in one cluster and any point in the other, which can lead to chaining effects in elongated clusters.[20] Complete linkage, or furthest-neighbor linkage, uses the maximum distance between points in the two clusters, promoting compact, spherical clusters but being sensitive to outliers.[20] Average linkage computes the mean distance between all pairs of points across the clusters, balancing the tendencies of single and complete methods to produce more balanced hierarchies.[20]Ward's method, a variance-minimizing approach, merges clusters that result in the smallest increase in total within-cluster variance, with the linkage distance given byΔ=ni+njninj∥μi−μj∥2where ni and nj are the sizes of clusters i and j, and μi and μj are their centroids.[21]Several algorithms implement hierarchical clustering efficiently. The naive agglomerative algorithm updates distances after each merge, achieving O(n^3) time complexity for n points due to repeated pairwise computations.[19] Optimized versions like SLINK for single linkage reduce this to O(n^2) time by maintaining a compact representation of the dendrogram and avoiding redundant distance calculations.[22] Similarly, CLINK achieves O(n^2) time for complete linkage through an incremental approach that extends partial hierarchies.[23] To determine the number of clusters, stopping criteria such as the inconsistency coefficient are used, which measures how a link's height deviates from the average and standard deviation of nearby links in the dendrogram, with values above a threshold indicating natural cuts.[24]Hierarchical clustering excels at handling non-spherical and varying-density clusters without requiring a predefined number of clusters, providing interpretable hierarchies via dendrograms.[18] However, it is computationally intensive, with even optimized implementations requiring O(n^2) time and space, making it less suitable for very large datasets.[19]In bioinformatics, hierarchical clustering is widely applied to gene expression data, where average or Ward's linkage with Euclidean distances groups genes or samples based on similar expression profiles, visualized in dendrograms to reveal co-regulated pathways or tissue-specific patterns.[25] For instance, in microarray datasets, it identifies clusters of genes upregulated in cancer samples, aiding in biomarker discovery.[26]
Centroid-Based Clustering
Centroid-based clustering methods partition a dataset into a predefined number of clusters, K, by iteratively assigning data points to the nearest centroid (prototype) and updating the centroids based on the assignments until convergence is achieved. This approach optimizes for compact, spherical clusters by minimizing the distance between points and their assigned centroids, typically using Euclidean distance. The core principle relies on creating Voronoi partitions where each cluster consists of points closest to its centroid, promoting intra-cluster similarity.[27]The k-means algorithm, a foundational centroid-based method, formalizes this process through Lloyd's iterative procedure. It begins with initialization of K centroids, often randomly selected from the data or using improved strategies like k-means++ for better spread. In the assignment step, each point x is allocated to the cluster Ck with the nearest centroidμk, forming Voronoi cells. The update step recomputes each centroid as the mean of its assigned points: μk=∣Ck∣1∑x∈Ckx. This alternation continues until centroids stabilize or a maximum number of iterations is reached. The objective is to minimize the within-cluster sum of squares (WCSS): ∑k=1K∑x∈Ck∥x−μk∥2, which measures total intra-cluster variance.[28][28][28][28]Variants address specific limitations of standard k-means. The k-medoids algorithm, implemented as Partitioning Around Medoids (PAM), selects actual data points as medoids instead of means, making it more robust to outliers by minimizing total dissimilarity rather than squared Euclidean distance; it iteratively swaps medoids to optimize the objective. Fuzzy c-means extends the approach to soft assignments, allowing partial memberships uik for point i to cluster k via uik=∑j(dik/djk)2/(m−1)1, where dik is the distance to centroid k and m>1 controls fuzziness, enabling overlapping clusters.Initialization significantly impacts results due to k-means' sensitivity to starting centroids, often leading to local optima; multiple runs with random seeds or k-means++—which probabilistically selects initial centroids farther apart—are recommended to mitigate this. Determining K involves methods like the elbow technique, plotting WCSS against K and selecting the "elbow" point where marginal gains diminish, indicating balanced complexity and fit.Despite its efficiency, centroid-based clustering assumes clusters are spherical, roughly equal-sized, and of similar variance, which fails for elongated, varying-density, or unevenly sized groups; it also exhibits quadratic sensitivity to outliers in high dimensions. The time complexity is O(nkt), where n is the number of points, k the number of clusters, and t the iterations, scaling poorly for large datasets without approximations.[27][27][27]
Density-Based Clustering
Density-based clustering algorithms define clusters as dense regions in the data space, separated by areas of lower density, allowing for the discovery of clusters with arbitrary shapes and the identification of noise without assuming a fixed number of clusters. In this framework, a cluster is a maximal set of density-connected points, where two points are density-connected if they belong to a chain of points such that each consecutive pair is directly density-reachable from the other based on local density criteria. Points lying in sparse regions, not belonging to any such dense structure, are treated as noise or outliers.[29]The foundational algorithm, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), was introduced in 1996 by Ester et al. It operates using two key parameters: Eps, which defines the radius of the ε-neighborhood around a point, and MinPts, the minimum number of points required to consider a neighborhood dense. Points are categorized as core points if their ε-neighborhood contains at least MinPts points (including themselves), border points if they lie in the ε-neighborhood of a core point but have fewer than MinPts points in their own neighborhood, and noise points otherwise. Direct density-reachability occurs when a core point p has another point q within its ε-neighborhood; density-reachability extends this transitively through chains of such connections. The algorithm processes the dataset by selecting an arbitrary unvisited point, performing a range query to retrieve all points within its ε-neighborhood (with naive implementation taking O(n) time per query), and expanding the cluster by recursively adding all density-reachable points from core points; unexpandable points are marked as noise, and the process repeats until all points are visited.[29]To handle datasets with varying densities, where a single Eps value may fail to capture both sparse and dense regions, OPTICS (Ordering Points To Identify the Clustering Structure) was developed in 1999 by Ankerst et al. as an extension of DBSCAN. OPTICS does not produce flat clusters directly but generates an augmented ordering of points based on their core-distance (minimum Eps to qualify as core) and reachability-distance (minimum distance to a density-reachable point under varying Eps), visualized in a reachability plot that reveals hierarchical cluster structures at multiple density thresholds; clusters can then be extracted post hoc by applying a steepness criterion to the plot.[30]A notable variant, HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), proposed in 2013 by Campello et al., further advances density-based methods by constructing a hierarchy without requiring a fixed Eps parameter. It builds upon mutual reachability distances to form a minimum spanning tree, applies single-linkage clustering to create a dendrogram of density levels, and extracts flat clusters by selecting a stability-based cut in the hierarchy, effectively accommodating clusters of differing densities and providing robust outlier detection through the hierarchy's leaves.Density-based clustering offers several advantages, including the ability to detect clusters without specifying the number of clusters in advance, robustness to noise through explicit outlier labeling, and flexibility in identifying non-convex, arbitrary-shaped clusters in spatial or high-dimensional data. However, these methods are sensitive to parameter selection—such as Eps and MinPts in DBSCAN—which can significantly affect results if not tuned appropriately via techniques like k-distance plots, and they may underperform on datasets with substantial density variations unless extended by algorithms like OPTICS or HDBSCAN.[31][29]A practical example of density-based clustering is its application to spatial datasets, such as grouping earthquake hypocenters to delineate seismic fault zones; DBSCAN can identify dense swarms of aftershocks as clusters while classifying isolated low-magnitude events as noise, aiding in hazard assessment without assuming spherical cluster shapes.[32]
Model-Based Clustering
Model-based clustering approaches treat the data as arising from a mixture of underlying probability distributions, where each component represents a cluster, and parameters are estimated via statistical inference to uncover the latent structure. This probabilistic framework allows for soft assignments of data points to clusters based on posterior probabilities, enabling the modeling of uncertainty and overlap between groups.[33] Unlike deterministic methods, it assumes the data are generated from a finite mixture model, typically specified as p(x)=∑k=1Kπkf(x∣θk), where πk are mixing proportions with ∑k=1Kπk=1 and πk>0, and f(x∣θk) is the density of the k-th component parameterized by θk.A common instantiation uses Gaussian components, yielding the Gaussian mixture model p(x)=∑k=1KπkN(x∣μk,Σk), which assumes clusters form ellipsoidal shapes aligned with the data's covariance structure.[33] Parameter estimation is typically performed using the expectation-maximization (EM) algorithm, which maximizes the likelihood by iteratively computing expectations and maximizations.[34] In the E-step, posterior probabilities of component membership are calculated as zik=∑j=1KπjN(xi∣μj,Σj)πkN(xi∣μk,Σk) for each data point i and component k. The M-step then updates the parameters: mixing proportions as πk=n1∑i=1nzik, means as μk=∑i=1nzik∑i=1nzikxi, and covariances as Σk=∑i=1nzik∑i=1nzik(xi−μk)(xi−μk)T.[34]Bayesian extensions address the fixed number of components K by employing nonparametric priors like the Dirichlet process mixture, which allows the number of clusters to be inferred from the data rather than specified in advance.[35] This approach uses Markov chain Monte Carlo methods, such as those proposed by Neal, to sample from the posterior distribution over partitions and parameters.[36] The core assumption is that clusters correspond to draws from underlying probability distributions, accommodating varying shapes and sizes through flexible covariance modeling, though it excels particularly with ellipsoidal forms.Implementation is facilitated by software like the mclust package in R, which fits Gaussian finite mixture models via EM and supports model selection across various covariance structures.[37] Despite its strengths, model-based clustering incurs high computational cost due to iterative optimization, especially for large datasets or complex covariances. Overfitting is a key limitation, mitigated by criteria such as the Bayesian Information Criterion (BIC), defined as BIC=−2lnL+plnn, where L is the maximized likelihood, p the number of parameters, and n the sample size; lower BIC values favor parsimonious models.[38] K-means can be viewed as a special case with hard assignments and spherical covariances equal across clusters.
Grid-Based and Other Methods
Grid-based clustering methods partition the data space into a finite number of non-overlapping grid cells or hyper-rectangles, allowing clustering operations to be performed efficiently on these discretized structures rather than individual data points. This approach reduces computational complexity by summarizing statistical properties within each cell, such as count, mean, variance, and distribution type, enabling quick query processing and region-based analysis. A key parameter is the cell resolution or grid size, which determines the granularity of the partitioning and trades off between accuracy and efficiency; finer grids capture more detail but increase processing time. These methods are particularly suited for spatial data mining tasks where exact point-level precision is less critical than scalability.[39]The STING (Statistical Information Grid) algorithm exemplifies this paradigm by employing a hierarchical, quadtree-like grid structure that adaptively refines cells based on datadensity, starting from a coarse global level and drilling down to finer resolutions only where necessary. Developed in 1997, STING precomputes statistical summaries for each cell—such as min/max values, entropy, and skewness—to support clustering by identifying dense regions through bottom-up aggregation or top-down refinement, making it efficient for large datasets with varying densities. Adaptive grids like those in STING handle non-uniform data distributions by allowing variable cell sizes, avoiding the inefficiencies of uniform partitioning in sparse areas.[39][40]Subspace clustering extends grid-based techniques to high-dimensional data, where clusters may exist only in lower-dimensional projections, mitigating the curse of dimensionality by searching for dense regions in arbitrary subspaces. The CLIQUE algorithm, introduced in 1998, applies a grid structure across all possible subspaces generated from the full feature set, identifying dense unit hypercubes and merging them into clusters while producing concise descriptions in disjunctive normal form (DNF). CLIQUE partitions each dimension into intervals based on data percentiles, covers the space with rectangular units, and uses a density threshold to prune sparse areas, enabling scalability to hundreds of dimensions without exhaustive subspace enumeration. This grid-based subspace approach excels in datasets where relevant features vary across clusters, such as gene expression analysis.[41][42]Spectral clustering treats data as vertices in a similarity graph, using the graph Laplacian to embed points into a low-dimensional space for subsequent partitioning, which is effective for non-convex clusters. The unnormalized graph Laplacian is defined as L=D−A, where A is the affinity matrix with entries representing pairwise similarities (e.g., Gaussian kernel), and D is the diagonal degree matrix with Dii=∑jAij. The algorithm computes the k smallest eigenvectors of L, forms a matrix of these eigenvectors (each row corresponding to a data point), and applies k-means clustering in this embedding space to obtain the final partitions. This method leverages spectral graph theory to minimize the normalized cut, balancing cluster cohesion and separation.[43][44]Among other methods, self-organizing maps (SOMs) provide a neural network-based alternative that preserves topological relationships through competitive learning on a discrete lattice. Proposed by Kohonen in 1982, SOMs initialize a grid of neurons with weight vectors and iteratively updates the winning neuron (closest to the input) and its neighbors toward the data point, gradually forming a continuous mapping of high-dimensional input space onto the lower-dimensional grid. This unsupervised process results in topology-preserving clusters where similar inputs activate nearby neurons, useful for visualization and exploratory analysis.[45]Grid-based and spectral methods offer advantages in scalability for large-scale and high-dimensional datasets, with time complexities often linear in the number of points or grid cells, outperforming point-based algorithms on massive data. However, quantization in grid-based approaches can lead to loss of precision, as boundary points may be arbitrarily assigned to cells, potentially distorting cluster shapes in low-density regions. Spectral methods, while robust to noise and outliers, incur higher costs for eigenvector computation in very large graphs. Some hybrid approaches integrate grid structures with density estimation to refine boundaries, enhancing accuracy without full point-wise processing. For instance, spectral clustering has been applied to image segmentation by modeling pixels as graph nodes with intensity-based affinities, partitioning the image into coherent regions as demonstrated in the normalized cuts framework.[40][42][44][46]
Evaluation Techniques
Internal Validation Metrics
Internal validation metrics evaluate the quality of a clustering solution solely based on the intrinsic information within the dataset and the resulting partition, without reference to external ground truth labels. These metrics typically emphasize two key properties: compactness, which measures how closely related the objects are within each cluster, and separation, which assesses how distinct the clusters are from one another. By quantifying these aspects, internal metrics enable the selection of the optimal number of clusters k, the comparison of different clustering algorithms on the same data, or the assessment of partition stability, often under assumptions such as the use of Euclidean distance in feature space.[47]One of the earliest internal metrics is the Dunn index, proposed by Dunn in 1974, which computes the ratio of the smallest inter-cluster distance to the largest intra-cluster diameter across all clusters. A higher value indicates better-defined clusters with strong separation relative to internal dispersion, making it suitable for identifying compact and well-separated structures. However, it can be computationally intensive for large datasets due to the need to evaluate all pairwise distances.The Davies–Bouldin index, introduced by Davies and Bouldin in 1979, provides a measure of cluster similarity by averaging, for each cluster, the maximum ratio of the sum of within-cluster scatter to the between-cluster separation with respect to other clusters. Formally, it is defined asDB=k1i=1∑kj=imaxdijsi+sj,where si is the average distance between points in cluster i and its centroid, dij is the distance between centroids of clusters i and j, and lower values signify superior clustering quality. This index is particularly useful for comparing partitions where compactness and separation trade-offs are critical, though it assumes spherical clusters in Euclidean space.[48]The silhouette coefficient, developed by Rousseeuw in 1987, offers a more intuitive per-object assessment of clustering validity by measuring how similar an object is to its own cluster compared to neighboring clusters. For each data point i, the silhouette value is calculated ass(i)=max{a(i),b(i)}b(i)−a(i),where a(i) is the average distance from i to other points in its cluster, and b(i) is the smallest average distance from i to points in a different cluster; the global coefficient is the mean over all points, ranging from -1 (poor assignment) to 1 (well-clustered). This metric not only yields a summary score but also enables graphical visualization to detect outliers or suboptimal merges, assuming a distance metric like Euclidean.[49]Another prominent metric is the Calinski-Harabasz index, also known as the variance ratio criterion, proposed by Caliński and Harabasz in 1974, which evaluates the ratio of between-cluster dispersion to within-cluster dispersion, adjusted for the number of clusters and data points. Higher values indicate favorable partitions with greater inter-cluster variance relative to intra-cluster variance, making it effective for selecting k in methods like k-means under multivariate normal assumptions in Euclidean space. These metrics collectively support unsupervised evaluation but may vary in sensitivity to cluster shape, density, or dimensionality, necessitating careful selection based on data characteristics.[50]
External Validation Metrics
External validation metrics assess the quality of a clustering by comparing the obtained partitions to a known ground truth labeling of the data, typically available in supervised evaluation scenarios where external class information exists. These metrics quantify the agreement between the clustering output and the reference labels, focusing on aspects such as pairwise similarities, set overlaps, or information preservation, and are particularly useful for benchmarking algorithms when ground truth is accessible. Unlike internal metrics that rely solely on data structure, external measures provide an objective evaluation against predefined categories, enabling direct assessment of how well the clustering recovers the underlying structure.A foundational tool for computing many external metrics is the contingency table, also known as a confusion matrix in this context, which tabulates the joint frequencies of cluster assignments and ground truth labels. For a dataset with n points, the table C is an m×k matrix where Ci,j denotes the number of points assigned to cluster i in the clustering and class j in the ground truth, with m clusters and k classes. This table serves as the basis for deriving agreement statistics, such as true positives (TP: pairs in same cluster and same class), true negatives (TN: pairs in different clusters and different classes), false positives (FP: pairs in same cluster but different classes), and false negatives (FN: pairs in different clusters but same class). The contingency table allows for straightforward calculation of overlaps and mismatches, facilitating the application of various indices.The Rand Index (RI), introduced by Rand in 1971, measures the fraction of pairwise agreements between the clustering and ground truth, treating pairs of points as correctly classified if they are either both in the same group or both in different groups according to both partitions. It is defined as RI=TP+TN+FP+FNTP+TN, yielding a value between 0 (no agreement) and 1 (perfect agreement). However, the basic RI is sensitive to chance agreements and is often adjusted using the Hubert-Arabie formula to produce the Adjusted Rand Index (ARI), which subtracts the expected agreement under random labeling and normalizes to range from -1 to 1, with values near 0 indicating random partitions. This adjustment makes ARI a robust choice for comparing clusterings of varying sizes.The F-measure adapts the precision-recall framework from information retrieval to clustering evaluation, computing the harmonic mean of precision (fraction of points in a cluster that belong to the dominant ground truth class) and recall (fraction of points in a ground truth class assigned to the dominant cluster). For each cluster, precision and recall are calculated relative to the best-matching class, and the F-measure is F = 2 \times \frac{\text{precision} \times \text{[recall](/page/The_Recall)}}{\text{precision} + \text{[recall](/page/The_Recall)}}, typically averaged across clusters using macro-averaging (equal weight per cluster) or micro-averaging (equal weight per point) to handle imbalances. This metric emphasizes balanced recovery of classes but requires solving an assignment problem to match clusters to classes optimally, as discussed in comparative analyses of extrinsic metrics.The Jaccard Index, originally a set similarity measure, evaluates clustering by computing the intersection-over-union for pairs of clusters and their corresponding ground truth classes, often extended to average pairwise similarities between predicted and true partitions. For two sets A and B, it is J(A,B)=∣A∪B∣∣A∩B∣, ranging from 0 (no overlap) to 1 (identical sets), and in clustering, it is applied via the contingency table to quantify overlap after optimal matching. This index is particularly intuitive for assessing set-based agreement but can be computationally intensive for large numbers of clusters due to the need for bipartite matching.Normalized Mutual Information (NMI) provides an information-theoretic measure of shared information between the clustering and ground truth, normalized to account for entropy differences. It is calculated as NMI(X;Y)=H(X)+H(Y)2I(X;Y), where I(X;Y) is the mutual information (sum over joint probabilities minus marginals), and H(X), H(Y) are the entropies of the cluster and class distributions, respectively, derived from the contingency table probabilities. NMI ranges from 0 (independent partitions) to 1 (identical), offering robustness to varying cluster numbers and sizes, as analyzed in studies of information-theoretic clustering comparisons.The Fowlkes-Mallows Index (FM) focuses on the geometric mean of precision and recall across all pairs, emphasizing pairwise concordance without requiring full class assignments. It is given by FM=(TP+FP)(TP+FN)TP, which simplifies to the square root of the product of the geometric means of TP proportions, and ranges from 0 to 1. Proposed for comparing hierarchical clusterings, FM is less sensitive to cluster size imbalances than arithmetic means and performs well when ground truth has clear boundaries, as demonstrated in its original formulation for dendrogram evaluation.Purity measures the extent to which each cluster contains points from a single ground truth class, assigning the cluster to its dominant class and computing the fraction of points matching that class, then averaging over all clusters weighted by size. Formally, for cluster i with dominant class j, purity is ∑ini∣Ci,j∣×nni, where ni is the cluster size and n the total points, yielding values from 0 (no purity) to 1 (perfect). This simple metric highlights homogeneity but tends to favor solutions with many small clusters, as it does not penalize over-fragmentation.External validation metrics share several limitations that can affect their applicability. They inherently require access to ground truth labels, which are often unavailable in real-world unsupervised settings, limiting their use to benchmark datasets. Many, such as RI, FM, and purity, assume non-overlapping hard clusters and perform poorly with fuzzy or probabilistic partitions, where points belong to multiple groups. Additionally, metrics like purity and F-measure can be biased toward balanced or equal-sized clusters, overestimating quality for fragmented solutions and underestimating for uneven distributions, as highlighted in formal constraint analyses of extrinsic indices.
Assessing Cluster Tendency
Assessing cluster tendency involves evaluating whether a dataset exhibits inherent structure suitable for clustering, prior to applying clustering algorithms, to avoid analyzing random noise as meaningful groups. This step helps determine if the data is "clusterable," meaning it contains non-random patterns or groupings that can be captured by clustering methods. Visual and statistical approaches are commonly used to inspect this tendency, often as a preliminary analysis to inform subsequent clustering decisions.[51]Visual methods provide an intuitive way to inspect potential groupings in the data. Scatter plots of the original features or pairs of variables allow users to visually detect concentrations or separations in low-dimensional data. For higher-dimensional datasets, projecting the data onto principal components using principal component analysis (PCA) can reveal hidden structures by reducing dimensionality while preserving variance, enabling observation of clusters in 2D or 3D plots. These techniques are particularly useful for initial exploratory analysis, though they rely on human interpretation and may miss subtle patterns in noisy data.The Hopkins statistic is a widely used statistical test for quantifying cluster tendency by comparing the distances between randomly generated points and actual data points to their nearest neighbors. Introduced in 1954, it computes a value H between 0 and 1, where H close to 0 (e.g., H < 0.3) indicates uniform data (low cluster tendency), H ≈ 0.5 indicates random distribution, and H > 0.5 (approaching 1) suggests clustering tendency.[51] The statistic is calculated as the ratio of the average distance from m random points to their nearest neighbors in the data versus the average distance from m data points to their nearest neighbors, providing a measure of spatial randomness. It is effective for detecting overall clusterability but can be sensitive to outliers.[52]Visual Assessment of Tendency (VAT), proposed in 2002, is a matrix-based visual method that reorders a pairwise dissimilarity matrix to produce an image revealing block-diagonal structures, where each block corresponds to a potential cluster. The algorithm applies a nearest neighbor heuristic to permute rows and columns of the matrix, highlighting natural groupings as dark blocks along the diagonal against a lighter background of inter-cluster dissimilarities. VAT is computationally efficient for moderate-sized datasets and aids in estimating the number of clusters by counting distinct blocks, though it assumes a dissimilarity matrix as input. Extensions like iVAT improve robustness to noise by incorporating path-based distances.Other tools for assessing cluster tendency include the NbClust R package, which applies up to 30 indices to recommend the optimal number of clusters, incorporating measures like the Hopkins statistic alongside silhouette and Dunn indices for comprehensive evaluation. The elbow method plots the within-cluster sum of squares against the number of clusters k, identifying an "elbow" point where adding more clusters yields diminishing returns, signaling natural groupings. Similarly, the gap statistic compares the log of within-cluster dispersion to that of random reference data, selecting k where the gap is maximized to detect non-random structure. These methods often complement tendency assessment by suggesting k if clustering is viable.Assessing cluster tendency assumes that datasets may contain noise or outliers, which can mask true structures; robust methods like VAT or Hopkins with preprocessing help mitigate this by focusing on dominant patterns. Preprocessing plays a crucial role, such as feature scaling (e.g., standardization to zero mean and unit variance) to ensure distance-based measures treat all dimensions equally, preventing bias from varying scales. Without scaling, tendency tests may falsely indicate randomness in heterogeneous data.In gene expression data analysis, assessing cluster tendency is essential to identify natural groupings of genes or samples before clustering; for instance, applying the Hopkins statistic to microarray datasets has shown low H values (<0.3), confirming clusterable structures related to biological pathways or disease states.
Applications
In Biology and Medicine
Cluster analysis plays a pivotal role in biology and medicine by enabling the organization of high-dimensional datasets, such as gene expression profiles and medical images, into meaningful groups that reveal underlying patterns in biological processes and disease mechanisms. In biology, hierarchical clustering has been instrumental in analyzing genome-wide expression patterns from DNA microarrays, where genes with similar expression levels across samples are grouped to identify co-regulated functional modules. For instance, Eisen et al. (1998) introduced a hierarchical clustering system that visualizes expression data as color-coded dendrograms, facilitating the discovery of cell cycle-regulated genes in yeast.[53] Similarly, hierarchical methods like unweighted pair group method with arithmetic mean (UPGMA) are widely used in phylogenetic tree construction to infer evolutionary relationships from distance matrices of sequence similarities, assuming a constant rate of evolution.[54] In protein interaction networks, spectral clustering uncovers functional modules by leveraging the eigenvectors of the network's Laplacian matrix, identifying densely connected subgraphs that correspond to protein complexes. Li et al. (2010) demonstrated this approach on yeast PPI data, achieving higher precision in complex detection compared to traditional methods.[55]In bioinformatics, clustering addresses the redundancy in large sequence datasets, with tools like CD-HIT employing a greedy incremental algorithm to group protein or nucleotide sequences at user-specified identity thresholds, reducing database sizes for faster similarity searches. Li et al. (2006) reported that CD-HIT processes million-sequence databases in hours, outperforming earlier tools in speed and accuracy for non-redundant set construction.[56] For metagenomics, density-based methods such as DBSCAN identify operational taxonomic units (OTUs) in microbial communities by grouping sequences based on neighborhood density, accommodating varying cluster shapes in sparse 16S rRNA data. Bhat and Prabhu (2017) highlighted DBSCAN's utility in handling noise and outliers in uncultured microbial profiles, improving taxonomic resolution over centroid-based alternatives.[57]In medicine, centroid-based clustering like k-means stratifies patients using electronic health records (EHRs) to uncover disease subtypes based on clinical features such as vital signs and lab results. Chen et al. (2016) applied k-means to EHR data from chronic disease cohorts, identifying prognostic subgroups that informed personalized care coordination.[58] For image analysis, fuzzy c-means (FCM) segmentation partitions MRI scans into tumor, edema, and healthy tissue regions by allowing partial memberships, which is robust to intensity inhomogeneities common in brain imaging. Koley and Sadhu (2019) enhanced FCM with rough sets for precise glioma boundary delineation, achieving Dice scores above 0.85 on benchmark datasets.[59]Notable case studies illustrate these applications' impact. In cancer subtyping, model-based clustering with Gaussian mixtures analyzes The Cancer Genome Atlas (TCGA) multi-omics data to delineate molecular subtypes, such as luminal and basal-like breast cancers, guiding targeted therapies. Lee et al. (2020) developed the Hydra framework, which automatically detects multimodal distributions in TCGA profiles, revealing subtype-specific driver mutations.[60] During the COVID-19 pandemic, density-based clustering grouped symptoms like fever and dyspnea from patient reports to identify risk profiles, aiding triage. A 2024 analysis using DBSCAN on symptom co-occurrence data from 2020 outbreaks uncovered persistent clusters linked to long COVID severity.[61]Despite these advances, clustering in biology and medicine faces challenges from high dimensionality and noise in omics data, where the "curse of dimensionality" amplifies irrelevant features and dropout events in single-cell RNA-seq. Duò et al. (2025) benchmarked algorithms on scRNA-seq datasets, showing that noise distorts cluster boundaries unless mitigated by dimensionality reduction, yet over-reduction risks losing biological signals.[62] These issues necessitate robust preprocessing and validation to ensure interpretable results in clinical settings.
In Business and Social Sciences
In business, cluster analysis plays a pivotal role in market segmentation, enabling firms to divide heterogeneous consumer bases into homogeneous groups based on shared characteristics such as demographics, behaviors, or psychographics, thereby facilitating targeted marketing strategies and resource allocation.[63] This approach, rooted in seminal methodological frameworks, allows companies to identify actionable customer segments for product differentiation and personalized advertising.[63] For instance, k-means clustering has been applied to retail data to segment customers by recency, frequency, and monetary value (RFM) metrics, improving campaign effectiveness in e-commerce.[64]In the financial sector, cluster analysis supports risk management by grouping assets, institutions, or clients according to risk profiles derived from financial indicators like credit scores, transaction histories, and volatility measures.[65] Applications include credit risk evaluation, where mixed data clustering combines quantitative and qualitative features to classify loan applicants into low-, medium-, and high-risk categories, enhancing lending decisions.[66] Additionally, density-based methods like DBSCAN have been used for fraud detection in transaction datasets, isolating anomalous clusters from normal patterns to mitigate financial losses.[65]In the social sciences, cluster analysis aids in uncovering typologies and structures within complex human data, such as identifying social classes or behavioral patterns in sociological surveys. A foundational application involves hierarchical clustering of relational data to classify social networks, revealing community structures based on interaction similarities, as demonstrated in early work on network partitioning.[67] In psychology, it has been employed to delineate personality types from Big Five trait inventories, with robust data-driven approaches identifying four distinct configurations—average, reserved, self-centered, and role-model—across large datasets, providing insights into individual differences and mental health correlations.[68] Model-based clustering further extends this to qualitative social data, such as interview codings, to reveal motives or attitudes in prevention studies.[69]
In Computer Science and Web Technologies
In computer science and web technologies, cluster analysis serves as a foundational technique for organizing unstructured data, enabling efficient processing in areas such as information retrieval and pattern recognition. Algorithms like k-means and hierarchical clustering are integrated into scalable systems to handle large-scale datasets, supporting tasks from search optimization to user behavior modeling. This application leverages clustering's ability to group similar items without supervision, enhancing algorithmic efficiency in dynamic environments like the web.[70]Document clustering, a key method in information retrieval, often employs term frequency-inverse document frequency (TF-IDF) vectorization combined with k-means to group text-based content for search engines. For instance, this approach has been applied to the 20 Newsgroups dataset, a standard benchmark simulating news articles, where TF-IDF transforms documents into numerical vectors before k-means partitions them into topical clusters, improving relevance in search results like those in early Google news grouping systems.[71][72] The technique reduces dimensionality and highlights discriminative terms, achieving high purity scores in topical separation for web-scale text corpora.[73]Density-based clustering, such as DBSCAN, plays a critical role in anomaly detection for cybersecurity, identifying unusual patterns in network traffic that signal intrusions. By defining clusters based on data point density, these methods isolate outliers representing potential threats, like irregular connection spikes in enterprise networks, with applications in intrusion detection systems that process real-time logs to flag deviations from normal behavior.[74] Studies show DBSCAN outperforms partition-based alternatives in handling varying densities of attack vectors, enhancing detection accuracy in high-dimensional network data.[75]In recommendation systems, spectral clustering enhances collaborative filtering by uncovering latent structures in user-item interactions, grouping similar videos or content for platforms like YouTube. This method constructs affinity graphs from user ratings and applies eigenvector decomposition to partition data, enabling personalized suggestions through clustered user preferences or item similarities, as demonstrated in optimizations that improve recommendation diversity and precision.[76] For video grouping, spectral techniques reduce the computational overhead of matrix factorization in large-scale collaborative models.[77]Web usage mining utilizes hierarchical clustering to analyze user sessions, representing navigation paths as sequences of page visits to identify behavioral patterns. In this process, sessions are preprocessed into vectors or trees, then agglomerated using linkage criteria like Ward's method to form dendrograms that reveal site structure and user intents, aiding in personalization and load balancing for web servers.[78] This approach excels in capturing sequential dependencies in paths, with evaluations showing improved silhouette scores for session grouping in e-commerce logs.[79]Case studies in image retrieval post-2015 highlight deep clustering's impact, where convolutional neural networks extract features before clustering to enable content-based search in vast databases. For example, web-scale image collections of 100 million items have been clustered using deep representations and approximate nearest neighbors, achieving sub-hour processing on single machines for retrieval tasks in search engines.[80] In big data contexts, MapReduce implementations of k-means distribute centroid updates across nodes, scaling to terabyte-scale datasets by parallelizing distance computations and iterations, as shown in frameworks that reduce convergence time by factors of 10-20 on Hadoop clusters.[81]Clustering integrates seamlessly into machine learning pipelines via libraries like scikit-learn, where estimators such as KMeans or DBSCAN are chained with preprocessing steps like standardization in Pipeline objects for end-to-end workflows. This modular design supports deployment in production systems, from feature extraction to model evaluation, ensuring reproducible clustering in applications like web analytics.[70][82]
Advanced Topics
Fuzzy and Probabilistic Clustering
Fuzzy clustering extends traditional hard partitioning by allowing data points to belong to multiple clusters with varying degrees of membership, represented by values in the interval [0,1] that sum to 1 across all clusters for each point.[83] This approach is particularly useful for handling overlapping clusters and ambiguous data where strict boundaries are inappropriate.[83]The seminal fuzzy c-means (FCM) algorithm, originally proposed by Dunn in 1973 and generalized by Bezdek in 1981, minimizes an objective function that incorporates membership degrees raised to a fuzziness parameter m>1:J=i=1∑nk=1∑cuikm∥xi−vk∥2where uik is the membership of data point xi in cluster k, vk is the cluster prototype, and ∥⋅∥2 denotes the squared Euclidean distance. The algorithm iteratively updates memberships and prototypes:uik=∑j=1c(∥xi−vj∥∥xi−vk∥)2/(m−1)1andvk=∑i=1nuikm∑i=1nuikmxi.Convergence is achieved when changes in J fall below a threshold.[83]Probabilistic clustering interprets these soft assignments as posterior probabilities of cluster membership, often modeled via finite mixture distributions such as Gaussian mixtures. The expectation-maximization (EM) algorithm, introduced by Dempster et al. in 1977, iteratively estimates mixture parameters by maximizing the likelihood, yielding probabilistic cluster assignments in the E-step. This framework links fuzzy methods to statistical inference, enabling uncertainty quantification in cluster assignments.[84]Possibilistic clustering, developed by Krishnapuram and Keller in 1993, relaxes the sum-to-1 constraint on memberships, treating them as degrees of possibility rather than probabilities.[85] This modification enhances robustness to noise and outliers by avoiding forced assignments to all clusters, with an objective function similar to FCM but incorporating a temperature parameter to control typicality.[85]In applications, fuzzy clustering excels in image processing, such as MRI brain tissue segmentation, where it delineates regions with gradual transitions and handles intensity inhomogeneities effectively.[86] It also supports decision-making under uncertainty by providing graded memberships for risk assessment in ambiguous scenarios.[83]Key advantages of fuzzy and probabilistic methods include robustness to outliers and noise compared to hard clustering, as memberships dilute the influence of anomalous points.[87] However, they require careful tuning of the fuzziness parameter m, where values near 1 yield crisp partitions and larger m increases overlap but may lead to less distinct clusters.[83]
Recent Developments and Challenges
In recent years, deep clustering has emerged as a significant advancement, integrating deep neural networks with traditional clustering techniques to learn representations and assignments jointly. A seminal approach is Deep Embedded Clustering (DEC), introduced in 2016, which uses an autoencoder to map data into a lower-dimensional space and optimizes cluster assignments via a clustering loss that minimizes distances to cluster centers while repelling points from other clusters.[88] This method has inspired subsequent works, including surveys highlighting its role in improving clustering on complex datasets through end-to-end learning.[89] Building on this, graph neural networks (GNNs) have been adapted for clustering in the 2020s, particularly for graph-structured data; for instance, Deep Modularity Networks (DMoN) employ unsupervised GNN pooling inspired by modularity measures to identify dense node communities, outperforming traditional methods on benchmark graphs.[90]Scalability remains a key focus for handling large-scale and streaming data. Mini-batch k-means, a variant of k-means that processes data in small random subsets, reduces computational time while approximating the full-batch objective, making it suitable for massive datasets.[70] Similarly, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), originally proposed in 1996, has seen updates for streaming environments, where it builds hierarchical summaries of data points to enable incremental clustering without full data access.[91] These methods address the limitations of classical algorithms in high-volume scenarios, such as real-time analytics.Subspace and ensemble clustering have advanced to manage multi-view and multimodal data prevalent in modern applications. Multi-view subspace clustering seeks low-dimensional subspaces shared across views, with recent methods like adaptive consensus graph learning constructing unified affinity matrices from multiple data representations to enhance robustness.[92] Ensemble approaches integrate diverse clustering results, particularly for multimodal data in the 2020s, by combining subspace recoveries to mitigate view-specific noise and improve overall partition quality.[93]Despite these progressions, clustering faces persistent challenges. The curse of dimensionality leads to sparsity and distance metric degradation in high-dimensional spaces, complicating meaningful cluster separation; dimensionality reduction techniques like t-SNE, applied as preprocessing, help by embedding data into lower dimensions while preserving local structures.[94] Interpretability is another hurdle, especially in deep clustering models, where explainable AI (XAI) methods aim to elucidate cluster decisions through inherent model design or post-hoc explanations, such as prototype-based interpretations.[95] Ethical concerns arise from biases in AI clustering, often stemming from unrepresentative training data that perpetuate discriminatory groupings, such as in social or medical applications, necessitating bias audits and diverse datasets.[96]Looking ahead, quantum clustering prototypes show promise for exponential speedups on complex problems. Variational quantum algorithms, explored since 2023, use quantum circuits to optimize clustering objectives, enabling multi-cluster detection in noisy intermediate-scale quantum (NISQ) devices; subsequent hybrid quantum-classical approaches, such as quantum-assisted k-means in 2024, and NISQ-compatible improvements in 2025, further enhance practicality for real-world datasets.[97][98][99] Additionally, federated learning frameworks support privacy-preserving clustering by allowing decentralized model training across clients without sharing raw data, as in patient clustering for personalized medicine; recent advances in clustered federated learning as of 2025 address data heterogeneity in healthcare and IoT applications through model-based and hybrid partitioning strategies.[100][101] These directions aim to balance computational efficiency, privacy, and ethical integrity in evolving data landscapes.