Hubbry Logo
Clustering high-dimensional dataClustering high-dimensional dataMain
Open search
Clustering high-dimensional data
Community hub
Clustering high-dimensional data
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Clustering high-dimensional data
Clustering high-dimensional data
from Wikipedia

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

Problems

[edit]

Four problems need to be overcome for clustering in high-dimensional data:[1]

  • Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. This problem is known as the curse of dimensionality.
  • The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:
  • A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
  • Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.

Recent research indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.[2]

Approaches

[edit]

Approaches towards clustering in axis-parallel or arbitrarily oriented affine subspaces differ in how they interpret the overall goal, which is finding clusters in data with high dimensionality.[1] An overall different approach is to find clusters based on pattern in the data matrix, often referred to as biclustering, which is a technique frequently utilized in bioinformatics.

Subspace clustering

[edit]
Example 2D space with subspace clusters

Subspace clustering aims to look for clusters in different combinations of dimensions (i.e., subspaces) and unlike many other clustering approaches does not assume that all of the clusters in a dataset are found in the same set of dimensions.[3] Subspace clustering can take bottom-up or top-down approaches. Bottom-up methods (such as CLIQUE) heuristically identify relevant dimensions by dividing the data space into a grid structure, selecting dense units, and then iteratively linking them if they are adjacent and dense.[3]

The adjacent image shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters (in subspace ) and , , (in subspace ) can be found. cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the axis. In two dimensions, the two clusters and can be identified.

The problem of subspace clustering is given by the fact that there are different subspaces of a space with dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithms utilize some kind of heuristic to remain computationally feasible, at the risk of producing inferior results. For example, the downward-closure property (cf. association rules) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace T containing a cluster, will result in a full space S also to contain that cluster (i.e. S ⊆ T), an approach taken by most of the traditional algorithms such as CLIQUE,[4] SUBCLU.[5] It is also possible to define a subspace using different degrees of relevance for each dimension, an approach taken by iMWK-Means,[6] EBK-Modes[7] and CBK-Modes.[8]

Projected clustering

[edit]

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regular clustering algorithm.

For example, the PreDeCon algorithm checks which attributes seem to support a clustering for each point, and adjusts the distance function such that dimensions with low variance are amplified in the distance function.[9] In the figure above, the cluster might be found using DBSCAN with a distance function that places less emphasis on the -axis and thus exaggerates the low difference in the -axis sufficiently enough to group the points into a cluster.

PROCLUS uses a similar approach with a k-medoid clustering.[10] Initial medoids are guessed, and for each medoid the subspace spanned by attributes with low variance is determined. Points are assigned to the medoid closest, considering only the subspace of that medoid in determining the distance. The algorithm then proceeds as the regular PAM algorithm.

If the distance function weights attributes differently, but never with 0 (and hence never drops irrelevant attributes), the algorithm is called a "soft"-projected clustering algorithm.

Projection-based clustering

[edit]

Projection-based clustering is based on a nonlinear projection of high-dimensional data into a two-dimensional space.[11] Typical projection-methods like t-distributed stochastic neighbor embedding (t-SNE),[12] or neighbor retrieval visualizer (NerV) [13] are used to project data explicitly into two dimensions disregarding the subspaces of higher dimension than two and preserving only relevant neighborhoods in high-dimensional data. In the next step, the Delaunay graph[14] between the projected points is calculated, and each vertex between two projected points is weighted with the high-dimensional distance between the corresponding high-dimensional data points. Thereafter the shortest path between every pair of points is computed using the Dijkstra algorithm.[15] The shortest paths are then used in the clustering process, which involves two choices depending on the structure type in the high-dimensional data.[11] This Boolean choice can be decided by looking at the topographic map of high-dimensional structures.[16] In a benchmarking of 34 comparable clustering methods, projection-based clustering was the only algorithm that always was able to find the high-dimensional distance or density-based structure of the dataset.[11] Projection-based clustering is accessible in the open-source R package "ProjectionBasedClustering" on CRAN.[17]

Bootstrap-based clustering

[edit]

Bootstrap aggregation (bagging) can be used to create multiple clusters and aggregate the findings. This is done by taking random subsamples of the data, performing a cluster analysis on each of them and then aggregating the results of the clusterings to generate a dissimilarity measure which can then be used to explore and cluster the original data.[18][19] Since high-dimensional data are likely to have many non-informative features, weights can be used during the bagging process to increase the impact of the more informative aspects. This produces "ABC dissimilarities" which can then be used to explore and cluster the original data and also to assess which features appear to be more impactful in defining the clusters. [20] [21] [22]

Hybrid approaches

[edit]

Not all algorithms try to either find a unique cluster assignment for each point or all clusters in all subspaces; many settle for a result in between, where a number of possibly overlapping, but not necessarily exhaustive set of clusters are found. An example is FIRES, which is from its basic approach a subspace clustering algorithm, but uses a heuristic too aggressive to credibly produce all subspace clusters.[23] Another hybrid approach is to include a human-into-the-algorithmic-loop: Human domain expertise can help to reduce an exponential search space through heuristic selection of samples. This can be beneficial in the health domain where, e.g., medical doctors are confronted with high-dimensional descriptions of patient conditions and measurements on the success of certain therapies. An important question in such data is to compare and correlate patient conditions and therapy results along with combinations of dimensions. The number of dimensions is often very large, consequently one needs to map them to a smaller number of relevant dimensions to be more amenable for expert analysis. This is because irrelevant, redundant, and conflicting dimensions can negatively affect effectiveness and efficiency of the whole analytic process.[24]

Correlation clustering

[edit]

Another type of subspaces is considered in Correlation clustering (Data Mining).

Software

[edit]
  • ELKI includes various subspace and correlation clustering algorithms
  • FCPS includes over fifty clustering algorithms[25]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Clustering high-dimensional data involves partitioning large datasets characterized by numerous features—such as profiles, text documents, or sensor readings—into groups of similar instances to uncover inherent structures and patterns, often using unsupervised techniques that extend traditional clustering to mitigate the effects of excessive dimensionality. A primary challenge in this domain is the curse of dimensionality, where high-dimensional spaces lead to data sparsity, making distances between points nearly uniform and rendering conventional metrics like ineffective for identifying meaningful nearest neighbors or clusters. This phenomenon, exacerbated by irrelevant or noisy dimensions, causes traditional algorithms such as K-means or to perform poorly, as clusters may only manifest in specific subspaces rather than the full feature space. To overcome these limitations, subspace clustering methods have emerged as a cornerstone approach, seeking clusters within lower-dimensional projections tailored to subsets of features, thereby improving , interpretability, and accuracy for applications in bioinformatics, web mining, and image analysis. These techniques are broadly categorized into top-down strategies, which start with the full space and iteratively refine subspaces (e.g., and ORCLUS), and bottom-up methods, which build clusters from dense regions in low-dimensional grids (e.g., and ENCLUS). The field originated in the late 1990s with early subspace methods like (1998) addressing the curse of dimensionality in databases, evolving through the 2000s with algorithmic advancements like (2000) and continuing to incorporate modern techniques as of the 2020s. Contemporary developments further enhance subspace clustering by incorporating for automatic feature extraction in ultra-high-dimensional settings, addressing through hybrid models, and introducing fairness-aware mechanisms to reduce biases in real-time streaming data from IoT devices. Evaluation remains critical, relying on various metrics assessing cluster quality and , though standardized benchmarks are still evolving to accommodate diverse high-dimensional scenarios.

Introduction

Overview and definitions

Clustering is the unsupervised process of partitioning a set of objects into subsets, known as clusters, such that objects within the same cluster exhibit higher similarity to one another than to those in other clusters, based on a predefined similarity or . This technique is widely used for summarization, pattern discovery, and exploratory analysis in fields like bioinformatics and . High-dimensional data arises in scenarios where the number of features or attributes describing each data point significantly exceeds the number of observations, often with dimensions ranging from tens to thousands; for instance, datasets may involve thousands of genes as dimensions for relatively few samples. Traditional clustering algorithms, such as k-means and , typically assume low-dimensional spaces and rely on the metric to quantify similarity, which presumes that clusters form compact, spherical groups in feature space. In high-dimensional settings, however, the intrinsic geometry of the data space alters fundamentally, leading to a concentration of distances where most pairs of points appear roughly equidistant, thereby diminishing the discriminatory power of standard metrics and resulting in sparse, meaningless clusters if no adaptations are made. This distance concentration effect, a manifestation of the curse of dimensionality, underscores the need for specialized approaches in high-dimensional clustering. To mitigate these issues, alternative similarity measures are essential prerequisites for effective clustering; , which emphasizes the directional alignment of vectors by normalizing for magnitude, proves particularly useful for sparse or text-based high-dimensional data, while the incorporates feature correlations to provide a more robust metric in spaces with dependent dimensions.

Historical context

The concept of the "curse of dimensionality" was first coined by Richard Bellman in 1961 to describe the exponential increase in complexity associated with high-dimensional problems in dynamic programming. Although originally from , this phenomenon gained prominence in clustering during the 1990s as datasets in fields like bioinformatics and grew to hundreds of dimensions, rendering traditional distance-based methods ineffective due to sparsity and distance concentration. A seminal 2004 survey by Parsons, Haque, and Liu formalized its application to high-dimensional clustering, emphasizing how it exacerbates challenges in and cluster separation. The late 1990s marked the emergence of specialized techniques to address these issues, with subspace clustering introduced as a foundational approach. In , et al. proposed the CLIQUE algorithm, a grid-based method that automatically discovers dense clusters in subspaces of high-dimensional data without predefined dimensionality, enabling scalable mining in applications like market basket analysis. Building on this, projected clustering gained traction in the early ; et al.'s PROCLUS algorithm (1999) pioneered the identification of clusters relevant to subsets of dimensions, allowing different clusters to project onto varying subspaces and improving robustness in noisy, high-dimensional environments. The explosion of after 2010, driven by sources like and , intensified the need for advanced high-dimensional clustering, integrating it more deeply with frameworks. Around 2015, began influencing for clustering, as seen in Xie et al.'s deep model, which combined autoencoders for with clustering objectives to handle nonlinear manifolds in high dimensions. Recent advancements from 2023 to 2025 reflect a shift toward AI-driven enhancements, particularly in density-based methods. For instance, et al.'s 2025 AAAI introduced an enhanced density peak clustering approach using multilayer perceptrons for and hierarchical label assignment, achieving superior performance on high-dimensional benchmarks by mitigating local errors.

Challenges

The curse of dimensionality manifests in high-dimensional spaces as an exponential growth in volume with increasing dimensions dd, which causes data points to become increasingly sparse and most pairs of points to appear roughly equidistant, a phenomenon known as distance concentration. This effect degrades the utility of distance-based similarity measures central to many clustering algorithms, as the intrinsic structure of the data becomes obscured by the overwhelming scale of the ambient space. Mathematically, consider points drawn uniformly from the unit hypercube [0,1]d[0,1]^d. The expected squared Euclidean distance between two such points is exactly d/6d/6, so the expected distance approximates d/6\sqrt{d/6}
Add your contribution
Related Hubs
User Avatar
No comments yet.