Recent from talks
Nothing was collected or created yet.
Clustering high-dimensional data
View on WikipediaClustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.
Problems
[edit]Four problems need to be overcome for clustering in high-dimensional data:[1]
- Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality. This problem is known as the curse of dimensionality.
- The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:
- A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
- Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.
Recent research indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.[2]
Approaches
[edit]Approaches towards clustering in axis-parallel or arbitrarily oriented affine subspaces differ in how they interpret the overall goal, which is finding clusters in data with high dimensionality.[1] An overall different approach is to find clusters based on pattern in the data matrix, often referred to as biclustering, which is a technique frequently utilized in bioinformatics.
Subspace clustering
[edit]
Subspace clustering aims to look for clusters in different combinations of dimensions (i.e., subspaces) and unlike many other clustering approaches does not assume that all of the clusters in a dataset are found in the same set of dimensions.[3] Subspace clustering can take bottom-up or top-down approaches. Bottom-up methods (such as CLIQUE) heuristically identify relevant dimensions by dividing the data space into a grid structure, selecting dense units, and then iteratively linking them if they are adjacent and dense.[3]
The adjacent image shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters (in subspace ) and , , (in subspace ) can be found. cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the axis. In two dimensions, the two clusters and can be identified.
The problem of subspace clustering is given by the fact that there are different subspaces of a space with dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithms utilize some kind of heuristic to remain computationally feasible, at the risk of producing inferior results. For example, the downward-closure property (cf. association rules) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace T containing a cluster, will result in a full space S also to contain that cluster (i.e. S ⊆ T), an approach taken by most of the traditional algorithms such as CLIQUE,[4] SUBCLU.[5] It is also possible to define a subspace using different degrees of relevance for each dimension, an approach taken by iMWK-Means,[6] EBK-Modes[7] and CBK-Modes.[8]
Projected clustering
[edit]Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regular clustering algorithm.
For example, the PreDeCon algorithm checks which attributes seem to support a clustering for each point, and adjusts the distance function such that dimensions with low variance are amplified in the distance function.[9] In the figure above, the cluster might be found using DBSCAN with a distance function that places less emphasis on the -axis and thus exaggerates the low difference in the -axis sufficiently enough to group the points into a cluster.
PROCLUS uses a similar approach with a k-medoid clustering.[10] Initial medoids are guessed, and for each medoid the subspace spanned by attributes with low variance is determined. Points are assigned to the medoid closest, considering only the subspace of that medoid in determining the distance. The algorithm then proceeds as the regular PAM algorithm.
If the distance function weights attributes differently, but never with 0 (and hence never drops irrelevant attributes), the algorithm is called a "soft"-projected clustering algorithm.
Projection-based clustering
[edit]Projection-based clustering is based on a nonlinear projection of high-dimensional data into a two-dimensional space.[11] Typical projection-methods like t-distributed stochastic neighbor embedding (t-SNE),[12] or neighbor retrieval visualizer (NerV) [13] are used to project data explicitly into two dimensions disregarding the subspaces of higher dimension than two and preserving only relevant neighborhoods in high-dimensional data. In the next step, the Delaunay graph[14] between the projected points is calculated, and each vertex between two projected points is weighted with the high-dimensional distance between the corresponding high-dimensional data points. Thereafter the shortest path between every pair of points is computed using the Dijkstra algorithm.[15] The shortest paths are then used in the clustering process, which involves two choices depending on the structure type in the high-dimensional data.[11] This Boolean choice can be decided by looking at the topographic map of high-dimensional structures.[16] In a benchmarking of 34 comparable clustering methods, projection-based clustering was the only algorithm that always was able to find the high-dimensional distance or density-based structure of the dataset.[11] Projection-based clustering is accessible in the open-source R package "ProjectionBasedClustering" on CRAN.[17]
Bootstrap-based clustering
[edit]Bootstrap aggregation (bagging) can be used to create multiple clusters and aggregate the findings. This is done by taking random subsamples of the data, performing a cluster analysis on each of them and then aggregating the results of the clusterings to generate a dissimilarity measure which can then be used to explore and cluster the original data.[18][19] Since high-dimensional data are likely to have many non-informative features, weights can be used during the bagging process to increase the impact of the more informative aspects. This produces "ABC dissimilarities" which can then be used to explore and cluster the original data and also to assess which features appear to be more impactful in defining the clusters. [20] [21] [22]
Hybrid approaches
[edit]Not all algorithms try to either find a unique cluster assignment for each point or all clusters in all subspaces; many settle for a result in between, where a number of possibly overlapping, but not necessarily exhaustive set of clusters are found. An example is FIRES, which is from its basic approach a subspace clustering algorithm, but uses a heuristic too aggressive to credibly produce all subspace clusters.[23] Another hybrid approach is to include a human-into-the-algorithmic-loop: Human domain expertise can help to reduce an exponential search space through heuristic selection of samples. This can be beneficial in the health domain where, e.g., medical doctors are confronted with high-dimensional descriptions of patient conditions and measurements on the success of certain therapies. An important question in such data is to compare and correlate patient conditions and therapy results along with combinations of dimensions. The number of dimensions is often very large, consequently one needs to map them to a smaller number of relevant dimensions to be more amenable for expert analysis. This is because irrelevant, redundant, and conflicting dimensions can negatively affect effectiveness and efficiency of the whole analytic process.[24]
Correlation clustering
[edit]Another type of subspaces is considered in Correlation clustering (Data Mining).
Software
[edit]References
[edit]- ^ a b Kriegel, H. P.; Kröger, P.; Zimek, A. (2009). "Clustering high-dimensional data". ACM Transactions on Knowledge Discovery from Data. 3: 1–58. doi:10.1145/1497577.1497578. S2CID 17363900.
- ^ Houle, M. E.; Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? (PDF). Scientific and Statistical Database Management. Lecture Notes in Computer Science. Vol. 6187. p. 482. doi:10.1007/978-3-642-13818-8_34. ISBN 978-3-642-13817-1.
- ^ a b Parsons, Lance; Haque, Ehtesham; Liu, Huan (2004-06-01). "Subspace clustering for high dimensional data: a review". ACM SIGKDD Explorations Newsletter. 6 (1): 90–105. doi:10.1145/1007730.1007731. ISSN 1931-0145.
- ^ Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. (2005). "Automatic Subspace Clustering of High Dimensional Data". Data Mining and Knowledge Discovery. 11: 5–33. CiteSeerX 10.1.1.131.5152. doi:10.1007/s10618-005-1396-1. S2CID 9289572.
- ^ Kailing, K.; Kriegel, H. P.; Kröger, P. (2004). Density-Connected Subspace Clustering for High-Dimensional Data. Proceedings of the 2004 SIAM International Conference on Data Mining. pp. 246. doi:10.1137/1.9781611972740.23. ISBN 978-0-89871-568-2.
- ^ De Amorim, R.C.; Mirkin, B. (2012). "Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering". Pattern Recognition. 45 (3): 1061. Bibcode:2012PatRe..45.1061C. doi:10.1016/j.patcog.2011.08.012.
- ^ Carbonera, Joel Luis; Abel, Mara (November 2014). "An Entropy-Based Subspace Clustering Algorithm for Categorical Data". 2014 IEEE 26th International Conference on Tools with Artificial Intelligence. IEEE. pp. 272–277. doi:10.1109/ictai.2014.48. ISBN 9781479965724. S2CID 7208538.
- ^ Carbonera, Joel Luis; Abel, Mara (2015). "CBK-Modes: A Correlation-based Algorithm for Categorical Data Clustering". Proceedings of the 17th International Conference on Enterprise Information Systems. SCITEPRESS - Science and Technology Publications. pp. 603–608. doi:10.5220/0005367106030608. ISBN 9789897580963.
- ^ Böhm, C.; Kailing, K.; Kriegel, H. -P.; Kröger, P. (2004). Density Connected Clustering with Local Subspace Preferences (PDF). Fourth IEEE International Conference on Data Mining (ICDM'04). p. 27. doi:10.1109/ICDM.2004.10087. ISBN 0-7695-2142-8.
- ^ Aggarwal, C. C.; Wolf, J. L.; Yu, P. S.; Procopiuc, C.; Park, J. S. (1999). "Fast algorithms for projected clustering". ACM SIGMOD Record. 28 (2): 61. CiteSeerX 10.1.1.681.7363. doi:10.1145/304181.304188.
- ^ a b c Thrun, M. C., & Ultsch, A. : Using Projection based Clustering to Find Distance and Density based Clusters in High-Dimensional Data, J. Classif., pp. 1-33, doi: 10.1007/s00357-020-09373-2.
- ^ Van der Maaten, L., & Hinton, G.: Visualizing Data using t-SNE, Journal of Machine Learning Research, Vol. 9(11), pp. 2579-2605. 2008.
- ^ Venna, J., Peltonen, J., Nybo, K., Aidos, H., & Kaski, S.: Information retrieval perspective to nonlinear dimensionality reduction for data visualization, The Journal of Machine Learning Research, Vol. 11, pp. 451-490. 2010.
- ^ Delaunay, B.: Sur la sphere vide, Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, Vol. 7(793-800), pp. 1-2. 1934.
- ^ Dijkstra, E. W.: A note on two problems in connexion with graphs, Numerische mathematik, Vol. 1(1), pp. 269-271. 1959.
- ^ Thrun, M. C., & Ultsch, A.: Uncovering High-Dimensional Structures of Projections from Dimensionality Reduction Methods, MethodsX, Vol. 7, pp. 101093, doi: 10.1016/j.mex.20200.101093,2020.
- ^ "CRAN - Package ProjectionBasedClustering". Archived from the original on 2018-03-17.
- ^ Dudoit, S. and Fridlyand, J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19/9, 1090–1099. doi:10.1093/bioinformatics/btg038.
- ^ Strehl, A. & Ghosh, J. (2002). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research. 3. 583-617. 10.1162/153244303321897735.
- ^ Amaratunga, D., Cabrera, J. & Kovtun, V.. (2008). Microarray learning with ABC. Biostatistics. 9. 128-36. 10.1093/biostatistics/kxm017.
- ^ Amaratunga, D. & Cabrera, J. & Lee, Y.S. (2014). Resampling-based similarity measures for high-dimensional data. Journal of Computational Biology. 22. 10.1089/cmb.2014.0195.
- ^ Cherkas, Y., Amaratunga, D., Raghavan, N., Sasaki, J. and McMillian, M. (2016). ABC gene-ranking for prediction of drug-induced cholestasis in rats, Toxicology Reports, 3: 252–261.
- ^ Kriegel, H.; Kröger, P.; Renz, M.; Wurst, S. (2005). A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data (PDF). Fifth IEEE International Conference on Data Mining (ICDM'05). p. 250. doi:10.1109/ICDM.2005.5. ISBN 0-7695-2278-5.
- ^ Hund, M.; Böhm, D.; Sturm, W.; Sedlmair, M.; Schreck, T.; Keim, D.A.; Majnaric, L.; Holzinger, A. (2016). "Visual analytics for concept exploration in subspaces of patient groups: Making sense of complex datasets with the Doctor-in-the-loop". Brain Informatics. 3 (4): 233–247. doi:10.1007/s40708-016-0043-5. PMC 5106406. PMID 27747817.
- ^ Thrun, M. C., & Stier, Q.: Fundamental Clustering Algorithms Suite, SoftwareX, Vol. 13(C), pp. 100642, doi: 10.1016/j.softx.2020.100642, 2021.
Clustering high-dimensional data
View on GrokipediaIntroduction
Overview and definitions
Clustering is the unsupervised process of partitioning a set of data objects into subsets, known as clusters, such that objects within the same cluster exhibit higher similarity to one another than to those in other clusters, based on a predefined similarity or distance measure.[1] This technique is widely used for data summarization, pattern discovery, and exploratory analysis in fields like bioinformatics and text mining.[1] High-dimensional data arises in scenarios where the number of features or attributes describing each data point significantly exceeds the number of observations, often with dimensions ranging from tens to thousands; for instance, gene expression datasets may involve thousands of genes as dimensions for relatively few samples.[1] Traditional clustering algorithms, such as k-means and hierarchical clustering, typically assume low-dimensional spaces and rely on the Euclidean distance metric to quantify similarity, which presumes that clusters form compact, spherical groups in feature space.[5] In high-dimensional settings, however, the intrinsic geometry of the data space alters fundamentally, leading to a concentration of distances where most pairs of points appear roughly equidistant, thereby diminishing the discriminatory power of standard metrics and resulting in sparse, meaningless clusters if no adaptations are made.[1] This distance concentration effect, a manifestation of the curse of dimensionality, underscores the need for specialized approaches in high-dimensional clustering.[1] To mitigate these issues, alternative similarity measures are essential prerequisites for effective clustering; cosine similarity, which emphasizes the directional alignment of vectors by normalizing for magnitude, proves particularly useful for sparse or text-based high-dimensional data, while the Mahalanobis distance incorporates feature correlations to provide a more robust metric in spaces with dependent dimensions.[1]Historical context
The concept of the "curse of dimensionality" was first coined by Richard Bellman in 1961 to describe the exponential increase in complexity associated with high-dimensional problems in dynamic programming. Although originally from control theory, this phenomenon gained prominence in clustering during the 1990s as datasets in fields like bioinformatics and text mining grew to hundreds of dimensions, rendering traditional distance-based methods ineffective due to sparsity and distance concentration.[2] A seminal 2004 survey by Parsons, Haque, and Liu formalized its application to high-dimensional clustering, emphasizing how it exacerbates challenges in density estimation and cluster separation.[2] The late 1990s marked the emergence of specialized techniques to address these issues, with subspace clustering introduced as a foundational approach. In 1998, Agrawal et al. proposed the CLIQUE algorithm, a grid-based method that automatically discovers dense clusters in subspaces of high-dimensional data without predefined dimensionality, enabling scalable mining in applications like market basket analysis.[6] Building on this, projected clustering gained traction in the early 2000s; Aggarwal et al.'s PROCLUS algorithm (1999) pioneered the identification of clusters relevant to subsets of dimensions, allowing different clusters to project onto varying subspaces and improving robustness in noisy, high-dimensional environments.[7] The explosion of big data after 2010, driven by sources like social media and genomics, intensified the need for advanced high-dimensional clustering, integrating it more deeply with machine learning frameworks.[8] Around 2015, deep learning began influencing dimensionality reduction for clustering, as seen in Xie et al.'s deep embedding model, which combined autoencoders for feature learning with clustering objectives to handle nonlinear manifolds in high dimensions.[9] Recent advancements from 2023 to 2025 reflect a shift toward AI-driven enhancements, particularly in density-based methods. For instance, Wang et al.'s 2025 AAAI paper introduced an enhanced density peak clustering approach using multilayer perceptrons for dimensionality reduction and hierarchical label assignment, achieving superior performance on high-dimensional benchmarks by mitigating local density estimation errors.[10]Challenges
Curse of dimensionality
The curse of dimensionality manifests in high-dimensional spaces as an exponential growth in volume with increasing dimensions , which causes data points to become increasingly sparse and most pairs of points to appear roughly equidistant, a phenomenon known as distance concentration.[1] This effect degrades the utility of distance-based similarity measures central to many clustering algorithms, as the intrinsic structure of the data becomes obscured by the overwhelming scale of the ambient space. Mathematically, consider points drawn uniformly from the unit hypercube . The expected squared Euclidean distance between two such points is exactly , so the expected distance approximates for large . Moreover, the relative variance of these distances diminishes as grows, leading to a concentration where nearly all pairwise distances cluster tightly around this mean value. This is formalized in the behavior of nearest and farthest neighbor distances: for any fixed number of points from a distribution with bounded support, the limit , meaning the ratio of nearest-to-farthest neighbor distances approaches 1, rendering relative proximities meaningless. In practice, this concentration undermines traditional clustering methods like k-means, which assume meaningful distance separations to identify compact, spherical clusters; in high dimensions, however, clusters tend to appear indistinguishable and hyperspherical due to the uniformity of distances.[1] The increased sparsity further exacerbates the issue, as real-world high-dimensional data often resides on lower-dimensional manifolds embedded within the full space, making it challenging to detect underlying patterns without additional preprocessing.[1] For instance, in text clustering with 1000-dimensional bag-of-words vectors, Euclidean distance fails to capture semantic similarities effectively, prompting the use of cosine similarity, which emphasizes angular separation over magnitude and better preserves directional relationships in sparse, high-dimensional representations. One approach to mitigate these effects involves linear dimensionality reduction techniques to project data into a lower-dimensional space where distances regain discriminatory power.[1]Feature relevance and noise issues
In high-dimensional data, irrelevant features often constitute a significant portion of the dimensions, diluting the underlying cluster structure and leading to suboptimal clustering performance. For instance, in datasets with thousands of features, many may not contribute meaningfully to group separations, effectively acting as "noise dimensions" that obscure relevant patterns. This issue is particularly pronounced in applications like genomics, where vast numbers of variables can mask true biological signals.[2] Noise amplification further complicates clustering in high dimensions, as perturbations in individual features propagate more readily across the expanded space, increasing the sensitivity of algorithms to outliers and small variations. This effect exacerbates the challenges posed by the curse of dimensionality, where distances become less discriminative, allowing noise to dominate similarity measures. Consequently, clusters may appear artificially diffuse or merged, reducing the reliability of methods like k-means or hierarchical clustering.[11] The hubness phenomenon exemplifies how noise and irrelevance manifest in high-dimensional nearest-neighbor relations, where a small number of points—known as hubs—frequently appear in the k-nearest neighbor lists of many others, distorting graph-based clustering structures. This arises from the skewed degree distributions in high-dimensional spaces, making hubs act as artificial attractors that bias cluster assignments. Empirical studies have shown that hubness increases with dimensionality, leading to non-uniform neighbor occurrences that hinder accurate partitioning.[12] To assess feature relevance, metrics such as mutual information are commonly employed to quantify the dependency between features and potential cluster labels, enabling the identification of informative dimensions over noisy ones. Mutual information captures non-linear relationships, outperforming linear correlation in high-dimensional settings by measuring shared information content. A representative example occurs in microarray gene expression data for cancer subtyping, where datasets may include over 20,000 genes, but only a few hundred are relevant to distinguishing subtypes like those in breast cancer. Without addressing irrelevant genes, clustering algorithms tend to produce overfitted or unstable groups, as noise from non-expressive features overwhelms the signal from key biomarkers. Studies on such data have demonstrated that filtering irrelevant features improves subtype discovery accuracy.[13] As a preprocessing strategy, feature ranking methods help mitigate these issues by scoring features based on their independence from noise or irrelevance, prioritizing those with strong associations to the data structure. This approach, applied prior to clustering, reduces dimensionality while preserving cluster-relevant information, though it requires careful thresholding to avoid over-selection. Such techniques have been shown to enhance clustering stability in noisy high-dimensional environments.[14]Dimensionality Reduction Techniques
Linear methods
Linear methods for dimensionality reduction project high-dimensional data onto a lower-dimensional linear subspace, preserving key structural information to facilitate clustering by alleviating computational burdens and noise effects inherent in high dimensions. These techniques rely on linear transformations, assuming that the principal variations in the data can be captured through orthogonal or independent projections, making them computationally efficient and interpretable for global patterns. They are foundational in preprocessing pipelines for clustering algorithms, particularly when data exhibits linear dependencies or Gaussian-like distributions. Principal Component Analysis (PCA) is a cornerstone linear method, involving an orthogonal transformation of the data to a new coordinate system where the axes (principal components) are ordered by the descending magnitude of explained variance. This projects the data onto the top k eigenvectors of the covariance matrix, effectively decorrelating features and focusing on the directions of maximum variability. PCA formulates the problem as finding the unit vector w that maximizes the projected variance: where X is the centered data matrix; the solution yields successive principal components as eigenvectors. In clustering contexts, the top k components are selected to retain a cumulative explained variance above a threshold, such as 95%, thereby reducing dimensionality while maintaining essential structure for subsequent analysis. Singular Value Decomposition (SVD) provides an equivalent framework to PCA for centered data, decomposing the data matrix X as X = UΣVᵀ, where the columns of U and V are left and right singular vectors, and Σ contains singular values. It is particularly advantageous for sparse high-dimensional data, such as document-term matrices in text clustering, as it enables low-rank approximations via truncation to the largest singular values, optimizing the Frobenius norm reconstruction error without requiring explicit covariance computation. Independent Component Analysis (ICA) complements these by seeking a linear transformation to maximize statistical independence among components, rather than mere uncorrelatedness, through measures like negentropy or mutual information minimization. It models the data as a linear mixture of independent non-Gaussian sources, solving for an unmixing matrix W such that the estimated sources Y = WX are independent; this is useful in clustering scenarios with superimposed signals, like blind source separation in neuroimaging data. These methods excel in capturing global linear structures and are robust for Gaussian or near-Gaussian data, offering computational scalability (O(d³) for dense d-dimensional data via eigendecomposition) and noise reduction. However, their linearity limits effectiveness on data with nonlinear manifolds, potentially distorting local clusters. For example, in single-cell RNA sequencing datasets exceeding 10,000 features, PCA routinely reduces dimensionality to 50 components, capturing over 90% of variance with minimal loss and enabling accurate downstream clustering. Linear reductions are frequently integrated with partitional clustering like k-means, enhancing convergence speed and silhouette scores in high-dimensional settings by focusing on variance-dominant subspaces.Nonlinear methods
Nonlinear dimensionality reduction techniques address the limitations of linear methods by capturing complex, local structures and manifolds in high-dimensional data, which is particularly beneficial for subsequent clustering tasks where global linear approximations may distort cluster boundaries. Unlike linear approaches such as principal component analysis (PCA), which focus on variance preservation across the entire dataset, nonlinear methods emphasize preserving local neighborhoods and pairwise relationships to reveal intricate patterns in high-dimensional spaces. These techniques are especially useful in clustering applications involving non-Euclidean geometries or curved manifolds, enabling more accurate separation of clusters that linear projections might merge. Multidimensional scaling (MDS) is a foundational nonlinear method that embeds high-dimensional data into a lower-dimensional space by minimizing a stress function that measures the discrepancy between pairwise distances in the original and embedded spaces. Classical or isometric MDS achieves an exact embedding for Euclidean distances using eigenvalue decomposition, while non-metric MDS relaxes this to ordinal preservation of distances through monotonic regression, making it suitable for non-Euclidean high-dimensional data in clustering. In clustering contexts, MDS helps visualize and separate clusters by preserving dissimilarities, as demonstrated in genomic datasets where it maintains inter-sample relationships better than linear methods for irregular structures. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a probabilistic nonlinear technique that focuses on local structure preservation by modeling pairwise similarities in high dimensions as Gaussian distributions and in low dimensions as t-distributions, minimizing the Kullback-Leibler (KL) divergence between them. The cost function is given by: where represents the high-dimensional pairwise similarity probabilities for point , and the low-dimensional counterparts. This approach excels in clustering high-dimensional data like images or gene expressions by revealing tight clusters through emphasis on local neighborhoods, though it requires careful parameter tuning for perplexity to balance local and global structure. Uniform Manifold Approximation and Projection (UMAP), introduced in 2018, offers a topology-preserving alternative using graph-based manifold learning to approximate the data's fuzzy topological structure before optimizing embeddings via cross-entropy minimization. UMAP is particularly effective for clustering in high-dimensional settings due to its ability to maintain both local and global data geometry, outperforming t-SNE in speed and scalability for large datasets. Autoencoders, neural network-based models, learn nonlinear representations by compressing data through an encoder and reconstructing it via a decoder, with the latent space serving as reduced features for clustering; recent 2025 advances in multilayer perceptron (MLP)-based autoencoders have improved unsupervised feature learning for nonlinear reductions in tasks like anomaly detection. Despite their strengths, nonlinear methods like t-SNE suffer from non-convex optimization leading to sensitivity to initialization and computational complexity of for pairwise computations, limiting scalability to large high-dimensional datasets. UMAP mitigates this with a graph Laplacian approximation, achieving faster performance while preserving cluster topology. Recent 2025 studies in bioinformatics, such as comparative analyses of single-cell RNA sequencing data, demonstrate UMAP outperforming PCA in visualizing and clustering high-dimensional transcriptomic profiles by better capturing manifold structures and cell-type separations.Clustering Approaches
Subspace clustering
Subspace clustering algorithms seek to identify clusters that manifest in lower-dimensional subspaces of the high-dimensional data space, rather than requiring patterns to be evident across all dimensions. This core idea recognizes that in high-dimensional datasets, relevant features for different clusters may vary, and averaging distances or similarities over irrelevant dimensions can dilute cluster structures, leading to poor detection of meaningful groups. By localizing the search to subspaces where data points exhibit high density or affinity, these methods uncover hidden patterns that would be obscured in the full-dimensional analysis.[2] One of the seminal grid-based approaches is CLIQUE, introduced in 1998, which partitions the data space into a hierarchy of non-overlapping hyper-rectangular grid cells across subspaces. It identifies dense units by bottom-up exploration, starting from one-dimensional subspaces and extending to higher dimensions only if sufficient density is maintained, thereby efficiently discovering clusters without exhaustive subspace enumeration. The density of a hyper-rectangle subspace is computed as where clusters are formed by merging adjacent dense units (, with as the density threshold). This formulation allows CLIQUE to handle arbitrary subspace clusters while providing interpretable descriptions in disjunctive normal form.[6] A density-based variant is DENCLUE, also from 1998, which employs kernel density estimation to model cluster structures in subspaces. It uses a kernel function (typically Gaussian) to compute a smooth density estimate for each point's influence, enabling the detection of clusters as local maxima in the density landscape, even in the presence of noise. By applying this to selected subspaces, DENCLUE avoids rigid grid assumptions and supports flexible cluster shapes, making it suitable for multimedia data where subspaces reveal domain-specific patterns.[15] More recent spectral methods leverage graph theory to capture subspace affinities. For instance, Sparse Subspace Clustering (SSC), proposed in 2012, assumes data points lie in a union of low-dimensional subspaces and constructs a sparse representation by solving the optimization problem where is the centered data matrix, is the sparse affinity matrix encoding self-expressiveness (each point as a sparse linear combination of others in its subspace), accounts for noise, promotes sparsity, is the Frobenius norm, and balances terms. Spectral clustering is then applied to the graph induced by to obtain the final partitioning. This approach excels in scenarios with linear subspaces, such as motion segmentation.[16] Subspace clustering techniques offer the advantage of discovering clusters in arbitrary, overlapping subspaces without predefined dimensionality reduction, enabling nuanced analysis of complex high-dimensional data. However, they are sensitive to parameter choices, such as density thresholds or regularization weights, which can significantly impact results and require careful tuning. Moreover, 2024 reviews emphasize scalability limitations, noting that methods like SSC struggle with computational complexity when dimensions exceed 1000, often due to the cost of affinity matrix construction and eigenvalue decomposition.[17]Projected clustering
Projected clustering addresses the challenges of high-dimensional data by identifying clusters in lower-dimensional projections that are specific to each group of data points, allowing different subsets of features to be relevant for different clusters. Unlike global dimensionality reduction methods that apply a uniform projection across the entire dataset, projected clustering tailors subspaces to individual clusters, enabling the discovery of heterogeneous structures where feature relevance varies across groups. This approach mitigates the curse of dimensionality by focusing on cluster-specific subspaces, often axis-parallel or orthogonal, to reveal meaningful patterns obscured in the full space. Key algorithms in projected clustering include PROCLUS, introduced in 1999, which operates in a k-means-like manner by generating random projections to identify candidate subspaces and iteratively refining clusters based on average squared error within those projections. PROCLUS starts with random sampling of data points to form initial clusters and subspaces, then optimizes by assigning points to the best-fitting projected cluster while minimizing distortion. Building on this, ORCLUS, proposed in 2000, extends the framework through iterative optimization of orthogonal subspaces, using a principal component analysis-inspired approach to allow clusters in arbitrarily oriented directions rather than strictly axis-parallel ones. ORCLUS refines subspaces by diagonalizing covariance matrices for candidate clusters and merging or splitting them based on quality measures like subspace similarity. Mathematically, projected clustering seeks to partition the data into clusters while assigning a subspace to each cluster , minimizing the total distortion defined as where are data points, and denotes the orthogonal projection onto subspace . Subspaces are often computed using singular value decomposition (SVD) to identify principal directions that capture the variance within candidate clusters, ensuring the projection preserves cluster cohesion. This objective balances cluster assignment and subspace selection, typically solved through heuristic iterations rather than exact optimization due to the combinatorial nature of subspace choice. Projected clustering excels at capturing heterogeneous clusters in high-dimensional spaces, where global methods fail due to irrelevant features dominating distance metrics, and it outperforms traditional subspace clustering on synthetic benchmarks with varying cluster dimensionalities by focusing on optimized per-cluster projections rather than exhaustive subspace enumeration. Recent comparative analyses confirm its superior accuracy and efficiency on high-dimensional synthetic datasets, achieving lower error rates in cluster recovery compared to exhaustive subspace methods. However, it assumes linear subspaces, limiting its ability to detect nonlinear manifolds, and incurs a computational cost of in worst-case iterations for subspace refinement and point assignments, where is the number of points and the dimensionality.[17][18]Correlation clustering
Correlation clustering addresses the challenges of high-dimensional data by grouping features or data points based on their interdependencies, such as correlations, rather than Euclidean distances, which often fail in sparse high-dimensional spaces. This approach is particularly suited for datasets where dimensions exhibit interrelated patterns, like co-expressed genes in microarray data. Clusters are defined by subsets of dimensions that show strong pairwise similarities, measured using metrics like the Pearson correlation coefficient or mutual information, enabling the identification of local patterns that global distance-based methods overlook.[19] A key algorithm in this domain is CLICK (CLuster Identification via Connectivity Kernels), introduced by Sharan and Shamir in 2000 for analyzing gene expression data from microarrays. CLICK constructs a similarity graph where nodes represent genes and edges are weighted by the correlation coefficient between their expression vectors across conditions. It identifies dense subgraphs, or kernels, representing tightly correlated groups (cliques), using minimum weight cuts to partition the graph, followed by expansion through singleton adoption and merging of similar kernels. This method partitions the data into clusters without requiring a predefined number of clusters, making it effective for noisy, high-dimensional biological datasets like yeast cell cycle expression profiles.[20] Another seminal method is coupled two-way clustering, proposed by Getz et al. in 2003, which simultaneously clusters both samples and features to uncover consistent co-clustering patterns. By iteratively applying one-way clustering (e.g., based on correlation) to subsets of genes and samples, it extracts submatrices from the expression data where both rows (genes) and columns (conditions) exhibit high internal correlation, forming biclusters. This bidirectional approach reveals relational structures in high-dimensional gene expression matrices, as demonstrated on breast and colon cancer datasets, where it identified biologically meaningful modules of co-regulated genes under specific conditions.[21] In genomics, correlation clustering excels at identifying biclusters—submatrices with high intra-correlation—such as groups of genes co-expressed across subsets of experiments, aiding in pathway discovery and disease subtyping. A 2024 comprehensive survey on biclustering highlights its robustness to noise in high-dimensional omics data, noting that correlation-based methods maintain performance in sparse settings by focusing on dependency structures rather than absolute values. The similarity matrix is typically defined as , where denotes the Pearson correlation between features and ; clusters can then be derived as connected components in the graph induced by or through hierarchical clustering applied to .[22][19] Despite these strengths, correlation clustering assumes primarily linear relationships, limiting its ability to capture nonlinear dependencies, and it scales poorly with very large numbers of dimensions due to the quadratic complexity of computing and analyzing the similarity matrix. These drawbacks can exacerbate issues in ultra-high-dimensional data, where noise accumulation further challenges kernel detection or bicluster stability.[19]Density-based and hybrid methods
Density-based clustering methods identify clusters as dense regions of data points separated by sparser areas, making them suitable for discovering clusters of arbitrary shapes in high-dimensional spaces without assuming the number of clusters in advance. These approaches, such as adaptations of DBSCAN, address the curse of dimensionality by focusing on local densities rather than global distances, though they often require careful parameter tuning to handle noise and varying densities effectively.[23] A prominent adaptation is HDBSCAN, a hierarchical extension of the DBSCAN algorithm that builds a density hierarchy to detect clusters at multiple scales.[23] Developed by Campello et al. in 2013, HDBSCAN constructs a minimum spanning tree on the data and extracts a cluster hierarchy by varying density thresholds, allowing it to identify clusters of varying densities in high-dimensional datasets while automatically labeling noise points.[24] This method outperforms traditional DBSCAN in high dimensions by mitigating sensitivity to the fixed neighborhood radius parameter, enabling robust clustering in spaces where distances become less meaningful.[25] Another key density-based technique is Density Peak Clustering (DPC), introduced by Rodriguez and Laio in 2014, which selects cluster centers as points of high local density that are far from other high-density points. In DPC, the local density for a point is computed as where if and otherwise, is the distance between points and , and is a cutoff distance. (Note: Gaussian kernel variants exist for smoother density estimation.) Cluster centers are then identified as points with high and large minimum distance to any point of higher density, after which non-center points are assigned based on density attraction. While effective for low-dimensional data, standard DPC struggles in high dimensions due to the concentration of distances, leading to poor density estimation.[10] To overcome these limitations, the Enhanced Density Peak Clustering (EDPC) method, proposed by Wang et al. in 2025, integrates multilayer perceptron (MLP)-based dimensionality reduction as a preprocessing step before applying DPC.[10] EDPC first generates pseudo-labels from initial sub-clusters to guide the MLP in reducing dimensionality to 50 features, preserving local structures and improving cluster separability in high-dimensional spaces.[26] This reduction enhances density estimation accuracy, yielding an average accuracy improvement of approximately 23% over vanilla DPC on high-dimensional benchmarks such as synthetic datasets and real-world high-d data.[26] By leveraging nonlinear dimensionality reduction techniques like MLPs, EDPC mitigates the effects of irrelevant features and noise, though it remains sensitive to hyperparameters such as density thresholds.[26] Hybrid methods combine density-based approaches with other paradigms to enhance stability and performance in high dimensions. Bootstrap-based hybrids, emerging in the early 2000s, use resampling techniques to assess clustering robustness by generating multiple partitions and consensus clustering to select stable structures.[27] For instance, methods like those proposed by Dudoit and Fridlyand in 2002 employ bootstrap resampling to evaluate cluster stability, reducing sensitivity to initialization and noise in high-dimensional settings through ensemble averaging. More recent hybrids integrate density-based clustering with subspace search, such as the SDENK framework introduced in 2025, which performs unbiased density estimation within relevant subspaces to handle density divergence and metric biases.[28] Surveys from 2024 highlight these subspace-density combinations as effective for identifying clusters in specific feature subsets while maintaining density connectivity.[29] As of November 2025, emerging hybrids incorporate transformer-based embeddings for ultra-high-dimensional multimodal data, such as text and images, enhancing density estimation in transformer-augmented density peak methods for improved pattern discovery in AI-generated datasets.[30] Recent advances include information-theoretic extensions for handling incomplete high-dimensional data. For example, a 2025 three-way density peak clustering approach for incomplete information systems uses reflexive relations to define densities and boundaries, enabling robust partitioning even with missing values by incorporating uncertainty into cluster assignments.[31] These methods leverage Bayesian or entropy-based measures to impute densities probabilistically, improving clustering quality on datasets with up to 30% missing entries compared to traditional imputations.[32] Density-based and hybrid methods excel at capturing arbitrary cluster shapes and handling noise without predefined cluster counts, but they can be computationally intensive in high dimensions and sensitive to parameters like cutoff distances or ensemble sizes.[33] Hybrids mitigate these issues through ensembling and preprocessing, often boosting stability by 15-25% in stability metrics on high-dimensional benchmarks.[34]Evaluation and Validation
Internal evaluation metrics
Internal evaluation metrics assess the quality of clustering results without relying on external ground truth labels, focusing instead on the intrinsic structure of the data such as cohesion within clusters and separation between them. In high-dimensional settings, these metrics are particularly challenged by phenomena like the curse of dimensionality, where distances become less meaningful due to points appearing equidistant, necessitating adaptations like dimension-specific weighting or alternative distance measures.[35] The silhouette score quantifies how well each data point fits into its assigned cluster relative to neighboring clusters, providing an overall measure of clustering validity. For a point , the score is computed as , where is the average distance from to other points in the same cluster (measuring cohesion), and is the smallest average distance from to points in any other cluster (measuring separation). Higher average silhouette scores (closer to 1) indicate better-defined clusters, while values near 0 suggest ambiguous assignments and negative values imply misclassifications. This metric, introduced by Rousseeuw in 1987, is widely used for its interpretability via graphical silhouettes but performs poorly in high dimensions due to the equidistance effect, often yielding low scores even for meaningful clusters.[36] The Davies-Bouldin index evaluates clustering by averaging the ratio of within-cluster scatter to between-cluster separation across all pairs of clusters, favoring compact and well-separated groups. It is defined as , where is the number of clusters, is the average intra-cluster distance for cluster , and is the inter-cluster distance between centroids of and . Lower values (ideally approaching 0) denote superior clustering, as originally proposed by Davies and Bouldin in 1979. In high-dimensional contexts, particularly for subspace clustering, the index can be adapted by weighting dimensions based on their relevance to specific subspaces, enhancing its applicability to localized cluster structures. The Calinski-Harabasz index, also known as the variance ratio criterion, measures the ratio of between-cluster dispersion to within-cluster dispersion, rewarding partitions with high separation and low compactness. It is calculated as , where is the between-cluster sum of squares, is the within-cluster sum of squares, is the number of points, and is the number of clusters; higher values indicate better clustering quality, as formulated by Caliński and Harabasz in 1974. This metric remains relatively stable in high dimensions compared to distance-based alternatives but assumes spherical clusters, limiting its effectiveness for irregular or subspace-specific formations.[35] For subspace clustering in high dimensions, specialized measures like the subspace separation metric assess validity by evaluating density and separation in projected subspaces rather than the full space. These focus on the energy or variance captured in relevant dimensions, such as through normalized projected energy, which quantifies how well clusters align with low-dimensional subspaces without external labels. Adaptations for high-dimensional data often incorporate cosine similarity or correlation-based distances instead of Euclidean metrics to mitigate the concentration of distances, improving robustness to noise and irrelevant features. Recent studies as of 2025 recommend ensemble-based metrics, combining multiple internal indices to evaluate clustering stability across perturbations, particularly for high-dimensional datasets where single metrics may falter.[37] Despite their utility, internal evaluation metrics exhibit limitations in high-dimensional clustering, including heightened sensitivity to noise that can inflate within-cluster distances and their inadequacy for handling overlapping clusters, where boundaries are ambiguous. These challenges underscore the need for careful metric selection or hybridization to ensure reliable assessment.[35][36]External validation techniques
External validation techniques assess the quality of clustering results in high-dimensional data by comparing them against ground truth labels, providing an objective measure of agreement when such labels are available. These methods are particularly valuable in high-dimensional settings, where intrinsic structures may be obscured by the curse of dimensionality, and ground truth from domain expertise or annotations enables rigorous evaluation of subspace or projected clusters. Unlike internal metrics, which rely solely on the data's inherent properties, external techniques require prior knowledge of true partitions to quantify accuracy, robustness to noise, and preservation of relevant features across dimensions. The Adjusted Rand Index (ARI) is a widely used external metric that measures the similarity between two clusterings by considering pairwise agreements, adjusted for chance to account for random clustering expectations. It is defined as ARI = (RI - Expected_RI) / (Max_RI - Expected_RI), where RI is the raw Rand Index based on concordant and discordant pairs across the predicted and true clusterings, Expected_RI is the expected value under random assignment, and Max_RI is the maximum possible RI. This adjustment makes ARI robust to imbalances in cluster sizes, which is crucial for high-dimensional data where subspaces may yield uneven group distributions. ARI ranges from -1 to 1, with values near 1 indicating strong agreement and 0 representing chance-level performance; it has been extensively applied in evaluating subspace clustering algorithms on datasets with varying dimensionalities.[38][39][40] Normalized Mutual Information (NMI) provides another entropy-based external validation measure, quantifying the shared information between predicted clusters and ground truth labels while normalizing for differing cluster granularities. It is computed as NMI = \frac{MI(C1, C2)}{\sqrt{H(C1) H(C2)}}, where MI(C1, C2) is the mutual information between the two clusterings, and H(C1), H(C2) are their respective entropies. This formulation ensures NMI values between 0 (no shared information) and 1 (perfect agreement), making it suitable for high-dimensional scenarios with potential overlapping clusters, as it handles probabilistic assignments effectively. NMI is particularly advantageous in projected clustering, where feature selection may introduce partial overlaps not captured by pair-counting methods like ARI.[41][42][43] Benchmarks for external validation in high-dimensional clustering often employ synthetic datasets generated with controlled parameters, such as dimensionalities from 100 to 1000 and varying noise levels, to test metric sensitivity to the curse of dimensionality. For instance, a 2025 comparative analysis used such synthetic benchmarks to evaluate algorithms like K-means and DBSCAN, reporting ARI and NMI scores that degrade with increasing dimensions and noise but stabilize in subspace projections. Real-world benchmarks draw from high-dimensional repositories like the UCI Machine Learning Repository, including datasets such as Isolet (d=617) and Arrhythmia (d=279), where ground truth labels enable computation of ARI up to 0.85 for effective methods on gene expression subspaces. These benchmarks highlight how external metrics reveal performance drops in full-dimensional spaces, underscoring the need for dimensionality-aware clustering.[44] Validation strategies for high-dimensional clustering incorporate external metrics within frameworks like cross-validation on subspaces, where data is partitioned and clustered in selected feature subsets to assess consistency against ground truth. Stability analysis via resampling, such as bootstrap or subsampling, further evaluates robustness by recomputing ARI or NMI across iterations, emphasizing projected methods' reliability in noisy high-dimensional environments. Recent 2025 studies integrate these quantitative metrics with visual analytics, using techniques like t-SNE projections to qualitatively verify cluster separations alongside ARI/NMI scores on synthetic and UCI benchmarks. When ground truth is unavailable, internal metrics serve as complements for preliminary assessment.[45]Applications
Bioinformatics and genomics
In bioinformatics and genomics, clustering high-dimensional data plays a pivotal role in analyzing gene expression profiles from microarrays, where datasets often encompass around 20,000 genes across fewer samples, enabling the identification of co-regulated genes involved in biological processes.[46] Traditional methods like hierarchical clustering have been foundational, but biclustering techniques, which simultaneously cluster genes and conditions, have proven particularly effective for uncovering patterns in sparse, high-dimensional microarray data, such as those revealing cancer subtypes through correlation-based similarities.[47] For instance, spectral biclustering approaches have been applied to gene expression profiles to predict tumor outcomes by identifying co-expressed gene sets across subsets of samples.[48] Single-cell RNA sequencing (scRNA-seq) generates even higher-dimensional data, with thousands of genes measured per cell, posing unique challenges due to sparsity and technical noise, yet clustering remains essential for cell type discovery and trajectory inference.[49] Dimensionality reduction via UMAP followed by density-based clustering, such as HDBSCAN, has become a standard pipeline for resolving heterogeneous cell populations in scRNA-seq datasets, improving resolution over traditional k-means by capturing non-linear manifolds.[50] Recent advances, including ensemble methods like scICE, enhance clustering consistency in large-scale scRNA-seq studies by randomizing hyperparameters to mitigate instability.[51] In protein interaction networks, which can be represented as high-dimensional graphs with thousands of nodes and edges, clustering identifies functional modules corresponding to biological pathways, often leveraging graph-based or subspace methods to handle noise and incompleteness.[52] Subspace clustering techniques, such as those in FACETS, decompose these networks into overlapping modules enriched for Gene Ontology terms, revealing multi-faceted functional decompositions that traditional community detection overlooks.[53] Correlation clustering is briefly employed here to group proteins based on interaction strengths, aiding in pathway discovery.[54] Case studies from The Cancer Genome Atlas (TCGA) exemplify these applications, where clustering multi-omics data uncovers tumor heterogeneity, such as subtype stratification in breast and lung cancers using projective clustering on gene expression matrices.[55] Challenges like batch effects, arising from technical variations across sequencing runs, are addressed via hybrid methods combining normalization with density-based clustering, as seen in integrative analyses that correct for confounders while preserving biological signals.[56] For example, recent TCGA subtyping frameworks incorporate subspace projections to handle high-dimensionality, reducing false positives in heterogeneity detection.[57] The impact of these clustering strategies extends to personalized medicine, where improved subtype identification in genomics datasets contributes to better predictive accuracy for treatment response compared to traditional methods. These advances facilitate targeted therapies by pinpointing patient-specific molecular profiles.[58]Image and signal processing
In content-based image retrieval, high-dimensional descriptors such as Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG) are commonly clustered to enable efficient similarity search across image databases. SIFT descriptors typically operate in 128 dimensions, capturing local invariant features robust to scale and rotation, while HOG descriptors can extend to 500 or more dimensions depending on cell configurations and binning, representing gradient orientations for object detection and retrieval tasks. Subspace clustering techniques are applied to these local features to identify relevant projections that mitigate the curse of dimensionality, allowing for compact visual codebooks built via k-means on descriptor sets.[59] Hyperspectral imaging generates data with approximately 200 spectral bands per pixel, resulting in high-dimensional feature vectors that challenge traditional clustering due to redundancy and noise. Projected clustering methods address this by selecting band subsets tailored to specific subspaces, facilitating segmentation of land cover types such as vegetation, water, and urban areas in remote sensing applications. For instance, segmentation-based projected clustering using mutual nearest neighbors integrates spatial information with spectral projections to enhance classification accuracy in hyperspectral scenes.[60] In signal processing, electroencephalography (EEG) and magnetic resonance imaging (MRI) time series are transformed into high-dimensional representations via Fourier analysis, converting temporal signals into frequency-domain features for subsequent clustering. Density-based approaches, such as adaptive density peaks clustering, are employed on these features to detect anomalies like epileptic seizures in EEG data by identifying dense regions amid sparse noise. Similarly, for functional MRI connectivity matrices, density-based methods like OPTICS robustly group static and dynamic patterns, achieving up to 95% cluster purity in distinguishing cognitive impairments from healthy states.[61][62] Recent advancements, such as the 2025 automated projection pursuit clustering workflow published in GigaScience, apply projection methods to high-dimensional hyperspectral and signal data, iteratively reducing dimensions to 3-10 for visualization while preserving cluster structures. These techniques handle redundancy through nonlinear dimensionality reduction, which is essential for capturing manifold structures in image and signal datasets. Benchmarks demonstrate that such nonlinear reductions can improve segmentation accuracy by 20-60% compared to linear baselines, depending on the dataset complexity.[63][64]Software Tools
Open-source libraries
Scikit-learn, a widely-used Python library for machine learning, provides built-in support for dimensionality reduction techniques such as Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE), alongside clustering algorithms like K-means, enabling effective handling of high-dimensional data through pipelines that combine reduction and clustering steps.[65] Subspace clustering can be achieved via custom extensions, such as integrating feature selection or projection methods within scikit-learn's Pipeline class, allowing users to focus on relevant subspaces for improved performance in dimensions up to approximately 10,000, depending on available memory and sample size. The library's modular design facilitates these workflows without requiring external dependencies, making it suitable for exploratory analysis of high-dimensional datasets.[65] The HDBSCAN library, implemented in Python, extends density-based clustering to high-dimensional spaces by building a hierarchy of clusters that handles varying densities more robustly than traditional DBSCAN, with adaptations for subspace exploration through parameter tuning or preprocessing. It supports high-dimensional inputs effectively by leveraging single-linkage clustering on a mutual reachability distance graph, which mitigates some curse-of-dimensionality effects. In 2023, integrations with NVIDIA's RAPIDS cuML introduced GPU acceleration for HDBSCAN, significantly speeding up computations on large high-dimensional datasets by up to 50 times compared to CPU-based implementations. Other notable open-source libraries include Orange, a visual programming tool in Python that offers widgets for high-dimensional data visualization and clustering, such as multidimensional scaling (MDS) and hierarchical clustering, allowing interactive workflows without extensive coding. ELKI, a Java-based framework, provides an extensive collection of subspace and projected clustering algorithms, including SUBCLU and PROCLUS, optimized for high-dimensional data mining tasks with support for custom distance functions and indexing structures. These libraries are free, open-source, and actively maintained by communities, ensuring ongoing updates and broad accessibility for researchers and practitioners.[66][67] For practical usage, these libraries often combine dimensionality reduction with clustering on benchmark datasets like those from the UCI Machine Learning Repository. A simple example in scikit-learn applies PCA followed by DBSCAN to reduce dimensions and identify clusters in a high-dimensional UCI dataset, such as the Iris dataset extended for demonstration:from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
import numpy as np
# Load UCI-like dataset (Iris as example; replace with high-d UCI data)
data = load_iris()
X = data.data # Shape: (150, 4); scale up for high-d
# Dimensionality reduction with PCA to 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Apply DBSCAN clustering
dbscan = [DBSCAN](/page/DBSCAN)(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_reduced)
# Output cluster labels
print(np.unique(clusters, return_counts=True))
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
import numpy as np
# Load UCI-like dataset (Iris as example; replace with high-d UCI data)
data = load_iris()
X = data.data # Shape: (150, 4); scale up for high-d
# Dimensionality reduction with PCA to 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Apply DBSCAN clustering
dbscan = [DBSCAN](/page/DBSCAN)(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X_reduced)
# Output cluster labels
print(np.unique(clusters, return_counts=True))
