Diffusion map
View on Wikipedia
Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon[1][2][3][4] which computes a family of embeddings of a data set into Euclidean space (often low-dimensional) whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the "diffusion distance" between probability distributions centered at those points. Different from linear dimensionality reduction methods such as principal component analysis (PCA), diffusion maps are part of the family of nonlinear dimensionality reduction methods which focus on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps give a global description of the data-set. Compared with other methods, the diffusion map algorithm is robust to noise perturbation and computationally inexpensive.
Definition of diffusion maps
[edit]Following [3] and,[5] diffusion maps can be defined in four steps.
Connectivity
[edit]Diffusion maps exploit the relationship between heat diffusion and random walk Markov chain. The basic observation is that if we take a random walk on the data, walking to a nearby data-point is more likely than walking to another that is far away. Let be a measure space, where is the data set and represents the distribution of the points on .
Based on this, the connectivity between two data points, and , can be defined as the probability of walking from to in one step of the random walk. Usually, this probability is specified in terms of a kernel function of the two points: . For example, the popular Gaussian kernel:
More generally, the kernel function has the following properties
( is symmetric)
( is positivity preserving).
The kernel constitutes the prior definition of the local geometry of the data-set. Since a given kernel will capture a specific feature of the data set, its choice should be guided by the application that one has in mind. This is a major difference with methods such as principal component analysis, where correlations between all data points are taken into account at once.
Given , we can then construct a reversible discrete-time Markov chain on (a process known as the normalized graph Laplacian construction):
and define:
Although the new normalized kernel does not inherit the symmetric property, it does inherit the positivity-preserving property and gains a conservation property:
Diffusion process
[edit]From we can construct a transition matrix of a Markov chain () on . In other words, represents the one-step transition probability from to , and gives the t-step transition matrix.
We define the diffusion matrix (it is also a version of graph Laplacian matrix)
We then define the new kernel
or equivalently,
where D is a diagonal matrix and
We apply the graph Laplacian normalization to this new kernel:
where is a diagonal matrix and
One of the main ideas of the diffusion framework is that running the chain forward in time (taking larger and larger powers of ) reveals the geometric structure of at larger and larger scales (the diffusion process). Specifically, the notion of a cluster in the data set is quantified as a region in which the probability of escaping this region is low (within a certain time t). Therefore, t not only serves as a time parameter, but it also has the dual role of scale parameter.
The eigendecomposition of the matrix yields
where is the sequence of eigenvalues of and and are the biorthogonal left and right eigenvectors respectively. Due to the spectrum decay of the eigenvalues, only a few terms are necessary to achieve a given relative accuracy in this sum.
Parameter α and the diffusion operator
[edit]The reason to introduce the normalization step involving is to tune the influence of the data point density on the infinitesimal transition of the diffusion. In some applications, the sampling of the data is generally not related to the geometry of the manifold we are interested in describing. In this case, we can set and the diffusion operator approximates the Laplace–Beltrami operator. We then recover the Riemannian geometry of the data set regardless of the distribution of the points. To describe the long-term behavior of the point distribution of a system of stochastic differential equations, we can use and the resulting Markov chain approximates the Fokker–Planck diffusion. With , it reduces to the classical graph Laplacian normalization.
Diffusion distance
[edit]The diffusion distance at time between two points can be measured as the similarity of two points in the observation space with the connectivity between them. It is given by
where is the stationary distribution of the Markov chain, given by the first left eigenvector of . Explicitly:
Intuitively, is small if there is a large number of short paths connecting and . There are several interesting features associated with the diffusion distance, based on our previous discussion that also serves as a scale parameter:
- Points are closer at a given scale (as specified by ) if they are highly connected in the graph, therefore emphasizing the concept of a cluster.
- This distance is robust to noise, since the distance between two points depends on all possible paths of length between the points.
- From a machine learning point of view, the distance takes into account all evidences linking to , allowing us to conclude that this distance is appropriate for the design of inference algorithms based on the majority of preponderance.[3]
Diffusion process and low-dimensional embedding
[edit]The diffusion distance can be calculated using the eigenvectors by
So the eigenvectors can be used as a new set of coordinates for the data. The diffusion map is defined as:
Because of the spectrum decay, it is sufficient to use only the first k eigenvectors and eigenvalues. Thus we get the diffusion map from the original data to a k-dimensional space which is embedded in the original space.
In [6] it is proved that
so the Euclidean distance in the diffusion coordinates approximates the diffusion distance.
Algorithm
[edit]The basic algorithm framework of diffusion map is as:
Step 1. Given the similarity matrix L.
Step 2. Normalize the matrix according to parameter : .
Step 3. Form the normalized matrix .
Step 4. Compute the k largest eigenvalues of and the corresponding eigenvectors.
Step 5. Use diffusion map to get the embedding .
Application
[edit]In the paper [6] Nadler et al. showed how to design a kernel that reproduces the diffusion induced by a Fokker–Planck equation. They also explained that, when the data approximate a manifold, one can recover the geometry of this manifold by computing an approximation of the Laplace–Beltrami operator. This computation is completely insensitive to the distribution of the points and therefore provides a separation of the statistics and the geometry of the data. Since diffusion maps give a global description of the data-set, they can measure the distances between pairs of sample points in the manifold in which the data is embedded. Applications based on diffusion maps include face recognition,[7] spectral clustering, low dimensional representation of images, image segmentation,[8] 3D model segmentation,[9] speaker verification[10] and identification,[11] sampling on manifolds, anomaly detection,[12][13] image inpainting,[14] revealing brain resting state networks organization [15] and so on.
Furthermore, the diffusion maps framework has been productively extended to complex networks,[16] revealing a functional organisation of networks which differs from the purely topological or structural one.
See also
[edit]References
[edit]- ^ Coifman, R.R.; Lafon, S; Lee, A B; Maggioni, M; Nadler, B; Warner, F; Zucker, S W (2005). "Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps". PNAS. 102 (21): 7426–7431. Bibcode:2005PNAS..102.7426C. doi:10.1073/pnas.0500334102. PMC 1140422. PMID 15899970.
- ^ Coifman, R.R.; Lafon, S; Lee, A B; Maggioni, M; Nadler, B; Warner, F; Zucker, S W (2005). "Geometric diffusions as a tool for harmonic analysis and structure definition of data: Multiscale methods". PNAS. 102 (21): 7432–7437. Bibcode:2005PNAS..102.7432C. doi:10.1073/pnas.0500896102. PMC 1140426. PMID 15899969.
- ^ a b c Coifman, R.R.; S. Lafon. (2006). "Diffusion maps". Applied and Computational Harmonic Analysis. 21: 5–30. doi:10.1016/j.acha.2006.04.006. S2CID 17160669.
- ^ Lafon, S.S. (2004). Diffusion Maps and Geometric Harmonics (PDF) (PhD). Yale University.
- ^ De la Porte, J.; Herbst, B M; Hereman, W; Van der Walt, S J (2008). "An Introduction to Diffusion Maps". Proceedings of the Nineteenth Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). CiteSeerX 10.1.1.309.674.
- ^ a b Nadler, Boaz; Stéphane Lafon; Ronald R. Coifman; Ioannis G. Kevrekidis (2005). "Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker–Planck Operators" (PDF). Advances in Neural Information Processing Systems. 18. arXiv:math/0506090. Bibcode:2005math......6090N.
- ^ Barkan, Oren; Weill, Jonathan; Wolf, Lior; Aronowitz, Hagai. "Fast high dimensional vector multiplication face recognition" (PDF). Proceedings of the IEEE International Conference on Computer Vision 2013: 1960–1967.
- ^ Zeev, Farbman; Fattal Raanan; Lischinski Dani (2010). "Diffusion maps for edge-aware image editing". ACM Trans. Graph. 29 (6): 145:1–145:10. doi:10.1145/1882261.1866171.
- ^ Oana, Sidi; van Kaick, Oliver; Kleiman, Yanir; Zhang, Hao; Cohen-Or, Daniel (2011). Unsupervised Co-Segmentation of a Set of Shapes via Descriptor-Space Spectral Clustering (PDF). ACM Transactions on Graphics.
- ^ Barkan, Oren; Aronowitz, Hagai (2013). "Diffusion maps for PLDA-based speaker verification" (PDF). 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 7639–7643. doi:10.1109/ICASSP.2013.6639149. ISBN 978-1-4799-0356-6.
- ^ Michalevsky, Yan; Talmon, Ronen; Cohen, Israel (2011). Speaker Identification Using Diffusion Maps (PDF). European signal processing conference 2011.
- ^ Mishne, Gal; Cohen, Israel (2013). "Multiscale Anomaly Detection Using Diffusion Maps". IEEE Journal of Selected Topics in Signal Processing. 7 (1): 111–123. Bibcode:2013ISTSP...7..111M. doi:10.1109/jstsp.2012.2232279. S2CID 1954466.
- ^ Shabat, Gil; Segev, David; Averbuch, Amir (2018-01-07). "Uncovering Unknown Unknowns in Financial Services Big Data by Unsupervised Methodologies: Present and Future trends". KDD 2017 Workshop on Anomaly Detection in Finance. 71: 8–19.
- ^ Gepshtein, Shai; Keller, Yosi (2013). "Image Completion by Diffusion Maps and Spectral Relaxation". IEEE Transactions on Image Processing. 22 (8): 2983–2994. Bibcode:2013ITIP...22.2983G. doi:10.1109/tip.2013.2237916. PMID 23322762. S2CID 14375333.
- ^ Margulies, Daniel S.; Ghosh, Satrajit S.; Goulas, Alexandros; Falkiewicz, Marcel; Huntenburg, Julia M.; Langs, Georg; Bezgin, Gleb; Eickhoff, Simon B.; Castellanos, F. Xavier; Petrides, Michael; Jefferies, Elizabeth; Smallwood, Jonathan (2016). "Situating the default-mode network along a principal gradient of macroscale cortical organization". Proceedings of the National Academy of Sciences. 113 (44): 12574–12579. Bibcode:2016PNAS..11312574M. doi:10.1073/pnas.1608282113. PMC 5098630. PMID 27791099.
- ^ De Domenico, Manlio (2017). "Diffusion geometry unravels the emergence of functional clusters in collective phenomena". Physical Review Letters. 118 (16) 168301. arXiv:1704.07068. Bibcode:2017PhRvL.118p8301D. doi:10.1103/PhysRevLett.118.168301. PMID 28474920. S2CID 2638868.
Diffusion map
View on GrokipediaIntroduction
Definition
Diffusion maps is an unsupervised machine learning algorithm for nonlinear dimensionality reduction, designed to preserve the intrinsic geometry of high-dimensional data sets by approximating the diffusion process on the underlying manifold.[10] First presented by Ronald R. Coifman and Stéphane Lafon in 2005, with a comprehensive framework detailed in 2006, the method constructs a family of embeddings that reveal the manifold's structure through the eigenfunctions of a Markov matrix derived from the data.[2][10] The core idea of diffusion maps is to map data points into a low-dimensional Euclidean space such that the Euclidean distances in this embedding approximate the diffusion distances on the original manifold, thereby capturing the data's intrinsic connectivity rather than extrinsic Euclidean relationships.[10] This approach enables efficient parametrization and analysis of complex geometric structures, particularly useful for manifold learning tasks where traditional linear methods fail.[10] Intuitively, diffusion maps treat data points as locations on a manifold where imaginary particles perform a random walk, or diffusion; the resulting embedding reflects how these particles spread and connect over time, highlighting the manifold's multiscale geometry without assuming prior knowledge of its shape.[10]Historical Development
Diffusion maps were first presented by Ronald R. Coifman and Stéphane Lafon in 2005 in a Proceedings of the National Academy of Sciences article as a tool for harmonic analysis and data structure definition, with the comprehensive framework detailed in their 2006 paper in Applied and Computational Harmonic Analysis.[2][10] This framework, which also connected diffusion maps to spectral clustering in a contemporaneous NIPS publication, builds upon spectral graph theory and heat kernel methods to approximate the geometry of data manifolds through random walks.[3] The approach emerged in the early 2000s amid growing interest in manifold learning, driven by the limitations of linear methods like principal component analysis (PCA), which struggle to capture nonlinear structures in complex data.[10] Motivations included the need for robust embeddings that preserve local and global data relationships, inspired by random walks on graphs as a means to model diffusion-based distances on manifolds.[11] Key milestones in the development of diffusion maps followed the 2006 foundation, with refinements in the 2010s focusing on robustness to noise and scalability. For instance, Vector Diffusion Maps, proposed by Amit Singer and Hau-tieng Wu in 2012, extended the original method by incorporating vector fields and connection Laplacians to better handle noisy high-dimensional data, such as in cryo-electron microscopy images, achieving superior neighbor identification at low signal-to-noise ratios. Another advancement came in 2015 with adaptations for high-dimensional single-cell RNA sequencing data, where diffusion maps were tailored to manage inherent noise and variability in biological datasets, enabling trajectory inference in pseudotime.[12] These enhancements addressed practical challenges in real-world applications, improving the method's applicability to imperfect data. In the 2020s, diffusion maps have seen integrations with deep learning to overcome computational limitations, particularly for out-of-sample extensions and large-scale processing. A notable example is the 2025 work on Deep Diffusion Maps by Sergio García-Heredia, Ángela Fernández, and Carlos M. Alaíz, which reformulates the embedding as an unconstrained optimization problem solved via neural networks, eliminating the need for costly spectral decompositions and enabling efficient geometry-aware processing in neural architectures.[13] This evolution has expanded diffusion maps into AI-driven tasks, including sparse data handling in fields like single-cell analysis and beyond. Open-source Python implementations, such as the pydiffmap library, have further promoted adoption since the early 2020s by providing accessible tools for computing embeddings without proprietary software.[14]Mathematical Foundations
Graph Representation and Connectivity
In diffusion maps, data points sampled from a low-dimensional manifold embedded in a high-dimensional Euclidean space are modeled as vertices of an undirected weighted graph, where the edge weights capture local similarities between points.[2] For a dataset consisting of points , the similarity graph is constructed by first computing pairwise distances, typically Euclidean , and then defining an affinity matrix with entries given by a positive, symmetric kernel function , where is a scale parameter controlling the locality of the similarities.[2] The graph can be fully connected, with weights assigned to all pairs via the kernel, or sparsified using -nearest neighbors (k-NN) to connect each point only to its closest neighbors, reducing computational complexity while preserving local structure; the k-NN approach is commonly employed in practical implementations to approximate the manifold geometry efficiently.[2] To handle varying data densities across the manifold, the affinity matrix is normalized by the local degree for each point , yielding a row-stochastic matrix that accounts for non-uniform sampling and ensures the graph reflects intrinsic geometric connectivity rather than sampling artifacts.[2] This graph representation encodes the local neighborhood structure of the data without presupposing global topology, providing the foundational substrate for subsequent diffusion-based analysis on the manifold.[2]Diffusion Process
The diffusion process in diffusion maps interprets the data manifold as the state space of a Markov random walk, where particles originate at data points and propagate according to local transition probabilities derived from the affinity kernel.[10] This random walk models heat diffusion on the graph, with the kernel representing the similarity between points and , thereby capturing the underlying geometric structure of the data set.[10] The one-step transition probabilities are encoded in the Markov matrix , constructed by normalizing the affinity matrix row-wise:Diffusion Operator
The diffusion operator in diffusion maps is constructed as a normalized graph Laplacian-like matrix that models a Markov random walk on the data manifold, approximating the underlying diffusion process. It is defined through a family of operators parameterized by α ∈ [0, 1], where the matrix P_α = D_α^{-1} Ψ_α, with D_α being the diagonal matrix of row sums of the normalized kernel matrix Ψ_α and Ψ_α(i,j) = ψ_α(x_i, x_j). The normalized kernel is given byDiffusion Distance
The diffusion distance is a central metric in the diffusion maps framework, quantifying the similarity between data points based on the evolution of probability distributions under a diffusion process. Formally, for a finite data set sampled from a manifold, it is defined using the transition matrix of a Markov random walk on the data points, raised to the power (representing diffusion steps), and the stationary distribution , which is typically the normalized degree vector for graph-based constructions or uniform if assuming equal sampling density. The squared diffusion distance between points and is given byLow-Dimensional Embedding
The low-dimensional embedding in diffusion maps is constructed by mapping each data point into a lower-dimensional Euclidean space using the spectral decomposition of the diffusion operator. Specifically, for a dataset of points , the embedding at diffusion time is defined as , where are the nontrivial eigenvalues of the transition matrix (with ), are the corresponding right eigenvectors, and is chosen to capture the intrinsic dimensionality. The trivial eigenvector , which is constant and associated with , is typically omitted as it does not contribute to distinguishing points. A key theoretical result ensures that this embedding preserves the diffusion geometry: the Euclidean distance in the embedded space approximates the diffusion distance, i.e., , where equality holds exactly when using all eigenvectors and approximately for the truncated embedding up to the desired precision. This is formalized in the proposition that the embedding into recovers the diffusion distance with error controlled by the truncation level, thereby providing coordinates that reflect the underlying manifold structure as probed by the diffusion process. The parameter governs the scale of the embedding, balancing local and global structural information; small values of emphasize local neighborhoods and fine-scale features, while larger integrate over longer paths to reveal broader clusters or connected components. For instance, in applications to point cloud data, can be tuned such that embeddings at increasing progressively coarsen the representation, merging substructures into global patterns. Compared to classical embeddings like multidimensional scaling, diffusion maps offer advantages in handling non-convex manifolds and noisy data, as the diffusion process inherently smooths perturbations while preserving intrinsic connectivity through path-based distances rather than direct Euclidean metrics. This robustness arises from the operator's normalization, which weights paths by density and length, enabling effective dimensionality reduction even for datasets with irregularities or outliers.Algorithm
Construction Steps
The construction of diffusion maps for a finite dataset of $ n $ points $ {x_i}_{i=1}^n \subset \mathbb{R}^d $ involves discretizing the diffusion process on a graph where nodes represent data points and edges are weighted by a kernel function. This yields a low-dimensional embedding that preserves the intrinsic geometry of the data manifold. The algorithm proceeds in sequential steps, incorporating parameters such as the kernel scale $ \epsilon > 0 $, normalization factor $ \alpha \in [0,1] $, diffusion time $ t > 0 $, and embedding dimension $ k $. Step 1: Build the affinity matrix.Start by constructing a symmetric affinity matrix $ K $ using a Gaussian kernel to measure similarity between points:
Compute the row-normalized degrees for the adjusted kernel:
Perform eigendecomposition on $ P $:
Form the diffusion map embedding for each point $ x_i $ at time $ t $:
Input: Data points {x_i}_{i=1}^n, parameters ε, α, t, k
1. Compute affinity matrix K_ij = exp(-||x_i - x_j||^2 / ε) for i,j=1 to n
2. Compute degrees d_i = sum_j K_ij
3. Form normalized kernel K_ij^(α) = K_ij / (d_i^α * d_j^α)
4. Compute adjusted degrees d_i^(α) = sum_j K_ij^(α)
5. Form transition matrix P_ij = K_ij^(α) / d_i^(α)
6. Compute eigendecomposition: eigenvalues {λ_l}, right eigenvectors {ψ_l} for l=1 to k (using [power iteration](/page/Power_iteration) or Lanczos if n large)
7. For each i=1 to n, compute embedding Ψ_t(x_i) = (λ_1^t ψ_1(x_i), ..., λ_k^t ψ_k(x_i))
Output: Embeddings {Ψ_t(x_i)}_{i=1}^n
The baseline complexity is $ O(n^2) $ space and time for dense kernel construction and eigendecomposition, limiting scalability to moderate $ n $ (e.g., up to 10^4–10^5). For larger datasets, approximations such as Nyström subsampling for the kernel or randomized SVD reduce costs to near-linear time while preserving accuracy.