Hubbry Logo
Topological data analysisTopological data analysisMain
Open search
Topological data analysis
Community hub
Topological data analysis
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Topological data analysis
Topological data analysis
from Wikipedia

In applied mathematics, topological data analysis (TDA) is an approach to the analysis of datasets using techniques from topology. Extraction of information from datasets that are high-dimensional, incomplete and noisy is generally challenging. TDA provides a general framework to analyze such data in a manner that is insensitive to the particular metric chosen and provides dimensionality reduction and robustness to noise. Beyond this, it inherits functoriality, a fundamental concept of modern mathematics, from its topological nature, which allows it to adapt to new mathematical tools.[citation needed]

The initial motivation is to study the shape of data. TDA has combined algebraic topology and other tools from pure mathematics to allow mathematically rigorous study of "shape". The main tool is persistent homology, an adaptation of homology to point cloud data. Persistent homology has been applied to many types of data across many fields. Moreover, its mathematical foundation is also of theoretical importance. The unique features of TDA make it a promising bridge between topology and geometry.[citation needed]

Basic theory

[edit]

Intuition

[edit]

TDA is premised on the idea that the shape of data sets contains relevant information. Real high-dimensional data is typically sparse, and tends to have relevant low dimensional features. One task of TDA is to provide a precise characterization of this fact. For example, the trajectory of a simple predator-prey system governed by the Lotka–Volterra equations[1] forms a closed circle in state space. TDA provides tools to detect and quantify such recurrent motion.[2]

Many algorithms for data analysis, including those used in TDA, require setting various parameters. Without prior domain knowledge, the correct collection of parameters for a data set is difficult to choose. The main insight of persistent homology is to use the information obtained from all parameter values by encoding this huge amount of information into an understandable and easy-to-represent form. With TDA, there is a mathematical interpretation when the information is a homology group. In general, the assumption is that features that persist for a wide range of parameters are "true" features. Features persisting for only a narrow range of parameters are presumed to be noise, although the theoretical justification for this is unclear.[3]

Early history

[edit]

Precursors to the full concept of persistent homology appeared gradually over time.[4] In 1990, Patrizio Frosini introduced a pseudo-distance between submanifolds, and later the size function, which on 1-dim curves is equivalent to the 0th persistent homology.[5][6] Nearly a decade later, Vanessa Robins studied the images of homomorphisms induced by inclusion.[7] Finally, shortly thereafter, Herbert Edelsbrunner et al. introduced the concept of persistent homology together with an efficient algorithm and its visualization as a persistence diagram.[8] Gunnar Carlsson et al. reformulated the initial definition and gave an equivalent visualization method called persistence barcodes,[9] interpreting persistence in the language of commutative algebra.[10]

In algebraic topology the persistent homology has emerged through the work of Sergey Barannikov on Morse theory. The set of critical values of smooth Morse function was canonically partitioned into pairs "birth-death", filtered complexes were classified, their invariants, equivalent to persistence diagram and persistence barcodes, together with the efficient algorithm for their calculation, were described under the name of canonical forms in 1994 by Barannikov.[11][12]

Concepts

[edit]

Some widely used concepts are introduced below. Note that some definitions may vary from author to author.

A point cloud is often defined as a finite set of points in some Euclidean space, but may be taken to be any finite metric space.

The Čech complex of a point cloud is the nerve of the cover of balls of a fixed radius around each point in the cloud.

A persistence module indexed by is a vector space for each , and a linear map whenever , such that for all and whenever [13] An equivalent definition is a functor from considered as a partially ordered set to the category of vector spaces.

The persistent homology group of a point cloud is the persistence module defined as , where is the Čech complex of radius of the point cloud and is the homology group.

A persistence barcode is a multiset of intervals in , and a persistence diagram is a multiset of points in ().

The Wasserstein distance between two persistence diagrams and is defined as where and ranges over bijections between and . Please refer to figure 3.1 in Munch[14] for illustration.

The bottleneck distance between and is This is a special case of Wasserstein distance, letting .

Basic property

[edit]

Structure theorem

[edit]

The first classification theorem for persistent homology appeared in 1994[11] via Barannikov's canonical forms. The classification theorem interpreting persistence in the language of commutative algebra appeared in 2005:[10] for a finitely generated persistence module with field coefficients, Intuitively, the free parts correspond to the homology generators that appear at filtration level and never disappear, while the torsion parts correspond to those that appear at filtration level and last for steps of the filtration (or equivalently, disappear at filtration level ).[11]

Persistent homology is visualized through a barcode or persistence diagram. The barcode has its root in abstract mathematics. Namely, the category of finite filtered complexes over a field is semi-simple. Any filtered complex is isomorphic to its canonical form, a direct sum of one- and two-dimensional simple filtered complexes.

Stability

[edit]

Stability is desirable because it provides robustness against noise. If is any space which is homeomorphic to a simplicial complex, and are continuous tame[15] functions, then the persistence vector spaces and are finitely presented, and , where refers to the bottleneck distance[16] and is the map taking a continuous tame function to the persistence diagram of its -th homology.

Workflow

[edit]

The basic workflow in TDA is:[17]

point cloud nested complexes persistence module barcode or diagram
  1. If is a point cloud, replace with a nested family of simplicial complexes (such as the Čech or Vietoris-Rips complex). This process converts the point cloud into a filtration of simplicial complexes. Taking the homology of each complex in this filtration gives a persistence module
  2. Apply the structure theorem to obtain the persistent Betti numbers, persistence diagram, or equivalently, barcode.

Graphically speaking,

A usual use of persistence in TDA[18]

Computation

[edit]

The first algorithm over all fields for persistent homology in algebraic topology setting was described by Barannikov[11] through reduction to the canonical form by upper-triangular matrices. The algorithm for persistent homology over was given by Edelsbrunner et al.[8] Afra Zomorodian and Carlsson gave the practical algorithm to compute persistent homology over all fields.[10] Edelsbrunner and Harer's book gives general guidance on computational topology.[19]

One issue that arises in computation is the choice of complex. The Čech complex and the Vietoris–Rips complex are most natural at first glance; however, their size grows rapidly with the number of data points. The Vietoris–Rips complex is preferred over the Čech complex because its definition is simpler and the Čech complex requires extra effort to define in a general finite metric space. Efficient ways to lower the computational cost of homology have been studied. For example, the α-complex and witness complex are used to reduce the dimension and size of complexes.[20]

Recently, Discrete Morse theory has shown promise for computational homology because it can reduce a given simplicial complex to a much smaller cellular complex which is homotopic to the original one.[21] This reduction can in fact be performed as the complex is constructed by using matroid theory, leading to further performance increases.[22] Another recent algorithm saves time by ignoring the homology classes with low persistence.[23]

Various software packages are available, such as javaPlex, Dionysus, Perseus, PHAT, DIPHA, GUDHI, Ripser, and TDAstats. A comparison between these tools is done by Otter et al.[24] Giotto-tda is a Python package dedicated to integrating TDA in the machine learning workflow by means of a scikit-learn [1] API. An R package TDA is capable of calculating recently invented concepts like landscape and the kernel distance estimator.[25] The Topology ToolKit is specialized for continuous data defined on manifolds of low dimension (1, 2 or 3), as typically found in scientific visualization. Cubicle is optimized for large (gigabyte-scale) grayscale image data in dimension 1, 2 or 3 using cubical complexes and discrete Morse theory. Another R package, TDAstats, uses the Ripser library to calculate persistent homology.[26]

Visualization

[edit]

High-dimensional data is impossible to visualize directly. Many methods have been invented to extract a low-dimensional structure from the data set, such as principal component analysis and multidimensional scaling.[27] However, it is important to note that the problem itself is ill-posed, since many different topological features can be found in the same data set. Thus, the study of visualization of high-dimensional spaces is of central importance to TDA, although it does not necessarily involve the use of persistent homology. However, recent attempts have been made to use persistent homology in data visualization.[28]

Carlsson et al. have proposed a general method called MAPPER.[29] It inherits the idea of Jean-Pierre Serre that a covering preserves homotopy.[30] A generalized formulation of MAPPER is as follows:

Let and be topological spaces and let be a continuous map. Let be a finite open covering of . The output of MAPPER is the nerve of the pullback cover , where each preimage is split into its connected components.[28] This is a very general concept, of which the Reeb graph[31] and merge trees are special cases.

This is not quite the original definition.[29] Carlsson et al. choose to be or , and cover it with open sets such that at most two intersect.[3] This restriction means that the output is in the form of a complex network. Because the topology of a finite point cloud is trivial, clustering methods (such as single linkage) are used to produce the analogue of connected sets in the preimage when MAPPER is applied to actual data.

Mathematically speaking, MAPPER is a variation of the Reeb graph. If the is at most one dimensional, then for each , [32] The added flexibility also has disadvantages. One problem is instability, in that some change of the choice of the cover can lead to major change of the output of the algorithm.[33] Work has been done to overcome this problem.[28]

Three successful applications of MAPPER can be found in Carlsson et al.[34] A comment on the applications in this paper by J. Curry is that "a common feature of interest in applications is the presence of flares or tendrils".[35]

A free implementation of MAPPER written by Daniel Müllner and Aravindakshan Babu is available online. MAPPER also forms the basis of Ayasdi's AI platform.

Multidimensional persistence

[edit]

Multidimensional persistence is important to TDA. The concept arises in both theory and practice. The first investigation of multidimensional persistence was early in the development of TDA.[36] Carlsson-Zomorodian introduced the theory of multidimensional persistence in[37] and in collaboration with Singh[38] introduced the use of tools from symbolic algebra (Grobner basis methods) to compute MPH modules. Their definition presents multidimensional persistence with n parameters as a graded module over a polynomial ring in n variables. Tools from commutative and homological algebra are applied to the study of multidimensional persistence in work of Harrington-Otter-Schenck-Tillman.[39] The first application to appear in the literature is a method for shape comparison, similar to the invention of TDA.[40]

The definition of an n-dimensional persistence module in is[35]

  • vector space is assigned to each point in
  • map is assigned if (
  • maps satisfy for all

It might be worth noting that there are controversies on the definition of multidimensional persistence.[35]

One of the advantages of one-dimensional persistence is its representability by a diagram or barcode. However, discrete complete invariants of multidimensional persistence modules do not exist.[41] The main reason for this is that the structure of the collection of indecomposables is extremely complicated by Gabriel's theorem in the theory of quiver representations,[42] although a finitely generated n-dim persistence module can be uniquely decomposed into a direct sum of indecomposables due to the Krull-Schmidt theorem.[43]

Nonetheless, many results have been established. Carlsson and Zomorodian introduced the rank invariant , defined as the , in which is a finitely generated n-graded module. In one dimension, it is equivalent to the barcode. In the literature, the rank invariant is often referred as the persistent Betti numbers (PBNs).[19] In many theoretical works, authors have used a more restricted definition, an analogue from sublevel set persistence. Specifically, the persistence Betti numbers of a function are given by the function , taking each to , where and .

Some basic properties include monotonicity and diagonal jump.[44] Persistent Betti numbers will be finite if is a compact and locally contractible subspace of .[45]

Using a foliation method, the k-dim PBNs can be decomposed into a family of 1-dim PBNs by dimensionality deduction.[46] This method has also led to a proof that multi-dim PBNs are stable.[47] The discontinuities of PBNs only occur at points where either is a discontinuous point of or is a discontinuous point of under the assumption that and is a compact, triangulable topological space.[48]

Persistent space, a generalization of persistent diagram, is defined as the multiset of all points with multiplicity larger than 0 and the diagonal.[49] It provides a stable and complete representation of PBNs. An ongoing work by Carlsson et al. is trying to give geometric interpretation of persistent homology, which might provide insights on how to combine machine learning theory with topological data analysis.[50]

The first practical algorithm to compute multidimensional persistence was invented very early.[51] After then, many other algorithms have been proposed, based on such concepts as discrete morse theory[52] and finite sample estimating.[53]

Other persistences

[edit]

The standard paradigm in TDA is often referred as sublevel persistence. Apart from multidimensional persistence, many works have been done to extend this special case.

Zigzag persistence

[edit]

The nonzero maps in persistence module are restricted by the preorder relationship in the category. However, mathematicians have found that the unanimity of direction is not essential to many results. "The philosophical point is that the decomposition theory of graph representations is somewhat independent of the orientation of the graph edges".[54] Zigzag persistence is important to the theoretical side. The examples given in Carlsson's review paper to illustrate the importance of functorality all share some of its features.[3]

Extended persistence and levelset persistence

[edit]

There are some attempts to loosen the stricter restriction of the function.[55] Please refer to the Categorification and cosheaves and Impact on mathematics sections for more information.

It's natural to extend persistence homology to other basic concepts in algebraic topology, such as cohomology and relative homology/cohomology.[56] An interesting application is the computation of circular coordinates for a data set via the first persistent cohomology group.[57]

Circular persistence

[edit]

Normal persistence homology studies real-valued functions. The circle-valued map might be useful, "persistence theory for circle-valued maps promises to play the role for some vector fields as does the standard persistence theory for scalar fields", as commented in Dan Burghelea et al.[58] The main difference is that Jordan cells (very similar in format to the Jordan blocks in linear algebra) are nontrivial in circle-valued functions, which would be zero in real-valued case, and combining with barcodes give the invariants of a tame map, under moderate conditions.[58]

Two techniques they use are Morse-Novikov theory[59] and graph representation theory.[60] More recent results can be found in D. Burghelea et al.[61] For example, the tameness requirement can be replaced by the much weaker condition, continuous.

Persistence with torsion

[edit]

The proof of the structure theorem relies on the base domain being field, so not many attempts have been made on persistence homology with torsion. Frosini defined a pseudometric on this specific module and proved its stability.[62] One of its novelty is that it doesn't depend on some classification theory to define the metric.[63]

Categorification and cosheaves

[edit]

One advantage of category theory is its ability to lift concrete results to a higher level, showing relationships between seemingly unconnected objects. Peter Bubenik et al.[64] offers a short introduction of category theory fitted for TDA.

Category theory is the language of modern algebra, and has been widely used in the study of algebraic geometry and topology. It has been noted that "the key observation of[10] is that the persistence diagram produced by[8] depends only on the algebraic structure carried by this diagram."[65] The use of category theory in TDA has proved to be fruitful.[64][65]

Following the notations made in Bubenik et al.,[65] the indexing category is any preordered set (not necessarily or ), the target category is any category (instead of the commonly used ), and functors are called generalized persistence modules in , over .

One advantage of using category theory in TDA is a clearer understanding of concepts and the discovery of new relationships between proofs. Take two examples for illustration. The understanding of the correspondence between interleaving and matching is of huge importance, since matching has been the method used in the beginning (modified from Morse theory). A summary of works can be found in Vin de Silva et al.[66] Many theorems can be proved much more easily in a more intuitive setting.[63] Another example is the relationship between the construction of different complexes from point clouds. It has long been noticed that Čech and Vietoris-Rips complexes are related. Specifically, .[67] The essential relationship between Cech and Rips complexes can be seen much more clearly in categorical language.[66]

The language of category theory also helps cast results in terms recognizable to the broader mathematical community. Bottleneck distance is widely used in TDA because of the results on stability with respect to the bottleneck distance.[13][16] In fact, the interleaving distance is the terminal object in a poset category of stable metrics on multidimensional persistence modules in a prime field.[63][68]

Sheaves, a central concept in modern algebraic geometry, are intrinsically related to category theory. Roughly speaking, sheaves are the mathematical tool for understanding how local information determines global information. Justin Curry regards level set persistence as the study of fibers of continuous functions. The objects that he studies are very similar to those by MAPPER, but with sheaf theory as the theoretical foundation.[35] Although no breakthrough in the theory of TDA has yet used sheaf theory, it is promising since there are many beautiful theorems in algebraic geometry relating to sheaf theory. For example, a natural theoretical question is whether different filtration methods result in the same output.[69]

Stability

[edit]

Stability is of central importance to data analysis, since real data carry noises. By usage of category theory, Bubenik et al. have distinguished between soft and hard stability theorems, and proved that soft cases are formal.[65] Specifically, general workflow of TDA is

data topological persistence module algebraic persistence module discrete invariant

The soft stability theorem asserts that is Lipschitz continuous, and the hard stability theorem asserts that is Lipschitz continuous.

Bottleneck distance is widely used in TDA. The isometry theorem asserts that the interleaving distance is equal to the bottleneck distance.[63] Bubenik et al. have abstracted the definition to that between functors when is equipped with a sublinear projection or superlinear family, in which still remains a pseudometric.[65] Considering the magnificent characters of interleaving distance,[70] here we introduce the general definition of interleaving distance(instead of the first introduced one):[13] Let (a function from to which is monotone and satisfies for all ). A -interleaving between F and G consists of natural transformations and , such that and .

The two main results are[65]

  • Let be a preordered set with a sublinear projection or superlinear family. Let be a functor between arbitrary categories . Then for any two functors , we have .
  • Let be a poset of a metric space , be a topological space. And let (not necessarily continuous) be functions, and to be the corresponding persistence diagram. Then .

These two results summarize many results on stability of different models of persistence.

For the stability theorem of multidimensional persistence, please refer to the subsection of persistence.

Structure theorem

[edit]

The structure theorem is of central importance to TDA; as commented by G. Carlsson, "what makes homology useful as a discriminator between topological spaces is the fact that there is a classification theorem for finitely generated abelian groups".[3] (see the fundamental theorem of finitely generated abelian groups).

The main argument used in the proof of the original structure theorem is the standard structure theorem for finitely generated modules over a principal ideal domain.[10] However, this argument fails if the indexing set is .[3]

In general, not every persistence module can be decomposed into intervals.[71] Many attempts have been made at relaxing the restrictions of the original structure theorem.[clarification needed] The case for pointwise finite-dimensional persistence modules indexed by a locally finite subset of is solved based on the work of Webb.[72] The most notable result is done by Crawley-Boevey, which solved the case of . Crawley-Boevey's theorem states that any pointwise finite-dimensional persistence module is a direct sum of interval modules.[73]

To understand the definition of his theorem, some concepts need introducing. An interval in is defined as a subset having the property that if and if there is an such that , then as well. An interval module assigns to each element the vector space and assigns the zero vector space to elements in . All maps are the zero map, unless and , in which case is the identity map.[35] Interval modules are indecomposable.[74]

Although the result of Crawley-Boevey is a very powerful theorem, it still doesn't extend to the q-tame case.[71] A persistence module is q-tame if the rank of is finite for all . There are examples of q-tame persistence modules that fail to be pointwise finite.[75] However, it turns out that a similar structure theorem still holds if the features that exist only at one index value are removed.[74] This holds because the infinite dimensional parts at each index value do not persist, due to the finite-rank condition.[76] Formally, the observable category is defined as , in which denotes the full subcategory of whose objects are the ephemeral modules ( whenever ).[74]

Note that the extended results listed here do not apply to zigzag persistence, since the analogue of a zigzag persistence module over is not immediately obvious.

Statistics

[edit]

Real data is always finite, and so its study requires us to take stochasticity into account. Statistical analysis gives us the ability to separate true features of the data from artifacts introduced by random noise. Persistent homology has no inherent mechanism to distinguish between low-probability features and high-probability features.

One way to apply statistics to topological data analysis is to study the statistical properties of topological features of point clouds. The study of random simplicial complexes offers some insight into statistical topology. Katharine Turner et al.[77] offers a summary of work in this vein.

A second way is to study probability distributions on the persistence space. The persistence space is , where is the space of all barcodes containing exactly intervals and the equivalences are if .[78] This space is fairly complicated; for example, it is not complete under the bottleneck metric. The first attempt made to study it is by Yuriy Mileyko et al.[79] The space of persistence diagrams in their paper is defined as where is the diagonal line in . A nice property is that is complete and separable in the Wasserstein metric . Expectation, variance, and conditional probability can be defined in the Fréchet sense. This allows many statistical tools to be ported to TDA. Works on null hypothesis significance test,[80] confidence intervals,[81] and robust estimates[82] are notable steps.

A third way is to consider the cohomology of probabilistic space or statistical systems directly, called information structures and basically consisting in the triple (), sample space, random variables and probability laws.[83][84] Random variables are considered as partitions of the n atomic probabilities (seen as a probability (n-1)-simplex, ) on the lattice of partitions (). The random variables or modules of measurable functions provide the cochain complexes while the coboundary is considered as the general homological algebra first discovered by Gerhard Hochschild with a left action implementing the action of conditioning. The first cocycle condition corresponds to the chain rule of entropy, allowing to derive uniquely up to the multiplicative constant, Shannon entropy as the first cohomology class. The consideration of a deformed left-action generalises the framework to Tsallis entropies. The information cohomology is an example of ringed topos. Multivariate k-Mutual information appear in coboundaries expressions, and their vanishing, related to cocycle condition, gives equivalent conditions for statistical independence.[85] Minima of mutual-informations, also called synergy, give rise to interesting independence configurations analog to homotopical links. Because of its combinatorial complexity, only the simplicial subcase of the cohomology and of information structure has been investigated on data. Applied to data, those cohomological tools quantifies statistical dependences and independences, including Markov chains and conditional independence, in the multivariate case.[86] Notably, mutual-informations generalize correlation coefficient and covariance to non-linear statistical dependences. These approaches were developed independently and only indirectly related to persistence methods, but may be roughly understood in the simplicial case using Hu Kuo Tin Theorem that establishes one-to-one correspondence between mutual-informations functions and finite measurable function of a set with intersection operator, to construct the Čech complex skeleton. Information cohomology offers some direct interpretation and application in terms of neuroscience (neural assembly theory and qualitative cognition[87]), statistical physic, and deep neural network for which the structure and learning algorithm are imposed by the complex of random variables and the information chain rule.[88]

Persistence landscapes, introduced by Peter Bubenik, are a different way to represent barcodes, more amenable to statistical analysis.[89] The persistence landscape of a persistent module is defined as a function , , where denotes the extended real line and . The space of persistence landscapes is very nice: it inherits all good properties of barcode representation (stability, easy representation, etc.), but statistical quantities can be readily defined, and some problems in Y. Mileyko et al.'s work, such as the non-uniqueness of expectations,[79] can be overcome. Effective algorithms for computation with persistence landscapes are available.[90] Another approach is to use revised persistence, which is image, kernel and cokernel persistence.[91]

Applications

[edit]

Classification of applications

[edit]

More than one way exists to classify the applications of TDA. Perhaps the most natural way is by field. A very incomplete list of successful applications includes[92] data skeletonization,[93] shape study,[94] graph reconstruction,[95][96][97][98] [99] image analysis, [100][101] material,[102][103] progression analysis of disease,[104][105] sensor network,[67] signal analysis,[106] cosmic web,[107] complex network,[108][109][110][111] fractal geometry,[112] viral evolution,[113] propagation of contagions on networks,[114] bacteria classification using molecular spectroscopy,[115] super-resolution microscopy,[116] hyperspectral imaging in physical-chemistry,[117] remote sensing,[118] feature selection,[119] and early warning signs of financial crashes.[120]

Another way is by distinguishing the techniques by G. Carlsson,[78]

one being the study of homological invariants of data on individual data sets, and the other is the use of homological invariants in the study of databases where the data points themselves have geometric structure.

Impact on mathematics

[edit]

Topological data analysis and persistent homology have had impacts on Morse theory.[121] Morse theory has played a very important role in the theory of TDA, including on computation. Some work in persistent homology has extended results about Morse functions to tame functions or, even to continuous functions[citation needed]. A forgotten result of R. Deheuvels long before the invention of persistent homology extends Morse theory to all continuous functions.[122]

One recent result is that the category of Reeb graphs is equivalent to a particular class of cosheaf.[123] This is motivated by theoretical work in TDA, since the Reeb graph is related to Morse theory and MAPPER is derived from it. The proof of this theorem relies on the interleaving distance.

Persistent homology is closely related to spectral sequences.[124][125] In particular the algorithm bringing a filtered complex to its canonical form[11] permits much faster calculation of spectral sequences than the standard procedure of calculating groups page by page. Zigzag persistence may turn out to be of theoretical importance to spectral sequences.

DONUT: A Database of TDA Applications

[edit]

The Database of Original & Non-Theoretical Uses of Topology (DONUT) is a database of scholarly articles featuring practical applications of topological data analysis to various areas of science. DONUT was started in 2017 by Barbara Giunti, Janis Lazovskis, and Bastian Rieck,[126] and as of October 2023 currently contains 447 articles.[127] DONUT was featured in the November 2023 issue of the Notices of the American Mathematical Society.[128]

Applications to Adversarial ML

[edit]

The stability property of topological features to small perturbations has been applied to make Graph Neural Networks robust against adversaries. Arafat et. al.[129] proposed a robustness framework which systematically integrates both local and global topological graph feature representations, the impact of which is controlled by the robust regularized topological loss. Given the attacker's budget, they derived stability guarantees on the node representations, establishing an important connection between Topological stability and Adversarial ML.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Topological data analysis (TDA) is an emerging field in applied mathematics and data science that leverages tools from algebraic topology to extract robust, multiscale, and interpretable features from complex, high-dimensional datasets, focusing on the intrinsic shape and connectivity of data rather than Euclidean metrics. By identifying topological invariants—such as connected components, loops, and voids—that persist across varying scales, TDA distinguishes meaningful structures from noise, providing summaries invariant under continuous deformations like stretching or bending. The foundations of TDA were laid in the early 2000s through pioneering work on persistent homology, a core computational framework introduced by Edelsbrunner, Letscher, and Zomorodian in 2002, which formalizes the tracking of topological features in filtrations of simplicial complexes. This was advanced by Zomorodian and Carlsson in 2005, who developed efficient algorithms for computing persistent homology over arbitrary fields and dimensions, enabling practical implementation on point cloud data via constructions like Vietoris–Rips or Čech complexes. Gunnar Carlsson's 2009 survey further solidified TDA as a discipline by outlining its philosophical and methodological integration of topology with statistical data analysis, emphasizing its ability to reveal global and local structures in noisy environments. At its heart, persistent homology represents data as a sequence of nested simplicial complexes parameterized by a scale (e.g., distance threshold), computing homology groups at each level to produce persistence diagrams or barcodes that encode the "birth" and "death" times of topological features. These summaries enable quantitative comparisons via metrics like the Wasserstein distance and facilitate machine learning integration, such as through persistence images or kernels. Software libraries like GUDHI and PHAT have made TDA accessible, supporting computations in dimensions up to several dozen for moderate-sized datasets. TDA has found diverse applications across domains, including biomolecular modeling for protein-ligand interactions and drug discovery, where it analyzes molecular shapes and binding sites; materials science for characterizing microstructures; and neuroscience for studying brain connectivity patterns. In recent years, topological deep learning has extended TDA by embedding topological features into neural networks, enhancing predictions in tasks like viral evolution forecasting and immune response analysis, as demonstrated in successes at the D3R Grand Challenges. Despite computational challenges for very large datasets, ongoing advances in multiparameter persistence and localized variants continue to broaden its impact.

Introduction

Intuition

Topological data analysis (TDA) addresses the challenge of understanding the intrinsic shape and structure of data, particularly when represented as point clouds—finite sets of points sampled from high-dimensional spaces, such as sensor readings or genomic profiles. Traditional statistical methods often focus on measures like means, variances, or linear correlations, which may overlook global topological features such as connectivity patterns or voids that persist across scales. In contrast, TDA employs tools from algebraic topology to capture these qualitative aspects of data shape, providing robustness to noise and deformations that distort metric properties without altering fundamental structure. A classic illustration of topological invariance is the equivalence between a coffee mug and a donut: both can be continuously deformed into each other without tearing or gluing, sharing the same number of holes (one), despite differences in size, curvature, or material. This principle underlies TDA's emphasis on features invariant under such deformations, allowing analysis of data shapes that might appear dissimilar metrically but are topologically alike. For instance, in high-dimensional datasets, TDA identifies these invariants to reveal underlying geometries that traditional metrics might miss. The primary goal of TDA is to detect and quantify topological features like clusters (connected components), loops, and voids in data at multiple scales, ensuring that transient noise does not obscure persistent structures. Persistent homology serves as the main tool for this multi-scale analysis, tracking how these features emerge and disappear as the resolution varies. Consider a noisy 2D point cloud approximating a circle with added random points: TDA can identify the surrounding cluster as a single connected component and detect the central void (a 1-dimensional hole) that persists across scales, without assuming a specific embedding like Euclidean distance. This approach uncovers cycles and groupings robustly, even in perturbed data.

Historical Development

The roots of topological data analysis (TDA) trace back to foundational developments in algebraic topology during the late 19th and early 20th centuries. In 1895, Henri Poincaré introduced simplicial complexes in his seminal paper "Analysis Situs," providing a combinatorial framework for studying the topological properties of spaces through decompositions into simplices, which laid the groundwork for later homology theories. Building on this, Eduard Čech developed Čech homology in the 1930s, particularly in his 1934 work on local connectivity, which formalized homology using open covers of topological spaces and emphasized neighborhood-based invariants essential for analyzing geometric data. TDA emerged as a distinct field in the early 2000s, adapting these classical tools to handle noisy, high-dimensional data by focusing on robust topological features. A pivotal breakthrough was the introduction of persistent homology by Herbert Edelsbrunner, David Letscher, and Afra Zomorodian in their 2002 paper, which formalized the tracking of topological features in filtrations of simplicial complexes and introduced persistence diagrams as a visual and computational representation to quantify feature lifetimes and distinguish signal from noise. This was complemented by Afra Zomorodian and Gunnar Carlsson in 2005, who developed efficient algorithms for computing persistent homology over arbitrary fields and dimensions, enabling practical implementation on point cloud data. The 2010s marked a period of rapid growth for TDA, driven by computational advancements and community efforts to make tools accessible. Key workshops and seminars, such as the 2017 Dagstuhl Seminar on Topology, Computation, and Data Analysis, fostered interdisciplinary collaboration among mathematicians, computer scientists, and domain experts, highlighting TDA's potential for real-world applications. Concurrently, open-source libraries proliferated; the GUDHI project, launched in 2014 by INRIA researchers including Clément Maria and Steve Oudot, delivered efficient C++ implementations for persistent homology and simplicial complexes, significantly lowering barriers to adoption and enabling scalable computations on large datasets. By the 2020s, TDA had integrated with machine learning paradigms, expanding its scope up to 2025. Seminal works like the 2023 review on Topological Deep Learning by Mustafa Hajij et al. outlined frameworks combining persistent homology with neural networks, such as topological layers in graph neural networks, to capture higher-order structures in data for enhanced predictive modeling. Specific integrations, including Topo-Net by Katherine Fitch et al. in 2024, applied topological features to deep learning for retinal image classification, demonstrating improved accuracy in biomedical tasks through invariant shape encodings. In climate modeling, post-2020 applications gained traction; for instance, a 2021 study by Francisco Algaba et al. used persistent homology on MODIS satellite imagery to delineate local climate zones, revealing spatial patterns in urban heat islands with greater robustness than traditional clustering methods. Similarly, a 2023 analysis leveraged TDA to predict extreme weather persistence, integrating topological summaries with climate simulations to forecast heatwave durations under varying greenhouse gas scenarios. As of 2025, advances in multiparameter persistence and topological deep learning continue to address computational challenges for large-scale applications.

Core Concepts

Topological Spaces and Homology

A topological space consists of a set XX equipped with a collection τ\tau of subsets of XX, known as open sets, that includes the empty set and XX itself, and is closed under arbitrary unions and finite intersections of its members. This structure generalizes the notion of "nearness" without relying on a metric, allowing for the study of continuity in abstract settings. A function f:XYf: X \to Y between topological spaces is continuous if the preimage of every open set in YY is open in XX. Two spaces are homeomorphic if there exists a continuous bijection with a continuous inverse between them, meaning they share the same topological properties, such as connectedness or the presence of holes. In topological data analysis, point cloud data—finite sets of points in Euclidean space—are often approximated by simplicial complexes to capture underlying shapes. A simplicial complex KK is a finite collection of simplices (nonempty finite subsets of vertices, such as points for dimension 0, line segments for dimension 1, triangles for dimension 2, and tetrahedra for dimension 3) that is closed under taking subsets (faces) and such that the intersection of any two simplices is either empty or a common face. From a point cloud PRdP \subset \mathbb{R}^d, the Vietoris–Rips complex VRr(P)VR_r(P) at scale r>0r > 0 includes a simplex for every finite subset of PP where the diameter (maximum pairwise distance) is at most rr. Alternatively, the Čech complex Cˇr(P)\check{C}_r(P) at scale rr consists of simplices corresponding to subsets of PP whose associated balls of radius rr have nonempty common intersection. These constructions discretize the data while approximating the topology of the underlying manifold or shape sampled by PP. To quantify topological features algebraically, simplicial homology is computed from the chain complex associated to KK. The group of kk-chains Ck(K)C_k(K) is the free abelian group generated by the oriented kk-simplices in KK, and the boundary operator k:Ck(K)Ck1(K)\partial_k: C_k(K) \to C_{k-1}(K) maps each kk-simplex to the alternating sum of its (k1)(k-1)-faces. The kk-cycles Zk(K)=kerkZ_k(K) = \ker \partial_k consist of chains with zero boundary, representing closed kk-dimensional surfaces, while the kk-boundaries Bk(K)=imk+1B_k(K) = \operatorname{im} \partial_{k+1} are cycles that bound (k+1)(k+1)-dimensional volumes. The kk-th homology group is then Hk(K)=Zk(K)/Bk(K)H_k(K) = Z_k(K) / B_k(K), which captures kk-dimensional "holes": for k=0k=0, H0H_0 detects connected components; for k=1k=1, H1H_1 detects independent loops; and for k=2k=2, H2H_2 detects enclosed voids. The Betti numbers βk=rankHk(K)\beta_k = \operatorname{rank} H_k(K) (computed over the rationals or a field) serve as topological invariants, counting the number of independent kk-dimensional holes in KK. For instance, a point cloud sampled from a circle, approximated by the Vietoris–Rips complex at a scale rr large enough to connect neighboring points but small enough to avoid filling the interior, yields a simplicial complex homeomorphic to a triangulated circle: a cycle of vertices and edges with no 2-simplices. Here, H0ZH_0 \cong \mathbb{Z} (one connected component, β0=1\beta_0 = 1) and H1ZH_1 \cong \mathbb{Z} (one independent loop, β1=1\beta_1 = 1), with all higher homology groups trivial. This computation highlights how homology distinguishes a circle from, say, a filled disk (where β1=0\beta_1 = 0).

Persistent Homology

Persistent homology is a cornerstone of topological data analysis, providing a framework to detect and quantify topological features in data that persist across multiple scales. Unlike classical homology, which computes static invariants of a fixed topological space, persistent homology examines the evolution of these features within a filtered simplicial complex, where simplices are added progressively according to a parameter, often a scale ε. This allows for the identification of robust structures amid noise, as features that survive over a range of scales are deemed significant. The method was introduced to formalize topological simplification in growing complexes, enabling the distinction between true geometric features and transient artifacts. In persistent homology, topological features such as connected components (dimension 0), loops (dimension 1), and voids (higher dimensions) are tracked by their birth and death times within the filtration. A feature born at parameter value b and dying at d forms a persistence pair (b, d), corresponding to the half-open interval [b, d) during which it contributes to the homology group. These pairs arise from the algebraic structure of the persistence module, where the k-th persistent homology H_k is a sequence of finite-dimensional vector spaces H_k(K_{ε_i}) equipped with linear maps H_k(K_{ε_i}) → H_k(K_{ε_{i+1}}) induced by the inclusions of the filtered complexes K_{ε_i} ⊆ K_{ε_{i+1}}. By the structure theorem for persistence modules over a field, this module decomposes uniquely (up to isomorphism) into a direct sum of indecomposable interval modules, each supported on an interval [b, d). The collection of persistence pairs can be visualized as a persistence diagram, a multiset of points (b, d) in the half-plane above the diagonal line d = b, where points on the diagonal represent trivial, instantaneously born-and-dead features. The stability of these diagrams under perturbations is quantified by the Wasserstein distance (or p-Wasserstein metric), which bounds the movement of points between diagrams by the L^∞ distance between the underlying functions or point clouds generating the filtrations. An equivalent representation is the barcode, consisting of horizontal line segments (bars) from b to d, ordered by homological dimension k; longer bars indicate more persistent features. A illustrative example is the application of the Vietoris-Rips filtration to a point cloud sampled from a noisy circle in the plane, approximating the 1-sphere S^1. As the filtration parameter ε increases, small 1-dimensional cycles (loops) form and quickly merge due to noise, yielding short bars in the H_1 barcode, while a prominent long bar emerges corresponding to the global 1-cycle encircling the circle's core topology, persisting until ε exceeds the circle's diameter.

Filtrations and Persistence Modules

In topological data analysis, a filtration provides a multi-scale framework for studying the evolution of topological features in data. Formally, a filtration is a nested sequence of simplicial complexes {Kϵ}ϵ0\{K_\epsilon\}_{\epsilon \geq 0} such that KϵKϵK_\epsilon \subseteq K_{\epsilon'} whenever ϵ<ϵ\epsilon < \epsilon', often constructed from a point cloud in a metric space. A common example is the Vietoris-Rips filtration, where KϵK_\epsilon consists of simplices formed by subsets of points with pairwise distances at most ϵ\epsilon, allowing the complex to grow as the scale parameter ϵ\epsilon increases and capturing connectivity at varying resolutions. This structure enables the analysis of data shapes by embedding them into a sequence of increasingly refined topological spaces. Persistence modules formalize the algebraic information encoded by filtrations, particularly when applied to homology groups. A persistence module is a functor from a poset (such as (R,)(\mathbb{R}, \leq)) to the category of vector spaces over a field, assigning to each parameter value a vector space (e.g., a homology group) and to each pair ϵϵ\epsilon \leq \epsilon' a linear map induced by the inclusion KϵKϵK_\epsilon \hookrightarrow K_{\epsilon'}, satisfying functorial properties like composition of maps. These modules form exact sequences reflecting the chain complex structure, where the images and kernels of the induced maps track the birth and death of topological features across scales. A key result in the theory is the decomposition theorem for persistence modules, which states that every pointwise finite-dimensional persistence module decomposes uniquely (up to isomorphism) as a direct sum of indicator modules supported on intervals I=[b,d)I = [b, d), where the indicator module is the vector space F\mathbb{F} (over field F\mathbb{F}) for parameters in II and zero otherwise, with maps being identity or zero accordingly. This interval decomposability allows persistent features to be represented as bars of finite length, quantifying their lifespan in the filtration. The rank invariant provides a numerical summary of a persistence module in dimension kk, defined as the function r(k;U)r(k; U) that counts the number of kk-dimensional features persisting over a parameter interval URU \subseteq \mathbb{R}, equivalent to the rank of the map between homology groups at the endpoints of UU. For instance, in a one-dimensional filtration induced by a height function on a line segment with points added sequentially, the persistence module tracks connected components as intervals that merge upon edge addition; the module decomposes into indicator functions on birth-death intervals, with the rank invariant r(0;[a,b])r(0; [a, b]) giving the number of components spanning that range. This setup underpins the application of persistence modules to persistent homology, where they capture multi-scale topological invariants of data.

Algorithms and Computation

Computing Persistent Homology

The computation of persistent homology relies on representing the filtered simplicial complex via its boundary matrix and reducing it through a series of column operations over a field, typically Z/2Z\mathbb{Z}/2\mathbb{Z}, to identify persistence pairs. The boundary matrix DD is constructed such that its columns are indexed by the simplices in increasing order of their filtration values, and each column jj (corresponding to simplex σj\sigma_j) has 1's in the rows corresponding to the facets of σj\partial \sigma_j. This matrix encodes the linear maps of the chain complex, allowing the persistent homology groups to be derived from its reduced form. The standard algorithm for reduction, introduced by Zomorodian and Carlsson, proceeds column-by-column from left to right in a process known as "clear and reduce." For each column jj, the "clear" step eliminates entries above the lowest 1 in the column by adding previous columns as needed, ensuring no non-zero entries above the pivot. The "reduce" step then pivots on the lowest 1 at row i=\low(j)i = \low(j), adding column jj to any later column k>jk > j whose lowest 1 is also at row ii, until no such conflicts remain. This yields the reduced boundary matrix R=DVR = DV for some invertible VV. Persistence pairs (b,d)(b, d) are those where \low(d)=b<d\low(d) = b < d in RR, with birth at the filtration time of simplex bb and death at dd. Features with \low(b)=b\low(b) = b persist infinitely. Zero columns after reduction indicate simplices that do not create new homology classes. For low dimensions, such as H0H_0, the algorithm simplifies using union-find data structures to track connected components during edge additions, akin to Kruskal's algorithm for minimum spanning trees. The worst-case time complexity of the standard reduction is O(n3)O(n^3) for nn simplices, due to potential quadratic fill-in during Gaussian elimination, though practical performance is often better. Optimizations like the twist algorithm reorder simplices by dimension and skip reductions for certain columns using a "twist" condition on boundary indices, achieving O(n2)O(n^2) time in many cases, particularly for sparse matrices from Vietoris–Rips complexes. Further enhancements include the dualization trick, computing persistent cohomology instead for faster execution in higher dimensions. Several open-source libraries implement these algorithms efficiently. The GUDHI library provides a comprehensive C++ framework with Python bindings for constructing filtrations and computing persistent homology via matrix reduction, supporting arbitrary simplicial complexes and optimized for large-scale data. Ripser, a lean C++ library with Python wrappers, specializes in Vietoris–Rips filtrations and incorporates the twist optimization, achieving state-of-the-art speed; for instance, benchmarks on point clouds with 10,000 vertices show Ripser computing H1H_1 persistence in seconds, outperforming earlier tools by orders of magnitude. Both libraries are actively maintained as of 2025 and include benchmarks demonstrating scalability to millions of simplices on modern hardware. A simple example illustrates the reduction on a filtered simplicial complex with vertices 1, 2, 3 added at filtration 0, edge 4 (1-2) at 1, edge 5 (1-3) at 2.5, edge 6 (2-3) at 3, and triangle 7 (1-2-3) at 3. The initial boundary matrix over Z/2Z\mathbb{Z}/2\mathbb{Z} (rows and columns indexed by 1 to 7, with rows/columns ordered: v1=1, v2=2, v3=3, e12=4, e13=5, e23=6, t=7) has 1's at positions (1,4), (2,4), (1,5), (3,5), (2,6), (3,6), (4,7), (5,7), (6,7); columns 1–3 and row 7 are zero. After full reduction, the persistence pairs are (2,4) and (3,5) for finite H0H_0 features (births at vertices 2 and 3 time 0; deaths at edges 4 time 1 and 5 time 2.5), (1, \infty) for the persistent H0H_0 component (birth at vertex 1 time 0), and (6,7) for the H1H_1 feature (birth at edge 6 time 3, death at triangle 7 time 3), yielding one persistent component and a short-lived loop.

Mapper Algorithm

The Mapper algorithm provides a scalable, approximate approach to topological summarization of high-dimensional data by constructing a simplicial complex that captures key structural features, such as clusters and connectivity, in a graph-like representation. Introduced by Singh, Mémoli, and Carlsson in 2007, it approximates the Reeb graph of a data set through partial clustering guided by filter functions, enabling visualization and exploration without computing full homology. This method is particularly useful for datasets where exact topological invariants are computationally prohibitive, offering a heuristic that preserves essential shape information. The algorithm begins with the selection of a filter function f:XRf: X \to \mathbb{R}, where XX is the input data set, often chosen as a density estimator, eccentricity measure, or projection onto a low-dimensional subspace like the first principal components. The range of ff is covered by a collection of overlapping intervals {Ij}\{I_j\}, whose number and overlap percentage determine the resolution: smaller intervals yield finer granularity, while overlaps (typically 50-70%) ensure continuity between adjacent bins. For each interval IjI_j, the preimage f1(Ij)f^{-1}(I_j) is obtained, and the points within it are partitioned into clusters using a local clustering method, such as single-linkage hierarchical clustering or DBSCAN; each cluster becomes a vertex in the output complex. The simplicial complex is then formed via the nerve theorem applied to this cover: vertices corresponding to clusters are linked by edges if their point sets intersect (i.e., share data points from overlaps), and higher-dimensional simplices can be included based on multiplicity to represent richer interactions among three or more clusters. This assembly process leverages the overlaps to create a connected structure that reflects the data's global topology, with the overlap parameter playing a key role in maintaining connectivity and avoiding fragmentation. One of Mapper's primary advantages is its scalability to large data sets, accommodating millions of points—far beyond the typical limits of exact persistent homology computations—due to its reliance on local clustering rather than global matrix reductions, often achieving near-linear time performance in practice with efficient clustering choices. For instance, when applied to the MNIST dataset of handwritten digits, Mapper constructs graphs that reveal topological skeletons, such as branched structures for digit shapes like '8' or loops for '0', aiding in shape-based classification and visualization. Recent variants extend the single-parameter framework to multi-parameter settings, incorporating multiple filter functions simultaneously to capture more nuanced structures; for example, ensemble-based approaches aggregate Mapper outputs over varied parameters for robust topology recovery, as developed in the 2020s for applications requiring stability across dimensions.

Visualization Techniques

Persistence Diagrams

Persistence diagrams provide a geometric representation of the persistent homology of a filtered topological space, capturing the birth and death times of topological features across scales. Each diagram is a multiset of points in the extended half-plane {(b,d)b<d}\{(b, d) \mid b < d \leq \infty\}, where the coordinate (b,d)(b, d) indicates a feature—such as a connected component (dimension 0), loop (dimension 1), or void (dimension 2)—that appears at filtration parameter bb (birth) and disappears at dd (death). Features with infinite persistence, which survive all scales, are represented by points (b,)(b, \infty). This representation arises from the algebraic structure of persistence modules, where the diagram encodes the intervals of persistence for each homology class. To visualize a persistence diagram, the points are plotted in the birth-death plane, with the diagonal line d=bd = b serving as a reference; all points lie strictly above this line since d>bd > b, and points on the diagonal (zero persistence) are typically omitted as noise. The diagram often includes the diagonal itself as a projection line for unmatched points in distance computations, and essential features at infinity are depicted as vertical rays extending upward from their birth coordinates. Multiple diagrams, one per homology dimension, can be plotted separately or combined, with points colored or sized by dimension to highlight different feature types. Software libraries facilitate this plotting; for instance, the GUDHI library integrates with Matplotlib for standard 2D scatter plots and supports extensions to interactive visualizations via Plotly for exploring large diagrams. A key advantage of persistence diagrams is their stability under small perturbations of the input filtration, ensuring that nearby data yield similar diagrams—a property akin to Gromov-Hausdorff stability in metric geometry. The bottleneck distance quantifies this by measuring the smallest ϵ\epsilon such that there exists a partial matching between the points of two diagrams where each matched pair and unmatched point (projected to the diagonal) lies within an \ell_\infty-ball of radius ϵ\epsilon: dB(P,Q)=infγsup(b,d)PΔ(b,d)γ(b,d),d_B(P, Q) = \inf_{\gamma} \sup_{(b,d) \in P \cup \Delta} \| (b,d) - \gamma(b,d) \|_{\infty}, where Δ\Delta is the diagonal, the supremum includes projections for unmatched points, and the infimum is over all bijections γ:PΔQΔ\gamma: P \cup \Delta \to Q \cup \Delta. This distance bounds the perturbation in the underlying space: if two filtrations arise from metric spaces at Gromov-Hausdorff distance at most ϵ\epsilon, their diagrams satisfy dB(P,Q)2ϵd_B(P, Q) \leq 2\epsilon. Complementing the bottleneck distance, the pp-Wasserstein distance provides a more flexible metric for statistical analysis of diagrams, defined as dWp(P,Q)=(infγ(b,d)PΔ(b,d)γ(b,d)pp)1/p,d_{W_p}(P, Q) = \left( \inf_{\gamma} \sum_{(b,d) \in P \cup \Delta} \| (b,d) - \gamma(b,d) \|_p^p \right)^{1/p},
Add your contribution
Related Hubs
User Avatar
No comments yet.