Hubbry Logo
Edit distanceEdit distanceMain
Open search
Edit distance
Community hub
Edit distance
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Edit distance
Edit distance
from Wikipedia

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Different definitions of an edit distance use different sets of like operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.[1]

Types of edit distance

[edit]

Different types of edit distance allow different sets of string operations. For instance:

Algorithms and the operations allowed
Algorithm Insertions Deletions Substitutions Transposition
Levenshtein distance
Longest common subsequence (LCS)
Hamming distance
Damerau–Levenshtein distance
Jaro distance

Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.

Formal definition and properties

[edit]

Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966:[2]

Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string.
Deletion of a single symbol changes uxv to uv (x→ε).
Substitution of a single symbol x for a symbol yx changes uxv to uyv (xy).

In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b. A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(xy) with the operations.[2]

Additional primitive operations have been suggested. Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes uxyv into uyxv.[3][4] For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa.[4]

Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost.[1]: 37  Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings.[1] Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed.

Example

[edit]

The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:

  1. kitten → sitten (substitute "s" for "k")
  2. sitten → sittin (substitute "i" for "e")
  3. sittin → sitting (insert "g" at the end)

LCS distance (insertions and deletions only) gives a different distance and minimal edit script:

  1. kitten → itten (delete "k" at 0)
  2. itten → sitten (insert "s" at 0)
  3. sitten → sittn (delete "e" at 4)
  4. sittn → sittin (insert "i" at 4)
  5. sittin → sitting (insert "g" at 6)

for a total cost/distance of 5 operations.

Properties

[edit]

Edit distance with non-negative cost satisfies the axioms of a metric, giving rise to a metric space of strings, when the following conditions are met:[1]: 37 

  • Every edit operation has positive cost;
  • for every operation, there is an inverse operation with equal cost.

With these properties, the metric axioms are satisfied as follows:

d(a, b) = 0 if and only if a=b, since each string can be trivially transformed to itself using exactly zero operations.
d(a, b) > 0 when ab, since this would require at least one operation at non-zero cost.
d(a, b) = d(b, a) by equality of the cost of each operation and its inverse.
Triangle inequality: d(a, c) ≤ d(a, b) + d(b, c).[5]

Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature.[1]

Other useful properties of unit-cost edit distances include:

  • LCS distance is bounded above by the sum of lengths of a pair of strings.[1]: 37 
  • LCS distance is an upper bound on Levenshtein distance.
  • For strings of the same length, Hamming distance is an upper bound on Levenshtein distance.[1]

Regardless of cost/weights, the following property holds of all edit distances:

  • When a and b share a common prefix, this prefix has no effect on the distance. Formally, when a = uv and b = uw, then d(a, b) = d(v, w).[4] This allows speeding up many computations involving edit distance and edit scripts, since common prefixes and suffixes can be skipped in linear time.

Computation

[edit]

The first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964.[6]

Common algorithm

[edit]

Using Levenshtein's original operations, the (nonsymmetric) edit distance from to is given by , defined by the recurrence[2]

This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization.[3]

The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer,[7] although it has a history of multiple invention.[2][3] After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at .

This algorithm has a time complexity of Θ(mn) where m and n are the lengths of the strings. When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations.[3] A linear-space solution to this problem is offered by Hirschberg's algorithm.[8]: 634  A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran.[9]

Improved algorithms

[edit]

Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants,[10] one of which takes two strings and a maximum edit distance s, and returns min(s, d). It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time O(s×min(m,n)), where m and n are the lengths of the strings. Space complexity is O(s2) or O(s), depending on whether the edit sequence needs to be read off.[3]

Further improvements by Landau, Myers, and Schmidt [1] give an O(s2 + max(m,n)) time algorithm.[11]

For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson[12] having worst case runtime of O(nm/logn).

Applications

[edit]

Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.

Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.

  • Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance.
  • Approximate string matching can be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string p, called the pattern, and a constant k; it then builds a deterministic finite state automaton that finds, in an arbitrary string s, a substring whose edit distance to p is at most k[13] (cf. the Aho–Corasick algorithm, which similarly constructs an automaton to search for any of a number of patterns, but without allowing edit operations). A similar algorithm for approximate string matching is the bitap algorithm, also defined in terms of edit distance.
  • Levenshtein automata are finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.[4]

Language edit distance

[edit]

A generalization of the edit distance between strings is the language edit distance between a string and a language, usually a formal language. Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string and any string taken from a set of strings. More formally, for any language L and string x over an alphabet Σ, the language edit distance d(L, x) is given by[14], where is the string edit distance. When the language L is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance.[15] For less expressive families of grammars, such as the regular grammars, faster algorithms exist for computing the edit distance.[16]

Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.[14][17]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Edit distance, also known as , is a used to quantify the difference between two sequences by calculating the minimum number of single-character operations—insertions, deletions, or substitutions—required to transform one into the other. This measure provides a numerical indication of similarity, where a distance of zero indicates identical strings and higher values reflect greater divergence; for instance, the edit distance between "cat" and "dog" is 3, involving three substitutions. The concept was introduced by Soviet mathematician in 1965 as part of his work on binary error-correcting codes capable of handling deletions, insertions, and reversals in information transmission. Levenshtein's original formulation focused on its utility in to detect and correct errors in data streams, laying the foundation for its broader adoption in computational fields. Over time, the metric has been generalized beyond binary codes to arbitrary strings over any , with unit cost typically assigned to each basic operation unless weighted variants are specified. Computing the edit distance between two strings of lengths n and m is classically solved using a dynamic programming that constructs a matrix of size (n+1) by (m+1), filling it row-by-row to track the optimal edit path, achieving O(nm) time and . This approach, often visualized as an alignment grid, allows for efficient with to avoid redundant computations, making it practical for moderate-length strings in applications like . More advanced techniques, such as diagonal-based or hierarchical decompositions, have been developed to approximate or accelerate computation for longer sequences, though exact solutions remain quadratic in the worst case. Edit distance has wide-ranging applications across and related disciplines, including spell-checking systems that suggest corrections based on low-distance dictionary words, for fuzzy matching in search engines, and bioinformatics for aligning or protein sequences to identify evolutionary similarities. In and , it supports tasks like handwriting analysis and image sequence comparison by extending the metric to non-text . Variants such as the Damerau-Levenshtein distance incorporate adjacent transpositions as an additional operation, enhancing relevance for certain error models like keyboard typos.

Overview and History

Basic Concept and Motivation

Edit distance, also known as , quantifies the similarity between two strings by measuring the minimum number of single-character operations—insertions, deletions, or substitutions—required to transform one string into the other. This metric provides an intuitive way to assess how closely related two sequences of characters are, with a lower distance indicating greater similarity. For instance, transforming the string "cat" into "cot" requires only one substitution (replacing 'a' with 'o'), resulting in an edit distance of 1. The concept emerged in the 1960s as a tool for addressing practical challenges in early , particularly in text processing. At a time when computers were increasingly used for and document preparation, typing mistakes were common, necessitating methods to identify and fix discrepancies efficiently. Edit distance offered a systematic approach to evaluate potential corrections by modeling these errors as minimal transformations, facilitating applications like automated . Vladimir Levenshtein introduced the idea in 1965, originally in the context of developing error-correcting codes for information transmission, but it quickly found relevance in string correction tasks. This foundational work laid the groundwork for its adoption in spell-checking systems, where comparing user input against dictionary entries became essential for improving accuracy in human-computer interaction.

Historical Development

The concept of edit distance emerged in the context of and error-correcting codes, where measuring the minimum number of operations to transform one into another addressed challenges in reliable transmission. Soviet mathematician formalized this idea in his 1965 paper, introducing the metric now known as to quantify differences between binary sequences capable of correcting deletions, insertions, and reversals. In 1964, Frederick J. Damerau proposed a variant incorporating transpositions as an additional operation, motivated by spelling error correction in early computing systems. The 1970s saw significant algorithmic advancements, with and Michael J. Fischer developing a dynamic programming approach in 1974 to efficiently compute the string-to-string correction distance, enabling practical applications beyond theoretical coding. By the , researchers extended these foundations for , introducing weighted variants and block-edit models to handle more complex error patterns in text processing and tasks. These developments addressed limitations in uniform-cost assumptions, allowing costs to vary by operation type for better alignment in diverse data. Post-2000 milestones integrated edit distance into bioinformatics, building on 1990s tools like BLAST, which leveraged sequence alignment heuristics inspired by edit operations to accelerate database searches for genetic similarities. In recent years, variants have appeared in AI applications, such as transformer models adapting edit distance for efficient DNA sequence alignment in genomic analysis. Notably, while similar sequence comparison ideas existed in linguistics before the 1960s, no standardized metrics akin to edit distance were formalized until Levenshtein's work, marking a gap in pre-1960s quantitative approaches. Today, edit distance variants underpin modern natural language processing for tasks like spell correction and semantic similarity.

Types of Edit Distance

Levenshtein Distance

The , also known as the standard edit distance, is a measure of the similarity between two strings defined as the minimum number of single-character edits required to transform one string into the other. Introduced by in the context of error-correcting codes, it quantifies differences in sequences by considering basic editing operations on characters. The allowed operations in Levenshtein distance are insertion of a single character, deletion of a single character, and substitution of one character for another, with each operation assigned a of 1. If two characters are identical, no substitution is needed, effectively costing 0. This formulation ensures the distance represents the smallest total cost to achieve equality between the strings. A concrete example illustrates the computation: consider transforming "" into "sitting". One optimal sequence involves substituting the initial 'k' with 's' to get "sitten", then substituting the 'e' with 'i' to get "sittin", and finally inserting 'g' at the end to get "sitting", for a total distance of 3. This can be visualized through an alignment that highlights the operations:

k i t t e n | | - s i t t i n g

k i t t e n | | - s i t t i n g

Here, the vertical bars indicate substitutions ('k' to 's' and 'e' to 'i'), and the dash represents the insertion of 'g'. Unlike some variants, focuses exclusively on these three core operations without additional transformations. It satisfies metric properties such as the , enabling its use in various distance-based analyses. For applications requiring a bounded similarity score, the normalized Levenshtein distance divides the raw distance by the length of the longer string, yielding a value between 0 (identical) and 1 (completely dissimilar). This normalization facilitates percentage-based interpretations, such as similarity = 1 - (distance / max(length1, length2)).

Damerau-Levenshtein and Other Variants

The Damerau-Levenshtein distance extends the by incorporating transpositions of adjacent characters as an additional edit operation with a of 1, alongside insertions, deletions, and substitutions each costing 1. This variant was introduced to better model common typing errors, where Damerau observed that over 80% of misspellings in an information-retrieval system stemmed from single instances of these four operations. For example, transforming "ca" to "ac" requires only one transposition, yielding a distance of 1, whereas the standard would require two substitutions. Other variants adapt edit distance for specific constraints or applications. The , applicable only to strings of equal length, counts the minimum number of substitutions needed to match them, ignoring insertions and deletions. It originated in the context of error-correcting codes for binary strings but extends naturally to general symbols. For instance, the Hamming distance between "abc" and "adc" is 1, reflecting the single differing position. The Jaro distance measures string similarity by considering matching characters within a certain distance window, penalizing transpositions separately, and is particularly suited for tasks like name matching. It computes a similarity score between 0 and 1 based on the fraction of matches, transpositions, and string lengths, with distance defined as 1 minus the similarity. The Jaro-Winkler variant refines this by boosting the score for strings sharing common prefixes, emphasizing initial characters' importance in identifiers. For example, "JONES" and "JOHNSON" yield a higher Jaro-Winkler similarity due to the shared prefix. Another variant is the longest common subsequence (LCS) distance, which restricts operations to insertions and deletions only, each costing 1, effectively measuring m + n - 2×LCS length, where m and n are the string lengths. This focuses on subsequence preservation without substitutions, useful for comparing ordered but non-contiguous elements like in or bioinformatics alignments. For strings "ABCBDAB" (length 7) and "BDCAB" (length 5), the LCS length is 4 ("BCAB"), so the distance is 7 + 5 - 2×4 = 4. Weighted variants assign non-uniform costs to operations for greater flexibility. In particular, affine gap penalties model insertions or deletions as gaps with an opening cost plus a linear extension cost, better capturing biological realities like indels in DNA sequences. This approach, introduced by Gotoh, allows substitution costs to vary based on character similarity, such as using matrices like BLOSUM for proteins. For example, a gap of length k incurs cost α + (k-1)β, where α > β encourages fewer, longer gaps over many short ones.
VariantOperations IncludedTypical Use Cases
LevenshteinInsertion, deletion, substitutionGeneral string correction, spell-checking
Damerau-LevenshteinAbove plus adjacent transpositionTyping error detection in text entry
Jaro/Jaro-WinklerMatches, transpositions, prefix weighting, name matching in databases
HammingSubstitution only (equal lengths)Error detection in fixed-length codes
LCSInsertion, deletion onlySubsequence alignment in sequences

Mathematical Formulation

Formal Definition

The edit distance between two strings xx and yy over a finite Σ\Sigma is defined as the minimum total cost of an edit sequence that transforms xx into yy, where the edit operations are drawn from a specified set, typically including insertion of a from Σ\Sigma, deletion of a from xx, and substitution of a in xx with one from Σ\Sigma, with transposition sometimes included as an optional operation. Each operation oio_i in the sequence is associated with a non-negative cost c(oi)c(o_i), which is commonly set to 1 for uniformity across operations in the case (such as ), though more general cost functions are permitted as long as the cost of an operation equals the cost of its inverse—for instance, the cost of insertion must match that of deletion to ensure consistency. An edit sequence corresponds to an alignment path, which can be represented as an ordered list of operations or depicted as a path in a where vertices correspond to prefixes of xx and yy, and edges represent individual edit steps. Formally, the edit distance is given by d(x,y)=min{i=1kc(oi)  |  o1,,ok is an edit sequence transforming x to y},d(x, y) = \min\left\{ \sum_{i=1}^{k} c(o_i) \;\middle|\; o_1, \dots, o_k \text{ is an edit sequence transforming } x \text{ to } y \right\},
Add your contribution
Related Hubs
User Avatar
No comments yet.