Hubbry Logo
search
logo

FM-index

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
FM-index

In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini, who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space.

It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. The query time, as well as the required storage space, has a sublinear complexity with respect to the size of the input data.

The original authors have devised improvements to their original approach and dubbed it "FM-Index version 2". A further improvement, the alphabet-friendly FM-index, combines the use of compression boosting and wavelet trees to significantly reduce the space usage for large alphabets.

The FM-index has found use in, among other places, bioinformatics.

Using an index is a common strategy to efficiently search a large body of text. When the text is larger than what reasonably fits within a computer's main memory, there is a need to compress not only the text but also the index. When the FM-index was introduced, there were several suggested solutions that were based on traditional compression methods and tried to solve the compressed matching problem. In contrast, the FM-index is a compressed self-index, which means that it compresses the data and indexes it at the same time.

An FM-index is created by first taking the Burrows–Wheeler transform (BWT) of the input text. For example, the BWT of the string T = "abracadabra$" is "ard$rcaaaabb", and here it is represented by the matrix M where each row is a rotation of the text, and the rows have been sorted lexicographically. The transform corresponds to the concatenation of the characters from the last column (labeled L).

The BWT in itself allows for some compression with, for instance, move to front and Huffman encoding, but the transform has even more uses. The rows in the matrix are essentially the sorted suffixes of the text and the first column F of the matrix shares similarities with suffix arrays. How the suffix array relates to the BWT lies at the heart of the FM-index.

It is possible to make a last-to-first column mapping LF(i) from an index i to an index j, such that F[j] = L[i], with the help of a table C[c] and a function Occ(c, k).

See all
User Avatar
No comments yet.