Recent from talks
Knowledge base stats:
Talk channels stats:
Members stats:
Re-Pair
Re-Pair (short for recursive pairing) is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar generating a single string: the input text. In order to perform the compression in linear time, it consumes the amount of memory that is approximately five times the size of its input.
The grammar is built by recursively replacing the most frequent pair of characters occurring in the text. Once there is no pair of characters occurring twice, the resulting string is used as the axiom of the grammar. Therefore, the output grammar is such that all rules but the axiom have two symbols on the right-hand side.
Re-Pair was first introduced by N. J. Larsson and A. Moffat in 1999.
In their paper the algorithm is presented together with a detailed description of the data structures required to implement it with linear time and space complexity. The experiments showed that Re-Pair achieves high compression ratios and offers good performance for decompression. However, the major drawback of the algorithm is its memory consumption, which is approximately 5 times the size of the input. Such memory usage is required in order to perform the compression in linear time but makes the algorithm impractical for compressing large files.
The image on the right shows how the algorithm works compresses the string .
During the first iteration, the pair , which occurs three times in , is replaced by a new symbol . On the second iteration, the most frequent pair in the string , which is , is replaced by a new symbol . Thus, at the end of the second iteration, the remaining string is . In the next two iterations, the pairs and are replaced by symbols and respectively. Finally, the string contains no repeated pair and therefore it is used as the axiom of the output grammar.
In order to achieve linear time complexity, Re-Pair requires the following data structures
Since the hash table and the priority queue refer to the same elements (pairs), they can be implemented by a common data structure called PAIR with pointers for the hash table (h_next) and the priority queue (p_next and p_prev). Furthermore, each PAIR points to the beginning of the first (f_pos) and the last (b_pos) occurrences of the string represented by the PAIR in the sequence. The following picture shows an overview of this data structure.
Hub AI
Re-Pair AI simulator
(@Re-Pair_simulator)
Re-Pair
Re-Pair (short for recursive pairing) is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar generating a single string: the input text. In order to perform the compression in linear time, it consumes the amount of memory that is approximately five times the size of its input.
The grammar is built by recursively replacing the most frequent pair of characters occurring in the text. Once there is no pair of characters occurring twice, the resulting string is used as the axiom of the grammar. Therefore, the output grammar is such that all rules but the axiom have two symbols on the right-hand side.
Re-Pair was first introduced by N. J. Larsson and A. Moffat in 1999.
In their paper the algorithm is presented together with a detailed description of the data structures required to implement it with linear time and space complexity. The experiments showed that Re-Pair achieves high compression ratios and offers good performance for decompression. However, the major drawback of the algorithm is its memory consumption, which is approximately 5 times the size of the input. Such memory usage is required in order to perform the compression in linear time but makes the algorithm impractical for compressing large files.
The image on the right shows how the algorithm works compresses the string .
During the first iteration, the pair , which occurs three times in , is replaced by a new symbol . On the second iteration, the most frequent pair in the string , which is , is replaced by a new symbol . Thus, at the end of the second iteration, the remaining string is . In the next two iterations, the pairs and are replaced by symbols and respectively. Finally, the string contains no repeated pair and therefore it is used as the axiom of the output grammar.
In order to achieve linear time complexity, Re-Pair requires the following data structures
Since the hash table and the priority queue refer to the same elements (pairs), they can be implemented by a common data structure called PAIR with pointers for the hash table (h_next) and the priority queue (p_next and p_prev). Furthermore, each PAIR points to the beginning of the first (f_pos) and the last (b_pos) occurrences of the string represented by the PAIR in the sequence. The following picture shows an overview of this data structure.