Recent from talks
Quotient filter
Knowledge base stats:
Talk channels stats:
Members stats:
Quotient filter
A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set (an approximate membership query filter, AMQ). A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning "element is in the set" when in fact the element is not present in the set (i.e., a false positive). There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include "insert" and "optionally delete". The more elements are added to the set, the larger the probability of false positives.
A typical application for quotient filters, and other AMQ filters, is to serve as a proxy for the keys in a database on disk. As keys are added to or removed from the database, the filter is updated to reflect this. Any lookup will first consult the fast quotient filter, then look in the (presumably much slower) database only if the quotient filter reported the presence of the key. If the filter returns absence, the key is known not to be in the database without any disk accesses having been performed.
A quotient filter has the usual AMQ operations of insert and query. In addition it can also be merged and re-sized without having to re-hash the original keys (thereby avoiding the need to access those keys from secondary storage). This property benefits certain kinds of log-structured merge-trees.
The compact hash table underlying a quotient filter was described by Cleary in 1984. The first known reference to using the structure as an AMQ filter is by Pagh et al. in 2005. In 2009, Dillinger and Manolios optimized the structure's metadata, added in-place accommodation of more elements, and applied the structure to explicit-state model checking. In 2011, Bender et al. penned the name "quotient filter", described several variants with different metadata encoding trade-offs, showed how to merge and resize quotient filters, presented a write-optimized version of the quotient filter for use on disk, and applied the structure to database storage problems. In 2017, Pandey et al. described a version that uses hardware bit-manipulation instructions to improve performance, supports concurrent updates, and adds support for associating a variable-sized counter to each element.
The quotient filter is based on a kind of hash table in which entries contain only a portion of the key plus some additional meta-data bits. These bits are used to deal with the case when distinct keys happen to hash to the same table entry. By way of contrast, other types of hash tables that deal with such collisions by linking to overflow areas are not compact because the overhead due to linkage can exceed the storage used to store the key. In a quotient filter a hash function generates a p-bit fingerprint. The r least significant bits is called the remainder while the q = p - r most significant bits is called the quotient, hence the name quotienting (coined by Knuth.) The hash table has 2q slots.
For some key d which hashes to the fingerprint dH, let its quotient be dQ and the remainder be dR. QF will try to store the remainder in slot dQ, which is known as the canonical slot. However the canonical slot might already be occupied because multiple keys can hash to the same fingerprint—a hard collision—or because even when the keys' fingerprints are distinct they can have the same quotient—a soft collision. If the canonical slot is occupied then the remainder is stored in some slot to the right.
As described below, the insertion algorithm ensures that all fingerprints having the same quotient are stored in contiguous slots. Such a set of fingerprints is defined as a run. Note that a run's first fingerprint might not occupy its canonical slot if the run has been forced right by some run to the left.
However a run whose first fingerprint occupies its canonical slot indicates the start of a cluster. The initial run and all subsequent runs comprise the cluster, which terminates at an unoccupied slot or the start of another cluster.
Hub AI
Quotient filter AI simulator
(@Quotient filter_simulator)
Quotient filter
A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set (an approximate membership query filter, AMQ). A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning "element is in the set" when in fact the element is not present in the set (i.e., a false positive). There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include "insert" and "optionally delete". The more elements are added to the set, the larger the probability of false positives.
A typical application for quotient filters, and other AMQ filters, is to serve as a proxy for the keys in a database on disk. As keys are added to or removed from the database, the filter is updated to reflect this. Any lookup will first consult the fast quotient filter, then look in the (presumably much slower) database only if the quotient filter reported the presence of the key. If the filter returns absence, the key is known not to be in the database without any disk accesses having been performed.
A quotient filter has the usual AMQ operations of insert and query. In addition it can also be merged and re-sized without having to re-hash the original keys (thereby avoiding the need to access those keys from secondary storage). This property benefits certain kinds of log-structured merge-trees.
The compact hash table underlying a quotient filter was described by Cleary in 1984. The first known reference to using the structure as an AMQ filter is by Pagh et al. in 2005. In 2009, Dillinger and Manolios optimized the structure's metadata, added in-place accommodation of more elements, and applied the structure to explicit-state model checking. In 2011, Bender et al. penned the name "quotient filter", described several variants with different metadata encoding trade-offs, showed how to merge and resize quotient filters, presented a write-optimized version of the quotient filter for use on disk, and applied the structure to database storage problems. In 2017, Pandey et al. described a version that uses hardware bit-manipulation instructions to improve performance, supports concurrent updates, and adds support for associating a variable-sized counter to each element.
The quotient filter is based on a kind of hash table in which entries contain only a portion of the key plus some additional meta-data bits. These bits are used to deal with the case when distinct keys happen to hash to the same table entry. By way of contrast, other types of hash tables that deal with such collisions by linking to overflow areas are not compact because the overhead due to linkage can exceed the storage used to store the key. In a quotient filter a hash function generates a p-bit fingerprint. The r least significant bits is called the remainder while the q = p - r most significant bits is called the quotient, hence the name quotienting (coined by Knuth.) The hash table has 2q slots.
For some key d which hashes to the fingerprint dH, let its quotient be dQ and the remainder be dR. QF will try to store the remainder in slot dQ, which is known as the canonical slot. However the canonical slot might already be occupied because multiple keys can hash to the same fingerprint—a hard collision—or because even when the keys' fingerprints are distinct they can have the same quotient—a soft collision. If the canonical slot is occupied then the remainder is stored in some slot to the right.
As described below, the insertion algorithm ensures that all fingerprints having the same quotient are stored in contiguous slots. Such a set of fingerprints is defined as a run. Note that a run's first fingerprint might not occupy its canonical slot if the run has been forced right by some run to the left.
However a run whose first fingerprint occupies its canonical slot indicates the start of a cluster. The initial run and all subsequent runs comprise the cluster, which terminates at an unoccupied slot or the start of another cluster.