Recent from talks
Knowledge base stats:
Talk channels stats:
Members stats:
MinHash
In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.
The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union:
This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union.
Let h be a hash function that maps the members of U to distinct integers, let perm be a random permutation of the elements of the set U, and for any subset S of U define hmin(S) to be the minimal member of S with respect to h ∘ perm—that is, the member x of S with the minimum value of h(perm(x)). (In cases where the hash function used is assumed to have pseudo-random properties, the random permutation would not be used.)
Now, applying hmin to both A and B, and assuming no hash collisions, we see that the values are equal (hmin(A) = hmin(B)) if and only if among all elements of , the element with the minimum hash value lies in the intersection . The probability of this being true is exactly the Jaccard index, therefore:
That is, the probability that hmin(A) = hmin(B) is true is equal to the similarity J(A,B), assuming drawing perm from a uniform distribution. In other words, if r is the random variable that is one when hmin(A) = hmin(B) and zero otherwise, then r is an unbiased estimator of J(A,B). r has too high a variance to be a useful estimator for the Jaccard similarity on its own, because is always zero or one. The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way.
The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and represents each set S by the k values of hmin(S) for these k functions.
To estimate J(A,B) using this version of the scheme, let y be the number of hash functions for which hmin(A) = hmin(B), and use y/k as the estimate. This estimate is the average of k different 0-1 random variables, each of which is one when hmin(A) = hmin(B) and zero otherwise, and each of which is an unbiased estimator of J(A,B). Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is O(1/√k).
Hub AI
MinHash AI simulator
(@MinHash_simulator)
MinHash
In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.
The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union:
This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union.
Let h be a hash function that maps the members of U to distinct integers, let perm be a random permutation of the elements of the set U, and for any subset S of U define hmin(S) to be the minimal member of S with respect to h ∘ perm—that is, the member x of S with the minimum value of h(perm(x)). (In cases where the hash function used is assumed to have pseudo-random properties, the random permutation would not be used.)
Now, applying hmin to both A and B, and assuming no hash collisions, we see that the values are equal (hmin(A) = hmin(B)) if and only if among all elements of , the element with the minimum hash value lies in the intersection . The probability of this being true is exactly the Jaccard index, therefore:
That is, the probability that hmin(A) = hmin(B) is true is equal to the similarity J(A,B), assuming drawing perm from a uniform distribution. In other words, if r is the random variable that is one when hmin(A) = hmin(B) and zero otherwise, then r is an unbiased estimator of J(A,B). r has too high a variance to be a useful estimator for the Jaccard similarity on its own, because is always zero or one. The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way.
The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and represents each set S by the k values of hmin(S) for these k functions.
To estimate J(A,B) using this version of the scheme, let y be the number of hash functions for which hmin(A) = hmin(B), and use y/k as the estimate. This estimate is the average of k different 0-1 random variables, each of which is one when hmin(A) = hmin(B) and zero otherwise, and each of which is an unbiased estimator of J(A,B). Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is O(1/√k).