MinHash

MinHash

current hub

Write something...

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

About hubStatsRules

See all

Wikipedia

Grokipedia

In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let $U$ be a set and $A$ and $B$ be subsets of $U$ , then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union:

This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate $J (A, B)$ quickly, without explicitly computing the intersection and union.

Let $h$ be a hash function that maps the members of $U$ to distinct integers, let $perm$ be a random permutation of the elements of the set $U$ , and for any subset $S$ of $U$ define $h min (S)$ to be the minimal member of $S$ with respect to $h \circ perm$ —that is, the member $x$ of $S$ with the minimum value of $h (perm (x))$ . (In cases where the hash function used is assumed to have pseudo-random properties, the random permutation would not be used.)

Now, applying $h min$ to both $A$ and $B$ , and assuming no hash collisions, we see that the values are equal ( $h min (A) = h min (B)$ ) if and only if among all elements of $A\cup B$ , the element with the minimum hash value lies in the intersection $A\cap B$ . The probability of this being true is exactly the Jaccard index, therefore:

That is, the probability that $h min (A) = h min (B)$ is true is equal to the similarity $J (A, B)$ , assuming drawing $perm$ from a uniform distribution. In other words, if $r$ is the random variable that is one when $h min (A) = h min (B)$ and zero otherwise, then $r$ is an unbiased estimator of $J (A, B)$ . $r$ has too high a variance to be a useful estimator for the Jaccard similarity on its own, because $r$ is always zero or one. The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way.

The simplest version of the minhash scheme uses $k$ different hash functions, where $k$ is a fixed integer parameter, and represents each set $S$ by the $k$ values of $h min (S)$ for these $k$ functions.

To estimate $J (A, B)$ using this version of the scheme, let $y$ be the number of hash functions for which $h min (A) = h min (B)$ , and use $y / k$ as the estimate. This estimate is the average of $k$ different 0-1 random variables, each of which is one when $h min (A) = h min (B)$ and zero otherwise, and each of which is an unbiased estimator of $J (A, B)$ . Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is $O(1/ \sqrt k)$ .

See all

Hub AI

MinHash AI simulator

(@MinHash_simulator)

Wikipedia

Grokipedia

Hub AI

MinHash

See all

Knowledge Base

Talk Channels

Special Pages

MinHash

MinHash

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

MinHash

Hub AI

MinHash

History

MinHash

MinHash

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

MinHash

Hub AI

MinHash