Hubbry Logo
MinHashMinHashMain
Open search
MinHash
Community hub
MinHash
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
MinHash
MinHash
from Wikipedia

In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference,[1] and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.[2] It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.[1]

Jaccard similarity and minimum hash values

[edit]

The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union:

This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union.

Let h be a hash function that maps the members of U to distinct integers, let perm be a random permutation of the elements of the set U, and for any subset S of U define hmin(S) to be the minimal member of S with respect to hperm—that is, the member x of S with the minimum value of h(perm(x)). (In cases where the hash function used is assumed to have pseudo-random properties, the random permutation would not be used.)

Now, applying hmin to both A and B, and assuming no hash collisions, we see that the values are equal (hmin(A) = hmin(B)) if and only if among all elements of , the element with the minimum hash value lies in the intersection . The probability of this being true is exactly the Jaccard index, therefore:

That is, the probability that hmin(A) = hmin(B) is true is equal to the similarity J(A,B), assuming drawing perm from a uniform distribution. In other words, if r is the random variable that is one when hmin(A) = hmin(B) and zero otherwise, then r is an unbiased estimator of J(A,B). r has too high a variance to be a useful estimator for the Jaccard similarity on its own, because is always zero or one. The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way.

Algorithm

[edit]

Variant with many hash functions

[edit]

The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and represents each set S by the k values of hmin(S) for these k functions.

To estimate J(A,B) using this version of the scheme, let y be the number of hash functions for which hmin(A) = hmin(B), and use y/k as the estimate. This estimate is the average of k different 0-1 random variables, each of which is one when hmin(A) = hmin(B) and zero otherwise, and each of which is an unbiased estimator of J(A,B). Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is O(1/k).[3]

Therefore, for any constant ε > 0 there is a constant k = O(1/ε2) such that the expected error of the estimate is at most ε. For example, 400 hashes would be required to estimate J(A,B) with an expected error less than or equal to .05.

Variant with a single hash function

[edit]

It may be computationally expensive to compute multiple hash functions, but a related version of MinHash scheme avoids this penalty by using only a single hash function and uses it to select multiple values from each set rather than selecting only a single minimum value per hash function. Let h be a hash function, and let k be a fixed integer. If S is any set of k or more values in the domain of h, define h(k)(S) to be the subset of the k members of S that have the smallest values of h. This subset h(k)(S) is used as a signature for the set S, and the similarity of any two sets is estimated by comparing their signatures.

Specifically, let A and B be any two sets. Then X = h(k)(h(k)(A) ∪ h(k)(B)) = h(k)(AB) is a set of k elements of AB, and if h is a random function then any subset of k elements is equally likely to be chosen; that is, X is a simple random sample of AB. The subset Y = Xh(k)(A) ∩ h(k)(B) is the set of members of X that belong to the intersection AB. Therefore, |Y|/k is an unbiased estimator of J(A,B). The difference between this estimator and the estimator produced by multiple hash functions is that X always has exactly k members, whereas the multiple hash functions may lead to a smaller number of sampled elements due to the possibility that two different hash functions may have the same minima. However, when k is small relative to the sizes of the sets, this difference is negligible.

By standard Chernoff bounds for sampling without replacement, this estimator has expected error O(1/k), matching the performance of the multiple-hash-function scheme.

Time analysis

[edit]

The estimator |Y|/k can be computed in time O(k) from the two signatures of the given sets, in either variant of the scheme. Therefore, when ε and k are constants, the time to compute the estimated similarity from the signatures is also constant. The signature of each set can be computed in linear time on the size of the set, so when many pairwise similarities need to be estimated this method can lead to a substantial savings in running time compared to doing a full comparison of the members of each set. Specifically, for set size n the many hash variant takes O(n k) time. The single hash variant is generally faster, requiring O(n) time to maintain the queue of minimum hash values assuming n >> k.[1]

Incorporating weights

[edit]

A variety of techniques to introduce weights into the computation of MinHashes have been developed. The simplest extends it to integer weights.[4] Extend our hash function h to accept both a set member and an integer, then generate multiple hashes for each item, according to its weight. If item i occurs n times, generate hashes . Run the original algorithm on this expanded set of hashes. Doing so yields the weighted Jaccard Index as the collision probability.

Further extensions that achieve this collision probability on real weights with better runtime have been developed, one for dense data,[5] and another for sparse data.[6]

Another family of extensions use exponentially distributed hashes. A uniformly random hash between 0 and 1 can be converted to follow an exponential distribution by CDF inversion. This method exploits the many beautiful properties of the minimum of a set of exponential variables.

This yields as its collision probability the probability Jaccard index[7]

Min-wise independent permutations

[edit]

In order to implement the MinHash scheme as described above, one needs the hash function h to define a random permutation on n elements, where n is the total number of distinct elements in the union of all of the sets to be compared. But because there are n! different permutations, it would require Ω(n log n) bits just to specify a truly random permutation, an infeasibly large number for even moderate values of n. Because of this fact, by analogy to the theory of universal hashing, there has been significant work on finding a family of permutations that is "min-wise independent", meaning that for any subset of the domain, any element is equally likely to be the minimum. It has been established that a min-wise independent family of permutations must include at least

different permutations, and therefore that it needs Ω(n) bits to specify a single permutation, still infeasibly large.[2]

Practical min-wise independent hash functions

[edit]

Because of the above impracticality, two variant notions of min-wise independence have been introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k.[8] Approximate min-wise independence has at most a fixed probability ε of varying from full independence.[9]

In 1999 Piotr Indyk proved[10] that any k-wise independent family of hash functions is also approximately min-wise independent for large enough. In particular, there are constants such that if , then

for all sets and . (Note, here means the probability is at most a factor too big, and at most too small.)

This guarantee is, among other things, sufficient to give the Jaccard bound required by the MinHash algorithm. That is, if and are sets, then

Since k-wise independent hash functions can be specified using just bits, this approach is much more practical than using completely min-wise independent permutations.

Another practical family of hash functions that give approximate min-wise independence is Tabulation hashing.

Applications

[edit]

The original applications for MinHash involved clustering and eliminating near-duplicates among web documents, represented as sets of the words occurring in those documents.[1][2][11] Similar techniques have also been used for clustering and near-duplicate elimination for other types of data, such as images: in the case of image data, an image can be represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions.[12]

In data mining, Cohen et al. (2001) use MinHash as a tool for association rule learning. Given a database in which each entry has multiple attributes (viewed as a 0–1 matrix with a row per database entry and a column per attribute) they use MinHash-based approximations to the Jaccard index to identify candidate pairs of attributes that frequently co-occur, and then compute the exact value of the index for only those pairs to determine the ones whose frequencies of co-occurrence are below a given strict threshold.[13]

The MinHash algorithm has been adapted for bioinformatics, where the problem of comparing genome sequences has a similar theoretical underpinning to that of comparing documents on the web. MinHash-based tools[14][15] allow rapid comparison of whole genome sequencing data with reference genomes (around 3 minutes to compare one genome with the 90000 reference genomes in RefSeq), and are suitable for speciation and maybe a limited degree of microbial sub-typing. There are also applications for metagenomics [14] and the use of MinHash derived algorithms for genome alignment and genome assembly.[16] Accurate average nucleotide identity (ANI) values can be generated very efficiently with MinHash-based algorithms.[17]

Other uses

[edit]

The MinHash scheme may be seen as an instance of locality-sensitive hashing, a collection of techniques for using hash functions to map large sets of objects down to smaller hash values in such a way that, when two objects have a small distance from each other, their hash values are likely to be the same. In this instance, the signature of a set may be seen as its hash value. Other locality sensitive hashing techniques exist for Hamming distance between sets and cosine distance between vectors; locality sensitive hashing has important applications in nearest neighbor search algorithms.[18] For large distributed systems, and in particular MapReduce, there exist modified versions of MinHash to help compute similarities with no dependence on the point dimension.[19]

Evaluation and benchmarks

[edit]

A large scale evaluation was conducted by Google in 2006 [20] to compare the performance of Minhash and SimHash[21] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling[22] and using Minhash and LSH for Google News personalization.[23]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
MinHash is a probabilistic algorithm for efficiently estimating the Jaccard similarity between large sets, such as those representing documents or data streams, by generating compact signatures through the application of random permutations or hash functions to identify minimum values. It serves as a core component of locality-sensitive hashing (LSH), enabling the rapid detection of similar items in massive datasets without exhaustive pairwise comparisons. Introduced by Andrei Broder and colleagues in 1997, MinHash was originally developed to address the challenge of identifying near-duplicate documents in web-scale corpora, such as the AltaVista search engine's index of over 30 million pages, where it successfully clustered millions of similar items using shingling techniques combined with min-wise independent permutations. The method relies on the key probabilistic property that, for two sets AA and BB, the probability that the minimum hash value (under a random permutation) is the same for both sets equals their Jaccard similarity r(A,B)=ABABr(A, B) = \frac{|A \cap B|}{|A \cup B|}, providing an unbiased estimator when multiple independent hash functions are used to form signatures. This estimation is both dimension-independent and scalable, reducing the signature size to a few hundred bytes per set while maintaining accuracy for similarities above a tunable threshold. In practice, MinHash signatures are constructed by applying kk independent hash functions to the elements of a set and selecting the minimum value for each, yielding a vector that approximates the set for similarity computations. For LSH applications, these signatures are partitioned into bb bands of rr rows each (k=b×rk = b \times r), where pairs colliding in at least one band are flagged as candidate duplicates; the probability of detection follows the curve 1(1sr)b1 - (1 - s^r)^b, with ss as the similarity, allowing false positives and negatives to be balanced for specific thresholds like 0.5 or 0.8. The algorithm's efficiency stems from or pseudorandom permutations, avoiding the need for true random permutations, and it has been extended to weighted variants for denser data representations. Beyond document deduplication, MinHash has become foundational in diverse fields, including bioinformatics for genome sketching (e.g., Mash tool for sequencing data comparison), recommendation systems for user-item similarity, and entity resolution in databases, where it handles sets of arbitrary size with linear-time preprocessing and subquadratic querying. Its influence persists in modern frameworks, underscoring its role in scalable similarity search amid growing data volumes.

Background Concepts

Jaccard Similarity

The Jaccard similarity between two finite sets AA and BB is defined as the cardinality of their intersection divided by the cardinality of their union: J(A, B) = \frac{|A \cap B|}{|A \cup B|} $$. This measure quantifies the degree of overlap between the sets, where a value of 1 indicates complete identity and values approaching 0 signify minimal shared elements. Introduced by botanist Paul Jaccard in 1901 to compare the similarity of plant species distributions across geographic regions, the metric provides a geometric interpretation as the proportion of common elements relative to the total distinct elements in the combined sets. In computational contexts, such as information retrieval and data mining, it serves as a foundational tool for assessing set-based resemblance without assuming vector weights or order.[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) The Jaccard similarity exhibits key properties: it is symmetric, satisfying $J(A, B) = J(B, A)$; reflexive, with $J(A, A) = 1$; and bounded in the interval $[0, 1]$, where 0 denotes [disjoint sets](/page/Disjoint_sets).[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) For illustration, consider sets $A = \{1, 2, 3\}$ and $B = \{2, 3, 4\}$: the [intersection](/page/Intersection) $\{2, 3\}$ has size 2, the union $\{1, 2, 3, 4\}$ has size 4, yielding $J(A, B) = 0.5$. In set-based applications, such as comparing binary feature representations of documents or items, Jaccard similarity is favored over vector-oriented measures like [cosine similarity](/page/Cosine_similarity) because it explicitly incorporates the union in the denominator, offering a direct gauge of relative overlap for unweighted, presence-absence [data](/page/Data).[](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) ### Locality-Sensitive Hashing Locality-sensitive hashing (LSH) is a probabilistic framework for approximate [nearest neighbor search](/page/Nearest_neighbor_search) in high-dimensional spaces, consisting of a family of hash functions designed such that similar items are hashed to the same value (i.e., collide) with higher probability than dissimilar items.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) Formally, a family $ \mathcal{H} $ of functions from a [universe](/page/Universe) $ \mathcal{U} $ to a set of buckets is $(r, cr, p_1, p_2)$-sensitive with respect to a [distance](/page/Distance) metric $ d $ if, for any two points $ x, y \in \mathcal{U} $, \Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1 \quad \text{if } d(x, y) \leq r, \Pr_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2 \quad \text{if } d(x, y) > cr, where $ c > 1 $ and $ p_1 > p_2 $.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) This property enables efficient indexing by bucketing data points and querying only candidate neighbors in the same bucket, achieving sublinear query times while trading exactness for scalability.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) The LSH framework was introduced by Indyk and Motwani in 1998 to mitigate the curse of dimensionality in nearest neighbor problems, where traditional methods like exhaustive search or tree-based indexing fail in high dimensions due to exponential growth in volume and sparsity.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) To balance [precision and recall](/page/Precision_and_recall), amplification techniques refine the basic LSH family: the AND-construction concatenates $ k $ independent hash functions into a single composite hash, reducing false positives by requiring collisions in all components (increasing specificity); conversely, the OR-construction applies $ l $ independent LSH families in parallel, treating collisions in any as a match to boost recall by capturing more true positives.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) These methods allow tuning the tradeoff, with space and query time scaling as $ O(n^{1 + \rho}) $ and $ O(n^\rho) $, respectively, where $ \rho = \frac{\log(1/p_1)}{\log(1/p_2)} < 1 $.[](https://graphics.stanford.edu/courses/cs468-06-fall/Papers/06%20indyk%20motwani%20-%20stoc98.pdf) In the context of Jaccard similarity, which measures set overlap as $ |A \cap B| / |A \cup B| $, LSH is particularly valuable because sets are often represented as characteristic vectors in extremely high-dimensional spaces (e.g., one dimension per possible element in the universe), exacerbating the curse of dimensionality through sparsity and the infeasibility of comparing all pairs.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) [MinHash](/page/MinHash) provides an LSH instantiation tailored to this metric, enabling efficient estimation of similar sets without exhaustive computation.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) ## Core Algorithm ### Multiple Hash Functions Variant The multiple hash functions variant of MinHash estimates the Jaccard similarity between sets by generating compact signatures using several independent hash functions, providing an unbiased estimator with controlled variance through the number of functions employed.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf) Sets are represented as characteristic vectors in a universe $ U $, where each set $ S \subseteq U $ indicates presence or absence of elements. To compute signatures, $ k $ independent universal hash functions $ h_1, h_2, \dots, h_k : U \to [0, 1) $ (or a large integer range) are selected, approximating random permutations. For a set $ S $, the signature $ \sigma(S) $ is a $ k $-dimensional vector where the $ i $-th entry is $ \sigma(S)_i = \min_{x \in S} h_i(x) $. This process simulates applying $ k $ random permutations to the universe and recording the minimum value (or rank) for each permuted set.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) The Jaccard similarity $ J(A, B) = \frac{|A \cap B|}{|A \cup B|} $ is estimated as the fraction of matching entries in the signatures: $ \hat{J}(A, B) = \frac{1}{k} \sum_{i=1}^k \mathbf{1} [\sigma(A)_i = \sigma(B)_i] $, where $ \mathbf{1} $ is the indicator function. This estimator is unbiased, with expectation $ \mathbb{E}[\hat{J}(A, B)] = J(A, B) $.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) The probability that signatures match at a given position equals the Jaccard similarity. For a fixed hash function $ h_i $ approximating a random permutation $ \pi_i $, the minima $ \min_{x \in A} h_i(x) $ and $ \min_{x \in B} h_i(x) $ match if the element of $ A \cup B $ with the smallest $ h_i $-value (or permuted rank) lies in $ A \cap B $. Since $ h_i $ randomizes the order uniformly, each element in the union is equally likely to be the minimum, yielding $ \Pr[\sigma(A)_i = \sigma(B)_i] = \frac{|A \cap B|}{|A \cup B|} = J(A, B) $. The matches across the $ k $ independent functions are identically distributed, enabling concentration bounds like Hoeffding's inequality for accuracy guarantees.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) The following pseudocode computes signatures for a collection of sets given $ k $ hash functions: ``` function ComputeSignatures(sets S_1, ..., S_m, hash_functions h_1, ..., h_k): signatures = empty m x k matrix for j = 1 to m: // for each set S_j for i = 1 to k: min_hash = infinity for x in S_j: min_hash = min(min_hash, h_i(x)) signatures[j, i] = min_hash return signatures ``` This implementation assumes hash functions are pre-chosen and efficient to evaluate; in practice, universal hashing families like linear polynomials over finite fields are used.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) For illustration, consider universe $ U = \{1, 2, 3, 4, 5\} $, sets $ A = \{1, 3\} $, $ B = \{1, 4\} $ with true Jaccard $ J(A, B) = 1/3 $, and $ k=2 $ hash functions. Suppose $ h_1(1)=0.2 $, $ h_1(3)=0.7 $, $ h_1(4)=0.4 $, so $ \sigma(A)_1 = 0.2 $, $ \sigma(B)_1 = 0.2 $ (match); and $ h_2(1)=0.5 $, $ h_2(3)=0.3 $, $ h_2(4)=0.8 $, so $ \sigma(A)_2 = 0.3 $, $ \sigma(B)_2 = 0.5 $ (no match). The estimate is $ \hat{J}(A, B) = 1/2 \approx 0.333 $. Different hash realizations yield varying estimates, averaging to $ 1/3 $.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) To achieve an $ \varepsilon $-approximation (i.e., $ |\hat{J} - J| < \varepsilon $ with high probability, say $ 1 - \delta $), select $ k \approx \frac{1}{\varepsilon^2} \ln \frac{1}{\delta} $; common practical values are $ k = 100 $ to $ 500 $ for $ \varepsilon \approx 0.1 $. Larger $ k $ reduces variance but increases storage and computation proportionally.[](https://www.eecs.harvard.edu/~michaelm/postscripts/jcss2000.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) ### Single Hash Function Variant The single hash function variant of MinHash generates signatures by applying a fixed hash function to the minimum elements identified under multiple random permutations of the universe, offering a theoretically grounded approach to estimating set similarities. Unlike the multiple hash functions variant, which relies on independent hash computations for each signature component, this method uses permutations to induce random orderings before hashing the resulting minima.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) Given a set $ S \subseteq U $ where $ U $ is the universe of elements, and $ k $ independent random permutations $ \pi_1, \pi_2, \dots, \pi_k $ of $ U $, the signature $ \sigma(S) $ is a vector of length $ k $ defined as $$ \sigma(S) = h \left( \arg\min_{x \in S} \pi_i(x) \right) $$ for $ i = 1 $ to $ k $, where $ h $ is a single fixed hash function (e.g., a 64-bit universal hash) applied to the element with the lowest rank under $ \pi_i $. The Jaccard similarity between two sets is estimated as the fraction of positions where their signatures match.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) Each permuted minimum behaves equivalently to a component from a random hash function in the multiple hash variant, as the permutation provides a uniform random ordering, ensuring the probability of matching signatures at any position equals the Jaccard similarity $ J(S, T) = \frac{|S \cap T|}{|S \cup T|} $.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) This approach yields a storage benefit by representing each set with just $ k $ fixed-size hash values in the signature, avoiding the need to retain full hash outputs for every element across multiple independent computations, which can be substantial for dense or large sets. Typical signature sizes range from 256 to 1024 bits (e.g., $ k = 64 $ to 128 with 64-bit hashes), enabling efficient indexing of millions of sets.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) In implementation, explicit permutations over large universes are impractical, so they are simulated via compositions of simpler hash functions; for instance, a base hash can map elements to pseudo-ranks, with $ h $ then applied to the identified minimum, often using rolling hashes like Rabin fingerprints for shingle-based sets.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf)[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) For example, consider sets $ A = \{a, b, c\} $ and $ B = \{b, c, d\} $ over universe $ U = \{a, b, c, d\} $, with Jaccard similarity 0.5. Under permutation $ \pi_1: a \to 3, b \to 1, c \to 2, d \to 4 $, the argmin for both $ A $ and $ B $ is $ b $ (lowest rank 1), so $ \sigma_1(A) = \sigma_1(B) = h(b) $. Under $ \pi_2: a \to 2, b \to 4, c \to 1, d \to 3 $, the argmin for both is $ c $ (rank 1), so $ \sigma_2(A) = \sigma_2(B) = h(c) $. Matching at both positions estimates similarity 1.0, while other permutations may mismatch to average toward 0.5 over many trials.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) This variant is suited for applications with very large universes (e.g., billions of possible shingles in web documents), where generating and storing signatures via multiple full independent hash passes over all elements would be prohibitively expensive in computation and memory.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) ### Complexity Analysis The time complexity of computing a MinHash signature for a single set of size $n$ (where $n$ is the number of elements in the set) using $k$ independent hash functions is $O(nk)$, as each of the $n$ elements must be hashed under all $k$ functions to determine the minimum value for each.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) This process scales linearly with the set size for sparse representations, making it efficient for large, high-dimensional datasets where elements are drawn from a potentially vast universe.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) Pairwise similarity estimation between two signatures then requires $O(k)$ time, simply by comparing the fraction of matching entries in the $k$-dimensional signatures.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) The space complexity for storing a single MinHash signature is $O(k)$, consisting of the $k$ minimum hash values, which remains independent of the universe size in the single-hash function variant where a universal hash family approximates multiple permutations without explicit enumeration of the full universe.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) For a collection of $N$ sets, the total storage is $O(Nk)$, enabling compact sketches that drastically reduce memory needs compared to raw set representations—for instance, compressing sets of up to 200,000 elements into signatures of roughly 1,000 bytes when $k \approx 100$.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) In the context of locality-sensitive hashing (LSH) for indexing, MinHash signatures are organized into bands of $r$ rows each, with $b$ bands such that $k = b \cdot r$; this structure yields an average-case query time of $O(1)$ for retrieving candidate similar sets, as the banding concentrates similar items into the same buckets with high probability, limiting the number of comparisons to a constant expected value independent of $N$.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) Preprocessing to build the LSH index takes $O(Nk)$ time overall, but subsequent queries probe only a fixed number of buckets on average.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) Key trade-offs arise with the choice of $k$: larger $k$ enhances estimation accuracy by reducing variance in the Jaccard similarity approximation, but it proportionally increases both storage and computation costs; Chernoff bounds quantify this, showing that the probability of significant error in the estimate decreases exponentially with $k$, e.g., bounding the deviation from the true similarity by $\delta$ with failure probability at most $2\exp(-k\delta^2/3)$.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) These asymptotic behaviors position MinHash as suitable for sublinear sketching in streaming models, where fixed $k$ allows processing and storage in $O(k)$ space per update while approximating similarities without full data retention.[](http://infolab.stanford.edu/~ullman/mmds/book.pdf) ## Theoretical Foundations ### Min-Wise Independent Permutations A family of permutations $\mathcal{F} \subseteq S_n$ on the universe $$ is defined as min-wise independent if, for every nonempty subset $X \subseteq $ and every element $x \in X$, \Pr_{\pi \sim \mathcal{F}} \left[ \pi(x) = \min { \pi(y) \mid y \in X } \right] = \frac{1}{|X|}. This condition ensures that, under a randomly selected permutation $\pi$ from $\mathcal{F}$, each element in $X$ has an equal probability of mapping to the smallest value among the images of $X$. The concept was introduced by Broder et al. in the context of estimating document resemblances for web duplicate detection, where random permutations model the relative ordering of document features to capture set overlaps accurately.[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) The formal definition and analysis were further developed in their subsequent work, establishing min-wise independence as a key probabilistic tool for similarity estimation.[](https://dl.acm.org/doi/10.1145/276698.276781) Min-wise independent permutations are essential for the theoretical soundness of MinHash because they provide an unbiased mechanism for estimating the Jaccard similarity between sets, linking directly to universal hashing principles that minimize collision biases. In MinHash, hash functions are constructed to approximate these permutations, ensuring that the minimum hash value over a set behaves as if drawn from a truly random reordering. This independence property guarantees that the estimator for Jaccard similarity remains asymptotically unbiased, even with finite samples, by preserving the uniform selection of minima across set elements. Without min-wise independence, the collision probabilities could deviate systematically, leading to skewed similarity estimates in applications like large-scale data deduplication.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) The formal proof that min-wise independence yields exact Jaccard similarity estimation proceeds as follows. For two sets $A$ and $B$, let $U = A \cup B$ and $I = A \cap B$. Under a min-wise independent permutation $\pi$, the element achieving the minimum value $\min \{ \pi(y) \mid y \in U \}$ is uniformly distributed over all elements in $U$. The MinHash values collide, $h(A) = h(B)$, if and only if this minimum occurs in $I$, which happens with probability $|I| / |U|$. Thus, the collision probability equals the Jaccard similarity $J(A, B) = |I| / |U|$. This equivalence holds precisely due to the min-wise property applied to the union set.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) While exact min-wise independence requires the property to hold for all set sizes, stronger k-wise independence conditions—where permutations are independent up to any k elements—can ensure exactness for small sets and approximate behavior for larger ones. In practice, 2-wise (pairwise) independence often suffices for [MinHash](/page/MinHash)'s probabilistic guarantees, as it controls variance in collision estimates without needing full higher-order independence, though approximate k-wise constructions are used to achieve epsilon-close min-wise behavior efficiently.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) [MinHash](/page/MinHash), grounded in this framework, functions as a locality-sensitive hashing scheme for Jaccard similarity, where the collision probability exactly matches the similarity measure, enabling provable performance in nearest-neighbor searches.[](https://www.sciencedirect.com/science/article/pii/S0022000099916902) ### Practical Constructions In practical implementations of MinHash, approximate min-wise independent hash functions are employed to simulate random permutations efficiently, as exact constructions are computationally prohibitive. The most common approach uses 2-universal (2U) linear hash functions defined as $ h_j(t) = (a_{1,j} + a_{2,j} t \mod p) \mod D $, where $ p $ is a large prime greater than the universe size, $ D $ is typically a power of 2 for bit-range control (e.g., $ 2^b $ for $ b $-bit outputs), and coefficients $ a_{1,j} $, $ a_{2,j} $ are chosen uniformly at random from $ [0, p-1] $.[](https://www.combinatorics.org/ojs/index.php/eljc/article/view/v7i1r26)[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) This family provides pairwise independence, ensuring low collision probabilities while remaining computationally lightweight, with each hash evaluation requiring only modular arithmetic. For improved approximation quality, especially in scenarios with sparse data or small sketch sizes, 4-universal (4U) polynomial hash functions extend the construction to higher-degree polynomials: $ h_j(t) = \left( \sum_{i=1}^4 a_{i,j} t^{i-1} \mod p \right) \mod D $, where four random coefficients are used.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) These offer greater randomness than 2U functions, reducing variance in Jaccard similarity estimates, though at a modest increase in computation per hash (still $ O(1) $ time). Empirical evaluations demonstrate that 4U performs comparably to or slightly better than 2U when using 200 or more hashes ($ k \geq 200 $) and $ b \geq 4 $ bits per hash value, achieving mean squared error (MSE) close to theoretical minima for sparse sets.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) To support 64-bit architectures without overflow issues, concatenated or combined hashes are often used, such as $ (h_1(x) \cdot M + h_2(x)) \mod 2^{64} $, where $ M = 2^{32} $ interleaves two 32-bit hashes into a single 64-bit value, enabling efficient processing on modern hardware.[](http://papers.neurips.cc/paper/7239-practical-hash-functions-for-similarity-estimation-and-dimensionality-reduction.pdf) This method maintains approximate independence while leveraging unsigned 64-bit integer operations for speed. Error bounds for these approximations show that the estimated Jaccard similarity converges to the true value with ratio approaching 1 with high probability when $ k $ is large (e.g., 200–500), with relative errors under 1% for typical web-scale datasets.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) Such constructions are integrated into libraries like Apache Spark's MinHashLSH module, which employs the 2U linear family with random coefficients modulo a prime for scalable similarity search on distributed datasets.[](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/feature/MinHashLSHModel.html) Despite not achieving exact min-wise independence, these pseudorandom approximations prove empirically effective, often matching full permutation accuracy in practice while reducing preprocessing time by orders of magnitude (e.g., via GPU acceleration for batch computations).[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) Limitations include potential accuracy degradation for very dense sets or small $ D $, necessitating larger $ k $ (up to 500) in machine learning applications compared to duplicate detection.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/a13-li.pdf) ## Extensions and Variants ### Weighted MinHash Weighted MinHash extends the standard [MinHash](/page/MinHash) technique to handle sets where elements have associated nonnegative weights, allowing for more nuanced similarity estimation in scenarios where uniform treatment of elements is insufficient. In the unweighted case, [MinHash](/page/MinHash) assumes all elements contribute equally, but real-world data often involves varying importance, such as frequencies or priorities. The weighted Jaccard similarity, which generalizes the traditional [Jaccard index](/page/Jaccard_index), is defined as \frac{\sum_x \min(w_A(x), w_B(x))}{\sum_x \max(w_A(x), w_B(x))}, where $w_A(x)$ and $w_B(x)$ are the weights of element $x$ in sets $A$ and $B$, respectively, and the sums are over all relevant elements. This metric captures the overlap weighted by the minimum contributions relative to the total maximum contributions, providing a better measure for applications like document comparison where term frequencies matter.[](http://www.cohenwang.org/edith/Surveys/minhash.pdf) The core adaptation involves modifying the hashing process to bias sampling toward higher-weight elements, ensuring inclusion probabilities are proportional to weights. One approach uses thresholding, where a random threshold is applied to normalized weights to select elements probabilistically. A key probabilistic sampling method, consistent weighted sampling (CWS), generates samples $(k, y_k)$ for each element such that the probability is proportional to the weight $w(x)$, often by drawing ranks from a distribution dependent on the weight.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/ConsistentWeightedSampling2.pdf) In practice, the weighted minimum is computed as $\arg\min_x \frac{h(x)}{w(x)}$, where $h(x)$ is a random hash function; this effective rank favors elements with larger $w(x)$, as higher weights reduce the adjusted hash value, increasing the chance of selection. Multiple such hashes are typically computed to form a sketch, with the process repeated for stability. One early variant, proposed by Haveliwala et al. in 2000, quantizes weights by multiplying by a large constant and applying standard MinHash to the resulting binary expansions.[](https://theory.stanford.edu/~taherh/papers/simsearch00.pdf) Later constructions, such as those using CWS, ensure that the probability of a sketch match between two weighted sets approximates their weighted Jaccard similarity, specifically $\Pr[\text{match}] \approx \sum_x \min(w_A(x), w_B(x)) / \sum_x \max(w_A(x), w_B(x))$. The mechanism adjusts sampling densities to handle weight disparities, making it suitable for scalable estimation without full data access.[](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/ConsistentWeightedSampling2.pdf) A representative example is document similarity, where elements are terms and weights are term frequencies (e.g., via tf-idf scoring); here, common high-frequency terms drive the similarity score, unlike unweighted bag-of-words models that ignore multiplicity. Such weighted sketches enable applications in text analysis beyond simple presence, including near-duplicate detection in large corpora and clustering documents with varying term importance, as seen in web search indexing.[](http://www.cohenwang.org/edith/Surveys/minhash.pdf) ### Advanced Adaptations One prominent adaptation of MinHash involves b-bit compression, where the full hash value (typically 64 bits) is truncated to only the lowest b bits, such as 5 bits, to create signatures that map into 2^b possible buckets.[](https://arxiv.org/abs/0910.3349) This reduction enables significant space savings by storing fewer bits per hash while still approximating Jaccard similarity through the proportion of matching b-bit values across multiple hash functions.[](https://arxiv.org/abs/0910.3349) The approach trades some estimation accuracy for efficiency, as the truncated hashes introduce additional variance in the similarity estimate.[](https://arxiv.org/abs/0910.3349) The probability of collision in b-bit MinHash adjusts based on the chosen b; for b=1, the effective Jaccard similarity threshold for reliable estimation is around 0.5, while larger b values (e.g., b=8) shift this threshold higher, improving precision at the cost of more storage.[](https://arxiv.org/abs/0910.3349) An unbiased estimator corrects for this bias, ensuring the expected value matches the true Jaccard similarity regardless of b.[](https://arxiv.org/abs/0910.3349) For handling ordered data like sequences, Ordinal MinHash (also known as Order MinHash or OMH) incorporates position-aware hashing by storing selected minhash values in ordered sublists, preserving relative order to better approximate edit distance rather than unordered set similarity.[](https://www.biorxiv.org/content/10.1101/534446v1) This variant refines standard MinHash by assigning hashes that account for positional relationships, making it suitable for applications in sequence alignment where order matters.[](https://www.biorxiv.org/content/10.1101/534446v1) In contrast to MinHash, which targets Jaccard similarity for sets, SimHash applies a similar locality-sensitive hashing principle but for cosine similarity on sparse binary vectors, such as document representations, by projecting onto random hyperplanes and signing bits rather than taking minima.[](https://arxiv.org/abs/1407.4416) A post-2010 adaptation extends MinHash to graph similarity by applying it to neighborhood sets around vertices, where sketches capture the k-nearest neighbors based on random rankings to estimate structural similarities like distances or centralities efficiently.[](https://www.semanticscholar.org/paper/58225ffe658820dcfbea7423828311179edf01b4) Post-2020 developments include ProbMinHash (2020), which extends [MinHash](/page/MinHash) to estimate the probability Jaccard similarity for continuous probability measures and distributions, and C-MinHash (2022), a variant using circulant permutations to reduce the number of hash functions to one while maintaining accuracy for Jaccard estimation in large-scale applications.[](https://arxiv.org/abs/2008.08916)[](https://proceedings.mlr.press/v162/li22m.html) These adaptations generally achieve compression ratios of 10-20x in space (e.g., b=1 versus 64 bits) but introduce error rates that increase with lower b or sparser neighborhoods, with relative errors often under 5% for Jaccard similarities above 0.5 in controlled settings.[](https://arxiv.org/abs/0910.3349) ## Applications and Uses ### Similarity Search in Sets MinHash serves as the foundation for locality-sensitive hashing (LSH) in efficient similarity search over large collections of sets, enabling approximate nearest neighbor queries based on Jaccard similarity without exhaustive pairwise comparisons. In this paradigm, MinHash signatures—compact representations of sets derived from multiple hash functions—are generated for each set in the database. These signatures are then partitioned into *b* bands, each consisting of *r* consecutive rows, where the total signature length *k* satisfies *k = b \times r*. For a query set, its signature is similarly banded, and candidate pairs are identified as those sharing at least one identical band hash value; this hashing of bands into buckets ensures that sets with high Jaccard similarity are likely to collide in the same bucket with probability approximately $1 - (1 - s^r)^b$, where *s* is the Jaccard similarity. This approach originated in the context of detecting near-duplicate web pages, where documents were represented as sets of shingles and clustered using MinHash sketches to process millions of pages scalably.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) To tune the LSH parameters for a desired Jaccard similarity threshold *t*, the row length *r* is typically set to approximately $(1/t) \log(1/t)$ to ensure that the probability of band collision transitions sharply around *t*, while the number of bands *b* is chosen as *k / r* to balance false positives and negatives, with *k* fixed based on available storage (e.g., 100–256 hashes for practical accuracy). This configuration amplifies the collision probability for similar sets (Jaccard > *t*) to near 1 while keeping it low for dissimilar ones, allowing subquadratic query times even on billion-scale datasets; for instance, an LSH ensemble using MinHash indexed 262 million web domains for [containment](/page/Containment) similarity search, achieving efficient retrieval through multiple hash tables. For sparse sets—common in applications like document term vectors or [genome](/page/Genome) k-mers—MinHash integrates seamlessly with [inverted index](/page/Inverted_index) structures, where band hashes serve as keys mapping to lists of set identifiers, enabling fast posting list intersections during candidate generation without materializing full signatures.[](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf)[](http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf) Compared to exact Jaccard similarity search, which requires *O(n^2)* time for *n* sets, MinHash-LSH delivers substantial improvements in efficiency while maintaining high [recall](/page/The_Recall); optimizations in large-scale deployments have achieved over 100× [speedup](/page/Speedup) in end-to-end pipelines for 90% [recall](/page/The_Recall) on seismic datasets represented as binary sets, demonstrating its viability for real-time queries over massive corpora.[](https://www.bailis.org/papers/quakes-vldb2018.pdf) ### Duplicate Detection and Clustering MinHash facilitates duplicate detection by generating compact signatures for data items represented as sets, enabling efficient estimation of Jaccard similarities between pairs to identify near-duplicates when the similarity exceeds a predefined threshold, such as 0.8 for textual content.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) This process begins with preprocessing the data into set representations, followed by signature computation using multiple hash functions to approximate the min-wise independence property, which ensures that the probability of matching hash minima correlates with set overlap.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) For clustering, MinHash signatures support all-pairs similarity estimation, where pairs surpassing the threshold are linked to form clusters using union-find (disjoint-set union) structures, efficiently merging components in linear time relative to the number of candidate pairs.[](https://arxiv.org/pdf/1406.1143) This approach scales to large repositories by first generating candidate pairs via [locality-sensitive hashing](/page/Locality-sensitive_hashing) on signatures, then applying union-find to group duplicates into equivalence classes, as demonstrated in processing millions of [Wikipedia](/page/Wikipedia) sentences to yield over 1 million clusters.[](https://arxiv.org/pdf/1406.1143) In textual duplicate detection, shingling preprocesses documents into sets of k-grams—contiguous sequences of k tokens or characters—to capture near-duplicates arising from minor edits, insertions, or deletions, transforming the problem into set similarity estimation amenable to MinHash.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) For instance, with k=10 for word-level shingling, a document's shingle set allows MinHash to estimate resemblance as the [Jaccard index](/page/Jaccard_index) of these sets, enabling detection of plagiarized or mirrored web pages.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) A seminal application occurred in 1997 with AltaVista's [web crawler](/page/Web_crawler), which processed 30 million documents using MinHash on 10-shingles to generate compact document signatures by selecting minima from 40-bit shingle fingerprints, identifying 5.3 million documents as exact duplicates in 2.1 million clusters and forming 3.6 million clusters from 12.3 million documents in a 150 GB corpus.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) This reduced storage redundancy and improved search quality by eliminating syntactic duplicates during indexing.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) For streaming environments with dynamic sets, such as evolving data streams, MinHash signatures support online updates by incrementally adjusting minima upon set insertions or deletions, maintaining approximation quality without full recomputation, as in fully-dynamic models handling subset additions in sublinear time.[](https://arxiv.org/pdf/2407.21614) Hierarchical clustering leverages MinHash-estimated distances—derived from signature mismatches as proxies for Jaccard dissimilarity—to construct dendrograms, where agglomerative methods successively merge clusters based on average or complete linkage, revealing nested group structures in large collections like malware binaries.[](https://dl.acm.org/doi/pdf/10.1145/3704268.3748684) To mitigate false positives from probabilistic estimation, a secondary verification step compares the full content of candidate pairs flagged by MinHash, confirming duplicates via exact Jaccard computation or [edit distance](/page/Edit_distance), which filters noise while preserving high [recall](/page/The_Recall) in the initial screening.[](https://cadmo.ethz.ch/education/lectures/FS18/SDBS/papers/broder.pdf) ### Other Domains In recommendation systems, MinHash has been employed to enhance [collaborative filtering](/page/Collaborative_filtering) by estimating similarities between user-item interaction sets, enabling efficient online collaborative filtering for millions of daily users. For instance, in large-scale news recommendation, MinHash clustering groups users and articles based on overlapping interests represented as sets. In [genomics](/page/Genomics), MinHash serves as a sketching technique for comparing microbial strains and metagenomes through [k-mer](/page/K-mer) representations, particularly post-2010 advancements in high-throughput sequencing. Tools like Mash apply MinHash to estimate pairwise [mutation](/page/Mutation) distances between genomes, supporting rapid [containment](/page/Containment) searches in large databases without full alignment, which is crucial for metagenomic [classification](/page/Classification) and outbreak detection. For network analysis, neighborhood-based MinHash variants compute graph similarities by hashing local node neighborhoods as sets, with works from 2015 onward enabling efficient estimation of [distance](/page/Distance) metrics in massive graphs. This approach approximates neighborhood sizes and structural similarities, aiding in community detection and [link prediction](/page/Link_prediction) without exhaustive pairwise computations.[](https://ieeexplore.ieee.org/document/8257969) In [big data](/page/Big_data) environments, MinHash integrates with frameworks like [Apache Spark](/page/Apache_Spark) for distributed [locality-sensitive hashing](/page/Locality-sensitive_hashing) (LSH), allowing scalable similarity joins over sparse datasets. Implementations in Spark MLlib treat non-zero vector elements as binary sets for Jaccard distance estimation, as demonstrated in healthcare [sentiment analysis](/page/Sentiment_analysis) pipelines processing vast [COVID-19](/page/COVID-19) data volumes.[](https://ieeexplore.ieee.org/document/9799934) Emerging applications in the [2020s](/page/2020s) leverage MinHash for privacy-preserving similarity computations in [federated learning](/page/Federated_learning), where min-wise independent permutations cluster clients based on local data sketches without sharing raw information. This supports clustered federated setups, maintaining [differential privacy](/page/Differential_privacy) while improving model convergence on heterogeneous datasets.[](https://www.diva-portal.org/smash/get/diva2:1712025/FULLTEXT01.pdf) Recent extensions as of 2024 include fully-dynamic MinHash sketches for [streaming data](/page/Streaming_data) with recovery mechanisms and MinCloud for trusted [malware](/page/Malware) detection in [Linux](/page/Linux) environments using transferable sketches.[](https://arxiv.org/pdf/2407.21614)[](https://www.sciencedirect.com/science/article/abs/pii/S2214212624002096) A key limitation of MinHash is its design for sparse, binary set representations under Jaccard similarity; for dense real-valued vectors requiring [cosine similarity](/page/Cosine_similarity), SimHash provides a more suitable alternative.[](http://proceedings.mlr.press/v33/shrivastava14.pdf) ## Performance Evaluation ### Benchmarking Methods Evaluating MinHash implementations typically involves assessing both the accuracy of similarity estimation and the efficiency of processing large-scale data. Key metrics for accuracy include [precision and recall](/page/Precision_and_recall) in candidate pair generation for [locality-sensitive hashing](/page/Locality-sensitive_hashing) (LSH) applications, which measure the proportion of retrieved pairs that are true positives and the fraction of true similar pairs identified, respectively. For direct Jaccard similarity estimation, [mean absolute error](/page/Mean_absolute_error) (MAE) or variance of the estimate relative to the true [Jaccard index](/page/Jaccard_index) is commonly used, with error bounds derived from the number of hash functions applied. Throughput is quantified as the number of pairs processed per second or average query time, often benchmarked on commodity hardware to evaluate [scalability](/page/Scalability).[](http://ekzhu.com/datasketch/lsh.html)[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) Datasets for benchmarking span synthetic and real-world collections to test robustness across distributions. Synthetic datasets often consist of uniform random sets or those following power-law distributions, with 1,000 samples over 100,000 features, allowing controlled variation in set cardinality and overlap to isolate estimation bias and variance. Real datasets include shingled web documents from large crawls, such as over 30 million Alta Vista pages totaling 150 GB, used to evaluate clustering at 50% resemblance thresholds; Wikipedia articles for near-duplicate detection; the 20 Newsgroups corpus (average 193 3-shingles per document); and Reddit post sets for pairwise similarity tasks. These enable assessment of performance on sparse, high-dimensional data typical of text processing.[](https://arxiv.org/pdf/1811.04633)[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf)[](http://ekzhu.com/datasketch/lsh.html) Testing protocols standardize parameter tuning and [performance measurement](/page/Performance_measurement) to ensure [reproducibility](/page/Reproducibility). Implementations vary the number of hash permutations $k$ (or total functions $num_perm$), bands $b$, and rows per band $r$ (with $k = b \times r$), plotting [receiver operating characteristic](/page/Receiver_operating_characteristic) (ROC) curves of precision against [recall](/page/The_Recall) to balance false positives and negatives. Experiments repeat 10 times to compute means and standard deviations, often using open-source tools like the Python datasketch library for [MinHash](/page/MinHash) and LSH indexing, which supports [Redis](/page/Redis) for scalable storage and includes benchmark scripts for query latency. Common pitfalls include sensitivity to [universe](/page/Universe) size, where insufficiently random hash functions lead to biased permutations, and practical hash collisions that inflate [error](/page/Error) in finite implementations, necessitating 40-bit or higher hash values for >99.9% confidence bounds.[](http://ekzhu.com/datasketch/lsh.html)[](https://arxiv.org/pdf/1811.04633)[](https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf) In the [2020s](/page/2020s), benchmarks have increasingly focused on [hardware acceleration](/page/Hardware_acceleration), particularly GPU implementations for MinHash LSH in [dataset](/page/Data_set) deduplication. For instance, the FED framework achieves up to 107.2x speedup over CPU baselines on [datasets](/page/Data_set) like RedPajama-1T (1.2T [tokens](/page/The_Tokens)), processing 30 billion [tokens](/page/The_Tokens) in 111.7 seconds across 16 GPUs (for a subset), with the full [dataset](/page/Data_set) completed in 6 hours, using a Jaccard similarity threshold of 0.8, highlighting gains in MinHash generation (260x faster) and candidate filtering. Such evaluations emphasize end-to-end throughput on real corpora like C4 and RealNews, underscoring MinHash's viability for trillion-token-scale tasks.[](https://arxiv.org/pdf/2501.01046) ### Comparative Analysis MinHash offers a probabilistic [approximation](/page/Approximation) to exact Jaccard similarity computation, which involves directly calculating set intersections and unions for all pairs, resulting in O(N^2 \cdot s) [time complexity](/page/Time_complexity) in the worst case for N sets of [average](/page/Average) [size](/page/Size) s, rendering it infeasible for large-scale datasets. In contrast, MinHash reduces query times to sublinear complexity through fixed-size sketches, enabling efficient similarity estimation at the cost of [approximation error](/page/Approximation_error).[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x)[](https://www.researchgate.net/publication/349391865_A_Survey_on_Locality_Sensitive_Hashing_Algorithms_and_their_Applications) Compared to other [locality-sensitive hashing](/page/Locality-sensitive_hashing) techniques, MinHash is specifically tailored for Jaccard similarity on sets, outperforming methods like [HyperLogLog](/page/HyperLogLog), which focuses solely on [cardinality](/page/Cardinality) estimation without direct pairwise similarity support. Similarly, SimHash targets [Hamming distance](/page/Hamming_distance) for document similarity under cosine metrics, making it less suitable for pure set overlap tasks where MinHash excels, as demonstrated in near-duplicate detection benchmarks. MinHash also provides better estimates for Jaccard than Euclidean-based LSH variants, which are optimized for vector distances rather than set [containment](/page/Containment).[](https://www.researchgate.net/publication/349391865_A_Survey_on_Locality_Sensitive_Hashing_Algorithms_and_their_Applications)[](https://dl.acm.org/doi/10.1145/2063576.2063749) Empirical evaluations highlight MinHash's effectiveness; in a 2006 study on 1.6 billion web pages, MinHash/LSH achieved 86% precision for cross-site near-duplicates, while SimHash achieved 90%; a combined approach reached 79% precision with 79% relative recall, with low false positive rates through banding. More recent genomic benchmarks, such as those using Mash, report high correlation (RMSE 0.00274) with exact average nucleotide identity on 54,000 genomes, with false positives controlled via [p-value](/page/P-value) thresholds below 10^{-10}.[](https://www3.cs.stonybrook.edu/~cse692/papers/henzinger_sigir06.pdf)[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) Key trade-offs include MinHash's space requirements, which scale with the number of hash functions needed for accuracy (e.g., 100-1000 for 1-5% error), potentially exceeding alternatives like the Count-Min sketch for streaming environments where frequency estimation suffices over direct similarity. Count-Min offers lower space for approximate counts in dynamic data but lacks MinHash's unbiased Jaccard estimator.[](https://www.researchgate.net/publication/349391865_A_Survey_on_Locality_Sensitive_Hashing_Algorithms_and_their_Applications)[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1809-x) In modern contexts, particularly for sparse sets like k-mers in [genomics](/page/Genomics), MinHash outperforms [deep learning](/page/Deep_learning) embeddings in efficiency, enabling 10x faster processing than alignment-based [exact](/page/Ex'Act) methods or neural approaches on gigabase-scale [data](/page/Data), though embeddings may capture denser semantics at higher computational cost. Post-2015 parallel implementations, including GPU-accelerated MinHash, achieve up to 25x speedups over parallel CPU baselines for sketching large collections.[](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x)[](https://www.slideshare.net/slideshow/gpu-acceleration-of-set-similarity-joins/63164204)
Add your contribution
Related Hubs
User Avatar
No comments yet.