Recent from talks
Nothing was collected or created yet.
Locality-sensitive hashing
View on WikipediaIn computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability.[1] The number of buckets is much smaller than the universe of possible input items.[1] Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
Hashing-based approximate nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH).[2][3]
Locality-preserving hashing was initially devised as a way to facilitate data pipelining in implementations of massively parallel algorithms that use randomized routing and universal hashing to reduce memory contention and network congestion.[4][5]
Definitions
[edit]A finite family of functions is defined to be an LSH family[1][6][7] for
- a metric space ,
- a threshold ,
- an approximation factor ,
- and probabilities
if it satisfies the following condition. For any two points and a hash function chosen uniformly at random from :
- If , then (i.e., a and b collide) with probability at least ,
- If , then with probability at most .
Such a family is called -sensitive.
LSH with respect to a similarity measure
[edit]Alternatively[8] it is possible to define an LSH family on a universe of items U endowed with a similarity function . In this setting, a LSH scheme is a family of hash functions H coupled with a probability distribution D over H such that a function chosen according to D satisfies for each .
Amplification
[edit]Given a -sensitive family , we can construct new families by either the AND-construction or OR-construction of .[1]
To create an AND-construction, we define a new family of hash functions g, where each function g is constructed from k random functions from . We then say that for a hash function , if and only if all for . Since the members of are independently chosen for any , is a -sensitive family.
To create an OR-construction, we define a new family of hash functions g, where each function g is constructed from k random functions from . We then say that for a hash function , if and only if for one or more values of i. Since the members of are independently chosen for any , is a -sensitive family.
Applications
[edit]LSH has been applied to several problem domains, including:
- Near-duplicate detection[9]
- Hierarchical clustering[10][11]
- Genome-wide association study[12]
- Image similarity identification
- Gene expression similarity identification[citation needed]
- Audio similarity identification
- Nearest neighbor search
- Audio fingerprint[13]
- Digital video fingerprinting[14]
- Shared memory organization in parallel computing[4][5]
- Physical data organization in database management systems[15]
- Training fully connected neural networks[16][17]
- Computer security[18]
- Machine learning[19]
Methods
[edit]Bit sampling for Hamming distance
[edit]One of the easiest ways to construct an LSH family is by bit sampling.[7] This approach works for the Hamming distance over d-dimensional vectors . Here, the family of hash functions is simply the family of all the projections of points on one of the coordinates, i.e., , where is the th coordinate of . A random function from simply selects a random bit from the input point. This family has the following parameters: , . That is, any two vectors with Hamming distance at most collide under a random with probability at least . Any with Hamming distance at least collide with probability at most .
Min-wise independent permutations
[edit]Suppose U is composed of subsets of some ground set of enumerable items S and the similarity function of interest is the Jaccard index J. If π is a permutation on the indices of S, for let . Each possible choice of π defines a single hash function h mapping input sets to elements of S.
Define the function family H to be the set of all such functions and let D be the uniform distribution. Given two sets the event that corresponds exactly to the event that the minimizer of π over lies inside . As h was chosen uniformly at random, and define an LSH scheme for the Jaccard index.
Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" — a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen π. It has been established that a min-wise independent family of permutations is at least of size ,[20] and that this bound is tight.[21]
Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k.[22] Approximate min-wise independence differs from the property by at most a fixed ε.[23]
Open source methods
[edit]Nilsimsa Hash
[edit]Nilsimsa is a locality-sensitive hashing algorithm used in anti-spam efforts.[24] The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. The paper suggests that the Nilsimsa satisfies three requirements:
- The digest identifying each message should not vary significantly for changes that can be produced automatically.
- The encoding must be robust against intentional attacks.
- The encoding should support an extremely low risk of false positives.
Testing performed in the paper on a range of file types identified the Nilsimsa hash as having a significantly higher false positive rate when compared to other similarity digest schemes such as TLSH, Ssdeep and Sdhash.[25]
TLSH
[edit]TLSH is locality-sensitive hashing algorithm designed for a range of security and digital forensic applications.[18] The goal of TLSH is to generate hash digests for messages such that low distances between digests indicate that their corresponding messages are likely to be similar.
An implementation of TLSH is available as open-source software.[26]
Random projection
[edit]
The random projection method of LSH due to Moses Charikar[8] called SimHash (also sometimes called arccos[27]) uses an approximation of the cosine distance between vectors. The technique was used to approximate the NP-complete max-cut problem.[8]
The basic idea of this technique is to choose a random hyperplane (defined by a normal unit vector r) at the outset and use the hyperplane to hash input vectors.
Given an input vector v and a hyperplane defined by r, we let . That is, depending on which side of the hyperplane v lies. This way, each possible choice of a random hyperplane r can be interpreted as a hash function .
For two vectors u,v with angle between them, it can be shown that
Since the ratio between and is at least 0.439 when ,[8][28] the probability of two vectors being on different sides of the random hyperplane is approximately proportional to the cosine distance between them.
Stable distributions
[edit]The hash function [29] maps a d-dimensional vector onto the set of integers. Each hash function in the family is indexed by a choice of random and where is a d-dimensional vector with entries chosen independently from a stable distribution and is a real number chosen uniformly from the range [0,r]. For a fixed the hash function is given by .
Other construction methods for hash functions have been proposed to better fit the data. [30] In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.
Semantic hashing
[edit]Semantic hashing is a technique that attempts to map input items to addresses such that closer inputs have higher semantic similarity.[31] The hashcodes are found via training of an artificial neural network or graphical model.[citation needed]
Algorithm for nearest neighbor search
[edit]One of the main applications of LSH is to provide a method for efficient approximate nearest neighbor search algorithms. Consider an LSH family . The algorithm has two main parameters: the width parameter k and the number of hash tables L.
In the first step, we define a new family of hash functions g, where each function g is obtained by concatenating k functions from , i.e., . In other words, a random hash function g is obtained by concatenating k randomly chosen hash functions from . The algorithm then constructs L hash tables, each corresponding to a different randomly chosen hash function g.
In the preprocessing step we hash all n d-dimensional points from the data set S into each of the L hash tables. Given that the resulting hash tables have only n non-zero entries, one can reduce the amount of memory used per each hash table to using standard hash functions.
Given a query point q, the algorithm iterates over the L hash functions g. For each g considered, it retrieves the data points that are hashed into the same bucket as q. The process is stopped as soon as a point within distance cR from q is found.
Given the parameters k and L, the algorithm has the following performance guarantees:
- preprocessing time: , where t is the time to evaluate a function on an input point p;
- space: , plus the space for storing data points;
- query time: ;
- the algorithm succeeds in finding a point within distance cR from q (if there exists a point within distance R) with probability at least ;
For a fixed approximation ratio and probabilities and , one can set and , where . Then one obtains the following performance guarantees:
- preprocessing time: ;
- space: , plus the space for storing data points;
- query time: ;
Finding nearest neighbor without fixed dimensionality
[edit]To generalize the above algorithm without radius R being fixed, we can take the algorithm and do a sort of binary search over R. It has been shown[32] that there is a data structure for the approximate nearest neighbor with the following performance guarantees:
- space: ;
- query time: ;
- the algorithm succeeds in finding the nearest neighbor with probability at least ;
Improvements
[edit]When t is large, it is possible to reduce the hashing time from . This was shown by[33] and[34] which gave
- query time: ;
- space: ;
It is also sometimes the case that the factor can be very large. This happens for example with Jaccard similarity data, where even the most similar neighbor often has a quite low Jaccard similarity with the query. In[35] it was shown how to reduce the query time to (not including hashing costs) and similarly the space usage.
See also
[edit]- Bloom filter – Data structure for approximate set membership
- Curse of dimensionality – Difficulties arising when analyzing data with many aspects ("dimensions")
- Feature hashing – Vectorizing features using a hash function
- Fourier-related transforms
- Geohash – Public domain geocoding invented in 2008
- Multilinear subspace learning – Approach to dimensionality reduction
- Principal component analysis – Method of data analysis
- Random indexing[36]
- Rolling hash – Type of hash function
- Singular value decomposition – Matrix decomposition
- Sparse distributed memory – Mathematical model of memory
- Wavelet compression – Mathematical technique used in data compression and analysis
- Locality of reference – Tendency of a processor to access nearby memory locations in space or time
References
[edit]- ^ a b c d Rajaraman, A.; Ullman, J. (2010). "Mining of Massive Datasets, Ch. 3".
- ^ Zhao, Kang; Lu, Hongtao; Mei, Jincheng (2014). Locality Preserving Hashing. AAAI Conference on Artificial Intelligence. Vol. 28. pp. 2874–2880.
- ^ Tsai, Yi-Hsuan; Yang, Ming-Hsuan (October 2014). "Locality preserving hashing". 2014 IEEE International Conference on Image Processing (ICIP). pp. 2988–2992. doi:10.1109/ICIP.2014.7025604. ISBN 978-1-4799-5751-4. ISSN 1522-4880. S2CID 8024458.
- ^ a b Chin, Andrew (1991). Complexity Issues in General Purpose Parallel Computing (DPhil). University of Oxford. pp. 87–95.
- ^ a b Chin, Andrew (1994). "Locality-Preserving Hash Functions for General Purpose Parallel Computation" (PDF). Algorithmica. 12 (2–3): 170–181. doi:10.1007/BF01185209. S2CID 18108051.
- ^ Gionis, A.; Indyk, P.; Motwani, R. (1999). "Similarity Search in High Dimensions via Hashing". Proceedings of the 25th Very Large Database (VLDB) Conference.
- ^ a b Indyk, Piotr.; Motwani, Rajeev. (1998). "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality.". Proceedings of 30th Symposium on Theory of Computing.
- ^ a b c d Charikar, Moses S. (2002). "Similarity Estimation Techniques from Rounding Algorithms". Proceedings of the 34th Annual ACM Symposium on Theory of Computing. pp. 380–388. CiteSeerX 10.1.1.147.4064. doi:10.1145/509907.509965. ISBN 1-58113-495-9.
- ^ Das, Abhinandan S.; et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th international conference on World Wide Web, pp. 271–280, doi:10.1145/1242572.1242610, ISBN 9781595936547, S2CID 207163129.
- ^ Koga, Hisashi; Tetsuo Ishibashi; Toshinori Watanabe (2007), "Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing", Knowledge and Information Systems, 12 (1): 25–53, doi:10.1007/s10115-006-0027-5, S2CID 4613827.
- ^ Cochez, Michael; Mou, Hao (2015), "Twister Tries", Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (PDF), pp. 505–517, doi:10.1145/2723372.2751521, ISBN 9781450327589, S2CID 14414777.
- ^ Brinza, Dumitru; et al. (2010), "RAPID detection of gene–gene interactions in genome-wide association studies", Bioinformatics, 26 (22): 2856–2862, doi:10.1093/bioinformatics/btq529, PMC 3493125, PMID 20871107
- ^ dejavu - Audio fingerprinting and recognition in Python, 2018-12-19
- ^ A Simple Introduction to Locality Sensitive Hashing (LSH), 2025-03-27
- ^ Aluç, Güneş; Özsu, M. Tamer; Daudjee, Khuzaima (2018), "Building self-clustering RDF databases using Tunable-LSH", The VLDB Journal, 28 (2): 173–195, doi:10.1007/s00778-018-0530-9, S2CID 53695535
- ^ Chen, Beidi; Medini, Tharun; Farwell, James; Gobriel, Sameh; Tai, Charlie; Shrivastava, Anshumali (2020-02-29). "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems". arXiv:1903.03129 [cs.DC].
- ^ Chen, Beidi; Liu, Zichang; Peng, Binghui; Xu, Zhaozhuo; Li, Jonathan Lingjie; Dao, Tri; Song, Zhao; Shrivastava, Anshumali; Re, Christopher (2021), "MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training", International Conference on Learning Representation
- ^ a b Oliver, Jonathan; Cheng, Chun; Chen, Yanggui (2013). "TLSH -- A Locality Sensitive Hash". 2013 Fourth Cybercrime and Trustworthy Computing Workshop. pp. 7–13. doi:10.1109/CTC.2013.9. ISBN 978-1-4799-3076-0.
- ^ Fanaee-T, Hadi (2024), Natural Learning, arXiv:2404.05903
- ^ Broder, A.Z.; Charikar, M.; Frieze, A.M.; Mitzenmacher, M. (1998). "Min-wise independent permutations". Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. pp. 327–336. CiteSeerX 10.1.1.409.9220. doi:10.1145/276698.276781. Retrieved 2007-11-14.
- ^ Takei, Y.; Itoh, T.; Shinozaki, T. "An optimal construction of exactly min-wise independent permutations". Technical Report COMP98-62, IEICE, 1998.
- ^ Matoušek, J.; Stojakovic, M. (2002). "On Restricted Min-Wise Independence of Permutations". Preprint. Retrieved 2007-11-14.
- ^ Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D. (2000). "Low discrepancy sets yield approximate min-wise independent permutation families". Information Processing Letters. 73 (1–2): 29–32. CiteSeerX 10.1.1.20.8264. doi:10.1016/S0020-0190(99)00163-5. Retrieved 2007-11-14.
- ^ Damiani; et al. (2004). "An Open Digest-based Technique for Spam Detection" (PDF). Retrieved 2013-09-01.
- ^ Oliver; et al. (2013). "TLSH - A Locality Sensitive Hash". 4th Cybercrime and Trustworthy Computing Workshop. Retrieved 2015-06-04.
- ^ "TLSH". GitHub. Retrieved 2014-04-10.
- ^ Alexandr Andoni; Indyk, P. (2008). "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions". Communications of the ACM. 51 (1): 117–122. CiteSeerX 10.1.1.226.6905. doi:10.1145/1327452.1327494. S2CID 6468963.
- ^ Goemans, Michel X.; Williamson, David P. (1995). "Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming". Journal of the ACM. 42 (6). Association for Computing Machinery (ACM): 1115–1145. doi:10.1145/227683.227684. ISSN 0004-5411. S2CID 15794408.
- ^ Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. (2004). "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions". Proceedings of the Symposium on Computational Geometry.
- ^ Pauleve, L.; Jegou, H.; Amsaleg, L. (2010). "Locality sensitive hashing: A comparison of hash function types and querying mechanisms". Pattern Recognition Letters. 31 (11): 1348–1358. Bibcode:2010PaReL..31.1348P. doi:10.1016/j.patrec.2010.04.004. S2CID 2666044.
- ^ Salakhutdinov, Ruslan; Hinton, Geoffrey (2008). "Semantic hashing". International Journal of Approximate Reasoning. 50 (7): 969–978. doi:10.1016/j.ijar.2008.11.006.
- ^ Har-Peled, Sariel; Indyk, Piotr; Motwani, Rajeev (2012). "Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality" (PDF). Theory of Computing. 8 (Special Issue in Honor of Rajeev Motwani): 321–350. doi:10.4086/toc.2012.v008a014. Retrieved 23 May 2025.
- ^ Dahlgaard, Søren, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. "Fast similarity sketching." 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2017.
- ^ Christiani, Tobias. "Fast locality-sensitive hashing frameworks for approximate near neighbor search." International Conference on Similarity Search and Applications. Springer, Cham, 2019.
- ^ Ahle, Thomas Dybdahl. "On the Problem of in Locality-Sensitive Hashing." International Conference on Similarity Search and Applications. Springer, Cham, 2020.
- ^ Gorman, James, and James R. Curran. "Scaling distributional similarity to large corpora." Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.
Further reading
[edit]- Samet, H. (2006) Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. ISBN 0-12-369446-9
- Indyk, Piotr; Motwani, Rajeev; Raghavan, Prabhakar; Vempala, Santosh (1997). "Locality-preserving hashing in multidimensional spaces". Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. STOC '97. pp. 618–625. CiteSeerX 10.1.1.50.4927. doi:10.1145/258533.258656. ISBN 978-0-89791-888-6. S2CID 15693787.
- Chin, Andrew (1994). "Locality-preserving hash functions for general purpose parallel computation" (PDF). Algorithmica. 12 (2–3): 170–181. doi:10.1007/BF01185209. S2CID 18108051.
External links
[edit]- Alex Andoni's LSH homepage
- LSHKIT: A C++ Locality Sensitive Hashing Library
- A Python Locality Sensitive Hashing library that optionally supports persistence via redis
- Caltech Large Scale Image Search Toolbox: a Matlab toolbox implementing several LSH hash functions, in addition to Kd-Trees, Hierarchical K-Means, and Inverted File search algorithms.
- Slash: A C++ LSH library, implementing Spherical LSH by Terasawa, K., Tanaka, Y
- LSHBOX: An Open Source C++ Toolbox of Locality-Sensitive Hashing for Large Scale Image Retrieval, Also Support Python and MATLAB.
- SRS: A C++ Implementation of An In-memory, Space-efficient Approximate Nearest Neighbor Query Processing Algorithm based on p-stable Random Projection
- TLSH open source on Github
- JavaScript port of TLSH (Trend Micro Locality Sensitive Hashing) bundled as node.js module
- Java port of TLSH (Trend Micro Locality Sensitive Hashing) bundled as maven package
Locality-sensitive hashing
View on GrokipediaIntroduction
Overview and Motivation
Locality-sensitive hashing (LSH) is a probabilistic technique designed to solve the approximate nearest neighbor (ANN) search problem in high-dimensional spaces. Given a database of n points in d-dimensional space and a query point q, the goal is to find a point p in the database such that the distance d(p, q) is within a factor of (1 + ε) of the true nearest neighbor's distance, where ε > 0 is a small approximation parameter. This problem arises frequently in applications like image retrieval, recommendation systems, and genomic data analysis, where datasets can involve thousands of dimensions and millions of points. However, the curse of dimensionality—where distances between points become increasingly uniform as d grows—renders traditional spatial partitioning methods, such as kd-trees, ineffective beyond low dimensions (typically d > 10), leading to query times that approach exhaustive linear scans.[8] Exact k-nearest neighbor (k-NN) search exacerbates these issues, requiring O(dn) time in the worst case for brute-force computation or suboptimal trade-offs in preprocessing and query efficiency with structures like ball trees. For instance, methods achieving sublinear query times often demand exponential space in d, making them impractical for real-world high-dimensional data where n can exceed 10^6. These limitations motivate approximate approaches that sacrifice perfect accuracy for substantial speedups, particularly in scenarios where users tolerate small errors, such as content-based search in multimedia databases.[8] LSH addresses these challenges by employing families of hash functions that preserve locality: similar items are hashed to the same bucket with high probability, while dissimilar items collide with low probability. During preprocessing, data points are hashed into multiple buckets using concatenated hash functions, creating a compact index. For a query, only points in the same or nearby buckets are examined, enabling sublinear query times—often O(n^ρ) where 0 < ρ < 1 depends on the approximation factor—while providing theoretical guarantees on recall. This hashing-based approximation avoids exhaustive searches, scaling efficiently to large, high-dimensional datasets.[8] LSH emerged in the late 1990s as a response to growing needs in data mining tasks, with foundational work by Indyk and Motwani introducing the framework for Euclidean spaces in 1998, followed by extensions to other metrics like Jaccard similarity by Gionis, Indyk, and Motwani in 1999. These developments were driven by the explosion of high-dimensional data in fields like text and image processing, where efficient similarity search became essential.[9][10]Historical Development
Locality-sensitive hashing (LSH) originated in 1998 with the seminal work by Piotr Indyk and Rajeev Motwani, who introduced the framework to address approximate nearest neighbor search in high-dimensional spaces under norms, aiming to mitigate the curse of dimensionality. Their approach defined families of hash functions that preserve locality, enabling efficient indexing with probabilistic guarantees on collision probabilities for nearby points. This foundational paper, presented at the Symposium on Theory of Computing (STOC), laid the groundwork for subsequent developments in dimensionality reduction and similarity search. Early extensions built on this by adapting LSH to other metrics. For instance, MinHash, originally proposed by Andrei Broder and colleagues in 1997 for estimating Jaccard similarity in web duplicate detection, was adapted into LSH frameworks in 1999 for scalable similarity search in large-scale databases, including Google's MapReduce for near-duplicate document identification. These adaptations marked LSH's transition from theory to practical tools in information retrieval. Practical implementations gained traction in the mid-2000s, with one of the first notable applications in image search emerging in 2006 through a time-space efficient LSH method that reduced space requirements by a factor of five compared to basic LSH while achieving comparable query times.[11] The integration of LSH with machine learning accelerated in the late 2000s; in 2009, Ruslan Salakhutdinov and Geoffrey Hinton introduced semantic hashing, a deep learning-based variant that mapped documents to binary codes for fast retrieval while capturing semantic similarities.[12] By the mid-2010s, LSH saw widespread adoption in industry, exemplified by Facebook's release of the Faiss library in 2017, which incorporated LSH alongside other indexing methods for efficient similarity search over billions of dense vectors in multimedia applications.[13] From the 2020s onward, LSH has evolved to handle complex structures like graph embeddings. Recent advances include 2021 work on network hashing for node embeddings, enabling scalable representation learning in graphs via locality-preserving sketches.[14] In 2025, extensions to heterogeneous graphs leveraged LSH to capture higher-order simplicial complexes, improving embedding efficiency for diverse node types in large-scale networks.[15] These developments underscore LSH's ongoing relevance in bridging theoretical efficiency with real-world scalability across domains.Mathematical Foundations
Locality-Sensitive Hash Functions
Locality-sensitive hashing builds upon the concept of universal hashing, which provides a family of hash functions where the collision probability for distinct inputs is uniformly bounded, typically at most for range size , regardless of the input structure or metric. This uniformity ensures average-case efficiency but lacks sensitivity to geometric or metric properties of the data. In contrast, locality-sensitive hashing designs families that are aware of an underlying distance or similarity measure, increasing the likelihood of collisions for nearby points while decreasing it for distant ones. Formally, a family of hash functions from a domain to a universe is -sensitive with respect to a distance function on (where and ) if, for any points , the collision probability satisfies when , and when . Here, represents a query radius defining "close" points, acts as a separation threshold with controlling the gap between close and far regions, and denote the respective high and low collision probabilities that enable probabilistic guarantees on hashing behavior. This sensitivity property allows the family to preserve locality in the hashed space, facilitating efficient approximate nearest neighbor searches in high-dimensional settings. The definition extends naturally to similarity measures, where a higher similarity corresponds to a higher collision probability; for instance, in a metric space, one can define -sensitivity such that if similarity , and if with and . This generalization applies to kernel-induced similarities or non-metric spaces, maintaining the core idea of metric-aware collisions. In practice, the collision probability for points at distance is often modeled as , where is a monotonically decreasing function tailored to the metric, ensuring the probability diminishes as distance increases. Such base families may have modest gaps between and , but amplification techniques can enhance this separation for improved performance, as detailed in subsequent sections.Amplification Techniques
Locality-sensitive hashing (LSH) families often start with modest gaps between collision probabilities for similar and dissimilar points, necessitating amplification techniques to enhance their utility in approximate nearest neighbor search. These methods combine multiple independent hash functions to sharpen the distinction, improving either precision (fewer false positives) or recall (higher chance of detecting true neighbors), at the cost of additional computational resources.[8] The AND-construction amplifies selectivity by concatenating independent hash functions from a base -sensitive family , forming a new function where and each . This reduces the probability of collision for similar points to and for dissimilar points to , assuming , thereby shrinking bucket sizes and minimizing false positives as increases. Typically, is chosen as to achieve a desired amplification factor, balancing the trade-off between query efficiency and accuracy.[8] Complementing the AND-construction, the OR-construction boosts recall by employing independent amplified hash functions , storing points in multiple hash tables and querying all corresponding buckets. The probability that a similar point collides in at least one table becomes , while for dissimilar points it is ; this increases the likelihood of retrieving true neighbors, though it may introduce more candidates requiring post-processing. Parameter is often set proportionally to the dataset size , such as where , to ensure high recall with sublinear query time.[8] Tuning and involves optimizing for the application's precision-recall needs, with the expected number of points per bucket approximating for dissimilar points, guiding the selection to keep buckets small (e.g., constant size) while maintaining recall above a threshold like 90%. In practice, these parameters are adjusted iteratively based on empirical validation, as larger enhances precision but risks missing similar points if over-tuned.[8] For MinHash, a specific LSH construction for Jaccard similarity, amplification relies on multiple random permutations to generate a signature of minhash values, estimating similarity as the fraction of matching components across independent hashes. Each permutation yields a minhash for set , and using such values reduces estimation variance, with the collision probability equaling the Jaccard index; typically, s = 100 to 200 provides good accuracy with low probability of significant errors, for sketches of 300-800 bytes. This multi-permutation approach directly amplifies the base family's selectivity for set resemblance tasks.[16] These amplification techniques incur trade-offs in space and time: the AND-construction multiplies hash storage by , while the OR-construction scales space and query time by , often leading to overall usage for . Despite these costs, they enable efficient scaling for high-dimensional data, with query times sublinear in .[8]LSH Constructions for Specific Metrics
Hamming Distance Methods
Hamming distance between two binary vectors of length is defined as the number of positions at which they differ, corresponding to the minimum number of bit flips required to transform one vector into the other.[8] A fundamental locality-sensitive hashing construction for the Hamming metric employs bit sampling, where a hash function randomly selects coordinates from the -dimensional binary vector and outputs the substring formed by those bits. For two vectors with Hamming distance , the collision probability under a single-bit hash is . To amplify the gap between collision probabilities for near and far points, multiple bits are sampled independently, yielding ; larger reduces the probability for distant points more sharply than for close ones.[8][8] This basic scheme can be enhanced with limited independence to reduce the amount of randomness required while preserving the desired sensitivity. Specifically, 4-wise independent hashing families, constructed via random polynomials of degree at most 3 evaluated over GF(2), enable exact achievement of the target collision probabilities and for an -sensitive family without needing fully random bit selections.[17] Such constructions maintain efficiency for high-dimensional spaces by approximating the full random sampling behavior with lower seed length. Amplification techniques like AND and OR constructions, which combine multiple hashes, are applied to further tune the probabilities, as detailed in the general framework.[8] For sparse binary data, such as document representations where vectors indicate term presence (1 for present, 0 otherwise), bit sampling via random projections onto a small number of coordinates serves as an effective hashing method. This approach focuses on differing bits in sparse regions, enabling efficient similarity estimation under Hamming distance despite high dimensionality.[8]Jaccard Similarity Methods
Jaccard similarity measures the overlap between two sets and as , providing a metric between 0 and 1 that is particularly useful for sparse, variable-length data such as documents represented as sets of shingles or features.[16] Locality-sensitive hashing for Jaccard similarity relies on MinHash, a family of hash functions derived from random permutations of the universe of elements. For a random permutation , the hash function ensures that the collision probability , exactly matching the Jaccard similarity.[16][18] This property arises because the minimum element under is equally likely to be any element in relative to , making MinHash a locality-sensitive scheme where similar sets collide with probability proportional to their similarity.[18] Exact random permutations require exponential space and time, so approximate constructions use small families of hash functions that simulate min-wise independence. A k-minwise independent family ensures that for any set of size at most k, the minimum behaves as under a true random permutation, up to a small error; these can be generated efficiently using universal hash functions or hash tables to map elements to pseudo-random ranks.[19] For instance, a family of size achieves -approximate min-wise independence, enabling practical computation of MinHash values without full permutations.[19] To amplify the locality sensitivity for nearest neighbor search, multiple independent MinHash functions are combined into signatures. Specifically, r such hashes form an AND-constructed signature (all must match for collision), which reduces false positives, while b such signatures are OR-constructed across bands to preserve recall for similar sets.[18] Jaccard similarity can be estimated from these hashes as the average collision rate over multiple MinHashes: , where m is the number of hashes and is the indicator function; this estimator is unbiased and concentrates around the true J(A, B) by the law of large numbers.[16]Cosine and Angular Distance Methods
Locality-sensitive hashing methods for cosine similarity and angular distance are particularly suited for high-dimensional real-valued vectors, such as those arising in text representations, image features, or embedding spaces, where the goal is to preserve angular relationships between points.[20] Cosine similarity between two vectors and is defined as , where denotes the inner product and the Euclidean norm; this measure is invariant to vector magnitudes and focuses on directional alignment.[20] The corresponding angular distance is , ranging from 0 (identical direction) to (opposite directions), providing a metric that emphasizes geometric separation on the unit hypersphere.[20] A foundational approach for these metrics employs random hyperplane projections, where the hash function is defined as , with drawn from a Gaussian distribution (or equivalently, a uniform distribution on the unit sphere).[20] This binary hash indicates on which side of the random hyperplane defined by the vector lies. For unit-normalized vectors and , the collision probability is , where is the angular separation; this probability decreases monotonically with increasing angle, ensuring that nearby points (small ) collide with higher likelihood than distant ones.[20] The method originates from semidefinite programming rounding techniques and was adapted for LSH to enable efficient similarity estimation. To handle non-normalized vectors, the input vectors are first projected onto the unit sphere by dividing each by its Euclidean norm, , ensuring the hashing operates solely on directions rather than magnitudes.[20] In general, , directly linking the collision rate to the cosine similarity.[20] This projection-based family is simple to implement and scales well to thousands of dimensions, making it ideal for applications like document similarity search.[20] A prominent variant is SimHash, which concatenates multiple independent random projections (typically 32 to 128 bits) to form a fixed-length binary signature for each vector.[20] The Hamming distance between two SimHash signatures then approximates the angular distance, with the expected number of differing bits proportional to ; this allows for fast bitwise operations in indexing and querying.[20] SimHash has been widely adopted for near-duplicate detection in large-scale text corpora, demonstrating empirical effectiveness in reducing query times while maintaining recall.[20]L_p Norm Methods
Locality-sensitive hashing (LSH) methods for norms, where , leverage -stable distributions to construct hash functions that preserve distances in Euclidean and more general metric spaces. The distance between two vectors is defined as . These methods are particularly effective for approximate nearest neighbor search in high-dimensional data under such norms, generalizing earlier techniques like random projections, which apply to unnormalized vectors here unlike their use in angular similarity.[21] A distribution is -stable if, for independent identically distributed random variables drawn from it, the linear combination has the same distribution as for any coefficients . Such distributions exist for ; for , the normal (Gaussian) distribution is 2-stable, while for , the Cauchy distribution is 1-stable. In the LSH scheme, a random vector is drawn with i.i.d. entries from a symmetric -stable distribution with , and an offset is chosen uniformly from , where is the bucket width parameter. The hash function is then , which projects the input onto a random direction and quantizes it into buckets.[21] The collision probability for two points at distance under this hash function is , where is the density of the (scaled) -stable distribution, ensuring the probability decreases monotonically as increases. Parameters are tuned such that points within distance collide with probability at least , while points at distance (with ) collide with probability at most , enabling amplification via concatenation and multiple tables. For the specific case of (Euclidean norm), with , the collision probability approximates , providing an exponential decay that supports efficient nearest neighbor guarantees.[21]Practical Implementations and Tools
Open-Source Algorithms
One prominent open-source library for locality-sensitive hashing (LSH) is Faiss, developed by Facebook AI Research in 2017, which provides GPU-accelerated implementations for LSH-based similarity search using L2 distance and inner product metrics.[13] Faiss supports inverted file (IVF) indexing combined with LSH techniques, enabling efficient approximate nearest neighbor (ANN) searches on large-scale datasets by partitioning vectors into clusters and using hash tables for candidate selection.[22] This library is written in C++ with Python and other language bindings, making it suitable for high-performance applications in machine learning pipelines.[23] Another widely used tool is Annoy (Approximate Nearest Neighbors Oh Yeah), released by Spotify in 2013, which employs random projection trees incorporating LSH principles to perform fast ANN searches in high-dimensional spaces.[24] Implemented in C++ with Python bindings, Annoy builds multiple randomized binary trees where each tree uses random hyperplanes for hashing, allowing for efficient querying by aggregating results across trees to balance speed and accuracy. It excels in memory-efficient indexing for static datasets, supporting metrics like Euclidean and angular distance without requiring GPU acceleration.[25] In 2023, Spotify introduced Voyager as a successor to Annoy, offering enhanced performance for nearest-neighbor search while maintaining LSH-based approaches.[24] For set-based similarity measures such as Jaccard index, the Datasketch library, a Python package with its first stable release in 2017, implements MinHashLSH with support for b-bit MinHash signatures to reduce signature size while preserving estimation accuracy.[26] It includes weighted MinHash variants for handling unequal element importance in sets and enables efficient similarity joins and lookups via hash buckets.[27] Datasketch also integrates with Redis for scalable, persistent indexing in production environments.[28] LSH implementations from these libraries are often integrated into broader ecosystems for enhanced usability. For instance, earlier versions of scikit-learn (deprecated in 0.19 and removed in 0.21) included LSHForest in the sklearn.neighbors module, providing a tree-based LSH structure for approximate nearest neighbor searches on dense vectors.[29] Similarly, Apache Spark's MLlib offers distributed LSH models like BucketedRandomProjectionLSH and MinHashLSH, allowing parallel processing of LSH indexing and queries across clusters for big data scenarios.[30] These open-source tools emphasize space efficiency, commonly using 32-bit integer hashes to store signatures compactly, which supports scalability to datasets with billions of points while maintaining low query latencies on commodity hardware.[25] For example, Faiss can index and search over 1 billion vectors in seconds on multi-GPU setups, demonstrating practical viability for real-world deployment.[13]Specialized Hash Functions
Nilsimsa, introduced in 2001, is a 256-bit locality-sensitive hash designed primarily for detecting near-duplicate text in anti-spam applications. It processes input text using overlapping n-grams, typically trigrams, which are rotated and subjected to bit-wise operations before being accumulated into 256 buckets via a hashing function. The final 32-byte digest is derived by selecting bits from these accumulators based on their values, enabling efficient comparison of similar documents through Hamming distance on the digests. This construction tunes collision probabilities to favor inputs with 80-90% Jaccard similarity, making it effective for identifying subtle variations in spam messages.[31] TLSH, or Trend Micro Locality Sensitive Hash, developed in 2013, targets file similarity detection, particularly in malware analysis and digital forensics. The algorithm partitions the input byte stream into buckets and computes histograms of byte value distributions within each, capturing statistical patterns insensitive to minor modifications. These histograms are quantized and combined into a fixed-length hash, with a normalization step to accommodate varying file sizes by adjusting bucket counts based on length. TLSH supports fuzzy matching via a dedicated distance metric that compares bucketed features, yielding scores from 0 (identical) to a maximum indicating dissimilarity.[32] In 2020, TLSH was updated to version 3, enhancing stability in hash generation for short or irregular byte streams, which improves reliability in malware detection pipelines by reducing variance in distance scores across variants; version 4.12 was released in 2024 with further optimizations.[33][34] Another specialized variant is SDHash, proposed in 2010 for digital forensics tasks such as known-file filtering and evidence correlation. It adapts MinHash principles by extracting byte chunks and weighting them according to local entropy, prioritizing high-entropy regions likely to contain unique identifiers while downweighting compressible or repetitive data. This entropy-weighted sketching produces a probabilistic digest that estimates similarity under edit distances, with implementations generating variable-length outputs optimized for large-scale indexing.[35]Algorithms for Nearest Neighbor Search
Core LSH Indexing Algorithm
The core locality-sensitive hashing (LSH) indexing algorithm preprocesses a dataset of n points by constructing L independent hash tables, where each table employs an AND-construction composed of k hash functions drawn from an LSH family H. This setup ensures that similar points are more likely to collide in the same bucket within a table, while the multiple tables reduce the overall false negative rate.[36] During the building phase, the algorithm first selects, for each table l, k independent hash functions from H. Then, for every point x in the dataset, it computes the composite hash value g_l(x) = (h_1(x), \dots, h_k(x)) using the fixed functions for table l. The point x (or its identifier) is then inserted into the bucket indexed by g_l(x) in that table. This process is repeated for all points and tables, resulting in a structure where buckets may contain multiple points due to hash collisions.[36] Collisions are managed by appending points to lists within each bucket, preserving all candidates that share the same hash value for potential retrieval during queries. The resulting index requires space O(n L), accounting for the storage of n points across L tables.[37] The following pseudocode outlines the insertion and table creation process:Preprocess(P, k, L, H):
For l = 1 to L:
Select k independent hash functions h_1, ..., h_k from H for table l
Initialize empty hash table T_l
For each point x in P:
For l = 1 to L:
Compute g_l(x) = (h_1(x), ..., h_k(x)) // using fixed h_i for table l
Append x (or its ID) to the list in bucket T_l[g_l(x)]
Preprocess(P, k, L, H):
For l = 1 to L:
Select k independent hash functions h_1, ..., h_k from H for table l
Initialize empty hash table T_l
For each point x in P:
For l = 1 to L:
Compute g_l(x) = (h_1(x), ..., h_k(x)) // using fixed h_i for table l
Append x (or its ID) to the list in bucket T_l[g_l(x)]
