Hubbry Logo
Feature hashingFeature hashingMain
Open search
Feature hashing
Community hub
Feature hashing
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Feature hashing
Feature hashing
from Wikipedia

In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix.[1][2] It works by applying a hash function to the features and using their hash values as indices directly (after a modulo operation), rather than looking the indices up in an associative array. In addition to its use for encoding non-numeric values, feature hashing can also be used for dimensionality reduction.[2]

This trick is often attributed to Weinberger et al. (2009),[2] but there exists a much earlier description of this method published by John Moody in 1989.[1]

Motivation

[edit]

Motivating example

[edit]

In a typical document classification task, the input to the machine learning algorithm (both during learning and classification) is free text. From this, a bag of words (BOW) representation is constructed: the individual tokens are extracted and counted, and each distinct token in the training set defines a feature (independent variable) of each of the documents in both the training and test sets.

Machine learning algorithms, however, are typically defined in terms of numerical vectors. Therefore, the bags of words for a set of documents is regarded as a term-document matrix where each row is a single document, and each column is a single feature/word; the entry i, j in such a matrix captures the frequency (or weight) of the j'th term of the vocabulary in document i. (An alternative convention swaps the rows and columns of the matrix, but this difference is immaterial.) Typically, these vectors are extremely sparse—according to Zipf's law.

The common approach is to construct, at learning time or prior to that, a dictionary representation of the vocabulary of the training set, and use that to map words to indices. Hash tables and tries are common candidates for dictionary implementation. E.g., the three documents

  • John likes to watch movies.
  • Mary likes movies too.
  • John also likes football.

can be converted, using the dictionary

Term Index
John 1
likes 2
to 3
watch 4
movies 5
Mary 6
too 7
also 8
football 9

to the term-document matrix

(Punctuation was removed, as is usual in document classification and clustering.)

The problem with this process is that such dictionaries take up a large amount of storage space and grow in size as the training set grows.[3] On the contrary, if the vocabulary is kept fixed and not increased with a growing training set, an adversary may try to invent new words or misspellings that are not in the stored vocabulary so as to circumvent a machine learned filter. To address this challenge, Yahoo! Research attempted to use feature hashing for their spam filters.[4]

Note that the hashing trick isn't limited to text classification and similar tasks at the document level, but can be applied to any problem that involves large (perhaps unbounded) numbers of features.

Mathematical motivation

[edit]

Mathematically, a token is an element in a finite (or countably infinite) set . Suppose we only need to process a finite corpus, then we can put all tokens appearing in the corpus into , meaning that is finite. However, suppose we want to process all possible words made of the English letters, then is countably infinite.

Most neural networks can only operate on real vector inputs, so we must construct a "dictionary" function .

When is finite, of size , then we can use one-hot encoding to map it into . First, arbitrarily enumerate , then define . In other words, we assign a unique index to each token, then map the token with index to the unit basis vector .

One-hot encoding is easy to interpret, but it requires one to maintain the arbitrary enumeration of . Given a token , to compute , we must find out the index of the token . Thus, to implement efficiently, we need a fast-to-compute bijection , then we have .

In fact, we can relax the requirement slightly: It suffices to have a fast-to-compute injection , then use .

In practice, there is no simple way to construct an efficient injection . However, we do not need a strict injection, but only an approximate injection. That is, when , we should probably have , so that probably .

At this point, we have just specified that should be a hashing function. Thus we reach the idea of feature hashing.

Algorithms

[edit]

Feature hashing (Weinberger et al. 2009)

[edit]

The basic feature hashing algorithm presented in (Weinberger et al. 2009)[2] is defined as follows.

First, one specifies two hash functions: the kernel hash , and the sign hash . Next, one defines the feature hashing function:Finally, extend this feature hashing function to strings of tokens bywhere is the set of all finite strings consisting of tokens in .

Equivalently,

Geometric properties

[edit]

We want to say something about the geometric property of , but , by itself, is just a set of tokens, we cannot impose a geometric structure on it except the discrete topology, which is generated by the discrete metric. To make it nicer, we lift it to , and lift from to by linear extension: There is an infinite sum there, which must be handled at once. There are essentially only two ways to handle infinities. One may impose a metric, then take its completion, to allow well-behaved infinite sums, or one may demand that nothing is actually infinite, only potentially so. Here, we go for the potential-infinity way, by restricting to contain only vectors with finite support: , only finitely many entries of are nonzero.

Define an inner product on in the obvious way: As a side note, if is infinite, then the inner product space is not complete. Taking its completion would get us to a Hilbert space, which allows well-behaved infinite sums.


Now we have an inner product space, with enough structure to describe the geometry of the feature hashing function .

First, we can see why is called a "kernel hash": it allows us to define a kernel byIn the language of the "kernel trick", is the kernel generated by the "feature map" Note that this is not the feature map we were using, which is . In fact, we have been using another kernel , defined by The benefit of augmenting the kernel hash with the binary hash is the following theorem, which states that is an isometry "on average".

Theorem (intuitively stated)If the binary hash is unbiased (meaning that it takes value with equal probability), then is an isometry in expectation:

Proof

By linearity of expectation, Now, , since we assumed is unbiased. So we continue


The above statement and proof interprets the binary hash function not as a deterministic function of type , but as a random binary vector with unbiased entries, meaning that for any .

This is a good intuitive picture, though not rigorous. For a rigorous statement and proof, see [2]

Pseudocode implementation

[edit]

Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words), then using the hash values directly as feature indices and updating the resulting vector at those indices. Here, we assume that feature actually means feature vector.

 function hashing_vectorizer(features : array of string, N : integer):
     x := new vector[N]
     for f in features:
         h := hash(f)
         x[h mod N] += 1
     return x

Thus, if our feature vector is ["cat","dog","cat"] and hash function is if is "cat" and if is "dog". Let us take the output feature vector dimension (N) to be 4. Then output x will be [0,2,1,0]. It has been suggested that a second, single-bit output hash function ξ be used to determine the sign of the update value, to counter the effect of hash collisions.[2] If such a hash function is used, the algorithm becomes

 function hashing_vectorizer(features : array of string, N : integer):
     x := new vector[N]
     for f in features:
         h := hash(f)
         idx := h mod N
         if ξ(f) == 1:
             x[idx] += 1
         else:
             x[idx] -= 1
     return x

The above pseudocode actually converts each sample into a vector. An optimized version would instead only generate a stream of pairs and let the learning and prediction algorithms consume such streams; a linear model can then be implemented as a single hash table representing the coefficient vector.

Extensions and variations

[edit]

Learned feature hashing

[edit]

Feature hashing generally suffers from hash collision, which means that there exist pairs of different tokens with the same hash: . A machine learning model trained on feature-hashed words would then have difficulty distinguishing and , essentially because is polysemic.

If is rare, then performance degradation is small, as the model could always just ignore the rare case, and pretend all means . However, if both are common, then the degradation can be serious.

To handle this, one can train supervised hashing functions that avoids mapping common tokens to the same feature vectors.[5]

Applications and practical performance

[edit]

Ganchev and Dredze showed that in text classification applications with random hash functions and several tens of thousands of columns in the output vectors, feature hashing need not have an adverse effect on classification performance, even without the signed hash function.[3]

Weinberger et al. (2009) applied their version of feature hashing to multi-task learning, and in particular, spam filtering, where the input features are pairs (user, feature) so that a single parameter vector captured per-user spam filters as well as a global filter for several hundred thousand users, and found that the accuracy of the filter went up.[2]

Chen et al. (2015) combined the idea of feature hashing and sparse matrix to construct "virtual matrices": large matrices with small storage requirements. The idea is to treat a matrix as a dictionary, with keys in , and values in . Then, as usual in hashed dictionaries, one can use a hash function , and thus represent a matrix as a vector in , no matter how big is. With virtual matrices, they constructed HashedNets, which are large neural networks taking only small amounts of storage.[6]

Implementations

[edit]

Implementations of the hashing trick are present in:

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Feature hashing, also known as the hashing trick, is a dimensionality reduction technique in machine learning that maps high-dimensional, sparse feature vectors to a lower-dimensional space using hash functions, enabling efficient vectorization without storing an explicit feature vocabulary. Introduced in 2009 by Weinberger, Dasgupta, Langford, Smola, and Attenberg, feature hashing was developed to address challenges in large-scale multitask learning, such as collaborative filtering for email spam detection across numerous users. The method works by applying a hash function hh to map each original feature index jj to one of mm buckets (where mdm \ll d, and dd is the original dimensionality), and incorporating a random sign function ξ(j){+1,1}\xi(j) \in \{+1, -1\} to mitigate collision effects; the hashed feature vector ϕ(x)\phi(x) is then defined as ϕi(x)=j:h(j)=iξ(j)xj\phi_i(x) = \sum_{j: h(j)=i} \xi(j) x_j for each dimension i=1,,mi = 1, \dots, m. This preserves the sparsity of the input—typically resulting in vectors with at most one non-zero entry per original feature—and reduces memory usage from O(d)O(d) to O(m)O(m), while providing an unbiased estimator for inner products: E[ϕ(x),ϕ(x)]=x,x\mathbb{E}[\langle \phi(x), \phi(x') \rangle] = \langle x, x' \rangle. Theoretically, feature hashing guarantees approximate preservation of Euclidean norms with high probability; specifically, for a unit-norm vector xx, the probability that ϕ(x)21ϵ|\Vert \phi(x) \Vert^2 - 1| \geq \epsilon is at most 2exp(mϵ2/72)2 \exp(-m \epsilon^2 / 72), ensuring negligible distortion for sufficiently large mm. Later analyses have derived tighter asymptotic bounds on the norm distortion, highlighting dependencies on the vector's sparsity and maximum entry relative to its 2\ell_2-norm, with empirical constants around 0.725 for typical data distributions. These properties make feature hashing particularly advantageous for streaming or evolving datasets, as it avoids the computational overhead of explicit mappings. Feature hashing finds broad applications in handling high-cardinality categorical data and text features, including tasks like clustering and generation, as well as recommendation systems for efficient featurization in real-time personalization. Implementations are available in libraries such as and Vowpal Wabbit, facilitating its use in scalable pipelines.

Introduction

Definition and overview

Feature hashing, also known as the hashing trick, is a fast technique that maps high-dimensional, sparse feature vectors—such as those from bag-of-words representations—to a lower-dimensional sparse using a . This method transforms categorical or textual data into numerical features suitable for models without requiring the storage of an explicit feature dictionary, making it particularly useful for handling vast numbers of possible features like words in tasks. The core mechanism involves hashing each feature to an index within a fixed-size vector of mm, often chosen as a power of 2 for efficient computation, where collisions—multiple features mapping to the same index—are resolved by summing their corresponding values. No or explicit mapping is stored, allowing the process to operate in constant time per feature. Formally, for a feature xix_i with associated value viv_i, the hashed feature vector ϕ(x)\phi(x) is constructed such that ϕ(x)[h(xi)modm]+=ξ(xi)vi,\phi(x)[h(x_i) \mod m] += \xi(x_i) v_i, where hh is a universal , such as the signed 32-bit MurmurHash3, and ξ(xi){+1,1}\xi(x_i) \in \{+1, -1\} is a random that helps provide an unbiased estimator for inner products and mitigate collision biases. This approach offers key benefits, including significant memory efficiency for large vocabularies—such as millions of terms—by avoiding the need to store sparse matrices explicitly, constant-time feature extraction regardless of vocabulary size, and scalability to streaming or dynamically growing datasets where new features may appear unpredictably. These properties enable its widespread use in large-scale applications, such as text classification and recommendation systems, while addressing challenges posed by high-dimensional data.

Historical origins

The concept of feature hashing was first introduced by John Moody in his 1989 paper on fast learning in multi-resolution hierarchies, where it was described as a method for efficiently handling continuous features in approximations by mapping them into a fixed-size via hashing functions. This early formulation focused on reducing for locally tuned processing units but received limited attention in the broader community at the time. The technique gained practical prominence through its integration into the Vowpal Wabbit machine learning library, released in 2007 by John Langford and colleagues, which employed hashing for efficient processing of high-dimensional text data without the need for explicit feature indexing or vocabulary storage. This implementation demonstrated the method's utility in online learning scenarios, enabling scalable handling of sparse, large-scale datasets in resource-constrained environments. A formal theoretical and extension of feature hashing were provided by Kilian Weinberger, Anirban , Josh Attenberg, John Langford, and Alex Smola in their 2009 paper "Feature Hashing for Large Scale ," which introduced probabilistic bounds on collisions and adaptations for multitask settings; the work appeared as an in February 2009 before publication at ICML. This contribution shifted the approach from an empirical tool to a rigorously analyzed technique, emphasizing its efficacy in reducing dimensionality while preserving learning performance. In the , feature hashing saw widespread adoption in open-source libraries, notably with the introduction of the FeatureHasher class in version 0.13 in 2013, which facilitated its use in Python-based workflows for text and categorical feature vectorization. More recent developments include integrations in contexts, such as the node2hash algorithm proposed in 2018 for graph representation learning, which leverages hashing to embed network nodes efficiently. Overall, feature hashing evolved from an overlooked ad-hoc mechanism in early neural architectures to a theoretically grounded staple in scalable , exemplified by the influential 2009 .

Theoretical Background

High-dimensional data challenges

In , high-dimensional data presents significant challenges due to the curse of dimensionality, where the volume of the feature space grows exponentially with the number of dimensions, leading to increased storage requirements and computational demands during model training. For instance, in tasks involving text corpora, the vocabulary size can easily exceed 10^6 unique terms, resulting in feature vectors that demand substantial memory and processing resources, often making full explicit representations impractical. This exponential growth exacerbates issues in algorithm performance, as the distance between data points becomes less meaningful, complicating tasks like nearest-neighbor searches or . Sparsity further compounds these problems, as high-dimensional datasets, such as those derived from encoding categorical variables or bag-of-words representations in text data, typically contain mostly zero values, with only a small fraction of features being non-zero for any given instance. Maintaining explicit dictionaries for such features requires space proportional to the full vocabulary size, O(|V|), which can overwhelm available memory in large-scale settings, while sparse storage formats like CSR (compressed sparse row) mitigate but do not eliminate the overhead of indexing and access. In practice, this sparsity leads to inefficient use of resources, as algorithms must navigate vast empty spaces, slowing down operations like gradient computations or similarity calculations. Scalability bottlenecks arise particularly with traditional feature extraction methods, such as TF-IDF, which involve building and indexing full-term document matrices that scale linearly with the number of documents and vocabulary size, O(n \times |V|), rendering them unsuitable for or environments where processing must occur in real-time or with limited resources. For example, in applications like on web-scale corpora, the inability to fit the entire index in RAM forces approximations or , introducing delays and complexity. Similarly, multitask learning scenarios, where features are shared across domains (e.g., text from multiple languages or topics), amplify these issues, as the combined feature space grows even larger without coordinated reduction strategies. From a statistical perspective, high dimensionality promotes , where models capture rather than underlying patterns due to the abundance of parameters relative to available samples, necessitating techniques to maintain . Methods that preserve inner products between original feature vectors are particularly valuable, as they ensure that kernel-based or linear models retain their discriminative power post-reduction, avoiding distortion of similarity measures critical for tasks like . Without such preservation, the effective diminishes, further hindering reliable inference in sparse, high-dimensional regimes.

Fundamentals of hashing in machine learning

Hash functions provide a foundational technique in for mapping data from potentially unbounded input spaces to fixed-size outputs, enabling efficient processing of high-dimensional or sparse features. A is a that transforms variable-length inputs, such as strings representing categorical variables or n-grams from text, into fixed-size values, typically 32-bit or 64-bit integers. Key properties include uniformity, where outputs are evenly distributed across the possible range, and low collision probability, meaning distinct inputs rarely map to the same output, which supports reliable indexing and in vector representations. In contexts, non-cryptographic hash functions are preferred over cryptographic ones due to their emphasis on computational speed rather than security against adversarial attacks. Examples include the Fowler-Noll-Vo (FNV) algorithm, which uses simple XOR and multiplication operations for rapid hashing, and , a family of functions designed for excellent distribution with minimal overhead, as implemented in libraries like 's FeatureHasher. For theoretical guarantees, families, introduced by Carter and Wegman, ensure that the probability of collision between any two distinct keys is at most 1/m, where m is the output range size, providing provable bounds on performance independent of input distribution. These properties make such hashes suitable for tasks without requiring extensive preprocessing. Hashing plays a critical role in by converting unstructured or categorical into numerical indices for model input, avoiding the need to store explicit mappings like dictionaries for high-cardinality variables. For instance, categorical features such as user IDs or word n-grams can be hashed directly to vector positions, creating sparse representations that implicitly encode the while supporting in large datasets. This approach is particularly useful in scenarios with evolving feature sets, as no maintenance is required during . Collisions, where distinct features map to the same index, are inherent in hashing and are handled in feature vectors by accumulating contributions in the shared bin, resulting in an approximate but unbiased representation under uniform hashing assumptions. This summation introduces a : smaller vector sizes m reduce memory usage but increase collision rates, potentially degrading model accuracy, while larger m improves at higher computational cost. Empirical analyses show that with m around 2^{18} to 2^{20}, collisions have negligible impact on linear models for typical text or categorical data. Essential prerequisites for feature hashing include modulo reduction to constrain hash outputs to the desired vector dimension m, computed as index = |h(input)| mod m, ensuring all mappings fit within the fixed space. Additionally, signed variants address from unsigned hashing by randomly assigning positive or negative signs to hashed features, which mitigates variance introduced by collisions and centers the feature distribution around zero, improving convergence in optimization-based learners. Unsigned hashing simply adds absolute values, which can introduce positive , whereas signed hashing alternates signs to balance the vector.

Motivation

Practical motivating example

In a typical scenario for text classification, such as distinguishing spam from legitimate emails, the bag-of-words representation is commonly used to convert documents into feature vectors. However, with a large corpus, the can easily exceed 1 million unique words, making explicit encoding impractical due to excessive memory requirements for storing sparse vectors of that dimensionality. Feature hashing addresses this by mapping words to a fixed, smaller number of dimensions using a , allowing efficient processing without maintaining a . To ensure unbiased estimates and mitigate collision effects, a random ξ(j){+1,1}\xi(j) \in \{+1, -1\} is applied to each feature. Consider a simple spam email with the content "buy viagra now." In the bag-of-words approach, each unique word is treated as a feature with a count of 1 (assuming no repetitions). Using a hash function (such as the FNV or MurmurHash, as implemented in tools like Vowpal Wabbit), the words are hashed into a vector space of size m=218=262,144m = 2^{18} = 262,144 dimensions. For illustration, suppose the hash function maps "buy" to index 5 with sign +1, "viagra" to index 12,345 with sign +1, and "now" to index 5 with sign +1 (resulting in a collision between "buy" and "now"). The hashed feature vector then has a value of (+1)\cdot1 + (+1)\cdot1 = 2 at index 5 and (+1)\cdot1 = 1 at index 12,345, with all other indices zero. (Note: signs are randomly assigned and fixed for each feature across documents, so collisions may partially cancel depending on the signs.) This process is repeated for each document in the dataset. Compared to the explicit representation, where the vector would be 1 million dimensions long with exactly three 1s (one for each word) and the rest zeros, the hashed version reduces the dimensionality dramatically while remaining sparse. Although collisions introduce approximations, the signed method preserves approximate inner products between vectors unbiasedly in expectation, which is sufficient for linear models like used in spam classification. This approach enables training models on millions of documents efficiently, without the need to store or update a full vocabulary, as the hash function handles mapping on-the-fly. Empirical evaluations in large-scale text tasks show minimal accuracy degradation compared to infeasible explicit encodings. The following table illustrates the transformation for the example document:
WordAssumed Original Index (in 1M vocab)Hashed Index (mod 2182^{18})SignContribution to Hashed Vector
buy42,5675+1+1 at index 5
viagra289,31412,345+1+1 at index 12,345
now456,7895+1+1 at index 5
Resulting hashed vector (excerpt): position 5 = 2, position 12,345 = 1 (others = 0).

Mathematical justification

The primary goal of feature hashing is to map high-dimensional feature vectors x,yRdx, y \in \mathbb{R}^d (where dd is potentially very large) to lower-dimensional hashed representations ϕ(x),ϕ(y)Rm\phi(x), \phi(y) \in \mathbb{R}^m (with mdm \ll d) such that the inner product ϕ(x),ϕ(y)\langle \phi(x), \phi(y) \rangle is an unbiased estimator of x,y\langle x, y \rangle, with concentration around the true value ensuring Pr(ϕ(x),ϕ(y)x,y>ϵx2y2)δ\Pr(|\langle \phi(x), \phi(y) \rangle - \langle x, y \rangle| > \epsilon \|x\|_2 \|y\|_2) \leq \delta for small δ\delta, thereby enabling the use of linear models on the hashed features without significant loss in predictive performance. This approximation is crucial because linear models rely on accurate inner products for tasks like regression and classification, and feature hashing preserves these in expectation while reducing computational and storage costs. Under a random pairwise independent hash function h:h: \to and independent random signs ξ:{+1,1}\xi: \to \{+1, -1\}, the hashed feature is defined as ϕi(x)=j:h(j)=iξ(j)xj\phi_i(x) = \sum_{j: h(j)=i} \xi(j) x_j. The hashed inner product is then ϕ(x),ϕ(y)=k=1m(i:h(i)=kξ(i)xi)(j:h(j)=kξ(j)yj)=j,l:h(j)=h(l)ξ(j)ξ(l)xjyl\langle \phi(x), \phi(y) \rangle = \sum_{k=1}^m \left( \sum_{i: h(i)=k} \xi(i) x_i \right) \left( \sum_{j: h(j)=k} \xi(j) y_j \right) = \sum_{j,l: h(j)=h(l)} \xi(j) \xi(l) x_j y_l, which equals x,y+\langle x, y \rangle + error terms from collisions where jlj \neq l. The expectation is unbiased: E[ϕ(x),ϕ(y)]=x,y\mathbb{E}[\langle \phi(x), \phi(y) \rangle] = \langle x, y \rangle, since E[ξ(j)ξ(l)]=0\mathbb{E}[\xi(j) \xi(l)] = 0 for jlj \neq l. The variance of the error is bounded as Var(ϕ(x),ϕ(y)x,y)x22y22m\mathrm{Var}(\langle \phi(x), \phi(y) \rangle - \langle x, y \rangle) \leq \frac{\|x\|_2^2 \|y\|_2^2}{m}, assuming bounded features; this scales as O(x22y22/m)O(\|x\|_2^2 \|y\|_2^2 / m), implying the standard deviation of the error is O(x2y2/m)O(\|x\|_2 \|y\|_2 / \sqrt{m})
Add your contribution
Related Hubs
User Avatar
No comments yet.