Recent from talks
Nothing was collected or created yet.
GloVe
View on WikipediaGloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations of words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.[1] Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods.
It was developed as an open-source project at Stanford University[2] and launched in 2014. It was designed as a competitor to word2vec, and the original paper noted multiple improvements of GloVe over word2vec. As of 2022[update], both approaches are outdated, and Transformer-based models, such as BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP.[3]
Definition
[edit]You shall know a word by the company it keeps (Firth, J. R. 1957:11)[4]
The idea of GloVe is to construct, for each word , two vectors , such that the relative positions of the vectors capture part of the statistical regularities of the word . The statistical regularity is defined as the co-occurrence probabilities. Words that resemble each other in meaning should also resemble each other in co-occurrence probabilities.
Word counting
[edit]Let the vocabulary be , the set of all possible words (aka "tokens"). Punctuation is either ignored, or treated as vocabulary, and similarly for capitalization and other typographical details.[1]
If two words occur close to each other, then we say that they occur in the context of each other. For example, if the context length is 3, then we say that in the following sentence
GloVe1, coined2 from3 Global4 Vectors5, is6 a7 model8 for9 distributed10 word11 representation12
the word "model8" is in the context of "word11" but not the context of "representation12".
A word is not in the context of itself, so "model8" is not in the context of the word "model8", although, if a word appears again in the same context, then it does count.
Let be the number of times that the word appears in the context of the word over the entire corpus. For example, if the corpus is just "I don't think that that is a problem." we have since the first "that" appears in the second one's context, and vice versa.
Let be the number of words in the context of all instances of word . By counting, we have(except for words occurring right at the start and end of the corpus)
Probabilistic modelling
[edit]Let be the co-occurrence probability. That is, if one samples a random occurrence of the word in the entire document, and a random word within its context, that word is with probability . Note that in general. For example, in a typical modern English corpus, is close to one, but is close to zero. This is because the word "ado" is almost only used in the context of the archaic phrase "much ado about", but the word "much" occurs in all kinds of contexts.
For example, in a 6 billion token corpus, we have
| Probability and Ratio | ||||
|---|---|---|---|---|
Inspecting the table, we see that the words "ice" and "steam" are indistinguishable along the "water" (often co-occurring with both) and "fashion" (rarely co-occurring with either), but distinguishable along the "solid" (co-occurring more with ice) and "gas" (co-occurring more with "steam").
The idea is to learn two vectors for each word , such that we have a multinomial logistic regression:and the terms are unimportant parameters.
This means that if the words have similar co-occurrence probabilities , then their vectors should also be similar: .
Logistic regression
[edit]Naively, logistic regression can be run by minimizing the squared loss:However, this would be noisy for rare co-occurrences. To fix the issue, the squared loss is weighted so that the loss is slowly ramped-up as the absolute number of co-occurrences increases:whereand are hyperparameters. In the original paper, the authors found that seem to work well in practice.
Use
[edit]Once a model is trained, we have 4 trained parameters for each word: . The parameters are irrelevant, and only are relevant.
The authors recommended using as the final representation vector for word , because empirically it worked better than or alone.
Applications
[edit]GloVe can be used to find relations between words like synonyms, company-product relations, zip codes and cities, etc. However, the unsupervised learning algorithm is not effective in identifying homographs, i.e., words with the same spelling and different meanings. This is as the unsupervised learning algorithm calculates a single set of vectors for words with the same morphological structure.[5] The algorithm is also used by the SpaCy library to build semantic word embedding features, while computing the top list words that match with distance measures such as cosine similarity and Euclidean distance approach.[6] GloVe was also used as the word representation framework for the online and offline systems designed to detect psychological distress in patient interviews.[7]
See also
[edit]References
[edit]- ^ a b c Pennington, Jeffrey; Socher, Richard; Manning, Christopher (October 2014). "Glove: Global Vectors for Word Representation". In Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics. pp. 1532–1543. doi:10.3115/v1/D14-1162.
- ^ GloVe: Global Vectors for Word Representation (pdf) Archived 2020-09-03 at the Wayback Machine "We use our insights to construct a new model for word representation which we call GloVe, for Global Vectors, because the global corpus statistics are captured directly by the model."
- ^ Von der Mosel, Julian; Trautsch, Alexander; Herbold, Steffen (2022). "On the validity of pre-trained transformers for natural language processing in the software engineering domain". IEEE Transactions on Software Engineering. 49 (4): 1487–1507. arXiv:2109.04738. doi:10.1109/TSE.2022.3178469. ISSN 1939-3520. S2CID 237485425.
- ^ Firth, J. R. (1957). Studies in Linguistic Analysis (PDF). Wiley-Blackwell.
- ^ Wenig, Phillip (2019). "Creation of Sentence Embeddings Based on Topical Word Representations: An approach towards universal language understanding". Towards Data Science.
- ^ Singh, Mayank; Gupta, P. K.; Tyagi, Vipin; Flusser, Jan; Ören, Tuncer I. (2018). Advances in Computing and Data Sciences: Second International Conference, ICACDS 2018, Dehradun, India, April 20-21, 2018, Revised Selected Papers. Singapore: Springer. p. 171. ISBN 9789811318122.
- ^ Abad, Alberto; Ortega, Alfonso; Teixeira, António; Mateo, Carmen; Hinarejos, Carlos; Perdigão, Fernando; Batista, Fernando; Mamede, Nuno (2016). Advances in Speech and Language Technologies for Iberian Languages: Third International Conference, IberSPEECH 2016, Lisbon, Portugal, November 23-25, 2016, Proceedings. Cham: Springer. p. 165. ISBN 9783319491691.
External links
[edit]- GloVe Archived 2016-12-19 at the Wayback Machine
- Deeplearning4j GloVe Archived 2019-02-02 at the Wayback Machine
GloVe
View on GrokipediaIntroduction
Definition and Principles
GloVe, short for Global Vectors for Word Representation, is an unsupervised learning algorithm designed to generate dense vector representations of words that encode semantic relationships derived from global corpus statistics. Unlike purely predictive models, GloVe employs a matrix factorization approach on word co-occurrence data to produce low-dimensional embeddings, typically ranging from 50 to 300 dimensions, which capture both semantic and syntactic regularities in language.[2] The vocabulary size of these embeddings depends on the training corpus, often encompassing hundreds of thousands to millions of words for large-scale datasets like Wikipedia or Common Crawl.[2] At its core, GloVe integrates the strengths of global statistical methods, such as latent semantic analysis (LSA), with the local context window techniques used in predictive models like word2vec. This hybrid principle allows GloVe to leverage aggregated co-occurrence information across the entire corpus—rather than relying solely on immediate contextual predictions—enabling more efficient learning of word similarities and analogies.[2] By focusing on the logarithm of word co-occurrence probabilities, the model ensures that the resulting vector space reflects meaningful linguistic patterns, such as the distributional hypothesis that words appearing in similar contexts tend to have related meanings.[2] These word embeddings facilitate arithmetic operations in the vector space, permitting analogies like king - man + woman ≈ queen, which demonstrate how semantic relationships can be linearly manipulated.[2] Such properties arise from the model's ability to position words in a continuous space where geometric distances and directions correspond to linguistic affinities, outperforming earlier methods in tasks requiring relational understanding.[2]History and Development
GloVe was developed in 2014 by researchers Jeffrey Pennington, Richard Socher, and Christopher D. Manning at Stanford University's Natural Language Processing Group.[4] The project emerged as part of ongoing efforts to improve unsupervised word representation models in natural language processing.[2] The model was formally introduced in a conference paper presented at the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), titled "GloVe: Global Vectors for Word Representation."[4] This publication detailed the approach and demonstrated its advantages over contemporary methods.[2] The primary motivation behind GloVe's creation was to overcome the limitations of earlier techniques like word2vec, which focused on local context windows and underutilized the global statistical information available in large corpora.[2] By integrating global word-word co-occurrence statistics across the entire corpus, GloVe sought to generate more robust vector representations that better capture semantic and syntactic relationships.[2] Following its publication, the Stanford team released GloVe as open-source software implemented in C, facilitating efficient training on large datasets.[5] The initial distribution included pre-trained word vectors derived from substantial corpora, such as the combined Wikipedia 2014 and Gigaword 5 dataset (approximately 6 billion tokens with a 400,000-word vocabulary) and larger Common Crawl data (up to 840 billion tokens).[1] These resources were made available via the project's website to support immediate experimentation and application.[1] In the years after its 2014 debut, GloVe experienced no major revisions to its core algorithm but saw widespread adoption through integrations into prominent NLP libraries, including support for loading pre-trained models in Gensim via its KeyedVectors interface.[6] Adaptations and minor enhancements, such as compatibility updates for evolving Python ecosystems, continued through libraries like spaCy. In 2024, updated pre-trained GloVe models were released, trained on refreshed corpora including Wikipedia, Gigaword, and a subset of Dolma to incorporate recent linguistic and cultural changes; these were documented in a July 2025 report by Riley Carlson, Lucas Bauer, and Christopher D. Manning, which also improved data preprocessing documentation and demonstrated enhanced performance on named entity recognition tasks and vocabulary coverage.[3] This has solidified its role as a foundational tool in word embedding workflows as of 2025.Methodology
Co-occurrence Matrix Construction
The co-occurrence matrix in GloVe is a square matrix of size , where is the vocabulary size, and each entry counts the number of times word occurs in the context of word when scanning the entire corpus.[2] This matrix captures global statistical information about word associations by aggregating co-occurrences across all instances of each word pair, rather than relying on local predictive models.[2] To construct , the algorithm processes the corpus using a symmetric context window centered on each target word . Typically, this window spans 10 words to the left and right, though smaller sizes like 5 words may be used for efficiency in certain implementations.[2] Within the window, co-occurrences are tallied with distance-based weighting: the contribution of a context word at distance from the center is scaled by , giving closer words higher influence while still accounting for broader associations.[2] For example, in a sentence like "the quick brown fox jumps," for the target word "fox," "brown" at distance 1 would contribute more to than "the" at distance 3.[2] Large-scale corpora are essential for building a robust , as GloVe draws from massive text sources such as Wikipedia (approximately 1-1.6 billion tokens) or the much larger Common Crawl dataset (around 42 billion tokens).[2] These sources yield a vocabulary of 400,000 or more words, resulting in a dense aggregation of co-occurrence statistics that reflects real-world language patterns.[2] The construction process scans the corpus once, updating matrix entries in a streaming fashion to manage memory, with a total time complexity of where is the corpus size.[2] Despite its utility, the co-occurrence matrix faces challenges inherent to large vocabularies and sparse data. For rare word pairs, most entries in remain zero—typically 75-95% sparsity depending on —since uncommon combinations do not appear frequently enough across the corpus.[2] This sparsity, combined with the computational demands of processing billions of tokens, requires efficient storage and indexing, often focusing only on non-zero elements during subsequent model training.[2] Such properties make a foundational yet challenging input for deriving word embeddings.[2]Mathematical Formulation
GloVe employs a log-bilinear model to derive word vector representations from global word-word co-occurrence statistics. Each word is associated with a word vector and a bias term , while each context word has a corresponding context vector and bias , where is the embedding dimensionality.[2] The core equation of the model predicts the logarithm of the word-context co-occurrence counts as follows: This formulation directly models the logarithm of co-occurrence probabilities, capturing the strength of association between words and their contexts through the inner product of their respective vectors plus additive biases.[2] The model is motivated by a probabilistic interpretation that emphasizes ratios of co-occurrence probabilities, which encode semantic relations. Specifically, the ratio (where ) is modeled such that , or equivalently . Taking the logarithm and incorporating biases yields the central equation above, framing GloVe as a log-bilinear regression over these ratios to ensure global statistical consistency.[2] To obtain a symmetric word embedding that treats words and contexts equivalently, the final vector for word is the element-wise sum of its word and context vectors: . This combined representation leverages the dual role of words as both foci and contexts in the co-occurrence data, enhancing the capture of semantic and syntactic regularities.[2]Training and Optimization
The training of GloVe embeddings involves minimizing a weighted least-squares objective function that captures global co-occurrence statistics from the corpus. The loss function is defined as where and are the word and context vectors for words and , and are scalar bias terms, is the co-occurrence count, and is a weighting function that diminishes the influence of rare co-occurrences.[2] The weighting function is given by if , and otherwise, with typical values and ; this design prevents low-count pairs from dominating the optimization while preserving information from frequent co-occurrences.[2] Optimization proceeds via AdaGrad, an adaptive gradient method suitable for sparse data, applied stochastically by sampling nonzero entries from the co-occurrence matrix. Parameters are initialized uniformly at random from the interval , where is the embedding dimensionality, to ensure stable initial gradients. An initial learning rate of 0.05 is commonly used, with the process iterating over the data multiple times.[2][5] Key hyperparameters include embedding dimensionality, typically ranging from 50 to 300 for balancing expressiveness and efficiency, and the number of training epochs, often 10 to 50 depending on corpus size and dimensionality (e.g., 50 epochs for dimensions under 300). Larger corpora improve representation quality by providing richer statistics, though diminishing returns occur beyond billions of tokens; convergence is monitored via loss reduction, halting when improvements plateau.[2] Computationally, training scales with the number of nonzero co-occurrences rather than corpus length, achieving time complexity per epoch, where grows sublinearly with corpus size (approximately for typical settings). Parallelization is achieved by distributing updates over nonzero matrix entries across threads or machines, enabling efficient training on large-scale data; for a 6-billion-token corpus, co-occurrence matrix construction takes about 85 minutes on a single thread, while a full training iteration for 300-dimensional vectors requires around 14 minutes on 32 cores.[2]Implementation
Software Tools and Pre-trained Models
The original implementation of GloVe, released by Stanford NLP in 2014, is a C-based software package hosted on GitHub under the repository stanfordnlp/GloVe. It enables users to train custom models on their own corpora by preprocessing text into co-occurrence files and optimizing the GloVe objective, while also providing evaluation scripts for tasks like word analogies and similarity using Python or Octave.[5][1] GloVe has been integrated into several popular libraries for broader accessibility. In Python, Gensim supports loading GloVe vectors through its KeyedVectors class, typically after converting the format with the built-in glove2word2vec utility, facilitating similarity computations and downstream NLP workflows.[7] spaCy's en_core_web_lg model embeds 300-dimensional GloVe vectors trained on 840 billion tokens from Common Crawl, allowing direct vector access for over 685,000 English terms within its processing pipeline. For R users, the text2vec package offers a native GloVe implementation, including functions to fit models on term-co-occurrence matrices and compute embeddings efficiently for text vectorization.[8] Official pre-trained GloVe models, developed by Stanford, are freely downloadable from the project website. These include the original 2014 variants such as 50-, 100-, 200-, and 300-dimensional vectors trained on 6 billion tokens from the 2014 Wikipedia dump combined with Gigaword 5 (400,000 vocabulary), as well as 300-dimensional vectors from 840 billion tokens of Common Crawl data (2.2 million vocabulary, cased). Updated 2024 models are also available, comprising 50-, 100-, 200-, and 300-dimensional vectors trained on 11.9 billion tokens from the 2024 Wikipedia dump and Gigaword 5 (1.29 million vocabulary), and 300-dimensional vectors trained on a 220 billion token subset of the Dolma corpus (1.2 million vocabulary). These resources are also mirrored on platforms like Kaggle for easier integration into machine learning pipelines.[1][3] Community extensions have built upon the original codebase to address performance needs, including GPU-accelerated versions implemented in PyTorch that parallelize the training process for large-scale corpora. For example, the pytorch-glove repository provides a differentiable implementation compatible with modern deep learning frameworks.[9]Usage Guidelines and Best Practices
When integrating GloVe embeddings into natural language processing projects, loading pre-trained vectors is straightforward using libraries like Gensim in Python. For instance, pre-trained GloVe models can be loaded directly from the Gensim data repository with the following code:import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100") # Loads 100-dimensional vectors trained on [Wikipedia](/page/Wikipedia) and Gigaword
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100") # Loads 100-dimensional vectors trained on [Wikipedia](/page/Wikipedia) and Gigaword
similarity = word_vectors.similarity('computer', '[laptop](/page/Laptop)')
print(similarity) # Outputs a value between -1 and 1, where higher indicates greater similarity
similarity = word_vectors.similarity('computer', '[laptop](/page/Laptop)')
print(similarity) # Outputs a value between -1 and 1, where higher indicates greater similarity
