Hubbry Logo
Word2vecWord2vecMain
Open search
Word2vec
Community hub
Word2vec
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Word2vec
Word2vec
from Wikipedia
word2vec
Original authorGoogle AI
Initial releaseJuly 29, 2013.; 12 years ago (July 29, 2013.)
Repositoryhttps://code.google.com/archive/p/word2vec/
Type
LicenseApache-2.0

Word2vec is a technique in natural language processing for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov, Kai Chen, Greg Corrado, Ilya Sutskever and Jeff Dean at Google, and published in 2013.[1][2]

Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those for "but" and "however", and "Berlin" and "Germany".

Approach

[edit]

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a mapping of the set of words to a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a vector in the space.

Word2vec can use either of two model architectures to produce these distributed representations of words: continuous bag of words (CBOW) or continuously sliding skip-gram. In both architectures, word2vec considers both individual words and a sliding context window as it iterates over the corpus.

The CBOW can be viewed as a 'fill in the blank' task, where the word embedding represents the way the word influences the relative probabilities of other words in the context window. Words which are semantically similar should influence these probabilities in similar ways, because semantically similar words should be used in similar contexts. The order of context words does not influence prediction (bag of words assumption).

In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.[1][2] The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note,[3] CBOW is faster while skip-gram does a better job for infrequent words.

After the model is trained, the learned word embeddings are positioned in the vector space such that words that share common contexts in the corpus — that is, words that are semantically and syntactically similar — are located close to one another in the space.[1] More dissimilar words are located farther from one another in the space.[1]

Mathematical details

[edit]

This section is based on expositions.[4][5]

A corpus is a sequence of words. Both CBOW and skip-gram are methods to learn one vector per word appearing in the corpus.

Let ("vocabulary") be the set of all words appearing in the corpus . Our goal is to learn one vector for each word .

The idea of skip-gram is that the vector of a word should be close to the vector of each of its neighbors. The idea of CBOW is that the vector-sum of a word's neighbors should be close to the vector of the word.

Continuous bag-of-words (CBOW)

[edit]
Continuous bag-of-words (CBOW) model
Illustration of CBOW as a neural network

The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of . The objective of training is to maximize .

For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be: , and the objective is .

In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood .

In continuous bag-of-words, the histogram is multiplied by a matrix to obtain a continuous representation of the word's context. The matrix is also called a dictionary. Its columns are the word vectors. It has columns, where is the size of the dictionary. Let be the length of each word vector. We have .

For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with , we obtain .

This is then multiplied with another matrix of shape . Each row of it is a word vector . This results in a vector of length , one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary.

This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss.

In full formula, the cross-entropy loss is:where the outer summation is over the words in a corpus, the quantity is the sum of a word's neighbors' vectors, etc.

Once such a system is trained, we have two trained matrices . Either the column vectors of or the row vectors of can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of or the "sat"-th row of . It is also possible to simply define , in which case there would no longer be a choice.

Skip-gram

[edit]
Skip-gram

The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word.

The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of . The objective of training is to maximize .

In full formula, the loss function isSame as CBOW, once such a system is trained, we have two trained matrices . Either the column vectors of or the row vectors of can serve as the dictionary. It is also possible to simply define , in which case there would no longer be a choice.

Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training.

History

[edit]

During the 1980s, there were some early attempts at using neural networks to represent words and concepts as vectors.[6][7][8]

In 2010, Tomáš Mikolov (then at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden layer to language modelling.[9]

Word2vec was created, patented,[10] and published in 2013 by a team of researchers led by Mikolov at Google over two papers.[1][2] The original paper was rejected by reviewers for ICLR conference 2013. It also took months for the code to be approved for open-sourcing.[11] Other researchers helped analyse and explain the algorithm.[4]

Embedding vectors created using the Word2vec algorithm have some advantages compared to earlier algorithms[1] such as those using n-grams and latent semantic analysis. GloVe was developed by a team at Stanford specifically as a competitor, and the original paper noted multiple improvements of GloVe over word2vec.[12] Mikolov argued that the comparison was unfair as GloVe was trained on more data, and that the fastText project showed that word2vec is superior when trained on the same data.[13][11]

As of 2022, the straight Word2vec approach was described as "dated". Transformer-based models, such as ELMo and BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in natural language processing.[14]

Parameterization

[edit]

Results of word2vec training can be sensitive to parametrization. The following are some important parameters in word2vec training.

Training algorithm

[edit]

A Word2vec model can be trained with hierarchical softmax and/or negative sampling. To approximate the conditional log-likelihood a model seeks to maximize, the hierarchical softmax method uses a Huffman tree to reduce calculation. The negative sampling method, on the other hand, approaches the maximization problem by minimizing the log-likelihood of sampled negative instances. According to the authors, hierarchical softmax works better for infrequent words while negative sampling works better for frequent words and better with low dimensional vectors.[3] As training epochs increase, hierarchical softmax stops being useful.[15]

Sub-sampling

[edit]

High-frequency and low-frequency words often provide little information. Words with a frequency above a certain threshold, or below a certain threshold, may be subsampled or removed to speed up training.[16]

Dimensionality

[edit]

Quality of word embedding increases with higher dimensionality. But after reaching some point, marginal gain diminishes.[1] Typically, the dimensionality of the vectors is set to be between 100 and 1,000.

Context window

[edit]

The size of the context window determines how many words before and after a given word are included as context words of the given word. According to the authors' note, the recommended value is 10 for skip-gram and 5 for CBOW.[3]

Extensions

[edit]

There are a variety of extensions to word2vec.

doc2vec

[edit]

doc2vec, generates distributed representations of variable-length pieces of texts, such as sentences, paragraphs, or entire documents.[17][18] doc2vec has been implemented in the C, Python and Java/Scala tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents.

doc2vec estimates the distributed representations of documents much like how word2vec estimates representations of words: doc2vec utilizes either of two model architectures, both of which are allegories to the architectures used in word2vec. The first, Distributed Memory Model of Paragraph Vectors (PV-DM), is identical to CBOW other than it also provides a unique document identifier as a piece of additional context. The second architecture, Distributed Bag of Words version of Paragraph Vector (PV-DBOW), is identical to the skip-gram model except that it attempts to predict the window of surrounding context words from the paragraph identifier instead of the current word.[17]

doc2vec also has the ability to capture the semantic 'meanings' for additional pieces of  'context' around words; doc2vec can estimate the semantic embeddings for speakers or speaker attributes, groups, and periods of time. For example, doc2vec has been used to estimate the political positions of political parties in various Congresses and Parliaments in the U.S. and U.K.,[19] respectively, and various governmental institutions.[20]

top2vec

[edit]

Another extension of word2vec is top2vec, which leverages both document and word embeddings to estimate distributed representations of topics.[21][22] top2vec takes document embeddings learned from a doc2vec model and reduces them into a lower dimension (typically using UMAP). The space of documents is then scanned using HDBSCAN,[23] and clusters of similar documents are found. Next, the centroid of documents identified in a cluster is considered to be that cluster's topic vector. Finally, top2vec searches the semantic space for word embeddings located near to the topic vector to ascertain the 'meaning' of the topic.[21] The word with embeddings most similar to the topic vector might be assigned as the topic's title, whereas far away word embeddings may be considered unrelated.

As opposed to other topic models such as LDA, top2vec provides canonical 'distance' metrics between two topics, or between a topic and another embeddings (word, document, or otherwise). Together with results from HDBSCAN, users can generate topic hierarchies, or groups of related topics and subtopics.

Furthermore, a user can use the results of top2vec to infer the topics of out-of-sample documents. After inferring the embedding for a new document, must only search the space of topics for the closest topic vector.

BioVectors

[edit]

An extension of word vectors for n-grams in biological sequences (e.g. DNA, RNA, and proteins) for bioinformatics applications has been proposed by Asgari and Mofrad.[24] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of machine learning in proteomics and genomics. The results suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.[24] A similar variant, dna2vec, has shown that there is correlation between Needleman–Wunsch similarity score and cosine similarity of dna2vec word vectors.[25]

Radiology and intelligent word embeddings (IWE)

[edit]

An extension of word vectors for creating a dense vector representation of unstructured radiology reports has been proposed by Banerjee et al.[26] One of the biggest challenges with Word2vec is how to handle unknown or out-of-vocabulary (OOV) words and morphologically similar words. If the Word2vec model has not encountered a particular word before, it will be forced to use a random vector, which is generally far from its ideal representation. This can particularly be an issue in domains like medicine where synonyms and related words can be used depending on the preferred style of radiologist, and words may have been used infrequently in a large corpus.

IWE combines Word2vec with a semantic dictionary mapping technique to tackle the major challenges of information extraction from clinical texts, which include ambiguity of free text narrative style, lexical variations, use of ungrammatical and telegraphic phases, arbitrary ordering of words, and frequent appearance of abbreviations and acronyms. Of particular interest, the IWE model (trained on the one institutional dataset) successfully translated to a different institutional dataset which demonstrates good generalizability of the approach across institutions.

Analysis

[edit]

The reasons for successful word embedding learning in the word2vec framework are poorly understood. Goldberg and Levy point out that the word2vec objective function causes words that occur in similar contexts to have similar embeddings (as measured by cosine similarity) and note that this is in line with J. R. Firth's distributional hypothesis. However, they note that this explanation is "very hand-wavy" and argue that a more formal explanation would be preferable.[4]

Levy et al. (2015)[27] show that much of the superior performance of word2vec or similar embeddings in downstream tasks is not a result of the models per se, but of the choice of specific hyperparameters. Transferring these hyperparameters to more 'traditional' approaches yields similar performances in downstream tasks. Arora et al. (2016)[28] explain word2vec and related algorithms as performing inference for a simple generative model for text, which involves a random walk generation process based upon loglinear topic model. They use this to explain some properties of word embeddings, including their use to solve analogies.

Preservation of semantic and syntactic relationships

[edit]
Visual illustration of word embeddings
Visual illustration of word embeddings

The word embedding approach is able to capture multiple different degrees of similarity between words. Mikolov et al. (2013)[29] found that semantic and syntactic patterns can be reproduced using vector arithmetic. Patterns such as "Man is to Woman as Brother is to Sister" can be generated through algebraic operations on the vector representations of these words such that the vector representation of "Brother" - "Man" + "Woman" produces a result which is closest to the vector representation of "Sister" in the model. Such relationships can be generated for a range of semantic relations (such as Country–Capital) as well as syntactic relations (e.g. present tense–past tense).

This facet of word2vec has been exploited in a variety of other contexts. For example, word2vec has been used to map a vector space of words in one language to a vector space constructed from another language. Relationships between translated words in both spaces can be used to assist with machine translation of new words.[30]

Assessing the quality of a model

[edit]

Mikolov et al. (2013)[1] developed an approach to assessing the quality of a word2vec model which draws on the semantic and syntactic patterns discussed above. They developed a set of 8,869 semantic relations and 10,675 syntactic relations which they use as a benchmark to test the accuracy of a model. When assessing the quality of a vector model, a user may draw on this accuracy test which is implemented in word2vec,[31] or develop their own test set which is meaningful to the corpora which make up the model. This approach offers a more challenging test than simply arguing that the words most similar to a given test word are intuitively plausible.[1]

Parameters and model quality

[edit]

The use of different model parameters and different corpus sizes can greatly affect the quality of a word2vec model. Accuracy can be improved in a number of ways, including the choice of model architecture (CBOW or Skip-Gram), increasing the training data set, increasing the number of vector dimensions, and increasing the window size of words considered by the algorithm. Each of these improvements comes with the cost of increased computational complexity and therefore increased model generation time.[1]

In models using large corpora and a high number of dimensions, the skip-gram model yields the highest overall accuracy, and consistently produces the highest accuracy on semantic relationships, as well as yielding the highest syntactic accuracy in most cases. However, the CBOW is less computationally expensive and yields similar accuracy results.[1]

Overall, accuracy increases with the number of words used and the number of dimensions. Mikolov et al.[1] report that doubling the amount of training data results in an increase in computational complexity equivalent to doubling the number of vector dimensions.

Altszyler and coauthors (2017) studied Word2vec performance in two semantic tests for different corpus size.[32] They found that Word2vec has a steep learning curve, outperforming another word-embedding technique, latent semantic analysis (LSA), when it is trained with medium to large corpus size (more than 10 million words). However, with a small training corpus, LSA showed better performance. Additionally they show that the best parameter setting depends on the task and the training corpus. Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Word2vec is a family of shallow architectures designed to learn continuous vector representations, or embeddings, of words from large-scale text corpora, capturing both syntactic and semantic relationships between words in a way that enables arithmetic operations on vectors to reflect linguistic analogies, such as "king" - "man" + "woman" ≈ "queen". Developed by researchers at , including Tomas Mikolov, Kai Chen, Greg Corrado, and , it was introduced in early as an efficient method for producing high-quality word vectors from billions of words without requiring extensive . The core of comprises two primary models: the continuous bag-of-words (CBOW) , which predicts a target word based on its surrounding context words to learn embeddings that emphasize frequent patterns, and the skip-gram model, which conversely predicts context words from a given target word, performing particularly well on rare words and smaller datasets. To address the computational challenges of training on massive vocabularies, the models incorporate optimizations like hierarchical softmax for approximating the in the output layer and negative sampling, which focuses on distinguishing real context-target pairs from randomly sampled noise words during gradient updates. These techniques allow to scale to datasets with over 100 billion words, producing embeddings typically in 300-dimensional spaces that outperform prior methods on benchmarks such as word similarity tasks from datasets like WordSim-353 and solving from . Since its release, Word2vec has profoundly influenced by providing foundational static s that boost performance in downstream applications, including , , and , while inspiring other methods, such as the static and contextual models like BERT. Its open-source implementation and demonstrated efficacy on real-world corpora have led to widespread adoption, with the original publications garnering over 50,000 citations and continuing relevance in resource-constrained environments despite the rise of transformer-based models.

Introduction

Definition and Purpose

Word2vec is a technique that employs a two-layer to produce dense vector representations, known as embeddings, of words derived from large-scale unstructured text corpora. These embeddings capture the distributional properties of words based on their co-occurrence patterns in context. The core purpose of Word2vec is to map words into a continuous where semantically and syntactically similar words are positioned nearby, thereby enabling the encoding of meaningful linguistic relationships without relying on labeled data for word meanings. This approach facilitates downstream applications in , including , , and , by providing a foundational representation that improves model performance on these tasks. For instance, in Word2vec embeddings, the vector arithmetic operation vec() - vec("man") + vec() yields a result closely approximating vec("queen"), illustrating how the model implicitly learns analogies and relational semantics from raw text. Word2vec includes two primary architectures—the continuous bag-of-words (CBOW) and skip-gram models—to achieve these representations efficiently.

Historical Development

Word2vec was developed in 2013 by Tomas Mikolov and colleagues at , as an efficient approach to learning distributed representations of words from large-scale text corpora. This work built upon earlier neural s, particularly the foundational neural probabilistic language model introduced by and co-authors in 2003, which demonstrated the potential of neural networks for capturing semantic relationships in words but was computationally intensive for large datasets. Mikolov's team addressed these limitations by proposing simpler architectures that enabled faster training while maintaining high-quality embeddings. The key milestones in Word2vec's development were marked by two seminal publications. The first, "Efficient Estimation of Word Representations in ," presented at the ICLR 2013 workshop, introduced the core models and training techniques for computing continuous vector representations. This was followed by "Distributed Representations of Words and Phrases and their Compositionality," published at NIPS 2013, which extended the framework to handle phrases and improved compositional semantics, significantly advancing the practicality of word embeddings. In 2023, the second paper received the NeurIPS Test of Time Award, recognizing its lasting influence on . These papers quickly garnered widespread attention due to their empirical success on semantic tasks, such as word analogy solving, and their scalability to billions of words. Shortly after publication, the team originally released an open-source implementation in C++ on Google Code in July 2013, now archived, with the code preserved and available via mirrors on platforms like . This release transitioned Word2vec from internal Google use to a foundational tool in , influencing subsequent models like , which leveraged statistics inspired by Word2vec's predictive approaches, and BERT, which built on static embeddings as a precursor to contextual representations. Post-2013, community-driven improvements enhanced Word2vec's accessibility and performance. By 2014, the Python library Gensim integrated with optimized interfaces for topic modeling and similarity tasks, enabling easier experimentation on diverse corpora. Further advancements included GPU accelerations in frameworks like starting around 2016, which allowed training on massive datasets with multi-GPU clusters, achieving up to 7.5 times without accuracy loss and addressing gaps in the original CPU-based code. These developments solidified 's enduring impact on techniques.

Model Architectures

Continuous Bag-of-Words (CBOW)

The Continuous Bag-of-Words (CBOW) architecture in predicts a target word based on the surrounding context words, treating the context as an unordered bag to efficiently learn word embeddings. In this model, the input consists of encoded vectors for the context words selected from a symmetric window around the target position, which are then averaged to produce a single input vector projected onto the hidden layer. The output layer applies a to generate a over the entire , selecting the target word as the predicted output. The primary training goal of CBOW is to maximize the of observing the target word given its context, thereby capturing semantic relationships through the learned embeddings where similar contexts lead to proximate word vectors. This objective smooths contextual variations in the training data and excels at representing frequent words, as the averaging process emphasizes common patterns over noise. CBOW demonstrates strengths in computational efficiency and training speed, achieved by averaging multiple context vectors into one, which lowers the complexity compared to processing each context word separately. For instance, in the sentence "the cat sat on the mat," CBOW would use the context set {"the", "cat", "on", "the", "mat"} to predict the target word "sat," averaging their representations to inform the prediction. Unlike the Skip-gram architecture, which reverses the prediction direction to forecast context from the target and better handles rare words, CBOW's context-to-target approach enables quicker convergence and greater suitability for smaller datasets.

Skip-gram

The Skip-gram architecture in Word2vec predicts surrounding context words given a target word, reversing the directionality of the continuous bag-of-words (CBOW) model to emphasize target-to-context prediction. The input consists of a encoded vector representing the target word from the vocabulary. This vector is projected through a hidden layer, where the weight matrix serves as the embedding lookup, yielding a dense vector representation of the target word. The output layer then computes unnormalized scores for every word in the vocabulary by taking the of the target embedding with each candidate context embedding, followed by an independent softmax normalization for each context position to yield probability distributions over possible context words. The training objective for Skip-gram is to maximize the average log-probability of observing the actual context words given the target word, aggregated across all positions within a predefined context window size c (typically 2 to 5). For a sentence with words w1, w2, ..., wT, this involves, for each target wt, predicting the context words wt+*j for −cjc and j ≠ 0. This setup generates multiple prediction tasks per target word occurrence, which proves advantageous for infrequent words: rare terms appear less often overall, but when they do serve as targets, they trigger several context predictions, amplifying the training signal for their embeddings compared to models that underweight them. Consider the sentence "the cat sat on the mat" with a context window of 2. Selecting "cat" as the target, Skip-gram would train to predict {"the", "sat"} as context words, treating each prediction independently. If "mat" (a potentially rarer term) is the target, it would predict {"on", "the"}, ensuring dedicated optimization for its embedding. This multiplicity of outputs per target contrasts with CBOW's single prediction, allowing Skip-gram to derive richer representations from limited occurrences of uncommon words. Skip-gram's strengths lie in producing higher-quality embeddings for rare and infrequent terms, as the model directly optimizes the target word's representation against diverse contexts, capturing nuanced semantic relationships that CBOW might average out. However, this comes at the cost of increased computational demands, making it slower to train than CBOW—particularly with large vocabularies—due to the need for multiple softmax operations per training example. In comparison to CBOW, which offers greater efficiency for frequent words and smaller datasets, Skip-gram prioritizes representational accuracy for less common vocabulary elements.

Mathematical Foundations

Objective Functions

The objective functions in Word2vec are designed to learn word embeddings by maximizing the likelihood of correctly predicting words within a local context window, based on the distributional hypothesis that words with similar meanings appear in similar contexts. The general form of the objective is a log-likelihood maximization over the training corpus, expressed as the sum of log probabilities for target-context word pairs: logP(wtargetcontext)\sum \log P(w_{\text{target}} \mid \text{context}), where the summation occurs over all such pairs derived from the corpus. This formulation encourages the model to assign high probability to observed word co-occurrences while assigning low probability to unobserved ones, thereby capturing semantic and syntactic relationships in the embedding space. For the Continuous Bag-of-Words (CBOW) architecture, the objective focuses on predicting the target word wcw_c given its surrounding context words wck,,wc1,wc+1,,wc+kw_{c-k}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+k} within a window of size kk. The conditional probability is modeled as P(wcwck,,wc1,wc+1,,wc+k)=exp(vwcvcontext)wVexp(vwvcontext),P(w_c \mid w_{c-k}, \dots, w_{c-1}, w_{c+1}, \dots, w_{c+k}) = \frac{\exp(\mathbf{v}_{w_c}^\top \cdot \overline{\mathbf{v}}_{\text{context}})}{\sum_{w' \in V} \exp(\mathbf{v}_{w'}^\top \cdot \overline{\mathbf{v}}_{\text{context}})}, where vw\mathbf{v}_w denotes the embedding vector for word ww, vcontext\overline{\mathbf{v}}_{\text{context}} is the average of the context word embeddings, and VV is the vocabulary. The softmax function normalizes the scores over the vocabulary to produce a probability distribution, and the overall CBOW objective sums the log of this probability across all target-context instances in the corpus. In the Skip-gram architecture, the objective reverses the prediction task by estimating the probability of each context word given the target word wtw_t, treating the context as a product of independent conditional probabilities for each surrounding word wt+jw_{t+j} where cjc-c \leq j \leq c and j0j \neq 0. Specifically, P(contextwt)=j=c,j0cexp(vwt+jvwt)wVexp(vwvwt).P(\text{context} \mid w_t) = \prod_{j=-c, j \neq 0}^{c} \frac{\exp(\mathbf{v}_{w_{t+j}}^\top \cdot \mathbf{v}_{w_t})}{\sum_{w' \in V} \exp(\mathbf{v}_{w'}^\top \cdot \mathbf{v}_{w_t})}. The Skip-gram objective then maximizes t=1Tcjc,j0logP(wt+jwt)\sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t), where PP is the softmax probability given above. In practice, this is approximated using techniques such as hierarchical softmax or negative sampling, as described below. This approach performs particularly well on rare words and smaller datasets, while capturing finer-grained relationships compared to CBOW.

Negative Sampling Approximation

The computation of the full softmax function in Word2vec's objective functions, which normalizes probabilities over the entire vocabulary of size VV (often millions of words), incurs an O(V)O(V) time complexity per parameter update, rendering it computationally prohibitive for training on large corpora. To address this, negative sampling provides an efficient approximation by modeling the softmax as a binary classification task between the true context-target pair (positive sample) and artificially generated noise words (negative samples). Specifically, it approximates the conditional probability P(wc)P(w \mid c) for a target word ww and context cc using the sigmoid function on their embedding dot product for the positive pair, combined with terms that push away KK negative samples drawn from a noise distribution Pn(w)P_n(w). The noise distribution Pn(w)P_n(w) is defined as Pn(w)=f(w)3/4ZP_n(w) = \frac{f(w)^{3/4}}{Z}, where f(w)f(w) is the unigram of word ww and ZZ is the normalization constant; raising to the 3/43/4 power adjusts the sampling to favor moderately frequent words, improving representation quality over or pure unigram sampling. For the Skip-gram model, the negative sampling objective for a target-context pair becomes logσ(vwvc)+i=1KEwiPn(w)[logσ(vwivw)],\log \sigma(\mathbf{v}_w^\top \mathbf{v}_c) + \sum_{i=1}^K \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-\mathbf{v}_{w_i}^\top \mathbf{v}_w) \right], where vw\mathbf{v}_w and vc\mathbf{v}_c are the target and embeddings, respectively, and σ(x)=(1+ex)1\sigma(x) = (1 + e^{-x})^{-1}. During , only the embeddings of the target, , and KK negative words are updated, avoiding the full . This approach reduces the per-update complexity from O(V)O(V) to O(K)O(K), with typical values of KK ranging from 5 to 20 yielding substantial speedups (up to 100 times faster than full softmax) while producing comparable or better embeddings, particularly for frequent words. As an alternative approximation mentioned in the original work, hierarchical softmax employs a tree over the vocabulary to compute probabilities via a path of length O(logV)O(\log V), offering logarithmic efficiency without sampling.

Training Process

Optimization Techniques

The primary optimization technique in Word2Vec training is (SGD), which minimizes the model's loss function through across its shallow structure. This approach computes gradients for input and output embedding matrices based on word context-target pairs, updating parameters incrementally to capture semantic relationships. In the original formulation, SGD employs an initial of 0.025, which remains constant within each and decays linearly across subsequent epochs to stabilize convergence. Modern reimplementations, such as those in the Gensim library, retain this SGD foundation but incorporate adaptive decay schedules to handle varying corpus sizes efficiently. Some contemporary frameworks, like PyTorch-based versions, substitute SGD with for per-parameter rates, often yielding faster training on smaller datasets while preserving embedding quality. The training loop processes the corpus sequentially, generating positive word pairs from the skip-gram or CBOW architecture and performing updates after each pair or mini-batch, enabling scalable handling of large-scale text data. Convergence is generally achieved after 1 to 5 epochs on corpora exceeding billions of words, with progress tracked via decreasing loss or proxy metrics like word accuracy. As an alternative to full softmax computation, hierarchical softmax structures the output as a binary Huffman , where non-leaf nodes represent probability decisions and words occupy leaves, reducing per-update complexity from O(V) to O(log V). This method proves particularly beneficial for large vocabularies, accelerating training without substantial accuracy loss.

Data Preparation Methods

Data preparation for Word2vec involves several preprocessing steps to transform raw text into a format suitable for training, ensuring efficiency and quality in learning word representations. Initial tokenization typically splits the text into words using whitespace and as delimiters, followed by lowercasing to normalize . Rare words appearing fewer than 5 times are removed to reduce noise and computational overhead, resulting in a size ranging from 100,000 to 1 million words depending on the corpus scale. To balance the influence of frequent and rare words during training, subsampling is applied to high-frequency words. The probability of retaining a word ww is given by P(w)=1105f(w)P(w) = 1 - \sqrt{\frac{10^{-5}}{f(w)}}
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.