Hubbry Logo
BigramBigramMain
Open search
Bigram
Community hub
Bigram
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Bigram
Bigram
from Wikipedia

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.

The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, and speech recognition.

Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).

Applications

[edit]

Bigrams, along with other n-grams, are used in most successful language models for speech recognition.[1]

Bigram frequency attacks can be used in cryptography to solve cryptograms. See frequency analysis.

Bigram frequency is one approach to statistical language identification.

Some activities in logology or recreational linguistics involve bigrams. These include attempts to find English words beginning with every possible bigram,[2] or words containing a string of repeated bigrams, such as logogogue.[3]

Bigram frequency in the English language

[edit]

The frequency of the most common letter bigrams in a large English corpus is:[4]

th 3.56%       of 1.17%       io 0.83%
he 3.07%       ed 1.17%       le 0.83%
in 2.43%       is 1.13%       ve 0.83%
er 2.05%       it 1.12%       co 0.79%
an 1.99%       al 1.09%       me 0.79%
re 1.85%       ar 1.07%       de 0.76%
on 1.76%       st 1.05%       hi 0.76%
at 1.49%       to 1.05%       ri 0.73%
en 1.45%       nt 1.04%       ro 0.73%
nd 1.35%       ng 0.95%       ic 0.70%
ti 1.34%       se 0.93%       ne 0.69%
es 1.34%       ha 0.93%       ea 0.69%
or 1.28%       as 0.87%       ra 0.69%
te 1.20%       ou 0.87%       ce 0.65%

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A bigram, or 2-gram, is a contiguous sequence of two adjacent elements from a larger sequence of discrete symbols, such as characters, syllables, or words, serving as a fundamental unit in statistical analysis of sequential data. In (NLP) and , bigrams are most commonly used to model the of an element given its immediate predecessor, enabling simple probabilistic predictions in language models. The origins of bigrams trace back to Andrey Markov's pioneering 1913 work on Markov chains, where he analyzed letter transitions—effectively bigrams—in a 20,000-character excerpt from Alexander Pushkin's novel to demonstrate statistical dependencies in text. This approach was later adapted by in his 1948 seminal paper on , which employed word bigrams to estimate the and predictability of English prose, laying groundwork for quantitative models of communication. Bigrams gained renewed prominence in the 1970s through applications in , notably by Fred Jelinek's team at and James at , who trained large-scale n-gram models on corpora to improve automatic transcription accuracy. In modern usage, bigrams form the building blocks of n-gram language models, where probabilities are estimated from frequency counts in training corpora, such as P(word_i | word_{i-1}) = count(word_{i-1}, word_i) / count(word_{i-1}). These models power diverse applications, including systems, grammar and spelling correction, , and early training for text generation, though they are often augmented with techniques like Laplace or Kneser-Ney to handle unseen bigrams. Model performance is typically measured by , which quantifies how well the bigrams predict held-out text, with lower values indicating better approximation of true language distributions. Despite limitations in capturing long-range dependencies, bigrams remain computationally efficient and integral to hybrid systems combining statistical and methods.

Fundamentals

Definition

A bigram is a contiguous sequence of two adjacent elements drawn from a larger sequence, such as symbols, letters, or words in text or speech. This concept captures pairwise dependencies within the sequence, serving as a fundamental unit for analyzing patterns in linguistic data. For instance, in the word "the," the bigrams are "th" and "he"; in the sentence "the cat sat," a word-level bigram would be "the cat." The term "bigram" originates from the prefix "bi-" (meaning two) and "gram" (referring to a written symbol or unit), and it specifically denotes the n=2 case of the more general n-gram framework. The concept of such pairwise sequences was systematically explored by in 1913 using letter transitions in text, and later applied by in his 1951 work on estimating the of printed English, where he used digram frequencies to approximate language redundancy. The broader term "n-gram" was introduced in the late 1950s within , building on Shannon's foundations. Bigrams have since become essential in for tasks like language modeling.

Mathematical Representation

In statistical , a bigram is formally represented through counts derived from a corpus. The unigram count C(wi)C(w_i) denotes the number of occurrences of a single word wiw_i in the corpus, while the bigram count C(wi,wi+1)C(w_i, w_{i+1}) represents the frequency of the consecutive word pair wiw_i followed by wi+1w_{i+1}. The probability of a bigram is derived from principles, approximating the likelihood of the next word given the previous one via . Specifically, P(wi+1wi)=C(wi,wi+1)C(wi)P(w_{i+1} \mid w_i) = \frac{C(w_i, w_{i+1})}{C(w_i)}, which estimates the as the relative frequency of the bigram divided by the unigram count of the preceding word. This formulation stems from the chain rule of probability, where the joint probability of a sequence is factored as P(w1,,wn)i=1n1P(wi+1wi)P(w_1, \dots, w_n) \approx \prod_{i=1}^{n-1} P(w_{i+1} \mid w_i), and each term is computed from empirical counts to maximize the likelihood of the observed data. Raw maximum likelihood estimates often fail due to sparse data, resulting in zero probabilities for unseen bigrams, which can lead to invalid models. To address this, redistribute probability mass; a simple and widely used method is Laplace (add-one) , which adds a pseudocount of 1 to all bigram and unigram counts. The smoothed bigram probability is then given by P(wi+1wi)=C(wi,wi+1)+1C(wi)+V,P(w_{i+1} \mid w_i) = \frac{C(w_i, w_{i+1}) + 1}{C(w_i) + V}, where VV is the size of the vocabulary. This ensures that every possible bigram receives a small positive probability, preventing zero estimates while preserving the relative frequencies of observed events. Bigrams also play a central role in measuring the uncertainty of language models through entropy. The entropy HH of a bigram model quantifies the average information content per word and is computed as the expected conditional entropy: H=wiP(wi)wi+1P(wi+1wi)log2P(wi+1wi).H = -\sum_{w_i} P(w_i) \sum_{w_{i+1}} P(w_{i+1} \mid w_i) \log_2 P(w_{i+1} \mid w_i). This expression averages the negative log-probabilities over the unigram distribution P(wi)P(w_i) and the conditional bigram probabilities, providing a theoretical measure of predictability. Lower entropy values indicate more predictable sequences, with bigram models typically yielding higher entropy than higher-order n-grams due to their limited context.

Types

Word Bigrams

Word bigrams consist of pairs of consecutive words extracted from tokenized text, serving as a fundamental unit for modeling local dependencies in . The formation process begins with tokenization, where raw text is split into words using whitespace and as delimiters; marks are typically removed or treated as separate tokens to avoid fragmenting meaningful units, while the text is often converted to lowercase to normalize for case insensitivity and reduce vocabulary size. For instance, in the sentence "The quick brown fox jumps.", tokenization yields the sequence ["the", "quick", "brown", "fox", "jumps"], excluding the period as non-word content. From this tokenized sequence, bigrams are generated by sliding a window of size two across the tokens, producing pairs such as ("the", "quick"), ("quick", "brown"), ("brown", "fox"), and ("fox", "jumps"). To account for sentence boundaries, special tokens like (start-of-sentence) and (end-of-sentence) are prepended and appended, respectively, resulting in additional bigrams: (, "the") and ("jumps", ). This step-by-step extraction ensures that bigrams capture the full contextual flow, including sentence-initial and terminal patterns, as standard in n-gram language modeling. A key property of word bigrams is their ability to model collocations—conventional word pairings that occur more frequently than expected by chance, such as "New York" treated as a fixed rather than independent words. These structures highlight syntactic patterns, including part-of-speech transitions like adjective-noun sequences (e.g., "quick brown"), which aid in representing grammatical dependencies without full . Challenges in word bigram formation include handling out-of-vocabulary (OOV) words, which are terms absent from the predefined corpus ; these are commonly replaced with a special token to maintain model continuity and prevent zero probabilities in downstream applications. Additionally, variations in tokenization across languages or domains can introduce inconsistencies, such as ambiguous attachment (e.g., "can't" as one token or "can" and "'t"), necessitating domain-specific preprocessing rules.

Character Bigrams

Character bigrams consist of pairs of consecutive characters extracted from untokenized text, typically after normalization such as converting to lowercase and removing or ignoring spaces and to focus on alphabetic sequences. For instance, the text "The quick" might yield bigrams like ("t","h"), ("h","e"), and ("q","u"), preserving the sequential order without word boundaries. This formation allows for straightforward computation over raw strings, enabling analysis of subword structures. These bigrams capture digraphs, which are common letter pairs in a , such as "th" (occurring in about 1.52% of English digraph positions), "he" (1.28%), and "in" (0.94%), reflecting orthographic patterns tied to and conventions. In English, digraphs like "th" often represent specific sounds (/θ/ or /ð/), aiding in the study of how letter combinations encode phonological information and influence spelling regularity. Their frequencies reveal linguistic redundancies, such as repeated pairs indicating morphological or phonetic constraints. Compared to word bigrams, character bigrams offer advantages in handling out-of-vocabulary (OOV) words and languages lacking explicit spaces, like Chinese, where they avoid the need for segmentation by treating continuous character streams as fixed-length units. This makes them robust for proper nouns, neologisms, or agglutinative scripts, as no dictionary or tokenizer is required, enhancing applicability in diverse NLP tasks. Consider the word "hello": after normalization, it produces the bigrams ("h","e"), ("e","l"), ("l","l"), and ("l","o"). The repeated ("l","l") highlights redundancy in English , where such pairs underscore letter frequencies and potential for predictive modeling in text generation or correction.

Other Variants

Phonetic bigrams, also known as phone bigrams, consist of pairs of adjacent phonetic units or phones, such as the sequence /k/ followed by /æ/ in the pronunciation of "." These bigrams model transitions between sounds in spoken language and are employed in speech recognition systems to capture phonetic dependencies and improve decoding accuracy. Byte bigrams refer to consecutive pairs of bytes in , utilized in for tasks such as file type identification and data compression. In file analysis, byte bigram frequencies help classify fragmented or unknown files by generating feature vectors from their binary content, distinguishing formats like executables from documents. For compression, byte pair encoding (BPE) iteratively identifies and replaces the most frequent byte bigrams with a single byte , reducing in text or binary streams as originally proposed for general-purpose data compression. Skip bigrams extend the bigram concept to non-adjacent pairs, such as (w_i, w_{i+2}), where one or more intervening elements are skipped, allowing capture of longer-range dependencies while contrasting with standard contiguous bigrams that require immediate adjacency. This variant is applied in natural language processing to enhance event prediction and statistical modeling by incorporating gapped sequences. In bioinformatics, domain-specific bigrams include DNA bigrams, which are pairs of adjacent nucleotides like "AT" or "CG" in genomic sequences, used to analyze patterns, classify DNA texts over the four-letter alphabet {A, C, G, T}, and extract features for predicting protein-DNA interactions. These bigrams facilitate motif detection and sequence similarity assessments in large-scale genomic studies.

Applications

Natural Language Processing

In natural language processing (NLP), bigrams serve as a foundational tool for language modeling, where they estimate the probability of a word given its immediate predecessor to predict the next word in a sequence. This approach, known as a bigram model, approximates the joint probability of word sequences by assuming Markovian dependence, enabling tasks like next-word prediction. A seminal exposition of bigram models highlights their use in calculating conditional probabilities P(wiwi1)P(w_i | w_{i-1}) from training corpora, which underpins probabilistic text prediction. The effectiveness of bigram models is often evaluated using , a metric that quantifies predictive uncertainty as PP=2HPP = 2^{H}, where HH represents the of the model with respect to the test data. Lower indicates better modeling of patterns, with bigrams typically achieving reasonable scores on short-range dependencies in English text, though performance degrades on longer sequences. Bigrams find practical applications in spell-checking, where context-aware systems leverage bigram probabilities to rank correction candidates by their likelihood in surrounding text. In , bigram models contribute to modeling and alignment by scoring probabilities. Additionally, bigrams support text generation by sampling subsequent words based on prior ones, producing coherent short phrases in early probabilistic systems. Despite their utility, bigram models exhibit limitations in capturing long-range dependencies, as they ignore beyond the immediate prior word, often resulting in sparse probabilities and for complex sentences; this has motivated extensions to higher-order n-grams like trigrams for improved modeling without delving into their specifics here. A notable case study involves bigram-based , where phrase-level bigrams detect polarity shifts from or intensifiers. For example, the bigram "not good" conveys negative sentiment despite the positive valence of "good," whereas "very good" amplifies positivity, allowing models to classify reviews more accurately than unigram approaches by incorporating local word interactions.

Cryptanalysis

Bigrams play a crucial role in for cryptanalyzing monoalphabetic substitution ciphers, where the distribution of letter pairs in the is compared to expected frequencies to map substitutions. For instance, the high frequency of the bigram "th" in English (occurring approximately 3.6% of the time) allows cryptanalysts to identify likely mappings for frequently appearing pairs in the encrypted text, facilitating partial or full recovery of the . This method extends single-letter by providing additional structural clues, making it more effective against ciphers that preserve bigram statistics despite letter scrambling. Historical applications of bigram-based techniques trace back to early Islamic cryptology, where Arab scholars like in the pioneered for individual letters in his treatise A Manuscript on Deciphering Cryptographic Messages, laying the groundwork for later adaptations to bigrams that exploited pairwise letter dependencies in languages like . By the , these ideas evolved into more systematic approaches, such as the published by Friedrich Kasiski in 1863, which detects the key length in polyalphabetic ciphers like the Vigenère by measuring distances between repeated bigrams in the —for example, multiple occurrences of a bigram like "QU" separated by intervals that are factors of the key length reveal its periodicity. During , bigram analysis contributed to Allied efforts in breaking variants, particularly naval versions where German operators used bigram substitution tables to encode message keys and settings, concealing them from direct frequency attacks; cryptanalysts at exploited recovered bigram tables and known-plaintext cribs to streamline decryption processes.

Text Compression

Bigrams play a key role in lossless text compression by capturing dependencies between consecutive symbols, enabling probabilistic models that exploit redundancy to encode text more efficiently than independent symbol treatments. In paired with bigram models, the probability of each subsequent character is estimated based on the preceding one, allowing the encoder to represent the entire message as a single fractional number within a [0,1) interval; narrower intervals for high-probability bigrams reduce the average bits required per character, approaching the Shannon entropy limit. A prominent application is Prediction by Partial Matching (PPM), an adaptive that incorporates bigram (order-1) contexts alongside higher orders, using partial matching to predict symbols when exact contexts are unseen; this method dynamically updates probability estimates and employs for entropy encoding, forming the basis for variants in compression software such as 7-Zip's PPMd filter. In PPM, bigram probabilities guide the selection of escape mechanisms to lower-order models, balancing adaptation and compression effectiveness. For instance, in compressing the sequence "the the," a bigram model recognizes the high probability of "the" following "the " (due to common repetitions in English), assigning it a short code interval in and thus requiring fewer bits than encoding each word independently. Bigram models with typically achieve around 1.5 bits per character for English text, a notable improvement over unigram Huffman coding's approximately 4.7 bits per character, as demonstrated in trained PPM variants on literary corpora. This performance underscores bigrams' utility in nearing English's estimated entropy of 1-1.3 bits per character without higher-order extensions.

Statistical Properties

Frequency Estimation

Frequency estimation for bigrams involves deriving probabilities from observed in a corpus, addressing challenges such as data sparsity where many possible bigrams do not appear, leading to zero-probability assignments, and high variance in estimates for . The (MLE) provides a baseline approach by computing the raw relative of a bigram given its context. For a bigram wiwi+1w_i w_{i+1}, the MLE probability is given by PMLE(wi+1wi)=C(wiwi+1)C(wi),P_{\text{MLE}}(w_{i+1} \mid w_i) = \frac{C(w_i w_{i+1})}{C(w_i)}, where C()C(\cdot) denotes the count in the training corpus. This method maximizes the likelihood of the training data but suffers from high variance for bigrams with low counts and assigns zero probability to unseen bigrams, causing issues in applications requiring smooth probability distributions. To handle sparsity, particularly for unseen bigrams, the Good-Turing estimator adjusts counts based on the frequency of frequencies observed in the corpus. It estimates the probability mass for unseen events as N1/NN_1 / N, where N1N_1 is the number of bigrams seen exactly once and NN is the total number of bigrams. For seen bigrams occurring rr times, the smoothed count is r=(r+1)Nr+1Nrr^* = (r + 1) \frac{N_{r+1}}{N_r}, with NrN_r being the number of bigrams occurring rr times; this redistributes probability from higher-frequency bigrams to lower or unseen ones, improving estimates for rare events without assuming a parametric form. An enhanced version incorporates predictors like expected independent frequencies to refine adjustments for different bins of unseen bigrams. Backoff and interpolation methods mitigate sparsity by blending bigram estimates with lower-order models, such as unigrams, when bigram counts are insufficient. A prominent interpolation technique is Jelinek-Mercer smoothing, which linearly combines the bigram MLE with the unigram probability: P(wi+1wi)=λPMLE(wi+1wi)+(1λ)P(wi+1),P(w_{i+1} \mid w_i) = \lambda P_{\text{MLE}}(w_{i+1} \mid w_i) + (1 - \lambda) P(w_{i+1}), where λ\lambda (between 0 and 1) controls the interpolation weight and is typically tuned on held-out data to maximize likelihood on a validation set. Backoff methods, in contrast, use the full bigram probability if the count exceeds a threshold, otherwise falling back to the lower-order model with a discount factor. These approaches ensure non-zero probabilities for all bigrams while preserving information from observed data. Evaluation of bigram frequency estimators focuses on their predictive performance on unseen data, commonly using log-likelihood (or ) on a held-out test set, which measures how well the model assigns high probability to actual sequences. This is often reported as , the exponential of the average negative log-likelihood per word, where lower values indicate better estimation; for example, bigram models typically achieve perplexities around 100-200 on news corpora, compared to over 900 for unigrams. Goodness-of-fit can also be assessed via chi-squared tests, comparing observed bigram frequencies in test data against those expected under the estimated model to detect systematic deviations.

Bigram Frequencies in English

In English text, character bigrams display characteristic frequency distributions derived from large corpora, revealing patterns shaped by phonetic, morphological, and . Analysis of the English corpus (approximately 744 billion words from the 1800s to 2000s), as compiled by , identifies "th" as the most prevalent bigram at 3.56%, reflecting its role in common function words and affixes like "the" and "-th". The second most frequent is "he" at 3.07%, often appearing in pronouns and articles. These frequencies underscore the dominance of a small set of bigrams in everyday use. The following table summarizes the top 10 character bigrams from Norvig's analysis, with relative frequencies indicating their proportional occurrence among all possible pairs (normalized across alphabetic characters, excluding spaces and ):
RankBigramFrequency (%)
1th3.56
2he3.07
3in2.43
4er2.05
5an1.99
6re1.85
7on1.76
8at1.49
9en1.45
10nd1.35
Word bigrams, sequences of adjacent words, exhibit even more skewed distributions due to grammatical constraints. In modern English sources like the N-grams dataset (covering up to 2019 and billions of words), common pairs such as "of the" and "in the" dominate, with "of the" accounting for roughly 0.35% of all bigram occurrences, highlighting prepositional and patterns. Other frequent examples include "to the" (0.28%) and "and the" (0.25%), which together represent high in narrative and expository prose. These percentages vary slightly by subcorpus but establish the scale of functional word pairings. Bigram frequencies in English vary by genre, influenced by stylistic and structural differences. In , bigrams show intermediate lexical — a measure of associational strength—clustering closer to due to and flow, whereas texts exhibit lower gravity per sentence, reflecting longer, more varied structures with less formulaic pairing. For instance, fiction favors cohesive bigrams like "said he" more than news, which prioritizes informational pairs such as "the United". displays the lowest overall gravity, emphasizing complex syntax over repetitive sequences. Over time, bigram frequencies have evolved, with shifts observable in informal texts via diachronic corpora. N-grams reveal gradual changes, such as a slight decline in traditional bigrams like "th" in informal registers (e.g., digital communication), attributed to phonetic simplifications and blending, though formal texts maintain stability. These variations underscore language's adaptability to cultural and technological contexts. Bigram ranks in English follow a power-law distribution akin to , where frequency decreases inversely with rank. Specifically, bigram probabilities adhere to p(r)rαp(r) \propto r^{-\alpha}, with α0.835\alpha \approx 0.835 for English co-occurrences in large corpora, indicating a "second-order" flatter than unigrams (α1\alpha \approx 1). This relation, where rank ×\times frequency approximates a constant for high-frequency bigrams, highlights the of linguistic units.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.