Recent from talks
Nothing was collected or created yet.
Bigram
View on WikipediaA bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.
The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, and speech recognition.
Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).
Applications
[edit]Bigrams, along with other n-grams, are used in most successful language models for speech recognition.[1]
Bigram frequency attacks can be used in cryptography to solve cryptograms. See frequency analysis.
Bigram frequency is one approach to statistical language identification.
Some activities in logology or recreational linguistics involve bigrams. These include attempts to find English words beginning with every possible bigram,[2] or words containing a string of repeated bigrams, such as logogogue.[3]
Bigram frequency in the English language
[edit]The frequency of the most common letter bigrams in a large English corpus is:[4]
th 3.56% of 1.17% io 0.83% he 3.07% ed 1.17% le 0.83% in 2.43% is 1.13% ve 0.83% er 2.05% it 1.12% co 0.79% an 1.99% al 1.09% me 0.79% re 1.85% ar 1.07% de 0.76% on 1.76% st 1.05% hi 0.76% at 1.49% to 1.05% ri 0.73% en 1.45% nt 1.04% ro 0.73% nd 1.35% ng 0.95% ic 0.70% ti 1.34% se 0.93% ne 0.69% es 1.34% ha 0.93% ea 0.69% or 1.28% as 0.87% ra 0.69% te 1.20% ou 0.87% ce 0.65%
See also
[edit]References
[edit]- ^ Collins, Michael John (1996-06-24). "A new statistical parser based on bigram lexical dependencies". Proceedings of the 34th annual meeting on Association for Computational Linguistics -. Association for Computational Linguistics. pp. 184–191. arXiv:cmp-lg/9605012. doi:10.3115/981863.981888. S2CID 12615602. Retrieved 2018-10-09.
- ^ Cohen, Philip M. (1975). "Initial Bigrams". Word Ways. 8 (2). Retrieved 11 September 2016.
- ^ Corbin, Kyle (1989). "Double, Triple, and Quadruple Bigrams". Word Ways. 22 (3). Retrieved 11 September 2016.
- ^ "English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU". norvig.com. Retrieved 2019-10-28.
Bigram
View on GrokipediaFundamentals
Definition
A bigram is a contiguous sequence of two adjacent elements drawn from a larger sequence, such as symbols, letters, or words in text or speech.[1] This concept captures pairwise dependencies within the sequence, serving as a fundamental unit for analyzing patterns in linguistic data. For instance, in the word "the," the bigrams are "th" and "he"; in the sentence "the cat sat," a word-level bigram would be "the cat."[1] The term "bigram" originates from the prefix "bi-" (meaning two) and "gram" (referring to a written symbol or unit), and it specifically denotes the n=2 case of the more general n-gram framework.[1] The concept of such pairwise sequences was systematically explored by Andrey Markov in 1913 using letter transitions in text, and later applied by Claude Shannon in his 1951 work on estimating the entropy of printed English, where he used digram frequencies to approximate language redundancy.[4][2] The broader term "n-gram" was introduced in the late 1950s within information theory, building on Shannon's foundations. Bigrams have since become essential in natural language processing for tasks like language modeling.[1]Mathematical Representation
In statistical natural language processing, a bigram is formally represented through counts derived from a corpus. The unigram count denotes the number of occurrences of a single word in the corpus, while the bigram count represents the frequency of the consecutive word pair followed by .[5][1] The probability of a bigram is derived from conditional probability principles, approximating the likelihood of the next word given the previous one via maximum likelihood estimation. Specifically, , which estimates the conditional probability as the relative frequency of the bigram divided by the unigram count of the preceding word.[5][1] This formulation stems from the chain rule of probability, where the joint probability of a sequence is factored as , and each term is computed from empirical counts to maximize the likelihood of the observed data.[5] Raw maximum likelihood estimates often fail due to sparse data, resulting in zero probabilities for unseen bigrams, which can lead to invalid models.[1] To address this, smoothing techniques redistribute probability mass; a simple and widely used method is Laplace (add-one) smoothing, which adds a pseudocount of 1 to all bigram and unigram counts. The smoothed bigram probability is then given by where is the size of the vocabulary.[5][1] This ensures that every possible bigram receives a small positive probability, preventing zero estimates while preserving the relative frequencies of observed events. Bigrams also play a central role in measuring the uncertainty of language models through entropy. The entropy of a bigram model quantifies the average information content per word and is computed as the expected conditional entropy: This expression averages the negative log-probabilities over the unigram distribution and the conditional bigram probabilities, providing a theoretical measure of predictability.[5][1] Lower entropy values indicate more predictable sequences, with bigram models typically yielding higher entropy than higher-order n-grams due to their limited context.[1]Types
Word Bigrams
Word bigrams consist of pairs of consecutive words extracted from tokenized text, serving as a fundamental unit for modeling local dependencies in natural language. The formation process begins with tokenization, where raw text is split into words using whitespace and punctuation as delimiters; punctuation marks are typically removed or treated as separate tokens to avoid fragmenting meaningful units, while the text is often converted to lowercase to normalize for case insensitivity and reduce vocabulary size.[6] For instance, in the sentence "The quick brown fox jumps.", tokenization yields the sequence ["the", "quick", "brown", "fox", "jumps"], excluding the period as non-word content.[1] From this tokenized sequence, bigrams are generated by sliding a window of size two across the tokens, producing pairs such as ("the", "quick"), ("quick", "brown"), ("brown", "fox"), and ("fox", "jumps"). To account for sentence boundaries, special tokens likeCharacter Bigrams
Character bigrams consist of pairs of consecutive characters extracted from untokenized text, typically after normalization such as converting to lowercase and removing or ignoring spaces and punctuation to focus on alphabetic sequences.[8] For instance, the text "The quick" might yield bigrams like ("t","h"), ("h","e"), and ("q","u"), preserving the sequential order without word boundaries. This formation allows for straightforward computation over raw strings, enabling analysis of subword structures.[9] These bigrams capture digraphs, which are common letter pairs in a language, such as "th" (occurring in about 1.52% of English digraph positions), "he" (1.28%), and "in" (0.94%), reflecting orthographic patterns tied to phonetics and spelling conventions.[8] In English, digraphs like "th" often represent specific sounds (/θ/ or /ð/), aiding in the study of how letter combinations encode phonological information and influence spelling regularity. Their frequencies reveal linguistic redundancies, such as repeated pairs indicating morphological or phonetic constraints.[10] Compared to word bigrams, character bigrams offer advantages in handling out-of-vocabulary (OOV) words and languages lacking explicit spaces, like Chinese, where they avoid the need for segmentation by treating continuous character streams as fixed-length units.[11] This makes them robust for proper nouns, neologisms, or agglutinative scripts, as no dictionary or tokenizer is required, enhancing applicability in diverse NLP tasks.[12] Consider the word "hello": after normalization, it produces the bigrams ("h","e"), ("e","l"), ("l","l"), and ("l","o"). The repeated ("l","l") highlights redundancy in English spelling, where such pairs underscore letter frequencies and potential for predictive modeling in text generation or error correction.[9]Other Variants
Phonetic bigrams, also known as phone bigrams, consist of pairs of adjacent phonetic units or phones, such as the sequence /k/ followed by /æ/ in the pronunciation of "cat." These bigrams model transitions between sounds in spoken language and are employed in speech recognition systems to capture phonetic dependencies and improve decoding accuracy.[13][14] Byte bigrams refer to consecutive pairs of bytes in binary data, utilized in computing for tasks such as file type identification and data compression. In file analysis, byte bigram frequencies help classify fragmented or unknown files by generating feature vectors from their binary content, distinguishing formats like executables from documents.[15] For compression, byte pair encoding (BPE) iteratively identifies and replaces the most frequent byte bigrams with a single byte symbol, reducing redundancy in text or binary streams as originally proposed for general-purpose data compression.[16] Skip bigrams extend the bigram concept to non-adjacent pairs, such as (w_i, w_{i+2}), where one or more intervening elements are skipped, allowing capture of longer-range dependencies while contrasting with standard contiguous bigrams that require immediate adjacency. This variant is applied in natural language processing to enhance event prediction and statistical modeling by incorporating gapped sequences.[17] In bioinformatics, domain-specific bigrams include DNA bigrams, which are pairs of adjacent nucleotides like "AT" or "CG" in genomic sequences, used to analyze patterns, classify DNA texts over the four-letter alphabet {A, C, G, T}, and extract features for predicting protein-DNA interactions. These bigrams facilitate motif detection and sequence similarity assessments in large-scale genomic studies.[18][19]Applications
Natural Language Processing
In natural language processing (NLP), bigrams serve as a foundational tool for language modeling, where they estimate the probability of a word given its immediate predecessor to predict the next word in a sequence. This approach, known as a bigram model, approximates the joint probability of word sequences by assuming Markovian dependence, enabling tasks like next-word prediction. A seminal exposition of bigram models highlights their use in calculating conditional probabilities from training corpora, which underpins probabilistic text prediction.[1] The effectiveness of bigram language models is often evaluated using perplexity, a metric that quantifies predictive uncertainty as , where represents the cross-entropy of the model with respect to the test data. Lower perplexity indicates better modeling of language patterns, with bigrams typically achieving reasonable scores on short-range dependencies in English text, though performance degrades on longer sequences.[1] Bigrams find practical applications in spell-checking, where context-aware systems leverage bigram probabilities to rank correction candidates by their likelihood in surrounding text. In machine translation, bigram models contribute to language modeling and alignment by scoring co-occurrence probabilities. Additionally, bigrams support text generation by sampling subsequent words based on prior ones, producing coherent short phrases in early probabilistic systems.[1] Despite their utility, bigram models exhibit limitations in capturing long-range dependencies, as they ignore context beyond the immediate prior word, often resulting in sparse probabilities and poor generalization for complex sentences; this has motivated extensions to higher-order n-grams like trigrams for improved modeling without delving into their specifics here.[1] A notable case study involves bigram-based sentiment analysis, where phrase-level bigrams detect polarity shifts from negation or intensifiers. For example, the bigram "not good" conveys negative sentiment despite the positive valence of "good," whereas "very good" amplifies positivity, allowing models to classify reviews more accurately than unigram approaches by incorporating local word interactions.[20]Cryptanalysis
Bigrams play a crucial role in frequency analysis for cryptanalyzing monoalphabetic substitution ciphers, where the distribution of letter pairs in the ciphertext is compared to expected plaintext frequencies to map substitutions. For instance, the high frequency of the bigram "th" in English (occurring approximately 3.6% of the time) allows cryptanalysts to identify likely mappings for frequently appearing pairs in the encrypted text, facilitating partial or full recovery of the plaintext. This method extends single-letter frequency analysis by providing additional structural clues, making it more effective against ciphers that preserve bigram statistics despite letter scrambling.[21] Historical applications of bigram-based techniques trace back to early Islamic cryptology, where Arab scholars like Al-Kindi in the 9th century pioneered frequency analysis for individual letters in his treatise A Manuscript on Deciphering Cryptographic Messages, laying the groundwork for later adaptations to bigrams that exploited pairwise letter dependencies in languages like Arabic. By the 19th century, these ideas evolved into more systematic approaches, such as the Kasiski examination published by Friedrich Kasiski in 1863, which detects the key length in polyalphabetic ciphers like the Vigenère by measuring distances between repeated bigrams in the ciphertext—for example, multiple occurrences of a bigram like "QU" separated by intervals that are factors of the key length reveal its periodicity.[22][23] During World War II, bigram analysis contributed to Allied efforts in breaking Enigma machine variants, particularly naval versions where German operators used bigram substitution tables to encode message keys and settings, concealing them from direct frequency attacks; cryptanalysts at Bletchley Park exploited recovered bigram tables and known-plaintext cribs to streamline decryption processes.[24]Text Compression
Bigrams play a key role in lossless text compression by capturing dependencies between consecutive symbols, enabling probabilistic models that exploit redundancy to encode text more efficiently than independent symbol treatments. In arithmetic coding paired with bigram models, the probability of each subsequent character is estimated based on the preceding one, allowing the encoder to represent the entire message as a single fractional number within a [0,1) interval; narrower intervals for high-probability bigrams reduce the average bits required per character, approaching the Shannon entropy limit.[25][26] A prominent application is Prediction by Partial Matching (PPM), an adaptive algorithm that incorporates bigram (order-1) contexts alongside higher orders, using partial string matching to predict symbols when exact contexts are unseen; this method dynamically updates probability estimates and employs arithmetic coding for entropy encoding, forming the basis for variants in compression software such as 7-Zip's PPMd filter.[26] In PPM, bigram probabilities guide the selection of escape mechanisms to lower-order models, balancing adaptation and compression effectiveness.[25] For instance, in compressing the sequence "the the," a bigram model recognizes the high probability of "the" following "the " (due to common repetitions in English), assigning it a short code interval in arithmetic coding and thus requiring fewer bits than encoding each word independently.[25] Bigram models with arithmetic coding typically achieve around 1.5 bits per character for English text, a notable improvement over unigram Huffman coding's approximately 4.7 bits per character, as demonstrated in trained PPM variants on literary corpora. This performance underscores bigrams' utility in nearing English's estimated entropy of 1-1.3 bits per character without higher-order extensions.Statistical Properties
Frequency Estimation
Frequency estimation for bigrams involves deriving probabilities from observed data in a corpus, addressing challenges such as data sparsity where many possible bigrams do not appear, leading to zero-probability assignments, and high variance in estimates for rare events.[1] The maximum likelihood estimation (MLE) provides a baseline approach by computing the raw relative frequency of a bigram given its context. For a bigram , the MLE probability is given by where denotes the count in the training corpus. This method maximizes the likelihood of the training data but suffers from high variance for bigrams with low counts and assigns zero probability to unseen bigrams, causing issues in applications requiring smooth probability distributions.[1][27] To handle sparsity, particularly for unseen bigrams, the Good-Turing estimator adjusts counts based on the frequency of frequencies observed in the corpus. It estimates the probability mass for unseen events as , where is the number of bigrams seen exactly once and is the total number of bigrams. For seen bigrams occurring times, the smoothed count is , with being the number of bigrams occurring times; this redistributes probability from higher-frequency bigrams to lower or unseen ones, improving estimates for rare events without assuming a parametric form. An enhanced version incorporates predictors like expected independent frequencies to refine adjustments for different bins of unseen bigrams.[28][27] Backoff and interpolation methods mitigate sparsity by blending bigram estimates with lower-order models, such as unigrams, when bigram counts are insufficient. A prominent interpolation technique is Jelinek-Mercer smoothing, which linearly combines the bigram MLE with the unigram probability: where (between 0 and 1) controls the interpolation weight and is typically tuned on held-out data to maximize likelihood on a validation set. Backoff methods, in contrast, use the full bigram probability if the count exceeds a threshold, otherwise falling back to the lower-order model with a discount factor. These approaches ensure non-zero probabilities for all bigrams while preserving information from observed data.[27] Evaluation of bigram frequency estimators focuses on their predictive performance on unseen data, commonly using log-likelihood (or cross-entropy) on a held-out test set, which measures how well the model assigns high probability to actual sequences. This is often reported as perplexity, the exponential of the average negative log-likelihood per word, where lower values indicate better estimation; for example, bigram models typically achieve perplexities around 100-200 on news corpora, compared to over 900 for unigrams. Goodness-of-fit can also be assessed via chi-squared tests, comparing observed bigram frequencies in test data against those expected under the estimated model to detect systematic deviations.[1][27]Bigram Frequencies in English
In English text, character bigrams display characteristic frequency distributions derived from large corpora, revealing patterns shaped by phonetic, morphological, and syntactic structures. Analysis of the Google Books English corpus (approximately 744 billion words from the 1800s to 2000s), as compiled by Peter Norvig, identifies "th" as the most prevalent bigram at 3.56%, reflecting its role in common function words and affixes like "the" and "-th". The second most frequent is "he" at 3.07%, often appearing in pronouns and articles. These frequencies underscore the dominance of a small set of bigrams in everyday language use.[29] The following table summarizes the top 10 character bigrams from Norvig's analysis, with relative frequencies indicating their proportional occurrence among all possible pairs (normalized across alphabetic characters, excluding spaces and punctuation):| Rank | Bigram | Frequency (%) |
|---|---|---|
| 1 | th | 3.56 |
| 2 | he | 3.07 |
| 3 | in | 2.43 |
| 4 | er | 2.05 |
| 5 | an | 1.99 |
| 6 | re | 1.85 |
| 7 | on | 1.76 |
| 8 | at | 1.49 |
| 9 | en | 1.45 |
| 10 | nd | 1.35 |
