Recent from talks
Nothing was collected or created yet.
Word list
View on WikipediaThis article has an unclear citation style. (March 2021) |
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (December 2023) |
A word list is a list of words in a lexicon, generally sorted by frequency of occurrence (either by graded levels, or as a ranked list). A word list is compiled by lexical frequency analysis within a given text corpus, and is used in corpus linguistics to investigate genealogies and evolution of languages and texts. A word which appears only once in the corpus is called a hapax legomena. In pedagogy, word lists are used in curriculum design for vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort" (Nation 1997), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field.
In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.
| Type | Occurrences | Rank |
|---|---|---|
| the | 3,789,654 | 1st |
| he | 2,098,762 | 2nd |
| [...] | ||
| king | 57,897 | 1,356th |
| boy | 56,975 | 1,357th |
| [...] | ||
| stringyfy | 5 | 34,589th |
| [...] | ||
| transducionalify | 1 | 123,567th |
Methodology
[edit]Factors
[edit]Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:
- corpus representativeness
- word frequency and range
- treatment of word families
- treatment of idioms and fixed expressions
- range of information
- various other criteria
Corpora
[edit]Traditional written corpus
[edit]
Most of currently available studies are based on written text corpus, more easily available and easy to process.
SUBTLEX movement
[edit]However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of the traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. The initial research saw a handful of follow-up studies,[1] providing valuable frequency count analysis for various languages. In depth SUBTLEX researches[2] over cleaned up open subtitles were produced for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al. 2010), Vietnamese (Pham, Bolger & Baayen 2011), German (Brysbaert et al. 2011), Brazil Portuguese (Tang 2012) and Portugal Portuguese (Soares et al. 2015), Albanian (Avdyli & Cuetos 2013), Polish (Mandera et al. 2014) and Catalan (2019[3]), Welsh (Van Veuhen et al. 2024[4]). SUBTLEX-IT (2015) provides raw data only.[5]
Lexical unit
[edit]In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise : English "can't" and French "aujourd'hui" include punctuations while French "chateau d'eau" designs a concept different from the simple addition of its components while including a space. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.
Statistics
[edit]It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.
German linguists define the Häufigkeitsklasse (frequency class) of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.
where is the floor function.
Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.
Pedagogy
[edit]Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors (Nation 1997). Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" (Nation 2006).
Effects of words frequency
[edit]Word frequency is known to have various effects (Brysbaert et al. 2011; Rudell 1993). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (Laufer 1997). Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect (Segui et al.). The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.
Languages
[edit]Below is a review of available resources.
English
[edit]Word counting is an ancient field,[6] with known discussion back to Hellenistic time. In 1944, Edward Thorndike, Irvin Lorge and colleagues[7] hand-counted 18,000,000 running words to provide the first large-scale English language frequency list, before modern computers made such projects far easier (Nation 1997). 20th century's works all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency[8] in the Corpus of Contemporary American English,[9] was first attested to in 1999,[10][11][12] and does not appear in any of these three lists.
- The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)
- The Teacher Word Book contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (Nation 1997).
- The General Service List (West, 1953)
- The General Service List contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (Nation 1997). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List.
- The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)
- A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (Nation 1997).
- The Brown (Francis and Kucera, 1982) LOB and related corpora
- These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists (Nation 1997).
French
[edit]Traditional datasets
[edit]A review has been made by New & Pallier. An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules.[13] It is claimed that 70 grammatical words constitute 50% of the communicatives sentence,[14][15] while 3,680 words make about 95~98% of coverage.[16] A list of 3,000 frequent words is available.[17]
The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet.[18] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".[19]
More recently, the project Lexique3 provides 142,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0.[20]
Subtlex
[edit]This Lexique3 is a continuous study from which originate the Subtlex movement cited above. New et al. 2007 made a completely new counting based on online film subtitles.
Spanish
[edit]There have been several studies of Spanish word frequency (Cuetos et al. 2011).[21]
Chinese
[edit]Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (Allanic 2003). American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can't Read Chinese (DeFrancis 1966). As a frequency toolkit, Da (Da 1998) and the Taiwanese Ministry of Education (TME 1997) provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, Cai & Brysbaert 2010 recently made a rich study of Chinese word and character frequencies.
Other
[edit]Wiktionary contains frequency lists in more languages.[22]
Most frequently used words in different languages based on Wikipedia or combined corpora.[23]
See also
[edit]- Corpus linguistics
- Letter frequency
- Most common words in English
- Long tail
- Google Ngram Viewer – shows changes in word/phrase frequency (and relative frequency) over time
Notes
[edit]- ^ "Crr » Subtitle Word Frequencies".
- ^ Brysbaert, Marc. "Subtitle Word Frequencies". crr.ugent.be. Archived from the original on 2023-04-15. Retrieved 2025-10-11.
- ^ Boada, Roger; Guasch, Marc; Haro, Juan; Demestre, Josep; Ferré, Pilar (1 February 2020). "SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan". Behavior Research Methods. 52 (1): 360–375. doi:10.3758/s13428-019-01233-1. ISSN 1554-3528. PMID 30895456. S2CID 84843788.
- ^ van Heuven, Walter JB; Payne, Joshua S; Jones, Manon W (May 2024). "SUBTLEX-CY: A new word frequency database for Welsh". Quarterly Journal of Experimental Psychology. 77 (5): 1052–1067. doi:10.1177/17470218231190315. ISSN 1747-0218. PMC 11032624. PMID 37649366.
- ^ Amenta, Simona; Mandera, Paweł; Keuleers, Emmanuel; Brysbaert, Marc; Crepaldi, Davide (7 January 2022). "SUBTLEX-IT".
- ^ Bontrager, Terry (1 April 1991). "The Development of Word Frequency Lists Prior to the 1944 Thorndike-Lorge List". Reading Psychology. 12 (2): 91–116. doi:10.1080/0270271910120201. ISSN 0270-2711.
- ^ The teacher's word book of 30,000 words.
- ^ "Words and phrases: Frequency, genres, collocates, concordances, synonyms, and WordNet".
- ^ "Corpus of Contemporary American English (COCA)".
- ^ "It's the links, stupid". The Economist. 20 April 2006. Retrieved 2008-06-05.
- ^ Merholz, Peter (1999). "Peterme.com". Internet Archive. Archived from the original on 1999-10-13. Retrieved 2008-06-05.
- ^ Kottke, Jason (26 August 2003). "kottke.org". Retrieved 2008-06-05.
- ^ "Le français fondamental". Archived from the original on 2010-07-04.
- ^ Ouzoulias, André (2004), Comprendre et aider les enfants en difficulté scolaire: Le Vocabulaire fondamental, 70 mots essentiels (PDF), Retz - Citing V.A.C Henmon (dead link, no Internet Archive copy, 10 August 2023)
- ^ Liste des "70 mots essentiels" recensés par V.A.C. Henmon
- ^ "Generalities".
- ^ "PDF 3000 French words".
- ^ "Maitrise de la langue à l'école: Vocabulaire". Ministère de l'éducation nationale.
- ^ Baudot, J. (1992), Fréquences d'utilisation des mots en français écrit contemporain, Presses de L'Université, ISBN 978-2-7606-1563-2
- ^ "Lexique".
- ^ "Spanish word frequency lists". Vocabularywiki.pbworks.com.
- ^ Wiktionary:Frequency lists, 21 July 2024
- ^ Most frequently used words in different languages, ezglot
References
[edit]Theoretical concepts
[edit]- Nation, P. (1997), "Vocabulary size, text coverage, and word lists", in Schmitt; McCarthy (eds.), Vocabulary: Description, Acquisition and Pedagogy, Cambridge: Cambridge University Press, pp. 6–19, ISBN 978-0-521-58551-4
- Laufer, B. (1997), "What's in a word that makes it hard or easy? Some intralexical factors that affect the learning of words.", Vocabulary: Description, Acquisition and Pedagogy, Cambridge: Cambridge University Press, pp. 140–155, ISBN 9780521585514
- Nation, P. (2006), "Language Education - Vocabulary", Encyclopedia of Language & Linguistics, Oxford: 494–499, doi:10.1016/B0-08-044854-2/00678-7, ISBN 9780080448541.
- Brysbaert, Marc; Buchmeier, Matthias; Conrad, Markus; Jacobs, Arthur M.; Bölte, Jens; Böhl, Andrea (2011). "The word frequency effect: a review of recent developments and implications for the choice of frequency estimates in German". Experimental Psychology. 58 (5): 412–424. doi:10.1027/1618-3169/a000123. PMID 21768069. database
- Rudell, A.P. (1993), "Frequency of word usage and perceived word difficulty : Ratings of Kucera and Francis words", Most, vol. 25, pp. 455–463
- Segui, J.; Mehler, Jacques; Frauenfelder, Uli; Morton, John (1982), "The word frequency effect and lexical access", Neuropsychologia, 20 (6): 615–627, doi:10.1016/0028-3932(82)90061-6, PMID 7162585, S2CID 39694258
- Meier, Helmut (1967), Deutsche Sprachstatistik, Hildesheim: Olms (frequency list of German words)
- DeFrancis, John (1966), Why Johnny can't read Chinese
- Allanic, Bernard (2003), The corpus of characters and their pedagogical aspect in ancient and contemporary China (fr: Les corpus de caractères et leur dimension pédagogique dans la Chine ancienne et contemporaine) (These de doctorat), Paris: INALCO
Written texts-based databases
[edit]- Da, Jun (1998), Jun Da: Chinese text computing, retrieved 2010-08-21.
- Taiwan Ministry of Education (1997), 八十六年常用語詞調查報告書, retrieved 2010-08-21.
- New, Boris; Pallier, Christophe, Manuel de Lexique 3 (in French) (3.01 ed.).
- Gimenes, Manuel; New, Boris (2016), "Worldlex: Twitter and blog word frequencies for 66 languages", Behavior Research Methods, 48 (3): 963–972, doi:10.3758/s13428-015-0621-0, ISSN 1554-3528, PMID 26170053.
SUBTLEX movement
[edit]- New, B.; Brysbaert, M.; Veronis, J.; Pallier, C. (2007). "SUBTLEX-FR: The use of film subtitles to estimate word frequencies" (PDF). Applied Psycholinguistics. 28 (4): 661. doi:10.1017/s014271640707035x. hdl:1854/LU-599589. S2CID 145366468. Archived from the original (PDF) on 2016-10-24.
- Brysbaert, Marc; New, Boris (2009), "Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English" (PDF), Behavior Research Methods, 41 (4): 977–990, doi:10.3758/brm.41.4.977, PMID 19897807, S2CID 4792474
- Keuleers, E, M, B.; New, B. (2010), "SUBTLEX--NL: A new measure for Dutch word frequency based on film subtitles", Behavior Research Methods, 42 (3): 643–650, doi:10.3758/brm.42.3.643, PMID 20805586
{{citation}}: CS1 maint: multiple names: authors list (link) - Cai, Q.; Brysbaert, M. (2010), "SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles", PLOS ONE, 5 (6): 8, Bibcode:2010PLoSO...510729C, doi:10.1371/journal.pone.0010729, PMC 2880003, PMID 20532192
- Cuetos, F.; Glez-nosti, Maria; Barbón, Analía; Brysbaert, Marc (2011), "SUBTLEX-ESP : Spanish word frequencies based on film subtitles" (PDF), Psicológica, 32: 133–143
- Dimitropoulou, M.; Duñabeitia, Jon Andoni; Avilés, Alberto; Corral, José; Carreiras, Manuel (2010), "SUBTLEX-GR: Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek", Frontiers in Psychology, 1 (December): 12, doi:10.3389/fpsyg.2010.00218, PMC 3153823, PMID 21833273
- Pham, H.; Bolger, P.; Baayen, R.H. (2011), "SUBTLEX-VIE : A Measure for Vietnamese Word and Character Frequencies on Film Subtitles", ACOL
- Brysbaert, Marc; Buchmeier, Matthias; Conrad, Markus; Jacobs, Arthur M.; Bölte, Jens; Böhl, Andrea (2011), "The Word Frequency Effect", Experimental Psychology, 58 (5): 412–424, doi:10.1027/1618-3169/a000123, ISSN 1618-3169, archived from the original on 2023-02-08
- Brysbaert, M.; New, Boris; Keuleers, E. (2012), "SUBTLEX-US : Adding Part of Speech Information to the SUBTLEXus Word Frequencies" (PDF), Behavior Research Methods: 1–22 (databases)
- Mandera, P.; Keuleers, E.; Wodniecka, Z.; Brysbaert, M. (2014). "Subtlex-pl: subtitle-based word frequency estimates for Polish" (PDF). Behav Res Methods. 47 (2): 471–483. doi:10.3758/s13428-014-0489-4. PMID 24942246. S2CID 2334688.
- Tang, K. (2012), "A 61 million word corpus of Brazilian Portuguese film subtitles as a resource for linguistic research", UCL Work Pap Linguist (24): 208–214
- Avdyli, Rrezarta; Cuetos, Fernando (June 2013), "SUBTLEX- AL: Albanian word frequencies based on film subtitles", ILIRIA International Review, 3 (1): 285–292, doi:10.21113/iir.v3i1.112 (inactive 12 July 2025), ISSN 2365-8592
{{citation}}: CS1 maint: DOI inactive as of July 2025 (link) - Soares, Ana Paula; Machado, João; Costa, Ana; Iriarte, Álvaro; Simões, Alberto; de Almeida, José João; Comesaña, Montserrat; Perea, Manuel (April 2015), "On the advantages of word frequency and contextual diversity measures extracted from subtitles: The case of Portuguese", The Quarterly Journal of Experimental Psychology, 68 (4): 680–696, doi:10.1080/17470218.2014.964271, PMID 25263599, S2CID 5376519
Word list
View on GrokipediaOverview
Definition and Scope
A word list is a curated collection of lexical items from a language, often derived from linguistic data and ranked by frequency of occurrence in a corpus, serving as a tool for analyzing vocabulary distribution and usage patterns in corpus linguistics. These lists typically present words in descending order of frequency, highlighting the most common lexical items first, and are essential for identifying core vocabulary that accounts for the majority of text in natural language. For instance, the most frequent words often include function words such as articles, prepositions, and pronouns, which dominate everyday discourse.[7] Word lists vary in their unit of counting, with key distinctions between headword lists, lemma-based lists, and word family lists. A headword represents the base form of a word, such as "run," without grouping variants. Lemma-based lists expand this to include inflected forms sharing the same base, like "run," "runs," "running," and "ran," treating them as a single entry to reflect morphological relationships. In contrast, word family lists encompass not only inflections but also derived forms, such as "runner," "running," and "unrunnable," capturing broader semantic and derivational connections within the vocabulary.[8][9] The scope of word lists is generally limited to common nouns, verbs, adjectives, and other content words in natural language, excluding proper nouns—such as names of people, places, or brands—unless they hold contextual relevance in specialized corpora. This focus ensures the lists prioritize generalizable vocabulary over unique identifiers. Basic word lists, often comprising the top 1,000 most frequent items, cover essential everyday terms sufficient for rudimentary communication, while comprehensive lists extending to 10,000 words incorporate advanced vocabulary for broader proficiency, such as in academic or professional settings. Systematic frequency-based word lists emerged in the early 20th century with large-scale manual counts.[10][11]Historical Evolution
The development of word lists began in the early 20th century with manual efforts to identify high-frequency vocabulary for educational purposes. In 1921, Edward Thorndike published The Teacher's Word Book, a list of 10,000 words derived from analyses of children's reading materials, including school texts and juvenile literature, to aid in curriculum design and literacy instruction.[12] This was expanded in 1932 with A Teacher's Word Book of the Twenty Thousand Words Found Most Frequently and Widely in General Reading for Children and Young People, which incorporated additional sources to rank words by frequency in youth-oriented content.[13] By 1944, Thorndike collaborated with Irving Lorge on The Teacher's Word Book of 30,000 Words, updating the earlier lists by integrating data from over 4.5 million words across diverse adult materials such as newspapers, magazines, and literature, thereby broadening applicability beyond child-focused education.[14] Post-World War II advancements emphasized practical lists for language teaching, particularly in English as a foreign language (EFL) and other tongues. Michael West's General Service List (GSL), released in 1953, compiled 2,000 word families selected for their utility in EFL contexts, drawing from graded readers and general texts to prioritize coverage of everyday communication.[15] Concurrently, in France during the 1950s, the Français Fondamental project produced basic vocabulary lists ranging from 1,500 to 3,200 words, organized around 16 centers of interest like family and work, to standardize teaching for immigrants and non-native speakers through corpus-based frequency analysis of spoken and written French.[16] The digital era marked a shift toward corpus linguistics in the late 20th century, enabling larger-scale and more precise frequency counts. The Brown Corpus—a 1-million-word collection of 1961 American English texts—was created in 1961 and made digitally available in 1964, facilitating the rise of computational methods for word list construction and influencing subsequent projects with balanced, genre-diverse data. This culminated in the 2013 New General Service List (NGSL) by Charles Browne, Brent Culligan, and Joseph Phillips, which updated West's GSL using a 273-million-word corpus of contemporary English, refining the core vocabulary to 2,801 lemmas for better EFL relevance.[17] A notable innovation occurred in 2009 with the introduction of SUBTLEX by Marc Brysbaert and Boris New, a frequency measure derived from 51 million words in American English movie and TV subtitles, offering superior representation of spoken language patterns over traditional written corpora.[18] This subtitle-based approach has since expanded, exemplified by the 2024 adaptation of SUBTLEX-CY for Welsh, which analyzes a 32-million-word corpus of television subtitles to provide psycholinguistically validated frequencies for this low-resource Celtic language, underscoring the method's versatility in supporting underrepresented tongues.[19]Methodology
Key Factors in Construction
The construction of word lists hinges on ensuring representativeness, which requires balancing a diverse range of genres such as fiction, news, and academic texts to prevent skews toward specific linguistic features or registers. This diversity mirrors the target language's natural variation, allowing the list to capture a broad spectrum of usage patterns without overemphasizing one sub-domain. Corpus size plays a critical role in reliability, with a minimum of 1 million words often deemed sufficient for stable frequency estimates of high-frequency vocabulary, though larger corpora (16-30 million words) enhance precision for norms. Smaller corpora risk instability in rankings, particularly for mid- and low-frequency items. Decisions on word family inclusion address morphological relatedness, treating derivatives like "run," "running," and "ran" as a single unit based on affixation levels that account for productivity and transparency. Bauer and Nation's framework outlines seven progressive levels, starting from the headword and extending to complex derivations, enabling compact lists that reflect learner needs while avoiding over-inflation of unique forms. This approach prioritizes semantic and derivational connections, but requires careful calibration to exclude transparent compounds that may dilute family coherence. Normalization techniques mitigate sublanguage biases, where specialized texts like technical documents disproportionately elevate jargon frequencies.[20] Stratified sampling and weighting adjust for these imbalances by proportionally representing genres, ensuring the list approximates general language use rather than niche varieties.[20] Such methods preserve overall frequency integrity while countering distortions from uneven source distributions.[20] Key challenges include handling polysemy, where a single form's multiple senses complicate frequency attribution, often requiring sense-disambiguated corpora to allocate counts accurately. Idioms pose similar issues, as their multi-word nature and non-compositional meanings evade standard tokenization, potentially underrepresenting phrasal units in lemma-based lists.[21] Neologisms, such as "COVID-19," further challenge static lists built from pre-2020 corpora, necessitating periodic updates to incorporate emergent terms without retrospective bias.[22] Dispersion metrics like Juilland's D quantify evenness of word distribution across texts, with values approaching 1 indicating broad coverage and thus greater reliability for generalizability. This measure, normalized by corpus structure, helps filter words concentrated in few documents, enhancing the list's robustness beyond raw frequency.Corpus Sources
Traditional written corpora have formed the foundation for early word list construction, providing balanced samples of edited prose across various genres. The Brown Corpus, compiled in 1961, consists of approximately 1 million words drawn from 500 samples of American English texts published that year, including fiction, news, and scientific writing, making it the first major computer-readable corpus for linguistic research.[23] Similarly, the British National Corpus (BNC), developed in the 1990s, encompasses 100 million words of contemporary British English, with 90% from written sources like books and newspapers and 10% from spoken transcripts, offering a synchronic snapshot of language use.[24] These corpora, while pioneering in representing formal written language, have notable limitations, such as the absence of internet slang, social media expressions, and evolving colloquialisms that emerged after their compilation periods.[25] To address gaps in capturing everyday spoken language, subtitle and spoken corpora have gained prominence since 2009, prioritizing natural dialogue over polished text. The SUBTLEX family, for instance, derives frequencies from film and television subtitles; SUBTLEX-US, based on American English, includes 51 million words from over 8,000 movies and series, providing measures like words per million and contextual diversity (percentage of films featuring a word).[26] This approach offers advantages in reflecting colloquial frequency, as subtitle-based norms better predict lexical decision times and reading behaviors compared to traditional written corpora like the Brown or BNC, which underrepresent informal speech patterns.[27] Modern digital corpora have expanded scale and diversity by incorporating web-based and historical data, enabling broader frequency analyses. The Corpus of Contemporary American English (COCA), spanning 1990 to 2019, contains over 1 billion words across genres such as spoken transcripts, fiction, magazines, newspapers, academic texts, and web content including blogs, thereby capturing evolving usage in digital contexts.[28] Complementing this, the Google Books Ngram corpus draws from trillions of words in scanned books across languages, covering the period from 1800 to 2019 (with extensions to 2022 in recent updates), allowing diachronic tracking of word frequencies while excluding low-quality scans for reliability.[29] Post-2010, there has been a notable shift toward multimodal corpora that integrate text with audio transcripts, video, and other modalities to enhance relevance for second language (L2) learners by simulating real-world input.[30] These resources, such as those combining spoken audio with aligned textual representations, better support vocabulary acquisition in naturalistic settings compared to text-only sources.[31] Dedicated corpora for AI-generated text remain in early development.[32]Lexical Unit Definitions
In the construction of word lists, a fundamental distinction exists between lemmas and word forms as lexical units. A lemma represents the base or citation form of a word, encompassing its inflected variants that share the same core meaning, such as "be" including "am," "is," "are," and "been." This approach groups related forms to reflect semantic unity and is commonly used in frequency-based vocabulary lists to avoid inflating counts with morphological variations. In contrast, word forms refer to the surface-level realizations of words as they appear in texts, treating each inflection or spelling variant separately for precise token analysis, such as counting "runs" and "running" independently. This differentiation affects how vocabulary size is estimated and prioritized in lists, with lemmas promoting efficiency in pedagogical applications while word forms provide granular data on actual usage patterns.[33] Word families extend the lemma concept by incorporating hierarchically related derivatives and compounds, allowing for a more comprehensive representation of vocabulary knowledge. According to Bauer and Nation's framework, which outlines seven progressive levels, inclusion begins at Level 1, treating each inflected form as separate, and progresses through Level 2 (inflections with the same base), Levels 3-6 (various derivational affixes based on frequency, regularity, and productivity), to Level 7 (classical roots and affixes). This scale balances inclusivity with learnability, though practical word lists often limit to Level 6 to focus on more transparent forms, integrating less predictable derivatives only if they occur frequently in corpora. For instance, the word family for "decide" at higher levels might include "decision," "indecisive," and "undecided," reflecting shared morphological and semantic roots. Such hierarchical structuring is widely adopted in corpus-derived lists to estimate coverage and guide instruction.[34] Multi-word units, such as collocations and lexical bundles, are treated as single lexical entries in pedagogical word lists to account for their formulaic nature and frequent co-occurrence beyond chance. Phrases like "point of view" or "in order to" are included holistically rather than as isolated words, recognizing their role as conventionalized units that learners acquire as wholes for fluency. These units are identified through corpus analysis focusing on mutual information and range, with lists like the Academic Collocation List compiling thousands of such sequences tailored to specific registers. By delineating multi-word units distinctly, word lists enhance coverage of idiomatic expressions, which constitute a significant portion of natural language use.[35] The token-type distinction underpins the delineation of lexical units by differentiating occurrences from unique forms, essential for assessing diversity in word lists. Tokens represent every instance of a word in a corpus, including repetitions, while types denote distinct forms, such as unique lemmas or word families. This leads to the type-token ratio (TTR), a measure of lexical variation calculated aswhere higher values indicate greater diversity. In word list construction, TTR helps evaluate corpus representativeness, guiding decisions on unit granularity to ensure lists reflect both frequency and richness without redundancy.[36] Challenges in defining lexical units arise with proper nouns and inflections, particularly in diverse language structures. Proper nouns like "London" are often excluded from core frequency lists or segregated into separate categories to focus on general vocabulary, unless analyses specifically track capitalized forms for domain-specific coverage, as seen in the BNC/COCA lists where they comprise nearly half of unlisted types. In agglutinative languages such as Turkish or Finnish, extensive inflectional suffixes create long, context-dependent forms, complicating lemmatization and risking fragmentation of units; for example, a single root might yield dozens of surface variants, necessitating advanced morphological parsing to group them accurately without under- or over-counting types. These issues highlight the need for language-specific rules in unit delineation to maintain list utility.[37][38]
