Hubbry Logo
Word listWord listMain
Open search
Word list
Community hub
Word list
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Word list
Word list
from Wikipedia

A word list is a list of words in a lexicon, generally sorted by frequency of occurrence (either by graded levels, or as a ranked list). A word list is compiled by lexical frequency analysis within a given text corpus, and is used in corpus linguistics to investigate genealogies and evolution of languages and texts. A word which appears only once in the corpus is called a hapax legomena. In pedagogy, word lists are used in curriculum design for vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort" (Nation 1997), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field.

In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

Table 1: Example lexical frequency analysis
Type Occurrences Rank
the 3,789,654 1st
he 2,098,762 2nd
[...]
king 57,897 1,356th
boy 56,975 1,357th
[...]
stringyfy 5 34,589th
[...]
transducionalify 1 123,567th

Methodology

[edit]

Factors

[edit]

Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:

  • corpus representativeness
  • word frequency and range
  • treatment of word families
  • treatment of idioms and fixed expressions
  • range of information
  • various other criteria

Corpora

[edit]

Traditional written corpus

[edit]
Frequency of personal pronouns in Serbo-Croatian

Most of currently available studies are based on written text corpus, more easily available and easy to process.

SUBTLEX movement

[edit]

However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of the traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. The initial research saw a handful of follow-up studies,[1] providing valuable frequency count analysis for various languages. In depth SUBTLEX researches[2] over cleaned up open subtitles were produced for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al. 2010), Vietnamese (Pham, Bolger & Baayen 2011), German (Brysbaert et al. 2011), Brazil Portuguese (Tang 2012) and Portugal Portuguese (Soares et al. 2015), Albanian (Avdyli & Cuetos 2013), Polish (Mandera et al. 2014) and Catalan (2019[3]), Welsh (Van Veuhen et al. 2024[4]). SUBTLEX-IT (2015) provides raw data only.[5]

Lexical unit

[edit]

In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise : English "can't" and French "aujourd'hui" include punctuations while French "chateau d'eau" designs a concept different from the simple addition of its components while including a space. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.

Statistics

[edit]

It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.

German linguists define the Häufigkeitsklasse (frequency class) of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.

where is the floor function.

Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.

Pedagogy

[edit]

Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors (Nation 1997). Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" (Nation 2006).

Effects of words frequency

[edit]

Word frequency is known to have various effects (Brysbaert et al. 2011; Rudell 1993). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (Laufer 1997). Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect (Segui et al.). The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.

Languages

[edit]

Below is a review of available resources.

English

[edit]

Word counting is an ancient field,[6] with known discussion back to Hellenistic time. In 1944, Edward Thorndike, Irvin Lorge and colleagues[7] hand-counted 18,000,000 running words to provide the first large-scale English language frequency list, before modern computers made such projects far easier (Nation 1997). 20th century's works all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency[8] in the Corpus of Contemporary American English,[9] was first attested to in 1999,[10][11][12] and does not appear in any of these three lists.

The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)
The Teacher Word Book contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (Nation 1997).
The General Service List (West, 1953)
The General Service List contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (Nation 1997). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List.
The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)
A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (Nation 1997).
The Brown (Francis and Kucera, 1982) LOB and related corpora
These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists (Nation 1997).

French

[edit]

Traditional datasets

[edit]

A review has been made by New & Pallier. An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules.[13] It is claimed that 70 grammatical words constitute 50% of the communicatives sentence,[14][15] while 3,680 words make about 95~98% of coverage.[16] A list of 3,000 frequent words is available.[17]

The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet.[18] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".[19]

More recently, the project Lexique3 provides 142,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0.[20]

Subtlex

[edit]

This Lexique3 is a continuous study from which originate the Subtlex movement cited above. New et al. 2007 made a completely new counting based on online film subtitles.

Spanish

[edit]

There have been several studies of Spanish word frequency (Cuetos et al. 2011).[21]

Chinese

[edit]

Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (Allanic 2003). American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can't Read Chinese (DeFrancis 1966). As a frequency toolkit, Da (Da 1998) and the Taiwanese Ministry of Education (TME 1997) provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, Cai & Brysbaert 2010 recently made a rich study of Chinese word and character frequencies.

Other

[edit]

Wiktionary contains frequency lists in more languages.[22]

Most frequently used words in different languages based on Wikipedia or combined corpora.[23]

See also

[edit]

Notes

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A word list is a curated collection of lexical items from a , typically organized alphabetically, by frequency of occurrence, or thematically, and compiled for specific analytical, educational, or practical purposes such as instruction, comparison, or software applications. In , word lists have long been instrumental for tasks like historical-comparative analysis and fieldwork; for instance, the , developed by in the mid-20th century, comprises 100 to 207 basic items intended to remain stable across languages for estimating divergence times through . Similarly, in , word lists prioritize high-utility terms to optimize learning efficiency, as seen in the General Service List (GSL), a 1953 compilation by Michael West of approximately 2,000 English word families representing the most frequent general needed for everyday comprehension. Complementing this, Averil Coxhead's (AWL), published in 2000, identifies 570 word families prevalent in university-level texts across disciplines, excluding those in the GSL, to support advanced academic reading and writing. Beyond education and linguistics, word lists play a key role in computational contexts, where they form the basis for tools like spell-check dictionaries and algorithms; standard word list files, such as those bundled with operating systems (e.g., /usr/share/dict/words), contain thousands of entries in multiple languages to enable functions ranging from text validation to . These applications underscore the versatility of word lists, which enhance targeted language mastery and technological efficiency by focusing on essential or contextually relevant terms rather than exhaustive dictionaries.

Overview

Definition and Scope

A word list is a curated collection of lexical items from a , often derived from linguistic data and ranked by frequency of occurrence in a corpus, serving as a tool for analyzing distribution and usage patterns in . These lists typically present words in descending order of frequency, highlighting the most common lexical items first, and are essential for identifying core that accounts for the majority of text in . For instance, the most frequent words often include function words such as articles, prepositions, and pronouns, which dominate everyday discourse. Word lists vary in their unit of counting, with key distinctions between headword lists, lemma-based lists, and word family lists. A headword represents the base form of a word, such as "run," without grouping variants. Lemma-based lists expand this to include inflected forms sharing the same base, like "run," "runs," "running," and "ran," treating them as a single entry to reflect morphological relationships. In contrast, word family lists encompass not only inflections but also derived forms, such as "runner," "running," and "unrunnable," capturing broader semantic and derivational connections within the vocabulary. The scope of word lists is generally limited to common nouns, verbs, adjectives, and other in , excluding proper nouns—such as names of people, places, or brands—unless they hold contextual in specialized corpora. This focus ensures the lists prioritize generalizable over unique identifiers. Basic word lists, often comprising the top 1,000 most frequent items, cover essential everyday terms sufficient for rudimentary communication, while comprehensive lists extending to words incorporate advanced for broader proficiency, such as in academic or professional settings. Systematic frequency-based word lists emerged in the early with large-scale manual counts.

Historical Evolution

The development of word lists began in the early with manual efforts to identify high-frequency vocabulary for educational purposes. In 1921, published The Teacher's Word Book, a list of 10,000 words derived from analyses of children's reading materials, including texts and juvenile , to aid in design and instruction. This was expanded in 1932 with A Teacher's Word Book of the Twenty Thousand Words Found Most Frequently and Widely in General Reading for Children and Young People, which incorporated additional sources to rank words by frequency in youth-oriented content. By 1944, Thorndike collaborated with Irving Lorge on The Teacher's Word Book of 30,000 Words, updating the earlier lists by integrating data from over 4.5 million words across diverse adult materials such as newspapers, magazines, and , thereby broadening applicability beyond child-focused education. Post-World War II advancements emphasized practical lists for language teaching, particularly in English as a (EFL) and other tongues. Michael West's (GSL), released in 1953, compiled 2,000 word families selected for their utility in EFL contexts, drawing from graded readers and general texts to prioritize coverage of everyday communication. Concurrently, in during the 1950s, the Français Fondamental project produced basic vocabulary lists ranging from 1,500 to 3,200 words, organized around 16 centers of interest like family and work, to standardize teaching for immigrants and non-native speakers through corpus-based of spoken and written French. The digital era marked a shift toward in the late 20th century, enabling larger-scale and more precise frequency counts. The —a 1-million-word collection of 1961 texts—was created in 1961 and made digitally available in 1964, facilitating the rise of computational methods for word list construction and influencing subsequent projects with balanced, genre-diverse data. This culminated in the 2013 New General Service List () by Charles Browne, Brent Culligan, and Joseph Phillips, which updated West's GSL using a 273-million-word corpus of contemporary English, refining the core to 2,801 lemmas for better EFL relevance. A notable occurred in with the introduction of SUBTLEX by Marc Brysbaert and Boris New, a measure derived from 51 million words in movie and TV subtitles, offering superior representation of spoken patterns over traditional written corpora. This subtitle-based approach has since expanded, exemplified by the 2024 adaptation of SUBTLEX-CY for Welsh, which analyzes a 32-million-word corpus of television subtitles to provide psycholinguistically validated frequencies for this low-resource Celtic , underscoring the method's versatility in supporting underrepresented tongues.

Methodology

Key Factors in Construction

The construction of word lists hinges on ensuring representativeness, which requires balancing a diverse range of genres such as , , and academic texts to prevent skews toward specific linguistic features or registers. This diversity mirrors the target language's natural variation, allowing the list to capture a broad spectrum of usage patterns without overemphasizing one sub-domain. Corpus size plays a critical role in reliability, with a minimum of 1 million words often deemed sufficient for stable frequency estimates of high-frequency vocabulary, though larger corpora (16-30 million words) enhance precision for norms. Smaller corpora risk instability in rankings, particularly for mid- and low-frequency items. Decisions on word family inclusion address morphological relatedness, treating derivatives like "run," "running," and "ran" as a single unit based on affixation levels that account for and transparency. Bauer and Nation's framework outlines seven progressive levels, starting from the headword and extending to complex derivations, enabling compact lists that reflect learner needs while avoiding over-inflation of unique forms. This approach prioritizes semantic and derivational connections, but requires careful calibration to exclude transparent compounds that may dilute family coherence. Normalization techniques mitigate sublanguage biases, where specialized texts like technical documents disproportionately elevate jargon frequencies. and weighting adjust for these imbalances by proportionally representing genres, ensuring the list approximates general language use rather than niche varieties. Such methods preserve overall frequency integrity while countering distortions from uneven source distributions. Key challenges include handling , where a single form's multiple senses complicate frequency attribution, often requiring sense-disambiguated corpora to allocate counts accurately. Idioms pose similar issues, as their multi-word nature and non-compositional meanings evade standard tokenization, potentially underrepresenting phrasal units in lemma-based lists. Neologisms, such as "," further challenge static lists built from pre-2020 corpora, necessitating periodic updates to incorporate emergent terms without retrospective bias. Dispersion metrics like Juilland's D quantify evenness of word distribution across texts, with values approaching 1 indicating broad coverage and thus greater reliability for generalizability. This measure, normalized by corpus structure, helps filter words concentrated in few documents, enhancing the list's robustness beyond raw .

Corpus Sources

Traditional written corpora have formed the foundation for early word list construction, providing balanced samples of edited prose across various genres. The , compiled in 1961, consists of approximately 1 million words drawn from 500 samples of texts published that year, including , , and , making it the first major computer-readable corpus for linguistic research. Similarly, the (BNC), developed in the 1990s, encompasses 100 million words of contemporary , with 90% from written sources like books and newspapers and 10% from spoken transcripts, offering a synchronic snapshot of use. These corpora, while pioneering in representing formal written , have notable limitations, such as the absence of , expressions, and evolving colloquialisms that emerged after their compilation periods. To address gaps in capturing everyday spoken language, subtitle and spoken corpora have gained prominence since , prioritizing natural over polished text. The SUBTLEX family, for instance, derives frequencies from film and television ; SUBTLEX-US, based on , includes 51 million words from over 8,000 movies and series, providing measures like words per million and contextual diversity (percentage of films featuring a word). This approach offers advantages in reflecting colloquial frequency, as subtitle-based norms better predict lexical decision times and reading behaviors compared to traditional written corpora like the Brown or BNC, which underrepresent informal speech patterns. Modern digital corpora have expanded scale and diversity by incorporating web-based and historical data, enabling broader frequency analyses. The (COCA), spanning 1990 to 2019, contains over 1 billion words across genres such as spoken transcripts, fiction, magazines, newspapers, academic texts, and web content including blogs, thereby capturing evolving usage in digital contexts. Complementing this, the Ngram corpus draws from trillions of words in scanned books across languages, covering the period from 1800 to 2019 (with extensions to 2022 in recent updates), allowing diachronic tracking of word frequencies while excluding low-quality scans for reliability. Post-2010, there has been a notable shift toward multimodal corpora that integrate text with audio transcripts, video, and other modalities to enhance relevance for (L2) learners by simulating real-world input. These resources, such as those combining spoken audio with aligned textual representations, better support acquisition in naturalistic settings compared to text-only sources. Dedicated corpora for AI-generated text remain in early development.

Lexical Unit Definitions

In the construction of word lists, a fundamental distinction exists between lemmas and word forms as lexical units. A lemma represents the base or citation form of a word, encompassing its inflected variants that share the same core meaning, such as "be" including "am," "is," "are," and "been." This approach groups related forms to reflect semantic unity and is commonly used in frequency-based lists to avoid inflating counts with morphological variations. In contrast, word forms refer to the surface-level realizations of words as they appear in texts, treating each or spelling variant separately for precise token analysis, such as counting "runs" and "running" independently. This differentiation affects how size is estimated and prioritized in lists, with lemmas promoting efficiency in pedagogical applications while word forms provide granular data on actual usage patterns. Word families extend the lemma concept by incorporating hierarchically related derivatives and compounds, allowing for a more comprehensive representation of vocabulary knowledge. According to Bauer and Nation's framework, which outlines seven progressive levels, inclusion begins at Level 1, treating each inflected form as separate, and progresses through Level 2 (inflections with the same base), Levels 3-6 (various derivational affixes based on frequency, regularity, and productivity), to Level 7 (classical and affixes). This scale balances inclusivity with learnability, though practical word lists often limit to Level 6 to focus on more transparent forms, integrating less predictable derivatives only if they occur frequently in corpora. For instance, the word family for "decide" at higher levels might include "decision," "indecisive," and "undecided," reflecting shared morphological and semantic . Such hierarchical structuring is widely adopted in corpus-derived lists to estimate coverage and guide instruction. Multi-word units, such as and lexical bundles, are treated as single lexical entries in pedagogical word lists to account for their formulaic nature and frequent beyond chance. Phrases like "point of view" or "in order to" are included holistically rather than as isolated words, recognizing their role as conventionalized units that learners acquire as wholes for . These units are identified through corpus analysis focusing on and range, with lists like the Academic Collocation List compiling thousands of such sequences tailored to specific registers. By delineating multi-word units distinctly, word lists enhance coverage of idiomatic expressions, which constitute a significant portion of use. The token-type distinction underpins the delineation of lexical units by differentiating occurrences from unique forms, essential for assessing diversity in word lists. Tokens represent every instance of a word in a corpus, including repetitions, while types denote distinct forms, such as unique lemmas or word families. This leads to the type-token ratio (TTR), a measure of lexical variation calculated as
TTR=typestokensTTR = \frac{types}{tokens}
where higher values indicate greater diversity. In word list construction, TTR helps evaluate corpus representativeness, guiding decisions on unit to ensure lists reflect both and richness without redundancy.
Challenges in defining lexical units arise with proper nouns and inflections, particularly in diverse language structures. Proper nouns like "London" are often excluded from core frequency lists or segregated into separate categories to focus on general vocabulary, unless analyses specifically track capitalized forms for domain-specific coverage, as seen in the BNC/COCA lists where they comprise nearly half of unlisted types. In agglutinative languages such as Turkish or Finnish, extensive inflectional suffixes create long, context-dependent forms, complicating lemmatization and risking fragmentation of units; for example, a single root might yield dozens of surface variants, necessitating advanced morphological parsing to group them accurately without under- or over-counting types. These issues highlight the need for language-specific rules in unit delineation to maintain list utility.

Frequency Calculation Methods

Frequency calculation in word lists begins with raw frequency, which simply counts the occurrences of a lexical unit within a corpus. For instance, if a word appears N times in a corpus of total size S, its raw frequency is N. This measure provides an unnormalized tally but is sensitive to corpus size variations, limiting direct comparisons across datasets. To address this, relative frequency normalizes counts against the total number of words, often expressed per million words for comparability. The formula is f=counttotal words×1,000,000f = \frac{\text{count}}{\text{total words}} \times 1,000,000, yielding a standardized rate that facilitates analysis across diverse corpora. This approach is standard in for scaling frequencies proportionally. Zipf's law offers a predictive model for word frequency distributions, stating that the frequency ff of a word is approximately inversely proportional to its rank rr in the frequency list, given by fcrf \approx \frac{c}{r}, where cc is a constant. Validation typically involves plotting frequency against rank on a log-log scale, where a linear relationship confirms adherence to the law, as observed in many natural language corpora. This principle, first formalized in 1935, underpins much of modern frequency analysis by highlighting the skewed nature of word usage. Advanced metrics extend beyond basic counts to account for contextual and distributional properties. Mutual information quantifies the association strength in collocations by measuring how much the co-occurrence probability of two words deviates from their independent probabilities, favoring rare but strongly linked pairs over high-frequency but weakly associated ones. For dispersion, adjustments mitigate biases from uneven distribution across texts; a common transformation is the logarithmic adjustment log(f+1)\log(f + 1), which compresses the frequency skew and stabilizes variance for low-frequency items. The SUBTLEX database exemplifies this by employing log10(f+1)\log_{10}(f + 1) to derive Zipf-scaled frequencies, better handling the long-tail distribution in subtitle corpora. Computational tools streamline these calculations. AntConc generates raw and relative frequency lists from loaded corpora, supporting keyword extraction and basic statistical overviews. provides advanced querying for frequency lists, including part-of-speech filtering and metrics like . Post-2020 developments integrate neural embeddings, such as BERT models, to compute semantic frequencies that weight word occurrences by contextual similarity, enhancing traditional counts with for more nuanced rankings in word lists.

Applications and Effects

Pedagogical Integration

Word lists play a central role in prioritization for instruction, enabling educators to focus on high-frequency that maximizes text coverage with minimal effort. According to Nation's , knowledge of the top 2,000 to 3,000 word families typically provides 80-90% coverage of everyday written and spoken texts, allowing learners to achieve functional comprehension early in their studies. This approach ensures that instructional time is allocated efficiently, prioritizing words that appear most often in authentic materials rather than rare or specialized terms. Vocabulary acquisition is often structured in tiers using word lists tailored to learner proficiency. For beginners, high-frequency lists emphasize the most common 1,000-2,000 words to build a foundational essential for basic communication. At advanced levels, specialized lists such as the Academic Word List (AWL), developed by Coxhead in 2000, target 570 word families prevalent in scholarly texts, enhancing learners' ability to engage with academic discourse. Teaching methods incorporating word lists frequently employ systems to reinforce retention, scheduling reviews at increasing intervals based on learner performance to optimize formation. These lists are also integrated with frameworks like the Common European Framework of Reference for Languages (CEFR), where approximately 500 words align with A1-level basic user proficiency, guiding design and assessment. In digital applications, word lists inform frequency-based progression, as seen in platforms like , which sequences lessons to introduce high-utility first for rapid skill-building. Effectiveness is evaluated through coverage tests, which measure how well a learner's spans sample texts, confirming alignment with instructional goals. However, post-2020 developments in AI, such as personalized systems that dynamically adjust word exposure using updated frequency data, remain underexplored in pedagogical literature despite their potential to enhance customization.

Psycholinguistic Impacts

High-frequency words are recognized more rapidly during lexical access compared to low-frequency words, as demonstrated in priming studies where repeated exposure to frequent items accelerates subsequent identification. This effect arises from the organization of the , where frequent words occupy more accessible positions, reducing search time in models of lexical retrieval. Eye-tracking research further supports this, showing that gaze durations on high-frequency words are shorter by approximately 20-50 milliseconds during natural reading, reflecting faster orthographic and phonological processing. Word frequency also influences memory retrieval, with high-frequency words exhibiting fewer tip-of-the-tongue (TOT) states, where a known word temporarily evades recall. Studies indicate that TOT incidents are significantly rarer for words in the top 1,000 most frequent, as their stronger semantic-phonological connections facilitate easier access from . In contrast, low-frequency words, comprising the bulk of the beyond basic lists, are more prone to TOT due to weaker representational strength, impacting fluent in everyday . The Zipfian distribution underlying word frequency lists promotes incremental by prioritizing high-frequency items that learners encounter repeatedly in input. This skewed pattern allows initial mastery of a small set of common words, enabling contextual for rarer terms. Low-frequency words demand more exposures to achieve comparable retention, as their sparse occurrence hinders consolidation in . Such dynamics underscore how frequency-based lists align with natural learning trajectories, reducing during early stages. An interaction between word frequency and phonological neighborhood density modulates processing efficiency, where high-frequency words in dense neighborhoods (surrounded by many phonologically similar competitors) exhibit slower recognition times. This inhibitory effect, observed in production tasks, arises from increased among activated lexical candidates, delaying selection despite the word's inherent . Recent neuroimaging evidence from fMRI studies confirms this at the neural level, revealing reduced activation in (left ) for high-frequency words during reading, indicative of more efficient articulatory planning and semantic integration with fewer neural resources.

Language-Specific Examples

English-Language Lists

One of the earliest and most influential English-language word lists is the General Service List (GSL), compiled by Michael West in 1953, which includes 2,000 headwords selected for their high frequency in everyday English texts and covers approximately 80% of words in general written materials. This list was derived from a corpus of about 2.5 million words, primarily from British and American sources, emphasizing words useful for non-native learners. However, the GSL has faced criticisms for relying on a dated corpus that predates significant linguistic shifts, such as technological advancements, leading to underrepresentation of modern vocabulary. To address these limitations, the (NGSL), developed by Charles Browne, Brent Culligan, and Joseph Phillips in 2013, updates the GSL with 2,801 lemmas drawn from the 273-million-word Cambridge English Corpus (CEC), excluding overlap with the Academic Word List to focus on general high-frequency vocabulary. The NGSL achieves over 90% coverage of common texts, providing a more current foundation for language learning by incorporating internet-era terms like "." For specialized contexts, the Academic Word List (AWL), created by Averil Coxhead in 2000, identifies 570 word families prevalent in university-level texts across disciplines, excluding general high-frequency words to target academic-specific vocabulary essential for higher education. This list was built from a 3.5-million-word corpus of written , highlighting terms like "analyze" and "" that appear frequently in scholarly but rarely in everyday . In the domain of readability for early education, Edward Fry's 1967 list compiles the top 300 high-frequency words suitable for grades 1-9, aiding in the assessment of text accessibility for young readers. The NGSL serves as a modern counterpart, extending coverage to contemporary terms absent from earlier lists like Fry's. Recent global events, such as the , underscore the need for post-2020 refreshes to English word lists, as terms like "" saw dramatic frequency increases in , potentially altering coverage priorities in learner corpora.

European-Language Lists

Frequency-based word lists for European languages other than English have been developed to address the unique morphological and syntactic features of these tongues, such as rich inflectional systems and , which complicate lexical frequency estimation compared to analytic languages like English. These lists often draw from diverse corpora, including written texts, spoken dialogues, and subtitles, to capture both formal and colloquial usage. Traditional corpora, such as national reference collections, provide the foundational data for many of these efforts. In French, the Français Fondamental project, initiated in 1948 and completed by 1964 under the direction of Paul Rivenc, produced a core vocabulary list ranging from 800 to 3,000 words, along with basic grammatical structures, aimed at teaching French as a to illiterate adults in colonial contexts and beyond. This list emphasized high-utility terms derived from everyday speech and simple texts, influencing subsequent pedagogical materials. A more comprehensive modern resource is Lexique3, a lexical database covering approximately 140,000 unique word forms (lemmas) with frequency measures updated in 2016 using a corpus of film and television totaling over 50 million words, enabling precise psycholinguistic analyses of and processing. For Spanish, Mark Davies's A Frequency Dictionary of Spanish (2006) compiles the 5,000 most frequent words from a 20-million-word corpus spanning contemporary written and spoken sources, including newspapers, , and conversations, providing part-of-speech and example sentences to support language learners. Complementing this, the SUBTLEX-ESP database (2011) offers word frequencies derived from of 1,627 Spanish films and television programs, encompassing 41.5 million words, which better approximates informal, spoken language exposure than traditional written corpora. German frequency lists, such as those integrated into the dictionary series, rely on the DeReKo (German Reference Corpus), a vast collection of over 61.4 billion words from texts dating from the onward, including newspapers, books, and , as of 2025, to rank lemmas and word forms by occurrence. of this corpus shows that the top 4,000 words account for about 95% of tokens in typical German texts, highlighting the efficiency of focusing on high-frequency items for comprehension and instruction. Cross-linguistic extensions of subtitle-based frequency measures, like the SUBTLEX family, have been adapted for European languages; for instance, SUBTLEX-FR (2012) provides French word frequencies from film , facilitating comparative studies across Romance and Germanic tongues. However, compiling these lists in gendered languages presents challenges, particularly with nouns, where separate masculine and feminine forms (e.g., in French le chat vs. la chatte) must be aggregated at the lemma level or ranked individually, potentially skewing rankings due to morphological variation and agreement rules that inflate frequencies of inflected variants. A notable recent advancement is the 2022 EU-funded Romance-Croatian Parallel Corpus, which includes aligned texts in five Romance languages (French, Italian, , Romanian, and Spanish) alongside Croatian, totaling millions of words, to update frequency profiles and support while addressing gaps in outdated monolingual lists for these high-resource languages.

Asian-Language Lists

Word lists for Asian languages often address unique linguistic features such as logographic scripts in Chinese and Japanese, tonal systems in Mandarin and Korean, and syllabic structures in . These lists prioritize data from large corpora to account for compound words, character combinations, and context-dependent usage, differing from alphabetic languages by emphasizing character-level statistics alongside word forms. For instance, in logographic systems, frequency calculations may distinguish between individual characters and multi-character words to better reflect reading and comprehension patterns. In Chinese, the (HSK) syllabus, developed in the 1980s by the National / Headquarters, structures vocabulary across six levels with a total of 8,840 words and characters, enabling learners to progress from basic greetings to advanced discourse. This list draws from contemporary written and spoken corpora, incorporating both simplified characters and common compounds, with coverage increasing cumulatively: level 1 requires 150 items, while level 6 encompasses all prior levels plus 2,500 additional entries for professional and academic contexts. Japanese word lists grapple with the complexity of kanji compounds, where frequency is derived from mixed-script corpora including hiragana, katakana, and kanji. This approach highlights how compound formation affects word boundaries, with top entries covering approximately 90% of typical texts through prioritized kanji-kanji pairings. For Korean, the National Institute of the Korean Language's frequency list, released in the 2000s, identifies the top 5,000 words from the Sejong Corpus—a approximately 11-million-word collection of balanced written and spoken data—handling Hangul's syllabic nature by treating morphemes and particles as integral to word units. This list facilitates learner progression by including honorifics and agglutinative forms, with the initial 1,000 items alone accounting for over 70% of common occurrences, adapted for tonal variations in pronunciation. A distinctive aspect of Chinese word lists is the distinction between character and word frequency, as logographic writing allows characters to function independently or in compounds. According to the Ministry of Education's 1986 guidelines, the 3,500 most common characters cover 99% of usage in modern texts, enabling efficient without exhaustive memorization of all 50,000+ characters in existence. This character-centric metric contrasts with word-based lists in alphabetic languages, influencing pedagogical tools to prioritize high-coverage hanzi like 的 (de, possessive particle) and 是 (shì, to be). Despite these advancements, post-2020 developments in Asian word lists remain limited, with few incorporating corpora like for Mandarin to capture evolving and neologisms such as 打工人 (dǎ gōng rén, "wage slave"). Similarly, integration of emojis into frameworks is nascent, overlooking their as visual lexemes in digital communication across tonal and logographic contexts, such as emoji-modified compounds on platforms like or .

Emerging and Low-Resource Languages

Word lists for emerging and low-resource languages address critical gaps in linguistic documentation, particularly for indigenous, African, and endangered varieties where traditional corpora are scarce. These efforts often rely on targeted collections from oral traditions, limited texts, and modern digital sources to prioritize high-frequency essential for revitalization and basic communication. Such lists not only support but also enable computational applications in understudied tongues. In indigenous languages, organizations like SIL International have developed extensive corpora and word lists to document and analyze vocabulary from diverse communities worldwide. For instance, SIL's resources include elicitation-based word lists and texts for languages spoken in regions like and , facilitating frequency analysis where full corpora are unavailable. A notable example is work on (Diné), where benchmark references from the 1980s, such as Young and Morgan's grammatical analyses, underpin vocabulary compilations of around 1,000 core terms derived from spoken and educational materials. African languages have seen advancements through corpus-driven frequency lists, filling voids in data for Bantu and other families. The Helsinki Corpus of Swahili 2.0, compiled in the 2000s and expanded to 25 million words, yields top-1,000 word lists based on annotated texts from newspapers, books, and interviews, highlighting everyday usage patterns. For Zulu (isiZulu), computational extraction methods applied to parallel corpora in studies from the early 2020s enable semi-automatic term and identification, drawing from web-mined and aligned texts to rank common lexical items. Recent expansions target Celtic and North Germanic low-resource languages using subtitle and gigaword corpora for more naturalistic data. The SUBTLEX-CY database for Welsh, released in 2023 from a 32-million-word corpus of subtitles, provides detailed word that outperform earlier written-based lists in predicting lexical processing. Similarly, the Icelandic Gigaword Corpus, developed in the with versions reaching 1.3 billion words by 2022, supports customizable lists from parliamentary speeches, , and , aiding in the analysis of a language with limited external influences. Crowdsourced platforms have emerged as vital tools for endangered languages, enabling community-driven vocabulary building. Apps like host user-generated courses for Hawaiian ('Ōlelo Hawaiʻi), including lists of 2,000–3,000 high-frequency terms compiled from preserved documents and revitalization projects around , which emphasize practical words for daily use and cultural preservation. Despite these initiatives, significant gaps persist in AI-assisted word lists for over 7,000 low-resource languages, where training data shortages limit model development and exacerbate digital divides. Post-2020 calls from initiatives like the Lacuna Fund urge the creation of open-access global corpora to democratize resources, emphasizing collaborative data curation for equitable NLP advancements in underrepresented tongues; as of 2025, the fund continues to support new dataset releases for African and indigenous languages.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.