Hubbry Logo
LexicostatisticsLexicostatisticsMain
Open search
Lexicostatistics
Community hub
Lexicostatistics
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Lexicostatistics
Lexicostatistics
from Wikipedia

Lexicostatistics is a method of comparative linguistics that involves comparing the percentage of lexical cognates between languages to determine their relationship. Lexicostatistics is related to the comparative method but does not reconstruct a proto-language. It is to be distinguished from glottochronology, which attempts to use lexicostatistical methods to estimate the length of time since two or more languages diverged from a common earlier proto-language. This is merely one application of lexicostatistics, however; other applications of it may not share the assumption of a constant rate of change for basic lexical items.

The term "lexicostatistics" is misleading in that mathematical equations are used but not statistics. Other features of a language may be used other than the lexicon, though this is unusual. Whereas the comparative method used shared identified innovations to determine sub-groups, lexicostatistics does not identify these. Lexicostatistics is a distance-based method, whereas the comparative method considers language characters directly. The lexicostatistics method is a simple and fast technique relative to the comparative method but has limitations (discussed below). It can be validated by cross-checking the trees produced by both methods.

History

[edit]

Lexicostatistics was developed by Morris Swadesh in a series of articles in the 1950s, based on earlier ideas.[1][2][3] The concept's first known use was by Dumont d'Urville in 1834 who compared various "Oceanic" languages and proposed a method for calculating a coefficient of relationship. Hymes (1960) and Embleton (1986) both review the history of lexicostatistics.[4][5]

Method

[edit]

Create word list

[edit]

The aim is to generate a list of universally used meanings (hand, mouth, sky, I). Words are then collected for these meaning slots for each language being considered. Swadesh reduced a larger set of meanings down to 200 originally. He later found that it was necessary to reduce it further but that he could include some meanings that were not in his original list, giving his later 100-item list. The Swadesh list in Wiktionary gives the total 207 meanings in a number of languages. Alternative lists that apply more rigorous criteria have been generated, e.g. the Dolgopolsky list and the Leipzig–Jakarta list, as well as lists with a more specific scope; for example, Dyen, Kruskal and Black have 200 meanings for 84 Indo-European languages in digital form.[6]

Determine cognacies

[edit]

A trained and experienced linguist is needed to make cognacy decisions. However, the decisions may need to be refined as the state of knowledge increases. However, lexicostatistics does not rely on all the decisions being correct. For each pair of words (in different languages) in this list, the cognacy of a form could be positive, negative or indeterminate. Sometimes a language has multiple words for one meaning, e.g. small and little for not big.

Calculate lexicostatistic percentages

[edit]

This percentage is related to the proportion of meanings for a particular language pair that are cognate, i.e. relative to the total without indeterminacy. This value is entered into an N×N table of distances, where N is the number of languages being compared. When completed, this table is half-filled in triangular form. The higher the proportion of cognacy the closer the languages are related.

Create family tree

[edit]

Creation of the language tree is based solely on the table found above. Various sub-grouping methods can be used but that adopted by Dyen, Kruskal and Black was:

  • all lists are placed in a pool
  • the two closest members are removed and form a nucleus which is placed in the pool
  • this step is repeated
  • under certain conditions a nucleus becomes a group
  • this is repeated until the pool only contains one group.

Calculations have to be of nucleus and group lexical percentages.

Applications

[edit]

A leading exponent of lexicostatistics application has been Isidore Dyen.[7][8][9][10] He used lexicostatistics to classify Austronesian languages[11] as well as Indo-European ones.[6] A major study of the latter was reported by Dyen, Kruskal and Black (1992).[6] Studies have also been carried out on Amerindian and African languages.

Pama-Nyungan

[edit]

The problem of internal branching within the Pama-Nyungan language family has been a long-standing issue for Australianist linguistics, and general consensus held that internal connections between the 25+ different subgroups of Pama-Nyungan were either impossible to reconstruct or that the subgroups were not in fact genetically related at all.[12] In 2012, Claire Bowern and Quentin Atkinson published the results from their application of computational phylogenetic methods on 194 doculects representing all major subgroups and isolates of Pama-Nyungan.[13] Their model "recovered" many of the branches and divisions that had erstwhile been proposed and accepted by many other Australianists, while also providing some insight into the more problematic branches, such as Paman (which is complicated by the lack of data) and Ngumpin-Yapa (where the genetic picture is obscured by very high rates of borrowing between languages). Their dataset forms the largest of its kind for a hunter-gatherer language family, and the second largest overall after Austronesian (Greenhill et al. 2008 Archived 2018-12-19 at the Wayback Machine). They conclude that Pama-Nyungan languages are in fact not exceptional to lexicostatistical methods, which have successfully been applied to other language families of the world.

Criticisms

[edit]

People such as Hoijer (1956) have shown that there were difficulties in finding equivalents to the meaning items while many have found it necessary to modify Swadesh's lists.[14] Gudschinsky (1956) questioned whether it was possible to obtain a universal list.[15]

Factors such as borrowing, tradition and taboo can skew the results, as with other methods. Sometimes lexicostatistics has been used with lexical similarity being used rather than cognacy to find resemblances. This is then equivalent to mass comparison.

The choice of meaning slots is subjective, as is the choice of synonyms.

Improved methods

[edit]

Some of the modern computational statistical hypothesis testing methods can be regarded as improvements of lexicostatistics in that they use similar word lists and distance measures.[citation needed]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Lexicostatistics is a quantitative method in that assesses the genetic relatedness between languages by comparing the proportion of shared cognates—words with a common origin—in their basic vocabularies, using a standardized list of core, culturally neutral terms to minimize borrowing and semantic shift. This approach focuses on as a proxy for phylogenetic , enabling the of languages into families without relying on exact . Developed primarily by American linguist in the early 1950s, lexicostatistics emerged as a tool to systematically classify languages, particularly those with limited written records, such as many Indigenous American tongues. Swadesh proposed lists of 100 and 200 basic words (e.g., body parts, numerals, pronouns) selected for their high retention rates over time and resistance to replacement. The method gained traction through applications to language families like Indo-European and Austronesian, where it facilitated large-scale comparisons via computational tools. Closely linked to —the extension of lexicostatistics for dating language splits based on a constant rate of lexical replacement, assuming an average 14% loss of core vocabulary per millennium—lexicostatistics has faced critiques for oversimplifying linguistic evolution, including assumptions about uniform change rates and identification challenges. Despite these, it remains influential in modern , informing databases like the Global Lexicostatistical Database and hybrid models integrating qualitative comparative evidence.

Introduction

Definition and Principles

Lexicostatistics is a statistical method employed in to assess the genetic relatedness of s by quantifying the proportion of shared basic vocabulary items, specifically cognates, between them. This approach treats lexical similarities as indicators of common ancestry, providing a numerical measure of divergence without reconstructing proto-s. At its core, lexicostatistics operates on the principle that certain basic vocabulary items exhibit stable retention rates over time, resisting replacement or borrowing at a relatively constant pace across languages. These items are drawn from Swadesh lists, standardized compilations of 100 or 200 universal concepts—such as body parts (hand, eye), numerals (one, two), and pronouns (I, )—chosen for their cultural neutrality and low susceptibility to . The method assumes an average retention rate of approximately 86% per for this core lexicon, with variations noted in empirical studies (e.g., ~80.5%). Cognates, defined as words in related languages that descend from a shared ancestral form and exhibit systematic sound correspondences, form the basis for these comparisons, while non-cognate matches due to borrowing or chance are excluded. Lexicostatistical similarity is calculated as the percentage of matching cognates in the selected word lists, serving as a proxy for relatedness. Established thresholds interpret these percentages using the standard 86% retention rate: above 80% typically indicates dialects or closely related languages (0–10 centuries of separation), 36–80% suggests membership in the same (10–35 centuries), and 12–36% points to distant genetic stocks (35–50 centuries). In practice, the involves selecting a standardized , gathering equivalents from the languages under comparison, identifying through expert judgment or comparative analysis, and computing the to gauge similarity. This high-level process emphasizes quantitative objectivity in classifying linguistic relationships within the broader framework of .

Relation to Historical Linguistics

Historical linguistics seeks to reconstruct proto-languages, classify language families, and trace the divergences among related languages through systematic analysis of linguistic changes over time. This field primarily employs the comparative method, which identifies regular sound correspondences in cognate words to infer ancestral forms and relationships. The reconstruction of proto-languages, such as Proto-Indo-European, serves as a tool for achieving the core aim of historical classification, enabling linguists to map the evolution and genetic affiliations of languages. Lexicostatistics plays a quantitative complementary role to the by providing rapid metrics of across large datasets, based on the proportion of shared s in basic vocabulary lists. Unlike the 's focus on detailed sound correspondences for reconstruction, lexicostatistics emphasizes percentage-based assessments to infer phylogenetic affinities, offering an efficient alternative for initial family classifications where full comparative analysis is resource-intensive. It integrates with traditional by validating or refining classifications derived from etymological studies, such as distinguishing inherited forms from borrowings through recurrent patterns. Effective application of lexicostatistics requires foundational knowledge of established language families, the distinction between lexical borrowing and genetic , and the rationale for prioritizing vocabulary over morphology or syntax due to the former's greater stability. Basic vocabulary, like Swadesh lists, is selected for its resistance to replacement—retaining about 86% similarity after 1,000 years—compared to the more variable nature of grammatical structures influenced by contact or internal drift. Borrowing must be accounted for by excluding or adjusting loanwords that mimic , ensuring percentages reflect true rather than external influences. In contrast to non-quantitative methods like traditional , which reconstruct individual word histories through detailed sound laws, lexicostatistics prioritizes aggregate percentages for broad similarity measures, bypassing in-depth form-by-form analysis. This approach differs by treating lexical data statistically to gauge relatedness, rather than aiming for precise proto-forms, making it suited for hypothesis generation in understudied families. However, its scope is limited to and divergence estimation, as it does not support deep phonological or morphological reconstruction required for revival.

Historical Development

Origins and Early Influences

The conceptual foundations of lexicostatistics emerged from 19th-century efforts in , where scholars sought quantitative measures for language relatedness through vocabulary comparisons. In 1834, French explorer and linguist Jules Sébastien César Dumont d'Urville proposed an early coefficient of relationship by comparing basic vocabulary across during his voyages, marking one of the first attempts to numerically assess lexical similarities for classification purposes. This approach built on the Romantic-era emphasis on systematic word comparisons, exemplified by Jacob Grimm's formulation of sound correspondences in in 1822, which indirectly highlighted the stability of core vocabulary elements across related tongues. August Schleicher's development of the Stammbaum () model in the 1850s further influenced these ideas, drawing analogies from pre-Darwinian biology to visualize language divergence as branching evolutions based on shared lexical forms. In the early , anthropological linguistics, particularly studies of Native American languages, reinforced the notion of vocabulary stability as a tool for tracing historical and cultural connections. , in his 1921 monograph , observed that certain basic vocabulary items exhibit greater persistence over time compared to others, attributing this to their fundamental role in everyday communication and cultural continuity, based on his fieldwork with Indigenous groups. Similarly, Hermann Hirt's 1900 analysis of and lexical patterns in Der indogermanische Ablaut explored vocabulary distributions to subgroup languages, prefiguring quantitative assessments of divergence through word retention. These works in anthropological contexts underscored how stable lexical cores could serve as proxies for deeper historical relationships, influencing later methodological refinements. By the 1940s, Morris Swadesh's fieldwork among Native American and led to initial explorations of lexical dating, where he noted consistent rates of vocabulary replacement that could estimate divergence times. In a 1948 proposal presented at the Viking Fund Summer Conference, Swadesh outlined a vocabulary-based dating method relying on retention rates of core terms, drawing from his observations of lexical stability in endangered languages. This intellectual lineage intersected with post-World War II demands for rapid amid and global migration studies, where anthropologists and administrators required efficient tools to map linguistic diversity in newly independent regions. Additionally, analogies to statistical methods in , such as early phylogenetic classifications, provided a framework for treating language evolution as a measurable process akin to species divergence.

Key Figures and Evolution

is recognized as the founder of lexicostatistics, having formalized the approach in his seminal 1952 paper, "Lexico-Statistic Dating of Prehistoric Ethnic Contacts," published in the Proceedings of the . In this work, Swadesh introduced a standardized list of 200 basic vocabulary items intended to reflect stable elements of language less prone to borrowing or rapid change, enabling quantitative comparisons to infer historical relationships among languages, particularly North American Indigenous ones and languages. Key collaborators advanced Swadesh's framework in the early 1950s. Robert B. Lees provided a critical refinement in his 1953 article, "The Basis of ," in , where he evaluated the assumptions underlying the method's application to dating linguistic divergences and suggested adjustments for more robust statistical handling. Similarly, Isidore Dyen applied lexicostatistical techniques to the Austronesian language family in his 1965 study, "A Lexicostatistical of the Austronesian Languages," demonstrating its utility for subgrouping large families through percentage calculations. The field evolved through refinements and debates in the mid-20th century. In 1955, Swadesh proposed a reduced 100-item list in "Towards Greater Accuracy in Lexicostatistic Dating," aiming to enhance reliability by focusing on the most stable concepts amid growing scrutiny of retention rates. The saw intense discussions, notably at Wenner-Gren Foundation symposia documented in Current Anthropology, where critiques like those by Knut Bergsland and Hans Vogt in 1962 questioned the constant-rate assumption central to dating (), prompting a conceptual shift toward using lexicostatistics solely for relative classification rather than absolute chronologies. By the 1970s, standardization efforts, such as Marvin L. Bender's 1971 lexicostatistical classification of Ethiopian languages, emphasized methodological consistency for broader applications in African . These developments marked key milestones, including 1950s international conferences sponsored by the Wenner-Gren Foundation that fostered interdisciplinary dialogue on quantitative methods. By 1970, the focus had solidified on over , reflecting widespread acceptance of its limitations in temporal estimation. Swadesh's ideas drew brief early influence from Edward Sapir's concepts of linguistic drift and vocabulary stability, though the method's formalization postdated Sapir's era.

Methodology

Word List Creation

In lexicostatistics, the creation of word lists begins with the selection of basic or core vocabulary items designed to reflect stable elements of a language's lexicon that are least susceptible to borrowing or replacement due to cultural contact. These items typically include everyday concepts tied to universal human experiences, such as body parts, natural phenomena, and simple actions, which exhibit high retention rates over time and thus provide a reliable basis for comparing genetic relationships between languages. The most widely adopted standard lists are the 100-item list initially developed by Morris Swadesh in 1955, with a final version published posthumously in 1971, and the 200-item list published in 1952. These lists were compiled to standardize comparisons across diverse languages, with the 100-item version serving as a core subset emphasizing greater stability. Criteria for inclusion prioritize universality (concepts present in all human societies), stability (resistance to semantic or lexical replacement), and elicitability (ease of translation and verification from speakers). For instance, words like "hand" or "water" meet these standards due to their non-cultural specificity and low borrowability, estimated at around 10% for the 100-item list. The process of creating these involves elicitation directly from native speakers to obtain equivalents for each , often using bilingual assistants or pictographic aids to ensure accuracy. Researchers select the most frequent or prototypical form for each slot, addressing by specifying primary senses through contextual definitions—for example, prioritizing the sense of "all" as a quantifier for items (e.g., "all the trees") rather than singular totals. This step minimizes and ensures comparability, though challenges arise from semantic shifts where a word's meaning has evolved differently across languages. Variations in list creation adapt to specific types, such as using shorter lists (e.g., 40-60 items) for isolates or under-documented languages where full elicitation is impractical, while maintaining the core criteria to preserve methodological consistency. Challenges like regional absences (e.g., no direct term for "" in tropical languages) are handled by allowing substitutions with culturally equivalent concepts, though this requires careful to avoid skewing comparisons. Swadesh lists are organized into conceptual categories to facilitate systematic elicitation and analysis. Key categories include pronouns (e.g., I, you, we), body parts (e.g., hand, eye, , ), and nature terms (e.g., , sun, , , ). These examples illustrate the focus on concrete, high-frequency items that elicit consistent responses across linguistic fieldwork.

Cognate Determination

determination in lexicostatistics involves identifying words across languages that share a common ancestral form, forming the essential analytical step after compiling standardized word lists. Key criteria include phonetic similarity based on regular correspondences—a foundational aspect of the requiring patterns observed in at least two instances—and semantic equivalence, where words must correspond to the same basic concept in the vocabulary list. Procedures for cognate identification traditionally rely on manual coding, in which linguists assign yes/no judgments to word pairs, often drawing on etymological dictionaries to verify shared roots. Handling false cognates—superficial resemblances arising from chance similarity or independent development—involves applying thresholds for recurrent correspondences rather than relying on isolated matches, thereby minimizing errors from coincidental look-alikes. Significant challenges arise in detecting borrowings, such as loanwords from cultural contact that can comprise up to 10% of basic vocabulary and mimic inherited forms; accounting for dialectal variation, which introduces inconsistencies in word forms; and the critical dependence on expert judgment by historical linguists to navigate these issues through consensus and reference to established etymological . Initial approaches were entirely manual, but subsequent advancements incorporated computational tools, such as sound similarity algorithms exemplified by the , which quantifies differences in word sequences to aid in distinguishing potential s from non-s. A representative example appears in , where equivalents for "" trace to Latin māter and illustrate regular sound shifts: French mère (with ), Spanish and Italian madre (retaining the intervocalic d), and mãe (further and simplification), confirming their status through consistent phonological patterns.

Percentage Calculation

The core of lexicostatistics lies in quantifying lexical similarity through the calculation of cognate percentages between languages. The standard formula for the lexicostatistical percentage is the number of shared cognates divided by the total number of comparable word pairs in the standardized list, multiplied by 100:
Percentage=(number of cognatestotal comparable words)×100.\text{Percentage} = \left( \frac{\text{number of cognates}}{\text{total comparable words}} \right) \times 100.
This yields a similarity score, typically based on lists of 100 or 200 basic vocabulary items, where the denominator accounts for any missing or unelicitable forms to ensure comparability.
For example, in a hypothetical pairwise comparison of two using a 100-word list, if 45 words are identified as cognates, the similarity percentage is 45%. This computation forms the basis for assessing degrees of relatedness, with higher percentages indicating closer historical ties. These percentages are interpreted using established thresholds to classify linguistic relationships, originally proposed by Swadesh. Scores above 81% suggest dialects of the same language or very close relatives; 36% to 81% indicate membership in the same family branch; around 36% or lower but above 12% denote distant family relations; and below 12% typically signify unrelated languages or membership in separate stocks. In cases of —such as when one language's list has more missing items than the other's—percentages are often symmetrized by averaging the directed scores or by restricting the comparison to fully overlapping items, minimizing from incomplete . Adjustments for list incompleteness involve excluding non-comparable entries from the denominator, provided the number of such gaps remains small (under 20-30%), to preserve reliability. Statistical considerations emphasize the impact of sample size and potential s in these calculations. For a 100-word , margins of error are approximately ±10-15% at 95% , calculated as ±1.96p^(1p^)n\pm 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
Add your contribution
Related Hubs
User Avatar
No comments yet.