Recent from talks
Nothing was collected or created yet.
Contingency table
View on WikipediaIn statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation",[1] part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.
A crucial problem of multivariate statistics is finding the (direct-)dependence structure underlying the variables contained in high-dimensional contingency tables. If some of the conditional independences are revealed, then even the storage of the data can be done in a smarter way (see Lauritzen (2002)). In order to do this one can use information theory concepts, which gain the information only from the distribution of probability, which can be expressed easily from the contingency table by the relative frequencies.
A pivot table is a way to create contingency tables using spreadsheet software.
Example
[edit]Suppose there are two variables, sex (male or female) and handedness (right- or left-handed). Further suppose that 100 individuals are randomly sampled from a very large population as part of a study of sex differences in handedness. A contingency table can be created to display the numbers of individuals who are male right-handed and left-handed, female right-handed and left-handed. Such a contingency table is shown below.
Handed- ness Sex |
Right-handed | Left-handed | Total |
|---|---|---|---|
| Male | 43 | 9 | 52 |
| Female | 44 | 4 | 48 |
| Total | 87 | 13 | 100 |
The numbers of the males, females, and right- and left-handed individuals are called marginal totals. The grand total (the total number of individuals represented in the contingency table) is the number in the bottom right corner.
The table allows users to see at a glance that the proportion of men who are right-handed is about the same as the proportion of women who are right-handed although the proportions are not identical. The strength of the association can be measured by the odds ratio, and the population odds ratio estimated by the sample odds ratio. The significance of the difference between the two proportions can be assessed with a variety of statistical tests including Pearson's chi-squared test, the G-test, Fisher's exact test, Boschloo's test, and Barnard's test, provided the entries in the table represent individuals randomly sampled from the population about which conclusions are to be drawn. If the proportions of individuals in the different columns vary significantly between rows (or vice versa), it is said that there is a contingency between the two variables. In other words, the two variables are not independent. If there is no contingency, it is said that the two variables are independent.
The example above is the simplest kind of contingency table, a table in which each variable has only two levels; this is called a 2 × 2 contingency table. In principle, any number of rows and columns may be used. There may also be more than two variables, but higher order contingency tables are difficult to represent visually. The relation between ordinal variables, or between ordinal and categorical variables, may also be represented in contingency tables, although such a practice is rare. For more on the use of a contingency table for the relation between two ordinal variables, see Goodman and Kruskal's gamma.
Standard contents of a contingency table
[edit]- Multiple columns (historically, they were designed to use up all the white space of a printed page). Where each row refers to a specific sub-group in the population (in this case men or women), the columns are sometimes referred to as banner points or cuts (and the rows are sometimes referred to as stubs).
- Significance tests. Typically, either column comparisons, which test for differences between columns and display these results using letters, or, cell comparisons, which use color or arrows to identify a cell in a table that stands out in some way.
- Nets or netts which are sub-totals.
- One or more of: percentages, row percentages, column percentages, indexes or averages.
- Unweighted sample sizes (counts).
Measures of association
[edit]The degree of association between the two variables can be assessed by a number of coefficients. The following subsections describe a few of them. For a more complete discussion of their uses, see the main articles linked under each subsection heading.
Odds ratio
[edit]The simplest measure of association for a 2 × 2 contingency table is the odds ratio. Given two events, A and B, the odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the odds ratio is 1; if the odds ratio is greater than 1, the events are positively associated; if the odds ratio is less than 1, the events are negatively associated.
The odds ratio has a simple expression in terms of probabilities; given the joint probability distribution:
the odds ratio is:
Phi coefficient
[edit]A simple measure, applicable only to the case of 2 × 2 contingency tables, is the phi coefficient (φ) defined by
where χ2 is computed as in Pearson's chi-squared test, and N is the grand total of observations. φ varies from 0 (corresponding to no association between the variables) to 1 or −1 (complete association or complete inverse association), provided it is based on frequency data represented in 2 × 2 tables. Then its sign equals the sign of the product of the main diagonal elements of the table minus the product of the off–diagonal elements. φ takes on the minimum value −1.0 or the maximum value of +1.0 if and only if every marginal proportion is equal to 0.5 (and two diagonal cells are empty).[2]
Cramér's V and the contingency coefficient C
[edit]Two alternatives are the contingency coefficient C, and Cramér's V.
The formulae for the C and V coefficients are:
- and
k being the number of rows or the number of columns, whichever is less.
C suffers from the disadvantage that it does not reach a maximum of 1.0, notably the highest it can reach in a 2 × 2 table is 0.707 . It can reach values closer to 1.0 in contingency tables with more categories; for example, it can reach a maximum of 0.870 in a 4 × 4 table. It should, therefore, not be used to compare associations in different tables if they have different numbers of categories.[3]
C can be adjusted so it reaches a maximum of 1.0 when there is complete association in a table of any number of rows and columns by dividing C by where k is the number of rows or columns, when the table is square [citation needed], or by where r is the number of rows and c is the number of columns.[4]
Tetrachoric correlation coefficient
[edit]Another choice is the tetrachoric correlation coefficient but it is only applicable to 2 × 2 tables. Polychoric correlation is an extension of the tetrachoric correlation to tables involving variables with more than two levels.
Tetrachoric correlation assumes that the variable underlying each dichotomous measure is normally distributed.[5] The coefficient provides "a convenient measure of [the Pearson product-moment] correlation when graduated measurements have been reduced to two categories."[6]
The tetrachoric correlation coefficient should not be confused with the Pearson correlation coefficient computed by assigning, say, values 0.0 and 1.0 to represent the two levels of each variable (which is mathematically equivalent to the φ coefficient).
Lambda coefficient
[edit]The lambda coefficient is a measure of the strength of association of the cross tabulations when the variables are measured at the nominal level. Values range from 0.0 (no association) to 1.0 (the maximum possible association).
Asymmetric lambda measures the percentage improvement in predicting the dependent variable. Symmetric lambda measures the percentage improvement when prediction is done in both directions.
Uncertainty coefficient
[edit]The uncertainty coefficient, or Theil's U, is another measure for variables at the nominal level. Its values range from −1.0 (100% negative association, or perfect inversion) to +1.0 (100% positive association, or perfect agreement). A value of 0.0 indicates the absence of association.
Also, the uncertainty coefficient is conditional and an asymmetrical measure of association, which can be expressed as
- .
This asymmetrical property can lead to insights not as evident in symmetrical measures of association.[7]
Others
[edit]Gamma, Tau-b and Tau-c are used when the categories or levels of both variables have a natural order.
- Gamma test: No adjustment for either table size or ties.
- Kendall's tau: Adjustment for ties.
See also
[edit]- Confusion matrix
- Pivot table, in spreadsheet software, cross-tabulates sampling data with counts (contingency table) and/or sums.
- TPL Tables is a tool for generating and printing crosstabs.
- The iterative proportional fitting procedure essentially manipulates contingency tables to match altered joint distributions or marginal sums.
- The multivariate statistics in special multivariate discrete probability distributions. Some procedures used in this context can be used in dealing with contingency tables.
- OLAP cube, a modern multidimensional computing form of contingency tables
- Panel data, multidimensional data over time
References
[edit]- ^ Karl Pearson, F.R.S. (1904). Mathematical contributions to the theory of evolution. Dulau and Co.
- ^ Ferguson, G. A. (1966). Statistical analysis in psychology and education. New York: McGraw–Hill.
- ^ Smith, S. C., & Albaum, G. S. (2004) Fundamentals of marketing research. Sage: Thousand Oaks, CA. p. 631
- ^ Blaikie, N. (2003) Analyzing Quantitative Data. Sage: Thousand Oaks, CA. p. 100
- ^ Ferguson.[full citation needed]
- ^ Ferguson, 1966, p. 244
- ^ "The Search for Categorical Correlation". 26 December 2019.
Further reading
[edit]- Andersen, Erling B. 1980. Discrete Statistical Models with Social Science Applications. North Holland, 1980.
- Bishop, Y. M. M.; Fienberg, S. E.; Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press. ISBN 978-0-262-02113-5. MR 0381130.
- Christensen, Ronald (1997). Log-linear models and logistic regression. Springer Texts in Statistics (Second ed.). New York: Springer-Verlag. pp. xvi+483. ISBN 0-387-98247-7. MR 1633357.
- Lauritzen, Steffen L. (1979). Lectures on Contingency Tables (Aalborg University) (PDF) (4th edition (first electronic edition), 2002 ed.).
- Gokhale, D. V.; Kullback, Solomon (1978). The Information in Contingency Tables. Marcel Dekker. ISBN 0-824-76698-9.
External links
[edit]- On-line analysis of contingency tables: calculator with examples
- Interactive cross tabulation, chi-squared independent test, and tutorial
- Fisher and chi-squared calculator of 2 × 2 contingency table
- More Correlation Coefficients
- Nominal Association: Phi, Contingency Coefficient, Tschuprow's T, Cramer's V, Lambda, Uncertainty Coefficient, March 24, 2008, G. David Garson, North Carolina State University
- CustomInsight.com Cross Tabulation
- The POWERMUTT Project: IV. DISPLAYING CATEGORICAL DATA
- StATS: Steves Attempt to Teach Statistics Odds ratio versus relative risk (January 9, 2001)
- Epi Info Community Health Assessment Tutorial Lesson 5 Analysis: Creating Statistics
Contingency table
View on GrokipediaFundamentals
Definition and Purpose
A contingency table, also known as a cross-tabulation or two-way frequency table, is a matrix that presents the multivariate frequency distribution of two or more categorical variables, with rows and columns representing the categories of each variable and cell entries showing the observed frequencies.[6][1][7] These tables provide a structured way to summarize joint occurrences of categories across variables, enabling researchers to visualize how observations are distributed across combinations without requiring numerical or continuous data.[8] The primary purpose of contingency tables is to explore potential associations between categorical variables, test hypotheses regarding their independence, and facilitate the computation of conditional probabilities from the data.[9][10] For instance, they are widely applied in epidemiology to assess relationships between risk factors and outcomes, such as exposure status and disease incidence.[11] In the social sciences, they help analyze patterns in survey responses or demographic data, while in market research, they reveal dependencies between consumer preferences and demographics to inform segmentation strategies.[12][13] Unlike parametric models, contingency tables allow for model-free visualization of dependencies, making them versatile for initial exploratory analysis across disciplines.[14] Contingency tables are typically structured as r × c matrices, where r denotes the number of row categories and c the number of column categories, and they often incorporate fixed marginal totals for conducting conditional analyses that focus on distributions within subsets of the data.[15][6] This empirical focus on observed frequencies distinguishes them from logical constructs like truth tables, which enumerate all possible outcomes rather than summarizing real-world data counts.[1] A common application involves using such tables to detect deviations from independence, as in the chi-squared test.[10]Historical Development
The roots of contingency tables trace back to early 19th-century efforts in vital statistics and demography, where scholars employed multi-way tables to explore associations between variables such as age, sex, and criminality, emphasizing marginal distributions over independence testing.[16] Precursors included Pierre-Simon Laplace and Siméon Denis Poisson's probabilistic analyses of 2×2 tables for comparing proportions in the early 1800s, and Félix Gavarret's 1840 application of these methods to medical data like sex ratios in births.[16] By the late 19th century, figures such as Charles Sanders Peirce (1884) developed measures of association for 2×2 tables, applying them to predictive problems like tornado forecasting, while Francis Galton (1892) used expected frequencies in 3×3 fingerprint classification tables to assess independence.[16] The modern statistical foundation of contingency tables was established in 1900 by Karl Pearson, who introduced the chi-squared test as a measure of goodness-of-fit and independence, applicable to tables assessing associations between categorical variables.[17] Pearson's seminal paper, "On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can Be Reasonably Supposed to Have Been Caused by Random Sampling," provided a criterion for evaluating whether observed frequencies deviated significantly from expectations under random sampling, marking the shift toward rigorous hypothesis testing in contingency analysis. Collaborating closely, George Udny Yule extended these ideas in the same year, developing association measures like Yule's coefficient of colligation for 2×2 tables, which quantified dependence in binary outcomes such as disease incidence by exposure.[18] Pearson later coined the term "contingency table" in his 1904 work, "On the Theory of Contingency and Its Relation to Association and Normal Correlation," formalizing the framework for multivariate categorical data. In the 1920s and 1930s, Ronald A. Fisher advanced the methodology for small-sample contingency tables, critiquing the chi-squared approximation's limitations and developing exact inference procedures. In his 1922 paper, "On the Application of the χ² Method to Association and Contingency Tables," Fisher outlined applications of chi-squared to multi-way tables while highlighting issues with low expected frequencies. By 1934, Fisher formalized Fisher's exact test in Statistical Methods for Research Workers, using the hypergeometric distribution to compute precise p-values for 2×2 tables under fixed margins, illustrated famously by the "lady tasting tea" experiment, which addressed small-sample independence without relying on asymptotic approximations. These contributions emphasized exact conditional inference, influencing tests for sparse data. The mid-20th century saw expansion to multi-way contingency tables, facilitated by computational advancements that enabled handling larger, more complex datasets beyond manual calculations. Early theoretical work, such as Maurice S. Bartlett's 1935 exploration of interactions in multi-dimensional tables, paved the way, but practical implementation accelerated post-World War II with electronic computers allowing iterative estimation for higher-order analyses. By the 1970s, Leo A. Goodman integrated log-linear models into contingency table analysis, treating cell frequencies as Poisson or multinomial outcomes and using iterative proportional fitting to model interactions hierarchically. Goodman's series of papers, starting with "The Multivariate Analysis of Qualitative Data: Interactions among Multiple Classifications" in 1970, provided stepwise procedures and direct estimation for building models that captured main effects and higher-order associations in multi-way tables.[19] This approach, further developed by Stephen E. Fienberg and others, revolutionized the field by enabling sophisticated inference on complex categorical structures.[20]Construction and Interpretation
Standard Format
A contingency table is typically arranged as a rectangular array that cross-classifies observations from two categorical variables, with rows representing the levels of one variable (say, with categories) and columns representing the levels of the other (with categories).[10] The cells of this table contain the observed frequencies, denoted as , which count the number of observations falling into the -th row category and -th column category.[21] This layout facilitates the visualization of the joint distribution of the two variables.[1] Standard notation employs double subscripts for cell entries, such as or for observed counts, where uppercase is sometimes used to distinguish from expected values in statistical analyses.[10] Certain cells may contain structural zeros, indicated where combinations of categories are impossible or precluded by design, rendering those probabilities inherently zero and the table incomplete.[22] While the two-way table serves as the standard format, extensions to multi-way tables incorporate additional dimensions, such as a three-dimensional array for three variables, though these are analyzed by slicing or modeling the higher-order interactions.[23] Contingency tables are generally symmetric in the sense that the variables are interchangeable, but asymmetric variants exist where the ordering of rows and columns matters, as in confusion matrices that distinguish predicted from actual outcomes in classification tasks.[24] In interpretation, the row totals (margins) summarize the distribution across columns for each row category, and column totals do likewise for rows; these margins enable the computation of conditional proportions, often expressed as percentages within rows or columns to assess relative frequencies.[10]Illustrative Example
Consider a hypothetical survey of 200 individuals assessing the relationship between gender and preference for a particular product (yes or no response). The data can be organized into a 2x2 contingency table, with gender as the row variable (male, female) and preference as the column variable (yes, no). The observed frequencies in each cell represent the count of respondents falling into each category combination.[1] The resulting table is as follows:| Gender | Yes | No |
|---|---|---|
| Male | 30 | 70 |
| Female | 40 | 60 |
