Recent from talks
Nothing was collected or created yet.
Log-linear analysis
View on WikipediaLog-linear analysis is a technique used in statistics to examine the relationship between more than two categorical variables. The technique is used for both hypothesis testing and model building. In both these uses, models are tested to find the most parsimonious (i.e., least complex) model that best accounts for the variance in the observed frequencies. (A Pearson's chi-square test could be used instead of log-linear analysis, but that technique only allows for two of the variables to be compared at a time.[1])
Fitting criterion
[edit]Log-linear analysis uses a likelihood ratio statistic that has an approximate chi-square distribution when the sample size is large:[2]
where
- natural logarithm;
- observed frequency in cellij (i = row and j = column);
- expected frequency in cellij.
- the deviance for the model.[3]
Assumptions
[edit]There are three assumptions in log-linear analysis:[2]
1. The observations are independent and random;
2. Observed frequencies are normally distributed about expected frequencies over repeated samples. This is a good approximation if both (a) the expected frequencies are greater than or equal to 5 for 80% or more of the categories and (b) all expected frequencies are greater than 1. Violations to this assumption result in a large reduction in power. Suggested solutions to this violation are: delete a variable, combine levels of one variable (e.g., put males and females together), or collect more data.
3. The logarithm of the expected value of the response variable is a linear combination of the explanatory variables. This assumption is so fundamental that it is rarely mentioned, but like most linearity assumptions, it is rarely exact and often simply made to obtain a tractable model.
Additionally, data should always be categorical. Continuous data can first be converted to categorical data, with some loss of information. With both continuous and categorical data, it would be best to use logistic regression. (Any data that is analysed with log-linear analysis can also be analysed with logistic regression. The technique chosen depends on the research questions.)
Variables
[edit]In log-linear analysis there is no clear distinction between what variables are the independent or dependent variables. The variables are treated the same. However, often the theoretical background of the variables will lead the variables to be interpreted as either the independent or dependent variables.[1]
Models
[edit]The goal of log-linear analysis is to determine which model components are necessary to retain in order to best account for the data. Model components are the number of main effects and interactions in the model. For example, if we examine the relationship between three variables—variable A, variable B, and variable C—there are seven model components in the saturated model. The three main effects (A, B, C), the three two-way interactions (AB, AC, BC), and the one three-way interaction (ABC) gives the seven model components.
The log-linear models can be thought of to be on a continuum with the two extremes being the simplest model and the saturated model. The simplest model is the model where all the expected frequencies are equal. This is true when the variables are not related. The saturated model is the model that includes all the model components. This model will always explain the data the best, but it is the least parsimonious as everything is included. In this model, observed frequencies equal expected frequencies, therefore in the likelihood ratio chi-square statistic, the ratio and . This results in the likelihood ratio chi-square statistic being equal to 0, which is the best model fit.[2] Other possible models are the conditional equiprobability model and the mutual dependence model.[1]
Each log-linear model can be represented as a log-linear equation. For example, with the three variables (A, B, C) the saturated model has the following log-linear equation:[1]
where
- expected frequency in cellijk;
- the relative weight of each variable.
Hierarchical model
[edit]Log-linear analysis models can be hierarchical or nonhierarchical. Hierarchical models are the most common. These models contain all the lower order interactions and main effects of the interaction to be examined.[1]
Graphical model
[edit]A log-linear model is graphical if, whenever the model contains all two-factor terms generated by a higher-order interaction, the model also contains the higher-order interaction.[4] As a direct-consequence, graphical models are hierarchical. Moreover, being completely determined by its two-factor terms, a graphical model can be represented by an undirected graph, where the vertices represent the variables and the edges represent the two-factor terms included in the model.
Decomposable model
[edit]A log-linear model is decomposable if it is graphical and if the corresponding graph is chordal.
Model fit
[edit]The model fits well when the residuals (i.e., observed-expected) are close to 0, that is the closer the observed frequencies are to the expected frequencies the better the model fit. If the likelihood ratio chi-square statistic is non-significant, then the model fits well (i.e., calculated expected frequencies are close to observed frequencies). If the likelihood ratio chi-square statistic is significant, then the model does not fit well (i.e., calculated expected frequencies are not close to observed frequencies).
Backward elimination is used to determine which of the model components are necessary to retain in order to best account for the data. Log-linear analysis starts with the saturated model and the highest order interactions are removed until the model no longer accurately fits the data. Specifically, at each stage, after the removal of the highest ordered interaction, the likelihood ratio chi-square statistic is computed to measure how well the model is fitting the data. The highest ordered interactions are no longer removed when the likelihood ratio chi-square statistic becomes significant.[2]
Comparing models
[edit]When two models are nested, models can also be compared using a chi-square difference test. The chi-square difference test is computed by subtracting the likelihood ratio chi-square statistics for the two models being compared. This value is then compared to the chi-square critical value at their difference in degrees of freedom. If the chi-square difference is smaller than the chi-square critical value, the new model fits the data significantly better and is the preferred model. Else, if the chi-square difference is larger than the critical value, the less parsimonious model is preferred.[1]
Follow-up tests
[edit]Once the model of best fit is determined, the highest-order interaction is examined by conducting chi-square analyses at different levels of one of the variables. To conduct chi-square analyses, one needs to break the model down into a 2 × 2 or 2 × 1 contingency table.[2]
For example, if one is examining the relationship among four variables, and the model of best fit contained one of the three-way interactions, one would examine its simple two-way interactions at different levels of the third variable.
Effect sizes
[edit]To compare effect sizes of the interactions between the variables, odds ratios are used. Odds ratios are preferred over chi-square statistics for two main reasons:[1]
1. Odds ratios are independent of the sample size;
2. Odds ratios are not affected by unequal marginal distributions.
Software
[edit]For datasets with a few variables – general log-linear models
[edit]- R with the loglm function of the MASS package (see tutorial)
- IBM SPSS Statistics with the GENLOG procedure (usage)
For datasets with hundreds of variables – decomposable models
[edit]See also
[edit]References
[edit]- ^ a b c d e f g Howell, D. C. (2009). Statistical methods for psychology (7th ed.). Belmot, CA: Cengage Learning. pp. 630–655.
- ^ a b c d e Field, A. (2005). Discovering statistics using SPSS (2nd ed.). Thousand Oaks, CA: SAGE Publications. pp. 695–718. ISBN 9780761944515.
- ^ Agresti, Alan (2007). An Introduction to Categorical Data Analysis (2nd ed.). Hoboken, NJ: Wiley Inter-Science. p. 212. doi:10.1002/0470114754. ISBN 978-0-471-22618-5.
- ^ Christensen, R. (1997). Log-Linear Models and Logistic Regression (2nd ed.). Springer.
- ^ Petitjean, F.; Webb, G.I.; Nicholson, A.E. (2013). Scaling log-linear analysis to high-dimensional data (PDF). International Conference on Data Mining. Dallas, TX, USA: IEEE. pp. 597–606.
Further reading
[edit]- Log-linear Models
- Simkiss, D.; Ebrahim, G. J.; Waterston, A. J. R. (Eds.) "Chapter 14: Analysing categorical data: Log-linear analysis". Journal of Tropical Pediatrics, online only area, “Research methods II: Multivariate analysis” (pp. 144–153). Retrieved May 2012 from http://www.oxfordjournals.org/tropej/online/ma_chap14.pdf
- Pugh, M. D. (1983). "Contributory fault and rape convictions: Log-linear models for blaming the victim". Social Psychology Quarterly, 46, 233–242. JSTOR 3033794
- Tabachnick, B. G., & Fidell, L. S. (2007). Using Multivariate Statistics (5th ed.). New York, NY: Allyn and Bacon.[page needed]
Log-linear analysis
View on GrokipediaFundamentals
Definition and Purpose
Log-linear analysis is a statistical technique used to model the relationships among multiple categorical variables through the analysis of multi-dimensional contingency tables. It expresses the logarithm of the expected cell frequencies as a linear combination of parameters that capture main effects and interactions among the variables. Formally, the model is given by where denotes the expected frequency in cell , is the intercept, the represent main effects, the capture two-way interactions, and higher-order terms account for more complex associations, with as indicator variables for the categories of variable . The primary purpose of log-linear analysis is to identify and test for associations, partial independence, and higher-order interactions in cross-classified categorical data, especially when traditional methods like analysis of variance (ANOVA) or linear regression are inappropriate due to the discrete nature of the variables. It enables researchers to assess whether variables are independent or exhibit dependencies that cannot be explained by lower-order effects alone, providing insights into the structure of multivariate frequency distributions. Unlike logistic regression, which models the conditional distribution of one categorical response variable given the others, log-linear analysis models the joint distribution of all variables simultaneously, treating none as distinctly dependent or independent. This approach was developed in the 1960s, with key formalizations such as Birch's 1963 work on multiplicative models, building on foundational work in multivariate categorical analysis.[2]Key Assumptions
Log-linear analysis relies on several foundational statistical assumptions to ensure the validity of inferences drawn from categorical data in contingency tables. Under the Poisson sampling framework, a primary assumption is the independence of cell counts, meaning that the counts in different cells are independent.[3] This independence ensures that the variability in one cell does not influence another, allowing the model to accurately capture associations among variables without confounding from correlated errors. Under multinomial sampling, cell counts are dependent due to the fixed total sample size, but the model accounts for this. The sampling scheme underlying the data must also align with the model's structure, typically following either a Poisson distribution for individual cell counts—where the total sample size is not fixed—or a multinomial distribution when margins are fixed, such as in product-multinomial sampling.[4] These distributions lead naturally to the log-linear parameterization because the logarithm of the expected cell frequencies is modeled as a linear function of the parameters, facilitating the analysis of cell counts as responses.[4] Under the Poisson assumption, cell counts are independent Poisson random variables, while the multinomial variant conditions on fixed totals to model proportions.[3] Another key assumption is the absence of structural zeros in the contingency table, where all cells are presumed to have positive probability unless theoretically impossible due to the study's design.[5] Structural zeros represent combinations of categories that cannot occur (e.g., impossible events), and their presence without explicit modeling can bias parameter estimates; thus, standard log-linear models assume such cells are either nonexistent or handled by excluding them from the table structure.[6] For reliable inference, particularly when using likelihood ratio tests or Pearson chi-square statistics to assess model fit, the data must satisfy a sufficient sample size requirement: expected frequencies should be at least 5 in the majority (typically 80%) of cells.[6] This condition supports the asymptotic chi-square approximation of the deviance and ensures that parameter estimates are stable and tests have appropriate Type I error rates; smaller expected values may necessitate exact methods or collapsed tables.[6] Finally, the model incorporates an assumption of homogeneity of variance implicit in the Poisson sampling framework, where the variance of each cell count equals its expected value (Var(Y) = E(Y) = μ).[7] This equidispersion property is crucial for the maximum likelihood estimation and standard error calculations in log-linear models; violations, such as overdispersion, may require extensions like negative binomial variants.[4]Types of Variables
Log-linear analysis primarily involves categorical variables, which are divided into nominal and ordinal types based on whether they possess an inherent order. Nominal variables represent unordered categories, such as gender (male/female) or religious affiliation (Protestant/Catholic/other), where the levels have no natural ranking and are treated as distinct groups without implying superiority or progression. These variables are typically encoded using dummy indicator variables, each corresponding to a category except a reference level, to facilitate modeling in the log-linear framework. Ordinal variables, however, feature ordered categories, for example, education levels (elementary/secondary/college) or satisfaction ratings (low/medium/high), where the sequence reflects increasing or decreasing intensity; they can also be represented by indicators but may incorporate scoring to leverage the order for more efficient analysis.[8] The foundational data structure for log-linear analysis is the contingency table, a multi-way array that cross-classifies observations according to the levels of two or more categorical variables, with each cell containing the observed frequency count of occurrences for that combination. For example, a 2×2×3 contingency table might tabulate counts across two binary nominal variables and one ordinal variable with three levels, enabling the examination of joint distributions. The margins of these tables—row totals, column totals, or higher-dimensional sums—represent univariate marginal distributions or partial associations between subsets of variables, providing summaries of the data's structure before modeling interactions.[8][9] In contingency tables for log-linear models, the treatment of margins as fixed or random depends on the underlying sampling scheme, which influences the distributional assumptions. Under Poisson sampling, all margins are random, with cell counts modeled as independent Poisson random variables whose means equal their variances, suitable for independent count data without fixed totals. In multinomial sampling, the overall sample size (a one-dimensional margin) is fixed, rendering other margins random conditional on this total, which aligns with scenarios where observations are categorized into mutually exclusive cells summing to a known n. This distinction ensures that the model accounts for the dependencies introduced by fixed margins, such as in prospective studies where row totals are controlled.[8][10] Interaction terms in log-linear analysis describe the associations among categorical variables through main effects, two-way interactions, and higher-order terms, forming the core of how dependencies are modeled in contingency tables. Main effects capture univariate marginal distributions for individual variables, reflecting their standalone influences on cell frequencies. Two-way interaction terms model bivariate associations between pairs of variables, such as the joint distribution of gender and education level, indicating whether categories co-occur more or less than expected under independence. Higher-order interactions, like three-way terms, address conditional dependencies among three or more variables—for instance, how the association between two variables varies across levels of a third—allowing for the representation of complex, multifaceted relationships in multi-way tables.[8][11]Model Specification and Fitting
Fitting Criteria
In log-linear analysis, parameters are estimated using maximum likelihood estimation (MLE), which selects values that maximize the likelihood of the observed cell frequencies under the assumption that these frequencies follow a Poisson distribution with expected values , where is the design vector for cell .[12][13] The Poisson likelihood function for the parameters given the observed counts is where the product is over all cells in the contingency table, and the factorial terms are constants with respect to .[13][12] Maximizing this likelihood, or equivalently its logarithm, requires solving nonlinear equations, typically through iterative numerical procedures such as the Newton-Raphson method or iteratively reweighted least squares (IRLS), which converge to the MLE under standard conditions for generalized linear models. For large samples, the MLE possesses desirable asymptotic properties: it is consistent (converging in probability to the true ), asymptotically unbiased, and asymptotically normally distributed with covariance matrix given by the inverse of the observed Fisher information.[14] In practice, contingency tables often contain empty cells (where ), which can complicate estimation if they lead to non-convergence or infinite parameter estimates; common strategies include collapsing table dimensions to combine cells or, if theoretically justified, adding small constants (such as 0.5) to all cells prior to fitting, though the latter risks biasing results and is generally discouraged without strong rationale.[14][15]Hierarchical Principles
In log-linear analysis, the hierarchical principle governs model specification by requiring that any k-way interaction term included in the model must be accompanied by all lower-order interactions (up to (k-1)-way) among the same variables, as well as their constituent main effects. For example, including a three-way interaction term ABC necessitates the two-way terms AB, AC, and BC, along with the main effects A, B, and C. This constraint, rooted in the principle of marginality, ensures that higher-order effects are interpreted relative to the associations captured by lower-order terms, preventing the estimation of isolated partial interactions that lack substantive meaning. The hierarchical structure embodies a Markov-like assumption, wherein the conditional independence of variables given the lower-order terms is implicitly modeled; higher-order interactions thus represent deviations from independence that are conditional on the specified margins. This property facilitates the decomposition of complex associations into interpretable components, aligning with the iterative nature of model building in categorical data analysis. Hierarchical models are compactly specified using bracket notation, where terms denote the highest-order interactions, and lower-order terms are automatically included. For instance, the notation [AB][C] specifies a model with the two-way interaction between variables A and B, the main effect of C, and—by hierarchy—the main effects of A and B, corresponding to the equation . Such notation streamlines the description of models ranging from complete independence ([A][B][C]) to partial associations. Adopting the hierarchical principle offers key advantages, including reduced overfitting by limiting the parameter space to parsimonious, nested structures and enhanced interpretability through systematic progression from main effects to higher interactions. This approach supports forward or backward selection strategies, where models are compared hierarchically to identify significant associations without redundant terms. A practical illustration occurs in analyzing a 2×2×2 contingency table, such as one cross-classifying gender (A), treatment (B), and outcome (C). The hierarchical model positing no three-way interaction but allowing all two-way interactions is expressed as: This formulation tests for pairwise associations (e.g., treatment-outcome conditional on gender) while assuming uniformity across the third variable's levels, providing a baseline for assessing more complex dependencies.Model Types
General Log-linear Models
General log-linear models provide a framework for analyzing associations among multiple categorical variables in contingency tables by modeling the logarithm of the expected cell frequencies as an additive function of parameters representing main effects and interactions. These models treat all variables symmetrically, without distinguishing between response and explanatory variables, and are fitted using maximum likelihood estimation, often via iterative methods like iterative proportional fitting. The approach is particularly suited for moderate-sized tables with a few variables, enabling tests of hypotheses about independence and conditional associations.[16] A foundational example is the mutual independence model, which assumes no associations among the variables. For a three-way contingency table with variables A, B, and C, this model is specified as , where denotes the expected frequency in cell , is the overall mean log-frequency, and the terms capture the main effects. Under this model, the expected frequencies factorize as the product of the one-way marginal distributions, implying that the joint distribution is the product of the marginals. This model serves as a baseline for assessing higher-order dependencies.[16] Partial association models extend the mutual independence framework by incorporating selected interactions while assuming conditional independencies for others. For instance, the model for three variables includes main effects for A, B, and C, along with two-way interactions AB and AC, but omits the BC interaction, yielding . This specification tests the conditional independence of B and C given A, where the absence of the BC term implies that any association between B and C is explained by their mutual relations with A. Such models allow for flexible exploration of partial dependencies in multi-way tables.[16] The saturation model represents the most complex case, incorporating all possible main effects and interactions up to the highest order. For an table, it includes terms through the three-way ABC interaction, resulting in a parameter count equal to the number of cells (), yielding zero degrees of freedom and a perfect fit to the observed data ( for all cells). While useful as a reference for model comparison, it provides no parsimony or insight into underlying structures. Interpretation of parameters in general log-linear models focuses on their role in describing log-expected margins and associations. The main effect terms quantify deviations in the log-expected frequencies for specific levels of a variable from the grand mean, averaged over the other variables; for example, , the log ratio of marginal means. Interaction parameters capture multiplicative effects; in two-way cases, the terms correspond to log-odds ratios measuring the strength of association between two variables, conditional on others in higher dimensions. These interpretations facilitate understanding of how variables jointly influence cell frequencies.[16] Despite their flexibility, general log-linear models face significant limitations with increasing dimensionality. The number of potential parameters expands exponentially with the number of variables (e.g., interactions for binary variables), exacerbating the curse of dimensionality: tables become increasingly sparse, estimation becomes computationally intensive due to the need for iterative algorithms over large parameter spaces, and models risk overfitting without strong prior constraints on interactions. These challenges restrict practical application to tables with no more than four or five variables unless specialized structures are imposed.Graphical Models
Graphical log-linear models represent conditional independencies and interactions among categorical variables in multidimensional contingency tables using undirected graphs. These models facilitate the specification of log-linear models by visualizing the structure of associations, where the absence of certain interactions is explicitly encoded through the graph's topology.[17] In a graphical log-linear model, nodes correspond to the categorical variables (or factors), and undirected edges between nodes denote direct associations, typically interpreted as two-factor interactions. The presence of an edge indicates that the variables are directly related, while the absence suggests potential conditional independence given other variables. Higher-order interactions are implied only if necessary to maintain the graph's structure.[18][19] The graph encodes conditional independencies via the d-separation criterion: two sets of variables are conditionally independent given a third set if the third set separates them in the graph, meaning there is no active path connecting the two sets when the separating nodes are conditioned upon. This separation implies that no higher-order interactions beyond those captured by the graph are required in the model. For example, if variables B and C are separated by A, then B is conditionally independent of C given A (B ⊥ C | A), precluding a three-way interaction term [ABC].[18] The log-linear model is generated by including terms that correspond to the maximal cliques—complete subgraphs where every pair of nodes is connected by an edge—in the graph. Each maximal clique generates an interaction term of the order equal to the number of variables in it, and the full set of such terms defines the model. Marginal terms for smaller cliques are automatically included to ensure the model is hierarchical.[17] Consider three binary variables A, B, and C forming a graph with edges A–B and A–C but no edge B–C. The maximal cliques are {A, B} and {A, C}, leading to the model [AB][AC][A]. This specifies the log-expected cell frequencies as implying B ⊥ C | A.[17] Graphical log-linear models offer advantages as a visual aid for discerning complex interdependencies among multiple variables, making it easier to hypothesize and test structures in high-dimensional data. They also connect to broader probabilistic modeling paradigms, such as Bayesian networks, by sharing foundational Markov properties that link graphical separation to statistical independence. These models can be fitted using maximum likelihood estimation, often via iterative proportional fitting algorithms.[17][19]Decomposable Models
Decomposable log-linear models form a special subclass of graphical log-linear models, characterized by an underlying dependence graph that is chordal—meaning it contains no induced cycles of length greater than three—and admits a simplicial ordering of its cliques, allowing recursive decomposition into independent components separated by complete cliques.[20] This structure ensures that the model factorizes exactly over the maximal cliques and separators , facilitating efficient computation without approximation.[21] A key advantage of decomposability is the availability of closed-form maximum likelihood estimates (MLEs), derived directly from the observed marginal tables corresponding to the cliques, with adjustments for the separators to account for overlaps; this eliminates the need for iterative optimization, making estimation computationally tractable even in higher dimensions.[20] The iterative proportional fitting (IPF) algorithm provides an alternative fitting method for these models, operating by cyclically scaling the current estimate to match the observed margins for each clique until convergence, which occurs in a finite number of steps for decomposable cases.[20] The expected cell frequencies in a decomposable model satisfy the factorization where denotes the cliques, the separators, and the respective multiplicities reflecting how often each component appears in the decomposition.[20] These models are particularly suited to sparse, high-dimensional contingency tables, such as those arising in genetics for analyzing multi-locus associations or in social surveys for multi-way interactions among categorical variables, where the chordal structure exploits sparsity to enable exact inference.[21]Model Evaluation
Assessing Model Fit
Assessing the fit of a log-linear model to contingency table data primarily involves goodness-of-fit tests that measure discrepancies between observed cell counts and model-expected counts , where the latter are derived from maximum likelihood estimation under the Poisson assumption. The Pearson chi-square statistic, , quantifies this discrepancy and asymptotically follows a chi-squared distribution when expected counts are sufficiently large (typically at least 5 per cell). Similarly, the likelihood ratio statistic, , provides an alternative measure that is often preferred for its additive property, allowing straightforward decomposition for nested models, and it also approximates a chi-squared distribution under the null hypothesis of adequate fit. Both statistics test the model against the saturated model, with small values (or large p-values) indicating good reproduction of the observed data. The degrees of freedom for these chi-squared tests are calculated as the total number of cells in the table minus the number of free parameters estimated in the model; in Poisson log-linear models, this subtracts the parameters defining the log-expected values, including the overall mean level. For instance, in a two-way table with rows and columns under a model of independence, the degrees of freedom equal . These degrees of freedom ensure the test accounts for model complexity, providing a benchmark for significance. When overall fit statistics suggest inadequacy, residual analysis helps pinpoint poorly fitted cells. Standardized Pearson residuals, , are commonly used; values with absolute magnitude greater than 2 or 3 flag cells where the model deviates substantially from observations, approximating a standard normal distribution under good fit. For count data, especially in sparse tables, Freeman-Tukey residuals—such as or more adjusted forms like —offer improved properties, including better normality and reduced sensitivity to small expected values.[22] In cases of small samples, where asymptotic chi-squared approximations may fail due to low expected counts, simulation-based methods provide robust alternatives. Monte Carlo tests simulate contingency tables under the fitted model (e.g., by generating Poisson variates with means ) and compute empirical distributions of or to derive exact p-values, enhancing reliability for complex or sparse data structures.Comparing Multiple Models
In log-linear analysis, comparing multiple models is essential for selecting the most parsimonious representation of associations among categorical variables in contingency tables, particularly when models adhere to hierarchical principles. Nested models, where one is a special case of another (e.g., a model omitting higher-order interactions), are compared using likelihood ratio tests based on the deviance statistic . The difference follows a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters () under the null hypothesis that the simpler model adequately fits the data; similarly, differences in the Pearson chi-squared statistic can be used. Stepwise selection procedures facilitate model comparison by iteratively building or pruning models. Backward selection begins with the saturated model (which fits the data perfectly) and removes terms whose omission does not significantly worsen fit, as determined by p-values from the nested likelihood ratio test (typically at ). Forward selection starts from the independence model and adds terms that significantly improve fit. These methods ensure hierarchical consistency while balancing model complexity and explanatory power. For broader model selection, information criteria penalize complexity to favor parsimonious models. The Akaike information criterion (AIC) is calculated as , where is the number of parameters, providing an estimate of relative predictive accuracy; lower values indicate better models. The Bayesian information criterion (BIC), given by with sample size , imposes a stronger penalty for additional parameters in larger datasets, approximating the Bayes factor for model choice. Both are applicable to log-linear models fitted via maximum likelihood under the Poisson assumption. When models are non-nested (neither is a special case of the other), Vuong's likelihood ratio test assesses which is closer to the true data-generating process by standardizing the average difference in log-likelihoods between models, yielding a test statistic asymptotically distributed as standard normal under the null of equal fit. This test accounts for the Kullback-Leibler divergence and is particularly useful for comparing log-linear models with differing interaction structures. For instance, in analyzing a three-way contingency table of variables A, B, and C, one might first fit a two-way interaction model (including AB, AC, BC terms) and then compare it to a model adding the ABC term using the likelihood ratio test: if exceeds the critical chi-squared value at , the three-way interaction significantly improves fit, justifying its inclusion.Interpretation and Analysis
Follow-up Tests
After an initial log-linear model is fitted to contingency table data using maximum likelihood estimation, follow-up tests enable targeted examination of hypotheses about specific parameters or interactions, such as whether a particular association is present conditional on other variables. These procedures build on parameter estimates from the fitting process and are essential for dissecting complex multi-way associations in categorical data analysis.[23] Wald tests provide a direct method for assessing the significance of individual parameters, such as log-odds ratios corresponding to interactions. The test statistic is constructed as , where is the maximum likelihood estimate of the parameter and its standard error, asymptotically following a standard normal distribution under the null hypothesis that the parameter equals zero. This approach is standard in the generalized linear models framework underlying log-linear analysis and is computationally straightforward once estimates are obtained.[23] Score tests, alternatively known as Lagrange multiplier tests, offer an efficient alternative for hypothesis testing without refitting the full alternative model. They rely on the score statistic, the first derivative of the log-likelihood evaluated at the null parameter values, divided by the expected information matrix: approximately , where is the score function and the Fisher information. In log-linear contexts, score tests are particularly valuable for verifying conditional independence or trends in contingency tables when the null model suffices for computation.[23] Partial association tests investigate the relationship between two variables while accounting for others, often in multi-way tables. These can involve collapsing the table over adjusting variables to form a partial 2×2 table and applying a chi-squared test, or fitting nested log-linear models and comparing them via likelihood ratio statistics to isolate the interaction term of interest. For a three-way table, the test for partial AB association controlling for C assesses whether the conditional odds ratio between A and B varies by levels of C; equivalently, the overall chi-squared can be partitioned into components, with the partial association component following a chi-squared distribution with (I-1)(J-1) degrees of freedom. This partitioning method facilitates hierarchical decomposition of associations.[24][23] When multiple follow-up tests are performed, such as simultaneously evaluating several interaction terms, the Bonferroni correction adjusts the significance level to maintain control over the family-wise error rate. The adjusted level is , where is the number of tests; for instance, with five tests at nominal , each uses . This procedure, though conservative, is routinely recommended in log-linear analyses involving multiple parameter hypotheses to prevent spurious significant findings.[23] An illustrative example is testing homogeneity of odds ratios in a three-way contingency table, such as a 2×2×K design where the goal is to verify if the odds ratio between two binary variables remains constant across K strata of a third. This is tested by fitting the homogeneous association log-linear model, which omits the three-way interaction term (), and comparing its deviance to that of the saturated model via a likelihood ratio test; non-significance indicates constant partial odds ratios. Such tests are common in stratified analyses, like assessing treatment effects uniform across patient subgroups.[23]Effect Sizes and Measures
In log-linear models for categorical data, effect sizes provide measures of the magnitude and practical importance of associations between variables, distinct from tests of statistical significance. These metrics help quantify how strongly variables interact in contingency tables, aiding interpretation in fields such as social sciences and epidemiology. Common effect sizes derive from model parameters or summary statistics of the table, focusing on the strength of dependence rather than just model fit.[23] For two-way contingency tables, the log-odds ratio serves as a primary effect size for measuring association strength. In a saturated log-linear model, the parameter for the interaction between variables and represents the log-odds ratio, and the odds ratio indicates the change in odds of one category given the other, relative to a baseline. Values near 1 suggest weak or no association, while deviations (e.g., above 2 or below 0.5) imply stronger effects. Confidence intervals for are typically constructed using the delta method, which approximates the variance of the transformed parameter from the asymptotic normality of . This approach ensures interpretable intervals on the odds ratio scale, facilitating assessment of uncertainty in the effect magnitude.[23] In higher-order tables, partial associations address conditional dependencies while accounting for other variables, with collapsibility measures evaluating whether marginal effects align with conditional ones. For instance, under models of conditional independence (e.g., no three-way interaction), the partial odds ratio between two variables equals the marginal odds ratio, a property known as collapsibility; violations indicate non-collapsibility, where averaging partial log-odds ratios over strata of a third variable yields the overall effect size. Average partial associations, computed as the arithmetic mean of conditional log-odds ratios across levels of conditioning variables, quantify the typical strength of a two-way interaction in multi-way tables, useful for summarizing complex structures without assuming decomposability.[25] Other effect sizes include uncertainty coefficients and analogs to , which capture asymmetric dependence and predictive power. The uncertainty coefficient , which measures the proportional reduction in uncertainty (entropy) of one variable given the other, is defined as , with values from 0 (independence) to 1 (perfect prediction); it is asymmetric, yielding distinct row- and column-based coefficients. For predictive accuracy, Goodman-Kruskal lambda () or gamma () serve as -like measures: quantifies the proportional reduction in prediction error for nominal data, while adjusts for ties in ordinal data, both ranging from -1 to 1, with absolute values above 0.3 often indicating moderate association.[23] As an example, consider a three-way contingency table analyzing the association between treatment (T), response (R), and sex (S) in a medical study. A significant three-way interaction term in the log-linear model implies that the two-way odds ratio between T and R varies by levels of S; the effect size can be interpreted as the range or average of these conditional odds ratios (e.g., vs. ), highlighting how the treatment effect's strength differs across subgroups and guiding targeted inferences.[23]Implementation
Software for Small Datasets
Software tools for fitting log-linear models to small contingency tables, typically with up to 5-6 dimensions, are available in several statistical packages, enabling hierarchical model specification, assessment of fit via chi-square statistics, and basic tests of effects such as associations and interactions.[26] These tools treat cell frequencies as Poisson-distributed counts and use iterative methods like maximum likelihood to estimate parameters, supporting analyses of multi-way tables where the log-expected frequencies are modeled as linear combinations of main effects and interactions.[27] In R, theloglm() function from the MASS package fits basic hierarchical log-linear models using iterative proportional fitting (IPF), a method equivalent to maximum likelihood for saturated models on complete tables.[28] It allows formula-based specification similar to linear models, producing deviance-based chi-square goodness-of-fit tests and likelihood ratio statistics for comparing nested models.[29] For more flexible extensions, the gnm package supports generalized nonlinear models (GNMs), which encompass log-linear models with multiplicative terms for complex interactions, also providing chi-square tests and effect estimates.[30] An example for a 2x2x2 table of factors A, B, and C uses:
library(MASS)
data <- array(c(10, 20, 30, 40, 50, 60, 70, 80), dim = c(2, 2, 2))
dimnames(data) <- list(A = c("a1", "a2"), B = c("b1", "b2"), C = c("c1", "c2"))
fit <- loglm(~ A + B + C + A:B, data = data)
summary(fit)
library(MASS)
data <- array(c(10, 20, 30, 40, 50, 60, 70, 80), dim = c(2, 2, 2))
dimnames(data) <- list(A = c("a1", "a2"), B = c("b1", "b2"), C = c("c1", "c2"))
fit <- loglm(~ A + B + C + A:B, data = data)
summary(fit)
proc genmod data=table;
class A B C;
model [COUNT](/page/Count) = A B C A*B / dist=poisson link=log type=3;
run;
proc genmod data=table;
class A B C;
model [COUNT](/page/Count) = A B C A*B / dist=poisson link=log type=3;
run;
GENLOG A B C
/DESIGN = A B C A*B
/CRITERIA = DELTA=0 ZEROS=SUPPRESS.
GENLOG A B C
/DESIGN = A B C A*B
/CRITERIA = DELTA=0 ZEROS=SUPPRESS.
poisson command, or equivalently glm with family(poisson) and link(log), fits log-linear models to count data in contingency tables using maximum likelihood, ideal for small dimensions with robust standard errors for inference.[33] Both support factor variables for hierarchical terms, yielding deviance and Pearson chi-square statistics for overall fit, alongside z-tests for effects.[34] For a 2x2x2 table reshaped to long format with variables a, b, c, and count, the syntax is:
glm count i.a i.b i.c i.a#i.b, family(poisson) link(log) vce(robust)
glm count i.a i.b i.c i.a#i.b, family(poisson) link(log) vce(robust)
