Hubbry Logo
logo
Relative likelihood
Community hub

Relative likelihood

logo
0 subscribers
Read side by side
from Wikipedia

In statistics, when selecting a statistical model for given data, the relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.

Relative likelihood of parameter values

[edit]

Assume that we are given some data x for which we have a statistical model with parameter θ. Suppose that the maximum likelihood estimate for θ is . Relative plausibilities of other θ values may be found by comparing the likelihoods of those other values with the likelihood of . The relative likelihood of θ is defined to be[1][2][3][4][5] where denotes the likelihood function. Thus, the relative likelihood is the likelihood ratio with fixed denominator .

The function is the relative likelihood function.

Likelihood region

[edit]

A likelihood region is the set of all values of θ whose relative likelihood is greater than or equal to a given threshold. In terms of percentages, a p% likelihood region for θ is defined to be.[1][3][6]

If θ is a single real parameter, a p% likelihood region will usually comprise an interval of real values. If the region does comprise an interval, then it is called a likelihood interval.[1][3][7]

Likelihood intervals, and more generally likelihood regions, are used for interval estimation within likelihood-based statistics ("likelihoodist" statistics): They are similar to confidence intervals in frequentist statistics and credible intervals in Bayesian statistics. Likelihood intervals are interpreted directly in terms of relative likelihood, not in terms of coverage probability (frequentism) or posterior probability (Bayesianism).

Given a model, likelihood intervals can be compared to confidence intervals. If θ is a single real parameter, then under certain conditions, a 14.65% likelihood interval (about 1:7 likelihood) for θ will be the same as a 95% confidence interval (19/20 coverage probability).[1][6] In a slightly different formulation suited to the use of log-likelihoods (see Wilks' theorem), the test statistic is twice the difference in log-likelihoods and the probability distribution of the test statistic is approximately a chi-squared distribution with degrees-of-freedom (df) equal to the difference in df-s between the two models (therefore, the e−2 likelihood interval is the same as the 0.954 confidence interval; assuming difference in df-s to be 1).[6][7]

Relative likelihood of models

[edit]

The definition of relative likelihood can be generalized to compare different statistical models. This generalization is based on AIC (Akaike information criterion), or sometimes AICc (Akaike Information Criterion with correction).

Suppose that for some given data we have two statistical models, M1 and M2. Also suppose that AIC(M1) ≤ AIC(M2). Then the relative likelihood of M2 with respect to M1 is defined as follows.[8]

To see that this is a generalization of the earlier definition, suppose that we have some model M with a (possibly multivariate) parameter θ. Then for any θ, set M2 = M(θ), and also set M1 = M(). The general definition now gives the same result as the earlier definition.

See also

[edit]

Notes

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In statistics, relative likelihood refers to the ratio of the likelihood of a specific parameter value or hypothesis to the maximum possible likelihood within a given model, serving as a measure of evidential support or plausibility for that value relative to the best-supported alternative. This concept, formalized through the likelihood function L(θx)=f(xθ)L(\theta \mid x) = f(x \mid \theta), where ff is the probability density or mass function and θ\theta is the parameter, enables direct comparisons without requiring prior probabilities or frequentist error rates. The relative likelihood R(θ)=L(θx)/L(θ^x)R(\theta) = L(\theta \mid x) / L(\hat{\theta} \mid x), with 0R(θ)10 \leq R(\theta) \leq 1, highlights how much less plausible a parameter is compared to the maximum likelihood estimate θ^\hat{\theta}.[1] Developed as part of the likelihood paradigm, relative likelihood builds on R.A. Fisher's introduction of the likelihood function in the 1920s and was advanced by A.W.F. Edwards in his 1972 monograph Likelihood, which argued for its use as the foundation of inductive inference over probability-based approaches. Later proponents, including Richard Royall in Statistical Evidence: A Likelihood Paradigm (1997), emphasized its role in quantifying statistical evidence via the Law of Likelihood, which states that data support one hypothesis over another to the extent of their likelihood ratio. Key applications include model selection, where relative likelihoods assess competing models' fit; parameter inference, via likelihood intervals (e.g., sets where R(θ)1/8R(\theta) \geq 1/8 or 0.15 for approximate 95% confidence regions); and hypothesis testing, avoiding p-values by focusing on direct evidential comparisons. This approach is particularly valuable in fields like ecology, physics, and machine learning for robust uncertainty quantification without Bayesian priors.

Fundamentals

Likelihood Function

The likelihood function, denoted $ L(\theta \mid x) $, represents the joint probability density function (or probability mass function in discrete cases) of the observed data $ x $, expressed as a function of the unknown parameter $ \theta $. Unlike a probability distribution over $ \theta $, it is not normalized such that its integral (or sum) over $ \theta $ equals 1; instead, it measures the plausibility of different $ \theta $ values given the fixed data $ x $. This distinction emphasizes that the likelihood treats the data as given and varies the parameters, reversing the roles in the conditional probability $ f(x \mid \theta) $.[2][3] For a sample of $ n $ independent and identically distributed observations $ x = (x_1, \dots, x_n) $, the likelihood function takes the product form
L(θx)=i=1nf(xiθ), L(\theta \mid x) = \prod_{i=1}^n f(x_i \mid \theta),
where $ f(\cdot \mid \theta) $ is the probability density or mass function of each observation under parameter $ \theta $. This formulation arises directly from the joint distribution under independence, facilitating computation and maximization.[2][4] Key properties of the likelihood function include its invariance under reparameterization: if $ \phi = g(\theta) $ for a one-to-one transformation $ g $, then the likelihood in terms of $ \phi $ is $ L(\phi \mid x) = L(\theta(\phi) \mid x) \cdot |J| $, where $ J $ is the Jacobian determinant, preserving the relative ordering of parameter values after accounting for the transformation. Because absolute values depend on arbitrary scaling and the data's support, inference typically relies on relative likelihoods rather than absolutes, comparing $ L(\theta \mid x) $ across $ \theta $ to assess plausibility. The function plays a central role in maximum likelihood estimation (MLE), where the estimator $ \hat{\theta} $ maximizes $ L(\theta \mid x) $ (or equivalently, its logarithm for convenience), providing a method to select the most data-compatible parameter.[3][5] The concept was introduced by Ronald A. Fisher in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," where he developed it as a foundational tool for frequentist inference, shifting focus from inverse probabilities to data-driven parameter assessment.[5] A simple example is the binomial likelihood for modeling $ k $ successes in $ n $ independent trials, such as coin flips, with success probability $ p $:
L(pk)=(nk)pk(1p)nk. L(p \mid k) = \binom{n}{k} p^k (1-p)^{n-k}.
Here, the likelihood peaks at $ p = k/n $, illustrating how it quantifies support for different $ p $ values based on the observed proportion.[4]

Relative Likelihood Definition

In statistics, the relative likelihood provides a normalized measure of the plausibility of a parameter value given observed data, by comparing it to the most plausible value within the model. Formally, for a parameter θ\theta and data xx, the relative likelihood is defined as
R(θx)=L(θx)L(θ^x), R(\theta \mid x) = \frac{L(\theta \mid x)}{L(\hat{\theta} \mid x)},
where L(θx)L(\theta \mid x) is the likelihood function and θ^\hat{\theta} is the maximum likelihood estimate (MLE) that maximizes LL over the parameter space. This ratio satisfies 0R(θx)10 \leq R(\theta \mid x) \leq 1, with R(θ^x)=1R(\hat{\theta} \mid x) = 1, emphasizing the relative support for θ\theta without dependence on absolute likelihood scales.[1] The log-relative likelihood, l(θx)=logR(θx)=logL(θx)logL(θ^x)l(\theta \mid x) = \log R(\theta \mid x) = \log L(\theta \mid x) - \log L(\hat{\theta} \mid x), is often preferred for numerical stability and analysis, as it transforms the multiplicative scale to an additive one. This form facilitates approximations, such as the second-order Taylor expansion around θ^\hat{\theta}, which yields a quadratic form resembling a normal approximation for large samples. Computationally, it simplifies evaluations in optimization and inference procedures.[1] Relative likelihood values near 1 indicate high plausibility for the parameter; for instance, the region where R(θx)0.15R(\theta \mid x) \geq 0.15 approximates a 95% support interval, corresponding asymptotically to a confidence region spanning roughly ±1.96\pm 1.96 standard errors around the MLE for scalar parameters. Under standard regularity conditions, the statistic 2logR(θx)-2 \log R(\theta \mid x) follows an approximate χ2\chi^2 distribution with degrees of freedom equal to the dimension of θ\theta when the true parameter is at θ\theta, supporting likelihood-based tests and intervals.[6] Unlike a general likelihood ratio, which compares the likelihoods of two distinct hypotheses L(θ1x)/L(θ2x)L(\theta_1 \mid x) / L(\theta_2 \mid x), relative likelihood is inherently tied to the model's maximum, offering a unified scale for assessing evidential support within a single framework. This distinction underscores its role in profiling parameter plausibility rather than direct hypothesis contrast.

Parameter Values

Relative Likelihood for Parameters

In statistical inference for parameters within a single model, the relative likelihood $ R(\theta | x) = L(\theta | x) / L(\hat{\theta} | x) $ serves as a direct measure of the evidential support for a specific parameter value θ\theta relative to the maximum likelihood estimate (MLE) θ^\hat{\theta}, where $ R(\hat{\theta} | x) = 1 $ by definition. This ratio quantifies the degree to which the observed data xx are less plausible under θ\theta than under θ^\hat{\theta}, providing a scale-invariant assessment of parameter plausibility that avoids assumptions of normality or other asymptotic approximations. Unlike standard errors, which depend on the curvature of the log-likelihood at the MLE, relative likelihood offers a global view of the likelihood surface, enabling visualization of uncertainty through plots of $ R(\theta | x) $ against θ\theta. This approach aligns with the likelihood principle, concentrating all inferential information in the likelihood function itself.[7][1] For multiparameter models, where interest lies in a subset of parameters θj\theta_j (parameters of interest) amid nuisance parameters ν\nu, the profile relative likelihood addresses the issue by maximizing the likelihood over ν\nu for fixed θj\theta_j. Formally, $ R(\theta_j | x) = \left[ \sup_{\nu} L(\theta_j, \nu | x) \right] / L(\hat{\theta}, \hat{\nu} | x) $, which concentrates the likelihood function to focus inference on θj\theta_j while marginalizing the impact of ν\nu. This profiling technique preserves the shape of the likelihood for θj\theta_j and is essential for practical applications, such as in generalized linear models, where nuisance parameters like dispersion must be accounted for without distorting the assessment of key effects. Adjustments to the profile likelihood, such as the Cox-Reid correction, further refine it by subtracting half the log-determinant of the observed information matrix for ν\nu, improving accuracy in small samples.[8] Likelihood intervals based on relative likelihood provide an approximation to confidence intervals by delineating the range of θ\theta values deemed sufficiently plausible. The set $ {\theta : R(\theta | x) \geq 0.15 } $ (or where the log-likelihood drops by at most 1.92 units from its maximum) roughly corresponds to a 95% likelihood interval—a threshold that asymptotically aligns with the 95% quantile of a χ12\chi^2_1 distribution under the Wilks theorem, though it differs from Wald intervals by not relying on local curvature or normality. These intervals are typically more conservative and data-dependent than frequentist confidence intervals, emphasizing evidential support over long-run coverage properties, and they perform well even in non-normal settings.[7][1][8] Evaluating relative likelihood often involves numerical computation, such as gridding over plausible θ\theta values or using optimization algorithms like Newton-Raphson to locate the MLE and profile maxima. Software implementations in R or Python facilitate this via built-in maximizers, but challenges arise with multimodal likelihood surfaces, where local optima may mislead inference and require global optimization techniques or multiple starting points to ensure reliable profiling. The expectation-maximization (EM) algorithm proves useful for models with latent variables, iteratively handling incomplete data to converge on the likelihood.[8][1] A concrete example occurs with nn independent observations from a normal distribution N(μ,σ2)N(\mu, \sigma^2) where σ2\sigma^2 is known, yielding sample mean xˉ\bar{x}. Here, the MLE is μ^=xˉ\hat{\mu} = \bar{x}, and the relative likelihood simplifies to
R(μx)=exp[n(μxˉ)22σ2], R(\mu | x) = \exp\left[ -\frac{n (\mu - \bar{x})^2}{2 \sigma^2} \right],
demonstrating a symmetric parabolic decline from 1 at μ=xˉ\mu = \bar{x}, with the rate of drop-off governed by sample size nn and precision 1/σ21/\sigma^2. This form underscores the quadratic nature of the log-likelihood near the MLE, making it straightforward to compute intervals like xˉ±2σ2/n\bar{x} \pm \sqrt{2 \sigma^2 / n} for R(μx)1/eR(\mu | x) \geq 1/e.[9][1]

Likelihood Regions

Likelihood regions offer a geometric perspective on the uncertainty associated with parameter estimates by delineating sets of plausible values based on the relative likelihood function. Formally, a likelihood region is defined as the set {θ:R(θx)c}\{\theta : R(\theta \mid x) \geq c\}, where R(θx)R(\theta \mid x) is the relative likelihood and cc is a threshold value that establishes a contour of plausibility, such as c=0.15c = 0.15 or c=1/8c = 1/8. For c0.15c \approx 0.15, the region asymptotically approximates a 95% confidence region under large-sample conditions, leveraging the quadratic behavior of the log relative likelihood logR(θx)\log R(\theta \mid x), as 2logc3.84=χ12(0.95)-2 \log c \approx 3.84 = \chi^2_1(0.95). In one dimension, the likelihood region manifests as an interval surrounding the MLE, bounded by points where the relative likelihood equals cc. For multidimensional parameters, the region assumes more intricate shapes—potentially ellipsoidal under the quadratic approximation of the log relative likelihood logR(θx)\log R(\theta \mid x), or irregular otherwise—and is typically constructed through contour plotting or numerical optimization to trace the level set R(θx)=cR(\theta \mid x) = c. These visualizations highlight joint parameter plausibility, revealing correlations and trade-offs not evident in marginal intervals.[10][11] Unlike certain confidence regions derived from asymptotic normality (e.g., Wald intervals), likelihood regions maintain invariance under reparameterization, ensuring the set of plausible values transforms coherently with any valid change of variables. This property stems directly from the likelihood function's role in the definition. For models from exponential families, such as the normal or Poisson distributions, likelihood regions align exactly with confidence regions obtained via the likelihood ratio test statistic under the null hypothesis of the boundary values. Parameter values within the likelihood region are interpreted as compatible with the observed data at the specified plausibility level cc, providing a direct measure of evidential support without reliance on prior distributions or long-run frequencies. These regions facilitate hypothesis testing by assessing whether a particular θ0\theta_0 or submanifold lies inside the contour; inclusion implies the value is not strongly contradicted by the data. As an illustrative example, consider estimating the rate parameter λ\lambda of a Poisson distribution from data with sample mean xˉ\bar{x}. The likelihood region {λ:R(λx)0.15}\{\lambda : R(\lambda \mid x) \geq 0.15\} forms an interval around xˉ\bar{x} that asymptotically approximates a 95% confidence interval. For an observed total count of 17 events (so xˉ=17\bar{x} = 17), this region is approximately [10.15,26.41][10.15, 26.41], which closely matches the exact Clopper-Pearson interval [10.25,26.35][10.25, 26.35] for the Poisson mean, demonstrating the practical alignment in low-dimensional cases.[11]

Model Comparison

Relative Likelihood for Models

In the context of model comparison, the relative likelihood assesses the plausibility of different statistical models in fitting the same observed data by comparing their maximized likelihood values. For two competing models $ M_k $ and $ M_1 $, the relative likelihood is defined as $ R(M_k \mid x) = \frac{L_{\max}(M_k \mid x)}{L_{\max}(M_1 \mid x)} $, where $ L_{\max}(M \mid x) $ is the maximum likelihood achieved by optimizing the parameters of model $ M $ given the data $ x $. This ratio directly measures how much more (or less) likely the data are under $ M_k $ compared to $ M_1 $ at their respective best-fitting parameter values. A value of $ R(M_k \mid x) < 1 $ implies that $ M_k $ fits the data worse than $ M_1 $, while $ R(M_k \mid x) > 1 $ suggests a better fit. For nested models, where the parameter space of one model (e.g., the reduced model $ M_1 $) is a subset of the other (e.g., the full model $ M_k $), the relative likelihood is commonly analyzed via the likelihood ratio statistic $ \Lambda = 2 \log \left( \frac{L_{\max}(M_k \mid x)}{L_{\max}(M_1 \mid x)} \right) $. Under the null hypothesis that the reduced model suffices, $ \Lambda $ asymptotically follows a chi-squared distribution with degrees of freedom equal to the difference in the number of free parameters between the models. Equivalently, $ -2 \log R(M_1 \mid x) \approx \chi^2(\Delta df) $, providing a basis for testing whether the additional complexity in $ M_k $ significantly improves the fit. This approach stems from the large-sample properties of maximum likelihood estimation.[12] In the case of non-nested models, whose parameter spaces are incomparable, the direct relative likelihood ratio can still be computed, but interpreting it requires caution due to potential differences in model dimensions that may confound comparisons. To address this, Vuong's test extends the framework by evaluating the standardized differences in log-likelihoods across observations, testing the null hypothesis that both models are equally distant from the true data-generating process against alternatives where one is closer. This test uses the relative likelihood concept to derive a z-statistic for model selection.[13] Raw relative likelihood ratios have notable limitations, as they do not penalize for model complexity and thus tend to favor overparameterized models that achieve higher maximized likelihoods by fitting noise in the data. For instance, when comparing a linear regression model ($ y = \beta_0 + \beta_1 x + \epsilon )toaquadraticextension() to a quadratic extension ( y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon $) on the same dataset, the relative likelihood $ R(\text{quadratic} \mid x) = \frac{L_{\max}(\text{quadratic} \mid x)}{L_{\max}(\text{linear} \mid x)} $ will always exceed 1 since the linear model is nested within the quadratic. A likelihood ratio test can then assess whether this increase is statistically significant, indicating meaningful evidence for the quadratic term.[12]

Relation to Selection Criteria

In model selection, relative likelihood serves as the basis for penalized criteria that adjust for model complexity to prevent overfitting while evaluating predictive quality. These criteria extend the raw relative likelihood $ R(M_k | x) $ by incorporating penalties proportional to the number of parameters, enabling systematic comparison among competing models. The Akaike Information Criterion (AIC), derived by Akaike in 1973, is formulated as
AIC=2logLmax+2p, \text{AIC} = -2 \log L_{\max} + 2p,
where $ L_{\max} $ is the maximum likelihood of the model and $ p $ is the number of estimated parameters. The relative quality of model $ k $ to the best-fitting model is then $ \exp[(\text{AIC}k - \text{AIC}{\min})/2] $, which approximates the expected relative likelihood $ R(M_k | x) $ adjusted for parameter count $ p $.[14] This approximation arises from information-theoretic principles, estimating the model's out-of-sample predictive accuracy. As an alternative, the Bayesian Information Criterion (BIC), introduced by Schwarz in 1978, is given by
BIC=2logLmax+plogn, \text{BIC} = -2 \log L_{\max} + p \log n,
with $ n $ denoting the sample size. BIC imposes a harsher penalty on complexity for larger $ n $, yielding a relative measure approximating $ R(M_k | x) \times n^{-(p_k - p_1)/2} $, where $ p_1 $ is the parameter count of the reference model.[15] This makes BIC particularly suitable for large datasets, favoring simpler models under asymptotic Bayesian assumptions. Other criteria build directly on relative likelihood concepts. In generalized linear models (GLMs), deviance is defined as $ D = -2 \log R $, measuring the discrepancy between the fitted model and a saturated model that perfectly fits the data; lower deviance indicates better fit relative to the saturated likelihood.[16] Cross-validation provides an empirical analog to relative likelihood by partitioning data and computing average predictive likelihoods across folds, offering a non-parametric assessment of model performance.[17] Raw relative likelihood suffices for straightforward model comparisons without complexity concerns, but penalized criteria like AIC and BIC are essential for predictive applications to mitigate overfitting by trading off fit against parsimony. A practical illustration appears in ARIMA time series modeling, where AIC-based relative likelihoods guide selection among orders $ (p, d, q) $; for instance, in forecasting datasets, the model with the lowest AIC—such as an ARIMA(1,1,1) over higher-order alternatives—yields the highest relative likelihood, balancing in-sample fit with forecasting reliability.[18]
User Avatar
No comments yet.