Recent from talks
Nothing was collected or created yet.
Scoring algorithm
View on WikipediaScoring algorithm, also known as Fisher's scoring,[1] is a form of Newton's method used in statistics to solve maximum likelihood equations numerically, named after Ronald Fisher.
Sketch of derivation
[edit]Let be random variables, independent and identically distributed with twice differentiable p.d.f. , and we wish to calculate the maximum likelihood estimator (M.L.E.) of . First, suppose we have a starting point for our algorithm , and consider a Taylor expansion of the score function, , about :
where
is the observed information matrix at . Now, setting , using that and rearranging gives us:
We therefore use the algorithm
and under certain regularity conditions, it can be shown that .
Fisher scoring
[edit]In practice, is usually replaced by , the Fisher information, thus giving us the Fisher Scoring Algorithm:
- ..
Under some regularity conditions, if is a consistent estimator, then (the correction after a single step) is 'optimal' in the sense that its error distribution is asymptotically identical to that of the true max-likelihood estimate.[2]
See also
[edit]References
[edit]- ^ Longford, Nicholas T. (1987). "A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects". Biometrika. 74 (4): 817–827. doi:10.1093/biomet/74.4.817.
- ^ Li, Bing; Babu, G. Jogesh (2019), "Bayesian Inference", Springer Texts in Statistics, New York, NY: Springer New York, Theorem 9.4, doi:10.1007/978-1-4939-9761-9_6, ISBN 978-1-4939-9759-6, S2CID 239322258, retrieved 2023-01-03
Further reading
[edit]- Jennrich, R. I. & Sampson, P. F. (1976). "Newton-Raphson and Related Algorithms for Maximum Likelihood Variance Component Estimation". Technometrics. 18 (1): 11–17. doi:10.1080/00401706.1976.10489395 (inactive 12 July 2025). JSTOR 1267911.
{{cite journal}}: CS1 maint: DOI inactive as of July 2025 (link)
Scoring algorithm
View on GrokipediaBackground
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is a fundamental method in statistical inference for estimating the parameters of a probabilistic model by selecting the values that make the observed data most probable under that model. Given a sample of independent observations from a distribution parameterized by , MLE seeks to maximize the likelihood function , where is the probability density or mass function. Equivalently, maximization of the log-likelihood is often performed, as it transforms the product into a sum, simplifying optimization while preserving the location of maxima.[5] The likelihood principle underlies MLE, asserting that all evidential content in the data regarding the unknown parameters is encapsulated within the likelihood function, thereby directing parametric inference toward comparisons of relative plausibility across parameter values rather than absolute probabilities. This approach emphasizes the data's role in updating beliefs about solely through how well different parameter configurations explain the observations, forming a cornerstone of frequentist parametric statistics.[5] MLE was developed by Ronald A. Fisher in the early 1920s, with its formal introduction in his seminal 1922 paper "On the Mathematical Foundations of Theoretical Statistics," where he established it as a general method for point estimation in parametric models.[5] A classic example arises in estimating the mean and variance of a normal distribution based on an i.i.d. sample . The log-likelihood function is Maximizing produces the estimators and , which coincide with the familiar sample mean and (biased) sample variance.[6]Score Function and Fisher Information
In statistical inference, the score function, often denoted as or , is defined as the gradient of the log-likelihood function with respect to the parameter vector , given by This vector measures the sensitivity of the log-likelihood to changes in . Under standard regularity conditions, the expected value of the score function is zero when evaluated at the true parameter value, i.e., , which implies that the maximum likelihood estimator (MLE) occurs where the score equals zero.[7] The observed information matrix, denoted , is the negative of the Hessian matrix of the log-likelihood, expressed as This matrix captures the local curvature of the log-likelihood surface at a specific , providing a data-dependent measure of precision for parameter estimates.[8] The Fisher information matrix is the expected value of the observed information matrix, defined as It quantifies the amount of information that the observed data carry about the unknown parameter , serving as a fundamental measure of estimator efficiency in parametric models. Additionally, the variance of the score function equals the Fisher information matrix, , highlighting its role in bounding the precision of estimators.[7][8]Derivation of the Scoring Algorithm
Taylor Expansion Approach
The Taylor expansion approach provides a foundational derivation for iteratively solving the score equation , where denotes the score function, defined as the gradient of the log-likelihood with respect to the parameter . The observed information matrix , which is the negative Hessian of the log-likelihood, serves as the Jacobian of the score function.[9] To approximate the root, consider a first-order Taylor series expansion of around an initial parameter estimate : since .[9] Setting this approximation equal to zero yields the updated estimate: This update represents a one-step approximation to the solution of the score equation, effectively linearizing the nonlinear score function near .[9] The approach is directly analogous to the Newton-Raphson method applied to finding roots of the score function , where the inverse observed information acts as the step-direction matrix.[10] However, if the initial guess is sufficiently far from the true maximum likelihood estimate, this single-step update may not converge to the root, necessitating multiple iterations by repeating the expansion around successive estimates.[9]Iterative Update Rule
The iterative update rule formalizes the scoring algorithm as a sequence of approximations to the maximum likelihood estimator, repeatedly applying a first-order Taylor expansion of the score function around the current parameter estimate. In Fisher's scoring method, the observed information matrix is approximated by its expected value under the model, the Fisher information matrix , which is typically positive definite and simplifies computations in many models. This replacement enhances numerical stability compared to using the observed information directly.[11] The general iterative formula for the update in the scoring algorithm, using the Fisher information matrix and the score vector , is given by where denotes the log-likelihood function and indexes the iteration. This update leverages the expected information to approximate the inverse Hessian, providing a direction of ascent toward the maximum likelihood estimate.[11] The algorithm proceeds in the following steps:- Initialize the parameter vector with a suitable starting value, often based on method-of-moments estimates or prior knowledge.
- At iteration , compute the score vector and the Fisher information matrix evaluated at the current estimate .
- Update the parameter vector using the formula .
- Check for convergence, such as when the norm of the score vector satisfies for a small tolerance , or when the change in is sufficiently small; otherwise, increment and return to step 2.
Initialize θ_0
Set m = 0
Set tolerance ε > 0
While ||V(θ_m)|| ≥ ε:
Compute V(θ_m) = ∂l(θ_m)/∂θ
Compute I(θ_m) = -E[∂²l(θ_m)/∂θ∂θ^T]
If I(θ_m) is not positive definite:
Apply regularization or stop with error
Set θ_{m+1} = θ_m + I(θ_m)^{-1} V(θ_m)
Set m = m + 1
Return θ_m as the estimate
Initialize θ_0
Set m = 0
Set tolerance ε > 0
While ||V(θ_m)|| ≥ ε:
Compute V(θ_m) = ∂l(θ_m)/∂θ
Compute I(θ_m) = -E[∂²l(θ_m)/∂θ∂θ^T]
If I(θ_m) is not positive definite:
Apply regularization or stop with error
Set θ_{m+1} = θ_m + I(θ_m)^{-1} V(θ_m)
Set m = m + 1
Return θ_m as the estimate
Fisher Scoring Method
Formulation
The Fisher scoring method is an iterative procedure for computing maximum likelihood estimates that modifies the Newton-Raphson update by substituting the observed information matrix with its expectation, the Fisher information matrix . The core update rule is given by where denotes the score function, defined as the gradient of the log-likelihood with respect to , i.e., . This formulation leverages the property that the expected value of the score is zero at the true parameter value, ensuring the fixed point of the iteration corresponds to the maximum likelihood estimate.[12] The reliance on the expected information rather than the observed information arises from its role in averaging the second derivatives of the log-likelihood over the data-generating distribution, which yields a positive semi-definite matrix that promotes numerical stability and avoids issues with negative eigenvalues sometimes encountered in the observed Hessian. Moreover, is often analytically tractable or constant across iterations in structured models, reducing computational demands compared to recalculating second derivatives for each sample realization.[12] Ronald Fisher first presented the method of scoring in 1912 as a practical tool for iterative maximum likelihood estimation, with further refinements detailed in publications through 1922, particularly suited to biological and experimental data where exact solutions are infeasible.[13] For distributions in the exponential family, parameterized in canonical form as , the Fisher information admits a simple expression: , which equals the variance of the sufficient statistic . This facilitates efficient implementation in common cases like the normal, Poisson, and binomial distributions.[12]Asymptotic Properties
Under standard regularity conditions, the maximum likelihood estimator (MLE) obtained through the Fisher scoring method is asymptotically normal, satisfying as the sample size , where denotes the Fisher information matrix evaluated at the true parameter .[14] This result holds because the Fisher scoring algorithm converges to the MLE under these conditions, inheriting its large-sample distribution.[14] A key advantage of the Fisher scoring method is that even a single iteration, starting from a -consistent initial estimator, produces an estimator that is asymptotically equivalent to the full MLE and thus achieves the asymptotic efficiency bound given by the inverse Fisher information.[14] This one-step property arises from the method's use of the expected Hessian approximation, which aligns closely with the true curvature of the log-likelihood in large samples.[14][15] The required regularity conditions for these asymptotic properties include the twice differentiability of the log-likelihood function with respect to , the positive definiteness of the Fisher information matrix , and the identifiability of the parameter .[14] These ensure the existence, uniqueness, and consistency of the MLE, upon which the scoring method relies.[14] In the asymptotic limit, the efficiency of the estimator from Fisher scoring matches that of the full MLE, attaining the Cramér-Rao lower bound scaled by the sample size.[14]Applications
In Parametric Statistics
In parametric statistics, the scoring algorithm, also known as Fisher scoring, serves as an iterative method for obtaining maximum likelihood estimates (MLEs) in parametric models where closed-form solutions are unavailable or computationally challenging. It is particularly applied to distributions such as the normal, Poisson, and binomial, enabling parameter estimation by approximating the likelihood surface through the score function and its expected Hessian, the Fisher information matrix. This approach leverages the first-order condition for MLE—setting the score to zero—and iteratively refines estimates using the information matrix as a step-size adjuster, promoting stable convergence in finite samples.[16] A representative example is the estimation of coefficients in logistic regression, which models binary outcomes under a binomial distribution with a logit link, assuming independent Bernoulli trials aggregated to binomial for n>1. The score vector for the parameter vector β is computed as X^T (y - μ), where X is the design matrix, y the response vector, and μ the fitted probabilities derived from the linear predictor η = Xβ via the inverse logit function μ = 1/(1 + exp(-η)). The Fisher information matrix is then X^T W X, with W a diagonal matrix of weights μ(1 - μ), providing the expected negative second derivative of the log-likelihood. This setup allows iterative updates β^{(t+1)} = β^{(t)} + (X^T W X)^{-1} X^T (y - μ), converging to the MLE after several steps, as demonstrated in simulations with synthetic binary data where 5-10 iterations suffice for precision within 10^{-6}.[17] The scoring algorithm plays a key role in iterative refinement for parametric MLE when direct solutions elude analytical derivation, such as in multivariate extensions or non-standard parameterizations of the aforementioned distributions; for instance, in estimating location-scale parameters for the normal distribution with censored observations, it iteratively adjusts initial guesses toward the likelihood maximum using the score and information. In the Poisson case, while the simple mean parameter has a closed-form MLE, the method extends to overdispersed or zero-inflated variants, refining estimates via successive approximations of the score ∑(y_i - λ)/λ and information n/λ. For the binomial, beyond logistic setups, it handles proportion parameters in clustered data without explicit solutions. The brief reference to the iterative update rule underscores its reliance on inverting the information matrix to guide parameter steps.[16][18] Historically, the scoring algorithm emerged in early computational statistics as a practical tool before modern software, pioneered by Ronald A. Fisher in his 1925 paper on statistical estimation, where it was proposed for numerical solution of likelihood equations in complex parametric problems like those in biometry and genetics. Prior to widespread access to electronic computers in the mid-20th century, statisticians relied on manual or mechanical iterations of scoring for MLE in journals and applied research, as seen in Fisher's applications to normal theory inference and contingency tables; its adoption grew through the 1930s-1950s in texts on numerical analysis, filling gaps left by less stable methods until packages like those in S (precursor to R) automated it by the 1970s.[19]In Generalized Linear Models
In generalized linear models (GLMs), the scoring algorithm, specifically Fisher scoring, is adapted to estimate the parameters β by iteratively solving the score equations while accounting for the model's link function g(·) and variance function V(·) from the exponential family distribution. This adaptation integrates seamlessly with the iteratively reweighted least squares (IRLS) procedure, where each iteration reweights observations based on the current estimates of the mean μ and applies weights that incorporate both the link derivatives and inverse variances to stabilize the updates and ensure convergence to the maximum likelihood estimator. The IRLS formulation arises naturally from the Newton-Raphson method but uses the expected Hessian (Fisher information) instead of the observed one, making it particularly efficient for GLMs as it leverages the structure of the exponential family.[20] The score function for β in a GLM is given by where is the design matrix, is the diagonal matrix of derivatives of the mean with respect to the linear predictor , is the diagonal matrix with entries from the variance function, is the response vector, and . The expected Fisher information matrix is then In the IRLS implementation, these are recast into a weighted least squares problem by defining working weights and a working response , leading to the update , where . This equivalence holds because the scoring step aligns precisely with the IRLS solution under the GLM structure. For canonical links, where (the natural parameter), the formulas simplify further, as and the score becomes .[11] A representative example is the binomial GLM for modeling proportions, such as success rates in clinical trials, where the response follows a Binomial() distribution with mean and variance . Using the canonical logit link , the derivative , so the score simplifies to and the information to . If a non-canonical link like probit is used instead, the computations become more involved, as (where is the standard normal density), requiring explicit inclusion of in the weights and altering the IRLS working response to reflect the differing sensitivity of to changes in . This demonstrates how the choice of link function directly influences the weighting and thus the numerical stability and interpretation of the scoring updates in binomial models.[20] In modern statistical software, the scoring algorithm for GLMs is routinely implemented via IRLS, as seen in R'sglm() function, which reports the number of "Fisher Scoring iterations" required for convergence and defaults to this method for families like binomial, Poisson, and Gaussian. This implementation ensures efficient handling of large datasets and various link functions, with built-in safeguards like step-halving for non-convergence.

