Scoring algorithm

Scoring algorithmMain

Community hub

Scoring algorithm

7 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

Scoring algorithm

View on Wikipedia

from Wikipedia

Scoring algorithm, also known as Fisher's scoring,^[1] is a form of Newton's method used in statistics to solve maximum likelihood equations numerically, named after Ronald Fisher.

Sketch of derivation

[edit]

Let $Y_{1},\ldots ,Y_{n}$ be random variables, independent and identically distributed with twice differentiable p.d.f. $f(y;\theta )$ , and we wish to calculate the maximum likelihood estimator (M.L.E.) $\theta ^{*}$ of $\theta$ . First, suppose we have a starting point for our algorithm $\theta _{0}$ , and consider a Taylor expansion of the score function, $V(\theta )$ , about $\theta _{0}$ :

V(\theta )\approx V(\theta _{0})-{\mathcal {J}}(\theta _{0})(\theta -\theta _{0}),\,

where

{\mathcal {J}}(\theta _{0})=-\sum _{i=1}^{n}\left.\nabla \nabla ^{\top }\right|_{\theta =\theta _{0}}\log f(Y_{i};\theta )

is the observed information matrix at $\theta _{0}$ . Now, setting $\theta =\theta ^{*}$ , using that $V(\theta ^{*})=0$ and rearranging gives us:

\theta ^{*}\approx \theta _{0}+{\mathcal {J}}^{-1}(\theta _{0})V(\theta _{0}).\,

We therefore use the algorithm

\theta _{m+1}=\theta _{m}+{\mathcal {J}}^{-1}(\theta _{m})V(\theta _{m}),\,

and under certain regularity conditions, it can be shown that $\theta _{m}\rightarrow \theta ^{*}$ .

Fisher scoring

[edit]

In practice, ${\mathcal {J}}(\theta )$ is usually replaced by ${\mathcal {I}}(\theta )=\mathrm {E} [{\mathcal {J}}(\theta )]$ , the Fisher information, thus giving us the Fisher Scoring Algorithm:

\theta _{m+1}=\theta _{m}+{\mathcal {I}}^{-1}(\theta _{m})V(\theta _{m})

Under some regularity conditions, if $\theta _{m}$ is a consistent estimator, then $\theta _{m+1}$ (the correction after a single step) is 'optimal' in the sense that its error distribution is asymptotically identical to that of the true max-likelihood estimate.^[2]

References

[edit]

^ Longford, Nicholas T. (1987). "A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects". Biometrika. 74 (4): 817–827. doi:10.1093/biomet/74.4.817.
^ Li, Bing; Babu, G. Jogesh (2019), "Bayesian Inference", Springer Texts in Statistics, New York, NY: Springer New York, Theorem 9.4, doi:10.1007/978-1-4939-9761-9_6, ISBN 978-1-4939-9759-6, S2CID 239322258, retrieved 2023-01-03

View on Grokipedia

from Grokipedia

A scoring algorithm, also known as Fisher's scoring method, is an iterative numerical optimization technique in statistics designed to solve the score equations for obtaining maximum likelihood estimates (MLEs) of model parameters. It functions as a specialized variant of the Newton-Raphson method, where the update step replaces the observed Hessian matrix (second derivative of the log-likelihood) with its expected value, known as the Fisher information matrix, which simplifies computations and enhances numerical stability.^[1]^[2] Developed by the statistician Ronald A. Fisher in the early 20th century, the method emerged as part of his foundational work on maximum likelihood estimation, with initial presentations of the numerical procedure appearing around 1912 and further refinements detailed in publications through 1922. Fisher's contributions integrated the scoring approach into broader advancements in likelihood theory, emphasizing its role in handling complex probabilistic models where direct analytical solutions are infeasible. The algorithm's key advantages include faster convergence in many scenarios compared to the full Newton-Raphson method, due to the positive definiteness of the expected information matrix, which avoids issues with non-positive definite observed Hessians, and its equivalence to iteratively reweighted least squares (IRLS) for generalized linear models (GLMs) with canonical links. It finds widespread application in fitting GLMs, such as logistic and Poisson regression, as well as in more advanced frameworks like mixed-effects models and survival analysis, where reliable MLE computation is essential for inference and prediction.^[3]^[4]

Background

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is a fundamental method in statistical inference for estimating the parameters of a probabilistic model by selecting the values that make the observed data most probable under that model. Given a sample of independent observations

\mathbf{x} = (x_1, \dots, x_n)

from a distribution parameterized by

\theta

, MLE seeks to maximize the likelihood function

L(\theta \mid \mathbf{x}) = \prod_{i=1}^n f(x_i \mid \theta)

, where

f(\cdot \mid \theta)

is the probability density or mass function. Equivalently, maximization of the log-likelihood

l(\theta \mid \mathbf{x}) = \log L(\theta \mid \mathbf{x}) = \sum_{i=1}^n \log f(x_i \mid \theta)

is often performed, as it transforms the product into a sum, simplifying optimization while preserving the location of maxima.^[5] The likelihood principle underlies MLE, asserting that all evidential content in the data regarding the unknown parameters

\theta

is encapsulated within the likelihood function, thereby directing parametric inference toward comparisons of relative plausibility across parameter values rather than absolute probabilities. This approach emphasizes the data's role in updating beliefs about

\theta

solely through how well different parameter configurations explain the observations, forming a cornerstone of frequentist parametric statistics.^[5] MLE was developed by Ronald A. Fisher in the early 1920s, with its formal introduction in his seminal 1922 paper "On the Mathematical Foundations of Theoretical Statistics," where he established it as a general method for point estimation in parametric models.^[5] A classic example arises in estimating the mean

\mu

and variance

\sigma^2

of a normal distribution based on an i.i.d. sample

x_1, \dots, x_n

. The log-likelihood function is

l(\mu, \sigma^2 \mid \mathbf{x}) = -\frac{n}{2} \log (2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2.

Maximizing

l(\mu, \sigma^2 \mid \mathbf{x})

produces the estimators

\hat{\mu} = \bar{x} = n^{-1} \sum_{i=1}^n x_i

and

\hat{\sigma}^2 = n^{-1} \sum_{i=1}^n (x_i - \bar{x})^2

, which coincide with the familiar sample mean and (biased) sample variance.^[6]

Score Function and Fisher Information

In statistical inference, the score function, often denoted as

V(\theta)

\dot{l}(\theta)

, is defined as the gradient of the log-likelihood function

l(\theta)

with respect to the parameter vector

\theta

, given by

V(\theta) = \frac{\partial l(\theta)}{\partial \theta}.

This vector measures the sensitivity of the log-likelihood to changes in

\theta

. Under standard regularity conditions, the expected value of the score function is zero when evaluated at the true parameter value, i.e.,

E[V(\theta)] = 0

, which implies that the maximum likelihood estimator (MLE) occurs where the score equals zero.^[7] The observed information matrix, denoted

J(\theta)

, is the negative of the Hessian matrix of the log-likelihood, expressed as

J(\theta) = -\frac{\partial^2 l(\theta)}{\partial \theta \partial \theta^\top}.

This matrix captures the local curvature of the log-likelihood surface at a specific

\theta

, providing a data-dependent measure of precision for parameter estimates.^[8] The Fisher information matrix

I(\theta)

is the expected value of the observed information matrix, defined as

I(\theta) = E[J(\theta)] = -E\left[ \frac{\partial^2 l(\theta)}{\partial \theta \partial \theta^\top} \right].

It quantifies the amount of information that the observed data carry about the unknown parameter

\theta

, serving as a fundamental measure of estimator efficiency in parametric models. Additionally, the variance of the score function equals the Fisher information matrix,

\operatorname{Var}(V(\theta)) = I(\theta)

, highlighting its role in bounding the precision of estimators.^[7]^[8]

Derivation of the Scoring Algorithm

Taylor Expansion Approach

The Taylor expansion approach provides a foundational derivation for iteratively solving the score equation

V(\theta) = 0

, where

V(\theta)

denotes the score function, defined as the gradient of the log-likelihood with respect to the parameter

\theta

. The observed information matrix

J(\theta)

, which is the negative Hessian of the log-likelihood, serves as the Jacobian of the score function.^[9] To approximate the root, consider a first-order Taylor series expansion of

V(\theta)

around an initial parameter estimate

\theta_0

V(\theta) \approx V(\theta_0) + \left. \frac{\partial V(\theta)}{\partial \theta} \right|_{\theta_0} (\theta - \theta_0) = V(\theta_0) - J(\theta_0) (\theta - \theta_0),

V(θ)≈V(θ0)+∂θ∂V(θ)

θ0(θ−θ0)=V(θ0)−J(θ0)(θ−θ0), since $\frac{\partial V(\theta)}{\partial \theta} = -J(\theta)$ .^[9] Setting this approximation equal to zero yields the updated estimate: $0 \approx V(\theta_0) - J(\theta_0) (\theta_1 - \theta_0) \implies \theta_1 \approx \theta_0 + J(\theta_0)^{-1} V(\theta_0).$ This update represents a one-step approximation to the solution of the score equation, effectively linearizing the nonlinear score function near $\theta_0$ .^[9] The approach is directly analogous to the Newton-Raphson method applied to finding roots of the score function $V(\theta) = 0$ , where the inverse observed information acts as the step-direction matrix.^[10] However, if the initial guess $\theta_0$ is sufficiently far from the true maximum likelihood estimate, this single-step update may not converge to the root, necessitating multiple iterations by repeating the expansion around successive estimates.^[9]

Iterative Update Rule

The iterative update rule formalizes the scoring algorithm as a sequence of approximations to the maximum likelihood estimator, repeatedly applying a first-order Taylor expansion of the score function around the current parameter estimate. In Fisher's scoring method, the observed information matrix is approximated by its expected value under the model, the Fisher information matrix

I(\theta_m) = -E\left[ \frac{\partial^2 l(\theta_m)}{\partial \theta \partial \theta^T} \right]

, which is typically positive definite and simplifies computations in many models. This replacement enhances numerical stability compared to using the observed information directly.^[11] The general iterative formula for the update in the scoring algorithm, using the Fisher information matrix

I(\theta_m)

and the score vector

V(\theta_m) = \frac{\partial l(\theta_m)}{\partial \theta}

, is given by

\theta_{m+1} = \theta_m + I(\theta_m)^{-1} V(\theta_m),

where

l(\theta)

denotes the log-likelihood function and

m

indexes the iteration. This update leverages the expected information to approximate the inverse Hessian, providing a direction of ascent toward the maximum likelihood estimate.^[11] The algorithm proceeds in the following steps:

Initialize the parameter vector $\theta_0$ with a suitable starting value, often based on method-of-moments estimates or prior knowledge.
At iteration $m$ , compute the score vector $V(\theta_m)$ and the Fisher information matrix $I(\theta_m)$ evaluated at the current estimate $\theta_m$ .
Update the parameter vector using the formula $\theta_{m+1} = \theta_m + I(\theta_m)^{-1} V(\theta_m)$ .
Check for convergence, such as when the norm of the score vector satisfies $\|V(\theta_m)\| < \epsilon$ for a small tolerance $\epsilon > 0$ , or when the change in $\theta$ is sufficiently small; otherwise, increment $m$ and return to step 2.

This step-by-step process ensures systematic refinement of the estimate.^[11] Convergence of the iterative update rule requires certain conditions, including that the Fisher information matrix

I(\theta)

is positive definite in a neighborhood of the true parameter value, ensuring the update steps are well-defined and the algorithm exhibits local quadratic convergence under standard regularity assumptions for the likelihood. If

I(\theta_m)

is singular or ill-conditioned at any iteration, safeguards such as regularization or alternative starting points may be necessary to proceed.^[11] For implementation, the following pseudocode outlines the procedure in a structured manner:

Initialize θ_0 Set m = 0 Set tolerance ε > 0 While ||V(θ_m)|| ≥ ε: Compute V(θ_m) = ∂l(θ_m)/∂θ Compute I(θ_m) = -E[∂²l(θ_m)/∂θ∂θ^T] If I(θ_m) is not positive definite: Apply regularization or stop with error Set θ_{m+1} = θ_m + I(θ_m)^{-1} V(θ_m) Set m = m + 1 Return θ_m as the estimate

This pseudocode assumes access to derivatives of the log-likelihood and the ability to compute or approximate the expected second derivatives, often analytically for standard models.^[11]

Fisher Scoring Method

Formulation

The Fisher scoring method is an iterative procedure for computing maximum likelihood estimates that modifies the Newton-Raphson update by substituting the observed information matrix

J(\theta_m)

with its expectation, the Fisher information matrix

I(\theta_m)

. The core update rule is given by

\theta_{m+1} = \theta_m + I(\theta_m)^{-1} V(\theta_m),

where

V(\theta)

denotes the score function, defined as the gradient of the log-likelihood

\ell(\theta) = \log L(\theta; x)

with respect to

\theta

, i.e.,

V(\theta) = \partial \ell(\theta)/\partial \theta

. This formulation leverages the property that the expected value of the score is zero at the true parameter value, ensuring the fixed point of the iteration corresponds to the maximum likelihood estimate.^[12] The reliance on the expected information

I(\theta_m)

rather than the observed information arises from its role in averaging the second derivatives of the log-likelihood over the data-generating distribution, which yields a positive semi-definite matrix that promotes numerical stability and avoids issues with negative eigenvalues sometimes encountered in the observed Hessian. Moreover,

I(\theta)

is often analytically tractable or constant across iterations in structured models, reducing computational demands compared to recalculating second derivatives for each sample realization.^[12] Ronald Fisher first presented the method of scoring in 1912 as a practical tool for iterative maximum likelihood estimation, with further refinements detailed in publications through 1922, particularly suited to biological and experimental data where exact solutions are infeasible.^[13] For distributions in the exponential family, parameterized in canonical form as

f(x; \theta) = b(x) \exp\{\theta t(x) - c(\theta)\}

, the Fisher information admits a simple expression:

I(\theta) = c''(\theta)

, which equals the variance of the sufficient statistic

t(X)

. This facilitates efficient implementation in common cases like the normal, Poisson, and binomial distributions.^[12]

Asymptotic Properties

Under standard regularity conditions, the maximum likelihood estimator (MLE)

\hat{\theta}

obtained through the Fisher scoring method is asymptotically normal, satisfying

\sqrt{n} (\hat{\theta} - \theta) \xrightarrow{d} N(0, I(\theta)^{-1})

(θ^−θ)d

N(0,I(θ)−1) as the sample size $n \to \infty$ , where $I(\theta)$ denotes the Fisher information matrix evaluated at the true parameter $\theta$ .^[14] This result holds because the Fisher scoring algorithm converges to the MLE under these conditions, inheriting its large-sample distribution.^[14] A key advantage of the Fisher scoring method is that even a single iteration, starting from a $\sqrt{n}$

-consistent initial estimator, produces an estimator that is asymptotically equivalent to the full MLE and thus achieves the asymptotic efficiency bound given by the inverse Fisher information.^[14] This one-step property arises from the method's use of the expected Hessian approximation, which aligns closely with the true curvature of the log-likelihood in large samples.^[14]^[15] The required regularity conditions for these asymptotic properties include the twice differentiability of the log-likelihood function with respect to $\theta$ , the positive definiteness of the Fisher information matrix $I(\theta)$ , and the identifiability of the parameter $\theta$ .^[14] These ensure the existence, uniqueness, and consistency of the MLE, upon which the scoring method relies.^[14] In the asymptotic limit, the efficiency of the estimator from Fisher scoring matches that of the full MLE, attaining the Cramér-Rao lower bound scaled by the sample size.^[14]

Applications

In Parametric Statistics

In parametric statistics, the scoring algorithm, also known as Fisher scoring, serves as an iterative method for obtaining maximum likelihood estimates (MLEs) in parametric models where closed-form solutions are unavailable or computationally challenging. It is particularly applied to distributions such as the normal, Poisson, and binomial, enabling parameter estimation by approximating the likelihood surface through the score function and its expected Hessian, the Fisher information matrix. This approach leverages the first-order condition for MLE—setting the score to zero—and iteratively refines estimates using the information matrix as a step-size adjuster, promoting stable convergence in finite samples.^[16] A representative example is the estimation of coefficients in logistic regression, which models binary outcomes under a binomial distribution with a logit link, assuming independent Bernoulli trials aggregated to binomial for n>1. The score vector for the parameter vector β is computed as X^T (y - μ), where X is the design matrix, y the response vector, and μ the fitted probabilities derived from the linear predictor η = Xβ via the inverse logit function μ = 1/(1 + exp(-η)). The Fisher information matrix is then X^T W X, with W a diagonal matrix of weights μ(1 - μ), providing the expected negative second derivative of the log-likelihood. This setup allows iterative updates β^{(t+1)} = β^{(t)} + (X^T W X)^{-1} X^T (y - μ), converging to the MLE after several steps, as demonstrated in simulations with synthetic binary data where 5-10 iterations suffice for precision within 10^{-6}.^[17] The scoring algorithm plays a key role in iterative refinement for parametric MLE when direct solutions elude analytical derivation, such as in multivariate extensions or non-standard parameterizations of the aforementioned distributions; for instance, in estimating location-scale parameters for the normal distribution with censored observations, it iteratively adjusts initial guesses toward the likelihood maximum using the score and information. In the Poisson case, while the simple mean parameter has a closed-form MLE, the method extends to overdispersed or zero-inflated variants, refining estimates via successive approximations of the score ∑(y_i - λ)/λ and information n/λ. For the binomial, beyond logistic setups, it handles proportion parameters in clustered data without explicit solutions. The brief reference to the iterative update rule underscores its reliance on inverting the information matrix to guide parameter steps.^[16]^[18] Historically, the scoring algorithm emerged in early computational statistics as a practical tool before modern software, pioneered by Ronald A. Fisher in his 1925 paper on statistical estimation, where it was proposed for numerical solution of likelihood equations in complex parametric problems like those in biometry and genetics. Prior to widespread access to electronic computers in the mid-20th century, statisticians relied on manual or mechanical iterations of scoring for MLE in journals and applied research, as seen in Fisher's applications to normal theory inference and contingency tables; its adoption grew through the 1930s-1950s in texts on numerical analysis, filling gaps left by less stable methods until packages like those in S (precursor to R) automated it by the 1970s.^[19]

In Generalized Linear Models

In generalized linear models (GLMs), the scoring algorithm, specifically Fisher scoring, is adapted to estimate the parameters β by iteratively solving the score equations while accounting for the model's link function g(·) and variance function V(·) from the exponential family distribution. This adaptation integrates seamlessly with the iteratively reweighted least squares (IRLS) procedure, where each iteration reweights observations based on the current estimates of the mean μ and applies weights that incorporate both the link derivatives and inverse variances to stabilize the updates and ensure convergence to the maximum likelihood estimator. The IRLS formulation arises naturally from the Newton-Raphson method but uses the expected Hessian (Fisher information) instead of the observed one, making it particularly efficient for GLMs as it leverages the structure of the exponential family.^[20] The score function for β in a GLM is given by

\mathbf{U}(\beta) = X^\top \mathbf{D} \mathbf{V}(\boldsymbol{\mu})^{-1} (\mathbf{y} - \boldsymbol{\mu}),

where

X

is the

n \times p

design matrix,

\mathbf{D} = \operatorname{diag}(d\mu_i / d\eta_i)

is the diagonal matrix of derivatives of the mean with respect to the linear predictor

\eta = X\beta

\mathbf{V}(\boldsymbol{\mu})

is the diagonal matrix with entries

V(\mu_i)

from the variance function,

\mathbf{y}

is the response vector, and

\boldsymbol{\mu} = g^{-1}(\eta)

. The expected Fisher information matrix is then

\mathbf{I}(\beta) = X^\top \mathbf{D} \mathbf{V}(\boldsymbol{\mu})^{-1} \mathbf{D} X.

In the IRLS implementation, these are recast into a weighted least squares problem by defining working weights

w_i = (d\mu_i / d\eta_i)^2 / V(\mu_i)

and a working response

z_i = \eta_i + (y_i - \mu_i) (d\eta_i / d\mu_i)

, leading to the update

\hat{\beta}^{(k+1)} = (X^\top W X)^{-1} X^\top W \mathbf{z}

, where

W = \operatorname{diag}(w_i)

. This equivalence holds because the scoring step

\hat{\beta}^{(k+1)} = \hat{\beta}^{(k)} + \mathbf{I}(\beta^{(k)})^{-1} \mathbf{U}(\beta^{(k)})

aligns precisely with the IRLS solution under the GLM structure. For canonical links, where

\eta = \theta

(the natural parameter), the formulas simplify further, as

\mathbf{D} = \mathbf{V}(\boldsymbol{\mu})

and the score becomes

X^\top (\mathbf{y} - \boldsymbol{\mu})

.^[11] A representative example is the binomial GLM for modeling proportions, such as success rates in clinical trials, where the response

y_i

follows a Binomial(

n_i, p_i

) distribution with mean

\mu_i = n_i p_i

and variance

V(\mu_i) = \mu_i (1 - \mu_i / n_i)

. Using the canonical logit link

\eta_i = \log(p_i / (1 - p_i))

, the derivative

d\mu_i / d\eta_i = \mu_i (1 - \mu_i / n_i) = V(\mu_i)

, so the score simplifies to

\mathbf{U}(\beta) = X^\top (\mathbf{y} - \boldsymbol{\mu})

and the information to

\mathbf{I}(\beta) = X^\top \mathbf{V}(\boldsymbol{\mu}) X

. If a non-canonical link like probit is used instead, the computations become more involved, as

d\mu_i / d\eta_i = n_i \phi(\eta_i)

(where

\phi

is the standard normal density), requiring explicit inclusion of

\mathbf{D}

in the weights and altering the IRLS working response to reflect the differing sensitivity of

\mu

to changes in

\eta

. This demonstrates how the choice of link function directly influences the weighting and thus the numerical stability and interpretation of the scoring updates in binomial models.^[20] In modern statistical software, the scoring algorithm for GLMs is routinely implemented via IRLS, as seen in R's glm() function, which reports the number of "Fisher Scoring iterations" required for convergence and defaults to this method for families like binomial, Poisson, and Gaussian. This implementation ensures efficient handling of large datasets and various link functions, with built-in safeguards like step-halving for non-convergence.

Comparisons with Other Methods

Versus Newton-Raphson Method

The Newton-Raphson method and the scoring algorithm both serve as iterative optimization techniques for solving maximum likelihood estimation problems by finding roots of the score function, with updates derived from a first-order Taylor expansion of the score around the current parameter estimate. The scoring algorithm employs the expected Fisher information matrix I(θ), which is the expected value of the observed information matrix J(θ)—the negative Hessian of the log-likelihood—averaged over the data distribution under the model. In contrast, the Newton-Raphson method relies solely on the observed Hessian J(θ) for all updates, capturing the exact local curvature of the log-likelihood at each iteration.^[21]^[22] A primary advantage of Fisher scoring lies in the properties of I(θ), which is often simpler to compute than J(θ) because its form depends on model expectations rather than specific data values, rendering it more stable across iterations and easier to derive analytically in many parametric settings.^[22] Furthermore, I(θ) is positive definite under standard regularity conditions, as it represents the covariance of the score function and thus guarantees non-negative eigenvalues, ensuring reliable descent directions and monotonic improvement in the likelihood.^[7] These features make Fisher scoring particularly suitable for models where the observed Hessian might be ill-conditioned or difficult to evaluate. Despite these benefits, the Newton-Raphson method can exhibit faster local convergence near the optimum, often achieving quadratic rates due to its precise use of second derivatives, whereas Fisher scoring's reliance on expected information introduces an approximation that may slow progress in well-behaved regions.^[21] However, this precision comes at the cost of greater computational demands for second-order derivatives and potential instability if J(θ) becomes non-positive definite. The iterative update rules of both methods exhibit structural similarities, differing primarily in the choice of curvature matrix. Empirical evidence from nonlinear item response theory models, used to estimate detection probabilities in jet engine part inspections, underscores these trade-offs. In simulations generating 50 datasets each of sizes 100 and 300 from a lognormal probability of detection model, the Newton-Raphson method failed to converge in 1 out of 50 cases for the smaller sample size due to local minima or oscillatory behavior, while Fisher scoring converged successfully across all 100 datasets, highlighting its robustness despite potentially requiring more iterations in stable scenarios.^[21]

Versus Expectation-Maximization Algorithm

The Expectation-Maximization (EM) algorithm addresses maximum likelihood estimation in scenarios involving incomplete data or latent variables by iteratively computing the expected complete-data log-likelihood, known as the Q-function, which serves as a lower bound on the observed-data log-likelihood, and then maximizing this surrogate. In contrast, the Fisher scoring algorithm iteratively updates parameter estimates by solving the score equations directly using the observed or expected Fisher information matrix, relying on the full specification of the observed likelihood without introducing auxiliary variables.^[23] This fundamental difference arises because EM avoids explicit computation of the often intractable observed score in latent variable models, while scoring assumes accessibility to the score and its curvature for efficient Newton-like updates. Fisher scoring is typically employed when data is complete and the likelihood permits straightforward evaluation of the score and information matrix, enabling rapid quadratic convergence under regularity conditions.^[23] Conversely, the EM algorithm is preferred for models with hidden structures, such as mixture distributions or hidden Markov models, where the observed score involves complex summations over latent states that render direct optimization impractical.^[24] Hybrid methods bridge these approaches; for instance, the Expectation-Conditional Maximization (ECM) algorithm extends EM by decomposing the M-step into simpler conditional maximizations over subsets of parameters, which can leverage scoring updates for those subsets to accelerate convergence while maintaining monotonicity in the likelihood. A representative example is parameter estimation in Gaussian mixture models, where EM is favored over pure scoring due to the non-convex nature of the observed likelihood, which features multiple local maxima and singularities when a component collapses, making direct score-based methods sensitive to initialization and computationally unstable.^[24] In such cases, EM's complete-data formulation simplifies the maximization while ensuring steady progress toward a local optimum.^[23]

Limitations and Considerations

Convergence Issues

The Fisher scoring algorithm, while generally robust for maximum likelihood estimation under regularity conditions that ensure asymptotic convergence to the global maximum, can encounter practical failures when these assumptions are violated. A primary issue arises from a singular information matrix, which occurs in cases of model misspecification, collinearity among predictors, or insufficient data variability, leading to non-invertibility and stalled iterations.^[25] Poor initial parameter values exacerbate this, potentially causing divergence or oscillation, as the algorithm's reliance on expected Hessian approximations may propel estimates away from the likelihood surface if starting points are far from the optimum.^[18] Additionally, non-convex log-likelihood functions can trap the algorithm at local maxima, particularly in multimodal landscapes where different starting values yield inconsistent solutions.^[9] To diagnose these convergence problems, practitioners monitor the norm of the score vector across iterations, halting or alerting when it fails to diminish toward zero, indicating persistent gradient or stagnation. Similarly, examining the eigenvalues of the information matrix during updates helps detect ill-conditioning, where near-zero eigenvalues signal impending singularity and potential numerical instability. These checks provide early warnings, allowing intervention before complete failure.^[26] Mitigation strategies include damping the update steps, such as through the convex combination

\theta_{m+1} = (1 - \alpha) \theta_m + \alpha (\theta_m + I(\theta_m)^{-1} S(\theta_m))

where

0 < \alpha < 1

and

I(\theta_m)

is the information matrix at iteration

m

, which stabilizes progress by blending old and new estimates. Adding a small multiple of the identity matrix to the information matrix (damping) also prevents singularity by improving invertibility. Near convergence, switching to the Newton-Raphson method can accelerate final steps, leveraging its quadratic convergence while relying on Fisher scoring's initial robustness.^[27]^[28]^[21] A notable real-world example of failure occurs in small-sample scenarios around n=50, such as in structural equation modeling with model misspecification; here, ill-conditioned information matrices lead to non-convergence, with simulation studies confirming higher failure rates compared to larger datasets (n=100 or more).^[29]

Computational Aspects

The computational demands of the scoring algorithm, also known as Fisher scoring, arise primarily from its iterative nature, where each step requires evaluating the score vector and the expected information matrix. Computing the score vector, which is the gradient of the log-likelihood, scales linearly with the sample size

n

and number of parameters

p

, at

O(np)

operations per iteration, as it involves summing contributions from each observation.^[30] In contrast, approximating or inverting the

p \times p

expected information matrix dominates the per-iteration cost at

O(p^3)

due to the matrix inversion, making the method computationally intensive for large

p

.^[30] These operations form the core loop of the algorithm, repeated until convergence criteria are met.^[18] In high-dimensional settings, where

p

is large relative to

n

, the expected information matrix often exhibits structure that mitigates the

O(p^3)

bottleneck, such as being block-diagonal or low-rank in models with conditional independence or hierarchical components. For instance, in crossed-factor linear mixed models, the information matrix decomposes into blocks corresponding to fixed and random effects, enabling efficient inversion via separate operations on smaller submatrices rather than full dense inversion.^[31] This sparsity or block structure is common in structured parametric models like generalized linear mixed models, reducing effective complexity to

O(k^3)

where

k \ll p

for block size

k

, and allowing scalable implementation without full matrix factorization.^[32] Such properties make Fisher scoring preferable over fully dense methods in applications like genomics or spatial statistics, where parameters exhibit partial independence.^[31] To address big data challenges, modern extensions adapt Fisher scoring to stochastic gradient variants, replacing full-data computations with mini-batch approximations for the score and information matrix, achieving near-linear scaling in

n

. Stochastic gradient Fisher scoring (SGFS) updates parameters using noisy estimates from subsamples, preserving the method's second-order information while reducing per-iteration cost to

O(b p^2)

for batch size

b \ll n

, as demonstrated in Bayesian posterior sampling and large-scale logistic regression.^[33] These variants enable handling datasets with millions of observations, with adaptive preconditioning via the empirical Fisher matrix accelerating convergence in non-convex landscapes.^[34] Implementations leverage parallelization in libraries such as TensorFlow Probability, where GLM fitting routines use vectorized Fisher scoring steps on GPUs for distributed computing, and Statsmodels, which employs iteratively reweighted least squares (IRLS)—equivalent to Fisher scoring—for efficient GLM estimation in Python environments.^[35] Simulation studies highlight runtime advantages of Fisher scoring over alternatives like Newton-Raphson, particularly in structured models. In a study of crossed-factor linear mixed models with up to 1,000 observations and

p=50

parameters, vectorized Fisher scoring variants achieved convergence in 5-10 iterations with runtimes 2-5 times faster than full Newton-Raphson due to block-diagonal approximations, while maintaining comparable accuracy.^[31] These benchmarks underscore its efficiency in moderate-scale settings but emphasize the need for approximations in very high dimensions.^[31]

History

Scoring algorithm

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Scoring algorithm

Sketch of derivation

Fisher scoring

See also

References

Further reading