Recent from talks
Nothing was collected or created yet.
Berndt–Hall–Hall–Hausman algorithm
View on WikipediaThe Berndt–Hall–Hall–Hausman (BHHH) algorithm is a numerical optimization algorithm similar to the Newton–Raphson algorithm, but it replaces the observed negative Hessian matrix with the outer product of the gradient. This approximation is based on the information matrix equality and therefore only valid while maximizing a likelihood function.[1] The BHHH algorithm is named after the four originators: Ernst R. Berndt, Bronwyn Hall, Robert Hall, and Jerry Hausman.[2]
Usage
[edit]If a nonlinear model is fitted to the data one often needs to estimate coefficients through optimization. A number of optimization algorithms have the following general structure. Suppose that the function to be optimized is Q(β). Then the algorithms are iterative, defining a sequence of approximations, βk given by
- ,
where is the parameter estimate at step k, and is a parameter (called step size) which partly determines the particular algorithm. For the BHHH algorithm λk is determined by calculations within a given iterative step, involving a line-search until a point βk+1 is found satisfying certain criteria. In addition, for the BHHH algorithm, Q has the form
and A is calculated using
In other cases, e.g. Newton–Raphson, can have other forms. The BHHH algorithm has the advantage that, if certain conditions apply, convergence of the iterative procedure is guaranteed.[citation needed]
See also
[edit]References
[edit]- ^ Henningsen, A.; Toomet, O. (2011). "maxLik: A package for maximum likelihood estimation in R". Computational Statistics. 26 (3): 443–458 [p. 450]. doi:10.1007/s00180-010-0217-1.
- ^ Berndt, E.; Hall, B.; Hall, R.; Hausman, J. (1974). "Estimation and Inference in Nonlinear Structural Models" (PDF). Annals of Economic and Social Measurement. 3 (4): 653–665.
Further reading
[edit]- V. Martin, S. Hurn, and D. Harris, Econometric Modelling with Time Series, Chapter 3 'Numerical Estimation Methods'. Cambridge University Press, 2015.
- Amemiya, Takeshi (1985). Advanced Econometrics. Cambridge: Harvard University Press. pp. 137–138. ISBN 0-674-00560-0.
- Gill, P.; Murray, W.; Wright, M. (1981). Practical Optimization. London: Harcourt Brace.
- Gourieroux, Christian; Monfort, Alain (1995). "Gradient Methods and ML Estimation". Statistics and Econometric Models. New York: Cambridge University Press. pp. 452–458. ISBN 0-521-40551-3.
- Harvey, A. C. (1990). The Econometric Analysis of Time Series (Second ed.). Cambridge: MIT Press. pp. 137–138. ISBN 0-262-08189-X.
Berndt–Hall–Hall–Hausman algorithm
View on GrokipediaBackground
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is a fundamental statistical method used to estimate the parameters of a probabilistic model by maximizing the likelihood function , which represents the probability of observing the given data under the model. Equivalently, it maximizes the log-likelihood , where is the probability density (or mass) function for each observation , assuming a sample of independent observations.[3] This approach, pioneered by Ronald Fisher, selects the parameter values that make the observed data most probable, providing estimates that are consistent, asymptotically normal, and efficient under standard regularity conditions.[4] Key assumptions underlying MLE include the independence of observations, such that the joint likelihood factors into the product of individual densities, and correct specification of the underlying model, meaning the chosen accurately reflects the data-generating process.[3] Violations, such as dependence or misspecification, can lead to inconsistent estimates, though MLE remains robust in many cases when the model is reasonably well-specified.[4] In econometrics, MLE plays a central role in estimating parameters of nonlinear structural models, where captures underlying economic relationships, such as supply and demand systems or simultaneous equations, that do not admit closed-form solutions under ordinary least squares.[1] These models often involve latent variables or non-normal errors, making MLE preferable for its use of full distributional information to achieve efficiency and enable inference via likelihood-based tests. For complex log-likelihood functions in such settings, numerical optimization techniques are typically required to locate the maximum.[3] A representative example is the simple binary choice model, where an outcome follows a Bernoulli distribution with success probability , and the log-likelihood is . The MLE matches the sample mean, illustrating how MLE recovers intuitive estimators even in discrete settings.[4] In more general binary models, such as probit or logit, depends nonlinearly on covariates, underscoring MLE's applicability to structural econometric applications.[1]Optimization Challenges in Nonlinear Models
In nonlinear econometric models, the log-likelihood function, typically expressed as where represents the contribution from the -th observation, deviates from the quadratic form of linear models, resulting in non-convexity and the potential for multiple local maxima.[1] This non-quadratic structure arises because individual likelihood contributions are often highly nonlinear in the parameters , leading to objective functions that may exhibit saddle points or flat regions, which complicate the identification of the global maximum.[1] Such multimodality increases the risk of convergence to suboptimal solutions, particularly in models involving discrete choices, limited dependent variables, or structural relationships in economics.[1] Computing the exact Hessian matrix for these models imposes significant demands, as it requires second derivatives that scale poorly with model size and dimensionality, often rendering analytical evaluation infeasible for large datasets or complex specifications.[1] First-order optimization methods, such as steepest ascent, can mitigate some Hessian-related costs but frequently exhibit slow convergence due to the irregular curvature of the likelihood surface in nonlinear settings.[1] Prior to 1974, practitioners relied on grid searches or rudimentary approximate techniques, which lacked guarantees of convergence and were computationally intensive even for modest models, often necessitating manual intervention or restrictive assumptions to ensure stability.[1] In simultaneous equation models, endogeneity—where explanatory variables are correlated with error terms—necessitates maximum likelihood estimation to achieve efficiency, yet this amplifies optimization complexity by entangling parameter identification across equations and introducing further nonlinearity through joint dependencies.[1] The resulting likelihood functions are particularly prone to ill-conditioning and sensitivity to initial conditions, exacerbating the challenges of multimodality and computational expense in empirical applications like demand systems or macroeconometric simulations.[1]History
Development Context
In the 1970s, the field of econometrics underwent a notable shift toward the adoption of nonlinear models to better represent intricate economic phenomena, including production functions and demand systems. These models accounted for the inherent nonlinearities in parameters and variables that linear approximations could not capture, driving the need for robust maximum likelihood estimation (MLE) techniques to derive parameter estimates. This evolution reflected a growing emphasis on theoretical accuracy in modeling economic behavior, where traditional linear frameworks proved inadequate for simultaneous relationships in economic systems.[1] Existing optimization methods, such as the Newton-Raphson algorithm, posed significant challenges for these nonlinear estimations, as they demanded computationally expensive evaluations of second and even third derivatives, often leading to unreliable convergence in practical applications. The computational burden was particularly prohibitive given the limited processing power available at the time, prompting the search for alternatives that utilized only first derivatives—such as gradients—while providing guarantees of convergence to a local maximum. This gap highlighted the limitations of prior approaches in handling the complexity of nonlinear MLE without excessive resource demands.[1] The motivation for improved methods was especially strong in structural econometric models characterized by simultaneity, where variables influenced each other across equations, as in systems of supply and demand. In such contexts, full information maximum likelihood (FIML) estimation was favored over limited information techniques, like two-stage least squares, because it exploited all available a priori restrictions system-wide to achieve consistent and asymptotically efficient parameter estimates. This preference underscored the theoretical superiority of FIML for capturing interdependencies, though its implementation required overcoming substantial numerical hurdles.[5] These developments were bolstered by contemporaneous advances in computer technology, which expanded the feasibility of estimating nonlinear models on a larger scale and paved the way for innovations in computational econometrics. The era's improving hardware and software capabilities enabled econometricians to move beyond simple linear regressions toward more sophisticated analyses, fostering a wave of methodological progress that addressed the demands of increasingly data-rich and theoretically complex economic inquiries.[6]Publication and Authors
The Berndt–Hall–Hall–Hausman algorithm was developed by four economists: Ernst R. Berndt, an econometrician and Louis E. Seley Professor Emeritus of Applied Economics at the MIT Sloan School of Management;[7] Bronwyn H. Hall, a public finance expert and Professor Emerita of Economics at the University of California, Berkeley;[8] Robert E. Hall, a macroeconomist and Robert and Carole McNeil Senior Fellow at Stanford University's Hoover Institution;[9] and Jerry A. Hausman, a labor economist and John and Jennie S. MacDonald Professor of Economics at MIT.[10] At the time of publication, Berndt was at the University of British Columbia, B. H. Hall at Harvard University, and both R. E. Hall and Hausman at MIT.[1] Note that the two Halls—Bronwyn and Robert—are distinct individuals with no direct relation. The algorithm was introduced in the seminal paper "Estimation and Inference in Nonlinear Structural Models" by Berndt, B. H. Hall, R. E. Hall, and Hausman, published in the Annals of Economic and Social Measurement, Volume 3, Number 4 (October 1974), pages 653–665.[11] In the paper, the authors describe the algorithm as requiring less computation than prior methods, such as Newton's method, by eliminating the need for third derivatives and full Hessian matrix evaluations in each iteration.[1] They further claim that it guarantees convergence to a local maximum under standard regularity conditions for the likelihood function, offering improvements over earlier approaches like those of Eisenpress and Greenstadt (1966) and Chow (1973).[1] The work has had substantial impact, influencing the implementation of maximum likelihood routines in econometric software packages, such as Stata'stechnique(bhhh) option.[12][13]
Mathematical Formulation
Objective and Gradient
The Berndt–Hall–Hall–Hausman (BHHH) algorithm addresses the maximization of the log-likelihood function for parameter estimation in nonlinear statistical models. The objective function is defined as the log-likelihood , where represents the log-density for the -th observation, with denoting the joint probability density function conditional on the parameters , covariates , and outcomes .[14] The first-order condition for maximization is given by the gradient vector, or score function, . At the maximum likelihood estimator , the score satisfies , and under standard regularity conditions, at the true parameter value, ensuring that the parameters align with the data's probabilistic structure under the model assumptions.[14] A key property underpinning the algorithm is the information matrix equality, which states that under standard regularity conditions, the expected negative second derivative of the log-likelihood equals the expected outer product of the score: , where is the Fisher information matrix. This equality connects the variance of the score to the curvature of the objective, facilitating efficient numerical optimization in high-dimensional parameter spaces.[14] In practice, the gradient indicates the direction of ascent toward the likelihood maximum, with the algorithm exploiting the additive structure of individual contributions to compute updates, particularly advantageous for models where evaluating the full Hessian is computationally intensive.[14]Hessian Approximation
In maximum likelihood estimation, the Hessian matrix provides second-order information essential for Newton-type optimization algorithms, but its direct computation requires evaluating costly second derivatives for each observation .[14] The core innovation of the Berndt–Hall–Hall–Hausman (BHHH) algorithm lies in approximating the negative inverse Hessian using the outer product of gradients (OPG), derived from the information matrix equality. Specifically, the approximation leverages the fact that, under regularity conditions, the expected value of the outer product of individual score vectors equals the negative expected Hessian. This substitution avoids explicit second derivatives while preserving asymptotic equivalence to the full Newton method near the true parameter value.[14][15] At iteration , the BHHH approximation takes the form where denotes the parameter vector. This matrix is positive definite provided the individual gradients are non-zero, ensuring descent directions in the optimization landscape, and its validity stems from asymptotic equality under the information matrix equality, which holds as the sample size increases and near the maximum likelihood estimator.[14]Algorithm Procedure
Initialization
The initialization phase of the Berndt–Hall–Hall–Hausman (BHHH) algorithm is crucial for ensuring reliable convergence in maximum likelihood estimation of nonlinear models, as the choice of initial parameter vector directly influences the algorithm's trajectory toward the optimum of the log-likelihood function. Poor initial values can cause the iterations to diverge or converge to a local maximum instead of the global one, potentially yielding biased or inefficient estimates.[1] Selecting appropriate starting points mitigates these risks and enhances computational efficiency, particularly in complex econometric models where the likelihood surface may be multimodal.[1] Common strategies for obtaining leverage simpler estimation techniques to approximate the nonlinear model. For instance, ordinary least squares (OLS) estimates from a linearized version of the model provide a sensible starting point, as they offer consistent but potentially inefficient approximations.[16] The method of moments, which matches sample moments to theoretical ones, or grid search over a parameter space can also generate feasible initial values when OLS is unsuitable. The initial must satisfy basic feasibility criteria to enable the algorithm's first iteration: the log-likelihood should be finite, ensuring the objective is well-defined, and the score vector (gradient) must be computable without numerical issues such as division by zero or undefined functions.[1]Iterative Updates and Convergence
The iterative process of the Berndt–Hall–Hall–Hausman (BHHH) algorithm begins with an initial parameter vector and proceeds through successive updates to maximize the log-likelihood function . At each iteration , the score vector (gradient) and the outer product of gradients (OPG) approximation are computed, where denotes the contribution to the score from the -th observation.[1][17] The parameter update follows the rule , where is the step size, typically initialized to 1 to approximate the scoring method for ascent. This step leverages the OPG as an estimate of the information matrix, ensuring the direction points toward increasing . If the update results in a decrease in compared to , the step size is adjusted downward using a line search procedure until ascent is ensured or a minimum step size is reached.[1][17] The iteration continues until convergence criteria are satisfied, including a small gradient norm (e.g., ), a small parameter change (e.g., ), or a maximum number of iterations (e.g., 100–300). Additionally, the algorithm verifies the positive definiteness of at each step to confirm a valid ascent direction; if is not positive definite, the iteration may restart or terminate. These criteria balance computational efficiency with reliable convergence to a local maximum.[1][17]Properties
Advantages
The Berndt–Hall–Hall–Hausman (BHHH) algorithm provides notable computational efficiency for maximum likelihood estimation by relying solely on first-order derivatives to approximate the Hessian via the outer product of gradients (OPG), avoiding the more demanding second- and third-order derivative evaluations required in Newton-type methods.[1][18] This approach computes the OPG matrix in O(N p²) time, where N denotes the sample size and p the number of parameters, rendering it particularly practical for problems with moderate dimensions.[19] Under standard regularity conditions—such as the log-likelihood being twice continuously differentiable, the existence of a unique maximum, and the information matrix being positive definite—the BHHH algorithm guarantees global convergence to a stationary point from any feasible starting value where the log-likelihood is defined, owing to its structure as a gradient method with appropriate line search for the step size λ_k.[1] The positive definiteness of the OPG approximation further ensures that the search direction remains an ascent direction when λ_k is selected via backtracking or similar techniques, promoting reliable progress toward the optimum.[1][19][20] The algorithm's suitability shines in large-sample settings, where the OPG consistently approximates the expected Hessian due to the asymptotic equality of the observed and expected information matrices, yielding robust and efficient parameter estimates without excessive computational overhead.[1][18]Limitations
The Berndt–Hall–Hall–Hausman (BHHH) algorithm's approximation of the Hessian matrix using the outer product of gradients (OPG) can be unreliable when the information matrix equality does not hold closely, such as in small samples or under model misspecification, leading to slower convergence and requiring more iterations to approach the maximum.[20] This approximation performs poorly far from the likelihood maximum, where the OPG matrix deviates significantly from the true negative Hessian, often resulting in small parameter updates that increase the log-likelihood only marginally and prolong the optimization process.[18] The algorithm is sensitive to the choice of step size in its updates, typically requiring a damping factor or line search to ensure monotonic improvement in the objective; without such adjustments, iterations may oscillate or diverge, particularly in nonquadratic likelihood surfaces.[18] The BHHH procedure relies on first-order derivatives (scores) and assumes the structure of a log-likelihood function, making it strictly suited for maximum likelihood estimation and less effective for objectives involving penalties, robust losses, or non-likelihood criteria that violate the information equality.[18] In high-dimensional problems with many parameters , the need to form and invert the OPG matrix at each iteration incurs a computational cost of for the inversion alone, which becomes prohibitive when is large, despite the overall efficiency relative to exact Hessian computation.[20] The algorithm's dependence on accurate gradient evaluations further exacerbates issues in scenarios where numerical differentiation is unstable or costly.[18]Comparisons
Newton–Raphson Method
The Newton–Raphson method is a second-order optimization algorithm employed to locate the maximum of a twice-differentiable objective function, such as the log-likelihood in maximum likelihood estimation (MLE). It generates a sequence of parameter estimates through iterative updates given by where denotes the gradient vector (first derivatives) and represents the Hessian matrix (second derivatives) of the objective function evaluated at the current iterate .[21][19] A key strength of the method lies in its rapid convergence properties: under suitable conditions, such as the Hessian being positive definite and Lipschitz continuous near the optimum, it achieves quadratic convergence, whereby the error decreases quadratically, often requiring fewer iterations than first-order alternatives for smooth, well-behaved problems.[22][19] Despite these advantages, the method's drawbacks stem from the intensive computation of the exact Hessian, which involves evaluating second partial derivatives across observations at each iteration, yielding an overall cost of operations plus the expense of matrix inversion for parameters.[19][23] Additionally, the Hessian may fail to be positive definite, particularly distant from the solution, which can cause divergence or instability and requires remedial steps like line searches or regularization.[19] By comparison, the Berndt–Hall–Hall–Hausman (BHHH) algorithm addresses these issues by approximating the Hessian via the outer product of gradients, maintaining the scaling but eliminating the need for second derivatives, thereby reducing analytical burden and ensuring positive semi-definiteness while accepting slightly slower convergence in exchange for practicality in large-scale MLE settings.[19]Gauss–Newton Method
The Gauss–Newton method is an iterative algorithm designed for solving nonlinear least squares problems, where the objective is to minimize the sum of squared residuals, , with denoting the vector of residuals and the parameter vector.[24] In this approach, the Hessian matrix of the objective function is approximated by , where is the Jacobian matrix of the residuals evaluated at the current iterate . This approximation neglects the second-order derivatives of the residuals, simplifying computation while leveraging the structure of least squares objectives. The parameter update is then given by which corresponds to a single step of linear least squares on the linearized residual model .[24] This formulation derives from applying Newton's method to the least squares objective and approximating the full Hessian, making it particularly efficient for problems where residuals represent model errors, such as in regression or curve fitting.[25] In the context of maximum likelihood estimation (MLE), the Gauss–Newton method applies directly when the log-likelihood function takes a least squares form, such as under Gaussian errors where , transforming MLE into nonlinear least squares.[26] For more general likelihoods, including those from exponential family distributions, the method can be extended by reparameterizing the score contributions, leading to an outer-product-of-gradients (OPG) approximation of the information matrix that mirrors the structure of the Berndt–Hall–Hall–Hausman (BHHH) algorithm. Specifically, under the information matrix equality, the expected outer product of the score vector equals the negative expected Hessian, allowing Gauss–Newton iterations to align with OPG-based updates for MLE in these cases.[27] A key strength of the Gauss–Newton method is its avoidance of explicit second derivatives, relying solely on first-order information via the Jacobian, which reduces computational cost compared to full Newton methods.[24] Near the solution, where the linearization is accurate and residuals are small, it exhibits quadratic convergence, often outperforming gradient-based methods in speed for well-behaved least squares problems.[25] However, far from the minimum, the approximation can fail if the neglected second-order terms are significant, leading to poor step directions or divergence, much like issues observed in the BHHH algorithm for MLE.[20] This sensitivity to initial conditions and nonlinearity underscores the need for safeguards, such as damping or line searches, in practical implementations.[24] The BHHH algorithm serves as the direct MLE analog to Gauss–Newton, replacing the Jacobian of residuals with the outer product of individual score vectors to approximate the information matrix, rather than using residual Jacobians. In nonlinear least squares interpretable as MLE, the two methods coincide, but BHHH generalizes to arbitrary likelihoods by exploiting the score's zero mean and variance properties.[1] This distinction highlights Gauss–Newton's specialization to squared-error objectives, while BHHH extends the approximation logic to broader statistical estimation contexts.[27]Applications
Econometric Models
The Berndt–Hall–Hall–Hausman (BHHH) algorithm plays a key role in estimating parameters for simultaneous equation models through full information maximum likelihood (FIML), especially in systems addressing endogeneity, such as supply and demand frameworks. By leveraging the outer product of gradients to approximate the information matrix, the algorithm efficiently optimizes the nonlinear likelihood function, providing consistent and asymptotically efficient estimates while managing the computational challenges of correlated equations. This approach is particularly valuable in structural econometric models where ordinary least squares fails due to simultaneity bias, as demonstrated in early applications to nonlinear systems.[1][28] In models with limited dependent variables, the BHHH algorithm supports maximum likelihood estimation for binary outcomes in probit and logit specifications, where the cumulative distribution function introduces nonlinearity. It extends effectively to multinomial probit models, optimizing complex likelihoods that account for choice correlations and alternative-specific effects, such as in discrete choice analysis of consumer behavior or labor market decisions. The algorithm's reliance on first derivatives alone makes it suitable for these settings, avoiding the need for second derivatives that can be cumbersome in high-dimensional choice sets.[29][30] For time series models, the BHHH algorithm facilitates parameter estimation in ARMA processes with nonlinear components and GARCH models capturing conditional heteroskedasticity in financial returns. In GARCH specifications, it maximizes the quasi-likelihood by iteratively updating parameters based on score vectors, accommodating volatility clustering without requiring full Hessian computations, which enhances numerical stability in volatile data environments. This application has been instrumental in forecasting asset returns and risk management, where precise volatility estimates are critical.[31][32] A notable example of the BHHH algorithm's application appears in the estimation of production function parameters within nonlinear structural models using panel data, as illustrated in the seminal 1974 work. Here, the algorithm optimizes the likelihood for systems incorporating firm-level heterogeneity and structural relationships, such as input-output dynamics, yielding reliable inferences on productivity and efficiency despite the models' complexity. This structural approach highlights the algorithm's utility in bridging theoretical econometrics with empirical panel analysis.[1]Software Implementations
The Berndt–Hall–Hall–Hausman (BHHH) algorithm is implemented in several major statistical software packages, particularly those focused on econometrics and maximum likelihood estimation (MLE). In Stata, the algorithm is available through theml command via the technique(bhhh) option, which allows users to specify BHHH as the optimization method for likelihood maximization.[33] This implementation supports hybrid approaches, such as performing an initial set of iterations with BHHH before switching to the Newton–Raphson method, for example using technique(bhhh 10 nr 1000) to run 10 BHHH iterations followed by 1,000 Newton–Raphson steps. Stata also provides controls for step size adjustments and maximum iterations to enhance convergence reliability in complex models.[12]
In R, the BHHH algorithm is supported in the maxLik package, which provides the maxBHHH function as a dedicated optimizer for MLE problems, approximating the Hessian via the outer product of gradients.[34] This function integrates with R's broader optimization ecosystem and allows customization of iteration limits and step sizes, often combined with other methods like BFGS for improved performance in econometric applications.[35] While not part of the base optim function, maxBHHH serves as a reliable variant for likelihood-based estimation in packages like maxLik.[36]
For Python users, the BHHH algorithm lacks built-in support in core libraries like statsmodels or scipy.optimize, which primarily offer quasi-Newton methods such as BFGS. However, open-source implementations are available, including a pure-Python version on GitHub that handles unconstrained optimization and is derived from a MATLAB routine by Fedor Iskhakov.[37] This repository provides a lightweight, customizable code base suitable for integration into custom MLE workflows, with options for iteration controls and gradient-based updates. Similar open-source MATLAB implementations exist for academic and research purposes, often shared via repositories for nonlinear optimization tasks.[38]
The BHHH algorithm is a common option in econometric software for MLE due to its robustness in handling misspecified starting values and noisy gradients, though Newton–Raphson remains the default in tools like Stata and gretl.[39][40] These implementations facilitate its use in fields like econometric modeling, where reliable convergence is essential for parameter estimation.
