Hubbry Logo
logo
Logistic regression
Community hub

Logistic regression

logo
0 subscribers
Read side by side
from Wikipedia
Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See § Example for worked details.

In statistics, a logistic model (or logit model) is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression[1] (or logit regression) estimates the parameters of a logistic model (the coefficients in the linear or non linear combinations). In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling;[2] the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see § Applications), and the logistic model has been the most commonly used model for binary regression since about 1970.[3] Binary variables can be generalized to categorical variables when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to multinomial logistic regression. If the multiple categories are ordered, one can use the ordinal logistic regression (for example the proportional odds ordinal logistic model[4]). See § Extensions for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform statistical classification (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a binary classifier.

Analogous linear models for binary variables with a different sigmoid function instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the probit model; see § Alternatives. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio. More abstractly, the logistic function is the natural parameter for the Bernoulli distribution, and in this sense is the "simplest" way to convert a real number to a probability.

The parameters of a logistic regression are most commonly estimated by maximum-likelihood estimation (MLE). This does not have a closed-form expression, unlike linear least squares; see § Model fitting. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by ordinary least squares (OLS) plays for scalar responses: it is a simple, well-analyzed baseline model; see § Comparison with linear regression for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by Joseph Berkson,[5] beginning in Berkson (1944), where he coined "logit"; see § History.

Applications

[edit]

General

[edit]

Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression.[6] Many other medical scales used to assess severity of a patient have been developed using logistic regression.[7][8][9][10] Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).[11][12] Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or for any other party, based on age, income, sex, race, state of residence, votes in previous elections, etc.[13] The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product.[14][15] It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.[16] In economics, it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing. Disaster planners and engineers rely on these models to predict decisions taken by householders or building occupants in small-scale and large-scales evacuations, such as building fires, wildfires, hurricanes among others.[17][18][19] These models help in the development of reliable disaster managing plans and safer design for the built environment.

Supervised machine learning

[edit]

Logistic regression is a supervised machine learning algorithm widely used for binary classification tasks, such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results. This approach utilizes the logistic (or sigmoid) function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories. The essential mechanism of logistic regression is grounded in the logistic function's ability to model the probability of binary outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps any real-valued number to a value within the 0 to 1 interval. This feature renders it particularly suitable for binary classification tasks, such as sorting emails into "spam" or "not spam". By calculating the probability that the dependent variable will be categorized into a specific group, logistic regression provides a probabilistic framework that supports informed decision-making.[20]

Example

[edit]

Problem

[edit]

As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question:

A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?

The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple regression analysis could be used.

The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).

Hours (xk) 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Pass (yk) 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

We wish to fit a logistic function to the data consisting of the hours studied (xk) and the outcome of the test (yk =1 for pass, 0 for fail). The data points are indexed by the subscript k which runs from to . The x variable is called the "explanatory variable", and the y variable is called the "categorical variable" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively.

Model

[edit]
Graph of a logistic regression curve fitted to the (xm,ym) data. The curve shows the probability of passing an exam versus hours studying.

The logistic function is of the form:

where μ is a location parameter (the midpoint of the curve, where ) and s is a scale parameter. This expression may be rewritten as:

where and is known as the intercept (it is the vertical intercept or y-intercept of the line ), and (inverse scale parameter or rate parameter): these are the y-intercept and slope of the log-odds as a function of x. Conversely, and .

Note that this model is actually an oversimplification, since it assumes everybody will pass if they learn long enough (limit = 1).

Fit

[edit]

The usual measure of goodness of fit for a logistic regression uses logistic loss (or log loss), the negative log-likelihood. For a given xk and yk, write . The are the probabilities that the corresponding will equal one and are the probabilities that they will be zero (see Bernoulli distribution). We wish to find the values of and which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points (yk), the squared error loss, is taken as a measure of the goodness of fit, and the best fit is obtained when that function is minimized.

The log loss for the k-th point is:

The log loss can be interpreted as the "surprisal" of the actual outcome relative to the prediction , and is a measure of information content. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when and , or and ), and approaches infinity as the prediction gets worse (i.e., when and or and ), meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point (and zero loss overall if all points are on a line), in a logistic regression it is not possible to have zero loss at any points, since is either 0 or 1, but .

These can be combined into a single expression:

This expression is more formally known as the cross-entropy of the predicted distribution from the actual distribution , as probability distributions on the two-element space of (pass, fail).

The sum of these, the total loss, is the overall negative log-likelihood , and the best fit is obtained for those choices of and for which is minimized.

Alternatively, instead of minimizing the loss, one can maximize its inverse, the (positive) log-likelihood:

or equivalently maximize the likelihood function itself, which is the probability that the given data set is produced by a particular logistic function:

This method is known as maximum likelihood estimation.

Parameter estimation

[edit]

Since is nonlinear in and , determining their optimum values will require numerical methods. One method of maximizing is to require the derivatives of with respect to and to be zero:

and the maximization procedure can be accomplished by solving the above two equations for and , which, again, will generally require the use of numerical methods.

The values of and which maximize and L using the above data are found to be:

which yields a value for μ and s of:

Predictions

[edit]

The and coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam.

For example, for a student who studies 2 hours, entering the value into the equation gives the estimated probability of passing the exam of 0.25:

Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:

This table shows the estimated probability of passing the exam for several values of hours studying.

Hours
of study
(x)
Passing exam
Log-odds (t) Odds (et) Probability (p)
1 −2.57 0.076 ≈ 1:13.1 0.07
2 −1.07 0.34 ≈ 1:2.91 0.26
0 1 1/2 = 0.50
3 0.44 1.55 0.61
4 1.94 6.96 0.87
5 3.45 31.4 0.97

Model evaluation

[edit]

The logistic regression analysis gives the following output.

Coefficient Std. Error z-value p-value (Wald)
Intercept (β0) −4.1 1.8 −2.3 0.021
Hours (β1) 1.5 0.9 1.7 0.017

By the Wald test, the output indicates that hours studying is significantly associated with the probability of passing the exam (). Rather than the Wald method, the recommended method[21] to calculate the p-value for logistic regression is the likelihood-ratio test (LRT), which for these data give (see § Deviance and likelihood ratio tests below).

Generalizations

[edit]

This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. Multinomial logistic regression is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories.

Background

[edit]
Figure 1. The standard logistic function ; for all .

Definition of the logistic function

[edit]

An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input , and outputs a value between zero and one.[2] For the logit, this is interpreted as taking input log-odds and having output probability. The standard logistic function is defined as follows:

A graph of the logistic function on the t-interval (−6,6) is shown in Figure 1.

Let us assume that is a linear function of a single explanatory variable (the case where is a linear combination of multiple explanatory variables is treated similarly). We can then express as follows:

And the general logistic function can now be written as:

In the logistic model, is interpreted as the probability of the dependent variable equaling a success/case rather than a failure/non-case. It is clear that the response variables are not identically distributed: differs from one data point to another, though they are independent given design matrix and shared parameters .[11]

Definition of the inverse of the logistic function

[edit]

We can now define the logit (log odds) function as the inverse of the standard logistic function. It is easy to see that it satisfies:

and equivalently, after exponentiating both sides we have the odds:

Interpretation of these terms

[edit]

In the above equations, the terms are as follows:

  • is the logit function. The equation for illustrates that the logit (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression expression.
  • denotes the natural logarithm.
  • is the probability that the dependent variable equals a case, given some linear combination of the predictors. The formula for illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression. This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation, the resulting expression for the probability ranges between 0 and 1.
  • is the intercept from the linear regression equation (the value of the criterion when the predictor is equal to zero).
  • is the regression coefficient multiplied by some value of the predictor.
  • base denotes the exponential function.

Definition of the odds

[edit]

The odds of the dependent variable equaling a case (given some linear combination of the predictors) is equivalent to the exponential function of the linear regression expression. This illustrates how the logit serves as a link function between the probability and the linear regression expression. Given that the logit ranges between negative and positive infinity, it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds.[2]

So we define odds of the dependent variable equaling a case (given some linear combination of the predictors) as follows:

The odds ratio

[edit]

For a continuous independent variable the odds ratio can be defined as:

The image represents an outline of what an odds ratio looks like in writing, through a template in addition to the test score example in the "Example" section of the contents. In simple terms, if we hypothetically get an odds ratio of 2 to 1, we can say... "For every one-unit increase in hours studied, the odds of passing (group 1) or failing (group 0) are (expectedly) 2 to 1 (Denis, 2019).

This exponential relationship provides an interpretation for : The odds multiply by for every 1-unit increase in x.[22]

For a binary independent variable the odds ratio is defined as where a, b, c and d are cells in a 2×2 contingency table.[23]

Multiple explanatory variables

[edit]

If there are multiple explanatory variables, the above expression can be revised to . Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a multiple regression with m explanators; the parameters for all are all estimated.

Again, the more traditional equations are:

and

where usually .

Definition

[edit]

A dataset contains N points. Each point i consists of a set of m input variables x1,i ... xm,i (also called independent variables, explanatory variables, predictor variables, features, or attributes), and a binary outcome variable Yi (also known as a dependent variable, response variable, output variable, or class), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable.

As in linear regression, the outcome variables Yi are assumed to depend on the explanatory variables x1,i ... xm,i.

Explanatory variables

The explanatory variables may be of any type: real-valued, binary, categorical, etc. The main distinction is between continuous variables and discrete variables.

(Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables), that is, separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have that value".)

Outcome variables

Formally, the outcomes Yi are described as being Bernoulli-distributed data, where each outcome is determined by an unobserved probability pi that is specific to the outcome at hand, but related to the explanatory variables. This can be expressed in any of the following equivalent forms:

The meanings of these four lines are:

  1. The first line expresses the probability distribution of each Yi : conditioned on the explanatory variables, it follows a Bernoulli distribution with parameters pi, the probability of the outcome of 1 for trial i. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables. The probability of success pi is not observed, only the outcome of an individual Bernoulli trial using that probability.
  2. The second line expresses the fact that the expected value of each Yi is equal to the probability of success pi, which is a general property of the Bernoulli distribution. In other words, if we run a large number of Bernoulli trials using the same probability of success pi, then take the average of all the 1 and 0 outcomes, then the result would be close to pi. This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success.
  3. The third line writes out the probability mass function of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes.
  4. The fourth line is another way of writing the probability mass function, which avoids having to write separate cases and is more convenient for certain types of calculations. This relies on the fact that Yi can take only the value 0 or 1. In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it. Hence, the outcome is either pi or 1 − pi, as in the previous line.
Linear predictor function

The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear predictor function for a particular data point i is written as:

where are regression coefficients indicating the relative effect of a particular explanatory variable on the outcome.

The model is usually put into a more compact form as follows:

  • The regression coefficients β0, β1, ..., βm are grouped into a single vector β of size m + 1.
  • For each data point i, an additional explanatory pseudo-variable x0,i is added, with a fixed value of 1, corresponding to the intercept coefficient β0.
  • The resulting explanatory variables x0,i, x1,i, ..., xm,i are then grouped into a single vector Xi of size m + 1.

This makes it possible to write the linear predictor function as follows:

using the notation for a dot product between two vectors.

This is an example of an SPSS output for a logistic regression model using three explanatory variables (coffee use per week, energy drink use per week, and soda use per week) and two categories (male and female).

Many explanatory variables, two categories

[edit]

The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables x1, x2,... and any number of categorical values .

To begin with, we may consider a logistic model with M explanatory variables, x1, x2 ... xM and, as in the example above, two categorical values (y = 0 and 1). For the simple binary logistic regression model, we assumed a linear relationship between the predictor variable and the log-odds (also called logit) of the event that . This linear relationship may be extended to the case of M explanatory variables:

where t is the log-odds and are parameters of the model. An additional generalization has been introduced in which the base of the model (b) is not restricted to Euler's number e. In most applications, the base of the logarithm is usually taken to be e. However, in some cases it can be easier to communicate results by working in base 2 or base 10.

For a more compact notation, we will specify the explanatory variables and the β coefficients as -dimensional vectors:

with an added explanatory variable x0 =1. The logit may now be written as:

Solving for the probability p that yields:

,

where is the sigmoid function with base . The above formula shows that once the are fixed, we can easily compute either the log-odds that for a given observation, or the probability that for a given observation. The main use-case of a logistic model is to be given an observation , and estimate the probability that . The optimum beta coefficients may again be found by maximizing the log-likelihood. For K measurements, defining as the explanatory vector of the k-th measurement, and as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple case above:

As in the simple example above, finding the optimum β parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the β parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood:

where xmk is the value of the xm explanatory variable from the k-th measurement.

Consider an example with explanatory variables, , and coefficients , , and which have been determined by the above method. To be concrete, the model is:

,

where p is the probability of the event that . This can be interpreted as follows:

  • is the y-intercept. It is the log-odds of the event that , when the predictors . By exponentiating, we can see that when the odds of the event that are 1-to-1000, or . Similarly, the probability of the event that when can be computed as
  • means that increasing by 1 increases the log-odds by . So if increases by 1, the odds that increase by a factor of . The probability of has also increased, but it has not increased by as much as the odds have increased.
  • means that increasing by 1 increases the log-odds by . So if increases by 1, the odds that increase by a factor of Note how the effect of on the log-odds is twice as great as the effect of , but the effect on the odds is 10 times greater. But the effect on the probability of is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater.

Multinomial logistic regression: Many explanatory variables and many categories

[edit]

In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probabilities: The probability that the outcome was in category 1 was given by and the probability that the outcome was in category 0 was given by . The sum of these probabilities equals 1, which must be true, since "0" and "1" are the only possible categories in this setup.

In general, if we have explanatory variables (including x0) and categories, we will need separate probabilities, one for each category, indexed by n, which describe the probability that the categorical outcome y will be in category y=n, conditional on the vector of covariates x. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base e, these probabilities are:

for

Each of the probabilities except will have their own set of regression coefficients . It can be seen that, as required, the sum of the over all categories n is 1. The selection of to be defined in terms of the other probabilities is artificial. Any of the probabilities could have been selected to be so defined. This special value of n is termed the "pivot index", and the log-odds (tn) are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables:

Note also that for the simple case of , the two-category case is recovered, with and .

The log-likelihood that a particular set of K measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by k, let the k-th set of measured explanatory variables be denoted by and their categorical outcomes be denoted by which can be equal to any integer in [0,N]. The log-likelihood is then:

where is an indicator function which equals 1 if yk = n and zero otherwise. In the case of two explanatory variables, this indicator function was defined as yk when n = 1 and 1-yk when n = 0. This was convenient, but not necessary.[24] Again, the optimum beta coefficients may be found by maximizing the log-likelihood function generally using numerical methods. A possible method of solution is to set the derivatives of the log-likelihood with respect to each beta coefficient equal to zero and solve for the beta coefficients:

where is the m-th coefficient of the vector and is the m-th explanatory variable of the k-th measurement. Once the beta coefficients have been estimated from the data, we will be able to estimate the probability that any subsequent set of explanatory variables will result in any of the possible outcome categories.

Interpretations

[edit]

There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations.

As a generalized linear model

[edit]

The particular model used by logistic regression, which distinguishes it from standard linear regression and from other types of regression analysis used for binary-valued outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:

Written using the more compact notation described above, this is:

This formulation expresses logistic regression as a type of generalized linear model, which predicts variables with various types of probability distributions by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable.

The intuition for transforming using the logit function (the natural log of the odds) was explained above[clarification needed]. It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over — thereby matching the potential range of the linear prediction function on the right side of the equation.

Both the probabilities pi and the regression coefficients are unobserved, and the means of determining them is not part of the model itself. They are typically determined by some sort of optimization procedure, e.g. maximum likelihood estimation, that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to regularization conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients. The use of a regularization condition is equivalent to doing maximum a posteriori (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using a squared regularizing function, which is equivalent to placing a zero-mean Gaussian prior distribution on the coefficients, but other regularizers are also possible.) Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as iteratively reweighted least squares (IRLS) or, more commonly these days, a quasi-Newton method such as the L-BFGS method.[25]

The interpretation of the βj parameter estimates is as the additive effect on the log of the odds for a unit change in the j the explanatory variable. In the case of a dichotomous explanatory variable, for instance, gender is the estimate of the odds of having the outcome for, say, males compared with females.

An equivalent formula uses the inverse of the logit function, which is the logistic function, i.e.:

The formula can also be written as a probability distribution (specifically, using a probability mass function):

As a latent-variable model

[edit]

The logistic model has an equivalent formulation as a latent-variable model. This formulation is common in the theory of discrete choice models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related probit model.

Imagine that, for each trial i, there is a continuous latent variable Yi* (i.e. an unobserved random variable) that is distributed as follows:

where

i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random error variable that is distributed according to a standard logistic distribution.

Then Yi can be viewed as an indicator for whether this latent variable is positive:

The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not. It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution. For example, a logistic error-variable distribution with a non-zero location parameter μ (which sets the mean) is equivalent to a distribution with a zero location parameter, where μ has been added to the intercept coefficient. Both situations produce the same value for Yi* regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by s. In the latter case, the resulting value of Yi* will be smaller by a factor of s than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same Yi choice.

(This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.)

It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the generalized linear model and without any latent variables. This can be shown as follows, using the fact that the cumulative distribution function (CDF) of the standard logistic distribution is the logistic function, which is the inverse of the logit function, i.e.

Then:

This formulation—which is standard in discrete choice models—makes clear the relationship between logistic regression (the "logit model") and the probit model, which uses an error variable distributed according to a standard normal distribution instead of a standard logistic distribution. Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape. The only difference is that the logistic distribution has somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust to model mis-specifications or erroneous data).

Two-way latent-variable model

[edit]

Yet another formulation uses two separate latent variables:

where

where EV1(0,1) is a standard type-1 extreme value distribution: i.e.

Then

This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable. The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the multinomial logit model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients. It is also possible to motivate each of the separate latent variables as the theoretical utility associated with making the associated choice, and thus motivate logistic regression in terms of utility theory. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is the approach taken by economists when formulating discrete choice models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.)

The choice of the type-1 extreme value distribution seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through rational choice theory.

It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution. In fact, this model reduces directly to the previous one with the following substitutions:

An intuition for this comes from the fact that, since we choose based on the maximum of two values, only their difference matters, not the exact values — and this effectively removes one degree of freedom. Another critical fact is that the difference of two type-1 extreme-value-distributed variables is a logistic distribution, i.e. We can demonstrate the equivalent as follows:

Example

[edit]

As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. the Parti Québécois, which wants Quebec to secede from Canada). We would then use three latent variables, one for each choice. Then, in accordance with utility theory, we can then interpret the latent variables as expressing the utility that results from making each of the choices. We can also interpret the regression coefficients as indicating the strength that the associated factor (i.e. explanatory variable) has in contributing to the utility — or more correctly, the amount by which a unit change in an explanatory variable changes the utility of a given choice. A voter might expect that the right-of-center party would lower taxes, especially on rich people. This would give low-income people no benefit, i.e. no change in utility (since they usually don't pay taxes); would cause moderate benefit (i.e. somewhat more money, or moderate utility increase) for middle-incoming people; would cause significant benefits for high-income people. On the other hand, the left-of-center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes. This would cause significant positive benefit to low-income people, perhaps a weak benefit to middle-income people, and significant negative benefit to high-income people. Finally, the secessionist party would take no direct actions on the economy, but simply secede. A low-income or middle-income voter might expect basically no clear utility gain or loss from this, but a high-income voter might expect negative utility since he/she is likely to own companies, which will have a harder time doing business in such an environment and probably lose money.

These intuitions can be expressed as follows:

Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
Center-right Center-left Secessionist
High-income strong + strong − strong −
Middle-income moderate + weak + none
Low-income none strong + none

This clearly shows that

  1. Separate sets of regression coefficients need to exist for each choice. When phrased in terms of utility, this can be seen very easily. Different choices have different effects on net utility; furthermore, the effects vary in complex ways that depend on the characteristics of each individual, so there need to be separate sets of coefficients for each characteristic, not simply a single extra per-choice characteristic.
  2. Even though income is a continuous variable, its effect on utility is too complex for it to be treated as a single variable. Either it needs to be directly split up into ranges, or higher powers of income need to be added so that polynomial regression on income is effectively done.

As a "log-linear" model

[edit]

Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the multinomial logit.

Here, instead of writing the logit of the probabilities pi as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:

Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the logarithm of the associated probability as a linear predictor, with an extra term at the end. This term, as it turns out, serves as the normalizing factor ensuring that the result is a distribution. This can be seen by exponentiating both sides:

In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply the sum of all un-normalized probabilities, and by dividing each probability by Z, the probabilities become "normalized". That is:

and the resulting equations are

Or generally:

This shows clearly how to generalize this formulation to more than two outcomes, as in multinomial logit. This general formulation is exactly the softmax function as in

To prove that this is equivalent to the previous model, we start by recognizing the above model is overspecified, in that and cannot be independently specified: rather so knowing one automatically determines the other. As a result, the model is nonidentifiable, in that multiple combinations of and will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:

As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set Then,

and so

which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where will produce equivalent results.)

Most treatments of the multinomial logit model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes. In general, the presentation with latent variables is more common in econometrics and political science, where discrete choice models and utility theory reign, while the "log-linear" formulation here is more common in computer science, e.g. machine learning and natural language processing.

As a single-layer perceptron

[edit]

The model has an equivalent formulation

This functional form is commonly called a single-layer perceptron or single-layer artificial neural network. A single-layer neural network computes a continuous output instead of a step function. The derivative of pi with respect to X = (x1, ..., xk) is computed from the general form:

where f(X) is an analytic function in X. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated:

In terms of binomial data

[edit]

A closely related model assumes that each i is associated not with a single Bernoulli trial but with ni independent identically distributed trials, where the observation Yi is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a binomial distribution:

An example of this distribution is the fraction of seeds (pi) that germinate after ni are planted.

In terms of expected values, this model is expressed as follows:

so that

Or equivalently:

This model can be fit using the same sorts of methods as the above more basic model.

Model fitting

[edit]

Maximum likelihood estimation (MLE)

[edit]

The regression coefficients are usually estimated using maximum likelihood estimation.[26][27] Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function so an iterative process must be used instead; for example Newton's method. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.[26]

In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, multicollinearity, sparseness, or complete separation.

  • Having a large ratio of variables to cases results in an overly conservative Wald statistic (discussed below) and can lead to non-convergence. Regularized logistic regression is specifically intended to be used in this situation.
  • Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases.[26] To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic [26] used to assess whether multicollinearity is unacceptably high.
  • Sparseness in the data refers to having a large proportion of empty cells (cells with zero counts). Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The model will not converge with zero cell counts for categorical predictors because the natural logarithm of zero is an undefined value so that the final solution to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or add a constant to all cells.[26]
  • Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion – all cases are accurately classified and the likelihood maximized with infinite coefficients. In such instances, one should re-examine the data, as there may be some kind of error.[2][further explanation needed]
  • One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions of a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).[28]

Iteratively reweighted least squares (IRLS)

[edit]

Binary logistic regression ( or ) can, for example, be calculated using iteratively reweighted least squares (IRLS), which is equivalent to maximizing the log-likelihood of a Bernoulli distributed process using Newton's method. If the problem is written in vector matrix form, with parameters , explanatory variables and expected value of the Bernoulli distribution , the parameters can be found using the following iterative algorithm:

where is a diagonal weighting matrix, the vector of expected values,

The regressor matrix and the vector of response variables. More details can be found in the literature.[29]

Bayesian

[edit]
Comparison of logistic function with a scaled inverse probit function (i.e. the CDF of the normal distribution), comparing vs. , which makes the slopes the same at the origin. This shows the heavier tails of the logistic distribution.

In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients, for example in the form of Gaussian distributions. There is no conjugate prior of the likelihood function in logistic regression. When Bayesian inference was performed analytically, this made the posterior distribution difficult to calculate except in very low dimensions. Now, though, automatic software such as OpenBUGS, JAGS, PyMC, Stan or Turing.jl allows these posteriors to be computed using simulation, so lack of conjugacy is not a concern. However, when the sample size or the number of parameters is large, full Bayesian simulation can be slow, and people often use approximate methods such as variational Bayesian methods and expectation propagation.

"Rule of ten"

[edit]

Widely used, the "one in ten rule", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where event denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use explanatory variables for an event (e.g. myocardial infarction) expected to occur in a proportion of participants in the study will require a total of participants. However, there is considerable debate about the reliability of this rule, which is based on simulation studies and lacks a secure theoretical underpinning.[30] According to some authors[31] the rule is overly conservative in some circumstances, with the authors stating, "If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observed with 10–16 EPV. The worst instances of each problem were not severe with 5–9 EPV and usually comparable to those with 10–16 EPV".[32]

Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required.[33] Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.[13]

Error and significance of fit

[edit]

Deviance and likelihood ratio test ─ a simple case

[edit]

In any fitting procedure, the addition of another fitting parameter to a model (e.g. the beta parameters in a logistic regression model) will almost always improve the ability of the model to predict the measured outcomes. This will be true even if the additional term has no predictive value, since the model will simply be "overfitting" to the noise in the data. The question arises as to whether the improvement gained by the addition of another fitting parameter is significant enough to recommend the inclusion of the additional term, or whether the improvement is simply that which may be expected from overfitting.

In short, for logistic regression, a statistic known as the deviance is defined which is a measure of the error between the logistic model fit and the outcome data. In the limit of a large number of data points, the deviance is chi-squared distributed, which allows a chi-squared test to be implemented in order to determine the significance of the explanatory variables.

Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of K data points (xk, yk) are fitted to a proposed model function of the form . The fit is obtained by choosing the b parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point:

The minimum value which constitutes the fit will be denoted by

The idea of a null model may be introduced, in which it is assumed that the x variable is of no use in predicting the yk outcomes: The data points are fitted to a null model function of the form y = b0 with a squared error term:

The fitting process consists of choosing a value of b0 which minimizes of the fit to the null model, denoted by where the subscript denotes the null model. It is seen that the null model is optimized by where is the mean of the yk values, and the optimized is:

which is proportional to the square of the (uncorrected) sample standard deviation of the yk data points.

We can imagine a case where the yk data points are randomly assigned to the various xk, and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the yk outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a chi-squared distribution, with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be . Using the chi-squared test, we may then estimate how many of these permuted sets of yk will yield a minimum error less than or equal to the minimum error using the original yk, and so we can estimate how significant an improvement is given by the inclusion of the x variable in the proposed model.

For logistic regression, the measure of goodness-of-fit is the likelihood function L, or its logarithm, the log-likelihood . The likelihood function L is analogous to the in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by .

In the case of simple binary logistic regression, the set of K data points are fitted in a probabilistic sense to a function of the form:

where is the probability that . The log-odds are given by:

and the log-likelihood is:

For the null model, the probability that is given by:

The log-odds for the null model are given by:

and the log-likelihood is:

Since we have at the maximum of L, the maximum log-likelihood for the null model is

The optimum is:

where is again the mean of the yk values. Again, we can conceptually consider the fit of the proposed model to every permutation of the yk and it can be shown that the maximum log-likelihood of these permutation fits will never be smaller than that of the null model:

Also, as an analog to the error of the linear regression case, we may define the deviance of a logistic regression fit as:

which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (K) increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the xk data points in the proposed model.

For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is The maximum value of the log-likelihood for the simple model is so that the deviance is

Using the chi-squared test of significance, the integral of the chi-squared distribution with one degree of freedom from 11.6661... to infinity is equal to 0.00063649...

This effectively means that about 6 out of a 10,000 fits to random yk can be expected to have a better fit (smaller deviance) than the given yk and so we can conclude that the inclusion of the x variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the null hypothesis with confidence.

Goodness of fit summary

[edit]

Goodness of fit in linear regression models is generally measured using R2. Since this has no direct analog in logistic regression, various methods[34]: ch.21  including the following can be used instead.

Deviance and likelihood ratio tests

[edit]

In linear regression analysis, one is concerned with partitioning variance via the sum of squares calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, deviance is used in lieu of a sum of squares calculations.[35] Deviance is analogous to the sum of squares calculations in linear regression[2] and is a measure of the lack of fit to the data in a logistic regression model.[35] When a "saturated" model is available (a model with a theoretically perfect fit), deviance is calculated by comparing a given model with the saturated model.[2] This computation gives the likelihood-ratio test:[2]

In the above equation, D represents the deviance and ln represents the natural logarithm. The log of this likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, hence the need for a negative sign. D can be shown to follow an approximate chi-squared distribution.[2] Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained.

When the saturated model is not available (a common case), deviance is calculated simply as −2·(log likelihood of the fitted model), and the reference to the saturated model's log likelihood can be removed from all that follows without harm.

Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept (which means "no predictors") and the saturated model. The model deviance represents the difference between a model with at least one predictor and the saturated model.[35] In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Thus, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a chi-square distribution with degrees of freedom[2] equal to the difference in the number of parameters estimated.

Let

Then the difference of both is:

If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improve the model's fit. This is analogous to the F-test used in linear regression analysis to assess the significance of prediction.[35]

Pseudo-R-squared

[edit]

In linear regression the squared multiple correlation, R2 is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors.[35] In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.[35][36]

Four of the most commonly used indices and one less commonly used one are examined on this page:

  • Likelihood ratio R2L
  • Cox and Snell R2CS
  • Nagelkerke R2N
  • McFadden R2McF
  • Tjur R2T

Hosmer–Lemeshow test

[edit]

The Hosmer–Lemeshow test uses a test statistic that asymptotically follows a distribution to assess whether or not the observed event rates match expected event rates in subgroups of the model population. This test is considered to be obsolete by some statisticians because of its dependence on arbitrary binning of predicted probabilities and relative low power.[37]

Coefficient significance

[edit]

After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.[35] In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient – the odds ratio (see definition). In linear regression, the significance of a regression coefficient is assessed by computing a t test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic.

Likelihood ratio test

[edit]

The likelihood-ratio test discussed above to assess model fit is also the recommended procedure to assess the contribution of individual "predictors" to a given model.[2][26][35] In the case of a single predictor model, one simply compares the deviance of the predictor model with that of the null model on a chi-square distribution with a single degree of freedom. If the predictor model has significantly smaller deviance (c.f. chi-square using the difference in degrees of freedom of the two models), then one can conclude that there is a significant association between the "predictor" and the outcome. Although some common statistical packages (e.g. SPSS) do provide likelihood ratio test statistics, without this computationally intensive test it would be more difficult to assess the contribution of individual predictors in the multiple logistic regression case.[citation needed] To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor.[35] There is some debate among statisticians about the appropriateness of so-called "stepwise" procedures.[weasel words] The fear is that they may not preserve nominal statistical properties and may become misleading.[38]

Wald statistic

[edit]

Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the Wald statistic. The Wald statistic, analogous to the t-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.[26]

Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic has limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be larger increasing the probability of Type-II error. The Wald statistic also tends to be biased when data are sparse.[35]

Case-control sampling

[edit]

Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.[39]

Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome. That is to say, if we form a logistic model from such data, if the model is correct in the general population, the parameters are all correct except for . We can correct if we know the true prevalence as follows:[39]

where is the true prevalence and is the prevalence in the sample.

Discussion

[edit]

Like other forms of regression analysis, logistic regression makes use of one or more predictor variables that may be either continuous or categorical. Unlike ordinary linear regression, however, logistic regression is used for predicting dependent variables that take membership in one of a limited number of categories (treating the dependent variable in the binomial case as the outcome of a Bernoulli trial) rather than a continuous outcome. Given this difference, the assumptions of linear regression are violated. In particular, the residuals cannot be normally distributed. In addition, linear regression may make nonsensical predictions for a binary dependent variable. What is needed is a way to convert a binary variable into a continuous one that can take on any real value (negative or positive). To do that, binomial logistic regression first calculates the odds of the event happening for different levels of each independent variable, and then takes its logarithm to create a continuous criterion as a transformed version of the dependent variable. The logarithm of the odds is the logit of the probability, the logit is defined as follows:

Although the dependent variable in logistic regression is Bernoulli, the logit is on an unrestricted scale.[2] The logit function is the link function in this kind of generalized linear model, i.e.

Y is the Bernoulli-distributed response variable and x is the predictor variable; the β values are the linear parameters.

The logit of the probability of success is then fitted to the predictors. The predicted value of the logit is converted back into predicted odds, via the inverse of the natural logarithm – the exponential function. Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success.

Machine learning and cross-entropy loss function

[edit]

In machine learning applications where logistic regression is used for binary classification, the MLE minimises the cross-entropy loss function.

Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable being 0 or 1 given experimental data.[40]

Consider a generalized linear model function parameterized by ,

Therefore,

and since , we see that is given by We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed,

Typically, the log likelihood is maximized,

which is maximized using optimization techniques such as gradient descent.

Assuming the pairs are drawn uniformly from the underlying distribution, then in the limit of large N,

where is the conditional entropy and is the Kullback–Leibler divergence. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.

Comparison with linear regression

[edit]

Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between the dependent and independent variables) from those of linear regression. In particular, the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes rather than the outcomes themselves.

Alternatives

[edit]

A common alternative to the logistic model (logit model) is the probit model, as the related names suggest. From the perspective of generalized linear models, these differ in the choice of link function: the logistic model uses the logit function (inverse logistic function), while the probit model uses the probit function (inverse error function). Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution of errors and the second a standard normal distribution of errors.[41] Other sigmoid functions or error distributions can be used instead.

Logistic regression is an alternative to Fisher's 1936 method, linear discriminant analysis.[42] If the assumptions of linear discriminant analysis hold, the conditioning can be reversed to produce logistic regression. The converse is not true, however, because logistic regression does not require the multivariate normal assumption of discriminant analysis.[43]

The assumption of linear predictor effects can easily be relaxed using techniques such as spline functions.[13]

History

[edit]

A detailed history of the logistic regression is given in Cramer (2002). The logistic function was developed as a model of population growth and named "logistic" by Pierre François Verhulst in the 1830s and 1840s, under the guidance of Adolphe Quetelet; see Logistic function § History for details.[44] In his earliest paper (1838), Verhulst did not specify how he fit the curves to the data.[45][46] In his more detailed paper (1845), Verhulst determined the three parameters of the model by making the curve pass through three observed points, which yielded poor predictions.[47][48]

The logistic function was independently developed in chemistry as a model of autocatalysis (Wilhelm Ostwald, 1883).[49] An autocatalytic reaction is one in which one of the products is itself a catalyst for the same reaction, while the supply of one of the reactants is fixed. This naturally gives rise to the logistic equation for the same reason as population growth: the reaction is self-reinforcing but constrained.

The logistic function was independently rediscovered as a model of population growth in 1920 by Raymond Pearl and Lowell Reed, published as Pearl & Reed (1920), which led to its use in modern statistics. They were initially unaware of Verhulst's work and presumably learned about it from L. Gustave du Pasquier, but they gave him little credit and did not adopt his terminology.[50] Verhulst's priority was acknowledged and the term "logistic" revived by Udny Yule in 1925 and has been followed since.[51] Pearl and Reed first applied the model to the population of the United States, and also initially fitted the curve by making it pass through three points; as with Verhulst, this again yielded poor results.[52]

In the 1930s, the probit model was developed and systematized by Chester Ittner Bliss, who coined the term "probit" in Bliss (1934), and by John Gaddum in Gaddum (1933), and the model fit by maximum likelihood estimation by Ronald A. Fisher in Fisher (1935), as an addendum to Bliss's work. The probit model was principally used in bioassay, and had been preceded by earlier work dating to 1860; see Probit model § History. The probit model influenced the subsequent development of the logit model and these models competed with each other.[53]

The logistic model was likely first used as an alternative to the probit model in bioassay by Edwin Bidwell Wilson and his student Jane Worcester in Wilson & Worcester (1943).[54] However, the development of the logistic model as a general alternative to the probit model was principally due to the work of Joseph Berkson over many decades, beginning in Berkson (1944), where he coined "logit", by analogy with "probit", and continuing through Berkson (1951) and following years.[55] The logit model was initially dismissed as inferior to the probit model, but "gradually achieved an equal footing with the probit",[56] particularly between 1960 and 1970. By 1970, the logit model achieved parity with the probit model in use in statistics journals and thereafter surpassed it. This relative popularity was due to the adoption of the logit outside of bioassay, rather than displacing the probit within bioassay, and its informal use in practice; the logit's popularity is credited to the logit model's computational simplicity, mathematical properties, and generality, allowing its use in varied fields.[3]

Various refinements occurred during that time, notably by David Cox, as in Cox (1958).[4]

The multinomial logit model was introduced independently in Cox (1966) and Theil (1969), which greatly increased the scope of application and the popularity of the logit model.[57] In 1973 Daniel McFadden linked the multinomial logit to the theory of discrete choice, specifically Luce's choice axiom, showing that the multinomial logit followed from the assumption of independence of irrelevant alternatives and interpreting odds of alternatives as relative preferences;[58] this gave a theoretical foundation for the logistic regression.[57]

Extensions

[edit]

There are large numbers of extensions:

See also

[edit]

References

[edit]

Sources

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Logistic regression is a statistical method that models the relationship between a binary dependent variable and one or more independent variables by estimating the probability of the binary outcome through the logistic function, which maps a linear predictor to values between 0 and 1.[1][2] Developed by statistician David R. Cox in 1958, the approach treats the logarithm of the odds of the event occurring as a linear function of the predictors, enabling the analysis of binary sequences and qualitative responses.[3] Parameters are typically estimated via maximum likelihood, which seeks to maximize the probability of the observed data under the model, providing a basis for inference about odds ratios and predictive probabilities.[4] Widely applied in medical research to predict disease risk from covariates, in economics for binary choice models, and in machine learning as a baseline classifier, logistic regression provides interpretable coefficients quantifying the impact of predictors on log-odds while accommodating extensions like multinomial variants for categorical outcomes beyond binary.[5][6][7]

Mathematical Foundations

Logistic Function

The logistic function, denoted as σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}, maps any real-valued input zz to a value between 0 and 1, producing an S-shaped curve characteristic of sigmoid functions.[8] This form arises as the solution to the logistic differential equation dydx=y(1y)\frac{dy}{dx} = y(1 - y) for the standardized case where the carrying capacity K=1K = 1 and growth rate r=1r = 1, obtained by separation of variables and integration yielding y(x)=11+Cexy(x) = \frac{1}{1 + Ce^{-x}} with C=1C = 1 for the canonical form passing through (0, 0.5)./08:_Introduction_to_Differential_Equations/8.04:_The_Logistic_Equation) Alternatively, it serves as the cumulative distribution function (CDF) of the standard logistic distribution, whose probability density function is f(z)=ez(1+ez)2f(z) = \frac{e^{-z}}{(1 + e^{-z})^2}, integrating to σ(z)\sigma(z) due to the distribution's symmetry and heavier tails compared to the normal distribution.[9] Key properties include strict monotonicity, as the derivative σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)) is always positive for finite zz, ensuring a one-to-one mapping from R\mathbb{R} to (0,1).[10] The function is bounded asymptotically: limzσ(z)=0\lim_{z \to -\infty} \sigma(z) = 0 and limzσ(z)=1\lim_{z \to \infty} \sigma(z) = 1, with symmetry around the inflection point at z=0z = 0 where σ(0)=0.5\sigma(0) = 0.5 and the second derivative changes sign, reflecting maximum growth rate.[11] The inverse transformation, known as the logit function, is logit(p)=ln(p1p)\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) for p(0,1)p \in (0,1), which linearizes the probability scale by converting bounded probabilities to the unbounded real line, facilitating the modeling of linear relationships in predictors.[8] This inverse exploits the logistic function's bijection, allowing probabilities to be expressed as σ(β0+β1x)\sigma(\beta_0 + \beta_1 x) in parameterized forms while preserving interpretability through the logit scale.[9]

Odds and Odds Ratio

In statistics, the odds of an event is defined as the ratio of the probability of the event occurring to the probability of it not occurring.[12] For a binary outcome where the probability of success is $ p $, the odds $ o $ is given by $ o = \frac{p}{1-p} $.[13] This transforms the probability scale, which is bounded between 0 and 1, into an unbounded scale ranging from 0 to infinity, facilitating multiplicative interpretations.[14] The odds ratio (OR) quantifies the association between an exposure or predictor and a binary outcome by comparing the odds of the outcome across two groups or conditions.[15] Specifically, for two groups, the OR is the ratio of the odds in the exposed group to the odds in the unexposed group: $ OR = \frac{o_1}{o_0} = \frac{p_1 / (1 - p_1)}{p_0 / (1 - p_0)} $, where $ p_1 $ and $ p_0 $ are the success probabilities in each group.[14] An OR of 1 indicates no association, while values greater than 1 suggest higher odds in the first group and less than 1 suggest lower odds.[15] In the context of logistic regression, which models the log-odds as a linear function of predictors, the exponentiated regression coefficient $ e^{\beta_j} $ represents the odds ratio associated with a one-unit increase in the $ j $-th predictor, holding other predictors constant.[16] This multiplicative effect on the odds underscores the model's emphasis on relative changes rather than absolute probabilities, which vary nonlinearly.[17] For illustration in simple two-group comparisons, consider a case-control study on Helicobacter pylori infection and peptic ulcer disease, where the exposure is infection status.[18] In one such analysis, among cases with peptic ulcers, 17 of 65 were infected (odds = 17/48 ≈ 0.354), while among controls, 156 of 212 were infected (odds = 156/56 ≈ 2.786); the odds ratio is (17 × 56) / (48 × 156) ≈ 0.010, no—correcting for standard 2x2 contingency: assuming cases: infected 17/65, non-infected cases implied, but reported OR ≈ 3.71 for association directionally indicating higher odds of exposure among cases.[18] Such ratios approximate relative risks when outcomes are rare, aiding causal inference in observational data.[15]

Logit Transformation

The logit transformation, defined as logit(p)=ln(p1p)\operatorname{logit}(p) = \ln\left(\frac{p}{1-p}\right), where pp is a probability between 0 and 1, converts bounded probabilities to an unbounded real-valued scale ranging from -\infty to \infty.[19][20] This function represents the natural logarithm of the odds, where odds are given by p1p\frac{p}{1-p}, providing a monotonic mapping that preserves the order of probabilities while enabling linear modeling.[20] In the context of logistic regression as a generalized linear model (GLM), the logit serves as the link function g(μ)=logit(μ)g(\mu) = \operatorname{logit}(\mu), relating the expected probability μ\mu to the linear predictor η=xβ\eta = \mathbf{x}^\top \boldsymbol{\beta} via ln(μ1μ)=η\ln\left(\frac{\mu}{1-\mu}\right) = \eta.[21] This formulation ensures that inverting the link—yielding μ=11+eη\mu = \frac{1}{1 + e^{-\eta}}—always produces values strictly between 0 and 1, preventing boundary violations that occur with non-saturating links like the identity function, which can yield invalid probabilities outside [0,1][0,1].[22][23] The logit's status as the canonical link for the binomial distribution arises from its alignment with the exponential family form of the binomial probability mass function, where the natural parameter θ\theta equals logit(π)\operatorname{logit}(\pi) for success probability π\pi. In this representation, the link directly corresponds to θ=g(μ)\theta = g(\mu), facilitating desirable statistical properties such as simplified maximum likelihood estimation and variance stabilization in GLM theory.[24] This canonical choice also yields interpretable coefficients as log-odds ratios, quantifying multiplicative changes in odds per unit change in predictors.[23][25]

Model Specification

Binary Logistic Regression

Binary logistic regression, introduced by statistician David R. Cox in 1958, is a generalized linear model used to estimate the probability of a binary outcome variable Y equaling 1 as a function of predictor variables.[26] In its simplest form with a single predictor X, the model assumes a linear relationship in the logit scale, where the logit is the natural logarithm of the odds: log(P(Y=1X)1P(Y=1X))=β0+β1X\log\left(\frac{P(Y=1|X)}{1 - P(Y=1|X)}\right) = \beta_0 + \beta_1 X.[27] This yields the probability equation P(Y=1X)=exp(β0+β1X)1+exp(β0+β1X)P(Y=1|X) = \frac{\exp(\beta_0 + \beta_1 X)}{1 + \exp(\beta_0 + \beta_1 X)}, with β0\beta_0 as the intercept and β1\beta_1 as the slope coefficient measuring the change in log-odds per unit increase in X.[6] The model posits that Y follows a Bernoulli distribution for individual observations, with success probability P(Y=1X)P(Y=1|X), or a binomial distribution for grouped data sharing the same X values and trial size n > 1.[28] A core assumption is the independence of observations, ensuring that the probability for one does not influence others.[29] The response is bounded between 0 and 1, with non-normal errors and heteroscedasticity inherent to the variance P(Y=1X)(1P(Y=1X))P(Y=1|X)(1 - P(Y=1|X)), which the logistic link addresses without assuming constant variance.[6] This formulation generalizes to multiple predictors while maintaining the logit linearity and distributional premises.

Multivariate Binary Model

In the multivariate binary logistic regression model, the logit of the success probability is expressed as a linear combination of multiple predictor variables: logit(P(Y=1X))=β0+j=1pβjXj\operatorname{logit}(P(Y=1 \mid \mathbf{X})) = \beta_0 + \sum_{j=1}^p \beta_j X_j, where X=(X1,,Xp)\mathbf{X} = (X_1, \dots, X_p) denotes the vector of predictors, β0\beta_0 is the intercept, and βj\beta_j are the coefficients measuring the change in log-odds associated with a unit increase in XjX_j, holding other predictors constant.[30][31] This formulation allows the model to accommodate continuous, binary, or other scaled predictors simultaneously, extending the univariate case to capture joint effects without assuming independence among predictors.[6] Categorical predictors are incorporated by transforming them into dummy variables, where a k-category variable is represented by k-1 binary indicators to avoid multicollinearity; for instance, a three-level factor might use two dummies, each coded as 1 for the corresponding category and 0 otherwise, with the reference category omitted.[32] This encoding ensures the linear predictor remains additive while enabling category-specific coefficient estimates. Interactions between predictors can be included by adding product terms, such as βjkXjXk\beta_{jk} X_j X_k, to account for non-additive effects, though their inclusion requires empirical justification to prevent overfitting, often guided by domain knowledge or exploratory analysis.[32] The model does not inherently assume causal relationships between predictors and the outcome, as it primarily estimates conditional associations; however, including multiple predictors empirically controls for confounding by adjusting the odds ratios for observed covariates, yielding estimates less biased by omitted variables under the assumption of no unmeasured confounders.[33][34] This adjustment is particularly valuable in observational data, where multivariate specification helps isolate predictor-outcome links amid correlated variables, though causal inference demands additional validation beyond model fitting.[33]

Polychotomous Extensions

Logistic regression extends to polychotomous outcomes—those with more than two unordered or ordered categories—through specialized models that generalize the binary logit framework while preserving the interpretation of log-odds ratios as changes in relative probabilities.[35][36] These extensions maintain the core principle of modeling categorical probabilities via a linear predictor but adapt the link function to ensure probabilities sum to unity across categories.[37] For nominal (unordered) polychotomous outcomes with JJ categories, multinomial logistic regression employs a baseline category approach, where the log-odds of each non-baseline category j=1,,J1j = 1, \dots, J-1 relative to the reference category JJ are modeled as log(P(Y=jX)P(Y=JX))=Xβj\log\left(\frac{P(Y=j \mid \mathbf{X})}{P(Y=J \mid \mathbf{X})}\right) = \mathbf{X} \boldsymbol{\beta}_j.[37] The category-specific probabilities are then derived via the softmax function: P(Y=jX)=exp(Xβj)1+k=1J1exp(Xβk)P(Y=j \mid \mathbf{X}) = \frac{\exp(\mathbf{X} \boldsymbol{\beta}_j)}{1 + \sum_{k=1}^{J-1} \exp(\mathbf{X} \boldsymbol{\beta}_k)} for j=1,,J1j = 1, \dots, J-1, and P(Y=JX)=11+k=1J1exp(Xβk)P(Y=J \mid \mathbf{X}) = \frac{1}{1 + \sum_{k=1}^{J-1} \exp(\mathbf{X} \boldsymbol{\beta}_k)}.[38] This formulation treats the problem as J1J-1 coupled binary logits, allowing category-specific coefficients βj\boldsymbol{\beta}_j that capture distinct effects without imposing ordering.[35] In contrast, for ordinal outcomes where categories possess a natural order (e.g., low, medium, high), the proportional odds model—also known as cumulative logit ordinal regression—models cumulative probabilities across thresholds.[39] Specifically, for JJ ordered categories, the log-odds of being in category mm or lower versus higher is log(P(YmX)P(Y>mX))=θmXβ\log\left(\frac{P(Y \leq m \mid \mathbf{X})}{P(Y > m \mid \mathbf{X})}\right) = \theta_m - \mathbf{X} \boldsymbol{\beta} for m=1,,J1m = 1, \dots, J-1, where β\boldsymbol{\beta} is shared across thresholds and θm\theta_m are category-specific intercepts.[40] This enforces the proportional odds assumption: the effect of predictors on odds ratios remains constant across cumulative splits, reflecting parallel cumulative logits in the linear predictor scale.[41] Multinomial models offer greater flexibility for nominal data by estimating separate coefficients per category pair, avoiding restrictive assumptions about ordering, but require more parameters ((J1)p(J-1)p for pp predictors), increasing variance in small samples.[35] Ordinal models achieve parsimony with fewer parameters (p+J1p + J-1) by leveraging order, yielding more stable estimates when proportionality holds, though violating this assumption (testable via score or likelihood ratio tests) can bias results toward nominal models.[39][41] Empirical choice depends on data structure: nominal outcomes favor multinomial for unbiased category distinctions, while ordered data benefit from ordinal efficiency if causal effects align with monotonicity in cumulative risks.[40][38]

Parameter Estimation

Maximum Likelihood via Gradient Descent

Maximum likelihood estimation (MLE) for logistic regression parameters involves maximizing the log-likelihood function under the assumption of independent Bernoulli-distributed outcomes. For a dataset of nn observations with binary responses yi{0,1}y_i \in \{0,1\} and predictors xi\mathbf{x}_i, the log-likelihood is (β)=i=1n[yilogpi+(1yi)log(1pi)]\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \right], where pi=11+exp(βTxi)p_i = \frac{1}{1 + \exp(-\boldsymbol{\beta}^T \mathbf{x}_i)}.[42] This optimization lacks a closed-form solution due to the nonlinearity of the logistic function, necessitating iterative numerical methods such as gradient-based approaches. The first derivative, or score function, provides the gradient: β=i=1n(yipi)xi\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^n (y_i - p_i) \mathbf{x}_i. Gradient ascent updates parameters as β(t+1)=β(t)+αββ(t)\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \alpha \frac{\partial \ell}{\partial \boldsymbol{\beta}} \big|_{\boldsymbol{\beta}^{(t)}}, with learning rate α>0\alpha > 0.[43][44] For faster convergence, second-order methods like Newton-Raphson incorporate the Hessian matrix, approximating the log-likelihood quadratically. The update is β(t+1)=β(t)+(H(t))1ββ(t)\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \left( -\mathbf{H}^{(t)} \right)^{-1} \frac{\partial \ell}{\partial \boldsymbol{\beta}} \big|_{\boldsymbol{\beta}^{(t)}}, where H=i=1npi(1pi)xixiT=XTWX\mathbf{H} = -\sum_{i=1}^n p_i (1 - p_i) \mathbf{x}_i \mathbf{x}_i^T = -\mathbf{X}^T \mathbf{W} \mathbf{X} and W\mathbf{W} is diagonal with entries pi(1pi)p_i (1 - p_i). This method, equivalent to one step of iteratively reweighted least squares per iteration, typically converges in fewer steps than first-order gradient descent but requires inverting the Hessian, which scales cubically with the number of parameters.[45][46] In large-scale settings, stochastic gradient descent (SGD) variants use mini-batches to approximate the full gradient, reducing computational cost per update: β(t+1)=β(t)+αiB(yipi)xi/B\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \alpha \sum_{i \in B} (y_i - p_i) \mathbf{x}_i / |B|, where BB is the batch. Convergence is monitored by criteria such as the gradient norm falling below a threshold (e.g., 10610^{-6}) or parameter changes stabilizing, often after 10-100 iterations depending on initialization and data scale. Numerical stability challenges arise from potential overflow in the sigmoid for extreme linear predictors; mitigation includes clipping inputs or using numerically stable logit computations.[47][48]

Iteratively Reweighted Least Squares

Iteratively reweighted least squares (IRLS), also known as iteratively weighted least squares, computes maximum likelihood estimates for parameters in generalized linear models such as logistic regression by successively approximating the nonlinear model with a weighted linear regression.[49] The method was introduced by Nelder and Wedderburn in 1972 within the framework of generalized linear models for exponential family distributions, where logistic regression corresponds to the Bernoulli distribution with a canonical logit link function.[50] In binary logistic regression, the response $ y_i $ follows a Bernoulli distribution with success probability $ p_i = \frac{1}{1 + \exp(- \mathbf{x}_i^T \boldsymbol{\beta})} $, and the goal is to maximize the log-likelihood $ \ell(\boldsymbol{\beta}) = \sum_i [y_i \log p_i + (1 - y_i) \log (1 - p_i)] $. IRLS implements Newton-Raphson iteration via Fisher scoring, using the expected information matrix as the Hessian approximation, which yields weighted least squares updates.[49] Each iteration linearizes the logit transformation around current parameter estimates, accounting for the heteroscedasticity inherent in the variance $ \mathrm{Var}(y_i) = p_i (1 - p_i) $. The algorithm initializes $ \boldsymbol{\beta}^{(0)} = \mathbf{0} $ (yielding initial $ p_i^{(0)} = 0.5 $) and proceeds iteratively:
  • Compute the linear predictor $ \eta_i^{(j)} = \mathbf{x}_i^T \boldsymbol{\beta}^{(j-1)} $ and fitted probabilities $ p_i^{(j)} = \frac{\exp(\eta_i^{(j)})}{1 + \exp(\eta_i^{(j)})} $.
  • Form weights $ w_i^{(j)} = p_i^{(j)} (1 - p_i^{(j)}) $.
  • Construct the working response $ z_i^{(j)} = \eta_i^{(j)} + \frac{y_i - p_i^{(j)}}{w_i^{(j)}} $.
  • Update $ \boldsymbol{\beta}^{(j)} = (\mathbf{X}^T \mathbf{W}^{(j)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(j)} \mathbf{z}^{(j)} $, where $ \mathbf{W}^{(j)} = \mathrm{diag}(w_i^{(j)}) $.
    Convergence is assessed by small changes in $ \boldsymbol{\beta} $ or $ \ell(\boldsymbol{\beta}) $, typically within 10–20 iterations for moderate datasets.[51]
This iterative weighting by the inverse variance $ 1 / [p_i (1 - p_i)] $ stabilizes the least squares approximation, drawing an analogy to ordinary least squares under homoscedastic Gaussian errors but adapted for the nonlinear link and non-constant variance.[49] For logistic regression as a GLM, the procedure exploits the exponential family structure, ensuring the score equations align with weighted residuals. Computationally, IRLS offers advantages for datasets up to thousands of observations, as each step requires only a linear system solve via $ \mathbf{X}^T \mathbf{W} \mathbf{X} $, which is efficient and leverages optimized linear algebra routines, though it may underperform stochastic gradient methods for very large-scale problems.[51]

Regularized Estimation

Regularization addresses overfitting in logistic regression by penalizing large coefficients in the estimation objective, which is particularly beneficial when the number of predictors exceeds the sample size (p ≫ n), as occurs in genomics and big data contexts. The penalized negative log-likelihood takes the form β^=argminβ[i=1nlogp(yixi,β)+λj=1pβjq]\hat{\beta} = \arg\min_{\beta} \left[ -\sum_{i=1}^n \log p(y_i \mid x_i, \beta) + \lambda \sum_{j=1}^p |\beta_j|^q \right], where λ0\lambda \geq 0 tunes the penalty strength and q determines the norm.[52] This approach stabilizes variance at the cost of introducing bias, with λ\lambda selected via k-fold cross-validation to minimize out-of-sample deviance.[53] L2 regularization (ridge, q=2) imposes a penalty λjβj2\lambda \sum_j \beta_j^2, which shrinks all coefficients toward zero proportionally but rarely sets them exactly to zero, aiding multicollinearity handling without variable elimination.[54] In contrast, L1 regularization (lasso, q=1) uses λjβj\lambda \sum_j |\beta_j|, driving many coefficients to precisely zero and thereby inducing sparsity for inherent feature selection, which enhances interpretability in sparse high-dimensional settings.[55] Elastic net extends this by linearly combining L1 and L2 penalties (λ(αjβj+(1α)jβj2)\lambda (\alpha \sum_j |\beta_j| + (1-\alpha) \sum_j \beta_j^2), with α[0,1]\alpha \in [0,1]), mitigating lasso's limitations in highly correlated predictors while retaining sparsity.[56] Cross-validation for λ\lambda involves evaluating a grid of values on held-out folds, selecting the one minimizing average prediction error, often using one-standard-error rules for parsimony in implementations like glmnet.[53] This tuning bridges regularization to model selection, as sparsity patterns vary with λ\lambda, allowing pathwise analysis of coefficient inclusion. Post-2020 applications in genomics demonstrate lasso's efficacy for sparse signal recovery; for example, GFLASSO-LR applies generalized fused lasso to logistic regression on microarray data, achieving dimension reduction and accurate gene set identification by enforcing grouped and fused sparsity.[57] In large-scale healthcare predictions, empirical comparisons across thousands of variables show L1 and elastic net yielding superior discriminative accuracy over pure L2 or unpenalized models, with robustness to validation splits.[56] These methods' computational efficiency via coordinate descent suits big data, where unregularized estimation fails due to non-convergence.[58]

Bayesian Approaches

In Bayesian logistic regression, the regression coefficients β\beta are assigned prior distributions, typically independent normal distributions centered at zero with weakly informative variances to reflect limited prior knowledge while preventing extreme values. For instance, a standard deviation of 2.5 on the standardized scale has been recommended for logistic coefficients to stabilize estimates without strong assumptions.[59] The intercept may receive a broader prior, such as normal with mean zero and standard deviation 10, to accommodate varying baseline probabilities.[60] These priors encode beliefs about parameter plausibility before observing data, enabling the incorporation of substantive knowledge, such as effect sizes from prior studies.[61] The posterior distribution p(βy)p(\beta \mid y) combines the logistic likelihood—Bernoulli for individual binary outcomes yiBernoulli(pi)y_i \sim \text{Bernoulli}(p_i) where [logit](/page/Logit)(pi)=β0+xiTβ\text{[logit](/page/Logit)}(p_i) = \beta_0 + \mathbf{x}_i^T \boldsymbol{\beta}—with the prior via Bayes' theorem. Lacking conjugacy for the logistic form, exact posteriors are intractable, necessitating approximation methods like Markov chain Monte Carlo (MCMC) algorithms, including Metropolis-Hastings or Hamiltonian Monte Carlo via tools such as Stan, or variational inference for faster but approximate inference.[62][63] MCMC samples from the joint posterior, yielding marginal distributions for each βj\beta_j and enabling credible intervals that represent the probability content directly, contrasting with frequentist intervals derived from sampling distributions under fixed parameters.[64] This framework excels in small samples by leveraging priors for regularization, reducing overfitting compared to maximum likelihood estimates that can diverge with sparse data.[65] It also supports hierarchical extensions, such as random effects models where coefficients vary across groups with hyperpriors, facilitating partial pooling and borrowing strength from related units. For validation, posterior predictive checks generate replicate datasets y~\tilde{y} from the posterior predictive distribution p(y~y)p(\tilde{y} \mid y), assessing fit by comparing statistics like discrepancy measures between observed and simulated data.[66] Such checks reveal model inadequacies, such as unmodeled heterogeneity, more coherently than asymptotic diagnostics.[67]

Inference and Evaluation

Likelihood-Based Tests

The likelihood ratio test (LRT) evaluates hypotheses concerning subsets of parameters in logistic regression by comparing the fit of nested models, where the reduced model imposes restrictions such as setting certain coefficients to zero. The test statistic is 2(RF)-2(\ell_R - \ell_F), with R\ell_R and F\ell_F denoting the maximized log-likelihoods of the reduced and full models, respectively; under the null hypothesis, this statistic asymptotically follows a χ2\chi^2 distribution with degrees of freedom equal to the difference in the number of free parameters between the models.[68][69] In the framework of generalized linear models, which encompasses logistic regression, the deviance D=2D = -2\ell (scaled relative to the saturated model that fits the data perfectly) provides a convenient computational form, such that the difference DRDFD_R - D_F equals the LRT statistic for nested models.[70] This equivalence holds because the saturated model's deviance is constant and cancels out in the comparison.[71] The test is applied to assess variable inclusion by fitting a full model with candidate predictors and a reduced model excluding them, rejecting the null if the deviance difference exceeds the critical χ2\chi^2 value.[72] The χ2\chi^2 approximation relies on large-sample asymptotics, assuming the models are correctly specified and the information matrix is positive definite, which derives from the consistency and normality of maximum likelihood estimators under regularity conditions.[73][74] For instance, in binary logistic regression with nn observations, reliable inference typically requires npn \gg p (where pp is the parameter count) and sufficient events per predictor level to avoid sparse data issues.[75] Empirical applications reveal caveats, particularly in small samples or with rare outcomes, where the deviance difference can exhibit upward bias, inflating type I error rates beyond the nominal level due to non-convergence of the asymptotic distribution or separation (perfect prediction in subsets).[76][77] In such cases, the LRT may overestimate significance, prompting alternatives like Firth's penalized likelihood, which reduces bias in coefficient estimates and derived tests by incorporating a Jeffreys prior adjustment.[78] Relatedly, McFadden's pseudo-R2=1M/0R^2 = 1 - \ell_M / \ell_0 (comparing the model to the null intercept-only fit) offers a likelihood-based summary for model adequacy but lacks a direct probabilistic interpretation and can misleadingly remain low (e.g., below 0.2) even for predictive models, as it penalizes deviation from the null without accounting for overfitting or sample specifics.[79][80] Thus, while useful for ranking nested models, it should not supplant formal LRT p-values without corroboration from simulation-based validation in finite samples.[81]

Goodness-of-Fit Measures

Goodness-of-fit measures in logistic regression assess the discrepancy between observed binary outcomes and those predicted by the model, helping to determine if the logistic form adequately describes the data-generating process. These tests are particularly useful for detecting global misspecification, such as omitted nonlinearities or interactions, though they rely on asymptotic approximations and can be sensitive to sample size. Common approaches include chi-squared-based statistics and deviance measures, often applied after grouping observations to stabilize variance under the binomial assumption.[82][83] The Hosmer-Lemeshow test groups observations into 10 (or sometimes fewer) deciles ordered by predicted probabilities, then computes a Pearson chi-squared statistic from the observed versus expected event counts in each group: χ2=i=1g(OiEi)2Ei+(niEi)\chi^2 = \sum_{i=1}^g \frac{(O_i - E_i)^2}{E_i + (n_i - E_i)}, where OiO_i is observed successes, Ei=nip^iE_i = n_i \hat{p}_i expected successes, nin_i group size, and gg groups, with degrees of freedom g2pg - 2 - p (p parameters). Under the null of adequate fit, this follows a chi-squared distribution; a p-value above 0.05 typically supports the model, though the test lacks power in small samples (n < 400) and rejects valid models in large samples due to trivial discrepancies amplified by grouping artifacts.[84][85][86] The Pearson chi-squared statistic extends this idea more generally, aggregating across user-defined or risk-set groups: χP2=k(ykμ^k)2μ^k(1μ^k/nk)\chi^2_P = \sum_k \frac{(y_k - \hat{\mu}_k)^2}{\hat{\mu}_k (1 - \hat{\mu}_k / n_k)}, adjusted for binomial variance, and compared to a chi-squared with degrees of freedom equal to the number of groups minus parameters minus 1. It flags poor fit if the statistic per degree of freedom deviates significantly from 1, but like the Hosmer-Lemeshow, it performs poorly with sparse data or when events are rare, as expected counts below 5 inflate Type I errors.[82][83][87] Deviance-based measures provide an alternative, with the residual deviance D=2ln(Lm/Ls)D = -2 \ln(L_m / L_s) quantifying twice the difference in log-likelihood between the fitted model LmL_m and saturated model LsL_s, asymptotically chi-squared under the null. Deviance residuals rD,k=sign(ykp^k)2[ykln(p^k/yk)+(1yk)ln((1p^k)/(1yk))]r_{D,k} = \operatorname{sign}(y_k - \hat{p}_k) \sqrt{-2 [y_k \ln(\hat{p}_k / y_k) + (1 - y_k) \ln((1 - \hat{p}_k)/(1 - y_k)) ]} (with continuity correction for yk=0y_k = 0 or 1) highlight local discrepancies; values exceeding 2-3 in absolute magnitude suggest outliers or influential points, and Q-Q plots against chi-squared quantiles or versus linear predictors aid diagnosis of systematic patterns like overdispersion.[88][89][90] In heterogeneous populations—such as those with unmodeled subgroup effects or varying baseline risks—standard tests like Hosmer-Lemeshow often fail to detect misspecification, yielding non-significant results despite biased predictions, as grouping averages mask local failures; simulations confirm this insensitivity unless heterogeneity exceeds 20-30% variance share.[86][91][83]

Predictive Accuracy and Calibration

Predictive accuracy of logistic regression models is assessed via out-of-sample metrics to gauge generalization beyond training data, emphasizing discrimination—the ability to rank cases by outcome likelihood—and calibration—the alignment of predicted probabilities with observed frequencies.[92] These evaluations require techniques like k-fold cross-validation, which partitions data into subsets for repeated training and testing, yielding unbiased estimates by mitigating overfitting where in-sample performance exceeds real-world applicability.[93] Discrimination is quantified by the area under the receiver operating characteristic curve (AUC-ROC), representing the probability that a positive instance receives a higher predicted score than a negative one, with values above 0.8 often indicating strong separation in balanced datasets.[94] For imbalanced classes, the area under the precision-recall curve (AUC-PR) supplements AUC-ROC by prioritizing positive class precision and recall, as ROC can mask poor minority-class performance.[95] Calibration evaluates whether a model's predicted probabilities are reliable estimates of event occurrence; for instance, among cases assigned a 0.3 probability of the positive outcome, approximately 30% should exhibit it empirically.[96] The Brier score measures this via the mean squared error between predictions and binary outcomes, ranging from 0 (perfect) to 0.25 (uninformative) for binary tasks, penalizing both miscalibration and overconfident predictions while rewarding sharpness in well-calibrated models.[95] Cross-validated Brier scores ensure out-of-sample reliability, as training-set calibration often deteriorates externally due to optimism bias.[97] Calibration plots, or reliability diagrams, stratify predictions into deciles or bins and graph observed event rates against average predicted probabilities; ideal alignment follows the 45-degree line, with deviations signaling over- or under-prediction that could mislead decision-making.[98] If miscalibration persists post-validation—potentially from regularization or sparse data—Platt scaling applies a monotonic transformation by fitting a supplementary logistic regression on held-out scores as inputs and outcomes as targets, recalibrating probabilities while preserving rank order and thus discrimination.[99] This method, validated on cross-validation folds, enhances probabilistic interpretability without retraining the primary model, though its efficacy assumes sufficient validation data to avoid further bias.[100]

Model Selection Criteria

In logistic regression, model selection criteria aim to balance explanatory power against model complexity to avoid overfitting, emphasizing parsimonious models that generalize well to new data. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are prominent penalization methods derived from asymptotic approximations to expected predictive error. AIC is computed as $ \text{AIC} = -2 \ell + 2p $, where $ \ell $ is the maximized log-likelihood and $ p $ is the number of parameters (including the intercept); it estimates the relative expected Kullback-Leibler divergence between the true model and the fitted model. BIC, defined as $ \text{BIC} = -2 \ell + p \ln n $ with $ n $ denoting sample size, imposes a harsher penalty on additional parameters, approximating the marginal likelihood under a Bayesian framework with a unit information prior. These criteria favor models minimizing the respective scores, with lower values indicating better trade-offs between fit and complexity. Empirical evaluations in logistic regression contexts, such as variable selection for binary outcomes in medical and social sciences, have shown BIC outperforming AIC in predictive validity, particularly in finite samples where AIC's lighter penalty can lead to overly complex models prone to poor out-of-sample performance. For instance, simulations with moderate sample sizes (n ≈ 500–1000) demonstrate BIC's superior recovery of true sparse models, reducing false positives in predictor inclusion by up to 20–30% compared to AIC. This aligns with causal realism, as BIC's sample-size-dependent penalty discourages extraneous variables that inflate apparent fit without causal relevance, promoting models grounded in underlying data-generating processes rather than noise. Cross-validation alternatives, like k-fold methods, complement these by directly assessing predictive accuracy but require larger datasets to stabilize estimates, whereas AIC and BIC leverage the full likelihood without partitioning. Stepwise selection procedures, which iteratively add or remove predictors based on significance tests or information criteria thresholds, offer automation but carry substantial risks of data dredging—capitalizing on chance correlations in the training data, yielding unstable and non-reproducible models. Studies across logistic applications, including epidemiological risk modeling, report stepwise methods inflating Type I errors by 10–50% and producing coefficient estimates biased toward over-optimism, as they implicitly multiple-test without adjustment. First-principles scrutiny reveals stepwise's failure to account for the multiplicity of paths explored, often selecting spurious predictors that correlate spuriously with the outcome, undermining causal inference; manual forward or backward selection informed by domain knowledge, or regularization techniques like LASSO (addressed elsewhere), mitigate these pitfalls by enforcing sparsity a priori. Thus, while computationally convenient, stepwise approaches demand cautious interpretation, with empirical evidence favoring criteria-guided all-subsets enumeration or Bayesian variable selection for truth-seeking reliability in high-dimensional settings.

Interpretations

Generalized Linear Model Framework

Logistic regression constitutes a specific instance of the generalized linear model (GLM) framework, which accommodates response variables distributed according to exponential family distributions by linking the mean of the response to a linear predictor through a monotonic link function.[101] In this setup, the GLM comprises three components: a random component specifying the distribution of the response YY (binomial for logistic regression), a systematic component as the linear predictor η=Xβ\eta = \mathbf{X}\boldsymbol{\beta}, and a link function gg such that g(μ)=ηg(\mu) = \eta, where μ=E(Y)\mu = E(Y).[102] For logistic regression modeling binary outcomes, the response follows a Bernoulli distribution (a special case of binomial with trials n=1n=1), yielding mean μ\mu and variance function V(μ)=μ(1μ)V(\mu) = \mu(1 - \mu).[103] The canonical link for the binomial family, which aligns the natural parameter of the exponential family with the linear predictor, is the logit function: g(μ)=log(μ1μ)g(\mu) = \log\left(\frac{\mu}{1 - \mu}\right), inverting to μ=11+eη\mu = \frac{1}{1 + e^{-\eta}}.[104] This formulation unifies logistic regression with other GLMs, such as Poisson regression (variance V(μ)=μV(\mu) = \mu, canonical log link) or Gaussian linear regression (constant variance, identity link), enabling shared theoretical properties like asymptotic normality of estimators under regularity conditions.[101] The GLM framework offers advantages in modularity for estimation and inference: maximum likelihood proceeds via iteratively reweighted least squares, leveraging the link and variance functions without requiring response transformation to normality, and hypothesis testing employs standardized tools like score, Wald, or likelihood ratio tests derived from the exponential family structure.[102] For cases of overdispersion, where empirical variance exceeds the binomial V(μ)=μ(1μ)V(\mu) = \mu(1 - \mu) due to unobserved heterogeneity or clustering, quasi-likelihood extends the approach by retaining the logit link for the mean while scaling the variance to ϕV(μ)\phi V(\mu) with estimated dispersion ϕ>1\phi > 1, yielding consistent point estimates and robust standard errors via sandwich covariance adjustment.[105] This empirical adaptation maintains the GLM's estimating equations without invoking a full alternative distribution, though it sacrifices some efficiency relative to correctly specified likelihoods.[105]

Latent Variable Representation

One interpretation of the binary logistic regression model posits an underlying latent continuous variable η=xβ+ϵ\eta = \mathbf{x}^\top \boldsymbol{\beta} + \epsilon, where x\mathbf{x} denotes the vector of predictors, β\boldsymbol{\beta} the coefficients, and ϵ\epsilon follows a standard logistic distribution with mean 0 and variance π2/3\pi^2/3.[106][107] The observed binary outcome YY is then defined as Y=1Y = 1 if η>0\eta > 0 and Y=0Y = 0 otherwise, representing a threshold-crossing process where the latent propensity exceeds a normalized cutoff of zero.[108] This formulation yields the probability P(Y=1x)=F(xβ)P(Y=1 \mid \mathbf{x}) = F(\mathbf{x}^\top \boldsymbol{\beta}), with FF as the cumulative distribution function of the logistic distribution, simplifying to the canonical logistic form 11+exβ\frac{1}{1 + e^{-\mathbf{x}^\top \boldsymbol{\beta}}}.[106][107] This latent variable framework contrasts with the probit model, which assumes ϵN(0,1)\epsilon \sim \mathcal{N}(0,1) instead, producing P(Y=1x)=Φ(xβ)P(Y=1 \mid \mathbf{x}) = \Phi(\mathbf{x}^\top \boldsymbol{\beta}) where Φ\Phi is the standard normal CDF.[109] The logistic error distribution features heavier tails than the normal, enabling more extreme predicted probabilities near 0 or 1 for large absolute values of the linear predictor, which can align better with data exhibiting outlier propensities but risks overconfidence in tails without empirical validation.[110] Empirical choice between logistic and probit often hinges on goodness-of-fit diagnostics rather than theoretical purity, as both approximate latent thresholds but differ in error tail behavior and computational tractability.[111] In econometric applications, this representation supports causal modeling of discrete choices via random utility maximization, where the latent η\eta captures an agent's unobservable utility net of systematic components xβ\mathbf{x}^\top \boldsymbol{\beta}, with logistic errors arising from differences in Gumbel-distributed idiosyncratic shocks across alternatives.[112] For binary decisions, such as market entry or policy adoption, the model infers underlying propensities from observed thresholds, facilitating counterfactual analysis of how shifts in observables alter choice probabilities under stable error structures.[113] This approach emphasizes causal mechanisms over mere correlation, grounding predictions in substantive processes like utility trade-offs rather than ad hoc functional forms.[112]

Log-Linear and Discriminatory Views

Log-linear models for categorical data, particularly in the analysis of contingency tables, parameterize the logarithm of expected cell probabilities or frequencies as a linear combination of main effects and interactions among all variables, treating predictors and outcomes symmetrically.[114] This approach models the joint distribution, facilitating the examination of associations and independence structures across the full table, such as in multi-way classifications where cell counts follow a Poisson or multinomial distribution.[115] In the discriminatory view, logistic regression directly models the conditional probability of the outcome given predictors, via the logit link: log(P(Y=1X)1P(Y=1X))=β0+βTX\log\left(\frac{P(Y=1|X)}{1-P(Y=1|X)}\right) = \beta_0 + \beta^T X. This focuses exclusively on the decision boundary separating outcome classes, bypassing estimation of the predictors' marginal distribution P(X)P(X), which log-linear models incorporate implicitly through the joint. For prediction and classification tasks, the discriminatory approach exhibits empirical superiority, as generative methods like log-linear require accurate specification of P(X)P(X), whose misspecification propagates errors into the derived conditional P(YX)P(Y|X).[116] Theoretical analyses show that discriminative classifiers, including logistic regression, attain lower asymptotic classification error than comparable generative models, such as naive Bayes (a simplified log-linear analog), even when the latter correctly specify the joint distribution.[116] This advantage stems from the discriminative model's narrower parameterization, converging in fewer samples—often O(1/ϵ2)O(1/\epsilon^2) versus O(d/ϵ2)O(d/\epsilon^2) for dd-dimensional XX in generative cases.[116] In non-representative sampling schemes, such as case-control designs where outcome prevalence is artificially balanced, log-linear models under prospective assumptions yield biased joint estimates unless retrospectively adjusted, whereas logistic regression produces consistent odds ratio coefficients exp(β)\exp(\beta), with only the intercept offset by sampling fractions.[117] This invariance of odds ratios to outcome sampling proportions preserves the discriminatory model's validity for relative risk approximation in retrospective data, a property absent in unadjusted log-linear fits.[117]

Neural Network Analogy

Logistic regression functions as a single-layer neural network, equivalent to a perceptron where the activation is the sigmoid function applied to a linear combination of input features plus a bias term.[118] The model computes $ p(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) $, with σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} providing a smooth, differentiable approximation to the step function used in early perceptrons.[118] This structure outputs a probability between 0 and 1, enabling probabilistic interpretation over binary decisions. For classification, a threshold such as 0.5 is applied to the sigmoid output to determine the class label, paralleling the binary output of a perceptron but with probabilistic calibration.[119] Training proceeds by minimizing the binary cross-entropy loss, $ -\sum [y \log p + (1-y) \log (1-p)] $, via gradient descent, where parameter updates follow $ \mathbf{w} \leftarrow \mathbf{w} - \eta \nabla L $, with the gradient derived from the sigmoid's derivative σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)).[120] This optimization mirrors the weight adjustment in single-layer networks, establishing logistic regression as a foundational case of neural network training. The analogy reveals logistic regression's constraint to linear separability in the input space, as the decision boundary remains hyperplanar despite the sigmoid's non-linearity in the output.[121] Complex datasets with non-linear boundaries require feature engineering, such as polynomial expansions, to fit effectively; otherwise, performance falters, underscoring the need for multi-layer architectures in deep learning to compose non-linear transformations through hidden layers.[122]

Applications

Classical Statistical Uses

Logistic regression originated in classical statistics through David Cox's 1958 formulation for the analysis of binary sequences, particularly in quantal bioassays where it models the probability of a positive response (e.g., toxicity or efficacy) as a logistic function of dosage or stimulus intensity.[26] This approach facilitated maximum likelihood estimation of parameters and hypothesis tests, such as likelihood ratio tests comparing nested models to assess the significance of regressors in predicting binary outcomes under exponential family assumptions.[123] In biostatistics, it became a cornerstone for testing associations in dose-response experiments, emphasizing inference on coefficients via Wald statistics or score tests rather than out-of-sample prediction.[124] In epidemiological applications, such as case-control studies, logistic regression estimates odds ratios as exp(β), where β represents the log-odds change per unit predictor, providing a basis for hypothesis tests on exposure-disease links after covariate adjustment.[14] For rare events, these odds ratios approximate relative risks, enabling tests of the null hypothesis β=0 against alternatives of association, as validated in designs sampling cases and controls separately.[125] This framework supports inference in retrospective studies, with profile likelihood confidence intervals for parameters quantifying uncertainty in estimated effects.[126] Econometrics adopted logistic regression for binary choice models, analyzing decisions like market entry or unemployment duration, where utility maximization implies a latent logistic error, yielding testable predictions on parameters via maximum likelihood.[127] Hypothesis testing here focuses on economic parameters, such as marginal effects at means, using delta method standard errors to evaluate theories of choice under constraints.[128] Despite these inferential strengths, classical uses in observational biostatistical and econometric data have drawn empirical critiques for overreliance on regression adjustment without rigorous causal identification, as unmeasured confounders violate the conditional independence assumption, leading to biased coefficient tests that conflate association with causation.[129] Studies show that even extensive covariate control fails to eliminate selection biases in non-experimental settings, underscoring the need for causal realism through methods like randomization or instrumental variables to validate hypothesis tests beyond mere correlation.[130] This limitation arises because logistic models parameterize conditional probabilities without inherently addressing counterfactuals required for causal claims.[131]

Machine Learning Contexts

In machine learning pipelines, logistic regression serves as a fundamental baseline classifier for binary and multiclass prediction tasks due to its simplicity, interpretability, and computational efficiency.[132] Practitioners routinely train it first to establish performance benchmarks before deploying more intricate models, as it requires minimal hyperparameter tuning and scales well to moderate dataset sizes.[133] The model's training objective centers on minimizing the binary cross-entropy loss, equivalent to the negative log-likelihood of the observed data under the Bernoulli distribution assumption for binary outcomes.[134][135] This loss penalizes confident incorrect predictions more severely than near-correct ones, promoting calibrated probability outputs; for a single observation, it is defined as lnpk-\ln p_k if the true label yk=1y_k = 1 or ln(1pk)-\ln(1 - p_k) if yk=0y_k = 0, where pkp_k is the predicted probability. In high-dimensional settings, such as datasets with thousands of features, L1 or L2 regularization is incorporated into the loss to mitigate overfitting by shrinking coefficients toward zero, enabling sparse solutions and improved generalization.[136][137] Logistic regression frequently acts as a weak learner within ensemble methods, particularly gradient boosting frameworks like XGBoost or LightGBM, where its additive updates via functional gradients contribute to sequential error correction across iterations.[138][139] Post-2020 empirical evaluations on tabular datasets reveal that L2-regularized logistic regression achieves discriminative performance comparable to complex models on approximately 55% of tasks, underscoring its robustness without the overhead of deep architectures.[140] This holds especially for structured data where feature interactions are linear or low-order, though regularization strength must be tuned via cross-validation to balance bias and variance.[141]

Domain-Specific Examples

In medicine, logistic regression has been applied to predict the 10-year risk of coronary heart disease using data from the Framingham Heart Study, a prospective cohort initiated in 1948 involving over 5,000 residents of Framingham, Massachusetts.[142] The model incorporates predictors such as age, systolic blood pressure, total cholesterol, and smoking status, yielding coefficients that quantify relative risks; for instance, a one-unit increase in log-transformed cholesterol corresponds to higher odds of disease onset.[143] This approach demonstrated empirical success in identifying modifiable risk factors, with the model's predictions aligning well against observed incidence rates in validation cohorts, though it underperforms in extreme tails due to the rarity of events.[144] In epidemiology, logistic regression is commonly used in cross-sectional studies to analyze associations between environmental exposures and binary health outcomes, estimating odds ratios with stepwise adjustment for confounders such as demographics (e.g., age, sex), behaviors (e.g., smoking), and metabolic factors (e.g., BMI, blood pressure).[145] In social sciences, logistic regression models voter turnout and party preference using survey data like the American National Election Studies (ANES), where binary outcomes such as voting for a Democratic versus Republican candidate are regressed on demographics, ideology, and economic perceptions.[146] For example, analyses of 2016 U.S. election data predicted vote intention with coefficients highlighting income and education as significant predictors, achieving modest out-of-sample accuracy around 70-80% in controlled settings.[147] However, empirical failures arise from endogeneity, as unmeasured confounders like social influence or strategic voting introduce bias; studies show models overfit historical data but falter during shifts like unexpected campaign events, with prediction errors exceeding 10% in volatile elections.[148] In finance, logistic regression underpins credit scoring systems to classify loan applicants as low or high default risk, drawing on variables like credit history length, debt-to-income ratio, and payment timeliness from datasets of millions of accounts.[149] A 2022 study on consumer loans from a financial institution reported a model with an area under the ROC curve of 0.75-0.85, enabling scorecards that comply with regulatory demands for interpretability under frameworks like the Basel Accords, where odds ratios directly inform lending thresholds.[150] Successes include reduced default rates by 20-30% in scored portfolios compared to rule-based systems, yet failures manifest in economic downturns, such as the 2008 crisis, where correlated shocks violated independence assumptions, leading to systematic underestimation of risks across cohorts.[151]

Limitations and Criticisms

Core Assumptions and Empirical Violations

Logistic regression posits a linear relationship between the predictors and the logit of the outcome probability, an assumption that cannot be directly verified without knowledge of the true underlying model, as it requires transforming the binary response via the inverse logit function.[152] This linearity in the logit implies that the log-odds are a linear combination of the explanatory variables, but empirical checks such as Box-Tidwell tests or augmented models often reveal deviations in real datasets, leading to biased coefficient estimates in simulation studies where nonlinear effects are present.[153] The model further assumes no perfect multicollinearity among predictors, meaning independent variables should not be linearly dependent, as high correlations inflate variance and render coefficients unstable.[154] In practice, multicollinearity frequently occurs in datasets with highly correlated features, such as economic indicators or genomic variables, resulting in imprecise effect estimates and wide confidence intervals, as demonstrated in analyses where variance inflation factors exceed 10.[155] A critical requirement is sufficient sample size, quantified as at least 10 events per variable (EPV), where "events" refer to the minority class outcomes in binary settings; Monte Carlo simulations by Peduzzi et al. (1996) showed that EPV below 10 yields upward bias in regression coefficients (up to 100% in extreme cases) and severely underestimated standard errors, compromising model stability and inference validity across varied prevalence rates and effect sizes.[156] Empirical applications in small or imbalanced datasets, common in rare-event studies like medical diagnostics, routinely violate this, with simulations confirming frequent instability even at EPV=5-10 when predictors are continuous or interactions are omitted.[157] Independence of observations is another foundational assumption, violated by spatial or temporal dependence, such as clustered geographic data where residuals exhibit autocorrelation; this leads to inefficient estimates and invalid hypothesis tests, as OLS-like standard errors fail to account for the correlation structure.[158] Overdispersion, where outcome variance exceeds the binomial expectation due to unobserved heterogeneity or zero-inflation, is prevalent in fields like ecology and epidemiology, causing standard errors to be too narrow and inflating type I error rates in simulations with outlier-induced dispersion.[159] Remedies such as robust (sandwich) standard errors adjust for heteroskedasticity and mild dependence without altering point estimates, providing consistent inference under misspecification, though they do not address coefficient bias from model form errors.[160]

Performance Shortcomings

Logistic regression's inherently linear decision boundary in the logit space restricts its capacity to capture non-linear interactions among features without extensive preprocessing or polynomial expansions, resulting in diminished predictive accuracy on datasets with complex manifolds, such as those in computer vision or textual analysis. Post-2020 benchmarks on diverse tabular and image datasets confirm that tree-based ensembles like random forests and gradient boosting machines consistently surpass logistic regression in metrics including accuracy, AUC-ROC, and F1-score, particularly when non-linear patterns predominate.[161][162] In high-dimensional settings exceeding hundreds of features, logistic regression exacerbates the curse of dimensionality through sparse parameter estimation, yielding poorer generalization compared to methods that implicitly select and combine features via splitting, as evidenced by controlled experiments on synthetic and real-world high-dimensional benchmarks.[163] While logistic regression demonstrates robustness relative to squared-error methods due to its bounded negative log-likelihood loss, it remains vulnerable to influential outliers, especially discordant ones where observed outcomes starkly contradict predicted probabilities, which can inflate variance in maximum likelihood estimates and distort coefficient magnitudes. Simulations on synthetic datasets illustrate that such outliers shift the fitted hyperplane, reducing overall classification performance by up to 10-15% in AUC under leverage conditions.[164][165] On structured, low-to-moderate dimensional data like clinical records, logistic regression holds parity with advanced learners in predictive tasks, achieving comparable discrimination and calibration without substantial gains from non-linear alternatives.31081-3/abstract) However, extrapolation beyond the training feature range induces calibration decay, as the linear logit approximation fails to mirror true probabilities, often producing overconfident outputs approaching 0 or 1 despite underlying uncertainty, a flaw amplified in scenarios with skewed distributions or rare events.[166][167] Empirical reliability diagrams from out-of-distribution tests post-2020 underscore this, showing expected calibration error rising by factors of 2-5 outside observed covariate supports.[168]

Misapplication Risks

A frequent misapplication of logistic regression arises from insufficient events per variable (EPV), where fewer than 10 outcome events (the rarer of success or failure) per predictor lead to unstable estimates, biased coefficients toward zero or infinity, and overly wide confidence intervals. This "rule of 10" guideline, derived from simulations, ensures reliable maximum likelihood estimation; violations, common in rare-event data, inflate Type I error rates and reduce model validity, as demonstrated in studies showing parameter instability below 5-10 EPV.[78][169] Dichotomizing continuous predictors, such as categorizing age or dosage into binary thresholds for interpretability, systematically reduces statistical power and introduces bias by discarding information about the variable's full range. This practice can halve effective sample size equivalent to randomly excluding half the data, while masking nonlinear relationships and creating artificial cutoffs that misrepresent associations; empirical evaluations confirm power losses of 20-40% or more depending on the distribution.[170][171] Overinterpreting odds ratios or coefficients as establishing causation, absent randomized experiments or robust causal identification strategies like instrumental variables, conflates correlation with causality in observational data. Logistic models inherently estimate conditional associations under the fitted link function, but without controlling for confounders or addressing endogeneity—conditions rarely met in non-experimental settings—such interpretations lead to spurious claims, as regression predictors include both causal factors and mere correlates without distinction.[172][173] P-hacking practices, such as iteratively adding/removing variables or transforming predictors until p-values dip below 0.05, exacerbate these issues by capitalizing on sampling variability in logistic fits, yielding models that overfit noise rather than signal and fail to replicate. Similarly, neglecting sampling biases—like selection into the dataset based on the outcome—distorts logit estimates toward the biased subsample proportions, producing invalid predictions outside the observed distribution unless corrected via weighting or inverse probability methods.[174][175]

Comparisons and Alternatives

Versus Linear Regression

Logistic regression differs from linear regression primarily in its handling of outcome variables: it models the probability of binary events via a logit link function that constrains predictions to the [0,1] interval, while linear regression targets unbounded continuous responses via a direct linear mapping.[176][6] Linear regression applied to binary data—termed the linear probability model (LPM)—can produce predictions exceeding 1 or falling below 0, rendering them invalid as probabilities, especially at extreme predictor values.[177] The logistic sigmoid function, $ p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} $, avoids this by asymptotically approaching 0 and 1, providing a more appropriate functional form for probabilistic interpretations.[178] A key variance distinction arises from their error structures: linear regression assumes homoscedasticity, where residual variance remains constant across fitted values, an assumption often violated when adapting it to binary outcomes due to the inherent Bernoulli variance $ p(1-p) $.[179] Logistic regression, by contrast, incorporates this heteroscedasticity explicitly through the binomial likelihood, where variance peaks at $ p = 0.5 $ and diminishes toward extremes, without imposing a constant-variance requirement on observed residuals.[180][181] Selection between the two for binary outcomes depends on context: LPM via linear regression offers straightforward coefficient interpretations as marginal probability changes and unbiased average treatment effects in causal settings, but suffers bound violations and inefficient standard errors without corrections.[177][182] Logistic regression is preferable for bounded probability estimates or when outcomes involve moderate probabilities away from 0 or 1, as linear models linearize relationships poorly in such cases; for rare events near boundaries, logistic captures saturation effects more accurately.[183]

Versus Tree-Based and Ensemble Methods

Logistic regression assumes a linear relationship in the logit scale between predictors and the log-odds of the outcome, necessitating manual feature engineering—such as polynomial terms or interaction products—to capture non-linearities or higher-order effects, whereas decision trees and ensemble methods like random forests or gradient boosting machines inherently partition the feature space to accommodate non-linear relationships and complex interactions without such preprocessing.[184][185] Empirical comparisons since 2020 frequently demonstrate superior predictive performance for tree-based ensembles over logistic regression in classification tasks with moderate to large datasets, as measured by metrics like AUC-ROC and accuracy; for instance, a 2024 study on credit risk prediction reported random forest achieving an AUC of 94.78% compared to logistic regression's lower value, attributing the gap to ensembles' robustness to feature interactions and noise.[186] Similarly, in a 2024 analysis of heart failure mortality prediction, decision trees exceeded logistic regression in misclassification rates and lift curves.[187] However, logistic regression exhibits lower variance and overfitting risk in smaller samples due to its parametric nature, making it preferable when data is limited and simplicity prioritizes stability over marginal accuracy gains.[188] In terms of interpretability, logistic regression's coefficients provide direct, quantifiable estimates of effect sizes—such as odds ratios for unit changes in predictors—facilitating straightforward inference, while single decision trees offer rule-based transparency through split paths, though ensembles aggregate thousands of trees into opaque predictions that obscure individual feature contributions.[189] For causal inference applications, logistic regression supports explicit control for confounders via adjusted coefficients under linearity and no-omitted-variable assumptions, enabling verifiable claims about average treatment effects, whereas tree-based methods prioritize predictive accuracy over causal estimands and require specialized extensions like causal forests to mitigate selection bias, often at the cost of reduced transparency.[190][191] This interpretability edge favors logistic regression in regulated domains demanding auditable reasoning, despite ensembles' handling of heterogeneity.[192]

Versus Neural Networks and Deep Learning

Logistic regression functions as a single-layer perceptron, applying a linear combination of inputs followed by a sigmoid activation to produce probabilistic outputs for binary classification, whereas deep neural networks employ multiple hidden layers to learn hierarchical, non-linear feature representations.[193] This structural simplicity limits logistic regression's capacity to automatically extract complex patterns from raw data without feature engineering, in contrast to deep learning's ability to handle intricate dependencies through backpropagation across layers. However, in domains requiring explicit modeling of linear or mildly non-linear relationships, such as tabular data for credit scoring or medical diagnostics, logistic regression's direct parameter estimates—interpretable as log-odds ratios for each predictor—provide causal insights that deep models obscure within distributed weights.[194] Empirical benchmarks on tabular datasets consistently demonstrate that logistic regression serves as a robust baseline, often matching or exceeding deep learning performance when data lacks the spatial hierarchies suited to convolutional or recurrent architectures.[195] For instance, analyses of diverse tabular classification tasks reveal that state-of-the-art deep learning approaches frequently fail to appreciably outperform simple logistic regression, particularly on datasets with fewer than 10,000 samples, where overfitting in deep models degrades generalization.[196] Tree-based ensembles may edge out both in accuracy, but logistic regression's sufficiency underscores deep learning's hype as overkill for structured data, as traditional methods leverage inductive biases aligned with tabular sparsity and noise.[197] Computationally, logistic regression exhibits linear scaling with respect to sample size n and features d during optimization via gradient descent or iterative reweighted least squares, typically converging in O(n d) operations per iteration with few epochs needed.[198] Deep learning, by contrast, incurs quadratic or higher costs from matrix multiplications and backpropagation across layers, demanding extensive GPU resources and training times that escalate with model depth and width—often orders of magnitude beyond logistic regression for equivalent tasks.[199] This efficiency gap favors logistic regression in resource-constrained environments or when rapid prototyping is essential, avoiding the data-hungry pretraining regimes of deep models that yield diminishing returns on non-image inputs.[200]

Historical Development

Origins in Biometrics

The logistic function, central to logistic regression, originated in early modeling of bounded growth processes with a characteristic sigmoid shape. In 1838, Belgian mathematician Pierre-François Verhulst introduced the logistic differential equation $ \frac{dP}{dt} = rP \left(1 - \frac{P}{K}\right) $ to describe population dynamics limited by carrying capacity $ K $, where the solution yields an S-shaped curve asymptotically approaching the upper bound.[201] This form, later termed "logistic" by Verhulst in 1845, provided a mathematical basis for cumulative distribution functions that transition smoothly from near-zero to near-one probabilities.[202] In biometrics, particularly toxicology and pharmacology, the sigmoid curve proved apt for quantal response assays, where binary outcomes—such as survival or mortality—are observed across graded doses administered to groups of organisms. These assays, common since the early 20th century, aimed to estimate metrics like the median lethal dose (LD50) by fitting response proportions against log-dose, revealing tolerance distributions akin to the logistic cumulative distribution function $ p(x) = \frac{1}{1 + e^{-(x - \mu)/s}} $. Empirical data from such experiments consistently showed sigmoidal patterns, reflecting variability in individual sensitivities rather than deterministic thresholds.[203] C.I. Bliss advanced quantal bioassay analysis in 1934, developing the probit method for transforming sigmoid dose-response curves to linear scales, though the logistic function offered a comparable, algebraically simpler alternative without requiring normal distribution assumptions. Bliss's approach involved maximum likelihood estimation precursors, weighting observations by binomial variance to fit curves to grouped data from toxicity tests on insects and rodents. These early techniques emphasized graphical probit-log dose plots and iterative adjustments, laying groundwork for probabilistic modeling of binary events in biological contexts before formal logistic regression frameworks emerged.[204][205]

Mid-20th Century Advancements

In 1958, statistician David R. Cox published "The Regression Analysis of Binary Sequences" in the Journal of the Royal Statistical Society: Series B, formalizing logistic regression as a method to model the probability of binary outcomes as a function of explanatory variables via the logistic (sigmoid) function.[206][26] Cox's approach specified the log-odds of success as a linear combination of predictors, log(p1p)=β0+β1x1++βkxk\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k, which bounded predictions to (0,1) and addressed limitations of linear probability models prone to out-of-range forecasts.[123] This established the canonical form still used today, emphasizing maximum likelihood estimation for parameter inference despite the absence of closed-form solutions.[207] Computational challenges in obtaining maximum likelihood estimates (MLEs) for logistic parameters persisted into the 1960s due to the need for iterative numerical methods. In 1967, Shih-Hua Walker and David B. Duncan advanced practical estimation in their Biometrika paper "Estimation of the Probability of an Event as a Function of Several Independent Variables," proposing an iterative reweighting scheme for multiple logistic models applied to dichotomous data.[208][209] Their algorithm, akin to successive approximations, updated weights based on current probability estimates to converge on MLEs, enabling reliable fitting even with multiple predictors and paving the way for software implementation in fields like biomedicine and economics.[210] The 1970s saw logistic regression integrated into broader statistical frameworks, enhancing its applicability. In 1972, John A. Nelder and Robert W. M. Wedderburn introduced generalized linear models (GLMs) in the Journal of the Royal Statistical Society: Series A, positioning logistic regression as a GLM variant with binomial variance and logit link.[211][212] This framework unified estimation across distributions via iteratively reweighted least squares (IRLS), a generalization of Walker's method that treated the logit as the canonical link for binary data.[50] The GLM approach facilitated extensions to overdispersion and diagnostics, contributing to logistic regression's adoption in statistical packages like GLIM (1970s) and its routine use in epidemiology, social sciences, and machine learning by the 1980s.[213]

Computational Era Contributions

The integration of logistic regression into computational frameworks accelerated its adoption as a practical tool for statistical analysis starting in the 1990s, driven by advancements in iterative algorithms and accessible software. Iteratively reweighted least squares (IRLS), a method for maximizing the likelihood via weighted linear regressions, had been implemented in commercial packages like SAS's PROC LOGISTIC and SPSS by the 1980s, allowing users to fit models on personal computers without deriving solutions manually.[214] This shifted emphasis from theoretical derivation to empirical application in fields like epidemiology and social sciences, where datasets grew larger with digital data collection. Open-source languages further democratized access. In R, released in its initial versions around 1993–1995, the glm function with family=binomial enabled straightforward logistic regression fitting from the outset, integrated into the base stats package and fostering reproducible workflows among researchers. Python implementations followed, with scikit-learn's LogisticRegression class introduced in version 0.13 (2012), supporting optimizations like stochastic gradient descent for larger datasets. These tools emphasized empirical validation through cross-validation and diagnostics, reducing reliance on proprietary systems. A pivotal 2000s contribution was regularization to handle overfitting and high-dimensional data. Friedman, Hastie, and Tibshirani developed coordinate descent algorithms for computing regularization paths in generalized linear models, including penalized logistic regression with L1 (lasso) and elastic net penalties, as detailed in their 2010 Journal of Statistical Software paper.[215] Implemented in the glmnet R package (version 1.0 released 2010), this approach efficiently solves for entire penalty sequences, enabling variable selection and improved predictive performance on sparse data—evident in benchmarks where regularized models outperformed unpenalized ones by reducing variance without substantial bias increase.[216] This marked a transition to scalable, computationally efficient variants suited to machine learning pipelines.

Modern Extensions

High-Dimensional and Sparse Data

In high-dimensional settings where the number of predictors pp greatly exceeds the number of observations nn (p >> n), logistic regression faces challenges such as overfitting and multicollinearity, particularly with sparse data featuring many zero or near-zero coefficients. Lasso regularization, which applies an L1 penalty to the logistic loss function, induces sparsity by shrinking irrelevant coefficients to zero, enabling simultaneous variable selection and estimation. This approach has been shown to recover sparse representations effectively in high-dimensional logistic models, as demonstrated in theoretical analyses confirming consistent selection under irrepresentable conditions.[217] Post-2010 adaptations extended Lasso to elastic net penalties, combining L1 and L2 regularization to handle correlated predictors common in genomics, where datasets involve thousands of gene expressions. For instance, elastic net logistic regression improved classifier performance for immune cell types and T cell subsets from high-dimensional genomic profiles, outperforming Lasso alone by selecting grouped variables. In genome-wide association studies, elastic net demonstrated superior predictive accuracy over Lasso for quantitative traits, with applications in cancer classification using gene expression data. A 2025 study applied elastic net to high-dimensional brain cancer gene data, identifying key genes across tumor types with reduced false selections compared to unregularized models.[218][219][220] To mitigate the instability of Lasso selections, which can lead to high false positive rates in noisy high-dimensional data, stability selection resamples subsets of the data and aggregates selections across iterations, controlling the expected number of false positives below a user-specified bound like 1. This method, applied to penalized logistic regression, enhances reliability in sparse settings by prioritizing consistently selected variables, with empirical evidence showing low false discovery rates (≤0.02) in simulations and real datasets. Robust variants, such as adaptive Lasso with density power divergence, further reduce sensitivity to outliers in high-dimensional logistic models.[221][222][223] Recent reviews from 2020-2025 affirm the empirical utility of regularized logistic regression in healthcare prediction tasks, such as risk modeling from high-dimensional biomarker data, where it balances interpretability and performance against more complex methods. These techniques have validated predictions in outcomes like disease readmission and obesity risk, with elastic net variants showing improved accuracy in sparse, correlated feature spaces typical of electronic health records.[224][225][226]

Causal Inference Integrations

Logistic regression is commonly employed to estimate propensity scores, defined as the probability of receiving treatment given observed covariates, $ P(T=1 \mid X) $, where $ T $ indicates treatment assignment and $ X $ represents covariates.[227] This logit-linear parameterization facilitates maximum likelihood estimation of treatment probabilities, enabling methods like inverse probability weighting (IPW) to adjust for confounding in observational data.[228] Under IPW, treated units receive weights of $ 1 / \hat{P}(T=1 \mid X) $ and control units $ 1 / (1 - \hat{P}(T=1 \mid X)) $, creating a pseudo-population where treatment assignment is independent of covariates, thus yielding marginal treatment effect estimates such as the average treatment effect (ATE).[229] These weights derive from the fitted logistic model and assume correct specification of the propensity score functional form, with stabilized variants incorporating marginal treatment probabilities to mitigate extreme weights.[227] Doubly robust estimators extend this framework by integrating the propensity score-based IPW with an outcome regression model, such as another logistic regression for binary outcomes, to produce consistent ATE estimates if at least one of the two models is correctly specified.[230] The augmented IPW formula residuals the outcome model predictions from the weighted means, reducing bias from propensity misspecification provided the outcome model captures $ E(Y \mid T, X) $ accurately.[231] This approach leverages logistic regression for both treatment and outcome probabilities, enhancing efficiency over IPW alone when models align with data-generating processes, though it requires overlap in covariate distributions and positivity (non-zero propensity scores across $ X $).[232] Despite these integrations, logistic regression-based causal methods in observational studies face inherent limitations absent randomization, primarily the untestable assumption of ignorability—no unmeasured confounding affecting both treatment and outcome.[233] Empirical evaluations reveal sensitivity to model misspecification, where omitted interactions or nonlinearities in the logit lead to biased propensity estimates and attenuated effects.[234] Collider bias arises when conditioning on post-treatment variables or selection criteria induces spurious associations by opening backdoor paths, as demonstrated in simulations where stratifying on outcomes distorts treatment-outcome links despite balanced covariates.[235] Unmeasured confounders, unverifiable without auxiliary data like instrumental variables, persistently undermine claims, with quantitative sensitivity analyses (e.g., E-values) showing that even modest unmeasured biases—such as a confounder raising outcome risk by a factor of 2—can fully explain observed effects.[236] Thus, without experimental validation, these methods support exploratory inference but not definitive causality.[237]

Scalable Implementations in Big Data

Scalable implementations of logistic regression for big data rely on iterative first-order optimization techniques, such as stochastic gradient descent (SGD), which compute approximate gradients using mini-batches of data rather than the full dataset, thereby reducing memory requirements and enabling parallel processing across distributed systems.[238] This approach converges to solutions comparable to batch methods while handling datasets exceeding single-machine capacity, as demonstrated in frameworks optimized for cluster environments.[239] Apache Spark's MLlib library supports distributed training of logistic regression models through mini-batch SGD or limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) solvers, where data is partitioned into resilient distributed datasets (RDDs) for fault-tolerant computation across nodes.[240] Released in versions supporting scalability since 2014, MLlib's implementation has been enhanced with adaptive SGD variants to accelerate convergence on terabyte-scale data, maintaining numerical stability via techniques like elastic net regularization.[241] Similarly, TensorFlow facilitates SGD-based logistic regression training in distributed settings, leveraging its graph execution model to parallelize gradient computations over multiple GPUs or clusters, as utilized in production-scale binary classification tasks.[242] Post-2020 developments have integrated federated learning paradigms into logistic regression to address privacy constraints in decentralized big data environments, where raw data remains local to devices or institutions, and only aggregated parameter updates (e.g., via secure averaging) are shared centrally.[243] For example, robust federated logistic regression frameworks for financial datasets, proposed in 2025, incorporate differential privacy noise to mitigate inference attacks while achieving accuracy within 2-5% of centralized baselines on distributed samples exceeding millions of records.[244] These variants, building on workshops like FL-ICML'20, enable scalable training without data centralization, as applied in healthcare analytics where site-specific silos prevent full dataset pooling.[245] Despite approximations inherent in subsampling and federation, the resulting models retain coefficient interpretability, allowing odds ratio assessments akin to non-scalable fits, thus balancing efficiency with causal inference utility in high-volume regimes.[246]

References

User Avatar
No comments yet.