Hubbry Logo
logo
Batch normalization
Community hub

Batch normalization

logo
0 subscribers
Read side by side
from Wikipedia

In artificial neural networks, batch normalization (also known as batch norm) is a normalization technique used to make training faster and more stable by adjusting the inputs to each layer—re-centering them around zero and re-scaling them to a standard size. It was introduced by Sergey Ioffe and Christian Szegedy in 2015.[1]

Experts still debate why batch normalization works so well. It was initially thought to tackle internal covariate shift, a problem where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.[1] However, newer research suggests it doesn’t fix this shift but instead smooths the objective function—a mathematical guide the network follows to improve—enhancing performance.[2] In very deep networks, batch normalization can initially cause a severe gradient explosion—where updates to the network grow uncontrollably large—but this is managed with shortcuts called skip connections in residual networks.[3] Another theory is that batch normalization adjusts data by handling its size and path separately, speeding up training.[4]

Internal covariate shift

[edit]

Each layer in a neural network has inputs that follow a specific distribution, which shifts during training due to two main factors: the random starting values of the network’s settings (parameter initialization) and the natural variation in the input data. This shifting pattern affecting the inputs to the network’s inner layers is called internal covariate shift. While a strict definition isn’t fully agreed upon, experiments show that it involves changes in the means and variances of these inputs during training.

Batch normalization was first developed to address internal covariate shift.[1] During training, as the parameters of preceding layers adjust, the distribution of inputs to the current layer changes accordingly, such that the current layer needs to constantly readjust to new distributions. This issue is particularly severe in deep networks, because small changes in shallower hidden layers will be amplified as they propagate within the network, resulting in significant shift in deeper hidden layers. Batch normalization was proposed to reduced these unwanted shifts to speed up training and produce more reliable models.

Beyond possibly tackling internal covariate shift, batch normalization offers several additional advantages. It allows the network to use a higher learning rate—a setting that controls how quickly the network learns—without causing problems like vanishing or exploding gradients, where updates become too small or too large. It also appears to have a regularizing effect, improving the network’s ability to generalize to new data, reducing the need for dropout, a technique used to prevent overfitting (when a model learns the training data too well and fails on new data). Additionally, networks using batch normalization are less sensitive to the choice of starting settings or learning rates, making them more robust and adaptable.

Procedures

[edit]

Transformation

[edit]

In a neural network, batch normalization is achieved through a normalization step that fixes the means and variances of each layer's inputs. Ideally, the normalization would be conducted over the entire training set, but to use this step jointly with stochastic optimization methods, it is impractical to use the global information. Thus, normalization is restrained to each mini-batch in the training process.

Let us use B to denote a mini-batch of size m of the entire training set. The empirical mean and variance of B could thus be denoted as

and .

For a layer of the network with d-dimensional input, , each dimension of its input is then normalized (i.e. re-centered and re-scaled) separately,

, where and ; and are the per-dimension mean and standard deviation, respectively.

is added in the denominator for numerical stability and is an arbitrarily small positive constant. The resulting normalized activation have zero mean and unit variance, if is not taken into account. To restore the representation power of the network, a transformation step then follows as

,

where the parameters and are subsequently learned in the optimization process.

Formally, the operation that implements batch normalization is a transform called the Batch Normalizing transform. The output of the BN transform is then passed to other network layers, while the normalized output remains internal to the current layer.

Backpropagation

[edit]

The described BN transform is a differentiable operation, and the gradient of the loss l with respect to the different parameters can be computed directly with the chain rule.

Specifically, depends on the choice of activation function, and the gradient against other parameters could be expressed as a function of :

,

, ,
, ,

and .

Inference

[edit]

During the training stage, the normalization steps depend on the mini-batches to ensure efficient and reliable training. However, in the inference stage, this dependence is not useful any more. Instead, the normalization step in this stage is computed with the population statistics such that the output could depend on the input in a deterministic manner. The population mean, , and variance, , are computed as:

, and .

The population statistics thus is a complete representation of the mini-batches.

The BN transform in the inference step thus becomes

,

where is passed on to next layers instead of . Since the parameters are fixed in this transformation, the batch normalization procedure is essentially applying a linear transform to the activation function.

Theory

[edit]

Although batch normalization has become popular due to its strong empirical performance, the working mechanism of the method is not yet well-understood. The explanation made in the original paper[1] was that batch norm works by reducing internal covariate shift, but this has been challenged by more recent work. One experiment[5] trained a VGG-16 network[6] under 3 different training regimes: standard (no batch norm), batch norm, and batch norm with noise added to each layer during training. In the third model, the noise has non-zero mean and non-unit variance, i.e. it explicitly introduces covariate shift. Despite this, it showed similar accuracy to the second model, and both performed better than the first, suggesting that covariate shift is not the reason that batch norm improves performance.

Using batch normalization causes the items in a batch to no longer be iid, which can lead to difficulties in training due to lower quality gradient estimation.[7]

Smoothness

[edit]

One alternative explanation[5] is that the improvement with batch normalization is instead due to producing a smoother parameter space and smoother gradients, as formalized by a smaller Lipschitz constant.

Consider two identical networks, one contains batch normalization layers and the other does not, the behaviors of these two networks are then compared. Denote the loss functions as and , respectively. Let the input to both networks be , and the output be , for which , where is the layer weights. For the second network, additionally goes through a batch normalization layer. Denote the normalized activation as , which has zero mean and unit variance. Let the transformed activation be , and suppose and are constants. Finally, denote the standard deviation over a mini-batch as .

First, it can be shown that the gradient magnitude of a batch normalized network, , is bounded, with the bound expressed as

.

Since the gradient magnitude represents the Lipschitzness of the loss, this relationship indicates that a batch normalized network could achieve greater Lipschitzness comparatively. Notice that the bound gets tighter when the gradient correlates with the activation , which is a common phenomena. The scaling of is also significant, since the variance is often large.

Secondly, the quadratic form of the loss Hessian with respect to activation in the gradient direction can be bounded as

.

The scaling of indicates that the loss Hessian is resilient to the mini-batch variance, whereas the second term on the right hand side suggests that it becomes smoother when the Hessian and the inner product are non-negative. If the loss is locally convex, then the Hessian is positive semi-definite, while the inner product is positive if is in the direction towards the minimum of the loss. It could thus be concluded from this inequality that the gradient generally becomes more predictive with the batch normalization layer.

It then follows to translate the bounds related to the loss with respect to the normalized activation to a bound on the loss with respect to the network weights:

, where and .

In addition to the smoother landscape, it is further shown that batch normalization could result in a better initialization with the following inequality:

, where and are the local optimal weights for the two networks, respectively.

Some scholars argue that the above analysis cannot fully capture the performance of batch normalization, because the proof only concerns the largest eigenvalue, or equivalently, one direction in the landscape at all points. It is suggested that the complete eigenspectrum needs to be taken into account to make a conclusive analysis.[8][5]

Measure

[edit]

Since it is hypothesized that batch normalization layers could reduce internal covariate shift, an experiment[citation needed] is set up to measure quantitatively how much covariate shift is reduced. First, the notion of internal covariate shift needs to be defined mathematically. Specifically, to quantify the adjustment that a layer's parameters make in response to updates in previous layers, the correlation between the gradients of the loss before and after all previous layers are updated is measured, since gradients could capture the shifts from the first-order training method. If the shift introduced by the changes in previous layers is small, then the correlation between the gradients would be close to 1.

The correlation between the gradients are computed for four models: a standard VGG network,[6] a VGG network with batch normalization layers, a 25-layer deep linear network (DLN) trained with full-batch gradient descent, and a DLN network with batch normalization layers. Interestingly, it is shown that the standard VGG and DLN models both have higher correlations of gradients compared with their counterparts, indicating that the additional batch normalization layers are not reducing internal covariate shift.

Vanishing/exploding gradients

[edit]

Even though batch norm was originally introduced to alleviate gradient vanishing or explosion problems, a deep batch norm network in fact suffers from gradient explosion at initialization time, no matter what it uses for nonlinearity. Thus, the optimization landscape is very far from smooth for a randomly initialized, deep batch norm network. More precisely, if the network has layers, then the gradient of the first layer weights has norm for some depending only on the nonlinearity. For any fixed nonlinearity, decreases as the batch size increases. For example, for ReLU, decreases to as the batch size tends to infinity. Practically, this means deep batch norm networks are untrainable. This is only relieved by skip connections in the fashion of residual networks.[9]

This gradient explosion on the surface contradicts the smoothness property explained in the previous section, but in fact they are consistent. The previous section studies the effect of inserting a single batch norm in a network, while the gradient explosion depends on stacking batch norms typical of modern deep neural networks.

Decoupling

[edit]

Another possible reason for the success of batch normalization is that it decouples the length and direction of the weight vectors and thus facilitates better training.

By interpreting batch norm as a reparametrization of weight space, it can be shown that the length and the direction of the weights are separated and can thus be trained separately. For a particular neural network unit with input and weight vector , denote its output as , where is the activation function, and denote . Assume that , and that the spectrum of the matrix is bounded as , , such that is symmetric positive definite. Adding batch normalization to this unit thus results in

, by definition.

The variance term can be simplified such that . Assume that has zero mean and can be omitted, then it follows that

, where is the induced norm of , .

Hence, it could be concluded that , where , and and account for its length and direction separately. This property could then be used to prove the faster convergence of problems with batch normalization.

Linear convergence

[edit]

Least-square problem

[edit]

With the reparametrization interpretation, it could then be proved that applying batch normalization to the ordinary least squares problem achieves a linear convergence rate in gradient descent, which is faster than the regular gradient descent with only sub-linear convergence.

Denote the objective of minimizing an ordinary least squares problem as

, where and .

Since , the objective thus becomes

, where 0 is excluded to avoid 0 in the denominator.

Since the objective is convex with respect to , its optimal value could be calculated by setting the partial derivative of the objective against to 0. The objective could be further simplified to be

.

Note that this objective is a form of the generalized Rayleigh quotient

, where is a symmetric matrix and is a symmetric positive definite matrix.

It is proven that the gradient descent convergence rate of the generalized Rayleigh quotient is

, where is the largest eigenvalue of , is the second largest eigenvalue of , and is the smallest eigenvalue of .[10]

In our case, is a rank one matrix, and the convergence result can be simplified accordingly. Specifically, consider gradient descent steps of the form with step size , and starting from , then

.

Learning halfspace problem

[edit]

The problem of learning halfspaces refers to the training of the Perceptron, which is the simplest form of neural network. The optimization problem in this case is

, where and is an arbitrary loss function.

Suppose that is infinitely differentiable and has a bounded derivative. Assume that the objective function is -smooth, and that a solution exists and is bounded such that . Also assume is a multivariate normal random variable. With the Gaussian assumption, it can be shown that all critical points lie on the same line, for any choice of loss function . Specifically, the gradient of could be represented as

, where , , and is the -th derivative of .

By setting the gradient to 0, it thus follows that the bounded critical points can be expressed as , where depends on and . Combining this global property with length-direction decoupling, it could thus be proved that this optimization problem converges linearly.

First, a variation of gradient descent with batch normalization, Gradient Descent in Normalized Parameterization (GDNP), is designed for the objective function , such that the direction and length of the weights are updated separately. Denote the stopping criterion of GDNP as

.

Let the step size be

.

For each step, if , then update the direction as

.

Then update the length according to

, where is the classical bisection algorithm, and is the total iterations ran in the bisection step.

Denote the total number of iterations as , then the final output of GDNP is

.

The GDNP algorithm thus slightly modifies the batch normalization step for the ease of mathematical analysis.

It can be shown that in GDNP, the partial derivative of against the length component converges to zero at a linear rate, such that

, where and are the two starting points of the bisection algorithm on the left and on the right, correspondingly.

Further, for each iteration, the norm of the gradient of with respect to converges linearly, such that

.

Combining these two inequalities, a bound could thus be obtained for the gradient with respect to :

, such that the algorithm is guaranteed to converge linearly.

Although the proof stands on the assumption of Gaussian input, it is also shown in experiments that GDNP could accelerate optimization without this constraint.

Neural networks

[edit]

Consider a multilayer perceptron (MLP) with one hidden layer and hidden units with mapping from input to a scalar output described as

, where and are the input and output weights of unit correspondingly, and is the activation function and is assumed to be a tanh function.

The input and output weights could then be optimized with

, where is a loss function, , and .

Consider fixed and optimizing only , it can be shown that the critical points of of a particular hidden unit , , all align along one line depending on incoming information into the hidden layer, such that

, where is a scalar, .

This result could be proved by setting the gradient of to zero and solving the system of equations.

Apply the GDNP algorithm to this optimization problem by alternating optimization over the different hidden units. Specifically, for each hidden unit, run GDNP to find the optimal and . With the same choice of stopping criterion and stepsize, it follows that

.

Since the parameters of each hidden unit converge linearly, the whole optimization problem has a linear rate of convergence.[8]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Batch normalization is a technique in deep learning that normalizes the inputs to each layer of a neural network by subtracting the batch mean and dividing by the batch standard deviation, then applying learnable scale and shift parameters to restore representational power. This process addresses internal covariate shift—the change in the distribution of layer inputs during training—enabling faster convergence, higher learning rates, and more robust optimization without requiring meticulous weight initialization.[1] Proposed by Sergey Ioffe and Christian Szegedy in 2015, and presented at the 2015 International Conference on Machine Learning (ICML), batch normalization was introduced as a method to accelerate the training of deep neural networks, particularly those with many layers where traditional training is hindered by shifting distributions that necessitate low learning rates and careful parameter tuning. The paper received the ICML 2025 Test of Time Award.[1][2][3] The approach integrates directly into the network architecture, typically applied immediately before the nonlinearity in fully connected or convolutional layers, transforming inputs $ x = Wu $ (where $ W $ are weights and $ u $ the previous layer's outputs) into normalized versions.[1] The core mechanism operates on mini-batches during training: for a mini-batch $ B = {x_1, x_2, \dots, x_m} $, it computes the empirical mean $ \mu_B = \frac{1}{m} \sum_{i=1}^m x_i $ and variance $ \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 $; each input is then normalized as $ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $ (with $ \epsilon $ a small constant for numerical stability), followed by scaling and shifting $ y_i = \gamma \hat{x}_i + \beta $, where $ \gamma $ and $ \beta $ are learned parameters initialized to 1 and 0, respectively.[1] During inference, population statistics (running averages of means and variances from training) replace batch statistics to ensure deterministic outputs.[1] Among its key benefits, batch normalization not only speeds up training—achieving comparable accuracy to prior methods with up to 14 times fewer steps on ImageNet classification—but also serves as a form of regularization, sometimes obviating the need for dropout while improving generalization.[1] It has become a standard component in convolutional neural networks (CNNs) and other deep architectures, significantly contributing to advances in computer vision and beyond, though it performs best with sufficiently large mini-batches and can introduce challenges in scenarios like recurrent networks or small-batch training.[1]

History and Introduction

Development and Original Proposal

Batch normalization was introduced by Sergey Ioffe and Christian Szegedy, researchers at Google, in their seminal 2015 paper titled "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," published as an arXiv preprint and later presented at the International Conference on Machine Learning (ICML).[1] The technique emerged as a response to the growing difficulties in training increasingly deep neural networks following breakthroughs like AlexNet in 2012, where extending network depth beyond a few layers often led to training instability.[1] At the time, deep network training was hindered by issues such as vanishing gradients and activation saturation, which were exacerbated in deeper architectures and necessitated careful parameter initialization, small learning rates, and activation functions like ReLU to mitigate slow convergence.[1] Ioffe and Szegedy proposed batch normalization to address these challenges by standardizing the inputs to each layer during training, using the mean and variance computed from mini-batch statistics, thereby stabilizing the learning process across layers.[1] This approach was motivated by the problem of internal covariate shift, where the distribution of layer inputs shifts as parameters are updated, complicating optimization.[1] In initial experiments, the authors applied batch normalization to an image classification model on the ImageNet dataset, demonstrating that it achieved the same accuracy as a baseline state-of-the-art model in 14 times fewer training steps while surpassing it with a top-5 validation error of 4.9%.[1] These results highlighted batch normalization's ability to enable much higher learning rates and reduce the sensitivity to initialization, marking a significant advancement in accelerating deep network training.[1]

Impact and Recognition

Following its introduction in 2015, batch normalization was rapidly integrated into major deep learning frameworks, with TensorFlow incorporating it as a core layer by late 2015 through its contrib module and later as a standard Keras layer.[4] PyTorch, released in early 2017, included batch normalization as a built-in module from its initial versions, facilitating seamless adoption in convolutional neural networks (CNNs) and recurrent architectures. This quick integration transformed batch normalization into a default component in model design, extending its use beyond vision tasks to natural language processing and generative models. The technique's key impacts include enabling the stable training of much deeper networks, such as the Residual Networks (ResNets) introduced in 2015, which achieved unprecedented depths of over 100 layers on ImageNet without gradient vanishing issues. It also diminished the need for meticulous weight initialization schemes like Xavier or He, allowing practitioners to employ higher learning rates and simpler setups, thereby streamlining experimentation. These advancements played a pivotal role in democratizing deep learning by lowering barriers to effective model training for researchers and engineers without specialized tuning expertise. By 2020, batch normalization had become ubiquitous, appearing in over 90% of top-performing ImageNet classification models on leaderboards, and it maintained strong prevalence in computer vision benchmarks through the mid-2020s.[5] A notable milestone was its inclusion in major architectures like Inception-v2 in 2015, where it improved accuracy and training speed on large-scale datasets. In recognition of its lasting influence, the original 2015 paper by Sergey Ioffe and Christian Szegedy received the ICML Test of Time Award in 2025, underscoring its enduring relevance a decade after publication as a foundational enabler of modern deep learning pipelines.[6]

Motivation

Internal Covariate Shift

Internal covariate shift refers to the change in the distributions of internal activations within a deep neural network during training, primarily caused by updates to the parameters of preceding layers. This phenomenon alters the inputs to subsequent layers, leading to shifting activation distributions that must be continually adapted to by the network.[1] The consequences of internal covariate shift include the need for lower learning rates to maintain training stability, more careful parameter initialization to avoid initial instability, and overall slower convergence, particularly in deeper networks. These issues are exacerbated by the amplification of small changes through multiple layers, often resulting in the saturation of nonlinear activation functions such as the sigmoid, which in turn causes vanishing gradients and hinders effective training.[1] Evidence for internal covariate shift is illustrated in the original proposal through plots of activation distributions in a sigmoid-activated network trained on ImageNet without normalization; these show significant shifts in the mean and variance of activations over training epochs, indicating ongoing distributional changes. In contrast, networks employing batch normalization maintain stable distributions throughout training.[1] This concept extends the broader machine learning notion of covariate shift—where the input distribution changes between training and test phases—to the internal representations of hidden layers, rather than solely the input features, making it a network-specific challenge akin to repeated domain shifts within the model itself.[1] Batch normalization mitigates internal covariate shift by normalizing the inputs to each layer using statistics computed from each mini-batch, thereby fixing the mean and variance of the activations and reducing the distributional changes caused by parameter updates.[1]

Modern Interpretations of Benefits

Subsequent research has challenged the original hypothesis that batch normalization primarily mitigates internal covariate shift, proposing instead that its benefits stem from smoothing the optimization landscape of the loss function. However, the role of internal covariate shift remains debated, with some studies, such as Rauf et al. (2020), arguing that reducing ICS is essential and sufficient for batch normalization's performance gains, countering earlier challenges.[7][8] In a seminal 2018 study, Santurkar et al. demonstrated through controlled experiments that internal covariate shift persists even with batch normalization applied, yet training converges faster and more reliably compared to networks without it.[7] Specifically, they showed that networks without batch normalization, but with artificially stabilized layer distributions to minimize covariate shift, do not exhibit the same acceleration in training or stability gains.[7] This empirical evidence from ablation studies underscores that the core advantage lies elsewhere, shifting focus to how batch normalization alters the geometry of the loss surface for more effective optimization.[7] One key mechanism identified is the reduction in the Lipschitz constant of the loss function, which bounds the norms of gradients across the parameter space and prevents explosive behavior during optimization.[7] By enforcing this smoothness, batch normalization enables the use of substantially larger learning rates without risking instability, as the constrained gradient magnitudes maintain consistent update steps.[7] Complementing this, Bjorck et al. argued that batch normalization acts as a form of preconditioning, transforming the curvature of the loss landscape to make it more isotropic in the parameter space, thereby stabilizing gradient descent and promoting convergence akin to optimizing a more convex-like function.[9] More recent work, as of 2025, suggests that batch normalization also improves the clustering characteristics of hidden representations, enhancing generalization without relying on sparsity.[10] These interpretations collectively explain the empirical observation that batch-normalized networks train more efficiently across diverse architectures and tasks. An additional perspective highlights the regularizing effect arising from the stochasticity in mini-batch statistics, which injects beneficial noise during training to enhance generalization. This noise, inherent to estimating means and variances from finite mini-batches rather than the full dataset, mirrors techniques like dropout by introducing variability that discourages overfitting, though it requires no extra inference-time computation. While not the primary driver of optimization speed, this aspect contributes to the overall robustness of batch normalization in improving model performance on unseen data.[1]

Procedures

Forward Pass Normalization

Batch normalization is applied during the forward pass immediately after the linear transformation in a layer, such as a fully connected or convolutional operation (e.g., x=Wu+bx = Wu + b), but before the nonlinearity (e.g., ReLU). For a mini-batch of mm activations B={x1,x2,,xm}B = \{x_1, x_2, \dots, x_m\}, the batch mean μB\mu_B and variance σB2\sigma_B^2 are first computed as follows:
μB=1mi=1mxi,σB2=1mi=1m(xiμB)2. \mu_B = \frac{1}{m} \sum_{i=1}^m x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2.
These statistics normalize the input distribution to have zero mean and unit variance. The normalization step then transforms each input xix_i to x^i\hat{x}_i:
x^i=xiμBσB2+ϵ, \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}},
where ϵ\epsilon is a small positive constant added to the denominator for numerical stability, preventing division by zero when the batch variance is small. To retain representational power, the normalized values are subsequently scaled and shifted using learnable parameters γ\gamma and β\beta, one pair per feature:
yi=γx^i+β. y_i = \gamma \hat{x}_i + \beta.
This allows the layer to recover the original distribution if needed, while benefiting from the stabilized inputs. In convolutional layers, normalization occurs independently for each channel (feature map), with statistics aggregated across the spatial dimensions (height pp and width qq) and the batch, resulting in an effective mini-batch size of m=mpqm' = m \cdot p \cdot q; a single γ\gamma and β\beta are applied per channel.

Backpropagation Through Normalization

During training, backpropagation through the batch normalization (BN) layer computes gradients with respect to the layer's inputs and learnable parameters using the chain rule, accounting for the dependencies introduced by the mini-batch mean μB\mu_B and variance σB2\sigma_B^2.[1] The gradient with respect to the normalized input x^i\hat{x}_i is first obtained as Lx^i=Lyiγ\frac{\partial L}{\partial \hat{x}_i} = \frac{\partial L}{\partial y_i} \cdot \gamma, where LL is the loss, yiy_i is the scaled and shifted output, and γ\gamma is the scale parameter; this propagates the upstream gradient Lyi\frac{\partial L}{\partial y_i} through the affine transformation.[1] To compute the gradient with respect to the original input xix_i, the dependencies on μB\mu_B and σB2\sigma_B^2 must be resolved via the chain rule:
Lxi=Lx^i1σB2+ϵ+LσB22(xiμB)m+LμB1m, \frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{2(x_i - \mu_B)}{m} + \frac{\partial L}{\partial \mu_B} \cdot \frac{1}{m},
where mm is the mini-batch size, and the auxiliary gradients are
LσB2=j=1mLx^j(xjμB)(1/2)(σB2+ϵ)3/2,LμB=j=1mLx^j(1σB2+ϵ)+LσB2j=1m2(xjμB)m. \frac{\partial L}{\partial \sigma_B^2} = \sum_{j=1}^m \frac{\partial L}{\partial \hat{x}_j} \cdot \frac{(x_j - \mu_B) \cdot (-1/2)}{(\sigma_B^2 + \epsilon)^{3/2}}, \quad \frac{\partial L}{\partial \mu_B} = \sum_{j=1}^m \frac{\partial L}{\partial \hat{x}_j} \cdot \left( -\frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \frac{\partial L}{\partial \sigma_B^2} \cdot \sum_{j=1}^m \frac{-2 (x_j - \mu_B)}{m}.
These terms correct for the batch-wide statistics, ensuring the input gradient reflects the normalization's effect on the entire mini-batch; the small constant ϵ>0\epsilon > 0 (typically 10510^{-5}) is included in the denominator to prevent division by zero and maintain numerical stability during differentiation. Note that the second term in LμB\frac{\partial L}{\partial \mu_B} evaluates to zero since (xjμB)=0\sum (x_j - \mu_B) = 0.[1] The gradients for the learnable parameters γ\gamma and β\beta are simpler, as they depend only on the normalized inputs and upstream gradients without batch statistic corrections:
Lγ=i=1mLyix^i,Lβ=i=1mLyi. \frac{\partial L}{\partial \gamma} = \sum_{i=1}^m \frac{\partial L}{\partial y_i} \cdot \hat{x}_i, \quad \frac{\partial L}{\partial \beta} = \sum_{i=1}^m \frac{\partial L}{\partial y_i}.
These are essentially unnormalized sums (or averages when divided by mm) that enable efficient updates to the affine parameters via stochastic gradient descent.[1] Computationally, propagating gradients through BN incurs an additional cost of O(md)O(m \cdot d) per layer, where dd is the feature dimension, due to the summations over the mini-batch; however, this is offset by the layer's role in enabling larger learning rates and more stable parameter updates during training.[1]

Inference Phase

During the inference phase of models employing batch normalization, the normalization process adapts to operate without mini-batches, relying instead on fixed population statistics for the mean and variance to ensure deterministic outputs. Specifically, the mini-batch mean μB\mu_B and variance σB2\sigma_B^2 used in training are replaced by estimates of the overall population mean μ\mu and variance σ2\sigma^2, approximated as μ1Tt=1TμBt\mu \approx \frac{1}{T} \sum_{t=1}^T \mu_B^t and σ2mm11Tt=1TσB2,t\sigma^2 \approx \frac{m}{m-1} \frac{1}{T} \sum_{t=1}^T \sigma_B^{2,t}, where TT is the number of training mini-batches and mm is the mini-batch size, providing an unbiased estimate of the population variance (approximating by neglecting the typically small variance of the batch means).[1] These population statistics are estimated during training by maintaining running averages over the mini-batches, often implemented as exponential moving averages to prioritize recent batches and reduce computational overhead. The update rule for these running statistics typically follows an exponential moving average scheme: the running mean is updated as \text{running_mean} \leftarrow \text{momentum} \times \text{running_mean} + (1 - \text{momentum}) \times \mu_B, with a similar form for the running variance, where the momentum parameter (commonly set to 0.9 or 0.99) controls the decay rate and balances responsiveness to new data with stability. In inference, the normalization step then applies the standard batch normalization formula but with these fixed running statistics:
x^=xμσ2+ϵ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}
followed by scaling and shifting y=γx^+βy = \gamma \hat{x} + \beta, where ϵ\epsilon is a small constant for numerical stability, and γ\gamma and β\beta are the learned parameters. This substitution yields consistent, batch-independent predictions, as the output depends solely on the input and the fixed parameters rather than stochastic mini-batch variations.[1] A key challenge in the inference phase arises when processing small or single-sample inputs, such as in online deployment or real-time applications, where computing fresh mini-batch statistics is infeasible or unreliable due to high variance in small samples. Solutions include relying on the precomputed running averages, which approximate the full dataset distribution without requiring batch aggregation, though this can introduce minor mismatches if training batches were atypically sized. Alternatively, one can compute exact population statistics by passing the entire training dataset through the model post-training, albeit at significant computational cost, or adopt batch-independent normalization techniques like layer normalization, which normalizes across features within each individual sample rather than across a batch. Unlike the training phase, where mini-batch statistics introduce stochastic noise that acts as a form of regularization to prevent overfitting, inference with fixed population statistics produces smoother, more deterministic predictions without this variability. This eliminates the "noise" from batch sampling, potentially leading to slightly less robust generalizations in scenarios sensitive to distributional shifts, though it enhances reproducibility and efficiency in deployment.[1]

Theoretical Foundations

Loss Landscape Smoothness

Batch normalization enhances the smoothness of the loss landscape in neural network optimization by reducing the Lipschitz constant of the gradient, ensuring that L(θ)L(θ)βθθ\|\nabla \mathcal{L}(\theta) - \nabla \mathcal{L}(\theta')\| \leq \beta \|\theta - \theta'\| for some β>0\beta > 0, where L\mathcal{L} is the loss function and θ\theta are the parameters. This property, known as β\beta-smoothness, bounds how rapidly the gradient changes with respect to parameter perturbations, leading to more predictable and stable optimization dynamics. The mechanism underlying this smoothness arises from batch normalization's reparametrization of the network, which constrains the magnitudes of activations and weights, thereby limiting the sensitivity of gradients to small changes in parameters and reducing overall gradient variability during training.[7] Mathematically, consider a layer with output y=f(Wx)y = f(Wx), where WW is the weight matrix and xx is the input. Batch normalization, applied post-activation, normalizes the pre-activation values and introduces learnable scale and shift parameters γ\gamma and β\beta, which effectively rescale the Hessian matrix associated with the layer's loss contribution. This rescaling reduces the condition number of the Hessian, as shown by bounding the second-order terms in the gradient computation: specifically, the normalized gradient satisfies yjL^22γ2σj2(yjL221m1,yjL21myjL,y^j2)\|\nabla_{y_j} \hat{\mathcal{L}}\|_2^2 \leq \frac{\gamma^2}{\sigma_j^2} \left( \|\nabla_{y_j} \mathcal{L}\|_2^2 - \frac{1}{m} \langle \mathbf{1}, \nabla_{y_j} \mathcal{L} \rangle^2 - \frac{1}{m} \langle \nabla_{y_j} \mathcal{L}, \hat{y}_j \rangle^2 \right), where L^\hat{\mathcal{L}} is the loss after normalization, σj2\sigma_j^2 is the variance, and mm is the batch size; a similar bound applies to the Hessian quadratic form, promoting a more well-conditioned optimization problem that facilitates efficient convergence.[7] Empirically, Santurkar et al. (2018) demonstrate this effect through visualizations of loss contours on deep linear networks, where batch normalization transforms elongated, ill-conditioned surfaces into rounder, more isotropic ones, reducing loss variation by up to two orders of magnitude early in training. This smoothing enables the use of substantially higher learning rates—up to 10 times larger—without divergence, accelerating convergence while maintaining stability across various architectures like VGG networks.[7]

Covariate Shift Quantification

To quantify the extent of internal covariate shift, researchers have employed metrics that measure changes in the distributions of layer activations over the course of training. A common approach is the Kullback-Leibler (KL) divergence between the activation distributions at the initial epoch (p_1) and a later epoch k (p_k), given by
DKL(p1pk)=p1(x)logp1(x)pk(x)dx, D_\text{KL}(p_1 \Vert p_k) = \int p_1(x) \log \frac{p_1(x)}{p_k(x)} \, dx,
which quantifies how much the distribution has shifted. This metric highlights the degree to which parameter updates alter input distributions to subsequent layers, a core aspect of internal covariate shift. In the seminal batch normalization paper, experiments on a multi-layer network trained on MNIST demonstrated that without normalization, activation distributions undergo substantial shifts during training, as visualized in histograms of layer inputs that deviate significantly from their initial Gaussian-like form. Batch normalization stabilizes these distributions by enforcing zero mean and unit variance per mini-batch, effectively reducing the observed shift and allowing for higher learning rates without divergence. Results indicated that this stabilization is particularly pronounced in early layers, where shifts are most disruptive to gradient flow.[1] Subsequent critiques, notably a 2018 NeurIPS study, challenged the centrality of shift reduction by showing that internal covariate shift persists even with batch normalization. Using a measure based on the L2 difference in mean and standard deviation of layer inputs across training iterations, the study found that batch-normalized networks exhibit comparable or sometimes greater shift magnitudes than non-normalized ones, yet train substantially faster—reaching 83% accuracy on CIFAR-10 with a VGG-like ReLU network versus 80% without. This lack of correlation between measured shift and training speed suggests that internal covariate shift mitigation is not the primary mechanism behind batch normalization's effectiveness.[7] While batch normalization mitigates internal covariate shift to some degree, empirical evidence indicates it is not the sole reason for its advantages, with complementary interpretations like loss landscape smoothness providing additional explanatory power.[7]

Gradient Flow Stabilization

In deep neural networks, gradients propagate backward through the chain rule, where the gradient at each layer is the product of the incoming gradient and the layer's Jacobian. This multiplicative process often results in vanishing gradients, scaling exponentially as exp(cdepth)\exp(-c \cdot \text{depth}) for some constant c>0c > 0, or exploding gradients when the product grows uncontrollably, hindering effective training of deep architectures.[11] Batch normalization addresses this instability by normalizing the inputs to each layer, constraining activations to have zero mean and unit variance during the forward pass. This keeps activation magnitudes in a stable range, reducing the risk of scale-induced amplification or attenuation in subsequent gradient computations. Additionally, the learnable scaling parameter γ\gamma decouples the learning of activation scales from the transformation itself, allowing the network to adaptively control gradient magnitudes without relying on weight initialization alone.[11] Mathematically, the effective Jacobian of a batch-normalized layer tends to have singular values clustered around 1, promoting balanced gradient flow across depths, in contrast to unnormalized layers where singular values can drift far from unity and cause instability.[11] The backward pass through normalization introduces a scaling factor, approximating the layer-wise gradient as Lx(Ly)(yx)/σ\frac{\partial L}{\partial x} \approx \left( \frac{\partial L}{\partial y} \right) \cdot \left( \frac{\partial y}{\partial x} \right) / \sigma, where σ\sigma is the standard deviation of the layer inputs; this division by σ\sigma normalizes the gradient magnitude, counteracting the cumulative effects of prior layers.[11] Empirical evidence from deep networks demonstrates this stabilization: in 90-layer models trained on permutation-invariant MNIST, batch normalization helps prevent exponential decay of gradient norms compared to no normalization, though norms may increase with depth for certain activations; combining it with techniques like backward gradient normalization achieves flat profiles across layers.[12] The synergy between batch normalization and skip connections in residual networks (ResNets) enhances this effect further. Skip connections enable identity mappings, allowing gradients to flow directly through added shortcut paths, while batch normalization ensures these propagated signals remain well-conditioned; together, they enable stable training of networks exceeding 100 layers deep without gradient pathologies.[13][14]

Parameter Decoupling

In standard neural networks without normalization, the weights in each layer entangle the magnitude (scale, denoted as W\|W\|) and direction (shape, denoted as W/WW / \|W\|) of the linear transformation, which can lead to ill-conditioned optimization problems where small changes in direction require large adjustments in magnitude to maintain output scale.[15] Batch normalization addresses this by applying normalization after the linear transformation, allowing the network to primarily learn the direction through the weights WW, while the learnable scale parameter γ\gamma absorbs and controls the overall magnitude; as a result, the effective output scale becomes γW\gamma \cdot \|W\|, decoupling these aspects.[15] For a linear layer, this can be expressed mathematically as $ \mathrm{BN}(Wx) \approx \gamma \left( \frac{Wx}{|Wx|} \right) + \beta $, where β\beta is the learnable shift parameter, effectively transferring the influence of W\|W\| to γ\gamma and isolating directional learning in WW.[15] This decoupling reduces the network's sensitivity to weight initialization, as initial magnitude variations are normalized away, enabling more robust starting points without fine-tuned scaling.[15] It also permits aggressive weight updates and larger learning rates during training, as directional adjustments no longer risk explosive scale changes that could destabilize gradients.[15] Theoretically, batch normalization improves the conditioning of the Fisher information matrix by reducing its maximum eigenvalue, which flattens the loss landscape and facilitates smoother optimization.[16]

Convergence Guarantees

Least-Squares Optimization

In linear models, batch normalization is analyzed in the context of solving the ordinary least-squares problem, which seeks to minimize the objective minWXWYF2\min_W \|XW - Y\|_F^2, where XRn×dX \in \mathbb{R}^{n \times d} is the input data matrix, YRn×mY \in \mathbb{R}^{n \times m} is the target matrix, and WRd×mW \in \mathbb{R}^{d \times m} are the weights, with the Frobenius norm measuring the squared error across all outputs. This setup captures the essence of linear regression, where gradient descent iteratively updates WW to reduce the loss. Without batch normalization, convergence of gradient descent can be slow when features in XX exhibit varying scales or heterogeneous variances, leading to an ill-conditioned Gram matrix XXX^\top X with a large condition number κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min}, where λmax\lambda_{\max} and λmin\lambda_{\min} are the largest and smallest eigenvalues. This ill-conditioning results in a suboptimal convergence rate for gradient descent, often requiring careful tuning of the learning rate and many more iterations to reach a small error. Batch normalization, applied before the linear transformation, standardizes each feature across the batch to zero mean and unit variance, effectively preprocessing XX to mitigate scale differences and reduce the condition number κ\kappa of the effective Gram matrix.[17] Consequently, gradient descent with batch normalization achieves linear convergence at a rate O(11κ)\mathcal{O}\left(1 - \frac{1}{\kappa}\right), where the reduced κ\kappa accelerates the process compared to the unnormalized case.[17] The proof relies on analyzing the dynamics of gradient descent through the normalized Gram matrix HH^*, which batch normalization constructs via over-parameterization and standardization, making its eigenvalues more balanced and closer to unity under feature normalization.[17] A key theorem establishes that this yields a speedup factor proportional to the variance reduction across features, with the contraction rate improving linearly as κ\kappa decreases (Theorem 3.4 in Cai et al., 2019).[17] Experiments on synthetic datasets, where features are generated with heterogeneous variances (e.g., exponentially increasing scales), demonstrate that batch normalization reduces the number of epochs to convergence by 2-5 times relative to standard gradient descent, while maintaining robustness to larger learning rates.[17] This linear case provides foundational insight, with extensions to related problems like halfspace learning showing similar benefits in classification settings.[17]

Halfspace Learning

In the context of halfspace learning, batch normalization (BN) is analyzed for linear classifiers such as logistic regression or support vector machines (SVMs) trained on separable data, where the objective is to minimize losses like the logistic loss or hinge loss to separate the data into halfspaces.[15] This setup typically involves Gaussian-distributed inputs, and the optimization problem is formulated as minimizing the expected loss $ f(\tilde{w}) = \mathbb{E}_{y,x} [\phi(-y x^T \tilde{w})] $, where w~\tilde{w} represents the normalized weight vector and ϕ\phi is the loss function.[15] BN normalizes the inputs to the linear layer, which decouples the optimization into length and direction components of the weights, thereby improving margin maximization during training.[15] This normalization effect leads to faster convergence, achieving an iteration complexity of $ O(\log(1/\epsilon)) $ for strongly convex losses to reach an ϵ\epsilon-accurate solution in stochastic gradient descent (SGD).[15] A key theoretical result from Kohler et al. (2018) demonstrates that BN provides a linear speedup in SGD by effectively bounding the step sizes, exploiting the decoupled structure to stabilize and accelerate the optimization process.[15] Post-BN, the effective loss landscape becomes μ\mu-strongly convex with an LL-Lipschitz continuous gradient, enabling linear convergence at the rate $ 1 - \mu/L $ per iteration.[15]
Specifically, under assumptions of Gaussian inputs and separable data, the gradient norm satisfies
w~f(w~Td)(1μ/L)2TdΦ2(ρ(w0)ρ)+error term, \|\nabla_{\tilde{w}} f(\tilde{w}_{T_d})\| \leq (1 - \mu/L)^{2T_d} \Phi^2 (\rho(w_0) - \rho^*) + \text{error term},
where TdT_d is the number of iterations for the direction update, Φ\Phi bounds the initial margin, and ρ\rho measures the margin.[15] This result holds for the direction optimization phase after length decoupling via BN.[15] Without BN, high-variance features in the input distribution can disproportionately slow down margin growth, as the optimization struggles with ill-conditioned landscapes.[15] In contrast, BN equalizes feature variances through normalization, promoting balanced updates and mitigating these slowdowns to achieve the accelerated rates.[15] This analysis uses least-squares objectives as a proxy for quadratic approximations of the losses in some derivations.[15]

Deep Neural Network Training

In overparameterized deep neural networks, batch normalization (BN) plays a role in the neural tangent kernel (NTK) regime, where wide networks behave like kernel methods during training with stochastic gradient descent (SGD). This regime assumes infinite width, transforming the nonlinear dynamics into a linear system governed by the NTK, whose eigenvalues determine the optimization landscape's conditioning. BN can affect these eigenvalues by altering the network's dynamical regimes, such as promoting chaotic behavior that influences trainability, as analyzed in studies of normalization effects on the NTK spectrum.[18][19] Theoretical analyses in the NTK regime suggest that BN contributes to improved optimization landscapes in certain settings, with extensions from linear models indicating potential benefits in stabilizing gradient flow and reducing spectral bias in specific architectures.[17][20] These effects are primarily derived under the infinite-width assumption, with finite-width networks approximating the behavior but potentially deviating due to feature learning beyond lazy training. Empirical evidence on benchmarks like CIFAR-10 supports faster training with BN, where networks reach 90% test accuracy in under 100 epochs, compared to unnormalized models requiring more iterations for similar performance.[21]
User Avatar
No comments yet.