Hubbry Logo
Softmax functionSoftmax functionMain
Open search
Softmax function
Community hub
Softmax function
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Softmax function
Softmax function
from Wikipedia

The softmax function, also known as softargmax[1]: 184  or normalized exponential function,[2]: 198  converts a tuple of K real numbers into a probability distribution over K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and is used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

Definition

[edit]

The softmax function takes as input a tuple z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

Formally, the standard (unit) softmax function , where , takes a tuple and computes each component of vector with

In words, the softmax applies the standard exponential function to each element of the input tuple (consisting of real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input tuple. For example, the standard softmax of is approximately , which amounts to assigning almost all of the total unit weight in the result to the position of the tuple's maximal element (of 8).

In general, instead of e a different base b > 0 can be used. As above, if b > 1 then larger input components will result in larger output probabilities, and increasing the value of b will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if 0 < b < 1 then smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values. Writing or [a] (for real β)[b] yields the expressions:[c]

A value proportional to the reciprocal of β is sometimes referred to as the temperature: , where k is typically 1 or the Boltzmann constant and T is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.

In some fields, the base is fixed, corresponding to a fixed scale,[d] while in others the parameter β (or T) is varied.

Interpretations

[edit]

Smooth arg max

[edit]

The Softmax function is a smooth approximation to the arg max function: the function whose value is the index of a tuple's largest element. The name "softmax" may be misleading. Softmax is not a smooth maximum (that is, a smooth approximation to the maximum function). The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning.[3][4] This section uses the term "softargmax" for clarity.

Formally, instead of considering the arg max as a function with categorical output (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique maximum arg): where the output coordinate if and only if is the arg max of , meaning is the unique maximum value of . For example, in this encoding since the third argument is the maximum.

This can be generalized to multiple arg max values (multiple equal being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, since the second and third argument are both the maximum. In case all arguments are equal, this is simply Points z with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a jump discontinuity) – while points with a single arg max are known as non-singular or regular points.

With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as , softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input z as , However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, but and for all inputs: the closer the points are to the singular set , the slower they converge. However, softargmax does converge compactly on the non-singular set.

Conversely, as , softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively min-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization".

It is also the case that, for any fixed β, if one input is much larger than the others relative to the temperature, , the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1: However, if the difference is small relative to the temperature, the value is not close to the arg max. For example, a difference of 10 is small relative to a temperature of 100: As , temperature goes to zero, , so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.

Statistical mechanics

[edit]

In statistical mechanics, the softargmax function is known as the Boltzmann distribution (or Gibbs distribution):[5]: 7  the index set are the microstates of the system; the inputs are the energies of that state; the denominator is known as the partition function, often denoted by Z; and the factor β is called the coldness (or thermodynamic beta, or inverse temperature).

Applications

[edit]

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression),[2]: 206–209 [6] multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.[7] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the jth class given a sample tuple x and a weighting vector w is:

This can be seen as the composition of K linear functions and the softmax function (where denotes the inner product of and ). The operation is equivalent to applying a linear operator defined by to tuples , thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space .

Neural networks

[edit]

The standard softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Since the function maps a tuple and a specific index to a real value, the derivative needs to take the index into account:

This expression is symmetrical in the indexes and thus may also be expressed as

Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).

To ensure stable numerical computations subtracting the maximum value from the input tuple is common. This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.

If the function is scaled with the parameter , then these expressions must be multiplied by .

See multinomial logit for a probability model which uses the softmax activation function.

Reinforcement learning

[edit]

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities. The function commonly used is:[8]

where the action value corresponds to the expected reward of following action a and is called a temperature parameter (in allusion to statistical mechanics). For high temperatures (), all actions have nearly the same probability and the lower the temperature, the more expected rewards affect the probability. For a low temperature (), the probability of the action with the highest expected reward tends to 1.

Computational complexity and remedies

[edit]

In neural network applications, the number K of possible outcomes is often large, e.g. in case of neural language models that predict the most likely outcome out of a vocabulary which might contain millions of possible words.[9] This can make the calculations for the softmax layer (i.e. the matrix multiplications to determine the , followed by the application of the softmax function itself) computationally expensive.[9][10] What's more, the gradient descent backpropagation method for training such a neural network involves calculating the softmax for every training example, and the number of training examples can also become large. The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.[9][10]

Approaches that reorganize the softmax layer for more efficient calculation include the hierarchical softmax and the differentiated softmax.[9] The hierarchical softmax (introduced by Morin and Bengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming latent variables.[10][11] The desired probability (softmax value) of a leaf (outcome) can then be calculated as the product of the probabilities of all nodes on the path from the root to that leaf.[10] Ideally, when the tree is balanced, this would reduce the computational complexity from to .[11] In practice, results depend on choosing a good strategy for clustering the outcomes into classes.[10][11] A Huffman tree was used for this in Google's word2vec models (introduced in 2013) to achieve scalability.[9]

A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.[9] These include methods that restrict the normalization sum to a sample of outcomes (e.g. Importance Sampling, Target Sampling).[9][10]

Numerical algorithms

[edit]

The standard softmax is numerically unstable because of large exponentiations. The safe softmax method calculates insteadwhere is the largest factor involved. Subtracting by it guarantees that the exponentiations result in at most 1.

The attention mechanism in Transformers takes three arguments: a "query vector" , a list of "key vectors" , and a list of "value vectors" , and outputs a softmax-weighted sum over value vectors:The standard softmax method involves several loops over the inputs, which would be bottlenecked by memory bandwidth. The FlashAttention method is a communication-avoiding algorithm that fuses these operations into a single loop, increasing the arithmetic intensity. It is an online algorithm that computes the following quantities:[12][13]and returns . In practice, FlashAttention operates over multiple queries and keys per loop iteration, in a similar way as blocked matrix multiplication. If backpropagation is needed, then the output vectors and the intermediate arrays are cached, and during the backward pass, attention matrices are rematerialized from these, making it a form of gradient checkpointing.

Mathematical properties

[edit]

Geometrically the softmax function maps the Euclidean space to the boundary of the standard -simplex, cutting the dimension by one (the range is a -dimensional simplex in -dimensional space), due to the linear constraint that all output sum to 1 meaning it lies on a hyperplane.

Along the main diagonal softmax is just the uniform distribution on outputs, : equal scores yield equal probabilities.

More generally, softmax is invariant under translation by the same value in each coordinate: adding to the inputs yields , because it multiplies each exponent by the same factor, (because ), so the ratios do not change:

Geometrically, softmax is constant along diagonals: this is the dimension that is eliminated, and corresponds to the softmax output being independent of a translation in the input scores (a choice of 0 score). One can normalize input scores by assuming that the sum is zero (subtract the average: where ), and then the softmax takes the hyperplane of points that sum to zero, , to the open simplex of positive values that sum to 1, analogously to how the exponent takes 0 to 1, and is positive.

By contrast, softmax is not invariant under scaling. For instance, but

The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the (x, y) plane. One variable is fixed at 0 (say ), so , and the other variable can vary, denote it , so the standard logistic function, and its complement (meaning they add up to 1). The 1-dimensional input could alternatively be expressed as the line , with outputs and

Gradients

[edit]

The softmax function is also the gradient of the LogSumExp function:where the LogSumExp function is defined as .

The gradient of softmax is thus .

History

[edit]

The softmax function was used in statistical mechanics as the Boltzmann distribution in the foundational paper Boltzmann (1868),[14] formalized and popularized in the influential textbook Gibbs (1902).[15]

The use of the softmax in decision theory is credited to R. Duncan Luce,[16]: 1  who used the axiom of independence of irrelevant alternatives in rational choice theory to deduce the softmax in Luce's choice axiom for relative preferences.[citation needed]

In machine learning, the term "softmax" is credited to John S. Bridle in two 1989 conference papers, Bridle (1990a):[16]: 1  and Bridle (1990b):[3]

We are concerned with feed-forward non-linear networks (multi-layer perceptrons, or MLPs) with multiple outputs. We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs. We look for appropriate output non-linearities and for appropriate criteria for adaptation of the parameters of the network (e.g. weights). We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.[17]: 227 

For any input, the outputs must all be positive and they must sum to unity. ...

Given a set of unconstrained values, , we can ensure both conditions by using a Normalised Exponential transformation: This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer. It preserves the rank order of its input values, and is a differentiable generalisation of the 'winner-take-all' operation of picking the maximum value. For this reason we like to refer to it as softmax.[18]: 213 

Example

[edit]

With an input of (1, 2, 3, 4, 1, 2, 3), the softmax is approximately (0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175). The output has most of its weight where the "4" was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. But note: a change of temperature changes the output. When the temperature is multiplied by 10, the inputs are effectively (0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3) and the softmax is approximately (0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153). This shows that high temperatures de-emphasize the maximum value.

Computation of this example using Python code:

>>> import numpy as np
>>> z = np.array([1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0])
>>> beta = 1.0
>>> np.exp(beta * z) / np.sum(np.exp(beta * z)) 
array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054,
       0.06426166, 0.1746813])

Alternatives

[edit]

The softmax function generates probability predictions densely distributed over its support. Other functions like sparsemax or α-entmax can be used when sparse probability predictions are desired.[19] Also the Gumbel-softmax reparametrization trick can be used when sampling from a discrete-discrete distribution needs to be mimicked in a differentiable manner.

See also

[edit]

Notes

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The softmax function, also known as softargmax, is a normalized exponential function that transforms a finite-dimensional vector of real numbers, often called logits or scores, into a probability distribution over the same number of categories, ensuring that the outputs are non-negative and sum to one. Mathematically, for a vector z=(z1,,zK)\mathbf{z} = (z_1, \dots, z_K) with KK elements, the softmax is defined as y^i=exp(zi)j=1Kexp(zj)\hat{y}_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)} for each i=1,,Ki = 1, \dots, K, where exp()\exp(\cdot) denotes the exponential function. This formulation preserves the relative ordering of the input values while producing interpretable probabilities suitable for multi-class decision-making. The term "softmax" was coined by John S. Bridle in , introduced in the context of training stochastic model recognition algorithms as neural networks to achieve maximum estimation of parameters. In this original work, proposed the function as a of the logistic sigmoid for multi-output scenarios, enabling networks to output conditional probabilities directly and facilitating discrimination-based learning via relative entropy () loss. Although later suggested "softargmax" as a more descriptive alternative to emphasize its relation to the argmax operation, "softmax" became the standard nomenclature in literature. Key properties of the softmax function include its shift invariance—adding a constant to all inputs does not change the output probabilities—and its reduction to the for binary cases, where p=11+exp((z1z2))p = \frac{1}{1 + \exp(-(z_1 - z_2))}. It arises naturally as the maximum distribution subject to expected score constraints and as a choice model under Gumbel-distributed noise added to scores, balancing of options with exploitation of the highest scores via a temperature parameter α\alpha that controls sharpness (as α\alpha \to \infty, it approaches a hard argmax). These attributes make it differentiable and computationally efficient, with the output odds between categories depending solely on score differences: pipj=exp(α(sisj))\frac{p_i}{p_j} = \exp(\alpha (s_i - s_j)). In modern applications, the softmax function serves as the canonical activation for the output layer in neural networks for multi-class , converting raw linear predictions into probabilities that can be optimized using loss. It is integral to softmax regression, a generalization of , where model parameters are learned to maximize the likelihood of correct class assignments. Beyond classification, softmax appears in attention mechanisms (e.g., scaled dot-product attention in transformers), for policy parameterization, and probabilistic modeling of categorical data, underscoring its versatility in converting unconstrained scores to interpretable distributions.

Definition

Mathematical Definition

The softmax function, first termed and applied in the context of probabilistic interpretation of outputs by (1989), is mathematically defined for a finite-dimensional input vector z=(z1,,zK)RK\mathbf{z} = (z_1, \dots, z_K)^\top \in \mathbb{R}^K (with K1K \geq 1) as the output vector σ(z)\boldsymbol{\sigma}(\mathbf{z}) whose ii-th component is given by σ(z)i=exp(zi)j=1Kexp(zj),i=1,,K.\sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}, \quad i = 1, \dots, K. The exp()\exp(\cdot) plays a crucial role in this formulation by mapping each real-valued input ziz_i to a strictly positive value exp(zi)>0\exp(z_i) > 0, thereby ensuring all components of the output vector are positive before normalization. In machine learning literature, the input vector is conventionally denoted as z\mathbf{z} to represent the pre-activation logits (unbounded real values produced by a linear layer), while x\mathbf{x} typically denotes the original feature inputs to the model; this distinction highlights the softmax's role as an output activation applied to logits. For the scalar case where K=1K=1, the definition simplifies trivially to σ(z1)=exp(z1)exp(z1)=1\sigma(z_1) = \frac{\exp(z_1)}{\exp(z_1)} = 1, yielding a constant output. When K=2K=2, the softmax reduces to the binary logistic (sigmoid) function up to a shift, as σ(z1,z2)1=11+exp(z2z1)=11+exp((z1z2))\sigma(z_1, z_2)_1 = \frac{1}{1 + \exp(z_2 - z_1)} = \frac{1}{1 + \exp(-(z_1 - z_2))}, analogous to the standard sigmoid applied to the logit difference.

Basic Interpretations

The softmax function serves as a normalized exponential transformation that converts a vector of unbounded real-valued inputs, often called logits or scores, into a discrete over multiple categories. By applying the to each input and dividing by the sum of exponentials across all inputs, it ensures that the outputs are strictly positive and sum to exactly one, thereby mapping the inputs onto the . This normalization aspect makes the softmax particularly useful for interpreting raw model outputs as probabilities in multi-class settings, where the relative magnitudes of the inputs determine the likelihood assigned to each class. The outputs of the softmax function directly parameterize the of a , where each component represents the probability of a specific category in a multinomial setting. This connection arises because the softmax enforces the constraints of a valid —non-negativity and normalization—allowing it to model the probabilities of mutually exclusive and exhaustive outcomes. In statistical terms, if the inputs are the natural logarithms of the unnormalized probabilities, the softmax recovers the normalized form, aligning with the parameterization used in models. The use of exponentials in the softmax provides an intuitive amplification of differences among the input values, transforming subtle variations in scores into more pronounced probabilistic preferences. For instance, a larger input value leads to exponentially higher output probability compared to smaller ones, which promotes decisive distributions where the highest-scoring category receives the majority of the probability mass, while still allowing for some in closer cases. This non-linear scaling ensures that the function is sensitive to relative differences rather than absolute values, enhancing its effectiveness in representing confidence levels across categories. A generalized variant of the softmax introduces a τ>0\tau > 0 to modulate the sharpness of the resulting distribution, defined as σ(z;τ)i=exp(zi/τ)jexp(zj/τ).\sigma(\mathbf{z}; \tau)_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}. When τ=1\tau = 1, it recovers the standard softmax; lower values of τ\tau sharpen the distribution toward the maximum input (approaching a Dirac delta), while higher values flatten it toward uniformity, providing flexibility in controlling the between and in probabilistic outputs.

Advanced Interpretations

Smooth Approximation to Argmax

The argmax operation, denoted as argmaxizi\arg\max_i z_i, selects the index ii corresponding to the maximum value in a vector zRK\mathbf{z} \in \mathbb{R}^K, producing a encoded vector where the entry at the maximum position is 1 and all others are 0. However, this operation is non-differentiable , which poses challenges for gradient-based optimization in , as it cannot be directly incorporated into differentiable computational graphs. The softmax function addresses this limitation by serving as a smooth, differentiable approximation to argmax, often referred to as "softargmax." Defined with a temperature parameter τ>0\tau > 0, the softmax σ(z;τ)i=exp(zi/τ)j=1Kexp(zj/τ)\sigma(\mathbf{z}; \tau)_i = \frac{\exp(z_i / \tau)}{\sum_{j=1}^K \exp(z_j / \tau)} maps the input vector z\mathbf{z} to a probability distribution over the KK categories, where the probabilities concentrate more sharply on the largest entries as τ\tau decreases. In the limit of vanishing temperature, the softmax output converges pointwise to the one-hot vector aligned with the argmax: limτ0+σ(z;τ)i=1\lim_{\tau \to 0^+} \sigma(\mathbf{z}; \tau)_i = 1 if i=argmaxjzji = \arg\max_j z_j (assuming no ties in z\mathbf{z}), and 0 otherwise. This smoothing property enables the use of gradient-based methods to approximate discrete decision-making processes that would otherwise rely on non-differentiable argmax operations. For instance, in techniques like straight-through estimators, the forward pass may employ a hard argmax for discrete selection, while the backward pass approximates gradients through a low-temperature softmax to propagate signals effectively during training.

Relation to Boltzmann Distribution

In statistical mechanics, the Boltzmann distribution describes the probability PiP_i of a system occupying a particular state ii with energy EiE_i at thermal equilibrium temperature TT, given by Pi=exp(Ei/kT)jexp(Ej/kT),P_i = \frac{\exp(-E_i / kT)}{\sum_j \exp(-E_j / kT)}, where kk is Boltzmann's constant and the sum in the denominator runs over all possible states jj. This distribution was first formulated by Ludwig Boltzmann in 1868 as part of his foundational work on the statistical mechanics of gases, deriving the equilibrium probabilities through combinatorial arguments for particle distributions. The softmax function bears a direct mathematical resemblance to the , arising from the mapping zi=Ei/kTz_i = -E_i / kT, which transforms the energies into logits scaled by the inverse temperature 1/kT1/kT; thus, softmax outputs precisely model the equilibrium probabilities in the of . Consequently, the softmax inherits key concepts from the Boltzmann framework, including the partition function (the normalizing denominator jexp(Ej/kT)\sum_j \exp(-E_j / kT)) that ensures probabilities sum to unity, and the interpretation of inputs as energy-based scores for probabilistic state selection.

Properties

Key Mathematical Properties

The softmax function σ:RK(0,1)K\sigma: \mathbb{R}^K \to (0,1)^K, defined componentwise as σ(z)i=exp(zi)j=1Kexp(zj)\sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}, exhibits several fundamental mathematical properties that it maps to the interior of the probability . A primary property is normalization, whereby the outputs sum to unity: i=1Kσ(z)i=1\sum_{i=1}^K \sigma(\mathbf{z})_i = 1 for all zRK\mathbf{z} \in \mathbb{R}^K. This follows directly from the definitional structure, as the exponential terms in the numerator and denominator cancel out in the . Complementing this is non-negativity, with σ(z)i>0\sigma(\mathbf{z})_i > 0 for all ii and z\mathbf{z}, since exponentials are strictly positive and the denominator is a positive sum. These traits position softmax outputs as valid probability distributions over KK categories. The function is also strictly monotonic in each component: if zi>zjz_i > z_j, then σ(z)i>σ(z)j\sigma(\mathbf{z})_i > \sigma(\mathbf{z})_j. This order-preserving behavior arises because increasing ziz_i relative to zjz_j amplifies the corresponding exponential term more than others, without altering the total sum due to normalization. Additionally, softmax is invariant to by a constant vector: σ(z+c1)=σ(z)\sigma(\mathbf{z} + c \mathbf{1}) = \sigma(\mathbf{z}) for any cRc \in \mathbb{R}, where 1\mathbf{1} is the all-ones vector. This holds because adding cc to each input multiplies both numerator and denominator by exp(c)\exp(c), which cancels out. Finally, the softmax function is unique as the mapping from RK\mathbb{R}^K to the interior of the that satisfies normalization, non-negativity, monotonicity, and translation invariance. This uniqueness stems from its characterization as the maximum- distribution subject to moment constraints on the expected inputs, derived via Lagrange multipliers. To see this, maximize the H(p)=i=1KpilogpiH(p) = -\sum_{i=1}^K p_i \log p_i over p(0,1)Kp \in (0,1)^K with ipi=1\sum_i p_i = 1 and ipizi=μ\sum_i p_i z_i = \mu (for fixed mean μ\mu). The Lagrangian is L(p,λ,β)=H(p)+λ(ipiziμ)+β(ipi1)L(p, \lambda, \beta) = H(p) + \lambda (\sum_i p_i z_i - \mu) + \beta (\sum_i p_i - 1). Setting partial derivatives to zero yields Lpi=logpi1+λzi+β=0\frac{\partial L}{\partial p_i} = -\log p_i - 1 + \lambda z_i + \beta = 0, so pi=exp(λzi+β1)p_i = \exp(\lambda z_i + \beta - 1). Applying the normalization constraint normalizes the exponentials, recovering the softmax form; strict convexity of the negative entropy ensures this solution is unique.

Gradient Computations

The gradients of the softmax function play a central role in algorithms for training neural networks, enabling the efficient computation of how perturbations in the input logits zRK\mathbf{z} \in \mathbb{R}^K propagate to changes in the output probabilities σ(z)ΔK1\boldsymbol{\sigma}(\mathbf{z}) \in \Delta^{K-1}, where ΔK1\Delta^{K-1} denotes the (K1)(K-1)-dimensional probability simplex. Consider the component-wise definition σi(z)=exp(zi)s\sigma_i(\mathbf{z}) = \frac{\exp(z_i)}{s}, where s=k=1Kexp(zk)s = \sum_{k=1}^K \exp(z_k). To derive the partial derivatives, apply the and . For iji \neq j, σizj=zj(exp(zi)s)=exp(zi)s2szj=exp(zi)exp(zj)s2=σiσj,\frac{\partial \sigma_i}{\partial z_j} = \frac{\partial}{\partial z_j} \left( \frac{\exp(z_i)}{s} \right) = -\frac{\exp(z_i)}{s^2} \cdot \frac{\partial s}{\partial z_j} = -\frac{\exp(z_i) \exp(z_j)}{s^2} = -\sigma_i \sigma_j, since szj=exp(zj)\frac{\partial s}{\partial z_j} = \exp(z_j). For the case i=ji = j, σizi=exp(zi)sexp(zi)exp(zi)s2=exp(zi)(sexp(zi))s2=σi(1σi),\frac{\partial \sigma_i}{\partial z_i} = \frac{\exp(z_i) \cdot s - \exp(z_i) \cdot \exp(z_i)}{s^2} = \frac{\exp(z_i) (s - \exp(z_i))}{s^2} = \sigma_i \left(1 - \sigma_i \right), as the numerator derivative includes both the direct term from exp(zi)\exp(z_i) and the indirect term through ss. Combining these yields the general component-wise form of the Jacobian entries: σizj=σi(δijσj),\frac{\partial \sigma_i}{\partial z_j} = \sigma_i (\delta_{ij} - \sigma_j), where δij\delta_{ij} is the Kronecker delta (δij=1\delta_{ij} = 1 if i=ji = j, else 0). In matrix notation, the full Jacobian J(z)RK×KJ(\mathbf{z}) \in \mathbb{R}^{K \times K} is J(z)=diag(σ(z))σ(z)σ(z),J(\mathbf{z}) = \operatorname{diag}(\boldsymbol{\sigma}(\mathbf{z})) - \boldsymbol{\sigma}(\mathbf{z}) \boldsymbol{\sigma}(\mathbf{z})^\top, which is symmetric and positive semidefinite with rank at most K1K-1. This structure admits a clear interpretation: the diagonal elements σi(1σi)0\sigma_i (1 - \sigma_i) \geq 0 capture self-reinforcement, where an increase in ziz_i boosts σi\sigma_i proportionally to its current value, while the off-diagonal elements σiσj<0-\sigma_i \sigma_j < 0 (for iji \neq j) encode inter-class competition, as an increase in zjz_j diminishes σi\sigma_i to maintain the normalization iσi=1\sum_i \sigma_i = 1. Consequently, each row (and column) of JJ sums to zero, preserving the simplex constraint under infinitesimal changes. In practice, forming the explicit K×KK \times K Jacobian requires O(K2)O(K^2) space and time, which is prohibitive for large KK. However, during backpropagation, only the Jacobian-vector product JvJ \mathbf{v} is typically needed for a downstream gradient vector vRK\mathbf{v} \in \mathbb{R}^K, and this can be evaluated in O(K)O(K) time via Jv=diag(σ)vσ(σv),J \mathbf{v} = \operatorname{diag}(\boldsymbol{\sigma}) \mathbf{v} - \boldsymbol{\sigma} (\boldsymbol{\sigma}^\top \mathbf{v}), avoiding materialization of the full matrix and enabling scalable computation in deep learning frameworks.

Numerical Considerations

Complexity and Challenges

The computation of the softmax function for an input vector zRK\mathbf{z} \in \mathbb{R}^K involves exponentiating each of the KK elements, computing their sum, and performing element-wise division for normalization, yielding a time complexity of O(K)O(K). This linear dependence on the dimension KK poses challenges in high-dimensional settings, such as natural language processing where KK corresponds to vocabulary sizes often exceeding 50,000, leading to substantial per-instance costs during inference and training. The space complexity is likewise O(K)O(K), required for storing the input vector, intermediate exponentials, and output probabilities, though computing in log-space via the log-sum-exp trick can avoid temporary storage of large exponential values, modestly reducing peak memory usage. A primary numerical challenge stems from the exponential operation in the softmax formula, σ(z)i=exp(zi)j=1Kexp(zj)\sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}, which is susceptible to overflow when any ziz_i is large and positive, causing exp(zi)\exp(z_i) to approach infinity and rendering the denominator undefined. Conversely, when all ziz_i are large and negative, underflow occurs as exp(zi)\exp(z_i) rounds to zero for all terms, resulting in loss of precision and a denominator near zero. These instabilities can propagate to produce NaN values in the probabilities or degenerate distributions where one probability approaches 1 and others 0, thereby distorting gradients during backpropagation as outlined in the gradient computations section.

Stable Numerical Methods

Computing the softmax function directly can lead to numerical overflow when input values are large, as the exponential terms grow rapidly. A standard technique to mitigate this is the subtract-max trick, which shifts all inputs by their maximum value before exponentiation. This ensures that all exponents are less than or equal to zero, bounding the terms and preventing overflow while preserving the original probabilities. The adjusted computation is given by σ(z)i=exp(zim)jexp(zjm),\sigma(\mathbf{z})_i = \frac{\exp(z_i - m)}{\sum_j \exp(z_j - m)}, where m=maxkzkm = \max_k z_k. This method is equivalent to the standard softmax because the shift factor eme^{-m} cancels out in the ratio. For applications requiring the logarithm of softmax probabilities, such as in cross-entropy loss computations, the log-sum-exp (LSE) trick provides numerical stability. The log-softmax for each component is logσi=zilogjexp(zj).\log \sigma_i = z_i - \log \sum_j \exp(z_j). Direct evaluation of the sum can still cause underflow for large negative inputs, so a stabilized LSE incorporates the subtract-max: logjexp(zj)=m+logjexp(zjm)\log \sum_j \exp(z_j) = m + \log \sum_j \exp(z_j - m), where m=maxkzkm = \max_k z_k. This formulation avoids both overflow in the exponentials and underflow in the summation, enabling accurate computation even for extreme input ranges. Stable implementations of logsumexp are essential in probabilistic modeling and optimization. In high-dimensional settings, such as attention mechanisms in transformers where the vocabulary size KK or sequence length is very large (e.g., thousands), full softmax computation becomes computationally prohibitive due to O(K)O(K) or quadratic scaling. To address this, approximations like sparsemax replace the dense softmax with a sparse variant that thresholds small probabilities to zero, producing a sparse probability distribution while maintaining differentiability. Sparsemax is particularly useful in multi-label classification and attention, as it focuses computation on the most relevant elements. Additionally, sampling-based methods, such as those in sparse transformers, approximate the softmax by evaluating only a subset of keys or using low-rank approximations, reducing memory and time complexity to near-linear in sequence length. These techniques preserve much of the expressive power of full softmax for large-scale applications. Major numerical libraries incorporate these stability measures into their softmax implementations. For instance, SciPy's scipy.special.softmax applies the subtract-max trick internally to handle a wide range of input scales reliably. Similarly, PyTorch's torch.nn.functional.softmax uses dimension-specific stable computation, subtracting the maximum along the specified axis to ensure robustness in deep learning workflows. These built-in functions allow practitioners to compute softmax without manual intervention for stability.

Applications

In Neural Networks

In neural networks, the softmax function serves as a key activation in the output layer for multi-class classification tasks, transforming a vector of raw scores, or logits, into a probability distribution over multiple classes that sums to one. This normalization enables the network to produce interpretable outputs representing the likelihood of each class, facilitating decision-making in applications such as image recognition and natural language processing. The softmax output is typically paired with the cross-entropy loss during training, which measures the divergence between the predicted probability distribution σ(z)\sigma(\mathbf{z}) and the true one-hot encoded target y\mathbf{y}. The loss is defined as i=1Kyilogσ(z)i-\sum_{i=1}^K y_i \log \sigma(\mathbf{z})_i, where KK is the number of classes, and this combination yields computationally efficient gradients for backpropagation, specifically Lzj=σ(z)jyj\frac{\partial L}{\partial z_j} = \sigma(\mathbf{z})_j - y_j for the jj-th logit. This simplification arises because the derivatives of the softmax and the negative log-likelihood cancel in a manner that avoids explicit Jacobian computations, accelerating optimization in multi-class settings. The adoption of softmax in neural networks gained prominence in the late 1980s and 1990s, as researchers sought probabilistic interpretations for feedforward classifiers amid the resurgence of connectionist models. John Bridle's work introduced the term "softmax" and advocated its use for modeling conditional probabilities in classification networks, bridging statistical pattern recognition with neural architectures. This era's emphasis on probabilistic outputs helped establish softmax as a standard for supervised learning paradigms. A notable variant involves scaling the logits by a temperature parameter T>0T > 0 before applying softmax, yielding σ(z/T)i=exp(zi/T)jexp(zj/T)\sigma(\mathbf{z}/T)_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}, which controls the distribution's sharpness. When T>1T > 1, the output softens, distributing probability more evenly across classes to aid in model or from larger networks to smaller s. In , the softened teacher probabilities guide the student via a distillation loss, improving generalization while compressing model size, as demonstrated in seminal work on transferring knowledge across neural networks. For , post-hoc temperature scaling adjusts overconfident predictions in trained models, enhancing reliability without retraining.

In Reinforcement Learning

In reinforcement learning, the softmax function plays a central role in parameterizing stochastic policies for discrete action spaces, enabling agents to select actions probabilistically based on estimated action values. Specifically, the policy π(as)\pi(a|s) is defined as π(as)=exp(Q(s,a)/τ)bexp(Q(s,b)/τ)\pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{b} \exp(Q(s,b)/\tau)}, where Q(s,a)Q(s,a) denotes the action-value function for state ss and action aa, and τ>0\tau > 0 is a temperature parameter that scales the logits before applying the softmax. This formulation ensures that the policy outputs a valid probability distribution over actions, with higher QQ-values receiving proportionally greater probability mass. The temperature parameter τ\tau governs the balance between exploration and exploitation in the policy. A high τ\tau flattens the distribution, promoting by assigning more uniform probabilities to actions and encouraging the agent to try suboptimal options to discover better long-term rewards. Conversely, a low τ\tau sharpens the distribution toward the action with the maximum QQ-value, favoring exploitation to maximize immediate expected returns. This adjustability allows softmax policies to adapt dynamically during , often starting with higher τ\tau for broad search and annealing to lower values for refinement. Softmax policies are integral to several gradient algorithms, particularly actor-critic methods for discrete actions. In REINFORCE, a foundational gradient , the softmax parameterization facilitates direct optimization of the parameters via stochastic gradient ascent on the , using complete episode trajectories to estimate gradients. Similarly, in Proximal Policy Optimization (PPO), a widely adopted on-policy method, the policy network outputs logits that are passed through softmax to yield action probabilities, enabling clipped surrogate objectives for stable updates over multiple epochs while handling discrete environments like . The primary advantages of softmax parameterization in these algorithms stem from its differentiability, which permits efficient gradient-based optimization of the expected reward without requiring value function approximations for updates. This smoothness supports convergence guarantees under certain conditions and allows seamless integration with actors, making it suitable for high-dimensional state spaces.

In Modern Architectures

In modern architectures, the softmax function plays a pivotal role in the self- mechanisms of models, where it normalizes the similarities between query and key vectors to produce attention weights. Specifically, in scaled dot-product , the attention weights αij\alpha_{ij} are computed as αij=σ(QiKjTdk)\alpha_{ij} = \sigma\left(\frac{Q_i K_j^T}{\sqrt{d_k}}\right)
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.