Hubbry Logo
LogSumExpLogSumExpMain
Open search
LogSumExp
Community hub
LogSumExp
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
LogSumExp
LogSumExp
from Wikipedia

The LogSumExp (LSE) (also called RealSoftMax[1] or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.[2] It is defined as the logarithm of the sum of the exponentials of the arguments:

Properties

[edit]

The LogSumExp function domain is , the real coordinate space, and its codomain is , the real line. It is an approximation to the maximum with the following bounds The first inequality is strict unless . The second inequality is strict unless all arguments are equal. (Proof: Let . Then . Applying the logarithm to the inequality gives the result.)

In addition, we can scale the function to make the bounds tighter. Consider the function . Then (Proof: Replace each with for some in the inequalities above, to give and, since finally, dividing by gives the result.)

Also, if we multiply by a negative number instead, we of course find a comparison to the function:

The LogSumExp function is convex, and is strictly increasing everywhere in its domain.[3] It is not strictly convex, since it is affine (linear plus a constant) on the diagonal and parallel lines:[4]

Other than this direction, it is strictly convex (the Hessian has rank ), so for example restricting to a hyperplane that is transverse to the diagonal results in a strictly convex function. See , below.

Writing the partial derivatives are: which means the gradient of LogSumExp is the softmax function.

The convex conjugate of LogSumExp is the negative entropy.

log-sum-exp trick for log-domain calculations

[edit]

The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.[5]

Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in linear-scale becomes the LSE in log-scale:

A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision floating point numbers.[6]

Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient).

where

Many math libraries such as IT++ provide a default routine of LSE and use this formula internally.

A strictly convex log-sum-exp type function

[edit]

LSE is convex but not strictly convex. We can define a strictly convex log-sum-exp type function[7] by adding an extra argument set to zero:

This function is a proper Bregman generator (strictly convex and differentiable). It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.

In tropical analysis, this is the sum in the log semiring.

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The LogSumExp function, also known as log-sum-exp, computes the natural logarithm of the sum of the exponentials of the elements in an input , mathematically defined as logi=1nexp(xi)\log \sum_{i=1}^n \exp(x_i) for a vector x=(x1,,xn)\mathbf{x} = (x_1, \dots, x_n), or more generally logi=1nbiexp(xi)\log \sum_{i=1}^n b_i \exp(x_i) when weighted by nonnegative factors bib_i. This operation arises frequently in probabilistic modeling and optimization, where it serves as a stable alternative to direct summation in the exponential domain, particularly when dealing with log-probabilities to maintain numerical precision. A key challenge in evaluating LogSumExp directly is the risk of numerical overflow when the xix_i values are large (e.g., exceeding logrmax\log r_{\max}, where rmaxr_{\max} is the largest representable floating-point number) or underflow when they are small, leading to catastrophic loss of accuracy in floating-point arithmetic such as IEEE 754 single (fp32) or double (fp64) precision. To address this, the log-sum-exp trick employs a shifted formulation: select a=maxixia = \max_i x_i, then compute a+logi=1nexp(xia)a + \log \sum_{i=1}^n \exp(x_i - a), which bounds the exponentials between 0 and 1, preventing overflow while minimizing underflow errors. Error analysis shows that this shifted version achieves relative error bounds on the order of machine epsilon ϵ\epsilon, typically outperforming naive implementations, with forward error estimates like (y+nxmin/y)ϵ(|y + n - x_{\min}| / |y|) \epsilon where yy is the result and xmin=minixix_{\min} = \min_i x_i. In and , LogSumExp is integral to algorithms involving softmax normalization, such as and naive Bayes classifiers, where it computes partition functions or marginal likelihoods in log-space. It also appears in variational inference, expectation-maximization, and training for log-likelihood evaluations, with implementations available in libraries like (via scipy.special.logsumexp), ensuring compatibility across array backends including JAX, PyTorch, and CuPy. The function's convexity and relation to the negative as its further underscore its role in problems.

Definition and Properties

Definition

The log-sum-exp function, commonly denoted as LSE or logsumexp, is a fundamental operation in and that arises in contexts requiring the aggregation of exponential terms on a . It takes a vector of real numbers as input and produces a scalar output by computing the natural logarithm of the sum of the exponentials of those inputs. Formally, for a vector x=(x1,,xn)Rn\mathbf{x} = (x_1, \dots, x_n)^\top \in \mathbb{R}^n, the log-sum-exp function is defined as LSE(x)=log(i=1nexi),\mathrm{LSE}(\mathbf{x}) = \log \left( \sum_{i=1}^n e^{x_i} \right), where log\log denotes the natural logarithm and the domain is Rn\mathbb{R}^n with R\mathbb{R}. This function provides a smooth, differentiable to the maximum function maxixi\max_i x_i, particularly effective when the input components exhibit large differences, in which case LSE(x)\mathrm{LSE}(\mathbf{x}) closely aligns with the largest xix_i.

Basic properties

The log-sum-exp function, defined as LSE(x1,,xn)=log(i=1nexi)\mathrm{LSE}(x_1, \dots, x_n) = \log \left( \sum_{i=1}^n e^{x_i} \right), satisfies fundamental bounding inequalities that highlight its close relationship to the maximum function. Specifically, maxixiLSE(x1,,xn)maxixi+logn\max_i x_i \leq \mathrm{LSE}(x_1, \dots, x_n) \leq \max_i x_i + \log n. The lower bound follows from the fact that the sum of exponentials is at least the largest exponential term, so i=1nexiemaxixi\sum_{i=1}^n e^{x_i} \geq e^{\max_i x_i}, and applying the logarithm preserves the inequality. The upper bound arises because each exponential is at most emaxixie^{\max_i x_i}, yielding i=1nexinemaxixi\sum_{i=1}^n e^{x_i} \leq n e^{\max_i x_i}, again with the logarithm maintaining the relation. These bounds position the log-sum-exp function as a smooth, differentiable to the maximum, with the additive logn\log n term quantifying the deviation for finite nn. The function exhibits monotonicity in each of its arguments: it is increasing with respect to any individual xjx_j, meaning that if xjx_j increases while other components remain fixed, then LSE(x1,,xn)\mathrm{LSE}(x_1, \dots, x_n) non-decreases (and strictly increases unless all other xix_i are -\infty). This property stems directly from the monotonicity of the exponential and logarithm functions, as raising xjx_j enlarges the corresponding exponential term in the sum, thereby enlarging the overall sum and its logarithm. Such behavior ensures that the log-sum-exp function preserves orderings in its inputs, making it suitable for applications requiring consistent scaling with input magnitudes. A notable special case occurs when all inputs are equal, i.e., x1==xn=xx_1 = \dots = x_n = x. In this scenario, LSE(x,,x)=log(nex)=x+logn\mathrm{LSE}(x, \dots, x) = \log(n e^x) = x + \log n, which achieves the upper bound from the general inequality. This equality illustrates how the function scales additively with the common input value plus a logarithmic factor dependent on the dimension nn, emphasizing its sensitivity to the number of terms when inputs are balanced.

Convexity and derivatives

The log-sum-exp function, defined as f(x)=logi=1nexif(\mathbf{x}) = \log \sum_{i=1}^n e^{x_i} for xRn\mathbf{x} \in \mathbb{R}^n, is convex. This convexity follows from the positive semi-definiteness of its Hessian matrix. Let z\mathbf{z} be the vector with components zi=exi/j=1nexjz_i = e^{x_i} / \sum_{j=1}^n e^{x_j}, the softmax of x\mathbf{x}. The Hessian is then 2f(x)=diag(z)zz\nabla^2 f(\mathbf{x}) = \operatorname{diag}(\mathbf{z}) - \mathbf{z} \mathbf{z}^\top, which is positive semi-definite because it represents the (scaled) covariance matrix of a multinomial distribution and has eigenvalues in [0, 1]. An alternative proof establishes convexity via the epigraph: the set {(x,t)Rn×Rtf(x)}\{ (\mathbf{x}, t) \in \mathbb{R}^n \times \mathbb{R} \mid t \geq f(\mathbf{x}) \} is convex, as it is equivalent to i=1nexit1\sum_{i=1}^n e^{x_i - t} \leq 1, and the left-hand side is jointly convex in (x,t)(\mathbf{x}, t) as a sum of convex exponential functions. Although convex, the log-sum-exp function is not strictly convex. It exhibits affine behavior along the direction of the all-ones vector 1\mathbf{1}, since f(x+c1)=f(x)+cf(\mathbf{x} + c \mathbf{1}) = f(\mathbf{x}) + c for any scalar cc, meaning the function is linear (hence not strictly convex) on such translates. This non-strictness arises because the exponentials scale uniformly under uniform shifts in the inputs. The first-order partial derivatives of the log-sum-exp function are given by fxk(x)=exkj=1nexj\frac{\partial f}{\partial x_k}(\mathbf{x}) = \frac{e^{x_k}}{\sum_{j=1}^n e^{x_j}} for each k=1,,nk = 1, \dots, n. Thus, the f(x)\nabla f(\mathbf{x}) is precisely the evaluated at x\mathbf{x}. The second-order properties, as encoded in the Hessian 2f(x)=diag(z)zz\nabla^2 f(\mathbf{x}) = \operatorname{diag}(\mathbf{z}) - \mathbf{z} \mathbf{z}^\top, further confirm convexity, with the zero eigenvalue corresponding to the direction 1\mathbf{1} along which the function is affine. The (Fenchel-Legendre transform) of the log-sum-exp function is the negative function over the probability . Specifically, f(y)=supxRn(yxf(x))=i=1nyilogyif^*(\mathbf{y}) = \sup_{\mathbf{x} \in \mathbb{R}^n} \left( \mathbf{y}^\top \mathbf{x} - f(\mathbf{x}) \right) = -\sum_{i=1}^n y_i \log y_i for y\mathbf{y} in the standard Δn1={yR+nyi=1}\Delta^{n-1} = \{ \mathbf{y} \in \mathbb{R}^n_+ \mid \sum y_i = 1 \}, and ++\infty otherwise. This duality highlights the log-sum-exp function's role in optimization, where the conjugate's domain restriction to probabilities underscores its ties to information-theoretic measures.

Numerical Computation

The log-sum-exp trick

The log-sum-exp trick provides a reformulation of the log-sum-exp function that enhances numerical stability during computation. For inputs x1,,xnRx_1, \dots, x_n \in \mathbb{R}, define x=maxixix^* = \max_i x_i. The trick expresses the function as LSE(x1,,xn)=x+log(i=1nexp(xix)).\mathrm{LSE}(x_1, \dots, x_n) = x^* + \log \left( \sum_{i=1}^n \exp(x_i - x^*) \right). This identity derives from the additive property of the logarithm and the scaling behavior of the exponential function. Starting from the original definition, log(i=1nexp(xi))=log(exp(x)i=1nexp(xix))=x+log(i=1nexp(xix)),\log \left( \sum_{i=1}^n \exp(x_i) \right) = \log \left( \exp(x^*) \sum_{i=1}^n \exp(x_i - x^*) \right) = x^* + \log \left( \sum_{i=1}^n \exp(x_i - x^*) \right), since log(exp(a)b)=a+logb\log(\exp(a) \cdot b) = a + \log b for a,b>0a, b > 0. The reformulation mitigates overflow risks inherent in direct evaluation of exp(xi)\sum \exp(x_i) when some xix_i are large and positive, as subtracting xx^* ensures xix0x_i - x^* \leq 0 for all ii, bounding each exp(xix)1\exp(x_i - x^*) \leq 1 and keeping intermediate values within floating-point representable ranges. For illustration, consider x=[10,20,5]x = [10, 20, 5], where x=20x^* = 20. The trick yields LSE(x)=20+log(exp(10)+exp(0)+exp(15))=20+log(exp(10)+1+exp(15)),\mathrm{LSE}(x) = 20 + \log \left( \exp(-10) + \exp(0) + \exp(-15) \right) = 20 + \log \left( \exp(-10) + 1 + \exp(-15) \right), which computes exponentials of non-positive arguments, avoiding the need to evaluate exp(20)\exp(20) directly and thus preserving .

Implementation and stability

The stable implementation of the log-sum-exp function relies on subtracting the maximum value from all inputs before to prevent overflow and underflow in . The algorithm proceeds as follows: first, identify the maximum value m=maxixim = \max_i x_i; then, compute the sum of the exponentials of the shifted inputs, s=iexp(xim)s = \sum_i \exp(x_i - m); finally, return m+logsm + \log s. For greater precision, especially when multiple inputs equal m or non-maximal terms contribute, compute s as k (the multiplicity of m) plus the sum of exp(xim)\exp(x_i - m) for xi<mx_i < m, then m+log(k+small_sum)m + \log(k + \mathrm{small\_sum}), using log1p(small_sum/k)+logk\log 1\mathrm{p}(\mathrm{small\_sum}/k) + \log k if small_sum is small relative to k to minimize floating-point precision loss. In floating-point arithmetic, direct computation of logiexp(xi)\log \sum_i \exp(x_i) risks overflow when the xix_i are large positive values, as the exponentials can exceed the representable range, or underflow when all xix_i are large negative, leading to a sum of zero and log0=\log 0 = -\infty. The shifting approach mitigates overflow by ensuring all exp(xim)1\exp(x_i - m) \leq 1, while for underflow scenarios—where individual exp(xim)\exp(x_i - m) for non-maximal terms may round to zero due to their negligible contribution—the sum remains dominated by the term(s) equal to 1, yielding an accurate approximation of the maximum value itself. Error analysis shows that this shifted algorithm achieves forward stability, with the relative error satisfying yy^y(y+nxminy)u+O(u2)\left| \frac{y - \hat{y}}{y} \right| \leq \left( \frac{|y + n - x_{\min}|}{|y|} \right) u + O(u^2)
Add your contribution
Related Hubs
User Avatar
No comments yet.