Recent from talks
Nothing was collected or created yet.
LogSumExp
View on WikipediaThis article needs additional citations for verification. (August 2015) |
The LogSumExp (LSE) (also called RealSoftMax[1] or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.[2] It is defined as the logarithm of the sum of the exponentials of the arguments:
Properties
[edit]The LogSumExp function domain is , the real coordinate space, and its codomain is , the real line. It is an approximation to the maximum with the following bounds The first inequality is strict unless . The second inequality is strict unless all arguments are equal. (Proof: Let . Then . Applying the logarithm to the inequality gives the result.)
In addition, we can scale the function to make the bounds tighter. Consider the function . Then (Proof: Replace each with for some in the inequalities above, to give and, since finally, dividing by gives the result.)
Also, if we multiply by a negative number instead, we of course find a comparison to the function:
The LogSumExp function is convex, and is strictly increasing everywhere in its domain.[3] It is not strictly convex, since it is affine (linear plus a constant) on the diagonal and parallel lines:[4]
Other than this direction, it is strictly convex (the Hessian has rank ), so for example restricting to a hyperplane that is transverse to the diagonal results in a strictly convex function. See , below.
Writing the partial derivatives are: which means the gradient of LogSumExp is the softmax function.
The convex conjugate of LogSumExp is the negative entropy.
log-sum-exp trick for log-domain calculations
[edit]The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.[5]
Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in linear-scale becomes the LSE in log-scale:
A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision floating point numbers.[6]
Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient).
where
Many math libraries such as IT++ provide a default routine of LSE and use this formula internally.
A strictly convex log-sum-exp type function
[edit]LSE is convex but not strictly convex. We can define a strictly convex log-sum-exp type function[7] by adding an extra argument set to zero:
This function is a proper Bregman generator (strictly convex and differentiable). It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.
In tropical analysis, this is the sum in the log semiring.
See also
[edit]References
[edit]- ^ Zhang, Aston; Lipton, Zack; Li, Mu; Smola, Alex. "Dive into Deep Learning, Chapter 3 Exercises". www.d2l.ai. Retrieved 27 June 2020.
- ^ Nielsen, Frank; Sun, Ke (2016). "Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities". Entropy. 18 (12): 442. arXiv:1606.05850. Bibcode:2016Entrp..18..442N. doi:10.3390/e18120442. S2CID 17259055.
- ^ El Ghaoui, Laurent (2017). Optimization Models and Applications.
- ^ "convex analysis - About the strictly convexity of log-sum-exp function - Mathematics Stack Exchange". stackexchange.com.
- ^ McElreath, Richard. Statistical Rethinking. OCLC 1107423386.
- ^ "Practical issues: Numeric stability". CS231n Convolutional Neural Networks for Visual Recognition.
- ^ Nielsen, Frank; Hadjeres, Gaetan (2018). "Monte Carlo Information Geometry: The dually flat case". arXiv:1803.07225 [cs.LG].
LogSumExp
View on Grokipediascipy.special.logsumexp), ensuring compatibility across array backends including JAX, PyTorch, and CuPy.[1] The function's convexity[2] and relation to the negative entropy as its convex conjugate[5] further underscore its role in convex optimization problems.
Definition and Properties
Definition
The log-sum-exp function, commonly denoted as LSE or logsumexp, is a fundamental operation in mathematics and computational science that arises in contexts requiring the aggregation of exponential terms on a logarithmic scale. It takes a vector of real numbers as input and produces a scalar output by computing the natural logarithm of the sum of the exponentials of those inputs.[6][7] Formally, for a vector , the log-sum-exp function is defined as where denotes the natural logarithm and the domain is with codomain .[6][7][8] This function provides a smooth, differentiable approximation to the maximum function , particularly effective when the input components exhibit large differences, in which case closely aligns with the largest .[6][7]Basic properties
The log-sum-exp function, defined as , satisfies fundamental bounding inequalities that highlight its close relationship to the maximum function. Specifically, . The lower bound follows from the fact that the sum of exponentials is at least the largest exponential term, so , and applying the logarithm preserves the inequality. The upper bound arises because each exponential is at most , yielding , again with the logarithm maintaining the relation. These bounds position the log-sum-exp function as a smooth, differentiable approximation to the maximum, with the additive term quantifying the deviation for finite .[9] The function exhibits monotonicity in each of its arguments: it is increasing with respect to any individual , meaning that if increases while other components remain fixed, then non-decreases (and strictly increases unless all other are ). This property stems directly from the monotonicity of the exponential and logarithm functions, as raising enlarges the corresponding exponential term in the sum, thereby enlarging the overall sum and its logarithm. Such behavior ensures that the log-sum-exp function preserves orderings in its inputs, making it suitable for applications requiring consistent scaling with input magnitudes.[7] A notable special case occurs when all inputs are equal, i.e., . In this scenario, , which achieves the upper bound from the general inequality. This equality illustrates how the function scales additively with the common input value plus a logarithmic factor dependent on the dimension , emphasizing its sensitivity to the number of terms when inputs are balanced.Convexity and derivatives
The log-sum-exp function, defined as for , is convex. This convexity follows from the positive semi-definiteness of its Hessian matrix. Let be the vector with components , the softmax of . The Hessian is then , which is positive semi-definite because it represents the (scaled) covariance matrix of a multinomial distribution and has eigenvalues in [0, 1]. An alternative proof establishes convexity via the epigraph: the set is convex, as it is equivalent to , and the left-hand side is jointly convex in as a sum of convex exponential functions.[10][11] Although convex, the log-sum-exp function is not strictly convex. It exhibits affine behavior along the direction of the all-ones vector , since for any scalar , meaning the function is linear (hence not strictly convex) on such translates. This non-strictness arises because the exponentials scale uniformly under uniform shifts in the inputs.[12] The first-order partial derivatives of the log-sum-exp function are given by for each . Thus, the gradient is precisely the softmax function evaluated at . The second-order properties, as encoded in the Hessian , further confirm convexity, with the zero eigenvalue corresponding to the direction along which the function is affine.[10] The convex conjugate (Fenchel-Legendre transform) of the log-sum-exp function is the negative entropy function over the probability simplex. Specifically, for in the standard simplex , and otherwise. This duality highlights the log-sum-exp function's role in optimization, where the conjugate's domain restriction to probabilities underscores its ties to information-theoretic measures.[13]Numerical Computation
The log-sum-exp trick
The log-sum-exp trick provides a reformulation of the log-sum-exp function that enhances numerical stability during computation. For inputs , define . The trick expresses the function as [14] This identity derives from the additive property of the logarithm and the scaling behavior of the exponential function. Starting from the original definition, since for .[14] The reformulation mitigates overflow risks inherent in direct evaluation of when some are large and positive, as subtracting ensures for all , bounding each and keeping intermediate values within floating-point representable ranges.[14] For illustration, consider , where . The trick yields which computes exponentials of non-positive arguments, avoiding the need to evaluate directly and thus preserving dynamic range.[14]Implementation and stability
The stable implementation of the log-sum-exp function relies on subtracting the maximum value from all inputs before exponentiation to prevent overflow and underflow in floating-point arithmetic. The algorithm proceeds as follows: first, identify the maximum value ; then, compute the sum of the exponentials of the shifted inputs, ; finally, return . For greater precision, especially when multiple inputs equal m or non-maximal terms contribute, compute s as k (the multiplicity of m) plus the sum of for , then , using if small_sum is small relative to k to minimize floating-point precision loss.[2] In floating-point arithmetic, direct computation of risks overflow when the are large positive values, as the exponentials can exceed the representable range, or underflow when all are large negative, leading to a sum of zero and . The shifting approach mitigates overflow by ensuring all , while for underflow scenarios—where individual for non-maximal terms may round to zero due to their negligible contribution—the sum remains dominated by the term(s) equal to 1, yielding an accurate approximation of the maximum value itself.[2] Error analysis shows that this shifted algorithm achieves forward stability, with the relative error satisfying , where y is the exact log-sum-exp value, , n is the input dimension, and u is the machine epsilon (approximately for IEEE 754 double precision). The condition number , which measures the function's sensitivity to input perturbations, remains a property of the mathematical function itself. However, the shift improves the conditioning of the computation by bounding intermediate values to [0,1], reducing forward errors from potentially infinite (due to overflow in the naive approach) to near machine epsilon levels, even for ill-conditioned inputs.[2] Implementations in numerical libraries incorporate this stability mechanism. For instance, NumPy'slogsumexp function computes the result in a numerically stable manner by applying the shift internally.[15] Similarly, SciPy's special.logsumexp provides the same stabilization for array inputs along specified axes. In TensorFlow, tf.math.reduce_logsumexp performs the operation with built-in numerical stabilization across tensor dimensions.[1][16]
Special handling is required for edge cases to maintain accuracy. When , the function simplifies to , as the shift yields and . If all are equal, the result is , computed stably since all shifted exponentials equal 1. For highly disparate values, where some , the underflow of those terms to zero has negligible impact, with the output approximating to within machine precision.[2]
