Recent from talks
Nothing was collected or created yet.
Log probability
View on WikipediaIn probability theory and computer science, a log probability is simply a logarithm of a probability.[1] The use of log probabilities means representing probabilities on a logarithmic scale , instead of the standard unit interval.
Since the probabilities of independent events multiply, and logarithms convert multiplication to addition, log probabilities of independent events add. Log probabilities are thus practical for computations, and have an intuitive interpretation in terms of information theory: the negative expected value of the log probabilities is the information entropy of an event. Similarly, likelihoods are often transformed to the log scale, and the corresponding log-likelihood can be interpreted as the degree to which an event supports a statistical model. The log probability is widely used in implementations of computations with probability, and is studied as a concept in its own right in some applications of information theory, such as natural language processing.
Motivation
[edit]Representing probabilities in this way has several practical advantages:
- Speed. Since multiplication is more expensive than addition, taking the product of a high number of probabilities is often faster if they are represented in log form. (The conversion to log form is expensive, but is only incurred once.) Multiplication arises from calculating the probability that multiple independent events occur: the probability that all independent events of interest occur is the product of all these events' probabilities.
- Accuracy. The use of log probabilities improves numerical stability, when the probabilities are very small, because of the way in which computers approximate real numbers.[1]
- Simplicity. Many probability distributions have an exponential form. Taking the log of these distributions eliminates the exponential function, unwrapping the exponent. For example, the log probability of the normal distribution's probability density function is instead of . Log probabilities make some mathematical manipulations easier to perform.
- Optimization. Since most common probability distributions—notably the exponential family—are only logarithmically concave,[2][3] and concavity of the objective function plays a key role in the maximization of a function such as probability, optimizers work better with log probabilities.
Representation issues
[edit]The logarithm function is not defined for zero, so log probabilities can only represent non-zero probabilities. Since the logarithm of a number in interval is negative, often the negative log probabilities are used. In that case the log probabilities in the following formulas would be inverted.
Any base can be selected for the logarithm.
Basic manipulations
[edit]In this section we would name probabilities in logarithmic space and for short:
The product of probabilities corresponds to addition in logarithmic space.
The sum of probabilities is a bit more involved to compute in logarithmic space, requiring the computation of one exponent and one logarithm.
However, in many applications a multiplication of probabilities (giving the probability of all independent events occurring) is used more often than their addition (giving the probability of at least one of mutually exclusive events occurring). Additionally, the cost of computing the addition can be avoided in some situations by simply using the highest probability as an approximation. Since probabilities are non-negative this gives a lower bound. This approximation is used in reverse to get a continuous approximation of the max function.
Addition in log space
[edit]The formula above is more accurate than , provided one takes advantage of the asymmetry in the addition formula. should be the larger (least negative) of the two operands. This also produces the correct behavior if one of the operands is floating-point negative infinity, which corresponds to a probability of zero.
- This quantity is indeterminate, and will result in NaN.
- This is the desired answer.
The above formula alone will incorrectly produce an indeterminate result in the case where both arguments are . This should be checked for separately to return .
For numerical reasons, one should use a function that computes (log1p) directly.
See also
[edit]References
[edit]- ^ a b Piech, Chris. "Probability for Computer scientists - Log probabilities". Retrieved 20 July 2023.
- ^ Kass, Robert E.; Vos, Paul W. (1997). Geometrical Foundations of Asymptotic Inference. New York: John Wiley & Sons. p. 14. ISBN 0-471-82668-5.
- ^ Papadopoulos, Alecos (September 25, 2013). "Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)?". Stack Exchange.
Log probability
View on GrokipediaDefinition and Fundamentals
Formal Definition
In probability theory and statistics, the log probability of an event is defined as the logarithm of its associated probability , where . This transformation is expressed as , with the natural logarithm serving as the standard choice in statistical analysis and computational contexts due to its mathematical properties and prevalence in likelihood functions.[1][5] Although base-10 logarithms can be used in some applications, the natural base (approximately 2.718) is preferred for its alignment with exponential functions and information-theoretic measures.[5] Common notation for log probability includes , where denotes the event or random variable, or the shorthand to explicitly signify the logarithmic scale.[1] The function is undefined at , as the logarithm approaches negative infinity, and for , the values range from to 0, reflecting the non-positive nature of probabilities in this interval.[1] For example, consider a fair coin flip where the probability of heads is . The log probability is then , illustrating how the transformation yields a negative value that scales with the rarity of the event.Relation to Natural Logarithm
In probability and statistics, the natural logarithm (base e) is the conventional base for log probabilities due to its favorable properties in calculus, particularly when dealing with probability densities and maximum likelihood estimation. The derivative of the natural logarithm of a probability p with respect to a parameter is simply (1/p) times the derivative of p, which streamlines the computation of score functions and gradients without extraneous constants. This simplification is especially useful in deriving estimators for distributions involving exponentials, as seen in exponential families. In contrast, using a logarithm with base b ≠ e introduces a scaling factor of 1/ln(b) in the derivative, complicating analytical work. The change-of-base formula further underscores this preference: for any base b, logb(p) = ln(p) / ln(b), meaning computations in the natural base eliminate the need for such normalization constants in theoretical derivations. While base-2 logarithms are standard in information theory to quantify entropy in bits—as introduced by Claude Shannon in 1948—the natural base aligns more naturally with the exponential forms prevalent in probabilistic modeling and avoids unit-specific scaling. The practice of using natural logarithms for log probabilities emerged in early 20th-century statistics, particularly with Ronald A. Fisher's 1922 formalization of maximum likelihood estimation, where log-likelihoods proved essential for managing products of joint probabilities from independent observations. This approach addressed numerical underflow issues in likelihood computations and facilitated asymptotic theory. Alfred Rényi extended related ideas in the 1960s through his axiomatic development of generalized entropy measures, which rely on logarithmic transformations of probabilities to capture uncertainty in a unified framework.[6] In modern computational implementations, libraries like NumPy default to the natural logarithm via thenp.log function for log probability operations, ensuring compatibility with statistical algorithms and preventing overflow in high-dimensional products. This convention promotes numerical stability, as log probabilities transform multiplications into additions, a property preserved regardless of base but optimized with the natural log for derivative-based optimizations.[7]
Key Properties
Monotonicity and Inequalities
The logarithm function is strictly increasing on the positive real numbers, which implies that for probabilities , if and only if .[8] This monotonicity preserves the ordering of probabilities when working in log space, ensuring that relative comparisons remain unchanged under the transformation.[9] The natural logarithm is a concave function for . By Jensen's inequality applied to this concavity, for a random variable taking values in with expectation , it holds that , with equality if and only if is almost surely constant.[10] This inequality underscores the subadditivity of the log transform in expectation, which is fundamental in analyses of probabilistic averages and information measures. A useful bound arising from monotonicity is that for , . Without loss of generality, assume ; then , so , as the logarithm is increasing. This provides an upper bound on the log of a sum of probabilities, limiting overflow in computations involving small values. In Bayesian updating, the monotonicity of the logarithm ensures that the posterior log-odds increase with the strength of supporting evidence, as the update adds the log-likelihood to the prior log-odds.[11] For instance, stronger evidence corresponds to a higher likelihood ratio, which monotonically boosts the posterior belief in the hypothesis.Log of Products and Sums
In probability theory, the product rule for independent events transforms under the logarithm into a simple addition. For two independent events and , the joint probability is , so the log probability becomes . This property, derived from the logarithm's additive nature over multiplication, facilitates the handling of multiplicative probability structures by converting them to summations, which are computationally more stable and interpretable in many analyses. This additive property extends naturally to the chain rule of probability, which decomposes joint distributions into products of conditionals. For a sequence of random variables , the joint probability is , and taking the logarithm yields . This representation is fundamental in probabilistic modeling, as it allows the log-joint probability to be expressed as a sum of local log-conditional probabilities, enabling efficient inference in graphical models and sequential processes. In contrast, the sum rule for probabilities presents a challenge in log space. For mutually exclusive events and , , so , which does not simplify to . This lack of direct additivity requires careful handling, often involving normalization to ensure the arguments sum appropriately before applying the logarithm. A key application of the product-to-sum transformation arises in maximum likelihood estimation (MLE), where the likelihood for independent and identically distributed (i.i.d.) observations is . The log-likelihood then simplifies to , turning the optimization problem into minimizing the negative sum of log probabilities, which is more numerically robust and aligns with gradient-based methods.[12]Operations in Log Space
Logarithmic Addition
Logarithmic addition refers to the computation of the logarithm of the sum of probabilities, which is essential when working in log space to represent unions of mutually exclusive events or marginal probabilities. For two mutually exclusive events and , the probability of their union is , so the log probability is given by .[13] This form, known as the log-sum-exp function, arises naturally in probabilistic models where direct summation in probability space can lead to numerical issues, but it requires careful implementation in log space to maintain stability.[14] For a general sum over multiple log probabilities , the expression becomes . To prevent overflow or underflow during exponentiation—especially when the values vary widely—a normalization technique factors out the maximum value , yielding .[13] This log-sum-exp trick ensures numerical stability by keeping all terms in the inner sum between 0 and 1 after subtraction, avoiding extreme exponents.[14] In the pairwise case, the formula simplifies further: for terms and where , . This avoids computing the potentially large directly and leverages the fact that , making it computationally efficient and stable.[13] A practical application of logarithmic addition occurs in the forward algorithm for hidden Markov models (HMMs), where state probabilities at time are computed by summing over transitions from previous states in log space: . This approach prevents underflow in long sequences, enabling reliable likelihood computation.[15]Handling Divisions and Ratios
In log space, divisions and ratios of probabilities are computed via subtraction, leveraging the property that the logarithm of a quotient equals the difference of the logarithms. Specifically, for events and with , .[16] This operation simplifies the calculation of probability ratios, such as odds ratios, which compare the likelihood of an event occurring versus not occurring, avoiding direct division of potentially small probabilities that could introduce numerical issues.[17] For conditional probabilities, the logarithm follows directly from the definition , yielding . Here, the joint probability is often obtained in log space from the sum of log probabilities under independence assumptions, such as if and are independent.[16] This subtraction enables efficient computation of conditionals in probabilistic models without exponentiation.[17] The log-odds, defined for a binary event as where is the probability of the event, transforms probabilities into an additive scale useful for binary outcomes.[17] Additions in log-odds space correspond to multiplications in probability space, facilitating updates in models with binary decisions. In Bayes' theorem, this manifests as , where the posterior log-odds equal the prior log-odds plus the log-likelihood ratio.[17] This form streamlines Bayesian inference by converting multiplicative updates to additions.[16] In logistic regression, model parameters directly represent changes in log-odds; for inputs and weights , the log-odds of the positive class is , such that where is the sigmoid function.[17] Each feature's coefficient indicates the log-odds shift per unit change, enabling interpretable probabilistic predictions for binary classification tasks.[16]Practical Applications
In Probabilistic Modeling
In probabilistic modeling, log probabilities play a central role in statistical inference, particularly through the log-likelihood function, which was introduced by Ronald A. Fisher in 1922 to facilitate parameter estimation. The log-likelihood for a set of independent observations given parameters is defined as , transforming the product of probabilities into a sum that is easier to maximize. This formulation underpins maximum likelihood estimation, where , enabling efficient optimization in large datasets by leveraging the additivity of logarithms.[18] A key application arises in exponential family distributions, where the log probability takes a linear form in the natural parameter space, simplifying inference and computation. Specifically, for a random variable with parameters , the log probability is given by where is the sufficient statistic, is the normalizing constant (or partition function), and is a base measure; this structure highlights the linearity in log space, which facilitates conjugate priors and moment calculations in Bayesian settings. This representation was formalized in the foundational work on sufficient statistics by Pitman in 1936, establishing exponential families as a cornerstone for tractable probabilistic models.[19] In graphical models, log probabilities are essential for belief propagation algorithms, such as the junction tree method, which represents joint distributions via cliques and separators to perform exact inference. The junction tree algorithm, developed by Lauritzen and Spiegelhalter in 1988, can be implemented using log potentials—logarithms of the clique and separator functions—to mitigate numerical underflow during message passing, ensuring stable computation of marginal posteriors in multiply connected Bayesian networks. This approach transforms multiplicative updates into additive ones, preserving the probabilistic structure while enhancing computational reliability. Variational inference further relies on log probabilities to approximate intractable posteriors through the evidence lower bound (ELBO), a tractable objective that lower-bounds the marginal log-likelihood. The ELBO is expressed as , where is a variational distribution over latent variables , and maximizing it yields an approximation to the true posterior . This framework, introduced by Jordan et al. in 1999 for graphical models, enables scalable inference by optimizing in log space, avoiding direct computation of normalizing constants. For joint distributions, the log probability of the joint can be referenced as the sum of marginal and conditional log probabilities, aligning with product rules from earlier sections.[20]In Machine Learning Algorithms
The use of log probabilities in machine learning algorithms has surged since the early 2010s, coinciding with the rise of deep learning, where they enable stable computation of probabilities in high-dimensional spaces. For instance, AlexNet, a seminal convolutional neural network, maximized the average log-probability across training cases using multinomial logistic regression as its objective, which helped mitigate numerical instability during training on large datasets like ImageNet. This approach became foundational for subsequent models, allowing efficient handling of softmax outputs without underflow in probability values close to zero. In gradient-based optimization, log probabilities simplify the computation of derivatives for maximum likelihood estimation. The gradient of the log-likelihood with respect to model parameters is given by the score function , which avoids direct manipulation of small probability values and reduces variance in stochastic gradient estimates.[21] This log-derivative trick is particularly valuable in scenarios where probabilities are tiny, as it transforms products into sums and stabilizes training dynamics. During backpropagation in neural networks, log probabilities are integral to the cross-entropy loss, formulated as , where is the one-hot target and the predicted softmax probabilities; this pairing with log-softmax outputs ensures numerical robustness and efficient gradient flow through layers. In reinforcement learning, policy gradient methods leverage log probabilities to update policies via advantage-weighted importance. The REINFORCE algorithm computes updates proportional to the gradient , where is the policy probability and the advantage, enabling direct optimization of expected rewards without explicit value functions.[22] A prominent application appears in large language models like GPT-3, where training maximizes the sum of log probabilities for next-token prediction, , using log-softmax over a vast vocabulary to handle autoregressive generation efficiently. This formulation supports scalable training on massive corpora, yielding models capable of coherent sequence generation.Numerical Implementation
Avoiding Overflow and Underflow
In probabilistic computations, direct evaluation of joint probabilities often involves products of individual probabilities, each typically less than 1, such as for events. For moderate to large , these products can rapidly approach machine epsilon, leading to underflow where the result is indistinguishable from zero in floating-point arithmetic, thus causing loss of precision or erroneous computations.[16] Working in log space mitigates this by transforming the product into a sum of logarithms, , where each remains finite and negative, preserving relative magnitudes without underflow.[16] Although exponentiation of log probabilities, , could theoretically overflow for large positive , probabilities satisfy and thus , making overflow impossible; instead, severe underflow in the exponentiated result is the dominant concern when is large negative.[23] The primary strategy to address these issues is to perform all intermediate calculations entirely in log space, only exponentiating at the final step if a probability scale is required, which maintains numerical stability throughout the process.[16] A representative example occurs in the naive Bayes classifier, where the posterior probability for a class involves multiplying class-conditional probabilities across a long feature vector; direct computation underflows for high-dimensional data, but summing the logs of these probabilities avoids this while enabling reliable argmax decisions.[24] Under the IEEE 754 floating-point standard, is defined to yield negative infinity, which appropriately represents impossible events without introducing NaNs, though implementations must ensure non-negative arguments to prevent undefined behavior like .Log-Sum-Exp Computation
The log-sum-exp (LSE) function is defined as , where are real numbers representing log-probabilities or log-likelihoods.[14] Direct computation of this expression risks numerical overflow when the are large and positive, or severe underflow when they are large and negative, leading to inaccurate results in floating-point arithmetic.[14] To ensure stability, the function is reformulated by shifting all terms by their maximum value: let , then This adjustment prevents overflow in the exponentials, as for all , while the final logarithm handles the scaling accurately.[14] In vectorized implementations, such as thelogsumexp function in SciPy, the LSE is computed efficiently over arrays or along specified axes, incorporating the max-shift internally to maintain numerical stability for high-dimensional inputs common in scientific computing.[25] This allows seamless handling of vector or matrix arguments without explicit looping, optimizing performance on modern hardware while bounding errors comparably to the scalar case.[25][14]
For sequential or incremental addition of terms, an iterative variant maintains a running LSE by incorporating each new via pairwise application: starting with , update using the stable shift at each step.[26] This approach is particularly useful in streaming computations or online algorithms where terms arrive one at a time, preserving stability without recomputing the full sum.[26]
Regarding numerical error, the naive direct evaluation incurs relative errors on the order of for terms far from the maximum, potentially losing all precision due to underflow.[14] In contrast, the shifted LSE reduces the backward error to at most , where is the unit roundoff (machine epsilon divided by ) and is the condition number of the summation, achieving accuracy close to machine precision for well-conditioned inputs.[14]
A practical example arises in the expectation-maximization (EM) algorithm for mixture models, where the E-step computes log posterior responsibilities as normalized log-likelihoods: for data point and components , the responsibility is obtained in log space via , using LSE to stably evaluate the normalizing log-partition function.[27]