Hubbry Logo
Importance samplingImportance samplingMain
Open search
Importance sampling
Community hub
Importance sampling
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Importance sampling
Importance sampling
from Wikipedia

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978,[1] but its precursors can be found in statistical physics as early as 1949.[2][3] Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

Basic theory

[edit]

Let be a random variable in some probability space . We wish to estimate the expected value of under , denoted . If we have statistically independent random samples , generated according to , then an empirical estimate of is just

and the precision of this estimate depends on the variance of :

The basic idea of importance sampling is to sample from a different distribution to lower the variance of the estimation of , or when sampling directly from is difficult.

This is accomplished by first choosing a random variable such that and that -almost everywhere . With the variable we define a probability that satisfies

The variable will thus be sampled under to estimate as above and this estimation is improved when

When is of constant sign over , the best variable would clearly be , so that is the searched constant and a single sample under suffices to give its value. Unfortunately we cannot take that choice, because is precisely the value we are looking for! However this theoretical best case gives us an insight into what importance sampling does: for all , the density of at can be written as

To the right, is one of the infinitesimal elements that sum up to :

therefore, a good probability change in importance sampling will redistribute the law of so that its samples' frequencies are sorted directly according to their contributions in as opposed to . Hence the name "importance sampling."

Importance sampling is often used as a Monte Carlo integrator. When is the uniform distribution over , the expectation corresponds to the integral of the real function .

Application to probabilistic inference

[edit]

Such methods are frequently used to estimate posterior densities or expectations in state and/or parameter estimation problems in probabilistic models that are too hard to treat analytically. Examples include Bayesian networks and importance weighted variational autoencoders.[4]

Application to simulation

[edit]

Importance sampling is a variance reduction technique that can be used in the Monte Carlo method. The idea behind importance sampling is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. If these "important" values are emphasized by sampling more frequently, then the estimator variance can be reduced. Hence, the basic methodology in importance sampling is to choose a distribution which "encourages" the important values. This use of "biased" distributions will result in a biased estimator if it is applied directly in the simulation. However, the simulation outputs are weighted to correct for the use of the biased distribution, and this ensures that the new importance sampling estimator is unbiased. The weight is given by the likelihood ratio, that is, the Radon–Nikodym derivative of the true underlying distribution with respect to the biased simulation distribution.

The fundamental issue in implementing importance sampling simulation is the choice of the biased distribution which encourages the important regions of the input variables. Choosing or designing a good biased distribution is the "art" of importance sampling. The rewards for a good distribution can be huge run-time savings; the penalty for a bad distribution can be longer run times than for a general Monte Carlo simulation without importance sampling.

Consider to be the sample and to be the likelihood ratio, where is the probability density (mass) function of the desired distribution and is the probability density (mass) function of the biased/proposal/sample distribution. Then the problem can be characterized by choosing the sample distribution that minimizes the variance of the scaled sample:

It can be shown that the following distribution minimizes the above variance:[5]

Notice that when , this variance becomes 0.

Mathematical approach

[edit]

Consider estimating by simulation the probability of an event , where is a random variable with cumulative distribution function and probability density function , where prime denotes derivative. A -length independent and identically distributed (i.i.d.) sequence is generated from the distribution , and the number of random variables that lie above the threshold are counted. The random variable is characterized by the Binomial distribution

One can show that , and , so in the limit we are able to obtain . Note that the variance is low if . Importance sampling is concerned with the determination and use of an alternate density function (for ), usually referred to as a biasing density, for the simulation experiment. This density allows the event to occur more frequently, so the sequence lengths gets smaller for a given estimator variance. Alternatively, for a given , use of the biasing density results in a variance smaller than that of the conventional Monte Carlo estimate. From the definition of , we can introduce as below.

where

is a likelihood ratio and is referred to as the weighting function. The last equality in the above equation motivates the estimator

This is the importance sampling estimator of and is unbiased. That is, the estimation procedure is to generate i.i.d. samples from and for each sample which exceeds , the estimate is incremented by the weight evaluated at the sample value. The results are averaged over trials. The variance of the importance sampling estimator is easily shown to be

Now, the importance sampling problem then focuses on finding a biasing density such that the variance of the importance sampling estimator is less than the variance of the general Monte Carlo estimate. For some biasing density function, which minimizes the variance, and under certain conditions reduces it to zero, it is called an optimal biasing density function.

Conventional biasing methods

[edit]

Although there are many kinds of biasing methods, the following two methods are most widely used in the applications of importance sampling.

Scaling

[edit]

Shifting probability mass into the event region by positive scaling of the random variable with a number greater than unity has the effect of increasing the variance (mean also) of the density function. This results in a heavier tail of the density, leading to an increase in the event probability. Scaling is probably one of the earliest biasing methods known and has been extensively used in practice. It is simple to implement and usually provides conservative simulation gains as compared to other methods.

In importance sampling by scaling, the simulation density is chosen as the density function of the scaled random variable , where usually for tail probability estimation. By transformation,

and the weighting function is

While scaling shifts probability mass into the desired event region, it also pushes mass into the complementary region which is undesirable. If is a sum of random variables, the spreading of mass takes place in an dimensional space. The consequence of this is a decreasing importance sampling gain for increasing , and is called the dimensionality effect. A modern version of importance sampling by scaling is e.g. so-called sigma-scaled sampling (SSS) which is running multiple Monte Carlo (MC) analysis with different scaling factors. In opposite to many other high yield estimation methods (like worst-case distances WCD) SSS does not suffer much from the dimensionality problem. Also addressing multiple MC outputs causes no degradation in efficiency. On the other hand, as WCD, SSS is only designed for Gaussian statistical variables, and in opposite to WCD, the SSS method is not designed to provide accurate statistical corners. Another SSS disadvantage is that the MC runs with large scale factors may become difficult, e. g. due to model and simulator convergence problems. In addition, in SSS we face a strong bias-variance trade-off: Using large scale factors, we obtain quite stable yield results, but the larger the scale factors, the larger the bias error. If the advantages of SSS does not matter much in the application of interest, then often other methods are more efficient.

Translation

[edit]

Another simple and effective biasing technique employs translation of the density function (and hence random variable) to place much of its probability mass in the rare event region. Translation does not suffer from a dimensionality effect and has been successfully used in several applications relating to simulation of digital communication systems. It often provides better simulation gains than scaling. In biasing by translation, the simulation density is given by

where is the amount of shift and is to be chosen to minimize the variance of the importance sampling estimator.

Effects of system complexity

[edit]

The fundamental problem with importance sampling is that designing good biased distributions becomes more complicated as the system complexity increases. Complex systems are the systems with long memory since complex processing of a few inputs is much easier to handle. This dimensionality or memory can cause problems in three ways:

In principle, the importance sampling ideas remain the same in these situations, but the design becomes much harder. A successful approach to combat this problem is essentially breaking down a simulation into several smaller, more sharply defined subproblems. Then importance sampling strategies are used to target each of the simpler subproblems. Examples of techniques to break the simulation down are conditioning and error-event simulation (EES) and regenerative simulation.

Evaluation of importance sampling

[edit]

In order to identify successful importance sampling techniques, it is useful to be able to quantify the run-time savings due to the use of the importance sampling approach. The performance measure commonly used is , and this can be interpreted as the speed-up factor by which the importance sampling estimator achieves the same precision as the MC estimator. This has to be computed empirically since the estimator variances are not likely to be analytically possible when their mean is intractable. Other useful concepts in quantifying an importance sampling estimator are the variance bounds and the notion of asymptotic efficiency. One related measure is the so-called Effective Sample Size (ESS).[6]

Variance cost function

[edit]

Variance is not the only possible cost function for a simulation, and other cost functions, such as the mean absolute deviation, are used in various statistical applications. Nevertheless, the variance is the primary cost function addressed in the literature, probably due to the use of variances in confidence intervals and in the performance measure .

An associated issue is the fact that the ratio overestimates the run-time savings due to importance sampling since it does not include the extra computing time required to compute the weight function. Hence, some people evaluate the net run-time improvement by various means. Perhaps a more serious overhead to importance sampling is the time taken to devise and program the technique and analytically derive the desired weight function.

Multiple and adaptive importance sampling

[edit]

When different proposal distributions, , are jointly used for drawing the samples different proper weighting functions can be employed (e.g., see [7][8][9][10]). In an adaptive setting, the proposal distributions, , and are updated each iteration of the adaptive importance sampling algorithm. Hence, since a population of proposal densities is used, several suitable combinations of sampling and weighting schemes can be employed.[11][12][13][14][15][16][17]

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Importance sampling is a for approximating the of a function under a target p(x)p(x) by drawing samples from an alternative proposal distribution q(x)q(x) that is easier to sample from, and then reweighting those samples using the importance weights w(x)=p(x)/q(x)w(x) = p(x)/q(x). This technique allows estimation of integrals or expectations Ep[f(X)]=f(x)p(x)dx\mathbb{E}_p[f(X)] = \int f(x) p(x) \, dx as the weighted average μ^f=1mj=1mw(x(j))f(x(j))\hat{\mu}_f = \frac{1}{m} \sum_{j=1}^m w(x^{(j)}) f(x^{(j)}), where x(j)q(x)x^{(j)} \sim q(x), thereby focusing computational effort on regions of high importance under the target distribution. The method is particularly valuable for in simulations, as the variance of the is minimized when q(x)q(x) is chosen proportional to f(x)p(x)|f(x) p(x)|, potentially reducing it to zero in the optimal case. Developed in the mid-20th century amid the rise of computational simulations, importance sampling emerged as a key strategy in methods, which originated from work by Stanislaw Ulam and during at . An early influential application appeared in the 1955 paper by Marshall N. Rosenbluth and , who used it to compute the average extension of molecular chains via sampling on the MANIAC electronic computer, introducing weighted sampling to bias toward non-overlapping configurations in polymer simulations. Concurrently, John M. Hammersley and Keith W. Morton formalized aspects of the approach in their 1954 work "Poor Man's Monte Carlo," which emphasized efficient low-cost sampling techniques for lattice problems and integral estimation without requiring expensive uniform sampling. In practice, importance sampling underpins numerous algorithms in statistics, physics, and , including sequential methods like particle filters for state-space models and rare-event probability estimation where events occur with low probability under the target but are more likely under the proposal. Its effectiveness hinges on the quality of the proposal distribution q(x)q(x), which must have heavier tails than p(x)p(x) to avoid infinite variance and support efficient sampling; poor choices can lead to high variance or degeneracy, prompting extensions like adaptive importance sampling. Today, it remains essential in for posterior approximation and in high-dimensional integration problems, where direct sampling from complex targets is infeasible.

Fundamentals

Definition and Motivation

Monte Carlo integration is a fundamental technique for estimating expectations of the form Ep[f(X)]=f(x)p(x)dx\mathbb{E}_p[f(X)] = \int f(x) p(x) \, dx, where Xp(x)X \sim p(x), by generating independent samples from pp and averaging the function values. This method converges to the true expectation by the , but its efficiency is limited by the variance of the , which is Varp(f(X))n\frac{\text{Var}_p(f(X))}{n} for nn samples. High variance arises particularly in cases involving , where the probability mass is concentrated in low-probability tails, or peaked distributions, where most samples fall in regions contributing negligibly to the , leading to slow convergence and requiring excessively large nn for reliable estimates. Importance sampling addresses this inefficiency by reformulating the expectation using an alternative proposal distribution q(x)q(x), from which samples are drawn more frequently in regions where f(x)p(x)|f(x) p(x)| is large, thereby reducing the estimator's variance without introducing bias. The key insight is to overweight "important" contributions to the integral, allowing fewer samples to achieve the same precision as naive sampling from pp. This technique is especially valuable in high-dimensional or complex settings where direct sampling from pp is impractical or inefficient. The origins of importance sampling trace back to the 1940s and 1950s, developed amid early simulations in to model and attenuation problems at . Seminal work by and others introduced variance reduction methods, with and Marshall's 1953 paper formalizing importance sampling as a strategy to minimize sample sizes in such computations by biasing samples toward physically relevant regions. A simple numerical example illustrates the variance reduction: consider estimating E[g(X)]\mathbb{E}[g(X)] where XN(0,1)X \sim \mathcal{N}(0,1) and g(x)=10exp(5(x3)4)g(x) = 10 \exp(-5(x-3)^4), a sharply peaked function centered far from the , mimicking challenges with bimodal or multimodal densities. Using standard Monte Carlo with 10,000 samples from N(0,1)\mathcal{N}(0,1) yields an estimate of approximately 0.0858 with standard deviation 0.0079. In contrast, importance sampling with a shifted proposal YN(3,1)Y \sim \mathcal{N}(3,1) and weights g(y)fX(y)fY(y)g(y) \frac{f_X(y)}{f_Y(y)} produces an estimate of 0.0910 with standard deviation just 0.0012, demonstrating over sixfold by concentrating samples near the peak.

Mathematical Formulation

Importance sampling provides a method to estimate expectations with respect to a target probability density p(x)p(x) by drawing samples from an alternative proposal density q(x)q(x). Consider the goal of approximating the expectation μ=Ep[f(X)]=f(x)p(x)dx\mu = \mathbb{E}_p[f(X)] = \int f(x) p(x) \, dx, where ff is an integrable function and pp is a probability density on a space X\mathcal{X}. The standard Monte Carlo estimator for μ\mu is given by μ^MC=1Ni=1Nf(Xi)\hat{\mu}_{\text{MC}} = \frac{1}{N} \sum_{i=1}^N f(X_i), where the samples XiX_i are independent and identically distributed as Xip(x)X_i \sim p(x). To derive the importance sampling estimator, apply the change of measure theorem from , which relies on the Radon-Nikodym between the densities. Assuming pp is absolutely continuous with respect to qq (i.e., p(x)=0p(x) = 0 implies q(x)=0q(x) = 0), the expectation can be rewritten as μ=Xf(x)p(x)q(x)q(x)dx=Eq[f(X)w(X)],\mu = \int_{\mathcal{X}} f(x) \frac{p(x)}{q(x)} q(x) \, dx = \mathbb{E}_q \left[ f(X) w(X) \right], where the weight function is defined as w(x)=p(x)/q(x)w(x) = p(x)/q(x). This identity holds provided that q(x)>0q(x) > 0 whenever p(x)f(x)>0p(x) f(x) > 0, ensuring the support condition for the to be well-defined and the to be unbiased. The corresponding importance sampling estimator is then μ^IS=1Ni=1Nf(Xi)w(Xi)\hat{\mu}_{\text{IS}} = \frac{1}{N} \sum_{i=1}^N f(X_i) w(X_i), with samples Xiiidq(x)X_i \stackrel{\text{iid}}{\sim} q(x). This estimator is unbiased for μ\mu under the support condition, as it follows directly from the law of large numbers applied to the expectation under qq. In many applications, such as Bayesian inference, the target density p(x)p(x) may be available only up to an unknown normalizing constant, so p(x)=γπ(x)p(x) = \gamma \pi(x) where γ=1/π(x)dx\gamma = 1 / \int \pi(x) \, dx and π(x)\pi(x) is the unnormalized density. In this case, the weights become w(x)=π(x)/q(x)w(x) = \pi(x)/q(x), and the self-normalized importance sampling estimator for μ\mu is μ~IS=i=1Nw(Xi)f(Xi)i=1Nw(Xi),\tilde{\mu}_{\text{IS}} = \frac{\sum_{i=1}^N w(X_i) f(X_i)}{\sum_{i=1}^N w(X_i)}, with Xiiidq(x)X_i \stackrel{\text{iid}}{\sim} q(x). This estimator is consistent but biased, converging to μ\mu as NN \to \infty under suitable moment conditions on qq.

Theoretical Properties

Estimator Bias and Variance

The importance sampling for the expectation μ=Ep[f(X)]\mu = \mathbb{E}_p[f(X)] is defined as μ^N=1Ni=1Nf(Xi)w(Xi)\hat{\mu}_N = \frac{1}{N} \sum_{i=1}^N f(X_i) w(X_i), where Xii.i.d.qX_i \overset{\text{i.i.d.}}{\sim} q and w(x)=p(x)/q(x)w(x) = p(x)/q(x) denotes the importance weight, assuming pp and qq are probability densities. This is unbiased provided that the support condition holds, meaning q(x)>0q(x) > 0 whenever p(x)>0p(x) > 0 (or equivalently, pqp \ll q). To see this, consider the expectation of a single term: Eq[f(X)w(X)]=f(x)p(x)q(x)q(x)dx=f(x)p(x)dx=μ\mathbb{E}_q[f(X) w(X)] = \int f(x) \frac{p(x)}{q(x)} q(x) \, dx = \int f(x) p(x) \, dx = \mu. By of expectation, Eq[μ^N]=μ\mathbb{E}_q[\hat{\mu}_N] = \mu. The variance of the estimator follows from the variance of the individual terms, scaled by 1/N1/N: Varq(μ^N)=1N(Eq[(f(X)w(X))2]μ2)\text{Var}_q(\hat{\mu}_N) = \frac{1}{N} \left( \mathbb{E}_q[(f(X) w(X))^2] - \mu^2 \right). Substituting the weight, this simplifies to Varq(f(X)w(X))=Ep[f(X)2p(X)q(X)]μ2\text{Var}_q(f(X) w(X)) = \mathbb{E}_p\left[f(X)^2 \frac{p(X)}{q(X)}\right] - \mu^2. This expression highlights the role of the mismatch between pp and qq in inflating variance, as the term p(x)q(x)\frac{p(x)}{q(x)} amplifies regions where qq under-samples relative to pp. For the special case f1f \equiv 1 (estimating the ), the relative variance relates directly to the χ2\chi^2-: Varq(w(X))(Eq[w(X)])2=χ2(pq)\frac{\text{Var}_q(w(X))}{(\mathbb{E}_q[w(X)])^2} = \chi^2(p \| q), providing a measure of sampling . A practical diagnostic for estimator quality is the effective sample size (ESS), defined as ESS=N1+cvw2\text{ESS} = \frac{N}{1 + \text{cv}_w^2}, where cvw=Varq(w(X))(Eq[w(X)])2\text{cv}_w = \sqrt{\frac{\text{Var}_q(w(X))}{(\mathbb{E}_q[w(X)])^2}}
Add your contribution
Related Hubs
User Avatar
No comments yet.