Recent from talks
Nothing was collected or created yet.
Cross-entropy method
View on WikipediaThe cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective.
The method approximates the optimal importance sampling estimator by repeating two phases:[1]
- Draw a sample from a probability distribution.
- Minimize the cross-entropy between this distribution and a target distribution to produce a better sample in the next iteration.
Reuven Rubinstein developed the method in the context of rare-event simulation, where tiny probabilities must be estimated, for example in network reliability analysis, queueing models, or performance analysis of telecommunication systems. The method has also been applied to the traveling salesman, quadratic assignment, DNA sequence alignment, max-cut and buffer allocation problems.
Estimation via importance sampling
[edit]Consider the general problem of estimating the quantity
,
where is some performance function and is a member of some parametric family of distributions. Using importance sampling this quantity can be estimated as
,
where is a random sample from . For positive , the theoretically optimal importance sampling density (PDF) is given by
.
This, however, depends on the unknown . The CE method aims to approximate the optimal PDF by adaptively selecting members of the parametric family that are closest (in the Kullback–Leibler sense) to the optimal PDF .
Generic CE algorithm
[edit]- Choose initial parameter vector ; set t = 1.
- Generate a random sample from
- Solve for , where
- If convergence is reached then stop; otherwise, increase t by 1 and reiterate from step 2.
In several cases, the solution to step 3 can be found analytically. Situations in which this occurs are
- When belongs to the natural exponential family
- When is discrete with finite support
- When and , then corresponds to the maximum likelihood estimator based on those .
Continuous optimization—example
[edit]The same CE algorithm can be used for optimization, rather than estimation. Suppose the problem is to maximize some function , for example, . To apply CE, one considers first the associated stochastic problem of estimating for a given level , and parametric family , for example the 1-dimensional Gaussian distribution, parameterized by its mean and variance (so here). Hence, for a given , the goal is to find so that is minimized. This is done by solving the sample version (stochastic counterpart) of the KL divergence minimization problem, as in step 3 above. It turns out that parameters that minimize the stochastic counterpart for this choice of target distribution and parametric family are the sample mean and sample variance corresponding to the elite samples, which are those samples that have objective function value . The worst of the elite samples is then used as the level parameter for the next iteration. This yields the following randomized algorithm that happens to coincide with the so-called Estimation of Multivariate Normal Algorithm (EMNA), an estimation of distribution algorithm.
Pseudocode
[edit]// Initialize parameters
μ := −6
σ2 := 100
t := 0
maxits := 100
N := 100
Ne := 10
// While maxits not exceeded and not converged
while t < maxits and σ2 > ε do
// Obtain N samples from current sampling distribution
X := SampleGaussian(μ, σ2, N)
// Evaluate objective function at sampled points
S := exp(−(X − 2) ^ 2) + 0.8 exp(−(X + 2) ^ 2)
// Sort X by objective function values in descending order
X := sort(X, S)
// Update parameters of sampling distribution via elite samples
μ := mean(X(1:Ne))
σ2 := var(X(1:Ne))
t := t + 1
// Return mean of final sampling distribution as solution
return μ
Related methods
[edit]See also
[edit]Journal papers
[edit]- De Boer, P.-T., Kroese, D.P., Mannor, S. and Rubinstein, R.Y. (2005). A Tutorial on the Cross-Entropy Method. Annals of Operations Research, 134 (1), 19–67.[1]
- Rubinstein, R.Y. (1997). Optimization of Computer Simulation Models with Rare Events, European Journal of Operational Research, 99, 89–112.
Software implementations
[edit]References
[edit]- ^ Rubinstein, R.Y. and Kroese, D.P. (2004), The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Springer-Verlag, New York ISBN 978-0-387-21240-1.
Cross-entropy method
View on GrokipediaFundamentals
Importance Sampling Basics
Rare event simulation refers to the task of estimating small probabilities, such as where denotes a rare event with probability , using Monte Carlo methods.[5] In direct or crude Monte Carlo estimation, one generates independent samples under the reference probability measure and approximates by the empirical proportion of samples falling into , yielding an estimator with variance .[5] This results in a high relative error of approximately , which grows large as becomes small, necessitating an impractically large to achieve reasonable precision and rendering the approach computationally inefficient for rare events.[5] Importance sampling addresses these challenges by shifting the sampling distribution to a tilted measure that places higher probability on the rare event region, thereby increasing the likelihood of observing relevant samples while correcting for the bias through importance weights.[6] Mathematically, to estimate the expectation for a non-negative function (often an indicator for rare events), one draws samples for and computes the unbiased estimator where the weight function is the Radon-Nikodym derivative .[7] This estimator has expectation under , but its variance depends on the choice of ; a well-chosen reduces variance by concentrating samples where contributes most to the integral.[7] The optimal importance sampling distribution that minimizes the variance of to zero (in the limit) is given by which proportionally weights the reference measure by the integrand .[8] However, computing requires knowledge of , the very quantity being estimated, so practical methods approximate it using parametric families or iterative procedures.[8] The technique originated in the 1950s within statistics and physics, particularly for estimating rare events in neutron transport simulations, as pioneered in the work of Kahn and Harris (1951) on particle transmission via random sampling.[9] This foundational approach emphasized biasing sampling distributions to improve efficiency in Monte Carlo computations for low-probability outcomes.[10]Cross-Entropy in Optimization Context
In the context of the cross-entropy method (CEM), cross-entropy serves as a fundamental information-theoretic measure for aligning probability distributions to facilitate optimization, particularly in rare event simulation and stochastic approximation tasks. The cross-entropy between two probability density functions and , denoted , is defined as which quantifies the expected number of extra bits required to code samples from using a code optimized for . This relates directly to the Kullback-Leibler (KL) divergence , an asymmetric measure of distributional discrepancy, via the identity where is the entropy of , a constant with respect to . Minimizing over thus equates to minimizing the cross-entropy , as the entropy term does not affect the optimization.[4][1] Within CEM, the method leverages this framework to approximate the optimal importance sampling distribution for estimating rare events or optimizing objective functions under a reference measure . Specifically, for rare event probabilities defined by a performance function (e.g., for some high threshold ), the goal is to minimize the KL divergence over a parameterized family of distributions , where is the tilted distribution concentrating mass on the rare event set , given by . This leads to the optimization program , which seeks to tilt toward regions of high performance under , thereby reducing variance in Monte Carlo estimates. The optimal would ideally place all mass on the rare event, but practical parameterization (e.g., via exponential families) approximates this by solving the cross-entropy minimization.[4][1] The derivation of the CEM objective from this cross-entropy minimization proceeds via importance sampling approximations. Since is unknown, samples are drawn from an initial , and the elite set is selected, where estimates the threshold. The KL minimization then reduces to maximizing the empirical cross-entropy term, equivalent to which fits to the elite samples by maximizing their average log-likelihood. This step transforms the rare event estimation or optimization problem into a sequence of distribution-fitting tasks akin to supervised learning, iteratively refining toward the optimal tilting.[4][1]Core Algorithm
Generic CE Procedure
The cross-entropy method (CEM) is an iterative, adaptive algorithm designed for estimating rare event probabilities and solving optimization problems by progressively refining a parametric reference distribution to focus on regions of interest.[11] As a model-based optimization technique, CEM evolves the distribution parameters over iterations to concentrate sampling efforts on high-reward or rare-event-prone areas, thereby reducing variance in estimates and improving efficiency in simulation-based computations.[12] This approach is particularly effective for problems where direct sampling from the original distribution is inefficient due to the rarity of target events or the complexity of the objective landscape.[11] The generic procedure begins with the initialization of reference distribution parameters, denoted as , which define an initial parametric family (e.g., Gaussian or Bernoulli distributions) that approximates the target behavior.[11] In each iteration , samples are generated from the current distribution .[11] These samples are then evaluated using a performance function , such as an indicator for rare events (where if the event occurs and 0 otherwise) or a reward metric for optimization.[12] The elite samples—typically the top-performing fraction—are selected based on a threshold determined by the elite fraction , which is commonly set between 0.01 and 0.1 to balance bias introduction against variance reduction in the estimates.[11] The parameters are then updated to by minimizing the cross-entropy between the empirical distribution induced by the elite samples and , a process equivalent to minimizing the Kullback-Leibler divergence from the elite empirical measure to (or maximizing the likelihood of the elite samples under ).[11] This update shifts the reference distribution toward the elites, enhancing the likelihood of sampling desirable outcomes in subsequent iterations.[12] The iteration continues, generating new samples, selecting elites, and updating parameters until a convergence criterion is met, such as stabilization of the elite threshold or a fixed number of iterations.[11] For rare event estimation, the final probability is computed using the converged distribution and importance sampling weights derived from the likelihood ratios with respect to the original distribution.[11] Empirical guidelines recommend in the range of to samples per iteration to ensure reliable elite selection and parameter estimates without excessive computational cost.[11] Under mild regularity conditions, such as the existence of finite moments and identifiability of the optimal parameters, the CEM exhibits asymptotic consistency, meaning the estimated rare event probability converges to the true value as and the number of iterations increase.[11] The elite fraction plays a crucial role in this convergence by controlling the trade-off: smaller values increase adaptability but risk overfitting to noise, while larger values promote stability at the expense of slower progress toward rare regions.[12]Parameter Update Mechanism
In the cross-entropy method (CEM), the parameter update mechanism involves adjusting the parameters of the importance sampling distribution at each iteration to minimize the cross-entropy distance to an optimal reference distribution, which is approximated using elite samples selected from the generated samples based on their performance. The update rule is formulated as where denotes the number of elite samples, and this optimization maximizes the average log-likelihood of the elite samples under the parameterized distribution . This approach was introduced in Rubinstein's formulation, which highlighted its equivalence to maximum likelihood estimation (MLE) on the elite samples, thereby enabling the method's applicability to a wide range of parametric families beyond simple cases. For common parametric families belonging to the natural exponential family, such as the multivariate normal distribution , the update admits explicit closed-form solutions that leverage sample moments of the elite set. Specifically, the mean parameter is updated as the sample mean of the elites, while the covariance matrix is updated as the sample covariance (or a biased variant for improved numerical stability), These updates ensure that the sampling distribution progressively concentrates around high-performing regions of the search space. In cases where the parameterized family lacks closed-form MLE solutions, such as non-parametric densities or complex distributions, the optimization of the objective with and denoting the density, is performed using stochastic approximation techniques or gradient-based methods like stochastic gradient ascent on the log-likelihood. This numerical approach maintains the method's adaptability while approximating the argmax through iterative refinements.Applications and Examples
Continuous Optimization Case
In the continuous optimization case, the cross-entropy method (CEM) is adapted by treating the objective function directly as the performance measure, without the indicator function used in rare-event simulation. Instead of focusing on rare events, elite samples are selected as the top samples ranked by their values, where is the sample size and is the elite fraction, typically 0.1. This shifts the sampling distribution toward regions of higher performance through iterative updates via maximum likelihood estimation on a parametric family, such as multivariate Gaussians.[13][14] A representative example involves maximizing multimodal benchmark functions like the Ackley or Rosenbrock function in . The Ackley function, defined as features multiple local maxima surrounding a global maximum at the origin, while the Rosenbrock function, forms a narrow parabolic valley leading to the global maximum at . Initialization uses samples from , with typical parameters , , and 50–100 iterations to concentrate the distribution near promising regions.[14] The procedure follows these steps:Initialize: Set μ ← 0, Σ ← I (d × d identity matrix), max_iter ← 100, N ← 1000, α ← 0.1
For k = 1 to max_iter:
Generate X_1, ..., X_N ∼ N(μ, Σ)
Evaluate S(X_j) for j = 1 to N
Sort indices such that S(X_{(1)}) ≥ ... ≥ S(X_{(N)})
Select elites: E = {X_{(1)}, ..., X_{(⌈αN⌉)}}
Update μ ← (1/|E|) ∑_{X ∈ E} X (sample mean of elites)
Update Σ ← (1/|E|) ∑_{X ∈ E} (X - μ)(X - μ)^T (biased sample covariance of elites)
If convergence criterion met (e.g., small change in μ or Σ), stop
Return μ as approximate maximizer
Initialize: Set μ ← 0, Σ ← I (d × d identity matrix), max_iter ← 100, N ← 1000, α ← 0.1
For k = 1 to max_iter:
Generate X_1, ..., X_N ∼ N(μ, Σ)
Evaluate S(X_j) for j = 1 to N
Sort indices such that S(X_{(1)}) ≥ ... ≥ S(X_{(N)})
Select elites: E = {X_{(1)}, ..., X_{(⌈αN⌉)}}
Update μ ← (1/|E|) ∑_{X ∈ E} X (sample mean of elites)
Update Σ ← (1/|E|) ∑_{X ∈ E} (X - μ)(X - μ)^T (biased sample covariance of elites)
If convergence criterion met (e.g., small change in μ or Σ), stop
Return μ as approximate maximizer
