Top-p sampling

Top-p sampling

current hub

Write something...

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

About hubStatsRules

See all

Wikipedia

Top-p sampling, also known as nucleus sampling, is a stochastic decoding strategy for generating sequences from autoregressive probabilistic models. It was originally proposed by Ari Holtzman and his colleagues in 2019 for natural language generation to address the issue of repetitive and nonsensical text generated by other common decoding methods like beam search. The technique has since been applied in other scientific fields, such as protein engineering and geophysics.

In top-p sampling, a probability threshold p is set, and the next item in a sequence is sampled only from the smallest possible set of high-probability candidates whose cumulative probability exceeds p. This method adapts the size of the candidate pool based on the model's certainty, making it more flexible than top-k sampling, which samples from a fixed number of candidates. Due to its effectiveness, top-p sampling is a widely used technique in many large language model applications.

At each step of the text generation process, a language model calculates a probability distribution over its entire vocabulary for the next token. While simply picking the token with the highest probability (greedy search) or a limited set of high-probability sequences (beam search) is possible, these deterministic methods often produce text that is dull, repetitive, or nonsensical. Top-p sampling introduces randomness to avoid these issues while maintaining quality.

The core idea is to sample from a smaller, more credible set of tokens at each step, called the nucleus. This nucleus contains the most likely next tokens whose combined, or cumulative probability, just exceeds the threshold p. By sampling only from this dynamically-sized group, the model can adapt to different situations. When the model is confident about the next token (e.g., one token has a very high probability), the nucleus will be small. When the model is uncertain (the probabilities are more evenly distributed), the nucleus will be larger, allowing for more diversity.

The process at each step is as follows:

Formally, the nucleus, $V^{(p)}\subseteq V$ , is defined as the largest set of tokens satisfying: $\sum _{x\in V^{(p)}}P(x|x_{1},\dots ,x_{t-1})\geq p$ In this formula, $P(x|x_{1},\dots ,x_{t-1})$ represents the probability of a token $x$ given the preceding tokens $x_{1},\dots ,x_{t-1}$ .