Hubbry Logo
Scoring ruleScoring ruleMain
Open search
Scoring rule
Community hub
Scoring rule
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Scoring rule
Scoring rule
from Wikipedia
Visualization of the expected score under various predictions from some common scoring functions. Dashed black line: forecaster's true belief, red: linear, orange: spherical, purple: quadratic, green: log.

In decision theory, a scoring rule[1] provides evaluation metrics for probabilistic predictions or forecasts. While "regular" loss functions (such as mean squared error) assign a goodness-of-fit score to a predicted value and an observed value, scoring rules assign such a score to a predicted probability distribution and an observed value. On the other hand, a scoring function[2] provides a summary measure for the evaluation of point predictions, i.e. one predicts a property or functional , like the expectation or the median.

The average logarithmic score of 10 points i.i.d. sampled from a standard normal distribution (blue histogram), evaluated on a variety of distributions (red line). Although not necessarily true for individual samples, on average, a proper scoring rule will give the lowest score if the predicted distribution matches the data distribution.
A calibration curve allows to judge how well model predictions are calibrated, by comparing the predicted quantiles to the observed quantiles. Blue is the best calibrated model, see calibration (statistics).

Scoring rules answer the question "how good is a predicted probability distribution compared to an observation?" Scoring rules that are (strictly) proper are proven to have the lowest expected score if the predicted distribution equals the underlying distribution of the target variable. Although this might differ for individual observations, this should result in a minimization of the expected score if the "correct" distributions are predicted.

Scoring rules and scoring functions are often used as "cost functions" or "loss functions" of probabilistic forecasting models. They are evaluated as the empirical mean of a given sample, the "score". Scores of different predictions or models can then be compared to conclude which model is best. For example, consider a model, that predicts (based on an input ) a mean and standard deviation . Together, those variables define a gaussian distribution , in essence predicting the target variable as a probability distribution. A common interpretation of probabilistic models is that they aim to quantify their own predictive uncertainty. In this example, an observed target variable is then held compared to the predicted distribution and assigned a score . When training on a scoring rule, it should "teach" a probabilistic model to predict when its uncertainty is low, and when its uncertainty is high, and it should result in calibrated predictions, while minimizing the predictive uncertainty.

Although the example given concerns the probabilistic forecasting of a real valued target variable, a variety of different scoring rules have been designed with different target variables in mind. Scoring rules exist for binary and categorical probabilistic classification, as well as for univariate and multivariate probabilistic regression.

Definitions

[edit]

Consider a sample space , a σ-algebra of subsets of and a convex class of probability measures on . A function defined on and taking values in the extended real line, , is -quasi-integrable if it is measurable with respect to and is quasi-integrable with respect to all .

Probabilistic forecast

[edit]

A probabilistic forecast is any probability measure . I.e. it is a distribution of potential future observations.

Scoring rule

[edit]

A scoring rule is any extended real-valued function such that is -quasi-integrable for all . represents the loss or penalty when the forecast is issued and the observation materializes.

Point forecast

[edit]

A point forecast is a functional, i.e. a potentially set-valued mapping .

Scoring function

[edit]

A scoring function is any real-valued function where represents the loss or penalty when the point forecast is issued and the observation materializes.

Orientation

[edit]

Scoring rules and scoring functions are negatively (positively) oriented if smaller (larger) values mean better. Here we adhere to negative orientation, hence the association with "loss".

Expected score

[edit]

We write for the expected score of a prediction under as the expected score of the predicted distribution , when sampling observations from distribution .

Sample average score

[edit]

Many probabilistic forecasting models are training via the sample average score, in which a set of predicted distributions is evaluated against a set of observations .

Propriety and consistency

[edit]

Strictly proper scoring rules and strictly consistent scoring functions encourage honest forecasts by maximization of the expected reward: If a forecaster is given a reward of if realizes (e.g. ), then the highest expected reward (lowest score) is obtained by reporting the true probability distribution.[1]

Proper scoring rules

[edit]

A scoring rule is proper relative to if (assuming negative orientation) its expected score is minimized when the forecasted distribution matches the distribution of the observation.

for all .

It is strictly proper if the above equation holds with equality if and only if .

Consistent scoring functions

[edit]

A scoring function is consistent for the functional relative to the class if

for all , all and all .

It is strictly consistent if it is consistent and equality in the above equation implies that .

Example application of scoring rules

[edit]
The logarithmic rule

An example of probabilistic forecasting is in meteorology where a weather forecaster may give the probability of rain on the next day. One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell. If the actual percentage was substantially different from the stated probability we say that the forecaster is poorly calibrated. A poorly calibrated forecaster might be encouraged to do better by a bonus system. A bonus system designed around a proper scoring rule will incentivize the forecaster to report probabilities equal to his personal beliefs.[3]

In addition to the simple case of a binary decision, such as assigning probabilities to 'rain' or 'no rain', scoring rules may be used for multiple classes, such as 'rain', 'snow', or 'clear', or continuous responses like the amount of rain per day.

The image shows an example of a scoring rule, the logarithmic scoring rule, as a function of the probability reported for the event that actually occurred. One way to use this rule would be as a cost based on the probability that a forecaster or algorithm assigns, then checking to see which event actually occurs.

Scoring rules can be used beyond evaluation metrics to directly serve as loss function to construct estimators.[4]

Examples of proper scoring rules

[edit]

There are an infinite number of scoring rules, including entire parameterized families of strictly proper scoring rules. The ones shown below are simply popular examples.

Categorical variables

[edit]

For a categorical response variable with mutually exclusive events, , a probabilistic forecaster or algorithm will return a probability vector with a probability for each of the outcomes.

Logarithmic score

[edit]
Expected value of logarithmic rule. When Event 1 is expected to occur with probability of 0.8, the blue line is described by the function .

The logarithmic scoring rule is a local strictly proper scoring rule. This is also the negative of surprisal, which is commonly used as a scoring criterion in Bayesian inference; the goal is to minimize expected surprise. This scoring rule has strong foundations in information theory.

Here, the score is calculated as the logarithm of the probability estimate for the actual outcome. That is, a prediction of 80% that correctly proved true would receive a score of ln(0.8) = −0.22. This same prediction also assigns 20% likelihood to the opposite case, and so if the prediction proves false, it would receive a score based on the 20%: ln(0.2) = −1.6. The goal of a forecaster is to maximize the score and for the score to be as large as possible, and −0.22 is indeed larger than −1.6.

If one treats the truth or falsity of the prediction as a variable x with value 1 or 0 respectively, and the expressed probability as p, then one can write the logarithmic scoring rule as x ln(p) + (1 − x) ln(1 − p). Note that any logarithmic base may be used, since strictly proper scoring rules remain strictly proper under linear transformation. That is:

is strictly proper for all .

Brier/Quadratic score

[edit]

The quadratic scoring rule is a strictly proper scoring rule

where is the probability assigned to the correct answer and is the number of classes.

The Brier score, originally proposed by Glenn W. Brier in 1950,[5] can be obtained by an affine transform from the quadratic scoring rule.

Where when the th event is correct and otherwise and is the number of classes.

An important difference between these two rules is that a forecaster should strive to maximize the quadratic score yet minimize the Brier score . This is due to a negative sign in the linear transformation between them.

Spherical score

[edit]

The spherical scoring rule is also a strictly proper scoring rule

Ranked Probability Score

[edit]

The ranked probability score [6] (RPS) is a strictly proper scoring rule, that can be expressed as:

Where when the th event is correct and otherwise, and is the number of classes. Other than other scoring rules, the ranked probability score considers the distance between classes, i.e. classes 1 and 2 are considered closer than classes 1 and 3. The score assigns better scores to probabilistic forecasts with high probabilities assigned to classes close to the correct class. For example, when considering probabilistic forecasts and , we find that , while , despite both probabilistic forecasts assigning identical probability to the correct class.

Comparison of categorical strictly proper scoring rules

[edit]

Shown below on the left is a graphical comparison of the Logarithmic, Quadratic, and Spherical scoring rules for a binary classification problem. The x-axis indicates the reported probability for the event that actually occurred.

It is important to note that each of the scores have different magnitudes and locations. The magnitude differences are not relevant however as scores remain proper under affine transformation. Therefore, to compare different scores it is necessary to move them to a common scale. A reasonable choice of normalization is shown in the picture where all scores intersect the points (0.5,0) and (1,1). This ensures that they yield 0 for a uniform distribution (two probabilities of 0.5 each), reflecting no cost or reward for reporting what is often the baseline distribution. All normalized scores below also yield 1 when the true class is assigned a probability of 1.

Score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red)
Normalized score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red)

Univariate continuous variables

[edit]

The scoring rules listed below aim to evaluate probabilistic predictions when the predicted distributions are univariate continuous probability distribution's, i.e. the predicted distributions are defined over a univariate target variable and have a probability density function .

Logarithmic score for continuous variables

[edit]

The logarithmic score is a local strictly proper scoring rule. It is defined as

where denotes the probability density function of the predicted distribution . It is a local, strictly proper scoring rule. The logarithmic score for continuous variables has strong ties to Maximum likelihood estimation. However, in many applications, the continuous ranked probability score is often preferred over the logarithmic score, as the logarithmic score can be heavily influenced by slight deviations in the tail densities of forecasted distributions.[7]

Continuous ranked probability score

[edit]
Illustration of the continuous ranked probability score (CRPS). Given a sample y and a predicted cumulative distribution F, the CRPS is given by computing the difference between the curves at each point x of the support, squaring it and integrating it over the whole support.

The continuous ranked probability score (CRPS)[8] is a strictly proper scoring rule much used in meteorology. It is defined as

where is the cumulative distribution function of the forecasted distribution , is the Heaviside step function and is the observation. For distributions with finite first moment, the continuous ranked probability score can be written as:[1]

where and are independent random variables, sampled from the distribution . This is the energy form of CRPS and opens the door to estimating the CRPS via Monte Carlo sampling (through approximating the expectation value).

Furthermore, when the cumulative probability function is continuous, the continuous ranked probability score can also be written as[9]

The continuous ranked probability score can be seen as both a continuous extension of the ranked probability score, as well as quantile regression. The continuous ranked probability score over the empirical distribution of an ordered set points (i.e. every point has probability of occurring), is equal to twice the mean quantile loss applied on those points with evenly spread quantiles :[10]

For many popular families of distributions, closed-form expressions for the continuous ranked probability score have been derived. The continuous ranked probability score has been used as a loss function for artificial neural networks, in which weather forecasts are postprocessed to a Gaussian probability distribution.[11][12]

CRPS was also adapted to survival analysis to cover censored events.[13]

CRPS is also known as Cramer–von Mises distance and can be seen as an improvement of Wasserstein distance (often used in machine learning) and further Cramer distance performed better in ordinal regression than KL distance or the Wasserstein metric.[14]

While CRPS is widely used for evaluating probabilistic forecasts, it has critical theoretical limitations. It has been shown that CRPS can produce systematically misleading evaluations by favoring probabilistic forecasts whose medians are close to the observed outcome, regardless of the actual probability assigned to that region, potentially resulting in higher scores for forecasts that allocate negligible (or even zero) probability mass to the true outcome. Furthermore, CRPS is not invariant under smooth transformations of the forecast variable, and its ranking of forecast systems may reverse under such transformations, raising concerns about its consistency for evaluation purposes.[15]

Interval score

[edit]

The interval score measures the calibration and sharpness of an interval prediction at nominal coverage :

"The forecaster is rewarded for narrow prediction intervals, and he or she incurs a penalty, the size of which de- pends on α, if the observation misses the interval"[16]

Multivariate continuous variables

[edit]

The scoring rules listed below aim to evaluate probabilistic predictions when the predicted distributions are univariate continuous probability distribution's, i.e. the predicted distributions are defined over a multivariate target variable and have a probability density function .

Multivariate logarithmic score

[edit]

The multivariate logarithmic score is similar to the univariate logarithmic score:

where denotes the probability density function of the predicted multivariate distribution . It is a local, strictly proper scoring rule.

Hyvärinen scoring rule

[edit]

The Hyvärinen scoring function (of a density p) is defined by[17]

Where denotes the Hessian trace and denotes the gradient. This scoring rule can be used to computationally simplify parameter inference and address Bayesian model comparison with arbitrarily-vague priors.[17][18] It was also used to introduce new information-theoretic quantities beyond the existing information theory.[19]

Energy score

[edit]

The energy score is a multivariate extension of the continuous ranked probability score:[1]

Here, , denotes the -dimensional Euclidean distance and are independently sampled random variables from the probability distribution . The energy score is strictly proper for distributions for which is finite. It has been suggested that the energy score is somewhat ineffective when evaluating the intervariable dependency structure of the forecasted multivariate distribution.[20] The energy score is equal to twice the energy distance between the predicted distribution and the empirical distribution of the observation.

Variogram score

[edit]

The variogram score of order is given by:[21]

Here, are weights, often set to 1, and can be arbitrarily chosen, but or are often used. is here to denote the 'th marginal random variable of . The variogram score is proper for distributions for which the 'th moment is finite for all components, but is never strictly proper. Compared to the energy score, the variogram score is claimed to be more discriminative with respect to the predicted correlation structure.

Conditional continuous ranked probability score

[edit]

The conditional continuous ranked probability score (Conditional CRPS or CCRPS) is a family of (strictly) proper scoring rules. Conditional CRPS evaluates a forecasted multivariate distribution by evaluation of CRPS over a prescribed set of univariate conditional probability distributions of the predicted multivariate distribution:[22]

Here, is the 'th marginal variable of , is a set of tuples that defines a conditional specification (with and ), and denotes the conditional probability distribution for given that all variables for are equal to their respective observations. In the case that is ill-defined (i.e. its conditional event has zero likelihood), CRPS scores over this distribution are defined as infinite. Conditional CRPS is strictly proper for distributions with finite first moment, if the chain rule is included in the conditional specification, meaning that there exists a permutation of such that for all : .

Interpretation of proper scoring rules

[edit]

All proper scoring rules are equal to weighted sums (integral with a non-negative weighting functional) of the losses in a set of simple two-alternative decision problems that use the probabilistic prediction, each such decision problem having a particular combination of associated cost parameters for false positive and false negative decisions. A strictly proper scoring rule corresponds to having a nonzero weighting for all possible decision thresholds. Any given proper scoring rule is equal to the expected losses with respect to a particular probability distribution over the decision thresholds; thus the choice of a scoring rule corresponds to an assumption about the probability distribution of decision problems for which the predicted probabilities will ultimately be employed, with for example the quadratic loss (or Brier) scoring rule corresponding to a uniform probability of the decision threshold being anywhere between zero and one. The classification accuracy score (percent classified correctly), a single-threshold scoring rule which is zero or one depending on whether the predicted probability is on the appropriate side of 0.5, is a proper scoring rule but not a strictly proper scoring rule because it is optimized (in expectation) not only by predicting the true probability but by predicting any probability on the same side of 0.5 as the true probability.[23][24][25][26][27][28]

Characteristics

[edit]

Affine transformation

[edit]

A strictly proper scoring rule, whether binary or multiclass, after an affine transformation remains a strictly proper scoring rule.[3] That is, if is a strictly proper scoring rule then with is also a strictly proper scoring rule, though if then the optimization sense of the scoring rule switches between maximization and minimization.

Locality

[edit]

A proper scoring rule is said to be local if its estimate for the probability of a specific event depends only on the probability of that event. This statement is vague in most descriptions but we can, in most cases, think of this as the optimal solution of the scoring problem "at a specific event" is invariant to all changes in the observation distribution that leave the probability of that event unchanged. All binary scores are local because the probability assigned to the event that did not occur is determined so there is no degree of flexibility to vary over.

Affine functions of the logarithmic scoring rule are the only strictly proper local scoring rules on a finite set that is not binary.

Decomposition

[edit]

The expectation value of a proper scoring rule can be decomposed into the sum of three components, called uncertainty, reliability, and resolution,[29][30] which characterize different attributes of probabilistic forecasts:

If a score is proper and negatively oriented (such as the Brier Score), all three terms are positive definite. The uncertainty component is equal to the expected score of the forecast which constantly predicts the average event frequency. The reliability component penalizes poorly calibrated forecasts, in which the predicted probabilities do not coincide with the event frequencies.

The equations for the individual components depend on the particular scoring rule. For the Brier Score, they are given by

where is the average probability of occurrence of the binary event , and is the conditional event probability, given , i.e.

See also

[edit]

Literature

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A scoring rule is a statistical measure that evaluates the quality of a probabilistic forecast by assigning a numerical score based on the predicted and the actual observed outcome, with the goal of incentivizing accurate and honest reporting of probabilities. These rules are particularly valuable in fields requiring reliable predictions, such as , , and , where they quantify forecast performance and promote alignment between reported beliefs and expected scores. A key subclass consists of proper scoring rules, which are designed such that the expected score is maximized (or minimized, depending on the convention) precisely when the forecaster reports their true subjective probabilities, thereby eliciting truthful forecasts without strategic manipulation. Strictly proper scoring rules further ensure that this optimum is unique, achieved only by the true distribution, enhancing their robustness for verification purposes. The originated in the context of , with Glenn W. Brier introducing the first prominent example—the —in 1950 as a quadratic measure of the mean squared difference between predicted probabilities and binary outcomes (0 or 1), which remains widely used for its simplicity and decomposability into , refinement, and components. Other notable examples include the logarithmic score, which rewards higher probabilities for correct outcomes via the log-likelihood and relates to concepts like Kullback-Leibler divergence, and the spherical score, a pseudospherical rule suitable for forecasts. Scoring rules have broad applications beyond , including in , in , and elicitation in , where they facilitate model comparison, calibration checks, and incentive-compatible mechanisms like prediction markets. Their theoretical foundation draws from , with proper rules often derived from Bregman divergences or entropy measures, ensuring mathematical consistency and interpretability. Ongoing research extends these to multivariate, spatial, and functional forecasts, addressing challenges like elicitation in high dimensions and robustness to outliers.

Definitions and Fundamentals

Probabilistic Forecast

A probabilistic forecast provides a predictive over possible future outcomes or events, rather than a single predicted value. This distribution quantifies the forecaster's by assigning probabilities to each potential result, enabling a more complete representation of what is known or believed about the future. In contrast to point forecasts, which deliver a single deterministic value such as an expected mean, probabilistic forecasts emphasize the full spectrum of possibilities to better capture inherent uncertainties in complex systems. This approach is particularly valuable in fields where decisions depend on assessing risks and ranges of outcomes, allowing users to weigh alternatives based on likelihoods. The origins of probabilistic forecasting trace back to early applications in meteorology during the early 20th century, with initial explicit treatments appearing in works like W. E. Cooke's 1906 analysis of weather predictions. These ideas were further developed in decision theory and statistics around the mid-20th century, notably through von Neumann and Morgenstern's 1944 formulation of expected utility theory, which formalized decision-making under probabilistic uncertainty. Early meteorological uses, such as Anders Ångström's 1920s explorations of forecast value, including his 1927 and 1933 papers on probability forecasting, highlighted practical needs for probability-based predictions in weather services. For discrete outcomes, a probabilistic forecast takes the form of a , where probabilities sum to one across all possibilities. In continuous cases, it is represented by a . A classic example is a weather forecaster stating a 70% chance of tomorrow, implying a distribution over rainy and non-rainy scenarios that informs decisions like event planning. Scoring rules serve to evaluate the accuracy and of such forecasts against observed outcomes.

Scoring Rule

A scoring rule is a function S(y,p)S(y, p) that evaluates the quality of a probabilistic forecast pp given an observed outcome yy, assigning a numerical score where higher values indicate greater accuracy in the forecast. These rules are applied in contexts such as , , and to quantify how well predicted probability distributions align with realized events. In mathematical terms, a scoring rule generally takes the form S(y,p)=g(p)h(y,p)S(y, p) = g(p) - h(y, p), where gg depends only on the forecast and hh incorporates both the outcome and the forecast, allowing for a structured assessment of deviation from truth. This structure facilitates comparisons across different methods by normalizing the evaluation process. Scoring rules serve a critical purpose in elicitation settings, where they incentivize forecasters to report their true probabilistic beliefs rather than biased estimates, thereby promoting reliable information gathering in processes. The development of scoring rules traces back to the 1950s, pioneered by Glenn W. Brier in his work on verifying probability forecasts for weather events, building on foundations in statistical from earlier contributions like those of .

Point Forecast

A point forecast provides a single predicted value for a future outcome, representing a deterministic estimate without any associated probabilities. For instance, in , a point forecast might predict a of exactly 25°C for a given location and time, serving as a concise summary of the expected observation. Such forecasts are often derived from models that optimize for a specific functional, like the or , to minimize error under a chosen . In the framework of scoring rules, a point forecast is equivalent to a degenerate probabilistic forecast, where the predictive distribution places 100% probability mass on the single predicted value, akin to a Dirac delta distribution concentrated at that point. This equivalence allows proper scoring rules designed for probabilistic forecasts—such as the continuous ranked probability score (CRPS)—to be applied directly, reducing to simpler error measures like the for point predictions. Despite their simplicity, point forecasts have notable limitations, as they do not quantify , potentially leading to overconfident predictions and poorer in scenarios with high variability or incomplete information. This omission can mislead decision-makers by implying , particularly in complex domains like climate or where outcomes are inherently . Point forecasts remain prevalent in deterministic modeling approaches, such as early systems or basic regression models, where computational efficiency prioritizes a single output over distributional details. Scoring rules adapt to these use cases by treating the point estimate as an implicit , enabling consistent evaluation alongside more expressive probabilistic forecasts.

Scoring Function

A scoring function constitutes the foundational mathematical construct for assessing the quality of probabilistic forecasts by comparing a predicted distribution to a realized outcome. Denoted typically as S(o,F)S(o, F), where oo is the observed outcome and FF is the forecasted , it maps these inputs to a that quantifies their correspondence. This structure underpins the evaluation process in , serving as the atomic unit from which broader assessment mechanisms are built. For discrete outcomes over a finite , the scoring function takes the basic form S(y,p)S(y, p), where yy represents the realized outcome and pp is the vector of forecasted probabilities assigned to each possible event. In the continuous setting, involving probability densities ff, the function S(y,f(y))S(y, f(y)) evaluates the forecast at the outcome yy, with the integral form S(y,f(y))f(y)dy\int S(y, f(y)) f(y) \, dy capturing its expectation under the forecasted ; this provides the theoretical basis for averaging scores across potential realizations. These algebraic expressions ensure the function's applicability across diverse probabilistic domains without presupposing specific distributional assumptions. Scoring functions possess general properties that enhance their utility in forecast evaluation, including monotonicity with respect to forecast accuracy—scores improve (non-decreasing in positive orientation or non-increasing in negative) as the forecasted distribution aligns more closely with the true underlying probabilities. They are not inherently required to satisfy stricter conditions like propriety, allowing flexibility in design while prioritizing sensitivity to discrepancies between forecast and outcome. Orientation influences whether higher numerical values indicate better performance (positive) or worse (negative), a choice that standardizes interpretation in applications. In contrast to fully specified scoring rules, which involve the operational deployment of these functions—such as aggregation over samples or integration into decision frameworks—scoring functions remain the elemental components focused solely on pairwise of outcome and forecast. This distinction underscores their role as versatile building blocks rather than complete verification protocols.

Orientation of Scoring Rules

Scoring rules are classified by their orientation, which determines whether higher or lower numerical values indicate superior forecast performance. Positively oriented scoring rules reward accurate probabilistic forecasts with higher scores, such as the logarithmic scoring rule, where the score is the logarithm of the predicted probability assigned to the observed outcome, and forecasters aim to maximize the . In contrast, negatively oriented scoring rules function as penalties, assigning lower values (often closer to zero or more negative) to better forecasts; the , defined as the mean squared difference between predicted probabilities and binary outcomes, exemplifies this by penalizing deviations and is minimized for optimal performance. The orientation of a scoring rule can be inverted without altering its relative evaluation of forecasts: multiplying the rule by −1 flips it from positive to negative (or vice versa), preserving the ordering of forecast quality since the transformation is monotonic. This equivalence ensures that core properties like propriety remain intact under sign reversal. The choice of orientation carries practical implications for applications. Positively oriented rules align with elicitation contexts, where forecasters maximize expected scores to reveal true beliefs, a framework rooted in . Negatively oriented rules predominate in optimization, treating scores as loss functions to minimize during model training, facilitating integration with gradient-based algorithms. This convention of distinguishing orientations emerged as standard in the literature since the 1970s, though usage varies by discipline—positive in statistics and , negative in computational fields.

Expected Score

The expected score serves as the theoretical performance metric for a scoring rule, quantifying the long-run average score a forecaster would receive if repeatedly issuing the same probabilistic forecast under a true underlying distribution. For a forecast distribution pp and true distribution qq over a discrete sample space, the expected score is defined as ES(p,q)=yq(y)S(y,p),\text{ES}(p, q) = \sum_y q(y) \, S(y, p), where S(y,p)S(y, p) is the score assigned to the forecast pp upon observing outcome yy. This expectation represents the mean score over outcomes drawn from qq. In the continuous case, the expected score takes the form ES(p,q)=q(y)S(y,p(y))dy,\text{ES}(p, q) = \int q(y) \, S(y, p(y)) \, dy, where the integration is over the , and p(y)p(y) denotes the forecasted density or evaluated at yy. This formulation extends the discrete case to handle outcomes from continuous distributions, maintaining the focus on average performance relative to the true distribution qq. The expected score plays a central in evaluating forecast quality, as it measures the anticipated long-run performance of a scoring rule. For proper scoring rules, the expected score is maximized when the forecast pp equals the true distribution qq, incentivizing forecasters to report their true beliefs to achieve the highest possible average score. Strict propriety further strengthens this by ensuring that the maximum is unique, attained only when p=qp = q, which promotes unambiguous truthful reporting in .

Sample Average Score

The sample average score, also referred to as the empirical score, quantifies the performance of probabilistic forecasts by averaging the values of a scoring rule over a of forecast-observation pairs. For nn pairs (yi,pi)(y_i, p_i), where yiy_i denotes the observed outcome and pip_i the forecast distribution at occasion ii, it is defined as Sˉn=1ni=1nS(yi,pi),\bar{S}_n = \frac{1}{n} \sum_{i=1}^n S(y_i, p_i), with SS representing the scoring rule. This measure was introduced in early forecast verification studies within , notably in Brier's 1950 analysis of probabilistic weather predictions, where it served as a metric to assess forecast accuracy across multiple events. Assuming the pairs (yi,pi)(y_i, p_i) are independent and identically distributed, the sample average score is an unbiased of the expected score, meaning its expectation equals the true value. Furthermore, under standard regularity conditions such as finite variance, it converges to the expected score as the sample size nn approaches infinity, by the ; this consistency property ensures that long-term empirical performance reliably reflects theoretical quality. In practice, the sample average score facilitates forecaster evaluation and using finite datasets, where competing methods are ranked by their aggregated scores—typically favoring those with minimal values for negatively oriented rules—to identify superior predictive systems without awaiting infinite data. This data-driven approach bridges theoretical expected scores to operational decision-making, though finite-sample variability necessitates careful sample size considerations for robust comparisons.

Properties and Theoretical Foundations

Proper Scoring Rules

A proper scoring rule is a mechanism for evaluating probabilistic forecasts such that the expected score is maximized when the forecaster reports their true distribution. Formally, for a scoring rule S(q,x)S(\mathbf{q}, x) that assigns a score based on reported probabilities q\mathbf{q} and observed outcome xx, and true PP, the rule is proper if the expected score satisfies EP[S(P,X)]EP[S(Q,X)]\mathbb{E}_{P}[S(P, X)] \geq \mathbb{E}_{P}[S(Q, X)] for all probability distributions QQ. This property ensures that, under the true distribution, no other report yields a higher expected score than truthful reporting. Strict propriety strengthens this condition by guaranteeing a unique maximum at the true distribution, meaning the inequality is strict unless Q=PQ = P. This unique incentive aligns the forecaster's optimal precisely with , eliminating any ambiguity in the reporting equilibrium. The theoretical foundations of proper scoring rules trace back to in the mid-20th century, with Leonard Savage's 1971 work providing a seminal characterization linking them to expected utility maximization and subjective probability elicitation. Savage demonstrated that proper scoring rules correspond to convex functions on probability simplices, enabling their use as devices to infer personal probabilities without strategic distortion, building on earlier ideas from de Finetti's coherence axioms. In practice, proper scoring rules facilitate truthful elicitation of beliefs in uncertain environments, preventing strategic misrepresentation by making dishonesty suboptimal in expectation. This has profound implications for applications like forecast verification and incentive-compatible mechanisms in economics and statistics.

Strictly Proper Scoring Rules

A strictly proper scoring rule is a refinement of a proper scoring rule, where the expected score under the true distribution pp is uniquely maximized by reporting pp itself. Formally, for a scoring rule SS, it is strictly proper if EYp[S(p,Y)]>EYp[S(q,Y)]\mathbb{E}_{Y \sim p} [S(p, Y)] > \mathbb{E}_{Y \sim p} [S(q, Y)] for all probability distributions qpq \neq p. This strict inequality ensures that any deviation from the true forecast pp results in a strictly lower expected score, incentivizing precise calibration and sharpness in probabilistic forecasts. In contrast to merely proper scoring rules, which allow multiple forecasts to achieve the maximum expected score, strictly proper rules eliminate such ambiguity, making them particularly valuable for eliciting truthful and unique probabilistic predictions. Examples of proper but not strictly proper scoring rules include the zero-one score, which assigns a score of 1 if the forecast's mode matches the outcome and 0 otherwise; this is proper because the expected score is maximized by any forecast assigning positive probability only to the true outcome, but it is not strictly proper due to non-uniqueness among such forecasts. Similarly, the energy score with parameter β=2\beta = 2 is proper for distributions with finite second moments but not strictly proper, as multiple forecasts can yield the same expected score. Such non-strict rules are rare in practice, as they often fail to distinguish between equally calibrated but differently sharp forecasts. The strict propriety of a scoring rule follows from the strict convexity of the negative expected score function with respect to the forecast distribution. Specifically, the negative expected score EYp[S(q,Y)]-\mathbb{E}_{Y \sim p} [S(q, Y)] is a strictly convex function of qq, implying a unique minimum at q=pq = p. This convexity links strictly proper scoring rules to Bregman divergences, where the divergence generated by a strictly convex function ϕ\phi defines the negative expected score as Dϕ(q,p)=EYp[ϕ(p)(Y)ϕ(q)(Y)]+ϕ(q)ϕ(p)D_\phi(q, p) = \mathbb{E}_{Y \sim p} [\phi'(p)(Y) - \phi'(q)(Y)] + \phi(q) - \phi(p), ensuring the strict inequality for qpq \neq p. A brief proof sketch involves applied to the strictly convex ϕ\phi: for qpq \neq p, the convexity yields ϕ(q)>EYp[ϕ(p)(Y)]\phi(q) > \mathbb{E}_{Y \sim p} [\phi'(p)(Y)], leading directly to the strict maximization of the expected score at q=pq = p. Most commonly used scoring rules in and statistics are strictly proper, including the logarithmic score and the Brier (quadratic) score. For instance, the logarithmic score, defined as S(p,y)=logp(y)S(p, y) = \log p(y), is strictly proper for both discrete and continuous distributions, uniquely rewarding the true probabilities. Likewise, the for categorical outcomes, S(p,y)=k(pkI{y=k})2S(p, y) = - \sum_k (p_k - \mathbb{I}\{y = k\})^2, is strictly proper, providing a quadratic penalty that penalizes deviations from the true distribution. These properties make strictly proper rules the standard for applications requiring robust evaluation of probabilistic forecasts.

Consistent Scoring Functions

In statistics, a consistent scoring function provides a framework for evaluating point forecasts of specific distributional functionals, ensuring that the scoring rule incentivizes accurate estimation of those functionals. Formally, consider a space X\mathcal{X} of possible outcomes, a space Y\mathcal{Y} of forecasts, and the set P(X)\mathcal{P}(\mathcal{X}) of probability distributions over X\mathcal{X}. A functional T:P(X)YT: \mathcal{P}(\mathcal{X}) \to \mathcal{Y} maps distributions to point estimates, such as the mean or a quantile. A scoring function S:Y×XRS: \mathcal{Y} \times \mathcal{X} \to \mathbb{R} is consistent for TT if, for any distribution FP(X)F \in \mathcal{P}(\mathcal{X}) and random outcome YFY \sim F, E[S(T(F),Y)]=infyYE[S(y,Y)],\mathbb{E}[S(T(F), Y)] = \inf_{y \in \mathcal{Y}} \mathbb{E}[S(y, Y)], with the infimum achieved uniquely at y=T(F)y = T(F) for strict consistency. This property implies that the expected score is minimized precisely when the forecast equals the true functional value, promoting truthful reporting. In practice, with an observed sample X1,,XniidFX_1, \dots, X_n \stackrel{\text{iid}}{\sim} F, the sample average score Sˉn(y)=n1i=1nS(y,Xi)\bar{S}_n(y) = n^{-1} \sum_{i=1}^n S(y, X_i) serves as an estimator. Under suitable regularity conditions, such as convexity of SS and continuity of TT, the minimizer y^n=argminySˉn(y)\hat{y}_n = \arg\min_y \bar{S}_n(y) converges almost surely to T(F)T(F) as nn \to \infty, making consistent scoring functions a basis for asymptotically valid point estimation. This convergence underpins their utility in empirical forecast verification, where the sample average approximates the population minimization. Proper scoring rules, which evaluate full probabilistic forecasts, represent a special case of consistent scoring functions where the functional TT is the identity mapping, i.e., T(F)=FT(F) = F, directly eliciting the true distribution. Classic examples illustrate this framework. The squared error score S(y,x)=(yx)2S(y, x) = (y - x)^2 is strictly consistent for the mean functional T(F)=EF[X]T(F) = \mathbb{E}_F[X], as its expected value E[(yX)2]=(yμ)2+Var(X)\mathbb{E}[(y - X)^2] = (y - \mu)^2 + \mathrm{Var}(X) (with μ=E[X]\mu = \mathbb{E}[X]) is uniquely minimized at y=μy = \mu. Similarly, the mean absolute error S(y,x)=yxS(y, x) = |y - x| is strictly consistent for the median functional T(F)=F1(1/2)T(F) = F^{-1}(1/2), since the expected absolute deviation is minimized at the median for any distribution. These examples extend to other functionals like quantiles and expectiles, where consistent scores can be constructed via integral representations. The general theory of consistent scoring functions was formalized in the statistics literature during the 2010s, building on earlier work in to address challenges in verifying point forecasts for complex functionals in fields like and . Seminal contributions, including characterizations via Choquet integrals and forecast rankings, have established their role in ensuring robust evaluation beyond simple location parameters.

Applications in Practice

Weather and Climate Forecasting

Scoring rules have played a pivotal role in the verification of probabilistic forecasts in since the mid-20th century, particularly for assessing the accuracy of predictions in and contexts. The , originally proposed in 1950, was developed specifically for evaluating probabilistic forecasts of binary events, such as the occurrence or non-occurrence of , providing a quadratic measure of forecast accuracy that penalizes deviations from observed outcomes. For continuous predictands like or , the Continuous Ranked Probability Score (CRPS) emerged in the 1970s as a key metric, generalizing the to probabilistic distributions and enabling fair comparisons between deterministic and ensemble . In operational settings, such as those at the European Centre for Medium-Range Weather Forecasts (ECMWF), scoring rules are routinely applied to evaluate ensemble prediction systems for variables including surface temperature and . The Discrete Ranked Probability Score (RPS) is used for categorical forecasts, measuring the cumulative differences between predicted and observed cumulative probabilities across ordered categories, while the CRPS assesses continuous forecasts by integrating the squared differences between predictive and empirical cumulative distribution functions derived from ensembles. These metrics are computed over verification periods using reanalysis datasets like ERA5, allowing forecasters to quantify performance across lead times from short-range to medium-range predictions. The primary benefits of these scoring rules in lie in their ability to simultaneously evaluate —the statistical reliability of predicted probabilities—and sharpness—the concentration of predictive distributions around expected outcomes—thus incentivizing forecasts that are both accurate and informative. By rewarding proper probabilistic reporting, they facilitate the construction of skill scores, such as the Ranked Probability Skill Score (RPSS), which normalize raw scores against climatological benchmarks to highlight relative improvements in forecast quality. By the 2020s, scoring rules had become embedded in the verification processes for large-scale intercomparisons, including Phase 6 of the (CMIP6), where the Brier Skill Score is applied to assess the performance of global s in simulating seasonal patterns against observational data. This integration supports model selection and bias correction in projections of future climate variability, enhancing the reliability of assessments for in sectors like and water resource management.

Economic and Decision Theory

In economic and decision theory, proper scoring rules play a crucial role in eliciting truthful subjective probabilities from individuals in settings such as markets and surveys, ensuring that reported beliefs maximize the expected reward under risk neutrality. These rules incentivize honest reporting by assigning scores that peak when the forecaster's matches their true beliefs, thereby facilitating the aggregation of dispersed information for collective . Applications of scoring rules in include prediction markets, where mechanisms like the logarithmic market scoring rule (LMSR) enable continuous trading and probability updates without requiring matched counterparties, promoting efficient information revelation. For instance, the Iowa Electronic Markets, a long-running real-money platform for forecasting elections and economic events, leverages market-based incentives akin to scoring rules to aggregate participant beliefs into accurate consensus probabilities. In cost-benefit analysis, scoring rules aid in quantifying uncertainty by eliciting probabilistic assessments of outcomes, allowing decision-makers to compute expected net benefits under alternative scenarios. A key theoretical link exists between scoring rules and utility maximization: for a risk-neutral agent, the expected score from a proper rule aligns precisely with the expected derived from acting on the true , making truthful reporting the optimal strategy in Bayesian decision frameworks. This connection underpins their use in for modeling subjective expectations. The prominence of scoring rules in economic applications grew in the and within , particularly through Charles Manski's work on measuring subjective probabilities via incentivized elicitation methods that address biases in survey responses. Manski emphasized their potential to reveal full s rather than point estimates, enhancing econometric analysis of under uncertainty.

Machine Learning and Calibration

In machine learning, proper scoring rules serve as loss functions for training probabilistic classifiers and regressors, incentivizing models to output calibrated probability distributions. The logarithmic scoring rule, equivalent to the cross-entropy loss, is widely used in neural networks for multi-class classification, where it minimizes the negative log-likelihood of the true class under the predicted probability distribution. This loss function encourages the model to assign high probabilities to correct classes and low probabilities to incorrect ones, facilitating gradient-based optimization in deep learning frameworks. Similarly, the Brier score, or quadratic scoring rule, functions as a mean squared error loss for multi-class settings by penalizing deviations between predicted probabilities and one-hot encoded true labels across all classes. Both rules are strictly proper, ensuring that the model's expected loss is minimized only when its predictions match the true conditional distribution, which supports reliable probabilistic outputs in supervised learning tasks. Scoring rules are instrumental in evaluating and improving model calibration, a critical aspect of trustworthy artificial intelligence where predicted probabilities should reflect true empirical frequencies. The expected calibration error (ECE) quantifies miscalibration by binning predictions based on confidence and measuring the difference between average predicted probabilities and observed accuracies within each bin, often using sample averages of proper scoring rules like the Brier or logarithmic score to estimate reliability. Poor calibration, detectable through high ECE values, can lead to overconfident or underconfident predictions, undermining applications in high-stakes domains such as medical diagnosis or autonomous systems; thus, post-hoc recalibration techniques, informed by these scores, adjust model outputs to align confidence with accuracy. In trustworthy AI, proper scoring rules provide a unified framework for assessing not just accuracy but also the sharpness and resolution of probabilistic forecasts, ensuring models are both precise and reliable. Recent developments in the have extended scoring rules to in large language models (LLMs), where they evaluate the decomposition of total into aleatoric (data-inherent) and epistemic (model-induced) components. For instance, frameworks using proper scores like the Brier and logarithmic rules have improved in LLMs for tasks such as clinical text , reducing Brier scores by up to 74% and enhancing expected calibration error through Bayesian-inspired approximations of posterior distributions. Methods leveraging proper scoring rules also guide selective and out-of-distribution detection by tailoring estimates to task-specific losses for better performance in real-world deployment. A key theoretical link exists between proper scoring rules and Bayesian methods, where the expected logarithmic score under the true distribution equals the negative Shannon entropy minus the Kullback-Leibler (KL) divergence to the predicted distribution, implying that proper rules minimize KL divergence when predictions align with the true posterior. This connection underpins their use in variational inference and Bayesian neural networks, promoting forecasts that are information-theoretically optimal.

Examples of Proper Scoring Rules

Logarithmic Score for Categorical and Continuous Variables

The logarithmic scoring rule, also known as the log score or ignorance score, evaluates probabilistic forecasts by assigning a score based on the logarithm of the predicted probability assigned to the observed outcome. For categorical variables with a of outcomes, the score is defined as S(y,p)=logp(y),S(y, p) = \log p(y), where yy is the realized outcome and p(y)p(y) is the forecaster's assigned probability to that outcome, assuming a positive orientation where higher scores indicate better performance. This formulation originates from early work on admissible probability measurement procedures, where it was identified as the unique local strictly proper scoring rule. For continuous variables, the logarithmic score extends naturally to probability densities, given by S(y,f)=logf(y),S(y, f) = \log f(y), where ff denotes the predicted and yy is the observed value. In both discrete and continuous cases, the score is undefined if the predicted probability or density at the outcome is zero, which underscores its strict requirement for positive support across possible outcomes. The logarithmic score is strictly proper, meaning that the expected score is uniquely maximized when the forecaster reports their true subjective probabilities, incentivizing honesty without strategic misrepresentation. It elicits the in aggregation contexts, such as pooling multiple forecasts, where the optimal combined probability corresponds to the of individual predictions under this rule. Additionally, the score is particularly sensitive to low-probability assignments; assigning a very small probability to the realized outcome results in a large negative score, heavily penalizing underestimation of . The derivation of the logarithmic score's propriety follows from , where the expected score under true probabilities qq for a forecast pp is Eq[S(y,p)]=yq(y)logp(y)\mathbb{E}_q[S(y, p)] = \sum_y q(y) \log p(y) in the discrete case (or the analog for continuous), which simplifies to the negative Shannon minus the Kullback-Leibler : Eq[S(y,p)]=H(q)DKL(qp)\mathbb{E}_q[S(y, p)] = -H(q) - D_{KL}(q \| p). This expression is maximized uniquely when p=qp = q, as the KL is zero only for matching distributions, thereby linking the score's optimization to maximization.

Brier and Quadratic Scores

The is a quadratic scoring rule originally developed for evaluating probabilistic forecasts in . Introduced by Glenn W. Brier in , it was proposed as a method to verify weather predictions expressed in terms of probabilities, such as the likelihood of , by measuring the squared difference between forecasted probabilities and observed binary outcomes. This score has since become a standard metric in forecast verification across various fields, rewarding forecasts that align closely with observed events while penalizing deviations. For a categorical prediction with KK possible outcomes, let p=(p1,,pK)\mathbf{p} = (p_1, \dots, p_K) denote the forecasted , where i=1Kpi=1\sum_{i=1}^K p_i = 1 and pi0p_i \geq 0, and let yy be the realized outcome. The indicator vector o\mathbf{o} has oi=I(y=i)o_i = I(y = i), which equals 1 if outcome ii occurs and 0 otherwise. The is then given by BS(p,y)=i=1K(piI(y=i))2=po2,BS(\mathbf{p}, y) = -\sum_{i=1}^K (p_i - I(y = i))^2 = -\|\mathbf{p} - \mathbf{o}\|^2, where 2\|\cdot\|^2 is the squared Euclidean norm. This negative orientation treats the score as a reward, with higher (less negative) values indicating superior forecast accuracy; equivalently, the positive version po2\|\mathbf{p} - \mathbf{o}\|^2 functions as a loss, minimized for accurate predictions. For a sequence of nn independent forecasts, the overall score is the average 1nt=1nBS(pt,yt)\frac{1}{n} \sum_{t=1}^n BS(\mathbf{p}_t, y_t). The Brier score possesses key theoretical properties that underpin its utility. It is a strictly proper scoring rule: for any true q\mathbf{q}, the expected score E[BS(p,y)q]\mathbb{E}[BS(\mathbf{p}, y) \mid \mathbf{q}] is uniquely maximized when p=q\mathbf{p} = \mathbf{q}, incentivizing forecasters to report their true beliefs rather than hedging or biasing probabilities. Additionally, it is decomposable, permitting the expected score to be partitioned into terms capturing (the reliability of predicted probabilities relative to observed frequencies) and refinement (the sharpness or variability of the forecasts, reflecting their informativeness). Specifically, the expected Brier score can be expressed as E[BS]=UR+C\mathbb{E}[BS] = U - R + C, where UU is the uncertainty inherent in the observations (a fixed climatological term), RR is the resolution or refinement (higher for sharper forecasts), and CC is the error (lower for well-calibrated forecasts); this , originally due to , highlights trade-offs in forecast quality. As a , the generalizes naturally to point forecasts. When the p\mathbf{p} is degenerate—assigning probability 1 to a single predicted outcome y^\hat{y} and 0 elsewhere—the score simplifies to BS(p,y)=(y^y)2BS(\mathbf{p}, y) = -( \hat{y} - y )^2, the negative squared error between the point prediction and the actual outcome. This connection positions the as an extension of the classical quadratic loss function to probabilistic settings, bridging deterministic and uncertain forecasting paradigms.

Spherical Score

The spherical score is a strictly proper scoring rule designed for evaluating probabilistic forecasts over categorical outcomes, where the forecast is represented by a p=(p1,,pm)\mathbf{p} = (p_1, \dots, p_m) with i=1mpi=1\sum_{i=1}^m p_i = 1 and pi0p_i \geq 0, and the outcome is indicated by a vector o\mathbf{o} such that oy=1o_y = 1 and oi=0o_i = 0 for iyi \neq y. It can be expressed in vector form as the of the forecast and outcome vectors, normalized by the Euclidean norm of the forecast vector: S(p,o)=pop2=pyi=1mpi2.S(\mathbf{p}, \mathbf{o}) = \frac{\mathbf{p} \cdot \mathbf{o}}{\| \mathbf{p} \|_2} = \frac{p_y}{\sqrt{\sum_{i=1}^m p_i^2}}.
Add your contribution
Related Hubs
User Avatar
No comments yet.