Hubbry Logo
Bayes error rateBayes error rateMain
Open search
Bayes error rate
Community hub
Bayes error rate
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Bayes error rate
Bayes error rate
from Wikipedia

In statistical classification, Bayes error rate is the lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error.[1][2]

A number of approaches to the estimation of the Bayes error rate exist. One method seeks to obtain analytical bounds which are inherently dependent on distribution parameters, and hence difficult to estimate. Another approach focuses on class densities, while yet another method combines and compares various classifiers.[2]

The Bayes error rate finds important use in the study of patterns and machine learning techniques.[3]

Definition

[edit]

Mohri, Rostamizadeh and Talwalkar define it as

Given a distribution over , the Bayes error is defined as the infimum of the errors achieved by measurable functions :
A hypothesis h with R(h) = R* is called a Bayes hypothesis or Bayes classifier.

Error determination

[edit]

In terms of machine learning and pattern classification, the labels of a set of random observations can be divided into 2 or more classes. Each observation is called an instance and the class it belongs to is the label. The Bayes error rate of the data distribution is the probability an instance is misclassified by a classifier that knows the true class probabilities given the predictors.

For a multiclass classifier, the expected prediction error may be calculated as follows:[3]

where is the instance, the expectation value, is a class into which an instance is classified, is the conditional probability of label for instance , and is the 0–1 loss function:

,

where is the Kronecker delta.

When the learner knows the conditional probability, then one solution is:

This solution is known as the Bayes classifier.

The corresponding expected Prediction Error is called the Bayes error rate:

,

where the sum can be omitted in the last step due to considering the counter event. By the definition of the Bayes classifier, it maximizes and, therefore, minimizes the Bayes error BE.

The Bayes error is non-zero if the classification labels are not deterministic, i.e., there is a non-zero probability of a given instance belonging to more than one class.[4] In a regression context with squared error, the Bayes error is equal to the noise variance.[3]

Proof of Minimality

[edit]

Proof that the Bayes error rate is indeed the minimum possible and that the Bayes classifier is therefore optimal, may be found together on the Wikipedia page Bayes classifier.


Plug-in Rules for Binary Classifiers

[edit]

A plug-in rule uses an estimate of the posterior probability to form a classification rule. Given an estimate , the excess Bayes error rate of the associated classifier is bounded above by:

To see this, note that the excess Bayes error is equal to 0 where the classifiers agree, and equal to where they disagree. To form the bound, notice that is at least as far as when the classifiers disagree.

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Bayes error rate is the lowest achievable error rate for any classifier in a given supervised classification problem, serving as a theoretical lower bound on the misclassification probability under the 0-1 loss function. It is defined as 1EX[max1jcP(Y=jX)]1 - \mathbb{E}_{\mathbf{X}} \left[ \max_{1 \leq j \leq c} P(Y = j \mid \mathbf{X}) \right], where the expectation is taken over the marginal distribution of the feature vector X\mathbf{X}, YY is the class label, and cc is the number of classes; this quantifies the irreducible error arising from inherent overlaps in the class-conditional distributions. The rate is achieved exclusively by the Bayes optimal classifier, which assigns an input x\mathbf{x} to the class j^=argmaxjP(Y=jX=x)\hat{j} = \arg\max_j P(Y = j \mid \mathbf{X} = \mathbf{x}), equivalent to selecting the class with the highest posterior probability based on the true joint distribution fX,Yf_{\mathbf{X},Y}. In , the Bayes error rate acts as a fundamental benchmark for evaluating classifier performance, indicating how close a practical model can approach optimal without or underfitting. It depends solely on the underlying data distribution and feature quality, rather than on algorithmic choices, making it zero only when class-conditional densities have no overlap and positive otherwise due to probabilistic . Since the true distribution is unknown in practice, estimating the Bayes error rate is challenging but crucial for assessing whether poor performance stems from inadequate features, limited data, or suboptimal algorithms; common estimation techniques include ensemble-based methods that average posterior probabilities from diverse classifiers or use information-theoretic measures like to approximate the bound. Recent advances have explored training neural networks to approach the Bayes error rate, such as through specialized losses like the Bayes Optimal Learning Threshold (BOLT), which have demonstrated superior results on benchmarks including MNIST (99.29% accuracy) and (93.29% accuracy) by directly optimizing toward the posterior maximum rather than surrogate objectives like . These developments highlight the rate's role in pushing the limits of achievable accuracy, particularly in high-dimensional settings where traditional estimators may fail.

Fundamentals

Definition

The Bayes error rate is defined as the lowest possible error rate achievable by any in a given problem, representing the expected probability of misclassification under the optimal when the true underlying probability distributions are fully known. This optimal classifier assigns each observation to the class that maximizes the P(Y=cX=x)P(Y = c \mid X = x), where YY is the class and XX is the feature vector. Formally, the Bayes error rate RR^* is given by R=E[1maxcP(Y=cX)]=(1maxcP(Y=cx))dPX(x),R^* = \mathbb{E}\left[1 - \max_c P(Y = c \mid X)\right] = \int \left(1 - \max_c P(Y = c \mid x)\right) \, dP_X(x), where the expectation is taken over the distribution of the features XX, and the maximum is over all possible class labels cc. This quantity captures the inherent uncertainty in the data due to overlapping class-conditional distributions, making it the irreducible minimum error even with perfect model specification. Intuitively, the Bayes error rate quantifies the fundamental limit imposed by probabilistic overlap in the classes; for instance, if class distributions are completely separable, R=0R^* = 0, but any overlap introduces unavoidable misclassifications. The concept emerged within statistical during the mid-20th century, extending the foundational originally formulated by and published posthumously in 1763.

Bayesian Decision Theory Context

Bayesian decision theory establishes a probabilistic framework for optimal under uncertainty, particularly in tasks where the goal is to assign an observation to one of several possible classes. This theory formalizes the minimization of expected risk, defined as the average loss incurred over the joint distribution of observations and true classes. For problems, the 0-1 loss function is commonly employed, assigning a loss of 1 for misclassification and 0 for correct assignment, thereby reducing the objective to minimizing the probability of error. At the core of Bayesian lies the P(Y=cX=x)P(Y=c \mid X=x), which quantifies the probability that the true class label YY is cc given the feature vector X=xX = x. These posteriors encapsulate all available about the class membership and serve as the basis for rational , allowing the incorporation of both data-driven and prior beliefs to update the probability of each class. The rule operationalizes this by assigning the observation X=xX = x to the class c=argmaxcP(Y=cX=x)c^* = \arg\max_c P(Y=c \mid X=x), selecting the class with the maximum probability to minimize the expected 0-1 loss. This rule derives directly from , which computes the posterior as P(Y=cX=x)=p(xY=c)P(Y=c)p(x),P(Y=c \mid X=x) = \frac{p(x \mid Y=c) P(Y=c)}{p(x)}, where p(xY=c)p(x \mid Y=c) denotes the class-conditional likelihood (the density of xx given class cc), P(Y=c)P(Y=c) is the of class cc, and p(x)=cp(xY=c)P(Y=c)p(x) = \sum_c p(x \mid Y=c) P(Y=c) is the marginal density of xx. This connection highlights the theory's reliance on priors to reflect or base rates, combined with likelihoods to weigh from the observation. In contrast to frequentist approaches, which emphasize parameter estimation from data alone and treat probabilities as long-run frequencies without priors, Bayesian decision theory explicitly models uncertainty through subjective or objective priors, enabling a coherent update to posteriors that integrates all information sources. The Bayes error rate corresponds to the expected under this optimal rule, serving as a theoretical benchmark for classifier performance.

Computation

General Formula

The Bayes error rate RR^* represents the minimum achievable error rate for classifying observations from a joint distribution P(X,Y)P(X, Y), where XX is the feature vector and YY is the class label taking values in {1,2,,K}\{1, 2, \dots, K\} for the multi-class setting. It is derived as the of the conditional misclassification probability under the optimal , which assigns xx to argmaxcP(Y=cX=x)\arg\max_c P(Y = c \mid X = x). This conditional error is 1maxcP(Y=cX=x)1 - \max_c P(Y = c \mid X = x), so R=E[1maxcP(Y=cX)]R^* = E\left[1 - \max_c P(Y = c \mid X)\right]. The expectation is taken with respect to the marginal distribution of XX, yielding the integral form R=X(1maxcP(Y=cx))p(x)dxR^* = \int_{\mathcal{X}} \left(1 - \max_c P(Y = c \mid x)\right) p(x) \, dx, where p(x)p(x) denotes the marginal density of XX and X\mathcal{X} is the feature space. This expression arises from integrating the pointwise error over all possible feature values, weighted by their density. In the multi-class extension for K>2K > 2, the formula captures the probability of not selecting the true class, as the Bayes classifier maximizes the posterior for each xx, leading to misclassification precisely when the true class does not have the highest posterior. The posterior probabilities P(Y=cx)P(Y = c \mid x) follow from Bayesian decision theory via Bayes' theorem applied to the known joint distribution. The derivation relies on the assumption that the full joint distribution P(X,Y)P(X, Y) is known, enabling exact computation of class-conditional densities p(xY=c)p(x \mid Y = c) and priors P(Y=c)P(Y = c); additionally, observations are assumed to be independent and identically distributed from this distribution. As an illustrative example, consider a setting where each class follows a Gaussian mixture model with known means, covariances, and mixing weights; the posteriors are then mixtures of Gaussian densities, and RR^* is obtained by integrating 1maxcP(Y=cx)1 - \max_c P(Y = c \mid x) over the decision regions defined by posterior maxima, often requiring partitioning the space into Voronoi-like cells for evaluation.

Binary Classification Case

In the binary classification case, the Bayes error rate simplifies from the general multi-class form to a direct over the feature space. Specifically, it is given by R=min(π0p0(x),π1p1(x))dx,R^* = \int \min\left( \pi_0 p_0(\mathbf{x}), \pi_1 p_1(\mathbf{x}) \right) d\mathbf{x}, where π0\pi_0 and π1=1π0\pi_1 = 1 - \pi_0 are the prior probabilities of the two classes, and p0(x)p_0(\mathbf{x}) and p1(x)p_1(\mathbf{x}) are the corresponding class-conditional probability density functions. This expression represents the expected minimum conditional error probability, integrated over the marginal density of x\mathbf{x}. The optimal Bayes classifier assigns a feature vector x\mathbf{x} to class 1 if the posterior probability P(Y=1x)>0.5P(Y=1 \mid \mathbf{x}) > 0.5, and to class 0 otherwise; this decision boundary occurs where P(Y=1x)=0.5P(Y=1 \mid \mathbf{x}) = 0.5. Since the posterior incorporates the priors via Bayes' theorem, P(Y=1x)=π1p1(x)π0p0(x)+π1p1(x)P(Y=1 \mid \mathbf{x}) = \frac{\pi_1 p_1(\mathbf{x})}{\pi_0 p_0(\mathbf{x}) + \pi_1 p_1(\mathbf{x})}, unequal priors shift the boundary relative to the equal-prior case by altering the likelihood ratio threshold p1(x)p0(x)>π0π1\frac{p_1(\mathbf{x})}{p_0(\mathbf{x})} > \frac{\pi_0}{\pi_1}. In contrast to multi-class settings, binary classification involves only a single decision boundary (or surface in higher dimensions), reducing the complexity of determining the regions where one class dominates. The Bayes error rate quantifies the inherent overlap between the two class-conditional distributions, weighted by the priors. In symmetric scenarios with equal priors (π0=π1=0.5\pi_0 = \pi_1 = 0.5), RR^* measures the degree of separability; minimal overlap yields low error, while substantial overlap increases it. For instance, when the class-conditional distributions are univariate Gaussians with equal variances σ2\sigma^2 but different means μ0<μ1\mu_0 < \mu_1, the is at μ0+μ12\frac{\mu_0 + \mu_1}{2}, and the error admits a using the Gaussian (the complementary of the standard normal): R=Q(μ1μ02σ),R^* = Q\left( \frac{\mu_1 - \mu_0}{2\sigma} \right), which can equivalently be written in terms of the error function as R=12erfc(μ1μ022σ)R^* = \frac{1}{2} \operatorname{erfc}\left( \frac{\mu_1 - \mu_0}{2\sqrt{2}\sigma} \right)
Add your contribution
Related Hubs
User Avatar
No comments yet.