Hubbry Logo
Probabilistic classificationProbabilistic classificationMain
Open search
Probabilistic classification
Community hub
Probabilistic classification
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Probabilistic classification
Probabilistic classification
from Wikipedia

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right[1] or when combining classifiers into ensembles.

Types of classification

[edit]

Formally, an "ordinary" classifier is some rule, or function, that assigns to a sample x a class label ŷ:

The samples come from some set X (e.g., the set of all documents, or the set of all images), while the class labels form a finite set Y defined prior to training.

Probabilistic classifiers generalize this notion of classifiers: instead of functions, they are conditional distributions , meaning that for a given , they assign probabilities to all (and these probabilities sum to one). "Hard" classification can then be done using the optimal decision rule[2]: 39–40 

or, in English, the predicted class is that which has the highest probability.

Binary probabilistic classifiers are also called binary regression models in statistics. In econometrics, probabilistic classification in general is called discrete choice.

Some classification models, such as naive Bayes, logistic regression and multilayer perceptrons (when trained under an appropriate loss function) are naturally probabilistic. Other models such as support vector machines are not, but methods exist to turn them into probabilistic classifiers.

Generative and conditional training

[edit]

Some models, such as logistic regression, are conditionally trained: they optimize the conditional probability directly on a training set (see empirical risk minimization). Other classifiers, such as naive Bayes, are trained generatively: at training time, the class-conditional distribution and the class prior are found, and the conditional distribution is derived using Bayes' rule.[2]: 43 

Probability calibration

[edit]

Not all classification models are naturally probabilistic, and some that are, notably naive Bayes classifiers, decision trees and boosting methods, produce distorted class probability distributions.[3] In the case of decision trees, where Pr(y|x) is the proportion of training samples with label y in the leaf where x ends up, these distortions come about because learning algorithms such as C4.5 or CART explicitly aim to produce homogeneous leaves (giving probabilities close to zero or one, and thus high bias) while using few samples to estimate the relevant proportion (high variance).[4]

An example calibration plot

Calibration can be assessed using a calibration plot (also called a reliability diagram).[3][5] A calibration plot shows the proportion of items in each class for bands of predicted probability or score (such as a distorted probability distribution or the "signed distance to the hyperplane" in a support vector machine). Deviations from the identity function indicate a poorly-calibrated classifier for which the predicted probabilities or scores can not be used as probabilities. In this case one can use a method to turn these scores into properly calibrated class membership probabilities.

For the binary case, a common approach is to apply Platt scaling, which learns a logistic regression model on the scores.[6] An alternative method using isotonic regression[7] is generally superior to Platt's method when sufficient training data is available.[3]

In the multiclass case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.[8]

Evaluating probabilistic classification

[edit]

Commonly used evaluation metrics that compare the predicted probability to observed outcomes include log loss, Brier score, and a variety of calibration errors. The former is also used as a loss function in the training of logistic models.

Calibration errors metrics aim to quantify the extent to which a probabilistic classifier's outputs are well-calibrated. As Philip Dawid put it, "a forecaster is well-calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent".[9] Foundational work in the domain of measuring calibration error is the Expected Calibration Error (ECE) metric.[10] More recent works propose variants to ECE that address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1], including the Adaptive Calibration Error (ACE) [11] and Test-based Calibration Error (TCE).[12]

A method used to assign scores to pairs of predicted probabilities and actual discrete outcomes, so that different predictive methods can be compared, is called a scoring rule.

Software Implementations

[edit]
  • MoRPE[13] is a trainable probabilistic classifier that uses isotonic regression for probability calibration. It solves the multiclass case by reduction to binary tasks. It is a type of kernel machine that uses an inhomogeneous polynomial kernel.

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Probabilistic classification is a that estimates the probability of each possible class label given an input instance's features, providing a distribution over outcomes rather than a deterministic assignment. This approach leverages statistical models to quantify , enabling applications in domains requiring calibrated confidence scores, such as fraud detection and medical prognosis. Probabilistic classifiers are broadly divided into generative and discriminative models based on their modeling strategy. Generative models, exemplified by Naive Bayes, learn the P(X,Y)P(X, Y) over features XX and labels YY, then apply to derive the P(YX)=P(X,Y)P(X)P(Y \mid X) = \frac{P(X, Y)}{P(X)}. This involves estimating class priors P(Y)P(Y) and class-conditional densities P(XY)P(X \mid Y), often under assumptions like feature independence to reduce . In contrast, discriminative models, such as , directly parameterize P(YX)P(Y \mid X) without modeling P(X)P(X), focusing on the between classes. , for instance, employs a to map linear combinations of features to probabilities between 0 and 1, optimized via . Key advantages of probabilistic classification include its ability to handle noisy or incomplete data through probabilistic inference and to provide interpretable uncertainty measures, which are crucial for risk-sensitive decisions. Generative approaches excel in low-data regimes or when generating synthetic samples is beneficial, while discriminative methods typically achieve higher accuracy with abundant training data due to their focus on boundary estimation. Despite simplifying assumptions like independence in Naive Bayes, these models demonstrate robust empirical performance across tasks, including text categorization and image recognition.

Overview and Fundamentals

Definition and Core Concepts

In , classification tasks involve assigning input data points to discrete categories or classes based on observed features, typically using a set of labeled examples to learn a mapping from inputs to outputs. This paradigm assumes that the model generalizes from known input-output pairs to predict labels for unseen data. Probabilistic classification extends this framework by having models output a over possible class labels for a given input, rather than a single hard prediction, thereby quantifying uncertainty in the predictions. This approach, rooted in , allows for more nuanced decision-making, such as selecting classes based on risk thresholds or combining predictions with prior knowledge. At its core, probabilistic classification relies on estimating posterior probabilities, denoted as P(yx)P(y|x), which represent the likelihood of each class yy given the input features xx, often derived via : P(yx)=P(xy)P(y)P(x)P(y|x) = \frac{P(x|y)P(y)}{P(x)}. This enables the application of , a foundational principle that minimizes expected loss by choosing actions (e.g., class assignments) that optimize a over the . The paradigm naturally handles both (two classes) and multi-class problems (more than two classes) by extending the distribution to multiple outcomes, providing a unified way to model uncertainty across scenarios. Predictive outcomes in AI, particularly in the context of probabilistic classification, refer to the forecasts or classifications generated by models operating in uncertain environments, such as those involving noise, incomplete data, or randomness. These outcomes are inherently probabilistic, outputting likelihoods (e.g., a 70% chance of rain) or distributions over possible results rather than deterministic answers, thereby capturing patterns in complex and noisy datasets. Characteristics include the ability to handle variability through probability estimates and, in some cases, controlled randomness, as seen in generative models like large language models that produce varied text outputs. This approach is crucial for applications requiring adaptability, such as weather and stock predictions based on historical patterns, probabilistic neural networks, recommendation systems with confidence scores, and tasks in natural language processing and image recognition. The historical foundations trace back to the 18th century with ' formulation of in his 1763 essay, which provided the probabilistic basis for updating beliefs based on evidence and laid the groundwork for classifiers like Naive Bayes—a simple yet effective probabilistic method assuming feature independence.

Probabilistic vs. Deterministic Classification

Deterministic classification methods produce hard label outputs, assigning a single class to each input instance based on a decision rule such as the argmax over score functions or the sign of a margin. These approaches are common in models like support vector machines (SVMs), which separate classes via a maximum-margin and classify new points deterministically, and decision trees, which traverse branches to reach a leaf node representing a specific class without quantifying . In contrast, probabilistic classification outputs a over possible classes for each input, representing the of class membership given the features, as defined in core concepts. The primary differences lie in how these methods handle : deterministic classifiers offer no inherent measures, relying solely on the final class assignment, whereas probabilistic classifiers provide calibrated probability estimates that enable scoring, risk-sensitive thresholding, and enhanced interpretability in ambiguous scenarios. For instance, probabilistic outputs allow decision-makers to adjust classification thresholds based on domain-specific costs, such as prioritizing over precision, which is particularly valuable when outcomes vary in severity. This probabilistic framing also facilitates integration with Bayesian for optimal actions under , unlike the binary nature of deterministic predictions. Probabilistic classification offers advantages in imbalanced datasets by enabling cost-sensitive adjustments to probability thresholds, mitigating the toward majority classes that plagues deterministic hard-label approaches. In high-stakes applications like , these methods improve by quantifying , allowing clinicians to weigh treatment risks against probabilistic outcomes rather than relying on categorical rulings, as highlighted in early analyses of diagnostic reasoning. However, probabilistic models often incur greater computational overhead due to the need for estimating full distributions, such as through or softmax normalization, compared to the simpler optimization in deterministic counterparts. A practical example is spam detection, where a probabilistic classifier might output a 0.8 probability of an email being spam, enabling nuanced actions like flagging for review instead of automatic deletion, whereas a deterministic classifier would output only "spam" or "not spam" without confidence nuance.

Model Types and Approaches

Generative Models

Generative models in probabilistic classification estimate the joint probability distribution P(x,y)P(\mathbf{x}, y) over input features x\mathbf{x} and class labels yy, enabling inference of the posterior class probabilities P(yx)P(y|\mathbf{x}) required for classification. This is achieved through Bayes' rule, which states: P(yx)=P(xy)P(y)P(x)P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y) P(y)}{P(\mathbf{x})} Here, P(xy)P(\mathbf{x}|y) represents the class-conditional likelihood, P(y)P(y) is the prior class probability, and P(x)P(\mathbf{x}) is the evidence or marginal likelihood, computed as P(x)=yP(xy)P(y)P(\mathbf{x}) = \sum_y P(\mathbf{x}|y) P(y). The derivation follows directly from the definition of conditional probability: P(yx)=P(x,y)/P(x)P(y|\mathbf{x}) = P(\mathbf{x}, y) / P(\mathbf{x}), where the joint P(x,y)=P(xy)P(y)P(\mathbf{x}, y) = P(\mathbf{x}|y) P(y). This framework allows generative models to not only classify but also generate synthetic data samples from the learned distribution. A key example is the Gaussian Naive Bayes classifier, which incorporates the "naive" assumption of conditional independence among features given the class: P(xy)=i=1dP(xiy)P(\mathbf{x}|y) = \prod_{i=1}^d P(x_i|y). For continuous features, each P(xiy)P(x_i|y) is modeled as a univariate Gaussian distribution with class-specific mean μyi\mu_{yi} and variance σyi2\sigma_{yi}^2: P(xiy)=12πσyiexp((xiμyi)22σyi2).P(x_i|y) = \frac{1}{\sqrt{2\pi} \sigma_{yi}} \exp\left( -\frac{(x_i - \mu_{yi})^2}{2\sigma_{yi}^2} \right).
Add your contribution
Related Hubs
User Avatar
No comments yet.