Hubbry Logo
One-class classificationOne-class classificationMain
Open search
One-class classification
Community hub
One-class classification
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
One-class classification
One-class classification
from Wikipedia

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, is an approach to the training of binary classifiers in which only examples of one of the two classes are used[1].

Examples include the monitoring of helicopter gearboxes,[2][3][4] motor failure prediction,[5] or assessing the operational status of a nuclear plant as 'normal':[6] In such scenarios, there are few, if any, examples of the catastrophic system states – rare outliers – that comprise the second class. Alternatively, the class that is being focussed on may cover a small, coherent subset of the data and the training may rely on an information bottleneck approach.[7]

In practice, counter-examples from the second class may be used in later rounds of training to further refine the algorithm.

Overview

[edit]

The term one-class classification (OCC) was coined by Moya & Hush (1996)[8] and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes.[9]

Introduction

[edit]
The hypersphere containing the target data having center c and radius r. Objects on the boundary are support vectors, and two objects lie outside the boundary having slack greater than 0.

SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points.[10] This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form,

However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below,

From the Karush–Kuhn–Tucker conditions for optimality, we get

where the 's are the solution to the following optimization problem:

subject to,

The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm.[11]

PU (Positive Unlabeled) learning

[edit]

A similar problem is PU learning, in which a binary classifier is constructed by semi-supervised learning from only positive and unlabeled sample points.[12]

In PU learning, two sets of examples are assumed to be available for training: the positive set and a mixed set , which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text,[13][14][15] time series,[16] bioinformatics tasks,[17][18] and remote sensing data.[19]

Approaches

[edit]

Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods.[6]

Density estimation methods

[edit]

Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance.

Gaussian model[20] is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT),[21] these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by:

Where, is the mean and is the covariance matrix. Computing the inverse of covariance matrix () is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse is used to approximate the inverse, and is calculated as .[22]

Boundary methods

[edit]

Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples.

K-centers

In K-center algorithm,[23] small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized,

The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object the distance can be calculated as,

Reconstruction methods

[edit]

Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc.

Applications

[edit]

Document classification

[edit]

The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using only positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history.

Biomedical studies

[edit]

One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal).[24] The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters.[25] To apply typicality to one-class classification for biomedical studies, each new observation, , is compared to the target class, , and identified as an outlier or a member of the target class.[24]

Unsupervised Concept Drift Detection

[edit]

One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initial concept or not. However, unsupervised drift detection monitors the flow of data, and signals a drift if there is a significant amount of change or anomalies. Unsupervised concept drift detection can be identified as the continuous form of one-class classification.[26] One-class classifiers are used for detecting concept drifts.[27]

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
One-class classification (OCC), also known as unary classification, is a paradigm in which a model is trained exclusively on examples from a single target class to distinguish those instances from outliers, anomalies, or data from other unseen classes during . This approach addresses scenarios where negative or counterexamples are unavailable, scarce, or prohibitively expensive to obtain, making it a specialized case of distinct from traditional binary or multi-class classification. The core objective is to learn a or representation that encapsulates the target class's characteristics, enabling the rejection of non-conforming inputs without prior knowledge of alternative classes. OCC has roots in early statistical methods for outlier detection and concept learning, with foundational work including the Support Vector Data Description (SVDD) introduced by Tax and Duin in 1999, which models the target class as a hypersphere in feature space, and the One-Class Support Vector Machine (OC-SVM) proposed by Schölkopf et al. in 1999, adapting SVM principles to estimate support for the target distribution alone. The term "one-class classification" was formalized by Moya et al. in 1993, building on pattern recognition techniques for scenarios like novelty detection. Over time, OCC has evolved to incorporate density estimation methods such as Gaussian Mixture Models (GMM) and clustering-based approaches like Local Outlier Factor (LOF), alongside isolation techniques such as Isolation Forest for efficient anomaly flagging. More recent deep learning integrations, including Deep SVDD and autoencoder-based reconstruction models, leverage neural networks to learn hierarchical features tailored to the single class, often outperforming classical methods on high-dimensional data like images. The technique finds broad applications in domains requiring robust anomaly detection, such as intrusion detection in cybersecurity, where normal network traffic defines the target class; fault detection in industrial systems; and medical diagnostics, including pneumonia screening from chest X-rays or fMRI analysis for brain anomalies, all benefiting from the absence of labeled abnormal samples during training. In biometrics, OCC supports anti-spoofing measures and active authentication by modeling legitimate user patterns. Other uses span remote sensing for land cover classification with limited samples, food authentication to identify adulterated products, and out-of-distribution (OOD) detection in autonomous systems to flag novel environmental conditions. These applications underscore OCC's value in real-world imbalanced datasets, where acquiring diverse negative examples is impractical. Despite its strengths, OCC faces challenges including the precise setting of decision thresholds to balance false positives and negatives, vulnerability to adversarial perturbations that exploit the lack of negative training data, and difficulties in generalizing to complex, high-dimensional distributions without to the target class. Recent advances mitigate these through generative models like GANs (e.g., OCGAN) to synthesize pseudo-negative samples and self-supervised techniques that enhance feature robustness, as evidenced by benchmarks on datasets like where deep OCC methods achieve superior area under the ROC curve (AUC) scores. More recent reviews as of 2024 continue to explore advancements in and hybrid methods. Ongoing research emphasizes hybrid approaches combining classical and neural methods to improve scalability and interpretability in practical deployments.

Introduction

Definition and Problem Statement

One-class classification is a paradigm in which a model is trained exclusively on data from a single target class to distinguish instances belonging to that class from outliers or instances from unknown classes. This approach is particularly useful in scenarios where negative examples (i.e., data from non-target classes) are unavailable, difficult to obtain, or poorly representative, allowing the model to learn a boundary or of the target class based solely on positive instances. Formally, given a dataset D={xi}i=1nD = \{ \mathbf{x}_i \}_{i=1}^n where all xi\mathbf{x}_i belong to the target class, the objective is to learn a decision function f:X{1,1}f: \mathcal{X} \to \{-1, 1\} such that f(x)1f(\mathbf{x}) \approx 1 for x\mathbf{x} in the target class and f(x)1f(\mathbf{x}) \approx -1 for outliers from unknown classes. The training process minimizes an empirical risk functional over the target data only, without relying on labeled counterexamples, often incorporating regularization to ensure generalization to unseen anomalies. A key assumption underlying this formulation is that the target class data adequately represents the normal or expected behavior, while non-target data is absent during training, enabling the model to detect deviations based on the learned target distribution or boundary. This paradigm is commonly applied in real-world problems such as fraud detection, where training data consists only of legitimate transactions, and the model must identify anomalous activities without prior examples of fraudulent ones.

Historical Context and Motivation

The term "one-class classification" was formalized by Moya and Hush in 1996. One-class classification emerged in the mid- to late as a response to the limitations of traditional in scenarios where only examples from one class (typically the "normal" or target class) are available, making it particularly suited for anomaly or novelty detection tasks. A foundational contribution was the Support Vector Data Description (SVDD) method introduced by and Duin in 1999, which adapts principles to enclose normal data points within a hypersphere, thereby defining a boundary for the target class without requiring counterexamples. This approach addressed the challenge of modeling data distributions in high-dimensional spaces using kernel functions, laying the groundwork for subsequent one-class techniques. The formalization of one-class support vector machines (OC-SVM) by Schölkopf et al. in further advanced the field, proposing a hyperplane-based method that separates the target from the origin in a transformed feature to estimate the support of the distribution. These early developments were motivated by real-world applications where negative class examples are scarce or difficult to obtain, such as detecting rare network intrusions in cybersecurity, where vast amounts of normal traffic exist but anomalous patterns are underrepresented or unlabeled. Unlike , which assumes balanced labeled datasets from both classes, one-class methods enable learning solely from positive instances, reducing dependency on costly or biased labeling processes. In the , the paradigm relied heavily on statistical and kernel-based methods like SVDD and OC-SVM, which excelled in low-to-moderate dimensional data but struggled with complex, high-dimensional structures. Post-2010, the integration of marked a significant evolution, with techniques such as Deep SVDD adapting architectures to learn compact representations of normal data for in and data. This shift was driven by the need to handle the "curse of dimensionality" in modern datasets, improving scalability and performance in domains like fraud detection and fault monitoring.

Connection to Anomaly Detection

One-class classification represents a supervised subset of , wherein models are trained solely on labeled instances of the target class—typically representing normal or expected behavior—to delineate boundaries that flag deviations as anomalies. This approach is particularly suited to scenarios where anomalous data is rare, unlabeled, or costly to obtain, allowing the classifier to characterize the target distribution without requiring negative examples during training. In essence, it operationalizes by learning a compact description of normality, such that test points outside this description are deemed outliers. A primary distinction from fully unsupervised anomaly detection lies in the level of supervision: unsupervised methods, such as clustering or statistical proximity-based techniques, operate without any labels and infer anomalies from the overall data structure, often assuming a uniform representation of normality and abnormality. One-class classification, by contrast, explicitly utilizes positive labels to focus the learning process on the target class distribution, enabling more precise boundary estimation and reducing sensitivity to the unknown characteristics of anomalies. This semi-supervised nature enhances robustness in imbalanced settings, where normal data dominates but provides the necessary guidance for effective detection. Historically, one-class classification emerged from the foundational outlier detection literature, with Douglas M. Hawkins providing a seminal definition in : an is "an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism." This concept evolved into formalized one-class methods through semi-supervised paradigms in the late 1990s and early 2000s, as researchers addressed the limitations of outlier detection by incorporating target-class supervision. In practice, one-class classification integrates with broader frameworks by adapting primitives—such as isolation forests for random partitioning or autoencoders for reconstruction—under the constraint of target-label guidance, ensuring the model prioritizes the learned normal region over global patterns. For example, serves as a bridging technique, where one-class variants approximate the probability of labeled target to score anomalies based on low-density regions.

Positive-Unlabeled (PU) Learning

Positive-unlabeled (PU) learning is a semi-supervised framework related to one-class classification that addresses scenarios where only a subset of positive examples from the target class is labeled, while the remaining data consists of unlabeled examples drawn from the full population, which includes both positives and negatives. The objective is to train a binary classifier capable of identifying positive instances without relying on explicitly labeled negative examples, making it suitable for applications where negative labeling is costly or infeasible. This setup contrasts with traditional by treating the unlabeled data as a requiring careful handling to avoid toward the positive class. Unlike strict one-class classification, which uses only positive examples, PU learning leverages the unlabeled data to infer negative characteristics. Central to PU learning are several core assumptions that enable reliable inference. The labeled positives must be a random sample from the true positive distribution, formalized under the Selected Completely At Random () assumption, where the probability of labeling a positive example is a constant cc independent of its features, ensuring P(s=1X,Y=1)=cP(s=1 \mid X, Y=1) = c and P(s=1X,Y=0)=0P(s=1 \mid X, Y=0) = 0. Additionally, the unlabeled data must contain both positive and negative instances, with the class prior π=P(Y=1)\pi = P(Y=1) being unknown but estimable, and the labeled positives are assumed to be reliable without noise. These assumptions prevent systematic and allow the unlabeled set to serve as a proxy for the complete distribution. Violations, such as instance-dependent labeling, can lead to degraded performance unless addressed by specialized variants. A typical workflow in PU learning follows a two-step process to construct an effective classifier. In the first step, the class prior π\pi is estimated from the unlabeled , often by an initial classifier to predict labeling probability g(X)=P(s=1X)g(X) = P(s=1 \mid X) using the labeled positives and unlabeled examples, then computing π^=1mXUg(X)c\hat{\pi} = \frac{1}{m} \sum_{X \in U} \frac{g(X)}{c}, where mm is the size of the unlabeled set and cc is derived from the positives' predicted labels. The second step involves a binary classifier on the positives and pseudo-negatives generated from the unlabeled , weighted by the estimated prior to mimic a balanced dataset, or directly optimizing an adjusted loss. This approach enables the use of standard algorithms while mitigating the absence of negatives. The theoretical foundation relies on rewriting the true classification risk to derive an unbiased amenable to optimization. The expected risk under a \ell is R(f)=πRp(f)+(1π)Rn(f),R(f) = \pi R_p(f) + (1 - \pi) R_n(f), where Rp(f)=E(X,Y)P[(f(X),1)]R_p(f) = \mathbb{E}_{(X,Y) \sim P} [\ell(f(X), 1)] is the positive-class risk (with Y=1), and Rn(f)=E(X,Y)N[(f(X),0)]R_n(f) = \mathbb{E}_{(X,Y) \sim N} [\ell(f(X), 0)] the negative-class risk. The standard unbiased for PU learning is R^(f)=πcR^P(f,+1)+R^U(f,1)πcR^P(f,1),\hat{R}(f) = \frac{\pi}{c} \hat{R}_P(f, +1) + \hat{R}_U(f, -1) - \frac{\pi}{c} \hat{R}_P(f, -1), where R^P(f,+1)\hat{R}_P(f, +1) is the empirical risk on positive examples assigning positive labels, R^U(f,1)\hat{R}_U(f, -1) on unlabeled examples assigning negative labels, and R^P(f,1)\hat{R}_P(f, -1) on positive examples assigning negative labels. This formulation provides a principled way to train without explicit negatives. Variants of PU learning extend the framework to handle specific challenges, such as sparsely labeled settings where positives are rare or imbalanced. For instance, methods like streaming PU (SPU) adapt the approach for dynamic data environments with limited labels, while instance-dependent PU relaxes for non-constant labeling probabilities, improving robustness in real-world scenarios with . These adaptations maintain the core risk estimation while incorporating additional priors or constraints for rare positive classes.

Approaches

Density Estimation Methods

Density estimation methods in one-class classification focus on modeling the probability density function of the target class using only positive training examples. The core principle involves estimating the likelihood p(xtarget)p(\mathbf{x} | \text{target}) from the available data and classifying a new point x\mathbf{x} as belonging to the target class if its estimated density exceeds a predefined threshold, otherwise labeling it as an outlier. This approach assumes that target class instances are densely clustered in the feature space, while outliers lie in low-density regions. Such methods are particularly suited for scenarios where the target distribution is well-represented in the training data, enabling probabilistic discrimination without requiring counterexamples. Key techniques include kernel density estimation (KDE), which provides a non-parametric way to approximate the target density by placing a kernel function at each training point. A common formulation uses Gaussian kernels in the Parzen window estimator, given by p^(x)=1nhdi=1nK(xxih),\hat{p}(\mathbf{x}) = \frac{1}{n h^d} \sum_{i=1}^n K\left( \frac{\mathbf{x} - \mathbf{x}_i}{h} \right), where nn is the number of training samples, dd is the data dimensionality, K()K(\cdot) is the kernel function (e.g., standard Gaussian), and hh is the bandwidth parameter controlling smoothness. The Parzen window method, a foundational form of KDE, applies a sliding window to compute local densities, making it effective for capturing irregular shapes in the target distribution. Specific examples encompass mixture of Gaussians (MoG) models, which fit the target data as a weighted sum of Gaussian components to handle multi-modal distributions, and histogram-based estimation for discrete or low-dimensional data, where the feature space is binned and frequencies are normalized to densities. Another important method is the Local Outlier Factor (LOF), which computes the local density deviation of a point relative to its k-nearest neighbors, identifying outliers as points with significantly lower local density compared to their neighbors; LOF is particularly effective for varying density clusters. These methods offer advantages such as the ability to handle multi-modal and complex distributions through flexible kernel choices or components, while providing interpretable probabilistic outputs for confidence scoring. For instance, MoG has been applied in analysis for outlier detection, and KDE variants in EEG for identification. Parameter selection is crucial; bandwidth hh in KDE or the number of components in MoG is typically tuned via cross-validation on the target data to balance bias and variance, ensuring the model generalizes without to noise.

Boundary Methods

Boundary methods in one-class classification focus on learning a that encloses the target class data, effectively separating it from potential s by defining a around the known examples. The core principle involves constructing a geometric , such as a hypersphere or , that minimizes the volume of the region containing the target data while allowing a small fraction of points to lie outside as errors or s. This approach assumes that the target data forms a compact cluster in the feature , and any point falling outside the boundary is classified as an . Isolation-based methods like also fit here by using random partitioning to isolate anomalies, scoring points based on the average path length in isolation trees; shorter paths indicate anomalies as they are easier to isolate. A prominent technique is the Support Vector Data Description (SVDD), which models the target data as lying within a hypersphere of minimal radius in a high-dimensional feature space. Introduced by and Duin, SVDD solves an to find the smallest hypersphere that encloses most of the target points, using slack variables to tolerate a controlled fraction of outliers. The primal formulation minimizes the objective function R2+1νni=1nξiR^2 + \frac{1}{\nu n} \sum_{i=1}^n \xi_i, subject to ϕ(xi)a2R2+ξi\|\phi(x_i) - a\|^2 \leq R^2 + \xi_i for all i=1,,ni = 1, \dots, n, and ξi0\xi_i \geq 0, where RR is the radius of the hypersphere, aa is its center, ϕ\phi is a feature map to a , nn is the number of target samples, and ν\nu (in the range (0,1](0,1]) is a hyperparameter that trades off between the volume of the description and the errors by controlling the fraction of outliers and support vectors. The dual form of SVDD is derived using Lagrange multipliers, leading to a quadratic program that maximizes i=1nαiK(xi,xi)i,j=1nαiαjK(xi,xj)\sum_{i=1}^n \alpha_i K(x_i, x_i) - \sum_{i,j=1}^n \alpha_i \alpha_j K(x_i, x_j), subject to 0αi1νn0 \leq \alpha_i \leq \frac{1}{\nu n}, i=1nαi=1\sum_{i=1}^n \alpha_i = 1, where KK is a kernel function enabling the handling of nonlinear boundaries without explicitly computing ϕ\phi. Another key method is the one-class (SVM), which constructs a that separates the target data from the origin in the feature space, maximizing the margin while rejecting a fraction of the data as . Developed by Schölkopf et al., this approach treats the origin as a representative of the outlier class and learns a decision function f(x)=sgn(wϕ(x)ρ)f(x) = \text{sgn}(w \cdot \phi(x) - \rho), where points with f(x)>0f(x) > 0 are accepted as target class. The optimization minimizes 12w2+1νni=1nξiρ\frac{1}{2} \|w\|^2 + \frac{1}{\nu n} \sum_{i=1}^n \xi_i - \rho, subject to wϕ(xi)ρξiw \cdot \phi(x_i) \geq \rho - \xi_i and ξi0\xi_i \geq 0, again using ν\nu to balance the trade-off between margin maximization and outlier fraction. Like SVDD, the one-class SVM employs the kernel trick in its dual formulation to capture nonlinear structures. Other boundary methods include the Minimum Volume Covering (MVCE), which fits the smallest-volume ellipsoid enclosing the target data, offering a more flexible shape than a hypersphere for elongated distributions. This technique, explored in the context of description, solves a semidefinite program to minimize the ellipsoid's volume while covering most points, with applications in robust one-class modeling. Nearest neighbor-based boundaries, such as Nearest Neighbor Description (NNDD), define the enclosure using to the k-nearest neighbors of target points, classifying a test point as an if its to the nearest target neighbor exceeds a threshold derived from the training data's neighbor . This nonparametric approach, proposed by and Duin, is computationally efficient for low-dimensional data and avoids assumptions about data sphericity. The hyperparameter ν\nu is central to both SVDD and one-class SVM, serving as an upper bound on the fraction of outliers and a lower bound on the fraction of support vectors, allowing users to tune the strictness of the boundary.

Reconstruction Methods

Reconstruction methods in one-class classification involve training generative models on target class to learn data representations that enable accurate reconstruction of normal instances, while outliers exhibit high reconstruction errors. The core principle is to minimize a reconstruction over the target distribution, thereby encoding the essential features of the target class in a compact ; during , instances with reconstruction errors exceeding a learned threshold are classified as outliers. This approach leverages the model's inability to reconstruct unseen or anomalous patterns effectively, providing an anomaly score based on error magnitude. Standard autoencoders form a foundational technique in this category, consisting of an encoder that maps input x\mathbf{x} to a low-dimensional latent representation z\mathbf{z} via a bottleneck layer, followed by a decoder that reconstructs x^\hat{\mathbf{x}} from z\mathbf{z}. Training minimizes the (MSE) loss L=xx^2L = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 solely on target data, often using multilayer perceptrons or convolutional architectures for high-dimensional inputs like images. The reconstruction error serves as the anomaly score, with a threshold typically set at the mean plus a multiple of the standard deviation of errors on validation target data; this method has demonstrated effectiveness in detecting anomalies in telemetry by capturing nonlinear data manifolds. Variational autoencoders (VAEs) extend this framework by imposing a probabilistic structure on the , modeling z\mathbf{z} as drawn from a prior distribution (usually standard Gaussian) via an approximate posterior q(zx)q(\mathbf{z}|\mathbf{x}). The objective optimizes the (ELBO) loss, balancing reconstruction accuracy and latent regularization: L=Eq(zx)[logp(xz)]DKL(q(zx)p(z)),\mathcal{L} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}[\log p(\mathbf{x}|\mathbf{z})] - D_{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})), where the first term encourages faithful reconstruction and the KL-divergergence enforces latent density alignment. For one-class tasks, the negative log-likelihood or ELBO value acts as the anomaly score, enabling detection of outliers through probabilistic deviation; this extension improves upon standard autoencoders by providing estimates in high-dimensional spaces, as shown in early applications to reconstruction probability-based anomaly scoring. Generative adversarial network (GAN)-based models, such as OC-GAN, adapt reconstruction principles by incorporating adversarial training to constrain the to target class representations. OC-GAN employs a denoising backbone with dual discriminators—one for the (ensuring uniformity for target encodings) and one for the visual space (ensuring realism of reconstructions)—while a classifier distinguishes target from generated out-of-class samples via gradient-based latent exploration. This yields superior novelty detection on datasets like , outperforming baselines by leveraging GAN stability to avoid mode collapse in one-class settings. Deep variants like Deep Support Vector Data Description (Deep SVDD) integrate reconstruction elements with boundary constraints, training an encoder to map data into a hypersphere of minimal volume centered at a predefined point (e.g., pre-trained weights), optionally in a reconstruction mode that minimizes both hypersphere compactness and MSE. This hybrid formulation combines generative reconstruction with one-class , enhancing robustness to outliers during training, as evidenced by improved performance on anomaly tasks compared to pure reconstruction methods; such approaches briefly reference boundary methods for added compactness without direct enclosure optimization.

Evaluation and Challenges

Performance Metrics

In one-class classification (OCC), performance evaluation is complicated by the absence or scarcity of negative examples during training, leading to severe class imbalance where the positive (target) class dominates. Standard metrics must be adapted, with the Area Under the Precision-Recall Curve (AUPRC) often preferred over the Area Under the Curve (AUC-ROC) due to the latter's tendency to produce overly optimistic results in highly skewed datasets. This preference arises because AUC-ROC relies on the (FPR = FP / (FP + TN)), which becomes insensitive to changes in false positives as the negative class (true negatives, TN) grows large or is absent, whereas AUPRC focuses directly on the positive class performance through . In OCC contexts, such as , AUPRC better captures the trade-off relevant to identifying rare outliers as the positive class. The precision- (PR) curve plots precision against at varying decision thresholds, providing a metric tailored to imbalanced scenarios. Precision at a given level rr is defined as P(r)=TP(r)TP(r)+FP(r),P(r) = \frac{\text{TP}(r)}{\text{TP}(r) + \text{FP}(r)}, where TP(rr) is the number of true positives at r=TP(r)/(TP+FN)r = \text{TP}(r) / (\text{TP} + \text{FN}), FP(rr) is the number of false positives, and FN is the total false negatives. The AUPRC is then the under this curve: AUPRC=01P(r)dr.\text{AUPRC} = \int_0^1 P(r) \, dr. This formulation emphasizes the model's ability to rank positives highly while minimizing false positives among retrieved instances, making it suitable for OCC where negatives are underrepresented. Other adapted metrics include the F1-score, computed with anomalies treated as the positive class to balance precision and recall in imbalanced settings: F1=2precisionrecallprecision+recall.F1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}. This harmonic mean is particularly useful for OCC tasks involving sparse outliers. The Matthews Correlation Coefficient (MCC) also serves as a robust measure for binary outcomes in OCC, providing a balanced assessment across all confusion matrix quadrants: MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN),\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}},
Add your contribution
Related Hubs
User Avatar
No comments yet.