Hubbry Logo
Nearest centroid classifierNearest centroid classifierMain
Open search
Nearest centroid classifier
Community hub
Nearest centroid classifier
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Nearest centroid classifier
Nearest centroid classifier
from Wikipedia
Rocchio Classification

In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied to text classification using word vectors containing tf*idf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback.[1]

An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors.[2]

Algorithm

[edit]

Training

[edit]

Given labeled training samples with class labels , compute the per-class centroids where is the set of indices of samples belonging to class .

Prediction

[edit]

The class assigned to an observation is .

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The nearest centroid classifier is a simple supervised algorithm for multi-class that represents each class by the , or , of its training samples and assigns an unlabeled test sample to the class whose is closest to it, typically using the metric. This classifier operates in two main phases: during , it computes the xˉk=1nki=1nkxi\bar{x}_k = \frac{1}{n_k} \sum_{i=1}^{n_k} x_i for each class kk, where nkn_k is the number of samples in class kk and xix_i are the feature vectors; for , it classifies a new sample xx by finding the class kk that minimizes the xxˉk2\|x - \bar{x}_k\|_2. The between classes is linear, derived from the difference in squared distances, and the method assumes features are on comparable scales, often requiring for optimal performance. It is computationally efficient, with and both achieving linear O(pn)O(pn) where pp is the number of features and nn the number of samples, making it suitable for large datasets. Historically, the approach traces its roots to the introduced in 1971 for in , which updates query vectors as weighted averages of relevant and non-relevant documents—essentially a centroid-based method adapted for binary relevance classification. In modern , it is recognized as a baseline prototype classifier, particularly effective in scenarios with well-separated, spherical class distributions, and serves as a foundation for extensions like the nearest shrunken centroids method for high-dimensional data such as analysis. Key advantages include its interpretability, as centroids provide a clear summary of class prototypes, and its robustness in low-dimensional spaces with balanced classes; however, it can underperform with outliers, imbalanced data, or non-spherical clusters, where alternatives like or support vector machines may excel. Applications span text categorization using TF-IDF representations—where it is equivalently termed the Rocchio classifier—image recognition, and bioinformatics, often as a fast initial model before more complex techniques. Variants, such as those incorporating shrinkage toward the global or alternative metrics like distance, address limitations in high dimensions or noisy features.

Overview

Definition and intuition

The nearest centroid classifier is a simple prototype-based method for multi-class , where each class is represented by a single known as the , and an input data point is assigned to the class whose is closest according to a distance metric. This approach treats the as the "center of mass" for the class's training samples, providing an intuitive summary that captures the average location of the class in feature space. It operates as a linear-time , requiring only a single pass over the training data to compute centroids, which makes it computationally efficient and suitable as a baseline for more complex classifiers. The core intuition behind the nearest centroid classifier is that data points from the same class tend to cluster around a central representative point, exerting a "pull" on new samples toward their respective class centers during prediction. For a test sample, the classifier measures the —typically Euclidean—to each class and assigns the sample to the nearest one, effectively partitioning the feature space into regions dominated by each class . This mimics natural grouping behaviors observed in clustering tasks, where points are attracted to their nearest center, and it assumes well-separated classes with roughly spherical distributions for optimal performance. A high-level pseudocode overview of the workflow is as follows: Training Phase:
  • For each class kk in the set of classes:
    • Compute the centroid ck\mathbf{c}_k as the arithmetic mean of all training samples assigned to class kk.
Prediction Phase:
  • For a test sample x\mathbf{x}:
    • Compute the distance dkd_k from x\mathbf{x} to each centroid ck\mathbf{c}_k.
    • Assign x\mathbf{x} to the class kk^* where dk=minkdkd_{k^*} = \min_k d_k.
To illustrate, consider a 2D toy dataset with two classes: Class A consists of points (1,2), (2,1), (2,3), and (3,2), yielding a centroid at approximately (2,2); Class B consists of points (5,4), (6,5), (4,5), and (5,6), with a centroid at (5,5). A test point at (3,3) would be closer to Class A's centroid (Euclidean distance ≈1.41) than to Class B's (≈2.83), so it is classified as A. The decision boundary between the classes forms the perpendicular bisector of the line segment joining the centroids (at (3.5,3.5)), dividing the plane into regions where points are nearer to one centroid or the other.

Historical development

The concept of using arithmetic means as representative points for groups of observations emerged in 19th-century statistics, providing an early foundation for centroid-based classification approaches. Karl Pearson's 1901 paper on lines and planes of closest fit to systems of points in space introduced methods for representing multivariate data around central tendencies, such as means, which later influenced class prototyping in pattern recognition. The nearest centroid classifier was formalized in the mid-20th century amid the growth of statistical pattern recognition, particularly for applications like radar signal processing. In the 1960s, researchers developed fixed-centroid methods where class prototypes were predefined means, as explored in works on signal classification; for instance, Keinosuke Fukunaga's contributions to nonparametric and parametric techniques in pattern recognition during this period highlighted the use of centroids for efficient decision boundaries in high-noise environments. George S. Sebestyen's 1962 book further advanced the minimum distance framework, treating class means as prototypes for assigning patterns via Euclidean proximity, establishing it as a core technique in early computational pattern recognition systems. By the 1970s, the method gained prominence in literature as a simple, interpretable baseline, notably through its adaptation in the for in . Richard O. Duda and Peter E. Hart's influential 1973 textbook Pattern Classification and Scene Analysis presented the nearest centroid (or nearest mean) classifier as an accessible parametric approach, contrasting it with more complex discriminants while emphasizing its utility in low-dimensional settings with Gaussian assumptions. In the , the nearest classifier experienced a resurgence for high-dimensional data challenges, such as text categorization and , where its linear-time complexity proved advantageous over memory-intensive alternatives. Han and Karypis's 2000 analysis demonstrated its effectiveness in centroid-based , achieving competitive accuracy on benchmark corpora by modeling classes via term-frequency vectors. Similarly, Tibshirani et al.'s 2002 nearest shrunken method adapted it for sparse, high-dimensional data, shrinking centroids toward the global mean to mitigate and improve prediction in studies, influencing its adoption in bioinformatics. This era also saw approximations linking it to probabilistic models like Naive Bayes for scalable text .

Algorithm

Training procedure

The training procedure of the nearest centroid classifier begins with a labeled comprising nn samples, each represented as a dd-dimensional feature vector, spanning KK distinct classes. The procedure consists of three primary steps. First, the is partitioned by class labels to group all samples associated with each class. Second, for each class kk, the μk\mu_k is calculated as the of the feature vectors of the samples in that class. Third, the collection of KK centroids is stored to form the complete model, which encapsulates the class prototypes without further optimization. In handling edge cases, an empty class—lacking any samples—prevents centroid computation, resulting in the exclusion of that class from the model. For imbalanced , where class sample sizes differ markedly, centroids are computed directly from the available samples as unweighted means, preserving the standard approach without reweighting. The overall is O(nd)O(n \cdot d), arising from the linear traversal of the to accumulate sums and counts per class dimension before averaging.

Prediction procedure

The prediction procedure of the nearest centroid classifier assigns an unlabeled test sample xRd\mathbf{x} \in \mathbb{R}^d to the class kk^* whose precomputed μk\boldsymbol{\mu}_{k^*} minimizes the to x\mathbf{x}, using the stored centroids {μ1,,μK}\{\boldsymbol{\mu}_1, \dots, \boldsymbol{\mu}_K\} for the KK classes. The process begins by computing the from x\mathbf{x} to each μk\boldsymbol{\mu}_k, typically using the xμk2=i=1d(xiμk,i)2\|\mathbf{x} - \boldsymbol{\mu}_k\|_2 = \sqrt{\sum_{i=1}^d (x_i - \mu_{k,i})^2}
Add your contribution
Related Hubs
User Avatar
No comments yet.