Recent from talks
Nothing was collected or created yet.
Ward's method
View on WikipediaIn statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special case of the objective function approach originally presented by Joe H. Ward, Jr.[1] Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function. This objective function could be "any function that reflects the investigator's purpose." Many of the standard clustering procedures are contained in this very general class. To illustrate the procedure, Ward used the example where the objective function is the error sum of squares, and this example is known as Ward's method or more precisely Ward's minimum variance method.
The nearest-neighbor chain algorithm can be used to find the same clustering defined by Ward's method, in time proportional to the size of the input distance matrix and space linear in the number of points being clustered.
The minimum variance criterion
[edit]Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging. This increase is a weighted squared distance between cluster centers. At the initial step, all clusters are singletons (clusters containing a single point). To apply a recursive algorithm under this objective function, the initial distance between individual objects must be (proportional to) squared Euclidean distance.
The initial cluster distances in Ward's minimum variance method are therefore defined to be the squared Euclidean distance between points:
Note: In software that implements Ward's method, it is important to check whether the function arguments should specify Euclidean distances or squared Euclidean distances.
Lance–Williams algorithms
[edit]Ward's minimum variance method can be defined and implemented recursively by a Lance–Williams algorithm. The Lance–Williams algorithms are an infinite family of agglomerative hierarchical clustering algorithms which are represented by a recursive formula for updating cluster distances at each step (each time a pair of clusters is merged). At each step, it is necessary to optimize the objective function (find the optimal pair of clusters to merge). The recursive formula simplifies finding the optimal pair.
Suppose that clusters and were next to be merged. At this point all of the current pairwise cluster distances are known. The recursive formula gives the updated cluster distances following the pending merge of clusters and . Let
- , , and be the pairwise distances between clusters , , and , respectively,
- be the distance between the new cluster and .
An algorithm belongs to the Lance-Williams family if the updated cluster distance can be computed recursively by
where and are parameters, which may depend on cluster sizes, that together with the cluster distance function determine the clustering algorithm. Several standard clustering algorithms such as single linkage, complete linkage, and group average method have a recursive formula of the above type. A table of parameters for standard methods is given by several authors.[2][3][4]
Ward's minimum variance method can be implemented by the Lance–Williams formula. For disjoint clusters and with sizes and respectively:
Hence Ward's method can be implemented as a Lance–Williams algorithm with
Variations
[edit]The popularity of the Ward's method has led to variations of it. For instance, Wardp introduces the use of cluster specific feature weights, following the intuitive idea that features could have different degrees of relevance at different clusters. [5]
References
[edit]- ^ Ward, J. H., Jr. (1963), "Hierarchical Grouping to Optimize an Objective Function", Journal of the American Statistical Association, 58, 236–244.
- ^ Cormack, R. M. (1971), "A Review of Classification", Journal of the Royal Statistical Society, Series A, 134(3), 321-367.
- ^ Gordon, A. D. (1999), Classification, 2nd Edition, Chapman and Hall, Boca Raton.
- ^ Milligan, G. W. (1979), "Ultrametric Hierarchical Clustering Algorithms", Psychometrika, 44(3), 343–346.
- ^ R.C. de Amorim (2015). "Feature Relevance in Ward's Hierarchical Clustering Using the Lp Norm" (PDF). Journal of Classification. 32 (1): 46–62. doi:10.1007/s00357-015-9167-1. S2CID 18099326.
Further reading
[edit]- Everitt, B. S., Landau, S. and Leese, M. (2001), Cluster Analysis, 4th Edition, Oxford University Press, Inc., New York; Arnold, London. ISBN 0340761199
- Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.
- Jain, A. K. and Dubes, R. C. (1988), Algorithms for Clustering Data, New Jersey: Prentice–Hall.
- Kaufman, L. and Rousseeuw, P. J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.
Ward's method
View on Grokipediahclust and agnes functions, as well as SAS and SPSS, due to its effectiveness in exploratory data analysis, particularly when combined with dimensionality reduction techniques like principal component analysis (PCA) for visualizing cluster hierarchies.[2] Despite its popularity—evidenced by thousands of citations in fields like ecology, psychology, and bioinformatics—critiques note potential distortions in dendrogram heights across implementations and recommend validation against alternative criteria for robust results.[4]
Overview
Definition and Purpose
Ward's method is an agglomerative hierarchical clustering algorithm that constructs a hierarchy of clusters by successively merging pairs of clusters in a manner that minimizes the increase in total within-cluster variance.[5] This approach treats cluster analysis as an analysis of variance problem, where the goal is to form groups of multivariate observations that maximize the between-cluster variance relative to the total variance.[1] The primary purpose of Ward's method is to partition quantitative data into homogeneous subgroups, assuming that clusters are roughly spherical and compact in the feature space.[1] It is particularly suited for datasets where the underlying structure can be captured by minimizing the error sum of squares (ESS), a measure of within-cluster dispersion, thereby producing well-separated and internally cohesive clusters without relying on predefined distance metrics beyond the initial setup. In operation, the algorithm initializes each data point as a singleton cluster and proceeds iteratively: at each step, it identifies and merges the two clusters whose union results in the smallest possible increase in the overall ESS, continuing until all points are combined into a single cluster or a desired number of groups is reached.[5] The initial dissimilarity between individual observations is defined using the squared Euclidean distance, given bywhich aligns with the variance-based objective and facilitates efficient computation for Euclidean spaces.
Historical Development
Ward's method, a hierarchical clustering technique, was introduced by Joe H. Ward, Jr., in his seminal 1963 paper titled "Hierarchical Grouping to Optimize an Objective Function," published in the Journal of the American Statistical Association.[5] This work proposed a procedure for forming hierarchical groups by minimizing an objective function based on within-cluster variance, marking a key advancement in agglomerative clustering approaches.[5] The method emerged in the early 1960s as part of the burgeoning field of multivariate analysis techniques, coinciding with increasing interest in computational statistics for handling complex datasets.[6] During this period, cluster analysis transitioned from theoretical frameworks to practical algorithms implementable on early computers, with Ward's contribution emphasizing error sum of squares as a robust criterion for grouping.[6] By the 1970s, the method had seen widespread adoption in social sciences and ecology, attributed to its effective variance-minimizing properties that supported interpretable cluster formations in empirical studies.[7] Subsequent developments addressed implementation challenges and clarified conceptual aspects. In 2011, Fionn Murtagh and Pierre Legendre published a paper on arXiv that examined the clustering criterion and agglomerative algorithms, resolving common misconceptions about the method's original formulation and its relation to least-squares principles.[7] This was followed in 2014 by their article in the Journal of Classification, which analyzed various algorithms purporting to implement Ward's criterion, confirming its strict hierarchical properties and distinguishing valid implementations from approximations.[2]Mathematical Foundation
Minimum Variance Criterion
Ward's minimum variance criterion, introduced in the seminal work on hierarchical clustering, defines the core decision rule for merging clusters by selecting the pair that results in the smallest increase in the total within-cluster sum of squared errors (SSE). This approach treats clustering as an optimization problem akin to analysis of variance, where the objective is to minimize the pooled within-group variance at each agglomeration step.[1] The rationale behind this criterion lies in its emphasis on producing compact and roughly spherical clusters, as it penalizes merges that substantially inflate the overall data variance by favoring unions of clusters with similar means and comparable sizes. By prioritizing low-variance groupings, the method assumes that observations within clusters should exhibit high similarity relative to their centroids, which is particularly suitable for quantitative data assumed to follow elliptical distributions.[1] This variance-focused strategy helps mitigate the formation of elongated or irregularly shaped clusters that might arise from purely distance-based rules. In the general process, the algorithm begins with n singleton clusters (one per observation) and iteratively merges them into n-1 clusters, then n-2, and so on, until a single cluster encompasses all data; at each iteration, all possible pairwise merges are evaluated, and the one yielding the minimal ΔSSE (change in SSE) is chosen. The error sum of squares serves as the key metric for this evaluation, quantifying the total squared deviation from cluster means.[1] Conceptually, this criterion differs from other agglomerative linkages, such as single linkage (which merges based on the minimum inter-point distance and can produce chaining effects) or complete linkage (which uses the maximum distance and tends to yield balanced but conservative clusters), by explicitly accounting for cluster sizes and centroids in the variance computation rather than relying solely on pairwise distances.[1] As a result, Ward's method often generates more equitable and compact partitions, though it requires squared Euclidean distances for consistency with the variance objective.Error Sum of Squares Formulation
The error sum of squares (ESS), also referred to as the within-cluster sum of squares, provides the mathematical foundation for quantifying dispersion in Ward's hierarchical clustering method. It measures the total squared Euclidean distance from each data point to its cluster centroid across all clusters in the current partitioning. For a dataset partitioned into clusters , the total ESS is defined as where denotes the centroid (arithmetic mean) of cluster .[3] At each agglomeration step, Ward's method selects the pair of clusters and that minimizes the increase in total ESS upon merging. This increase, denoted , for clusters of sizes and with centroids and , is given by The post-merge ESS for the combined cluster is thus , ensuring the objective function only increases monotonically.[4] The formula for derives from the variance decomposition theorem, which partitions the total sum of squares into within-cluster and between-cluster components: , where is the ESS and is the between-cluster sum of squares. Merging clusters alters by an amount equal to the weighted squared distance between centroids, as the new centroid minimizes the ESS for the union. This follows from expanding , using the property that and analogously for .[3] Notable properties of include its non-negativity, achieved via the Cauchy-Schwarz inequality on the squared distance term, with only if (centroids coincide). The factor highlights sensitivity to cluster sizes: it reaches a maximum of (or ) for equal sizes and approaches the size of the smaller cluster as imbalance grows, thereby penalizing merges of similarly sized, distant clusters more heavily than those involving a small outlier cluster.[4]Algorithms
Lance-Williams Framework
The Lance-Williams framework provides a general recursive formula for updating inter-cluster distances in agglomerative hierarchical clustering after merging two clusters. This formula, introduced in 1967, allows various linkage methods to be expressed uniformly, facilitating efficient computation without recalculating all pairwise distances from scratch.[8] In its general form for symmetric distance measures, the updated distance between the merged cluster (i ∪ j) and another cluster k is given by where the term with γ is zero (γ = 0) for methods assuming symmetric distances, such as those based on Euclidean metrics.[3] The coefficients α_i, α_j, β, and γ are method-specific and depend on cluster properties like sizes n_i, n_j, and n_k. This structure enables O(n²) time complexity for the entire clustering process by updating a distance matrix incrementally.[3] Ward's method fits naturally into this framework by specifying coefficients that align with its minimum variance objective, which minimizes the increase in within-cluster error sum of squares (SSE). For Ward's method, assuming squared Euclidean distances as the dissimilarity measure, the parameters are where n_i, n_j, and n_k denote the sizes (number of original points) in clusters i, j, and k, respectively.[3] These weights ensure that the updated distance d_{(i ∪ j) k} reflects the weighted average of distances adjusted for the variance contribution of the merge, preserving the SSE minimization at each step.[9] The equivalence of these parameters to Ward's criterion is derived by considering the centroids of the clusters. The increase in total SSE upon merging i and j is ΔSSE(i,j) = \frac{n_i n_j}{n_i + n_j} | \mathbf{c}_i - \mathbf{c}_j |^2, where \mathbf{c}_i and \mathbf{c}j are the centroids. To update distances to k, the formula is obtained by expressing the centroid of the merged cluster \mathbf{c}{(i \cup j)} = \frac{n_i \mathbf{c}_i + n_j \mathbf{c}j}{n_i + n_j} and deriving | \mathbf{c}{(i \cup j)} - \mathbf{c}_k |^2 in terms of the original distances, yielding the Lance-Williams form above after algebraic expansion and normalization by cluster sizes. This derivation confirms that using these coefficients in the recursive update exactly reproduces the SSE-based merging criterion, enabling efficient implementation.[3] The Lance-Williams framework was proposed by Lance and Williams in 1967, building on earlier work including Ward's 1963 method, and was applied to Ward's approach in subsequent formalizations to unify agglomerative strategies.[8][9]Computational Implementation
The standard implementation of Ward's method follows the general agglomerative hierarchical clustering paradigm, beginning with each data point as its own cluster and iteratively merging the pair of clusters that results in the smallest increase in total within-cluster variance, as measured by the error sum of squares (ESS). This naive approach requires maintaining and updating a full n × n distance matrix at each step, leading to a time complexity of O(n³) for n data points, which becomes impractical for datasets beyond a few thousand points.[4] To achieve greater efficiency, implementations leverage the Lance-Williams framework, which enables recursive updates to cluster distances without recomputing the entire matrix after each merge, reducing the overall complexity to O(n²) while using O(n²) space. This optimization is standard in modern libraries and preserves the exact Ward's criterion.[10] Further computational savings can be obtained using the nearest-neighbor chain algorithm, which identifies candidate merges by constructing chains in the nearest-neighbor graph of clusters, avoiding exhaustive scans of the distance matrix and reducing practical runtime, especially for sparse or structured data, while maintaining the O(n²) worst-case bound. This technique, originally proposed for general agglomerative methods, applies directly to Ward's linkage and has been adapted in parallel frameworks for distributed computing.[11] For large datasets where even O(n²) methods are prohibitive, practical implementations often incorporate scalability techniques such as subsampling or hybrid approaches: for instance, performing an initial k-means clustering to reduce the data to a manageable number of centroids (e.g., k << n), then applying Ward's method on those centroids to build the hierarchy, followed by assignment of original points to the resulting structure. Dendrograms, generated from the linkage matrix, provide a visual summary of the hierarchy without storing the full tree, aiding interpretation in high-dimensional settings. Ward's method is widely supported in statistical software, including thehclust function in R's base stats package, which implements the optimized Lance-Williams version with Ward's linkage via the "ward.D2" method for ESS minimization. In Python, the scipy.cluster.hierarchy module provides ward for condensed distance matrices, while scikit-learn's AgglomerativeClustering extends this with connectivity constraints (e.g., for graph-structured data) to handle sparse large-scale inputs more efficiently.
