Recent from talks
Nothing was collected or created yet.
Principal component analysis
View on WikipediaPrincipal component analysis
View on GrokipediaIntroduction
Overview
Principal component analysis (PCA) is a statistical procedure that employs an orthogonal linear transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables known as principal components.[2] This transformation re-expresses the data in a new coordinate system where the greatest variance in the dataset lies along the first coordinate (first principal component), the second greatest variance along the second coordinate, and so on.[8] The primary purpose of PCA is dimensionality reduction, enabling the summarization of complex, high-dimensional datasets by retaining the maximum amount of variance possible with a reduced number of components, thus simplifying analysis while preserving essential information.[2] The basic workflow of PCA begins with centering the data by subtracting the mean from each variable to ensure the dataset has zero mean, which facilitates the analysis of variance.[8] Next, the covariance matrix of the centered data is computed to capture the relationships between variables. From this matrix, the eigenvectors and corresponding eigenvalues are extracted, where the eigenvectors represent the directions of the principal components and the eigenvalues indicate the amount of variance explained by each. Finally, the original data is projected onto these principal components, typically selecting the top components that account for a substantial portion of the total variance.[2] As an unsupervised learning technique, PCA operates without requiring labeled data or predefined outcomes, making it particularly valuable for exploratory data analysis in fields such as genomics, finance, and image processing.[8] By focusing on variance maximization, it uncovers underlying structures and patterns in the data, aiding in noise reduction and visualization without assuming any specific distributional form.[2]History
Principal component analysis (PCA) was first introduced by Karl Pearson in 1901, who described it as a method for finding the "principal axes of a normal ellipsoid" through the geometric optimization of hyperplanes that minimize perpendicular distances to systems of points in space.[9] This foundational work laid the groundwork for PCA as a technique for dimensionality reduction in multivariate data.[10] This built on earlier ideas, including principal axes of ellipsoids by Francis Galton and the singular value decomposition by Eugenio Beltrami (1873) and Camille Jordan (1874).[3] In 1933, Harold Hotelling formalized PCA algebraically, framing it in terms of maximizing variance through eigenvalues and eigenvectors of the covariance matrix, and explicitly linking it to multiple correlation criteria. Hotelling's contributions distinguished PCA from earlier geometric approaches and positioned it as a tool for analyzing complexes of statistical variables.[11] The 1950s and 1960s marked significant developments in PCA, driven by advances in statistical theory and the advent of electronic computers that enabled practical computation for larger datasets. Key advancements included asymptotic sampling distributions for PCA coefficients by Theodore W. Anderson in 1963 and the use and interpretation of principal components by C. R. Rao in 1964.[12] John C. Gower's 1966 work further explored geometric and statistical connections, while practical applications proliferated in fields like demography, econometrics, and meteorology, such as Moser and Scott's 1961 reduction of 57 demographic variables to four principal components.[12] Computational methods improved with the use of singular value decomposition (SVD) for efficient computation, with its application to low-rank matrix approximation relevant to PCA established by Eckart and Young in 1936, and further computational advancements in the mid-20th century, facilitating efficient eigenvalue computations.[12][13] Links to factor analysis strengthened during this era, with PCA often viewed as a special case for extracting factors without assuming an underlying model, as discussed by Hotelling in 1957 and others; tools like the scree graph by Raymond B. Cattell in 1966 and Kaiser's eigenvalue rule in 1960 emerged for component selection.[12] Since the 2000s, PCA has been increasingly integrated into machine learning workflows for dimensionality reduction and feature extraction, as highlighted in surveys of reduction techniques.[14] In the 2020s, PCA continues to serve as a preprocessing step in deep learning pipelines, such as reducing input dimensions before training neural networks to mitigate the curse of dimensionality and enhance computational efficiency, particularly in applications like image analysis and genomic data processing.[15] Standard references include Pearson's 1901 paper, Hotelling's 1933 article, and Ian T. Jolliffe's comprehensive books from 1986 and 2002, which synthesize theoretical and applied aspects.[10][12]Conceptual Foundations
Intuition
Principal component analysis (PCA) addresses the challenge of analyzing high-dimensional datasets where variables often exhibit correlations, resulting in redundancy that complicates interpretation and computation. In such scenarios, multiple variables may convey overlapping information—for instance, in biological data where traits like height and weight are interrelated—leading to inefficient representations that obscure underlying patterns. PCA mitigates this by transforming the data into a new set of uncorrelated variables, called principal components, which capture the essential structure while discarding noise and redundancy, thereby simplifying analysis without substantial loss of information.[8] At its core, variance serves as a measure of how spread out the data points are along a particular direction, reflecting the amount of information or signal present. The first principal component is designed to align with the direction of maximum variance, thereby capturing the largest portion of the data's variability and providing the most informative summary. Subsequent components then capture progressively less variance in orthogonal directions, ensuring that the retained dimensions prioritize the most significant aspects of the data distribution. This hierarchical approach allows PCA to reduce dimensionality effectively, as the principal components ordered by decreasing variance enable selective retention of the top few for approximation.[8] To build intuition, consider PCA as rotating the coordinate axes of a 2D dataset to align them with the directions of greatest spread, minimizing the loss of information when projecting onto fewer axes. Imagine data points forming an elliptical cloud tilted away from the standard x and y axes; by rotating the axes to match the cloud's long and short axes, projections onto the primary axis preserve the bulk of the spread, while the secondary axis handles the remaining variation. This rotation decorrelates the variables, transforming redundant, slanted data into independent components.[16] A simple 2D example illustrates this with correlated variables like height and weight in a population, where taller individuals tend to weigh more, creating a diagonal scatter plot with high positive correlation. Without transformation, both dimensions redundantly describe body size; PCA identifies a principal component along the best-fit line of this diagonal (the direction of maximum variance), effectively summarizing the data with a single variable that captures most of the spread. The second component, perpendicular to the first, accounts for deviations from this line (e.g., variations in body shape), but often explains far less variance, allowing reduction to one dimension with minimal information loss. This decorrelation reveals independent factors, such as overall size versus proportionality, enhancing interpretability.[8]Geometric Interpretation
In principal component analysis (PCA), data points are conceptualized as a cloud of points distributed in a multidimensional space, where each dimension corresponds to a variable. The principal components emerge as orthogonal directions—specifically, the eigenvectors of the data's covariance matrix—that align with the axes of maximum variance within this cloud. The first principal component captures the direction along which the data exhibits the greatest spread, while subsequent components account for progressively less variance in perpendicular directions. This geometric view transforms the high-dimensional data into a lower-dimensional representation by projecting the points onto these principal axes, preserving as much of the original variability as possible.[17][18] A key visual analogy for PCA involves fitting an ellipsoid to the data cloud, where the ellipsoid's shape and orientation reflect the covariance structure of the dataset. The principal axes of this ellipsoid serve as the semi-axes, with their lengths proportional to the square roots of the corresponding eigenvalues, indicating the extent of variance along each direction. For instance, in a two-dimensional case, the data scatter forms an elliptical cloud centered at the origin (after mean subtraction), and the major and minor axes of the ellipse directly correspond to the first and second principal components, respectively. This fitting process minimizes the perpendicular distances from the data points to the hyperplane defined by the principal components, providing an intuitive sense of how PCA approximates the data's geometry.[19][18] To assess the utility of individual components, a scree plot is often employed, plotting the eigenvalues in descending order to visualize the proportion of total variance explained by each principal component. The "elbow" in this plot helps identify the number of components that capture a substantial portion of the data's variability without including noise-dominated directions. This graphical tool underscores the hierarchical nature of variance reduction in the geometric framework of PCA.[17] A critical aspect of this geometric interpretation is the necessity of centering the data by subtracting the mean from each variable, which shifts the cloud's centroid to the origin. Without centering, the principal components might misleadingly align with the direction of the mean vector rather than true variance directions, distorting the analysis; centering ensures that the components focus solely on the spread around the central tendency. In contrast, applying PCA to raw, uncentered data can lead to components that primarily describe location rather than shape, undermining the method's effectiveness for dimensionality reduction.[19][18]Mathematical Formulation
Definition
Principal component analysis (PCA) is a statistical procedure applied to a dataset , an matrix where denotes the number of samples (observations) and the number of variables (features), with the data assumed to be centered by subtracting the column means to remove the overall mean and emphasize deviations.[20] This centering ensures that the analysis focuses on the covariance structure rather than absolute levels.[21] The core objective of PCA is to derive a set of uncorrelated linear combinations of the original variables, known as principal components, arranged in decreasing order of their variances to maximize the captured variation in the dataset.[20] The projections of the centered data onto these principal components yield the scores, which represent the transformed coordinates of each observation, while the coefficients of the linear combinations are the loadings, quantifying the weight or contribution of each original variable to a given component.[20] PCA relies on the assumption of linearity in the relationships between variables, meaning it seeks only linear transformations and may not capture nonlinear dependencies.[21] It is also sensitive to outliers, as these can inflate variances and skew the components toward atypical data points.[20] Furthermore, PCA is highly sensitive to variable scaling; if features are on disparate measurement scales, standardization (e.g., to unit variance) is typically required to prevent larger-scale variables from dominating the results.[22][23] The covariance matrix underpins this process by encapsulating the pairwise variances and covariances among variables.[20]Principal Components
In principal component analysis, the first principal component is extracted as the direction in the data space that maximizes the variance, corresponding to the eigenvector of the sample covariance matrix associated with its largest eigenvalue. This eigenvector provides the weights for a linear combination of the original variables that captures the greatest amount of variability in the dataset.[24] Subsequent principal components are then derived iteratively, each orthogonal to all preceding components and maximizing the remaining unexplained variance. Subsequent principal components correspond to the eigenvectors associated with the next largest eigenvalues, ensuring orthogonality and successive variance maximization.[25] The covariance matrix, central to this extraction, summarizes the second-order statistics of the centered data.[24] The k-th principal component scores are computed by projecting the centered data matrix onto the corresponding eigenvector , yielding where is a vector of scores for the k-th component across all observations.[25] Each principal component is unique only up to a sign flip, as eigenvectors can point in either direction; a common convention is to choose the orientation with positive loadings on the variable with the highest weight for improved interpretability.Covariance Structure
In principal component analysis (PCA), the covariance matrix serves as the foundational structure for identifying the principal directions of variance in a dataset. For a centered data matrix of size , where is the number of observations and is the number of variables, the sample covariance matrix is defined as .[26] This matrix is symmetric and positive semi-definite, with its diagonal elements representing the variances of each variable and the off-diagonal elements capturing the pairwise covariances between variables.[26] The off-diagonal entries of quantify the linear relationships or correlations between variables, indicating the degree of redundancy or dependence in the data; positive values suggest variables tend to increase together, while negative values indicate opposing movements.[26] PCA leverages this structure by seeking an orthogonal transformation that diagonalizes , thereby producing a new set of uncorrelated components aligned with the directions of maximum variance.[27] The eigendecomposition of the covariance matrix takes the form , where is an orthogonal matrix whose columns are the eigenvectors (the principal directions), and is a diagonal matrix containing the corresponding eigenvalues (the variances along those directions), ordered such that .[26] This decomposition reveals the inherent covariance structure, transforming the original correlated variables into uncorrelated principal components.[27] When variables in the dataset have differing scales or units, the covariance matrix can be sensitive to these differences, potentially biasing the results toward high-variance variables.[27] In such cases, PCA is often performed on the correlation matrix instead, which is the covariance matrix of the standardized variables (each with zero mean and unit variance), ensuring scale invariance and equal weighting across variables.[27] The correlation matrix thus provides a normalized view of the covariance structure, particularly useful for datasets where linear changes in measurement units are plausible.[27]Dimensionality Reduction
Variance Maximization
Principal component analysis seeks to identify linear combinations of variables, known as principal components, that capture the maximum possible variance in the data. The first principal component is defined as the direction in the variable space that maximizes the variance of the projected data points, given by , where is the data matrix (centered to have zero mean) and is the covariance matrix. This maximization is subject to the unit norm constraint (or equivalently ) to ensure the direction is scaled appropriately and to avoid trivial solutions of infinite variance. This formulation, originally introduced by Hotelling, corresponds to maximizing the Rayleigh quotient , whose maximum value is the largest eigenvalue of , with the corresponding eigenvector providing the direction .[28] To derive this formally, the optimization problem for the first component employs the method of Lagrange multipliers. Define the Lagrangian as Taking the gradient with respect to and setting it to zero yields which simplifies to the eigenvalue equation The solution is the eigenvector corresponding to the largest eigenvalue , and the maximized variance is . This constrained optimization ensures that the principal component direction aligns with the axis of greatest data spread.[12] Subsequent principal components are obtained by sequential maximization of the residual variance, subject to both the unit norm constraint and orthogonality to previous components. For the second component , the objective is to maximize subject to and . This introduces an additional Lagrange multiplier for the orthogonality constraint, leading to a modified Lagrangian whose stationarity condition again results in the eigenvalue equation , where is the second-largest eigenvalue and is the corresponding eigenvector orthogonal to . This process continues for higher components, yielding an orthogonal set of eigenvectors ordered by decreasing eigenvalues.[12] The importance of each principal component is quantified by the explained variance ratio, defined as for the -th component, which represents the proportion of the total variance captured by that component relative to the trace of . The cumulative explained variance for the first components is then , aiding in decisions about dimensionality reduction by indicating how much information is retained. These ratios are directly derived from the eigenvalues obtained in the optimization.[12]Data Projection
Once the principal components have been identified, dimensionality reduction in PCA proceeds by projecting the original data onto the subspace defined by the first components, where and is the original number of variables. For a centered data matrix of dimensions (with observations), the reduced data matrix of dimensions is computed as , where is the matrix whose columns are the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of . This linear transformation preserves the maximum possible variance in the lower-dimensional space while discarding directions of minimal variation.[29] Selecting the value of is crucial for balancing dimensionality reduction with information retention. A standard method is to retain enough components to explain a predefined proportion of the total variance, such as 90% or 95%, computed as the cumulative sum of the eigenvalues divided by their total sum; for instance, if the first few eigenvalues account for 95% of the trace of the covariance matrix, those components are kept.[29] Another approach is the scree plot, which visualizes the eigenvalues in descending order against component index and identifies the "elbow" where the plot flattens, indicating diminishing returns in explained variance; this heuristic, originally proposed for factor analysis, is widely applied in PCA to avoid over-retention of noise.[30] Cross-validation offers a more data-driven alternative, evaluating by assessing reconstruction accuracy on validation sets to minimize generalization error.[29] The effectiveness of this projection is quantified by the reconstruction error, which measures the loss of information due to dimensionality reduction. When projecting onto the top components, the minimal squared Frobenius norm of the error between the original and its approximation is given by , where are the eigenvalues of the covariance matrix in descending order and the covariance is defined as ; retaining more components reduces this error by including larger , but at the cost of higher dimensionality.[29] Although PCA assumes continuous variables, qualitative or categorical variables can be handled by first encoding them into dummy (binary indicator) variables, allowing projection as with numerical data. However, this encoding introduces challenges, such as artificial multicollinearity among dummies and distortion of distances in the variable space, which can undermine the interpretability and optimality of the components; for purely categorical data, specialized extensions like multiple correspondence analysis are often preferable to mitigate these limitations.[31]Computation Techniques
Covariance Method
The covariance method is a classical batch algorithm for computing principal components, relying on the eigendecomposition of the data's covariance matrix to identify directions of maximum variance. This approach, formalized in early statistical literature, processes the entire dataset at once and is particularly effective when the number of features is not excessively large relative to the number of samples . It begins by centering the data to ensure the mean is zero, which is essential for the covariance to capture true variability rather than shifts due to location.[32] The procedure unfolds in the following steps. First, center the dataset by subtracting the mean of each feature across all samples from the corresponding feature values; this yields a mean-centered matrix . Second, compute the sample covariance matrix , which quantifies the pairwise variances and covariances among features. Third, perform eigendecomposition on to obtain , where is the matrix of eigenvectors (principal directions) and is the diagonal matrix of eigenvalues (variances along those directions). Fourth, sort the eigenvalues in descending order to rank the principal components by explained variance. Finally, project the centered data onto the top eigenvectors to obtain the reduced representation , where contains the first columns of .[32]Algorithm: Covariance Method for PCA
Input: Data matrix X ∈ ℝ^{n × p}, number of components k
1. Compute mean vector μ = (1/n) X^T 1 (1 is vector of ones)
2. Center: X_c = X - 1 μ^T
3. Compute covariance: Σ = (1/(n-1)) X_c^T X_c
4. Eigendecompose: [V, Λ] = eig(Σ) // V orthogonal, Λ diagonal
5. Sort indices: [Λ_sorted, idx] = sort(diag(Λ), 'descend')
6. V_sorted = V(:, idx)
7. If k < p: Project Y = X_c V_sorted(:, 1:k)
Output: Principal components V_sorted, scores Y (if projected)
Algorithm: Covariance Method for PCA
Input: Data matrix X ∈ ℝ^{n × p}, number of components k
1. Compute mean vector μ = (1/n) X^T 1 (1 is vector of ones)
2. Center: X_c = X - 1 μ^T
3. Compute covariance: Σ = (1/(n-1)) X_c^T X_c
4. Eigendecompose: [V, Λ] = eig(Σ) // V orthogonal, Λ diagonal
5. Sort indices: [Λ_sorted, idx] = sort(diag(Λ), 'descend')
6. V_sorted = V(:, idx)
7. If k < p: Project Y = X_c V_sorted(:, 1:k)
Output: Principal components V_sorted, scores Y (if projected)
Eigenvalue Decomposition
The covariance matrix in principal component analysis is symmetric and positive semi-definite, which guarantees that all its eigenvalues are real and non-negative, and that it admits an orthogonal basis of eigenvectors.[34] This spectral theorem for symmetric matrices ensures the eigenvectors corresponding to distinct eigenvalues are orthogonal, allowing the decomposition , where is an orthogonal matrix whose columns are the eigenvectors, and is a diagonal matrix of the eigenvalues.[35] To compute the eigendecomposition, the eigenvalues are found by solving the characteristic equation which yields the roots , where is the dimension of the data. For each eigenvalue , the corresponding eigenvector satisfies the equation with normalized such that and for .[34] Practical algorithms for this decomposition include the QR algorithm, which iteratively factors the matrix into an orthogonal and upper triangular (via QR decomposition), then updates the matrix as , converging to a triangular form revealing the eigenvalues on the diagonal; this method offers quadratic convergence and numerical stability for symmetric matrices.[36] For approximating the dominant (largest) eigenvalue and eigenvector, power iteration starts with a random unit vector and repeatedly applies followed by normalization, converging linearly at a rate determined by the ratio of the largest to second-largest eigenvalues.[37] Numerical considerations often involve deflation techniques to compute eigenvalues sequentially without full matrix operations each time; for instance, after finding the dominant eigenpair , Hotelling's deflation updates the matrix to , which sets the first eigenvalue to zero while preserving the others, allowing iteration on the reduced problem with improved efficiency for high dimensions.[38] These approaches ensure stability, as the symmetry of prevents complex arithmetic and maintains orthogonality under finite-precision computations.Singular Value Decomposition
Singular value decomposition (SVD) provides a robust and direct method for computing principal component analysis (PCA) by factorizing the centered data matrix rather than its covariance matrix, making it particularly suitable for non-square or high-dimensional datasets. Consider a centered data matrix , where is the number of observations and is the number of variables. The SVD decomposes as , where contains the left singular vectors (orthonormal), contains the right singular vectors (orthonormal loadings), is a diagonal matrix with non-negative singular values (where ), and the scores (projections of the data onto the principal components) are given by the columns of . This factorization directly yields the principal components as the columns of , with the singular values indicating the amount of variance explained by each component.[39][40] The connection between SVD and PCA arises through the covariance structure: the eigenvalues of the sample covariance matrix satisfy , since , confirming that the right singular vectors are the eigenvectors of (and thus of , up to scaling). This relation allows PCA to be performed entirely via SVD without explicitly forming or eigendecomposing the covariance matrix, which is advantageous when (tall matrices) or vice versa, as it avoids potential numerical issues from inverting or scaling large matrices. The proportion of variance explained by the -th component is then .[41][39] SVD offers several advantages over eigendecomposition of the covariance matrix for PCA, including greater numerical stability for rank-deficient or ill-conditioned data, as it operates directly on the original matrix and handles rectangular shapes without modification. It is especially effective for "thin" or "tall" datasets where , preventing the need to compute a potentially singular covariance matrix when . The computational complexity of full SVD is , which is efficient for moderate-sized problems and scales better than the cost of eigendecomposition when is large but is small.[41][40][42] For computing the SVD in PCA, the Golub-Reinsch algorithm is a standard approach, involving an initial bidiagonalization step using Householder transformations to reduce to a bidiagonal form, followed by iterative QR decomposition to obtain the singular values and vectors; this method is reliable for dense matrices and forms the basis for implementations in libraries like LAPACK. Alternatively, divide-and-conquer algorithms, which recursively split the matrix into smaller subproblems, offer faster performance for full SVD on large dense matrices with complexity approaching or better in practice, while maintaining high accuracy. These techniques ensure that PCA via SVD remains feasible for datasets up to moderate sizes, such as thousands of observations and variables.[43]Advanced Computation
Iterative Algorithms
Iterative algorithms for principal component analysis provide efficient ways to approximate the dominant eigenvectors of the covariance matrix without computing a full eigendecomposition, making them particularly useful for high-dimensional data where exact methods like singular value decomposition may be computationally prohibitive. These methods rely on repeated matrix-vector multiplications to iteratively refine estimates of the principal components, leveraging the fact that the leading eigenvector corresponds to the direction of maximum variance. The power method, a foundational iterative technique, is widely used to extract the first principal component and can be extended to subsequent ones through deflation. The power method begins with an initial random unit vector and iteratively updates it via the relation where is the covariance matrix of the centered data. This process amplifies the component of aligned with the dominant eigenvector, causing convergence to that eigenvector under the assumption that the largest eigenvalue strictly exceeds the second-largest eigenvalue . The method traces its application to principal components back to early formulations by Hotelling, who described iterative procedures for solving the associated eigenvalue problem. Modern expositions emphasize its simplicity and low per-iteration cost, typically involving only a single matrix-vector product followed by normalization. To compute subsequent principal components, deflation is applied after identifying the first one. This involves subtracting the projection of the data onto the found eigenvector from the original data (or equivalently, modifying the covariance matrix by removing the contribution of that component), thereby isolating the subspace orthogonal to it and allowing the power method to converge to the next dominant direction. For the -th component, the deflated covariance is updated as , with , ensuring orthogonality among the extracted components. This sequential deflation enables extraction of the top- components with separate invocations of the power method, avoiding the need for full matrix diagonalization. The convergence rate of the power method is geometric, with the angle between and the true dominant eigenvector satisfying , leading to an error reduction factor of per iteration. Thus, the number of iterations required to achieve a desired accuracy is on the order of , making it efficient when the eigenvalue gap is large but slower otherwise. This approach is especially suitable for scenarios needing only the top few principal components, as full deflation for all components can accumulate numerical errors, though variants like orthogonal iteration mitigate this for multiple simultaneous vectors.Online Estimation
Online estimation in principal component analysis (PCA) refers to algorithms that incrementally update principal components as new data observations arrive in a streaming fashion, avoiding the need to store or recompute over the entire dataset. This approach is particularly suited for scenarios where data arrives sequentially and storage or batch processing is impractical, such as in real-time systems. These methods typically rely on stochastic approximations that adjust weight vectors to maximize variance projections on-the-fly.[44] A foundational technique for extracting the leading principal component online is Oja's rule, which updates a weight vector using a single-pass Hebbian-inspired learning mechanism. The update is given by followed by normalization , where is a small learning rate and is the incoming data vector. This rule converges in expectation to the dominant eigenvector of the covariance matrix under suitable conditions on .[45] Theoretical analyses confirm that Oja's rule achieves near-optimal error rates for streaming PCA, with convergence scaling as after updates for the top component.[46] For extracting multiple principal components, extensions such as Oja's subspace algorithm generalize the single-component rule to a set of orthonormal weight vectors, approximating the dominant subspace through sequential deflation or joint updates. Alternatively, stochastic gradient descent variants directly optimize the trace of the projected variance, updating a subspace matrix to capture the top- components incrementally. These methods maintain orthogonality and converge to the principal subspace with rates depending on the eigengap between eigenvalues. In real-time applications, online PCA via Oja's rule and its generalizations enables dimensionality reduction for streaming sensor data in monitoring systems and adaptive control in robotics, where components are updated per observation to track evolving patterns. For non-stationary data streams, a forgetting factor can be incorporated into covariance updates, exponentially downweighting past observations to emphasize recent trends and adapt to concept drift.[47][48][44] Post-2015 advancements in stochastic PCA have addressed big data challenges by improving convergence guarantees and scalability; for instance, variance-reduced stochastic gradient methods achieve faster rates than plain Oja's rule for high-dimensional streams, while online difference-of-convex algorithms handle nonconvex extensions efficiently.NIPALS Method
The Non-linear Iterative Partial Least Squares (NIPALS) algorithm provides an iterative approach to computing principal components, originally developed as part of partial least squares methods but adaptable to pure principal component analysis (PCA) by omitting the response matrix.[49] In PCA applications, NIPALS sequentially extracts components from a centered data matrix by alternating between estimating score vectors (projections) and loading vectors (directions of maximum variance), making it suitable for scenarios where full eigendecomposition is computationally intensive.[50] NIPALS was introduced by Herman Wold in 1966 for estimating principal components and related models through iterative least squares, with early applications in chemometrics for handling ill-conditioned covariance matrices.[49] Wold's method, detailed in his chapter on multivariate analysis, emphasized its flexibility for sequential computation without requiring the full spectral decomposition upfront. For pure PCA, the algorithm proceeds as follows, assuming a column-centered matrix of dimensions :- Initialize the component index and set . Select an initial score vector (e.g., a non-zero column of ).[50]
- Compute the loading vector: Normalize it to unit length:
- Update the score vector: (Note: The denominator after normalization.)[50]
- Repeat steps 2–3 until the change in is below a convergence threshold (typically converges in fewer than 200 iterations).[51]
- Deflate the residual matrix: increment , and repeat for the next component until the desired number of components is obtained or residual variance is negligible.[50]