Feature scaling
View on Wikipedia| Part of a series on |
| Machine learning and data mining |
|---|
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
Motivation
[edit]Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, many classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.[1]
It's also important to apply feature scaling if regularization is used as part of the loss function (so that coefficients are penalized appropriately).
Empirically, feature scaling can improve the convergence speed of stochastic gradient descent. In support vector machines,[2] it can reduce the time to find support vectors. Feature scaling is also often used in applications involving distances and similarities between data points, such as clustering and similarity search. As an example, the K-means clustering algorithm is sensitive to feature scales.
Methods
[edit]Rescaling (min-max normalization)
[edit]Also known as min-max scaling or min-max normalization, rescaling is the simplest method and consists in rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. The general formula for a min-max of [0, 1] is given as:[3]
where is an original value, is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span [160 pounds, 200 pounds]. To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).
To rescale a range between an arbitrary set of values [a, b], the formula becomes:
where are the min-max values.
Mean normalization
[edit]where is an original value, is the normalized value, is the mean of that feature vector. There is another form of the means normalization which divides by the standard deviation which is also called standardization.
Standardization (Z-score Normalization)
[edit]
In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines, logistic regression, and artificial neural networks).[4][5] The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.
Where is the original feature vector, is the mean of that feature vector, and is its standard deviation.
Robust Scaling
[edit]Robust scaling, also known as standardization using median and interquartile range (IQR), is designed to be robust to outliers. It scales features using the median and IQR as reference points instead of the mean and standard deviation:where are the three quartiles (25th, 50th, 75th percentile) of the feature.
Unit vector normalization
[edit]Unit vector normalization regards each individual data point as a vector, and divide each by its vector norm, to obtain . Any vector norm can be used, but the most common ones are the L1 norm and the L2 norm.
For example, if , then its Lp-normalized version is:
See also
[edit]- Normalization (machine learning)
- Normalization (statistics)
- Standard score
- fMLLR, Feature space Maximum Likelihood Linear Regression
References
[edit]- ^ Ioffe, Sergey; Christian Szegedy (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". arXiv:1502.03167 [cs.LG].
- ^ Juszczak, P.; D. M. J. Tax; R. P. W. Dui (2002). "Feature scaling in support vector data descriptions". Proc. 8th Annu. Conf. Adv. School Comput. Imaging: 25–30. CiteSeerX 10.1.1.100.2524.
- ^ "Min Max normalization". ml-concepts.com. Archived from the original on 2023-04-05. Retrieved 2022-12-14.
- ^ Grus, Joel (2015). Data Science from Scratch. Sebastopol, CA: O'Reilly. pp. 99, 100. ISBN 978-1-491-90142-7.
- ^ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN 978-0-387-84884-6.
Further reading
[edit]- Han, Jiawei; Kamber, Micheline; Pei, Jian (2011). "Data Transformation and Data Discretization". Data Mining: Concepts and Techniques. Elsevier. pp. 111–118. ISBN 9780123814807.
External links
[edit]- Lecture by Andrew Ng on feature scaling Archived 2017-03-14 at the Wayback Machine
Feature scaling
View on GrokipediaIntroduction
Definition and Scope
Feature scaling is the process of transforming the values of numerical features in a dataset to a common scale, ensuring that each feature contributes proportionally to machine learning models without one dominating due to differing ranges or variances.[1] In this context, features refer to the independent variables or attributes that represent the input data points, such as measurements in a tabular dataset.[1] This adjustment is essential for algorithms sensitive to feature magnitudes, like gradient descent-based methods or distance metrics, though its primary goal is to standardize inputs for fair comparison across variables.[3] The scope of feature scaling is centered in machine learning, statistics, and data analysis, where it forms a foundational component of the preprocessing pipeline to prepare raw data for modeling.[1] It targets only numerical features, such as continuous or discrete quantitative variables, and does not apply to categorical data, which instead undergoes techniques like encoding to numerical representations.[4] As part of broader feature engineering, scaling addresses scale disparities but excludes aspects like feature creation, selection, or dimensionality reduction, focusing narrowly on rescaling to enhance algorithmic efficiency and interpretability.[3] Feature scaling represents a specific subset of normalization techniques, emphasizing adjustments to the range or distribution of features across a dataset, while normalization encompasses a wider array of methods that alter data values, including instance-level scaling to unit norms for similarity computations.[1] This distinction highlights scaling's role in collective feature equalization, distinct from broader normalization practices that may involve probabilistic or distributional transformations in statistical contexts.[5]Historical Context
The development of feature scaling techniques originated in early 20th-century statistics, closely linked to Karl Pearson's pioneering work on correlation and the standardization of variables around 1900. Pearson introduced the correlation coefficient in his 1895 paper, where standardization via subtraction of means and division by standard deviations enabled the comparison of variables on comparable scales, addressing issues in regression and inheritance analysis. His 1901 contribution further advanced these ideas by applying standardized scores in least-squares fitting for multivariate data, establishing normalization as a core practice for handling disparate measurement units in statistical modeling. In the mid-20th century, feature scaling became integral to multivariate statistical methods, exemplified by Harold Hotelling's introduction of principal component analysis (PCA) in 1933. Hotelling's framework explicitly required scaling variables to equal variance—often through z-score standardization—to prevent features with larger natural scales from disproportionately influencing the principal components, thereby ensuring equitable contribution in dimensionality reduction and data summarization. This adoption in PCA and related techniques, such as canonical correlation, marked a shift toward systematic preprocessing in complex datasets, influencing fields like psychometrics and econometrics. The integration of feature scaling into machine learning accelerated in the 1980s and 1990s alongside the resurgence of neural networks and distance-based algorithms, where unscaled features could distort gradient descent or metric computations. Early neural network implementations, such as those using backpropagation popularized in 1986, implicitly relied on scaling to stabilize training, while algorithms like k-nearest neighbors (formalized in the 1950s but widely applied in ML contexts by the 1990s) and support vector machines (introduced in 1995) explicitly benefited from normalization to equalize feature influences in distance or margin calculations. Post-2000, its prominence grew with big data processing, facilitated by frameworks like scikit-learn, launched in 2007 as an extension of SciPy for scalable ML preprocessing.[6] A pivotal milestone came in Christopher Bishop's 2006 textbook Pattern Recognition and Machine Learning, which systematically discussed feature scaling as essential preprocessing for gradient-based optimizers and probabilistic models, highlighting its role in improving convergence rates and model generalization across diverse datasets. This synthesis bridged statistical foundations with emerging ML paradigms, solidifying scaling's status as a standard practice in contemporary applications.Motivation
Effects of Unscaled Features
In distance-based algorithms such as k-nearest neighbors (k-NN) and support vector machines (SVM), unscaled features lead to dominance by those with larger ranges, skewing computations and decision boundaries. For instance, in k-NN, features like proline (ranging 0–1,000) overshadow others like hue (1–10) in distance metrics, resulting in inaccurate neighbor selection and reduced model performance.[2] Similarly, in SVM, unscaled data requires much higher regularization parameters to compensate for magnitude imbalances, often yielding lower accuracy.[2] Optimization algorithms relying on gradient descent, including linear regression and neural networks, suffer from elongated loss surfaces when features are unscaled, causing slower convergence and inefficient parameter updates. This occurs because gradients for high-magnitude features produce larger steps, while low-magnitude ones yield small updates, leading to uneven progress along the parameter space and potentially trapping the optimizer in suboptimal regions.[7] For example, in logistic regression applied to the wine dataset after PCA, unscaled features result in drastically lower accuracy (35.19%) and higher log-loss (0.957) compared to scaled versions (96.30% accuracy, 0.0825 log-loss).[2] Multilayer perceptrons exhibit similar sensitivity, with unscaled inputs degrading predictive performance across classification tasks. A hypothetical dataset with height in centimeters (e.g., 150–200) and weight in kilograms (e.g., 50–100) illustrates this bias in SVM: the hyperplane would tilt excessively toward the weight feature due to its comparable range, misclassifying points where height differences are critical.[2] In k-NN, such disparities would prioritize weight in distance calculations, ignoring height's influence and leading to erroneous clustering. Tree-based models like decision trees experience minimal direct impact from unscaled features, as splits are based on thresholds rather than distances or gradients, maintaining consistent performance in random forests. However, in ensembles such as random forests, scale differences can bias variable importance measures, with larger-scale features appearing more influential due to broader split ranges in the randomForest implementation.[8] Unscaled features distort statistical analyses like principal component analysis (PCA) by inflating the variance of high-magnitude variables, causing incorrect identification of principal components. For example, in the wine dataset, unscaled proline dominates the first principal component, overshadowing other features and leading to misrepresented data structure, whereas scaling ensures balanced contributions.[2] This skew can propagate errors in downstream tasks like dimensionality reduction.Benefits of Scaling
Feature scaling significantly accelerates convergence in gradient-based optimization methods, such as stochastic gradient descent used in neural networks and logistic regression, by normalizing feature variances to create more isotropic loss landscapes, thereby enabling more uniform step sizes during updates. This addresses issues arising from unscaled features, where disparate scales lead to elongated loss surfaces that prolong training. Empirical evaluations across multiple datasets demonstrate that scaling reduces the number of iterations required for convergence, with multilayer perceptron models showing decreased training times using variance-stabilizing transformations. By equalizing the ranges of features, scaling ensures fair contribution from all variables in model training, preventing features with larger magnitudes from disproportionately influencing outcomes and reducing bias in coefficient estimates, particularly in linear models like logistic regression. For instance, in logistic regression, unscaled features can skew coefficient interpretations toward high-magnitude inputs, but scaling allows coefficients to reflect true relative impacts without scale-induced distortions. This balanced influence enhances the reliability of model predictions in algorithms sensitive to feature magnitudes. Scaling promotes enhanced generalization in distance-based models, such as support vector machines (SVM) and k-means clustering, by mitigating overfitting in high-dimensional spaces where unscaled features distort distance metrics. In k-means clustering on multi-unit feature sets, it improves accuracy from 0.56 to 0.96, precision from 0.49 to 0.96, recall from 0.61 to 0.95, and F-score from 0.54 to 0.95 compared to raw data.[9] These gains stem from more equitable proximity calculations, leading to robust cluster formations and decision boundaries that better handle unseen data. From a computational standpoint, feature scaling yields efficiency improvements by lowering overall iteration counts in optimization routines, with studies reporting reduced training durations across models; for example, SVM training times decrease notably post-scaling. While specific speedup factors vary by dataset and model, these reductions highlight scaling's role in practical deployments. Finally, scaled features facilitate greater interpretability by enabling direct comparisons of variable importance across diverse domains, such as medical imaging versus financial metrics, where raw scales might otherwise obscure relative contributions. In gradient-based models, this normalization preserves the semantic meaning of features while standardizing their influence, allowing practitioners to assess impacts consistently without scale artifacts confounding analyses.Scaling Methods
Min-Max Normalization
Min-max normalization, also known as min-max scaling, is a linear transformation technique used to rescale features to a fixed range, typically [0, 1], by mapping the minimum value of the dataset to 0 and the maximum to 1.[10] The core formula for this transformation is given by| Original Feature | Example Value | Min-Max Scaled to [0, 1] |
|---|---|---|
| Age (years) | 30 | 0.167 |
| Salary (dollars) | 50,000 | 0.375 |
MinMaxScaler from the scikit-learn library's sklearn.preprocessing module implements min-max normalization. It scales features to a specified range (default [0, 1]) by subtracting the minimum value and dividing by the range. MinMaxScaler is recommended for algorithms requiring bounded ranges, such as neural networks. The scaler should be fitted on the training data only and then used to transform test data to avoid data leakage.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
X = np.array([[1., 1., 2.], [2., 0., 0.], [0., 1., 1.]])
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
# Output: [[0.5 0. 1. ]
# [1. 0.5 0.33]
# [0. 1. 0. ]]
Standardization
Standardization, also known as Z-score normalization, is a feature scaling technique that transforms data to have a mean of zero and a standard deviation of one, facilitating comparisons across features with different units or scales.[2] The transformation is applied using the formulasklearn.preprocessing.StandardScaler class implements standardization by removing the mean and scaling features to unit variance. It is particularly recommended for algorithms that assume zero mean and unit variance, such as SVM and logistic regression. The scaler must be fitted on the training data only using fit or fit_transform, then used to transform the test data with transform to avoid data leakage and ensure unbiased evaluation.
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[1., 1., 2.], [2., 0., 0.], [0., 1., 1.]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
# Output approx: [[ 0. -1.22 1.33]
# [ 1.22 0. -0.27]
# [-1.22 1.22 -1.06]]
Among its advantages, standardization is robust to unbounded ranges, avoiding artificial bounds that might clip extreme values, and it effectively controls variance for stable model training.[2] However, it assumes absence of heavy tails, as outliers inflate the standard deviation, potentially compressing the bulk of the data and reducing sensitivity to typical variations.[17] For example, in clustering tasks like K-means applied to environmental data, standardizing temperature (e.g., in Celsius, mean 20, std 5) and humidity (e.g., percentage, mean 60, std 15) ensures both features influence cluster formation equally without one dominating due to larger numerical range.[5]
Mean Normalization
Mean normalization is a feature scaling technique that centers features around zero by subtracting their mean and then scales them using the feature's range, producing values typically in the approximate range of [-0.5, 0.5]. This method combines data centering with bounded scaling to address issues in algorithms sensitive to feature magnitudes, such as those relying on distance metrics or iterative optimization. The transformation is given by the formulaRobust Scaling
Robust scaling is a preprocessing technique in machine learning that transforms features using statistics robust to outliers, primarily the median and interquartile range (IQR), making it suitable for datasets with skewed distributions or anomalous values. Unlike mean-based methods, it focuses on the central tendency and dispersion of the middle 50% of the data, reducing the impact of extreme observations on the scaling process. This approach originates from principles in robust statistics, where estimators are designed to maintain reliability despite deviations from assumed models, as formalized through influence functions that quantify an estimator's sensitivity to perturbations.[21] The core formula for robust scaling is:Unit Vector Normalization
Unit vector normalization, also known as L2 normalization, is a feature scaling technique that transforms each feature vector to have a Euclidean norm of 1, thereby preserving the direction of the vector while removing magnitude information. This method is particularly suited for scenarios where the relative orientations between vectors are more informative than their absolute lengths. The transformation is defined by the formulaPractical Considerations
Selecting a Method
Selecting an appropriate feature scaling method involves evaluating the characteristics of the dataset, the requirements of the machine learning algorithm, and domain-specific considerations to optimize model performance and interpretability.[1] For datasets exhibiting a Gaussian or approximately normal distribution, standardization (Z-score normalization) is typically preferred as it centers the data around zero with unit variance, preserving the shape of the distribution while mitigating the influence of varying scales.[1] In contrast, datasets with significant outliers benefit from robust scaling, which uses median and interquartile range to reduce the impact of extreme values, ensuring more stable transformations compared to methods sensitive to minima and maxima.[1] The choice also hinges on the algorithm: distance-based methods like k-nearest neighbors (KNN), support vector machines (SVM), and k-means clustering often perform better with min-max normalization or standardization to equalize feature contributions in distance calculations, while algorithms that assume or perform better with zero mean and unit variance, such as SVM and logistic regression, particularly benefit from standardization using StandardScaler. MinMaxScaler is preferred for features with bounded ranges or in neural networks. Gradient descent-based optimizers in linear regression or neural networks converge faster with appropriate scaling, often standardization due to its zero-mean data or min-max normalization to bound inputs.[1][7][29] In practice, these methods are commonly implemented in libraries such as scikit-learn. MinMaxScaler scales features to a specified range (default [0, 1]) by subtracting the min and dividing by the range:from sklearn.preprocessing import MinMaxScaler
import numpy as np
X = np.array([[1., 1., 2.], [2., 0., 0.], [0., 1., 1.]])
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
# Output:
# array([[0.5 , 0. , 1. ],
# [1. , 0.5 , 0.33333333],
# [0. , 1. , 0. ]])
StandardScaler standardizes features by removing the mean and scaling to unit variance:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
# Output approx:
# array([[ 0. , -1.22, 1.33],
# [ 1.22, 0. , -0.27],
# [-1.22, 1.22, -1.06]])
To prevent data leakage, scalers must be fitted on the training data only to compute statistics, then used to transform both training and test data.[1]
Domain-specific factors further guide selection. In image processing, min-max normalization to the [0,1] range is standard for pixel values, facilitating consistent input to convolutional neural networks and preserving the bounded nature of image intensities, as commonly applied in datasets like MNIST.[30]
Integration into the machine learning pipeline is crucial for maintaining data integrity. Feature scaling should occur after imputation of missing values to avoid distorting statistics used in the transformation, but before categorical encoding to ensure numerical features are on comparable scales prior to one-hot or ordinal transformations.[1] For model interpretation, inverse transformations allow reverting scaled features to their original units, enabling analysis of predictions in domain-relevant terms, such as through the inverse_transform method in libraries like scikit-learn.[1]
Empirical validation remains essential, as no universal method suits all scenarios; cross-validation can compare scaling techniques by evaluating downstream metrics like classification accuracy or optimization convergence speed, revealing context-specific improvements.[29] For instance, studies show that the choice of scaler can lead to substantial performance variations, with up to 0.5 differences in F1-scores for SVM on certain datasets, underscoring the need for testing.[29]
In complex datasets, hybrid approaches combine methods for enhanced robustness.[31] Recent trends in the 2020s have seen the rise of automated scaling in libraries like PyCaret, which intelligently selects and applies transformations within automated machine learning workflows, streamlining selection for practitioners while adapting to data characteristics.[32]