Permutation test
View on WikipediaA permutation test (also called re-randomization test or shuffle test) is an exact statistical hypothesis test. A permutation test involves two or more samples. The (possibly counterfactual) null hypothesis is that all samples come from the same distribution . Under the null hypothesis, the distribution of the test statistic is obtained by calculating all possible values of the test statistic under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling.
Permutation tests can be understood as surrogate data testing where the surrogate data under the null hypothesis are obtained through permutations of the original data.[1]
In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Ronald Fisher and E. J. G. Pitman in the 1930s.
Permutation tests should not be confused with randomized tests.[2]
Method
[edit]
To illustrate the basic idea of a permutation test, suppose we collect random variables and for each individual from two groups and whose sample means are and , and that we want to know whether and come from the same distribution. Let and be the sample size collected from each group. The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject, at some significance level, the null hypothesis H that the data drawn from is from the same distribution as the data drawn from .
The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the observed value of the test statistic, .
Next, the observations of groups and are pooled, and the difference in sample means is calculated and recorded for every possible way of dividing the pooled values into two groups of size and (i.e., for every permutation of the group labels A and B). The set of these calculated differences is the exact distribution of possible differences (for this sample) under the null hypothesis that group labels are exchangeable (i.e., are randomly assigned).
The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than . The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than . Many implementations of permutation tests require that the observed data itself be counted as one of the permutations so that the permutation p-value will never be zero.[3]
Alternatively, if the only purpose of the test is to reject or not reject the null hypothesis, one could sort the recorded differences, and then observe if is contained within the middle % of them, for some significance level . If it is not, we reject the hypothesis of identical probability curves at the significance level.
To exploit variance reduction with paired samples, a paired permutation test must be applied, see paired difference test. This is equivalent to performing a normal, unpaired permutation test, but restricting the set of valid permutations to only those which respect the paired nature of the data by forbidding both halves of any pair from being included in the same partition. In the specific but common case where the test statistic is the mean, this is also equivalent to computing a single set of differences of each pair and iterating over all of the sign-reversals instead of the usual partitioning approach.
Relation to parametric tests
[edit]Permutation tests are a subset of non-parametric statistics. Assuming that our experimental data come from data measured from two treatment groups, the method simply generates the distribution of mean differences under the assumption that the two groups are not distinct in terms of the measured variable. From this, one then uses the observed statistic ( above) to see to what extent this statistic is special, i.e., the likelihood of observing the magnitude of such a value (or larger) if the treatment labels had simply been randomized after treatment.
In contrast to permutation tests, the distributions underlying many popular "classical" statistical tests, such as the t-test, F-test, z-test, and χ2 test, are obtained from theoretical probability distributions. Fisher's exact test is an example of a commonly used parametric test for evaluating the association between two dichotomous variables. When sample sizes are very large, the Pearson's chi-square test will give accurate results. For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher's exact test becomes more appropriate.
Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square). All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner to construct a permutation t-test, a permutation test of association, a permutation version of Aly's test for comparing variances and so on.
The major drawbacks to permutation tests are that they
- Can be computationally intensive and may require "custom" code for difficult-to-calculate statistics. This must be rewritten for every case.
- Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even more computation.
Advantages
[edit]Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses.
Permutation tests can be used for analyzing unbalanced designs[4] and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001) [citation needed]. They can also be used to analyze qualitative data that has been quantitized (i.e., turned into numbers). Permutation tests may be ideal for analyzing quantitized data that do not satisfy statistical assumptions underlying traditional parametric tests (e.g., t-tests, ANOVA),[5] see PERMANOVA.
Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes.
Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations made the application of permutation test methods practical for a wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based "exact" confidence intervals.
Limitations
[edit]An important assumption behind a permutation test is that the observations are exchangeable under the null hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation t-test) require equal variance under the normality assumption. In this respect, the classic permutation t-test shares the same weakness as the classical Student's t-test (the Behrens–Fisher problem). This can be addressed in the same way the classic t-test has been extended to handle unequal variances: by employing the Welch statistic with Satterthwaite adjustment to the degrees of freedom.[6] A third alternative in this situation is to use a bootstrap-based test. Statistician Phillip Good explains the difference between permutation tests and bootstrap tests the following way: "Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap entails less-stringent assumptions."[7] Bootstrap tests are not exact. In some cases, a permutation test based on a properly studentized statistic can be asymptotically exact even when the exchangeability assumption is violated.[8] Bootstrap-based tests can test with the null hypothesis and, therefore, are suited for performing equivalence testing.
Monte Carlo testing
[edit]An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data to allow complete enumeration in a convenient manner. This is done by generating the reference distribution by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible replicates. The realization that this could be applied to any permutation test on any dataset was an important breakthrough in the area of applied statistics. The earliest known references to this approach are Eden and Yates (1933) and Dwass (1957).[9][10] This type of permutation test is known under various names: approximate permutation test, Monte Carlo permutation tests or random permutation tests.[11]
After random permutations, it is possible to obtain a confidence interval for the p-value based on the Binomial distribution, see Binomial proportion confidence interval. For example, if after random permutations the p-value is estimated to be , then a 99% confidence interval for the true (the one that would result from trying all possible permutations) is .
On the other hand, the purpose of estimating the p-value is most often to decide whether , where is the threshold at which the null hypothesis will be rejected (typically ). In the example above, the confidence interval only tells us that there is roughly a 50% chance that the p-value is smaller than 0.05, i.e. it is completely unclear whether the null hypothesis should be rejected at a level .
If it is only important to know whether for a given , it is logical to continue simulating until the statement can be established to be true or false with a very low probability of error. Given a bound on the admissible probability of error (the probability of finding that when in fact or vice versa), the question of how many permutations to generate can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far, in order to guarantee that the conclusion (which is either or ) is correct with probability at least as large as . ( will typically be chosen to be extremely small, e.g. 1/1000.) Stopping rules to achieve this have been developed[12] which can be incorporated with minimal additional computational cost. In fact, depending on the true underlying p-value it will often be found that the number of simulations required is remarkably small (e.g. as low as 5 and often not larger than 100) before a decision can be reached with virtual certainty.
Example tests
[edit]See also
[edit]Literature
[edit]Original references:
- Fisher, R.A. (1935) The Design of Experiments, New York: Hafner
- Pitman, E. J. G. (1937) "Significance tests which may be applied to samples from any population", Royal Statistical Society Supplement, 4: 119-130 and 225-32 (parts I and II). JSTOR 2984124 JSTOR 2983647
- Pitman, E. J. G. (1938). "Significance tests which may be applied to samples from any population. Part III. The analysis of variance test". Biometrika. 29 (3–4): 322–335. doi:10.1093/biomet/29.3-4.322.
Modern references:
- Collingridge, D.S. (2013). "A Primer on Quantitized Data Analysis and Permutation Testing". Journal of Mixed Methods Research. 7 (1): 79–95. doi:10.1177/1558689812454457. S2CID 124618343.
- Edgington, E. S., & Onghena, P. (2007) Randomization tests, 4th ed. New York: Chapman and Hall/CRC ISBN 9780367577711
- Good, Phillip I. (2005) Permutation, Parametric and Bootstrap Tests of Hypotheses, 3rd ed., Springer ISBN 0-387-98898-X
- Good, P (2002). "Extensions of the concept of exchangeability and their applications". Journal of Modern Applied Statistical Methods. 1 (2): 243–247. doi:10.22237/jmasm/1036110240.
- Lunneborg, Cliff. (1999) Data Analysis by Resampling, Duxbury Press. ISBN 0-534-22110-6.
- Pesarin, F. (2001). Multivariate Permutation Tests : With Applications in Biostatistics, John Wiley & Sons. ISBN 978-0471496700
- Welch, W. J. (1990). "Construction of permutation tests". Journal of the American Statistical Association. 85 (411): 693–698. doi:10.1080/01621459.1990.10474929.
Computational methods:
- Mehta, C. R.; Patel, N. R. (1983). "A network algorithm for performing Fisher's exact test in r x c contingency tables". Journal of the American Statistical Association. 78 (382): 427–434. doi:10.1080/01621459.1983.10477989.
- Mehta, C. R.; Patel, N. R.; Senchaudhuri, P. (1988). "Importance sampling for estimating exact probabilities in permutational inference". Journal of the American Statistical Association. 83 (404): 999–1005. doi:10.1080/01621459.1988.10478691.
- Gill, P. M. W. (2007). "Efficient calculation of p-values in linear-statistic permutation significance tests" (PDF). Journal of Statistical Computation and Simulation. 77 (1): 55–61. CiteSeerX 10.1.1.708.1957. doi:10.1080/10629360500108053. S2CID 1813706.
Current research on permutation tests
[edit]- Good, P.I. (2012) Practitioner's Guide to Resampling Methods.
- Good, P.I. (2005) Permutation, Parametric, and Bootstrap Tests of Hypotheses
- Hesterberg, T. C., D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein (2005): Bootstrap Methods and Permutation Tests, software.
- Moore, D. S., G. McCabe, W. Duckworth, and S. Sclove (2003): Bootstrap Methods and Permutation Tests
- Simon, J. L. (1997): Resampling: The New Statistics.
- Yu, Chong Ho (2003): Resampling methods: concepts, applications, and justification. Practical Assessment, Research & Evaluation, 8(19). (statistical bootstrapping)
- Resampling: A Marriage of Computers and Statistics (ERIC Digests)
- Pesarin, F., Salmaso, L. (2010). Permutation Tests for Complex Data: Theory, Applications and Software. Wiley. https://books.google.com/books?id=9PWVTOanxPUC&hl=de
References
[edit]- ^ Moore, Jason H. "Bootstrapping, permutation testing and the method of surrogate data." Physics in Medicine & Biology 44.6 (1999): L11.
- ^ Onghena, Patrick (2017-10-30), Berger, Vance W. (ed.), "Randomization Tests or Permutation Tests? A Historical and Terminological Clarification", Randomization, Masking, and Allocation Concealment (1 ed.), Boca Raton, FL: Chapman and Hall/CRC, pp. 209–228, doi:10.1201/9781315305110-14, ISBN 978-1-315-30511-0, retrieved 2021-10-08
- ^ Phipson, Belinda; Smyth, Gordon K (2010). "Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn". Statistical Applications in Genetics and Molecular Biology. 9 (1) 39. arXiv:1603.05766. doi:10.2202/1544-6115.1585. PMID 21044043. S2CID 10735784.
- ^ "Invited Articles" (PDF). Journal of Modern Applied Statistical Methods. 1 (2): 202–522. Fall 2011. Archived from the original (PDF) on May 5, 2003.
- ^ Collingridge, Dave S. (11 September 2012). "A Primer on Quantitized Data Analysis and Permutation Testing". Journal of Mixed Methods Research. 7 (1): 81–97. doi:10.1177/1558689812454457. S2CID 124618343.
- ^ Janssen, Arnold (1997). "Studentized Permutation Tests for Non-I.i.d. Hypotheses and the Generalized Behrens-Fisher Problem". Statistics & Probability Letters. 36 (1): 9–21. doi:10.1016/s0167-7152(97)00043-6.
- ^ Good, Phillip I. (2005). Resampling Methods: A Practical Guide to Data Analysis (3rd ed.). Birkhäuser. ISBN 978-0817643867.
- ^ Chung, EY; Romano, JP (2013). "Exact and asymptotically robust permutation tests". The Annals of Statistics. 41 (2): 487–507. arXiv:1304.5939. doi:10.1214/13-AOS1090.
- ^ Eden, T; Yates, F (1933). "On the validity of Fisher's z test when applied to an actual example of non-normal data. (With five text-figures.)". The Journal of Agricultural Science. 23 (1): 6–17. doi:10.1017/S0021859600052862. S2CID 84802682. Retrieved 3 June 2021.
- ^ Dwass, Meyer (1957). "Modified Randomization Tests for Nonparametric Hypotheses". Annals of Mathematical Statistics. 28 (1): 181–187. doi:10.1214/aoms/1177707045. JSTOR 2237031.
- ^ Thomas E. Nichols, Andrew P. Holmes (2001). "Nonparametric Permutation Tests For Functional Neuroimaging: A Primer with Examples" (PDF). Human Brain Mapping. 15 (1): 1–25. doi:10.1002/hbm.1058. hdl:2027.42/35194. PMC 6871862. PMID 11747097.
- ^ Gandy, Axel (2009). "Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk". Journal of the American Statistical Association. 104 (488): 1504–1511. arXiv:math/0612488. doi:10.1198/jasa.2009.tm08368. S2CID 15935787.
Permutation test
View on GrokipediaFundamentals
Definition and basic principles
A permutation test is an exact statistical hypothesis test that evaluates whether observed data support a null hypothesis of exchangeability by constructing the empirical null distribution from all possible rearrangements (permutations) of the data.[4][5] This approach treats the pooled observations as fixed, generating the reference distribution conditionally on the observed data, which serves as a sufficient statistic under the null.[5] As a non-parametric method, it makes no assumptions about the underlying data distribution, such as normality, distinguishing it from parametric tests that rely on specific distributional forms.[6][4] The basic principle underlying permutation tests is the assumption that, under the null hypothesis, the observations are exchangeable—meaning any permutation of their labels or assignments yields the same joint distribution.[4][6] This allows for randomly reassigning group labels or pairings while keeping the data values fixed, simulating outcomes in a world where no systematic differences exist between groups.[5] The test then assesses the extremity of the observed test statistic relative to this permutation-generated distribution, providing a p-value that reflects the probability of obtaining results at least as extreme under the null.[6] This exchangeability condition is weaker than independence and identical distribution (IID), enabling robust inference even when stricter assumptions fail.[4] Permutation tests are applicable to comparing two or more samples, including in regression and multivariate settings, offering flexibility for small or complex datasets where parametric methods may be inappropriate.[5][6] For instance, in a two-sample test, the method permutes the group labels between samples to mimic null-world scenarios, intuitively checking if the observed difference could arise by chance alone.[4] For large datasets where exhaustive permutations are computationally infeasible, Monte Carlo approximations can sample from the permutation space to estimate the distribution.[6]Historical development
The permutation test originated in the context of randomized agricultural experiments during the 1920s, inspired by Ronald A. Fisher's famous "lady tasting tea" experiment, which demonstrated the use of exact randomization to test sensory discrimination claims under controlled conditions.[7] This thought experiment, conceived around 1925 and later detailed in Fisher's 1935 book The Design of Experiments, emphasized randomization as a foundation for valid inference without distributional assumptions, particularly for analyzing variance in experimental designs. Fisher's work tied permutation methods directly to the randomization inherent in experimental setups, such as those at Rothamsted Experimental Station, where treatments were assigned to plots to ensure the null distribution of test statistics could be derived from all possible rearrangements. In the early 1930s, the method was independently developed and applied to small-sample exact tests. Eden and Yates introduced permutation resampling in 1933 to validate Fisher's z-test on non-normal agricultural data, computing exact probabilities by enumerating all possible arrangements of wheat height measurements across blocks. Fisher formalized the approach in 1935 for general randomized experiments, while E.J.G. Pitman extended it through seminal papers in 1937 and 1938, developing distribution-free significance tests for differences in means, correlations, and analysis of variance applicable to samples from any population. Pitman's contributions, including exact tests for variance ratios, solidified permutation methods as robust alternatives for small datasets where parametric assumptions failed. Post-World War II, permutation tests gained prominence within non-parametric statistics as a response to the limitations of Gaussian-based methods, with key formalizations appearing in the 1940s. Maurice Kendall and B. Babington Smith's The Advanced Theory of Statistics (1943) integrated permutation principles into broader statistical theory, alongside developments like the Wilcoxon rank-sum test (1945) and Mann-Whitney U test (1947), which relied on permutation distributions for exact inference. By the mid-20th century, these methods had expanded beyond agricultural randomization to general hypothesis testing across fields like psychology and biology, emphasizing their exactness for finite samples.[7] The 1980s and 1990s saw a resurgence driven by increased computational power, enabling permutation tests for larger datasets and complex designs previously infeasible by hand. This era featured algorithmic improvements, such as network methods for exact computations, and the popularization of Monte Carlo approximations.[7] Phillip Good's 1994 book Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses synthesized these advances, providing accessible implementations and demonstrating their utility in diverse applications from clinical trials to environmental science.Procedure
Step-by-step exact method
The exact permutation test provides a precise method for hypothesis testing by exhaustively generating the entire null distribution of the test statistic, assuming the data are exchangeable under the null hypothesis of no group differences. This approach is computationally feasible only for small to moderate sample sizes, where the total number of distinct permutations remains manageable, typically up to around 10^6. For example, with two groups of 5 observations each, the number of possible permutations is \binom{10}{5} = 252, allowing full enumeration on standard hardware. The procedure follows these steps:- Formulate the null hypothesis of exchangeability, which posits that the observations from different groups (or conditions) are interchangeable, implying no systematic differences between them.
- Compute the observed test statistic $ T_{\text{obs}} $ from the original data, such as the difference in group means.
- Generate all possible permutations of the data labels or pooled observations, respecting the group sizes; for two groups of sizes $ n_1 $ and $ n_2 $, this yields $ \binom{n_1 + n_2}{n_1} $ unique arrangements under the null.
- For each permutation, recalculate the test statistic $ T_i $.
- Determine the p-value as the proportion of permuted statistics at least as extreme as $ T_{\text{obs}} $, including the observed case itself; for a two-sided test, this is given by
Choice of test statistic
The test statistic in a permutation test quantifies the discrepancy between the observed data and the null hypothesis of exchangeability, serving as a measure of effect size or difference relevant to the hypothesis under investigation. It must be defined such that it can be consistently computed for the original dataset and for each permuted version of the data, enabling the generation of an empirical reference distribution under the null. This flexibility allows the test statistic to be tailored to the specific research question, prioritizing sensitivity to anticipated alternatives while maintaining computational tractability. A classic example is the two-sample test for difference in means, where the test statistic is the unpooled t-statistic:Variations
Monte Carlo approximation
When the total number of possible permutations under the null hypothesis is exceedingly large, rendering exact enumeration computationally infeasible, the Monte Carlo approximation provides a practical alternative by drawing a large random sample of permutations to estimate the null distribution of the test statistic.[8] Typically, samples of 10,000 or more permutations are used, with selection performed either with replacement (for simplicity when the permutation space is vast) or without replacement (to maintain exactness in smaller feasible cases).[9] This approach, which gained prominence in the 1980s alongside advances in computing power, allows permutation tests to scale to larger datasets while preserving their non-parametric validity.[8] The procedure adapts the exact permutation test by replacing complete enumeration with Monte Carlo sampling: after computing the observed test statistic, a random subset of permutations is generated, the test statistic is recalculated for each, and the approximate p-value is obtained as the proportion of these values that are as extreme as or more extreme than the observed statistic.[10] To assess the reliability of this estimate, standard error calculations can provide confidence intervals for the p-value, aiding interpretation in cases where precision matters.[9] The approximation's accuracy improves with larger sample sizes B; for instance, with B = 10,000, the standard error is approximately , where p is the true (unknown) p-value, yielding errors on the order of 0.005 for typical p around 0.05.[10] The variance of the Monte Carlo p-value estimator is approximated byMultivariate extensions
In multivariate extensions of permutation tests, the core principle involves permuting entire observation vectors for each unit rather than individual components, thereby preserving the within-unit correlations across multiple dimensions. This approach is particularly suitable for scenarios such as multivariate analysis of variance (MANOVA), where hypotheses concern differences in multiple endpoints or response variables simultaneously, ensuring that the test maintains the joint distribution structure under the null hypothesis of no group differences. A prominent example is the permutational multivariate analysis of variance (PERMANOVA), which extends univariate ANOVA to multivariate data by operating on distance or dissimilarity matrices, such as Euclidean distances for continuous variables. The test statistic is typically a pseudo-F ratio, analogous to the classical F-statistic, defined as:Theoretical foundations
Assumptions and null distribution
The null hypothesis in a permutation test posits that the observations are exchangeable, meaning that under the null, the joint distribution of the data remains invariant to any permutation of the observations, implying no systematic differences between groups or conditions.[13] This exchangeability holds when the observations can be regarded as independent and identically distributed (i.i.d.) from the same underlying distribution, such that permuting labels or assignments does not alter the probability of observing the data.[14] For instance, in a two-sample test, the null assumes the samples arise from the same population, with any apparent differences attributable to random variation rather than true effects.[15] The primary assumptions of permutation tests include random sampling from the population or, in experimental designs, randomization in the assignment of treatments to units, ensuring that the observed data's marginal distribution is preserved under permutation.[16] Unlike parametric tests, no specific distributional form (e.g., normality) is required beyond exchangeability, making the test conditional on the observed data without modeling the data-generating process explicitly.[17] However, the assumptions demand independence of observations; violations such as dependence (e.g., in clustered or time-series data) or heterogeneous variances can invalidate exchangeability, leading to incorrect inference.[18] Permutation tests are thus robust to the shape of the underlying distribution but sensitive to structural dependencies that prevent permutations from mimicking the null world adequately.[19] Under the null hypothesis, the null distribution of the test statistic is discrete and uniform over all possible permutations of the data, with each permutation equally likely.[20] For a dataset of size , the total number of distinct permutations is (or for grouped designs, where are group sizes), and the probability of any specific permutation is .[13] This uniformity arises because exchangeability ensures every relabeling of the data is probabilistically equivalent under the null, generating an exact reference distribution from which the p-value is computed as the proportion of permutations yielding a test statistic at least as extreme as the observed one. In contrast to parametric tests, which often rely on asymptotic normality for large samples, the permutation null distribution is exact and finite, avoiding approximations even for small datasets.[16]Relation to parametric and randomization tests
Permutation tests function as non-parametric alternatives to parametric procedures such as the Student's t-test for comparing means or analysis of variance (ANOVA) for group differences. Parametric tests derive their sampling distributions under specific assumptions, including normality of errors and homogeneity of variances, which enable exact or asymptotic control of the Type I error rate and potentially higher statistical power when these conditions hold. In contrast, permutation tests rely on the exchangeability of observations under the null hypothesis to generate an exact reference distribution by rearranging data labels, thereby controlling the Type I error rate precisely for finite samples without invoking normality or other distributional forms. When the underlying data satisfy parametric assumptions, such as normality, permutation tests can closely mimic the behavior of their parametric counterparts; for instance, the permutation distribution of the t-statistic in a two-sample test with equal sample sizes coincides exactly with the Student-t distribution, leading to equivalent p-values. Overall, the power of permutation tests is often comparable to that of parametric tests under ideal conditions but offers greater robustness in violated assumptions, though parametric methods may exhibit superior power in large samples when normality prevails. Randomization tests represent a specific subset of permutation tests tailored to designed experiments, where the randomness arises from the deliberate random assignment of treatments to units, as foundational in Ronald Fisher's framework for exact inference in agricultural trials. Permutation tests extend this approach more broadly to observational or non-experimental data, assuming exchangeability rather than controlled randomization, which allows their application beyond strictly designed settings like Fisher's exact test for contingency tables. In randomized experiments, the permutation distribution under exchangeability aligns precisely with the randomization distribution, a connection clarified by Eugene S. Edgington in the 1980s to unify the two under shared principles of resampling-based inference.[21]Properties
Advantages
Permutation tests provide exact control of the Type I error rate for finite sample sizes under the randomization model, unlike parametric tests such as the t-test, which provide exact control only under specific distributional assumptions like normality but may rely on asymptotic approximations when those assumptions fail or in non-standard conditions. This exactness ensures that the probability of falsely rejecting the null hypothesis is precisely the nominal significance level, making permutation tests particularly reliable in experimental settings where randomization is the basis for inference.[22] A key advantage of permutation tests is their flexibility, as they require no assumptions about the underlying data distribution and can be applied to virtually any test statistic, including complex, user-defined, or non-standard ones that capture specific aspects of the data. This allows researchers to tailor the test to the problem at hand without being constrained by predefined parametric forms.[22] Permutation tests demonstrate robustness to violations of normality, effectively handling skewed, heavy-tailed, or ordinal data, and performing well even with small sample sizes where parametric methods often fail due to unmet assumptions. In such scenarios, they maintain validity and reliability without needing data transformations.[22] In cases of non-normal data, permutation tests often exhibit superior power compared to parametric alternatives like the t-test; for instance, Monte Carlo simulations indicate higher detection rates for group differences under uniform or moderately skewed distributions.[23] Additionally, the empirical null distribution generated directly from the permuted data enhances interpretability, as it provides a tangible, data-driven reference for understanding the variability and extremity of the observed test statistic under the null hypothesis.[20]Limitations
Permutation tests, while robust and distribution-free, suffer from significant computational challenges, particularly when performing exact tests. The exact permutation distribution requires evaluating the test statistic over all possible rearrangements of the data under the null hypothesis, which for a two-sample test with total sample size involves permutations, where is the size of the first sample. This number grows factorially with , rendering exact computations infeasible for moderate to large sample sizes; for instance, exact tests are typically feasible only for very small datasets with [20](/page/2point0). For larger , such as balanced samples of 25 each (), the number of permutations exceeds , making it practically impossible without specialized algorithms. To circumvent this, Monte Carlo approximations sample a subset of permutations, but this introduces variability and approximation error in the resulting p-value, with precision depending on the number of resamples used.[24][25][26] A core limitation stems from the reliance on the exchangeability assumption under the null hypothesis, which posits that the joint distribution of observations remains unchanged under any permutation. This holds for independent and identically distributed (i.i.d.) data but fails for dependent structures, such as time series, spatial data, or clustered observations, where permuting units disrupts inherent dependencies. In these cases, standard unrestricted permutations yield invalid null distributions, necessitating restricted or design-based randomization schemes that further complicate implementation and increase computational demands. For example, in clustered data, exchangeability may not apply if variances differ across clusters, even under a null of equal means, violating the test's validity.[27][28] Permutation tests also exhibit lower statistical power compared to parametric tests when the data meet parametric assumptions, such as normality, because they do not leverage distributional information to concentrate the test. Under normality, parametric tests like the t-test achieve higher power by exploiting the known shape of the sampling distribution, whereas permutation tests treat all permutations equally, leading to a more diffuse null distribution. Additionally, in multiple testing scenarios, applying permutations to each hypothesis independently escalates computational costs without inherent multiplicity adjustments, often requiring joint permutation strategies that amplify the burden. The resulting exact p-values are discrete multiples of (where is the total number of permutations), leading to ties and reduced resolution; for small , p-values cluster, and exact values of 0 or 1 are rare but possible only in extreme cases. Furthermore, deriving confidence intervals from permutation tests is less intuitive and efficient than using bootstrap methods, which are better suited for interval estimation due to their resampling flexibility.[29][30][1]Applications
Two-sample and ANOVA examples
Permutation tests are commonly applied to compare means between two independent samples under the null hypothesis that the samples come from the same distribution. Consider a two-sample test using corn yield data from an agricultural experiment with eight plots divided into weed-free and weedy conditions.[https://www.math.csi.cuny.edu/~verzani/tmp/sdc/permutation-test.html] The weed-free group yields were 166.7, 172.2, 165.0, and 176.9 bushels per acre, with a mean of 170.2. The weedy group yields were 162.8, 142.4, 162.8, and 162.4 bushels per acre, with a mean of 157.6. The observed test statistic is the difference in group means: 170.2 - 157.6 = 12.6. To compute the p-value exactly, pool all eight observations and generate all possible ways to reassign four to the weed-free group, yielding \binom{8}{4} = 70 permutations, each equally likely under the null. For each permutation, recalculate the difference in means. The p-value is the proportion of these differences that are at least as extreme as 12.6 (one-sided test for higher yield in weed-free), which is 1/70 ≈ 0.014.[https://www.math.csi.cuny.edu/~verzani/tmp/sdc/permutation-test.html] Since 0.014 < 0.05, reject the null hypothesis, concluding evidence that weeding increases yields. For larger samples where exact enumeration is infeasible, Monte Carlo approximation uses random permutations. In a study of movie ratings by control (n=50, mean=65) and treated (n=50, mean=70) groups, the observed t-statistic from a two-sample t-test was approximately 2.82, with a parametric p-value of 0.00578.[https://library.virginia.edu/data/articles/testing-significance-permutation-based-methods] Performing 1000 random permutations of group labels and recomputing the t-statistic each time yields a p-value of 0.005, the proportion of permuted statistics at least as extreme as observed, confirming significance and illustrating consistency with parametric results.[https://library.virginia.edu/data/articles/testing-significance-permutation-based-methods] For one-way ANOVA, permutation tests assess equality of means across k>2 groups by permuting group labels and recomputing the F-statistic. In a study of ethical perceptions of the Milgram obedience experiment among 37 high school teachers divided into actual-experiment (n=13, mean=3.31), complied (n=13, mean=3.85), and refused (n=11, mean=5.55) groups on a 1-9 scale, the observed F-statistic was 3.49.[https://people.hsc.edu/faculty-staff/blins/classes/spring16/math222/examples/Milgram/index.html] With 10,000 random permutations of labels, the pseudo-F values form the null distribution, and the p-value is the proportion exceeding 3.49, yielding 0.040.[https://people.hsc.edu/faculty-staff/blins/classes/spring16/math222/examples/Milgram/index.html] This rejects the null at α=0.05, indicating differences in ethical ratings across groups, similar to the parametric ANOVA p-value of 0.042. A simple pseudocode for replication in software like R or Python follows the permutation procedure:pool_data = concatenate(group1, group2) # For two-sample
observed_stat = mean(group1) - mean(group2)
num_perms = 1000 # Or exact if small n
perm_stats = []
for i in 1 to num_perms:
shuffled = random_permutation(pool_data)
perm_group1 = shuffled[1:length(group1)]
perm_stat = mean(perm_group1) - mean(shuffled[length(group1)+1:end])
perm_stats.append(perm_stat)
p_value = sum(perm_stat >= observed_stat for perm_stat in perm_stats) / num_perms
For ANOVA, replace the statistic with F and permute labels across all groups. These examples demonstrate rejection (corn, ratings, ethics) or potential failure to reject in non-significant cases, emphasizing the test's flexibility for univariate group comparisons without normality assumptions.[https://www.math.csi.cuny.edu/~verzani/tmp/sdc/permutation-test.html][https://library.virginia.edu/data/articles/testing-significance-permutation-based-methods][https://people.hsc.edu/faculty-staff/blins/classes/spring16/math222/examples/Milgram/index.html]