Hubbry Logo
Friedman testFriedman testMain
Open search
Friedman test
Community hub
Friedman test
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Friedman test
Friedman test
from Wikipedia

The Friedman test is a non-parametric statistical test developed by Milton Friedman.[1][2][3] Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row (or block) together, and then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test.

Classic examples of use are:

  • wine judges each rate different wines. Are any of the wines ranked consistently higher or lower than the others?
  • welders each use welding torches, and the ensuing welds were rated on quality. Do any of the torches produce consistently better or worse welds?

The Friedman test is used for one-way repeated measures analysis of variance by ranks. In its use of ranks it is similar to the Kruskal–Wallis one-way analysis of variance by ranks.

The Friedman test is widely supported by many statistical software packages.

Method

[edit]
  1. Given data , that is, a matrix with rows (the blocks), columns (the treatments) and a single observation at the intersection of each block and treatment, calculate the ranks within each block. If there are tied values, assign to each tied value the average of the ranks that would have been assigned without ties. Replace the data with a new matrix where the entry is the rank of within block .
  2. Find the values
  3. The test statistic is given by . Note that the value of does need to be adjusted for tied values in the data.[4]
  4. Finally, when or is large (i.e. or ), the probability distribution of can be approximated by that of a chi-squared distribution. In this case the p-value is given by . If or is small, the approximation to chi-square becomes poor and the p-value should be obtained from tables of specially prepared for the Friedman test. If the p-value is significant, appropriate post-hoc multiple comparisons tests would be performed.
[edit]
  • When using this kind of design for a binary response, one instead uses the Cochran's Q test.
  • The Sign test (with a two-sided alternative) is equivalent to a Friedman test on two groups.
  • Kendall's W is a normalization of the Friedman statistic between and .
  • The Wilcoxon signed-rank test is a nonparametric test of nonindependent data from only two groups.
  • The Skillings–Mack test is a general Friedman-type statistic that can be used in almost any block design with an arbitrary missing-data structure.
  • The Wittkowski test is a general Friedman-Type statistics similar to Skillings-Mack test. When the data do not contain any missing value, it gives the same result as Friedman test. But if the data contain missing values, it is both, more precise and sensitive than Skillings-Mack test.[5]

Post hoc analysis

[edit]

Post-hoc tests were proposed by Schaich and Hamerle (1984)[6] as well as Conover (1971, 1980)[7] in order to decide which groups are significantly different from each other, based upon the mean rank differences of the groups. These procedures are detailed in Bortz, Lienert and Boehnke (2000, p. 275).[8] Eisinga, Heskes, Pelzer and Te Grotenhuis (2017)[9] provide an exact test for pairwise comparison of Friedman rank sums, implemented in R. The Eisinga c.s. exact test offers a substantial improvement over available approximate tests, especially if the number of groups () is large and the number of blocks () is small.

Not all statistical packages support post-hoc analysis for Friedman's test, but user-contributed code exists that provides these facilities (for example in SPSS,[10] and in R.[11]). The R package titled PMCMRplus contains numerous non-parametric methods for post-hoc analysis after Friedman,[12] including support for the Nemenyi test.

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Friedman test is a non-parametric statistical test developed by economist and statistician in to assess differences in treatments across multiple matched blocks or repeated measures without relying on the assumption of normally distributed data. It functions as a rank-based analog to the parametric repeated measures analysis of variance (ANOVA), particularly suitable for randomized block designs where observations are dependent within blocks, such as in longitudinal studies or matched subject experiments. By converting raw data into ranks within each block, the test computes a statistic based on the variance of rank sums, which under the of no treatment effects approximates a with k-1 , where k is the number of treatments. The procedure begins by the observations for each block from 1 (lowest) to k (highest), assigning average ranks in case of ties, and then summing these ranks across all n blocks for each treatment to obtain rank totals R_j. The F_r (or Q) is calculated as
F_r = \frac{12}{n k (k+1)} \sum_{j=1}^k R_j^2 - 3 n (k+1),
which simplifies to assess whether the rank sums deviate significantly from their under . Assumptions include ordinal or higher-level data that can be ranked, independent blocks, and no systematic interactions beyond the treatments of interest; it performs well with small sample sizes (n ≥ 5 recommended) but may require exact distribution tables for very small k. Unlike parametric ANOVA, it is robust to outliers and non-normal distributions but has lower power when normality holds.
In practice, the Friedman test is widely applied in fields like , , and to analyze repeated measures , such as comparing relief across multiple drugs in the same patients or evaluating performance under varying conditions in matched subjects. If the test indicates significant differences (p < 0.05), post-hoc pairwise comparisons using procedures like the Wilcoxon signed-rank test with Bonferroni correction can identify specific treatment pairs driving the effect. Its enduring relevance stems from Friedman's original emphasis on avoiding normality assumptions in variance analysis, making it a foundational tool in non-parametric statistics.

Introduction

Definition and purpose

The Friedman test is a rank-based, non-parametric statistical procedure designed to detect differences in treatments across multiple test attempts or blocks in experimental designs. Introduced as an alternative to parametric methods that assume normality, it applies ranks to the data within each block to avoid reliance on distributional assumptions, making it suitable for analyzing correlated or matched observations. Its primary purpose is to compare three or more related samples, particularly in scenarios where the data violate the normality requirements of parametric tests such as repeated measures ANOVA. The test evaluates whether there are significant overall differences among the groups or conditions, testing the null hypothesis that all population distributions are identical against the alternative that at least one tends to produce larger (or smaller) observations. The Friedman test is appropriately applied in situations involving ordinal data, small sample sizes, or non-normal distributions within within-subjects designs, such as ranking preferences across multiple options or assessing treatment effects in matched blocks. In practice, it ranks the observations within each block and derives a test statistic to quantify the consistency of these ranks across blocks, providing evidence of treatment effects without assuming equal intervals or Gaussian distributions.

Historical development

The Friedman test originated from the work of economist and statistician Milton Friedman during his early career in the 1930s, as he explored alternatives to parametric methods that relied on normality assumptions. Developed in 1937 while Friedman was a research assistant at the National Bureau of Economic Research, the test addressed the need for robust analysis in experimental designs involving multiple related samples. Friedman first described the procedure in his paper "The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance," published in the Journal of the American Statistical Association. In this work, he proposed replacing raw observations with ranks within blocks to perform a two-way analysis of variance, thereby extending rank-based techniques to handle repeated measures or matched designs without assuming underlying distributions. The method was directly inspired by Harold Hotelling and Margaret Pabst's 1936 paper on rank correlation coefficients, which Friedman encountered during his studies under Hotelling at Columbia University; it built on R. A. Fisher's foundational two-way ANOVA framework by adapting it for non-parametric use. The test gained prominence in the mid-20th century alongside the broader expansion of non-parametric statistics, particularly in the 1940s and 1950s, as researchers in fields like psychology, biology, and social sciences increasingly favored distribution-free methods for handling ordinal or non-normal data in experimental settings. This period saw a surge in rank-based procedures, with Friedman's approach becoming a standard tool for analyzing blocked designs, as evidenced by its inclusion in seminal texts on non-parametric methods.

Assumptions and data requirements

Non-parametric assumptions

The Friedman test, as a non-parametric alternative to repeated-measures ANOVA, does not require the assumption of normality in the underlying data distributions, making it suitable for ordinal or non-normal continuous data where parametric assumptions fail. Instead, its core assumption is that the observations within each block are identically distributed except for possible location shifts attributable to treatment effects, allowing the test to focus on differences in central tendency across treatments while controlling for block variability. Regarding independence, the test is designed for related samples, where observations within each block are dependent due to matching or repeated measures on the same subjects, but the blocks themselves must be independent to ensure the validity of the overall analysis. This structure accounts for intra-block correlations without assuming independence within blocks, distinguishing it from tests for independent samples. The data are typically ordinal or continuous, transformed into ranks within each block for analysis; ties are accommodated by assigning average ranks to tied values, preserving the ordinal nature of the measurements. Despite its robustness to outliers and non-normal distributions, the Friedman test has limitations, including the assumption of similar distributional shapes across treatments within blocks apart from location differences. If these assumptions are violated—such as through differing variances or shapes in the distributions—the test may fail to adequately control the Type I error rate, potentially leading to inflated false positive rates, though it generally maintains nominal levels when assumptions hold. In contrast to parametric tests, which impose stricter normality and homoscedasticity requirements, the Friedman test's relaxed assumptions enhance its applicability in diverse empirical settings.

Data structure and prerequisites

The Friedman test requires data arranged in a blocked design, featuring nn blocks (such as subjects or matched groups) and kk treatments (such as conditions or time points), with exactly one observation per treatment within each block. This configuration yields a rectangular k×nk \times n data matrix, where rows represent blocks and columns represent treatments, ensuring a complete, unreplicated block structure. The test mandates at least three treatments (k3k \geq 3) to detect differences among multiple groups, while the number of blocks should be at least five for adequate power, though typically n10n \geq 10 is recommended to ensure reliable p-values via the chi-square approximation; smaller nn can be analyzed using exact permutation methods. When ties occur within a block (identical observations across treatments), they are handled by assigning average ranks to the tied values, and statistical software often adjusts the degrees of freedom accordingly to maintain test validity. Key prerequisites include that observations within blocks must be related, such as repeated measures on the same subjects over time or carefully matched pairs/groups to control for inter-block variability. Incomplete data poses challenges, as the test assumes fully observed blocks; missing values necessitate either listwise deletion (reducing nn) or cautious imputation, though the latter risks biasing ranks and should be avoided when possible, with extensions like the Skillings-Mack test considered for substantial missingness. For illustration, consider a dataset evaluating three treatments (A, B, C) on ten subjects, structured as follows:
SubjectTreatment ATreatment BTreatment C
15.26.14.8
24.95.55.0
............
106.07.25.9
This table exemplifies the required format, with measurable values (ordinal or continuous) entered directly for subsequent ranking within rows.

Test procedure

Step-by-step method

The Friedman test begins with organizing the data into a structured format suitable for analysis. The dataset consists of nn blocks (also called subjects or rows), each containing kk observations corresponding to kk treatments or conditions (columns). This arrangement ensures that observations within each block are related, such as repeated measures on the same subjects. To perform the test manually, follow these sequential steps:
  1. Organize the data: Arrange the observations into an n×kn \times k table, where rows represent blocks and columns represent treatments. Ensure that the data meet the basic prerequisites, such as ordinal or continuous measurements without requiring normality.
  2. Rank observations within each block: For each row independently, assign ranks to the kk observations from 1 (lowest value) to kk (highest value). If ties occur within a block, assign the average of the tied ranks to each tied observation; for example, two tied values for ranks 3 and 4 both receive rank 3.5. This ranking process is performed separately for every block to account for subject-specific variability.
  3. Sum the ranks for each treatment: Calculate the total rank sum RjR_j for each treatment jj (where j=1j = 1 to kk) by adding the ranks assigned to that treatment across all nn blocks. These sums, R1,R2,,RkR_1, R_2, \dots, R_k, represent the aggregated ranking for each treatment.
  4. Verify block totals (optional but recommended): For each block, confirm that the sum of ranks equals k(k+1)2\frac{k(k+1)}{2}. This sum is exact even with ties due to the use of average ranks. This step ensures the ranking process is accurate and complete.
  5. Prepare for test statistic calculation: Use the rank sums RjR_j as inputs for computing the overall test statistic, with the detailed formula provided in the subsequent mathematical formulation.
Manual computation of the Friedman test is practical for small sample sizes, such as n20n \leq 20 and k10k \leq 10, as the ranking and summation steps are straightforward by hand or with basic spreadsheets. For larger datasets, statistical software is advisable to handle ties and verifications efficiently.

Mathematical formulation

The Friedman test statistic is based on the sums of ranks assigned to each of the kk treatments across nn blocks (or subjects). The rank sum for treatment jj is defined as Rj=i=1nrijR_j = \sum_{i=1}^n r_{ij}, where rijr_{ij} is the rank of the ii-th block's observation for treatment jj, with ranks typically ranging from 1 to kk within each block (using average ranks for any ties). Under the null hypothesis that there are no differences among the treatments, the test statistic QQ measures the variability in these rank sums and is given by Q=12nk(k+1)j=1k(Rjn(k+1)2)2,Q = \frac{12}{n k (k+1)} \sum_{j=1}^k \left( R_j - \frac{n(k+1)}{2} \right)^2, where n(k+1)2\frac{n(k+1)}{2} is the expected rank sum for each treatment. This is mathematically equivalent to Q=12nk(k+1)j=1kRj23n(k+1).Q = \frac{12}{n k (k+1)} \sum_{j=1}^k R_j^2 - 3 n (k+1). The derivation of QQ stems from the variance of the rank sums under the null hypothesis. Under this hypothesis, each RjR_j has an expected value of n(k+1)2\frac{n(k+1)}{2} and variance n(k21)12\frac{n (k^2 - 1)}{12}, leading to a standardized measure of dispersion that scales to the given form; this normalization ensures the statistic approximates a chi-squared distribution when treatment effects are absent. For large nn, QQ follows approximately a χ2\chi^2 distribution with k1k-1 degrees of freedom, allowing critical values to be obtained from standard chi-squared tables for significance testing. When ties occur within blocks, average ranks are assigned to tied values, and the test statistic is adjusted to account for the reduced variability. The adjusted QQ is Q=12nk(k+1)Cj=1kRj23n(k+1),Q = \frac{12}{n k (k+1) C} \sum_{j=1}^k R_j^2 - 3 n (k+1), where the correction factor CC is C=1i(ti3ti)nk(k+1)(k1),C = 1 - \frac{\sum_i (t_i^3 - t_i)}{n k (k+1)(k-1)}, with the sum taken over all mm sets of ties and tit_i denoting the number of observations tied in the ii-th set. This adjustment deflates QQ to reflect the ties' impact on rank dispersion. For small samples, the chi-squared approximation may be inaccurate, so the exact distribution of QQ is used instead, typically via precomputed critical value tables or permutation-based methods that enumerate all possible rank assignments under the null hypothesis.

Interpretation and results

Test statistic and significance

The null hypothesis of the Friedman test posits that the probability distributions of the treatments are identical across blocks, which is commonly interpreted as the treatments having no differential effect, or equivalently, equal medians assuming identical distribution shapes. To determine statistical significance, the test statistic QQ (detailed in the Mathematical formulation section) is compared to the critical value from the chi-square distribution with k1k-1 degrees of freedom, where kk is the number of treatments, at a chosen significance level such as α=0.05\alpha = 0.05. Alternatively, statistical software computes the exact p-value using or enumeration for small samples, or the asymptotic chi-square approximation for larger ones. If QQ exceeds the critical value, or if the p-value is less than α\alpha, the null hypothesis is rejected, providing evidence that at least one treatment differs from the others in its distribution. Standard reporting conventions include stating the value of QQ, the degrees of freedom k1k-1, and the associated p-value, along with the sample size nn (number of blocks) to contextualize the approximation's reliability. The chi-square approximation is generally valid when n>10n > 10, though some guidelines recommend n>15n > 15 or k>4k > 4 for better accuracy; for smaller samples, exact methods are preferred to avoid inflated Type I error rates. Regarding power, the Friedman test exhibits good statistical power for detecting shifts (differences in medians or central tendencies) under the null, but generally has lower power than parametric alternatives like repeated-measures ANOVA when normality holds, particularly for detecting shifts; it is not designed to detect differences in variance or distribution shape.

Effect size measures

The Friedman test detects differences in treatments across multiple matched blocks, but assessing the magnitude of these differences requires measures to evaluate practical significance. The primary metric for the Friedman test is Kendall's coefficient of concordance, denoted as , which quantifies the degree of agreement in rankings across treatments. Introduced by Kendall and Babington Smith, normalizes the to range from 0, indicating no agreement or effect, to 1, representing perfect concordance in ranks. W is calculated as W=Qn(k1),W = \frac{Q}{n(k-1)}, where Q is the Friedman test statistic derived from the sum of squared rank totals, n is the number of blocks (subjects), and k is the number of treatments. This measure is computed directly from the rank sums assigned to each treatment, providing a straightforward way to report the proportion of variance in ranks attributable to treatment differences, which aids in interpreting the practical importance of results beyond . Guidelines for interpreting W, adapted from Cohen's conventions for related statistics, classify values of approximately 0.1 as small effects, 0.3 as medium effects, and 0.5 as large effects, though these thresholds should be contextualized by the study's domain. Alternative effect size measures include rank-based analogs to eta-squared, which estimate the percentage of total rank variance explained by the treatment factor, and rank biserial correlations adapted for multi-group comparisons, though these are less commonly applied to the overall test. Despite its utility, assumes the absence of tied ranks; when ties occur, adjustments such as those proposed by Gwet are recommended to correct for bias. Additionally, for small samples (e.g., fewer than 20 treatments), the underlying chi-square approximation for significance testing may inflate Type I errors, necessitating adjustments or exact methods that also affect W's reliability.

Parametric alternatives

The primary parametric alternative to the Friedman test is the repeated measures of variance (ANOVA), which is suitable for comparing means across multiple related samples or conditions when the data meet parametric assumptions. Repeated measures ANOVA evaluates differences in means directly, assuming the data are continuous and follow a within each group, along with homogeneity of variances ( for within-subjects factors). In contrast, the Friedman test employs ranks rather than raw values, providing robustness against violations of normality and unequal variances, making it preferable for or non-normal distributions. The Friedman test was specifically developed by statistician in as a non-parametric method to circumvent the normality assumption inherent in parametric ANOVA procedures. Key differences lie in their statistical foundations: repeated measures ANOVA relies on the F-statistic to test for mean differences, while the Friedman test uses a chi-square distributed (Q) based on rank sums. Under conditions of normality, the ANOVA F-statistic relates asymptotically to the Friedman Q statistic, with the latter approximating (k-1) times the F-value for k treatments, reflecting their near-equivalence in large samples. However, the asymptotic relative of the Friedman test relative to ANOVA under normality is approximately 0.955k/(k+1), indicating slightly lower power for the non-parametric approach when parametric assumptions hold. Researchers should prefer repeated measures ANOVA when dealing with large sample sizes, continuous normally distributed data, and verified , as it offers higher statistical power to detect true differences in such scenarios. Conversely, the Friedman test is more appropriate for smaller samples, skewed distributions, or ranked data, where its robustness prevents inflated Type I error rates associated with parametric violations. This choice balances power gains from parametric methods against the reliability of non-parametric alternatives in real-world data often deviating from ideal assumptions.

Other non-parametric alternatives

The Wilcoxon signed-rank test serves as a foundational non-parametric procedure for comparing two related samples, functioning as a direct precursor to the Friedman test when the number of treatments or conditions is limited to k=2. Developed by Frank Wilcoxon in , it assesses differences in paired observations by ranking the absolute differences and accounting for their signs, providing a robust alternative to the paired t-test under non-normality. In scenarios with only two repeated measures per block, the Wilcoxon signed-rank test is preferred over the Friedman test due to its higher power and simpler computation, as the Friedman test reduces to a less efficient in this case. For designs involving independent samples rather than repeated measures, the Kruskal-Wallis test acts as the between-subjects analog to the Friedman test, extending the Mann-Whitney U test to k>2 groups. Introduced by William Kruskal and W. Allen Wallis in , it ranks all observations across groups and tests for differences in distribution medians without assuming block-wise dependencies. Researchers should opt for the Kruskal-Wallis test when blocks (subjects) are unrelated, as it avoids the within-block ranking central to the Friedman test and better suits completely randomized designs. Extensions of the Friedman framework address specialized data types. For binary or dichotomous repeated measures, provides a direct adaptation, evaluating consistency in proportions across k conditions while maintaining the block structure. Proposed by William G. Cochran in 1950, it simplifies the Friedman ranks to 0-1 assignments per cell, offering a non-parametric chi-square-like test for matched binary data. When ordered alternatives are hypothesized—such as a monotonic trend in treatment effects—the Page test enhances the approach by weighting ranks according to their expected order. Edward B. Page formalized this in 1963, computing a of rank sums to detect ordered differences with greater sensitivity than the omnibus test. The Friedman test's specificity to repeated measures designs imposes limitations relative to these alternatives; for instance, it requires paired blocks and may underperform without them, whereas the Kruskal-Wallis test accommodates independence but loses power from unexploited pairings. Similarly, while Cochran's Q and the Page test inherit the block-wise ranking, they are constrained to binary or ordered contexts, respectively, and cannot handle general continuous outcomes as flexibly as the Friedman test.

Post-hoc analysis

Multiple comparisons overview

Following a significant result from the test, which indicates overall differences among the treatments or related samples, post-hoc multiple comparisons are employed to pinpoint which specific pairs of treatments differ from one another. This step is essential because the test only detects the presence of at least one difference but does not identify the location or nature of those differences. Such analyses are performed solely when the test's is less than the chosen significance level (e.g., α = 0.05); if the overall test is not significant, no further pairwise investigations are warranted to avoid unnecessary error inflation. A primary challenge in multiple comparisons arises from the increased risk of Type I errors—the false identification of differences—due to conducting numerous pairwise tests simultaneously. To mitigate this, adjustments to the significance level are required, such as the , which divides the overall α by the number of comparisons to control the . These safeguards ensure that the probability of at least one false positive across all tests remains at the desired level, preserving the integrity of the analysis in the context of repeated measures or blocked designs typical of the Friedman test. General strategies for post-hoc analysis after the Friedman test rely on rank-based procedures that extend the nonparametric framework of the original test. Common approaches include pairwise comparisons using adapted versions of tests like Dunn's procedure, which operates on the ranks to compare treatment means while incorporating multiplicity adjustments. For comprehensive all-pairs evaluations, methods such as the Nemenyi test or Conover's test are frequently applied, providing a structured way to assess differences based on rank sums or means across all treatment combinations. These techniques maintain the robustness of nonparametric , making them suitable for ordinal or non-normal data in repeated measures settings. However, some statistical literature has criticized rank-based post-hoc tests like Nemenyi and Conover for relying on assumptions such as exchangeability of ranks that may not hold in all applications, potentially leading to invalid p-values.

Specific post-hoc procedures

When the Friedman test indicates significant differences among the treatments, post-hoc procedures are employed to identify which specific pairs differ. One common approach is the pairwise adjusted for multiple comparisons using the . For each pair of treatments, the signed-rank test is applied to the differences in observations across blocks, ranking the absolute differences and assigning signs based on direction. The significance level α is then divided by the number of pairwise comparisons, C(k, 2), where k is the number of treatments, to control the . This method is suitable for targeted pairwise investigations but can be conservative with many comparisons. The Nemenyi test provides a distribution-free multiple procedure based on rank sums from the Friedman analysis. It compares all pairs simultaneously by computing the statistic q=RiRj[k](/page/K)([k](/page/K)+1)6Nq = \frac{|R_i - R_j|}{\sqrt{\frac{[k](/page/K)([k](/page/K)+1)}{6N}}}
Add your contribution
Related Hubs
User Avatar
No comments yet.