Recent from talks
Contribute something
Nothing was collected or created yet.
Box plot
View on Wikipedia
In descriptive statistics, a box plot or boxplot is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles.[1]
In addition to the box on a box plot, there can be lines (which are called whiskers) extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset[2] may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution[3] (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length).
The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.
History
[edit]The range-bar method was first introduced by Mary Eleanor Spear in her book "Charting Statistics" in 1952[4] and again in her book "Practical Charting Techniques" in 1969.[5] The box-and-whisker plot was first introduced in 1970 by John Tukey, who later published on the subject in his book "Exploratory Data Analysis" in 1977.[6]
Elements
[edit]

A boxplot is a standardized way of displaying the dataset based on the five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.
- Minimum (Q0 or 0th percentile): the lowest data point in the data set excluding any outliers
- Maximum (Q4 or 100th percentile): the highest data point in the data set excluding any outliers
- Median (Q2 or 50th percentile): the middle value in the data set
- First quartile (Q1 or 25th percentile): also known as the lower quartile qn(0.25), it is the median of the lower half of the dataset.
- Third quartile (Q3 or 75th percentile): also known as the upper quartile qn(0.75), it is the median of the upper half of the dataset.[7]
In addition to the minimum and maximum values used to construct a box-plot, another important element that can also be employed to obtain a box-plot is the interquartile range (IQR), as denoted below:
- Interquartile range (IQR): the distance between the upper and lower quartiles
A box-plot usually includes two parts, a box and a set of whiskers.
Box
[edit]The box is drawn from Q1 to Q3 with a horizontal line drawn inside it to denote the median. Some box plots include an additional character to represent the mean of the data.[8][9]
Whiskers
[edit]The whiskers must end at an observed data point, but can be defined in various ways. In the most straightforward method, the boundary of the lower whisker is the minimum value of the data set, and the boundary of the upper whisker is the maximum value of the data set. Because of this variability, it is appropriate to describe the convention that is being used for the whiskers and outliers in the caption of the box-plot.
Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile (Q3), a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile (Q1) and a whisker is drawn down to the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as outliers.[10] The outliers can be plotted on the box-plot as a dot, a small circle, a star, etc. (see example below).

There are other representations in which the whiskers can stand for several other things, such as:
- One standard deviation above and below the mean of the data set
- The 9th percentile and the 91st percentile of the data set
- The 2nd percentile and the 98th percentile of the data set
Rarely, box-plot can be plotted without the whiskers. This can be appropriate for sensitive information to avoid whiskers (and outliers) disclosing actual values observed.[11]
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to depict the seven-number summary. If the data are normally distributed, the locations of the seven marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of each whisker.
Variations
[edit]
Since the mathematician John W. Tukey first popularized this type of visual data display in 1969, several variations on the classical box plot have been developed, and the two most commonly found variations are the variable-width box plots and the notched box plots.
Variable-width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.[12]
Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians.[12] The height of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is an uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).[12] The width of the notch is arbitrarily chosen to be visually pleasing, and should be consistent amongst all box plots being displayed on the same page.
One convention for obtaining the boundaries of these notches is to use a distance of around the median.[13]
Adjusted box plots are intended to describe skew distributions, and they rely on the medcouple statistic of skewness.[14] For a medcouple value of MC, the lengths of the upper and lower whiskers on the box-plot are respectively defined to be:
For a symmetrical data distribution, the medcouple will be zero, and this reduces the adjusted box-plot to the Tukey's box-plot with equal whisker lengths of for both whiskers.
Other kinds of box plots, such as the violin plots and the bean plots can show the difference between single-modal and multimodal distributions, which cannot be observed from the original classical box-plot.[6]
Examples
[edit]Example without outliers
[edit]
A series of hourly temperatures were measured throughout the day in degrees Fahrenheit. The recorded values are listed in order as follows (°F): 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81.
A box plot of the data set can be generated by first calculating five relevant values of this data set: minimum, maximum, median (Q2), first quartile (Q1), and third quartile (Q3).
The minimum is the smallest number of the data set. In this case, the minimum recorded day temperature is 57°F.
The maximum is the largest number of the data set. In this case, the maximum recorded day temperature is 81°F.
The median is the "middle" number of the ordered data set. This means that exactly 50% of the elements are below the median and 50% of the elements are greater than the median. The median of this ordered data set is 70°F.
The first quartile value (Q1 or 25th percentile) is the number that marks one quarter of the ordered data set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater than it. The first quartile value can be easily determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number found between 57°F and 70°F is 66°F.
The third quartile value (Q3 or 75th percentile) is the number that marks three quarters of the ordered data set. In other words, there are exactly 75% of the elements that are less than the third quartile and 25% of the elements that are greater than it. The third quartile value can be easily obtained by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70°F and 81°F is 75°F.
The interquartile range, or IQR, can be calculated by subtracting the first quartile value (Q1) from the third quartile value (Q3):
Hence,
1.5 IQR above the third quartile is:
1.5 IQR below the first quartile is:
The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile. Here, 1.5 IQR above the third quartile is 88.5°F and the maximum is 81°F. Therefore, the upper whisker is drawn at the value of the maximum, which is 81°F.
Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile. Here, 1.5 IQR below the first quartile is 52.5°F and the minimum is 57°F. Therefore, the lower whisker is drawn at the value of the minimum, which is 57°F.
Example with outliers
[edit]
Above is an example without outliers. Here is a follow-up example for generating box-plot with outliers:
The ordered set for the recorded temperatures is (°F): 52, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 89.
In this example, only the first and the last number are changed. The median, third quartile, and first quartile remain the same.
In this case, the maximum value in this data set is 89°F, and 1.5 IQR above the third quartile is 88.5°F. The maximum is greater than 1.5 IQR plus the third quartile, so the maximum is an outlier. Therefore, the upper whisker is drawn at the greatest value smaller than 1.5 IQR above the third quartile, which is 79°F.
Similarly, the minimum value in this data set is 52°F, and 1.5 IQR below the first quartile is 52.5°F. The minimum is smaller than 1.5 IQR minus the first quartile, so the minimum is also an outlier. Therefore, the lower whisker is drawn at the smallest value greater than 1.5 IQR below the first quartile, which is 57°F.
In the case of large datasets
[edit]An additional example for obtaining box-plot from a data set containing a large number of data points is:
General equation to compute empirical quantiles
[edit]- Here stands for the general ordering of the data points (i.e. if , then )
Using the above example that has 24 data points (n = 24), one can calculate the median, first and third quartile either mathematically or visually.
Median
First quartile
Third quartile
Visualization
[edit]

Although box plots may seem more primitive than histograms or kernel density estimates, they do have a number of advantages. First, the box plot enables statisticians to do a quick graphical examination on one or more data sets. Box-plots also take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data in parallel. Lastly, the overall structure of histograms and kernel density estimate can be strongly influenced by the choice of number and width of bins techniques and the choice of bandwidth, respectively.
Although looking at a statistical distribution is more common than looking at a box plot, it can be useful to compare the box plot against the probability density function (theoretical histogram) for a normal N(0,σ2) distribution and observe their characteristics directly.
See also
[edit]References
[edit]- ^ C., Dutoit, S. H. (2012). Graphical exploratory data analysis. Springer. ISBN 978-1-4612-9371-2. OCLC 1019645745.
{{cite book}}: CS1 maint: multiple names: authors list (link) - ^ Grubbs, Frank E. (February 1969). "Procedures for Detecting Outlying Observations in Samples". Technometrics. 11 (1): 1–21. doi:10.1080/00401706.1969.10490657. ISSN 0040-1706.
- ^ Richard., Boddy (2009). Statistical Methods in Practice : for Scientists and Technologists. John Wiley & Sons. ISBN 978-0-470-74664-6. OCLC 940679163.
- ^ Spear, Mary Eleanor (2024). Charting Statistics. McGraw Hill. p. 166.
- ^ Spear, Mary Eleanor. (1969). Practical charting techniques. New York: McGraw-Hill. ISBN 0070600104. OCLC 924909765.
- ^ a b Wickham, Hadley; Stryjewski, Lisa. "40 years of boxplots" (PDF). Retrieved December 24, 2020.
- ^ Holmes, Alexander; Illowsky, Barbara; Dean, Susan (31 March 2015). "Introductory Business Statistics". OpenStax. Archived from the original on 27 July 2020. Retrieved 29 April 2020.
- ^ Frigge, Michael; Hoaglin, David C.; Iglewicz, Boris (February 1989). "Some Implementations of the Boxplot". The American Statistician. 43 (1): 50–54. doi:10.2307/2685173. JSTOR 2685173.
- ^ Marmolejo-Ramos, F.; Tian, S. (2010). "The shifting boxplot. A boxplot based on essential summary statistics around the mean". International Journal of Psychological Research. 3 (1): 37–46. doi:10.21500/20112084.823. hdl:10819/6492.
- ^ Dekking, F.M. (2005). A Modern Introduction to Probability and Statistics. Springer. pp. 234–238. ISBN 1-85233-896-2.
- ^ Derrick, Ben; Green, Elizabeth; Ritchie, Felix; White, Paul (September 2022). "The Risk of Disclosure When Reporting Commonly Used Univariate Statistics". Privacy in Statistical Databases. Lecture Notes in Computer Science. Vol. 13463. pp. 119–129. doi:10.1007/978-3-031-13945-1_9. ISBN 978-3-031-13944-4.
- ^ a b c McGill, Robert; Tukey, John W.; Larsen, Wayne A. (February 1978). "Variations of Box Plots". The American Statistician. 32 (1): 12–16. doi:10.2307/2683468. JSTOR 2683468.
- ^ "R: Box Plot Statistics". R manual. Retrieved 26 June 2011.
- ^ Hubert, M.; Vandervieren, E. (2008). "An adjusted boxplot for skewed distribution". Computational Statistics and Data Analysis. 52 (12): 5186–5201. CiteSeerX 10.1.1.90.9812. doi:10.1016/j.csda.2007.11.008.
Further reading
[edit]- Tukey, John W. (1977). Exploratory Data Analysis. Addison-Wesley. ISBN 9780201076165.
- Benjamini, Y. (1988). "Opening the Box of a Boxplot". The American Statistician. 42 (4): 257–262. doi:10.2307/2685133. JSTOR 2685133.
- Rousseeuw, P. J.; Ruts, I.; Tukey, J. W. (1999). "The Bagplot: A Bivariate Boxplot". The American Statistician. 53 (4): 382–387. doi:10.2307/2686061. JSTOR 2686061.
External links
[edit]- Beeswarm Boxplot - superimposing a frequency-jittered stripchart on top of a box plot
Box plot
View on GrokipediaIntroduction
Definition
A box plot is a standardized graphical method for displaying the distribution of a numerical dataset based on its five-number summary, which includes the minimum value, the first quartile (Q1, or 25th percentile), the median (Q2, or 50th percentile), the third quartile (Q3, or 75th percentile), and the maximum value.[5] This summary captures essential aspects of the data without requiring the full dataset to be shown.[6] Visually, the box plot represents the spread of the data through the interquartile range (the distance between Q1 and Q3, forming the "box"), central tendency via the median line within the box, and potential skewness by the relative lengths of the box and adjacent whiskers, which extend from the quartiles to the minimum and maximum (or to non-outlier extremes).[6] As a non-parametric tool, it makes no assumptions about the underlying distribution of the data, such as normality, allowing it to effectively summarize distributions for exploratory analysis across diverse datasets.[7] Box plots thus provide a concise way to summarize data distributions, highlighting variability and location without parametric constraints.[5] The method is also known as a box-and-whisker plot, a term introduced by statistician John Tukey in his seminal 1977 work Exploratory Data Analysis.[6]Purpose and Advantages
Box plots are primarily employed to visualize the distribution and spread of numerical data by summarizing key statistical measures, including the median, quartiles, and range, which provide insights into central tendency, variability, and overall data structure.[8] This approach enables analysts to quickly grasp the middle 50% of the data via the interquartile range (IQR) and assess the full extent through whiskers extending to non-outlier extremes.[9] They are especially effective for identifying outliers—data points lying beyond 1.5 times the IQR from the quartiles—flagging potential anomalies without distorting the core summary.[9] Furthermore, box plots aid in detecting skewness by revealing asymmetries, such as a median offset toward one quartile or unequal whisker lengths, which indicate non-normal distributions.[10] A key application is facilitating comparisons across multiple datasets or groups, where side-by-side box plots highlight differences in medians, spreads, and shapes, supporting decisions in fields like medicine and quality control.[8] One major advantage of box plots is their robustness to outliers and extreme values, as they emphasize order-based statistics like the median and quartiles, which remain stable even when a few points skew the mean or standard deviation.[11] This contrasts with mean-based summaries, making box plots reliable for real-world datasets prone to contamination.[12] They are also accessible to non-statisticians, conveying complex distributional insights through an intuitive, minimalist design that avoids overwhelming detail while highlighting essentials like symmetry and spread.[8] For large datasets, box plots offer efficiency by condensing thousands of observations into a single graphic, enabling rapid pattern recognition without computational intensity or visual clutter.[4] In comparison to histograms, which depict the full frequency distribution and multimodal shapes, box plots provide a compact summary focused on quantiles rather than binning the entire dataset, proving superior for inter-group comparisons where detailed density is secondary.[13] Unlike stem-and-leaf plots, which retain and display individual data values for granular inspection, box plots sacrifice this detail for greater compactness and scalability, ideal when the goal is overview rather than exhaustive enumeration.[14]History
Origins
The box plot, also known as the box-and-whisker plot, was invented by American statistician John W. Tukey in the early 1970s as a key component of exploratory data analysis (EDA). Tukey first presented the schematic plot, an early form of the box plot, in the preliminary edition of his book in 1970. He further introduced the concept in his 1972 paper "Some Graphical and Semigraphical Displays," published in Statistical Papers in Honor of George W. Snedecor, where he presented it alongside other semigraphical techniques like the stem-and-leaf diagram to facilitate initial data examination.[4] The tool gained prominence through Tukey's 1977 book Exploratory Data Analysis, which provided a comprehensive framework for its use in summarizing data distributions in a simple, visual manner. In this seminal work, Tukey detailed the box plot's structure to highlight central tendency, spread, and potential outliers without requiring complex computations. Within the EDA paradigm, the box plot exemplified Tukey's emphasis on graphical and numerical methods that prioritize direct interaction with data over reliance on parametric statistical assumptions or formal hypothesis testing. This approach encouraged analysts to "let the data speak" through intuitive visualizations, fostering discovery of patterns and anomalies prior to confirmatory analysis.Development and Adoption
Following John W. Tukey's invention of the box plot in 1970 as a tool for exploratory data analysis, the method underwent significant refinements in the late 1970s and 1980s to improve its robustness and utility for comparing distributions. In 1978, Robert McGill, Tukey, and Wayne A. Larsen proposed variations including variable-width box plots, which scale the box width proportional to sample size, and notched box plots, which incorporate confidence intervals around the median to facilitate visual comparisons between groups. These enhancements addressed limitations in the original design, such as handling unequal sample sizes and assessing median differences more reliably.[4] By the 1980s and into the 1990s, further modifications focused on adapting box plots for diverse data characteristics, including alternative definitions for outliers and fences to better accommodate skewed distributions. A 1989 survey by Michael Frigge, David C. Hoaglin, and Boris Iglewicz examined implementations across statistical software, revealing inconsistencies in how elements like whiskers and outliers were calculated but underscoring the plot's growing standardization for robust summary statistics.[15] These developments emphasized the box plot's flexibility, making it suitable for exploratory analysis in non-normal data scenarios common in applied research.[4] The box plot's adoption accelerated in the 1980s across disciplines requiring concise reporting of summary statistics, particularly in medicine, where it became a standard for visualizing patient outcomes, treatment effects, and biomarker distributions in clinical studies.[16] In engineering, it facilitated quality control and process variability assessments, while in social sciences, it supported comparisons of survey data and behavioral metrics.[4] This widespread integration stemmed from its ability to reveal central tendency, spread, and asymmetry without assuming normality, proving valuable for interdisciplinary data interpretation.[16] The method's influence extended to statistical standards and tools, with box plots incorporated into major software packages like SPSS and Minitab by the late 1980s, enabling routine use in academic and professional workflows.[15] Journals in statistics and applied fields began recommending box plots for graphical abstracts, promoting their role in enhancing readability and comparability of results over traditional tables. Into the 21st century, the box plot continued to evolve with integrations in open-source libraries like R and Python, supporting advanced applications in data science and machine learning.[4]Elements and Construction
Core Components
The core components of a standard box plot consist of a rectangular box, an internal line representing the median, extending whiskers, and individual points denoting outliers. These elements collectively summarize the distribution of a dataset by highlighting its central tendency, spread, and potential anomalies without assuming normality.[5] The central box spans from the first quartile (Q1, the 25th percentile) to the third quartile (Q3, the 75th percentile), encapsulating the interquartile range (IQR), which measures the middle 50% of the data. This box visually represents the variability within the core of the distribution, with its length indicating the spread of the central data points; a longer box suggests greater dispersion in the middle half of the values.[5] A horizontal line within the box marks the median (Q2, the 50th percentile), dividing the data into two equal halves and providing a robust measure of central tendency that is less affected by extreme values than the mean. The position of this line relative to the box edges reveals skewness: if it is closer to Q1 or Q3, the distribution leans toward the lower or upper end, respectively.[5] Whiskers extend from the box edges to the smallest and largest data points that fall within 1.5 times the IQR below Q1 and above Q3, respectively, defining the main body of the data excluding extremes. These lines, often capped with short horizontal ticks or symbols, illustrate the range of the bulk of the observations and help identify the extent of non-outlying variation.[5] Data points lying beyond the whiskers—specifically, those more than 1.5 IQR away from Q1 or Q3—are plotted as individual symbols, such as circles or asterisks, to denote potential outliers. These points flag unusual observations that may warrant further investigation, with the convention distinguishing mild outliers (1.5 to 3.0 IQR away) from extreme ones (beyond 3.0 IQR) through varying symbol sizes or shapes.[5]Step-by-Step Construction
To construct a box plot, first sort the dataset in ascending order to facilitate the identification of key statistical measures.[7] Next, compute the five-number summary, which consists of the minimum value (the smallest observation), the first quartile (Q1, the 25th percentile), the median (the 50th percentile), the third quartile (Q3, the 75th percentile), and the maximum value (the largest observation).[3][7] The median is calculated as the middle value when the number of observations (n) is odd; for even n, it is the average of the two central values after sorting.[17] In cases of ties (duplicate values), the sorted positions are used without adjustment, preserving the order of equal observations.[17] Q1 and Q3, following Tukey's hinge method, are determined as the medians of the lower and upper halves of the sorted data, respectively; for even counts in these halves, the average of the two middle values is taken, while odd counts use the single middle value.[17][3] Once the five-number summary is obtained, calculate the interquartile range (IQR) as Q3 minus Q1.[3][7] The inner fences are then defined as Q1 minus 1.5 times the IQR for the lower fence and Q3 plus 1.5 times the IQR for the upper fence; these delineate the range for potential outliers.[3] The adjacent values, which form the whisker ends, are the largest observation not exceeding the upper inner fence and the smallest not falling below the lower inner fence.[3][7] To plot the box plot, draw a rectangular box extending from Q1 to Q3, with a horizontal line inside the box at the median position to represent the core components.[3] Extend vertical whiskers from the box edges to the adjacent values on each end.[7] Finally, mark any observations beyond the inner fences as individual points (outliers) outside the whiskers.[3][7]Mathematical Foundations
Quantile Calculations
In box plots, quartiles are key summary statistics that partition an ordered dataset into four equal parts: the first quartile (Q1) at the 25th percentile, the median (Q2) at the 50th percentile, and the third quartile (Q3) at the 75th percentile.[18] These values define the interquartile range (IQR = Q3 - Q1), which captures the middle 50% of the data and forms the box's boundaries.[18] Several methods exist for computing these quartiles from a sample of size , differing in how they handle the positioning and interpolation of order statistics. The inclusive method includes the median in both the lower and upper halves of the dataset when splitting for Q1 and Q3 calculations, particularly when is odd. In contrast, the exclusive method excludes the median from these halves to avoid overlap, ensuring the lower half contains the smallest observations and the upper half the largest . Tukey's hinges, original to box plot construction, treat Q1 and Q3 as the medians of the respective halves, including the overall median in both halves when is odd.[19] The general formula for the empirical -quantile (where ) in many implementations is given by the position , followed by linear interpolation between adjacent order statistics and if the position is not an integer: where and .[18] For quartiles, set for Q1, for the median, and for Q3; this approach, known as Type 4 in standard classifications, provides continuity and unbiasedness for symmetric distributions.[18] For small datasets (), Tukey's hinges apply specific adjustments to ensure robustness; the lower hinge is the median of the lower half of the data (first observations), and the upper hinge is the median of the upper half—for instance, with ordered data , lower half yields hinge at 2, upper half at 4, while the median is 3; for ordered data, lower half yields hinge at .[19] Larger datasets () typically use the general interpolation formula without modification, as edge effects diminish.[18]Outlier Detection
In box plots, outlier detection primarily relies on the interquartile range (IQR), defined as the difference between the third quartile (Q3) and the first quartile (Q1), to identify data points that deviate significantly from the central bulk of the distribution. The standard method, introduced by John Tukey, flags as outliers any values falling below or above . These thresholds, known as the inner fences, correspond to approximately 1.5 times the IQR beyond the quartiles and are designed to capture mild outliers while remaining robust to moderate skewness in non-normal distributions.[12][20] Tukey further distinguished between mild and extreme outliers using outer fences at and , with points between the inner and outer fences classified as "outside" values and those beyond the outer fences as "far out" values. This hierarchical approach allows for nuanced identification, where mild outliers may warrant investigation for potential errors, while extreme ones highlight rare but possibly valid extremes. The multipliers of 1.5 and 3 were refined by Tukey through empirical experience to balance sensitivity and specificity in exploratory data analysis.[12][20] Alternative methods complement the Tukey approach for outlier detection, particularly in datasets sensitive to the choice of IQR multiplier. The modified Z-score, proposed by Iglewicz and Hoaglin, uses the median absolute deviation (MAD) as a robust scale measure: for a data point , it is calculated as , with values exceeding 3.5 in absolute magnitude flagged as potential outliers. This method is especially effective for heavy-tailed distributions, as it avoids reliance on means and standard deviations, which can be distorted by outliers themselves. Other IQR-based robust measures adjust the Tukey fences dynamically for skewness or sample size, enhancing adaptability without assuming normality.[21] The identification of outliers via box plots sparks debate on their interpretation: they may represent measurement errors requiring correction or genuine extremes revealing important variability in the data-generating process. The non-parametric nature of the box plot method, relying on order statistics rather than distributional assumptions, supports exploratory investigation without prematurely dismissing these points as anomalies, aligning with Tukey's philosophy of data analysis as detective work.[21][12]Variations
Notched Box Plots
Notched box plots extend the standard box plot by incorporating inward notches on each side of the box to provide a visual representation of the variability around the median. These notches were introduced by McGill, Tukey, and Larsen in their 1978 paper on variations of box plots.[22] The notch boundaries are calculated as the median \pm 1.58 \times \frac{\mathrm{IQR}}{\sqrt{n}}, where \mathrm{IQR} is the interquartile range and n is the sample size; this formula approximates a 95% confidence interval for the median under assumptions of normality and roughly equal sample sizes across groups.[22] The primary purpose of the notches is to facilitate informal comparisons of medians between multiple groups displayed side by side. If the notches of two box plots do not overlap, it indicates strong evidence of a significant difference between the medians at the \alpha = 0.05 level, serving as a quick visual hypothesis test without requiring formal statistical computation.[22] This approach leverages the asymptotic normality of the median estimator to infer differences efficiently. In practice, the notch width is defined as 1.58 \times \frac{\mathrm{IQR}}{\sqrt{n}}, but the notches are clipped to the inner fences (or hinges) if the calculated extent would extend beyond them, preventing distortion of the plot.[22] Compared to side-by-side standard box plots, notched versions offer a clear advantage in multiple comparisons by embedding inferential information directly into the visualization, reducing the need for separate confidence interval plots or post-hoc tests while maintaining the core summary of the data distribution.[22]Adjusted and Modified Box Plots
Adjusted and modified box plots address limitations of the standard box plot when dealing with skewed distributions or datasets requiring more detailed tail information, by incorporating asymmetry in whisker construction or extending quantile summaries beyond quartiles. In standard box plots, whiskers extend symmetrically based on the interquartile range (IQR), which can misrepresent tails in positively or negatively skewed data, leading to excessive outlier flagging on the longer tail.[23] The adjusted box plot, proposed by Hubert and Vandervieren, modifies whisker lengths asymmetrically using the medcouple, a robust, sign-preserving measure of skewness that ranges from -1 to 1. For positively skewed data (medcouple > 0), the upper whisker is extended further by applying an adjustment factor greater than 1.5 to the IQR, while the lower whisker uses a factor less than 1.5, better capturing the elongated upper tail without flagging valid points as outliers. Conversely, for negative skew, the lower tail is extended. This approach reduces false outliers in skewed distributions like exponential or lognormal data, where traditional symmetric fences underrepresent the longer tail.[23][24] Variations in quantile definitions, as detailed by Hyndman and Fan, also influence modified box plots by affecting the positions of the box edges and whiskers, particularly in small samples where different interpolation methods yield asymmetric summaries. Nine common quantile types are compared, with types 7 and 8 (plotting positions) recommended for box plots due to their consistency in estimating population quartiles, ensuring more reliable tail adjustments in skewed cases.[18] Letter-value plots extend the box plot by displaying multiple nested boxes representing successive quantiles, or "letter values," starting from the median and halving the data at each step (e.g., fourths, eights, sixteenths) until fewer than 15 observations remain per tail. Originally conceptualized by Tukey for exploratory analysis of large datasets, these plots reveal detailed tail behavior and symmetry not visible in standard quartiles, making them suitable for skewed distributions where inner quantiles highlight central tendencies and outer ones emphasize extreme tails.[25] For non-independent and identically distributed (non-IID) data, such as clustered or heterogeneous samples, traditional box plots can obscure variability; alternatives like raincloud plots combine a density estimate (half-violin), summary box, and jittered raw data points to visualize full distributions and individual observations without assuming IID conditions. Proposed by Allen et al., raincloud plots are particularly useful for skewed data in experimental contexts, like neuroscience or psychology, where variance stabilization adjustments (e.g., via transformations) may precede plotting to normalize tails. These are employed when standard box plots underrepresent multimodal or heavy-tailed structures in non-IID settings.[26] Adjusted and modified variants are recommended for positively skewed distributions, such as income or response times, where the standard IQR-based method compresses the longer tail, potentially masking important distributional features.[23]Interpretation
Reading a Single Box Plot
A box plot provides a compact visual summary of a dataset's distribution, allowing readers to assess key statistical features without examining the raw data. Developed by John Tukey as part of exploratory data analysis, it emphasizes the median, quartiles, and potential outliers to reveal central tendency, variability, and shape.[27][7] To assess central tendency, locate the horizontal line within the box, which represents the median—the value that divides the dataset into two equal halves, with 50% of observations above and 50% below. If the median aligns with the center of the box, the distribution is symmetric around this point; otherwise, its offset indicates asymmetry in the data.[27][7] The spread of the data is measured by the box's length, which spans the interquartile range (IQR) from the first quartile (Q1, the 25th percentile) to the third quartile (Q3, the 75th percentile), capturing the middle 50% of the observations and highlighting the typical variability excluding extremes. Whiskers extend from the box edges to the smallest and largest values within 1.5 times the IQR, or to the data's minimum and maximum if no such extremes exist, thus illustrating the overall range while protecting against outlier influence.[27][7] Skewness, or the lack of symmetry in the distribution, can be detected by examining the median's position relative to the box center and the relative lengths of the whiskers. A median closer to Q1 with a longer upper whisker suggests positive (right) skewness, where the tail extends toward higher values; conversely, a median nearer to Q3 with a longer lower whisker indicates negative (left) skewness.[27][7] Outliers are identified as individual points plotted beyond the whiskers, specifically any data values falling more than 1.5 IQRs away from Q1 or Q3, following Tukey's method to flag potential anomalies for further investigation. The number and positioning of these points reveal the extent and direction of unusual deviations in the dataset.[27][7]Comparing Multiple Distributions
Box plots facilitate the comparison of multiple distributions by displaying them side-by-side along a shared axis, enabling simultaneous assessment of central tendencies, variabilities, and shapes across groups. This configuration aligns the plots horizontally for categorical variables while plotting the response variable vertically, allowing viewers to discern differences in medians (as horizontal lines within boxes), interquartile ranges (as box lengths), and overall spreads (via whiskers extending to adjacent values). For instance, in analyzing energy output from different machines, side-by-side box plots reveal that one machine consistently outperforms others in both median output and consistency of results.[5] When interpreting overlaps between these plots, a key visual cue is the positioning of the boxes and whiskers. If the interquartile ranges (IQRs) of two adjacent box plots do not overlap, the groups are likely to differ significantly in their central locations, providing a rough indication of distinct distributions. Subtler differences may be suggested by partial overlap in the whiskers, which extend to the most extreme non-outlier values (typically up to 1.5 times the IQR from the quartiles), hinting at potential variations in tails without implying equivalence. These overlap assessments serve as exploratory tools to guide further statistical testing, such as t-tests or ANOVA, rather than definitive proofs of significance.[10] For datasets involving more than two groups, effective visualization involves ordering the box plots by ascending or descending medians to reveal patterns or trends across categories. To incorporate results from multiple comparison procedures like Tukey's honestly significant difference (HSD) test, compact letter displays can be overlaid on or above the plots; groups assigned the same letter (e.g., "a" or "ab") indicate no significant difference at the chosen alpha level, while differing letters denote statistically distinguishable medians. This lettering system, derived from all-pairwise comparisons, enhances interpretability without cluttering the display.[28] Comparisons can be complicated by unequal sample sizes across groups, as larger samples yield more reliable quartile and median estimates, potentially exaggerating apparent differences or stability relative to smaller samples. In such cases, adjusting box widths proportional to the square root of sample sizes helps normalize visual perceptions of precision, though it does not fully mitigate the need for formal statistical adjustments in inference.[29] Box plots with fewer than 20 observations per group may also produce unstable summaries, underscoring the importance of verifying assumptions through complementary analyses.[30]Examples
Dataset Without Outliers
To illustrate the construction and interpretation of a box plot for a clean, symmetric dataset without outliers, consider the exam scores of 20 students: 65, 69, 72, 74, 75, 75, 78, 80, 81, 82, 83, 84, 86, 88, 88, 88, 90, 92, 94, 95. The sorted dataset yields the following five-number summary: minimum = 65, first quartile (Q1) = 75 (median of the lower half, averaging the 5th and 6th values: (75 + 75)/2), median = 82.5 (averaging the 10th and 11th values: (82 + 83)/2), third quartile (Q3) = 88 (median of the upper half, averaging the 15th and 16th values: (88 + 88)/2), and maximum = 95. The interquartile range (IQR) is Q3 - Q1 = 88 - 75 = 13. Since no outliers are present—the lower fence is Q1 - 1.5 × IQR = 75 - 19.5 = 55.5 (above the minimum) and the upper fence is Q3 + 1.5 × IQR = 88 + 19.5 = 107.5 (above the maximum)—the whiskers extend fully to the minimum and maximum values. To construct the plot, draw a box from Q1 (75) to Q3 (88) with a line at the median (82.5), and attach whiskers from the box ends to 65 and 95, respectively. This box plot reveals a symmetric distribution, as the median nearly centers the box and the whiskers are of comparable length (lower whisker spans 10 units from 65 to 75; upper from 88 to 95 spans 7 units, indicating a tight, balanced spread without extremes). The compact IQR of 13 suggests low variability in the middle 50% of scores, centered around 82.5, typical of a consistent performance across the class. Visually, the balanced box and symmetric extensions emphasize the absence of skewness or anomalies in this dataset.Dataset With Outliers
To illustrate the effect of outliers in a box plot, consider a hypothetical dataset of annual household incomes (in thousands of dollars) for 20 individuals, where most values cluster between 30 and 60, but two extreme values exceed 200. The sorted incomes include values such as 25, 30, 32, 35, 35, 38, 40, 42, 45, 48, 50, 52, 55, 55, 58, 60, 65, 70, 200, and 250. The first quartile (Q1) is 36.5 (averaging the 5th and 6th values: (35 + 38)/2), the third quartile (Q3) is 59 (averaging the 15th and 16th values: (58 + 60)/2), and the interquartile range (IQR) is 22.5. Outliers are identified using the standard criterion of values falling beyond 1.5 times the IQR from the quartiles, resulting in a lower fence at approximately 2.75 and an upper fence at approximately 92.75; thus, the two incomes above 92.75 (200 and 250) are flagged as outliers, while the whiskers extend to the maximum non-outlier value of 70 on the upper end and the minimum value of 25 on the lower end, as it exceeds the lower fence. This method, introduced by John Tukey, highlights potential anomalies without removing them from the visualization. In the resulting box plot, the box spans from 36.5 to 59 with the median at 49, the left whisker is short (reaching down to 25), and the right whisker extends to 70, with the outlier points plotted individually beyond it. This visual reveals a right-skewed distribution, where the outliers dramatically inflate the overall range to 225 while the IQR remains robust at 22.5, unaffected by the extremes and providing a stable measure of central spread.Applications and Limitations
Common Uses
Box plots are widely employed in exploratory data analysis (EDA) within statistics to summarize the distribution of data, highlighting measures of central tendency, spread, and potential outliers, which aids in initial data understanding before more formal analyses.[5] They facilitate the visualization of skewness, symmetry, and variability, allowing statisticians to assess data characteristics efficiently without assuming a specific distributional form.[31] In preparation for hypothesis testing, box plots serve as a preliminary tool to check assumptions such as normality by revealing deviations in the data's shape, such as asymmetry or heavy tails, which might influence the choice of parametric or non-parametric tests.[27] In the medical field, box plots are commonly used to compare treatment effects across patient groups, such as visualizing response variables like blood pressure reductions or survival times between control and intervention cohorts, enabling quick identification of median outcomes and interquartile ranges for efficacy assessment.[32] For instance, they illustrate the distribution of clinical outcomes in randomized trials, helping researchers detect variability in treatment responses and outliers representing atypical patient reactions.[33] In environmental science, box plots summarize pollutant concentration levels across monitoring sites or time periods, such as displaying daily PM2.5 or NO2 measurements to compare spatial heterogeneity and identify high-variability locations for regulatory action.[34] This application supports the evaluation of air quality trends, with the box's quartiles indicating typical exposure ranges and whiskers extending to extreme events like pollution spikes.[35] Within finance, box plots depict the distributions of asset return data, such as daily stock yields or portfolio volatilities, to compare performance across securities or market conditions, revealing medians for average returns and interquartile ranges for risk assessment.[36] They are particularly useful for highlighting asymmetry in return distributions, which informs investment strategies by showing potential downside risks through lower whiskers.[37] In business contexts, box plots support quality control processes by monitoring manufacturing metrics, like product dimensions or defect rates, across production batches to detect shifts in process stability and variability.[38] They are also applied in A/B testing for digital products, where side-by-side box plots compare user engagement metrics, such as conversion rates between variants, to evaluate which design yields a more consistent and higher median performance.[39] In modern genomics research, box plots provide concise summaries of gene expression levels across samples or conditions, such as comparing transcript abundances in treated versus untreated cell lines to identify differentially expressed genes through distributional overlaps or shifts.[40] For RNA-seq data, they visualize the spread of normalized expression values, aiding in quality checks and preliminary comparisons before advanced differential analysis.[41] This usage has become standard in high-throughput studies, where multiple box plots side-by-side facilitate the interpretation of expression variability across experimental groups.[32]Limitations and Alternatives
Box plots have several limitations that can affect their utility in data analysis. One key drawback is their inability to reveal multimodality in distributions; multiple distinct distributions, such as unimodal versus bimodal ones, can produce identical box plot signatures if they share the same quartiles, thereby masking important structural features of the data.[42] Additionally, box plots do not indicate sample size, which is crucial for assessing the reliability of the summary statistics; without this information, interpretations may overlook variability due to small or uneven group sizes, particularly in comparative analyses.[43] The choice of quartile calculation method also introduces sensitivity, as different approaches—such as Tukey's hinges versus standard percentiles—can yield varying box widths and whisker lengths, especially in discrete or small datasets, leading to inconsistent representations across software implementations.[44] For small sample sizes (typically fewer than 10–20 observations), box plots become unreliable, as quartile estimates and outlier detection may not accurately reflect the underlying distribution, potentially misleading users about spread and central tendency.[30][7] Beyond these issues, box plots provide only a coarse summary and fail to convey precise data density or the full shape of the distribution, limiting their insight into aspects like gaps, tails, or precise quantile behaviors.[5] When these limitations are problematic, alternatives better suited to specific needs include histograms or violin plots, which visualize the full probability density and reveal multimodality or shape details that box plots obscure.[45] For small datasets, dot plots (or strip plots) preserve individual data points, avoiding the summarization pitfalls of box plots while facilitating direct observation of values and outliers.[46] Empirical cumulative distribution functions (ECDFs) offer a precise, non-parametric view of quantiles and cumulative probabilities, providing a complementary or superior option for exact distributional comparisons without relying on quartile approximations.[42]Visualization Tools
Software Implementation
Box plots can be generated using a variety of software tools and programming languages, each offering built-in functions or interfaces for creating these visualizations from raw data or summary statistics.[47] Popular options include statistical programming environments like R and Python, as well as spreadsheet and statistical software such as Microsoft Excel and IBM SPSS Statistics. These implementations typically compute the necessary quartiles, medians, and outlier thresholds automatically from input data.[48] In the R programming language, the base graphics package provides theboxplot() function, a generic method that accepts vectors, matrices, or formulas to produce simple box plots. For example, boxplot(x) plots a single vector, while boxplot(formula, data) groups data by factors for comparative displays.[47] For enhanced customization, the ggplot2 package uses geom_boxplot() within a ggplot() call, mapping variables via aesthetics like aes(x = group, y = value) to create layered, publication-ready plots with options for themes, colors, and facets.[49]
Python libraries offer similar capabilities through the matplotlib and seaborn packages. The matplotlib.pyplot.boxplot() function draws box plots from arrays or lists, supporting parameters for whisker lengths, notch displays, and outlier markers, as in plt.boxplot(data).[48] Seaborn's seaborn.boxplot() integrates seamlessly with pandas DataFrames, enabling grouped visualizations via sns.boxplot(data=df, x='group', y='value') and automatic styling for better readability across multiple distributions.[50]
In Microsoft Excel, users can create box and whisker charts directly from the Insert tab under the Statistics chart group, selecting data ranges that automatically calculate quartiles and handle outliers. The charts are vertical by default, but horizontal orientation can be achieved by transposing the data or using workarounds,[51] with options to add data labels to elements like outliers or the mean.[52] Similarly, IBM SPSS Statistics employs the Chart Builder dialog, where selecting the Boxplot icon allows specification of variables for simple or clustered plots, including controls for axis labels and exclusion of cases, making it suitable for exploratory data analysis in social sciences.
When dealing with large datasets, direct input to these functions may lead to performance issues due to memory constraints during quartile computations. To mitigate this, downsampling techniques—such as random subsampling to a representative size (e.g., 10,000 points)—or precomputing summary statistics (median, quartiles, and extremes) and supplying them as input can be used; for instance, matplotlib's boxplot() accepts a dictionary of precalculated stats, while R's boxplot() efficiently handles formulas on aggregated data.[48]