Hubbry Logo
Box plotBox plotMain
Open search
Box plot
Community hub
Box plot
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Box plot
Box plot
from Wikipedia
Box plot of data from the Michelson experiment

In descriptive statistics, a box plot or boxplot is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles.[1]

In addition to the box on a box plot, there can be lines (which are called whiskers) extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset[2] may be plotted as individual points beyond the whiskers on the box-plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution[3] (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length).

The spacings in each subsection of the box-plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box-plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.

History

[edit]

The range-bar method was first introduced by Mary Eleanor Spear in her book "Charting Statistics" in 1952[4] and again in her book "Practical Charting Techniques" in 1969.[5] The box-and-whisker plot was first introduced in 1970 by John Tukey, who later published on the subject in his book "Exploratory Data Analysis" in 1977.[6]

Elements

[edit]
Box-plot with whiskers from minimum to maximum
The same box-plot with whiskers drawn within the 1.5 IQR value

A boxplot is a standardized way of displaying the dataset based on the five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.

  • Minimum (Q0 or 0th percentile): the lowest data point in the data set excluding any outliers
  • Maximum (Q4 or 100th percentile): the highest data point in the data set excluding any outliers
  • Median (Q2 or 50th percentile): the middle value in the data set
  • First quartile (Q1 or 25th percentile): also known as the lower quartile qn(0.25), it is the median of the lower half of the dataset.
  • Third quartile (Q3 or 75th percentile): also known as the upper quartile qn(0.75), it is the median of the upper half of the dataset.[7]

In addition to the minimum and maximum values used to construct a box-plot, another important element that can also be employed to obtain a box-plot is the interquartile range (IQR), as denoted below:

A box-plot usually includes two parts, a box and a set of whiskers.

Box

[edit]

The box is drawn from Q1 to Q3 with a horizontal line drawn inside it to denote the median. Some box plots include an additional character to represent the mean of the data.[8][9]

Whiskers

[edit]

The whiskers must end at an observed data point, but can be defined in various ways. In the most straightforward method, the boundary of the lower whisker is the minimum value of the data set, and the boundary of the upper whisker is the maximum value of the data set. Because of this variability, it is appropriate to describe the convention that is being used for the whiskers and outliers in the caption of the box-plot.

Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile (Q3), a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile (Q1) and a whisker is drawn down to the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as outliers.[10] The outliers can be plotted on the box-plot as a dot, a small circle, a star, etc. (see example below).

A box plot representing data

There are other representations in which the whiskers can stand for several other things, such as:

  • One standard deviation above and below the mean of the data set
  • The 9th percentile and the 91st percentile of the data set
  • The 2nd percentile and the 98th percentile of the data set

Rarely, box-plot can be plotted without the whiskers. This can be appropriate for sensitive information to avoid whiskers (and outliers) disclosing actual values observed.[11]

The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to depict the seven-number summary. If the data are normally distributed, the locations of the seven marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of each whisker.

Variations

[edit]
Four box plots, with and without notches and variable width

Since the mathematician John W. Tukey first popularized this type of visual data display in 1969, several variations on the classical box plot have been developed, and the two most commonly found variations are the variable-width box plots and the notched box plots.

Variable-width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.[12]

Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians.[12] The height of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is an uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).[12] The width of the notch is arbitrarily chosen to be visually pleasing, and should be consistent amongst all box plots being displayed on the same page.

One convention for obtaining the boundaries of these notches is to use a distance of around the median.[13]

Adjusted box plots are intended to describe skew distributions, and they rely on the medcouple statistic of skewness.[14] For a medcouple value of MC, the lengths of the upper and lower whiskers on the box-plot are respectively defined to be:

For a symmetrical data distribution, the medcouple will be zero, and this reduces the adjusted box-plot to the Tukey's box-plot with equal whisker lengths of for both whiskers.

Other kinds of box plots, such as the violin plots and the bean plots can show the difference between single-modal and multimodal distributions, which cannot be observed from the original classical box-plot.[6]

Examples

[edit]

Example without outliers

[edit]
A boxplot with no outliers

A series of hourly temperatures were measured throughout the day in degrees Fahrenheit. The recorded values are listed in order as follows (°F): 57, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81.

A box plot of the data set can be generated by first calculating five relevant values of this data set: minimum, maximum, median (Q2), first quartile (Q1), and third quartile (Q3).

The minimum is the smallest number of the data set. In this case, the minimum recorded day temperature is 57°F.

The maximum is the largest number of the data set. In this case, the maximum recorded day temperature is 81°F.

The median is the "middle" number of the ordered data set. This means that exactly 50% of the elements are below the median and 50% of the elements are greater than the median. The median of this ordered data set is 70°F.

The first quartile value (Q1 or 25th percentile) is the number that marks one quarter of the ordered data set. In other words, there are exactly 25% of the elements that are less than the first quartile and exactly 75% of the elements that are greater than it. The first quartile value can be easily determined by finding the "middle" number between the minimum and the median. For the hourly temperatures, the "middle" number found between 57°F and 70°F is 66°F.

The third quartile value (Q3 or 75th percentile) is the number that marks three quarters of the ordered data set. In other words, there are exactly 75% of the elements that are less than the third quartile and 25% of the elements that are greater than it. The third quartile value can be easily obtained by finding the "middle" number between the median and the maximum. For the hourly temperatures, the "middle" number between 70°F and 81°F is 75°F.

The interquartile range, or IQR, can be calculated by subtracting the first quartile value (Q1) from the third quartile value (Q3):

Hence,

1.5 IQR above the third quartile is:

1.5 IQR below the first quartile is:

The upper whisker boundary of the box-plot is the largest data value that is within 1.5 IQR above the third quartile. Here, 1.5 IQR above the third quartile is 88.5°F and the maximum is 81°F. Therefore, the upper whisker is drawn at the value of the maximum, which is 81°F.

Similarly, the lower whisker boundary of the box plot is the smallest data value that is within 1.5 IQR below the first quartile. Here, 1.5 IQR below the first quartile is 52.5°F and the minimum is 57°F. Therefore, the lower whisker is drawn at the value of the minimum, which is 57°F.

Example with outliers

[edit]
A box plot with outliers

Above is an example without outliers. Here is a follow-up example for generating box-plot with outliers:

The ordered set for the recorded temperatures is (°F): 52, 57, 57, 58, 63, 66, 66, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 89.

In this example, only the first and the last number are changed. The median, third quartile, and first quartile remain the same.

In this case, the maximum value in this data set is 89°F, and 1.5 IQR above the third quartile is 88.5°F. The maximum is greater than 1.5 IQR plus the third quartile, so the maximum is an outlier. Therefore, the upper whisker is drawn at the greatest value smaller than 1.5 IQR above the third quartile, which is 79°F.

Similarly, the minimum value in this data set is 52°F, and 1.5 IQR below the first quartile is 52.5°F. The minimum is smaller than 1.5 IQR minus the first quartile, so the minimum is also an outlier. Therefore, the lower whisker is drawn at the smallest value greater than 1.5 IQR below the first quartile, which is 57°F.

In the case of large datasets

[edit]

An additional example for obtaining box-plot from a data set containing a large number of data points is:

General equation to compute empirical quantiles

[edit]
Here stands for the general ordering of the data points (i.e. if , then )

Using the above example that has 24 data points (n = 24), one can calculate the median, first and third quartile either mathematically or visually.

Median

First quartile

Third quartile

Visualization

[edit]
Box-plot and a probability density function (pdf) of a Normal N(0,1σ2) Population
Box-plots displaying the skewness of the data set

Although box plots may seem more primitive than histograms or kernel density estimates, they do have a number of advantages. First, the box plot enables statisticians to do a quick graphical examination on one or more data sets. Box-plots also take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data in parallel. Lastly, the overall structure of histograms and kernel density estimate can be strongly influenced by the choice of number and width of bins techniques and the choice of bandwidth, respectively.

Although looking at a statistical distribution is more common than looking at a box plot, it can be useful to compare the box plot against the probability density function (theoretical histogram) for a normal N(0,σ2) distribution and observe their characteristics directly.

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
A box plot, also known as a box-and-whisker plot, is a graphical method for summarizing the distribution of a dataset using robust statistical measures, particularly the five-number summary that includes the minimum, first quartile (Q1), median, third quartile (Q3), and maximum values. It visually depicts the central tendency, spread, skewness, and potential outliers in numerical data, making it a key tool in exploratory data analysis for comparing multiple groups or distributions efficiently. The box plot was developed by American statistician John W. Tukey as part of his framework for exploratory data analysis, first appearing in schematic form in 1970 and formalized in his influential 1977 book Exploratory Data Analysis. Tukey's design emphasized simple, intuitive visualizations to reveal data patterns without assuming normality, drawing from earlier range charts but innovating with quartile-based boxes to highlight interquartile range (IQR) and extremes. Since its introduction, the box plot has become a standard in statistics, implemented in software like R, Python's Matplotlib, and Excel, and extended in variations such as notched or violin plots for added inference on medians or density. In construction, the box spans from Q1 to Q3, with a line marking the inside; whiskers extend to the farthest points within 1.5 times the IQR from the quartiles, while points beyond this range are plotted as outliers to flag potential anomalies. This 1.5 IQR rule, known as Tukey's fences, balances sensitivity to variability and robustness against extremes, allowing quick assessment of (if centers the box) or asymmetry (if skewed). Box plots excel in handling non-parametric but may obscure or sample size differences unless modified with widths proportional to group sizes.

Introduction

Definition

A box plot is a standardized graphical method for displaying the distribution of a numerical based on its , which includes the minimum value, the first (Q1, or 25th ), the (Q2, or 50th ), the third (Q3, or 75th ), and the maximum value. This summary captures essential aspects of the without requiring the full to be shown. Visually, the box plot represents the spread of the data through the (the distance between Q1 and Q3, forming the "box"), central tendency via the line within the box, and potential by the relative lengths of the box and adjacent whiskers, which extend from the quartiles to the minimum and maximum (or to non-outlier extremes). As a non-parametric tool, it makes no assumptions about the underlying distribution of the data, such as normality, allowing it to effectively summarize distributions for exploratory analysis across diverse datasets. Box plots thus provide a concise way to summarize data distributions, highlighting variability and location without parametric constraints. The method is also known as a box-and-whisker plot, a term introduced by statistician in his seminal 1977 work .

Purpose and Advantages

Box plots are primarily employed to visualize the distribution and spread of numerical data by summarizing key statistical measures, including the , quartiles, and range, which provide insights into , variability, and overall data structure. This approach enables analysts to quickly grasp the middle 50% of the data via the (IQR) and assess the full extent through whiskers extending to non-outlier extremes. They are especially effective for identifying outliers—data points lying beyond 1.5 times the IQR from the quartiles—flagging potential anomalies without distorting the core summary. Furthermore, box plots aid in detecting by revealing asymmetries, such as a offset toward one quartile or unequal whisker lengths, which indicate non-normal distributions. A key application is facilitating comparisons across multiple datasets or groups, where side-by-side box plots highlight differences in medians, spreads, and shapes, supporting decisions in fields like and . One major advantage of box plots is their robustness to outliers and extreme values, as they emphasize order-based statistics like the and quartiles, which remain stable even when a few points skew the or standard deviation. This contrasts with mean-based summaries, making box plots reliable for real-world datasets prone to . They are also accessible to non-statisticians, conveying complex distributional insights through an intuitive, minimalist design that avoids overwhelming detail while highlighting essentials like and spread. For large datasets, box plots offer efficiency by condensing thousands of observations into a single graphic, enabling rapid without computational intensity or visual clutter. In comparison to histograms, which depict the full frequency distribution and multimodal shapes, box plots provide a compact summary focused on quantiles rather than binning the entire , proving superior for inter-group comparisons where detailed is secondary. Unlike stem-and-leaf plots, which retain and display individual data values for granular inspection, box plots sacrifice this detail for greater and , ideal when the goal is overview rather than exhaustive enumeration.

History

Origins

The box plot, also known as the box-and-whisker plot, was invented by American statistician John W. Tukey in the early 1970s as a key component of (EDA). Tukey first presented the schematic plot, an early form of the box plot, in the preliminary edition of his book in 1970. He further introduced the concept in his 1972 paper "Some Graphical and Semigraphical Displays," published in Statistical Papers in Honor of George W. Snedecor, where he presented it alongside other semigraphical techniques like the stem-and-leaf diagram to facilitate initial data examination. The tool gained prominence through Tukey's 1977 book , which provided a comprehensive framework for its use in summarizing data distributions in a simple, visual manner. In this seminal work, Tukey detailed the box plot's structure to highlight , spread, and potential outliers without requiring complex computations. Within the EDA , the box plot exemplified Tukey's emphasis on graphical and numerical methods that prioritize direct interaction with data over reliance on parametric statistical assumptions or formal testing. This approach encouraged analysts to "let the data speak" through intuitive visualizations, fostering discovery of patterns and anomalies prior to confirmatory analysis.

Development and Adoption

Following John W. Tukey's invention of the box plot in 1970 as a tool for , the method underwent significant refinements in the late 1970s and 1980s to improve its robustness and utility for comparing distributions. In 1978, Robert McGill, Tukey, and Wayne A. Larsen proposed variations including variable-width box plots, which scale the box width proportional to sample size, and notched box plots, which incorporate confidence intervals around the to facilitate visual comparisons between groups. These enhancements addressed limitations in the original design, such as handling unequal sample sizes and assessing differences more reliably. By the 1980s and into the 1990s, further modifications focused on adapting box plots for diverse data characteristics, including alternative definitions for outliers and fences to better accommodate skewed distributions. A 1989 survey by Michael Frigge, David C. Hoaglin, and Boris Iglewicz examined implementations across statistical software, revealing inconsistencies in how elements like and outliers were calculated but underscoring the plot's growing standardization for robust . These developments emphasized the box plot's flexibility, making it suitable for exploratory analysis in non-normal data scenarios common in applied research. The box plot's adoption accelerated in the across disciplines requiring concise reporting of , particularly in , where it became a standard for visualizing patient outcomes, treatment effects, and biomarker distributions in clinical studies. In engineering, it facilitated and process variability assessments, while in social sciences, it supported comparisons of survey data and behavioral metrics. This widespread integration stemmed from its ability to reveal , spread, and without assuming normality, proving valuable for interdisciplinary data interpretation. The method's influence extended to statistical standards and tools, with box plots incorporated into major software packages like and by the late 1980s, enabling routine use in academic and professional workflows. Journals in statistics and applied fields began recommending box plots for graphical abstracts, promoting their role in enhancing readability and comparability of results over traditional tables. Into the , the box plot continued to evolve with integrations in open-source libraries like and Python, supporting advanced applications in and .

Elements and Construction

Core Components

The core components of a standard box plot consist of a rectangular box, an internal line representing the , extending whiskers, and individual points denoting outliers. These elements collectively summarize the distribution of a by highlighting its , spread, and potential anomalies without assuming normality. The central box spans from the first quartile (Q1, the 25th ) to the third quartile (Q3, the 75th ), encapsulating the (IQR), which measures the middle 50% of the . This box visually represents the variability within the core of the distribution, with its length indicating the spread of the central points; a longer box suggests greater dispersion in the middle half of the values. A horizontal line within the marks the (Q2, the 50th ), dividing the data into two equal halves and providing a robust measure of that is less affected by extreme values than the . The position of this line relative to the edges reveals : if it is closer to Q1 or Q3, the distribution leans toward the lower or upper end, respectively. extend from the edges to the smallest and largest data points that fall within 1.5 times the IQR below Q1 and above Q3, respectively, defining the main body of the data excluding extremes. These lines, often capped with short horizontal ticks or symbols, illustrate the range of the bulk of the observations and help identify the extent of non-outlying variation. Data points lying beyond the whiskers—specifically, those more than 1.5 IQR away from Q1 or Q3—are plotted as individual symbols, such as circles or asterisks, to denote potential outliers. These points flag unusual observations that may warrant further investigation, with the convention distinguishing mild outliers (1.5 to 3.0 IQR away) from extreme ones (beyond 3.0 IQR) through varying symbol sizes or shapes.

Step-by-Step Construction

To construct a box plot, first sort the in ascending order to facilitate the identification of key statistical measures. Next, compute the , which consists of the minimum value (the smallest observation), the first (Q1, the 25th ), the (the 50th ), the third (Q3, the 75th ), and the maximum value (the largest observation). The is calculated as the middle value when the number of observations (n) is odd; for even n, it is the of the two central values after sorting. In cases of ties (duplicate values), the sorted positions are used without adjustment, preserving the order of equal observations. Q1 and Q3, following Tukey's hinge method, are determined as the medians of the lower and upper halves of the sorted , respectively; for even counts in these halves, the of the two middle values is taken, while odd counts use the single middle value. Once the is obtained, calculate the (IQR) as Q3 minus Q1. The inner fences are then defined as Q1 minus 1.5 times the IQR for the lower fence and Q3 plus 1.5 times the IQR for the upper fence; these delineate the range for potential outliers. The adjacent values, which form the whisker ends, are the largest observation not exceeding the upper inner fence and the smallest not falling below the lower inner fence. To plot the box plot, draw a rectangular extending from Q1 to Q3, with a horizontal line inside the box at the position to represent the core components. Extend vertical from the box edges to the adjacent values on each end. Finally, mark any observations beyond the inner fences as individual points (outliers) outside the .

Mathematical Foundations

Quantile Calculations

In box plots, quartiles are key summary statistics that partition an ordered dataset into four equal parts: the first quartile (Q1) at the 25th percentile, the median (Q2) at the 50th percentile, and the third quartile (Q3) at the 75th percentile. These values define the interquartile range (IQR = Q3 - Q1), which captures the middle 50% of the data and forms the box's boundaries. Several methods exist for computing these quartiles from a sample of size nn, differing in how they handle the positioning and of order statistics. The inclusive method includes the in both the lower and upper halves of the when splitting for Q1 and Q3 calculations, particularly when nn is odd. In contrast, the exclusive method excludes the from these halves to avoid overlap, ensuring the lower half contains the smallest (n1)/2(n-1)/2 observations and the upper half the largest (n1)/2(n-1)/2. Tukey's hinges, original to box plot construction, treat Q1 and Q3 as the medians of the respective halves, including the overall in both halves when nn is odd. The general formula for the empirical pp-quantile (where 0<p<10 < p < 1) in many implementations is given by the position g(p)=(n+1)pg(p) = (n + 1)p, followed by linear interpolation between adjacent order statistics x(j)x_{(j)} and x(j+1)x_{(j+1)} if the position is not an integer: Q^(p)=(1γ)x(j)+γx(j+1),\hat{Q}(p) = (1 - \gamma) x_{(j)} + \gamma x_{(j+1)}, where j=g(p)j = \lfloor g(p) \rfloor and γ=g(p)j\gamma = g(p) - j. For quartiles, set p=0.25p = 0.25 for Q1, p=0.5p = 0.5 for the median, and p=0.75p = 0.75 for Q3; this approach, known as Type 4 in standard classifications, provides continuity and unbiasedness for symmetric distributions. For small datasets (n<10n < 10), Tukey's hinges apply specific adjustments to ensure robustness; the lower hinge is the median of the lower half of the data (first (n+1)/2\lfloor (n+1)/2 \rfloor observations), and the upper hinge is the median of the upper half—for instance, with n=5n = 5 ordered data {1,2,3,4,5}\{1, 2, 3, 4, 5\}, lower half {1,2,3}\{1,2,3\} yields hinge at 2, upper half {3,4,5}\{3,4,5\} at 4, while the median is 3; for n=4n = 4 ordered data, lower half {x(1),x(2)}\{x_{(1)}, x_{(2)}\} yields hinge at (x(1)+x(2))/2(x_{(1)} + x_{(2)})/2. Larger datasets (n10n \geq 10) typically use the general interpolation formula without modification, as edge effects diminish.

Outlier Detection

In box plots, outlier detection primarily relies on the interquartile range (IQR), defined as the difference between the third quartile (Q3) and the first quartile (Q1), to identify data points that deviate significantly from the central bulk of the distribution. The standard method, introduced by , flags as outliers any values falling below Q11.5×IQRQ_1 - 1.5 \times IQR or above Q3+1.5×IQRQ_3 + 1.5 \times IQR. These thresholds, known as the inner fences, correspond to approximately 1.5 times the IQR beyond the quartiles and are designed to capture mild outliers while remaining robust to moderate skewness in non-normal distributions. Tukey further distinguished between mild and extreme outliers using outer fences at Q13×IQRQ_1 - 3 \times IQR and Q3+3×IQRQ_3 + 3 \times IQR, with points between the inner and outer fences classified as "outside" values and those beyond the outer fences as "far out" values. This hierarchical approach allows for nuanced identification, where mild outliers may warrant investigation for potential errors, while extreme ones highlight rare but possibly valid extremes. The multipliers of 1.5 and 3 were refined by Tukey through empirical experience to balance sensitivity and specificity in exploratory data analysis. Alternative methods complement the Tukey approach for outlier detection, particularly in datasets sensitive to the choice of IQR multiplier. The modified Z-score, proposed by Iglewicz and Hoaglin, uses the median absolute deviation (MAD) as a robust scale measure: for a data point xix_i, it is calculated as 0.6745×(xi\median)/\MAD0.6745 \times (x_i - \median)/\MAD, with values exceeding 3.5 in absolute magnitude flagged as potential outliers. This method is especially effective for heavy-tailed distributions, as it avoids reliance on means and standard deviations, which can be distorted by outliers themselves. Other IQR-based robust measures adjust the Tukey fences dynamically for skewness or sample size, enhancing adaptability without assuming normality. The identification of outliers via box plots sparks debate on their interpretation: they may represent measurement errors requiring correction or genuine extremes revealing important variability in the data-generating process. The non-parametric nature of the box plot method, relying on order statistics rather than distributional assumptions, supports exploratory investigation without prematurely dismissing these points as anomalies, aligning with Tukey's philosophy of data analysis as detective work.

Variations

Notched Box Plots

Notched box plots extend the standard box plot by incorporating inward notches on each side of the box to provide a visual representation of the variability around the median. These notches were introduced by McGill, Tukey, and Larsen in their 1978 paper on variations of box plots. The notch boundaries are calculated as the median \pm 1.58 \times \frac{\mathrm{IQR}}{\sqrt{n}}, where \mathrm{IQR} is the interquartile range and n is the sample size; this formula approximates a 95% confidence interval for the median under assumptions of normality and roughly equal sample sizes across groups. The primary purpose of the notches is to facilitate informal comparisons of medians between multiple groups displayed side by side. If the notches of two box plots do not overlap, it indicates strong evidence of a significant difference between the medians at the \alpha = 0.05 level, serving as a quick visual hypothesis test without requiring formal statistical computation. This approach leverages the asymptotic normality of the median estimator to infer differences efficiently. In practice, the notch width is defined as 1.58 \times \frac{\mathrm{IQR}}{\sqrt{n}}, but the notches are clipped to the inner fences (or hinges) if the calculated extent would extend beyond them, preventing distortion of the plot. Compared to side-by-side standard box plots, notched versions offer a clear advantage in multiple comparisons by embedding inferential information directly into the visualization, reducing the need for separate confidence interval plots or post-hoc tests while maintaining the core summary of the data distribution.

Adjusted and Modified Box Plots

Adjusted and modified box plots address limitations of the standard box plot when dealing with skewed distributions or datasets requiring more detailed tail information, by incorporating asymmetry in whisker construction or extending quantile summaries beyond quartiles. In standard box plots, whiskers extend symmetrically based on the interquartile range (IQR), which can misrepresent tails in positively or negatively skewed data, leading to excessive outlier flagging on the longer tail. The adjusted box plot, proposed by Hubert and Vandervieren, modifies whisker lengths asymmetrically using the medcouple, a robust, sign-preserving measure of skewness that ranges from -1 to 1. For positively skewed data (medcouple > 0), the upper whisker is extended further by applying an adjustment factor greater than 1.5 to the IQR, while the lower whisker uses a factor less than 1.5, better capturing the elongated upper tail without flagging valid points as outliers. Conversely, for negative skew, the lower tail is extended. This approach reduces false outliers in skewed distributions like exponential or lognormal data, where traditional symmetric fences underrepresent the longer tail. Variations in quantile definitions, as detailed by Hyndman and Fan, also influence modified box plots by affecting the positions of the edges and whiskers, particularly in small samples where different methods yield asymmetric summaries. Nine common types are compared, with types 7 and 8 (plotting positions) recommended for box plots due to their consistency in estimating population quartiles, ensuring more reliable tail adjustments in skewed cases. Letter-value plots extend the box plot by displaying multiple nested boxes representing successive quantiles, or "letter values," starting from the and halving the at each step (e.g., fourths, eights, sixteenths) until fewer than 15 observations remain per tail. Originally conceptualized by Tukey for exploratory analysis of large datasets, these plots reveal detailed tail behavior and symmetry not visible in standard quartiles, making them suitable for skewed distributions where inner quantiles highlight central tendencies and outer ones emphasize extreme tails. For non-independent and identically distributed (non-IID) data, such as clustered or heterogeneous samples, traditional box plots can obscure variability; alternatives like raincloud plots combine a density estimate (half-violin), summary box, and jittered raw data points to visualize full distributions and individual observations without assuming IID conditions. Proposed by Allen et al., raincloud plots are particularly useful for skewed data in experimental contexts, like or , where variance stabilization adjustments (e.g., via transformations) may precede plotting to normalize tails. These are employed when standard box plots underrepresent multimodal or heavy-tailed structures in non-IID settings. Adjusted and modified variants are recommended for positively skewed distributions, such as or response times, where the standard IQR-based method compresses the longer tail, potentially masking important distributional features.

Interpretation

Reading a Single Box Plot

A box plot provides a compact visual summary of a dataset's distribution, allowing readers to assess key statistical features without examining the raw data. Developed by as part of , it emphasizes the , quartiles, and potential outliers to reveal , variability, and shape. To assess central tendency, locate the horizontal line within the box, which represents the —the value that divides the into two equal halves, with 50% of observations above and 50% below. If the aligns with the center of the , the distribution is symmetric around this point; otherwise, its offset indicates asymmetry in the data. The spread of the data is measured by the box's length, which spans the (IQR) from the first (Q1, the 25th ) to the third (Q3, the 75th ), capturing the middle 50% of the observations and highlighting the typical variability excluding extremes. Whiskers extend from the edges to the smallest and largest values within 1.5 times the IQR, or to the data's minimum and maximum if no such extremes exist, thus illustrating the overall range while protecting against influence. Skewness, or the lack of in the distribution, can be detected by examining the 's position relative to the box center and the relative lengths of the whiskers. A closer to Q1 with a longer upper whisker suggests positive (right) , where the tail extends toward higher values; conversely, a nearer to Q3 with a longer lower whisker indicates negative (left) . Outliers are identified as individual points plotted beyond the , specifically any data values falling more than 1.5 IQRs away from Q1 or Q3, following Tukey's method to flag potential anomalies for further investigation. The number and positioning of these points reveal the extent and direction of unusual deviations in the .

Comparing Multiple Distributions

Box plots facilitate the comparison of multiple distributions by displaying them side-by-side along a shared axis, enabling simultaneous assessment of central tendencies, variabilities, and shapes across groups. This configuration aligns the plots horizontally for categorical variables while plotting the response variable vertically, allowing viewers to discern differences in medians (as horizontal lines within es), interquartile ranges (as lengths), and overall spreads (via extending to adjacent values). For instance, in analyzing energy output from different machines, side-by-side plots reveal that one machine consistently outperforms others in both median output and consistency of results. When interpreting overlaps between these plots, a key visual cue is the positioning of the boxes and . If the interquartile ranges (IQRs) of two adjacent box plots do not overlap, the groups are likely to differ significantly in their central locations, providing a rough indication of distinct distributions. Subtler differences may be suggested by partial overlap in the , which extend to the most extreme non-outlier values (typically up to 1.5 times the IQR from the quartiles), hinting at potential variations in tails without implying equivalence. These overlap assessments serve as exploratory tools to guide further statistical testing, such as t-tests or ANOVA, rather than definitive proofs of significance. For datasets involving more than two groups, effective visualization involves ordering the box plots by ascending or descending medians to reveal patterns or trends across categories. To incorporate results from multiple comparison procedures like Tukey's honestly significant difference (HSD) test, compact letter displays can be overlaid on or above the plots; groups assigned the same letter (e.g., "a" or "ab") indicate no significant difference at the chosen alpha level, while differing letters denote statistically distinguishable medians. This lettering system, derived from all-pairwise comparisons, enhances interpretability without cluttering the display. Comparisons can be complicated by unequal sample sizes across groups, as larger samples yield more reliable quartile and median estimates, potentially exaggerating apparent differences or stability relative to smaller samples. In such cases, adjusting box widths proportional to the square root of sample sizes helps normalize visual perceptions of precision, though it does not fully mitigate the need for formal statistical adjustments in inference. Box plots with fewer than 20 observations per group may also produce unstable summaries, underscoring the importance of verifying assumptions through complementary analyses.

Examples

Dataset Without Outliers

To illustrate the construction and interpretation of a box plot for a clean, symmetric without outliers, consider the exam scores of 20 students: 65, 69, 72, 74, 75, 75, 78, 80, 81, 82, 83, 84, 86, 88, 88, 88, 90, 92, 94, 95. The sorted yields the following : minimum = 65, first quartile (Q1) = 75 (median of the lower half, averaging the 5th and 6th values: (75 + 75)/2), = 82.5 (averaging the 10th and 11th values: (82 + 83)/2), third quartile (Q3) = 88 (median of the upper half, averaging the 15th and 16th values: (88 + 88)/2), and maximum = 95. The (IQR) is Q3 - Q1 = 88 - 75 = 13. Since no outliers are present—the lower fence is Q1 - 1.5 × IQR = 75 - 19.5 = 55.5 (above the minimum) and the upper fence is Q3 + 1.5 × IQR = 88 + 19.5 = 107.5 (above the maximum)—the extend fully to the minimum and maximum values. To construct the plot, draw a box from Q1 (75) to Q3 (88) with a line at the (82.5), and attach from the box ends to 65 and 95, respectively. This box plot reveals a symmetric distribution, as the nearly centers the box and the are of comparable length (lower whisker spans 10 units from 65 to 75; upper from 88 to 95 spans 7 units, indicating a tight, balanced spread without extremes). The compact IQR of 13 suggests low variability in the middle 50% of scores, centered around 82.5, typical of a consistent performance across the class. Visually, the balanced box and symmetric extensions emphasize the absence of or anomalies in this dataset.

Dataset With Outliers

To illustrate the effect of outliers in a box plot, consider a hypothetical of annual household incomes (in thousands of dollars) for 20 individuals, where most values cluster between 30 and 60, but two extreme values exceed 200. The sorted incomes include values such as 25, 30, 32, 35, 35, 38, 40, 42, 45, 48, 50, 52, 55, 55, 58, 60, 65, 70, 200, and 250. The first (Q1) is 36.5 (averaging the 5th and 6th values: (35 + 38)/2), the third (Q3) is 59 (averaging the 15th and 16th values: (58 + 60)/2), and the (IQR) is 22.5. Outliers are identified using the standard criterion of values falling beyond 1.5 times the IQR from the quartiles, resulting in a lower fence at approximately 2.75 and an upper fence at approximately 92.75; thus, the two incomes above 92.75 (200 and 250) are flagged as outliers, while the whiskers extend to the maximum non-outlier value of 70 on the upper end and the minimum value of 25 on the lower end, as it exceeds the lower fence. This method, introduced by , highlights potential anomalies without removing them from the visualization. In the resulting box plot, the box spans from 36.5 to 59 with the median at 49, the left whisker is short (reaching down to 25), and the right whisker extends to 70, with the outlier points plotted individually beyond it. This visual reveals a right-skewed distribution, where the outliers dramatically inflate the overall range to 225 while the IQR remains robust at 22.5, unaffected by the extremes and providing a stable measure of central spread.

Applications and Limitations

Common Uses

Box plots are widely employed in (EDA) within statistics to summarize the distribution of , highlighting measures of , spread, and potential outliers, which aids in initial understanding before more formal analyses. They facilitate the visualization of , , and variability, allowing statisticians to assess characteristics efficiently without assuming a specific distributional form. In preparation for hypothesis testing, box plots serve as a preliminary tool to check assumptions such as normality by revealing deviations in the data's shape, such as or heavy tails, which might influence the choice of parametric or non-parametric tests. In the medical field, box plots are commonly used to compare treatment effects across patient groups, such as visualizing response variables like reductions or survival times between control and intervention cohorts, enabling quick identification of outcomes and interquartile ranges for assessment. For instance, they illustrate the distribution of clinical outcomes in randomized trials, helping researchers detect variability in treatment responses and outliers representing atypical patient reactions. In , box plots summarize pollutant concentration levels across monitoring sites or time periods, such as displaying daily PM2.5 or NO2 measurements to compare spatial heterogeneity and identify high-variability locations for regulatory action. This application supports the evaluation of air quality trends, with the box's quartiles indicating typical exposure ranges and extending to extreme events like spikes. Within , box plots depict the distributions of asset return data, such as daily yields or portfolio volatilities, to compare performance across securities or market conditions, revealing medians for average returns and interquartile ranges for . They are particularly useful for highlighting in return distributions, which informs strategies by showing potential downside risks through lower . In business contexts, box plots support processes by monitoring metrics, like product dimensions or defect rates, across production batches to detect shifts in process stability and variability. They are also applied in for digital products, where side-by-side box plots compare user engagement metrics, such as conversion rates between variants, to evaluate which design yields a more consistent and higher performance. In modern research, box plots provide concise summaries of levels across samples or conditions, such as comparing transcript abundances in treated versus untreated cell lines to identify differentially expressed genes through distributional overlaps or shifts. For data, they visualize the spread of normalized expression values, aiding in quality checks and preliminary comparisons before advanced differential analysis. This usage has become standard in high-throughput studies, where multiple box plots side-by-side facilitate the interpretation of expression variability across experimental groups.

Limitations and Alternatives

Box plots have several limitations that can affect their utility in . One key drawback is their inability to reveal in distributions; multiple distinct distributions, such as unimodal versus bimodal ones, can produce identical box plot signatures if they share the same quartiles, thereby masking important structural features of the . Additionally, box plots do not indicate sample size, which is crucial for assessing the reliability of the ; without this information, interpretations may overlook variability due to small or uneven group sizes, particularly in comparative analyses. The choice of quartile calculation method also introduces sensitivity, as different approaches—such as Tukey's hinges versus standard percentiles—can yield varying box widths and whisker lengths, especially in discrete or small datasets, leading to inconsistent representations across software implementations. For small sample sizes (typically fewer than 10–20 observations), box plots become unreliable, as estimates and outlier detection may not accurately reflect the underlying distribution, potentially misleading users about spread and . Beyond these issues, box plots provide only a coarse summary and fail to convey precise density or the full of the distribution, limiting their insight into aspects like gaps, tails, or precise behaviors. When these limitations are problematic, alternatives better suited to specific needs include histograms or violin plots, which visualize the full probability density and reveal multimodality or shape details that box plots obscure. For small datasets, dot plots (or strip plots) preserve individual data points, avoiding the summarization pitfalls of box plots while facilitating direct observation of values and outliers. Empirical cumulative distribution functions (ECDFs) offer a precise, non-parametric view of quantiles and cumulative probabilities, providing a complementary or superior option for exact distributional comparisons without relying on quartile approximations.

Visualization Tools

Software Implementation

Box plots can be generated using a variety of software tools and programming languages, each offering built-in functions or interfaces for creating these visualizations from or . Popular options include statistical programming environments like and Python, as well as spreadsheet and statistical software such as and IBM SPSS Statistics. These implementations typically compute the necessary quartiles, medians, and thresholds automatically from input data. In the R programming language, the base graphics package provides the boxplot() function, a generic method that accepts vectors, matrices, or formulas to produce simple box plots. For example, boxplot(x) plots a single vector, while boxplot(formula, data) groups data by factors for comparative displays. For enhanced customization, the ggplot2 package uses geom_boxplot() within a ggplot() call, mapping variables via aesthetics like aes(x = group, y = value) to create layered, publication-ready plots with options for themes, colors, and facets. Python libraries offer similar capabilities through the and seaborn packages. The matplotlib.pyplot.boxplot() function draws box plots from arrays or lists, supporting parameters for whisker lengths, notch displays, and outlier markers, as in plt.boxplot(data). Seaborn's seaborn.boxplot() integrates seamlessly with DataFrames, enabling grouped visualizations via sns.boxplot(data=df, x='group', y='value') and automatic styling for better readability across multiple distributions. In , users can create box and whisker charts directly from the Insert tab under the Statistics chart group, selecting data ranges that automatically calculate quartiles and handle outliers. The charts are vertical by default, but horizontal orientation can be achieved by transposing the data or using workarounds, with options to add data labels to elements like outliers or the mean. Similarly, Statistics employs the Chart Builder dialog, where selecting the Boxplot icon allows specification of variables for simple or clustered plots, including controls for axis labels and exclusion of cases, making it suitable for in social sciences. When dealing with large datasets, direct input to these functions may lead to performance issues due to memory constraints during quartile computations. To mitigate this, downsampling techniques—such as random subsampling to a representative size (e.g., 10,000 points)—or precomputing (, , and extremes) and supplying them as input can be used; for instance, 's boxplot() accepts a dictionary of precalculated stats, while R's boxplot() efficiently handles formulas on aggregated .

Best Practices for Display

When displaying box plots, orientation should be chosen based on the and to enhance . Vertical orientations are suitable for time-based groupings or when category labels are short, allowing easy vertical scanning of values. In contrast, horizontal orientations are preferable for datasets with many categories or lengthy labels, as they prevent label overlap and improve without rotating text. Scaling elements of box plots requires consistency to facilitate accurate comparisons across distributions. Maintain a uniform y-axis range when plotting multiple box plots side-by-side to avoid distorting perceived differences in spread or . For indicating varying sample sizes, adjust box widths proportionally to the of the number of data points, which visually reflects the precision of the estimate without overwhelming the plot. Logarithmic scales should be reserved for highly skewed data and clearly labeled to prevent misinterpretation. Annotations play a crucial role in clarifying box plot components for viewers. Label key statistics such as the , quartiles, and outliers directly on or near the plot, and include sample sizes (e.g., "n=50") adjacent to each box to provide context on reliability, especially for small datasets. If the audience may lack familiarity with box plot conventions, incorporate a brief explanatory note or , such as defining whiskers as extending to 1.5 times the . To ensure accessibility, employ color palettes that are friendly to color-blind viewers, such as high-contrast grays or blues with sufficient differentiation, and opt for lighter fill colors in boxes to reduce visual clutter. Arrange categories in a logical order, such as by increasing value, to reveal trends without relying on color alone. In dense displays with multiple overlapping elements, like numerous outliers, prioritize jittering or transparency to mitigate overplotting and maintain clarity for all users. For modern digital presentations, interactive box plots enhance engagement and detail exploration. Tools like allow users to hover over elements for precise values of medians, quartiles, or individual points, and enable toggling visibility of outliers or underlying points to avoid static clutter. Such features are particularly useful in web-based reports, where users can dynamically adjust views without altering the core plot integrity. To prevent misleading interpretations, explicitly define whisker extents in the plot caption or notes, as variations (e.g., to the full range versus 1.5 IQR) can imply different distributional properties. Avoid assuming whiskers uniformly represent tails, as they may obscure or gaps in the ; supplement with violin plots if full is critical. Notches around the can indicate but should only be used when sample sizes support reliable to prevent false inferences about group differences.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.