Hubbry Logo
Descriptive statisticsDescriptive statisticsMain
Open search
Descriptive statistics
Community hub
Descriptive statistics
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Descriptive statistics
Descriptive statistics
from Wikipedia

A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information,[1] while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.[2] This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric statistics.[3] Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented.[4] For example, in papers reporting on human subjects, typically a table is included giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, the proportion of subjects with related co-morbidities, etc.

Some measures that are commonly used to describe a data set are measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.[5]

Use in statistical analysis

[edit]

Descriptive statistics provide simple summaries about the sample and about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may either form the basis of the initial description of the data as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a particular investigation.

For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a player or a team. This number is the number of shots made divided by the number of shots taken. For example, a player who shoots 33% is making approximately one shot in every three. The percentage summarizes or describes multiple discrete events. Consider also the grade point average. This single number describes the general performance of a student across the range of their course experiences.[6]

The use of descriptive and summary statistics has an extensive history and, indeed, the simple tabulation of populations and of economic data was the first way the topic of statistics appeared. More recently, a collection of summarisation techniques has been formulated under the heading of exploratory data analysis: an example of such a technique is the box plot.

In the business world, descriptive statistics provides a useful summary of many types of data. For example, investors and brokers may use a historical account of return behaviour by performing empirical and analytical analyses on their investments in order to make better investing decisions in the future.

Univariate analysis

[edit]

Univariate analysis involves describing the distribution of a single variable, including its central tendency (including the mean, median, and mode) and dispersion (including the range and quartiles of the data-set, and measures of spread such as the variance and standard deviation). The shape of the distribution may also be described via indices such as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in graphical or tabular format, including histograms and stem-and-leaf display.

Bivariate and multivariate analysis

[edit]

When a sample consists of more than one variable, descriptive statistics may be used to describe the relationship between pairs of variables. In this case, descriptive statistics include:

The main reason for differentiating univariate and bivariate analysis is that bivariate analysis is not only a simple descriptive analysis, but also it describes the relationship between two different variables.[7] Quantitative measures of dependence include correlation (such as Pearson's r when both variables are continuous, or Spearman's rho if one or both are not) and covariance (which reflects the scale variables are measured on). The slope, in regression analysis, also reflects the relationship between variables. The unstandardised slope indicates the unit change in the criterion variable for a one unit change in the predictor. The standardised slope indicates this change in standardised (z-score) units. Highly skewed data are often transformed by taking logarithms. The use of logarithms makes graphs more symmetrical and look more similar to the normal distribution, making them easier to interpret intuitively.[8]: 47 

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Descriptive statistics is a branch of statistics focused on summarizing and organizing the characteristics of a , typically derived from a sample or an entire , to provide a clear and concise overview of its main features without drawing inferences about broader groups. This approach serves as the foundational step in quantitative analysis, enabling researchers to describe through numerical measures, tables, and graphical representations that highlight patterns, trends, and variability. Key components of descriptive statistics include measures of , which identify typical or average values in the , such as the (the arithmetic average, calculated as the sum of values divided by the number of observations), the (the middle value when are ordered), and the mode (the most frequently occurring value). These measures help condense complex datasets into simple summaries; for instance, the is sensitive to extreme values, while the is more robust to outliers. Additionally, measures of dispersion or variability quantify the spread of , including the range (the difference between the maximum and minimum values), the (the spread of the middle 50% of ), (the average of squared deviations from the ), and (the of variance, indicating average deviation from the ). Descriptive statistics can be categorized by the number of variables analyzed: univariate (focusing on one variable, such as distributions), bivariate (examining relationships between two variables), and multivariate (involving multiple variables to reveal complex patterns). Graphical methods complement these numerical summaries, including histograms for continuous data distributions, bar charts for categorical frequencies, box plots to display medians and quartiles alongside outliers, and scatterplots for bivariate relationships. Such visualizations preserve the integrity of the original data while facilitating intuitive understanding of its shape, symmetry, and potential anomalies. In contrast to inferential statistics, which use sample data to test hypotheses and make predictions about populations, descriptive statistics remain confined to the observed , emphasizing and presentation over generalization. This distinction underscores its role in fields like healthcare, social sciences, , where it aids in by providing essential summaries—such as patient ages or sales variability—that inform further analysis or policy. Tools like spreadsheets (e.g., Excel) or statistical software (e.g., ) commonly facilitate these computations, making descriptive statistics accessible for initial data scrutiny.

Fundamentals

Definition and purpose

Descriptive statistics is the branch of that involves the analysis, summarization, and presentation of sets to describe their features through numerical measures, tables, or graphs. This approach organizes into a more comprehensible form, highlighting key characteristics such as the overall structure and composition of the without attempting to draw conclusions beyond the data itself. The primary purpose of descriptive statistics is to condense large or complex sets into concise summaries that reveal patterns, trends, and essential features, facilitating easier interpretation and serving as a foundational step for subsequent analyses. By focusing on the observed , it enables researchers, analysts, and decision-makers to gain initial insights into variables like distributions or relationships, aiding fields from to without inferring broader generalizations. The development of descriptive statistics emerged in the 18th and 19th centuries, building on early ideas of averaging and graphical representation, with key advancements by in examining human traits through and regression, and by in introducing tools like the for data visualization. For instance, a simple of heights from 10 individuals could be summarized by grouping values into height ranges and noting the count in each, transforming detailed individual measurements into a straightforward overview of the group's composition. In contrast to inferential statistics, which extend findings to larger populations, descriptive methods remain confined to the at hand.

Distinction from inferential statistics

Descriptive statistics focus on summarizing and organizing the observed data from a specific sample, providing exact descriptions such as the or frequency distributions without extending beyond the dataset itself. In contrast, inferential statistics use that sample data to make probabilistic estimates about a broader , incorporating tools like confidence intervals to quantify uncertainty in those estimates. For example, reporting the exact height of 100 surveyed individuals represents descriptive statistics, while using that average to infer the height of an entire community with a exemplifies inferential approaches. The scope of descriptive statistics is inherently limited to exploration and confirmation within the collected data, emphasizing patterns and trends observable directly from the sample. Inferential statistics, however, broaden this scope through hypothesis testing and generalization, enabling conclusions about population characteristics or relationships that were not directly measured. An overlap occurs when a descriptive measure, such as a sample mean, transitions into an inferential context by serving as the basis for population estimation, highlighting how the two branches can complement each other in analysis. Despite their utility, descriptive statistics cannot prove causation or support generalizations outside the sampled data, as they lack the probabilistic framework required for such extensions. Inferential methods are thus essential to address these limitations, providing the rigor needed to draw reliable inferences from samples to populations.

Measures of Central Tendency

Arithmetic mean

The , often simply called the mean, is a fundamental measure of in descriptive statistics, defined as the sum of all values divided by the number of observations. It provides a single value that summarizes the "center" or typical value of a , assuming equal importance for each observation. The formula for the arithmetic mean xˉ\bar{x} of a with nn values x1,x2,,xnx_1, x_2, \dots, x_n is given by: xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i This expression arises from the basic arithmetic operation of averaging, where the total sum is evenly distributed across the observations. To calculate the arithmetic mean, first sum all the values in the dataset and then divide by the total number of values. For example, consider a small dataset of test scores: 80, 90, and 100. The sum is 270, and with three scores, the mean is 270/3=90270 / 3 = 90. This process can extend to larger datasets or grouped data using frequencies, such as xˉ=(fixi)n\bar{x} = \frac{\sum (f_i \cdot x_i)}{n} where fif_i represents the frequency of each value xix_i. Key properties of the arithmetic include its sensitivity to extreme values (outliers), which can disproportionately influence the result, especially in small samples; its additivity, meaning the of sums equals the sum of for independent groups; and its utility in weighted averages, where observations are assigned different importance via weights wiw_i in the formula xˉ=(wixi)wi\bar{x} = \frac{\sum (w_i \cdot x_i)}{\sum w_i}. Additionally, the sum of deviations from the equals zero, making it a balanced point for further statistical . The offers advantages such as incorporating every point for a comprehensive representation, resisting random fluctuations across repeated samples, and serving as a foundation for other statistical measures like the standard deviation. However, its disadvantages include vulnerability to skewing by outliers, rendering it less suitable for highly skewed distributions or non-numeric , where alternatives like the may provide a more robust central measure.

Median and mode

The median is a measure of that represents the middle value in a when the observations are arranged in ascending order. For an odd number of observations nn, it is the value at position (n+1)/2(n+1)/2; for an even number, it is the of the values at positions n/2n/2 and n/2+1n/2 + 1. For example, in the income {10,000, 20,000, 50,000, 1,000,000}, sorted as {10,000, 20,000, 50,000, 1,000,000}, the is the of 20,000 and 50,000, which is 35,000. Unlike the , the is robust to outliers because it depends only on the order of the data rather than their magnitudes. It is particularly useful for skewed distributions, where extreme values might distort the , providing a better representation of the typical value. The mode is the value or values that occur most frequently in a , serving as another measure of . A is unimodal if it has one mode, bimodal if it has two, and multimodal if it has more than two. For example, in the color preferences {red, red, }, the mode is red, as it appears twice while blue appears once. The mode is especially valuable for nominal or categorical data, where arithmetic operations are not applicable, making it the only central tendency measure suitable for such variables. It is commonly used to summarize the most common category in frequency distributions, such as preferred product types in .

Measures of Variability

Range and interquartile range

The range is a basic measure of statistical dispersion that quantifies the spread of data by calculating the difference between the maximum and minimum values in a dataset. It is defined by the formula R=max(xi)min(xi)R = \max(x_i) - \min(x_i), where xix_i represents the data points. For example, in a set of daily temperatures recorded as 20°C, 22°C, 25°C, 27°C, and 30°C, the range is 30°C - 20°C = 10°C, indicating the full extent of temperature variation observed. While straightforward to compute and interpret, the range is highly sensitive to outliers, as a single extreme value can dramatically alter the maximum or minimum and thus the overall measure. The (IQR) provides a more robust alternative measure of spread by focusing on the middle 50% of the data, specifically the difference between the third (Q3) and the first (Q1). divide an ordered into four equal parts: Q1 marks the 25th (lower ), Q2 the (50th ), and Q3 the 75th (upper ). The IQR is calculated using the IQR=Q3Q1\text{IQR} = Q3 - Q1. To compute the IQR, first arrange the data in ascending order to form an ordered list. Next, locate the (Q2) by finding the middle value (or of the two middle values if the has an even number of observations), which splits the data into lower and upper halves. Then, determine Q1 as the of the lower half (excluding Q2 if n is odd) and Q3 as the of the upper half, again averaging if necessary for even-sized halves. Finally, subtract Q1 from Q3 to obtain the IQR; for instance, in the ordered {1, 2, 3, 4, 5, 6, 7}, Q1 = 2, Q3 = 6, and IQR = 4. Unlike the range, the IQR is resistant to the influence of outliers because it excludes the lowest 25% and highest 25% of the data, emphasizing central variability instead. This robustness makes the IQR particularly useful in datasets with potential extremes, though it provides less information about the full data spread compared to measures like standard deviation.

Variance and standard deviation

Variance measures the average squared deviation of data points from the , providing a quantitative assessment of dispersion. For a population of size NN with μ\mu, the population variance σ2\sigma^2 is calculated as σ2=1Ni=1N(xiμ)2\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2, where xix_i are the individual values. This formula averages the squared differences to emphasize larger deviations and yield a non-negative value. When working with a sample of size nn drawn from a larger , the sample variance s2s^2 uses the formula s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2, where xˉ\bar{x} is the sample . The denominator n1n-1 instead of nn applies , which adjusts for the bias introduced by estimating the mean from the sample, ensuring s2s^2 is an unbiased estimator of σ2\sigma^2. Without this correction, the sample variance would systematically underestimate the true variance due to the sample mean minimizing deviations within the sample. The standard deviation is the square root of the variance, returning the measure to the original scale of the : population standard deviation σ=σ2\sigma = \sqrt{\sigma^2}
Add your contribution
Related Hubs
User Avatar
No comments yet.