Hubbry Logo
Stratified samplingStratified samplingMain
Open search
Stratified sampling
Community hub
Stratified sampling
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Stratified sampling
Stratified sampling
from Wikipedia

In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations.

Stratified sampling example

In statistical surveys, when subpopulations within an overall population vary, it could be advantageous to sample each subpopulation (stratum) independently.

Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. The strata should define a partition of the population. That is, it should be collectively exhaustive and mutually exclusive: every element in the population must be assigned to one and only one stratum. Then sampling is done in each stratum, for example: by simple random sampling. The objective is to improve the precision of the sample by reducing sampling error. It can produce a weighted mean that has less variability than the arithmetic mean of a simple random sample of the population.

In computational statistics, stratified sampling is a method of variance reduction when Monte Carlo methods are used to estimate population statistics from a known population.[1]

Strategies

[edit]
  1. Proportionate allocation uses a sampling fraction in each of the strata that are proportional to that of the total population. For instance, if the population consists of n total individuals, m of which are male and f female (and where m + f = n), then the relative size of the two samples (x1 = m/n males, x2 = f/n females) should reflect this proportion.
  2. Optimum allocation (or disproportionate allocation) – The sampling fraction of each stratum is proportionate to both the proportion (as above) and the standard deviation of the distribution of the variable. Larger samples are taken in the strata with the greatest variability to generate the least possible overall sampling variance. Neyman allocation is a strategy of this type.

A real-world example of using stratified sampling would be for a political survey. If the respondents needed to reflect the diversity of the population, the researcher would specifically seek to include participants of various minority groups such as race or religion, based on their proportionality to the total population as mentioned above. A stratified survey could thus claim to be more representative of the population than a survey of simple random sampling or systematic sampling. Both mean and variance can be corrected for disproportionate sampling costs using stratified sample sizes.

Example

[edit]

Assume that we need to estimate the average number of votes for each candidate in an election. Assume that a country has 3 towns: Town A has 1 million factory workers, Town B has 2 million office workers and Town C has 3 million retirees. We can choose to get a random sample of size 60 over the entire population but there is some chance that the resulting random sample is poorly balanced across these towns and hence is biased, causing a significant error in estimation (when the outcome of interest has a different distribution, in terms of the parameter of interest, between the towns). Instead, if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively, then we can produce a smaller error in estimation for the same total sample size. This method is generally used when a population is not a homogeneous group.

Advantages

[edit]

The reasons to use stratified sampling rather than simple random sampling include[2]

  1. If measurements within strata have a lower standard deviation (as compared to the overall standard deviation in the population), stratification gives a smaller error in estimation.
  2. For many applications, measurements become more manageable and/or cheaper when the population is grouped into strata.
  3. When it is desirable to have estimates of the population parameters for groups within the population – stratified sampling verifies we have enough samples from the strata of interest.

If the population density varies greatly within a region, stratified sampling will ensure that estimates can be made with equal accuracy in different parts of the region, and that comparisons of sub-regions can be made with equal statistical power. For example, in Ontario a survey taken throughout the province might use a larger sampling fraction in the less populated north, since the disparity in population between north and south is so great that a sampling fraction based on the provincial sample as a whole might result in the collection of only a handful of data from the north.

Disadvantages

[edit]

It would be a misapplication of the technique to make subgroups' sample sizes proportional to the amount of data available from the subgroups, rather than scaling sample sizes to subgroup sizes (or to their variances, if known to vary significantly—e.g. using an F test). Data representing each subgroup are taken to be of equal importance if suspected variation among them warrants stratified sampling. If subgroup variances differ significantly and the data needs to be stratified by variance, it is not possible to simultaneously make each subgroup sample size proportional to subgroup size within the total population. For an efficient way to partition sampling resources among groups that vary in their means, variance and costs, see "optimum allocation". The problem of stratified sampling in the case of unknown class priors (ratio of subpopulations in the entire population) can have a deleterious effect on the performance of any analysis on the dataset, e.g. classification.[3] In that regard, minimax sampling ratio can be used to make the dataset robust with respect to uncertainty in the underlying data generating process.[3]

Combining sub-strata to ensure adequate numbers can lead to Simpson's paradox, where trends that exist in different groups of data disappear or even reverse when the groups are combined.

Mean and standard error

[edit]

The mean and variance of stratified random sampling are given by:[2]

where

number of strata
the sum of all stratum sizes
size of stratum
sample mean of stratum
number of observations in stratum
sample standard deviation of stratum

Note that the term , which equals , is a finite population correction and must be expressed in "sample units". Forgoing the finite population correction gives:

where the is the population weight of stratum .

Sample size allocation

[edit]

For the proportional allocation strategy, the size of the sample in each stratum is taken in proportion to the size of the stratum. Suppose that in a company there are the following staff:[4]

  • male, full-time: 90
  • male, part-time: 18
  • female, full-time: 9
  • female, part-time: 63
  • total: 180

and we are asked to take a sample of 40 staff, stratified according to the above categories.

The first step is to calculate the percentage of each group of the total.

  • % male, full-time = 90 ÷ 180 = 50%
  • % male, part-time = 18 ÷ 180 = 10%
  • % female, full-time = 9 ÷ 180 = 5%
  • % female, part-time = 63 ÷ 180 = 35%

This tells us that of our sample of 40,

  • 50% (20 individuals) should be male, full-time.
  • 10% (4 individuals) should be male, part-time.
  • 5% (2 individuals) should be female, full-time.
  • 35% (14 individuals) should be female, part-time.

Another easy way without having to calculate the percentage is to multiply each group size by the sample size and divide by the total population size (size of entire staff):

  • male, full-time = 90 × (40 ÷ 180) = 20
  • male, part-time = 18 × (40 ÷ 180) = 4
  • female, full-time = 9 × (40 ÷ 180) = 2
  • female, part-time = 63 × (40 ÷ 180) = 14

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Stratified sampling is a probability sampling method in statistics used to select a representative sample from a population by first dividing the population into distinct, non-overlapping subgroups, or strata, based on shared characteristics such as age, gender, income, or location, and then randomly sampling from each stratum in proportion to its size or according to a specified allocation. This approach ensures that all relevant subgroups are adequately represented in the sample, thereby reducing sampling error and improving the precision of estimates compared to simple random sampling, particularly when the population exhibits high variability within subgroups. The process begins with identifying key stratifying variables that capture important heterogeneity in the , followed by partitioning the into mutually exclusive . Samples are then drawn independently from each using random selection techniques, with sample sizes determined either proportionately (reflecting the stratum's share of the total ) or disproportionately (to oversample underrepresented groups or for greater precision in specific ), with appropriate weighting to maintain unbiased estimates. Common allocation strategies include proportional allocation for unbiased overall estimates and optimal allocation, such as Neyman allocation, which minimizes variance by considering both stratum sizes and within-stratum variability. For instance, in a survey, the might be stratified by geographic region and age group to ensure balanced representation across diverse demographics. Stratified sampling offers several advantages, including enhanced statistical efficiency through reduced sampling variance, guaranteed inclusion of minority subgroups, and the ability to provide separate estimates for each stratum, which is valuable for subgroup analysis. However, it requires prior knowledge of the population's composition to define effective strata, can be more complex and costly to implement than simpler methods, and may not effectively reduce sampling error if strata are poorly chosen or if the stratifying variable does not correlate well with the study outcomes. Compared to cluster sampling, it typically yields more precise results for heterogeneous populations but demands more upfront planning. The method was formalized in its modern form by in 1934, who demonstrated the theoretical foundations for optimal sample allocation to minimize estimation errors in stratified designs, building on earlier ideas in representative sampling from the early . Today, stratified sampling is widely applied in fields like survey research, , quality control, and simulations, where accurate representation across diverse population segments is critical.

Fundamentals

Definition

Stratified sampling is a probability sampling technique in which the population of is divided into distinct, non-overlapping subgroups known as strata, based on one or more stratification variables, and then independent random samples are drawn from each stratum. This approach ensures that the sample reflects the 's diversity by capturing representation from each subgroup proportionally or according to a specified allocation. Unlike simple random sampling, which selects units directly from the entire without regard to internal structure, stratified sampling leverages prior about the to improve representativeness and . The key components of stratified sampling include the total NN, which represents the aggregate number of units; the number of strata KK, denoting the subgroups formed; the size of each NhN_h for hh (where h=1,2,,Kh = 1, 2, \dots, K); and the sampling fraction within each , typically denoted as nh/Nhn_h / N_h, where nhn_h is the sample size drawn from hh. A fundamental prerequisite is that the strata must be mutually exclusive, meaning no unit belongs to more than one , and collectively exhaustive, ensuring all population units are included in some . This partitioning allows for targeted sampling that accounts for heterogeneity across groups while maintaining the essential to probability-based . The relationship among these components is expressed mathematically as the total population size equaling the sum of the stratum sizes: N=h=1KNhN = \sum_{h=1}^{K} N_h This formula underscores the exhaustive coverage of the population by the strata, forming the basis for subsequent sampling and estimation procedures.

Comparison to simple random sampling

Simple random sampling (SRS) treats the entire population as homogeneous, selecting a single random sample where each unit has an equal probability of inclusion without regard for subgroups, which can lead to underrepresentation of rare or small subgroups in heterogeneous populations. In contrast, stratified sampling divides the population into mutually exclusive and exhaustive strata based on relevant characteristics and then samples proportionally from each stratum, ensuring representation across all subgroups and thereby reducing sampling error in diverse populations. This approach controls variability by homogenizing groups within strata while capturing differences between them, resulting in greater precision and lower variance for estimates compared to SRS when using the same sample size. Stratified sampling was developed in the early 20th century, notably through Jerzy Neyman's 1934 work on representative methods, to address biases in agricultural experiments and census surveys where populations varied by region, soil type, or demographics.

Design and Implementation

Stratum formation

Stratum formation is the initial step in stratified sampling, where the target population is partitioned into mutually exclusive and collectively exhaustive subgroups known as strata. Effective strata are designed to enhance the precision of estimates by ensuring homogeneity within each stratum—meaning low variability in the key variable of interest among units—and heterogeneity between strata, which captures significant differences across groups. This approach reduces the overall sampling variance compared to simple random sampling, as units within a stratum are more similar, allowing for more efficient representation of the . To form strata, researchers typically rely on auxiliary variables that are correlated with the study variable and readily available for the entire , such as demographic factors (e.g., age or levels) or spatial attributes (e.g., ). These variables enable the division of the into non-overlapping categories that cover all units without omission or duplication; for instance, a might be stratified by brackets (low, medium, high) to reflect varying economic behaviors. The choice of auxiliary variables is critical, as they must be measurable from the and relevant to the research objectives to avoid introducing bias. Forming strata presents several challenges, including the substantial cost associated with acquiring a comprehensive that includes the necessary auxiliary information for all units. Additionally, arbitrary or poorly defined boundaries can lead to misclassification errors, where units are incorrectly assigned, potentially undermining the homogeneity goal and increasing variance. Obtaining accurate frame data often requires administrative or censuses, which may not always be up-to-date or complete, further complicating the process. A practical guideline for the number of strata KK is to select a modest number that provides sufficient detail without risking empty or overly small strata, which could inflate variance; for small-scale surveys, KK between 4 and 6 is often recommended to balance gains in precision against . Cochran noted that beyond approximately six strata, additional divisions yield in efficiency for many populations.

Sampling strategies

In stratified sampling, the core procedure involves independently drawing samples from each predefined using probability-based methods, typically simple random sampling, to ensure representation proportional to the stratum's characteristics. Once the has been divided into homogeneous strata, a —a complete list of units—is obtained for each stratum hh, from which nhn_h units are selected randomly, either with or without replacement. This independent selection within strata allows for tailored sampling efforts that account for variability across groups, as originally outlined in the foundational framework for probability-based stratified designs. A common strategy is proportional allocation, where the sample size for each stratum hh is determined such that nhn=NhN\frac{n_h}{n} = \frac{N_h}{N}, with nn as the total sample size, NhN_h as the , and NN as the total ; this ensures the sample mirrors the population's stratum proportions, reducing when stratum sizes differ significantly. Variations on the basic random selection include equal allocation, in which nh=nKn_h = \frac{n}{K} for all KK strata regardless of their population sizes, which is particularly useful when the goal is to compare strata directly or when variability is similar across groups, though it may oversample small strata. Another variation employs within each stratum, suitable for ordered lists like or geographic sequences: after a random starting point, every kk-th unit is selected, where k=Nhnhk = \frac{N_h}{n_h}, offering efficiency over simple random sampling when the frame lacks inherent randomness but maintaining approximate randomness if the ordering avoids periodicity. In practice, after random selection, non-response is addressed by adjusting sampling weights within each stratum, often by inflating the weights of respondents by the inverse of the stratum-specific response rate to compensate for missing units and preserve representativeness. For instance, if the response rate in stratum hh is rhr_h, the adjusted weight for responding units becomes wh=1rh×Nhnhw_h = \frac{1}{r_h} \times \frac{N_h}{n_h}, ensuring unbiased estimates when non-response is assumed ignorable within strata.

Sample size allocation

In stratified sampling, determining the appropriate sample size for each stratum, denoted nhn_h, is crucial for achieving efficient estimation while meeting overall survey objectives. Allocation methods balance the total sample size nn across strata to minimize variance or incorporate practical constraints, assuming the population is divided into HH strata with sizes NhN_h and total size N=NhN = \sum N_h. In practice, the total sample size and allocation are often determined to achieve a desired level of precision for key estimates, such as population means or proportions, at specified confidence levels. For estimating a proportion within a stratum (assuming a large or infinite population), the required sample size per stratum is calculated using the formula nh=Z2p(1p)e2,n_h = \frac{Z^2 p (1-p)}{e^2}, where ZZ is the z-score corresponding to the desired confidence level (e.g., 1.96 for 95% confidence, 1.645 for 90%, 2.576 for 99%), pp is the expected proportion (commonly set to 0.5 for a conservative estimate that maximizes variance), and ee is the desired margin of error (e.g., 0.05 for ±5%). The total sample size is then the sum of nhn_h across all strata. Adjustments may be applied for finite populations using the correction nh=nh/(1+nh/Nh)n_h' = n_h / (1 + n_h / N_h) or for design effects (e.g., clustering) by inflating the calculated sizes. Sample sizes are frequently set to ensure reliable estimates not only overall but also at important domain levels (e.g., national, urban/rural). To support reliable subgroup or domain-specific estimates, a minimum effective sample size per stratum or domain is often targeted, typically in the range of 100–400 observations. Changing the confidence level requires recalculating with a different ZZ value; the required sample size scales approximately with Z2Z^2. For example, achieving the same margin of error at 99% confidence instead of 95% requires roughly (2.576 / 1.96)^2 ≈ 1.73 times more observations, or about 73% larger sample. Once the required sample sizes (overall or per stratum/domain) are determined based on precision goals, allocation methods are applied to distribute the sample across strata, particularly when the total sample size is constrained by budget or logistics. Proportional allocation assigns sample sizes in proportion to the stratum's share of the population, given by the formula nh=nNhN.n_h = n \cdot \frac{N_h}{N}. This method assumes equal variability across strata and ensures the sample mirrors the population structure, which simplifies weighting and reduces bias in overall estimates. Disproportionate allocation deviates from proportionality to improve precision for specific subgroups, such as by oversampling small or rare strata that may have higher variability. For instance, the sample size can be set proportional to the product of stratum size and standard deviation, nhNhShn_h \propto N_h S_h, where ShS_h is the within-stratum standard deviation; this allocates more resources to heterogeneous strata to enhance subgroup estimates without inflating overall variance excessively. Neyman allocation provides an optimal disproportionate strategy for a fixed total sample size nn, minimizing the variance of the population mean estimator when costs are equal across strata. The formula is nh=nNhShk=1HNkSk,n_h = n \cdot \frac{N_h S_h}{\sum_{k=1}^H N_k S_k}, which prioritizes larger, more variable strata but requires prior knowledge or estimates of ShS_h from pilot studies or historical data. This approach, introduced by Neyman in 1934, can substantially reduce variance compared to proportional allocation in populations with unequal stratum variances. Practical considerations often modify these allocations, such as budget constraints that limit total nn or vary costs chc_h per unit across strata, leading to adjusted formulas like nhNhSh/chn_h \propto N_h S_h / \sqrt{c_h}
Add your contribution
Related Hubs
User Avatar
No comments yet.