Recent from talks
Nothing was collected or created yet.
Simple random sample
View on WikipediaThis article needs additional citations for verification. (November 2011) |
In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sample in a random way. In SRS, each subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals.[1] Simple random sampling is a basic type of sampling and can be a component of other more complex sampling methods.[2]
Introduction
[edit]The principle of simple random sampling is that every set with the same number of items has the same probability of being chosen. For example, suppose N college students want to get a ticket for a basketball game, but there are only X < N tickets for them, so they decide to have a fair way to see who gets to go. Then, everybody is given a number in the range from 0 to N-1, and random numbers are generated, either electronically or from a table of random numbers. Numbers outside the range from 0 to N-1 are ignored, as are any numbers previously selected. The first X numbers would identify the lucky ticket winners.
In small populations and often in large ones, such sampling is typically done "without replacement", i.e., one deliberately avoids choosing any member of the population more than once. Although simple random sampling can be conducted with replacement instead, this is less common and would normally be described more fully as simple random sampling with replacement. Sampling done without replacement is no longer independent, but still satisfies exchangeability, hence most results of mathematical statistics still hold. Further, for a small sample from a large population, sampling without replacement is approximately the same as sampling with replacement, since the probability of choosing the same individual twice is low. Survey methodology textbooks generally consider simple random sampling without replacement as the benchmark to compute the relative efficiency of other sampling approaches.[3]
An unbiased random selection of individuals is important so that if many samples were drawn, the average sample would accurately represent the population. However, this does not guarantee that a particular sample is a perfect representation of the population. Simple random sampling merely allows one to draw externally valid conclusions about the entire population based on the sample. The concept can be extended when the population is a geographic area.[4] In this case, area sampling frames are relevant.
Conceptually, simple random sampling is the simplest of the probability sampling techniques. It requires a complete sampling frame, which may not be available or feasible to construct for large populations. Even if a complete frame is available, more efficient approaches may be possible if other useful information is available about the units in the population.
Advantages are that it is free of classification error, and it requires minimum previous knowledge of the population other than the frame. Its simplicity also makes it relatively easy to interpret data collected in this manner. For these reasons, simple random sampling best suits situations where not much information is available about the population and data collection can be efficiently conducted on randomly distributed items, or where the cost of sampling is small enough to make efficiency less important than simplicity. If these conditions do not hold, stratified sampling or cluster sampling may be a better choice.
Relationship between simple random sample and other methods
[edit]Equal probability sampling (epsem)
[edit]A sampling method for which each individual unit has the same chance of being selected is called equal probability sampling (epsem for short).
Using a simple random sample will always lead to an epsem, but not all epsem samples are SRS. For example, if a teacher has a class arranged in 5 rows of 6 columns and she wants to take a random sample of 5 students she might pick one of the 6 columns at random. This would be an epsem sample but not all subsets of 5 pupils are equally likely here, as only the subsets that are arranged as a single column are eligible for selection. There are also ways of constructing multistage sampling, that are not srs, while the final sample will be epsem.[5] For example, systematic random sampling produces a sample for which each individual unit has the same probability of inclusion, but different sets of units have different probabilities of being selected.
Samples that are epsem are self weighting, meaning that the inverse of selection probability for each sample is equal.
Distinction between a systematic random sample and a simple random sample
[edit]Consider a school with 1000 students, and suppose that a researcher wants to select 100 of them for further study. All their names might be put in a bucket and then 100 names might be pulled out. Not only does each person have an equal chance of being selected, we can also easily calculate the probability (P) of a given person being chosen, since we know the sample size (n) and the population (N):
1. In the case that any given person can only be selected once (i.e., after selection a person is removed from the selection pool):
2. In the case that any selected person is returned to the selection pool (i.e., can be picked more than once):
This means that every student in the school has in any case approximately a 1 in 10 chance of being selected using this method. Further, any combination of 100 students has the same probability of selection.
If a systematic pattern is introduced into random sampling, it is referred to as "systematic (random) sampling". An example would be if the students in the school had numbers attached to their names ranging from 0001 to 1000, and we chose a random starting point, e.g. 0533, and then picked every 10th name thereafter to give us our sample of 100 (starting over with 0003 after reaching 0993). In this sense, this technique is similar to cluster sampling, since the choice of the first unit will determine the remainder. This is no longer simple random sampling, because some combinations of 100 students have a larger selection probability than others – for instance, {3, 13, 23, ..., 993} has a 1/10 chance of selection, while {1, 2, 3, ..., 100} cannot be selected under this method.
Sampling a dichotomous population
[edit]If the members of the population come in three kinds, say "blue", "red" and "black", the number of red elements in a sample of given size will vary by sample and hence is a random variable whose distribution can be studied. That distribution depends on the numbers of red and black elements in the full population. For a simple random sample with replacement, the distribution is a binomial distribution. For a simple random sample without replacement, one obtains a hypergeometric distribution.[6]
Algorithms
[edit]Several efficient algorithms for simple random sampling have been developed.[7][8] A naive algorithm is the draw-by-draw algorithm where at each step we remove the item at that step from the set with equal probability and put the item in the sample. We continue until we have a sample of desired size . The drawback of this method is that it requires random access in the set.
The selection-rejection algorithm developed by Fan et al. in 1962[9] requires a single pass over data; however, it is a sequential algorithm and requires knowledge of total count of items , which is not available in streaming scenarios.
A very simple random sort algorithm was proved by Sunter in 1977.[10] The algorithm simply assigns a random number drawn from uniform distribution as a key to each item, then sorts all items using the key and selects the smallest items.
J. Vitter in 1985[11] proposed reservoir sampling algorithms, which are widely used. This algorithm does not require knowledge of the size of the population in advance, and uses constant space.
Random sampling can also be accelerated by sampling from the distribution of gaps between samples[12] and skipping over the gaps.
See also
[edit]References
[edit]- ^ Yates, Daniel S.; David S. Moore; Daren S. Starnes (2008). The Practice of Statistics, 3rd Ed. Freeman. ISBN 978-0-7167-7309-2.
- ^ Thompson, Steven K. (2012). Sampling. Wiley series in probability and statistics (3rd ed.). Hoboken, N.J: John Wiley & Sons. ISBN 978-1-118-16293-4.
- ^ Cochran, William Gemmell (1977). Sampling techniques. Wiley series in probability and mathematical statistics (3d ed.). New York: Wiley. ISBN 978-0-471-16240-7.
- ^ Cressie, Noel A. C. (2015). Statistics for spatial data (Revised ed.). Hoboken, NJ: John Wiley & Sons, Inc. ISBN 978-1-119-11517-5.
- ^ Peters, Tim J., and Jenny I. Eachus. "Achieving equal probability of selection under various random sampling strategies." Paediatric and perinatal epidemiology 9.2 (1995): 219-224.
- ^ Ash, Robert B. (2008). Basic probability theory. Mineola, N.Y: Dover Publications. ISBN 978-0-486-46628-6. OCLC 190785258.
- ^ Tille, Yves; Tillé, Yves (2006-01-01). Sampling Algorithms - Springer. Springer Series in Statistics. doi:10.1007/0-387-34240-0. ISBN 978-0-387-30814-2.
- ^ Meng, Xiangrui (2013). "Scalable Simple Random Sampling and Stratified Sampling" (PDF). Proceedings of the 30th International Conference on Machine Learning (ICML-13): 531–539.
- ^ Fan, C. T.; Muller, Mervin E.; Rezucha, Ivan (1962-06-01). "Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers". Journal of the American Statistical Association. 57 (298): 387–402. doi:10.1080/01621459.1962.10480667. ISSN 0162-1459.
- ^ Sunter, A. B. (1977-01-01). "List Sequential Sampling with Equal or Unequal Probabilities without Replacement". Applied Statistics. 26 (3): 261–268. doi:10.2307/2346966. JSTOR 2346966.
- ^ Vitter, Jeffrey S. (1985-03-01). "Random Sampling with a Reservoir". ACM Trans. Math. Softw. 11 (1): 37–57. CiteSeerX 10.1.1.138.784. doi:10.1145/3147.3165. ISSN 0098-3500.
- ^ Vitter, Jeffrey S. (1984-07-01). "Faster methods for random sampling". Communications of the ACM. 27 (7): 703–718. CiteSeerX 10.1.1.329.6400. doi:10.1145/358105.893. ISSN 0001-0782.
External links
[edit]
Media related to Random sampling at Wikimedia Commons
Simple random sample
View on GrokipediaFundamentals
Definition
A simple random sample (SRS) is a subset of individuals selected from a larger population such that every possible sample of a given size has an equal probability of being chosen.[5] This method ensures that each member of the population has an equal chance of inclusion in the sample, promoting representativeness and minimizing bias in the selection process.[6] In this framework, simple random sampling is applied to finite populations where the total number of units, denoted as , is known. The sample size, typically denoted as , is fixed in advance, and randomness is introduced through mechanisms like random number generation or physical randomization to achieve the equal probability condition.[7] This random selection underpins the validity of subsequent statistical analyses by allowing inferences to be generalized from the sample to the population.[8] The concept of simple random sampling emerged in the early 20th century within the context of probability theory and experimental design, with Ronald A. Fisher playing a pivotal role in its formalization. In his 1925 book Statistical Methods for Research Workers, Fisher emphasized randomization as essential for valid statistical inference, laying the groundwork for modern sampling techniques.[9] Simple random sampling serves as a foundational tool in statistical inference, where the primary goal is to use the sample to estimate unknown population parameters, such as the mean or proportion, and to quantify the uncertainty in those estimates.[6]Key Properties
A simple random sample ensures unbiasedness because every unit in the population has an equal probability of being selected, resulting in estimators like the sample mean and sample proportion having expected values equal to the corresponding population parameters.[10] This equal selection probability eliminates subjective biases in the sampling process and supports reliable inference about population characteristics.[11] The method's randomness fosters representativeness, as the sample tends to reflect the population's diversity and distributional properties on average, thereby minimizing systematic errors that could arise from non-random selection.[10] In without-replacement simple random sampling, which is commonly used, the observations are not strictly independent since each draw alters the probabilities for remaining units, though the design maintains exchangeability among selected units.[11][12] Key advantages of simple random sampling include its theoretical simplicity, which facilitates equal treatment of all population units and straightforward statistical analysis, as well as its ability to quantify sampling error precisely.[10] Disadvantages encompass the need for a complete population listing, which may be impractical for large or dispersed groups, and reduced efficiency when data display clustering, where other designs like stratified sampling perform better.[10] These properties remain robust under finite population correction (FPC) when the sample constitutes a substantial portion of the population, such as more than 5%, by adjusting variance estimates downward to account for decreased sampling variability without replacement.[10] The FPC multiplier, typically where is sample size and is population size, refines precision for such scenarios while preserving unbiasedness.[11]Mathematical Foundations
Selection Mechanism
The selection mechanism of a simple random sample involves probabilistic procedures to ensure every population unit has an equal chance of inclusion, typically implemented through draws from a defined population of size . Two primary models govern this process: sampling with replacement and sampling without replacement. These models differ in their treatment of previously selected units and the resulting probability structures, with the choice depending on whether duplicates are permissible in the sample. In the with-replacement model, each draw is independent, and a unit selected in one draw is returned to the population before the next, allowing for possible duplicates in the sample. The probability of selecting any specific unit on a given draw is , and for a specific ordered sample of size , the probability is .[13] This model follows a multinomial distribution for the counts of each unit in the sample.[14] In contrast, the without-replacement model prohibits duplicates by removing selected units from consideration for subsequent draws, resulting in a sample of distinct units. The probability of selecting any specific unordered sample of size is , where denotes the binomial coefficient representing the total number of possible combinations of units from .[15] This ensures uniformity over all possible subsets, akin to a hypergeometric selection process but without regard to categories.[1] To simulate these selections computationally, random numbers are generated from a uniform distribution on , which are then mapped to population units via inverse transform or direct indexing.[16] This prerequisite relies on pseudo-random number generators to approximate true randomness in practice. A complete and accessible sampling frame—a list encompassing all population units—is essential for both models, as it defines the universe from which draws occur and ensures the probabilities are well-defined.[2] For large populations where is much greater than , the without-replacement model approximates the with-replacement model, simplifying computations while maintaining similar probabilistic properties; this approximation is particularly useful in post-2000 computational statistics texts addressing big data scenarios.[16][17]Estimators and Variance
In simple random sampling without replacement from a finite population of size , the sample mean serves as the unbiased estimator of the population mean , satisfying .[14] This unbiasedness holds because each unit in the population has an equal probability of inclusion in the sample, ensuring the expected value of the estimator aligns with the true parameter.[14] The variance of the sample mean under this sampling scheme is given by where is the population variance defined with the denominator for unbiased estimation purposes, and the factor is the finite population correction (FPC) that accounts for the reduced variability when sampling a substantial portion of the population without replacement.[15] An unbiased estimator of this variance, using the sample variance , is .[1] For estimating a population proportion in dichotomous populations—where each unit is classified as a success (1) or failure (0)—the sample proportion (or equivalently, the number of successes divided by ) is unbiased with .[14] Its variance is which simplifies to the estimated form incorporating the FPC.[1] This structure mirrors the sample mean case, as the proportion is a special instance of the mean for binary data. The standard error of the sample mean, , quantifies the precision of the estimate and forms the basis for constructing confidence intervals, typically for large samples under approximate normality via the central limit theorem.[18] Similar standard errors apply to , enabling inference about proportions. In complex scenarios where analytical variance formulas are intractable—such as non-normal distributions or nonlinear statistics—bootstrap methods provide a resampling-based alternative for variance estimation; introduced by Efron in 1979, these involve repeatedly drawing samples with replacement from the observed data to approximate the sampling distribution empirically, proving especially useful in modern computational settings.[19]Comparisons with Other Methods
Equal Probability Sampling
Equal probability sampling (EPS), also referred to as the equal probability of selection method (EPSEM), is a sampling framework in which every unit in the population has an identical probability of inclusion in the sample, denoted as for all units , where is the sample size and is the population size.[20] This design ensures that the selection process treats all population elements uniformly, facilitating straightforward probability calculations for inference.[21] Simple random sampling represents a core special case of EPS, characterized by direct random selection from the full population without incorporating stratification, clustering, or other structural modifications, thereby maintaining equal inclusion probabilities through mechanisms like lottery draws or random number generation.[21] Within the broader EPS framework, this approach avoids complexities introduced by multi-stage or layered designs while preserving the equal probability property.[2] The EPS framework offers key benefits for design-based inference, as the constant inclusion probabilities streamline estimation procedures; notably, the Horvitz-Thompson estimator, which in general form weights observations by the inverse of , simplifies to the population size multiplied by the unweighted sample mean under EPS, reducing computational demands and aligning variance calculations with those of simple random sampling.[22] Historically, EPS concepts were formalized in survey methodology during the 1950s by W. Edwards Deming, who emphasized simplifications through equal probabilities and replication to enhance practical application in large-scale surveys.[23] Post-2010 developments have integrated EPS into complex survey designs, such as the National Children's Study, where equal probability selection supports representativeness across diverse health outcomes in multi-stage frameworks.[24] A primary limitation of EPS is its assumption against utilizing auxiliary information for optimizing selection probabilities, which can lead to lower efficiency in heterogeneous populations compared to alternatives like probability proportional to size (PPS) sampling that leverage unit characteristics to vary inclusion chances and reduce variance.[25]Systematic Sampling
Systematic sampling is a probability sampling method where elements are selected from a population list at regular intervals, known as the sampling interval , which is typically calculated as , with being the population size and the desired sample size. To ensure randomness, a random starting point is chosen between 1 and , after which every -th element is selected until the sample reaches size .[26] This approach maintains equal inclusion probabilities of for each unit. However, if the population list has periodicity that aligns with , it can lead to higher variance due to correlated selections.[1] In contrast to simple random sampling, which allows every possible combination of units to have an equal probability of selection, systematic sampling restricts the sample to specific linear subsets imposed by the ordered list and fixed interval.[27] This ordering can lead to lower variance than simple random sampling if the population exhibits random variation without trends, as it spreads the sample evenly across the list; however, it increases variance if hidden periodic patterns exist, potentially clustering similar units together.[28] The efficiency of systematic sampling is often assessed through its approximate variance for the sample mean, given by where is the population variance and is the average intraclass correlation coefficient among elements separated by multiples of .[29] This formula resembles the simple random sampling variance but adjusts for ordering effects via ; when (indicating dispersion), systematic sampling reduces variance and is preferred for cost savings in accessing large, ordered lists like directories or databases.[1] Systematic sampling is particularly advantageous in scenarios requiring simplicity and uniformity, such as quality control audits, where full randomization is logistically challenging.[26] Simple random sampling is superior to systematic sampling when populations contain hidden periodicity, as the unrestricted selection avoids alignment with intervals that could bias estimates, ensuring robustness in unordered or trend-heavy datasets like financial time series.[30]Special Cases and Applications
Dichotomous Populations
In a dichotomous population of size , there exists a proportion of elements classified as "successes" (e.g., individuals with a binary trait such as yes/no responses or defective/non-defective items), where denotes the total number of successes in the population. Simple random sampling without replacement from this population involves drawing a sample of size , resulting in observed successes within the sample. This setup models scenarios with a fixed number of units in two mutually exclusive categories in the population, enabling inference about the unknown population proportion .[31] The number of successes in the sample follows a hypergeometric distribution, which accounts for the dependencies introduced by sampling without replacement from a finite population. The probability mass function is given by for to , where denotes the binomial coefficient. This distribution arises directly from the uniform selection mechanism of simple random sampling, ensuring each subset of size is equally likely.[31] The unbiased estimator for the population proportion is , which provides a consistent estimate of as increases relative to . Under the hypergeometric distribution, the exact variance of this estimator is reflecting the reduction in variability due to the finite population correction factor . For large relative to , this approximates the binomial variance , facilitating normal approximations for confidence intervals when is sufficiently large (e.g., and ).[32][33] This framework finds application in polling, where simple random samples estimate binary voter preferences (e.g., support for a candidate), allowing construction of confidence intervals to predict election outcomes with quantified uncertainty. In quality control, it is used to assess the proportion of defective items in a production batch by sampling without replacement, aiding decisions on acceptability thresholds.[34][35] Bayesian extensions incorporate prior knowledge via beta-binomial models, where a beta prior distribution on (conjugate to the binomial likelihood) updates to a posterior beta distribution after observing the sample, particularly useful when approximating the hypergeometric with a binomial for large populations. This approach is increasingly applied in machine learning contexts, such as A/B testing for binary conversion rates, enabling posterior predictive checks and credible intervals that integrate uncertainty from small samples.[36]Real-World Examples
In public opinion polling, simple random sampling has been instrumental in avoiding selection biases that plagued earlier methods. The 1936 Literary Digest poll, which surveyed over 10 million individuals selected from telephone directories and automobile registration lists, inaccurately predicted a landslide victory for Republican candidate Alfred Landon over incumbent Franklin D. Roosevelt due to its non-representative sample favoring wealthier, urban Republicans.[37] In contrast, George Gallup's American Institute of Public Opinion employed a more scientific quota sampling approach informed by random principles to achieve representativeness across demographics, correctly forecasting Roosevelt's victory with 61% of the vote.[38] This episode underscored simple random sampling's role in producing unbiased estimates of population opinions, influencing modern polling standards.[39] In clinical trials, simple random assignment to treatment and control groups ensures unbiased estimation of intervention effects by equalizing known and unknown confounders across groups. The U.S. Food and Drug Administration's 1962 Kefauver-Harris Amendments mandated adequate and well-controlled studies for drug approval, establishing randomization as a core requirement to minimize bias and support causal inferences.[40] For instance, in evaluating new therapies for conditions like cancer or cardiovascular disease, researchers randomly allocate participants to arms, allowing valid comparisons of outcomes such as survival rates or symptom reduction.[41] This practice has become standard in Phase III trials, enabling regulators to approve treatments based on reliable evidence of efficacy and safety.[41] Environmental monitoring frequently applies simple random sampling to assess contamination in natural populations, providing unbiased estimates for policy decisions. The U.S. Environmental Protection Agency's National Study of Chemical Residues in Lake Fish Tissue, conducted from 2000 to 2003, selected lakes and sampling sites using probability-based designs incorporating random selection to represent the nation's approximately 147,000 lakes and reservoirs.[42] Within selected lakes, fish were randomly captured and composited to measure contaminants like mercury and PCBs, with the study estimating that mercury concentrations exceeded human health screening values in 48.8% of lakes.[43] These findings informed advisories on fish consumption and guided remediation efforts, demonstrating how random sampling yields nationally generalizable contamination profiles; results have informed subsequent assessments such as the 2017 National Lakes Assessment.[44] In the 2020s, simple random subsampling has gained prominence in big data contexts, particularly for training artificial intelligence models on massive datasets. For example, the 2020 Big Transfer (BiT) framework for visual representation learning used balanced random subsamples of the ImageNet dataset—containing over 1.2 million images across 1,000 classes—to efficiently train models while maintaining performance comparable to full-dataset training.[45] This approach reduces computational costs in resource-intensive tasks like image classification, allowing researchers to iterate quickly without sacrificing the representativeness needed for robust model generalization.[45] Such subsampling techniques have been widely adopted in machine learning pipelines to handle datasets exceeding terabytes in size.[46] Despite its strengths, simple random sampling in practice faces challenges like non-response bias, where certain subgroups decline participation, skewing results. Mitigation strategies include post-stratification weighting to adjust for underrepresented groups and follow-up incentives to boost response rates, as implemented in large-scale surveys to restore balance.[47] For instance, in opinion polls, weighting by demographics like age and education has proven effective in correcting biases from low-response subsets.[48] Overall, simple random sampling enables generalizability by ensuring each population unit has an equal selection chance, allowing inferences to extend reliably from samples to broader contexts across fields like polling, medicine, ecology, and AI.[49] This property underpins its enduring value in empirical research, fostering trustworthy conclusions that inform decisions at scale.[45]Implementation
Algorithms
A simple random sample without replacement can be generated using the Fisher-Yates shuffle algorithm, which randomly permutes the entire population and selects the first elements from the permuted list. This approach ensures each subset of size from the population of size is equally likely, with an expected time complexity of . The modern version of the algorithm, as described by Knuth, iterates from the last index to the first, swapping each element with a randomly chosen element from the unshuffled portion of the array. The pseudocode for the Fisher-Yates shuffle is as follows:procedure FisherYatesShuffle(array A of size N)
for i from N-1 downto 1 do
j ← random integer such that 0 ≤ j ≤ i
exchange A[j] and A[i]
end for
return the first n elements of A
procedure FisherYatesShuffle(array A of size N)
for i from N-1 downto 1 do
j ← random integer such that 0 ≤ j ≤ i
exchange A[j] and A[i]
end for
return the first n elements of A
sample() function supports both with- and without-replacement sampling from vectors.[50] Similarly, Python's random.sample() in the standard library generates without-replacement samples from sequences using an underlying pseudorandom number generator.[51]
For applications requiring high-security or true randomness, such as cryptographic sampling in 2025, quantum random number generators (QRNGs) can replace pseudorandom sources to drive these algorithms.[53] NIST's SP 800-90 series, updated in September 2025, endorses QRNG constructions based on quantum nonlocality for verifiable randomness in random bit generation.[54] The CURBy beacon, launched by NIST in June 2025, provides a public service for such quantum-entropy sources, ensuring unpredictability against classical adversaries.[53]
