Recent from talks
Nothing was collected or created yet.
Population proportion
View on WikipediaIn statistics a population proportion, generally denoted by or the Greek letter ,[1] is a parameter that describes a percentage value associated with a population. A census can be conducted to determine the actual value of a population parameter, but often a census is not practical due to its costs and time consumption. For example, the 2010 United States census showed that 83.7% of the American population was identified as not being Hispanic or Latino; the value of .837 is a population proportion. In general, the population proportion and other population parameters are unknown.
A population proportion is usually estimated through an unbiased sample statistic obtained from an observational study or experiment, resulting in a sample proportion, generally denoted by and in some textbooks by .[2][3] For example, the National Technological Literacy Conference conducted a national survey of 2,000 adults to determine the percentage of adults who are economically illiterate; the study showed that 1,440 out of the 2,000 adults sampled did not understand what a gross domestic product is.[4] The value of 72% (or 1440/2000) is a sample proportion.
Mathematical definition
[edit]
A proportion is mathematically defined as being the ratio of the quantity of elements (a countable quantity) in a subset to the size of a set :
where is the count of successes in the population, and is the size of the population.
This mathematical definition can be generalized to provide the definition for the sample proportion:
where is the count of successes in the sample, and is the size of the sample obtained from the population.[5][2]
Estimation
[edit]One of the main focuses of study in inferential statistics is determining the "true" value of a parameter. Generally the actual value for a parameter will never be found, unless a census is conducted on the population of study. However, there are statistical methods that can be used to get a reasonable estimation for a parameter. These methods include confidence intervals and hypothesis testing.
Estimating the value of a population proportion can be of great implication in the areas of agriculture, business, economics, education, engineering, environmental studies, medicine, law, political science, psychology, and sociology.
A population proportion can be estimated through the usage of a confidence interval known as a one-sample proportion in the Z-interval whose formula is given below:
where is the sample proportion, is the sample size, and is the upper critical value of the standard normal distribution for a level of confidence .[6]
Proof
[edit]To derive the formula for the one-sample proportion in the Z-interval, a sampling distribution of sample proportions needs to be taken into consideration. The mean of the sampling distribution of sample proportions is usually denoted as and its standard deviation is denoted as:[2]
Since the value of is unknown, an unbiased statistic will be used for . The mean and standard deviation are rewritten respectively as:
- and
Invoking the central limit theorem, the sampling distribution of sample proportions is approximately normal—provided that the sample is reasonably large and unskewed.
Suppose the following probability is calculated:
- ,
where and are the standard critical values.

The inequality
can be algebraically re-written as follows:
From the algebraic work done above, it is evident from a level of certainty that could fall in between the values of:
- .
Conditions for inference
[edit]In general the formula used for estimating a population proportion requires substitutions of known numerical values. However, these numerical values cannot be "blindly" substituted into the formula because statistical inference requires that the estimation of an unknown parameter be justifiable. For a parameter's estimation to be justifiable, there are three conditions that need to be verified:
- The data's individual observation have to be obtained from a simple random sample of the population of interest.
- The data's individual observations have to display normality. This can be assumed mathematically with the following definition:
- Let be the sample size of a given random sample and let be its sample proportion. If and , then the data's individual observations display normality.
- The data's individual observations have to be independent of each other. This can be assumed mathematically with the following definition:
- Let be the size of the population of interest and let be the sample size of a simple random sample of the population. If , then the data's individual observations are independent of each other.
The conditions for SRS, normality, and independence are sometimes referred to as the conditions for the inference tool box in most statistical textbooks[citation needed].
Example
[edit]Suppose a presidential election is taking place in a democracy. A random sample of 400 eligible voters in the democracy's voter population shows that 272 voters support candidate B. A political scientist wants to determine what percentage of the voter population support candidate B.
To answer the political scientist's question, a one-sample proportion in the Z-interval with a confidence level of 95% can be constructed in order to determine the population proportion of eligible voters in this democracy that support candidate B.
Solution
[edit]It is known from the random sample that with sample size . Before a confidence interval is constructed, the conditions for inference will be verified.
- Since a random sample of 400 voters was obtained from the voting population, the condition for a simple random sample has been met.
- Let and , it will be checked whether and
- and
- The condition for normality has been met.
- Let be the size of the voter population in this democracy, and let . If , then there is independence.
- The population size for this democracy's voters can be assumed to be at least 4,000. Hence, the condition for independence has been met.
With the conditions for inference verified, it is permissible to construct a confidence interval.
Let and
To solve for , the expression is used.


By examining a standard normal bell curve, the value for can be determined by identifying which standard score gives the standard normal curve an upper tail area of 0.0250 or an area of 1 – 0.0250 = 0.9750. The value for can also be found through a table of standard normal probabilities.
From a table of standard normal probabilities, the value of that gives an area of 0.9750 is 1.96. Hence, the value for is 1.96.
The values for , , can now be substituted into the formula for one-sample proportion in the Z-interval:
Based on the conditions of inference and the formula for the one-sample proportion in the Z-interval, it can be concluded with a 95% confidence level that the percentage of the voter population in this democracy supporting candidate B is between 63.429% and 72.571%.
Value of the parameter in the confidence interval range
[edit]A commonly asked question in inferential statistics is whether the parameter is included within a confidence interval. The only way to answer this question is for a census to be conducted. Referring to the example given above, the probability that the population proportion is in the range of the confidence interval is either 1 or 0. That is, the parameter is included in the interval range or it is not. The main purpose of a confidence interval is to better illustrate what the ideal value for a parameter could possibly be.
Common errors and misinterpretations from estimation
[edit]A very common error that arises from the construction of a confidence interval is the belief that the level of confidence, such as , means 95% chance. This is incorrect. The level of confidence is based on a measure of certainty, not probability. Hence, the values of fall between 0 and 1, exclusively.
Estimation of P using ranked set sampling
[edit]A more precise estimate of P can be obtained by choosing ranked set sampling instead of simple random sampling[7][8]
See also
[edit]References
[edit]- ^ Introduction to Statistical Investigations. Wiley. 18 August 2014. ISBN 978-1-118-95667-0.
- ^ a b c Weisstein, Eric W. "Sample Proportion". mathworld.wolfram.com. Retrieved 2020-08-22.
- ^ "6.3: The Sample Proportion". Statistics LibreTexts. 2014-04-16. Retrieved 2020-08-22.
- ^ Ott, R. Lyman (1993). An Introduction to Statistical Methods and Data Analysis. Duxbury Press. ISBN 0-534-93150-2.
- ^ Weisstein, Eric (1998). CRC Concise Encyclopedia of Mathematics. Chapman & Hall/CRC. Bibcode:1998ccem.book.....W.
- ^ Hinders, Duane (2008). Annotated Teacher's Edition The Practice of Statistics. W.H. Freeman. ISBN 978-0-7167-7703-8.
- ^ Abbasi, Azhar Mehmood; Yousaf Shad, Muhammad (2021-05-15). "Estimation of population proportion using concomitant based ranked set sampling". Communications in Statistics – Theory and Methods. 51 (9): 2689–2709. doi:10.1080/03610926.2021.1916529. ISSN 0361-0926. S2CID 236554602.
- ^ Abbasi, Azhar Mehmood; Shad, Muhammad Yousaf (2021-05-15). "Estimation of population proportion using concomitant based ranked set sampling". Communications in Statistics – Theory and Methods. 51 (9): 2689–2709. doi:10.1080/03610926.2021.1916529. ISSN 0361-0926. S2CID 236554602.
Population proportion
View on GrokipediaFundamentals
Mathematical Definition
In statistics, the population proportion, denoted by the lowercase letter , is defined as the ratio of the number of elements in a finite population that possess a specific characteristic of interest—often termed "successes"—to the total size of the population.[3] Mathematically, this is expressed as where represents the number of successes and is the total population size, with being a non-negative integer such that .[4][5] This parameter is interpreted as a fixed but typically unknown value between 0 and 1, inclusive, that quantifies the true fraction of the population exhibiting the characteristic; for instance, it might represent the exact proportion of voters in a city favoring a particular policy.[4] To distinguish it from the sample proportion, which is an estimate derived from a subset of the population and denoted by , the population proportion remains a descriptive measure of the entire finite population without reference to sampling variability.[5] In finite populations, is an exact value computable if the full population is enumerated, inherently bounded by ; for infinite populations, it generalizes to a limiting probability, though the core concept retains the interpretive range of [0, 1].[3] This definition aligns with underlying probability models, such as the binomial distribution, where serves as the success probability parameter for independent trials.[4]Relation to Probability Models
The population proportion , defined as the ratio of successes to total units in a finite population of size , relates to probability models used in sampling. For sampling with replacement or from infinite populations, the number of successes in a sample of size follows a binomial distribution , where is the success probability parameter analogous to the population proportion.[6] For finite populations sampled without replacement, the hypergeometric distribution provides the exact model, where the number of successes in a sample depends on the fixed successes in the population, with entering as the population proportion parameter. However, when the population size is large relative to the sample size (typically ), the binomial distribution approximates the hypergeometric well, justifying its use for modeling sample proportions in many practical scenarios.[7] A key implication of these models is the asymptotic normality of the sample proportion for large sample sizes, as per the Central Limit Theorem, which approximates the sampling distribution of as normal with mean and variance , providing a foundation for inferential statistics without relying on exact distributional forms.[8]Estimation Basics
Point Estimation
In point estimation for a population proportion , the sample proportion serves as the primary estimator, where is the number of successes in a random sample of size from a Bernoulli process with success probability .[9] This estimator provides a single-value approximation of the unknown population proportion based on observed data. The sample proportion is unbiased for , meaning its expected value equals the true parameter: . This property follows from the linearity of expectation applied to the underlying binomial model, where the sample successes follow a distribution, so and thus .[10] As a method-of-moments estimator, is obtained by equating the first sample moment (the sample mean of indicator variables for successes) to the corresponding population moment , yielding the same simple proportion formula.[9] This approach ensures the estimator aligns the empirical mean with the theoretical expectation under the Bernoulli model. The sample proportion is consistent, converging in probability to the true as the sample size , by the law of large numbers applied to the sequence of independent Bernoulli trials.[11] This asymptotic property guarantees that larger samples yield estimates arbitrarily close to the population value with high probability. The standard error of quantifies its sampling variability and is given by , which measures the typical deviation of from across repeated samples. In practice, since is unknown, it is estimated by substituting to obtain .Properties of the Estimator
The sample proportion , where , serves as the maximum likelihood estimator (MLE) for the population proportion .[12] As an MLE under the binomial likelihood, achieves asymptotic efficiency by attaining the Cramér-Rao lower bound, meaning its asymptotic variance equals the reciprocal of the Fisher information, .[12] This property holds under standard regularity conditions, establishing as the optimal unbiased estimator in the large-sample limit.[12] Since is unbiased, with for any , its mean squared error (MSE) simplifies to its variance: .[13] This MSE quantifies the average squared deviation of from and is minimized when , reaching a maximum value of .[13] By the central limit theorem, the asymptotic distribution is given by , providing a normal approximation for large .[12] For nonlinear functions , the delta method extends this to , facilitating inference on transformed proportions.[12] In small samples, exhibits zero bias but elevated variance, especially for near 0 or 1, where the binomial distribution is skewed and the normal approximation falters.[12] Adjustments like the Wilson score estimator, which centers at for 95% confidence (), offer finite-sample improvements by stabilizing estimates and reducing coverage errors in intervals derived from .[14] Relative to estimating a population mean from a [0,1]-bounded variable, shares the same maximum variance bound of , highlighting its efficiency parity in comparable settings.[12]Interval Estimation
Confidence Interval Derivation
The standard confidence interval for a population proportion is derived using the normal approximation to the sampling distribution of the sample proportion , where is the number of successes in independent Bernoulli trials. Under the central limit theorem, for large , is asymptotically normally distributed with mean and variance , so . To construct an approximate confidence interval, this pivotal quantity is transformed into an interval for by replacing the unknown in the denominator with , yielding the Wald interval: where is the upper quantile of the standard normal distribution. This approximation inverts the large-sample Wald test, evaluating the standard error at the maximum likelihood estimate . The coverage probability of this interval is approximately for large , but it is exact only in the asymptotic limit; finite-sample performance can deviate substantially, particularly when is near 0 or 1, where the interval may undercover (e.g., actual coverage below 0.93 for many cases with ) or produce degenerate bounds like [0,0] when . While the Wald interval is simple, alternatives such as the Wilson score interval, which inverts the score test to better center the interval and improve coverage, and the Agresti-Coull interval, an adjusted Wald method adding pseudocounts for continuity correction, often perform more reliably across a wider range of and . For variance stabilization, transformations like the arcsine square root or the logit can be applied to before constructing normal-approximation intervals on the transformed scale, then back-transformed to obtain bounds for with more uniform variance.Conditions for Inference
For valid inference on a population proportion using normal-based methods, such as confidence intervals, several key conditions must be satisfied to ensure the approximation's accuracy. The data must consist of binary outcomes, where each observation is a success or failure, as the proportion estimator relies on the binomial distribution model. This dichotomous nature precludes direct application to continuous or multi-category data without adaptation, such as grouping categories into binary form. A primary requirement is the normality condition for the sampling distribution of the sample proportion . A common rule of thumb is that the expected number of successes and failures should each be at least 5, i.e., and , where is the sample size and is the population proportion. Some sources recommend a stricter criterion of and to improve the approximation, particularly when is close to 0 or 1. These conditions help ensure that the binomial distribution is sufficiently symmetric and that the central limit theorem applies effectively.[15][16][17] Independence among observations is another fundamental assumption, typically achieved through simple random sampling (SRS) from the population. Under SRS, each unit has an equal probability of selection, and observations are independent, avoiding issues like clustering or temporal dependence that could bias the variance estimate. If sampling without replacement, the sample size should generally be less than 10% of the population size to maintain approximate independence; otherwise, a finite population correction (FPC) is necessary.[16][18] For finite populations, additional considerations apply when the sample size is not negligible relative to . If , the standard variance of must be adjusted by the FPC factor to account for the reduced variability in sampling without replacement. For very small populations where is limited and the normal approximation fails, exact inference based on the hypergeometric distribution is preferred, as it models the exact probability of observing successes in a sample of size from a population with total successes.[19][3]/12:_Finite_Sampling_Models/12.02:_The_Hypergeometric_Distribution) To diagnose violations of these conditions, statistical tests can be employed. The chi-square goodness-of-fit test assesses whether the observed frequencies align with the expected binomial distribution under the null hypothesis. For small samples or when normality conditions are unmet, exact tests such as the binomial exact test or Fisher's exact test provide reliable alternatives by computing p-values directly from the discrete distribution without approximation.[20][21]Practical Applications
Example Computation
Consider a hypothetical survey in which 60 out of 100 randomly selected voters indicate a preference for candidate A. The point estimate of the population proportion is calculated as .[22] To construct a 95% confidence interval, apply the standard formula , where corresponds to the 95% confidence level from the standard normal distribution.[22] The standard error is , so the margin of error is . Thus, the confidence interval is approximately , or .[22] This interval means that we are 95% confident the true population proportion of voters preferring candidate A lies between 0.50 and 0.70.[23] The normal approximation underlying this interval is valid because and .[22] To illustrate the effect of sample size on precision, suppose the same proportion is observed in a smaller sample of , with 12 successes, yielding . The standard error becomes , and the margin of error is , resulting in a wider 95% confidence interval of approximately .[22]Sample Size Determination
Sample size determination is essential in surveys and studies aiming to estimate a population proportion with a specified level of precision, typically defined by the margin of error in a confidence interval. The required sample size ensures that the estimate is sufficiently close to with high probability, balancing cost and accuracy. For large populations, the formula derives from the standard error of the proportion under the normal approximation. The margin of error for a confidence interval around is given by where is the critical value from the standard normal distribution (e.g., 1.96 for 95% confidence). Solving for yields This formula requires a prior estimate of ; if unknown, a conservative approach uses to maximize the variance , resulting in For instance, at 95% confidence () and , this gives .[24][25] For finite populations of size , the formula is adjusted using the finite population correction factor to account for reduced variability when sampling without replacement: where is the initial sample size from the infinite population formula. This adjustment is particularly relevant when .[3] While the primary focus is estimation precision, sample size for hypothesis testing on proportions incorporates power against an alternative proportion , under null : This ensures adequate power to detect meaningful differences, though estimation-focused designs prioritize margin of error over power.[26] Software tools facilitate these calculations; in R, thesamplingbook package's sample.size.prop function computes for proportions, including finite corrections, yielding for the example above. If a pilot study provides an initial , substitute it into the formula to refine , improving efficiency over the conservative estimate.[27]
