Recent from talks
Nothing was collected or created yet.
Generalizability theory
View on WikipediaThis article includes a list of general references, but it lacks sufficient corresponding inline citations. (August 2012) |
Generalizability theory, or G theory, is a statistical framework for conceptualizing, investigating, and designing reliable observations. It is used to determine the reliability (i.e., reproducibility) of measurements under specific conditions. It is particularly useful for assessing the reliability of performance assessments. It was originally introduced by Lee Cronbach, N. Rajaratnam, and Goldine Gleser in 1963.
Overview
[edit]In G theory, sources of variation are referred to as facets. Facets are similar to the "factors" used in analysis of variance, and may include persons, raters, items/forms, time, and settings among other possibilities. These facets are potential sources of error and the purpose of generalizability theory is to quantify the amount of error caused by each facet and interaction of facets. The usefulness of data gained from a G study is crucially dependent on the design of the study. Therefore, the researcher must carefully consider the ways in which he/she hopes to generalize any specific results. Is it important to generalize from one setting to a larger number of settings? From one rater to a larger number of raters? From one set of items to a larger set of items? The answers to these questions will vary from one researcher to the next, and will drive the design of a G study in different ways.
In addition to deciding which facets the researcher generally wishes to examine, it is necessary to determine which facet will serve as the object of measurement (e.g. the systematic source of variance) for the purpose of analysis. The remaining facets of interest are then considered to be sources of measurement error. In most cases, the object of measurement will be the person to whom a number/score is assigned. In other cases it may be a group or performers such as a team or classroom. Ideally, nearly all of the measured variance will be attributed to the object of measurement (e.g. individual differences), with only a negligible amount of variance attributed to the remaining facets (e.g., rater, time, setting).
The results from a G study can also be used to inform a decision, or D, study. In a D study, we can ask the hypothetical question of "what would happen if different aspects of this study were altered?" For example, a soft drink company might be interested in assessing the quality of a new product through use of a consumer rating scale. By employing a D study, it would be possible to estimate how the consistency of quality ratings would change if consumers were asked 10 questions instead of 2, or if 1,000 consumers rated the soft drink instead of 100. By employing simulated D studies, it is therefore possible to examine how the generalizability coefficients (similar to reliability coefficients in classical test theory) would change under different circumstances, and consequently determine the ideal conditions under which our measurements would be the most reliable.
Comparison with classical test theory
[edit]The focus of classical test theory (CTT) is on determining error of the measurement. Perhaps the most famous model of CTT is the equation , where X is the observed score, T is the true score, and e is the error involved in measurement. Although e could represent many different types of error, such as rater or instrument error, CTT only allows us to estimate one type of error at a time. Essentially it throws all sources of error into one error term. This may be suitable in the context of highly controlled laboratory conditions, but variance is a part of everyday life. In field research, for example, it is unrealistic to expect that the conditions of measurement will remain constant. Generalizability theory acknowledges and allows for variability in assessment conditions that may affect measurements. The advantage of G theory lies in the fact that researchers can estimate what proportion of the total variance in the results is due to the individual factors that often vary in assessment, such as setting, time, items, and raters.
Another important difference between CTT and G theory is that the latter approach takes into account how the consistency of outcomes may change if a measure is used to make absolute versus relative decisions. An example of an absolute, or criterion-referenced, decision would be when an individual's test score is compared to a cut-off score to determine eligibility or diagnosis (i.e. a child's score on an achievement test is used to determine eligibility for a gifted program). In contrast, an example of a relative, or norm-referenced, decision would be when the individual's test score is used to either (a) determine relative standing as compared to his/her peers (i.e. a child's score on a reading subtest is used to determine which reading group he/she is placed in), or (b) make intra-individual comparisons (i.e. comparing previous versus current performance within the same individual). The type of decision that the researcher is interested in will determine which formula should be used to calculate the generalizability coefficient (similar to a reliability coefficient in CTT).
See also
[edit]References
[edit]- Brennan, R. L. (2001). Generalizability Theory. New York: Springer-Verlag. ISBN 978-0-387-95282-6.
- Chiu, C.W.C. (2001). Scoring performance assessments based on judgements: generalizability theory. New York: Kluwer. ISBN 978-0-7923-7499-2.
- Crocker, L.; Algina, J. (1986). Introduction to Classical and Modern Test Theory. New York: Harcourt Brace. ISBN 978-0-495-39591-1.
- Cronbach, L.J.; Gleser, G.C.; Nanda, H.; Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles (PDF). New York: John Wiley. ISBN 0-471-18850-6.
- Cronbach, L.J.; Nageswari, R.; Gleser, G.C. (1963). "Theory of generalizability: A liberalization of reliability theory". The British Journal of Statistical Psychology. 16: 137–163. doi:10.1111/j.2044-8317.1963.tb00206.x.
- Shrout, P. E.; Fleiss, J. L. (1979). "Intraclass correlations: Uses in assessing rater reliability". Psychological Bulletin. 86 (2): 420–428. doi:10.1037/0033-2909.86.2.420.
- Shavelson, R.J.; Webb, N.M. (1991). Generalizability Theory: A Primer. Thousand Oaks, CA: Sage. ISBN 978-0803937451.
External links
[edit]Generalizability theory
View on GrokipediaIntroduction
Definition and Scope
Generalizability theory (G theory) is a statistical framework developed to estimate the reliability of behavioral measurements by partitioning the variance of observed scores into multiple components, including both systematic effects associated with the objects of measurement (such as persons) and various unsystematic sources of error arising from conditions of measurement.[5] This approach extends classical test theory by recognizing that reliability is not determined by a single source of error but by multiple facets that can interact in complex ways.[5] The scope of G theory primarily encompasses applications in the behavioral sciences, including education, psychology, and social sciences, where it is used to evaluate the dependability of scores derived from tests, ratings, observations, or performance assessments influenced by varying conditions such as different raters, test items, tasks, or occasions.[6] In educational settings, for instance, it assesses the reliability of student achievement scores across multiple forms of assessment, while in psychological research, it examines the consistency of behavioral ratings under diverse observational contexts.[7] The key purpose of G theory is to enable the generalization of measurement results from specific, observed conditions to a broader universe of generalization, which defines the set of allowable conditions over which inferences are intended to hold, thereby providing a more comprehensive evaluation of reliability than traditional single-facet estimates like test-retest or inter-rater reliability.[5] Unlike classical approaches that isolate one error source at a time, G theory simultaneously accounts for multiple facets to yield estimates that better reflect real-world measurement variability.[5] At its foundation, the observed score in G theory can be expressed as where is the observed score, is the mean of the universe score (the expected value over the universe of generalization), represents systematic effects (such as person-related variance), and denotes random error components.[5] This model underscores the theory's emphasis on decomposing variance to understand and optimize generalizability.[5]Historical Development
Generalizability theory emerged as an extension of earlier psychometric frameworks, building on the foundations of analysis of variance (ANOVA) introduced by Ronald A. Fisher in the 1920s for agricultural experiments and later adapted to measurement reliability in psychology. Fisher's ANOVA techniques, detailed in his 1925 work Statistical Methods for Research Workers, provided the statistical machinery for partitioning variance sources, which mid-20th-century psychometrics extended to reliability estimation, such as Cyril J. Hoyt's 1941 application of ANOVA to test-retest reliability by treating persons and items as variance factors. These precursors addressed single sources of error in classical test theory but lacked a unified approach to multiple error facets. The theory's formal origins trace to 1963, when Lee J. Cronbach, Nageswari Rajaratnam, and Goldine C. Gleser published "Theory of Generalizability: A Liberalization of Reliability Theory" in the British Journal of Statistical Psychology, introducing a framework to generalize beyond fixed conditions by considering multiple sources of variation.[8] This article critiqued the limitations of classical test theory's assumption of a single error term, proposing instead a more flexible model for behavioral measurements that incorporated ANOVA-based variance decomposition across conditions like raters or tasks. Building on this, Cronbach collaborated with Harinder Nanda and Rajaratnam to formalize the approach in their 1972 book The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles, which synthesized the theory into a comprehensive system for estimating dependability in multifaceted assessments. By the 1980s, generalizability theory had gained institutional recognition, with its methods referenced in the American Psychological Association's Standards for Educational and Psychological Testing (1985 edition) as a tool for evaluating the contributions of multiple variance sources to reliability. This adoption marked its evolution from a novel psychometric innovation to a standard in reliability analysis, particularly in educational testing, where early applications focused on generalizing scores across items, occasions, and raters to inform decisions like student evaluation. The theory's shift from classical test theory's singular error focus to multifaceted error analysis enabled more nuanced interpretations of measurement consistency, addressing real-world complexities in behavioral data.Theoretical Foundations
Variance Components Model
The variance components model forms the statistical foundation of generalizability theory, employing analysis of variance (ANOVA) techniques to decompose the total observed score variance into distinct components attributable to the objects of measurement and various facets, along with their interactions. This approach extends classical test theory by partitioning error variance into multiple sources, such as systematic biases from specific facets (e.g., items or raters) and random interactions, enabling a more nuanced understanding of measurement reliability. In a fully crossed design involving persons (p, the objects of measurement), items (i), and raters (r), the total variance of observed scores, denoted , is expressed as: Here, represents the variance of the universe scores (true scores generalized over facets), while the remaining terms capture error variances: main effects of facets like and contribute to absolute error, interactions such as and to relative error, the triple interaction to residual relative error, and to any unexplained residual variance. The model assumes normality of error terms for valid ANOVA application, independence among effects unless interactions are explicitly modeled, and typically treats facets as random effects drawn from an infinite universe, though fixed effects may be specified for particular conditions. These assumptions ensure that variance components reflect generalizable sources rather than sample-specific artifacts. To estimate these components, expected mean squares (EMS) derived from the ANOVA table are equated to observed mean squares and solved algebraically. For instance, in the p × i × r design, the EMS for the p × i interaction term is , where is the number of raters; solving for involves subtracting the EMS for the p × i × r residual from the observed mean square for p × i and dividing by . This method yields unbiased estimates under the random effects model, providing the building blocks for subsequent reliability analyses.Universe of Generalization and Facets
In generalizability theory, the universe of generalization refers to the broader population of all possible measurement conditions—such as tasks, raters, or occasions—over which inferences from observed scores are intended to extend. This universe is a subset of the universe of admissible observations, defined specifically for decision-making purposes in a decision study, and it determines the scope of reliable generalizations by identifying which facets are treated as random. For instance, in assessing writing proficiency, the universe might encompass all possible essay prompts and qualified raters, allowing scores from a sample to inform judgments across this domain.[3] The universe of generalization supports two primary types of decisions: relative, which are norm-referenced and emphasize comparisons of individuals' relative positions (e.g., ranking students' performance), and absolute, which are criterion-referenced and focus on performance against a fixed standard (e.g., pass/fail thresholds). In relative decisions, the universe typically includes facets that influence rank orders without fixed criteria, whereas absolute decisions incorporate all variation affecting true score levels to ensure accuracy against benchmarks. This distinction guides how the universe is specified to align with the intended use of the scores.[4] Facets represent the systematic sources of variation or measurable conditions in the observation process, with persons (p) serving as the object of measurement and other facets—such as items (i), occasions (o), or raters (r)—acting as conditions under which persons are observed. Facets are categorized as random or fixed: random facets are drawn from a large, potentially infinite population (e.g., raters sampled from educators nationwide), enabling generalization across all levels and treating variation as error; fixed facets involve specific, exhaustive levels (e.g., a predefined set of math problems), restricting inferences to those conditions and excluding their variation from error terms. Examples include generalizing over random items to reduce content sampling bias or treating specific occasions as fixed for targeted skill assessments.[3][9] The arrangement of facets in a study design influences the estimation of generalizability, with facets either crossed—combining all levels across facets (e.g., p × i × r, where every person responds to every item under every rater)—or nested, where one facet is subordinate to another (e.g., raters nested within items, r:i, with raters evaluating only assigned items). Crossed designs capture interactions comprehensively but require more resources, while nested designs reflect practical hierarchies, such as multiple observers per site. Generalizing over random facets, like raters, minimizes idiosyncratic biases and enhances the robustness of inferences across the universe.[3] Central to this framework is the distinction between the universe score, which is the expected value of a person's score averaged over the entire universe of generalization (representing true proficiency), and the observed score, which is the specific measurement obtained under sampled conditions and includes random error. This separation highlights the theory's aim to bridge observed data to stable universe-level inferences, with facets defining the boundaries of that bridge. As facets contribute to variance components in the underlying model, their specification directly shapes the reliability of these generalizations.[3]Core Procedures
Generalizability Study (G-Study)
The generalizability study (G-study) serves as the empirical foundation of generalizability theory, aimed at collecting data from a defined measurement design to estimate the variance components associated with the universe of generalization. By partitioning observed score variance into components attributable to objects of measurement (such as persons) and various facets (such as items or raters), the G-study quantifies the relative contributions of systematic and error sources, providing insights into the dependability of scores across potential replications of the measurement procedure.[3] This process applies the variance components model empirically, without delving into theoretical derivations, to yield estimates that inform subsequent design optimizations.[10] Conducting a G-study begins with specifying the measurement design, such as a fully crossed p × i × r design where p represents persons (objects), i represents items, and r represents raters, ensuring all combinations are observed to facilitate variance estimation. A representative sample is then selected, for instance, 50 persons, 20 items, and 3 raters, followed by administering the measurements to generate the observation data matrix. Analysis proceeds via analysis of variance (ANOVA), where the expected mean squares (EMS) from the ANOVA table are used to derive unbiased estimates of the variance components, such as σ²(p) for person variance or σ²(p × i) for person-by-item interaction. For example, in an educational assessment context, this might reveal variance components attributable to persons, items, interactions, and residual error in varying proportions.[3][9] Design considerations are crucial for the validity of G-study estimates; balanced designs, with equal numbers of observations across cells, simplify ANOVA computations and assume no missing data, while unbalanced designs—common in real-world scenarios due to incomplete responses—require specialized methods to adjust EMS formulas and prevent biased estimates. Missing data can be handled by deleting cases to achieve balance, though this reduces statistical power, or by employing iterative estimation techniques that account for unequal cell sizes. Software tools facilitate these analyses: GENOVA supports balanced univariate designs through FORTRAN-based ANOVA, producing variance component tables directly, whereas urGENOVA extends this to unbalanced and multivariate cases, using restricted maximum likelihood or other estimators for robust σ² outputs like σ²(p × i|r) in nested rater-item designs. These empirical variance estimates form the core output, serving as inputs for evaluating different measurement configurations without altering the original data collection.[11][12][13]Decision Study (D-Study)
The Decision study (D-study) employs variance components from the generalizability study to project generalizability under alternative measurement designs, enabling the optimization of practical procedures such as selecting the number of items versus raters to achieve targeted reliability in real-world applications.[14] This approach simulates how changes in facet sample sizes affect error variances, facilitating decisions that enhance dependability while considering resource constraints.[3] In the D-study procedure, G-study variance components are adjusted by dividing interacting variances by the intended sample sizes of random facets to estimate error terms; for instance, in a persons by items (p × i) design, the relative error variance is expressed as , where denotes the number of items, allowing computation of the minimum required to attain a specified generalizability coefficient .[14] Fixed facets, such as specific tasks not intended to generalize, are excluded from error variance calculations to reflect narrower universes of generalization.[5] D-studies differentiate between relative types, which support ranking individuals and exclude main effect variances from error, and absolute types, which evaluate domain-referenced performance and incorporate those variances for broader generalizability.[3] For example, relative D-studies prioritize interaction terms like person-by-item variance, while absolute ones add item main effect variance to ensure scores generalize across the full domain.[14] Interpreting D-study outcomes highlights trade-offs in design choices; increasing the number of raters diminishes person-by-rater (nested in occasions) interaction variance , thereby improving reliability, but incurs higher logistical costs compared to expanding items, which more efficiently reduces relevant error components.[1] These projections guide applied settings, such as educational testing, in balancing precision against practicality.[5]Estimation Methods
Generalizability Coefficient
The generalizability coefficient, denoted as or , quantifies the reliability of relative scores by measuring the proportion of observed score variance attributable to universe score variance, which reflects the consistency of relative standings (e.g., rankings) of persons or objects across conditions defined by the facets of generalization.[3][9] It is analogous to Cronbach's in classical test theory but extends to multiple sources of error by incorporating variance components from various facets, such as items or raters.[15] This coefficient is derived from the expected squared correlation between observed scores and corresponding universe scores over randomly parallel measurements drawn from the universe of generalization, serving as an intraclass correlation adjusted via the Spearman-Brown prophecy formula to account for design facets.[9] The universe score represents the expected observed score over all possible conditions in the universe, and the derivation partitions total observed score variance into systematic (universe score) and relative error components using analysis of variance principles. For relative decisions, main effects of random facets (e.g., items) are excluded from the error term, as they introduce systematic bias equally across persons and do not affect relative rankings.[3] The formula for the generalizability coefficient is where is the variance component due to persons (or the primary objects of measurement), and is the relative error variance arising from interactions of persons with random facets.[15][3] Estimation relies on variance components obtained from a generalizability study (G-study), which are then projected to a decision study (D-study) design by dividing interaction variances by the number of levels in each facet. For a persons × items () design, the relative error variance is estimated as where is the person-item interaction variance and is the number of items in the D-study; these components are typically derived from G-study ANOVA mean squares. The item main effect is excluded here, as it does not contribute to relative error in this context.[15][9] The coefficient ranges from 0 (no generalizability) to 1 (perfect consistency), with values above 0.80 generally indicating strong generalizability suitable for relative decisions, such as norm-referenced evaluations where the focus is on differentiating individuals rather than absolute levels.[3][15] Its magnitude is influenced by the relative sizes of variance components and the number of facet levels; for example, larger interactions increase , lowering , while more items reduce the error contribution from , raising .[9] As an illustrative example, consider a hypothetical G-study for an educational test in a design, yielding the variance components shown in the table below. For a D-study using 10 items (), the relative error variance is . Substituting into the formula gives , suggesting strong generalizability for relative score comparisons.[3][15]| Variance Component | Estimated Value |
|---|---|
| 0.25 | |
| 0.50 | |
| 0.40 |
