Recent from talks
Nothing was collected or created yet.
Psychometrics
View on Wikipedia| Part of a series on |
| Psychology |
|---|
Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities.[1] Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, personality factors (e.g., introversion), mental disorders, and educational achievement.[2] The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.[2]
Practitioners are described as psychometricians, although not all who engage in psychometric research go by this title. Most psychometricians are psychologists with advanced graduate training in psychometrics and measurement theory. According to the Dictionary of Psychology a psychometrician "is an individual with a theoretical knowledge of measurement techniques who is qualified to develop, evaluate, and improve psychological tests."[3] In addition to traditional academic institutions, psychometricians also work for organizations, such as Pearson and the Educational Testing Service as well as independent consultants. Some psychometric researchers focus on the construction and validation of assessment instruments, including surveys, scales, and open- or close-ended questionnaires. Others focus on research relating to measurement theory (e.g., item response theory, intraclass correlation) or specialize as learning and development professionals.
Etymology
[edit]The word psychometry derives from Greek: ψυχή, psukhē, "spirit, soul" and μέτρον, metron, "measure"). The American academic Joseph Rodes Buchanan is credited as having first coined the word "psychometry" in 1842 but in connection with his investigation of the paranormal illusion rather than the rational quantification of psychological criteria.
Historical foundation
[edit]Rational psychological testing has come from two streams of thought: the first, from Darwin, Galton, and Cattell, on the measurement of individual differences and the second, from Herbart, Weber, Fechner, and Wundt and their psychophysical measurements of a similar construct. The second set of individuals and their research is what has led to the development of experimental psychology and standardized testing.[4]
Victorian stream
[edit]Charles Darwin was the inspiration behind Francis Galton, a scientist who advanced the development of psychometrics. In 1859, Darwin published his book On the Origin of Species. Darwin described the role of natural selection in the emergence, over time, of different populations of species of plants and animals. The book showed how individual members of a species differ among themselves and how they possess characteristics that are more or less adaptive to their environment. Those with more adaptive characteristics are more likely to survive to procreate and give rise to another generation. Those with less adaptive characteristics are less likely. These ideas stimulated Galton's interest in the study of human beings and how they differ one from another and how to measure those differences.
Galton wrote a book entitled Hereditary Genius which was first published in 1869. The book described different characteristics that people possess and how those characteristics make some more "fit" than others. Today these differences, such as sensory and motor functioning (reaction time, visual acuity, and physical strength), are important domains of scientific psychology. Much of the early theoretical and applied work in psychometrics was undertaken in an attempt to measure intelligence. Galton often referred to as "the father of psychometrics," devised and included mental tests among his anthropometric measures. James McKeen Cattell, a pioneer in the field of psychometrics, went on to extend Galton's work. Cattell coined the term mental test, and is responsible for research and knowledge that ultimately led to the development of modern tests.[4]
German stream
[edit]The origin of psychometrics also has connections to the related field of psychophysics. Around the same time that Darwin, Galton, and Cattell were making their discoveries, Herbart was also interested in "unlocking the mysteries of human consciousness" through the scientific method.[4] Herbart was responsible for creating mathematical models of the mind, which were influential in educational practices for years to come.
E.H. Weber built upon Herbart's work and tried to prove the existence of a psychological threshold, saying that a minimum stimulus was necessary to activate a sensory system. After Weber, G.T. Fechner expanded upon the knowledge he gleaned from Herbart and Weber, to devise the law that the strength of a sensation grows as the logarithm of the stimulus intensity. A follower of Weber and Fechner, Wilhelm Wundt is credited with founding the science of psychology. It is Wundt's influence that paved the way for others to develop psychological testing.[4]
20th century
[edit]In 1936, the psychometrician L. L. Thurstone, founder and first president of the Psychometric Society, developed and applied a theoretical approach to measurement referred to as the law of comparative judgment, an approach that has close connections to the psychophysical theory of Ernst Heinrich Weber and Gustav Fechner. In addition, Spearman and Thurstone both made important contributions to the theory and application of factor analysis, a statistical method developed and used extensively in psychometrics.[5] In the late 1950s, Leopold Szondi made a historical and epistemological assessment of the impact of statistical thinking on psychology during previous few decades: "in the last decades, the specifically psychological thinking has been almost completely suppressed and removed, and replaced by a statistical thinking. Precisely here we see the cancer of testology and testomania of today."[6]
More recently, psychometric theory has been applied in the measurement of personality, attitudes and beliefs, and academic achievement. These latent constructs cannot truly be measured, and much of the research and science in this discipline has been developed in an attempt to measure these constructs as close to the true score as possible.
Figures who made significant contributions to psychometrics include Karl Pearson, Henry F. Kaiser, Carl Brigham, L. L. Thurstone, E. L. Thorndike, Georg Rasch, Eugene Galanter, Johnson O'Connor, Frederic M. Lord, Ledyard R Tucker, Louis Guttman, and Jane Loevinger.
Definition of measurement in the social sciences
[edit]The definition of measurement in the social sciences has a long history. A current widespread definition, proposed by Stanley Smith Stevens, is that measurement is "the assignment of numerals to objects or events according to some rule." This definition was introduced in a 1946 Science article in which Stevens proposed four levels of measurement.[7] Although widely adopted, this definition differs in important respects from the more classical definition of measurement adopted in the physical sciences, namely that scientific measurement entails "the estimation or discovery of the ratio of some magnitude of a quantitative attribute to a unit of the same attribute" (p. 358)[8]
Indeed, Stevens's definition of measurement was put forward in response to the British Ferguson Committee, whose chair, A. Ferguson, was a physicist. The committee was appointed in 1932 by the British Association for the Advancement of Science to investigate the possibility of quantitatively estimating sensory events. Although its chair and other members were physicists, the committee also included several psychologists. The committee's report highlighted the importance of the definition of measurement. While Stevens's response was to propose a new definition, which has had considerable influence in the field, this was by no means the only response to the report. Another, notably different, response was to accept the classical definition, as reflected in the following statement:
- Measurement in psychology and physics are in no sense different. Physicists can measure when they can find the operations by which they may meet the necessary criteria; psychologists have to do the same. They need not worry about the mysterious differences between the meaning of measurement in the two sciences (Reese, 1943, p. 49).[9]
These divergent responses are reflected in alternative approaches to measurement. For example, methods based on covariance matrices are typically employed on the premise that numbers, such as raw scores derived from assessments, are measurements. Such approaches implicitly entail Stevens's definition of measurement, which requires only that numbers are assigned according to some rule. The main research task, then, is generally considered to be the discovery of associations between scores, and of factors posited to underlie such associations.[10]
On the other hand, when measurement models such as the Rasch model are employed, numbers are not assigned based on a rule. Instead, in keeping with Reese's statement above, specific criteria for measurement are stated, and the goal is to construct procedures or operations that provide data that meet the relevant criteria. Measurements are estimated based on the models, and tests are conducted to ascertain whether the relevant criteria have been met.[citation needed]
Instruments and procedures
[edit]The first psychometric instruments were designed to measure intelligence.[11] One early approach to measuring intelligence was the test developed in France by Alfred Binet and Theodore Simon. That test was known as the Test Binet-Simon.The French test was adapted for use in the U. S. by Lewis Terman of Stanford University, and named the Stanford-Binet IQ test.
Another major focus in psychometrics has been on personality testing. There has been a range of theoretical approaches to conceptualizing and measuring personality, though there is no widely agreed upon theory. Some of the better-known instruments include the Minnesota Multiphasic Personality Inventory, the Five-Factor Model (or "Big 5") and tools such as Personality and Preference Inventory and the Myers–Briggs Type Indicator. Attitudes have also been studied extensively using psychometric approaches.[citation needed][12] An alternative method involves the application of unfolding measurement models, the most general being the Hyperbolic Cosine Model (Andrich & Luo, 1993).[13]
Theoretical approaches
[edit]Psychometricians have developed a number of different measurement theories. These include classical test theory (CTT) and item response theory (IRT).[14][15] An approach that seems mathematically to be similar to IRT but also quite distinctive, in terms of its origins and features, is represented by the Rasch model for measurement. The development of the Rasch model, and the broader class of models to which it belongs, was explicitly founded on requirements of measurement in the physical sciences.[16]
Psychometricians have also developed methods for working with large matrices of correlations and covariances. Techniques in this general tradition include: factor analysis,[17] a method of determining the underlying dimensions of data. One of the main challenges faced by users of factor analysis is a lack of consensus on appropriate procedures for determining the number of latent factors.[18] A usual procedure is to stop factoring when eigenvalues drop below one because the original sphere shrinks. The lack of the cutting points concerns other multivariate methods, also.[19]
Multidimensional scaling[20] is a method for finding a simple representation for data with a large number of latent dimensions. Cluster analysis is an approach to finding objects that are like each other. Factor analysis, multidimensional scaling, and cluster analysis are all multivariate descriptive methods used to distill from large amounts of data simpler structures.
More recently, structural equation modeling[21] and path analysis represent more sophisticated approaches to working with large covariance matrices. These methods allow statistically sophisticated models to be fitted to data and tested to determine if they are adequate fits. Because at a granular level psychometric research is concerned with the extent and nature of multidimensionality in each of the items of interest, a relatively new procedure known as bi-factor analysis[22][23][24] can be helpful. Bi-factor analysis can decompose "an item's systematic variance in terms of, ideally, two sources, a general factor and one source of additional systematic variance."[25]
Key concepts
[edit]Key concepts in classical test theory are reliability and validity. A reliable measure is one that measures a construct consistently across time, individuals, and situations. A valid measure is one that measures what it is intended to measure. Reliability is necessary, but not sufficient, for validity.
Both reliability and validity can be assessed statistically. Consistency over repeated measures of the same test can be assessed with the Pearson correlation coefficient, and is often called test-retest reliability.[26] Similarly, the equivalence of different versions of the same measure can be indexed by a Pearson correlation, and is called equivalent forms reliability or a similar term.[26]
Internal consistency, which addresses the homogeneity of a single test form, may be assessed by correlating performance on two halves of a test, which is termed split-half reliability; the value of this Pearson product-moment correlation coefficient for two half-tests is adjusted with the Spearman–Brown prediction formula to correspond to the correlation between two full-length tests.[26] Perhaps the most commonly used index of reliability is Cronbach's α, which is equivalent to the mean of all possible split-half coefficients. Other approaches include the intra-class correlation, which is the ratio of variance of measurements of a given target to the variance of all targets.
There are a number of different forms of validity. Criterion-related validity refers to the extent to which a test or scale predicts a sample of behavior, i.e., the criterion, that is "external to the measuring instrument itself."[27] That external sample of behavior can be many things including another test; college grade point average as when the high school SAT is used to predict performance in college; and even behavior that occurred in the past, for example, when a test of current psychological symptoms is used to predict the occurrence of past victimization (which would accurately represent postdiction). When the criterion measure is collected at the same time as the measure being validated the goal is to establish concurrent validity; when the criterion is collected later the goal is to establish predictive validity. A measure has construct validity if it is related to measures of other constructs as required by theory. Content validity is a demonstration that the items of a test do an adequate job of covering the domain being measured. In a personnel selection example, test content is based on a defined statement or set of statements of knowledge, skill, ability, or other characteristics obtained from a job analysis.
Item response theory models the relationship between latent traits and responses to test items. Among other advantages, IRT provides a basis for obtaining an estimate of the location of a test-taker on a given latent trait as well as the standard error of measurement of that location. For example, a university student's knowledge of history can be deduced from his or her score on a university test and then be compared reliably with a high school student's knowledge deduced from a less difficult test. Scores derived by classical test theory do not have this characteristic, and assessment of actual ability (rather than ability relative to other test-takers) must be assessed by comparing scores to those of a "norm group" randomly selected from the population. In fact, all measures derived from classical test theory are dependent on the sample tested, while, in principle, those derived from item response theory are not.
Standards of quality
[edit]The considerations of validity and reliability typically are viewed as essential elements for determining the quality of any test. However, professional and practitioner associations frequently have placed these concerns within broader contexts when developing standards and making overall judgments about the quality of any test as a whole within a given context. A consideration of concern in many applied research settings is whether or not the metric of a given psychological inventory is meaningful or arbitrary.[28]
Testing standards
[edit]In 2014, the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) published a revision of the Standards for Educational and Psychological Testing,[29] which describes standards for test development, evaluation, and use. The Standards cover essential topics in testing including validity, reliability/errors of measurement, and fairness in testing. The book also establishes standards related to testing operations—including test design and development, scores, scales, norms, score linking, cut scores, test administration, scoring, reporting, score interpretation, test documentation, and rights and responsibilities of test takers and test users. Finally, the Standards cover topics related to testing applications, including psychological testing and assessment, workplace testing and credentialing, educational testing and assessment, and testing in program evaluation and public policy.
Evaluation standards
[edit]In the field of evaluation, and in particular educational evaluation, the Joint Committee on Standards for Educational Evaluation[30] has published three sets of standards for evaluations. The Personnel Evaluation Standards[31] was published in 1988, The Program Evaluation Standards (2nd edition)[32] was published in 1994, and The Student Evaluation Standards[33] was published in 2003.
Each publication presents and elaborates a set of standards for use in a variety of educational settings. The standards provide guidelines for designing, implementing, assessing, and improving the identified form of evaluation.[34] Each of the standards has been placed in one of four fundamental categories to promote educational evaluations that are proper, useful, feasible, and accurate. In these sets of standards, validity and reliability considerations are covered under the accuracy topic. For example, the student accuracy standards help ensure that student evaluations will provide sound, accurate, and credible information about student learning and performance.
Controversy and criticism
[edit]Because psychometrics is based on latent psychological processes measured through correlations, there has been controversy about some psychometric measures.[35][page needed] Critics, including practitioners in the physical sciences, have argued that such definition and quantification is difficult, and that such measurements are often misused by laymen, such as with personality tests used in employment procedures. The Standards for Educational and Psychological Measurement gives the following statement on test validity: "validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests".[36] Simply put, a test is not valid unless it is used and interpreted in the way it is intended.[37]
Two types of tools used to measure personality traits are objective tests and projective measures. Examples of such tests are the: Big Five Inventory (BFI), Minnesota Multiphasic Personality Inventory (MMPI-2), Rorschach Inkblot test, Neurotic Personality Questionnaire KON-2006,[38] or Eysenck Personality Questionnaire. Some of these tests are helpful because they have adequate reliability and validity, two factors that make tests consistent and accurate reflections of the underlying construct. The Myers–Briggs Type Indicator (MBTI), however, has questionable validity and has been the subject of much criticism. Psychometric specialist Robert Hogan wrote of the measure: "Most personality psychologists regard the MBTI as little more than an elaborate Chinese fortune cookie."[39]
Lee Cronbach noted in American Psychologist (1957) that, "correlational psychology, though fully as old as experimentation, was slower to mature. It qualifies equally as a discipline, however, because it asks a distinctive type of question and has technical methods of examining whether the question has been properly put and the data properly interpreted." He would go on to say, "The correlation method, for its part, can study what man has not learned to control or can never hope to control ... A true federation of the disciplines is required. Kept independent, they can give only wrong answers or no answers at all regarding certain important problems."[40]
Non-human: animals and machines
[edit]Psychometrics addresses human abilities, attitudes, traits, and educational evolution. Notably, the study of behavior, mental processes, and abilities of non-human animals is usually addressed by comparative psychology, or with a continuum between non-human animals and the rest of animals by evolutionary psychology. Nonetheless, there are some advocators for a more gradual transition between the approach taken for humans and the approach taken for (non-human) animals.[41][42][43][44]
The evaluation of abilities, traits and learning evolution of machines has been mostly unrelated to the case of humans and non-human animals, with specific approaches in the area of artificial intelligence. A more integrated approach, under the name of universal psychometrics, has also been proposed.[45][46]
See also
[edit]- Cattell–Horn–Carroll theory
- Classical test theory
- Computational psychometrics
- Concept inventory
- Cronbach's alpha
- Educational assessment
- Educational psychology
- Factor analysis
- Flynn effect
- Item response theory
- Likert scale
- List of international databases on individual student achievement tests
- List of psychometric software
- List of schools for psychometrics
- Omega
- Operationalisation
- Quantitative psychology
- Psychometric Society
- Psychological testing
- Rasch model
- Scale (social sciences)
- School psychology
- Standardized test
References
[edit]- ^ "Glossary1". 22 July 2017. Archived from the original on 2017-07-22. Retrieved 28 June 2022.
- ^ a b Tabachnick, B.G.; Fidell, L.S. (2001). Using Multivariate Analysis. Boston: Allyn and Bacon. ISBN 978-0-321-05677-1.[page needed]
- ^ American Psychological Association. (2015). Dictionary of Psychology. Author. doi:10.1037/14646-000 [1]
- ^ a b c d Kaplan, Robert M.; Saccuzzo, Dennis P. (2012-05-01). Psychological Testing: Principles, Applications, and Issues (8th ed.). Cengage Learning. ISBN 978-1-133-49201-6.
- ^ Nunnally, Jum C.; Bernstein, Ira H. (1994). Psychometric Theory (3rd ed.). New York, NY: McGraw-Hill Humanities/Social Sciences/Languages. ISBN 978-0-07-047849-7.
- ^ Leopold Szondi (1960) Das zweite Buch: Lehrbuch der Experimentellen Triebdiagnostik. Huber, Bern und Stuttgart, 2nd edition. Ch.27, From the Spanish translation, B)II Las condiciones estadisticas, p.396. Quotation:
el pensamiento psicologico especifico, en las ultima decadas, fue suprimido y eliminado casi totalmente, siendo sustituido por un pensamiento estadistico. Precisamente aqui vemos el cáncer de la testología y testomania de hoy.
- ^ Stevens, S. S. (7 June 1946). "On the Theory of Scales of Measurement". Science. 103 (2684): 677–680. Bibcode:1946Sci...103..677S. doi:10.1126/science.103.2684.677. PMID 17750512. S2CID 4667599.
- ^ Michell, Joel (August 1997). "Quantitative science and the definition of measurement in psychology". British Journal of Psychology. 88 (3): 355–383. doi:10.1111/j.2044-8295.1997.tb02641.x.
- ^ Resse, Thomas Whelan (1943). "The application of the theory of physical measurement to the measurement of psychological magnitudes, with three experimental examples". Psychological Monographs. 55 (3): i–89. doi:10.1037/h0093539. ISSN 0096-9753.
- ^ "Psychometrics". Assessmentpsychology.com. Retrieved 28 June 2022.
- ^ Stern, Theodore A.; Fava, Maurizio; Wilens, Timothy E.; Rosenbaum, Jerrold F. (2016). Massachusetts General Hospital comprehensive clinical psychiatry (Second ed.). London. p. 73. ISBN 978-0323295079. Retrieved 31 October 2021.
{{cite book}}: CS1 maint: location missing publisher (link) - ^ Longe, Jacqueline L., ed. (2022). The Gale Encyclopedia of Psychology. Vol. 2 (4th ed.). Farmington Hills, Michigan: Gale. p. 1000. ISBN 9780028683867.
- ^ Andrich, David; Luo, Guanzhong (1993-09-01). "A Hyperbolic Cosine Latent Trait Model For Unfolding Dichotomous Single-Stimulus Responses". Applied Psychological Measurement. 17 (3): 253–276. doi:10.1177/014662169301700307. ISSN 0146-6216.
- ^ Embretson, Susan E.; Reise, Steven Paul (2000). Item Response Theory for Psychologists. L. Erlbaum Associates. ISBN 978-0-8058-2818-4.
- ^ Hambleton, R.K., & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston: Kluwer-Nijhoff.
- ^ Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen, Danish Institute for Educational Research, expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.
- ^ Thompson, B.R. (2004). Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. American Psychological Association.
- ^ Zwick, William R.; Velicer, Wayne F. (1986). "Comparison of five rules for determining the number of components to retain". Psychological Bulletin. 99 (3): 432–442. doi:10.1037/0033-2909.99.3.432.
- ^ Singh, Manoj Kumar (2021-09-11). Introduction to Social Psychology. K.K. Publications.
- ^ Davison, M.L. (1992). Multidimensional Scaling. Krieger.
- ^ Kaplan, D. (2008). Structural Equation Modeling: Foundations and Extensions, 2nd ed. Sage.
- ^ DeMars, Christine E. (2013-10-01). "A Tutorial on Interpreting Bifactor Model Scores". International Journal of Testing. 13 (4): 354–378. doi:10.1080/15305058.2013.799067. ISSN 1530-5058.
- ^ Reise, Steven P. (2012-09-01). "The Rediscovery of Bifactor Measurement Models". Multivariate Behavioral Research. 47 (5): 667–696. doi:10.1080/00273171.2012.715555. ISSN 0027-3171. PMC 3773879. PMID 24049214.
- ^ Rodriguez, Anthony; Reise, Steven P.; Haviland, Mark G. (June 2016). "Evaluating bifactor models: Calculating and interpreting statistical indices". Psychological Methods. 21 (2): 137–150. doi:10.1037/met0000045. ISSN 1939-1463. PMID 26523435.
- ^ Schonfeld, Irvin Sam; Verkuilen, Jay; Bianchi, Renzo (August 2019). "An exploratory structural equation modeling bi-factor analytic approach to uncovering what burnout, depression, and anxiety scales measure". Psychological Assessment. 31 (8): 1073–1079. doi:10.1037/pas0000721. ISSN 1939-134X. PMID 30958024.
- ^ a b c "Home – Educational Research Basics by Del Siegle". www.gifted.uconn.edu. 17 February 2015.
- ^ Nunnally, J.C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.
- ^ Blanton, H., & Jaccard, J. (2006). Arbitrary metrics in psychology. Archived 2006-05-10 at the Wayback Machine American Psychologist, 61(1), 27–41.
- ^ "The Standards for Educational and Psychological Testing". apa.org.
- ^ "Joint Committee on Standards for Educational Evaluation". Archived from the original on 15 October 2009. Retrieved 28 June 2022.
- ^ Joint Committee on Standards for Educational Evaluation. (1988). The Personnel Evaluation Standards: How to Assess Systems for Evaluating Educators. Archived 2005-12-12 at the Wayback Machine Newbury Park, CA: Sage Publications.
- ^ Joint Committee on Standards for Educational Evaluation. (1994). The Program Evaluation Standards, 2nd Edition. Archived 2006-02-22 at the Wayback Machine Newbury Park, CA: Sage Publications.
- ^ Committee on Standards for Educational Evaluation. (2003). The Student Evaluation Standards: How to Improve Evaluations of Students. Archived 2006-05-24 at the Wayback Machine Newbury Park, CA: Corwin Press.
- ^ [E. Cabrera-Nguyen (2010). "Author guidelines for reporting scale development and validation results in the Journal of the Society for Social Work and Research]". Academia.edu. 1 (2): 99–103.
- ^ Tabachnick, B.G.; Fidell, L.S. (2001). Using Multivariate Analysis. Boston: Allyn and Bacon. ISBN 978-0-321-05677-1.
- ^ American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999) Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
- ^ Bandalos, Deborah L. (2018). Measurement theory and applications for the social sciences. New York. p. 261. ISBN 978-1-4625-3215-5. OCLC 1015955756.
{{cite book}}: CS1 maint: location missing publisher (link) - ^ Aleksandrowicz JW, Klasa K, Sobański JA, Stolarska D (2009). "KON-2006 Neurotic Personality Questionnaire" (PDF). Archives of Psychiatry and Psychotherapy. 1: 21–22.
- ^ Hogan, Robert (2007). Personality and the fate of organizations. Mahwah, NJ: Lawrence Erlbaum Associates. p. 28. ISBN 978-0-8058-4142-8. OCLC 65400436.
- ^ Cronbach, L. J. (1957). "The two disciplines of scientific psychology". American Psychologist. 12 (11): 671–684. doi:10.1037/h0043943 – via EBSCO.
- ^ Humphreys, L.G. (1987). "Psychometrics considerations in the evaluation of intraspecies differences in intelligence". Behav Brain Sci. 10 (4): 668–669. doi:10.1017/s0140525x0005514x.
- ^ Eysenck, H.J. (1987). "The several meanings of intelligence". Behav Brain Sci. 10 (4): 663. doi:10.1017/s0140525x00055060.
- ^ Locurto, C. & Scanlon, C (1987). "Individual differences and spatial learning factor in two strains of mice". Behav Brain Sci. 112: 344–352.
- ^ King, James E & Figueredo, Aurelio Jose (1997). "The five-factor model plus dominance in chimpanzee personality". Journal of Research in Personality. 31 (2): 257–271. doi:10.1006/jrpe.1997.2179.
- ^ J. Hernández-Orallo; D.L. Dowe; M.V. Hernández-Lloreda (2013). "Universal Psychometrics: Measuring Cognitive Abilities in the Machine Kingdom" (PDF). Cognitive Systems Research. 27: 50–74. doi:10.1016/j.cogsys.2013.06.001. hdl:10251/50244. S2CID 26440282.
- ^ Hernández-Orallo, José (2017). The Measure of All Minds: Evaluating Natural and Artificial Intelligence. Cambridge: Cambridge University Press. ISBN 978-1-107-15301-1.
Bibliography
[edit]- Andrich, D. & Luo, G. (1993). "A hyperbolic cosine model for unfolding dichotomous single-stimulus responses" (PDF). Applied Psychological Measurement. 17 (3): 253–276. CiteSeerX 10.1.1.1003.8107. doi:10.1177/014662169301700307. S2CID 120745971.
- Michell, J. (1999). Measurement in Psychology. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511490040
- Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.
- Reese, T.W. (1943). The application of the theory of physical measurement to the measurement of psychological magnitudes, with three experimental examples. Psychological Monographs, 55, 1–89. doi:10.1037/h0061367
- Stevens, S. S. (1946). "On the theory of scales of measurement". Science. 103 (2684): 677–80. Bibcode:1946Sci...103..677S. doi:10.1126/science.103.2684.677. PMID 17750512.
- Thurstone, L.L. (1927). "A law of comparative judgement". Psychological Review. 34 (4): 278–286. doi:10.1037/h0070288.
- Thurstone, L.L. (1929). The Measurement of Psychological Value. In T.V. Smith and W.K. Wright (Eds.), Essays in Philosophy by Seventeen Doctors of Philosophy of the University of Chicago. Chicago: Open Court.
- Thurstone, L.L. (1959). The Measurement of Values. Chicago: The University of Chicago Press.
- S.F. Blinkhorn (1997). "Past imperfect, future conditional: fifty years of test theory". British Journal of Mathematical and Statistical Psychology. 50 (2): 175–185. doi:10.1111/j.2044-8317.1997.tb01139.x.
- Sanford, David (18 November 2017). "Cambridge just told me Big Data doesn't work yet". LinkedIn.
Further reading
[edit]- Robert F. DeVellis (2016). Scale Development: Theory and Applications. SAGE Publications. ISBN 978-1-5063-4158-3.
- Borsboom, Denny (2005). Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge: Cambridge University Press. ISBN 978-0-521-84463-5.
- Leslie A. Miller; Robert L. Lovler (2015). Foundations of Psychological Testing: A Practical Approach. SAGE Publications. ISBN 978-1-4833-6927-3.
- Roderick P. McDonald (2013). Test Theory: A Unified Treatment. Psychology Press. ISBN 978-1-135-67530-1.
- Paul Kline (2000). The Handbook of Psychological Testing. Psychology Press. ISBN 978-0-415-21158-1.
- Rush AJ Jr; First MB; Blacker D (2008). Handbook of Psychiatric Measures. American Psychiatric Publishing. ISBN 978-1-58562-218-4. OCLC 85885343.
- Ann C Silverlake (2016). Comprehending Test Manuals: A Guide and Workbook. Taylor & Francis. ISBN 978-1-351-97086-0.
External links
[edit]Psychometrics
View on GrokipediaHistorical Development
19th-Century Antecedents
In the early 19th century, Ernst Heinrich Weber conducted experiments demonstrating that the just-noticeable difference (JND)—the minimal change in a stimulus detectable by an observer—is a constant proportion of the stimulus's magnitude, as observed in weight-lifting tasks where heavier base weights required larger absolute increments for detection.[11] This principle, later termed Weber's law, provided an empirical foundation for quantifying perceptual thresholds and scaling subjective sensations against physical intensities.[12] Gustav Theodor Fechner built upon Weber's findings in Elemente der Psychophysik (1860), formalizing psychophysics as a discipline to measure the relationship between physical stimuli and psychological sensations through methods like the method of limits and constant stimuli, positing a logarithmic law where equal perceptual increments correspond to multiplicative stimulus changes.[13] These innovations introduced rigorous experimental protocols and mathematical models for sensory measurement, establishing precedents for treating psychological phenomena as quantifiable constructs amenable to scientific analysis.[14] Francis Galton advanced these quantitative traditions by applying them to human mental variation and heredity, motivated by Charles Darwin's On the Origin of Species (1859). In Hereditary Genius (1869), Galton analyzed biographical data on eminent individuals, concluding that intellectual ability follows a normal distribution and clusters familially due to genetic transmission rather than solely environmental factors, thus emphasizing stable individual differences in cognitive faculties.[15] To gather empirical data, he opened an anthropometric laboratory at the International Health Exhibition in South Kensington in 1884, followed by a permanent site in 1885, where over 9,000 participants underwent measurements of physical traits (e.g., height, arm span, lung capacity) alongside sensory and reaction-time tests intended as indices of nervous system efficiency and innate mental prowess.[16] Galton's statistical contributions further bridged measurement to variation analysis: he introduced regression toward the mean in studies of height inheritance during the 1880s and formalized correlation in his 1888 Royal Society paper "Co-relations and Their Measurement," using anthropometric data to quantify interdependent deviations among traits, thereby enabling the statistical modeling of human diversity essential to psychometric inference.[17]Early 20th-Century Foundations
In the early 20th century, psychometrics shifted from 19th-century sensory discrimination measures, such as those pioneered by Francis Galton focusing on reaction times and perceptual acuity, to evaluations of complex cognitive abilities like reasoning and judgment, reflecting a recognition that higher mental functions better captured individual differences in intelligence.[18] This practical turn began with the 1905 Binet-Simon scale, developed by Alfred Binet and Théodore Simon to identify French schoolchildren needing remedial education amid expanding compulsory schooling laws. The scale comprised 30 tasks escalating in difficulty, normed by age groups from 3 to 13 years, yielding a mental age score representing the highest age level of tasks a child could reliably complete—thus comparing performance against chronological age peers rather than absolute metrics.[19][20] Subsequent revisions, including the 1908 version, refined this approach by incorporating mental levels for subnormal performers, establishing intelligence as a developmental benchmark amenable to quantification.[21] Concurrently, Charles Spearman advanced theoretical foundations through factor analysis in his 1904 paper, analyzing correlations across sensory, memory, and reasoning tasks from schoolchildren and adults, which revealed a consistent "positive manifold" of intercorrelations averaging around 0.5 to 0.7. Spearman attributed this pervasive hierarchy to a single general factor, g, representing core intellectual energy, with residual specific factors explaining task-unique variance—a parsimonious model contrasting multifaceted views and enabling latent trait extraction from observed scores.[22][23] World War I imperatives for efficient recruit classification propelled these innovations into mass application, as the U.S. Army, led by Robert Yerkes, devised the Army Alpha (verbal, for literates) and Beta (pictorial, for non-readers) group tests in 1917. Administered to roughly 1.7 million inductees by 1919 across 40 verbal and 7 performance subtests, they classified personnel into ability grades correlating with training completion rates (e.g., higher scores linked to officer suitability and lower desertion), validating scalability for over 100,000 daily administrations while exposing limitations like cultural bias in verbal items.[24][25] These efforts underscored psychometrics' utility for causal prediction in real-world selection, bridging individual diagnostics to societal demands.Mid-20th-Century Advances
In the mid-20th century, psychometric theory advanced through hierarchical models of intelligence that integrated Louis L. Thurstone's earlier identification of multiple primary mental abilities—such as verbal comprehension, spatial visualization, and numerical facility—with Charles Spearman's concept of a general intelligence factor (g). Thurstone's 1938 framework, which emphasized separable abilities over a dominant unitary g, influenced dominant psychometric approaches throughout the 1940s and 1950s, prompting refinements like Philip E. Vernon's 1950 hierarchical model positing g at the top level above verbal-educational and practical-mechanical group factors.[26][27] These developments resolved tensions between unitary and multifactor views by empirically demonstrating a general factor explaining substantial variance (often 40-50%) atop specific abilities, supported by factor analyses of large test batteries.[28] Reliability assessment and standardization practices also matured, enabling broader application. Lee J. Cronbach's 1951 introduction of coefficient alpha provided a widely adopted index of internal consistency for test items, quantifying how well items measure a unidimensional construct under classical test theory assumptions, with values above 0.7 typically deemed acceptable for research instruments.[29] Concurrently, norming procedures for comprehensive batteries improved; David Wechsler's Wechsler Adult Intelligence Scale (WAIS), published in 1955, established age-graded norms based on U.S. samples exceeding 2,000 adults, facilitating clinical and educational comparisons by yielding a full-scale IQ with a mean of 100 and standard deviation of 15.[30] These tools supported post-World War II institutionalization, as psychometrics permeated educational tracking and personnel selection in schools, military aptitude testing (e.g., extensions of Army General Classification Test protocols), and civil service exams.[31] Empirical links to behavioral genetics strengthened psychometric validity, with twin and adoption studies from the 1920s to 1960s estimating intelligence heritability at 0.5 to 0.8. Early work, such as the 1937 study by Freeman, Holzinger, and Newman on 19 monozygotic and 86 dizygotic twin pairs, derived heritability around 0.87 from IQ correlation differences (MZ r ≈ 0.91, DZ r ≈ 0.63), attributing variance primarily to genetic factors after controlling for shared environments.[32] Later analyses, including Cyril Burt's 1958 syntheses of British twin data yielding h² ≈ 0.77, reinforced this range despite methodological debates, paving the way for causal interpretations of test scores as reflecting heritable traits modulated by environment.[32] These estimates, derived from resemblance patterns in separated twins and adoptees, underscored psychometrics' foundation in measurable, biologically influenced constructs, influencing selection policies amid growing test use in postwar economies.[32]Late 20th- and 21st-Century Innovations
Computerized adaptive testing (CAT), which tailors item selection to an examinee's ability level in real time to maximize measurement precision with fewer items, gained practical momentum in the late 20th century following advances in item response theory (IRT). Early conceptual work on IRT-based CAT began in the 1970s, but computational feasibility emerged with affordable personal computers in the 1980s, enabling simulations and prototypes.[33] By the 1990s, CAT was implemented in large-scale assessments, such as the Graduate Record Examination (GRE) in 1994 and the Armed Services Vocational Aptitude Battery, reducing test administration time by up to 50% while preserving reliability equivalent to fixed-form tests.[34] These innovations leveraged large item banks calibrated via IRT parameters, allowing dynamic adjustment of difficulty to minimize standard error of measurement around the examinee's theta estimate.[35] In the 21st century, psychometrics integrated with molecular genetics through genome-wide association studies (GWAS), yielding polygenic scores (PGS) that predict cognitive abilities including intelligence. Large-scale GWAS since the 2010s identified thousands of genetic variants associated with educational attainment and cognitive performance, enabling PGS that explain 7-12% of variance in general cognitive ability and correlate around 0.25-0.33 with IQ in independent validation samples.[36][37] These scores, derived from effect sizes of single nucleotide polymorphisms, demonstrate within-family predictive validity, mitigating population stratification biases and supporting heritability estimates from twin studies. Neuroimaging modalities, such as functional MRI (fMRI) developed in the early 1990s, further converged with psychometrics by mapping brain activity patterns to latent traits measured by tests, enhancing construct validation through multivariate analyses like psychometric similarity metrics.[38] Such multimodal approaches underscore causal links between genetic predispositions, neural substrates, and observable psychometric variance. Post-2020 developments harnessed artificial intelligence and big data for scalable, context-aware assessments. Large language models (LLMs) have been repurposed for personality profiling by administering standard inventories like the Big Five, yielding embeddings that predict traits with reliability comparable to human raters and enabling dynamic, text-based diagnostics.[39][40] Machine learning algorithms process vast datasets from wearable sensors and digital footprints to refine item response models, adapting to individual differences in real-world behaviors. Virtual reality (VR) and gamified platforms, introduced in psychometric contexts around the 2010s, improve ecological validity by simulating naturalistic tasks—such as executive function challenges in immersive environments—correlating strongly with traditional cognitive batteries while reducing abstractness biases in lab settings.[41] These tools address limitations of static tests by incorporating behavioral telemetry, though ongoing validation emphasizes the need for diverse samples to counter algorithmic overfitting.[42]Conceptual and Definitional Foundations
Psychological Attributes as Measurable Constructs
Psychometrics constitutes the discipline dedicated to the measurement of latent psychological attributes, such as cognitive abilities and personality traits, which are inferred from patterns of observable behavior and performance on standardized tasks.[7] These attributes are not directly perceptible but manifest through proxies like response accuracy, speed, or consistency, enabling quantification via statistical models that distinguish systematic variance from error.[43] This approach rests on the premise that psychological constructs possess sufficient causal potency to influence repeated behavioral outcomes predictably, allowing for empirical validation independent of subjective interpretation. A central construct in psychometrics is the general factor of intelligence, or g, identified by Charles Spearman in 1904 as the dominant source of covariance among diverse cognitive tasks.[44] g represents a hierarchically superordinate ability that subsumes narrower factors, such as verbal or spatial skills, and correlates with biological markers including neural efficiency observed in brain imaging studies, where higher-g individuals exhibit reduced metabolic activation during cognitive demands.[45] Meta-analytic evidence establishes g as causally efficacious, accounting for 20-50% of variance in real-world outcomes like occupational attainment and longevity, with corrected correlations reaching approximately 0.5 for job performance.[46][44] Personality attributes, modeled via frameworks like the Big Five (openness, conscientiousness, extraversion, agreeableness, neuroticism), exemplify stable latent traits with test-retest correlations exceeding 0.6 over multi-year intervals, indicating enduring individual differences amid minor fluctuations.[47] These traits predict behavioral consistencies, such as conscientiousness correlating with academic persistence, and are distinguished from ephemeral states by their resistance to short-term perturbation, as evidenced in longitudinal cohorts spanning decades.[48] The measurability of such constructs hinges on their hierarchical structure and replicable links to observables, underscoring psychometrics' commitment to falsifiable, data-driven inference over introspective or categorical assertions.Challenges in Social Science Measurement
Psychological attributes, such as intelligence or personality traits, pose measurement challenges in social sciences due to their latent nature, requiring inference from observable behaviors rather than direct quantification as in physical sciences, where entities like mass or length yield repeatable, additive readings under controlled conditions.[49] Unlike physical measurement, psychometric scales derive from correlational patterns across items, raising questions about whether scores represent true intervals or merely ordinal rankings, as interactions among cognitive faculties can render total scores non-additive— for instance, synergistic effects in problem-solving may not sum linearly across subtests.[50] These definitional hurdles are mitigated through deviation-based scoring, such as IQ as standard deviations from a normative mean, which approximates ratio scaling while acknowledging population variability, and ipsative approaches that normalize within individuals to highlight relative strengths without assuming cross-person comparability.[51] Causal realism underpins psychometric constructs as stable dispositions that exert influence on behavior, distinct from ephemeral states or purely interpretive frameworks, with empirical support from longitudinal data demonstrating predictive power beyond contemporaneous correlations.[52] For example, general intelligence (g) measured in childhood forecasts educational attainment with correlations exceeding 0.5, as evidenced in prospective studies tracking thousands of participants over decades, where early IQ scores independently predict years of schooling and academic performance after controlling for socioeconomic factors.[53] [54] This stability— with test-retest correlations for IQ often above 0.7 across intervals of years— affirms traits as causal entities rather than constructs devoid of ontological status, countering constructivist views that prioritize subjective meaning over observable outcomes.[55] Empirical rigor in psychometrics thus favors convergent evidence from diverse indicators— behavioral, physiological, and genetic— over relativistic interpretations, as predictive utility in real-world criteria, such as occupational success or health outcomes, validates measurement despite scaling approximations.[56] Challenges persist in equating scale units precisely, yet multi-method triangulation, including neuroimaging correlates of g (e.g., brain efficiency metrics aligning with psychometric variance), bolsters causal claims against skepticism that deems such efforts illusory.[57] This approach privileges data-driven falsification, rejecting undue deference to interpretive paradigms that undervalue quantitative prediction in favor of narrative coherence.Theoretical Frameworks
Classical Test Theory
Classical test theory (CTT) models an observed score on a psychological test as the sum of a true score , reflecting the examinee's underlying attribute, and a random error component , such that .[58] The true score represents the expected value of over repeated administrations under identical conditions, while is assumed to have a mean of zero and zero covariance with , ensuring that errors do not systematically bias the measurement and that aggregation across items or trials reduces error variance for greater stability.[59] This additive decomposition, rooted in early correlational work by Charles Spearman around 1904 and formalized by Harold Gulliksen in his 1950 monograph Theory of Mental Tests, emphasizes total score reliability over item-level analysis, making it suitable for norming instruments on large samples where empirical correlations suffice for practical inference.[58] Reliability in CTT is defined as the ratio of true score variance to observed score variance, , indicating the proportion of score variation attributable to true differences rather than error.[60] Estimates derive from parallel forms reliability, correlating scores from two theoretically equivalent test versions administered separately to capture consistency across administrations, or split-half reliability, where a single test is divided into comparable halves (e.g., odd-even items), with the resulting correlation adjusted via the Spearman-Brown formula to predict full-test reliability: .[61] These methods assume parallel tests with equal means, variances, and error structures, enabling error variance partitioning even without multiple forms.[62] CTT extends to item aggregation under tau-equivalence, assuming items share equal true score loadings and error variances, which justifies internal consistency estimators like Cronbach's alpha (, where is the number of items); the congeneric model relaxes this to permit varying item reliabilities and loadings while retaining the structure for composite scores.[63] [64] Applied in early intelligence batteries, such as Lewis Terman's 1916 Stanford revision of the Binet-Simon scale, CTT facilitated norm development through split-half and alternate-form correlations on thousands of U.S. children, yielding age-based standards with reported reliabilities often exceeding 0.90 for full scales.[65] Despite its simplicity and efficacy for aggregate norming, CTT's parameters proved sample-dependent, with item difficulties and discriminations varying across groups, as empirical data from the 1960s—such as in adaptive testing pilots and cross-validation studies—highlighted non-parallelism and ignored item-error interactions, limiting precision for heterogeneous populations and spurring item response theory alternatives.[66] [65]Item Response Theory
Item response theory (IRT) posits that the probability of a correct response to a test item depends on the examinee's position on an underlying latent trait continuum, such as cognitive ability, and the item's specific characteristics, modeled via probabilistic functions rather than aggregate scores. This framework calibrates items independently of the tested population, yielding invariant parameter estimates that hold across diverse groups when model assumptions are met.[67] Fundamental to IRT is the item characteristic curve (ICC), which graphically represents the logistic or normal ogive probability of success as a function of trait level, enabling separation of person ability from item properties.[68] The one-parameter logistic (1PL) model, synonymous with the Rasch model, was formalized by Georg Rasch in his 1960 monograph Probabilistic Models for Some Intelligence and Attainment Tests, incorporating only an item difficulty parameter while fixing discrimination at unity, thus prioritizing specific objectivity where raw scores suffice as sufficient statistics for trait estimation.[69] The two-parameter logistic (2PL) model extends this by adding an item discrimination parameter , allowing steeper or shallower ICC slopes to reflect varying item sensitivity to trait differences, as developed in Frederic Lord's 1968 work on statistical theories of mental test scores.[70] Three-parameter logistic (3PL) models further include a lower asymptote for guessing in dichotomous multiple-choice items, fitting scenarios with nonzero baseline success probabilities, though this increases estimation complexity and requires larger samples for stability.[71] IRT's parameter invariance supports computerized adaptive testing (CAT), where items are dynamically selected to target the examinee's estimated trait level, maximizing measurement precision with fewer items—typically 50-70% shorter than fixed forms—while minimizing floor and ceiling effects.[72] In high-stakes contexts, such as the Graduate Record Examination (GRE), Educational Testing Service implemented IRT-based CAT in 1993, achieving test exposure reductions of over 50% with score reliabilities correlating above 0.90 to conventional administrations, thus enhancing efficiency without compromising validity.[73] IRT also facilitates equating across non-parallel test forms via common-item linking, ensuring score comparability, and differential item functioning (DIF) analysis, which statistically tests for trait-irrelevant group differences in item performance, promoting fairness in diverse populations.[74] These capabilities yield empirical advantages in precision and generalizability, particularly for large-scale assessments where classical methods falter under heterogeneous conditions.[75]Factor Analytic and Structural Equation Models
Factor analysis in psychometrics reduces the dimensionality of observed variables, such as cognitive test scores, by identifying latent factors that account for their intercorrelations, thereby uncovering underlying trait structures. Exploratory factor analysis (EFA), advanced by Louis L. Thurstone in the 1930s through works like Primary Mental Abilities (1938), employs techniques such as centroid methods or principal components to derive factors empirically from data covariance matrices, without imposing a priori constraints on factor patterns.[76] [27] Thurstone's application to intelligence tests revealed multiple primary abilities, challenging Spearman's single-factor view while highlighting emergent higher-order common variance.[27] Confirmatory factor analysis (CFA) and structural equation modeling (SEM), formalized by Karl G. Jöreskog in 1969, shift to theory-driven validation by specifying hypothesized factor structures and estimating parameters like loadings via maximum likelihood, with model fit assessed through indices such as the Comparative Fit Index (CFI; values >0.95 denote adequate fit) and Root Mean Square Error of Approximation (RMSEA; <0.06 preferred).[77] [78] These methods test hierarchical intelligence models, where broad factors (e.g., verbal, perceptual) load on a second-order general factor (g), explaining residual correlations beyond first-order specifics.[77] Bifactor models extend this by allowing direct loadings from all indicators onto g and orthogonal group factors, partitioning common variance more precisely; in cognitive batteries, g typically captures 40-60% of total variance, with specific factors accounting for 20-30% orthogonal to g, as evidenced in reanalyses of large datasets like the Woodcock-Johnson.[79] [80] This structure empirically validates g's dominance, with bifactor fit often superior to higher-order alternatives due to reduced parameter constraints, though equivalence in explained variance holds under certain rotations.[81] SEM integrates with behavioral genetics by modeling twin covariances to estimate paths from latent genetic factors to g, yielding heritabilities of 50-80% for general intelligence.[82] Polygenic scores (PGS) from GWAS, aggregating thousands of variants, load primarily on g (explaining ~58% of genetic variance across cognitive traits) rather than specifics, with SEM confirming causal genetic precedence over environmental confounds in longitudinal and adoption designs.[82] [83] This supports g as a biologically grounded construct, where PGS predict g-loaded outcomes independently of test-specific residuals.[84]Methods and Instruments
Test Construction Procedures
Test construction in psychometrics involves a systematic, iterative process grounded in empirical data collection to develop scales that reliably measure targeted psychological constructs. Initial item generation draws from content domain sampling, where subject matter experts delineate the construct's theoretical boundaries and produce a broad pool of potential items—often 3-5 times the final test length—to ensure comprehensive coverage without redundancy.[85] This step emphasizes logical representation of the domain, informed by job analysis, literature reviews, or critical incidents, to align items with the intended inferences.[86] Pilot testing follows on a convenience sample of 100-200 respondents to gather preliminary data for item analysis. Key metrics include item difficulty (p-values between 0.30 and 0.70 for optimal discrimination) and corrected item-total correlations, with thresholds above 0.30 signaling acceptable item contribution to scale homogeneity; items below 0.20-0.30 are typically revised or eliminated based on their failure to covary sufficiently with the total score.[87] [88] Revisions incorporate qualitative feedback, such as think-aloud protocols, alongside quantitative refinement to improve clarity and reduce ambiguity.[89] Subsequent administration to validation samples, often stratified by key demographics like age, sex, education, and ethnicity to mirror population distributions (e.g., U.S. Census proportions for broad-ability tests), enables norming through percentile ranks or standardized scores with means of 100 and standard deviations of 15.[86] [90] This ensures generalizability, as deviations from representativeness—such as oversampling urban or educated subgroups—can inflate norms and misrepresent population standings.[91] Modern procedures leverage digital tools for efficiency, including crowdsourced platforms like Amazon Mechanical Turk for diverse, rapid piloting of item banks, which accelerate data accrual while requiring safeguards against non-serious responses.[92] Machine learning techniques, such as anomaly detection models, further refine datasets by flagging inconsistent or inattentive patterns (e.g., straight-lining or rapid completion times), improving response validity before final scaling.[93] These approaches, validated against traditional methods, enhance scalability without compromising empirical rigor.[94]Types of Psychometric Assessments
Cognitive ability assessments measure general and specific intellectual capacities, often through standardized batteries that yield scores reflecting the general intelligence factor (g). The Wechsler Adult Intelligence Scale–Fourth Edition (WAIS-IV), normed and released in 2008, comprises 10 core subtests assessing verbal comprehension, perceptual reasoning, working memory, and processing speed, with many subtests exhibiting g-loadings above 0.7, such as arithmetic, vocabulary, and figure weights.[95][96] Meta-analyses confirm that cognitive ability measures like those from the WAIS predict job performance with a corrected validity coefficient of approximately 0.51 across complex roles, outperforming other single predictors in personnel selection.[97] Personality assessments evaluate stable traits influencing behavior, affect, and interpersonal dynamics, typically via self-report inventories targeting the Big Five model (Neuroticism, Extraversion, Openness to Experience, Agreeableness, Conscientiousness). The NEO Personality Inventory–Revised (NEO-PI-R) operationalizes these dimensions through 240 items, with facet-level scoring for nuanced profiling; twin studies estimate broad heritability at 40–60% for the traits, indicating substantial genetic influence alongside environmental factors.[98][99] Empirically, these traits forecast life outcomes beyond cognitive measures, such as elevated divorce risk linked to high Neuroticism (odds ratio ≈1.5–2.0) and low Conscientiousness, per meta-analytic syntheses of longitudinal data.[100] Aptitude and achievement assessments gauge learned knowledge, specific skills, or potential for targeted performance, often in educational or vocational contexts. Tests like the SAT and ACT, administered to millions annually for admissions, have incorporated multidimensional scoring since redesigns in the mid-2010s, yielding subscores in domains such as evidence-based reading/writing, math, and science reasoning alongside composite totals.[101] The 2016 SAT revision, for instance, emphasized skills like data interpretation and essay analysis, enabling domain-specific validity evidence for college success, with section correlations to GPA ranging 0.3–0.5; these evolutions reflect adaptations to broader competency models while maintaining predictive utility for academic trajectories.[102]Standards of Psychometric Quality
Reliability Evaluation
Reliability evaluation in psychometrics quantifies the consistency of test scores, distinguishing true score variance from error variance under classical test theory, where observed score equals true score plus error.[103] High reliability ensures stable inferences about underlying constructs, with coefficients estimating the ratio of true variance to total variance; values above 0.80 indicate strong internal consistency for multi-item scales, as measured by Cronbach's alpha, which assesses item intercorrelations assuming unidimensionality and tau-equivalence.[104] For stable traits like intelligence, test-retest correlations exceeding 0.70 over intervals of weeks to months demonstrate temporal stability, while inter-rater reliability, often via intraclass correlation coefficients (ICC), evaluates agreement among observers, targeting ICC >0.75 for subjective ratings.[105] Generalizability theory extends classical approaches by partitioning variance across multiple facets—such as items, raters, and occasions—yielding a generalizability coefficient (G) that generalizes findings to broader universes of conditions, superior to single-facet estimates for complex assessments.[106] This framework uses analysis of variance to estimate error from interactions, enabling decision studies to optimize design for maximal reliability given resource constraints.[107] Sources of measurement error include transient respondent states (e.g., fatigue or motivation fluctuations) and situational variability, which standardized administration protocols—enforcing uniform instructions, timing, and environments—minimize to boost coefficients; for instance, controlled conditions in cognitive testing yield reliabilities far exceeding ad hoc self-reports.[103] In practice, intelligence tests average reliability coefficients of 0.90 or higher across full scales, with subtests often at 0.88-0.93, outperforming many social science indicators where alphas hover below 0.70 due to greater subjectivity.[108][109]Validity Assessment
Validity assessment in psychometrics determines the extent to which test scores correspond to theoretically expected patterns and real-world outcomes, emphasizing empirical evidence from predictive power and nomological networks rather than subjective face validity. Criterion-related validity, particularly predictive validity, is demonstrated through correlations between test scores and external criteria such as job performance or academic achievement. Meta-analyses of general mental ability (GMA) tests show uncorrected validity coefficients of 0.51 for predicting job performance across diverse occupations, explaining approximately 26% of variance in outcomes after accounting for measurement error and range restriction. Similar patterns hold for academic success, where cognitive tests forecast grade point averages with validities around 0.40-0.50 in large-scale studies, outperforming non-ability predictors in longitudinal designs. Construct validity is established via convergent and divergent associations within nomological nets, often analyzed using multitrait-multimethod (MTMM) matrices to disentangle trait-method variance. In MTMM frameworks, measures of the same construct (e.g., intelligence across verbal, spatial, and numerical tasks) exhibit higher correlations than measures of different constructs using identical methods, supporting the coherence of underlying factors like the general intelligence factor (g). For g, convergent validity appears in robust negative correlations with elementary cognitive tasks, such as choice reaction times (r ≈ -0.40 to -0.50), reflecting neural processing efficiency, while divergent validity is evident in negligible associations with extraneous variables like self-reported test effort or motivation in neutral testing contexts.00023-K) These patterns align with biological and experimental indicators, including brain imaging correlates of processing speed, reinforcing g's theoretical independence from motivational artifacts.[110] Incremental validity highlights psychometrics' added predictive utility beyond alternative methods. Cognitive ability tests contribute substantial unique variance in personnel selection, with effect sizes (Cohen's d) exceeding 1.0 when combined with subjective assessments like unstructured interviews (which alone yield validities of ~0.38). Structured combinations of GMA tests and interviews achieve corrected validities up to 0.63, demonstrating psychometrics' foundational role in enhancing overall criterion prediction over relying solely on non-test-based evaluations. This incremental benefit persists even after recent adjustments for methodological artifacts like range restriction, underscoring the empirical robustness of psychometric scores in applied settings.Fairness, Bias, and Equivalence
Differential item functioning (DIF) assesses whether test items yield different probabilities of correct response across groups matched on overall ability, potentially indicating bias unrelated to the construct. Common methods include the Mantel-Haenszel procedure for dichotomous items, which computes a common odds ratio stratified by ability levels to detect uniform DIF, and logistic regression, which models item response as a function of ability, group membership, and their interaction to identify both uniform and non-uniform DIF.[111][112] In highly g-loaded intelligence tests, such as Raven's Progressive Matrices, empirical DIF analyses across diverse groups, including sex and ethnic comparisons, consistently show minimal to negligible bias, with few items exhibiting significant DIF after ability matching.[111][113] This pattern holds particularly for abstract, nonverbal items, supporting the claim that observed group differences primarily reflect true ability disparities rather than artifactual cultural loading.[114] Predictive bias evaluates whether tests forecast outcomes differentially across groups, examined via regression slope equality and intercept differences. Meta-analytic evidence indicates that general cognitive ability measures exhibit comparable predictive validities for job performance, educational attainment, and income across racial and gender subgroups, with correlations typically ranging from 0.5 to 0.6 without systematic slope attenuation.[115] For instance, validity coefficients for cognitive tests predicting supervisory ratings of performance do not differ significantly between Black and White employees, countering claims of subgroup underprediction; in some datasets, correlations are even stronger for minority groups.[116] While mean score differences persist, the absence of slope disparities implies equal utility in forecasting individual outcomes, aligning with causal models where g drives real-world success independently of demographic artifacts.[117] Cross-cultural equivalence in psychometric instruments requires testing measurement invariance (MI) using structural equation modeling (SEM), progressing from configural (factor structure equality) to metric (factor loading invariance), scalar (intercept invariance), and strict (residual variance invariance) levels. Strict MI constraints ensure comparable latent trait measurement across adaptations, allowing valid mean comparisons; violations signal nonequivalence, often due to linguistic or contextual factors rather than core construct differences.[118] For general intelligence factors, SEM analyses support weak to metric invariance across diverse populations, enabling cross-national g comparisons, though scalar invariance is rarer in verbal subtests and more robust in fluid measures like Raven's.[119] These tests underscore that while adaptations demand rigorous validation, g-loaded assessments demonstrate sufficient equivalence to attribute score variances to substantive cognitive differences over methodological artifacts.[120]Applications and Empirical Utility
Human Assessment Domains
Psychometric assessments are applied across human domains to evaluate cognitive, personality, and behavioral traits, informing decisions that demonstrably enhance outcomes through causal mechanisms such as resource allocation matched to individual capacities. In education, clinical practice, and industrial-organizational contexts, these tools facilitate targeted interventions, with empirical evidence linking their use to improved efficiency and reduced errors in high-stakes selections.[121] Educational ApplicationsCognitive ability tests, including IQ measures, guide student placement in ability-grouped or streamed programs, enabling instruction tailored to aptitude levels and thereby causally boosting achievement by optimizing learning pace and content complexity. A comprehensive meta-analysis synthesizing over 100 years of research on ability grouping and acceleration found overall positive effects on academic outcomes, with effect sizes averaging 0.12 to 0.29 standard deviations (SD) across grouping types, and larger gains (up to 0.5 SD in select acceleration interventions) for high-ability students through enriched curricula.[122] These practices correlate with policy-driven improvements, such as reduced frustration in mismatched classrooms and heightened motivation, yielding sustained gains in standardized test scores when implemented with psychometric rigor.[123] Clinical Applications
In clinical settings, personality inventories like the Minnesota Multiphasic Personality Inventory (MMPI-2-RF) detect psychopathology by assessing symptom patterns against normative data, aiding differential diagnosis while necessitating base rate considerations to curb false positives. Validity studies confirm moderate diagnostic accuracy for disorders such as depression and schizophrenia, with scale elevations predicting clinical status better than chance, though low prevalence rates (e.g., <5% for specific pathologies) inflate false positive risks—up to 37% of non-clinical adults score at clinical thresholds (T ≥ 65) on at least one basic scale.[124] Causal utility emerges in treatment planning, where psychometric screening refines referrals, reducing unnecessary interventions and improving prognostic accuracy by integrating empirical profiles with epidemiological priors.[125] Industrial-Organizational Applications
General mental ability (GMA) tests predict job performance across roles, with meta-analytic validities averaging r = 0.51 (uncorrected) to 0.65 (corrected for range restriction and unreliability), enabling hiring that causally elevates workforce productivity via cognitive-job fit.[126] Schmidt and Hunter's syntheses of decades of data (1980s–2000s) show GMA selection yields high return on investment, including 20–50% reductions in turnover costs and performance variances explained up to 25%, outperforming alternatives like interviews (r = 0.18) in utility models estimating societal economic gains in billions.[46] These impacts stem from causal chains where validated assessments minimize mismatch penalties, fostering organizational efficiency without adverse effects on diverse hires when bias is controlled.[127]
