Hubbry Logo
RepeatabilityRepeatabilityMain
Open search
Repeatability
Community hub
Repeatability
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Repeatability
Repeatability
from Wikipedia

Repeatability or test–retest reliability[1] is the closeness of the agreement between the results of successive measurements of the same measure, when carried out under the same conditions of measurement. [2] In other words, the measurements are taken by a single person or instrument on the same item, under the same conditions, and in a short period of time. A less-than-perfect test–retest reliability causes test–retest variability. Such variability can be caused by, for example, intra-individual variability and inter-observer variability. A measurement may be said to be repeatable when this variation is smaller than a predetermined acceptance criterion.

Test–retest variability is practically used, for example, in medical monitoring of conditions. In these situations, there is often a predetermined "critical difference", and for differences in monitored values that are smaller than this critical difference, the possibility of variability as a sole cause of the difference may be considered in addition to, for example, changes in diseases or treatments.[3]

Conditions

[edit]

The following conditions need to be fulfilled in the establishment of repeatability: [2][4]

  • the same experimental tools
  • the same observer
  • the same measuring instrument, used under the same conditions
  • the same location
  • repetition over a short period of time.
  • same objectives

Repeatability methods were developed by Bland and Altman (1986).[5]

If the correlation between separate administrations of the test is high (e.g. 0.7 or higher as in this Cronbach's alpha-internal consistency-table[6]), then it has good test–retest reliability.

The repeatability coefficient is a precision measure which represents the value below which the absolute difference between two repeated test results may be expected to lie with a probability of 95%.[citation needed]

The standard deviation under repeatability conditions is part of precision and accuracy.[citation needed]

Attribute agreement analysis for defect databases

[edit]

An attribute agreement analysis is designed to simultaneously evaluate the impact of repeatability and reproducibility on accuracy. It allows the analyst to examine the responses from multiple reviewers as they look at several scenarios multiple times. It produces statistics that evaluate the ability of the appraisers to agree with themselves (repeatability), with each other (reproducibility), and with a known master or correct value (overall accuracy) for each characteristic – over and over again.[7]

Psychological testing

[edit]

Because the same test is administered twice and every test is parallel with itself, differences between scores on the test and scores on the retest should be due solely to measurement error. This sort of argument is quite probably true for many physical measurements. However, this argument is often inappropriate for psychological measurement, because it is often impossible to consider the second administration of a test a parallel measure to the first.[8]

The second administration of a psychological test might yield systematically different scores than the first administration due to the following reasons:[8]

  1. The attribute that is being measured may change between the first test and the retest. For example, a reading test that is administered in September to a third grade class may yield different results when retaken in June. One would expect some change in children's reading ability over that span of time, a low test–retest correlation might reflect real changes in the attribute itself.
  2. The experience of taking the test itself can change a person's true score. For example, completing an anxiety inventory could serve to increase a person's level of anxiety.
  3. Carryover effect, particularly if the interval between test and retest is short. When retested, people may remember their original answer, which could affect answers on the second administration.

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Repeatability is the closeness of agreement between the results of successive measurements of the same measurand carried out under the same conditions of measurement, including the same procedure, observer, measuring instrument, location, and short interval of time. These conditions, known as repeatability conditions, ensure that variations in results are minimized to assess the inherent precision of a measurement system. In scientific experiments and , repeatability serves as a core component of precision, enabling researchers to verify that outcomes are consistent and not due to random artifacts or chance. It is quantified through statistical measures such as the standard deviation of repeated results, which helps estimate and validate the reliability of instruments or methods. For example, in under standards like ISO 5725, repeatability evaluates how consistently a produces results on identical items by the same operator in the same using the same equipment over a short period. This assessment is essential for applications in fields ranging from physics and to biomedical , where it underpins the trustworthiness of and facilitates comparisons between measurement techniques. Repeatability is often distinguished from related concepts like , which involves obtaining consistent results under changed conditions such as different operators, equipment, or locations, and replicability, which tests findings with new data or independent studies. While high repeatability confirms within a single setup, achieving broader and replicability strengthens the overall validity of scientific claims and combats issues like the observed in disciplines such as . By prioritizing repeatability, scientists ensure that foundational experimental rigor supports cumulative knowledge and innovation.

Fundamental Concepts

Definition

Repeatability is the degree to which the results of a , experiment, or process remain consistent when repeated multiple times under the same conditions, including the use of the same operator, , , and over a short period. In , it is specifically defined as precision under a set of repeatability conditions, where these conditions encompass the same measurement procedure, operator, measuring , operating conditions, and replicate measurements on the same or similar objects. This concept emphasizes the closeness of agreement between independent test results obtained under stipulated conditions of , with all potential sources of variation—such as environmental factors, time intervals, and procedural details—held constant to isolate inherent process variability. The key attributes include minimizing extraneous influences to ensure that any observed differences arise solely from random measurement errors rather than systematic changes. The notion of repeatability emerged in the as part of the broader of scientific methods in , driven by efforts to establish uniform measurement systems amid industrial expansion, culminating in the 1875 that founded the International Bureau of Weights and Measures (BIPM). It was further formalized in the by the (ISO), notably in ISO 5725-1, first published in 1994 and revised in 2023, which provides general principles and definitions for accuracy, trueness, and precision in measurement methods. A basic example is repeating a in a controlled environment, where successive trials using the identical , temperature, and stirring method yield the same values, illustrating the process's repeatability. In contrast to , which evaluates consistency under varied conditions like different operators or locations, repeatability strictly maintains identical setups to assess intrinsic reliability. Repeatability is distinguished from related concepts in measurement and scientific practice by its focus on consistency under identical conditions, whereas reproducibility involves achieving similar outcomes when conditions are varied, such as by different operators or equipment. Replicability, in contrast, refers to the ability of independent researchers to obtain comparable results through new experiments or , often emphasizing verification beyond the original study. Reliability encompasses a broader assessment of a method's overall stability and consistency across repeated uses, time periods, and varying conditions, serving as an umbrella term that includes aspects of both repeatability and reproducibility. The (ISO) provides precise definitions in ISO 5725, where repeatability is defined as the closeness of agreement between successive measurements of the same quantity under the same conditions (known as within-run variation), and as the closeness of agreement between measurements under changed conditions, such as different laboratories or time periods (between-run or between-laboratory variation). These distinctions highlight repeatability as a measure of precision in a controlled, unchanging environment, while tests robustness against external variables. A common misconception arises in media and public discourse on scientific integrity, where repeatability is frequently conflated with during discussions of crises like the in , leading to overstated concerns about basic experimental consistency when the issue often pertains to broader inter-study validation. The following table summarizes these distinctions for clarity:
TermConditionsScopeExample
RepeatabilityIdentical (same operator, equipment, short time interval)Within a single setup or runMultiple readings from the same under unchanged ambient conditions.
ReproducibilityVaried (e.g., different labs, operators, or equipment)Between setups or runs results obtained across independent laboratories using similar protocols.
ReplicabilityIndependent (new data, methods by other researchers)External verificationSeparate research teams confirming a statistical effect with fresh participant samples.
ReliabilityOverall (across time, conditions, and repetitions)Broad measurement stabilityA diagnostic tool providing consistent outcomes for the same over multiple sessions.

Measurement and Assessment

Statistical Methods

Statistical methods for evaluating repeatability focus on quantifying the variation in repeated measurements obtained under identical conditions, enabling researchers and practitioners to assess the precision of measurement processes. These techniques partition sources of variability and provide metrics to determine whether a meets acceptable standards for reliability. Key approaches include , variance component analysis, and graphical monitoring tools, often applied in and experimental design. The standard deviation of repeated measurements is a fundamental metric for repeatability, capturing the typical spread of results from successive trials of the same item or process. It is calculated as the of the variance of the dataset, where lower values indicate higher repeatability. Complementing this, the (CV) normalizes the standard deviation relative to the , expressed as
CV=(σμ)×100,\text{CV} = \left( \frac{\sigma}{\mu} \right) \times 100,
where σ\sigma is the standard deviation and μ\mu is the ; this percentage-based measure facilitates comparisons across datasets with different units or scales, particularly in analytical and settings.
Standardized formulas further refine these assessments. According to ISO 5725-2, the repeatability standard deviation rr approximates the interval within which 95% of repeated measurements are expected to fall, given by
r=2.8×σw,r = 2.8 \times \sigma_w,
where σw\sigma_w is the within-laboratory standard deviation derived from multiple replicates; this assumes a and is used to establish precision limits in interlaboratory studies. In contexts, the gauge repeatability and (GR&R) percentage evaluates measurement adequacy as
GR&R%=(6×σGRRtolerance)×100,\text{GR\&R\%} = \left( \frac{6 \times \sigma_{\text{GRR}}}{\text{tolerance}} \right) \times 100, where σGRR\sigma_{\text{GRR}} is the standard deviation from the Gage R&R study (combining repeatability and variation) and tolerance is the specified limit; values below 10% indicate an acceptable , while 10-30% suggest marginal requiring improvement.
Analysis of variance (ANOVA) is a core technique for dissecting repeatability by partitioning total variation into components attributable to operators, parts, or equipment, using a to estimate variance contributions. This method, often implemented in crossed or nested designs, tests for significant differences and quantifies repeatability as the residual error variance. Control charts, such as Shewhart charts, monitor repeatability over time by plotting measurement means or ranges against control limits (typically ±3σ\pm 3\sigma), signaling deviations when points exceed bounds or exhibit non-random patterns, thus aiding ongoing stability assessment. Software tools like and facilitate these computations through built-in functions for variance analysis and metric calculation. For instance, in R's irr package, repeatability indices can be derived from a simple dataset of repeated weight measurements—say, ten trials yielding values 100.2, 99.8, 100.1, 100.0, 99.9, 100.3, 100.1, 99.7, 100.0, 100.2 g—with a of 100.03 g and standard deviation of 0.17 g, resulting in a CV of approximately 0.17%, indicating high repeatability for a precision balance.

Attribute Agreement Analysis

Attribute Agreement Analysis (AAA) is a statistical method within (MSA) designed to evaluate the consistency and accuracy of subjective classifications in categorical data, such as assigning defect types by multiple appraisers. It focuses on repeatability by quantifying agreement beyond chance, helping identify sources of variation in human judgment during inspections. A key component of AAA is the statistic, which measures inter-rater agreement for categorical assignments while adjusting for expected agreement by chance:
κ=pope1pe,\kappa = \frac{p_o - p_e}{1 - p_e},
where pop_o is the observed proportion of agreement and pep_e is the expected proportion under chance. AAA typically includes two main appraisal types: appraiser-versus-standard, which assesses accuracy against a reference classification, and appraiser-versus-appraiser, which evaluates among raters.
In defect databases, AAA is applied in manufacturing , including processes, to measure repeatability in categorizing defects from visual inspections of parts, ensuring reliable data entry for root cause analysis. For instance, in a binary defect classification (defective/non-defective) across 100 samples rated by three appraisers, percent agreement is calculated as the ratio of matching classifications to total ratings, while kappa values assess chance-adjusted consistency; interpretations often deem percent agreement above 80% and kappa above 0.75 as indicating acceptable repeatability. AAA was developed in the 1990s primarily for the automotive and electronics industries to standardize gage studies for attribute data, and it was formally integrated into the Automotive Industry Action Group's (AIAG) Measurement Systems Analysis Reference Manual, third edition, published in 2002.

Applications in Specific Fields

Scientific Experiments

Repeatability forms a cornerstone of the , serving as a critical mechanism for validating hypotheses by ensuring that experimental results can be consistently reproduced across multiple trials under the same conditions. This allows researchers to distinguish reliable findings from anomalies or errors, thereby building confidence in the underlying scientific claims. For example, in physics, repeated measurements using simple setups have been essential for refining estimates of the , with modern interferometric techniques demonstrating high consistency across trials to achieve precise values. Effective experimental design incorporates key principles to enhance repeatability, including to distribute potential biases evenly across treatments, blinding to eliminate observer expectations from influencing outcomes, and of materials, procedures, and environmental conditions to isolate the effects of manipulated variables. In fields like , protocols typically mandate a minimum of three to five replicates per experimental condition to capture variability and confirm result stability, providing a statistical basis for assessing consistency without excessive resource demands. A seminal historical case illustrating repeatability is Louis Pasteur's swan-neck flask experiments conducted in the , in which he boiled nutrient broth in flasks with elongated, curved necks to trap airborne contaminants while allowing air access. By repeatedly observing that the broth remained sterile until the necks were broken—allowing microbial entry—Pasteur demonstrated the absence of , with the consistent outcomes across trials providing robust evidence that refuted prevailing theories. In contemporary scientific practice, repeatability is verified through rigorous , where evaluators scrutinize the specificity and feasibility of protocols to determine if experiments can be independently repeated with similar results. Complementing this, initiatives emerging prominently in the 2010s, such as those led by the Center for Open Science, promote the sharing of raw datasets, code, and methods via public repositories to enable external replication and verification. The reproducibility crisis that gained prominence in biomedicine during the 2010s—characterized by failure rates of approximately 50% in replicating published studies—has intensified focus on repeatability, prompting leading journals to mandate detailed repeatability statements, explicit reporting of replicate numbers, and comprehensive methods descriptions in submissions to bolster experimental reliability.

Psychological Testing

In psychological testing, repeatability is primarily evaluated through test-retest reliability, which assesses the consistency of scores obtained when the same instrument is administered to the same participants on two separate occasions, typically separated by an interval of 1-2 weeks to reduce memory effects while capturing trait stability. When assessing test-retest reliability for psychometric scales, key aspects to verify include the sample size used, the time interval between administrations, and evidence of temporal stability. This approach is fundamental in psychometrics, as it helps determine whether a measure yields stable results over short periods, distinguishing true variance in psychological constructs from random error. For interval or ratio data, such as continuous scores from cognitive tests, Pearson's product-moment correlation coefficient (r) is commonly used to quantify test-retest reliability, with values above 0.7 indicating acceptable stability. For ordinal scales or when assessing agreement beyond mere correlation, the intraclass correlation coefficient (ICC) is preferred, as it accounts for both correlation and systematic differences; an ICC greater than 0.7 is generally considered acceptable for repeatability in psychological assessments. In practice, these metrics reveal high repeatability for intelligence measures like the (WAIS), where full-scale IQ test-retest reliabilities range from 0.88 to 0.93 over intervals of several weeks, reflecting the relative stability of cognitive abilities. Personality inventories, such as those measuring the Big Five traits, show moderate repeatability, with test-retest correlations around 0.80 to 0.90 for short intervals, attributable to the enduring nature of traits despite minor fluctuations. Challenges in achieving repeatability in psychological testing stem from inherent human variability, including factors like mood, fatigue, and environmental influences, which can introduce error variance and lower reliability coefficients. Ethical constraints further complicate exact replication, as repeated testing must balance scientific needs with participant well-being, such as obtaining and avoiding undue burden or deception in behavioral studies. These issues are particularly pronounced in assessments involving vulnerable populations, where stringent ethical guidelines limit the frequency and intensity of retesting. The historical foundations of repeatability in trace back to early 20th-century , pioneered by in his 1904 work, which introduced correlation-based methods to evaluate test consistency and laid the groundwork for . This framework emphasized the importance of reliability coefficients in validating measures of general intelligence, influencing subsequent developments in standardized testing protocols.

Quality Control and Defect Databases

In quality control systems, repeatability ensures that inspection processes consistently identify defects, minimizing variations that could lead to false positives or negatives in defect databases. This consistency is critical for maintaining accurate records of production issues, as variability in human or assessments can propagate errors into databases, affecting downstream analyses and corrective actions. For instance, gage repeatability and (GR&R) studies quantify the variability introduced by the measurement system itself, helping manufacturers isolate and reduce sources of inconsistency in defect detection. Attribute agreement analysis serves as a key tool for evaluating the consistency of defect classifications in databases, particularly in attribute-based inspections where subjective judgments are involved. In the automotive sector, this analysis is integrated into (PPAP) requirements, where systems analysis (MSA) for attribute data assesses appraiser agreement to ensure reliable defect logging before production approval. Standards like ISO 5725 define repeatability conditions, such as the same procedure, operator, and equipment, to achieve consistent outcomes, supporting the overall integrity of systems. In semiconductor manufacturing, repeatability in wafer defect classification is essential across shifts to maintain uniform identification of surface anomalies, preventing discrepancies that could compromise yield rates. Manual classifications often vary due to operator subjectivity, but automated systems using vision-based machine learning achieve higher consistency by standardizing defect pattern recognition on wafer maps. Defect databases in these environments facilitate tracking through structured queries that analyze classification agreement over time, enabling manufacturers to monitor and refine repeatability metrics from logged inspection data. Poor repeatability in quality checks has significant economic consequences, as evidenced by the 2010s Takata airbag recalls affecting millions of vehicles, where Takata's inadequate quality-control records and inconsistent defect detection in inflator contributed to widespread safety issues and massive financial liabilities exceeding billions in costs. These incidents underscored how lapses in repeatable inspections at the supplier level can escalate to product recalls, damaging brand reputation and incurring regulatory penalties. Post-2020 advancements in automation have substantially enhanced repeatability in defect detection, with models achieving average precision improvements of up to 78.6% in multi-category classifications, reducing reliance on variable human inputs. These systems integrate convolutional neural networks for real-time analysis, enabling scalable, consistent defect identification in high-volume lines.

Challenges and Improvements

Factors Affecting Repeatability

Environmental factors, such as fluctuations, levels, and mechanical vibrations, can significantly alter experimental outcomes and undermine repeatability by introducing uncontrolled variations in conditions. For instance, in laboratory equipment calibration, even minor changes in ambient can cause or contraction in instruments, leading to inconsistent readings across repeated trials. Procedural issues further compromise repeatability through inconsistencies in protocol execution, equipment degradation over time, and the use of uncalibrated instruments, which introduce systematic errors into repeated measurements. Time-dependent degradation of samples, such as the loss of in reagents during multiple trials, exemplifies how procedural timing can lead to varying results even under nominally conditions. Human elements, including operator bias, , and inconsistencies in training, play a critical role in reducing repeatability, particularly in tasks requiring subjective judgment or manual intervention. Differences in operator technique, , or even can result in measurable variations when the same procedure is repeated by the same or different individuals. The impact of these factors can be quantified using metrics like the percentage of study variation in gage repeatability and reproducibility (GR&R) analyses, where values exceeding 10% of the process tolerance or often indicate poor repeatability and the need for system improvements. A notable historical case illustrating temperature's effect on repeatability is the 1986 , where O-ring seal tests demonstrated non-repeatable performance and erosion at temperatures below 53°F (12°C), contributing to the failure during launch in cold conditions. Broader influences, such as variability in raw materials, can affect industrial repeatability by introducing inconsistencies in material properties that propagate through repeated processes. Fluctuations in supplier or composition, for example, lead to variable outcomes in tests across production runs.

Strategies for Enhancing Repeatability

Standardizing protocols through detailed standard operating procedures (SOPs) is a foundational strategy for enhancing repeatability in experimental workflows. SOPs provide step-by-step instructions that minimize procedural variations, ensuring consistent execution across personnel and sessions. By incorporating checklists for critical steps, such as material preparation and data recording, SOPs facilitate adherence and reduce oversight errors, thereby improving the of results. In settings, complements SOPs by mitigating ; for instance, robotic pipetting systems enable precise liquid handling in high-throughput assays. Regular calibration and maintenance of equipment according to established guidelines further bolsters repeatability by preserving measurement accuracy over time. The National Institute of Standards and Technology (NIST) recommends periodic verification of instruments against reference standards to detect and correct drifts, ensuring that components due to equipment variability remain within specified limits. Traceable calibration to NIST standards in analytical labs helps establish reliable baselines for quantitative analysis. This practice, including the use of , aligns with international standards like ISO/IEC 17025, promoting long-term instrument stability. Incorporating statistical controls into experimental design, such as replicates and , allows researchers to quantify and mitigate inherent variability, thereby enhancing the precision of outcomes. Replicates—multiple runs under identical conditions—help estimate within-experiment variability, with biological replicates preferred over technical ones to capture true process fluctuations. prior to experimentation determines the minimum sample size needed to detect effects of interest, for instance requiring approximately 100 subjects to achieve 80% power for a 5 µm change in thickness with a standard deviation of approximately 10 µm. Complementing these, electronic laboratory notebook (ELN) systems automate data logging with timestamps and digital signatures, supporting traceable records that facilitate verification and reduce transcription errors in repeatable workflows. Training programs and auditing mechanisms ensure operator proficiency, directly addressing human factors that undermine repeatability. Good Clinical Laboratory Practice (GCLP) training, which emphasizes standardized techniques and error recognition, has improved assay proficiency in resource-limited settings by standardizing operator performance across labs. Operator certification programs, often aligned with ISO standards, require demonstrated competence through practical assessments, leading to reduced variability in measurements like blood pressure readings when proper cuff selection and positioning are enforced. Auditing via inter-laboratory comparisons establishes performance baselines by evaluating repeatability uncertainty (u_repeat) against shared artifacts, with normalized error metrics (|En|) helping identify labs needing recalibration to align within 1 standard deviation of the group mean. Emerging trends in the 2020s leverage AI-driven predictive modeling to forecast and adjust for variability in , advancing repeatability in complex assays. approaches in , validated against diverse datasets, support improvements in screening reproducibility.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.