Hubbry Logo
Replication crisisReplication crisisMain
Open search
Replication crisis
Community hub
Replication crisis
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Replication crisis
Replication crisis
from Wikipedia

first page of research paper
Ioannidis (2005): "Why Most Published Research Findings Are False".[1]

The replication crisis, also known as the reproducibility or replicability crisis, is the growing number of published scientific results that other researchers have been unable to reproduce. Because the reproducibility of empirical results is a cornerstone of the scientific method,[2] such failures undermine the credibility of theories that build on them and can call into question substantial parts of scientific knowledge.

The replication crisis is frequently discussed in relation to psychology and medicine, wherein considerable efforts have been undertaken to reinvestigate the results of classic studies to determine whether they are reliable, and if they turn out not to be, the reasons for the failure.[3][4] Data strongly indicate that other natural and social sciences are also affected.[5]

The phrase "replication crisis" was coined in the early 2010s as part of a growing awareness of the problem.[6] Considerations of causes and remedies have given rise to a new scientific discipline known as metascience,[7] which uses methods of empirical research to examine empirical research practice.[8]

Considerations about reproducibility can be placed into two categories. Reproducibility in a narrow sense refers to reexamining and validating the analysis of a given set of data. The second category, replication, involves repeating an existing experiment or study with new, independent data to verify the original conclusions.

Background

[edit]

Replication

[edit]

Replication has been called "the cornerstone of science".[9][10] Environmental health scientist Stefan Schmidt began a 2009 review with this description of replication:

Replication is one of the central issues in any empirical science. To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception. A replication experiment to demonstrate that the same findings can be obtained in any other place by any other researcher is conceived as an operationalization of objectivity. It is the proof that the experiment reflects knowledge that can be separated from the specific circumstances (such as time, place, or persons) under which it was gained.[11]

But there is limited consensus on how to define replication and potentially related concepts.[12][13][11] A number of types of replication have been identified:

  1. Direct or exact replication, where an experimental procedure is repeated as closely as possible.[11][14]
  2. Systematic replication, where an experimental procedure is largely repeated, with some intentional changes.[14]
  3. Conceptual replication, where a finding or hypothesis is tested using a different procedure.[11][14] Conceptual replication allows testing for generalizability and veracity of a result or hypothesis.[14]

Reproducibility can also be distinguished from replication, as referring to reproducing the same results using the same data set. Reproducibility of this type is why many researchers make their data available to others for testing.[15]

The replication crisis does not necessarily mean these fields are unscientific.[16][17][18] Rather, this process is part of the scientific process in which old ideas or those that cannot withstand careful scrutiny are pruned,[19][20] although this pruning process is not always effective.[21][22]

A hypothesis is generally considered to be supported when the results match the predicted pattern and that pattern of results is found to be statistically significant. Results are considered significant whenever the relative frequency of the observed pattern falls below an arbitrarily chosen value (i.e. the significance level) when assuming the null hypothesis is true. This generally answers the question of how unlikely results would be if no difference existed at the level of the statistical population. If the probability associated with the test statistic exceeds the chosen critical value, the results are considered statistically significant.[23] The corresponding probability of exceeding the critical value is depicted as p < 0.05, where p (typically referred to as the "p-value") is the probability level. This should result in 5% of hypotheses that are supported being false positives (an incorrect hypothesis being erroneously found correct), assuming the studies meet all of the statistical assumptions. Some fields use smaller p-values, such as p < 0.01 (1% chance of a false positive) or p < 0.001 (0.1% chance of a false positive). But a smaller chance of a false positive often requires greater sample sizes or a greater chance of a false negative (a correct hypothesis being erroneously found incorrect). Although p-value testing is the most commonly used method, it is not the only method.

Statistics

[edit]

Certain terms commonly used in discussions of the replication crisis have technically precise meanings, which are presented here.[1]

In the most common case, null hypothesis testing, there are two hypotheses, a null hypothesis and an alternative hypothesis . The null hypothesis is typically of the form "X and Y are statistically independent". For example, the null hypothesis might be "taking drug X does not change 1-year recovery rate from disease Y", and the alternative hypothesis is that it does change.

As testing for full statistical independence is difficult, the full null hypothesis is often reduced to a simplified null hypothesis "the effect size is 0", where "effect size" is a real number that is 0 if the full null hypothesis is true, and the larger the effect size is, the more the null hypothesis is false.[24] For example, if X is binary, then the effect size might be defined as the change in the expectation of Y upon a change of X:Note that the effect size as defined above might be zero even if X and Y are not independent, such as when . Since different definitions of "effect size" capture different ways for X and Y to be dependent, there are many different definitions of effect size.

In practice, effect sizes cannot be directly observed, but must be measured by statistical estimators. For example, the above definition of effect size is often measured by Cohen's d estimator. The same effect size might have multiple estimators, as they have tradeoffs between efficiency, bias, variance, etc. This further increases the number of possible statistical quantities that can be computed on a single dataset. When an estimator for an effect size is used for statistical testing, it is called a test statistic.

Illustration of the 4 possible outcomes of a null hypothesis test: false negative, true negative, false positive, true positive. In this illustration, the hypothesis test is a one-sided threshold test.

A null hypothesis test is a decision procedure which takes in some data, and outputs either or . If it outputs , it is usually stated as "there is a statistically significant effect" or "the null hypothesis is rejected".

Often, the statistical test is a (one-sided) threshold test, which is structured as follows:

  1. Gather data .
  2. Compute a test statistic for the data.
  3. Compare the test statistic against a critical value/threshold . If , then output , else, output .

A two-sided threshold test is similar, but with two thresholds, such that it outputs if either or

There are 4 possible outcomes of a null hypothesis test: false negative, true negative, false positive, true positive. A false negative means that is true, but the test outcome is ; a true negative means that is true, and the test outcome is , etc.

Probability to reject Probability to not reject
If is True α 1-α
If is True 1-β (power) β
Interaction between sample size, effect size, and statistical power. Distributions of sample means under the null (θ=0) and alternative hypotheses are shown. The shaded red area represents significance (α), held constant at 0.05, while the shaded green area represents statistical power (1-β). As the sample size increases, the distributions narrow, leading to clearer separation between the hypotheses and higher power. Similarly, a larger effect size increases the distance between the distributions, resulting in greater power.

Significance level, false positive rate, or the alpha level, is the probability of finding the alternative to be true when the null hypothesis is true:For example, when the test is a one-sided threshold test, then where means "the data is sampled from ".

Statistical power, true positive rate, is the probability of finding the alternative to be true when the alternative hypothesis is true:where is also called the false negative rate. For example, when the test is a one-sided threshold test, then .

Given a statistical test and a data set , the corresponding p-value is the probability that the test statistic is at least as extreme, conditional on . For example, for a one-sided threshold test, If the null hypothesis is true, then the p-value is distributed uniformly on . Otherwise, it is typically peaked at and roughly exponential, though the precise shape of the p-value distribution depends on what the alternative hypothesis is.[25][26]

Since the p-value is distributed uniformly on conditional on the null hypothesis, one may construct a statistical test with any significance level by simply computing the p-value, then output if . This is usually stated as "the null hypothesis is rejected at significance level ", or "", such as "smoking is correlated with cancer (p < 0.001)".

History

[edit]

The beginning of the replication crisis can be traced to a number of events in the early 2010s. Philosopher of science and social epistemologist Felipe Romero identified four events that can be considered precursors to the ongoing crisis:[27]

  • Controversies around social priming research: In the early 2010s, the well-known "elderly-walking" study[28] by social psychologist John Bargh and colleagues failed to replicate in two direct replications.[29] This experiment was part of a series of three studies that had been widely cited throughout the years, was regularly taught in university courses, and had inspired a large number of conceptual replications. Failures to replicate the study led to much controversy and a heated debate involving the original authors.[30] Notably, many of the conceptual replications of the original studies also failed to replicate in subsequent direct replications.[31][32][33][34]
  • Controversies around experiments on extrasensory perception: Social psychologist Daryl Bem conducted a series of experiments supposedly providing evidence for the controversial phenomenon of extrasensory perception.[35] Bem was highly criticized for his study's methodology and upon reanalysis of the data, no evidence was found for the existence of extrasensory perception.[36] The experiment also failed to replicate in subsequent direct replications.[37] According to Romero, what the community found particularly upsetting was that many of the flawed procedures and statistical tools used in Bem's studies were part of common research practice in psychology.
  • Amgen and Bayer reports on lack of replicability in biomedical research: Scientists from biotech companies Amgen and Bayer Healthcare reported alarmingly low replication rates (11–20%) of landmark findings in preclinical oncological research.[38]
  • Publication of studies on p-hacking and questionable research practices: Since the late 2000s, a number of studies in metascience showed how commonly adopted practices in many scientific fields, such as exploiting the flexibility of the process of data collection and reporting, could greatly increase the probability of false positive results.[39][40][41] These studies suggested how a significant proportion of published literature in several scientific fields could be nonreplicable research.

This series of events generated a great deal of skepticism about the validity of existing research in light of widespread methodological flaws and failures to replicate findings. This led prominent scholars to declare a "crisis of confidence" in psychology and other fields,[42] and the ensuing situation came to be known as the "replication crisis".

Although the beginning of the replication crisis can be traced to the early 2010s, some authors point out that concerns about replicability and research practices in the social sciences had been expressed much earlier. Romero notes that authors voiced concerns about the lack of direct replications in psychological research in the late 1960s and early 1970s.[43][44] He also writes that certain studies in the 1990s were already reporting that journal editors and reviewers are generally biased against publishing replication studies.[45][46]

In the social sciences, the blog Data Colada (whose three authors coined the term "p-hacking" in a 2014 paper) has been credited with contributing to the start of the replication crisis.[47][48][49]

University of Virginia professor and cognitive psychologist Barbara A. Spellman has written that many criticisms of research practices and concerns about replicability of research are not new.[50] She reports that between the late 1950s and the 1990s, scholars were already expressing concerns about a possible crisis of replication,[51] a suspiciously high rate of positive findings,[52] questionable research practices,[53] the effects of publication bias,[54] issues with statistical power,[55][56] and bad standards of reporting.[51]

Spellman also identifies reasons that the reiteration of these criticisms and concerns in recent years led to a full-blown crisis and challenges to the status quo. First, technological improvements facilitated conducting and disseminating replication studies, and analyzing large swaths of literature for systemic problems. Second, the research community's increasing size and diversity made the work of established members more easily scrutinized by other community members unfamiliar with them. According to Spellman, these factors, coupled with increasingly limited resources and misaligned incentives for doing scientific work, led to a crisis in psychology and other fields.[50]

According to Andrew Gelman,[57] the works of Paul Meehl, Jacob Cohen, and Tversky and Kahneman in the 1960s-70s were early warnings of replication crisis. In discussing the origins of the problem, Kahneman himself noted historical precedents in subliminal perception and dissonance reduction replication failures.[58]

It had been repeatedly pointed out since 1962[55] that most psychological studies have low power (true positive rate), but low power persisted for 50 years, indicating a structural and persistent problem in psychological research.[59][60]

Prevalence

[edit]

In psychology

[edit]

Several factors have combined to put psychology at the center of the conversation.[61][62] Some areas of psychology once considered solid, such as social priming and ego depletion,[63] have come under increased scrutiny due to failed replications.[64] Much of the focus has been on social psychology,[65] although other areas of psychology such as clinical psychology,[66][67][68] developmental psychology,[69][70][71] and educational research have also been implicated.[72][73][74][75][76]

In August 2015, the first open empirical study of reproducibility in psychology was published, called The Reproducibility Project: Psychology. Coordinated by psychologist Brian Nosek, researchers redid 100 studies in psychological science from three high-ranking psychology journals (Journal of Personality and Social Psychology, Journal of Experimental Psychology: Learning, Memory, and Cognition, and Psychological Science). 97 of the original studies had significant effects, but of those 97, only 36% of the replications yielded significant findings (p value below 0.05).[12] The mean effect size in the replications was approximately half the magnitude of the effects reported in the original studies. The same paper examined the reproducibility rates and effect sizes by journal and discipline. Study replication rates were 23% for the Journal of Personality and Social Psychology, 48% for Journal of Experimental Psychology: Learning, Memory, and Cognition, and 38% for Psychological Science. Studies in the field of cognitive psychology had a higher replication rate (50%) than studies in the field of social psychology (25%).[77]

Of the 64% of non-replications, only 25% disproved the original result (at statistical significance). The other 49% were inconclusive, neither supporting nor contradicting the original result. This is because many replications were underpowered, with a sample 2.5 times smaller than the original.[78]

A study published in 2018 in Nature Human Behaviour replicated 21 social and behavioral science papers from Nature and Science, finding that only about 62% could successfully reproduce original results.[79][80]

Similarly, in a study conducted under the auspices of the Center for Open Science, a team of 186 researchers from 60 different laboratories (representing 36 different nationalities from six different continents) conducted replications of 28 classic and contemporary findings in psychology.[81][82] The study's focus was not only whether the original papers' findings replicated but also the extent to which findings varied as a function of variations in samples and contexts. Overall, 50% of the 28 findings failed to replicate despite massive sample sizes. But if a finding replicated, then it replicated in most samples. If a finding was not replicated, then it failed to replicate with little variation across samples and contexts. This evidence is inconsistent with a proposed explanation that failures to replicate in psychology are likely due to changes in the sample between the original and replication study.[82]

Results of a 2022 study suggest that many earlier brainphenotype studies ("brain-wide association studies" (BWAS)) produced invalid conclusions as the replication of such studies requires samples from thousands of individuals due to small effect sizes.[83][84]

In medicine

[edit]
Graphic of results and barriers. 193 experiments were designed, 87 were initiated, and 50 were completed.
Results from The Reproducibility Project: Cancer Biology suggest most studies of the cancer research sector may not be replicable.

Of 49 medical studies from 1990 to 2003 with more than 1000 citations, 92% found that the studied therapies were effective. Of these studies, 16% were contradicted by subsequent studies, 16% had found stronger effects than did subsequent studies, 44% were replicated, and 24% remained largely unchallenged.[85] A 2011 analysis by researchers with pharmaceutical company Bayer found that, at most, a quarter of Bayer's in-house findings replicated the original results.[86] But the analysis of Bayer's results found that the results that did replicate could often be successfully used for clinical applications.[87]

In a 2012 paper, C. Glenn Begley, a biotech consultant working at Amgen, and Lee Ellis, a medical researcher at the University of Texas, found that only 11% of 53 pre-clinical cancer studies had replications that could confirm conclusions from the original studies.[38] In late 2021, The Reproducibility Project: Cancer Biology examined 53 top papers about cancer published between 2010 and 2012 and showed that among studies that provided sufficient information to be redone, the effect sizes were 85% smaller on average than the original findings.[88][89] A survey of cancer researchers found that half of them had been unable to reproduce a published result.[90] Another report estimated that almost half of randomized controlled trials contained flawed data (based on the analysis of anonymized individual participant data (IPD) from more than 150 trials).[91]

In other disciplines

[edit]

In nutrition science

[edit]

In nutrition science, for most food ingredients, there were studies that found that the ingredient has an effect on cancer risk. Specifically, out of a random sample of 50 ingredients from a cookbook, 80% had articles reporting on their cancer risk. Statistical significance decreased for meta-analyses.[92]

In economics

[edit]

Economics has lagged behind other social sciences and psychology in its attempts to assess replication rates and increase the number of studies that attempt replication.[13] A 2016 study in the journal Science replicated 18 experimental studies published in two leading economics journals, The American Economic Review and the Quarterly Journal of Economics, between 2011 and 2014. It found that about 39% failed to reproduce the original results.[93][94][95] About 20% of studies published in The American Economic Review are contradicted by other studies despite relying on the same or similar data sets.[96] A study of empirical findings in the Strategic Management Journal found that about 30% of 27 retested articles showed statistically insignificant results for previously significant findings, whereas about 4% showed statistically significant results for previously insignificant findings.[97]

In water resource management

[edit]

A 2019 study in Scientific Data estimated with 95% confidence that of 1,989 articles on water resources and management published in 2017, study results might be reproduced for only 0.6% to 6.8%, largely because the articles did not provide sufficient information to allow for replication.[98]

Across fields

[edit]

A 2016 survey by Nature on 1,576 researchers who took a brief online questionnaire on reproducibility found that more than 70% of researchers have tried and failed to reproduce another scientist's experiment results (including 87% of chemists, 77% of biologists, 69% of physicists and engineers, 67% of medical researchers, 64% of earth and environmental scientists, and 62% of all others), and more than half have failed to reproduce their own experiments. But fewer than 20% had been contacted by another researcher unable to reproduce their work. The survey found that fewer than 31% of researchers believe that failure to reproduce results means that the original result is probably wrong, although 52% agree that a significant replication crisis exists. Most researchers said they still trust the published literature.[5][99] In 2010, Fanelli (2010)[100] found that 91.5% of psychiatry/psychology studies confirmed the effects they were looking for, and concluded that the odds of this happening (a positive result) was around five times higher than in fields such as astronomy or geosciences. Fanelli argued that this is because researchers in "softer" sciences have fewer constraints to their conscious and unconscious biases.

Early analysis of result-blind peer review, which is less affected by publication bias, has estimated that 61% of result-blind studies in biomedicine and psychology have led to null results, in contrast to an estimated 5% to 20% in earlier research.[101]

In 2021, a study conducted by University of California, San Diego found that papers that cannot be replicated are more likely to be cited.[102] Nonreplicable publications are often cited more even after a replication study is published.[103]

Causes

[edit]

There are many proposed causes for the replication crisis.

Historical and sociological causes

[edit]

The replication crisis may be triggered by the "generation of new data and scientific publications at an unprecedented rate" that leads to "desperation to publish or perish" and failure to adhere to good scientific practice.[104]

Predictions of an impending crisis in the quality-control mechanism of science can be traced back several decades. Derek de Solla Price—considered the father of scientometrics, the quantitative study of science—predicted in 1963 that science could reach "senility" as a result of its own exponential growth.[105] Some present-day literature seems to vindicate this "overflow" prophecy, lamenting the decay in both attention and quality.[106][107]

Historian Philip Mirowski argues that the decline of scientific quality can be connected to its commodification, especially spurred by major corporations' profit-driven decision to outsource their research to universities and contract research organizations.[108]

Social systems theory, as expounded in the work of German sociologist Niklas Luhmann, inspires a similar diagnosis. This theory holds that each system, such as economy, science, religion, and media, communicates using its own code: true and false for science, profit and loss for the economy, news and no-news for the media, and so on.[109][110] According to some sociologists, science's mediatization,[111] commodification,[108] and politicization,[111][112] as a result of the structural coupling among systems, have led to a confusion of the original system codes.

Problems with the publication system in science

[edit]

Publication bias

[edit]

A major cause of low reproducibility is the publication bias stemming from the fact that statistically non-significant results and seemingly unoriginal replications are rarely published. Only a very small proportion of academic journals in psychology and neurosciences explicitly welcomed submissions of replication studies in their aim and scope or instructions to authors.[113][114] This does not encourage reporting on, or even attempts to perform, replication studies. Among 1,576 researchers Nature surveyed in 2016, only a minority had ever attempted to publish a replication, and several respondents who had published failed replications noted that editors and reviewers demanded that they play down comparisons with the original studies.[5][99] An analysis of 4,270 empirical studies in 18 business journals from 1970 to 1991 reported that less than 10% of accounting, economics, and finance articles and 5% of management and marketing articles were replication studies.[93][115] Publication bias is augmented by the pressure to publish and the author's own confirmation bias,[a] and is an inherent hazard in the field, requiring a certain degree of skepticism on the part of readers.[41]

Publication bias leads to what psychologist Robert Rosenthal calls the "file drawer effect". The file drawer effect is the idea that as a consequence of the publication bias, a significant number of negative results[b] are not published. According to philosopher of science Felipe Romero, this tends to produce "misleading literature and biased meta-analytic studies",[27] and when publication bias is considered along with the fact that a majority of tested hypotheses might be false a priori, it is plausible that a considerable proportion of research findings might be false positives, as shown by metascientist John Ioannidis.[1] In turn, a high proportion of false positives in the published literature can explain why many findings are nonreproducible.[27]

Another publication bias is that studies that do not reject the null hypothesis are scrutinized asymmetrically. For example, they are likely to be rejected as being difficult to interpret or having a Type II error. Studies that do reject the null hypothesis are not likely to be rejected for those reasons.[117]

In popular media, there is another element of publication bias: the desire to make research accessible to the public led to oversimplification and exaggeration of findings, creating unrealistic expectations and amplifying the impact of non-replications. In contrast, null results and failures to replicate tend to go unreported. This explanation may apply to power posing's replication crisis.[118]

Mathematical errors

[edit]

Even high-impact journals have a significant fraction of mathematical errors in their use of statistics. For example, 11% of statistical results published in Nature and BMJ in 2001 are "incongruent", meaning that the reported p-value is mathematically different from what it should be if it were correctly calculated from the reported test statistic. These errors were likely from typesetting, rounding, and transcription errors.[119]

Among 157 neuroscience papers published in five top-ranking journals that attempt to show that two experimental effects are different, 78 erroneously tested instead for whether one effect is significant while the other is not, and 79 correctly tested for whether their difference is significantly different from 0.[120]

"Publish or perish" culture

[edit]

The consequences for replicability of the publication bias are exacerbated by academia's "publish or perish" culture. As explained by metascientist Daniele Fanelli, "publish or perish" culture is a sociological aspect of academia whereby scientists work in an environment with very high pressure to have their work published in recognized journals. This is the consequence of the academic work environment being hypercompetitive and of bibliometric parameters (e.g., number of publications) being increasingly used to evaluate scientific careers.[121] According to Fanelli, this pushes scientists to employ a number of strategies aimed at making results "publishable". In the context of publication bias, this can mean adopting behaviors aimed at making results positive or statistically significant, often at the expense of their validity.[121]

According to Center for Open Science founder Brian Nosek and his colleagues, "publish or perish" culture created a situation whereby the goals and values of single scientists (e.g., publishability) are not aligned with the general goals of science (e.g., pursuing scientific truth). This is detrimental to the validity of published findings.[122]

Philosopher Brian D. Earp and psychologist Jim A. C. Everett argue that, although replication is in the best interests of academics and researchers as a group, features of academic psychological culture discourage replication by individual researchers. They argue that performing replications can be time-consuming, and take away resources from projects that reflect the researcher's original thinking. They are harder to publish, largely because they are unoriginal, and even when they can be published they are unlikely to be viewed as major contributions to the field. Replications "bring less recognition and reward, including grant money, to their authors".[123]

In his 1971 book Scientific Knowledge and Its Social Problems, philosopher and historian of science Jerome R. Ravetz predicted that science—in its progression from "little" science composed of isolated communities of researchers to "big" science or "techno-science"—would suffer major problems in its internal system of quality control. He recognized that the incentive structure for modern scientists could become dysfunctional, creating perverse incentives to publish any findings, however dubious. According to Ravetz, quality in science is maintained only when there is a community of scholars, linked by a set of shared norms and standards, who are willing and able to hold each other accountable.

Standards of reporting

[edit]

Certain publishing practices also make it difficult to conduct replications and to monitor the severity of the reproducibility crisis, for articles often come with insufficient descriptions for other scholars to reproduce the study. The Reproducibility Project: Cancer Biology showed that of 193 experiments from 53 top papers about cancer published between 2010 and 2012, only 50 experiments from 23 papers have authors who provided enough information for researchers to redo the studies, sometimes with modifications. None of the 193 papers examined had its experimental protocols fully described and replicating 70% of experiments required asking for key reagents.[88][89] The aforementioned study of empirical findings in the Strategic Management Journal found that 70% of 88 articles could not be replicated due to a lack of sufficient information for data or procedures.[93][97] In water resources and management, most of 1,987 articles published in 2017 were not replicable because of a lack of available information shared online.[98] In studies of event-related potentials, only two-thirds the information needed to replicate a study were reported in a sample of 150 studies, highlighting that there are substantial gaps in reporting.[124]

Procedural bias

[edit]

By the Duhem-Quine thesis, scientific results are interpreted by both a substantive theory and a theory of instruments. For example, astronomical observations depend both on the theory of astronomical objects and the theory of telescopes. A large amount of non-replicable research might accumulate if there is a bias of the following kind: faced with a null result, a scientist prefers to treat the data as saying the instrument is insufficient; faced with a non-null result, a scientist prefers to accept the instrument as good, and treat the data as saying something about the substantive theory.[125]

Cultural evolution

[edit]

Smaldino and McElreath[60] proposed a simple model for the cultural evolution of scientific practice. Each lab randomly decides to produce novel research or replication research, at different fixed levels of false positive rate, true positive rate, replication rate, and productivity (its "traits"). A lab might use more "effort", making the ROC curve more convex but decreasing productivity. A lab accumulates a score over its lifetime that increases with publications and decreases when another lab fails to replicate its results. At regular intervals, a random lab "dies" and another "reproduces" a child lab with a similar trait as its parent. Labs with higher scores are more likely to reproduce. Under certain parameter settings, the population of labs converge to maximum productivity even at the price of very high false positive rates.

Questionable research practices

[edit]

Questionable research practices are deliberate behaviors that exploit researcher degrees of freedom (researcher DF)—choices in study design, data analysis, or reporting—to inflate false positive rates and undermine reproducibility.[126][127][41] Examples of questionable research practices include data dredging,[127][128][40][c] selective reporting of only statistically significant findings,[126][127][128][40][d] HARKing (hypothesizing after results are known),[127][128][40][e] PARKing (pre-registering after results are known), and conducting inappropriate power analyses.[130]

Genesis

[edit]

Researchers' degrees of freedom occur at many stages: hypothesis formulation, design of experiments, data collection and analysis, and reporting of research.[127] Analyses of identical datasets by different teams, even absent incentives for significant findings, often yield divergent results in disciplines such as psychology, linguistics, and ecology.[131][132][133] This is because research design and data analysis entail numerous decisions that are not sufficiently constrained by a field's best practices and statistical methodologies. As a result, researcher DF can lead to situations where some failed replication attempts use a different, yet plausible, research design or statistical analysis; such studies do not necessarily undermine previous findings.[134] Multiverse analysis, a method that makes inferences based on all plausible data-processing pipelines, provides a solution to the problem of analytical flexibility.[135] Sensitivity analysis explores modelling specifications to create a comprehensive view of how different analytical choices influence outcomes.[136] Collaborative approaches can be used to compensate for questionable research practices. In multianalyst approaches, different analysts conduct different analyses to address questions.[137][138][139] This collaborative validation fosters intellectual honesty and exposes questionable research practices, leading to more reliable and robust scientific conclusions.[140]

In medicine

[edit]

Irreproducible medical studies commonly share these characteristics: investigators not being blinded to the experimental versus the control arms; failure to repeat experiments; lack of positive and negative controls; failing to report all the data; inappropriate use of statistical tests; and use of reagents that were not appropriately validated.[141]

In AI research

[edit]

In machine learning research, a range of questionable practices have emerged due to intense pressure to achieve state-of-the-art benchmark results. Common evaluation questionable practices include "benchmark overfitting" by repeatedly tuning hyperparameters on held-out test sets, selectively reporting the best of multiple random seeds or experimental runs, and "metric hacking" through unreported post hoc decisions such as choice of tokenization or evaluation scripts that inflate scores.[142] Researchers also frequently cherry-pick tasks or datasets where their proposed model outperforms baselines, omit negative or lower-performing results, and obscure compute budgets to present energy-intensive methods as efficient. Such practices can give a misleading impression of model progress and hinder fair comparisons across methods. Moreover, many machine learning studies suffer from irreproducible research practices that impede replication and auditing. Key issues include incomplete disclosure of training details—such as dataset preprocessing, exact model hyperparameters, random seeds, and hardware configurations—and failure to release code or data used in experiments. Researchers often omit versions of dependencies or rely on proprietary datasets, preventing others from reproducing results. Even for public benchmarks, subtle "data leakage" through inadvertent inclusion of test data in training splits remains widespread. Together, these questionable practices undermine the validity and credibility of reported findings in machine learning and pose substantial barriers to cumulative scientific progress.[142]

In October 2024, Communications of the ACM published a peer-reviewed critique[143] of a 2021 Nature paper[144] by researchers from Google. The critique described "a smorgasbord of questionable practices in ML, including irreproducible research practices, multiple variants of cherry-picking, misreporting, and likely data contamination (leakage)". The criticized paper claimed progress in fast chip design using reinforcement learning, but supported the claim with experiments on proprietary training and test datasets whose statistical details were undisclosed, preventing independent reproduction outside the company.[145] Specific execution times for both the proposed approach and prior techniques on individual test cases were not disclosed in the paper, but later studies found that Google's method ran orders of magnitude more slowly than the Cadence Design Systems tools available at the same time. A fraction of evidence reviewed in the critique is summarized below using prior terminology.[142]

Cherry-picking (benchmark hacking)
Results with unclear statistical significance were reported on only five chip blocks out of 20 mentioned in the paper. Additionally, the authors refused to publish results on public benchmarks requested by Nature reviewer #3
Cherry-picking (baseline hacking)
The critique pointed out that the Nature paper chose to compare to a weaker variant of simulated annealing and to a "human baseline" involving unspecified participants. Results in subsequent publications indicated that the Google method consistently underperformed Cadence tools and established algorithms such as simulated annealing.
Cherry-picking (metric hacking)
The Nature paper provides chip-quality metrics for its own method but omits these metrics for simulated annealing. Instead, it evaluates simulated annealing using a proxy function that does not accurately represent the actual chip-quality metrics.[143]
Misreporting
The criticized results were produced with the undisclosed usage of data from a commercial software tool.[146] Nature published an appendix that admitted to this usage, years after the original paper[144] and after multiple rounds of criticism.[147][148][149] Researchers reported additional discrepancies between the description in the Nature paper and what was actually used to produce results and published as source code.[146]
Data leakage (contamination)
Researchers outside Google were unable to replicate the original findings. A subsequent Google publication[150] noted that pre-training on "diverse TPU" did not improve results, whereas pre-training on "previous netlist versions" produced some quality gains. Although the Nature paper did not disclose this, using the same or similar data for both pre-training and testing (data leakage) could represent a significant methodological flaw.[143]

After it rebranded as AlphaChip, multiple researchers voiced skepticism about the Google approach. The questionable research practices and independent researchers' inability to reproduce AlphaChip performance fueled skepticism in AI-driven chip design approaches and underscored the need for transparent, reproducible benchmarks and rigorous evaluation standards to ensure genuine progress in the field of electronic design automation.[151][152]

Prevalence

[edit]

According to IU professor Ernest O'Boyle and psychologist Martin Götz, around 50% of researchers surveyed across various studies admitted engaging in HARKing.[153] In a survey of 2,000 psychologists by behavioral scientist Leslie K. John and colleagues, around 94% of psychologists admitted having employed at least one questionable research practice. More specifically, 63% admitted failing to report all of a study's dependent measures, 28% to report all of a study's conditions, and 46% to selectively reporting studies that produced the desired pattern of results. In addition, 56% admitted having collected more data after having inspected already collected data, and 16% to having stopped data collection because the desired result was already visible.[40] According to biotechnology researcher J. Leslie Glick's estimate in 1992, 10% to 20% of research and development studies involved either questionable research practices or outright fraud.[154] The methodology used to estimate questionable research practices has been contested, and more recent studies suggested lower prevalence rates on average.[155]

Fraud

[edit]

Questionable research practices are considered a separate category from more explicit violations of scientific integrity, such as data falsification.[126][127] A 2009 meta-analysis found that 2% of scientists across fields admitted falsifying studies at least once and 14% admitted knowing someone who did. Such misconduct was, according to one study, reported more frequently by medical researchers than by others.[156] Prominent examples (see also List of scientific misconduct incidents) include scientific fraud by social psychologist Diederik Stapel,[157][14] cognitive psychologist Marc Hauser, and social psychologist Lawrence Sanna.[14] In 2018, some researchers viewed scientific fraud as uncommon.[14] A 2025 Northwestern University study found that "the publication of fraudulent science is outpacing the growth rate of legitimate scientific publications". The study also discovered broad networks of organized scientific fraudsters.[158][159]

In March 2024, Harvard Business School's investigative committee, after reviewing a nearly 1,300-page report unsealed during Professor Francesca Gino's $25 million lawsuit against Harvard and the Data Colada bloggers, found that Gino "committed research misconduct intentionally, knowingly, or recklessly" by falsifying data in four published studies.[160] The report documented that Gino had altered participant responses—including changing 104 moral-impurity ratings (flipping low values to high in one experimental condition and vice versa in another) and manipulating four networking-intentions items, for a total of 168 modified observations—to make the data conform to her hypotheses.[161] She also engaged in selective reporting by publishing only the "Posted" dataset while omitting the original Qualtrics archives and misrepresented the provenance of her data by attributing discrepancies to alleged third-party tampering rather than to deliberate changes by her research team.[162] In June 2023 Gino was placed on unpaid administrative leave, and in May 2025 Harvard revoked her tenure—the first such action in roughly 80 years—citing egregious violations of academic integrity.[163]

Statistical issues

[edit]

Low statistical power

[edit]

According to Deakin University professor Tom Stanley and colleagues, one plausible reason studies fail to replicate is low statistical power. This happens for three reasons. First, a replication study with low power is unlikely to succeed since, by definition, it has a low probability to detect a true effect. Second, if the original study has low power, it will yield biased effect size estimates. When conducting a priori power analysis for the replication study, this will result in underestimation of the required sample size. Third, if the original study has low power, the post-study odds of a statistically significant finding reflecting a true effect are quite low. It is therefore likely that a replication attempt of the original study would fail.[15]

Mathematically, the probability of replicating a previous publication that rejected a null hypothesis in favor of an alternative is assuming significance is less than power. Thus, low power implies low probability of replication, regardless of how the previous publication was designed, and regardless of which hypothesis is really true.[78]

Stanley and colleagues estimated the average statistical power of psychological literature by analyzing data from 200 meta-analyses. They found that on average, psychology studies have between 33.1% and 36.4% statistical power. These values are quite low compared to the 80% considered adequate statistical power for an experiment. Across the 200 meta-analyses, the median of studies with adequate statistical power was between 7.7% and 9.1%, implying that a positive result would replicate with probability less than 10%, regardless of whether the positive result was a true positive or a false positive.[15]

The statistical power of neuroscience studies is quite low. The estimated statistical power of fMRI research is between .08 and .31,[164] and that of studies of event-related potentials was estimated as .72‒.98 for large effect sizes, .35‒.73 for medium effects, and .10‒.18 for small effects.[124]

In a study published in Nature, psychologist Katherine Button and colleagues conducted a similar study with 49 meta-analyses in neuroscience, estimating a median statistical power of 21%.[165] Meta-scientist John Ioannidis and colleagues computed an estimate of average power for empirical economic research, finding a median power of 18% based on literature drawing upon 6.700 studies.[166] In light of these results, it is plausible that a major reason for widespread failures to replicate in several scientific fields might be very low statistical power on average.

The same statistical test with the same significance level will have lower statistical power if the effect size is small under the alternative hypothesis. Complex inheritable traits are typically correlated with a large number of genes, each of small effect size, so high power requires a large sample size. In particular, many results from the candidate gene literature suffered from small effect sizes and small sample sizes and would not replicate. More data from genome-wide association studies (GWAS) come close to solving this problem.[167][168] As a numeric example, most genes associated with schizophrenia risk have low effect size (genotypic relative risk, GRR). A statistical study with 1000 cases and 1000 controls has 0.03% power for a gene with GRR = 1.15, which is already large for schizophrenia. In contrast, the largest GWAS to date has ~100% power for it.[169]

Positive effect size bias

[edit]

Even when the study replicates, the replication typically has a smaller effect size. Underpowered[clarification needed] studies have a large effect size bias.[170]

Distribution of statistically significant estimates of the regression factor in a linear model in the presence of added error. When the sample size is small, adding noise overestimates the regression factor about 50% of the times. When the sample size is small, it consistently underestimates. Figure appeared in.[171]

In studies that statistically estimate a regression factor, such as the in , when the dataset is large, noise tends to cause the regression factor to be underestimated, but when the dataset is small, noise tends to cause the regression factor to be overestimated.[171]

Problems of meta-analysis

[edit]

Meta-analyses have their own methodological problems and disputes, which leads to rejection of the meta-analytic method by researchers whose theory is challenged by meta-analysis.[117]

Rosenthal proposed the "fail-safe number" (FSN)[54] to avoid the publication bias against null results. It is defined as follows: Suppose the null hypothesis is true; how many publications would be required to make the current result indistinguishable from the null hypothesis?

Rosenthal's point is that certain effect sizes are large enough, such that even if there is a total publication bias against null results (the "file drawer problem"), the number of unpublished null results would be impossibly large to swamp out the effect size. Thus, the effect size must be statistically significant even after accounting for unpublished null results.

One objection to the FSN is that it is calculated as if unpublished results are unbiased samples from the null hypothesis. But if the file drawer problem is true, then unpublished results would have effect sizes concentrated around 0. Thus fewer unpublished null results would be necessary to swap out the effect size, and so the FSN is an overestimate.[117]

Another problem with meta-analysis is that bad studies are "infectious" in the sense that one bad study might cause the entire meta-analysis to overestimate statistical significance.[78]

P-hacking

[edit]

Various statistical methods can be applied to make the p-value appear smaller than it really is. This need not be malicious, as moderately flexible data analysis, routine in research, can increase the false-positive rate to above 60%.[41]

For example, if one collects some data, applies several different significance tests to it, and publishes only the one that happens to have a p-value less than 0.05, then the total p-value for "at least one significance test reaches p < 0.05" can be much larger than 0.05, because even if the null hypothesis were true, the probability that one out of many significance tests is extreme is not itself extreme.

Typically, a statistical study has multiple steps, with several choices at each step, such as during data collection, outlier rejection, choice of test statistic, choice of one-tailed or two-tailed test, etc. These choices in the "garden of forking paths" multiply, creating many "researcher degrees of freedom". The effect is similar to the file-drawer problem, as the paths not taken are not published.[172]

Consider a simple illustration. Suppose the null hypothesis is true, and we have 20 possible significance tests to apply to the dataset. Also suppose the outcomes to the significance tests are independent. By definition of "significance", each test has probability 0.05 to pass with significance level 0.05. The probability that at least 1 out of 20 is significant is, by assumption of independence, .[173]

Another possibility is the multiple comparisons problem. In 2009, it was twice noted that fMRI studies had a suspicious number of positive results with large effect sizes, more than would be expected since the studies have low power (one example[174] had only 13 subjects). It pointed out that over half of the studies would test for correlation between a phenomenon and individual fMRI voxels, and only report on voxels exceeding chosen thresholds.[175]

The figure shows the change in p-values computed from a t-test as the sample size increases, and how early stopping can allow for p-hacking even when the null hypothesis is exactly true. Data is drawn from two identical normal distributions, . For each sample size , ranging from 5 to , a t-test is performed on the first samples from each distribution, and the resulting p-value is plotted. The red dashed line indicates the commonly used significance level of 0.05. If the data collection or analysis were to stop at a point where the p-value happened to fall below the significance level, a spurious statistically significant difference could be reported.

Optional stopping is a practice where one collects data until some stopping criterion is reached. Though a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collecting even more data before stopping. Neglecting these events leads to a p-value that is too low. In fact, if the null hypothesis is true, any significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained.[176] For a concrete example of testing for a fair coin, see p-value#optional stopping.

More succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter could have done in reaction to data that might have been. Accounting for what might have been is hard even for honest researchers.[176] One benefit of preregistration is to account for all counterfactuals, allowing the p-value to be calculated correctly.[177]

The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway.[173]

Such practices are widespread in psychology. In a 2012 survey, 56% of psychologists admitted to early stopping, 46% to only reporting analyses that "worked", and 38% to post hoc exclusion, that is, removing some data after analysis was already performed on the data before reanalyzing the remaining data (often on the premise of "outlier removal").[40]

Statistical heterogeneity

[edit]

As also reported by Stanley and colleagues, a further reason studies might fail to replicate is high heterogeneity of the to-be-replicated effects. In meta-analysis, "heterogeneity" refers to the variance in research findings that results from there being no single true effect size. Instead, findings in such cases are better seen as a distribution of true effects.[15] Statistical heterogeneity is calculated using the I-squared statistic,[178] defined as "the proportion (or percentage) of observed variation among reported effect sizes that cannot be explained by the calculated standard errors associated with these reported effect sizes".[15] This variation can be due to differences in experimental methods, populations, cohorts, and statistical methods between replication studies. Heterogeneity poses a challenge to studies attempting to replicate previously found effect sizes. When heterogeneity is high, subsequent replications have a high probability of finding an effect size radically different than that of the original study.[f]

Importantly, significant levels of heterogeneity are also found in direct/exact replications of a study. Stanley and colleagues discuss this while reporting a study by quantitative behavioral scientist Richard Klein and colleagues, where the authors attempted to replicate 15 psychological effects across 36 different sites in Europe and the U.S. In the study, Klein and colleagues found significant amounts of heterogeneity in 8 out of 16 effects (I-squared = 23% to 91%). Importantly, while the replication sites intentionally differed on a variety of characteristics, such differences could account for very little heterogeneity . According to Stanley and colleagues, this suggested that heterogeneity could have been a genuine characteristic of the phenomena being investigated. For instance, phenomena might be influenced by so-called "hidden moderators" – relevant factors that were previously not understood to be important in the production of a certain effect.

In their analysis of 200 meta-analyses of psychological effects, Stanley and colleagues found a median percent of heterogeneity of I-squared = 74%. According to the authors, this level of heterogeneity can be considered "huge". It is three times larger than the random sampling variance of effect sizes measured in their study. If considered along sampling error, heterogeneity yields a standard deviation from one study to the next even larger than the median effect size of the 200 meta-analyses they investigated.[g] The authors conclude that if replication is defined by a subsequent study finding a sufficiently similar effect size to the original, replication success is not likely even if replications have very large sample sizes. Importantly, this occurs even if replications are direct or exact since heterogeneity nonetheless remains relatively high in these cases.

Others

[edit]

Within economics, the replication crisis may be also exacerbated because econometric results are fragile:[179] using different but plausible estimation procedures or data preprocessing techniques can lead to conflicting results.[180][181][182]

Context sensitivity

[edit]

New York University professor Jay Van Bavel and colleagues argue that a further reason findings are difficult to replicate is the sensitivity to context of certain psychological effects. On this view, failures to replicate might be explained by contextual differences between the original experiment and the replication, often called "hidden moderators".[183] Van Bavel and colleagues tested the influence of context sensitivity by reanalyzing the data of the widely cited Reproducibility Project carried out by the Open Science Collaboration.[12] They re-coded effects according to their sensitivity to contextual factors and then tested the relationship between context sensitivity and replication success in various regression models.

Context sensitivity was found to negatively correlate with replication success, such that higher ratings of context sensitivity were associated with lower probabilities of replicating an effect.[h] Importantly, context sensitivity significantly correlated with replication success even when adjusting for other factors considered important for reproducing results (e.g., effect size and sample size of original, statistical power of the replication, methodological similarity between original and replication).[i] In light of the results, the authors concluded that attempting a replication in a different time, place or with a different sample can significantly alter an experiment's results. Context sensitivity thus may be a reason certain effects fail to replicate in psychology.[183]

Bayesian explanation

[edit]

In the framework of Bayesian probability, by Bayes' theorem, rejecting the null hypothesis at significance level 5% does not mean that the posterior probability for the alternative hypothesis is 95%, and the posterior probability is also different from the probability of replication.[184][176] Consider a simplified case where there are only two hypotheses. Let the prior probability of the null hypothesis be , and the alternative . For a given statistical study, let its false positive rate (significance level) be , and true positive rate (power) be . For illustrative purposes, let significance level be 0.05 and power be 0.45 (underpowered).

Now, by Bayes' theorem, conditional on the statistical studying finding to be true, the posterior probability of actually being true is not , but

and the probability of replicating the statistical study is which is also different from . In particular, for a fixed level of significance, the probability of replication increases with power, and prior probability for . If the prior probability for is small, then one would require a high power for replication.

For example, if the prior probability of the null hypothesis is , and the study found a positive result, then the posterior probability for is , and the replication probability is .

Problem with null hypothesis testing

[edit]

Some argue that null hypothesis testing is itself inappropriate, especially in "soft sciences" like social psychology.[185][186]

As repeatedly observed by statisticians,[187] in complex systems, such as social psychology, "the null hypothesis is always false", or "everything is correlated". If so, then if the null hypothesis is not rejected, that does not show that the null hypothesis is true, but merely that it was a false negative, typically due to low power.[188] Low power is especially prevalent in subject areas where effect sizes are small and data is expensive to acquire, such as social psychology.[185][189]

Furthermore, when the null hypothesis is rejected, it might not be evidence for the substantial alternative hypothesis. In soft sciences, many hypotheses can predict a correlation between two variables. Thus, evidence against the null hypothesis "there is no correlation" is no evidence for one of the many alternative hypotheses that equally well predict "there is a correlation". Fisher developed the NHST for agronomy, where rejecting the null hypothesis is usually good proof of the alternative hypothesis, since there are not many of them. Rejecting the hypothesis "fertilizer does not help" is evidence for "fertilizer helps". But in psychology, there are many alternative hypotheses for every null hypothesis.[189][190]

In particular, when statistical studies on extrasensory perception reject the null hypothesis at extremely low p-value (as in the case of Daryl Bem), it does not imply the alternative hypothesis "ESP exists". Far more likely is that there was a small (non-ESP) signal in the experiment setup that has been measured precisely.[191]

Paul Meehl noted that statistical hypothesis testing is used differently in "soft" psychology (personality, social, etc.) from physics. In physics, a theory makes a quantitative prediction and is tested by checking whether the prediction falls within the statistically measured interval. In soft psychology, a theory makes a directional prediction and is tested by checking whether the null hypothesis is rejected in the right direction. Consequently, improved experimental technique makes theories more likely to be falsified in physics but less likely to be falsified in soft psychology, as the null hypothesis is always false since any two variables are correlated by a "crud factor" of about 0.30. The net effect is an accumulation of theories that remain unfalsified, but with no empirical evidence for preferring one over the others.[23][190]

Base rate fallacy

[edit]

According to philosopher Alexander Bird, a possible reason for the low rates of replicability in certain scientific fields is that a majority of tested hypotheses are false a priori.[192] On this view, low rates of replicability could be consistent with quality science. Relatedly, the expectation that most findings should replicate would be misguided and, according to Bird, a form of base rate fallacy. Bird's argument works as follows. Assuming an ideal situation of a test of significance, whereby the probability of incorrectly rejecting the null hypothesis is 5% (i.e. Type I error) and the probability of correctly rejecting the null hypothesis is 80% (i.e. Power), in a context where a high proportion of tested hypotheses are false, it is conceivable that the number of false positives would be high compared to those of true positives.[192] For example, in a situation where only 10% of tested hypotheses are actually true, one can calculate that as many as 36% of results will be false positives.[j]

The claim that the falsity of most tested hypotheses can explain low rates of replicability is even more relevant when considering that the average power for statistical tests in certain fields might be much lower than 80%. For example, the proportion of false positives increases to a value between 55.2% and 57.6% when calculated with the estimates of an average power between 34.1% and 36.4% for psychology studies, as provided by Stanley and colleagues in their analysis of 200 meta-analyses in the field.[15] A high proportion of false positives would then result in many research findings being non-replicable.

Bird notes that the claim that a majority of tested hypotheses are false a priori in certain scientific fields might be plausible given factors such as the complexity of the phenomena under investigation, the fact that theories are seldom undisputed, the "inferential distance" between theories and hypotheses, and the ease with which hypotheses can be generated. In this respect, the fields Bird takes as examples are clinical medicine, genetic and molecular epidemiology, and social psychology. This situation is radically different in fields where theories have outstanding empirical basis and hypotheses can be easily derived from theories (e.g., experimental physics).[192]

Consequences

[edit]

When effects are wrongly stated as relevant in the literature, failure to detect this by replication will lead to the canonization of such false facts.[193]

A 2021 study found that papers in leading general interest, psychology and economics journals with findings that could not be replicated tend to be cited more over time than reproducible research papers, likely because these results are surprising or interesting. The trend is not affected by publication of failed reproductions, after which only 12% of papers that cite the original research will mention the failed replication.[194][195] Further, experts are able to predict which studies will be replicable, leading the authors of the 2021 study, Marta Serra-Garcia and Uri Gneezy, to conclude that experts apply lower standards to interesting results when deciding whether to publish them.[195]

Public awareness and perceptions

[edit]

Concerns have been expressed within the scientific community that the general public may consider science less credible due to failed replications.[196] Research supporting this concern is sparse, but a nationally representative survey in Germany showed that more than 75% of Germans have not heard of replication failures in science.[197] The study also found that most Germans have positive perceptions of replication efforts: only 18% think that non-replicability shows that science cannot be trusted, while 65% think that replication research shows that science applies quality control, and 80% agree that errors and corrections are part of science.[197]

Response in academia

[edit]

With the replication crisis of psychology earning attention, Princeton University psychologist Susan Fiske drew controversy for speaking against critics of psychology for what she called bullying and undermining the science.[198][199][200][201] She called these unidentified "adversaries" names such as "methodological terrorist" and "self-appointed data police", saying that criticism of psychology should be expressed only in private or by contacting the journals.[198] Columbia University statistician and political scientist Andrew Gelman responded to Fiske, saying that she had found herself willing to tolerate the "dead paradigm" of faulty statistics and had refused to retract publications even when errors were pointed out.[198] He added that her tenure as editor had been abysmal and that a number of published papers she edited were found to be based on extremely weak statistics; one of Fiske's own published papers had a major statistical error and "impossible" conclusions.[198]

Credibility revolution

[edit]

Some researchers in psychology indicate that the replication crisis is a foundation for a "credibility revolution", where changes in standards by which psychological science are evaluated may include emphasizing transparency and openness, preregistering research projects, and replicating research with higher standards for evidence to improve the strength of scientific claims.[202] Such changes may diminish the productivity of individual researchers, but this effect could be avoided by data sharing and greater collaboration.[202] A credibility revolution could be good for the research environment.[203]

Remedies

[edit]

Focus on the replication crisis has led to renewed efforts in psychology to retest important findings.[41][204] A 2013 special edition of the journal Social Psychology focused on replication studies.[13]

Standardization as well as (requiring) transparency of the used statistical and experimental methods have been proposed.[205] Careful documentation of the experimental set-up is considered crucial for replicability of experiments and various variables may not be documented and standardized such as animals' diets in animal studies.[206]

A 2016 article by John Ioannidis elaborated on "Why Most Clinical Research Is Not Useful".[207] Ioannidis describes what he views as some of the problems and calls for reform, characterizing certain points for medical research to be useful again; one example he makes is the need for medicine to be patient-centered (e.g. in the form of the Patient-Centered Outcomes Research Institute) instead of the current practice to mainly take care of "the needs of physicians, investigators, or sponsors".

Reform in scientific publishing

[edit]

Metascience

[edit]

Metascience is the use of scientific methodology to study science itself. It seeks to increase the quality of scientific research while reducing waste. It is also known as "research on research" and "the science of science", as it uses research methods to study how research is done and where improvements can be made. Metascience is concerned with all fields of research and has been called "a bird's eye view of science."[208] In Ioannidis's words, "Science is the best thing that has happened to human beings ... but we can do it better."[209]

Meta-research continues to be conducted to identify the roots of the crisis and to address them. Methods of addressing the crisis include pre-registration of scientific studies and clinical trials as well as the founding of organizations such as CONSORT and the EQUATOR Network that issue guidelines for methodology and reporting. Efforts continue to reform the system of academic incentives, improve the peer review process, reduce the misuse of statistics, combat bias in scientific literature, and increase the overall quality and efficiency of the scientific process.

Presentation of methodology

[edit]

Some authors have argued that the insufficient communication of experimental methods is a major contributor to the reproducibility crisis and that better reporting of experimental design and statistical analyses would improve the situation. These authors tend to plead for both a broad cultural change in the scientific community of how statistics are considered and a more coercive push from scientific journals and funding bodies.[210] But concerns have been raised about the potential for standards for transparency and replication to be misapplied to qualitative as well as quantitative studies.[211]

Business and management journals that have introduced editorial policies on data accessibility, replication, and transparency include the Strategic Management Journal, the Journal of International Business Studies, and the Management and Organization Review.[93]

Result-blind peer review

[edit]

In response to concerns in psychology about publication bias and data dredging, more than 140 psychology journals have adopted result-blind peer review. In this approach, studies are accepted not on the basis of their findings and after the studies are completed, but before they are conducted and on the basis of the methodological rigor of their experimental designs, and the theoretical justifications for their statistical analysis techniques before data collection or analysis is done.[212] Early analysis of this procedure has estimated that 61% of result-blind studies have led to null results, in contrast to an estimated 5% to 20% in earlier research.[101] In addition, large-scale collaborations between researchers working in multiple labs in different countries that regularly make their data openly available for different researchers to assess have become much more common in psychology.[213]

Pre-registration of studies

[edit]

Scientific publishing has begun using pre-registration reports to address the replication crisis.[214][215] The registered report format requires authors to submit a description of the study methods and analyses prior to data collection. Once the method and analysis plan is vetted through peer-review, publication of the findings is provisionally guaranteed, based on whether the authors follow the proposed protocol. One goal of registered reports is to circumvent the publication bias toward significant findings that can lead to implementation of questionable research practices. Another is to encourage publication of studies with rigorous methods.

The journal Psychological Science has encouraged the preregistration of studies and the reporting of effect sizes and confidence intervals.[216] The editor in chief also noted that the editorial staff will be asking for replication of studies with surprising findings from examinations using small sample sizes before allowing the manuscripts to be published.

Metadata and digital tools for tracking replications

[edit]

It has been suggested that "a simple way to check how often studies have been repeated, and whether or not the original findings are confirmed" is needed.[194] Categorizations and ratings of reproducibility at the study or results level, as well as addition of links to and rating of third-party confirmations, could be conducted by the peer-reviewers, the scientific journal, or by readers in combination with novel digital platforms or tools.

Statistical reform

[edit]

Requiring smaller p-values

[edit]

Many publications require a p-value of p < 0.05 to claim statistical significance. The paper "Redefine statistical significance",[217] signed by a large number of scientists and mathematicians, proposes that in "fields where the threshold for defining statistical significance for new discoveries is p < 0.05, we propose a change to p < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields." Their rationale is that "a leading cause of non-reproducibility (is that the) statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. Associating 'statistically significant' findings with p < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems."[217]

This call was subsequently criticised by another large group, who argued that "redefining" the threshold would not fix current problems, would lead to some new ones, and that in the end, all thresholds needed to be justified case-by-case instead of following general conventions.[218] A 2022 followup study examined these competing recommendations' practical impact. Despite high citation rates of both proposals, researchers found limited implementation of either the p < 0.005 threshold or the case-by-case justification approach in practice. This revealed what the authors called a "vicious cycle", in which scientists reject recommendations because they are not standard practice, while the recommendations fail to become standard practice because few scientists adopt them.[219]

Addressing misinterpretation of p-values

[edit]

Although statisticians are unanimous that the use of "p < 0.05" as a standard for significance provides weaker evidence than is generally appreciated, there is a lack of unanimity about what should be done about it. Some have advocated that Bayesian methods should replace p-values. This has not happened on a wide scale, partly because it is complicated and partly because many users distrust the specification of prior distributions in the absence of hard data. A simplified version of the Bayesian argument, based on testing a point null hypothesis was suggested by pharmacologist David Colquhoun.[220][221] The logical problems of inductive inference were discussed in "The Problem with p-values" (2016).[222]

The hazards of reliance on p-values arises partly because even an observation of p = 0.001 is not necessarily strong evidence against the null hypothesis.[221] Despite the fact that the likelihood ratio in favor of the alternative hypothesis over the null is close to 100, if the hypothesis was implausible, with a prior probability of a real effect being 0.1, even the observation of p = 0.001 would have a false positive risk of 8 percent. It would still fail to reach the 5 percent level.

It was recommended that the terms "significant" and "non-significant" should not be used.[221] p-values and confidence intervals should still be specified, but they should be accompanied by an indication of the false-positive risk. It was suggested that the best way to do this is to calculate the prior probability that would be necessary to believe in order to achieve a false positive risk of a certain level, such as 5%. The calculations can be done with various computer software.[221][223] This reverse Bayesian approach, which physicist Robert Matthews suggested in 2001,[224] is one way to avoid the problem that the prior probability is rarely known.

Encouraging larger sample sizes

[edit]

To improve the quality of replications, larger sample sizes than those used in the original study are often needed.[225] Larger sample sizes are needed because estimates of effect sizes in published work are often exaggerated due to publication bias and large sampling variability associated with small sample sizes in an original study.[226][227][228] Further, using significance thresholds usually leads to inflated effects, because particularly with small sample sizes, only the largest effects will become significant.[186]

Cross-validation

[edit]

One common statistical problem is overfitting, that is, when researchers fit a regression model over a large number of variables but a small number of data points. For example, a typical fMRI study of emotion, personality, and social cognition has fewer than 100 subjects, but each subject has 10,000 voxels. The study would fit a sparse linear regression model that uses the voxels to predict a variable of interest, such as self-reported stress. But the study would then report on the p-value of the model on the same data it was fitted to. The standard approach in statistics, where data is split into a training and a validation set, is resisted because test subjects are expensive to acquire.[175][229]

One possible solution is cross-validation, which allows model validation while also allowing the whole dataset to be used for model-fitting.[230]

Replication efforts

[edit]

Funding

[edit]

In July 2016, the Netherlands Organisation for Scientific Research made €3 million available for replication studies. The funding is for replication based on reanalysis of existing data and replication by collecting and analysing new data. Funding is available in the areas of social sciences, health research and healthcare innovation.[231]

In 2013, the Laura and John Arnold Foundation funded the launch of The Center for Open Science with a $5.25 million grant. By 2017, it provided an additional $10 million in funding.[232] It also funded the launch of the Meta-Research Innovation Center at Stanford at Stanford University run by Ioannidis and medical scientist Steven Goodman to study ways to improve scientific research.[232] It also provided funding for the AllTrials initiative led in part by medical scientist Ben Goldacre.[232]

Emphasis in post-secondary education

[edit]

Based on coursework in experimental methods at MIT, Stanford, and the University of Washington, it has been suggested that methods courses in psychology and other fields should emphasize replication attempts rather than original studies.[233][234][235] Such an approach would help students learn scientific methodology and provide numerous independent replications of meaningful scientific findings that would test the replicability of scientific findings. Some have recommended that graduate students should be required to publish a high-quality replication attempt on a topic related to their doctoral research prior to graduation.[236]

Replication database

[edit]

There has been a concern that replication attempts have been growing.[237][238][239] As a result, this may lead to lead to research waste.[240] In turn, this has led to a need to systematically track replication attempts. As a result, several databases have been created (e.g.[241][242]). The databases have created a replication database that includes psychology and speech-language therapy, among other disciplines, to promote theory-driven research and optimize the use of academic and institutional resource, while promoting trust in science.[243]

Final year thesis
[edit]

Some institutions require undergraduate students to submit a final year thesis that consists of an original piece of research. Daniel Quintana, a psychologist at the University of Oslo in Norway, has recommended that students should be encouraged to perform replication studies in thesis projects, as well as being taught about open science.[244]

Semi-automated
[edit]
"The overall process of testing the reproducibility and robustness of the cancer biology literature by robot. First, text mining is used to extract statements about the effect of drugs on gene expression in breast cancer. Then two different teams semi-automatically tested these statements using two different protocols, and two different cell lines (MCF7 and MDA-MB-231) using the laboratory automation system Eve."

Researchers demonstrated a way of semi-automated testing for reproducibility: statements about experimental results were extracted from, as of 2022 non-semantic, gene expression cancer research papers and subsequently reproduced via robot scientist "Eve".[245][246] Problems of this approach include that it may not be feasible for many areas of research and that sufficient experimental data may not get extracted from some or many papers even if available.

Involving original authors

[edit]

Psychologist Daniel Kahneman argued that, in psychology, the original authors should be involved in the replication effort because the published methods are often too vague.[247][248] Others, such as psychologist Andrew Wilson, disagree, arguing that the original authors should write down the methods in detail.[247] An investigation of replication rates in psychology in 2012 indicated higher success rates of replication in replication studies when there was author overlap with the original authors of a study[249] (91.7% successful replication rates in studies with author overlap compared to 64.6% successful replication rates without author overlap).

Big team science

[edit]

The replication crisis has led to the formation and development of various large-scale and collaborative communities to pool their resources to address a single question across cultures, countries and disciplines.[250] The focus is on replication, to ensure that the effect generalizes beyond a specific culture and investigate whether the effect is replicable and genuine.[251] This allows interdisciplinary internal reviews, multiple perspectives, uniform protocols across labs, and recruiting larger and more diverse samples.[251] Researchers can collaborate by coordinating data collection or fund data collection by researchers who may not have access to the funds, allowing larger sample sizes and increasing the robustness of the conclusions.

Broader changes to scientific approach

[edit]

Emphasize triangulation, not just replication

[edit]

Psychologist Marcus R. Munafò and Epidemiologist George Davey Smith argue, in a piece published by Nature, that research should emphasize triangulation, not just replication, to protect against flawed ideas. They claim that,

replication alone will get us only so far (and) might actually make matters worse ... [Triangulation] is the strategic use of multiple approaches to address one question. Each approach has its own unrelated assumptions, strengths and weaknesses. Results that agree across different methodologies are less likely to be artefacts. ... Maybe one reason replication has captured so much interest is the often-repeated idea that falsification is at the heart of the scientific enterprise. This idea was popularized by Karl Popper's 1950s maxim that theories can never be proved, only falsified. Yet an overemphasis on repeating experiments could provide an unfounded sense of certainty about findings that rely on a single approach. ... philosophers of science have moved on since Popper. Better descriptions of how scientists actually work include what epistemologist Peter Lipton called in 1991 "inference to the best explanation".[252]

Complex systems paradigm

[edit]

The dominant scientific and statistical model of causation is the linear model.[253] The linear model assumes that mental variables are stable properties which are independent of each other. In other words, these variables are not expected to influence each other. Instead, the model assumes that the variables will have an independent, linear effect on observable outcomes.[253]

Social scientists Sebastian Wallot and Damian Kelty-Stephen argue that the linear model is not always appropriate.[253] An alternative is the complex system model which assumes that mental variables are interdependent. These variables are not assumed to be stable, rather they will interact and adapt to each specific context.[253] They argue that the complex system model is often more appropriate in psychology, and that the use of the linear model when the complex system model is more appropriate will result in failed replications.[253]

...psychology may be hoping for replications in the very measurements and under the very conditions where a growing body of psychological evidence explicitly discourages predicting replication. Failures to replicate may be plainly baked into the potentially incomplete, but broadly sweeping failure of human behavior to conform to the standard of independen[ce] ...[253]

Replication should seek to revise theories

[edit]

Replication is fundamental for scientific progress to confirm original findings. However, replication alone is not sufficient to resolve the replication crisis. Replication efforts should seek not just to support or question the original findings, but also to replace them with revised, stronger theories with greater explanatory power. This approach therefore involves pruning existing theories, comparing all the alternative theories, and making replication efforts more generative and engaged in theory-building.[254][255] However, replication alone is not enough, it is important to assess the extent that results generalise across geographical, historical and social contexts is important for several scientific fields, especially practitioners and policy makers to make analyses in order to guide important strategic decisions. Reproducible and replicable findings was the best predictor of generalisability beyond historical and geographical contexts, indicating that for social sciences, results from a certain time period and place can meaningfully drive as to what is universally present in individuals.[256]

Open science

[edit]
Six coloured hexagons with text on them are arranged around the words "Tenets of Open Science". Starting at the top right and moving clockwise, the text on the hexagons says: Reproducibility of results; Scientific integrity; Citizen science; Promotion of collaborative work; Ease of access to knowledge for all; and Stimulation of innovation. Underneath the hexagons, there is a large exclamation point, and text saying "Plus: better citation rates for open access articles and research data".
Tenets of open science

Open data, open source software and open source hardware all are critical to enabling reproducibility in the sense of validation of the original data analysis. The use of proprietary software, the lack of the publication of analysis software and the lack of open data prevents the replication of studies. Unless software used in research is open source, reproducing results with different software and hardware configurations is impossible.[257] CERN has both Open Data and CERN Analysis Preservation projects for storing data, all relevant information, and all software and tools needed to preserve an analysis at the large experiments of the LHC. Aside from all software and data, preserved analysis assets include metadata that enable understanding of the analysis workflow, related software, systematic uncertainties, statistics procedures and meaningful ways to search for the analysis, as well as references to publications and to backup material.[258] CERN software is open source and available for use outside of particle physics and there is some guidance provided to other fields on the broad approaches and strategies used for open science in contemporary particle physics.[259]

Online repositories where data, protocols, and findings can be stored and evaluated by the public seek to improve the integrity and reproducibility of research. Examples of such repositories include the Open Science Framework, Registry of Research Data Repositories, and Psychfiledrawer.org. Sites like Open Science Framework offer badges for using open science practices in an effort to incentivize scientists. However, there have been concerns that those who are most likely to provide their data and code for analyses are the researchers that are likely the most sophisticated.[260] Ioannidis suggested that "the paradox may arise that the most meticulous and sophisticated and method-savvy and careful researchers may become more susceptible to criticism and reputation attacks by reanalyzers who hunt for errors, no matter how negligible these errors are".[260]

See also

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The replication crisis, also referred to as the reproducibility crisis, is an ongoing methodological challenge in various scientific disciplines where a substantial number of published research findings cannot be independently replicated by other researchers or even by the original authors, thereby eroding confidence in the reliability of scientific knowledge. This phenomenon highlights systemic issues in research practices that lead to inflated rates of false positives and non-reproducible results across fields. The crisis first drew significant attention in psychology through the 2015 Reproducibility Project: Psychology, a collaborative effort led by the Open Science Collaboration that attempted to replicate 100 experiments from three leading psychology journals published in 2008. Of these, 97% of the original studies reported statistically significant results (p < 0.05), but only 36% of the replication attempts achieved significance, with replication effect sizes roughly half as large as those in the originals (correlation coefficient (r) declining from 0.403 to 0.197). This stark discrepancy underscored the prevalence of non-replicable findings and prompted widespread scrutiny of psychological research. Beyond psychology, the replication crisis affects diverse areas including biomedicine, economics, ecology, and social sciences, as evidenced by failed replications in cancer biology (where only 11% of 53 high-profile studies replicated) and economics (with replication rates around 61% for high-quality studies). A 2016 survey of more than 1,500 scientists across disciplines found that over 70% had tried and failed to reproduce another researcher's experiments, while more than 50% had failed to reproduce their own work; similar concerns persist, with a 2024 survey finding 72% of biomedical researchers agreeing the field faces a reproducibility crisis. These issues have implications for public trust in science, policy decisions, and resource allocation in research funding. Contributing factors to the crisis include publication bias, where journals preferentially publish novel, positive, or statistically significant results, leading to a literature skewed toward false discoveries; questionable research practices such as p-hacking (manipulating data analysis to achieve significance) and (hypothesizing after results are known); and underpowered studies with small sample sizes that increase the likelihood of Type I errors. Additionally, the "publish or perish" incentive structure in academia pressures researchers to prioritize quantity over rigor, exacerbating these problems. In response, the scientific community has initiated reforms like preregistration of studies, open data sharing, and increased emphasis on replication attempts to enhance transparency and reliability.

Background

Definition and Importance of Replication

Replication in scientific research refers to the process of independently conducting a study to verify the results of a prior investigation, typically by employing the same or closely analogous methods, materials, and analytical procedures. This practice ensures that findings are not artifacts of chance, error, or unique circumstances, thereby providing diagnostic evidence about the validity of previous claims. There are two primary types of replication: direct replication, which involves an exact repetition of the original study's protocol, including the same experimental conditions, participant recruitment, and data analysis steps, to confirm the reliability of specific results; and conceptual replication, which tests the underlying hypothesis or theory using alternative methods, populations, or measures to assess the generalizability of the core idea. Direct replication focuses on reproducibility under identical conditions, while conceptual replication emphasizes robustness across variations, both contributing to the accumulation of reliable knowledge. The importance of replication lies in its role as a cornerstone of the scientific method, enabling the distinction between genuine effects and statistical noise or false positives, thus building cumulative and trustworthy scientific knowledge. As philosopher of science emphasized in his principle of falsifiability, scientific theories must be testable and potentially refutable through repeatable experiments; without replicability, isolated findings hold no significance for advancing knowledge. , who lived from 1902 to 1994, argued that reproducibility underpins the objectivity of scientific evidence, allowing independent verification to falsify or corroborate hypotheses. For instance, in a simple laboratory experiment measuring human reaction times to visual stimuli, replication might involve a second researcher recruiting a new group of participants, using the same timing software and stimulus presentation, and applying identical statistical tests to the data; successful replication would confirm the original average reaction time as a reliable benchmark, illustrating how this process validates basic empirical claims.

Historical Origins

The concept of replication emerged as a cornerstone of scientific practice during the early modern period, particularly through the experimental culture fostered by the in 17th-century England. 's air-pump experiments in the 1660s exemplified this, as he emphasized detailed reporting and encouraged independent repetitions by witnesses to verify findings on air pressure and vacuums, thereby establishing empirical reliability amid debates with critics like . This approach transformed replication from ad hoc verification into a normative expectation, promoting trust in experimental claims through communal scrutiny rather than solitary authority. By the 19th and early 20th centuries, replication gained formal prominence in physics, where precise measurements demanded repeated trials to resolve discrepancies and build consensus. The Michelson-Morley experiment of 1887, testing the luminiferous ether, was extensively replicated by Dayton Miller and others in subsequent decades, with improved interferometers confirming the null result and paving the way for Einstein's relativity theory. Concurrently, the professionalization of science led to journals like Nature (founded 1869) and Philosophical Transactions implicitly requiring verifiable evidence, as editors prioritized reproducible demonstrations to distinguish rigorous work from speculation. Following World War II, the expansion of quantitative methods in social sciences and psychology assumed replication as an inherent safeguard, yet systematic checks remained rare amid rapid institutional growth. Fields like experimental psychology grew through federal funding and applied testing, where studies on behavior and cognition were presumed replicable due to controlled lab settings, but this optimism overlooked contextual variations. Key debates in the 1960s and 1970s, notably Lee Cronbach's critiques, highlighted tensions in reliability, distinguishing internal validity (controlled replicability within studies) from external validity (generalizability across settings). In his 1975 address, Cronbach argued for bridging experimental and correlational approaches to enhance psychological findings' robustness, influencing methodological reforms. Sociologically, replication norms solidified during science's professionalization from the 19th century onward, as universities and academies standardized training to curb fraud and bias, embedding verification in peer review and tenure criteria to legitimize disciplines amid industrialization and specialization.

Early Indicators and Statistics

Early signs of the replication crisis emerged through quantitative analyses in the late 20th and early 21st centuries, revealing systemic issues in research reliability across disciplines. One foundational indicator was the low statistical power in psychological studies, which increases the risk of false negatives and undermines replicability. In a 1962 review by Jacob Cohen, the average statistical power to detect medium-sized effects in abnormal-social psychological research was approximately 0.46, implying a high likelihood of missing true effects. Subsequent pre-2010 surveys confirmed persistently low power; for instance, Peter Sedlmeier and 's 1989 analysis of psychological studies from 1960 to 1987 found an average power of 0.37, corresponding to a 63% chance of false negatives for medium effects. In medicine, early tools for evaluating research quality further highlighted reproducibility concerns. The development of the AMSTAR (A MeaSurement Tool to Assess systematic Reviews) instrument in 2007 provided a standardized 11-item checklist to appraise the methodological rigor of systematic reviews, often revealing deficiencies that compromise reproducibility. Initial applications of AMSTAR to non-Cochrane systematic reviews in fields like oncology and cardiology demonstrated low overall quality scores, with many reviews failing to adequately address publication bias, study heterogeneity, or conflict of interest—factors that erode confidence in replicated findings. Global statistics from early meta-analyses also pointed to replication failures in applied fields. John P. A. Ioannidis's seminal 2005 paper, "Why Most Published Research Findings Are False," modeled the probability of false positives using factors like power, bias, and pre-study odds, estimating that 50-90% of findings in fields with small effects and low power—such as nutrition and epidemiology—could be false. This was echoed in early meta-analyses; for instance, 1990s reviews of nutrition studies on dietary factors and disease risk often showed inconsistent results across trials, with replication success rates below 50% for associations like antioxidant supplements and cancer prevention. In cancer research, retrospective analyses from 1999-2010 documented stark inconsistencies: C. Glenn Begley and Lee M. Ellis reported in 2012 that only 11% (6 out of 53) of influential preclinical studies from that period could be replicated by an independent team at , attributing discrepancies to selective reporting and experimental variability. These indicators collectively signaled the need for broader scrutiny of scientific practices.

Prevalence

In Psychology

The replication crisis in psychology became starkly evident through the Reproducibility Project: Psychology, conducted by the Open Science Collaboration from 2011 to 2015, which attempted to replicate 100 experimental and correlational studies originally published in three prominent psychology journals in 2008. Of these, only 36% produced statistically significant results in the same direction as the originals, with replication effect sizes averaging roughly half those reported in the initial studies (mean original effect size d = 0.403; replication d = 0.197). This project highlighted systemic issues in psychological research reproducibility, prompting widespread scrutiny and reform efforts within the field. Subfields within psychology showed varying degrees of vulnerability to replication failures, with social psychology particularly affected. For instance, priming effects—such as those influencing behavior through subtle cues—replicated successfully in only 17% of cases across 94 studies, with 94% exhibiting smaller effect sizes than the originals. Similarly, the ego depletion hypothesis, positing that self-control is a limited resource that depletes with use, has faced severe challenges, succeeding in just 4 out of 36 major multi-site replication attempts by 2022, yielding a success rate below 20%. In contrast, cognitive psychology demonstrated somewhat higher replicability, with memory studies and related experiments achieving around 48% success in the , compared to 23% in social psychology, though inconsistencies persist in areas like false memory paradigms. Surveys have underscored the role of questionable research practices (QRPs) in contributing to non-replication. A 2012 study by John et al., surveying over 2,000 psychologists, found that more than 50% admitted to practices such as failing to report all dependent measures (63%) or selectively reporting analyses that "worked" (56%), which inflate the likelihood of false positives and hinder reproducibility. Recent analyses indicate ongoing challenges, with effect sizes in psychological journals post-2015 remaining approximately halved compared to pre-crisis levels, reflecting a conservative shift in reporting but persistent overestimation in originals. A 2023 meta-analysis of replications across psychology subfields estimated overall failure rates between 40% and 60%, varying by domain, with social psychology at the lower end of replicability. These findings emphasize the need for continued vigilance in psychological research validation.

In Medicine and Biomedical Sciences

The replication crisis in medicine and biomedical sciences manifests prominently in both preclinical and clinical research, where failures to reproduce findings undermine the reliability of evidence used for drug development, treatment decisions, and public health policies. High-profile initiatives have highlighted systemic issues, with preclinical studies in particular showing low reproducibility rates that contribute to wasted resources and delayed therapeutic advances. For instance, pharmaceutical companies have reported significant challenges in validating published results, leading to reevaluations of research pipelines and calls for improved standards. A landmark effort, the Reproducibility Project: Cancer Biology, launched in 2013 by the Center for Open Science in collaboration with Science Exchange, aimed to replicate key experiments from 50 high-profile cancer biology papers published between 2010 and 2012. By 2021, the project had completed replications for 23 papers, finding that only 46% of the 97 experiments showed statistically significant effects in the expected direction, compared to 87% in the originals; moreover, replicable effect sizes were on average 85% smaller than those initially reported. This outcome underscores the fragility of preclinical cancer research, where positive findings were only half as likely to replicate successfully (40%) as null results (80%), suggesting inflated original effects due to methodological or selective reporting issues. Preclinical failures extend beyond cancer, as evidenced by a 2012 internal analysis at Bayer HealthCare, which attempted to reproduce 67 landmark publications in oncology, women's health, and cardiovascular disease. The team could validate only 25% of the studies to a level sufficient for further drug development, attributing discrepancies to incomplete experimental details, biological variability, and potential biases in original reporting. Similarly, scientists reported in 2012 replicating just 6 out of 53 influential cancer studies (11%), reinforcing industry-wide concerns about the translatability of basic research to clinical applications. These corporate audits, though not exhaustive, illustrate how non-reproducible preclinical data can lead to billions in downstream costs, estimated at $28 billion annually in the U.S. alone for irreproducible preclinical research. In clinical trials, reproducibility challenges arise in confirming drug efficacy and safety, often resulting from selective publication, underpowered designs, and inconsistent protocols across studies. A 2009 analysis estimated that up to 85% of health research funding, including clinical trials, is wasted due to avoidable issues like poor question formulation, non-generalizable designs, and inaccessible data, which hinder independent verification and meta-analyses. For example, many phase III trials for expensive therapeutics fail to replicate prior efficacy signals observed in smaller studies, contributing to regulatory scrutiny and retracted approvals; a 2023 assessment of highly cited clinical research from 2004–2018 found replication rates as low as 40–50% for key claims in top journals. These issues exacerbate the translation gap, where promising preclinical results seldom advance to approved treatments, with only about 5–10% of cancer drugs succeeding in clinical phases. A 2024 international survey of over 1,900 biomedical researchers revealed widespread acknowledgment of a reproducibility crisis, with 72% agreeing that biomedicine faces severe replicability problems and only 5% estimating that more than 80% of studies are reproducible; respondents attributed low rates (<30% in many estimates) to cultural pressures and methodological flaws across fields like cell biology. In subareas such as neuroscience, functional MRI (fMRI) studies have been particularly affected, with a 2009 analysis by Vul et al. demonstrating that over 50 high-profile papers reported implausibly high correlations (often >0.8) between brain activation and traits like or , likely inflated by non-independent selection of peaks—suggesting up to 70% false positive rates when accounting for low statistical power and flexible analyses. These findings prompted methodological reforms, including preregistration and whole-brain corrections, but highlight ongoing vulnerabilities in that parallel broader biomedical concerns.

In Economics and Other Social Sciences

The replication crisis has notably impacted , where empirical studies often rely on complex econometric models and large datasets, making verification challenging. A comprehensive assessment of empirical papers published in the American Economic Review's centenary volume revealed that only 29% had been formally replicated, highlighting a low baseline rate for top-tier work in the field. Similarly, a large-scale of laboratory experiments in found replication success rates ranging from 61% to 78%, depending on the indicator used, such as or consistency; however, this still indicates substantial variability and failure in about 22-39% of cases. These findings underscore how the field's emphasis on novel techniques, like difference-in-differences and regression discontinuity designs, can obscure without standardized . Key analyses have pinpointed p-hacking as a contributing factor in , where researchers may selectively report results to achieve . In a study of over 57,000 hypothesis tests from top economics journals spanning 1963 to 2018, Brodeur et al. identified clear patterns of p-value bunching just below the 0.05 threshold, particularly in causal analyses using instrumental variables and regression discontinuity, suggesting p-hacking influences 20-50% of the distribution of significant results depending on the method. This practice not only inflates false positives but also complicates replication, as original datasets and code are often unavailable or inadequately documented. In other social sciences, such as and , replication challenges arise from reliance on survey data, network analyses, and observational studies of . In , efforts to computationally reproduce findings from prominent articles have highlighted issues with data and code availability, pointing to undocumented data cleaning steps and software dependencies. faces analogous problems, with large-scale replication projects in the 2020s yielding an overall success rate of approximately 50%, particularly in observational studies where model specifications vary across contexts. For instance, replications of studies have shown inconsistent effects of campaign interventions on turnout, often failing due to heterogeneous samples and unmodeled interactions in multi-level data. Illustrative examples highlight these issues. The seminal 1994 study by Card and Krueger on the employment effects of a increase in produced counterintuitive null results on job losses, but subsequent replications using payroll records and alternative estimators have yielded mixed outcomes, with some confirming no significant impact and others finding modest employment reductions of 1-4%, illustrating sensitivities to data sources and measurement error. Similarly, research on , such as the Gini coefficient's links to or generosity outcomes, has encountered inconsistencies; replications of studies linking inequality to reduced have provided mixed evidence, with effect sizes varying widely across cultures and failing to replicate in 40-60% of attempts due to variables like regional differences. Broader trends in the 2020s, drawn from surveys of social scientists, indicate that non-replication rates hover around 50-60% for survey-based research across , , and , driven by issues like low statistical power in heterogeneous populations and selective reporting of subgroup analyses. These patterns emphasize shared vulnerabilities in social sciences, where human-centric data amplify variability compared to controlled experimental settings.

In Natural and Emerging Fields

In nutrition science, replication issues have been particularly pronounced in studies examining dietary effects, such as the purported links between saturated fat intake and cardiovascular disease. Initial observational and intervention studies from the mid-20th century suggested strong associations, but subsequent meta-analyses have failed to consistently replicate these findings, revealing inconsistencies due to methodological variations, confounding factors like overall diet quality, and selective reporting. For instance, a comprehensive review highlighted that many early claims about saturated fats increasing heart disease risk do not hold under rigorous re-examination, with effect sizes often diminishing or reversing in larger, better-controlled datasets. This has implications for public health guidelines, where non-replicable evidence has influenced long-standing recommendations on fat consumption, prompting calls for preregistration and transparent data sharing to bolster reliability. Representative examples include conflicting results from cohort studies on low-fat diets, where initial protective effects against heart disease were not upheld in replication attempts across diverse populations. In water resource management, models assessing impacts from the onward have exhibited significant replication failures, particularly in cross-site applications, with approximately 50% of projections failing to align when transferred to new geographic or temporal contexts. These discrepancies arise from model sensitivities to local variables like soil hydrology and patterns, which are often not fully parameterized in original simulations. For example, hydrologic models developed for basin-specific climate scenarios in frequently underperform when replicated in European or Asian watersheds, highlighting the limitations of generalizability in environmental modeling. Physics and chemistry fields, while generally more robust due to standardized experimental protocols, are not immune to replication challenges, though failures are rarer and often high-profile. In 2023, claims surrounding cold fusion-like processes, including low-energy nuclear reactions in palladium setups, resurfaced but remained non-replicated despite initial excitement, echoing historical debacles from the 1980s. Similarly, in material science, complex syntheses like 2D materials (e.g., graphene derivatives) prove especially difficult to duplicate due to subtle variations in fabrication conditions. A 2022 analysis of moiré materials synthesis emphasized that precise replication requires exact control over atomic layering, which is often inadequately documented, leading to inconsistent electronic properties in follow-up studies. Emerging fields like and have amplified the replication crisis through opaque practices, such as undisclosed training data and hyperparameters, resulting in non-replicable models. Reports from 2024 and 2025 highlight that up to 70% of image recognition benchmarks fail independent replication, often due to data leakage—where test sets inadvertently overlap with training data—or unshared proprietary datasets in large-scale models. For instance, convolutional neural networks achieving state-of-the-art accuracy on datasets like frequently underperform in replication efforts because of non-deterministic elements like random initialization and hardware-specific optimizations. These issues extend beyond performance metrics to broader scientific applications, where ML-driven predictions in fields like climate modeling inherit the same reproducibility pitfalls. Cross-trends in 2025 analyses point to the replication crisis extending deeply into computational fields, including simulations in natural sciences, where algorithmic choices and software environments exacerbate non-replicability. Lovrić's examination emphasizes that p-hacking and insufficient validation in computational workflows contribute to this expansion, urging standardized pipelines to mitigate risks across physics, environmental modeling, and AI-integrated research. This convergence underscores a systemic need for open-source code and benchmark protocols to restore confidence in computational outputs.

Causes

Systemic and Cultural Factors

The expansion of science funding following the , particularly in the United States through agencies like the and , increasingly tied grants to the production of novel findings, which diminished institutional support and incentives for replication studies. This shift prioritized groundbreaking discoveries over verification, as funding panels favored high-risk, high-reward projects that promised new knowledge, leaving replication efforts under-resourced and undervalued. Sociological analyses trace the replication crisis to the erosion of Robert K. Merton's framework of scientific norms, known as CUDOS—communalism (sharing knowledge freely), universalism (impartial evaluation), disinterestedness (objectivity over personal gain), and organized skepticism (rigorous scrutiny). Competitive academic environments have undermined these norms, fostering a that elevates novelty and rapid over verification and communal critique. In this context, organized skepticism has weakened, as pressures for productivity discourage the time-intensive work of replicating prior results, leading to a performative where impact metrics overshadow collective reliability. The "" culture intensified in the and , with and promotions increasingly linked to publication volume rather than depth or . Surveys of faculty indicate that around 68% perceive greater pressure to publish compared to recent years, exacerbating the de-emphasis on replication in favor of prolific output. This systemic pressure has normalized a focus on quantity, where career advancement depends on accumulating papers in high-impact journals, often at the expense of robust verification processes. Globally, replication norms vary, with European systems generally exhibiting stricter emphasis on verification due to more balanced structures, in contrast to the 's highly competitive, novelty-driven model that amplifies challenges. In , funding bodies like the often integrate replication considerations into grant evaluations more explicitly than their counterparts, reflecting cultural differences in prioritizing cumulative knowledge over individual breakthroughs. A contributing cultural factor is the prevalent in scientific practice, where researchers and evaluators overlook the low prior probabilities of novel effects being true, leading to overconfidence in initial positive findings without adequate replication. This in scientific culture amplifies the crisis by fostering expectations of high success rates for discoveries that statistically are unlikely to hold, independent of methodological rigor. emerges as a symptom of these broader systemic issues, where null or replicated results are less likely to be disseminated.

Publication and Incentive Structures

One major flaw in the scientific publication system is , which favors the reporting of positive or novel results over null or negative findings. This bias leads to the "file drawer problem," where studies yielding non-significant results are often left unpublished, distorting the and hindering efforts to replicate or verify claims. Standards of reporting in published papers frequently lack the necessary details for replication, with analyses from the revealing minimal adoption of transparency practices. For instance, a review of empirical articles in high-impact journals found that fewer than half provided publicly available (40%), materials (20%), or code (3%), indicating insufficient methodological descriptions to enable independent reproduction. The "" culture exacerbates these issues by prioritizing publication quantity and prestige over rigorous, replicable work. Career advancement metrics, such as journal impact factors and the , reward novel findings in high-profile outlets, often at the expense of confirmatory or replication studies, fostering a system where researchers face pressure to produce eye-catching results to secure jobs, promotions, and tenure. Journal practices further discourage replication, as such studies are rarely published. Prior to , only about 1.6% of publications explicitly involved replications, reflecting editorial preferences for original, groundbreaking research over verification efforts. Funding incentives compound these structural problems by emphasizing innovation over confirmation. For example, (NIH) grant criteria historically prioritize "transformative" research that promises paradigm shifts, while systematic replication or confirmatory studies receive little dedicated support, limiting resources for checks.

Questionable Research Practices

Questionable research practices (QRPs) refer to a range of design, analytic, and reporting choices that researchers make to enhance the chances of obtaining statistically significant results and achieving , without crossing into outright fabrication or falsification of . These practices are often subtle and flexible, allowing researchers to "listen to the " in ways that capitalize on chance findings while presenting them as confirmatory , as demonstrated in simulations showing how such flexibility can dramatically inflate error rates. Unlike , QRPs occupy a gray area where researchers may rationalize them as standard procedure to navigate competitive pressures. Surveys indicate widespread use of QRPs across fields, particularly in where self-admission rates for selective reporting of analyses that "work" range from 50% to over 70%. In and , witnessed rates for QRPs, including conditional reporting of results, are around 40%, based on international surveys of researchers from 2010 to 2020. A 2012 survey using truth-telling incentives found that 94% of psychological researchers admitted to engaging in at least one QRP over their career, with specific practices like failing to report all dependent measures (63%) and selectively reporting studies that yielded significant results (46%). Common examples include (hypothesizing after the results are known), where researchers formulate or adjust hypotheses post-analysis and present them as pre-planned, which obscures the exploratory nature of the work and biases interpretation toward confirmation. Another is optional stopping, in which data collection continues or stops based on interim statistical significance checks, effectively inflating the chance of false positives without adjusting for multiple testing. These practices are enabled by , where non-significant results are less likely to be published, further incentivizing flexibility. Simulations illustrate the severe impact of QRPs on scientific validity, demonstrating that even moderate use can double the from the nominal 5% to over 50%, as researchers exploit analytic flexibility to report only favorable outcomes. For instance, combining practices like optional stopping with selective outcome reporting can push the likelihood of publishing false positives to 60% or higher in low-power studies. Such inflation undermines replicability, as the reported effects often stem from noise rather than true phenomena, contributing directly to the replication crisis.

Statistical and Methodological Issues

One major statistical issue contributing to the replication crisis is low statistical power in experimental designs. Statistical power is defined as 1β1 - \beta, where β\beta represents the Type II error rate, or the probability of failing to detect a true effect. In social sciences, including , typical power levels range from 0.3 to 0.5, meaning that even true effects have a 50-70% chance of going undetected in a single study, leading to high non-replication rates for genuine findings. For instance, a of studies found a power of 0.21, exacerbating the risk of missing real effects and inflating apparent ones in published results. P-hacking, the selective reporting or analysis of data to achieve , further undermines replicability by inflating Type I error rates through practices like conducting multiple tests without correction. A common example is multiple testing, where the increases without adjustments; the addresses this by dividing the significance level α\alpha by the number of tests kk, yielding an adjusted threshold of α/k\alpha / k. Simulations demonstrate that such undisclosed flexibility can produce up to 60% false positives even for null effects, directly contributing to non-replicable claims in the literature. Underpowered studies also introduce positive effect size bias, where detected effects are systematically overestimated due to the phenomenon—significant results arise disproportionately from larger-than-true sample effects. In , this bias led to effect size overestimates by a factor of up to three times the true value across low-power studies. Similar patterns appear in social sciences, where small samples amplify , resulting in inflated estimates that fail to replicate at more realistic scales. Null hypothesis significance testing (NHST) exacerbates these problems through widespread misinterpretation of p-values, often treated as measures of effect importance or practical significance rather than evidence against the null. A p-value below 0.05 indicates only that the data are unlikely under the null hypothesis, not the probability that the null is true or the size of any alternative effect, yet a 2018 survey found that 99% of psychology researchers and students misinterpreted at least one aspect of p-values. This dichotomous focus on significance thresholds discourages nuanced reporting and promotes cherry-picking, with alternatives like Bayesian methods offering posterior probabilities for hypotheses but seeing limited adoption due to computational demands. Context sensitivity in effect sizes, where results vary across populations or settings, poses another methodological challenge, often overlooked in generalized claims. For example, psychological effects calibrated on (Western, Educated, Industrialized, Rich, Democratic) samples—comprising about 96% of publications despite representing only 12% of the global population—frequently diminish or reverse in diverse groups, reducing cross-study replicability. In meta-analyses attempting to aggregate findings, distorts pooled estimates by favoring positive results, detectable via Egger's test, which regresses standardized effect sizes against their precision to identify asymmetry indicating missing small or null studies. This bias can substantially inflate overall effect sizes in affected fields like . Finally, statistical heterogeneity across studies, quantified by the I2I^2 statistic as the percentage of total variation due to between-study differences rather than chance, often signals underlying methodological inconsistencies; values exceeding 50% suggest substantial issues, such as unaccounted moderators, complicating reliable synthesis and replication.

Consequences

Effects on Scientific Knowledge

The replication crisis has led to substantial wasted resources in scientific , particularly in biomedical fields where irreproducible preclinical studies consume billions annually. A 2015 analysis estimated that approximately $28 billion per year is spent in the United States on basic biomedical that cannot be successfully repeated, representing about half of the total preclinical budget due to factors like low reproducibility rates. This financial burden diverts from promising avenues, exacerbating inefficiencies in across scientific endeavors. The crisis undermines cumulative scientific knowledge by allowing theories to be constructed on foundations of false positives, resulting in the eventual collapse of entire research paradigms. For instance, the social priming paradigm in , which posited that subtle environmental cues could unconsciously influence complex behaviors, largely disintegrated following a series of failed replications in the , prompting widespread reevaluation of related . Such breakdowns highlight how non-replicable findings propagate errors, slowing the refinement of theoretical models and hindering genuine progress in understanding phenomena. In specific fields like , the replication crisis has driven a "credibility revolution" since the , leading to the revision or rejection of a significant share of established results. Large-scale replication projects have shown that only about 36-50% of key psychological effects from prominent journals hold up under rigorous retesting, necessitating updates to textbooks and curricula that previously presented these as settled . This erosion extends to broader scientific domains, where irreproducible preclinical results contribute to high failure rates in pipelines; for example, only 11% of landmark cancer papers could be reproduced by one pharmaceutical company, delaying therapeutic innovations and increasing costs for viable treatments. Long-term analyses reveal the enduring impact on , with a substantial portion of highly cited papers from the 2000-2010 period now viewed as questionable due to replication challenges. Studies indicate that non-replicable findings are often cited far more frequently—up to 153 times more than replicable ones—perpetuating flawed knowledge and complicating efforts to build reliable cumulative . This pattern underscores how the crisis not only wastes immediate resources but also distorts the historical record, requiring ongoing efforts to reassess and correct the scientific canon.

Impact on Public Trust and Awareness

The replication crisis has heightened public of scientific issues, particularly in fields like . Surveys conducted in 2025 indicate that 18% of laypeople have heard of recent failures to replicate studies, with awareness rising to 29% among those exposed to discussions of methodological flaws. This increased visibility has been amplified post-COVID-19, as widespread about scientific findings, including vaccine efficacy, has drawn attention to broader concerns over research reliability. For instance, high-profile failures in psychological priming experiments during the , such as attempts to replicate social priming effects that garnered significant media attention, have contributed to this growing public familiarity with replication challenges. Erosion of in has been a notable consequence, with polls showing a marked decline linked to perceptions of irreproducibility. A November 2024 survey found that 76% of Americans reported a great deal (34%) or fair amount (42%) of confidence in , down from 87% in , with the decline partly attributed to events including the replication crisis and the . This downturn, which accelerated during the , has been attributed in part to the replication crisis, as revelations of non-reproducible results have fueled doubts about the reliability of scientific claims. The crisis's exposure of systemic issues has thus intertwined with other trust-eroding events, deepening skepticism toward expert consensus. Media coverage of the replication crisis has further shaped public perceptions, spotlighting scandals and prompting official acknowledgments. In the , extensive reporting on failed replications of priming studies in , which had previously achieved viral status in outlets like TED Talks and major news publications, highlighted the fragility of celebrated findings and sparked widespread debate. By 2025, this culminated in statements addressing the "replication crisis" as a threat to public confidence, including an on "Restoring Gold Standard Science" that emphasized to rebuild trust in federally funded . These developments have had tangible societal consequences, including indirect contributions to through perceived scientific unreliability. The crisis has amplified uncertainties in biomedical research, where replication failures foster a general that exacerbates hesitancy by portraying as prone to error or . On a positive note, however, the heightened has empowered the public to demand more rigorous, transparent , fostering greater of claims and ultimately strengthening societal expectations for evidence-based knowledge.

Institutional and Academic Responses

The credibility revolution in psychology during the 2010s represented a pivotal shift toward prioritizing and transparency in research practices, prompted by large-scale replication failures that highlighted systemic issues in the field. A key component of this movement was the founding of the Open Science Framework (OSF) in 2013 by Brian Nosek and Jeffrey Spies at the , which provides free tools for preregistration, , and collaborative project management to foster . The OSF has since supported major initiatives, such as the Reproducibility Project: Psychology, which attempted to replicate 100 studies from top journals and found only 36% showed statistically significant effects in the same direction. Journals responded by revising policies to encourage reproducible . In April 2013, journals implemented updated reporting standards requiring authors to provide detailed methods, statistical analyses, and information to enhance transparency and facilitate independent verification. followed in March 2014 with a mandatory policy, compelling authors to include statements on how underlying could be accessed for replication, reanalysis, or validation of findings. Funding agencies introduced measures to enforce rigor. The U.S. (NIH) announced plans in 2014 to bolster , issuing guidelines for preclinical reporting and requiring grant applicants from 2015 onward to address the strength of prior studies, of key resources, and potential biases in experimental design. The European Research Council (ERC) similarly stresses data management and retention in its grant requirements, recommending that funded researchers maintain accessible files to enable and verification. Academic training adapted to these concerns, with U.S. graduate programs in the 2020s increasingly integrating and into curricula; a survey of APA-accredited doctoral programs found that over 70% offered training on topics like preregistration and to equip students against challenges. Conferences played a role in dissemination, as seen in the Association for Psychological Science (APS) 2015 annual convention, which included dedicated sessions on replication strategies and open practices to guide researchers in implementing reforms. These institutional efforts often reference preregistration as a core tool for mitigating selective reporting.

Remedies

Reforms in Publishing Practices

To address the replication crisis, several reforms in practices have emerged to mitigate and questionable research practices by emphasizing methodological rigor over results. These include preregistration of studies, result-blind , mandates for and code sharing, dedicated journals for , and databases tracking retractions. Such changes aim to shift incentives toward transparent, verifiable research processes. Preregistration involves researchers publicly committing to their hypotheses, methods, and analysis plans before data collection, typically via platforms like the Open Science Framework (OSF), which launched preregistration capabilities in 2013. This practice distinguishes confirmatory analyses from exploratory ones, reducing the flexibility that enables p-hacking and other questionable research practices (QRPs) by locking in decisions prior to observing outcomes. Empirical evidence shows preregistration improves the credibility of findings by preserving accurate calibration of evidence and minimizing post-hoc adjustments. For instance, the 's initiatives have demonstrated that preregistered studies exhibit higher evidential value and lower rates of bias compared to non-preregistered ones. Result-blind peer review, proposed as a key reform in 2013, evaluates manuscripts based solely on research questions, methods, and proposed analyses without knowledge of results, thereby countering biases favoring positive or significant outcomes. Journals such as Psychological Science adopted related formats like Registered Replication Reports starting in 2013, where occurs in stages: initial approval of methods before , followed by review of results. This approach has been implemented in over 200 journals across disciplines by the , leading to higher-quality and reduced selective reporting. Studies of these formats indicate they enhance replicability by prioritizing scientific merit over novelty. Open science mandates have further transformed publishing by requiring , code, and materials to be shared alongside publications, facilitating independent verification. The Transparency and Openness Promotion (TOP) Guidelines, developed in 2015 and widely adopted by 2016, provide a modular framework for journals to enforce levels of transparency across citation, , code, research design, and analysis transparency. In , numerous high-impact journals, including those from the , have integrated TOP standards, promoting compliance through editorial policies and checklists. Adoption has grown steadily, with surveys indicating that by the late 2010s, a significant portion of included availability statements, though full sharing remains variable due to barriers like concerns. These guidelines directly target by making non-significant or null results verifiable and reusable. Dedicated metascience journals have emerged to prioritize replication studies and methodological critiques, providing outlets for research that might otherwise face publication hurdles. Meta-Psychology, launched in 2020, exemplifies this by focusing exclusively on the methods, theories, and practices of psychological science, including empirical replications and analyses of replicability factors. Such venues encourage rigorous evaluation of the research ecosystem, with articles often employing Bayesian or meta-analytic approaches to assess replication success rates across fields. Metadata tools like the Database, established in 2010, track retractions, expressions of concern, and related issues in , promoting accountability in . By cataloging over 50,000 retractions and corrections by the mid-2020s, it enables researchers and journals to monitor patterns of and reliability, informing policy reforms such as enhanced post-publication review. The database's has facilitated meta-analyses revealing spikes in retractions linked to the replication crisis, underscoring the need for proactive standards.

Enhancements in Statistical Methods

In response to the replication crisis, researchers have proposed several enhancements to statistical methods to reduce false positives and improve the reliability of findings. One prominent reform involves tightening the threshold for . In 2017, a group of 72 researchers advocated redefining the default threshold from the conventional 0.05 to 0.005 for claims of new discoveries, arguing that this change would approximately halve the while maintaining acceptable statistical power. This proposal distinguishes between "suggestive evidence" (p < 0.005) and conventional significance, encouraging replication before accepting novel results as definitive. To address the widespread issue of underpowered studies, which often fail to detect true effects reliably, recommendations emphasize increasing sample sizes to achieve higher statistical power, typically targeting 90% power (1 - β = 0.9) rather than the common 50-60%. For a two-sided test, the required sample size n per group can be calculated using the formula: n=(Z1α/2+Z1β)2σ2δ2n = \frac{(Z_{1-\alpha/2} + Z_{1-\beta})^2 \cdot \sigma^2}{\delta^2} where Z1α/2Z_{1-\alpha/2} is the critical value for the desired significance level α (e.g., 1.96 for α = 0.05), Z1βZ_{1-\beta} is the critical value for power (e.g., 1.28 for 90% power), σ is the standard deviation, and δ is the minimum detectable effect size. Achieving 90% power often requires sample sizes approximately three times larger than those in typical underpowered studies, depending on the effect size and variability. Misinterpretation of s has exacerbated the crisis, leading to overreliance on dichotomous significance testing. The American Statistical Association's 2016 statement urged a shift toward emphasizing via confidence intervals, which provide a range of plausible effect sizes rather than a binary outcome. This approach promotes better understanding of uncertainty and effect magnitude, with education efforts highlighting that a does not measure the probability that the is true or the size of an effect. Confidence intervals, for instance, allow researchers to assess practical significance alongside statistical evidence, fostering more nuanced interpretations. In fields like and , where model can undermine replicability, cross-validation techniques have been adopted to evaluate model robustness. K-fold cross-validation, a standard method, partitions the into k subsets (folds), the model on k-1 folds and validating on the held-out fold, repeating this process k times to compute an average metric. This resampling approach reduces variance in performance estimates and helps ensure models generalize beyond the , with k often set to 5 or 10 for balance between bias and computation. Its application has become routine in predictive modeling to guard against spurious results that fail replication. Bayesian statistical methods offer a from significance testing (NHST) by incorporating prior knowledge and providing direct probability statements about parameters via posterior distributions. Instead of p-values, Bayesian approaches use credible intervals to quantify in effect estimates, allowing for more flexible through Bayes factors or model . John Kruschke's 2014 work exemplifies this transition, demonstrating how Bayesian estimation with priors and posteriors yields richer inferences than NHST, particularly for small samples or complex models. This method has gained traction in and social sciences for its ability to accumulate evidence across studies without rigid thresholds. Improvements in techniques aim to correct for , a key driver of non-replicable findings. The trim-and-fill method, introduced by Duval and Tweedie, addresses asymmetry by iteratively "trimming" studies with overly large effects (presumed biased) and "filling" in hypothetical missing studies on the opposite side to estimate an unbiased overall effect. This nonparametric approach has been widely implemented in software like Comprehensive Meta-Analysis, providing adjusted effect sizes that better reflect the true literature. While not without limitations, such as sensitivity to the number of studies, it enhances the credibility of meta-analytic syntheses in fields prone to selective reporting.

Replication Initiatives and Funding

Organized efforts to address the replication crisis have included large-scale collaborative projects aimed at systematically replicating key findings across multiple laboratories. The Many Labs series, initiated in the , exemplifies such initiatives; Many Labs 2, conducted in , involved 36 samples from 28 different laboratories replicating 28 classic and contemporary psychological effects, achieving a replication success rate of approximately 50% based on . Similarly, in and social sciences, the Institute for Replication (I4R), established in the early 2020s, conducts reproductions and replications of influential studies to enhance credibility, including meta-analyses such as a 2024 study of 110 papers that found 85% computational . Funding from philanthropic and governmental sources has been crucial to sustaining these replication efforts. The provided over $1.5 million to the Center for between 2011 and 2020 to support initiatives in , including projects like Many Labs that aligned scientific practices with values of openness and integrity. In 2023, the U.S. (NSF) allocated more than $1.8 million across 10 awards to advance infrastructure, encompassing replication and programs that encourage high-powered designs and in social and behavioral sciences. Databases have emerged to catalog and track replication attempts, facilitating meta-analytic insights into replicability trends. The , launched in 2015, established an open database on the Open Science Framework containing replications of 100 studies from top journals, revealing an overall replication rate of 36% and enabling ongoing queries into factors like sample size and effect magnitude. Complementing this, Curate Science, a web-based platform introduced in 2015 and expanded in the 2020s, allows researchers, journals, and institutions to tag and evaluate the transparency and credibility of published studies, promoting community-driven assessments of reproducibility. Guidelines have been developed to involve original authors in replication processes, enhancing the fidelity of attempts. The ARRIVE guidelines, updated in 2018 for reporting animal experiments, recommend that original teams provide detailed protocols, , and materials to support independent replications, thereby reducing barriers to verification in biomedical fields. Educational initiatives in post-secondary institutions have increasingly emphasized replication design to train future researchers. In the , universities have integrated courses and modules on the replication crisis into and curricula, such as workshops teaching preregistration, , and multi-lab coordination to foster robust study designs. Big team science approaches have further advanced replication in specialized domains. The ManyBabies project, ongoing since 2017, unites over 100 laboratories worldwide to replicate and extend infant cognition studies using standardized protocols, quantifying variability in effects like infants' preference for prosocial agents and achieving high generalizability through diverse samples.

Broader Cultural and Policy Shifts

The replication crisis has prompted a shift toward methodological , which emphasizes integrating evidence from diverse approaches—such as observational data, experiments, and genetic studies—rather than relying solely on direct replication to validate findings. This strategy, advocated by Munafò and Davey Smith, helps mitigate biases inherent in single methods and builds more robust conclusions by cross-validating results across independent lines of inquiry. In parallel, the crisis has encouraged viewing scientific progress through the lens of complex adaptive systems, where knowledge evolves dynamically as an interconnected network of theories, , and practices that self-correct over time. Failed replications serve as signals for revising or refining theories, fostering adaptability rather than treating non-replications as mere failures, as explored in recent analyses linking the crisis to and systemic resilience in . The movement has accelerated these changes, with the Transparency and Openness Promotion (TOP) Guidelines undergoing significant updates in 2024 to incorporate verification practices and study types that enhance across disciplines. Complementing this, the principles—ensuring data and materials are Findable, Accessible, Interoperable, and Reusable—have become foundational for sharing resources, promoting collaborative validation beyond isolated labs. On the policy front, the White House Office of Science and Technology Policy (OSTP) issued a 2025 memorandum establishing "Gold Standard Science" requirements, mandating standards for federally funded research to ensure transparency and rigor in grant allocations; as of November 2025, the NSF has integrated these into grant review processes, requiring replication plans for high-risk projects. Recent developments extend these principles to , exemplified by NeurIPS 2025's updated reproducibility checklists that require detailed reporting of computational environments and data handling to address unique challenges in AI validation. Meta-analyses in 2025 indicate tangible progress, with psychological studies showing stronger evidential support through larger sample sizes (up to 100% increases in some subfields) and fewer questionable p-values, reflecting improved reliability post-crisis.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.