A draft of this manuscript was posted in December, 2014 as a pdf file on http://www.r-index.org. I have received several emails about the draft. This revised manuscript does not include a comparison of different bias tests. The main aim is to provide an introduction to the R-Index and to correct some misconceptions of the R-Index that have become apparent over the past year.

Please cite this post as: Schimmack, U. (2016). The Replicability-Index: Quantifying Statistical Research Integrity. https://wordpress.com/post/replication-index.wordpress.com/920

Author’s Note. I would like to thank Gregory Francis, Julia McNeil, Amy Muise, Michelle Martel, Elizabeth Page-Gould, Geoffrey MacDonald, Brent Donnellan, David Funder, Michael Inzlicht, and the Social-Personality Research Interest Group at the University of Toronto for valuable discussions, suggestions, and encouragement.

Abstract

Researchers are competing for positions, grant money, and status. In this competition, researchers can gain an unfair advantage by using questionable research practices (QRPs) that inflate effect sizes and increase the chances of obtaining stunning and statistically significant results. To ensure fair competition that benefits the greater good, it is necessary to detect and discourage the use of QRPs. To this aim, I introduce a doping test for science; the replicability index (R-index). The R-Index is a quantitative measure of research integrity that can be used to evaluate the statistical replicability of a set of studies (e.g., journals, individual researchers’ publications). A comparison of the R-Index for the Journal of Abnormal and Social Psychology in 1960 and the Attitudes and Social Cognition section of the Journal of Social and Personality Psychology in 2011 shows an increase in the use of QRPs. Like doping tests in sports, the availability of a scientific doping test should deter researchers from engaging in practices that advance their careers at the expense of everybody else. Demonstrating replicability should become an important criterion of research excellence that can be used by funding agencies and other stakeholders to allocate resources to research that advances science.

Keywords: Power, Publication Bias, Significance, Credibility, Sample Size, Questionable Research Methods, Replicability, Statistical Methods

**INTRODUCTION**

It has been known for decades that published results are likely to be biased in favor of authors’ theoretical inclinations (Sterling, 1959). The strongest scientific evidence for publication bias stems from a comparison of the rate of significant results in psychological journals and the statistical power of published studies. Statistical power is the long-run probability to obtain a significant result, when the null-hypothesis is false (Cohen, 1988). The typical statistical power of psychological studies has been estimates to be around 60% (Sterling, Rosenbaum, & Weinkam, 1995). However, the rate of significant results in psychological journals is over 90% (Sterling, 1959; Sterling et al., 1995). The discrepancy between these estimates of power reveals that published studies are biased and that some findings may be simply false positive results, whereas other studies report inflated effect size estimates.

It has been overlooked that estimates of statistical power are also inflated by the use of questionable research methods. Thus, the commonly reported estimate that typical power in psychological studies is 60% is an inflated estimate of true power (Schimmack, 2012). If the actual power is less than 50%, it means that a typical study in psychology has a larger probability to fail (produce a false negative result) than to succeed (rejecting a false null-hypothesis). Conducting such low powered studies is extremely wasteful. Moreover, few researchers have resources to discard 50% of their empirical output. As a result, the incentive for the use of questionable research practices that inflate effect sizes is strong.

Not surprisingly, the use of questionable research practices is common (John et al., 2012). More than 50% of anonymous respondents reported selective reporting of dependent variables, dropping experimental conditions, or not reporting studies that did not support theoretical predictions. The widespread use of QRPs undermines the fundamental assumption of science that scientific theories have been subjected to rigorous empirical tests. In violation of this assumption, QRPs allow researchers to find empirical support for hypotheses even when these hypotheses are false.

The most dramatic example was Bem’s (2011) infamous evidence of time-reversed causality (e.g., studying after a test can improve test performance). Although Bem reported nine successful studies, subsequent studies failed to replicate this finding and raised concerns about the integrity of Bem’s studies (Schimmack, 2012). One possibility for false positive results could be that a desirable outcome occurred by chance and a researcher mistakes this fluke finding as evidence that a prediction was true. However, a fluke finding is unlikely to repeat itself in a series of studies. Statistically, it is highly improbable that Bem’s results are simple type-I errors because the chance of obtaining 9 out of 10 type-I errors with a probability of .05 is less than 1 out of 53 billion (1 / 53,610,771,049). This probability is much smaller than the probability of winning the lottery (1 / 14 million). It is also unlikely that Bem simply failed to report studies with non-significant results because he would have needed 180 studies (9 x 20) to obtain 9 significant results because a type-I error of 5% implies that a significant result will occur, on average, for every 20 studies. With sample sizes of about 100 participants in reported studies, this would imply that Bem tested 18,000 participants. It is therefore reasonable to conclude that Bem used questionable research methods to produce his implausible and improbable results.

Although the publication of Bem’s article in a flagship journal of psychology was a major embarrassment for psychologists, it provided an opportunity to highlight fundamental problems in the way psychologists produced and published empirical results. There have been many valuable suggestions and initiatives to increase the integrity of psychological science (e.g., Asendorpf et al., 2012). In this manuscript, I propose another solution to the problem of QRPs; I suggest that scientific organizations ban the use of questionable research practices, just like sports organizations ban the use of performance enhancing substances. At present, scientific organization only ban and punish outright manipulation of original data. However, excessive use of QRPs can produce fake results without fake data. As the ultimate product of an empirical science are the results of statistical analyses, it does not matter whether fake results were obtained with fake data or with questionable statistical analyses. The use of QRPs therefore violates the code of ethics in science that a researcher should base conclusions on an objective and unbiased analyses of empirical data. Dropping studies or dependent variables that do not support a hypothesis violates this code of scientific integrity.

Unfortunately, the world of professional sports also shows that doping bans are ineffective unless they are enforced by regular doping tests. Thus, a ban of questionable research practices needs to be accompanied by objective tests that can reveal the use of questionable research practices. The main purpose of this article is to introduce a statistical test that reveals the use of questionable research practices that can be used to enforce a ban of such practices. This quantitative index of research integrity can be used by readers, editors, and funding agencies to ensure that only rigorous empirical studies are published or funded.

**The Replicability-index**

The R-index is based on power theory (Cohen, 1988). Statistical power is defined as the long-run probability of obtaining statistically significant results in a series of studies (see Schimmack, 2016, for more details). A study with 50% power is expected to produce 50 significant results and 50 non-significant results. In the short-run, the actual number of significant results can underestimate or overestimate the true power of a study, but in an unbiased set of studies, the long-run percentage of significant result provides an unbiased estimate of average power (see Schimmack, 2016, for details on meta-analysis of power). Importantly, in smaller sets of studies underestimation is as likely as overestimation. However, Sterling (1959) was the first to observe that scientific journals report more significant results than the actual power of studies justifies. In other words, a simple count of significant results provides an inflated estimate of observed power.

A simple count of the percentage of significant results in journals would suggest that psychological studies have over 90% statistical power to reject the null-hypothesis. However, studies of the typical power in psychology based on sample sizes and a moderate effect size suggest that the typical power of statistical tests in psychology is around 60% (Giegerenzer & Sedelmeier, 1995; see also Schimmack, 2016).

The discrepancy between these estimates of power reveals a systematic bias because these estimates should converge in the long run. Discrepancies between the two estimates of power can be tested for significance. Schimmack (2012) developed the incredibility index to examine whether a set of studies reported too many significant results. For example, the probability that 10 studies with 60% power produce 90% significant results (9 significant and 1 non-significant) is p = .046 (binomial prob. calculator). The incredibility index uses 1 – p, so that higher numbers show that the result is incredible because there should have been more non-significant results. In this example, the incredibility index is 1 – .046 = .954. This result suggests that the reported results were selected to provide stronger evidence for a hypothesis than the full set of results would have provided; in other words, questionable research practices were used to produce the reported results.

Some critics have argued that the incredibility index is flawed because it relies on observed effect sizes to estimate power. These power estimates are called observed power or post-hoc power and statisticians have warned against the computation of observed power (Henning & Hoenig, 2001). However, this objection is flawed because Henning and Hoenig (2001) only examined the usefulness of computing observed power for a single statistical test (Schimmack, 2016). The problem of observed power estimates for a single statistical test is that the confidence interval around the estimate is so large that it often covers the full range of possible estimates from 0 (or more accurately, the alpha criterion of significance) to 1 (Schimmack, 2015). This estimate is not fundamentally flawed, but it is uninformative. However, in a meta-analysis of power estimates, sampling error decreases, the confidence interval around the power estimate shrinks, and the power estimate becomes more accurate and useful. Thus, a meta-analysis of studies can be used to estimate power and to compare the success rate (percentage of significant results) to the power estimate.

The incredibilty index computed a power estimate for each study and then averaged these power estimates to obtain an estimate of average observed power. A binomimal probability test was then used to compute the probability that a set of reported results reported too few non-significant results.

The R-Index builds on the incredibility index. One problem of the incredibility index is that probabilities provide no information about effect sizes. An incredibility index of 99% can be obtained with 10 studies that produced 10 significant results with an average observed power of 60% or with 100 studies that produced 100% significant results with average observed power of 95%. Evidently, average observed power of 95% is very high and the fact that one would expect only 95 significant results while 100 significant results were reported suggests only a small bias. In contrast, the discrepancy between 60% observed power and 100% reported results is large. The fact that the same incredibility index can be obtained for different amount of bias is nothing special. Probabilities are always a function of the magnitude of an effect (discrepancy) and the amount of sampling error, which is inversely related to sample size. For this reason, it is important to complement information about probabilities with effect size measures. For the incredibility index, the effect size is the difference between the success-rate and the observed power estimate. In this example, the effect sizes are 100-60 = 40 vs. 100-95 = 5. This effect size is called the inflation rate, because it is expected that the success rate exceeds observed power.

In large sets of studies (e.g., an entire volume of a journal), the IC-index is useless because it will merely reveal the well-known presence of publication bias and QRPs, and the p-value is influenced by the number of tests in a journal. A journal with more articles and statistical tests would have a lower incredibility index even if the studies, on average, have more power and are less biased. The inflation rate provides a better measure of the integrity of reported results in a journal.

Another problem of the incredibility index is that power is not normally or symmetrically distributed. As a result, the average observed power estimate is a biased estimate of the average true power (Yuan & Maxwell, 2005; Schimmack, 2015). For example, when the true power is close to the upper value of 100%, observed power is more likely to underestimate than to overestimate true power. To overcome this problem, the R-Index uses the median to estimate true power. The median is unbiased because in each study it is equally likely that the observed effect size underestimates or overestimates the true effect size. Thus, it is equally likely that a power estimate underestimates or overestimates true power. While the amount of underestimation and overestimation is not symmetrically distributed, the direction of bias is known to be equally distributed on both sides of true power. Simulations confirm that the median provides an unbiased estimate of true power even when power is high.

Thus, the formula for the inflation in a set of studies is

Inflation = Percentage of Significant Results – Median (Estimated Power)

Median observed power is an unbiased estimate of power in an unbiased set of studies. However, if the set of studies is biased by publication bias, median observed power is inflated. It is still able to detect publication bias because the success rate increases faster than median observed power. For example, if true power is 50%, but only significant results are reported (100% success rate), median observed power increases to 75% (Schimmack, 2015).

The amount of inflation is proportional to the actual power of a set of studies. When the set of studies includes only significant results (100% success rate), inflation is necessarily greater than 0 because power is never 100%. However, median observed power of 95% implies only a small amount of inflation (5%) and the actual power is close to the median observed power (94%). In contrast, median observed power of 70% implies a large amount of bias, and true power is only 30%. As a result, the true power of a set of studies increases with the median observed power and decreases with the amount of inflation. The R-Index combines these two indicators of power by subtracting the inflation rate from median observed power.

R-Index = Median Observed Power – Inflation

As Inflation = Success Rate – Median Observed Power, the R-Index can also be expressed as a function of Success Rate and Median Observed Power

R-Index = Median Observed Power – (Success Rate – Median Observed Power)

or

R-Index = 2 * Median Observed Power – Success Rate

The R-Index can range from 0 to 1. A value of 0 is obtained when median observed power is 50% and the success rate is 100%. However, this event should not occur with real data because significant results have a minimum observed power of 50%. To obtain a median of 50% observed power all studies would have to have 50% power, but sampling error should produce variation in observed power estimates. A fixed value or restricted variance is another indicator of bias (Schimmack, 2015). A more realistic lower limit for the R-Index is a value of 22%. This value is obtained when the null-hypothesis is true (the population effect size is zero) and only significant results are reported (success rate = 100%). In this case, median observed power is 61%, the inflation rate is 39%, and the R-Index is 61 – 39 = 22. The maximum of 100 would be obtained if studies practically have 100% power and the success rate is 100%.

It is important to note that the R-Index is not an estimate of power. It is monotonically related to power, but an R-Index of 22% does not imply that a set of studies has 22% power. As noted earlier, an R-Index of 22% is obtained when the null-hypothesis is true which produces only 5% significant results if the significance criterion is 5%. When power is less than 50%, the R-Index is conservative and the Index values are higher than true power. When power is more than 50%, the R-Index values are lower than true power. However, for comparisons of journals, authors, etc., rankings with the R-Index will reflect the ranking in terms of true power. Moreover, an R-Index below 50% implies that true power is less than 50%, which can be considered inadequate power for most research questions.

**Example 1: Bem’s Feeling the Future**

The first example uses Bem’s (2011) article to demonstrate the usefulness of computing an R-Index.

N | d | Obs.Pow | Success |
---|---|---|---|

100 | 0.25 | 0.79 | 1 |

150 | 0.2 | 0.78 | 1 |

100 | 0.26 | 0.82 | 1 |

100 | 0.23 | 0.73 | 1 |

100 | 0.22 | 0.70 | 1 |

150 | 0.15 | 0.57 | 1 |

150 | 0.14 | 0.52 | 1 |

200 | 0.09 | 0.35 | 0 |

100 | 0.19 | 0.59 | 1 |

50 | 0.42 | 0.88 | 1 |

The median observed power is 71%. The success rate is 90% Accordingly, the inflation rate is 90 – 71 = 19%. The R-Index is 71 – 19 = 52. An R-Index of 52 is higher than the 22% that is expected from a set of studies without a real effect and publication bias. However, it is not clear how questionable research practices influence the R-Index. Thus, the R-Index should not be used to infer from values greater than 22% that an effect is present. The R-Index does suggest that Bem’s studies did not have 80% power as he suggested in the planning of his studies. It also suggests that the nominal median effect size of d = .21 is inflated and that future studies should expect a lower effect size. These predictions were confirmed in a set of replication studies (Galak et al., 2013) In short, an R-Index of 50% raises concerns about the robustness of empirical results and shows that impressive success rates of 90% or more do not necessarily provide strong evidence for the existence of an effect.

**Example 2: The Multiple Lab Project**

In the wake of the replicability crisis, the Open-Science Fouundation has started to examine the replicability of psychological research with replication studies. These replication studies reproduce the original studies as closely as possible. The first results emerged from the Many-Labs project. In this project, an international team of researchers replicated 13 psychological studies in several laboratories. The main finding of this project was that 10 of the 13 studies were successfully replicated in several labs. The success rate is 77%. I computed the R-Index for the original studies. One study provided insufficient information to compute observed power, leaving 12 studies to be analyzed. The success rate for the original studies was 100% (one study had a marginally significant effect, p < .10, two-tailed). Median observed power was 86%. The inflation rate is 100 – 86 = 14, and the R-Index is 86 – 14 = 72. Thus, an R-Index of 72 suggests that studies have a high probability of replicating. Of course, a higher R-Index would be even better.

It is important to note that success in the Many Lab Project was defined as a significant result in a meta-analysis across all labs with over 3,000 participants. The success rate would be lower if replication success were defined as a significant result in an exact replication study with the same statistical power (sample size) as the original study. Nevertheless, many of the results were replicated even with smaller sample sizes because the original studies examined large effects, had large samples, or both.

**Conclusion**

It has been widely recognized that questionable research practice are threatening the foundations of science. This manuscript introduces the R-Index as a statistical tool to assess the replicability of published results. Results are replicable if the original studies had sufficient power to produce significant results. A study with 80% power is likely to produce a significant result in 80% of all attempts without the need for questionable research practices. In contrast, a study with 20% power can only produce significant results with the help of inflated effect sizes. In 20% of all attempts, luck alone will be sufficient to inflate effect sizes. In all other cases, researchers have to hide failed attempts in file drawers or use questionable statistical practices to inflate effect sizes. The R-Index reveals the presence of questionable research practices when observed power is lower than the rate of significant results. The R-Index has two components. It increases with observed power because studies with high power are more likely to replicate. The second component is the discrepancy between the percentage of significant results and observed power. The greater the discrepancy, the more questionable research practices have contributed to success and the more observed power overestimates true power.

**References**

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. [Article]. Journal of Personality and Social Psychology, 100(3), 407-425. doi: 10.1037/a0021524

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2013). Correcting the Past: Failures to Replicate Psi. Journal of Personality and Social Psychology.

Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. [Article]. American Statistician, 55(1), 19-24. doi: 10.1198/000313001300339897

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. [Article]. Psychological Science, 23(5), 524-532. doi: 10.1177/0956797611430953

Schimmack, U. (2012). The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles. [Article]. Psychological Methods, 17(4), 551-566. doi: 10.1037/a0029487

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies. [Article]. Psychological Bulletin, 105(2), 309-316.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. [Article]. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice-versa. [Editorial Material]. American Statistician, 49(1), 108-112.

Thanks for posting this statement, Uli. My comment: an important part of an applied method is specifying input. The examples here are from research that had clear critical tests (Bem) or outright specified the critical test as part of replication (ML). Theoretically, researchers principally face motivation to p-hack and selectively publish based on the critical test; the role of auxiliary tests in this is harder to quantify in a standard way (e.g. a manipulation check might also need to be significant to be published, while a DV reflecting an alternate explanation might actually need to be decisively non-significant). I’d therefore think that sampling only critical tests is an essential part of the method -or at the very least, the method of sampling all tests should be validated as a proxy

LikeLike

The results of a study based on a sample of results should only be generalized to the population that was used to obtain a sample.

If the focus is on a particular hypothesis, only results relevant to this hypothesis should be included in the sample (traditional meta-analysis).

If the focus is on all focal hypothesis tests in a journal, only focal hypothesis tests should be sampled.

If the focus is on all significance tests, the sample should represent all significance tests.

I think you are arguing that nobody cares about all significance tests and that journals should not be evaluated based on the power of these tests. As a result, an index based on all tests is useless and only focal hypothesis tests should be used.

My counter argument is that it is not clear why researcher report statistical results of non-focal hypothesis tests, if these are not relevant or if these are underpowered, but it is relevant to claim gender was a moderatore or was not a moderator based on a statistical tests of gender.

So, we can agree that different results could be obtained if rankings were based on focal hypothesis tests.

I know you are working on a compilation of focal hypothesis tests. How many journals are you covering and how many articles per journal and year are you using?

I am looking forward to a comparison of your analyses of focal tests and my results based on all statistical tests.

LikeLike