“Only when the tide goes out do you discover who has been swimming naked.” Warren Buffet (Value Investor).
Francis, Tanzman, and Matthews (2014) examined the credibility of psychological articles published in the prestigious journal Science. They focused on articles that contained four or more articles because (a) the statistical test that they has insufficient power for smaller sets of studies and (b) the authors assume that it is only meaningful to focus on studies that are published within a single article.
They found 26 articles published between 2006 and 2012. Eight articles could not be analyzed with their method.
The remaining 18 articles had a 100% success rate. That is, they never reported that a statistical hypothesis test failed to produce a significant result. Francis et al. computed the probability of this outcome for each article. When the probability was less than 10%, they made the recommendation to be skeptical about the validity of the theoretical claims.
For example, a researcher may conduct five studies with 80% power. As expected, one of the five studies produced a non-significant result. It is rational to assume that this finding is a type-II error as the Type-II error should occur in 1 out of 5 studies. The scientist decides not to include the non-significant result. In this case, there is bias, the average effect size across the four significant studies is slightly inflated, but the empirical results do support empirical claims.
If, however, the null-hypothesis is true and a researcher conducts many statistical tests and reports only significant results, demonstrating excessive significant results would also reveal that the reported results provide no empirical support for the theoretical claims in this article.
The problem with Francis et al.’s approach is that it does not clearly distinguish between these two scenarios.
The R-Index addresses this problem. It provides quantitative information about the replicability of a set of studies. Like Francis et al., the R-Index is based on the observed power of individual statistical tests (see Schimmack, 2012, for details), but the next steps are different. Francis et al. multiply observed power estimates. This approach is only meaningful for sets of studies that reported only significant results. The R-Index can be computed for studies that reported significant and non-significant results. Here are the steps:
Compute median observed power for all theoretically important statistical tests from a single study; then compute the median of these medians. This median estimates the median true power of a set of studies.
Compute the rate of significant results for the same set of statistical tests; then average the rates across the same set of studies. This average estimates the reported success rate for a set of studies.
Median observed power and average success rate are both estimates of true power or replicability of a set of studies. Without bias, these two estimates should converge as the number of studies increase.
If the success rate is higher than median observed power, it suggests that the reported results provide an inflated picture of the true effect size and replicability of a phenomenon.
The R-Index uses the difference between success rate and median observed power to correct the inflated estimate of replicability by subtracting the inflation rate (success rate – median observed power) from the median observed power.
R-Index = Median Observed Power – (Success rate – Median Observed Power)
The R-Index is a quantitative index, where higher values suggest a higher probability that an exact replication study will be successful and it avoids simple dichotomous decisions. Nevertheless, it can be useful to provide some broad categories that distinguish different levels of replicability.
An R-Index of more than 80% is consistent with true power of 80%, even when some results are omitted. I chose 80% as a boundary because Jacob Cohen advised researchers that they should plan studies with 80% power. Many undergraduates learn this basic fact about power and falsely assume that researchers are following a rule that is mentioned in introductory statistics.
An R-Index between 50% and 80% suggests that the reported results support an empirical phenomenon, but that power was less than ideal. Most important, this also implies that these studies make it difficult to distinguish non-significant results and type-II errors. For example, two tests with 50% power are likely to produce one significant result and one non-significant result. Researches are tempted to interpret the significant one and to ignore the non-significant one. However, in a replication study the opposite pattern is just as likely to occur.
An R-Index between25% and 50% raises doubts about the empirical support for the conclusions. The reason is that an R-Index of 22% can be obtained when the null-hypothesis is true and all non-significant results are omitted. In this case, observed power is inflated from 5% to 61%. With a 100% success rate, the inflation rate is 39%, and the R-Index is 22% (61% – 39% = 22%).
An R-Index below 20% suggest that researchers used questionable research methods (importantly, these method are questionable but widely accepted in many research communities and not considered to be ethical misconduct) to obtain results that are statistically significant (e.g., systematically deleting outliers until p < .05).
Table 1 list Francis et al.’s results and the R-Index. Studies are arranged in order of the R-Index. Only 1 study is in the exemplary category with an R-Index greater than 80%.
4 studies have an R-Index between 50% and 80%.
8 studies have an R-Index in the range between 20% and 50%.
5 studies have an R-Index below 20%.
There are good reasons why researchers should not conduct studies with less than 50% power. However, 13 of the 18 studies have an R-Index below 50%, which suggests that the true power in these studies was less than 50%.
The R-Index provides an alternative approach to Francis’s TES to examine the credibility of a set of published studies. Whereas Francis concluded that 15 out of 18 articles show bias that invalidates the theoretical claims of the original article, the R-Index provides quantitative information about the replicability of reported results.
The R-Index does not provide a simple answer about the validity of published findings, but in many cases the R-Index raises concerns about the strength of the empirical evidence and reveals that editorial decisions failed to take replicability into account.
The R-Index provides a simple tool for editors and reviewers to increase the credibility of published results and to increase the replicability of published findings. Editors and reviewers can compute, or ask authors who submit manuscripts to compute, the R-Index and use this information in their editorial decision. There is no clear criterion value, but a higher R-Index is better and moderate R-values should be justified by other criteria (e.g., uniqueness of sample).
The R-Index can be used to examine whether editors continue to accept articles with low replicability or are committed to the publication of empirical results that are credible and replicable.