Category Archives: r-index

The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

tide_naked

“Only when the tide goes out do you discover who has been swimming naked.”  Warren Buffet (Value Investor).

 

 

Francis, Tanzman, and Matthews (2014) examined the credibility of psychological articles published in the prestigious journal Science. They focused on articles that contained four or more articles because (a) the statistical test that they has insufficient power for smaller sets of studies and (b) the authors assume that it is only meaningful to focus on studies that are published within a single article.

They found 26 articles published between 2006 and 2012. Eight articles could not be analyzed with their method.

The remaining 18 articles had a 100% success rate. That is, they never reported that a statistical hypothesis test failed to produce a significant result. Francis et al. computed the probability of this outcome for each article. When the probability was less than 10%, they made the recommendation to be skeptical about the validity of the theoretical claims.

For example, a researcher may conduct five studies with 80% power. As expected, one of the five studies produced a non-significant result. It is rational to assume that this finding is a type-II error as the Type-II error should occur in 1 out of 5 studies. The scientist decides not to include the non-significant result. In this case, there is bias, the average effect size across the four significant studies is slightly inflated, but the empirical results do support empirical claims.

If, however, the null-hypothesis is true and a researcher conducts many statistical tests and reports only significant results, demonstrating excessive significant results would also reveal that the reported results provide no empirical support for the theoretical claims in this article.

The problem with Francis et al.’s approach is that it does not clearly distinguish between these two scenarios.

The R-Index addresses this problem. It provides quantitative information about the replicability of a set of studies. Like Francis et al., the R-Index is based on the observed power of individual statistical tests (see Schimmack, 2012, for details), but the next steps are different. Francis et al. multiply observed power estimates. This approach is only meaningful for sets of studies that reported only significant results. The R-Index can be computed for studies that reported significant and non-significant results. Here are the steps:

Compute median observed power for all theoretically important statistical tests from a single study; then compute the median of these medians. This median estimates the median true power of a set of studies.

Compute the rate of significant results for the same set of statistical tests; then average the rates across the same set of studies. This average estimates the reported success rate for a set of studies.

Median observed power and average success rate are both estimates of true power or replicability of a set of studies. Without bias, these two estimates should converge as the number of studies increase.

If the success rate is higher than median observed power, it suggests that the reported results provide an inflated picture of the true effect size and replicability of a phenomenon.

The R-Index uses the difference between success rate and median observed power to correct the inflated estimate of replicability by subtracting the inflation rate (success rate – median observed power) from the median observed power.

R-Index = Median Observed Power – (Success rate – Median Observed Power)

The R-Index is a quantitative index, where higher values suggest a higher probability that an exact replication study will be successful and it avoids simple dichotomous decisions. Nevertheless, it can be useful to provide some broad categories that distinguish different levels of replicability.

An R-Index of more than 80% is consistent with true power of 80%, even when some results are omitted. I chose 80% as a boundary because Jacob Cohen advised researchers that they should plan studies with 80% power. Many undergraduates learn this basic fact about power and falsely assume that researchers are following a rule that is mentioned in introductory statistics.

An R-Index between 50% and 80% suggests that the reported results support an empirical phenomenon, but that power was less than ideal. Most important, this also implies that these studies make it difficult to distinguish non-significant results and type-II errors. For example, two tests with 50% power are likely to produce one significant result and one non-significant result. Researches are tempted to interpret the significant one and to ignore the non-significant one. However, in a replication study the opposite pattern is just as likely to occur.

An R-Index between25% and 50% raises doubts about the empirical support for the conclusions. The reason is that an R-Index of 22% can be obtained when the null-hypothesis is true and all non-significant results are omitted. In this case, observed power is inflated from 5% to 61%. With a 100% success rate, the inflation rate is 39%, and the R-Index is 22% (61% – 39% = 22%).

An R-Index below 20% suggest that researchers used questionable research methods (importantly, these method are questionable but widely accepted in many research communities and not considered to be ethical misconduct) to obtain results that are statistically significant (e.g., systematically deleting outliers until p < .05).

Table 1 list Francis et al.’s results and the R-Index. Studies are arranged in order of the R-Index.  Only 1 study is in the exemplary category with an R-Index greater than 80%.
4 studies have an R-Index between 50% and 80%.
8 studies have an R-Index in the range between 20% and 50%.
5 studies have an R-Index below 20%.

There are good reasons why researchers should not conduct studies with less than 50% power.  However, 13 of the 18 studies have an R-Index below 50%, which suggests that the true power in these studies was less than 50%.

FrancisScienceTable

Conclusion

The R-Index provides an alternative approach to Francis’s TES to examine the credibility of a set of published studies. Whereas Francis concluded that 15 out of 18 articles show bias that invalidates the theoretical claims of the original article, the R-Index provides quantitative information about the replicability of reported results.

The R-Index does not provide a simple answer about the validity of published findings, but in many cases the R-Index raises concerns about the strength of the empirical evidence and reveals that editorial decisions failed to take replicability into account.

The R-Index provides a simple tool for editors and reviewers to increase the credibility of published results and to increase the replicability of published findings. Editors and reviewers can compute, or ask authors who submit manuscripts to compute, the R-Index and use this information in their editorial decision. There is no clear criterion value, but a higher R-Index is better and moderate R-values should be justified by other criteria (e.g., uniqueness of sample).

The R-Index can be used to examine whether editors continue to accept articles with low replicability or are committed to the publication of empirical results that are credible and replicable.

Advertisements

Do it yourself: R-Index Spreadsheet and Manual is now available.

Science is self-correcting, but it often takes too long.

A spreadsheet to compute the R-Index and a manual that shows how to use the spreadsheet is now available on the www.r-index.org website. Researchers from all fields of science that use statistics are welcome to use the R-Index to examine the statistical integrity of published research findings. A high R-Index suggests that a set of studies reported results that are likely to replicate in an EXACT replication study with high statistical power. A low R-Index suggests that published results may be biased and that published results may not replicate. Researchers can share the results of their R-Index analyses by submitting the completed spreadsheets to www.r-index.org and the results will be posted anonymously. Results and spreadsheets will be openly accessible.

Dr. Schnall’s R-Index

In several blog posts, Dr. Schnall made some critical comments about attempts to replicate her work and these blogs created a heated debate about replication studies. Heated debates are typically a reflection of insufficient information. Is the Earth flat? This question created heated debates hundreds of years ago. In the age of space travel it is no longer debated. In this blog, I presented some statistical information that sheds light on the debate about the replicability of Dr. Schnall’s research.

The Original Study

Dr. Schnall and colleagues conducted a study with 40 participants. A comparison of two groups on a dependent variable showed a significant difference, F(1,38) = 3.63. In these days, Psychological Science asked researchers to report P-Rep instead of p-values. P-rep was 90%. The interpretation of P-rep was that there is a 90% chance to find an effect with the SAME SIGN in an exact replication study with the same sample size. The conventional p-value for F(1,38) = 3.63 is p = .06, a finding that commonly is interpreted as marginally significant. The standardized effect size is d = .60, which is considered a moderate effect size. The 95% confidence interval is -.01 to 1.47.

The wide confidence interval makes it difficult to know the true effect size. A post-hoc power analysis, assuming the true effect size is d = .60 suggests that an exact replication study has a 46% chance to produce a significant results (p < .05, two-tailed). However, if the true effect size is lower, actual power is lower. For example, if the true effect size is small (d = .2), a study with N = 40 has only 9% power (that is a 9% chance) to produce a significant result.

The First Replication Study

Drs. Johnson, Cheung, and Donnellan conducted a replication study with 209 participants. Assuming the effect size in the original study is the true effect size, this replication study has 99% power. However, assuming the true effect size is only d = .2, the study has only 31% power to produce a significant result. The study produce a non-significant result, F(1, 206) = .004, p = .95. The effect size was d = .01 (in the same direction). Due to the larger sample, the confidence interval is smaller and ranges from -.26 to .28. The confidence interval includes d = 2. Thus, both studies are consistent with the hypothesis that the effect exists and that the effect size is small, d = .2.

The Second Replication Study

Dr. Huang conducted another replication study with N = 214 participants (Huang, 2004, Study 1). Based on the previous two studies, the true effect might be expected to be somewhere between -.01 and .28, which includes a small effect size of d = .20. A study with N = 214 participants has 31% power to produce a significant result. Not surprisingly, the study produce a non-significant result, t(212) = 1.22, p = .23. At the same time, the effect size fell within the confidence interval set by the previous two studies, d = .17.

A Third Replication Study

Dr. Hung conducted a replication study with N = 440 participants (Study 2). Maintaining the plausible effect size of d = .2 as the best estimate of the true effect size, the study has 55% power to produce a significant result, which means it is nearly as likely to produce a non-significant result as it is to produce a significant result, if the effect size is small (d = .2). The study failed to produce a significant result, t(438) = .042, p = 68. The effect size was d = .04 with a confidence interval ranging from -.14 to .23. Again, this confidence interval includes a small effect size of d = .2.

A Fourth Replication Study

Dr. Hung published a replication study in the supplementary materials to the article. The study again failed to demonstrate a main effect, t(434) = 0.42, p = .38. The effect size is d = .08 with a confidence interval of -.11 to .27. Again, the confidence interval is consistent with a small true effect size of d = .2. However, the study with 436 participants had only a 55% chance to produce a significant result.

If Dr. Huang had combined the two samples to conduct a more powerful study, a study with 878 participants would have 80% power to detect a small effect size of d = .2. However, the combined effect size of d = .06 for the combined samples is still not significant, t(876) = .89. The confidence interval ranges from -.07 to .19. It no longer includes d = .20, but the results are still consistent with a positive, yet small effect in the range between 0 and .20.

Conclusion

In sum, nobody has been able to replicate Schnall’s finding that a simple priming manipulation with cleanliness related words has a moderate to strong effect (d = .6) on moral judgments of hypothetical scenarios. However, all replication studies show a trend in the same direction. This suggests that the effect exists, but that the effect size is much smaller than in the original study; somewhere between 0 and .2 rather than .6.

Now there are three possible explanations for the much larger effect size in Schnall’s original study.

1. The replication studies were not exact replications and the true effect size in Schnall’s version of the experiment is stronger than in the other studies.

2. The true effect size is the same in all studies, but Dr. Schnall was lucky to observe an effect size that was three times as large as the true effect size and large enough to produce a marginally significant result.

3. It is possible that Dr. Schnall did not disclose all of the information about her original study. For example, she may have conducted additional studies that produced smaller and non-significant results and did not report these results. Importantly, this practice is common and legal and in an anonymous survey many researchers admitted using practices that produce inflated effect sizes in published studies. However, it is extremely rare for researchers to admit that these common practices explain one of their own findings and Dr. Schnall has attributed the discrepancy in effect sizes to problems with replication studies.

Dr. Schnall’s Replicability Index

Based on Dr. Schnall’s original study it is impossible to say which of these explanations accounts for her results. However, additional evidence makes it possible to test the third hypothesis that Dr. Schnall knows more than she was reporting in her article. The reason is that luck does not repeat itself. If Dr. Schnall was just lucky, other studies by her should have failed because Lady Luck is only on your side half the time. If, however, disconfirming evidence is systematically excluded from a manuscript, the rate of successful studies is higher than the observed statistical power in published studies (Schimmack, 2012).

To test this hypothesis, I downloaded Dr. Schnall’s 10 most cited articles (in Web of Science, July, 2014). These 10 articles contained 23 independent studies. For each study, I computed the median observed power of statistical tests that tested a theoretically important hypothesis. I also calculated the success rate for each study. The average success rate was 91% (ranging from 45% to 100%, median = 100%). The median observed power was 61%. The inflation rate is 30% (91%-61%). Importantly, observed power is an inflated estimate of replicability when the success rate is inflated. I created the replicability index (R-index) to take this inflation into account. The R-Index subtracts the inflation rate from observed median power.

Dr. Schnall’s R-Index is 31% (61% – 30%).

What does an R-Index of 31% mean? Here are some comparisons that can help to interpret the Index.

Imagine the null-hypothesis is always true, and a researcher publishes only type-I errors. In this case, observed power is 61% and the success rate is 100%. The R-Index is 22%.

Dr. Baumeister admitted that his publications select studies that report the most favorable results. His R-Index is 49%.

The Open Science Framework conducted replication studies of psychological studies published in 2008. A set of 25 completed studies in November 2014 had an R-Index of 43%. The actual rate of successful replications was 28%.

Given this comparison standards, it is hardly surprising that one of Dr. Schnall’s study did not replicate even when the sample size and power of replication studies were considerably higher.

Conclusion

Dr. Schnall’s R-Index suggests that the omission of failed studies provides the most parsimonious explanation for the discrepancy between Dr. Schnall’s original effect size and effect sizes in the replication studies.

Importantly, the selective reporting of favorable results was and still is an accepted practice in psychology. It is a statistical fact that these practices reduce the replicability of published results. So why do failed replication studies that are entirely predictable create so much heated debate? Why does Dr. Schnall fear that her reputation is tarnished when a replication study reveals that her effect sizes were inflated? The reason is that psychologists are collectively motivated to exaggerate the importance and robustness of empirical results. Replication studies break with the code to maintain an image that psychology is a successful science that produces stunning novel insights. Nobody was supposed to test whether published findings are actually true.

However, Bem (2011) let the cat out of the bag and there is no turning back. Many researchers have recognized that the public is losing trust in science. To regain trust, science has to be transparent and empirical findings have to be replicable. The R-Index can be used to show that researchers reported all the evidence and that significant results are based on true effect sizes rather than gambling with sampling error.

In this new world of transparency, researchers still need to publish significant results. Fortunately, there is a simple and honest way to do so that was proposed by Jacob Cohen over 50 years ago. Conduct a power analysis and invest resources only in studies that have high statistical power. If your expertise led you to make a correct prediction, the force of the true effect size will be with you and you do not have to rely on Lady Luck or witchcraft to get a significant result.

P.S. I nearly forgot to comment on Dr. Huang’s moderator effects. Dr. Huang claims that the effect of the cleanliness manipulation depends on how much effort participants exert on the priming task.

First, as noted above, no moderator hypothesis is needed because all studies are consistent with a true effect size in the range between 0 and .2.

Second, Dr. Huang found significant interaction effects in two studies. In Study 2, the effect was F(1,438) = 6.05, p = .014, observed power = 69%. In Study 2a, the effect was F(1,434) = 7.53, p = .006, observed power = 78%. The R-Index for these two studies is 74% – 26% = 48%.   I am waiting for an open science replication with 95% power before I believe in the moderator effect.

Third, even if the moderator effect exists, it doesn’t explain Dr. Schnall’s main effect of d = .6.

The Replicability-Index (R-Index): Quantifying Research Integrity

ANNIVERSARY POST.  Slightly edited version of first R-Index Blog on December 1, 2014.

In a now infamous article, Bem (2011) produced 9 (out of 10) statistically significant results that appeared to show time-reversed causality.  Not surprisingly, subsequent studies failed to replicate this finding.  Although Bem never admitted it, it is likely that he used questionable research practices to produce his results. That is, he did not just run 10 studies and found 9 significant results. He may have dropped failed studies, deleted outliers, etc.  It is well-known among scientists (but not lay people) that researchers routinely use these questionable practices to produce results that advance their careers.  Think, doping for scientists.

I have developed a statistical index that tracks whether published results were obtained by conducting a series of studies with a good chance of producing a positive result (high statistical power) or whether researchers used questionable research practices.  The R-Index is a function of the observed power in a set of studies. More power means that results are likely to replicate in a replication attempt.  The second component of the R-index is the discrepancy between observed power and the rate of significant results. 100 studies with 80% power should produce, on average, 80% significant results. If observed power is 80% and the success rate is 100%, questionable research practices were used to obtain more significant results than the data justify.  In this case, the actual power is less than 80% because questionable research practices inflate observed power. The R-index subtracts the discrepancy (in this case 20% too many significant results) from observed power to adjust for the inflation.  For example, if observed power is 80% and success rate is 100%, the discrepancy is 20% and the R-index is 60%.

In a paper, I show that the R-index predicts success in empirical replication studies.

The R-index also sheds light on the recent controversy about failed replications in psychology (repligate) between replicators and “replihaters.”   Replicators sometimes imply that failed replications are to be expected because original studies used small samples with surprisingly large effects, possibly due to the use of questionable research practices. Replihaters counter that replicators are incompetent researchers who are motivated to produce failed studies.  The R-Index makes it possible to evaluate these claims objectively and scientifically.  It shows that the rampant use of questionable research practices in original studies makes it extremely likely that replication studies will fail.  Replihaters should take note that questionable research practices can be detected and that many failed replications are predicted by low statistical power in original articles.