“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)
The R-Index can be used to evaluate the replicability of a set of statistical results. It can be used to evaluate the statistical research integrity of journals, articles on a specific topic (meta-analysis), and researchers. Just like the H-Index has become a popular metric of research excellence, the R-Index of individual researchers can be used to evaluate the replicability of their findings.
I chose Roy Baumeister as an example for several reasons. First, the R-Index is based on my earlier work on the incredibility-index (Schimmack, 2012). In this article, I demonstrated how power analysis can be used to reveal that researchers used questionable research practices to produce statistically significant results. I illustrated this approach with two articles. One article published 10 experiments that appeared to demonstrate time-reversed causality. Independent replication studies failed to replicate this incredible finding. The Incredibility-Index predicted this failure. The second article was a study on glucose consumption and will-power with Roy Baumeister as the senior author. The Incredibility-Index showed that the statistical results reported in this article were even less credible than the time-travel studies in Bem’s (2011) article.
Not surprisingly, Roy Baumeister was a reviewer of the incredibility article. During the review process, Roy Baumeister explained why his article reported more significant results than one would expect on the basis of the statistical power of these studies.
“My paper with Gailliot et al. (2007) is used as an illustration here. Of course, I am quite familiar with the process and history of that one. We initially submitted it with more studies, some of which had weaker results. The editor said to delete those. He wanted the paper shorter so as not to use up a lot of journal space with mediocre results. It worked: the resulting paper is shorter and stronger. Does that count as magic? The studies deleted at the editor’s request are not the only story. I am pretty sure there were other studies that did not work. Let us suppose that our hypotheses were correct and that our research was impeccable. Then several of our studies would have failed, simply given the realities of low power and random fluctuations. Is anyone surprised that those studies were not included in the draft we submitted for publication? If we had included them, certainly the editor and reviewers would have criticized them and formed a more negative impression of the paper. Let us suppose that they still thought the work deserved publication (after all, as I said, we are assuming here that the research was impeccable and the hypotheses correct). Do you think the editor would have wanted to include those studies in the published version?”
To my knowledge this is one of the few frank acknowledgements that questionable research practices (i.e., excluding evidence that does not support an author’s theory) contributed to the picture-perfect results in a published article. It is therefore instructive to examine the R-Index of a researcher who openly acknowledged that the reported results are a biased selection of the empirical evidence.
A tricky issue in any statistical analysis is the sampling of studies. In this case it would be possible to conduct the analysis on the full set of articles published by Roy Baumeister. However, for my analysis I selected a sample. To ensure that the sample is unbiased, I chose a sampling strategy that makes a priori sense and does not involve random sampling because I have control over the random generator. My sampling strategy was to focus on the Top 10 most cited original research articles.
To evaluate the R-Index, it is instructive to keep the following scenarios in mind.
- The null-hypothesis is true and a researcher uses questionable research practices to obtain just significant results (p = .049999). The observed power for this set of studies is 50%, but all statistical results are significant, 100% success rate. The success rate is inflated by 50%. The R-Index is observed power minus inflation rate, which is 0%.
- The null-hypothesis is true and a researcher drops non-significant results and/or uses questionable research methods that capitalize on chance. In this case, p-values above .05 are not reported and p-values below .05 have a uniform distribution with a median of .025. A p-value of .025 corresponds to 61% observed power. With 100% significant results, the inflation rate is 39%, and the R-Index is 22% (61%-39%).
- The null-hypothesis is false and researcher conducts studies with 30% power. The non-significant studies are not published. In this case, observed power is 70%. With 100% success rate, the inflation rate is 30%. The R-Index is 40%.
- The null-hypothesis is false and researcher conducts studies with 50% power. The non-significant studies are not published. In this case, observed power is 75%. With 100% success rate, the inflation rate is 25%. The R-Index is 50%.
- The null-hypothesis is false and researchers conduct studies with 80% power, as recommended by Cohen. The non-significant results are not published (20% missing). In this case, observed power is 90% with 100% significant results. With 10% inflation rate, the R-Index is 80% (90% – 10%).
- A sample of psychological studies published in 2008 produced an R-Index of 43% (Observed Power = 72%, Success Rate = 100%, Inflation Rate = 28%). Exact replications of these studies produced a success rate of 28%.
Roy Baumeister’s Top-10 articles contained 40 studies. Each study reported multiple statistical tests. I computed the median observed power of statistical tests that tested a theoretically relevant hypothesis. I also recorded whether the test was considered supportive of the theoretical hypothesis (typically, p < .05). The median observed power in this set of 40 studies was 69%. The success rate was 89%. The inflation rate is 20% and the R-Index is 49% (69% – 20%).
Roy Baumeister’s R-Index of 49% is consistent with his statement that his articles do not contain all of the studies that tested a theoretical prediction. Studies that tested theoretical predictions and failed to support them are missing. An R-Index of 49% is also consistent with Roy Baumeister’s claim that his practices reflect the common practices in the field. Other sets of studies in social psychology produce similar indices (e.g., replicability project of psychological studies, R-Index = 43%; success rate in empirical replication studies 28%).
In conclusion, Roy Baumeister’s acknowledged the use of questionable research practices (i.e., excluding evidence that does not support a theoretical hypothesis) and his R-Index is 49%. The R-Index of a representative set of studies in psychology in 2008 produced an R-Index of 42%. This suggests that the use of questionable research practices in psychology is widespread and the R-Index is able to detect the use of these practices. A set of studies that were subjected to empirical replication attempts produced a R-Index of 38%, and 28% of replication attempts were successful (72% failed).
The R-Index makes it possible to quantify and compare the use of questionable research practices and I hope it will encourage researchers to conduct fewer and more powerful studies. I also hope that a quantitative index makes it possible to make replicability an evaluation criterion for scientists.
So what could Roy Baumeister have done? He published 9 studies that supported his hypothesis and excluded several more studies because they were underpowered. I suggest running fewer studies with higher power so that all studies can produce significant results, assuming the null-hypothesis is really false.