Bayesians like to blame p-values and frequentist statistics for the replication crisis in psychology (see, e.g., Wagenmakers et al., 2011). An alternative view is that the replication crisis is caused by selective reporting of non-significant results (Schimmack, 2012). This bias would influence Frequentist and Bayesian statistics alike and switching from p-values to Bayes-Factors would not solve the replication crisis. It is difficult to evaluate these competing claims because Bayesian statistics are still used relatively infrequently in research articles. For example, a search for the term Bayes Factor retrieved only six articles in Psychological Science in the years from 1990 to 2015.
One article made a reference to the use of Bayesian statistic in modeling. Three articles used Bayes-Factors to test the null-hypothesis. These article will be examined in a different post, but they are not relevant for the problem of replicating results that apeared to demonstrate effects by rejecting the null-hypothesis. Only two articles used Bayes-Factors to test whether a predicted effect is present.
One article reported Bayes-Factors to claim support for predicted effects in 6 studies (Savani & Rattan, 2012). The results are summarized in Table 1.
MA = meta-analysis, OP = observed power, BF1 = Bayes-Factor reported in article based on half-normal with SD = .5, BF2 = default Bayes-Factor with Cauchy(0,1)
All 6 studies reported a statistically significant result, p < .05 (two-tailed). Five of the six studies reported a Bayes-Factor and all Bayes-Factors supported the alternative hypothesis. Bayes-Factors in the article were based on a half-normal centered at d = .5. The Bayes-Factors show that the data are much more consistent with this alternative hypothesis than with the null-hypothesis. I also computed the Bayes-Factor for a Cauchy distribution centered at 0 with a scaling parameter of r = 1 (Wagenmakers et al., 2011). This alternative hypothesis assumes that there is a 50% probability that the standardized effect size is greater than d = 1. This extreme alternative hypothesis favors the null-hypothesis when the data show small to moderate effect sizes. Even this Bayes-Factor consistently favors the alternative hypothesis, but the odds are less impressive. This result shows that Bayes-Factors have to be interpreted in the context of the specified alternative hypothesis. The last row shows the results of a meta-analysis. The results of the six studies were combined using Stouffer’s formula sum(z) / sqrt(k). To compute the Bayes-Factor the z-score was converted into a t-value with total N – 2 degrees of freedom. The meta-analysis shows strong support for an effect, z = 5.92, and the Bayes-Factor in favor of the hypothesis is greater than 1 million to 1.
Thus, frequentist and Bayesian statistics produce converging results. However, both statistical methods assume that the reported statistics are unbiased. If researchers only present significant results or use questionable research practices that violate statistical assumptions, effect sizes are inflated, which biases p-values and Bayes-Factors alike. It is therefore necessary to test whether the reported results are biased. A bias analysis with the Test of Insufficient Variance (TIVA) shows that the data are biased. TIVA compares the observed variance in z-scores against the expected variance of z-scores due to random sampling error, which is 1. The observed variance is only Var(z) = 0.04. A chi-square test shows that the discrepancy between the observed and expected variance would occur rarely by chance alone, p = .001. Thus, neither p-values nor Bayes-Factors provide a credible test of the hypothesis because the reported results are not credible.
Kibbe and Leslie (2011) reported the results of a single study that compared infants’ looking times in three experimental conditions. The authors first reported the results of a traditional Analysis of Variance that showed a significant effect, F(2, 31) = 3.54, p = .041. They also reported p-values for post-hoc tests that compared the critical experimental condition with the control condition, p = .021. They then reported the results of a Bayesian contrast analysis that compared the critical experimental condition with the other two conditions. They report a Bayes-Factor of 7.4 in favor of a difference between means. The article does not specify the alternative hypothesis that was tested and the website link in the article does not provide readily available information about the prior distribution of the test. In any case, the Bayesian results are consistent with the ANOVA results. As there is only one study, it is impossible to conduct a formal bias test, but studies with p-values close to .05 often do not replicate.
In conclusion, Bayesian statistics are still rarely used to test research hypotheses. Only two articles in the journal Psychological Science have done so. One article reported six studies and reported high Bayes-Factors in five studies to support theoretical predictions. A bias analysis showed that the results in this article are biased and violate basic assumptions of sampling. This example suggest that Bayesian statistics does not solve the credibility problem in psychology. Bayes-Factors can be gamed just like p-values. In fact, it is even easier to game Bayes-Factors by specifying a priori distributions that closely match the observed data in order to report Bayes-Factors that impress reviewers, editors, and readers with limited understanding of Bayesian statistics. To avoid this problems, Bayesians need to agree on a principled approach how researchers should specify prior distributions. Moreover, Bayesian statistics are only credible if researchers report all relevant results. Thus, Bayesian statistics need to be accompanied by information about the credibility of the data.
Kibbe, M., & Leslie, A. (2011). What Do Infants Remember When They Forget? Location and Identity in 6-Month-Olds’ Memory for Objects. Psychological Science, 22, 1500-1505.
Savani, K., & Rattan, A. (2012). A choice mind-set increases the acceptance and maintenance of wealth inequality. Psychological Science, 23, 796-804.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566.
Schimmack, U. (2015a). The test of insufficient variance (TIVA). Abgerufen von https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/
Stouffer, S. A., Suchman, E. A , DeVinney, L.C., Star, S.A., Williams, R.M. Jr (1949). Adjustment During Army Life. Princeton, NJ, Princeton University Press.
Wagenmakers, E. J.,Wetzels, R., Borsboom,D.,& Van derMaas, H. L. (2011).Why psychologists must change the way they analyze their data: The case of psi. [Commentary on Bem (2011)]. Journal of Personality and Social Psychology, 100, 426–432. doi: 10.1037/a0022790