The crisis of confidence in psychological science started with Bem’s (2011) article in the Journal of Personality and Social Psychology. The article made the incredible claim that extraverts can foresee future random events (e.g., the location of an erotic picture) above chance.
Rather than demonstrating some superhuman abilities, the article revealed major problems in the way psychologists conduct research and report their results.
Wagenmakers and colleagues were given the opportunity to express their concerns in a commentary that was published along with the original article, which is highly unusual (Wagenmakers et al., 2011).
Wagenmakers used this opportunity to attribute the problems in psychological science to the use of p-values. The claim that the replication crisis in psychology follows from the use of p-values has been repeated several times, most recently in a special issue that promotes Bayes Factors as an alternative statistical approach.
“the edifice of NHST appears to show subtle signs of decay. This is arguably due
to the recent trials and tribulations collectively known as the “crisis of confidence” in psychological research, and indeed, in empirical research more generally (e.g., Begley
& Ellis, 2012; Button et al., 2013; Ioannidis, 2005; John, Loewenstein, & Prelec, 2012; Nosek & Bar-Anan, 2012; Nosek, Spies, & Motyl, 2012; Pashler & Wagenmakers, 2012; Simmons, Nelson, & Simonsohn, 2011). This crisis of confidence has stimulated a methodological reorientation away from the current practice of p value NHST (Wagenmakers et al., 2018, Psychonomics Bulletin and Review).
In short, Bem used NHST and p-values, Bem’s claims are false, therefore NHST and p-values are false.
However, it does not follow from Bem’s use of p-values that NHST is flawed or caused the replication crisis in experimental social psychology, just like it does not follow from the fact that Bem is a men and that his claims were false that all claims by man are false.
The key problem with Bem’s article is that he used questionable and some would argue fraudulent research practices to produce incredible p-values (Francis, 2012; Schimmack, 2012). For example, he combined several smaller studies with promising trends into a single dataset to report a p-value less than .05 (Schimmack, 2018). This highly problematic practice violates the assumption that the observations in a dataset are drawn from a representative sample. It is not clear how any statistical method could produce valid results when its basic assumptions are violated.
So, we have two competing accounts of the replication crisis in psychology. Wagenmakers argues that even proper use of NHST produces questionable results that are difficult to replicate. In contrast, I argue that proper use of NHST produces credible p-values that can be replicated and only questionable research practices and abuse of NHST produce incredible p-values that cannot be replicated.
Who is right?
The answer is simple. Wagenmakers et al. (2011) engaged in a questionable research practice to demonstrate the superiority of Bayes-Factors when they examined Bem’s results with Bayesian statistics. They analyzed each study individually to show that each study alone produced fairly weak evidence for extraverts’ miraculous extrasensory abilities. However, they did not report the results of a meta-analysis of all studies.
The weak evidence in each single study is not important because JPSP would not have accepted Bem’s manuscript for publication, if he had presented a significant result in a single study. In 2011, social psychologists were well aware that a single p-value less than .05 provides only suggestive evidence and does not warrant publication in a top journal (Kerr, 1998). Most articles in JPSP report four or more studies. Bem reported 9 studies. Thus, the crucial statistical question is how strong the combined evidence of all 9 studies is. This question is best addressed by means of a meta-analysis of the evidence. Wagenmakers et al. (2011) are well-aware of this fact, but avoided reporting the results of a Bayesian meta-analysis.
In this article, we have assessed the evidential impact of Bem’s (2011) experiments in isolation. It is certainly possible to combine the information across experiments, for instance by means of a meta-analysis (Storm, Tressoldi, & Di Risio, 2010; Utts, 1991). We are ambivalent about the merits of meta-analyses in the context of psi: One may obtain a significant result by combining the data from many experiments, but this may simply reflect the fact that some proportion of these experiments suffer from experimenter bias and excess exploration (Wagenmakers et al., 2011)
I believe the real reason why they did not report the results of a Bayesian analysis is that it would have shown that p-values and Bayes-Factors lead to the same inference that Bem’s data are inconsistent with the null-hypothesis. After all, Bayes-Factors and p-values are mere transformations of a test-statistic into a different parameter. Holding sample size constant, p-values and Bayes-Factors in favor of the null-hypothesis decrease as the test statistic (e.g., a t-value) increases. This is shown below with Bem’s data.
Bayesian Meta-Analysis of Bem
Bem reported a mean effect size of d = .22 based on 9 studies with a total of 1170 participants. A better measure of effect size is the weighted average, which is slightly smaller, d = .197. The effect size can be tested against an expected value of 0 (no ESP) with a one-sample t-test with a sampling error of 1 / sqrt(1170) = 0.029. The t-value is .197/.029 = 6.73. The corresponding z-score is 6.66 (cf. Bem, Utts, & Johnson, 2011).
The p-value for t(1169) = 6.73 is 2.65e-11 or 0.00000000003.
I used Rouder’s online app to compute the default Bayes-Factor.
To obtain the BF in favor of the null-hypothesis, which is more comparable to a p-value that expresses evidence against the null-hypothesis, we obtain a BF with 9 zeros after the decimal, BF01 = 1/139075597 = 7.190334e-09 or 0.000000007.
Given the data, it is reasonable to reject the null-hypothesis using p-values or Bayes-Factors. Thus, the problem is the high t-value and not the transformation of the t-value into a p-value.
The problem with the t-value is clear when we consider that particle physicists (a.k.a real scientists) use values greater than 5 to rule out chance findings. Thus, Bem’s evidence meets the same strict criterion that was used to celebrate the discovery of the Higgs-Bosson particle in physics (cf. Schimmack, 2012).
The problem with Bem’s article is not that he used p-values. He could also have used Bayesian statistics to support his incredible claims. The problem is that Bem engaged in highly questionable research practices and was not transparent in reporting these practices. Holding p-values accountable for his behavior would be like holding cars responsible for drunk drivers.
Wagenmakers railing against p-values is akin to Don Quixote’s railing against windmills. It is not uncommon that a group of scientist is vehemently pushing an agenda. In fact, the incentive structure in science seems to promote self-promoters. However, it is disappointing that a peer-reviewed journal uncritically accepted his questionable claim that p-value caused the replication crisis. There is ample evidence that questionable research practices are being used to produce too many significant results (John et al., 2012; Schimmack, 2012). Disregarding this evidence to make false, self-serving attributions is just as questionable as other questionable research practices that impede scientific progress.
The biggest danger with Wagenmakers and colleagues agenda is that it distracts from the key problems that need to be fixed. Curbing the use of questionable research practices and increasing the statistical power of studies to produce strong evidence (i.e., high t-values) is paramount to improving psychological science. However, there is little evidence that psychologists have changed their practices since 2011; with the exception of some social psychologists (Schimmack, 2017).
Thus, it is important to realize that Wagenmakers’ attribution of the replication crisis to the use of NHST is a fundamental attribution error in meta-psychology that is rooted in a motivated bias to find some useful application for Bayes-Factors. Contrary to Wagenmakers et al.’s claim that “Psychologists need to change the way they analyze their data” they actually need to change the way they obtain their data. With good data, the differences between p-values and Bayes-Factors are of minor importance.