I learned about Bayes theorem in the 1990s and I used Bayes’s famous formula in my first JPSP article (Schimmack & Reisenzein, 1997). When Wagenmakers et al. (2011) published their criticism of Bem (2011), I did not know about Bayesian statistics. I have since learned more about Bayesian statistics and I am aware that there are many different approaches to using priors in statistical inferences. This post is about a single Bayesian statistical approach, namely Bayesian Null-Hypothesis Testing (BNHT), which has been attributed to Jeffrey’s, introduced into psychology by Rouder, Speckman, and Sun (2009), and used by Wagenmakers et al. (2011) to suggest that Bem’s evidence for ESP was obtained by using flawed p-values, whereas Bayes-Factors showed no evidence for ESP, although they did not show evidence for the absence of ESP, either. Since then, I have learned more about Bayes Factors, in part from reading blog posts by Jeff Rouder, including R-Code to run my own simulation studies, and from discussions with Jeff Rouder on social media. I am not an expert on Bayesian modeling, but I understand the basic logic underlying Bayes-Factors.
Rouder et al.’s (2009) article has been cited over 800 times and was cited over 200 times in 2016 and 2017. An influential article like this cannot be ignored. Like all other inferential statistical methods, JBF (Jeffrey’s Bayes Factors or Jeff’s Bayes Factors) examine statistical properties of data (effect size, sampling error) in relation to sampling distributions of test statistics. Rouder et al. (2009) focused on t-distributions that are used for the comparison of means by means of t-tests. Although most research articles in psychology continue to use traditional significance testing. the use of Bayes-Factors is on the rise. It is therefore important to critically examine how Bayes-Factors are being used and whether inferences based on Bayes-Factors are valid.
Inferences about Sampling Error as Causes of Observed Effects.
The main objective of inferential statistics in psychological research is to rule out the possibility that an observed effect is merely a statistical fluke. If the evidence obtained in a study is strong enough given some specified criterion value, researchers are allowed to reject the hypothesis that an observed effect was merely produced by chance (a false positive effect) and interpret the result as being caused by some effect. Although Bayes-Factors could be reported without drawing conclusions (just like t-values or p-values could be reported without drawing inferences), most empirical articles that use Bayes-Factors use them to draw inferences about effects. Thus, the aim of this blog post is to examine whether empirical researchers use JBFs correctly.
Two Types of Errors
Inferential statistics are error prone. The outcome of empirical studies is not deterministic and results from samples may not generalize to populations. There are two possible errors that can occur, the so-called type-I errors and type-II errors. Type-I errors are false positive results. A false positive result occurs when there is no real effect in the population, but the results of a study led to the rejection of the null-hypothesis that sampling error alone caused the observed mean differences. The second error is the false inference that sampling error alone caused an observed difference in a sample, while a test of the entire population would show that there is an actual effect. This is called a false negative result. The main problem in assessing type-II errors (false negatives) is that the probability of a type-II error depends on the magnitude of the effect. Large effects can be easily observed even in small samples and the risk of a type-II error is small. However, as effect sizes become smaller and approach zero, it becomes harder and harder to distinguish the effect from pure sampling error. Once effect sizes become really small (say 0.0000001 percent of a standard deviation), it is practically impossible to distinguish results of a study with a tiny real effect from results of a study with no effect at all.
For reasons that are irrelevant here, psychologists have ignored type-II errors. A type-II error can only be made when researchers conclude that an effect is absent. However, empirical psychologists were trained to ignore non-significant results as inconclusive rather than drawing the inferences that an effect is absent, and risking making a type-II error. This led to the belief that p-values cannot be used to test the hypothesis that an effect is absent. This was not much of a problem because most of the time psychologists made predictions that an effect should occur (reward should increase behavior; learning should improve recall, etc.). However, it became a problem when Bem (2011) claimed to demonstrate that subliminal priming can influence behavior even if the prime is presented AFTER the behavior occurred. Wagenmakers et al. (2011) and others found this hypothesis implausible and the evidence for it unbelievable. However, rather than being satisfied with demonstrating that the evidence is flawed, it would be even better to demonstrate that this implausible effect does not exist. Traditional statistical methods that focus on rejecting the null-hypothesis did not allow this. Wagenmakers et al. (2011) suggested that Bayes-Factors solve this problem because they can be used to test the plausible hypothesis that time-reversed priming does not exist (the effect size is truly zero). Many subsequent articles have used JBFs for exactly this purpose; that is, to provide empirical evidence in support of the null-hypothesis that an observed mean difference is entirely due to sampling error. Like all inductive inferences, inferences in favor of H0 can be false. While psychologists have traditionally ignored type-II errors because they did not make inferences in favor of H0, the rise of inferences in favor of H0 by means of JBFs makes it necessary to examine the validity of these inferences.
The main problem of using JBFs to provide evidence for the null-hypothesis is that Bayes-Factors are ratios of two hypotheses. The data can be more or less compatible with each of the two hypotheses. Say, if the data favor H0 by a likelihood of .2 and H1 by a likelihood of .1, the ratio of the two likelihoods is .2/.1 = 2. The greater the likelihood in favor of H0, the more likely it is that an observed mean difference is purely sampling error. As JBFs are ratios of two likelihoods, they depend on the specification of H1. For t-tests with continuous variables, H1 is specified as a weighted distribution of effect sizes. Although H1 covers all possible effect sizes, it is possible to create an infinite number of alternative hypotheses (H1.1, H1.2, H1.3….H1.∞). The Bayes-Factor changes as a function of the way H1 is specified. Thus, while one specific H1 may produce a JBF of 1000000:1 in favor of H0, another one may produce a JBF of 1:1. It is therefore a logical fallacy to infer from a specific JBF for one particular H1 that H0 is supported, true, or that there is evidence for the absence of an effect. The logically correct inference is that, with extremely high probability), the alternative hypothesis is false, but that does not justify the inverse inference that H0 is true because H0 and H1 do not specify the full event space of all possible hypotheses that could be tested. It is easy to overlook this because every H1 covers the full range of effect sizes, but these effect sizes can be used to create an infinite number of alternative hypotheses.
To make it simple, let’s forget about sampling distributions and likelihoods. Using JBFs to claim that the data support the null-hypothesis in some absolute sense, is like a guessing game where you can pick any number you want, I guess that you picked 7 (because people like the number 7), you say it was not 7, and I now infer that you must have picked 0, as if 7 and 0 were the only options. If you think this is silly, you are right, and it is equally silly to infer from rejecting one out of an infinite number of possible H1s that H0 must be true.
So, a correct use of JBFs would be to state conclusions in terms of the H1 that was specified to compute the JBF. For example, in Wagenmakers et al’s analyses of Bem’s data, the authors specified H1 as a hypothesis that allocated 25% probability to effect sizes of d less than 1 (the opposite of the predicted effect) and 25% probability of a d greater than 1 (a very strong effect similar to gender differences in height). Even if the JBF would strongly favor H0, which it did not, it would not justify the inference that time-reversed priming does not exist. It would merely justify the inference that the effect size is unlikely to be greater than 1, one way or the other. However, if Wagenmakers et al. (2011) had presented their results correctly, nobody would have bothered to take notice of such a trivial inference. It was only the incorrect presentation of JBFs as a test of the null-hypothesis that led to the false belief that JBFs can provide evidence for the absence of an effect (e..g, the true effect size in Bem’s studies is zero). In fact, Wagenmakers played the game where H1 guessed that the effect size is 1, H1 was wrong, leading to the conclusion that the effect size must be 0. This is an invalid inference because there are still an infinite number of plausible effect sizes between 0 and 1.
There is nothing inherently wrong in calculating likelihood ratios and using them to test competing predictions. However, the use of Bayes-Factors as a way to provide evidence for the absence of an effect is misguided because it is logically impossible to provide evidence for one specific effect size out of an infinite set of possible effect sizes. It doesn’t matter whether the specific effect size is 0 or any other value. A likelihood ratio can only compare two hypothesis out of an infinite set of hypotheses. If one hypothesis is rejected, it does not justify inferring that the other hypotheses is true. This is the reason why we can falsify H0 because when we reject H0 we do not infer that one specific effect size is true; we merely infer that it is not 0, leaving open the possibility that it is any one of the other infinite number of effect sizes. We cannot reverse this because we cannot test the hypothesis that the effect is zero against a single hypothesis that covers all other effect sizes. JBFs can only test H0 against all other effect sizes by assigning weights to them. As there is an infinite number of ways to weight effect sizes, there is an infinite set of alternative hypothesis. Thus, we can reject the hypothesis that sampling error alone produced an effect but practically we can never demonstrate that sampling error alone caused the outcome of a study.
To demonstrate that an effect does not exist it is necessary to specify a region of effect sizes around zero. The smaller the region, the more resources are needed to provide evidence that the effect size is at best very small. One negative consequence of the JBF approach has been that small samples were used to claim support for the point null-hypothesis, with a high probability that this conclusion was a false negative result. Researchers should always report the 95% confidence interval around the observed effect size. If this interval includes effect sizes of .2 standard deviations, the inference in favor of a null-result is questionable because many effect sizes in psychology are small. Confidence intervals (or Bayesian credibility intervals with plausible priors) are more useful for claims about the absence of an effect than misleading statistics that pretend to provide strong evidence in favor of a false null-hypothesis.