Category Archives: Uncategorized

How Can We Interpret Inferences with Bayesian Hypothesis Tests?

SUMMARY

In this blog post I show how it is possible to translate the results of a Bayesian Hypothesis Test into an equivalent frequentist statistical test that follows Neyman Pearsons approach of hypthesis testing where hypotheses are specified as ranges of effect sizes (critical regions) and observed effect sizes are used to make inferences about population effect sizes with long-run error rates.

INTRODUCTION

The blog post also explains why it is misleading to consider Bayes Factors that favor the null-hypothesis (d = 0) over an alternative hypothesis (e.g., Jeffrey’s prior) as evidence for the absence of an effect.  This conclusion is only warranted with infinite sample sizes, but with finite sample sizes, especially small sample sizes that are typical in psychology,  Bayes Factors in favor of H0 can only be interpreted as evidence that the population effect size is close to zero, but not as evidence that the population effect size is exactly zero.  How close the effect sizes that are consistent with H0 are depends on sample size and the criterion value that is used to interpret the results of a study as sufficient evidence for H0.

One problem with Bayes Factors is that like p-values, they are a continuous measure of likelihoods, just like p-values are a continuous measure of probabilities, and the observed value is not sufficient to justify an inference or interpretation of the data. This is why psychologists moved from Fisher’s approach to Neyman Pearson’s approach that compared an observed p-value to a specified (by convention or pre-registertation) criterion value. For p-values this is alpha. If p < alpha, we reject H0:d = 0 in favor of H1, there was a (positive or negative) effect.

Most researchers interpret Bayes Factors relative to some criterion value (e.g., BF > 3 or BF > 5 or BF > 10). These criterion values are just as arbitrary as the .05 criterion for p-values and the only justification for these values that I have seen is that (Jeffrey who invented Bayes Factors said so). There is nothing wrong with a conventional criterion value, even if Bayesian’s think there is something wrong with p < .05, but use BF > 3 in just the same way, but it is important to understand the implications of using a particular criterion value for an inference. In NHST the criterion value has a clear meaning. It means that in the long-run, the rate of false inferences (deciding in favor of H1 when H1 is false) will not be higher than the criterion value.  With alpha = .05 as a conventional criterion, a research community decided that it is ok to have a maximum 5% error rate.  Unlike, p-values, criterion values for Bayes-Factors provide no information about error rates.  The best way to understand what a Bayes-Factor of 3 means is that we can assume that H0 and H1 are equally probable before we conduct a study and a Bayes Factor of 3 in favor of H0 makes it 3 times more likely that H0 is true than that H1 is true. If we were gambling on results and the truth were known, we would increase our winning odds from 50:50 to 75:25.   With a Bayes-Factor of 5, the winning odds increase to 80:20.

HYPOTHESIS TESTING VERSUS EFFECT SIZE ESTIMATION

p-values and BF also share another shortcoming. Namely they provide information about the data given a hypothesis or two hypotheses, but they do not provide information about the data. We all know that we should not report results as “X influenced Y, p < .05”. The reason is that this statement provides no information about the effect size.  The effect size could be tiny, d = 0.02, small, d = .20, or larger, d = .80.  Thus, it is now required to provide some information about raw or standardized effect sizes and ideally also about the amount of  raw or standardized sampling error. For example, standardized effect sizes could be reported as the standardized mean difference and sampling error (d = .3, se = .15) or as a  confidence interval, e.g., (d = .3, 95% CI = 0 to .6). This is important information about the actual data, but it does not provide information about hypothesis tests. Thus, if the results of a study are used to test hypothesis, information about effect sizes and sampling errors has to be evaluated with specified criterion values that can be used to examine which hypothesis is consistent with an observed effect size.

RELATING HYPOTHESIS TESTS TO EFFECT SIZE ESTIMATION

In NHST, it is easy to see how p-values are related to effect size estimation.  A confidence interval around the observed effect size is constructed by multiplying the amount of sampling error by  a factor that is defined by alpha.  The 95% confidence interval covers all values around the observed effect size, except the most extreme 5% values in the tails of the sampling distribution.  It follows that any significance test that compares the observed effect size against a value outside the confidence interval will produce a p-value less than the error criterion.

It is not so straightforward to see how Bayes Factors relate to effect size estimates.  Rouder et al. (2016) discuss a scenario where the 95% credibiltiy interval ranges around the most likey effect size of d = .165 ranges from .055 to .275 and excludes zero.  Thus, an evaluation of the null-hypothesis, d = 0, in terms of a 95%CI would lead to the rejection of the point-zero hypothesis.  We cannot conclude from this evidence that an effect is absent. Rather the most reasonable inference is that the population effect size is likely to be small, d ~ .2.   In this scenario, Rouder et al. (2009) obtained a Bayes-Factor of 1.  This Bayes-Factor also does not support H0, but it also does not provide support for H1.  How is it possible that two Bayesian methods seem to produce contradictory results? One method rejects H0:d = 0 and the other method shows no more support for H1 than for H0:d = 0.

Rouder et al. provide no answer to this question.  “Here we have a divergence. By using posterior credible intervals, we might reject the null, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point” (p. 536).   Moreover, they suggest that Bayes Factors give the correct answer and the rejection of d = 0 by means of credibility intervals is unwarranted. “…, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point.Updating with Bayes’ rule directly is the correct approach because it describes appropriate conditioning of belief about the null point on all the information in the data” (p. 536).

The problem with this interpretation of the discrepancy is that Rouder et al. (2009) misinterpret the meaning of a Bayes Factor as if it can be directly interpreted as a test of the null-hypothesis, d = 0.  However, in more thoughtful articles by the same authors, they recognize that (a) Bayes Factors only provide relative information about H0 in comparison to a specific alternative hypothesis H1, (b) the specification of H1 influences Bayes Factors, (c) alternative hypotheses that give a high a priori probability to large effect sizes favor H0 when the observed effect size is small, and (d) it is always possible to specify an alternative hypothesis (H1) that will not favor H0 by limiting the range of effect sizes to small effect sizes. For example, even with a small observed effect size of d = .165, it is possible to provide strong support for H1 and reject H0, if H1 is specified as Cauchy(0,0.1) and the sample size is sufficiently large to test H0 against H1.

BF.N.r.Plot.png
Figure 1 shows how Bayes Factors vary as a function of the specification of H1 and as a function of sample size with the same observed effect size of d = .165.  It is possible to get Bayes-Factors greater than 3 in favor of H0 with a wide Cauchy (0,1) and a small sample size of N = 100 and a Bayes Factor greater than 3 in favor of H1 with a small scaling factor of .4 or smaller and a sample size of N = 250.  In short, it is not possible to interpret Bayes Factors that favor H0 as evidence for the absence of an effect.  The Bayes Factor only tells us that the observed effect size is more consistent with the data than H1, but it is difficult to interpret this result because H1 is not a clearly specified alternative effect size. H1 changes not only with the specification of the range of effect sizes, but also with sample size.  This property is not a design flaw of Bayes Factors.  They were designed to provide more and more stringent tests of H0:d = 0 that would eventually support H1 if the sample size is sufficiently large and H0:d = 0 is false.  However, if H0 is false and H1 includes many large effect sizes (an ultrawide prior), Bayes Factors will first favor H0 and data collection may stop before Bayes Factors switch and provide the correct result that the population effect size is not zero.   This behavior of Bayes-Factors was illustrated by Rouder et al. (2009) with a simulation of a population effect size of d = .02.

 

BFSmallEffect.png
Here we see that the Bayes Factor favors H0 until sample sizes are above N = 5,000 and provides the correct information about the point hypothesis being false with N = 20,000 or more.To avoid confusion in the interpretation of Bayes Factors and to provide a better understanding of the actual regions of effect sizes that are consistent with H0 and H1, I developed simple R-Code that translates the results of a Bayesian Hypothesis Test into a Neyman Pearson hypothesis test.

TRANSLATING RESULTS FROM A BAYESIAN HYPOTHESIS TEST INTO RESULTS FROM A NEYMAN PEARSON HYPOTHESIS TEST

A typical analysis with BF creates three regions. One region of observed effect sizes is defined by BF > BF.crit in favor of H1 over H0. One region is defined by inconclusive BF with BF < BF.crit in favor of H0 and BF < BF.crit for H1 (1/BF crit < BF(H1/H0) < BF.crit.). The third region is defined by effect sizes between 0 and the effect size that matches the criterion for BF > BF.crit in favor of H0.
The width and location of these regions depends on the specification of H1 (a wider or narrower distribution of effect sizes under the assumption that an effect is present), the sample size, and the long-run error rate, where an error is defined as a BF > BF.crit that supports H0 when H1 is true and vice versa.
I examined the properties of BF for two scenarios. In one scenario researchers specify H1 as a Cauchy(0,.4). The value of .4 was chosen because .4 is a reasonable estimate of the median effect size in psychological research. I chose a criterion value of BF.crit = 5 to maintain a relatively low error rate.
I used a one sample t-test with n = 25, 100, 200, 500, and 1,000. The same amount of sampling error would be obtained in a two-sample design with 4x the sample size (N = 100, 400, 800, 2,000, and 4,000).
bf.crit N bf0 ci.low border ci.high alpha
[1,] 5 25 2.974385 NA NA 0.557 NA
[2,] 5 100 5.296013 0.035 0.1535 0.272 0.1194271
[3,] 5 200 7.299299 0.063 0.1300 0.197 0.1722607
[4,] 5 500 11.346805 0.057 0.0930 0.129 0.2106060
[5,] 5 1000 15.951191 0.048 0.0715 0.095 0.2287873
We see that the typical sample size in cognitive psychology with a within-subject design (n = 25) will never produce a result in favor of H0 and it requires an effect size of d = .56 to produce a result in favor of H1. This criterion is somewhat higher than the criterion effect size for p < .05 (two-tailed), which is d = .41, and approximately the same as the effect size needed for with alpha = .01, d = .56.
With N = 100, it is possible to obtain evidence for H0. If the observed effect size is exactly 0, BF = 5.296, and the maximum observed effect size that produces evidence in favor of H0 is d = 0.035. The minimum effect size needed to support H1 is d = .272. We can think about these two criterion values as limits of a confidence interval around the effect size in the middle (d = .1535). The width of the confidence interval implies that in the long run, we would make ~ 11% errors in favor of H0 and 11% errors in favor of H1, if the population effect size is d = .1535(#1). If we treat d = .1535 as the boundary for an interval null-hypothesis, H0:abs(d) < .1535, we do not make a mistake when the population effect size is less than .1535. So, we can interpret a BF > 5 as evidence for H0:abs(d) < .15, with an 11% error rate. The probability of supporting H0 with a statistically small effect size of d = .2 would be less than 11%. In short, we can interpret BF > 5 in favor of H0 as evidence for abs(d) < .15 and BF > 5 in favor of H1 as evidence for H1:abs(d) > .15, with approximate error rates of 10% and a region of inconclusive evidence for observed effect sizes between d = .035 and d = .272.
The results for N = 200, 500, and 1,000 can be interpreted the same way. An increase in sample size has the following effects: (a) the boundary effect size d.b that separates H0:|d| <= d.b and H1:|d| > d.b shrinks. In the limit it reaches zero and only d = 0 supports H0: |d| <= 0. With N = 1,000, the boundary value is d.b = .048 and an observed effect size of d = .0715 provides sufficient evidence for H1. However, the table also shows that the error rate increases. In larger samples a BF of 5 in one direction or the other occurs more easily by chance and the long-term error rate has doubled. Of course, researchers could keep a fixed error rate by adjusting the BF criterion value to match a fixed error rate, but Bayesian Hypthesis tests are not designed to maintain a fixed error rate. If this were a researchers goal, they could just specify alpha and use NHST to test H0:|d| < d.crit vs. H1:|d| > d.crit.
In practice, many researchers use a wider prior and a lower criterion value. For example, EJ Wagenmakers prefers the original Jeffrey prior with a scaling factor of 1 and a criterion value of 3 as noteworthy (but not definitive) evidence.
The next table translates inferences with a Cauchy(0,1) and BF.crit = 3 into effect size regions.
bf.crit N bf0 ci.low border ci.high alpha
[1,] 3 25 6.500319 0.256 0.3925 0.529 0.2507289
[2,] 3 100 12.656083 0.171 0.2240 0.277 0.2986493
[3,] 3 200 17.812296 0.134 0.1680 0.202 0.3155818
[4,] 3 500 28.080784 0.094 0.1140 0.134 0.3274574
[5,] 3 1000 39.672827 0.071 0.0850 0.099 0.3290325

The main effect of using Cauchy(0,1) to specify H1 is that the border value that distinguishes H0 and H1 is higher. The main effect of using BF.crit = 3 as a criterion value is that it is easier to provide evidence for H0 or H1 at the expense of having a higher error rate.

It is now possible to provide evidence for H0 with a small sample of n = 25 in a one-sample t-test. However, when we translate this finding into ranges of effect sizes, we see that the boundary between H0 and H1 is d = .39.  Any observed effect size below .256 yields a BF in favor of H0. So, it would be misleading to interpret this finding as if a BF of 3 in a sample of n = 25 provides evidence for the point null-hypothesis d = 0.  It only shows that an effect size of d < .39 is more consistent with an effect size of 0 than with effect sizes specified in H1 which places a lot of weight on large effect sizes.  As sample sizes increase, the meaning of BF > 3 in favor of H0 changes. With N = 1,000,  a BF of 3  any effect size larger than .071 does no longer provide evidence for H0.  In the limit with an infinite sample size, only d = 0 would provide evidence for H0 and we can infer that H0 is true. However, BF > 3 in finite sample sizes does not justify this inference.

The translation of BF results into hypotheses about effect size regions makes it clear why BF results in small samples often seem to diverge from hypothesis tests with confidence intervals or credibility intervals.  In small samples, BF are sensitive to specification of H1 and even if it is unlikely that the population effect size is 0 (0 is outside the confidence or credibility interval), the BF may show support for H0 because the effect size is below the criterion value that is needed to support H0.  This inconsistency does not mean that different statistical procedures lead to different inferences. It only means that BF > 3 in favor of H0 RELATIVE TO H1 cannot be interpreted as a test of the hypothesis of d = 0.  It can only be interpreted as evidence for H0 relative to H1 and the specification of H1 influences  which effect sizes provide support for H0.

CONCLUSION

Sir Arthur Eddington (cited by Cacioppo & Berntson, 1994) described a hypothetical
scientist who sought to determine the size of the various fish in the sea. The scientist began by weaving a 2-in. mesh net and setting sail across the seas. repeatedly sampling catches and carefully measuring. recording. and analyzing the results of each catch. After extensive sampling. the scientist concluded that there were no fish smaller than 2 in. in the sea.

The moral of this story is that a scientists method influences their results.  Scientists who use p-values to search for significant results in small samples, will rarely discover small effects and may start to believe that most effects are large.  Similarly, scientists who use Bayes-Factors with wide priors may delude themselves that they are searching for small and large effects and falsely believe that effects are either absent or large.  In both cases, scientists make the same mistake.  A small sample is like a net with large holes that can only (reliably) capture big fish.  This is ok, if the goal is to capture only big fish, but it is a problem when the goal is to find out whether a pond contains any fish at all.  A wide net with big holes may never lead to the discovery of a fish in the pond, while there are plenty of small fish in the pond.

Researchers therefore have to be careful when they interpret a Bayes Factor and they should not interpret Bayes-Factors in favor of H0 as evidence for the absence of an effect. This fallacy is just as problematic as the fallacy to interpret a p-value above alpha (p > .05) as evidence for the absence of an effect.  Most researchers are aware that non-significant results do not justify the inference that the population effect size is zero. It may be news to some that a Bayes Factor in favor of H0 suffers from the same problem.  A Bayes-Factor in favor of H0 is better considered a finding that rejects the specific alternative hypothesis that was pitted against d = 0.  Falsification of this specific H1 does not justify the inference that H0:d = 0 is true.  Another model that was not tested could still fit the data better than H0.

Advertisements

Bayes Ratios: A Principled Approach to Bayesian Hypothesis Testing

 

This post is a stub that will be expanded and eventually be turned into a manuscript for publication.

 

I have written a few posts before that are critical of Bayesian Hypothesis Testing with Bayes Factors (Rouder et al.,. 2009; Wagenmakers et al., 2010, 2011).

The main problem with this approach is that it typically compares a single effect size (typically 0) with an alternative hypothesis that is a composite of all other effect sizes. The alternative is often specified as a weighted average with a Cauchy distribution to weight effect sizes.  This leads to a comparison of H0:d=0 vs. H1:d=Cauchy(es,0,r) with r being a scaling factor that specifies the median absolute effect size for the alternative hypothesis.

It is well recognized by critics and proponents of this test that the comparison of H0 and H1 favors H0 more and more as the scaling factor is increased.  This makes the test sensitive to the specification of H1.

Another problem is that Bayesian hypothesis testing either uses arbitrary cutoff values (BF > 3) to interpret the results of a study or asks readers to specify their own prior odds of H0 and H1.  I have started to criticize this approach because the use of a subjective prior in combination with an objective specification of the alternative hypothesis can lead to false conclusions.  If I compare H0:d = 0 with H1:d = .2, I am comparing two hypothesis with a single value.  If I am very uncertain about the results of a study , I can assign an equal prior probability to both effect sizes and the prior odds of H0/H1 are .5/.5 = 1. Thus, a Bayes Factor can be directly interpreted as the posterior odds of H0 and H1 given the data.

Bayes Ratio (H0/H1) = Prior Odds (H0,H1) * Bayes Factor (H0/H1)

However, if I increase the range of possible effect sizes for H1 and I am uncertain about the actual effect sizes, the a priori probability increases, just like my odds of winning increases when I disperse my bet on several possible outcomes (lottery numbers, horses in the Kentucky derby, or numbers in a roulette game).  Betting on effect sizes is no different and the prior odds in favor of H1 increase the more effect sizes I consider plausible.

I therefore propose to use the prior distribution of effect sizes to specify my uncertainty about what could happen in a study. If I think, the null-hypothesis is most likely, I can weight it more than other effect sizes (e.g., with a Cauchy or normal distribution centered at 0).   I can then use this distribution to compute (a) the prior odds of H0 and H1, and (b) the conditional probabilities of the observed test statistic (e.g., a t-value) given H0 and H1.

Instead of interpreting Bayes Factors directly, which is not Bayesian, and confuses conditional probabilities of data given hypothesis with conditional probabilities of hypotheses given data,  Bayes-Factors are multiplied with the prior odds, to get Bayes Ratios, which many Bayesians consider to be the answer to the real question researchers want to answer.  How much should I believe H0 or H1 after I collected data and computed a test-statistic like a t-value?

This approach is more principled and Bayesian than the use of Bayes Factors with arbitrary cut-off values that are easily misinterpreted as evidence for H0 or H1.

One reason why this approach may not have been used before is that H0 is often specified as a point-value (d = 0) and the a priori probability of a single point effect size is 0.  Thus, the prior odds (H0/H1) are zero and the Bayes Ratio is also zero.  This problem can be avoided by restricting H1 to a reasonably small range of effect sizes and by specifying the null-hypothesis as a small range of effect sizes around zero.  As a result, it becomes possible to obtain non-zero prior odds for H0 and to obtain interpretable Bayes Ratios.

The inferences based on Bayes Ratios are not only more principled than those based on Base Factors,  they are also more in line with inferences that one would draw on the basis of other methods that can be used to test H0 and H1 such as confidence intervals or Bayesian credibility intervals.

For example, imagine a researcher who wants to provide evidence for the null-hypothesis that there are no gender differences in intelligence.   The researcher decided a priori that small differences of less than 1.5 IQ points (0.1 SD) will be considered as sufficient to support the null-hypothesis. He collects data from 50 men and 50 women and finds a mean difference of 3 IQ points in one or the other direction (conveniently, it doesn’t matter in which direction).

The t-value with a standardized mean difference of d = 3/15d = .2, and sampling error of SE = 2/sqrt(100) = .2 is t = .2/2 = 1.  A t-value of 1 is not statistically significant. Thus, it is clear that the data do not provide evidence against H0 that there are no gender differences in intelligence.  However, do the data provide positive sufficient evidence for the null-hypothesis?   p-values are not designed to answer this question.  The 95%CI around the observed standardized effect size is -.19 to .59.  This confidence interval is wide. It includes 0, but it also includes d = .2 (a small effect size) and d = .5 (a moderate effect size), which would translate into a difference by 7.5 IQ points.  Based on this finding it would be questionable to interpret the data as support for the null-hypothesis.

With a default specification of the alternative hypothesis with a Cauchy distribution scaled to 1,  the Bayes-Factor (H0/H1) favors H0 over H1  4.95:1.   The most appropriate interpretation of this finding is that the prior odds should be updated by a factor of 5:1 in favor of H0, whatever these prior odds are.  However, following Jeffrey’s many users who compute Bayes-Factors interpret Bayes-Factors directly with reference to Jeffrey’s criterion values and a value greater than 3 can be and has been used to suggest that the data provide support for the null-hypothesis.

This interpretation ignores that the a priori distribution of effect sizes allocates only a small probability (p = .07) to H0 and a much larger area to H1 (p = .93).  When the Bayes Factor is combined with the prior odds (H0/H1) of .07/.93 = .075/1,   the resulting Bayes Ratio shows that support for H0 increased, but that it is still more likely that H1 is true than that H0 is true,   .075 * 4.95 = .37.   This conclusion is consistent with the finding that the 95%CI overlaps with the region of effect sizes for H0 (d = -.1, .1).

We can increase the prior odds of H0 by restricting the range of effect sizes that are plausible under H1.  For example, we can restrict effect sizes to 1 or we can set the scaling parameter of the Cauchy distribution to .5. This way, 50% of the distribution falls into the range between d = -.5 and .5.

The t-value and 95%CI remain unchanged because they do not require a specification of H1.  By cutting the range of effect sizes for H1 roughly in half (from scaling parameter 1 to .5), the Bayes-Factor in favor of H0 is also cut roughly in half and is no longer above the criterion value of 3, BF (H0/H1) = 2.88.

The change of the alternative hypothesis has the opposite effect on prior odds. The probability of H0 nearly doubled (p = .13) and the prior odds are now .13/.87 = .15.  The resulting Bayes Ratio in favor of H0 remains similar to the Bayes Ratio with the wider Cauchy distribution, Bayes Ratio = .15 * 2.88 = 0.45.  In fact, it actually is a bit stronger than the Bayes Ratio with the wider specification of effect sizes (BR (H0/H1) = .45.  However, both Bayes Ratios lead to the same conclusion that is also consistent with the observed effect size, d = .2, and the confidence interval around it, d = -.19 to d = .59.  That is, given the small sample size, the observed effect size provides insufficient information to draw any firm conclusions about H0 or H1. More data are required to decide empirically which hypothesis is more likely to be true.

The example used an arbitrary observed effect size of d = .2.  Evidently, effect sizes much larger than this would lead to the rejection of H0 with p-values, confidence intervals, Bayes Factor, or Bayes-Ratios.  A more interesting question is what the results would be like if the observed effect size would have provided maximum support for the null-hypothesis, which assumes an observed effect size of 0, which also produces a t-value of 0.   With the default prior of Cauchy(M=0,V=1), the Bayes-Factor in favor of H0 is 9.42, which is close to the next criterion value of BF > 10 that is sometimes used to stop data collection because the results are decisive.  However, the Bayes Ratio is still slightly in favor of H1, BR (H1/H0) = 1.42.  The 95%CI ranges from -.39 to .39 and overlaps with the criterion range of effect sizes in the range from -.1 to .1.   Thus, the Bayes Ratio shows that even an observed effect size of 0 in a sample of N = 100 provides insufficient evidence to infer that the null-hypothesis is true.

When we increase sample size to N = 2,000,  the 95%CI around d = 0 ranges from -.09 to .09.  This finding means that the data support the null-hypothesis and that we would make a mistake in our inferences that use the same approach in no more than 5% of our tests (not just those that provide evidence for H0, but all tests that use this approach).  The Bayes-Factor also favors H0 with a massive BF (H0/H1) = 711..27.   The Bayes-Ratio also favors H0, with a Bayes-Ratio of 53.35.   As Bayes-Ratios are the ratio of two complementary probabilities p(H0) + p(H1) = 1, we can compute the probability of H0 being true with the formula  BR(H0/H1) / (Br(H0/H1) + 1), which yields a probability of 98%.  We see how the Bayes-Ratio is consistent with the information provided by the confidence interval.  The long-run error frequency for inferring H0 from the data was less than 5% and the probability of H1 being true given the data is 1-.98 = .02.

Conclusion

Bayesian Hypothesis Testing has received increased interest among empirical psychologists, especially in situations when researchers aim to demonstrate the lack of an effect.  Increasingly, researchers use Bayes-Factors with criterion values to claim that their data provide evidence for the null-hypothesis.  This is wrong for three reasons.

First, it is impossible to test a hypothesis that is specified as one effect size out of an infinite number of alternative effect sizes.  Researchers appear to be confused that Bayes Factors in favor of H0 can be used to suggest that all other effect sizes are implausible. This is not the case because Bayes Factors do not compare H0 to all other effect sizes. They compare H0 to a composite hypotheses of all other effect sizes and Bayes Factors depend on the way the composite is created. Falsification of one composite does not ensure that the null-hypothesis is true (the only viable hypothesis still standing) because other composites can still fit the data better than H0.

Second, the use of Bayes-Factors with criterion values also suffers from the problem that it ignores the a priori odds of H0 and H1.  A full Bayesian inferences requires to take the prior odds into account and to compute posterior odds or Bayes Ratios.  The problem for the point-null hypothesis (d = 0) is that the prior odds for H0 over H1 is 0. The reason is that the prior distribution of effect sizes adds up to 1 (the true effect size has to be somewhere), leaving zero probability for d = 0.   It is possible to compute Bayes-Factors for d = 0 because Bayes-Factors use densities. For the computation of Bayes Factors the distinction between densities and probabilities is not important, but the for the computation of prior odds, the distinction is important.  A single effect size has a density on the Cauchy distribution, but it has zero probability.

The fundamental inferential problem of Bayes-Factors that compare H0:d=0 can be avoided by specifying H0 as a critical region around d=0.  It is then possible to compute prior odds based on the area under the curve for H0 and the area under the curve for H1. It is also possible to compute Bayes Factors for H0 and H1 when H0 and H1 are specified as complementary regions of effect sizes.  The two ratios can be multiplied to obtain a Bayes Ratio. Furthermore, Bayes Ratios can be used as the probability of H0 given the data and the probability of H1 given the data.  The results of this test are consistent with other approaches to the testing of regional null-hypothesis and they are robust to misspecifications of the alternative hypothesis that allocate to much weight to large effect sizes.   Thus, I recommend Bayes Ratios for principled Bayesian Hypothesis testing.

 

*************************************************************************

R-Code for the analyses reported in this post.

*************************************************************************

#######################
### set input
#######################

### What is the total sample size?
N = 2000

### How many groups?  One sample or two sample?
gr = 2

### what is the observed effect size
obs.es = 0

### Set the range for H0, H1 is defined as all other effect sizes outside this range
H0.range = c(-.1,.1)  #c(-.2,.2) # 0 for classic point null

### What is the limit for maximum effect size, d = 14 = r = .99
limit = 14

### What is the mode of the a priori distribution of effect sizes?
mode = 0

### What is the variability (SD for normal, scaling parameter for Cauchy) of the a priori distribution of effect sizes?
var = 1

### What is the shape of the a priori distribution of effect sizes
shape = “Cauchy”  # Uniform, Normal, Cauchy  Uniform needs limit

### End of Input
### R computes Likelihood ratios and Weighted Mean Likelihood Ratio (Bayes Factor)
prec = 100 #set precision, 100 is sufficient for 2 decimal
df = N-gr
se = gr/sqrt(N)
pop.es = mode
if (var > 0) pop.es = seq(-limit*prec,limit*prec)/prec
weights = 1
if (var > 0 & shape == “Cauchy”) weights = dcauchy(pop.es,mode,var)
if (var > 0 & shape == “Normal”) weights = dnorm(pop.es,mode,var)
if (var > 0 & shape == “Uniform”) weights = dunif(pop.es,-limit,limit)
H0.mat = cbind(0,1)
H1.mat = cbind(mode,1)
if (var > 0) H0.mat = cbind(pop.es,weights)[pop.es >= H0.range[1] & pop.es <= H0.range[2],]
if (var > 0) H1.mat = cbind(pop.es,weights)[pop.es < H0.range[1] | pop.es > H0.range[2],]
H0.mat = matrix(H0.mat,,2)
H1.mat = matrix(H1.mat,,2)
H0 = sum(dt(obs.es/se,df,H0.mat[,1]/se)*H0.mat[,2])/sum(H0.mat[,2])
H1 = sum(dt(obs.es/se,df,H1.mat[,1]/se)*H1.mat[,2])/sum(H1.mat[,2])
BF10 = H1/H0
BF01 = H0/H1
Pr.H0 = sum(H0.mat[,2]) / sum(weights)
Pr.H1 = sum(H1.mat[,2]) / sum(weights)
PriorOdds = Pr.H1/Pr.H0
Bayes.Ratio10 = PriorOdds*BF10
Bayes.Ratio01 = 1/Bayes.Ratio10
### R creates output file
text = c()
text[1] = paste0(‘The observed t-value with d = ‘,obs.es,’ and N = ‘,N,’ is t(‘,df,’) = ‘,round(obs.es/se,2))
text[2] = paste0(‘The 95% confidence interal is ‘,round(obs.es-1.96*se,2),’ to ‘,round(obs.es+1.96*se,2))
text[3] = paste0(‘Weighted Mean Density(H0:d >= ‘,H0.range[1],’ & <= ‘,H0.range[2],’) = ‘,round(H0,5))
text[4] = paste0(‘Weighted Mean Density(H1:d <= ‘,H0.range[1],’ | => ‘,H0.range[2],’) = ‘,round(H1,5))
text[5] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H0/H1: ‘,round(BF01,2))
text[6] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H1/H0: ‘,round(BF10,2))
text[7] = paste0(‘The a priori likelihood ratio of H1/H0 is ‘,round(Pr.H1,2),’/’,round(Pr.H0,2),’ = ‘,round(PriorOdds,2))
text[8] = paste0(‘The Bayes Ratio(H1/H0) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio10,2))
text[9] = paste0(‘The Bayes Ratio(H0/H1) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio01,2))
### print output
text

 

 

 

 

How Does Uncertainty about Population Effect Sizes Influence the Probability that the Null-Hypothesis is True?

There are many statistical approaches that are often divided into three schools of thought: (a) Fisherian, (b) Neyman-Pearsonion, and (c) Bayesian.  This post is about Bayesian statistics.  Within Bayesian statistics, there are further distinctions that can be made. One distinction is between Bayesian parameter estimation (credibility intervals) and Bayesian hypothesis testing.  This post is about Bayesian hypothesis testing.  One goal of As one goal of Bayesian Hypothesis testing is to provide evidence for the null-hypothesis.  It is often argued that Baysian Null-Hypothesis Testing (BNHT) is superior to the widely used method of Null-Hypothesis Testing with p-values.  This post is about the ability of BNHT to test the null-hypothesis.

The crucial idea of BNHT is that it is possible to contrast the null-hypothesis (H0) with an alternative hypothesis (H1) and to compute the relative likelihood that the data support one hypothesis versus the other:  p(H0/D) / p(H1/D).  If this ratio is large enough (e.g, p(H0/D) / p(H1/D) > criterion),  it can be stated that the data support the null-hypothesis more than the alternative hypothesis.

To compute the ratio of the two conditional probabilities, researchers need to quantify two ratios.  One ratio is the prior ratio of the probabilities that H0 or H1 are true: p(H0)/p(H1. This ratio does not have a common name. I call it the probability ratio (PR).  The other ratio is the ratio of the conditional probabilities of the data given H0 and H1. This ratio is often called a Bayes Factor (BF): BF = p(D/H0)/p(D/H1).

To make claims about H0 and H1 based on some observed test statistic,  the Probability Ratio has to be multiplied with the Bayes Factor.

p(H0/D)                 p(H0)   x  p(D/H0)
________   =   _______________     =   PR * BF
p(H1/D)                  p(H1)   x   p(D/H1)
The main reason for calling this approach Bayesian is that Bayesian statisticians are willing and required to specify a priori probabilties of hypotheses before any data are collected.  In the formula above, p(H0) and p(H1) are the a priori probabilities that a population effect size is 0 (p(H0) or that it is some other value, p(H1).  However, in practice BNHT is often used without specifying these a priori probabilities.

“Table 1 provides critical t values needed for JZS Bayes factor values of 1/10, 1/3, 3, and 10 as a function of sample size. This table is analogous in form to conventional t-value tables for given p value criteria. For instance, suppose a researcher observes a t value of 3.3 for 100 observations. This t value favors the alternative and corresponds to a JZS Bayes factor less than 1/10 because it exceeds the critical value of 3.2 reported in the table. Likewise,
suppose a researcher observes a t value of 0.5. The corresponding JZS Bayes factor is greater than 10 because the t value is smaller than 0.69, the corresponding critical value in Table 1. Because the Bayes factor is directly interpretable as an odds ratio, it may be reported without reference to cutoffs such as 3 or 1/10. Readers may decide the meaning of odds ratios for themselves” (Rouder et al., 2009).

The use of arbitrary cutoff values (3 or 10) for Bayes Factors is not a complete Bayesian statistical analysis because it does not provide information about the hypothesis given the data. Bayes Factors alone only provide information about the ratio of the conditional probabilities of the data given two alternative hypothesis and the ratios are not equivalent.

p(H0/D)                 p(D/H0)
________   ≠   _________
p(H1/D)                  p(D/H1)

In practice, users of BNHT are unaware or ignore the need to think about the base rates of H0 and H1, when they interpret Bayes Factors.  The main point of this post is to demonstrate that Bayes Factors that compare the null-hypothesis of a single effect size against an alternative hypothesis that combines many effect sizes (all effect sizes that are not zero) can be deceptive because the ratio of p(H0) / p(H1) decreases as the number of effect sizes increases.  In the limit the a priori probability of the null-hypothesis being true is zero, which implies that no data can provide evidence for it because any Bayes-Factor that is multiplied with zero is zero, which implies that it is reasonable to believe in the alternative hypothesis no matter how strongly a Bayes Factor favors the null-hypothesis.

The following urn experiments explains the logic of my argument, points out a similar problem in the famous Monty Hall problem, and provides r-code to run simulations with different assumptions about the number and distribution of effect sizes and the implications for the probability ratio of H0 and H1 and Bayes Factors that are need to provide evidence for the null-hypothesis.

An Urn Experiment of Population Effect Sizes

The classic example in statistics are urn experiments.  An urn is filled with balls with different colors. If the urn is filled with 100 balls and only one ball is red and you get one chance to draw a ball from the urn without peeking, the probability of you drawing the red ball is 1 out of 100 or 1%.

To think straight about statistics and probabilities it is helpful, unless you are some math genius who can really think in 10 dimensions, to remind yourself that even complicated probability problems are essentially urn experiments. The question is only what the urn experiment would look like.

In this post, I am examining the urn experiment that corresponds to the problem of Bayesian statisticians to specify probabilities of effect sizes in experiments without any information that would be helpful to guess which effect size is most likely.

To translate the Bayesian problem of the prior into an urn experiment, we first have to turn effect sizes into balls.  The problem is that effect sizes are typically continuous, but an urn can only be filled with discrete objects.  The solution to this problem is to cut the continuous range of effect sizes into discrete units.  The number of units depends on the desired precision.  For example, effect sizes can be measured in standardized units with one decimal, d = 0, d = .1, d = .2, etc.  or with two decimals, d = .00, d = .01, d = .02, etc.  or with 10 decimals.  The more precise the measurement, the more discrete events are created.  Instead of using colors, we can use balls with numbers printed on them as you may have seen in lottery draws.   In psychology, theories and empirical studies often are not very precise and it would hardly be meaningful to distinguish between an effect size of d = .213 and an effect size of d = .214.  Even two decimals are rarely needed and the typical sampling error in psychological studies of d = .20, would make it impossible to distinguish between d = .33 and d = .38 empirically.  So, it makes sense to translate the continuous range of effect sizes into balls with one digit numbers, d = .0, d = .1, d = .2.

The second problem is that effect sizes can be positive or negative.  This is not really a problem because some balls can have negative numbers printed on them.  However, the example can be generalized from the one-sided scenario with only positive effect sizes to a two-sided scenario that also includes negative effects. To keep things simple, I use only positive effect sizes in this example.

The third problem is that some effect size measures are unlimited. However, in practice it is unreasonable to expect very large effect sizes and it is possible to limit the range of possible effect sizes at a maximum value.  The limit could be d = 10, d = 5, or d = 2.  For this example, I use a limit of d = 2.

It is now possible to translate the continuous measure of standardized effect sizes into 21 discrete events and to fill the urn with 21 balls that have printed the numbers 0, 0.1, 0.2, …., 2.0 printed on them.

The main point of Bayesian inference is to draw conclusions about the probability that a particular hypothesis is true given the results of an empirical study.  For example, how probable is it that the null-hypothesis is true when I observe an effect size of d = .2?  However, a study only provides information about the data given a specific hypothesis. How probable is it to observe an effect size of d = .2, if the null-hypothesis were true?  To answer the first question, it is necessary to specify the probability that the hypothesis is true independent of any data; that is, how probable is it that the null-hypothesis is true?

P(pop.es=0/obs.es = .2) = P(pop.es=0) * P(Obs.ES=.2/Pop.ES=0) / P(Obs.ES = .20)

This looks scary and for this post you do not need to understand the complete formula ,but it is just a mathematical way of saying that the probability that a population effect size (pop.es) is zero when the observed effect size (obs.es) is d = .2 equals the unconditional probability that the population effect size is zero multiplied by the conditional probability of observing an effect size of d = .2 when the population effect size is 0 divided by the unconditional probability of observing an effect size of d = .2.

I only show this formula to highlight the fact that the main goal of Bayesian inference is to estimate the probability of a hypothesis (in this case, pop.es = 0) given some observed data (in this case, obs.es = .20) and that researchers need to specify the unconditional probability of the hypothesis (pop.es = 0) to do so.

We can now return to the urn experiment and ask the question how likely it is that a particular hypothesis is true. For example, how likely is it that the null-hypothesis is true?  That is, how likely is it that we end up with a ball that has the number 0.0 printed on it when conduct a study with an unknown population effect size? The answer is: it depends.  It depends on the way our urn was filled.  We of course do not know how often the null-hypothesis is true, but we can fill the urn in a way that expresses maximum uncertainty about the probability that the null-hypothesis is true.  Maximum uncertainty means that all possible events are equally likely (Bayesian statisticians actually use a so-called uniform prior when the range of possible outcomes is fixed).  So, we can fill the urn with one ball for each of the 21 effect sizes (0.0, 0.1, 0.2,….. 2.0).   Now it is fairly easy to determine the a priori probability that the null-hypothesis is true.  There are 21 balls and you are drawing one ball from the urn.  Thus, the a priori probability of the null-hypothesis being true is 1/21 = .047.

As noted before, if the range of events increases because we specify a wider range of effect sizes (say effect sizes up to 10), the a priori probability of drawing the ball with 0.0 printed on it decreases. If we specify effect sizes with more precision (e.g., two digits), the probability of drawing the ball that has 0.00 printed on it decreases further.  With effect sizes ranging from 0 to 10 and being specified with two digits, there are 1001 balls in the urn and the probability of drawing the ball with 0.00 printed on it is 0.001.  Thus, even if the data would provide strong support for the null-hypothesis, the proper inference has to take into account that a priori it is very unlikely that a randomly drawn study had an effect size of 0.00.

As effect sizes are continuous and theoretically can range from -infinity to +infinity, there is an infinite number of effect sizes and the probability of drawing a ball with 0 printed on it from an infinitely large urn that is filled with an infinite number of balls is zero (1/infinity).  This would suggest that it is meaningless to test the hypothesis whether the null-hypothesis is true or not because we already know the answer to the question; the probability is zero. As any number that is multiplied by 0 is zero, the probability that the population effect size is zero remains zero, even if the probability that the population effect size is 0 when we observed an effect size of 0 is 1.   Of course, this is also true for any other hypothesis about effect sizes greater than zero. The probability that the effect size is exactly d = .2 is also 0.  The implication is simply that it is not possible to empirically test hypotheses when the range of effect sizes is cut into an infinite number of pieces because the a priori probability that the effect size has a specific size is always 0.   This problem can be solved by limiting the number of balls in the urn so that we avoid the problem of drawing from an infinitely large urn with an infinite number of balls.

Bayesians solve the infinity problem by using mathematical functions.  A commonly used function was proposed by Jeffrey’s.  Jeffrey’s proposed to specify uncertainty about effect sizes with a Cauchy distribution with a scaling parameter of 1.  Figure 1 shows the distribution.

JeffreyPrior.png

The figure is cut off at effect sizes smaller than -10 and larger than 10, and it assumes that effect sizes are measured with two digits.  With two decimals, the densities can be interpreted as percentages and sum to 100.  The sum of the probabilities for effect sizes in the range between -10 and 10 covers only 93.66% of the full distribution. The remaining 6.34% are in the tails below -10 and above 10. As you can see, the distribution is not uniform. It actually gives the highest probability to an effect size of 0. The probability density for an effect size of 0 is 0.32 and translates into a probability of 0.32% with two digits as units for the effect size.   By eliminating these extreme effect sizes, the probability of the null-hypothesis increases slightly from 0.32% to 0.32/93.66*100 = 0.34%. With two decimals, there are 2001 effect sizes (-10, -9.99, …..-0.01, 0, 0.1….,9.99,10). A uniform prior would put the probability of a single effect size at 1/2001 = 0.05%.  This shows that Jeffrey’s prior gives a higher probability to the null-hypothesis, but it also does so for other small effect sizes close to zero.  The probability density of observing an effect size of d = 0.01 is only slightly smaller, d = .31827, than the probability of the null-hypothesis, d = .3183.

If we translate Jeffrey’s prior for effect sizes with two digits into an urn experiment, and we filled the urn proportionally to the distribution in Figure 1 with 10,000 balls, 34 balls would have the number 0.00 printed on them.  When we draw one ball from the urn, the probability of drawing one of the 34 balls with 0.00, is 34/10000 = 0.0034 or 0.34%.

Bayesian statisticians do not use probability densities to specify the probability that the population effect size is zero, possibly because probability densities do not directly translate into probabilities and the unit.  However, by treating effect sizes as a continuous variable, the number of balls in the lottery is infinite and the probability of drawing a ball with 0.0000000000000000 printed on it is practically zero.  A reasonable alternative is to specify a reasonable unit for effect sizes.  As noted earlier, for many psychological applications, a reasonable unit is a single digit (d = 0, d = .1, d = .2, etc.).  This implies that effect sizes between d = -.05 and d = .05 are essentially treated as 0.

Given Jeffrey’s distribution, the rational specification of the a prior probabilities that the effect size is 0 or somewhere between -10 and 10 is

P(pop.es = 0)                      0.32                                     1
___________   =    _____________      =    ______

P(pop.es ≠ 0)                   9.37 – 0.32                           28
To draw statistical inferences Bayesian Null Hypothesis Tests uses the Bayes-Factor.  Without going into details here, a Bayes-Factor provides the complementary ratio of the conditional probabilities of data based on the null-hypothesis or the alternative hypothesis.  It is not uncommon to use a Bayes-Factor of 3 or greater as support for one of the two hypotheses.  However, if we take the prior probabilities of these hypothesis into account a Bayes-Factor of 3 does not justify a belief in the null-hypothesis, nor is it sufficiently strong to overcome the low probability that the null-hypothesis is true given the large uncertainty about effect sizes. A Bayes-Factor of 3 would change the probability of 1/28 into a probability of 3/28 = .11.  Thus, it is still unlikely that the effect size is zero.  A Bayes-Factor of 28 in favor of H0, would be needed to make it equally likely that the null-hypothesis is true and that it is not true and to assert that the null-hypothesis is true with a probability of 90%, the Bayes-Factor would have to be 255; 255/28 = 9 = .90/.10.

It is possible to further decrease the number of balls in the lottery. For example, it is possible to set the unit to 1. This gives only 11 effect sizes (-10, -9, -8,…,-1,0,1,…8,9,10).  The probability density of .32 translates now in a .32 probability, versus a .68 probability for all other effect sizes. After adjusting for the range restriction, this translates into a ratio of 1.95 to 1 in favor of the alternative.  Thus, a Bayes-Factor of 3 would favor the null-hypothesis and it would only require a Bayes-Factor of 18 to obtain a probability of .90 that H0 is true,  18/1.95 = 9 = .90/.10.   However, it is important to realize that the null-hypothesis with d = 1 covers effect sizes in the range from -.5 to .5.   This wide range covers effect sizes that are typical for psychology and are commonly called small or moderate effects.  As a result, this is not a practical solution because the test no longer really tests the hypothesis that there is no effect.

In conclusion, Jeffrey’s proposed a rational approach to specify the probability of population effect sizes without any data and without prior information about effect sizes.  He proposed a prior distribution of population effect sizes that covers a wide range of effect sizes.  The cost of working with this prior distribution of effect sizes under maximum uncertainty is that a wide range of effect sizes are considered to be plausible. This means that there are many possible events and the probability of any single event is small.  Jeffrey’s prior makes it possible to quantify this probability as a function of the density of an effect size and the precision of measurement of effect sizes (number of digits).  This probability should be used to evaluate Bayes-Factors.  Contrary to existing norms, Bayes-Factors of 3 or 10 cannot be used to claim that the data favor the null-hypothesis over the alternative hypothesis because this interpretation of Bayes-Factors ignore that without further information it is more likely that the null-hypothesis is false than that it is correct.   It seems unreasonable to assign equal probabilities to two events, where one event is akin to drawing a single red ball from an urn when the other event is to draw all but that red ball from an urn.  As the number of balls in the urn increases, these probabilities become more and more unequal.  Any claim that the null-hypothesis is equally or more probable than other effects would have to be motivated by prior information, which would invalidate the use of Jeffrey’s distribution of effect sizes that was developed for a scenario where prior information is not available.

Postscript or Part II

One of the most famous urn experiments in probability theory is the Monty Hall problem.

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

I am happy to admit that I got this problem wrong.  I was not alone.  In a public newspaper column, Vos Savant responded that it would be advantageous to switch because the probability of winning after switching is 2/3, whereas sticking to your guns and staying with the initial choice has only a 1/3 choice of winning.

This column received 10,000 responses with 1,000 responses by readers with a Ph.D. who argued that the chances are 50:50.  This example shows that probability theory is hard even when you are formally trained in math or statistics.  The problem is to match the actual problem to the appropriate urn experiment. Once the correct urn experiment has been chosen, it is easy to compute the probability.

Here is how I solved the Monty Hall problem for myself.  I increased the number of doors from 3 to 1,000.  Again, I have a choice to pick one door.  My chance of picking the correct door at random is now 1/1000 or 0.001.  Everybody can realize that it is very unlikely that I picked the correct door by chance. If 1000 doors do not help, try 1,000,000 doors.  Let’s assume I picked a door with a goat, which has a probability of 999/1000 or 99.9%.  Now the gameshow host will open 998 other doors with goats and the only door that he does not open is the door with the car.  Should I switch?  If intuition is not sufficient for you, try the math.  There is a 99.9% probability to pick a door with a goat and if this happens, the probability that the other door has the car is 1.  There is a 1/1000 = 0.1% probability that I picked the door with the car and if I did so, the probability that the door that I picked has the car is 1.  So, you have a 0.1% chance of winning if you stay and a 99.9% chance of winning if you switch.

The situation is the same when you have three doors.  There is a 2/3 chance that you randomly pick a door with a goat. Now the gameshow host opens the only other door with a goat and the other door must have the car.  If you picked the door with the car, the game show host will open one of the two doors with a goat and the other door still has a goat behind it.  So, you have a 2/3 chance of winning if you switch and a 1/3 chance of winning when you stay.

What does all of this have to do with Bayesian statistics?  There is a similarity between the Monty Hall problem and Bayesian statistics.   If we would only consider two effect sizes, say d = 0 and d = .2, we would have an equal probability that either one is the correct effect size without looking at any data and without prior information.  The odds of the null-hypothesis being true versus the alternative hypothesis being true are 50:50.   However, there are many other effect sizes that are not being considered. In Bayesian hypothesis testing these non-null effect sizes are combined in a single alternative hypothesis that the effect size is not 0 (e.g., d = .1, d = .2, d = .3, etc.).  If we limit our range of effect sizes to effect sizes between -10 and 10 and specify effect sizes with one digit precision we end up with 201 effect sizes, one effect size is 0 and the other effect sizes are not zero.  The goal is to find the actual population effect size by collecting data and by conducting a Bayesian hypothesis test. If you do find the correct population effect size, you win a Noble Prize, if you are wrong you get ridiculed by your colleagues.  Bayesian null-hypothesis tests proceed like in a Monty Hall game show by picking one effect size at random. Typically, this effect size is 0.  They could have picked any other effect size at random, but Bayes-Factors are typically used to test the null-hypothesis.  After collecting some data, the data provide information that increase the probability for some effect sizes and further decrease the probability of other effect sizes.  Imagine an illuminated display of the 201 effect sizes and the game show host turns some effect sizes green or red.  Even Bayesians would abandon their preferred randomly chosen effect size of 0, if it would turn red.  However, let’s consider a scenario where 0 and 20 other effect sizes (e.g., 0.1, 0.2, 0.3, etc. ) are still green.  Now the gameshow host gives you a choice. You can either stay with 0 or you can pick all other 20 effect sizes that are flashing green.  You are allowed to pick all 20 because they are combined in a single alternative hypothesis that the effect size is not zero. It doesn’t matter what the effect size is. It only matters that they are not zero.  Bayesians who simply look to the Bayes Factor (what the data say) and accept the null-hypothesis ignore that the null-hypothesis is only one out of several effect sizes that are compatible with the data and they ignore that a priori it is unlikely that they picked the correct effect size when they pitted a single effect size against all other.

Why would Bayesians do such a crazy thing, when it is clear that you have a much better chance of winning if you can bet on 20 out of 21 effect sizes rather than 1 out of 21 and the winning odds for switching are 20:1?

Maybe they suffer from a similar problem as many people who vehemently argued that the correct answer to the Monty Hall problem is 50:50. The reason for this argument is simply that there are two doors. It doesn’t matter how we got there. Now that we are facing the final decision, we are left with two choices.  The same illusion may occur when we express Bayes-Factors as odds for two hypotheses and ignore the asymmetry between the two hypothesis that one hypothesis consists of a single effect size and the other hypothesis consists of all other effect sizes.

They may forget that in the beginning they picked zero at random from a large set of possible effect sizes and that it is very unlikely that they picked the correct effect size in the beginning.  This part of the problem is fully ignored when researchers compute Bayes-Factors and directly interpret Bayes-Factors.  This is not even Bayesian because the Bayes theorem explicitly requires to specify the probability of the randomly chosen null-hypothesis to draw valid inferences.  This is actually the main point of the Bayes theorem. Even when the data favor the null-hypothesis, we have to consider the a priori probability that the null-hypothesis is true (i.e., the base rate of the null-hypothesis).   Without a value for p(H0) there is no Bayesian inference.  One solution is to simply to assume that p(H0) and p(H1) are equally likely. In this case, a Bayes-Factor that favors the randomly chosen effect size would mean it is rational to stay with it.  However, the 50:50 ratio does not make sense because it is a priori more likely that one of the effect sizes of the alternative hypothesis is the right on.  Therefore, it is better to switch and reject the null-hypothesis.  In this sense, Bayesians who interpret Bayes-Factors without taking the base-rate of H0 into account are not Bayesian and they are likely to end up being losers in the game of science because they will often conclude in favor of an effect size simply because they randomly picked it from a wide range of effect sizes.

################################################################
# R-Code to compute Ratio of p(H0)/p(H1) and BF required to change p(H0/D)/p(H1/D) to a # ratio of 9:1 (90% probability that H0 is true).
################################################################

# set the scaling factor
scale = 1
# set the number of units / precision
precision = 5
# set upper limit of effect sizes
high = 3
# get lower limit
low = -high
# create effect sizes
x = seq(low,high,1/precision)
# compute number of effect sizes
N.es = length(x)
# get densities for each effect size
y = dcauchy(x,0,scale)
# draw pretty picture
curve(dcauchy(x,0,scale),low,high,xlab=’Effect Size’,main=”Jeffrey’s Prior Distribution of Population Effect Sizes”)
segments(0,0,0,dcauchy(0,0,scale),col=’red’,lty=3)
# get the density for effect size of 0 (lazy way)
H0 = max(y) / sum(y)
# get the density of all other effect sizes
H1 = 1-H0
text(0,H0,paste0(‘Density = ‘,H0),pos=4)
# compute a priori ratio of H1 over H0
PR = H1/H0
# set belief strength for H0
PH0 = .90
# get Bayes-Factor in favor of H0
BF = -(PH0*PR)/(PH0-1)
BF

library(BayesFactor)

N = 0
while (try < BF) {
N = N + 50
try = 1/exp(ttest.tstat(t=0, n1=N, n2=N, rscale = scale)[[‘bf’]])
}
try
N

dec = 3
res = paste0(“If standardized mean differences (Cohen’s d) are measured in intervals of d = “,1/precision,” and are limited to effect sizes between “,low,” and “,high)
res = paste0(res,” there are “,N.es,” effect sizes. With a uniform prior, the chance of picking the correct effect size “)
res = paste0(res,”at random is p = 1/”,N.es,” = “,round(1/N.es,dec),”. With the Cauchy(x,0,1) distribution, the probability of H0 is “)
res = paste0(res,round(H0,dec),” and the probability of H1 is “,round(H1,dec),”. To obtain a probability of .90 in favor of H0, the data have to produce a Bayes Factor of “)
res = paste0(res,round(BF,dec), ” in favor of H0. It is then possible to accept the null-hypothesis that the effect size is “)
res = paste0(res,”0 +/- “,round(.5/precision,dec),”. “,N*2,” participants are needed in a between subject design with an observed effect size of 0 to produce this Bayes Factor.”)
print(res)

Wagenmakers’ Default Prior is Inconsistent with the Observed Results in Psychologial Research

Bayesian statistics is like all other statistics. A bunch of numbers are entered into a formula and the end result is another number.  The meaning of the number depends on the meaning of the numbers that enter the formula and the formulas that are used to transform them.

The input for a Bayesian inference is no different than the input for other statistical tests.  The input is information about an observed effect size and sampling error. The observed effect size is a function of the unknown population effect size and the unknown bias introduced by sampling error in a particular study.

Based on this information, frequentists compute p-values and some Bayesians compute a Bayes-Factor. The Bayes Factor expresses how compatible an observed test statistic (e.g., a t-value) is with one of two hypothesis. Typically, the observed t-value is compared to a distribution of t-values under the assumption that H0 is true (the population effect size is 0 and t-values are expected to follow a t-distribution centered over 0 and an alternative hypothesis. The alternative hypothesis assumes that the effect size is in a range from -infinity to infinity, which of course is true. To make this a workable alternative hypothesis, H1 assigns weights to these effect sizes. Effect sizes with bigger weights are assumed to be more likely than effect sizes with smaller weights. A weight of 0 would mean a priori that these effects cannot occur.

As Bayes-Factors depend on the weights attached to effect sizes, it is also important to realize that the support for H0 depends on the probability that the prior distribution was a reasonable distribution of probable effect sizes. It is always possible to get a Bayes-Factor that supports H0 with an unreasonable prior.  For example, an alternative hypothesis that assumes that an effect size is at least two standard deviations away from 0 will not be favored by data with an effect size of d = .5, and the BF will correctly favor H0 over this improbable alternative hypothesis.  This finding would not imply that the null-hypothesis is true. It only shows that the null-hypothesis is more compatible with the observed result than the alternative hypothesis. Thus, it is always necessary to specify and consider the nature of the alternative hypothesis to interpret Bayes-Factors.

Although the a priori probabilities of  H0 and H1 are both unknown, it is possible to test the plausibility of priors against actual data.  The reason is that observed effect sizes provide information about the plausible range of effect sizes. If most observed effect sizes are less than 1 standard deviation, it is not possible that most population effect sizes are greater than 1 standard deviation.  The reason is that sampling error is random and will lead to overestimation and underestimation of population effect sizes. Thus, if there were many population effect sizes greater than 1, one would also see many observed effect sizes greater than 1.

To my knowledge, proponents of Bayes-Factors have not attempted to validate their priors against actual data. This is especially problematic when priors are presented as defaults that require no further justification for a specification of H1.

In this post, I focus on Wagenmakers’ prior because Wagenmaker has been a prominent advocate of Bayes-Factors as an alternative approach to conventional null-hypothesis-significance testing.  Wagenmakers’ prior is a Cauchy distribution with a scaling factor of 1.  This scaling factor implies a 50% probability that effect sizes are larger than 1 standard deviation.  This prior was used to argue that Bem’s (2011) evidence for PSI was weak. It has also been used in many other articles to suggest that the data favor the null-hypothesis.  These articles fail to point out that the interpretation of Bayes-Factors in favor of H0 is only valid for Wagenmakers’ prior. A different prior could have produced different conclusions.  Thus, it is necessary to examine whether Wagenmakers’ prior is a plausible prior for psychological science.

Wagenmakers’ Prior and Replicability

A prior distribution of effect sizes makes assumption about population effect sizes. In combination with information about sample size, it is possible to compute non-centrality parameters, which are equivalent to the population effect size divided by sampling error.  For each non-centrality parameter it is possible to estimate power as the area under the curve of the non-central t-distribution on the right side of the criterion value that corresponds to alpha, typically .05 (two-tailed).   The assumed typical power is simply the weighted average of the power values for each non-centrality parameters.

Replicability is not identical to power for a set of studies with heterogeneous non-centrality parameters because studies with higher power are more likely to become significant. Thus, the set of studies that achieved significance has higher average power as the original set of studies.

Aside from power, the distribution of observed test statistics is also informative. Unlikely power which is bound at 1, the distribution of test-statistics is unlimited. Thus, unreasonable assumptions about the distribution of effect sizes are visible in a distribution of test statistics that does not match distributions of tests statistics in actual studies.  One problem is that test-statistics are not directly comparable for different sample sizes or statistical tests because non-central distributions vary as a function of degrees of freedom and the test being used (e.g., chi-square vs. t-test).  To solve this problem, it is possible to convert all test statistics into z-scores so that they are on a common metric.  In a heterogeneous set of studies, the sign of the effect provides no useful information because signs only have to be consistent in tests of the same population effect size. As a result, it is necessary to use absolute z-scores. These absolute z-scores can be interpreted as the strength of evidence against the null-hypothesis.

I used a sample size of N = 80 and assumed a between subject design. In this case, sampling error is defined as 2/sqrt(80) = .224.  A sample size of N = 80 is the median sample size in Psychological Science. It is also the total sample size that would be obtained in a 2 x 2 ANOVA with n = 20 per cell.  Power and replicability estimates would increase for within-subject designs and for studies with larger N. Between subject designs with smaller N would yield lower estimates.

I simulated effect sizes in the range from 0 to 4 standard deviations.  Effect sizes of 4 or larger are extremely rare. Excluding these extreme values means that power estimates underestimate power slightly, but the effect is negligible because Wagenmakers’ prior assigns low probabilities (weights) to these effect sizes.

For each possible effect size in the range from 0 to 4 (using a resolution of d = .001)  I computed the non-centrality parameter as d/se.  With N = 80, these non-centrality parameters define a non-central t-distribution with 78 degrees of freedom.

I computed the implied power to achieve a significant result with alpha = .05 (two-tailed) with the formula

power = pt(ncp,N-2,qt(1-.025,N-2))

The formula returns the area under the curve on the right side of the criterion value that corresponds to a two-tailed test with p = .05.

The mean of these power values is the average power of studies if all effect sizes were equally likely.  The value is 89%. This implies that in the long run, a random sample of studies drawn from this population of effect sizes is expected to produce 89% significant results.

However, Wagenmakers’ prior assumes that smaller effect sizes are more likely than larger effect sizes. Thus, it is necessary to compute the weighted average of power using Wagenmakes’ prior distribution as weights.  The weights were obtained using the density of a Cauchy distribution with a scaling factor of 1 for each effect size.

wagenmakers.weights = dcauchy(es,0,1)

The weighted average power was computed as the sum of the weighted power estimates divided by the sum of weights.  The weighted average power is 69%.  This estimate implies that Wagenmakers’ prior assumes that 69% of statistical tests produce a significant result, when the null-hypothesis is false.

Replicability is always higher than power because the subset of studies that produce a significant result has higher average power than the the full set of studies. Replicabilty for a set of studies with heterogeneous power is the sum of the squared power of individual studies divided by the sum of power.

Replicability = sum(power^2) / sum(power)

The unweighted estimate of replicabilty is 96%.   To obtain the replicability for Wagenmakers’ prior, the same weighting scheme as for power can be used for replicability.

Wagenmakers.Replicability = sum(weights * power^2) / sum(weights*power)

The formula shows that Wagenmakers’ prior implies a replicabilty of 89%.  We see that the weighting scheme has relatively little effect on the estimate of replicability because many of the studies with small effect sizes are expected to produce a non-significant result, whereas the large effect sizes often have power close to 1, which implies that they wil be significant in the original study and the replication study.

The success rate of replication studies is difficult to estimate. Cohen estimated that typical studies in psychology have 50% power to detect a medium effect size, d = .5.  This would imply that the actual success rate would be lower because in an unknown percentage of studies the null-hypothesis is true.  However, replicability would be higher because studies with higher power are more likely to be significant.  Given this uncertainty, I used a scenario with 50% replicability.  That is an unbiased sample of studies taken from psychological journals would produce 50% successful replications in an exact replication study of the original studies.  The following computations show the implications of a 50% success rate in replication studies for the proportion of hypothesis tests where the null hypothesis is true, p(H0).

The percentage of true null-hypothesis is a function of the success rate in replication study, weighted average power, and weighted replicability.

p(H0) = (weighted.average.power * (weighted.replicability – success.rate)) / (success.rate*.05 – success.rate*weighted.average.power – .05^2 + weighted.average.power*weighted.replicability)

To produce a success rate of 50% in replication studies with Wagenmakers’ prior when H1 is true (89% replicability), the percentage of true null-hypothesis has to be 92%.

The high percentage of true null-hypothesis (92%) also has implications for the implied false-positive rate (i.e., the percentage of significant results that are true null-hypothesis.

False Positive Rate =  (Type.1.Error *.05)  / (Type.1.Error * .05 +
(1-Type.1.Error) * Weighted.Average.Power)
For every 100 studies, there are 92 true null-hypothesis that produce 92*.05 = 4.6 false positive results. For the remaining 8 studies with a true effect, there are 8 * .67 = 5.4 true discoveries.  The false positive rate is 4.6 / (4.6 + 5.4) = 46%.  This means Wagenmakers prior assumes that a success rate of 50% in replication studies implies that nearly 50% of studies that replicate successfully are false-positives results that would not replicate in future replication studies.

Aside from these analytically derived predictions about power and replicability, Wagenmakers’ prior also makes predictions about the distribution of observed evidence in individual studies. As observed scores are influenced by sampling error, I used simulations to illustrate the effect of Wagenmakers’ prior on observed test statistics.

For the simulation I converted the non-central t-values into non-central z-scores and simulated sampling error with a standard normal distribution.  The simulation included 92% true null-hypotheses and 8% true H1 based on Wagenmaker’s prior.  As published results suffer from publication bias, I simulated publication bias by selecting only observed absolute z-scores greater than 1.96, which corresponds to the p < .05 (two-tailed) significance criterion.  The simulated data were submitted to a powergraph analysis that estimates power and replicability based on the distribution of absolute z-scores.

Figure 1 shows the results.   First, the estimation method slightly underestimated the actual replicability of 50% by 2 percentage points.  Despite this slight estimation error, the Figure accurately illustrates the implications of Wagenmakers’ prior for observed distributions of absolute z-scores.  The density function shows a steep decrease in the range of z-scores between 2 and 3, and a gentle slope for z-scores greater than 4 to 10 (values greater than 10 are not shown).

Powergraphs provide some information about the composition of the total density by dividing the total density into densities for power less than 20%, 20-50%, 50% to 85% and more than 85%. The red line (power < 20%) mostly determines the shape of the total density function for z-scores from 2 to 2.5, and most the remaining density is due to studies with more than 85% power starting with z-scores around 4.   Studies with power in the range between 20% and 85% contribute very little to the total density. Thus, the plot correctly reveals that Wagenmakers’ prior assumes that the roughly 50% average replicability is mostly due to studies with very low power (< 20%) and studies with very high power (> 85%).
Powergraph for Wagenmakers' Prior (N = 80)

Validation Study 1: Michael Nujiten’s Statcheck Data

There are a number of datasets that can be used to evaluate Wagenmakers’ prior. The first dataset is based on an automatic extraction of test statistics from psychological journals. I used Michael Nuijten’s dataset to ensure that I did not cheery-pick data and to allow other researchers to reproduce the results.

The main problem with automatically extracted test statistics is that the dataset does not distinguish between  theoretically important test statistics and other statistics, such as significance tests of manipulation checks.  It is also not possible to distinguish between between-subject and within-subject designs.  As a result, replicability estimates for this dataset will be higher than the simulation based on a between-subject design.

Powergraph for Michele Nuijten's StatCheck Data

 

Figure 2 shows all of the data, but only significant z-scores (z > 1.96) are used to estimate replicability and power. The most striking difference between Figure 1 and Figure 2 is the shape of the total density on the right side of the significance criterion.  In Figure 2 the slope is shallower. The difference is visible in the decomposition of the total density into densities for different power bands.  In Figure 1 most of the total density was accounted for by studies with less than 20% power and studies with more than 85% power.  In Figure 2, studies with power in the range between 20% and 85% account for the majority of studies with z-scores greater than 2.5 up to z-scores of 4.5.

The difference between Figure 1 and Figure 2 has direct implications for the interpretation of Bayes-Factors with t-values that correspond to z-scores in the range of just significant results. Given Wagenmakers’ prior, z-scores in this range mostly represent false-positive results. However, the real dataset suggests that some of these z-scores are the result of underpowered studies and publication bias. That is, in these studies the null-hypothesis is false, but the significant result will not replicate because these studies have low power.

Validation Study 2:  Open Science Collective Articles (Original Results)

The second dataset is based on the Open Science Collective (OSC) replication project.  The project aimed to replicate studies published in three major psychology journals in the year 2008.  The final number of articles that were selected for replication was 99. The project replicated one study per article, but articles often contained multiple studies.  I computed absolute z-scores for theoretically important tests from all studies of these 99 articles.  This analysis produced 294 test statistics that could be converted into absolute z-scores.

Powergraph for OSC Rep.Project Articles (all studies)
Figure 3 shows clear evidence of publication bias.  No sampling distribution can produce the steep increase in tests around the critical value for significance. This selection is not an artifact of my extraction, but an actual feature of published results in psychological journals (Sterling, 1959).

Given the small number of studies, the figure also contains bootstrapped 95% confidence intervals.  The 95% CI for the power estimate shows that the sample is too small to estimate power for all studies, including studies in the proverbial file drawer, based on the subset of studies that were published. However, the replicability estimate of 49% has a reasonably tight confidence interval ranging from 45% to 66%.

The shape of the density distribution in Figure 3 differs from the distribution in Figure 2 in two ways. Initially the slop is steeper in Figure 3, and there is less density in the tail with high z-scores.  Both aspects contribute to the lower estimate of replicability in Figure 3, suggesting that replicabilty of focal hypothesis tests is lower than replicabilty for all statistical tests.

Comparing Figure 3 and Figure 1 shows again that the powergraph based on Wagenmakers’ prior differs from the powergraph for real data. In this case, the discrepancy is even more notable because focal hypothesis tests rarely produce large z-scores (z > 6).

Validation Study 3:  Open Science Collective Articles (Replication Results)

At present, the only data that are somewhat representative of psychological research (at least of social and cognitive psychology) and that do not suffer from publication bias are the results from the replication studies of the OSC replication project.  Out of 97 significant results in original studies, 36 studies (37%) produced that produced a significant result in the original studies produced a significant result in the replication study.  After eliminating some replication studies (e.g., sample of replication study was considerably smaller), 88 studies remained.

Powergraph for OSC Replication Results (k = 88)Figure 4 shows the powergraph for the 88 studies. As there is no publication bias, estimates of power and replicability are based on non-significant and significant results.  Although the sample size is smaller, the estimate of power has a reasonably narrow confidence interval because the estimate includes non-significant results. Estimated power is only 31%. The 95% confidence interval includes the actual success rate of 40%, which shows that there is no evidence of publication bias.

A visual comparison of Figure 1 and Figure 4 shows again that real data diverge from the predicted pattern by Wagenmakers’ prior.  Real data show a greater contribution of power in the range between 20% and 85% to the total density, and large z-scores (z > 6) are relatively rare in real data.

Conclusion

Statisticians have noted that it is good practice to examine the assumptions underlying statistical tests. This blog post critically examines the assumptions underlying the use of Bayes-Factors with Wagenmakers’ prior.  The main finding is that Wagenmaker’s prior makes unreasonable assumptions about power, replicability, and the distribution of observed test-statistics with or without publication bias. The main problem from Wagenmakers’ prior is that it predicts too many statistical results with strong evidence against the null-hypothesis (z > 5, or the 5 sigma rule in physics).  To achieve reasonable predictions for success rates without publication bias (~50%), Wagenmakers’ prior has to assume that over 90% of statistical tests conducted in psychology test false hypothesis (i.e., predict an effect when H0 is true), and that the false-positive rate is close to 50%.

Implications

Bayesian statisticians have pointed out for a long time that the choice of a prior influences Bayes-Factors (Kass, 1993, p. 554).  It is therefore useful to carefully examine priors to assess the effect of priors on Bayesian inferences. Unreasonable priors will lead to unreasonable inferences.  This is also true for Wagenmakers’ prior.

The problem of using Bayes-Factors with Wagenmakers’ prior to test the null-hypothesis is apparent in a realistic scenario that assumes a moderate population effect size of d = .5 and a sample size of N = 80 in a between subject design. This study has a non-central t of 2.24 and 60% power to produce a significant result with p < .05, two-tailed.   I used R to simulate 10,000 test-statistics using the non-central t-distribution and then computed Bayes-Factors with Wagenmakers’ prior.

Figure 5 shows a histogram of log(BF). The log is being used because BF are ratios and have very skewed distributions.  The histogram shows that BF never favor the null-hypothesis with a BF of 10 in favor of H0 (1/10 in the histogram).  The reason is that even with Wagenmakers’ prior a sample size of N = 80 is too small to provide strong support for the null-hypothesis.  However, 21% of observed test statistics produce a Bayes-Factor less than 1/3, which is sometimes used as sufficient evidence to claim that the data support the null-hypothesis.  This means that the test has a 21% error rate to provide evidence for the null-hypothesis when the null-hypothesis is false.  A 21% error rate is 4 times larger than the 5% error rate in null-hypothesis significance testing. It is not clear why researchers should replace a statistical method with a 5% error rate for a false discovery of an effect with a 20% error rate of false discoveries of null effects.

Another 48% of the results produce Bayes-Factors that are considered inconclusive. This leaves 31% of results that favor H1 with a Bayes-Factor greater than 3, and only 17% of results produce a Bayes-Factor greater than 10.   This implies that even with the low standard of a BF > 3, the test has only 31% power to provide evidence for an effect that is present.

These results are not wrong because they correctly express the support that the observed data provide for H0 and H1.  The problem only occurs when the specification of H1 is ignored. Given Wagenmakers prior, it is much more likely that a t-value of 1 stems from the sampling distribution of H0 than from the sampling distribution of H1.  However, studies with 50% power when an effect is present are also much more likely to produce t-values of 1 than t-values of 6 or larger.   Thus, a different prior that is more consistent with the actual power of studies in psychology would produce different Bayes-Factors and reduce the percentage of false discoveries of null effects.  Thus, researchers who think Wagenmakers’ prior is not a realistic prior for their research domain should use a more suitable prior for their research domain.

HistogramBF

 

Counterarguments

Wagenmakers’ has ignored previous criticisms of his prior.  It is therefore not clear what counterarguments he would make.  Below, I raise some potential counterarguments that might be used to defend the use of Wagenmakers’ prior.

One counterargument could be that the prior is not very important because the influence of priors on Bayes-Factors decreases as sample sizes increase.  However, this argument ignores the fact that Bayes-Factors are often used to draw inferences from small samples. In addition, Kass (1993) pointed out that “a simple asymptotic analysis shows that even in large samples Bayes factors remain sensitive to the choice of prior” (p. 555).

Another counterargument could be that a bias in favor of H0 is desirable because it keeps the rate of false-positives low. The problem with this argument is that Bayesian statistics does not provide information about false-positive rates.  Moreover, the cost for reducing false-positives is an increase in the rate of false negatives; that is, either inconclusive results or false evidence for H0 when an effect is actually present.  Finally, the choice of the correct prior will minimize the overall amount of errors.  Thus, it should be desirable for researchers interested in Bayesian statistics to find the most appropriate priors in order to minimize the rate of false inferences.

A third counterargument could be that Wagenmakers’ prior expresses a state of maximum uncertainty, which can be considered a reasonable default when no data are available.  If one considers each study as a unique study, a default prior of maximum uncertainty would be a reasonable starting point.  In contrast, it may be questionable to treat a new study as a randomly drawn study from a sample of studies with different population effect sizes.  However, Wagenmakers’ prior does not express a state of maximum uncertainty and makes assumptions about the probability of observing very large effect sizes.  It does so without any justification for this expectation.  It therefore seems more reasonable to construct priors that are consistent with past studies and to evaluate priors against actual results of studies.

A fourth counterargument is that Bayes-Factors are superior because they can provide evidence for the null-hypothesis and the alternative hypothesis.  However, this is not correct. Bayes-Factors only provide relative support for the null-hypothesis relative to a specific alternative hypothesis.  Researchers who are interested in testing the null-hypothesis can do so using parameter estimation with confidence or credibility intervals. If the interval falls within a specified region around zero, it is possible to affirm the null-hypothesis with a specified level of certainty that is determined by the precision of the study to estimate the population effect size.  Thus, it is not necessary to use Bayes-Factors to test the null-hypothesis.

In conclusion, Bayesian statistics and other statistics are not right or wrong. They combine assumptions and data to draw inferences.  Untrustworthy data and wrong assumptions can lead to false conclusions.  It is therefore important to test the integrity of data (e.g., presence of publication bias) and to examine assumptions.  The uncritical use of Bayes-Factors with default assumptions is not good scientific practice and can lead to false conclusions just like the uncritical use of p-values can lead to false conclusions.

A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

R-Code for (Simplified) Powergraphs with StatCheck Dataset

First you need to download the datafile from
https://github.com/chartgerink/2016statcheck_data/blob/master/statcheck_dataset.csv

Right click on <Raw> and save file.

When you are done, provide Path where R can find the file.

# Provide Path
GetPath =  <path>

# give file name
fn = “statcheck_dataset.csv”

# read datafile
d = read.csv(paste0(GetPath,fn))

# get t-values
t = d$Value
t[d$Statistic != “t”] = 0
summary(t)

#convert t-values into absolute z-scores
z.val.t = qnorm(pt(abs(t),d$df2,log.p=TRUE),log.p=TRUE)
z.val.t[z.val.t > 20] = 20
z.val.t[is.na(z.val.t)] = 0
summary(z.val.t)
hist(z.val.t[z.val.t < 6 & z.val.t > 0],breaks=30)
abline(v=1.96,col=”red”,lwd=2)
abline(v=1.65,col=”red”,lty=3)

#get F-values
F = d$Value
F[d$Statistic != “F”] = 0
F[F > 20] = 400
summary(F)

#convert F-values into absolute z-scores
z.val.F = qnorm(pf(abs(F),d$df1,d$df2,log.p=TRUE),log.p=TRUE)
z.val.F[z.val.F > 20] = 20
z.val.F[z.val.F < 0] = 0
z.val.F[is.na(z.val.F)] = 0
summary(z.val.F)
hist(z.val.F[z.val.F < 6 & z.val.F > 0],breaks=30)
abline(v=1.96,col=”red”,lwd=2)
abline(v=1.65,col=”red”,lty=3)

#get z-scores and convert into absolute z-scores
z.val.z = abs(d$Value)
z.val.z[d$Statistic != “Z”] = 0
z.val.z[z.val.z > 20] = 20
summary(z.val.z)
hist(z.val.z[z.val.z < 6 & z.val.z > 0],breaks=30)
abline(v=1.96,col=”red”,lwd=2)
abline(v=1.65,col=”red”,lty=3)

#check results
summary(cbind(z.val.t,z.val.F,z.val.z))

#get z-values for t,F, and z-tests
z.val = z.val.t + z.val.F + z.val.z

#check median absolute z-score by test statistic
tapply(z.val,d$Statistic,median)

##### save as r data file and reuse

### run analysis for specific author

# provide an author name as it appears in the authors column of the data file
author = “Stapel”
author.found = grepl(pattern = author,d$authors,ignore.case=TRUE)
table(author.found)

# give title for graphic
Name = paste0(“StatCheck “,author)

# select z-scores of author
z.val.sel = z.val[author.found == “TRUE”]

#set limit of y-axis of graph
ylim = 1

#create histogram
hist(z.val.sel[z.val.sel < 6 & z.val.sel > 0],xlim=c(0,6),ylab=”Density”,xlab=”|z| scores”,breaks=30,ylim=c(0,ylim),freq=FALSE,main=Name)
#add line for significance
abline(v=1.96,col=”red”,lwd=2)
#add line for marginal significance
abline(v=1.65,col=”red”,lty=3)

#compute median observed power
mop = median(z.val[z.val > 2 & z.val < 4])
# add bias to move normal distribution of fitted function to match observed distribution
bias = 0
# change variance of normal distribution to model heterogeneity (1 = equal power for all studies)
hetero = 1

### add fitted model curve to the plot
par(new=TRUE)
curve(dnorm(x,mop-bias,hetero),0,6,col=”red”,ylim=c(0,ylim),xlab=””,ylab=””)

### when satisfied with fit compute power
power = length(z.val.sel[z.val.sel > 1.96 & z.val.sel < 4]) * pnorm(mop – bias,1.96) + length(z.val.sel[z.val.sel > 4])
power = power / length(z.val.sel[z.val.sel > 1.96])

### add power estimate to the figure
text(3,.8,pos=4,paste0(“Power: “,round(power,2)))

 

 

Subjective Priors: Putting Bayes into Bayes-Factors

A  Post-Publication Review of “The Interplay between Subjectivity, Statistical Practice, and Psychological Science by  Jeffrey N. Rouder, Richard D. Morey, and Eric-Jan Wagenmakers

Credibility Crisis

Rouder, Morey, and Wagenmakers (RMW) start their article with the claim that psychology is facing a crisis of confidence.  Since Bem (2011) published an incredible article that provided evidence for time-reversed causality, psychologists have realized that the empirical support for theoretical claims in scientific publications is not as strong as it appears to be.  Take Bem’s article as an example. Bem presented 9 significant results in 10 statistical tests to provide support for his incredible claim with p < .05 (one-tailed) to obtain a false positive result if extra-sensory perception does not exist.  The probability of obtaining 9 false positive results in 10 studies is less than one billion. This number is larger than all of the studies that have ever been conducted in psychology.  It is very unlikely that such a rare event would occur by chance.  Nevertheless, subsequent studies failed to replicate this finding even though these studies had much larger sample sizes and therewith a much larger chance to replicate the original results.

RMW point out that the key problem in psychological science is that researchers use questionable research practices that increase the chances of reporting a type-I error.  They fail to mention that Francis (2012) and Schimmack (2012) provide direct evidence for the use of questionable research practices in Bem’s article.  Thus, the key problem in psychology and other sciences is that researchers are allowed to publish results that support their predictions while hiding evidence that does not support their claims.  Sterling (1959) pointed out that this selective reporting of significant results invalidates the usefulness of p-values to control the type-I error rate in a field of research.  Once researchers report only significant results, the true false positive rate could be 100%.  Thus, the fundamental problem underlying the crisis of confidence is selective reporting of significant results.  Nobody has openly challenged this claim, but many articles fail to mention this key problem. As a result, they offer solutions to the crisis of confidence that are based on a false diagnosis of the problem.

RMW suggest that the use of p-values is a fundamental problem that has contributed to the crisis of confidence and they offer Bayesian statistics as a solution to the problem.

In their own words, “the target of this critique is the practice of performing significance tests and reporting associated p-values”

This statement makes clear that the authors do not recognize selective reporting of p-values smaller than .05 as the problem, but rather question the usefulness of computing p-values in general.  In this way, they conflate an old and unresolved controversy amongst statisticians with the credibility crisis in psychology.

RMW point out that statisticians have been fighting over the right way to conduct inferential statistics for decades without any resolution.  One argument against p-values is that a significance criterion leads to a dichotomous decision when data can only strengthen or weaken the probability that a hypothesis is true or false.  That is, a p-value of .04 does not suddenly prove that a hypothesis is true.  It is just more likely that the hypothesis is true than if a study had produced a p-value of .20.  This point has been made in a classic article by Rozenboom that is cited by RMW.

“The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true.” (p. 422–423)

This argument ignores that decisions have to be made. Researchers have to decide whether they want to conduct follow-up studies, editors have to decide whether the evidence is sufficient to accept a manuscript for publication, and textbook writers have to decide whether they want to include an article in a textbook.  A type-I error probability of 5% has evolved as a norm for giving a researcher the benefit of the doubt that the hypothesis is true.  If this criterion were applied rigorously, no more than 5% of published results would be type-I errors.  Moreover, replication studies would quickly weed out false-positives because the chance of repeated type-I errors decreases quickly to zero if failed replication studies are reported.

Even if we agree that there is a problem with a decision criterion, it is not clear what a Bayesian science would look like.  Would newspaper articles report that a new studies increased the evidence for the effect of exercise on health from a 3:1 to a 10:1 ratio to being true? Would articles report Bayes-Factors without inferences that an effect exists?  It seems to defeat the purpose of an inferential statistical approach if the outcome of the inference process is not a conclusion that leads to a change in beliefs and even though beliefs can be true or false to varying degrees, beliefs are a black or white matter (I either believe that Berlin is the capital of Germany or not.

In fact, a review of articles that reported Bayes-Factors shows that most of these articles use Bayes-Factors to draw conclusions about hypotheses.  Currently, Bayes-Factors are mostly used to claim support for the absence of an effect when the Bayes-Factor favors the point-null hypothesis over an alternative hypothesis that predicted an effect.  This conclusion is typically made when the Bayes-Factor favors the null-hypothesis over the alternative hypothesis by a ratio of 3:1 or more.  RMW may disagree with this use of Bayes Factors, but this is how their statistical approach is currently being used. In essence, BF > 3 is used like p < .05.  It is easy to see how this change to Bayesian statistics does not solve the credibility crisis if only studies that produced Bayes-Factors greater than 3 are reported.  The only new problem is that authors may publish results that suggest effects do not exist at all, but that this conclusion is a classic type-II error (there is an effect, but the study had insufficient power to show the effect).

For example, Shanks et al. (2016) reported 30 statistical tests of the null-hypothesis and all tests favored the null-hypothesis over the alternative hypothesis and 29 tests exceeded the criterion of BF > 3.  Based on these results. Shanks et al. (2016) conclude that “as indicated by the Bayes factor analyses, their results strongly support the null hypothesis of no effect.”

RMW may argue that the current use of Bayes-Factors is improper and that better training in the use of Bayesian methods will solve this problem.  It is therefore interesting to examine RMW’s vision of proper use of Bayesian statistics.

RMW state that the key difference between conventional statistics with p-values and Bayesian statistics is subjectivity.

“Subjectivity is the key to principled measures of evidence for theory from data.”

“A fully Bayesian approach centers subjectivity as essential for principled analysis”

“The subjectivist perspective provides a principled approach to inference that is transparent, honest, and productive.”

Importantly, they also characterize their own approach as consistent with the call for subjectivity in inferential statistics.

“The Bayesian-subjective approach advocated here has been 250 years in the making”

However, it is not clear where RMW’s approach allows researchers to specify their subjective beliefs.  RMW have developed or used an approach that is often characterized as objective Bayesian.   “A major goal of statistics (indeed science) is to find a completely coherent objective Bayesian methodology for learning from data. This is exemplified by the attitudes of Jeffreys (1961)”  (Berger, 2006). That is, rather than developing models based on a theoretical understanding of a research question, a generic model is used to test a point null-hypothesis (d =0) against a vague alternative hypothesis that there is an effect (d ≠ 0).  In this way, the test is similar to the traditional comparison of the null-hypothesis and the alternative hypothesis in conventional statistics.  The only difference is that Bayesian statistics aims to quantify the relative support for these two hypothesis.  This would be easy if the alternative hypothesis were specified as a competing point prediction with a specified effect size.  For example, are the data more consistent with an effect size of d = 0 or d = .5?  Specifying a fixed value would make this comparison subjective if theories are not sufficiently specified to make such precise predictions.  Thus, the subjective beliefs of a researcher are needed to pick a fixed effect size that is being compared to the null-hypothesis.  However, RMW advocate a Bayesian approach that specifies the alternative hypothesis as a distribution that covers all possible effect sizes.  Clearly no subjectivity is needed to state an alternative hypothesis that the effect size can be anywhere between -∞ and +∞.

There is an infinite number of alternative hypothesis that can be constructed by assigning different weights to effect sizes for the infinite range of effect sizes.  These distributions can take on any form, although some may be more plausible than others.  Creating a plausible alternative hypothesis could involve subjective choices.  However, RMW’s advocate the use of a Cauchy distribution and their online tools and r-code only allow researchers to specify alternative hypotheses as a Cauchy distribution. Moreover, the Cauchy distribution is centered over zero, which implies that the most likely value for the alternative hypothesis is that there is no effect.  This violates the idea of subjectivity, because any researcher who tests a hypothesis against the null-hypothesis will assign the lowest probability to a zero value.  For example, if I think that the effect of exercise on weight loss is d = .5, I am saying that the most likely outcome if my hypothesis is correct is an effect size of d = .5, not an effect size of d = 0.  There is nothing inherently wrong with specifying the alternative hypothesis as a Cauchy distribution centered over 0, but it does seem wrong to present this specification as a subjective approach to hypothesis testing.

For example, let’s assume Bem (2011) wanted to use Bayesian statistic to test the hypothesis that individuals can foresee random future events.  A Cauchy distribution centered over 0 means that he thinks a null-result is the most likely outcome of the study, but this distribution does not represent his prior expectations.  Based on a meta-analysis and his experience as a researcher, he expected a small effect size of d = .2.  Thus, a subjective prior would be centered around a small effect size and Bem clearly did not expect a zero effect size or negative effect sizes (i.e., people can predict future events with less accuracy than random guessing).  RMW ignore that other Bayesian statisticians allow for priors that are not centered over 0 and they do not compare their approach to these alternative specifications of prior distributions.

RMW’s approach to Bayesian statistics leaves one opportunity for subjectivity to specify the prior distribution.  This parameter is the scaling parameter of the Cauchy distribution. The scaling parameter divides the density (the area under the curve) so that 50% of the distribution is in the tails.  RMW initially used a scaling parameter of 1 as a default setting.  This scaling parameter implies that there the prior distribution allocates a 50% chance to effect sizes in the range from -1 to 1 and a 50% probability to larger effect sizes.  Using the same default setting makes the approach fully objective or non-subjective because the same a priori distribution is used independent of subjective beliefs relevant to a particular research question.  Rouder and Morey later changed the default setting to a scaling parameter of .707, whereas Wagenmakers continues to use a scaling parameter of 1.

RMW suggest that the use of a default scaling parameter is not the most optimal use of their approach. “We do not recommend a single default model, but a collection of models that may be tuned by a single parameter, the scale of the distribution on effect size.”  Jeff Rouder also provided R-code that allows researchers to specify their own prior distributions and compute Bayes-Factors for a given observed effect size and sample size (the posted R-script is limited to within-subject/one-sample t-tests).

However, RMW do not provide practical guidelines how researchers should translate their subjective beliefs into a model with the corresponding scaling factor.  RMW give only very minimal recommendations. They suggest that a scaling factor greater than 1 is implausible because it would give too much weight to large effect sizes.  Remember, that even a scaling factor of 1 implies that there is a 50% chance that the absolute effect size is greater than 1 standard deviation.  They also suggest that setting the scaling factor to values smaller than .2 “makes the a priori distribution unnecessarily narrow because it does not give enough credence to effect sizes normally observed in well-executed behavioral-psychological experiments.”  This still leaves a range of scaling factors ranging from .2 to 1.  RMW do not provide further guidelines how researchers should set the scaling parameter. Instead they suggest that the default setting of .707 “is perfectly reasonable in most contexts.”  They do not explain why a value of .707 is perfectly reasonable and in what context this value is not reasonable.  Thus, they do not provide help in setting subjectively meaningful parameters, but rather imply that the default model can be used without thinking about the actual research question.

In my opinion, the use of a default setting is unsatisfactory because the choice of the scaling factor has an influence on the Bayes-Factor.  As noted by RMW, “there certainly is an effect of scale. The largest effect occurs for t = 0.” When the t-value is 0, the data provide maximal support for the null-hypothesis.  In RMW’s example, changing the scaling parameter from .2 to 1, increases the odds of the null-hypothesis being true from 5 to 1 to 10:1.  For gamblers who put real money on the line, the difference in winning $5 or $10 is a notable difference.  If I want to win $100, I have to risk losing $10 or $20 by placing bets with odds of 10:1 versus 5:1, and the difference of $10 can buy you two lattes at Starbucks or a hot dog at a ballgame.

What is a reasonable prior in between-subject designs?

Given the lack of guidance from RMW, I would like to make my own suggestion based on Cohen’s work on standardized effect sizes and the wealth of information about typical standardized effect sizes from meta-analyses.

One possibility is to create a prior distribution that matches the typical effect sizes observed in psychological research.  Cohen provides some helpful guidelines for researchers so that they could conduct power analyses.   He suggested that a moderate effect size is a difference of half a standard deviation (d = .5).  For other metrics, like correlation coefficients, a moderate effect size was r = .3.  Other useful information comes from Richard, Bond, and Stokes-Zoota’s meta-analysis of 100 years of social psychological research that produced a median effect size of r = .21 (d ~ .4).  The recent replication of 100 studies in social and cognitive psychology also yielded a median effect size of r = .2 (OSC, Science, 2016).

 

RichardBondStokes-ZootaResults.png

Figure of effect size distribution in Richard et al.

I suggest that researchers can translate their own subjective expectations into prior distributions by considering the distribution of effect sizes with the help of Cohen’s criteria for small, moderate, and large effect sizes.  That is, how many effect sizes does a researcher expect to be less than .2, between .2 and .5, between .5 and .8, and larger than .8?

A distribution that produces this expectation can be found using either a Cauchy or a Normal distribution and by changing the parameters for the mean and variability.

 

center = .5
width = .5

p = c()
p[1] = pnorm(-.8,-center,width)
p[2] = pnorm(-.5,-center,width) – pnorm(-.8,-center,width)
p[3] = pnorm(-.2,-center,width) – pnorm(-.5,-center,width)
p[4] = pnorm(0,-center,width) – pnorm(-.2,-center,width)
p = p / sum(p)
p

p = c()
p[1] = pcauchy(-.8,-center,width)
p[2] = pcauchy(-.5,-center,width) – pcauchy(-.8,-center,width)
p[3] = pcauchy(-.2,-center,width) – pcauchy(-.5,-center,width)
p[4] = pcauchy(0,-center,width) – pcauchy(-.2,-center,width)
p = p / sum(p)
p

The problem for the Cauchy distribution centered over 0 is that it is impossible to specify the assumption that effect sizes will fall into the small to large range (Table 1).   To create this scenario, RMW use a gamma distribution, but a gamma distribution has a step decline because it has to asymptote to 0 for an effect size of 0.  A normal distribution centered over a moderate effect size does not have this unrealistic property.  RMW also provide no subjective reason for the choice of a Cauchy distribution, which is not surprising because it originated in Jeffrey’s work that tried to create a fully objective Bayesian approach.

 

Cohen’s d Cauchy(0,1) Cauchy(0,.707) Cauchy(0,.4) Cauchy(0,.2)
.0 – .2 13 18 30 50
.2 – .5 17 22 28 26
.5 – .8 13 15 13 09
> .8 57 46 30 16

 

To obtain a priori distribution with higher probabilities for moderate effect sizes, it is necessary to shift the center of the distribution from 0 to a moderate effect size.  This can be done with a normal distribution or a Cauchy distribution.  However, Table 2 shows the the Cauchy distribution gives too much weight to large effect sizes. A normal distribution centered at d = .5 and a Standard deviation of .5 also gives too much weight to large effect sizes.

 

Cohen’s d Norm(.5,.5) Norm(.4,.4) Cauchy(.5,.5) Cauchy(.4,.4)
.0 – .2 14 18 10 14
.2 – .5 27 34 23 30
.5 – .8 27 29 23 23
> .8 33 19 44 33

 

In my opinion, a normal distribution with a mean of .4 and a standard deviation of .4 produces a reasonable prior distribution of effect sizes that matches the meta-analytic distribution of effect sizes reasonably well.  This prior distribution assigns a probability of 63% to effect sizes in the range between .2 and .8 and about equal probabilities to smaller and larger effect sizes.

The prior distribution is an important integral part of Bayesian inference.  Unlike p-values, Bayes-Factors can only be interpreted conditional on the prior distribution. A Bayes-Factor that favors the alternative over the point null-hypothesis can be used to bet on the presence of an effect, but it does not provide information about the size of the effect.  A Bayes-Factor that favors the null-hypothesis over an alternative only means that it is better to bet on the null-hypothesis than to bet on a specific weighted composite of effect sizes (e.g., a distributed bet with $1 on d = .2, $5 on d = .5, and $10 on d = .8). This bet may be a bad bet, but it may be better to bet $16 on d = .2 than to bet on d = .0.  To determine the odds for other bets, other priors would have to be tested.  Therefore, it is crucial that researchers who report Bayes-Factors as scientific information specify how they chose a prior distribution.  A Bayes-Factor in favor of the null-hypothesis with a Cauchy(0,10) that places a 50% probability on effect sizes of 10 standard deviations (e.g., an increase in IQ by 150 points) does not tell us that there is no effect. It only tells us that the researcher chose a bad prior.  Researchers can use the r-code provided by Jeff-Rouder or R-Code and online app provided by Dienes to compute Bayes-Factors for non-centered, normal priors.

CONCLUSIONS

In conclusion, RMW start their article with the credibility crisis in psychology, discuss Bayesian statistics as a solution, and suggest Jeffrey’s objective Bayesian approach as an alternative to traditional significance testing.  In my opinion, replacing significance testing with an objective Bayesian approach creates new problems that have not been solved and fails to address the root cause of the credibility crisis in psychology, namely the common practice of reporting only results that support a hypothesis that predicts an effect.  Therefore, I suggest that psychologists need to focus on open science and encourage full disclosure of data and research methods to fix the credibility problem.  Whether data are reported with p-values or Bayes-Factors, confidence intervals or credibility intervals is less important.  All articles should report basic statistics like means, unstandardized regression coefficients, standard deviations, and sampling error.  With this information, researchers can compute p-values, confidence intervals, or Bayes-Factors and draw their own conclusions from the data. Even better if the actual data are made available.  Good data will often survive bad statistical analysis (robustness), but good statistics cannot solve the problem of bad data.