All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

How Does Uncertainty about Population Effect Sizes Influence the Probability that the Null-Hypothesis is True?

There are many statistical approaches that are often divided into three schools of thought: (a) Fisherian, (b) Neyman-Pearsonion, and (c) Bayesian.  This post is about Bayesian statistics.  Within Bayesian statistics, there are further distinctions that can be made. One distinction is between Bayesian parameter estimation (credibility intervals) and Bayesian hypothesis testing.  This post is about Bayesian hypothesis testing.  One goal of As one goal of Bayesian Hypothesis testing is to provide evidence for the null-hypothesis.  It is often argued that Baysian Null-Hypothesis Testing (BNHT) is superior to the widely used method of Null-Hypothesis Testing with p-values.  This post is about the ability of BNHT to test the null-hypothesis.

The crucial idea of BNHT is that it is possible to contrast the null-hypothesis (H0) with an alternative hypothesis (H1) and to compute the relative likelihood that the data support one hypothesis versus the other:  p(H0/D) / p(H1/D).  If this ratio is large enough (e.g, p(H0/D) / p(H1/D) > criterion),  it can be stated that the data support the null-hypothesis more than the alternative hypothesis.

To compute the ratio of the two conditional probabilities, researchers need to quantify two ratios.  One ratio is the prior ratio of the probabilities that H0 or H1 are true: p(H0)/p(H1. This ratio does not have a common name. I call it the probability ratio (PR).  The other ratio is the ratio of the conditional probabilities of the data given H0 and H1. This ratio is often called a Bayes Factor (BF): BF = p(D/H0)/p(D/H1).

To make claims about H0 and H1 based on some observed test statistic,  the Probability Ratio has to be multiplied with the Bayes Factor.

p(H0/D)                 p(H0)   x  p(D/H0)
________   =   _______________     =   PR * BF
p(H1/D)                  p(H1)   x   p(D/H1)
The main reason for calling this approach Bayesian is that Bayesian statisticians are willing and required to specify a priori probabilties of hypotheses before any data are collected.  In the formula above, p(H0) and p(H1) are the a priori probabilities that a population effect size is 0 (p(H0) or that it is some other value, p(H1).  However, in practice BNHT is often used without specifying these a priori probabilities.

“Table 1 provides critical t values needed for JZS Bayes factor values of 1/10, 1/3, 3, and 10 as a function of sample size. This table is analogous in form to conventional t-value tables for given p value criteria. For instance, suppose a researcher observes a t value of 3.3 for 100 observations. This t value favors the alternative and corresponds to a JZS Bayes factor less than 1/10 because it exceeds the critical value of 3.2 reported in the table. Likewise,
suppose a researcher observes a t value of 0.5. The corresponding JZS Bayes factor is greater than 10 because the t value is smaller than 0.69, the corresponding critical value in Table 1. Because the Bayes factor is directly interpretable as an odds ratio, it may be reported without reference to cutoffs such as 3 or 1/10. Readers may decide the meaning of odds ratios for themselves” (Rouder et al., 2009).

The use of arbitrary cutoff values (3 or 10) for Bayes Factors is not a complete Bayesian statistical analysis because it does not provide information about the hypothesis given the data. Bayes Factors alone only provide information about the ratio of the conditional probabilities of the data given two alternative hypothesis and the ratios are not equivalent.

p(H0/D)                 p(D/H0)
________   ≠   _________
p(H1/D)                  p(D/H1)

In practice, users of BNHT are unaware or ignore the need to think about the base rates of H0 and H1, when they interpret Bayes Factors.  The main point of this post is to demonstrate that Bayes Factors that compare the null-hypothesis of a single effect size against an alternative hypothesis that combines many effect sizes (all effect sizes that are not zero) can be deceptive because the ratio of p(H0) / p(H1) decreases as the number of effect sizes increases.  In the limit the a priori probability of the null-hypothesis being true is zero, which implies that no data can provide evidence for it because any Bayes-Factor that is multiplied with zero is zero, which implies that it is reasonable to believe in the alternative hypothesis no matter how strongly a Bayes Factor favors the null-hypothesis.

The following urn experiments explains the logic of my argument, points out a similar problem in the famous Monty Hall problem, and provides r-code to run simulations with different assumptions about the number and distribution of effect sizes and the implications for the probability ratio of H0 and H1 and Bayes Factors that are need to provide evidence for the null-hypothesis.

An Urn Experiment of Population Effect Sizes

The classic example in statistics are urn experiments.  An urn is filled with balls with different colors. If the urn is filled with 100 balls and only one ball is red and you get one chance to draw a ball from the urn without peeking, the probability of you drawing the red ball is 1 out of 100 or 1%.

To think straight about statistics and probabilities it is helpful, unless you are some math genius who can really think in 10 dimensions, to remind yourself that even complicated probability problems are essentially urn experiments. The question is only what the urn experiment would look like.

In this post, I am examining the urn experiment that corresponds to the problem of Bayesian statisticians to specify probabilities of effect sizes in experiments without any information that would be helpful to guess which effect size is most likely.

To translate the Bayesian problem of the prior into an urn experiment, we first have to turn effect sizes into balls.  The problem is that effect sizes are typically continuous, but an urn can only be filled with discrete objects.  The solution to this problem is to cut the continuous range of effect sizes into discrete units.  The number of units depends on the desired precision.  For example, effect sizes can be measured in standardized units with one decimal, d = 0, d = .1, d = .2, etc.  or with two decimals, d = .00, d = .01, d = .02, etc.  or with 10 decimals.  The more precise the measurement, the more discrete events are created.  Instead of using colors, we can use balls with numbers printed on them as you may have seen in lottery draws.   In psychology, theories and empirical studies often are not very precise and it would hardly be meaningful to distinguish between an effect size of d = .213 and an effect size of d = .214.  Even two decimals are rarely needed and the typical sampling error in psychological studies of d = .20, would make it impossible to distinguish between d = .33 and d = .38 empirically.  So, it makes sense to translate the continuous range of effect sizes into balls with one digit numbers, d = .0, d = .1, d = .2.

The second problem is that effect sizes can be positive or negative.  This is not really a problem because some balls can have negative numbers printed on them.  However, the example can be generalized from the one-sided scenario with only positive effect sizes to a two-sided scenario that also includes negative effects. To keep things simple, I use only positive effect sizes in this example.

The third problem is that some effect size measures are unlimited. However, in practice it is unreasonable to expect very large effect sizes and it is possible to limit the range of possible effect sizes at a maximum value.  The limit could be d = 10, d = 5, or d = 2.  For this example, I use a limit of d = 2.

It is now possible to translate the continuous measure of standardized effect sizes into 21 discrete events and to fill the urn with 21 balls that have printed the numbers 0, 0.1, 0.2, …., 2.0 printed on them.

The main point of Bayesian inference is to draw conclusions about the probability that a particular hypothesis is true given the results of an empirical study.  For example, how probable is it that the null-hypothesis is true when I observe an effect size of d = .2?  However, a study only provides information about the data given a specific hypothesis. How probable is it to observe an effect size of d = .2, if the null-hypothesis were true?  To answer the first question, it is necessary to specify the probability that the hypothesis is true independent of any data; that is, how probable is it that the null-hypothesis is true?

P(pop.es=0/obs.es = .2) = P(pop.es=0) * P(Obs.ES=.2/Pop.ES=0) / P(Obs.ES = .20)

This looks scary and for this post you do not need to understand the complete formula ,but it is just a mathematical way of saying that the probability that a population effect size (pop.es) is zero when the observed effect size (obs.es) is d = .2 equals the unconditional probability that the population effect size is zero multiplied by the conditional probability of observing an effect size of d = .2 when the population effect size is 0 divided by the unconditional probability of observing an effect size of d = .2.

I only show this formula to highlight the fact that the main goal of Bayesian inference is to estimate the probability of a hypothesis (in this case, pop.es = 0) given some observed data (in this case, obs.es = .20) and that researchers need to specify the unconditional probability of the hypothesis (pop.es = 0) to do so.

We can now return to the urn experiment and ask the question how likely it is that a particular hypothesis is true. For example, how likely is it that the null-hypothesis is true?  That is, how likely is it that we end up with a ball that has the number 0.0 printed on it when conduct a study with an unknown population effect size? The answer is: it depends.  It depends on the way our urn was filled.  We of course do not know how often the null-hypothesis is true, but we can fill the urn in a way that expresses maximum uncertainty about the probability that the null-hypothesis is true.  Maximum uncertainty means that all possible events are equally likely (Bayesian statisticians actually use a so-called uniform prior when the range of possible outcomes is fixed).  So, we can fill the urn with one ball for each of the 21 effect sizes (0.0, 0.1, 0.2,….. 2.0).   Now it is fairly easy to determine the a priori probability that the null-hypothesis is true.  There are 21 balls and you are drawing one ball from the urn.  Thus, the a priori probability of the null-hypothesis being true is 1/21 = .047.

As noted before, if the range of events increases because we specify a wider range of effect sizes (say effect sizes up to 10), the a priori probability of drawing the ball with 0.0 printed on it decreases. If we specify effect sizes with more precision (e.g., two digits), the probability of drawing the ball that has 0.00 printed on it decreases further.  With effect sizes ranging from 0 to 10 and being specified with two digits, there are 1001 balls in the urn and the probability of drawing the ball with 0.00 printed on it is 0.001.  Thus, even if the data would provide strong support for the null-hypothesis, the proper inference has to take into account that a priori it is very unlikely that a randomly drawn study had an effect size of 0.00.

As effect sizes are continuous and theoretically can range from -infinity to +infinity, there is an infinite number of effect sizes and the probability of drawing a ball with 0 printed on it from an infinitely large urn that is filled with an infinite number of balls is zero (1/infinity).  This would suggest that it is meaningless to test the hypothesis whether the null-hypothesis is true or not because we already know the answer to the question; the probability is zero. As any number that is multiplied by 0 is zero, the probability that the population effect size is zero remains zero, even if the probability that the population effect size is 0 when we observed an effect size of 0 is 1.   Of course, this is also true for any other hypothesis about effect sizes greater than zero. The probability that the effect size is exactly d = .2 is also 0.  The implication is simply that it is not possible to empirically test hypotheses when the range of effect sizes is cut into an infinite number of pieces because the a priori probability that the effect size has a specific size is always 0.   This problem can be solved by limiting the number of balls in the urn so that we avoid the problem of drawing from an infinitely large urn with an infinite number of balls.

Bayesians solve the infinity problem by using mathematical functions.  A commonly used function was proposed by Jeffrey’s.  Jeffrey’s proposed to specify uncertainty about effect sizes with a Cauchy distribution with a scaling parameter of 1.  Figure 1 shows the distribution.

JeffreyPrior.png

The figure is cut off at effect sizes smaller than -10 and larger than 10, and it assumes that effect sizes are measured with two digits.  With two decimals, the densities can be interpreted as percentages and sum to 100.  The sum of the probabilities for effect sizes in the range between -10 and 10 covers only 93.66% of the full distribution. The remaining 6.34% are in the tails below -10 and above 10. As you can see, the distribution is not uniform. It actually gives the highest probability to an effect size of 0. The probability density for an effect size of 0 is 0.32 and translates into a probability of 0.32% with two digits as units for the effect size.   By eliminating these extreme effect sizes, the probability of the null-hypothesis increases slightly from 0.32% to 0.32/93.66*100 = 0.34%. With two decimals, there are 2001 effect sizes (-10, -9.99, …..-0.01, 0, 0.1….,9.99,10). A uniform prior would put the probability of a single effect size at 1/2001 = 0.05%.  This shows that Jeffrey’s prior gives a higher probability to the null-hypothesis, but it also does so for other small effect sizes close to zero.  The probability density of observing an effect size of d = 0.01 is only slightly smaller, d = .31827, than the probability of the null-hypothesis, d = .3183.

If we translate Jeffrey’s prior for effect sizes with two digits into an urn experiment, and we filled the urn proportionally to the distribution in Figure 1 with 10,000 balls, 34 balls would have the number 0.00 printed on them.  When we draw one ball from the urn, the probability of drawing one of the 34 balls with 0.00, is 34/10000 = 0.0034 or 0.34%.

Bayesian statisticians do not use probability densities to specify the probability that the population effect size is zero, possibly because probability densities do not directly translate into probabilities and the unit.  However, by treating effect sizes as a continuous variable, the number of balls in the lottery is infinite and the probability of drawing a ball with 0.0000000000000000 printed on it is practically zero.  A reasonable alternative is to specify a reasonable unit for effect sizes.  As noted earlier, for many psychological applications, a reasonable unit is a single digit (d = 0, d = .1, d = .2, etc.).  This implies that effect sizes between d = -.05 and d = .05 are essentially treated as 0.

Given Jeffrey’s distribution, the rational specification of the a prior probabilities that the effect size is 0 or somewhere between -10 and 10 is

P(pop.es = 0)                      0.32                                     1
___________   =    _____________      =    ______

P(pop.es ≠ 0)                   9.37 – 0.32                           28
To draw statistical inferences Bayesian Null Hypothesis Tests uses the Bayes-Factor.  Without going into details here, a Bayes-Factor provides the complementary ratio of the conditional probabilities of data based on the null-hypothesis or the alternative hypothesis.  It is not uncommon to use a Bayes-Factor of 3 or greater as support for one of the two hypotheses.  However, if we take the prior probabilities of these hypothesis into account a Bayes-Factor of 3 does not justify a belief in the null-hypothesis, nor is it sufficiently strong to overcome the low probability that the null-hypothesis is true given the large uncertainty about effect sizes. A Bayes-Factor of 3 would change the probability of 1/28 into a probability of 3/28 = .11.  Thus, it is still unlikely that the effect size is zero.  A Bayes-Factor of 28 in favor of H0, would be needed to make it equally likely that the null-hypothesis is true and that it is not true and to assert that the null-hypothesis is true with a probability of 90%, the Bayes-Factor would have to be 255; 255/28 = 9 = .90/.10.

It is possible to further decrease the number of balls in the lottery. For example, it is possible to set the unit to 1. This gives only 11 effect sizes (-10, -9, -8,…,-1,0,1,…8,9,10).  The probability density of .32 translates now in a .32 probability, versus a .68 probability for all other effect sizes. After adjusting for the range restriction, this translates into a ratio of 1.95 to 1 in favor of the alternative.  Thus, a Bayes-Factor of 3 would favor the null-hypothesis and it would only require a Bayes-Factor of 18 to obtain a probability of .90 that H0 is true,  18/1.95 = 9 = .90/.10.   However, it is important to realize that the null-hypothesis with d = 1 covers effect sizes in the range from -.5 to .5.   This wide range covers effect sizes that are typical for psychology and are commonly called small or moderate effects.  As a result, this is not a practical solution because the test no longer really tests the hypothesis that there is no effect.

In conclusion, Jeffrey’s proposed a rational approach to specify the probability of population effect sizes without any data and without prior information about effect sizes.  He proposed a prior distribution of population effect sizes that covers a wide range of effect sizes.  The cost of working with this prior distribution of effect sizes under maximum uncertainty is that a wide range of effect sizes are considered to be plausible. This means that there are many possible events and the probability of any single event is small.  Jeffrey’s prior makes it possible to quantify this probability as a function of the density of an effect size and the precision of measurement of effect sizes (number of digits).  This probability should be used to evaluate Bayes-Factors.  Contrary to existing norms, Bayes-Factors of 3 or 10 cannot be used to claim that the data favor the null-hypothesis over the alternative hypothesis because this interpretation of Bayes-Factors ignore that without further information it is more likely that the null-hypothesis is false than that it is correct.   It seems unreasonable to assign equal probabilities to two events, where one event is akin to drawing a single red ball from an urn when the other event is to draw all but that red ball from an urn.  As the number of balls in the urn increases, these probabilities become more and more unequal.  Any claim that the null-hypothesis is equally or more probable than other effects would have to be motivated by prior information, which would invalidate the use of Jeffrey’s distribution of effect sizes that was developed for a scenario where prior information is not available.

Postscript or Part II

One of the most famous urn experiments in probability theory is the Monty Hall problem.

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

I am happy to admit that I got this problem wrong.  I was not alone.  In a public newspaper column, Vos Savant responded that it would be advantageous to switch because the probability of winning after switching is 2/3, whereas sticking to your guns and staying with the initial choice has only a 1/3 choice of winning.

This column received 10,000 responses with 1,000 responses by readers with a Ph.D. who argued that the chances are 50:50.  This example shows that probability theory is hard even when you are formally trained in math or statistics.  The problem is to match the actual problem to the appropriate urn experiment. Once the correct urn experiment has been chosen, it is easy to compute the probability.

Here is how I solved the Monty Hall problem for myself.  I increased the number of doors from 3 to 1,000.  Again, I have a choice to pick one door.  My chance of picking the correct door at random is now 1/1000 or 0.001.  Everybody can realize that it is very unlikely that I picked the correct door by chance. If 1000 doors do not help, try 1,000,000 doors.  Let’s assume I picked a door with a goat, which has a probability of 999/1000 or 99.9%.  Now the gameshow host will open 998 other doors with goats and the only door that he does not open is the door with the car.  Should I switch?  If intuition is not sufficient for you, try the math.  There is a 99.9% probability to pick a door with a goat and if this happens, the probability that the other door has the car is 1.  There is a 1/1000 = 0.1% probability that I picked the door with the car and if I did so, the probability that the door that I picked has the car is 1.  So, you have a 0.1% chance of winning if you stay and a 99.9% chance of winning if you switch.

The situation is the same when you have three doors.  There is a 2/3 chance that you randomly pick a door with a goat. Now the gameshow host opens the only other door with a goat and the other door must have the car.  If you picked the door with the car, the game show host will open one of the two doors with a goat and the other door still has a goat behind it.  So, you have a 2/3 chance of winning if you switch and a 1/3 chance of winning when you stay.

What does all of this have to do with Bayesian statistics?  There is a similarity between the Monty Hall problem and Bayesian statistics.   If we would only consider two effect sizes, say d = 0 and d = .2, we would have an equal probability that either one is the correct effect size without looking at any data and without prior information.  The odds of the null-hypothesis being true versus the alternative hypothesis being true are 50:50.   However, there are many other effect sizes that are not being considered. In Bayesian hypothesis testing these non-null effect sizes are combined in a single alternative hypothesis that the effect size is not 0 (e.g., d = .1, d = .2, d = .3, etc.).  If we limit our range of effect sizes to effect sizes between -10 and 10 and specify effect sizes with one digit precision we end up with 201 effect sizes, one effect size is 0 and the other effect sizes are not zero.  The goal is to find the actual population effect size by collecting data and by conducting a Bayesian hypothesis test. If you do find the correct population effect size, you win a Noble Prize, if you are wrong you get ridiculed by your colleagues.  Bayesian null-hypothesis tests proceed like in a Monty Hall game show by picking one effect size at random. Typically, this effect size is 0.  They could have picked any other effect size at random, but Bayes-Factors are typically used to test the null-hypothesis.  After collecting some data, the data provide information that increase the probability for some effect sizes and further decrease the probability of other effect sizes.  Imagine an illuminated display of the 201 effect sizes and the game show host turns some effect sizes green or red.  Even Bayesians would abandon their preferred randomly chosen effect size of 0, if it would turn red.  However, let’s consider a scenario where 0 and 20 other effect sizes (e.g., 0.1, 0.2, 0.3, etc. ) are still green.  Now the gameshow host gives you a choice. You can either stay with 0 or you can pick all other 20 effect sizes that are flashing green.  You are allowed to pick all 20 because they are combined in a single alternative hypothesis that the effect size is not zero. It doesn’t matter what the effect size is. It only matters that they are not zero.  Bayesians who simply look to the Bayes Factor (what the data say) and accept the null-hypothesis ignore that the null-hypothesis is only one out of several effect sizes that are compatible with the data and they ignore that a priori it is unlikely that they picked the correct effect size when they pitted a single effect size against all other.

Why would Bayesians do such a crazy thing, when it is clear that you have a much better chance of winning if you can bet on 20 out of 21 effect sizes rather than 1 out of 21 and the winning odds for switching are 20:1?

Maybe they suffer from a similar problem as many people who vehemently argued that the correct answer to the Monty Hall problem is 50:50. The reason for this argument is simply that there are two doors. It doesn’t matter how we got there. Now that we are facing the final decision, we are left with two choices.  The same illusion may occur when we express Bayes-Factors as odds for two hypotheses and ignore the asymmetry between the two hypothesis that one hypothesis consists of a single effect size and the other hypothesis consists of all other effect sizes.

They may forget that in the beginning they picked zero at random from a large set of possible effect sizes and that it is very unlikely that they picked the correct effect size in the beginning.  This part of the problem is fully ignored when researchers compute Bayes-Factors and directly interpret Bayes-Factors.  This is not even Bayesian because the Bayes theorem explicitly requires to specify the probability of the randomly chosen null-hypothesis to draw valid inferences.  This is actually the main point of the Bayes theorem. Even when the data favor the null-hypothesis, we have to consider the a priori probability that the null-hypothesis is true (i.e., the base rate of the null-hypothesis).   Without a value for p(H0) there is no Bayesian inference.  One solution is to simply to assume that p(H0) and p(H1) are equally likely. In this case, a Bayes-Factor that favors the randomly chosen effect size would mean it is rational to stay with it.  However, the 50:50 ratio does not make sense because it is a priori more likely that one of the effect sizes of the alternative hypothesis is the right on.  Therefore, it is better to switch and reject the null-hypothesis.  In this sense, Bayesians who interpret Bayes-Factors without taking the base-rate of H0 into account are not Bayesian and they are likely to end up being losers in the game of science because they will often conclude in favor of an effect size simply because they randomly picked it from a wide range of effect sizes.

################################################################
# R-Code to compute Ratio of p(H0)/p(H1) and BF required to change p(H0/D)/p(H1/D) to a # ratio of 9:1 (90% probability that H0 is true).
################################################################

# set the scaling factor
scale = 1
# set the number of units / precision
precision = 5
# set upper limit of effect sizes
high = 3
# get lower limit
low = -high
# create effect sizes
x = seq(low,high,1/precision)
# compute number of effect sizes
N.es = length(x)
# get densities for each effect size
y = dcauchy(x,0,scale)
# draw pretty picture
curve(dcauchy(x,0,scale),low,high,xlab=’Effect Size’,main=”Jeffrey’s Prior Distribution of Population Effect Sizes”)
segments(0,0,0,dcauchy(0,0,scale),col=’red’,lty=3)
# get the density for effect size of 0 (lazy way)
H0 = max(y) / sum(y)
# get the density of all other effect sizes
H1 = 1-H0
text(0,H0,paste0(‘Density = ‘,H0),pos=4)
# compute a priori ratio of H1 over H0
PR = H1/H0
# set belief strength for H0
PH0 = .90
# get Bayes-Factor in favor of H0
BF = -(PH0*PR)/(PH0-1)
BF

library(BayesFactor)

N = 0
while (try < BF) {
N = N + 50
try = 1/exp(ttest.tstat(t=0, n1=N, n2=N, rscale = scale)[[‘bf’]])
}
try
N

dec = 3
res = paste0(“If standardized mean differences (Cohen’s d) are measured in intervals of d = “,1/precision,” and are limited to effect sizes between “,low,” and “,high)
res = paste0(res,” there are “,N.es,” effect sizes. With a uniform prior, the chance of picking the correct effect size “)
res = paste0(res,”at random is p = 1/”,N.es,” = “,round(1/N.es,dec),”. With the Cauchy(x,0,1) distribution, the probability of H0 is “)
res = paste0(res,round(H0,dec),” and the probability of H1 is “,round(H1,dec),”. To obtain a probability of .90 in favor of H0, the data have to produce a Bayes Factor of “)
res = paste0(res,round(BF,dec), ” in favor of H0. It is then possible to accept the null-hypothesis that the effect size is “)
res = paste0(res,”0 +/- “,round(.5/precision,dec),”. “,N*2,” participants are needed in a between subject design with an observed effect size of 0 to produce this Bayes Factor.”)
print(res)

Advertisements

Subjective Bayesian T-Test Code

########################################################

rm(list=ls()) #will remove ALL objects

##############################################################
Bayes-Factor Calculations for T-tests
##############################################################

#Start of Settings

### Give a title for results output
Results.Title = ‘Normal(x,0,.5) N = 100 BS-Design, Obs.ES = 0′

### Criterion for Inference in Favor of H0, BF (H1/H0)
BF.crit.H0 = 1/3

### Criterion for Inference in Favor of H1
#set z.crit.H1 to Infinity to use Bayes-Factor, BF(H1/H0)
BF.crit.H1 = 3
z.crit.H1 = Inf

### Set Number of Groups
gr = 2

### Set Total Sample size
N = 100

### Set observed effect size
### for between-subject designs and one sample designs this is Cohen’s d
### for within-subject designs this is dz
obs.es = 0

### Set the mode of the alternative hypothesis
alt.mode = 0

### Set the variability of the alternative hypothesis
alt.var = .5

### Set the shape of the distribution of population effect sizes
alt.dist = 2  #1 = Cauchy; 2 = Normal

### Set the lower bound of population effect sizes
### Set to zero if there is zero probability to observe effects with the opposite sign
low = -3

### Set the upper bound of population effect sizes
### For example, set to 1, if you think effect sizes greater than 1 SD are unlikely
high = 3

### set the precision of density estimation (bigger takes longer)
precision = 100

### set the graphic resolution (higher resolution takes longer)
graphic.resolution = 20

### set limit for non-central t-values
nct.limit = 100

################################
# End of Settings
################################

# compute degrees of freedom
df = (N – gr)

# get range of population effect sizes
pop.es=seq(low,high,(1/precision))

# compute sampling error
se = gr/sqrt(N)

# limit population effect sizes based on non-central t-values
pop.es = pop.es[pop.es/se >= -nct.limit & pop.es/se <= nct.limit]

# function to get weights for Cauchy or Normal Distributions
get.weights=function(pop.es,alt.dist,p) {
if (alt.dist == 1) w = dcauchy(pop.es,alt.mode,alt.var)
if (alt.dist == 2) w = dnorm(pop.es,alt.mode,alt.var)
sum(w)
# get the scaling factor to scale weights to 1*precision
#scale = sum(w)/precision
# scale weights
#w = w / scale
return(w)
}

# get weights for population effect sizes
weights = get.weights(pop.es,alt.dist,precision)

#Plot Alternative Hypothesis
Title=”Alternative Hypothesis”
ymax=max(max(weights)*1.2,1)
plot(pop.es,weights,type=’l’,ylim=c(0,ymax),xlab=”Population Effect Size”,ylab=”Density”,main=Title,col=’blue’,lwd=3)
abline(v=0,col=’red’)

#create observations for plotting of prediction distributions
obs = seq(low,high,1/graphic.resolution)

# Get distribution for observed effect size assuming H1
H1.dist = as.numeric(lapply(obs, function(x) sum(dt(x/se,df,pop.es/se) * weights)/precision))

#Get Distribution for observed effect sizes assuming H0
H0.dist = dt(obs/se,df,0)

#Compute Bayes-Factors for Prediction Distribution of H0 and H1
BFs = H1.dist/H0.dist

#Compute z-scores (strength of evidence against H0)
z = qnorm(pt(obs/se,df,log.p=TRUE),log.p=TRUE)

# Compute H1 error rate rate
BFpos = BFs
BFpos[z < 0] = Inf
if (z.crit.H1 == Inf) z.crit.H1 = abs(z[which(abs(BFpos-BF.crit.H1) == min(abs(BFpos-BF.crit.H1)))])
ncz = qnorm(pt(pop.es/se,df,log.p=TRUE),log.p=TRUE)
weighted.power = sum(pnorm(abs(ncz),z.crit.H1)*weights)/sum(weights)
H1.error = 1-weighted.power

#Compute H0 Error Rate
z.crit.H0 = abs(z[which(abs(BFpos-BF.crit.H0) == min(abs(BFpos-BF.crit.H0)))])
H0.error = (1-pnorm(z.crit.H0))*2

# Get density for observed effect size assuming H0
Density.Obs.H0 = dt(obs.es,df,0)

# Get density for observed effect size assuming H1
Density.Obs.H1 = sum(dt(obs.es/se,df,pop.es/se) * weights)/precision

# Compute Bayes-Factor for observed effect size
BF.obs.es = Density.Obs.H1 / Density.Obs.H0

#Compute z-score for observed effect size
obs.z = qnorm(pt(obs.es/se,df,log.p=TRUE),log.p=TRUE)

#Show Results
ymax=max(H0.dist,H1.dist)*1.3
plot(type=’l’,z,H0.dist,ylim=c(0,ymax),xlab=”Strength of Evidence (z-value)”,ylab=”Density”,main=Results.Title,col=’black’,lwd=2)
par(new=TRUE)
plot(type=’l’,z,H1.dist,ylim=c(0,ymax),xlab=””,ylab=””,col=’blue’,lwd=2)
abline(v=obs.z,lty=2,lwd=2,col=’darkgreen’)
abline(v=-z.crit.H1,col=’blue’,lty=3)
abline(v=z.crit.H1,col=’blue’,lty=3)
abline(v=-z.crit.H0,col=’red’,lty=3)
abline(v=z.crit.H0,col=’red’,lty=3)
points(pch=19,c(obs.z,obs.z),c(Density.Obs.H0,Density.Obs.H1))
res = paste0(‘BF(H1/H0): ‘,format(round(BF.obs.es,3),nsmall=3))
text(min(z),ymax*.95,pos=4,res)
res = paste0(‘BF(H0/H1): ‘,format(round(1/BF.obs.es,3),nsmall=3))
text(min(z),ymax*.90,pos=4,res)
res = paste0(‘H1 Error Rate: ‘,format(round(H1.error,3),nsmall=3))
text(min(z),ymax*.80,pos=4,res)
res = paste0(‘H0 Error Rate: ‘,format(round(H0.error,3),nsmall=3))
text(min(z),ymax*.75,pos=4,res)

######################################################
### END OF Subjective Bayesian T-Test CODE
######################################################
### Thank you to Jeff Rouder for posting his code that got me started.
### http://jeffrouder.blogspot.ca/2016/01/what-priors-should-i-use-part-i.html

 

Wagenmakers’ Default Prior is Inconsistent with the Observed Results in Psychologial Research

Bayesian statistics is like all other statistics. A bunch of numbers are entered into a formula and the end result is another number.  The meaning of the number depends on the meaning of the numbers that enter the formula and the formulas that are used to transform them.

The input for a Bayesian inference is no different than the input for other statistical tests.  The input is information about an observed effect size and sampling error. The observed effect size is a function of the unknown population effect size and the unknown bias introduced by sampling error in a particular study.

Based on this information, frequentists compute p-values and some Bayesians compute a Bayes-Factor. The Bayes Factor expresses how compatible an observed test statistic (e.g., a t-value) is with one of two hypothesis. Typically, the observed t-value is compared to a distribution of t-values under the assumption that H0 is true (the population effect size is 0 and t-values are expected to follow a t-distribution centered over 0 and an alternative hypothesis. The alternative hypothesis assumes that the effect size is in a range from -infinity to infinity, which of course is true. To make this a workable alternative hypothesis, H1 assigns weights to these effect sizes. Effect sizes with bigger weights are assumed to be more likely than effect sizes with smaller weights. A weight of 0 would mean a priori that these effects cannot occur.

As Bayes-Factors depend on the weights attached to effect sizes, it is also important to realize that the support for H0 depends on the probability that the prior distribution was a reasonable distribution of probable effect sizes. It is always possible to get a Bayes-Factor that supports H0 with an unreasonable prior.  For example, an alternative hypothesis that assumes that an effect size is at least two standard deviations away from 0 will not be favored by data with an effect size of d = .5, and the BF will correctly favor H0 over this improbable alternative hypothesis.  This finding would not imply that the null-hypothesis is true. It only shows that the null-hypothesis is more compatible with the observed result than the alternative hypothesis. Thus, it is always necessary to specify and consider the nature of the alternative hypothesis to interpret Bayes-Factors.

Although the a priori probabilities of  H0 and H1 are both unknown, it is possible to test the plausibility of priors against actual data.  The reason is that observed effect sizes provide information about the plausible range of effect sizes. If most observed effect sizes are less than 1 standard deviation, it is not possible that most population effect sizes are greater than 1 standard deviation.  The reason is that sampling error is random and will lead to overestimation and underestimation of population effect sizes. Thus, if there were many population effect sizes greater than 1, one would also see many observed effect sizes greater than 1.

To my knowledge, proponents of Bayes-Factors have not attempted to validate their priors against actual data. This is especially problematic when priors are presented as defaults that require no further justification for a specification of H1.

In this post, I focus on Wagenmakers’ prior because Wagenmaker has been a prominent advocate of Bayes-Factors as an alternative approach to conventional null-hypothesis-significance testing.  Wagenmakers’ prior is a Cauchy distribution with a scaling factor of 1.  This scaling factor implies a 50% probability that effect sizes are larger than 1 standard deviation.  This prior was used to argue that Bem’s (2011) evidence for PSI was weak. It has also been used in many other articles to suggest that the data favor the null-hypothesis.  These articles fail to point out that the interpretation of Bayes-Factors in favor of H0 is only valid for Wagenmakers’ prior. A different prior could have produced different conclusions.  Thus, it is necessary to examine whether Wagenmakers’ prior is a plausible prior for psychological science.

Wagenmakers’ Prior and Replicability

A prior distribution of effect sizes makes assumption about population effect sizes. In combination with information about sample size, it is possible to compute non-centrality parameters, which are equivalent to the population effect size divided by sampling error.  For each non-centrality parameter it is possible to estimate power as the area under the curve of the non-central t-distribution on the right side of the criterion value that corresponds to alpha, typically .05 (two-tailed).   The assumed typical power is simply the weighted average of the power values for each non-centrality parameters.

Replicability is not identical to power for a set of studies with heterogeneous non-centrality parameters because studies with higher power are more likely to become significant. Thus, the set of studies that achieved significance has higher average power as the original set of studies.

Aside from power, the distribution of observed test statistics is also informative. Unlikely power which is bound at 1, the distribution of test-statistics is unlimited. Thus, unreasonable assumptions about the distribution of effect sizes are visible in a distribution of test statistics that does not match distributions of tests statistics in actual studies.  One problem is that test-statistics are not directly comparable for different sample sizes or statistical tests because non-central distributions vary as a function of degrees of freedom and the test being used (e.g., chi-square vs. t-test).  To solve this problem, it is possible to convert all test statistics into z-scores so that they are on a common metric.  In a heterogeneous set of studies, the sign of the effect provides no useful information because signs only have to be consistent in tests of the same population effect size. As a result, it is necessary to use absolute z-scores. These absolute z-scores can be interpreted as the strength of evidence against the null-hypothesis.

I used a sample size of N = 80 and assumed a between subject design. In this case, sampling error is defined as 2/sqrt(80) = .224.  A sample size of N = 80 is the median sample size in Psychological Science. It is also the total sample size that would be obtained in a 2 x 2 ANOVA with n = 20 per cell.  Power and replicability estimates would increase for within-subject designs and for studies with larger N. Between subject designs with smaller N would yield lower estimates.

I simulated effect sizes in the range from 0 to 4 standard deviations.  Effect sizes of 4 or larger are extremely rare. Excluding these extreme values means that power estimates underestimate power slightly, but the effect is negligible because Wagenmakers’ prior assigns low probabilities (weights) to these effect sizes.

For each possible effect size in the range from 0 to 4 (using a resolution of d = .001)  I computed the non-centrality parameter as d/se.  With N = 80, these non-centrality parameters define a non-central t-distribution with 78 degrees of freedom.

I computed the implied power to achieve a significant result with alpha = .05 (two-tailed) with the formula

power = pt(ncp,N-2,qt(1-.025,N-2))

The formula returns the area under the curve on the right side of the criterion value that corresponds to a two-tailed test with p = .05.

The mean of these power values is the average power of studies if all effect sizes were equally likely.  The value is 89%. This implies that in the long run, a random sample of studies drawn from this population of effect sizes is expected to produce 89% significant results.

However, Wagenmakers’ prior assumes that smaller effect sizes are more likely than larger effect sizes. Thus, it is necessary to compute the weighted average of power using Wagenmakes’ prior distribution as weights.  The weights were obtained using the density of a Cauchy distribution with a scaling factor of 1 for each effect size.

wagenmakers.weights = dcauchy(es,0,1)

The weighted average power was computed as the sum of the weighted power estimates divided by the sum of weights.  The weighted average power is 69%.  This estimate implies that Wagenmakers’ prior assumes that 69% of statistical tests produce a significant result, when the null-hypothesis is false.

Replicability is always higher than power because the subset of studies that produce a significant result has higher average power than the the full set of studies. Replicabilty for a set of studies with heterogeneous power is the sum of the squared power of individual studies divided by the sum of power.

Replicability = sum(power^2) / sum(power)

The unweighted estimate of replicabilty is 96%.   To obtain the replicability for Wagenmakers’ prior, the same weighting scheme as for power can be used for replicability.

Wagenmakers.Replicability = sum(weights * power^2) / sum(weights*power)

The formula shows that Wagenmakers’ prior implies a replicabilty of 89%.  We see that the weighting scheme has relatively little effect on the estimate of replicability because many of the studies with small effect sizes are expected to produce a non-significant result, whereas the large effect sizes often have power close to 1, which implies that they wil be significant in the original study and the replication study.

The success rate of replication studies is difficult to estimate. Cohen estimated that typical studies in psychology have 50% power to detect a medium effect size, d = .5.  This would imply that the actual success rate would be lower because in an unknown percentage of studies the null-hypothesis is true.  However, replicability would be higher because studies with higher power are more likely to be significant.  Given this uncertainty, I used a scenario with 50% replicability.  That is an unbiased sample of studies taken from psychological journals would produce 50% successful replications in an exact replication study of the original studies.  The following computations show the implications of a 50% success rate in replication studies for the proportion of hypothesis tests where the null hypothesis is true, p(H0).

The percentage of true null-hypothesis is a function of the success rate in replication study, weighted average power, and weighted replicability.

p(H0) = (weighted.average.power * (weighted.replicability – success.rate)) / (success.rate*.05 – success.rate*weighted.average.power – .05^2 + weighted.average.power*weighted.replicability)

To produce a success rate of 50% in replication studies with Wagenmakers’ prior when H1 is true (89% replicability), the percentage of true null-hypothesis has to be 92%.

The high percentage of true null-hypothesis (92%) also has implications for the implied false-positive rate (i.e., the percentage of significant results that are true null-hypothesis.

False Positive Rate =  (Type.1.Error *.05)  / (Type.1.Error * .05 +
(1-Type.1.Error) * Weighted.Average.Power)
For every 100 studies, there are 92 true null-hypothesis that produce 92*.05 = 4.6 false positive results. For the remaining 8 studies with a true effect, there are 8 * .67 = 5.4 true discoveries.  The false positive rate is 4.6 / (4.6 + 5.4) = 46%.  This means Wagenmakers prior assumes that a success rate of 50% in replication studies implies that nearly 50% of studies that replicate successfully are false-positives results that would not replicate in future replication studies.

Aside from these analytically derived predictions about power and replicability, Wagenmakers’ prior also makes predictions about the distribution of observed evidence in individual studies. As observed scores are influenced by sampling error, I used simulations to illustrate the effect of Wagenmakers’ prior on observed test statistics.

For the simulation I converted the non-central t-values into non-central z-scores and simulated sampling error with a standard normal distribution.  The simulation included 92% true null-hypotheses and 8% true H1 based on Wagenmaker’s prior.  As published results suffer from publication bias, I simulated publication bias by selecting only observed absolute z-scores greater than 1.96, which corresponds to the p < .05 (two-tailed) significance criterion.  The simulated data were submitted to a powergraph analysis that estimates power and replicability based on the distribution of absolute z-scores.

Figure 1 shows the results.   First, the estimation method slightly underestimated the actual replicability of 50% by 2 percentage points.  Despite this slight estimation error, the Figure accurately illustrates the implications of Wagenmakers’ prior for observed distributions of absolute z-scores.  The density function shows a steep decrease in the range of z-scores between 2 and 3, and a gentle slope for z-scores greater than 4 to 10 (values greater than 10 are not shown).

Powergraphs provide some information about the composition of the total density by dividing the total density into densities for power less than 20%, 20-50%, 50% to 85% and more than 85%. The red line (power < 20%) mostly determines the shape of the total density function for z-scores from 2 to 2.5, and most the remaining density is due to studies with more than 85% power starting with z-scores around 4.   Studies with power in the range between 20% and 85% contribute very little to the total density. Thus, the plot correctly reveals that Wagenmakers’ prior assumes that the roughly 50% average replicability is mostly due to studies with very low power (< 20%) and studies with very high power (> 85%).
Powergraph for Wagenmakers' Prior (N = 80)

Validation Study 1: Michael Nujiten’s Statcheck Data

There are a number of datasets that can be used to evaluate Wagenmakers’ prior. The first dataset is based on an automatic extraction of test statistics from psychological journals. I used Michael Nuijten’s dataset to ensure that I did not cheery-pick data and to allow other researchers to reproduce the results.

The main problem with automatically extracted test statistics is that the dataset does not distinguish between  theoretically important test statistics and other statistics, such as significance tests of manipulation checks.  It is also not possible to distinguish between between-subject and within-subject designs.  As a result, replicability estimates for this dataset will be higher than the simulation based on a between-subject design.

Powergraph for Michele Nuijten's StatCheck Data

 

Figure 2 shows all of the data, but only significant z-scores (z > 1.96) are used to estimate replicability and power. The most striking difference between Figure 1 and Figure 2 is the shape of the total density on the right side of the significance criterion.  In Figure 2 the slope is shallower. The difference is visible in the decomposition of the total density into densities for different power bands.  In Figure 1 most of the total density was accounted for by studies with less than 20% power and studies with more than 85% power.  In Figure 2, studies with power in the range between 20% and 85% account for the majority of studies with z-scores greater than 2.5 up to z-scores of 4.5.

The difference between Figure 1 and Figure 2 has direct implications for the interpretation of Bayes-Factors with t-values that correspond to z-scores in the range of just significant results. Given Wagenmakers’ prior, z-scores in this range mostly represent false-positive results. However, the real dataset suggests that some of these z-scores are the result of underpowered studies and publication bias. That is, in these studies the null-hypothesis is false, but the significant result will not replicate because these studies have low power.

Validation Study 2:  Open Science Collective Articles (Original Results)

The second dataset is based on the Open Science Collective (OSC) replication project.  The project aimed to replicate studies published in three major psychology journals in the year 2008.  The final number of articles that were selected for replication was 99. The project replicated one study per article, but articles often contained multiple studies.  I computed absolute z-scores for theoretically important tests from all studies of these 99 articles.  This analysis produced 294 test statistics that could be converted into absolute z-scores.

Powergraph for OSC Rep.Project Articles (all studies)
Figure 3 shows clear evidence of publication bias.  No sampling distribution can produce the steep increase in tests around the critical value for significance. This selection is not an artifact of my extraction, but an actual feature of published results in psychological journals (Sterling, 1959).

Given the small number of studies, the figure also contains bootstrapped 95% confidence intervals.  The 95% CI for the power estimate shows that the sample is too small to estimate power for all studies, including studies in the proverbial file drawer, based on the subset of studies that were published. However, the replicability estimate of 49% has a reasonably tight confidence interval ranging from 45% to 66%.

The shape of the density distribution in Figure 3 differs from the distribution in Figure 2 in two ways. Initially the slop is steeper in Figure 3, and there is less density in the tail with high z-scores.  Both aspects contribute to the lower estimate of replicability in Figure 3, suggesting that replicabilty of focal hypothesis tests is lower than replicabilty for all statistical tests.

Comparing Figure 3 and Figure 1 shows again that the powergraph based on Wagenmakers’ prior differs from the powergraph for real data. In this case, the discrepancy is even more notable because focal hypothesis tests rarely produce large z-scores (z > 6).

Validation Study 3:  Open Science Collective Articles (Replication Results)

At present, the only data that are somewhat representative of psychological research (at least of social and cognitive psychology) and that do not suffer from publication bias are the results from the replication studies of the OSC replication project.  Out of 97 significant results in original studies, 36 studies (37%) produced that produced a significant result in the original studies produced a significant result in the replication study.  After eliminating some replication studies (e.g., sample of replication study was considerably smaller), 88 studies remained.

Powergraph for OSC Replication Results (k = 88)Figure 4 shows the powergraph for the 88 studies. As there is no publication bias, estimates of power and replicability are based on non-significant and significant results.  Although the sample size is smaller, the estimate of power has a reasonably narrow confidence interval because the estimate includes non-significant results. Estimated power is only 31%. The 95% confidence interval includes the actual success rate of 40%, which shows that there is no evidence of publication bias.

A visual comparison of Figure 1 and Figure 4 shows again that real data diverge from the predicted pattern by Wagenmakers’ prior.  Real data show a greater contribution of power in the range between 20% and 85% to the total density, and large z-scores (z > 6) are relatively rare in real data.

Conclusion

Statisticians have noted that it is good practice to examine the assumptions underlying statistical tests. This blog post critically examines the assumptions underlying the use of Bayes-Factors with Wagenmakers’ prior.  The main finding is that Wagenmaker’s prior makes unreasonable assumptions about power, replicability, and the distribution of observed test-statistics with or without publication bias. The main problem from Wagenmakers’ prior is that it predicts too many statistical results with strong evidence against the null-hypothesis (z > 5, or the 5 sigma rule in physics).  To achieve reasonable predictions for success rates without publication bias (~50%), Wagenmakers’ prior has to assume that over 90% of statistical tests conducted in psychology test false hypothesis (i.e., predict an effect when H0 is true), and that the false-positive rate is close to 50%.

Implications

Bayesian statisticians have pointed out for a long time that the choice of a prior influences Bayes-Factors (Kass, 1993, p. 554).  It is therefore useful to carefully examine priors to assess the effect of priors on Bayesian inferences. Unreasonable priors will lead to unreasonable inferences.  This is also true for Wagenmakers’ prior.

The problem of using Bayes-Factors with Wagenmakers’ prior to test the null-hypothesis is apparent in a realistic scenario that assumes a moderate population effect size of d = .5 and a sample size of N = 80 in a between subject design. This study has a non-central t of 2.24 and 60% power to produce a significant result with p < .05, two-tailed.   I used R to simulate 10,000 test-statistics using the non-central t-distribution and then computed Bayes-Factors with Wagenmakers’ prior.

Figure 5 shows a histogram of log(BF). The log is being used because BF are ratios and have very skewed distributions.  The histogram shows that BF never favor the null-hypothesis with a BF of 10 in favor of H0 (1/10 in the histogram).  The reason is that even with Wagenmakers’ prior a sample size of N = 80 is too small to provide strong support for the null-hypothesis.  However, 21% of observed test statistics produce a Bayes-Factor less than 1/3, which is sometimes used as sufficient evidence to claim that the data support the null-hypothesis.  This means that the test has a 21% error rate to provide evidence for the null-hypothesis when the null-hypothesis is false.  A 21% error rate is 4 times larger than the 5% error rate in null-hypothesis significance testing. It is not clear why researchers should replace a statistical method with a 5% error rate for a false discovery of an effect with a 20% error rate of false discoveries of null effects.

Another 48% of the results produce Bayes-Factors that are considered inconclusive. This leaves 31% of results that favor H1 with a Bayes-Factor greater than 3, and only 17% of results produce a Bayes-Factor greater than 10.   This implies that even with the low standard of a BF > 3, the test has only 31% power to provide evidence for an effect that is present.

These results are not wrong because they correctly express the support that the observed data provide for H0 and H1.  The problem only occurs when the specification of H1 is ignored. Given Wagenmakers prior, it is much more likely that a t-value of 1 stems from the sampling distribution of H0 than from the sampling distribution of H1.  However, studies with 50% power when an effect is present are also much more likely to produce t-values of 1 than t-values of 6 or larger.   Thus, a different prior that is more consistent with the actual power of studies in psychology would produce different Bayes-Factors and reduce the percentage of false discoveries of null effects.  Thus, researchers who think Wagenmakers’ prior is not a realistic prior for their research domain should use a more suitable prior for their research domain.

HistogramBF

 

Counterarguments

Wagenmakers’ has ignored previous criticisms of his prior.  It is therefore not clear what counterarguments he would make.  Below, I raise some potential counterarguments that might be used to defend the use of Wagenmakers’ prior.

One counterargument could be that the prior is not very important because the influence of priors on Bayes-Factors decreases as sample sizes increase.  However, this argument ignores the fact that Bayes-Factors are often used to draw inferences from small samples. In addition, Kass (1993) pointed out that “a simple asymptotic analysis shows that even in large samples Bayes factors remain sensitive to the choice of prior” (p. 555).

Another counterargument could be that a bias in favor of H0 is desirable because it keeps the rate of false-positives low. The problem with this argument is that Bayesian statistics does not provide information about false-positive rates.  Moreover, the cost for reducing false-positives is an increase in the rate of false negatives; that is, either inconclusive results or false evidence for H0 when an effect is actually present.  Finally, the choice of the correct prior will minimize the overall amount of errors.  Thus, it should be desirable for researchers interested in Bayesian statistics to find the most appropriate priors in order to minimize the rate of false inferences.

A third counterargument could be that Wagenmakers’ prior expresses a state of maximum uncertainty, which can be considered a reasonable default when no data are available.  If one considers each study as a unique study, a default prior of maximum uncertainty would be a reasonable starting point.  In contrast, it may be questionable to treat a new study as a randomly drawn study from a sample of studies with different population effect sizes.  However, Wagenmakers’ prior does not express a state of maximum uncertainty and makes assumptions about the probability of observing very large effect sizes.  It does so without any justification for this expectation.  It therefore seems more reasonable to construct priors that are consistent with past studies and to evaluate priors against actual results of studies.

A fourth counterargument is that Bayes-Factors are superior because they can provide evidence for the null-hypothesis and the alternative hypothesis.  However, this is not correct. Bayes-Factors only provide relative support for the null-hypothesis relative to a specific alternative hypothesis.  Researchers who are interested in testing the null-hypothesis can do so using parameter estimation with confidence or credibility intervals. If the interval falls within a specified region around zero, it is possible to affirm the null-hypothesis with a specified level of certainty that is determined by the precision of the study to estimate the population effect size.  Thus, it is not necessary to use Bayes-Factors to test the null-hypothesis.

In conclusion, Bayesian statistics and other statistics are not right or wrong. They combine assumptions and data to draw inferences.  Untrustworthy data and wrong assumptions can lead to false conclusions.  It is therefore important to test the integrity of data (e.g., presence of publication bias) and to examine assumptions.  The uncritical use of Bayes-Factors with default assumptions is not good scientific practice and can lead to false conclusions just like the uncritical use of p-values can lead to false conclusions.

A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

R-Code for (Simplified) Powergraphs with StatCheck Dataset

First you need to download the datafile from
https://github.com/chartgerink/2016statcheck_data/blob/master/statcheck_dataset.csv

Right click on <Raw> and save file.

When you are done, provide Path where R can find the file.

# Provide Path
GetPath =  <path>

# give file name
fn = “statcheck_dataset.csv”

# read datafile
d = read.csv(paste0(GetPath,fn))

# get t-values
t = d$Value
t[d$Statistic != “t”] = 0
summary(t)

#convert t-values into absolute z-scores
z.val.t = qnorm(pt(abs(t),d$df2,log.p=TRUE),log.p=TRUE)
z.val.t[z.val.t > 20] = 20
z.val.t[is.na(z.val.t)] = 0
summary(z.val.t)
hist(z.val.t[z.val.t < 6 & z.val.t > 0],breaks=30)
abline(v=1.96,col=”red”,lwd=2)
abline(v=1.65,col=”red”,lty=3)

#get F-values
F = d$Value
F[d$Statistic != “F”] = 0
F[F > 20] = 400
summary(F)

#convert F-values into absolute z-scores
z.val.F = qnorm(pf(abs(F),d$df1,d$df2,log.p=TRUE),log.p=TRUE)
z.val.F[z.val.F > 20] = 20
z.val.F[z.val.F < 0] = 0
z.val.F[is.na(z.val.F)] = 0
summary(z.val.F)
hist(z.val.F[z.val.F < 6 & z.val.F > 0],breaks=30)
abline(v=1.96,col=”red”,lwd=2)
abline(v=1.65,col=”red”,lty=3)

#get z-scores and convert into absolute z-scores
z.val.z = abs(d$Value)
z.val.z[d$Statistic != “Z”] = 0
z.val.z[z.val.z > 20] = 20
summary(z.val.z)
hist(z.val.z[z.val.z < 6 & z.val.z > 0],breaks=30)
abline(v=1.96,col=”red”,lwd=2)
abline(v=1.65,col=”red”,lty=3)

#check results
summary(cbind(z.val.t,z.val.F,z.val.z))

#get z-values for t,F, and z-tests
z.val = z.val.t + z.val.F + z.val.z

#check median absolute z-score by test statistic
tapply(z.val,d$Statistic,median)

##### save as r data file and reuse

### run analysis for specific author

# provide an author name as it appears in the authors column of the data file
author = “Stapel”
author.found = grepl(pattern = author,d$authors,ignore.case=TRUE)
table(author.found)

# give title for graphic
Name = paste0(“StatCheck “,author)

# select z-scores of author
z.val.sel = z.val[author.found == “TRUE”]

#set limit of y-axis of graph
ylim = 1

#create histogram
hist(z.val.sel[z.val.sel < 6 & z.val.sel > 0],xlim=c(0,6),ylab=”Density”,xlab=”|z| scores”,breaks=30,ylim=c(0,ylim),freq=FALSE,main=Name)
#add line for significance
abline(v=1.96,col=”red”,lwd=2)
#add line for marginal significance
abline(v=1.65,col=”red”,lty=3)

#compute median observed power
mop = median(z.val[z.val > 2 & z.val < 4])
# add bias to move normal distribution of fitted function to match observed distribution
bias = 0
# change variance of normal distribution to model heterogeneity (1 = equal power for all studies)
hetero = 1

### add fitted model curve to the plot
par(new=TRUE)
curve(dnorm(x,mop-bias,hetero),0,6,col=”red”,ylim=c(0,ylim),xlab=””,ylab=””)

### when satisfied with fit compute power
power = length(z.val.sel[z.val.sel > 1.96 & z.val.sel < 4]) * pnorm(mop – bias,1.96) + length(z.val.sel[z.val.sel > 4])
power = power / length(z.val.sel[z.val.sel > 1.96])

### add power estimate to the figure
text(3,.8,pos=4,paste0(“Power: “,round(power,2)))

 

 

Replicability Report No.2: Do Mating Primes have a replicable effects on behavior?

In 2000, APA declared the following decade the decade of behavior.  The current decade may be considered the decade of replicability or rather the lack thereof.  The replicability crisis started with the publication of Bem’s (2011) infamous “Feeling the future” article.  In response, psychologists have started the painful process of self-examination.

Preregistered replication reports and systematic studies of reproducibility have demonstrated that many published findings are difficult to replicate and when they can be replicated, actual effect sizes are about 50% smaller than reported effect sizes in original articles (OSC, Science, 2016).

To examine which studies in psychology produced replicable results, I created ReplicabilityReports.  Replicability reports use statistical tools that can detect publication bias and questionable research practices to examine the replicability of research findings in a particular research area.  The first replicability report examined the large literature of ego-depletion studies and found that only about a dozen studies may have produced replicable results.

This replicability report focuses on a smaller literature that used mating primes (images of potential romantic partners / imagining a romantic scenario) to test evolutionary theories of human behavior.  Most studies use the typical priming design, where participants are randomly assigned to one or more mating prime conditions or a control condition. After the priming manipulation the effect of activating mating-related motives and thoughts on a variety of measures is examined.  Typically, an interaction with gender is predicted with the hypothesis that mating primes have stronger effects on male participants. Priming manipulations vary from subliminal presentations to instructions to think about romantic scenarios for several minutes; sometimes with the help of visual stimuli.  Dependent variables range from attitudes towards risk-taking to purchasing decisions.

Shanks et al. (2015) conducted a meta-analysis of a subset of mating priming studies that focus on consumption and risk-taking.  A funnel plot showed clear evidence of bias in the published literature.  The authors also conducted several replication studies. The replication studies failed to produce any significant results. Although this outcome might be due to low power to detect small effects, a meta-analysis of all replication studies also produced no evidence for reliable priming effects (average d = 00, 95%CI = -.12 | .11).

This replicability report aims to replicate and extend Shanks et al.’s findings in three ways.  First, I expanded the data base by including all articles that mentioned the word mating primes in a full text search of social psychology journals.  This expanded the set of articles from 15 to 36 articles and the set of studies from 42 to 92. Second, I used a novel and superior bias test.  Shanks et al. used Funnel plots and Egger’s regression of effect sizes on sampling error to examine bias. The problem with this approach is that heterogeneity in effect sizes can produce a negative correlation between effect sizes and sample sizes.  Power-based bias tests do not suffer from this problem (Schimmack, 2014).  A set of studies with average power of 60% cannot produce more than 60% significant results (Sterling et al., 1995).  Thus, the discrepancy between observed power and reported success rate provides clear evidence of selection bias. Powergraphs also make it possible to estimate the actual power of studies after correcting for publication bias and questionable research practices.  Finally, replicability reports use bias tests that can be applied to small sets of studies.  This makes it possible to find studies with replicable results even if most studies have low replicability.

DESCRIPTIVE STATISTICS

The dataset consists of 36 articles and 92 studies. The median sample size of a study was N = 103 and the total number of participants was N = 11,570. The success rate including marginally significant results, z > 1.65, was 100%.  The success rate excluding marginally significant results, z > 1.96, was 90%.  Median observed power for all 92 studies was 66%.  This discrepancy shows that the published results are biased towards significance.  When bias is present, median observed power overestimates actual power.  To correct for this bias, the R-Index subtracts the inflation rate from median observed power.  The R-Index is 66 – 34 = 32.  An R-Index below 50% implies that most studies will not replicate a significant result in an exact replication study with the same sample size and power as the original studies.  The R-Index for the 15 studies included in Shanks et al. was 34% and the R-Index for the additional studies was 36%.  This shows that convergent results were obtained for two independent samples based on different sampling procedures and that Shanks et al.’s limited sample was representative of the wider literature.

POWERGRAPH

For each study, a focal hypothesis test was identified and the result of the statistical test was converted into an absolute z-score.  These absolute z-scores can vary as a function of random sampling error or differences in power and should follow a mixture of normal distributions.  Powergraphs find the best mixture model that minimizes the discrepancy between observed and predicted z-scores.

Powergraph for Romance Priming (Focal Tests)

 

The histogram of z-scores shows clear evidence of selection bias. The steep cliff on the left side of the criterion for significance (z = 1.96) shows a lack of non-significant results.  The few non-significant results are all in the range of marginal significance and were reported as evidence for an effect.

The histogram also shows evidence of the use of questionable research practices. Selection bias would only produce a cliff to the left of the significance criterion, but a mixture-normal distribution on the right side of the significance criterion. However, the graph also shows a second cliff around z = 2.8.  This cliff can be explained by questionable research practices that inflate effect sizes to produce significant results.  These questionable research practices are much more likely to produce z-scores in the range between 2 and 3 than z-scores greater than 3.

The large amount of z-scores in the range between 1.96 and 2.8 makes it impossible to distinguish between real effects with modest power and questionable effects with much lower power that will not replicate.  To obtain a robust estimate of power, power is estimated only for z-scores greater than 2.8 (k = 17).  The power estimate is 73% based. This power estimate suggests that some studies may have reported real effects that can be replicated.

The grey curve shows the predicted distribution for a set of studies with 73% power.  As can be seen, there are too many observed z-scores in the range between 1.96 and 2.8 and too few z-scores in the range between 0 and 1.96 compared to the predicted distribution based on z-scores greater than 2.8.

The powergraph analysis confirms and extends Shanks et al.’s (2016) findings. First, the analysis provides strong evidence that selection bias and questionable research practices contribute to the high success rate in the mating-prime literature.  Second, the analysis suggests that a small portion of studies may actually have reported true effects that can be replicated.

REPLICABILITY OF INDIVIDUAL ARTICLES

The replicability of results published in individual articles was examined with the Test of Insufficient Variance (TIVA) and the Replicability-Index.  TIVA tests bias by comparing the variance of observed z-scores against the variance that is expected based on sampling error.  As sampling error for z-scores is 1, observed z-scores should have at least a variance of 1. If there is heterogeneity, variance can be even greater, but it cannot be smaller than 1.  TIVA uses the chi-square test for variances to compute the probability that a variance less than 1 was simply due to chance.  A p-value less than .10 is used to flag an article as questionable.

The Replicability-Index (R-Index) used observed power to test bias. Z-scores are converted into a measure of observed power and median observed power is used as an estimate of power.  The success rate (percentage of significant results) should match observed power.  The difference between success rate and median power shows an inflated success rate.  The R-Index subtracts inflation from median observed power.  A value of 50% is used as the minimum criterion for replicability.

Articles that pass both tests are examined in more detail to identify studies with high replicability.  Only three articles passed this test.

1   Greitemeyer, Kastenmüller, and Fischer (2013) [R-Index = .80]

The article with the highest R-Index reported 4 studies.  The high R-Index for this article is due to Studies 2 to 4.  Studies 3 and 4 used a 2 x 3 between subject design with gender and three priming conditions. Both studies produced strong evidence for an interaction effect, Study 3: F(2,111) = 12.31, z = 4.33, Study 4: F(2,94) = 7.46, z = 3.30.  The pattern of the interaction is very similar in the two studies.  For women, the means are very similar and not significantly different for each other.  For men, the two mating prime conditions are very similar and significantly different from the control condition.  The standardized effect sizes for the difference between the combined mating prime conditions and the control conditions are large, Study 3: t(110) = 6.09, p < .001, z = 5.64, d = 1.63; Study 4: t(94) = 5.12, d = 1.30.

Taken at face value, these results are highly replicable, but there are some concerns about the reported results. The means in conditions that are not predicted to differ from each other are very similar.  I tested the probability of this event to occur using TIVA and compared the means of the two mating prime conditions for men and women in the two studies.  The four z-scores were z = 0.53, 0.08, 0.09, and -0.40.  The variance should be 1, but the observed variance is only Var(z) = 0.14.  The probability of this reduction in variance to occur by chance is p = .056.  Thus, even though the overall R-Index for this article is high and the reported effect sizes are very high, it is likely that an actual replication study will produce weaker effects and may not replicate the original findings.

Study 2 also produced strong evidence for a priming x gender interaction, F(1,81) = 11.23, z = 3.23.  In contrast to studies 3 and 4, this interaction was a cross-over interaction with opposite effects of primes for males and females.  However, there is some concern about the reliability of this interaction because the post-hoc tests for males and females were both just significant, males: t(40) = 2.61, d = .82, females, t(41) = 2.10, d = .63.  As these post-hoc tests are essentially two independent studies, it is possible to use TIVA to test whether these results are too similar, Var(z) = 0.11, p = .25.  The R-Index for this set of studies is low, R-Index = .24 (MOP = .62).  Thus, a replication study may replicate an interaction effect, but the chance of replicating significant results for males or females separately are lower.

Importantly, Shanks et al. (2016) conducted two close replication of Greitemeyer’s studies with risky driving, gambling, and sexual risk taking as dependent variables.  Study 5 compared the effects of short-term mate primes on risky driving.  Although the sample size was small, the large effect size in the original study implies that this study had high power to replicate the effect, but it did not, t(77) = = -0.85, p = .40, z = -.85.  The negative sign indicates that the pattern of means was reversed, but not significantly so.  Study 6 failed to replicate the interaction effect for sexual risk taking reported by Greitemeyer et al., F(1, 93) = 1.15, p = .29.  The means for male participants were in the opposite direction showing a decrease in risk taking after mating priming.  The study also failed to replicate the significant decrease in risk taking for female participants.  Study 6 also produced non-significant results for gambling and substance risk taking.   These failed replication studies raise further concerns about the replicability of the original results with extremely large effect sizes.

Jon K. Maner, Matthew T. Gailliot, D. Aaron Rouby, and Saul L. Miller (JPSP, 2007) [R-Index = .62]

This article passed TIVA only due to the low power of TIVA for a set of three studies, TIVA: Var(z) = 0.15, p = .14.  In Study 1, male and female participants were randomly assigned to a sexual-arousal priming condition or a happiness control condition. Participants also completed a measure of socio-sexual orientation (i.e., interest in casual and risky sex) and were classified into groups of unrestricted and restricted participants. The dependent variable was performance on a dot-probe task.  In a dot-probe task, participants have to respond to a dot that appears in the location of two stimuli that compete for visual attention.  In theory, participants are faster to respond to the dot if appears in the location of a stimulus that attracts more attention.  Stimuli were pictures of very attractive or less attractive members of the same or opposite sex.  The time between the presentation of the pictures and the dot was also manipulated.  The authors reported that they predicted a three-way way interaction between priming condition, target picture, and stimulus-onset time.  The authors did not predict an interaction with gender.  The ANOVA showed a significant three-way interaction, F(1,111) = 10.40, p = .002, z = 3.15.  A follow-up two-way ANOVA showed an interaction between priming condition and target for unrestricted participants, F(1,111) = 7.69, p = .006, z = 2.72.

Study 2 replicated Study 1 with a sentence unscrambling task which is used as a subtler priming manipulation.  The study closely replicated the results of Study 1. The three way interaction was significant, F(1,153) = 9.11, and the follow up two-way interaction for unrestricted participants was also significant, F(1,153) = 8.22, z = 2.75.

Study 3 changed the primes to jealousy or anxiety/frustration.  Jealousy is a mating related negative emotion and was predicted to influence participants like mating primes.  In this study, participants were classified into groups with high or low sexual vigilance based on a jealousy scale.  The predicted three-way interaction was significant, F(1,153) = 5.74, p = .018, z = 2.37.  The follow-up two-way interaction only for participants high in sexual vigilance was also significant, F(1,153) = 8.13, p = .005, z = 2.81.

A positive feature of this set of studies is that the manipulation of targets within subjects reduces within-cell variability and increases power to produce significant results.  However, a problem is that the authors also report studies for specific targets and do not mention that they used reaction times to other targets as covariate. These analyses have low power due to the high variability in reaction times across participants.  However, surprisingly each study still produced the predicted significant result.

Study 1: “Planned analyses clarified the specific pattern of hypothesized effects. Multiple regression evaluated the hypothesis that priming would interact with participants’ sociosexual orientation to increase attentional adhesion to attractive opposite-sex targets. Attention to those targets was regressed on experimental condition, SOI, participant sex, and their centered interactions (nonsignificant interactions were dropped). Results confirmed the hypothesized interaction between priming condition and SOI, beta = .19, p < .05 (see Figure 1).”
I used r = .19 and N = 113 and obtained t(111) = 2.04, p = .043, z = 2.02.

Study 2: “Planned analyses clarified the specific pattern of hypothesized effects. Regression evaluated the hypothesis that the mate-search prime would interact with sociosexual orientation to increase attentional adhesion to attractive opposite-sex targets. Attention to these targets was regressed on experimental condition, SOI score, participant sex, and their centered interactions (nonsignificant interactions were dropped). As in Study 1, results revealed the predicted interaction between priming condition and sociosexual orientation, beta = .15, p = .04, one-tailed (see Figure 2)”
I used r = .15 and N = 155 and obtained t(153) = 1.88, p = .06 (two-tailed!), z = 1.86.

Study 3: “We also observed a significant main effect of intrasexual vigilance, beta = .25, p < .001, partial r = .26, and, more important, the hypothesized two-way interaction between priming condition and level of intrasexual vigilance, beta = .15, p < .05, partial r = .16 (see Figure 3).”
I used r = .16 and N = 155 and obtained t(153) = 2.00, p = .047, z = 1.99.

The problem is that the results of these three independent analyses are too similar, z = 2.02, 1.86, 1.99; Var(z) < .001, p = .007.

In conclusion, there are some concerns about the replicability of these results and even if the results replicate they do not provide support for the hypothesis that mating primes have a hard-wired effect on males. Only one of the three studies produced a significant two-way interaction between priming and target (F-value not reported), and none of the three studies produced a significant three-way interaction between priming, target, and gender.  Thus, the results are inconsistent with other studies that found either main effects of mating primes or mating prime by gender interactions.

3. Bram Van den Bergh and Siegfried Dewitte (Proc. R. Soc. B, 2006) [R-index = .58]

This article reports three studies that examined the influence of mating primes on behavior in the ultimatum game.

Study 1 had a small sample size of 40 male participants who were randomly assigned to seeing pictures of non-nude female models or landscapes.  The study produced a significant main effect, F(1,40) = 4.75, p = .035, z = 2.11, and a significant interaction with finger digit ratio, F(1,40) = 4.70, p = .036, z = 2.10.  I used the main effect for analysis because it is theoretically more important than the interaction effect, but the results are so similar that it does not matter which effect is used.

Study 2 used rating of women’s t-shirts or bras as manipulation. The study produced strong evidence that mating primes (rating bras) lead to lower minimum acceptance rates in the ultimatum game than the control condition (rating t-shirts), F(1,33) = 8.88, p = .005, z = 2.78.  Once more the study also produced a significant interaction with finger digit ratio, F(1,33) = 8.76, p = .006, z = 2.77.

Study 3 had three experimental conditions, namely non-sexual pictures of older and young women, and pictures of young non-nude female models.  The study produced a significant effect of condition, F(2,87) = 5.49, p = .006, z = 2.77.  Once more the interaction with finger-digit ratio was also significant, F(2,87) = 5.42.

This article barely passed the test of insufficient variance in the primary analysis that uses one focal test per study, Var(z) = 0.15, p = .14.  However, the main effect and the interaction effects are statistically independent and it is possible to increase the power of TIVA by using the z-scores for the three main effects and the three interactions.  This test produces significant evidence for bias, Var(z) = 0.12, p = .01.

In conclusion, it is unlikely that the results reported in this article will replicate.

CONCLUSION

The replicability crisis in psychology has created doubt about the credibility of published results.  Numerous famous priming studies have failed to replicate in large replication studies.  Shanks et al. (2016) reported problems with the specific literature of romantic and mating priming.  This replicability report provided further evidence that the mating prime literature is not credible.  Using an expanded set of 92 studies, analysis with powergraphs, the test of insufficient variance, and the replicability index showed that many significant results were obtained with the help of questionable research practices that inflate observed effect sizes and provide misleading evidence about the strength and replicability of published results.  Only three articles passed the test with TIVA and R-Index and detailed examination of these studies also showed statistical problems with the evidence in these articles.  Thus, this replicability analysis of 36 articles failed to identify a single credible article.  The lack of credible evidence is consistent with Shanks et al.’s failure to produce significant results in 15 independent replication studies.

Of course, these results do not imply that evolutionary theory is wrong or that sexual stimuli have no influence on human behavior.  For example, in my own research I have demonstrated that sexually arousing opposite-sex pictures capture men’s and women’s attention (Schimmack, 2005).  However, these responses occurred in response to specific stimuli and not as carry-over effects of a priming manipulation. Thus, the problem with mating prime studies is probably that priming effects are weak and may have no notable influence on unrelated behaviors like consumer behavior or risk taking in investments.  Given the replication problems with other priming studies, it seems necessary to revisit the theoretical assumptions underlying this paradigm.  For example, Shanks et al. (2016) pointed out that behavioral priming effects are theoretically implausible because these predictions contradict well-established theories that behavior is guided by the cognitive appraisal of the situation at hand rather than unconscious residual information from previous situations. This makes evolutionary sense because behavior has to respond to the adaptive problem at hand to ensure survival and reproduction.

I recommend that textbook writers, journalists, and aspiring social psychologists treat claims about human behavior based on mating priming studies with a healthy dose of skepticism.  The results reported in these articles may reveal more about the motives of researchers than their participants.

Subjective Priors: Putting Bayes into Bayes-Factors

A  Post-Publication Review of “The Interplay between Subjectivity, Statistical Practice, and Psychological Science by  Jeffrey N. Rouder, Richard D. Morey, and Eric-Jan Wagenmakers

Credibility Crisis

Rouder, Morey, and Wagenmakers (RMW) start their article with the claim that psychology is facing a crisis of confidence.  Since Bem (2011) published an incredible article that provided evidence for time-reversed causality, psychologists have realized that the empirical support for theoretical claims in scientific publications is not as strong as it appears to be.  Take Bem’s article as an example. Bem presented 9 significant results in 10 statistical tests to provide support for his incredible claim with p < .05 (one-tailed) to obtain a false positive result if extra-sensory perception does not exist.  The probability of obtaining 9 false positive results in 10 studies is less than one billion. This number is larger than all of the studies that have ever been conducted in psychology.  It is very unlikely that such a rare event would occur by chance.  Nevertheless, subsequent studies failed to replicate this finding even though these studies had much larger sample sizes and therewith a much larger chance to replicate the original results.

RMW point out that the key problem in psychological science is that researchers use questionable research practices that increase the chances of reporting a type-I error.  They fail to mention that Francis (2012) and Schimmack (2012) provide direct evidence for the use of questionable research practices in Bem’s article.  Thus, the key problem in psychology and other sciences is that researchers are allowed to publish results that support their predictions while hiding evidence that does not support their claims.  Sterling (1959) pointed out that this selective reporting of significant results invalidates the usefulness of p-values to control the type-I error rate in a field of research.  Once researchers report only significant results, the true false positive rate could be 100%.  Thus, the fundamental problem underlying the crisis of confidence is selective reporting of significant results.  Nobody has openly challenged this claim, but many articles fail to mention this key problem. As a result, they offer solutions to the crisis of confidence that are based on a false diagnosis of the problem.

RMW suggest that the use of p-values is a fundamental problem that has contributed to the crisis of confidence and they offer Bayesian statistics as a solution to the problem.

In their own words, “the target of this critique is the practice of performing significance tests and reporting associated p-values”

This statement makes clear that the authors do not recognize selective reporting of p-values smaller than .05 as the problem, but rather question the usefulness of computing p-values in general.  In this way, they conflate an old and unresolved controversy amongst statisticians with the credibility crisis in psychology.

RMW point out that statisticians have been fighting over the right way to conduct inferential statistics for decades without any resolution.  One argument against p-values is that a significance criterion leads to a dichotomous decision when data can only strengthen or weaken the probability that a hypothesis is true or false.  That is, a p-value of .04 does not suddenly prove that a hypothesis is true.  It is just more likely that the hypothesis is true than if a study had produced a p-value of .20.  This point has been made in a classic article by Rozenboom that is cited by RMW.

“The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true.” (p. 422–423)

This argument ignores that decisions have to be made. Researchers have to decide whether they want to conduct follow-up studies, editors have to decide whether the evidence is sufficient to accept a manuscript for publication, and textbook writers have to decide whether they want to include an article in a textbook.  A type-I error probability of 5% has evolved as a norm for giving a researcher the benefit of the doubt that the hypothesis is true.  If this criterion were applied rigorously, no more than 5% of published results would be type-I errors.  Moreover, replication studies would quickly weed out false-positives because the chance of repeated type-I errors decreases quickly to zero if failed replication studies are reported.

Even if we agree that there is a problem with a decision criterion, it is not clear what a Bayesian science would look like.  Would newspaper articles report that a new studies increased the evidence for the effect of exercise on health from a 3:1 to a 10:1 ratio to being true? Would articles report Bayes-Factors without inferences that an effect exists?  It seems to defeat the purpose of an inferential statistical approach if the outcome of the inference process is not a conclusion that leads to a change in beliefs and even though beliefs can be true or false to varying degrees, beliefs are a black or white matter (I either believe that Berlin is the capital of Germany or not.

In fact, a review of articles that reported Bayes-Factors shows that most of these articles use Bayes-Factors to draw conclusions about hypotheses.  Currently, Bayes-Factors are mostly used to claim support for the absence of an effect when the Bayes-Factor favors the point-null hypothesis over an alternative hypothesis that predicted an effect.  This conclusion is typically made when the Bayes-Factor favors the null-hypothesis over the alternative hypothesis by a ratio of 3:1 or more.  RMW may disagree with this use of Bayes Factors, but this is how their statistical approach is currently being used. In essence, BF > 3 is used like p < .05.  It is easy to see how this change to Bayesian statistics does not solve the credibility crisis if only studies that produced Bayes-Factors greater than 3 are reported.  The only new problem is that authors may publish results that suggest effects do not exist at all, but that this conclusion is a classic type-II error (there is an effect, but the study had insufficient power to show the effect).

For example, Shanks et al. (2016) reported 30 statistical tests of the null-hypothesis and all tests favored the null-hypothesis over the alternative hypothesis and 29 tests exceeded the criterion of BF > 3.  Based on these results. Shanks et al. (2016) conclude that “as indicated by the Bayes factor analyses, their results strongly support the null hypothesis of no effect.”

RMW may argue that the current use of Bayes-Factors is improper and that better training in the use of Bayesian methods will solve this problem.  It is therefore interesting to examine RMW’s vision of proper use of Bayesian statistics.

RMW state that the key difference between conventional statistics with p-values and Bayesian statistics is subjectivity.

“Subjectivity is the key to principled measures of evidence for theory from data.”

“A fully Bayesian approach centers subjectivity as essential for principled analysis”

“The subjectivist perspective provides a principled approach to inference that is transparent, honest, and productive.”

Importantly, they also characterize their own approach as consistent with the call for subjectivity in inferential statistics.

“The Bayesian-subjective approach advocated here has been 250 years in the making”

However, it is not clear where RMW’s approach allows researchers to specify their subjective beliefs.  RMW have developed or used an approach that is often characterized as objective Bayesian.   “A major goal of statistics (indeed science) is to find a completely coherent objective Bayesian methodology for learning from data. This is exemplified by the attitudes of Jeffreys (1961)”  (Berger, 2006). That is, rather than developing models based on a theoretical understanding of a research question, a generic model is used to test a point null-hypothesis (d =0) against a vague alternative hypothesis that there is an effect (d ≠ 0).  In this way, the test is similar to the traditional comparison of the null-hypothesis and the alternative hypothesis in conventional statistics.  The only difference is that Bayesian statistics aims to quantify the relative support for these two hypothesis.  This would be easy if the alternative hypothesis were specified as a competing point prediction with a specified effect size.  For example, are the data more consistent with an effect size of d = 0 or d = .5?  Specifying a fixed value would make this comparison subjective if theories are not sufficiently specified to make such precise predictions.  Thus, the subjective beliefs of a researcher are needed to pick a fixed effect size that is being compared to the null-hypothesis.  However, RMW advocate a Bayesian approach that specifies the alternative hypothesis as a distribution that covers all possible effect sizes.  Clearly no subjectivity is needed to state an alternative hypothesis that the effect size can be anywhere between -∞ and +∞.

There is an infinite number of alternative hypothesis that can be constructed by assigning different weights to effect sizes for the infinite range of effect sizes.  These distributions can take on any form, although some may be more plausible than others.  Creating a plausible alternative hypothesis could involve subjective choices.  However, RMW’s advocate the use of a Cauchy distribution and their online tools and r-code only allow researchers to specify alternative hypotheses as a Cauchy distribution. Moreover, the Cauchy distribution is centered over zero, which implies that the most likely value for the alternative hypothesis is that there is no effect.  This violates the idea of subjectivity, because any researcher who tests a hypothesis against the null-hypothesis will assign the lowest probability to a zero value.  For example, if I think that the effect of exercise on weight loss is d = .5, I am saying that the most likely outcome if my hypothesis is correct is an effect size of d = .5, not an effect size of d = 0.  There is nothing inherently wrong with specifying the alternative hypothesis as a Cauchy distribution centered over 0, but it does seem wrong to present this specification as a subjective approach to hypothesis testing.

For example, let’s assume Bem (2011) wanted to use Bayesian statistic to test the hypothesis that individuals can foresee random future events.  A Cauchy distribution centered over 0 means that he thinks a null-result is the most likely outcome of the study, but this distribution does not represent his prior expectations.  Based on a meta-analysis and his experience as a researcher, he expected a small effect size of d = .2.  Thus, a subjective prior would be centered around a small effect size and Bem clearly did not expect a zero effect size or negative effect sizes (i.e., people can predict future events with less accuracy than random guessing).  RMW ignore that other Bayesian statisticians allow for priors that are not centered over 0 and they do not compare their approach to these alternative specifications of prior distributions.

RMW’s approach to Bayesian statistics leaves one opportunity for subjectivity to specify the prior distribution.  This parameter is the scaling parameter of the Cauchy distribution. The scaling parameter divides the density (the area under the curve) so that 50% of the distribution is in the tails.  RMW initially used a scaling parameter of 1 as a default setting.  This scaling parameter implies that there the prior distribution allocates a 50% chance to effect sizes in the range from -1 to 1 and a 50% probability to larger effect sizes.  Using the same default setting makes the approach fully objective or non-subjective because the same a priori distribution is used independent of subjective beliefs relevant to a particular research question.  Rouder and Morey later changed the default setting to a scaling parameter of .707, whereas Wagenmakers continues to use a scaling parameter of 1.

RMW suggest that the use of a default scaling parameter is not the most optimal use of their approach. “We do not recommend a single default model, but a collection of models that may be tuned by a single parameter, the scale of the distribution on effect size.”  Jeff Rouder also provided R-code that allows researchers to specify their own prior distributions and compute Bayes-Factors for a given observed effect size and sample size (the posted R-script is limited to within-subject/one-sample t-tests).

However, RMW do not provide practical guidelines how researchers should translate their subjective beliefs into a model with the corresponding scaling factor.  RMW give only very minimal recommendations. They suggest that a scaling factor greater than 1 is implausible because it would give too much weight to large effect sizes.  Remember, that even a scaling factor of 1 implies that there is a 50% chance that the absolute effect size is greater than 1 standard deviation.  They also suggest that setting the scaling factor to values smaller than .2 “makes the a priori distribution unnecessarily narrow because it does not give enough credence to effect sizes normally observed in well-executed behavioral-psychological experiments.”  This still leaves a range of scaling factors ranging from .2 to 1.  RMW do not provide further guidelines how researchers should set the scaling parameter. Instead they suggest that the default setting of .707 “is perfectly reasonable in most contexts.”  They do not explain why a value of .707 is perfectly reasonable and in what context this value is not reasonable.  Thus, they do not provide help in setting subjectively meaningful parameters, but rather imply that the default model can be used without thinking about the actual research question.

In my opinion, the use of a default setting is unsatisfactory because the choice of the scaling factor has an influence on the Bayes-Factor.  As noted by RMW, “there certainly is an effect of scale. The largest effect occurs for t = 0.” When the t-value is 0, the data provide maximal support for the null-hypothesis.  In RMW’s example, changing the scaling parameter from .2 to 1, increases the odds of the null-hypothesis being true from 5 to 1 to 10:1.  For gamblers who put real money on the line, the difference in winning $5 or $10 is a notable difference.  If I want to win $100, I have to risk losing $10 or $20 by placing bets with odds of 10:1 versus 5:1, and the difference of $10 can buy you two lattes at Starbucks or a hot dog at a ballgame.

What is a reasonable prior in between-subject designs?

Given the lack of guidance from RMW, I would like to make my own suggestion based on Cohen’s work on standardized effect sizes and the wealth of information about typical standardized effect sizes from meta-analyses.

One possibility is to create a prior distribution that matches the typical effect sizes observed in psychological research.  Cohen provides some helpful guidelines for researchers so that they could conduct power analyses.   He suggested that a moderate effect size is a difference of half a standard deviation (d = .5).  For other metrics, like correlation coefficients, a moderate effect size was r = .3.  Other useful information comes from Richard, Bond, and Stokes-Zoota’s meta-analysis of 100 years of social psychological research that produced a median effect size of r = .21 (d ~ .4).  The recent replication of 100 studies in social and cognitive psychology also yielded a median effect size of r = .2 (OSC, Science, 2016).

 

RichardBondStokes-ZootaResults.png

Figure of effect size distribution in Richard et al.

I suggest that researchers can translate their own subjective expectations into prior distributions by considering the distribution of effect sizes with the help of Cohen’s criteria for small, moderate, and large effect sizes.  That is, how many effect sizes does a researcher expect to be less than .2, between .2 and .5, between .5 and .8, and larger than .8?

A distribution that produces this expectation can be found using either a Cauchy or a Normal distribution and by changing the parameters for the mean and variability.

 

center = .5
width = .5

p = c()
p[1] = pnorm(-.8,-center,width)
p[2] = pnorm(-.5,-center,width) – pnorm(-.8,-center,width)
p[3] = pnorm(-.2,-center,width) – pnorm(-.5,-center,width)
p[4] = pnorm(0,-center,width) – pnorm(-.2,-center,width)
p = p / sum(p)
p

p = c()
p[1] = pcauchy(-.8,-center,width)
p[2] = pcauchy(-.5,-center,width) – pcauchy(-.8,-center,width)
p[3] = pcauchy(-.2,-center,width) – pcauchy(-.5,-center,width)
p[4] = pcauchy(0,-center,width) – pcauchy(-.2,-center,width)
p = p / sum(p)
p

The problem for the Cauchy distribution centered over 0 is that it is impossible to specify the assumption that effect sizes will fall into the small to large range (Table 1).   To create this scenario, RMW use a gamma distribution, but a gamma distribution has a step decline because it has to asymptote to 0 for an effect size of 0.  A normal distribution centered over a moderate effect size does not have this unrealistic property.  RMW also provide no subjective reason for the choice of a Cauchy distribution, which is not surprising because it originated in Jeffrey’s work that tried to create a fully objective Bayesian approach.

 

Cohen’s d Cauchy(0,1) Cauchy(0,.707) Cauchy(0,.4) Cauchy(0,.2)
.0 – .2 13 18 30 50
.2 – .5 17 22 28 26
.5 – .8 13 15 13 09
> .8 57 46 30 16

 

To obtain a priori distribution with higher probabilities for moderate effect sizes, it is necessary to shift the center of the distribution from 0 to a moderate effect size.  This can be done with a normal distribution or a Cauchy distribution.  However, Table 2 shows the the Cauchy distribution gives too much weight to large effect sizes. A normal distribution centered at d = .5 and a Standard deviation of .5 also gives too much weight to large effect sizes.

 

Cohen’s d Norm(.5,.5) Norm(.4,.4) Cauchy(.5,.5) Cauchy(.4,.4)
.0 – .2 14 18 10 14
.2 – .5 27 34 23 30
.5 – .8 27 29 23 23
> .8 33 19 44 33

 

In my opinion, a normal distribution with a mean of .4 and a standard deviation of .4 produces a reasonable prior distribution of effect sizes that matches the meta-analytic distribution of effect sizes reasonably well.  This prior distribution assigns a probability of 63% to effect sizes in the range between .2 and .8 and about equal probabilities to smaller and larger effect sizes.

The prior distribution is an important integral part of Bayesian inference.  Unlike p-values, Bayes-Factors can only be interpreted conditional on the prior distribution. A Bayes-Factor that favors the alternative over the point null-hypothesis can be used to bet on the presence of an effect, but it does not provide information about the size of the effect.  A Bayes-Factor that favors the null-hypothesis over an alternative only means that it is better to bet on the null-hypothesis than to bet on a specific weighted composite of effect sizes (e.g., a distributed bet with $1 on d = .2, $5 on d = .5, and $10 on d = .8). This bet may be a bad bet, but it may be better to bet $16 on d = .2 than to bet on d = .0.  To determine the odds for other bets, other priors would have to be tested.  Therefore, it is crucial that researchers who report Bayes-Factors as scientific information specify how they chose a prior distribution.  A Bayes-Factor in favor of the null-hypothesis with a Cauchy(0,10) that places a 50% probability on effect sizes of 10 standard deviations (e.g., an increase in IQ by 150 points) does not tell us that there is no effect. It only tells us that the researcher chose a bad prior.  Researchers can use the r-code provided by Jeff-Rouder or R-Code and online app provided by Dienes to compute Bayes-Factors for non-centered, normal priors.

CONCLUSIONS

In conclusion, RMW start their article with the credibility crisis in psychology, discuss Bayesian statistics as a solution, and suggest Jeffrey’s objective Bayesian approach as an alternative to traditional significance testing.  In my opinion, replacing significance testing with an objective Bayesian approach creates new problems that have not been solved and fails to address the root cause of the credibility crisis in psychology, namely the common practice of reporting only results that support a hypothesis that predicts an effect.  Therefore, I suggest that psychologists need to focus on open science and encourage full disclosure of data and research methods to fix the credibility problem.  Whether data are reported with p-values or Bayes-Factors, confidence intervals or credibility intervals is less important.  All articles should report basic statistics like means, unstandardized regression coefficients, standard deviations, and sampling error.  With this information, researchers can compute p-values, confidence intervals, or Bayes-Factors and draw their own conclusions from the data. Even better if the actual data are made available.  Good data will often survive bad statistical analysis (robustness), but good statistics cannot solve the problem of bad data.