Category Archives: Power

How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press). 

Forthcoming article: 
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J., Sun, J., Washburn, A. N., Wong, K., Yantis, C. A., & Skitka, L. J. (in press). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. (preprint)

Brief Introduction

Since JPSP published incredbile evidence for mental time travel (Bem, 2011), the credibility of social psychological research has been questioned.  There is talk of a crisis of confidence, a replication crisis, or a credibility crisis.  However, hard data on the credibility of empirical findings published in social psychology journals are scarce.

There have been two approaches to examine the credibility of social psychology.  One approach relies on replication studies.  Authors attempt to replicate original studies as closely as possible.  The most ambitious replication project was carried out by the Open Science Collaboration (Science, 2015) that replicated 1 study from 100 articles; 54 articles were classified as social psychology.   For original articles that reported a significant result, only a quarter replicated a significant result in the replication studies.  This estimate of replicability suggests that researches conduct many more studies than are published and that effect sizes in published articles are inflated by sampling error, which makes them difficult to replicate. One concern about the OSC results is that replicating original studies can be difficult.  For example, a bilingual study in California may not produce the same results as a bilingual study in Canada.  It is therefore possible that the poor outcome is partially due to problems of reproducing the exact conditions of original studies.

A second approach is to estimate replicability of published results using statistical methods.  The advantage of this approach is that replicabiliy estimates are predictions for exact replication studies of the original studies because the original studies provide the data for the replicability estimates.   This is the approach used by Motyl et al.

The authors sampled 30% of articles published in 2003-2004 (pre-crisis) and 2013-2014 (post-crisis) from four major social psychology journals (JPSP, PSPB, JESP, and PS).  For each study, coders identified one focal hypothesis and recorded the statistical result.  The bulk of the statistics were t-values from t-tests or regression analyses and F-tests from ANOVAs.  Only 19 statistics were z-tests.   The authors applied various statistical tests to the data that test for the presence of publication bias or whether the studies have evidential value (i.e., reject the null-hypothesis that all published results are false positives).  For the purpose of estimating replicability, the most important statistic is the R-Index.

The R-Index has two components.  First, it uses the median observed power of studies as an estimate of replicability (i.e., the percentage of studies that should produce a significant result if all studies were replicated exactly).  Second, it computes the percentage of studies with a significant result.  In an unbiased set of studies, median observed power and percentage of significant results should match.  Publication bias and questionable research practices will produce more significant results than predicted by median observed power.  The discrepancy is called the inflation rate.  The R-Index subtracts the inflation rate from median observed power because median observed power is an inflated estimate of replicability when bias is present.  The R-Index is not a replicability estimate.  That is, an R-Index of 30% does not mean that 30% of studies will produce a significant result.  However, a set of studies with an R-Index of 30 will have fewer successful replications than a set of studies with an R-Index of 80.  An exception is an R-Index of 50, which is equivalent with a replicability estimate of 50%.  If the R-Index is below 50, one would expect more replication failures than successes.

Motyl et al. computed the R-Index separately for the 2003/2004 and the 2013/2014 results and found “the R-index decreased numerically, but not statistically over time, from .62 [CI95% = .54, .68] in 2003-2004 to .52 [CI95% = .47, .56] in 2013-2014. This metric suggests that the field is not getting better and that it may consistently be rotten to the core.”

I think this interpretation of the R-Index results is too harsh.  I consider an R-Index below 50 an F (fail).  An R-Index in the 50s is a D, and an R-Index in the 60s is a C.  An R-Index greater than 80 is considered an A.  So, clearly there is a replication crisis, but social psychology is not rotten to the core.

The R-Index is a simple tool, but it is not designed to estimate replicability.  Jerry Brunner and I developed a method that can estimate replicability, called z-curve.  All test-statistics are converted into absolute z-scores and a kernel density distribution is fitted to the histogram of z-scores.  Then a mixture model of normal distributions is fitted to the density distribution and the means of the normal distributions are converted into power values. The weights of the components are used to compute the weighted average power. When this method is applied only to significant results, the weighted average power is the replicability estimate;  that is, the percentage of significant results that one would expect if the set of significant studies were replicated exactly.   Motyl et al. did not have access to this statistical tool.  They kindly shared their data and I was able to estimate replicability with z-curve.  For this analysis, I used all t-tests, F-tests, and z-tests (k = 1,163).   The Figure shows two results.  The left figure uses all z-scores greater than 2 for estimation (all values on the right side of the vertical blue line). The right figure uses only z-scores greater than 2.4.  The reason is that just-significant results may be compromised by questionable research methods that may bias estimates.

Motyl.2d0.2d4

The key finding is the replicability estimate.  Both estimations produce similar results (48% vs. 49%).  Even with over 1,000 observations there is uncertainty in these estimates and the 95%CI can range from 45 to 54% using all significant results.   Based on this finding, it is predicted that about half of these results would produce a significant result again in a replication study.

However, it is important to note that there is considerable heterogeneity in replicability across studies.  As z-scores increase, the strength of evidence becomes stronger, and results are more likely to replicate.  This is shown with average power estimates for bands of z-scores at the bottom of the figure.   In the left figure,  z-scores between 2 and 2.5 (~ .01 < p < .05) have only a replicability of 31%, and even z-scores between 2.5 and 3 have a replicability below 50%.  It requires z-scores greater than 4 to reach a replicability of 80% or more.   Similar results are obtained for actual replication studies in the OSC reproducibilty project.  Thus, researchers should take the strength of evidence of a particular study into account.  Studies with p-values in the .01 to .05 range are unlikely to replicate without boosting sample sizes.  Studies with p-values less than .001 are likely to replicate even with the same sample size.

Independent Replication Study 

Schimmack and Brunner (2016) applied z-curve to the original studies in the OSC reproducibility project.  For this purpose, I coded all studies in the OSC reproducibility project.  The actual replication project often picked one study from articles with multiple studies.  54 social psychology articles reported 173 studies.   The focal hypothesis test of each study was used to compute absolute z-scores that were analyzed with z-curve.

OSC.soc

The two estimation methods (using z > 2.0 or z > 2.4) produced very similar replicability estimates (53% vs. 52%).  The estimates are only slightly higher than those for Motyl et al.’s data (48% & 49%) and the confidence intervals overlap.  Thus, this independent replication study closely replicates the estimates obtained with Motyl et al.’s data.

Automated Extraction Estimates

Hand-coding of focal hypothesis tests is labor intensive and subject to coding biases. Often studies report more than one hypothesis test and it is not trivial to pick one of the tests for further analysis.  An alternative approach is to automatically extract all test statistics from articles.  This makes it also possible to base estimates on a much larger sample of test results.  The downside of automated extraction is that articles also report statistical analysis for trivial or non-critical tests (e.g., manipulation checks).  The extraction of non-significant results is irrelevant because they are not used by z-curve to estimate replicability.  I have reported the results of this method for various social psychology journals covering the years from 2010 to 2016 and posted powergraphs for all journals and years (2016 Replicability Rankings).   Further analyses replicated the results from the OSC reproducibility project that results published in cognitive journals are more replicable than those published in social journals.  The Figure below shows that the average replicability estimate for social psychology is 61%, with an encouraging trend in 2016.  This estimate is about 10% above the estimates based on hand-coded focal hypothesis tests in the two datasets above.  This discrepancy can be due to the inclusion of less original and trivial statistical tests in the automated analysis.  However, a 10% difference is not a dramatic difference.  Neither 50% nor 60% replicability justify claims that social psychology is rotten to the core, nor do they meet the expectation that researchers should plan studies with 80% power to detect a predicted effect.

replicability-cog-vs-soc

Moderator Analyses

Motyl et al. (in press) did extensive coding of the studies.  This makes it possible to examine potential moderators (predictors) of higher or lower replicability.  As noted earlier, the strength of evidence is an important predictor.  Studies with higher z-scores (smaller p-values) are, on average, more replicable.  The strength of evidence is a direct function of statistical power.  Thus, studies with larger population effect sizes and smaller sampling error are more likely to replicate.

It is well known that larger samples have less sampling error.  Not surprisingly, there is a correlation between sample size and the absolute z-scores (r = .3).  I also examined the R-Index for different ranges of sample sizes.  The R-Index was the lowest for sample sizes between N = 40 and 80 (R-Index = 43), increased for N = 80 to 200 (R-Index = 52) and further for sample sizes between 200 and 1,000 (R-Index = 69).  Interestingly, the R-Index for small samples with N < 40 was 70.  This is explained by the fact that research designs also influence replicability and that small samples often use more powerful within-subject designs.

A moderator analysis with design as moderator confirms this.  The R-Indices for between-subject designs is the lowest (R-Index = 48) followed by mixed designs (R-Index = 61) and then within-subject designs (R-Index = 75).  This pattern is also found in the OSC reproducibility project and partially accounts for the higher replicability of cognitive studies, which often employ within-subject designs.

Another possibility is that articles with more studies package smaller and less replicable studies.  However,  number of studies in an article was not a notable moderator:  1 study R-Index = 53, 2 studies R-Index = 51, 3 studies R-Index = 60, 4 studies R-Index = 52, 5 studies R-Index = 53.

Conclusion 

Motyl et al. (in press) coded a large and representative sample of results published in social psychology journals.  Their article complements results from the OSC reproducibility project that used actual replications, but a much smaller number of studies.  The two approaches produce different results.  Actual replication studies produced only 25% successful replications.  Statistical estimates of replicability are around 50%.   Due to the small number of actual replications in the OSC reproducibility project, it is important to be cautious in interpreting the differences.  However, one plausible explanation for lower success rates in actual replication studies is that it is practically impossible to redo a study exactly.  This may even be true when researchers conduct three similar studies in their own lab and only one of these studies produces a significant result.  Some non-random, but also not reproducible, factor may have helped to produce a significant result in this study.  Statistical models assume that we can redo a study exactly and may therefore overestimate the success rate for actual replication studies.  Thus, the 50% estimate is an optimistic estimate for the unlikely scenario that a study can be replicated exactly.  This means that even though optimists may see the 50% estimate as “the glass half full,” social psychologists need to increase statistical power and pay more attention to the strength of evidence of published results to build a robust and credible science of social behavior.

 

 

Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

inflated-mop

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

inflated-effect-sizes

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

inflated-mop

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

 

 

Subjective Bayesian T-Test Code

########################################################

rm(list=ls()) #will remove ALL objects

##############################################################
Bayes-Factor Calculations for T-tests
##############################################################

#Start of Settings

### Give a title for results output
Results.Title = ‘Normal(x,0,.5) N = 100 BS-Design, Obs.ES = 0′

### Criterion for Inference in Favor of H0, BF (H1/H0)
BF.crit.H0 = 1/3

### Criterion for Inference in Favor of H1
#set z.crit.H1 to Infinity to use Bayes-Factor, BF(H1/H0)
BF.crit.H1 = 3
z.crit.H1 = Inf

### Set Number of Groups
gr = 2

### Set Total Sample size
N = 100

### Set observed effect size
### for between-subject designs and one sample designs this is Cohen’s d
### for within-subject designs this is dz
obs.es = 0

### Set the mode of the alternative hypothesis
alt.mode = 0

### Set the variability of the alternative hypothesis
alt.var = .5

### Set the shape of the distribution of population effect sizes
alt.dist = 2  #1 = Cauchy; 2 = Normal

### Set the lower bound of population effect sizes
### Set to zero if there is zero probability to observe effects with the opposite sign
low = -3

### Set the upper bound of population effect sizes
### For example, set to 1, if you think effect sizes greater than 1 SD are unlikely
high = 3

### set the precision of density estimation (bigger takes longer)
precision = 100

### set the graphic resolution (higher resolution takes longer)
graphic.resolution = 20

### set limit for non-central t-values
nct.limit = 100

################################
# End of Settings
################################

# compute degrees of freedom
df = (N – gr)

# get range of population effect sizes
pop.es=seq(low,high,(1/precision))

# compute sampling error
se = gr/sqrt(N)

# limit population effect sizes based on non-central t-values
pop.es = pop.es[pop.es/se >= -nct.limit & pop.es/se <= nct.limit]

# function to get weights for Cauchy or Normal Distributions
get.weights=function(pop.es,alt.dist,p) {
if (alt.dist == 1) w = dcauchy(pop.es,alt.mode,alt.var)
if (alt.dist == 2) w = dnorm(pop.es,alt.mode,alt.var)
sum(w)
# get the scaling factor to scale weights to 1*precision
#scale = sum(w)/precision
# scale weights
#w = w / scale
return(w)
}

# get weights for population effect sizes
weights = get.weights(pop.es,alt.dist,precision)

#Plot Alternative Hypothesis
Title=”Alternative Hypothesis”
ymax=max(max(weights)*1.2,1)
plot(pop.es,weights,type=’l’,ylim=c(0,ymax),xlab=”Population Effect Size”,ylab=”Density”,main=Title,col=’blue’,lwd=3)
abline(v=0,col=’red’)

#create observations for plotting of prediction distributions
obs = seq(low,high,1/graphic.resolution)

# Get distribution for observed effect size assuming H1
H1.dist = as.numeric(lapply(obs, function(x) sum(dt(x/se,df,pop.es/se) * weights)/precision))

#Get Distribution for observed effect sizes assuming H0
H0.dist = dt(obs/se,df,0)

#Compute Bayes-Factors for Prediction Distribution of H0 and H1
BFs = H1.dist/H0.dist

#Compute z-scores (strength of evidence against H0)
z = qnorm(pt(obs/se,df,log.p=TRUE),log.p=TRUE)

# Compute H1 error rate rate
BFpos = BFs
BFpos[z < 0] = Inf
if (z.crit.H1 == Inf) z.crit.H1 = abs(z[which(abs(BFpos-BF.crit.H1) == min(abs(BFpos-BF.crit.H1)))])
ncz = qnorm(pt(pop.es/se,df,log.p=TRUE),log.p=TRUE)
weighted.power = sum(pnorm(abs(ncz),z.crit.H1)*weights)/sum(weights)
H1.error = 1-weighted.power

#Compute H0 Error Rate
z.crit.H0 = abs(z[which(abs(BFpos-BF.crit.H0) == min(abs(BFpos-BF.crit.H0)))])
H0.error = (1-pnorm(z.crit.H0))*2

# Get density for observed effect size assuming H0
Density.Obs.H0 = dt(obs.es,df,0)

# Get density for observed effect size assuming H1
Density.Obs.H1 = sum(dt(obs.es/se,df,pop.es/se) * weights)/precision

# Compute Bayes-Factor for observed effect size
BF.obs.es = Density.Obs.H1 / Density.Obs.H0

#Compute z-score for observed effect size
obs.z = qnorm(pt(obs.es/se,df,log.p=TRUE),log.p=TRUE)

#Show Results
ymax=max(H0.dist,H1.dist)*1.3
plot(type=’l’,z,H0.dist,ylim=c(0,ymax),xlab=”Strength of Evidence (z-value)”,ylab=”Density”,main=Results.Title,col=’black’,lwd=2)
par(new=TRUE)
plot(type=’l’,z,H1.dist,ylim=c(0,ymax),xlab=””,ylab=””,col=’blue’,lwd=2)
abline(v=obs.z,lty=2,lwd=2,col=’darkgreen’)
abline(v=-z.crit.H1,col=’blue’,lty=3)
abline(v=z.crit.H1,col=’blue’,lty=3)
abline(v=-z.crit.H0,col=’red’,lty=3)
abline(v=z.crit.H0,col=’red’,lty=3)
points(pch=19,c(obs.z,obs.z),c(Density.Obs.H0,Density.Obs.H1))
res = paste0(‘BF(H1/H0): ‘,format(round(BF.obs.es,3),nsmall=3))
text(min(z),ymax*.95,pos=4,res)
res = paste0(‘BF(H0/H1): ‘,format(round(1/BF.obs.es,3),nsmall=3))
text(min(z),ymax*.90,pos=4,res)
res = paste0(‘H1 Error Rate: ‘,format(round(H1.error,3),nsmall=3))
text(min(z),ymax*.80,pos=4,res)
res = paste0(‘H0 Error Rate: ‘,format(round(H0.error,3),nsmall=3))
text(min(z),ymax*.75,pos=4,res)

######################################################
### END OF Subjective Bayesian T-Test CODE
######################################################
### Thank you to Jeff Rouder for posting his code that got me started.
### http://jeffrouder.blogspot.ca/2016/01/what-priors-should-i-use-part-i.html

 

Wagenmakers’ Default Prior is Inconsistent with the Observed Results in Psychologial Research

Bayesian statistics is like all other statistics. A bunch of numbers are entered into a formula and the end result is another number.  The meaning of the number depends on the meaning of the numbers that enter the formula and the formulas that are used to transform them.

The input for a Bayesian inference is no different than the input for other statistical tests.  The input is information about an observed effect size and sampling error. The observed effect size is a function of the unknown population effect size and the unknown bias introduced by sampling error in a particular study.

Based on this information, frequentists compute p-values and some Bayesians compute a Bayes-Factor. The Bayes Factor expresses how compatible an observed test statistic (e.g., a t-value) is with one of two hypothesis. Typically, the observed t-value is compared to a distribution of t-values under the assumption that H0 is true (the population effect size is 0 and t-values are expected to follow a t-distribution centered over 0 and an alternative hypothesis. The alternative hypothesis assumes that the effect size is in a range from -infinity to infinity, which of course is true. To make this a workable alternative hypothesis, H1 assigns weights to these effect sizes. Effect sizes with bigger weights are assumed to be more likely than effect sizes with smaller weights. A weight of 0 would mean a priori that these effects cannot occur.

As Bayes-Factors depend on the weights attached to effect sizes, it is also important to realize that the support for H0 depends on the probability that the prior distribution was a reasonable distribution of probable effect sizes. It is always possible to get a Bayes-Factor that supports H0 with an unreasonable prior.  For example, an alternative hypothesis that assumes that an effect size is at least two standard deviations away from 0 will not be favored by data with an effect size of d = .5, and the BF will correctly favor H0 over this improbable alternative hypothesis.  This finding would not imply that the null-hypothesis is true. It only shows that the null-hypothesis is more compatible with the observed result than the alternative hypothesis. Thus, it is always necessary to specify and consider the nature of the alternative hypothesis to interpret Bayes-Factors.

Although the a priori probabilities of  H0 and H1 are both unknown, it is possible to test the plausibility of priors against actual data.  The reason is that observed effect sizes provide information about the plausible range of effect sizes. If most observed effect sizes are less than 1 standard deviation, it is not possible that most population effect sizes are greater than 1 standard deviation.  The reason is that sampling error is random and will lead to overestimation and underestimation of population effect sizes. Thus, if there were many population effect sizes greater than 1, one would also see many observed effect sizes greater than 1.

To my knowledge, proponents of Bayes-Factors have not attempted to validate their priors against actual data. This is especially problematic when priors are presented as defaults that require no further justification for a specification of H1.

In this post, I focus on Wagenmakers’ prior because Wagenmaker has been a prominent advocate of Bayes-Factors as an alternative approach to conventional null-hypothesis-significance testing.  Wagenmakers’ prior is a Cauchy distribution with a scaling factor of 1.  This scaling factor implies a 50% probability that effect sizes are larger than 1 standard deviation.  This prior was used to argue that Bem’s (2011) evidence for PSI was weak. It has also been used in many other articles to suggest that the data favor the null-hypothesis.  These articles fail to point out that the interpretation of Bayes-Factors in favor of H0 is only valid for Wagenmakers’ prior. A different prior could have produced different conclusions.  Thus, it is necessary to examine whether Wagenmakers’ prior is a plausible prior for psychological science.

Wagenmakers’ Prior and Replicability

A prior distribution of effect sizes makes assumption about population effect sizes. In combination with information about sample size, it is possible to compute non-centrality parameters, which are equivalent to the population effect size divided by sampling error.  For each non-centrality parameter it is possible to estimate power as the area under the curve of the non-central t-distribution on the right side of the criterion value that corresponds to alpha, typically .05 (two-tailed).   The assumed typical power is simply the weighted average of the power values for each non-centrality parameters.

Replicability is not identical to power for a set of studies with heterogeneous non-centrality parameters because studies with higher power are more likely to become significant. Thus, the set of studies that achieved significance has higher average power as the original set of studies.

Aside from power, the distribution of observed test statistics is also informative. Unlikely power which is bound at 1, the distribution of test-statistics is unlimited. Thus, unreasonable assumptions about the distribution of effect sizes are visible in a distribution of test statistics that does not match distributions of tests statistics in actual studies.  One problem is that test-statistics are not directly comparable for different sample sizes or statistical tests because non-central distributions vary as a function of degrees of freedom and the test being used (e.g., chi-square vs. t-test).  To solve this problem, it is possible to convert all test statistics into z-scores so that they are on a common metric.  In a heterogeneous set of studies, the sign of the effect provides no useful information because signs only have to be consistent in tests of the same population effect size. As a result, it is necessary to use absolute z-scores. These absolute z-scores can be interpreted as the strength of evidence against the null-hypothesis.

I used a sample size of N = 80 and assumed a between subject design. In this case, sampling error is defined as 2/sqrt(80) = .224.  A sample size of N = 80 is the median sample size in Psychological Science. It is also the total sample size that would be obtained in a 2 x 2 ANOVA with n = 20 per cell.  Power and replicability estimates would increase for within-subject designs and for studies with larger N. Between subject designs with smaller N would yield lower estimates.

I simulated effect sizes in the range from 0 to 4 standard deviations.  Effect sizes of 4 or larger are extremely rare. Excluding these extreme values means that power estimates underestimate power slightly, but the effect is negligible because Wagenmakers’ prior assigns low probabilities (weights) to these effect sizes.

For each possible effect size in the range from 0 to 4 (using a resolution of d = .001)  I computed the non-centrality parameter as d/se.  With N = 80, these non-centrality parameters define a non-central t-distribution with 78 degrees of freedom.

I computed the implied power to achieve a significant result with alpha = .05 (two-tailed) with the formula

power = pt(ncp,N-2,qt(1-.025,N-2))

The formula returns the area under the curve on the right side of the criterion value that corresponds to a two-tailed test with p = .05.

The mean of these power values is the average power of studies if all effect sizes were equally likely.  The value is 89%. This implies that in the long run, a random sample of studies drawn from this population of effect sizes is expected to produce 89% significant results.

However, Wagenmakers’ prior assumes that smaller effect sizes are more likely than larger effect sizes. Thus, it is necessary to compute the weighted average of power using Wagenmakes’ prior distribution as weights.  The weights were obtained using the density of a Cauchy distribution with a scaling factor of 1 for each effect size.

wagenmakers.weights = dcauchy(es,0,1)

The weighted average power was computed as the sum of the weighted power estimates divided by the sum of weights.  The weighted average power is 69%.  This estimate implies that Wagenmakers’ prior assumes that 69% of statistical tests produce a significant result, when the null-hypothesis is false.

Replicability is always higher than power because the subset of studies that produce a significant result has higher average power than the the full set of studies. Replicabilty for a set of studies with heterogeneous power is the sum of the squared power of individual studies divided by the sum of power.

Replicability = sum(power^2) / sum(power)

The unweighted estimate of replicabilty is 96%.   To obtain the replicability for Wagenmakers’ prior, the same weighting scheme as for power can be used for replicability.

Wagenmakers.Replicability = sum(weights * power^2) / sum(weights*power)

The formula shows that Wagenmakers’ prior implies a replicabilty of 89%.  We see that the weighting scheme has relatively little effect on the estimate of replicability because many of the studies with small effect sizes are expected to produce a non-significant result, whereas the large effect sizes often have power close to 1, which implies that they wil be significant in the original study and the replication study.

The success rate of replication studies is difficult to estimate. Cohen estimated that typical studies in psychology have 50% power to detect a medium effect size, d = .5.  This would imply that the actual success rate would be lower because in an unknown percentage of studies the null-hypothesis is true.  However, replicability would be higher because studies with higher power are more likely to be significant.  Given this uncertainty, I used a scenario with 50% replicability.  That is an unbiased sample of studies taken from psychological journals would produce 50% successful replications in an exact replication study of the original studies.  The following computations show the implications of a 50% success rate in replication studies for the proportion of hypothesis tests where the null hypothesis is true, p(H0).

The percentage of true null-hypothesis is a function of the success rate in replication study, weighted average power, and weighted replicability.

p(H0) = (weighted.average.power * (weighted.replicability – success.rate)) / (success.rate*.05 – success.rate*weighted.average.power – .05^2 + weighted.average.power*weighted.replicability)

To produce a success rate of 50% in replication studies with Wagenmakers’ prior when H1 is true (89% replicability), the percentage of true null-hypothesis has to be 92%.

The high percentage of true null-hypothesis (92%) also has implications for the implied false-positive rate (i.e., the percentage of significant results that are true null-hypothesis.

False Positive Rate =  (Type.1.Error *.05)  / (Type.1.Error * .05 +
(1-Type.1.Error) * Weighted.Average.Power)
For every 100 studies, there are 92 true null-hypothesis that produce 92*.05 = 4.6 false positive results. For the remaining 8 studies with a true effect, there are 8 * .67 = 5.4 true discoveries.  The false positive rate is 4.6 / (4.6 + 5.4) = 46%.  This means Wagenmakers prior assumes that a success rate of 50% in replication studies implies that nearly 50% of studies that replicate successfully are false-positives results that would not replicate in future replication studies.

Aside from these analytically derived predictions about power and replicability, Wagenmakers’ prior also makes predictions about the distribution of observed evidence in individual studies. As observed scores are influenced by sampling error, I used simulations to illustrate the effect of Wagenmakers’ prior on observed test statistics.

For the simulation I converted the non-central t-values into non-central z-scores and simulated sampling error with a standard normal distribution.  The simulation included 92% true null-hypotheses and 8% true H1 based on Wagenmaker’s prior.  As published results suffer from publication bias, I simulated publication bias by selecting only observed absolute z-scores greater than 1.96, which corresponds to the p < .05 (two-tailed) significance criterion.  The simulated data were submitted to a powergraph analysis that estimates power and replicability based on the distribution of absolute z-scores.

Figure 1 shows the results.   First, the estimation method slightly underestimated the actual replicability of 50% by 2 percentage points.  Despite this slight estimation error, the Figure accurately illustrates the implications of Wagenmakers’ prior for observed distributions of absolute z-scores.  The density function shows a steep decrease in the range of z-scores between 2 and 3, and a gentle slope for z-scores greater than 4 to 10 (values greater than 10 are not shown).

Powergraphs provide some information about the composition of the total density by dividing the total density into densities for power less than 20%, 20-50%, 50% to 85% and more than 85%. The red line (power < 20%) mostly determines the shape of the total density function for z-scores from 2 to 2.5, and most the remaining density is due to studies with more than 85% power starting with z-scores around 4.   Studies with power in the range between 20% and 85% contribute very little to the total density. Thus, the plot correctly reveals that Wagenmakers’ prior assumes that the roughly 50% average replicability is mostly due to studies with very low power (< 20%) and studies with very high power (> 85%).
Powergraph for Wagenmakers' Prior (N = 80)

Validation Study 1: Michael Nujiten’s Statcheck Data

There are a number of datasets that can be used to evaluate Wagenmakers’ prior. The first dataset is based on an automatic extraction of test statistics from psychological journals. I used Michael Nuijten’s dataset to ensure that I did not cheery-pick data and to allow other researchers to reproduce the results.

The main problem with automatically extracted test statistics is that the dataset does not distinguish between  theoretically important test statistics and other statistics, such as significance tests of manipulation checks.  It is also not possible to distinguish between between-subject and within-subject designs.  As a result, replicability estimates for this dataset will be higher than the simulation based on a between-subject design.

Powergraph for Michele Nuijten's StatCheck Data

 

Figure 2 shows all of the data, but only significant z-scores (z > 1.96) are used to estimate replicability and power. The most striking difference between Figure 1 and Figure 2 is the shape of the total density on the right side of the significance criterion.  In Figure 2 the slope is shallower. The difference is visible in the decomposition of the total density into densities for different power bands.  In Figure 1 most of the total density was accounted for by studies with less than 20% power and studies with more than 85% power.  In Figure 2, studies with power in the range between 20% and 85% account for the majority of studies with z-scores greater than 2.5 up to z-scores of 4.5.

The difference between Figure 1 and Figure 2 has direct implications for the interpretation of Bayes-Factors with t-values that correspond to z-scores in the range of just significant results. Given Wagenmakers’ prior, z-scores in this range mostly represent false-positive results. However, the real dataset suggests that some of these z-scores are the result of underpowered studies and publication bias. That is, in these studies the null-hypothesis is false, but the significant result will not replicate because these studies have low power.

Validation Study 2:  Open Science Collective Articles (Original Results)

The second dataset is based on the Open Science Collective (OSC) replication project.  The project aimed to replicate studies published in three major psychology journals in the year 2008.  The final number of articles that were selected for replication was 99. The project replicated one study per article, but articles often contained multiple studies.  I computed absolute z-scores for theoretically important tests from all studies of these 99 articles.  This analysis produced 294 test statistics that could be converted into absolute z-scores.

Powergraph for OSC Rep.Project Articles (all studies)
Figure 3 shows clear evidence of publication bias.  No sampling distribution can produce the steep increase in tests around the critical value for significance. This selection is not an artifact of my extraction, but an actual feature of published results in psychological journals (Sterling, 1959).

Given the small number of studies, the figure also contains bootstrapped 95% confidence intervals.  The 95% CI for the power estimate shows that the sample is too small to estimate power for all studies, including studies in the proverbial file drawer, based on the subset of studies that were published. However, the replicability estimate of 49% has a reasonably tight confidence interval ranging from 45% to 66%.

The shape of the density distribution in Figure 3 differs from the distribution in Figure 2 in two ways. Initially the slop is steeper in Figure 3, and there is less density in the tail with high z-scores.  Both aspects contribute to the lower estimate of replicability in Figure 3, suggesting that replicabilty of focal hypothesis tests is lower than replicabilty for all statistical tests.

Comparing Figure 3 and Figure 1 shows again that the powergraph based on Wagenmakers’ prior differs from the powergraph for real data. In this case, the discrepancy is even more notable because focal hypothesis tests rarely produce large z-scores (z > 6).

Validation Study 3:  Open Science Collective Articles (Replication Results)

At present, the only data that are somewhat representative of psychological research (at least of social and cognitive psychology) and that do not suffer from publication bias are the results from the replication studies of the OSC replication project.  Out of 97 significant results in original studies, 36 studies (37%) produced that produced a significant result in the original studies produced a significant result in the replication study.  After eliminating some replication studies (e.g., sample of replication study was considerably smaller), 88 studies remained.

Powergraph for OSC Replication Results (k = 88)Figure 4 shows the powergraph for the 88 studies. As there is no publication bias, estimates of power and replicability are based on non-significant and significant results.  Although the sample size is smaller, the estimate of power has a reasonably narrow confidence interval because the estimate includes non-significant results. Estimated power is only 31%. The 95% confidence interval includes the actual success rate of 40%, which shows that there is no evidence of publication bias.

A visual comparison of Figure 1 and Figure 4 shows again that real data diverge from the predicted pattern by Wagenmakers’ prior.  Real data show a greater contribution of power in the range between 20% and 85% to the total density, and large z-scores (z > 6) are relatively rare in real data.

Conclusion

Statisticians have noted that it is good practice to examine the assumptions underlying statistical tests. This blog post critically examines the assumptions underlying the use of Bayes-Factors with Wagenmakers’ prior.  The main finding is that Wagenmaker’s prior makes unreasonable assumptions about power, replicability, and the distribution of observed test-statistics with or without publication bias. The main problem from Wagenmakers’ prior is that it predicts too many statistical results with strong evidence against the null-hypothesis (z > 5, or the 5 sigma rule in physics).  To achieve reasonable predictions for success rates without publication bias (~50%), Wagenmakers’ prior has to assume that over 90% of statistical tests conducted in psychology test false hypothesis (i.e., predict an effect when H0 is true), and that the false-positive rate is close to 50%.

Implications

Bayesian statisticians have pointed out for a long time that the choice of a prior influences Bayes-Factors (Kass, 1993, p. 554).  It is therefore useful to carefully examine priors to assess the effect of priors on Bayesian inferences. Unreasonable priors will lead to unreasonable inferences.  This is also true for Wagenmakers’ prior.

The problem of using Bayes-Factors with Wagenmakers’ prior to test the null-hypothesis is apparent in a realistic scenario that assumes a moderate population effect size of d = .5 and a sample size of N = 80 in a between subject design. This study has a non-central t of 2.24 and 60% power to produce a significant result with p < .05, two-tailed.   I used R to simulate 10,000 test-statistics using the non-central t-distribution and then computed Bayes-Factors with Wagenmakers’ prior.

Figure 5 shows a histogram of log(BF). The log is being used because BF are ratios and have very skewed distributions.  The histogram shows that BF never favor the null-hypothesis with a BF of 10 in favor of H0 (1/10 in the histogram).  The reason is that even with Wagenmakers’ prior a sample size of N = 80 is too small to provide strong support for the null-hypothesis.  However, 21% of observed test statistics produce a Bayes-Factor less than 1/3, which is sometimes used as sufficient evidence to claim that the data support the null-hypothesis.  This means that the test has a 21% error rate to provide evidence for the null-hypothesis when the null-hypothesis is false.  A 21% error rate is 4 times larger than the 5% error rate in null-hypothesis significance testing. It is not clear why researchers should replace a statistical method with a 5% error rate for a false discovery of an effect with a 20% error rate of false discoveries of null effects.

Another 48% of the results produce Bayes-Factors that are considered inconclusive. This leaves 31% of results that favor H1 with a Bayes-Factor greater than 3, and only 17% of results produce a Bayes-Factor greater than 10.   This implies that even with the low standard of a BF > 3, the test has only 31% power to provide evidence for an effect that is present.

These results are not wrong because they correctly express the support that the observed data provide for H0 and H1.  The problem only occurs when the specification of H1 is ignored. Given Wagenmakers prior, it is much more likely that a t-value of 1 stems from the sampling distribution of H0 than from the sampling distribution of H1.  However, studies with 50% power when an effect is present are also much more likely to produce t-values of 1 than t-values of 6 or larger.   Thus, a different prior that is more consistent with the actual power of studies in psychology would produce different Bayes-Factors and reduce the percentage of false discoveries of null effects.  Thus, researchers who think Wagenmakers’ prior is not a realistic prior for their research domain should use a more suitable prior for their research domain.

HistogramBF

 

Counterarguments

Wagenmakers’ has ignored previous criticisms of his prior.  It is therefore not clear what counterarguments he would make.  Below, I raise some potential counterarguments that might be used to defend the use of Wagenmakers’ prior.

One counterargument could be that the prior is not very important because the influence of priors on Bayes-Factors decreases as sample sizes increase.  However, this argument ignores the fact that Bayes-Factors are often used to draw inferences from small samples. In addition, Kass (1993) pointed out that “a simple asymptotic analysis shows that even in large samples Bayes factors remain sensitive to the choice of prior” (p. 555).

Another counterargument could be that a bias in favor of H0 is desirable because it keeps the rate of false-positives low. The problem with this argument is that Bayesian statistics does not provide information about false-positive rates.  Moreover, the cost for reducing false-positives is an increase in the rate of false negatives; that is, either inconclusive results or false evidence for H0 when an effect is actually present.  Finally, the choice of the correct prior will minimize the overall amount of errors.  Thus, it should be desirable for researchers interested in Bayesian statistics to find the most appropriate priors in order to minimize the rate of false inferences.

A third counterargument could be that Wagenmakers’ prior expresses a state of maximum uncertainty, which can be considered a reasonable default when no data are available.  If one considers each study as a unique study, a default prior of maximum uncertainty would be a reasonable starting point.  In contrast, it may be questionable to treat a new study as a randomly drawn study from a sample of studies with different population effect sizes.  However, Wagenmakers’ prior does not express a state of maximum uncertainty and makes assumptions about the probability of observing very large effect sizes.  It does so without any justification for this expectation.  It therefore seems more reasonable to construct priors that are consistent with past studies and to evaluate priors against actual results of studies.

A fourth counterargument is that Bayes-Factors are superior because they can provide evidence for the null-hypothesis and the alternative hypothesis.  However, this is not correct. Bayes-Factors only provide relative support for the null-hypothesis relative to a specific alternative hypothesis.  Researchers who are interested in testing the null-hypothesis can do so using parameter estimation with confidence or credibility intervals. If the interval falls within a specified region around zero, it is possible to affirm the null-hypothesis with a specified level of certainty that is determined by the precision of the study to estimate the population effect size.  Thus, it is not necessary to use Bayes-Factors to test the null-hypothesis.

In conclusion, Bayesian statistics and other statistics are not right or wrong. They combine assumptions and data to draw inferences.  Untrustworthy data and wrong assumptions can lead to false conclusions.  It is therefore important to test the integrity of data (e.g., presence of publication bias) and to examine assumptions.  The uncritical use of Bayes-Factors with default assumptions is not good scientific practice and can lead to false conclusions just like the uncritical use of p-values can lead to false conclusions.

A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167