# Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

# Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.

# Replicability Report No. 1: Is Ego-Depletion a Replicable Effect?

Abstract

It has been a common practice in social psychology to publish only significant results.  As a result, success rates in the published literature do not provide empirical evidence for the existence of a phenomenon.  A recent meta-analysis suggested that ego-depletion is a much weaker effect than the published literature suggests and a registered replication study failed to find any evidence for it.  This article presents the results of a replicability analysis of the ego-depletion literature.  Out of 165 articles with 429 studies (total N  = 33,927),  128 (78%) showed evidence of bias and low replicability (Replicability-Index < 50%).  Closer inspection of the top 10 articles with the strongest evidence against the null-hypothesis revealed some questionable statistical analyses, and only a few articles presented replicable results.  The results of this meta-analysis show that most published findings are not replicable and that the existing literature provides no credible evidence for ego-depletion.  The discussion focuses on the need for a change in research practices and suggests a new direction for research on ego-depletion that can produce conclusive results.

INTRODUCTION

In 1998, Roy F. Baumeister and colleagues published a groundbreaking article titled “Ego Depletion: Is the Active Self a Limited Resource?”   The article stimulated research on the newly minted construct of ego-depletion.  At present, more than 150 articles and over 400 studies with more than 30,000 participants have contributed to the literature on ego-depletion.  In 2010, a meta-analysis of nearly 100 articles, 200 studies, and 10,000 participants concluded that ego-depletion is a real phenomenon with a moderate to strong effect size of six tenth of a standard deviation (Hagger et al., 2010).

In 2011, Roy F. Baumeister and John Tierney published a popular book on ego-depletion titled “Will-Power,” and Roy F. Baumeister became to be known as the leading expert on self-regulation, will-power (The Atlantic, 2012).

Everything looked as if ego-depletion research has a bright future, but five years later the future of ego-depletion research looks gloomy and even prominent ego-depletion researchers wonder whether ego-depletion even exists (Slate, “Everything is Crumbling”, 2016).

An influential psychological theory, borne out in hundreds of experiments, may have just been debunked. How can so many scientists have been so wrong?

What Happened?

It has been known for 60 years that scientific journals tend to publish only successful studies (Sterling, 1959).  That is, when Roy F. Baumeister reported his first ego-depletion study and found that resisting the temptation to eat chocolate cookies led to a decrease in persistence on a difficult task by 17 minutes, the results were published as a groundbreaking discovery.  However, when studies do not produce the predicted outcome, they are not published.  This bias is known as publication bias.  Every researcher knows about publication bias, but the practice is so widespread that it is not considered a serious problem.  Surely, researches would not conduct more failed studies than successful studies and only report the successful ones.  Yes, omitting a few studies with weaker effects leads to an inflation of the effect size, but the successful studies still show the general trend.

The publication of one controversial article in the same journal that published the first ego-depletion article challenged this indifferent attitude towards publication bias. In a shocking article, Bem (2011) presented 9 successful studies demonstrating that extraverted students at Cornell University were seemingly able to foresee random events in the future. In Study 1, they seemed to be able to predict where a computer would present an erotic picture even before the computer randomly determined the location of the picture.  Although the article presented 9 successful studies and 1 marginally successful study, researchers were not convinced that extrasensory perception is a real phenomenon.  Rather, they wondered how credible the evidence in other article is if it is possible to get 9 significant results for a phenomenon that few researchers believed to be real.  As Sterling (1959) pointed out, a 100% success rate does not provide evidence for a phenomenon if only successful studies are reported. In this case, the success rate is by definition 100% no matter whether an effect is real or not.

In the same year, Simmons et al. (2011) showed how researchers can increase the chances to get significant results without a real effect by using a number of statistical practices that seem harmless, but in combination can increase the chance of a false discovery by more than 1000% (from 5% to 60%).  The use of these questionable research practices has been compared to the use of doping in sports (John et al., 2012).  Researchers who use QRPs are able to produce many successful studies, but the results of these studies cannot be replicated when other researchers replicate the reported studies without QRPs.  Skeptics wondered whether many discoveries in psychology are as incredible as Bem’s discovery of extrasensory perception; groundbreaking, spectacular, and false.  Is ego-depletion a real effect or is it an artificial product of publication bias and questionable research practices?

Does Ego-Depletion Depend on Blood Glucose?

The core assumption of ego-depletion theory is that working on an effortful task requires energy and that performance decreases as energy levels decrease.  If this theory is correct, it should be possible to find a physiological correlate of this energy.  Ten years after the inception of ego-depletion theory, Baumeister and colleagues claimed to have found the biological basis of ego-depletion in an article called “Self-control relies on glucose as a limited energy source.”  (Gailliot et al., 2007).  The article had a huge impact on ego-depletion researchers and it became a common practice to measure blood-glucose levels.

Unfortunately, Baumeister and colleagues had not consulted with physiological psychologists when they developed the idea that brain processes depend on blood-glucose levels.  To maintain vital functions, the human body ensures that the brain is relatively independent of peripheral processes.  A large literature in physiological psychology suggested that inhibiting the impulse to eat delicious chocolate cookies would not lead to a measurable drop in blood glucose levels (Kurzban, 2011).

Let’s look at the numbers. A well-known statistic is that the brain, while only 2% of body weight, consumes 20% of the body’s energy. That sounds like the brain consumes a lot of calories, but if we assume a 2,400 calorie/day diet – only to make the division really easy – that’s 100 calories per hour on average, 20 of which, then, are being used by the brain. Every three minutes, then, the brain – which includes memory systems, the visual system, working memory, then emotion systems, and so on – consumes one (1) calorie. One. Yes, the brain is a greedy organ, but it’s important to keep its greediness in perspective.

But, maybe experts on physiology were just wrong and Baumeister and colleagues made another groundbreaking discovery.  After all, they presented 9 successful studies that appeared to support the glucose theory of will-power, but 9 successful studies alone provide no evidence because it is not clear how these successful studies were produced.

To answer this question, Schimmack (2012) developed a statistical test that provides information about the credibility of a set of successful studies. Experimental researchers try to hold many factors that can influence the results constant (all studies are done in the same laboratory, glucose is measured the same way, etc.).  However, there are always factors that the experimenter cannot control. These random factors make it difficult to predict the exact outcome of a study even if everything goes well and the theory is right.  To minimize the influence of these random factors, researchers need large samples, but social psychologists often use small samples where random factors can have a large influence on results.  As a result, conducting a study is a gamble and some studies will fail even if the theory is correct.  Moreover, the probability of failure increases with the number of attempts.  You may get away with playing Russian roulette once, but you cannot play forever.  Thus, eventually failed studies are expected and a 100% success rate is a sign that failed studies were simply not reported.  Schimmack (2012) was able to use the reported statistics in Gailliot et al. (2007) to demonstrate that it was very likely that the 100% success rate was only achieved by hiding failed studies or with the help of questionable research practices.

Baumeister was a reviewer of Schimmack’s manuscript and confirmed the finding that a success rate of 9 out of 9 studies was not credible.

“My paper with Gailliot et al. (2007) is used as an illustration here. Of course, I am quite familiar with the process and history of that one. We initially submitted it with more studies, some of which had weaker results. The editor said to delete those. He wanted the paper shorter so as not to use up a lot of journal space with mediocre results. It worked: the resulting paper is shorter and stronger. Does that count as magic? The studies deleted at the editor’s request are not the only story. I am pretty sure there were other studies that did not work. Let us suppose that our hypotheses were correct and that our research was impeccable. Then several of our studies would have failed, simply given the realities of low power and random fluctuations. Is anyone surprised that those studies were not included in the draft we submitted for publication? If we had included them, certainly the editor and reviewers would have criticized them and formed a more negative impression of the paper. Let us suppose that they still thought the work deserved publication (after all, as I said, we are assuming here that the research was impeccable and the hypotheses correct). Do you think the editor would have wanted to include those studies in the published version?”

To summarize, Baumeister defends the practice of hiding failed studies with the argument that this practice is acceptable if the theory is correct.  But we do not know whether the theory is correct without looking at unbiased evidence.  Thus, his line of reasoning does not justify the practice of selectively reporting successful results, which provides biased evidence for the theory.  If we could know whether a theory is correct without data, we would not need empirical tests of the theory.  In conclusion, Baumeister’s response shows a fundamental misunderstanding of the role of empirical data in science.  Empirical results are not mere illustrations of what could happen if a theory were correct. Empirical data are supposed to provide objective evidence that a theory needs to explain.

Since my article has been published, there have been several failures to replicate Gailliot et al.’s findings and recent theoretical articles on ego-depletion no longer assume that blood-glucose as the source of ego-depletion.

“Upon closer inspection notable limitations have emerged. Chief among these is the failure to replicate evidence that cognitive exertion actually lowers blood glucose levels.” (Inzlicht, Schmeichel, & Macrae, 2014, p 18).

Thus, the 9 successful studies that were selected by Baumeister et al. (1998) did not illustrate an empirical fact, they created false evidence for a physiological correlate of ego-depletion that could not be replicated.  Precious research resources were wasted on a line of research that could have been avoided by consulting with experts on human physiology and by honestly examining the successful and failed studies that led to the Baumeister et al. (1998) article.

Even Baumeister agrees that the original evidence was false and that glucose is not the biological correlate of ego-depletion.

In retrospect, even the initial evidence might have gotten a boost in significance from a fortuitous control condition. Hence at present it seems unlikely that ego depletion’s effects are caused by a shortage of glucose in the bloodstream” (Baumeister, 2014, p 315).

Baumeister fails to mention that the initial evidence also got a boost from selection bias.

In sum, the glucose theory of ego-depletion was based on selective reporting of studies that provided misleading support for the theory and the theory lacks credible empirical support.  The failure of the glucose theory raises questions about the basic ego-depletion effect.  If researchers in this field used selective reporting and questionable research practices, the evidence for the basic effect is also likely to be biased and the effect may be difficult to replicate.

If 200 studies show ego-depletion effects, it must be real?

Psychologists have not ignored publication bias altogether.  The main solution to the problem is to conduct meta-analyses.  A meta-analysis combines information from several small studies to examine whether an effect is real.  The problem for meta-analysis is that publication bias also influences the results of a meta-analysis.  If only successful studies are published, a meta-analysis of published studies will show evidence for an effect no matter whether the effect actually exists or not.  For example, the top journal for meta-analysis, Psychological Bulletin, has published meta-analyses that provide evidence for extransensory perception (Bem & Honorton, 1994).

To address this problem, meta-analysts have developed a number of statistical tools to detect publication bias.  The most prominent method is Eggert’s regression of effect size estimates on sampling error.  A positive correlation can reveal publication bias because studies with larger sampling errors (small samples) require larger effect sizes to achieve statistical significance.  To produce these large effect sizes when the actual effect does not exist or is smaller, researchers need to hide more studies or use more questionable research practices.  As a result, these results are particularly difficult to replicate.

Although the use of these statistical methods is state of the art, the original ego-depletion meta-analysis that showed moderate to large effects did not examine the presence of publication bias (Hagger et al., 2010). This omission was corrected in a meta-analysis by Carter and McCollough (2014).

Upon reading Hagger et al. (2010), we realized that their efforts to estimate and account for the possible influence of publication bias and other small-study effects had been less than ideal, given the methods available at the time of its publication (Carter & McCollough, 2014).

The authors then used Eggert regression to examine publication bias.  Moreover, they used a new method that was not available at the time of Hagger et al.’s (2010) meta-analysis to estimate the effect size of ego-depletion after correcting for the inflation caused by publication bias.

Not surprisingly, the regression analysis showed clear evidence of publication bias.  More stunning were the results of the effect size estimate after correcting for publication bias.  The bias-corrected effect size estimate was d = .25 with a 95% confidence interval ranging from d = .18 to d = .32.   Thus, even the upper limit of the confidence interval is about 50% less than the effect size estimate in the original meta-analysis without correction for publication bias.   This suggests that publication bias inflated the effect size estimate by 100% or more.  Interestingly, a similar result was obtained in the reproducibility project, where a team of psychologists replicated 100 original studies and found that published effect sizes were over 100% larger than effect sizes in the replication project (OSC, 2015).

An effect size of d = .2 is considered small.  This does not mean that the effect has no practical importance, but it raises questions about the replicability of ego-depletion results.  To obtain replicable results, researchers should plan studies so that they have an 80% chance to get significant results despite the unpredictable influence of random error.  For small effects, this implies that studies require large samples.  For the standard ego-depletion paradigm with an experimental group and a control group and an effect size of d = .2, a sample size of 788 participants is needed to achieve 80% power. However, the largest sample size in an ego-depletion study was only 501 participants.  A sample size of 388 participants is needed to achieve significance without an inflated effect size (50% power) and most studies fall short of this requirement in sample size.  Thus, most published ego-depletion results are unlikely to replicate and future ego-depletion studies are likely to produce non-significant results.

In conclusion, even 100 studies with 100% successful results do not provide convincing evidence that ego-depletion exists and which experimental procedures can be used to replicate the basic effect.

Replicability without Publication Bias

In response to concerns about replicability, the American Psychological Society created a new format for publications.  A team of researchers can propose a replication project.  The research proposal is peer-reviewed like a grant application.  When the project is approved, researchers conduct the studies and publish the results independent of the outcome of the project.  If it is successful, the results confirm that earlier findings that were reported with publication bias are replicable, although probably with a smaller effect size.  If the studies fail, the results suggest that the effect may not exist or that the effect size is very small.

In the fall of 2014 Hagger and Chatzisarantis announced a replication project of an ego-depletion study.

The third RRR will do so using the paradigm developed and published by Sripada, Kessler, and Jonides (2014), which is similar to that used in the original depletion experiments (Baumeister et al., 1998; Muraven et al., 1998), using only computerized versions of tasks to minimize variability across laboratories. By using preregistered replications across multiple laboratories, this RRR will allow for a precise, objective estimate of the size of the ego depletion effect.

In the end, 23 laboratories participated and the combined sample size of all studies was N = 2141.  This sample size affords an 80% probability to obtain a significant result (p < .05, two-tailed) with an effect size of d = .12, which is below the lower limit of the confidence interval of the bias-corrected meta-analysis.  Nevertheless, the study failed to produce a statistically significant result, d = .04 with a 95%CI ranging from d = -.07 to d = .14.  Thus, the results are inconsistent with a small effect size of d = .20 and suggest that ego-depletion may not even exist at all.

Ego-depletion researchers have responded to this result differently.  Michael Inzlicht, winner of a theoretical innovation prize for his work on ego-depletion, wrote:

The results of a massive replication effort, involving 24 labs (or 23, depending on how you count) and over 2,000 participants, indicates that short bouts of effortful control had no discernable effects on low-level inhibitory control. This seems to contradict two decades of research on the concept of ego depletion and the resource model of self-control. Like I said: science is brutal.

In contrast, Roy F. Baumeister questioned the outcome of this research project that provided the most comprehensive and scientific test of ego-depletion.  In a response with co-author Kathleen D. Vohs titled “A misguided effort with elusive implications,” Baumeister tries to explain why ego depletion is a real effect, despite the lack of unbiased evidence for it.

The first line of defense is to question the validity of the paradigm that was used for the replication project. The only problem is that this paradigm seemed reasonable to the editors who approved the project, researchers who participated in the project and who expected a positive result, and to Baumeister himself when he was consulted during the planning of the replication project.  In his response, Baumeister reverses his opinion about the paradigm.

In retrospect, the decision to use new, mostly untested procedures for a large replication project was foolish.

He further claims that he proposed several well-tested procedures, but that these procedures were rejected by the replication team for technical reasons.

Baumeister nominated several procedures that have been used in successful studies of ego depletion for years. But none of Baumeister’s suggestions were allowable due to the RRR restrictions that it must be done with only computerized tasks that were culturally and linguistically neutral.

Baumeister and Vohs then claim that the manipulation did not lead to ego-depletion and that it is not surprising that an unsuccessful manipulation does not produce an effect.

Signs indicate the RRR was plagued by manipulation failure — and therefore did not test ego depletion.

They then assure readers that ego-depletion is real because they have demonstrated the effect repeatedly using various experimental tasks.

For two decades we have conducted studies of ego depletion carefully and honestly, following the field’s best practices, and we find the effect over and over (as have many others in fields as far-ranging as finance to health to sports, both in the lab and large-scale field studies). There is too much evidence to dismiss based on the RRR, which after all is ultimately a single study — especially if the manipulation failed to create ego depletion.

This last statement is, however, misleading if not outright deceptive.  As noted earlier, Baumeister admitted to the practice of not publishing disconfirming evidence.  He and I disagree whether the selective publication of successful studies is honest or dishonest.  He wrote:

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

So, when Baumeister and Vohs assure readers that they conducted ego-depletion research carefully and honestly, they are not saying that they reported all studies that they conducted in their labs.  The successful studies published in articles are not representative of the studies conducted in their labs.

In a response to Baumeister and Vohs, the lead authors of the replication project pointed out that ego-depletion does not exist unless proponents of ego-depletion theory can specify experimental procedures that reliably produce the predicted effect.

The onus is on researchers to develop a clear set of paradigms that reliably evoke depletion in large samples with high power (Hagger & Chatzisarantis, 2016)

In an open email letter, I asked Baumeister and Vohs to name paradigms that could replicate a published ego-depletion effect.  They were not able or willing to name a single paradigm. Roy Bameister’s response was “In view of your reputation as untrustworthy, dishonest, and otherwise obnoxious, i prefer not to cooperate or collaborate with you.”

I did not request to collaborate with him.  I merely asked which paradigm would be able to produce ego-depletion effects in an open and transparent replication study, given his criticism of the most rigorous replication study that he initially approved.

If an expert who invented a theory and published numerous successful studies cannot name a paradigm that will work, it suggests that he does not know which studies may work because for each published successful study there are unpublished, unsuccessful studies that used the same procedure, and it is not obvious which study would actually replicate in an honest and transparent replication project.

A New Meta-Analysis of Ego-Depletion Studies:  Are there replicable effects?

Since I published the incredibility index (Schimmack, 2012) and demonstrated bias in research on glucose and ego-depletion, I have developed new and more powerful ways to reveal selection bias and questionable research practices.  I applied these methods to the large literature on ego-depletion to examine whether there are some credible ego-depletion effects and a paradigm that produces replicable effects.

The first method uses powergraphs (Schimmack, 2015) to examine selection bias and the replicability of a set of studies. To create a powergrpah, original research results are converted into absolute z-score.  A z-score shows how much evidence a study result provides against the null-hypothesis that there is no effect.  Unlike effect size measures, z-scores also contain information about the sample size (sampling error).   I therefore distinguish between meta-analysis of effect sizes and meta-analysis of evidence.  Effect size meta-analysis aims to determine the typical, average size of an effect.  Meta-analyses of evidence examine how strong the evidence for an effect (i.e., against the null-hypothesis of no effect) is.

The distribution of absolute z-scores provides important information about selection bias, questionable research practices, and replicability.  Selection bias is revealed if the distribution of z-scores shows a steep drop on the left side of the criterion for statistical significance (this is analogous to the empty space below the line for significance in a funnel plot). Questionable research practices are revealed if z-scores cluster in the area just above the significance criterion.  Replicabilty is estimated by fitting a weighted composite of several non-central distributions that simulate studies with different non-centrality parameters and sampling error.

A literature search retrieved 165 articles that reported 429 studies.  For each study, the most important statistical test was converted first into a two-tailed p-value and then into a z-score.  A single test statistic was used to ensure that all z-scores are statistically independent.

The results show clear evidence of selection bias (Figure 1).  Although there are some results below the significance criterion (z = 1.96, p < .05, two-tailed), most of these results are above z = 1.65, which corresponds to p < .10 (two-tailed) or p < .05 (one-tailed).  These results are typically reported as marginally significant and used as evidence for an effect.   There are hardly any results that fail to confirm a prediction based on ego-depletion theory.  Using z = 1.65 as criterion, the success rate is 96%, which is common for the reported success rate in psychological journals (Sterling, 1959; Sterling et al., 1995; OSC, 2015).  The steep cliff in the powergraph shows that this success rate is due to selection bias because random error would have produced a more gradual decline with many more non-significant results.

The next observation is the tall bar just above the significance criterion with z-scores between 2 and 2.2.   This result is most likely due to questionable research practices that lead to just significant results such as optional stopping or selective dropping of outliers.

Another steep drop is observed at z-scores of 2.6.  This drop is likely due to the use of further questionable research practices such as dropping of experimental conditions, use of multiple dependent variables, or simply running multiple studies and selecting only significant results.

A rather large proportion of z-scores are in the questionable range from z = 1.96 to 2.60.  These results are unlikely to replicate. Although some studies may have reported honest results, there are too many questionable results and it is impossible to say which results are trustworthy and which results are not.  It is like getting information from a group of people where 60% are liars and 40% tell the truth.  Even though 40% are telling the truth, the information is useless without knowing who is telling the truth and who is lying.

The best bet to find replicable ego-depletion results is to focus on the largest z-scores as replicability increases with the strength of evidence (OSC, 2015). The power estimation method uses the distribution of z-scores greater than 2.6 to estimate the average power of these studies.  The estimated power is 47% with a 95% confidence interval ranging from 32% to 63%.  This result suggests that some ego-depletion studies have produced replicable results.  In the next section, I examine which studies this may be.

In sum, a state-of-the art meta-analysis of evidence for an effect in the ego-depletion literature shows clear evidence for selection bias and the use of questionable research practices.  Many published results are essentially useless because the evidence is not credible.  However, the results also show that some studies produced replicable effects, which is consistent with Carter and McCollough’s finding that the average effect size is likely to be above zero.

What Ego-Depletion Studies Are Most Likely to Replicate?

Powergraphs are useful for large sets of heterogeneous studies.  However, they are not useful to examine the replicability of a single study or small sets of studies, such as a set of studies in a multiple-study article.  For this purpose, I developed two additional tools that detect bias in published results. .

The Test of Insufficient Variance (TIVA) requires a minimum of two independent studies.  As z-scores follow a normal distribution (the normal distribution of random error), the variance of z-scores should be 1.  However, if non-significant results are omitted from reported results, the variance shrinks.  TIVA uses the standard comparison of variances to compute the probability that an observed variance of z-scores is an unbiased sample drawn from a normal distribution.  TIVA has been shown to reveal selection bias in Bem’s (2011) article and it is a more powerful test than the incredibility index (Schimmack, 2012).

The R-Index is based on the Incredibilty Index in that it compares the success rate (percentage of significant results) with the observed statistical power of a test. However, the R-Index does not test the probability of the success rate.  Rather, it uses the observed power to predict replicability of an exact replication study.  The R-Index has two components. The first component is the median observed power of a set of studies.  In the limit, median observed power approaches the average power of an unbiased set of exact replication studies.  However, when selection bias is present, median observed power is biased and provides an inflated estimate of true power.  The R-Index measures the extent of selection bias by means of the difference between success rate and median observed power.  If median observed power is 75% and the success rate is 100%, the inflation rate is 25% (100 – 75 = 25).  The inflation rate is subtracted from median observed power to correct for the inflation.  The resulting replication index is not directly an estimate of power, except for the special case when power is 50% and the success rate is 100%   When power is 50% and the success rate is 100%, median observed power increases to 75%.  In this case, the inflation correction of 25% returns the actual power of 50%.

I emphasize this special case because 50% power is also a critical point at which point a rational bet would change from betting against replication (Replicability < 50%) to betting on a successful replication (Replicability > 50%).  Thus, an R-Index of 50% suggests that a study or a set of studies produced a replicable result.  With success rates close to 100%, this criterion implies that median observed power is 75%, which corresponds to a z-score of 2.63.  Incidentally, a z-score of 2.6 also separated questionable results from more credible results in the powergraph analysis above.

It may seem problematic to use the R-Index even for a single study because observed power of a single study is strongly influenced by random factors and observed power is by definition above 50% for a significant result. However, The R-Index provides a correction for selection bias and a significant result implies a 100% success rate.  Of course, it could also be an honestly reported result, but if the study was published in a field with evidence of selection bias, the R-Index provides a reasonable correction for publication bias.  To achieve an R-Index above 50%, observed power has to be greater than 75%.

This criterion has been validated with social psychology studies in the reproducibilty project, where the R-Index predicted replication success with over 90% accuracy. This criterion also correctly predicted that the ego-depletion replication project would produce fewer than 50% successful replications, which it did, because the R-Index for the original study was way below 50% (F(1,90) = 4.64, p = .034, z = 2.12, OP = .56, R-Index = .12).  If this information had been available during the planning of the RRR, researchers might have opted for a paradigm with a higher chance of a successful replication.

To identify paradigms with higher replicability, I computed the R-Index and TIVA (for articles with more than one study) for all 165 articles in the meta-analysis.  For TIVA I used p < .10 as criterion for bias and for the R-Index I used .50 as the criterion.   37 articles (22%) passed this test.  This implies that 128 (78%) showed signs of statistical bias and/or low replicability.  Below I discuss the Top 10 articles with the highest R-Index to identify paradigms that may produce a reliable ego-depletion effect.

1. Robert D. Dvorak and Jeffrey S. Simons (PSPB, 2009) [ID = 142, R-Index > .99]

This article reported a single study with an unusually large sample size for ego-depletion studies. 180 participants were randomly assigned to a standard ego-depletion manipulation. In the control condition, participants watched an amusing video.  In the depletion condition, participants watched the same video, but they were instructed to suppress all feelings and expressions.  The dependent variable was persistence on a set of solvable and unsolvable anagrams.  The t-value in this study suggests strong evidence for an ego-depletion effect, t(178) = 5.91.  The large sample size contributes to this, but the effect size is also large, d = .88.

Interestingly, this study is an exact replication of Study 3 in the seminal ego-depletion article by Baumeister et al. (1998), which obtained a significant effect with just 30 participants and a strong effect size of d = .77, t(28) = 2.12.

The same effect was also reported in a study with 132 smokers (Heckman, Ditre, & Brandon, 2012). Smokers who were not allowed to smoke persisted longer on a figure tracing task when they could watch an emotional video normally than when they had to suppress emotional responses, t(64) = 3.15, d = .78.  The depletion effect was weaker when smokers were allowed to smoke between the video and the figure tracing task. The interaction effect was significant, F(1, 128) = 7.18.

In sum, a set of studies suggests that emotion suppression influences persistence on a subsequent task.  The existing evidence suggests that this is a rather strong effect that can be replicated across laboratories.

2. Megan Oaten, Kipling D. William, Andrew Jones, & Lisa Zadro (J Soc Clinical Psy, 2008) [ID = 118, R-Index > .99]

This article reports two studies that manipulated social exclusion (ostracism) under the assumption that social exclusion is ego-depleting. The dependent variable was consumption of an unhealthy food in Study 1 and drinking a healthy, but unpleasant drink in Study 2.  Both studies showed extremely strong effects of ego-depletion (Study 1: d = 2.69, t(71) = 11.48;  Study 2: d = 1.48, t(72) = 6.37.

One concern about these unusually strong effects is the transformation of the dependent variable.  The authors report that they first ranked the data and then assigned z-scores corresponding to the estimated cumulative proportion.  This is an unusual procedure and it is difficult to say whether this procedure inadvertently inflated the effect size of ego-depletion.

Interestingly, one other article used social exclusion as an ego-depletion manipulation (Baumeister et al., 2005).  This article reported six studies and TIVA showed evidence of selection bias, Var(z) = 0.15, p = .02.  Thus, the reported effect sizes in this article are likely to be inflated.  The first two studies used consumption of an unpleasant tasting drink and eating cookies, respectively, as dependent variables. The reported effect sizes were weaker than in the article by Oaten et al. (d = 1.00, d = .90).

In conclusion, there is some evidence that participants avoid displeasure and seek pleasure after social rejection. A replication study with a sufficient sample size may replicate this result with a weaker effect size.  However, even if this effect exists it is not clear that the effect is mediated by ego-depletion.

3. Kathleen D. Vohs & Ronald J. Farber (Journal of Consumer Research) [ID = 29, R-Index > .99]

This article examined the effect of several ego-depletion manipulations on purchasing behavior.  Study 1 found a weaker effect, t(33) = 2.83,  than Studies 2 and 3, t(63) = 5.26, t(33) = 5.52, respectively.  One possible explanation is that the latter studies used actual purchasing behavior.  Study 2 used the White Bear paradigm and Study 2 used amplification of emotion expressions as ego-depletion manipulations.  Although statistically robust, purchasing behavior does not seem to be the best indicator of ego-depletion.  Thus, replication efforts may focus on other dependent variables that measure ego-depletion more directly.

4. Kathleen D. Vohs, Roy F. Baumeister, & Brandon J. Schmeichel (JESP, 2012/2013) [ID = 49, R-Index = .96]

This article was first published in 2012, but the results for Study 1 were misreported and a corrected version was published in 2013.  The article presents two studies with a 2 x 3 between-subject design. Study 1 had n = 13 participants per cell and Study 2 had n = 35 participants per cell.  Both studies showed an interaction between ego-depletion manipulations and manipulations of self-control beliefs. The dependent variables in both studies were the Cognitive Estimation Test and a delay of gratification task.  Results were similar for both dependent measures. I focus on the CET because it provides a more direct test of ego-depletion; that is, the draining of resources.

In the condition with limited-will-power beliefs of Study 1, the standard ego-depletion effect that compares depleted participants to a control condition was a decreased by about 6 points from about 30 to 24 points (no exact means or standard deviations, or t-values for this contrast are provided).  The unlimited will-power condition shows a smaller decrease by 2 points (31 vs. 29).  Study 2 replicates this pattern. In the limited-will-power condition, CET scores decreased again by 6 points from 32 to 26 and in the unlimited-will-power condition CET scores decreased by about 2 points from about 31 to 29 points.  This interaction effect would again suggest that the standard depletion effect can be reduced by manipulating participants’ beliefs.

One interesting aspect of the study was the demonstration that ego-depletion effects increase with the number of ego-depleting tasks.  Performance on the CET decreased further when participants completed 4 vs. 2 or 3 vs. 1 depleting task.  Thus, given the uncertainty about the existence of ego-depletion, it would make sense to start with a strong manipulation that compares a control condition with a condition with multiple ego-depleting tasks.

One concern about this article is the use of the CET as a measure of ego-depletion.  The task was used in only one other study by Schmeichel, Vohs, and Baumeister (2003) with a small sample of N = 37 participants.  The authors reported a just significant effect on the CET, t(35) = 2.18.  However, Vohs et al. (2013) increased the number of items from 8 to 20, which makes the measure more reliable and sensitive to experimental manipulations.

Another limitation of this study is that there was no control condition without manipulation of beliefs. It is possible that the depletion effect in this study was amplified by the limited-will-power manipulation. Thus, a simple replication of this study would not provide clear evidence for ego-depletion.  However, it would be interesting to do a replication study that examines the effect of ego-depletion on the CET without manipulation of beliefs.

In sum, this study could provide the basis for a successful demonstration of ego-depletion by comparing effects on the CET for a control condition versus a condition with multiple ego-depletion tasks.

5. Veronika Job, Carol S. Dweck, and Gregory M. Walton (Psy Science, 2010) [ID = 191, R-Index = 94]

The article by Job et al. (2010) is noteworthy for several reasons.  First, the article presented three close replications of the same effect with high t-values, ts = 3.88, 8.47, 2.62.  Based on these results, one would expect that other researchers can replicate the results.  Second, the effect is an interaction between a depletion manipulation and a subtle manipulation of theories about the effect of working on an effortful task.  Hidden among other questionnaires, participants received either items that suggested depletion (“After a strenuous mental activity your energy is depleted and you must rest to get it refueled again” or items that suggested energy is unlimited (“Your mental stamina fuels itself; even after strenuous mental exertion you can continue doing more of it”). The pattern of the interaction effect showed that only participants who received the depletion items showed the depletion effect.  Participants who received the unlimited energy items showed no significant difference in Stroop performance.  Taken at face value, this finding would challenge depletion theory, which assumes that depletion is an involuntary response to exerting effort.

However, the study also raises questions because the authors used an unconventional statistical method to analyze their data.  Data were analyzed with a multi-level model that modeled errors as a function of factors that vary within participants over time and factors that vary between participants, including the experimental manipulations.  In an email exchange, the lead author confirmed that the model did not include random factors for between-subject variance.  A statistician assured the lead author that this was acceptable.  However, a simple computation of the standard deviation around mean accuracy levels would show that this variance is not zero.  Thus, the model artificially inflated the evidence for an effect by treating between-subject variance as within-subject variance. In a betwee-subject analysis, the small differences in error rates (about 5 percentage points) are unlikely to be significant.

In sum, it is doubtful that a replication study would replicate the interaction between depletion manipulations and the implicit theory manipulation reported in Job et al. (2010) in an appropriate between-subject analysis.  Even if this result would replicate, it would not support the theory that ego-depletion is a limited resource that is depleted after a short effortful task because the effect can be undone with a simple manipulations of beliefs in unlimited energy.

6. Roland Imhoff, Alexander F. Schmidt, & Friederike Gerstenberg (Journal of Personality, 2014) [ID = 146, R-Index = .90]

Study 1 reports results a standard ego-depletion paradigm with a relatively larger sample (N = 123).  The ego-depletion manipulation was a Stroop task with 180 trials.  The dependent variable was consumption of chocolates (M&M).  The study reported a large effect, d = .72, and strong evidence for an ego-depletion effect, t(127) = 4.07.  The strong evidence is in part justified by the large sample size, but the standardized effect size seems a bit large for a difference of 2g in consumption, whereas the standard deviation of consumption appears a bit small (3g).  A similar study with M&M consumption as dependent variable found a 2g difference in the opposite direction with a much larger standard deviation of 16g and no significant effect, t(48) = -0.44.

The second study produced results in line with other ego-depletion studies and did not contribute to the high R-Index of the article, t(101) = 2.59. The third study was a correlational study with examined correlates of a trait measure of ego-depletion.  Even if this correlation is replicable, it does not support the fundamental assumption of ego-depletion theory of situational effects of effort on subsequent effort.  In sum, it is unlikely that Study 1 is replicable and that strong results are due to misreported standard deviations.

7. Hugo J.E.M. Alberts, Carolien Martijn, & Nanne K. de Vries (JESP, 2011) [ID = 56, R-Index = .86]

This article reports the results of a single study that crossed an ego-depletion manipulation with a self-awareness priming manipulation (2 x 2 with n = 20 per cell).  The dependent variable was persistence in a hand-grip task.  Like many other handgrip studies, this study assessed handgrip persistence before and after the manipulation, which increases the statistical power to detect depletion effects.

The study found weak evidence for an ego-depletion effect, but relatively strong evidence for an interaction effect, F(1,71) = 13.00.  The conditions without priming showed a weak ego depletion effect (6s difference, d = .25).  The strong interaction effect was due to the priming conditions, where depleted participants showed an increase in persistence by 10s and participants in the control condition showed a decrease in performance by 15s.  Even if this is a replicable finding, it does not support the ego-depletion effect.  The weak evidence for ego depletion with the handgrip task is consistent with a meta-analysis of handgrip studies (Schimmack, 2015).

In short, although this study produced an R-Index above .50, closer inspection of the results shows no strong evidence for ego-depletion.

8. James M. Tyler (Human Communications Research, 2008) [ID = 131, R-Index = .82]

Even though the statistical results suggest that these results are highly replicable, the small sample sizes and very large effect sizes raise some concerns about replicability.  The large effects cannot be attributed to the ego-depletion tasks or measures that have been used in many other studies that produced much weaker effect. Thus, the only theoretical explanation for these large effect sizes would be that ego depletion has particularly strong effects on social processes.  Even if these effects could be replicated, it is not clear that ego-depletion is the mediating mechanism.  Especially the complex manipulation in the first two studies allow for multiple causal pathways.  It may also be difficult to recreate this manipulation and a failure to replicate the results could be attribute to problems with reproducibility.  Thus, a replication of this study is unlikely to advance understanding of ego-depletion without first establishing that ego-depletion exists.

9. Brandon J. Schmeichel, Heath A. Demaree, Jennifer L. Robinson, & Jie Pu (Social Cognition, 2006) [ID = 52, R-Index = .80]

This article reported one study with an emotion regulation task. Participants in the depletion condition were instructed to exaggerated emotional responses to a disgusting film clip.  The study used two task to measure ego-depletion.  One task required generation of words; the other task required generation of figures.  The article reports strong evidence in an ANOVA with both dependent variables, F(1,46) = 11.99.  Separate analyses of the means show a stronger effect for the figural task, d = .98, than for the verbal task, d = .50.

The main concern with this study is that the fluency measures were never used in any other study.  If a replication study fails, one could argue that the task is not a valid measure of ego-depletion.  However, the study shows the advantage of using multiple measures to increase statistical power (Schimmack, 2012).

10. Mark Muraven, Marylene Gagne, and Heather Rosman (JESP, 2008) [ID = 15, R-Index = .78]

Study 1 reports the results of a 2 x 2 design with N = 30 participants (~ 7.5 participants per condition).  It crossed an ego-depletion manipulation (resist eating chocolate cookies vs. radishes) with a self-affirmation manipulation.  The dependent variable was the number of errors in a vigilance task (respond to a 4 after a 6).  The results section shows some inconsistencies.  The 2 x 2 ANOVA shows strong evidence for an interaction, F(1,28) = 10.60, but the planned contrast that matches the pattern of means, shows a just significant effect, F(1,28) = 5.18.  Neither of these statistics is consistent with the reported means and standard deviations, where the depleted not affirmed group has twice the number of errors (M = 12.25, SD = 1.63) than the depleted group with affirmation (M = 5.40, SD = 1.34). These results would imply a standardized effect size of d = 4.59.

Study 2 did not manipulate ego-depletion and reported a more reasonable, but also less impressive result for the self-affirmation manipulation, F(2,63) = 4.67.

Study 3 crossed an ego-depletion manipulation with a pressure manipulation.  The ego-depletion task was a computerized ego-depletion task where participants in the depletion condition had to type a paragraph without copying the letter E or spaces. This is more difficult than just copying a paragraph.  The pressure manipulation were constant reminders to avoid making errors and to be as fast as possible.  The sample size was N = 96 (n = 24 per cell).  The dependent variable was the vigilance task from Study 1.  The evidence for a depletion effect was strong, F(1, 92) = 10.72 (z = 3.17).  However, the effect was qualified by the pressure manipulation, F(1,92) = 6.72.  There was a strong depletion effect in the pressure condition, d = .78, t(46) = 2.63, but there was no evidence for a depletion effect in the no-pressure condition, d = -.23, t(46) = 0.78.

The standard deviations in Study 3 that used the same dependent variable were considerable wider than the standard deviations in Study 1, which explains the larger standardized effect sizes in Study 1.  With the standard deviations of Study 3, Study 1 would not have

DISCUSSION AND FUTURE DIRECTIONS

The original ego-depletion article published in 1998 has spawned a large literature with over 150 articles, more than 400 studies, and a total number of over 30,000 participants. There have been numerous theoretical articles and meta-analyses of this literature.  Unfortunately, the empirical results reported in this literature are not credible because there is strong evidence that reported results are biased.  The bias makes it difficult to predict which effects are replicable. The main conclusion that can be drawn from this shaky mountain of evidence is that ego-depletion researchers have to change the way they conduct and report their findings.

Importantly, this conclusion is in stark disagreement with Baumeister’s recommendations.  In a forthcoming article, he suggests that “the field has done very well with the methods and standards it has developed over recent decades,” (p. 2), and he proposes that “we should continue with business as usual” (p. 1).

Baumeister then explicitly defends the practice of selectively publishing studies that produced significant results without reporting failures to demonstrate the effect in conceptually similar studies.

Critics of the practice of running a series of small studies seem to think researchers are simply conducting multiple tests of the same hypothesis, and so they argue that it would be better to conduct one large test. Perhaps they have a point: One big study could be arguably better than a series of small ones. But they also miss the crucial point that the series of small studies is typically designed to elaborate the idea in different directions, such as by identifying boundary conditions, mediators, moderators, and extensions. The typical Study 4 is not simply another test of the same hypothesis as in Studies 1–3. Rather, each one is different. And yes, I suspect the published report may leave out a few other studies that failed. Again, though, those studies’ purpose was not primarily to provide yet another test of the same hypothesis. Instead, they sought to test another variation, such as a different manipulation, or a different possible boundary condition, or a different mediator. Indeed, often the idea that motivated Study 1 has changed so much by the time Study 5 is run that it is scarcely recognizable. (p. 2)

Baumeister overlooks that a program of research that tests novel hypothesis with new experimental procedures in small samples is most likely to produce a non-significant result.  When these results are not reported, only reporting significant results does not mean that these studies successfully demonstrated an effect or elucidated moderating factors. The result of this program of research is a complicated pattern of results that is shaped by random error, selection bias, and weak true effects that are difficult to replicate (Figure 1).

Baumeister makes the logical mistake to assume that the type-I error rate is reset when a study is not a direct replication and that the type-I error only increases for exact replications. For example, it is obvious that we should not believe that eating green jelly beans decreases the risk of cancer, if 1 out of 20 studies with green jelly beans produced a significant result.  With a 5% error rate, we would expect one significant result in 20 attempts by chance alone.  Importantly, this does not change if green jelly beans showed an effect, but red, orange, purple, blue, ….. jelly beans did not show an effect.  With each study, the risk of a false positive result increases and if 1 out of 20 studies produced a significant result, the success rate is not higher than one would expect by chance alone.  It is therefore important to report all results and to report only the one green-jelly bean study with a significant result distorts the scientific evidence.

Baumeister overlooks the multiple comparison problem when he claims that “a series of small studies can build and refine a hypothesis much more thoroughly than a single large study”

As the meta-analysis, a series of over 400 small studies with selection bias tells us very little about ego-depletion and it remains unclear under which conditions the effect can be reliably demonstrated.  To his credit, Baumeister is humble enough to acknowledge that his sanguine view of social psychological research is biased.

In my humble and biased view, social psychology has actually done quite well. (p. 2)

Baumeister remembers fondly the days when he learned how to conduct social psychological experiments.  “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants.”  A simple power analysis with these sample sizes shows that a study with n = 10 per cell (N = 20) has a sensitivity to detect effect sizes of d = 1.32 with 80% probability.  Even the biased effect size estimate for ego-depletion studies was only half of this effect size.  Thus, a sample size of n = 10 is ridiculously low.  What about a sample size of n = 20?   It still requires an effect size of d = .91 to have an 80% chance to produce a significant result.  Maybe Roy Baumeister might think that it is sufficient to aim for 50% success rate and to drop the other 50%.  An effect size of d = .64 gives researchers a 50% chance to get a significant result with N = 40.  But the meta-analysis shows that the bias-correct effect size is less than this.  So, even n = 20 is not sufficient to demonstrate ego-depletion effects.  Does this mean the effects are too flimsy to study?

Inadvertently, Baumeister seems to dismiss ego-depletion effects as irrelevant, if it would require large sample sizes to demonstrate ego-depletion.

Large samples increase statistical power. Therefore, if social psychology changes to insist on large samples, many weak effects will be significant that would have failed with the traditional and smaller samples. Some of these will be important effects that only became apparent with larger samples because of the constraints on experiments. Other findings will however make a host of weak effects significant, so more minor and trivial effects will enter into the body of knowledge.

If ego-depletion effects are not really strong, but only inflated by selection bias, and the real effects are much weaker, they may be minor and trivial effects that have little practical significance for the understanding of self-control in real life.

Baumeister then comes to the most controversial claim of his article that has produced a vehement response on social media.  He claims that a special skill called flair is needed to produce significant results with small samples.

Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure.

The need for flair also explains why some researchers fail to replicate original studies by researchers with flair.

But in that process, we have created a career niche for bad experimenters. This is an underappreciated fact about the current push for publishing failed replications. I submit that some experimenters are incompetent. In the past their careers would have stalled and failed. But today, a broadly incompetent experimenter can amass a series of impressive publications simply by failing to replicate other work and thereby publishing a series of papers that will achieve little beyond undermining our field’s ability to claim that it has accomplished anything.

Baumeister even noticed individual differences in flair among his graduate and post-doctoral students.  The measure of flair was whether students were able to present significant results to him.

Having mentored several dozen budding researchers as graduate students and postdocs, I have seen ample evidence that people’s ability to achieve success in social psychology varies. My laboratory has been working on self-regulation and ego depletion for a couple decades. Most of my advisees have been able to produce such effects, though not always on the first try. A few of them have not been able to replicate the basic effect after several tries. These failures are not evenly distributed across the group. Rather, some people simply seem to lack whatever skills and talents are needed. Their failures do not mean that the theory is wrong.

The first author of the glucose paper was a victim of a doctoral advisor who believed that one could demonstrate a correlation between blood glucose levels and behavior with samples of 20 or less participants.  He found a way to produce these results in a way that produced statistical evidence of bias, but this effort was wasted on a false theory and a program of research that could not produce evidence for or against the theory because sample sizes were too small to show the effect even if the theory were correct.  Furthermore, it is not clear how many graduate students left Baumeister’s lab thinking that they were failures because they lacked research skills when they only applied the scientific method correctly?

As this article went to press, we were notified that this experiment had been independently replicated by Timothy J. Howe, of Cole Junior High School in East Greenwich, Rhode Island, for his science fair project. His results conformed almost exactly to ours, with the exception that mean persistence in the chocolate condition was slightly (but not significantly) higher than in the control condition. These converging results strengthen confidence in the present findings.

If ego-depletion effects can be replicated in a school project, it undermines the idea that successful results require special skills.  Moreover, the meta-analysis shows that flair is little more than selective publishing of significant results, a conclusion that is confirmed by Baumeister’s response to my bias analyses. “you may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication).

In conclusion, future researchers interested in self-regulation have a choice. They can believe in ego-depletion and ignore the statistical evidence of selection bias, failed replications, and admissions of suppressed evidence, and conduct further studies with existing paradigms and sample sizes and see what they get.  Alternatively, they may go to the other extreme and dismiss the entirely literature.

“If all the field’s prior work is misleading, underpowered, or even fraudulent, there is no need to pay attention to it.” (Baumeister, p. 4).

This meta-analysis offers a third possibility by trying to find replicable results that can provide the basis for the planning of future studies that provide better tests of ego-depletion theory.  I do not suggest to directly replicate any past study.  Rather, I think future research should aim for a strong demonstration of ego-depletion.  To achieve this goal, future studies should maximize statistical power in four ways.

First, use a strong experimental manipulation by comparing a control condition with a combination of multiple ego-depletion paradigms to maximize the standardized effect size.

Second, the study should use multiple, reliable, and valid measures of ego-depletion to minimize the influence of random and systematic measurement error in the dependent variable.

Third, the study should use a within-subject design or at least a pre-post design to control for individual differences in performance on the ego-depletion tasks to further reduce error variance.

Fourth, the study should have a sufficient sample size to make a non-significant result theoretically important.  I suggest planning for a standard error of .10 standard deviations.  As a result, any effect size greater than d = .20 will be significant, and a non-significant result if consistent with the null-hypothesis that the effect size is less than d = .20.

The next replicability report will show which path ego-depletion researcher have taken.  Even if they follow Baumeister’s suggestion to continue with business as usual, they can no longer claim that they were unaware of the consequences of going down this path.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2019):
Why Ionnidis’s Claim “Most published research findings are false” is false

TOP TEN BLOGS

1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018).

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

5.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

10. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

# The Abuse of Hoenig and Heisey: A Justification of Power Calculations with Observed Effect Sizes

In 2001, Hoenig and Heisey wrote an influential article, titled “The Abuse of Power: The Persuasive Fallacy of Power Calculations For Data Analysis.”  The article has been cited over 500 times and it is commonly cited as a reference to claim that it is a fallacy to use observed effect sizes to compute statistical power.

In this post, I provide a brief summary of Hoenig and Heisey’s argument. The summary shows that Hoenig and Heisey were concerned with the practice of assessing the statistical power of a single test based on the observed effect size for this effect. I agree that it is often not informative to do so (unless the result is power = .999). However, the article is often cited to suggest that the use of observed effect sizes in power calculations is fundamentally flawed. I show that this statement is false.

The abstract of the article makes it clear that Hoenig and Heisey focused on the estimation of power for a single statistical test. “There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result” (page 1). The abstract informs readers that this practice is fundamentally flawed. “This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic” (p. 1).

Given that method articles can be difficult to read, it is possible that the misinterpretation of Hoenig and Heisey is the result of relying on the term “fundamentally flawed” in the abstract. However, some passages in the article are also ambiguous. In the Introduction Hoenig and Heisey write “we describe the flaws in trying to use power calculations for data-analytic purposes” (p. 1). It is not clear what purposes are left for power calculations if they cannot be used for data-analytic purposes. Later on, they write more forcefully “A number of authors have noted that observed power may not be especially useful, but to our knowledge a fatal logical flaw has gone largely unnoticed.” (p. 2). So readers cannot be blamed entirely if they believed that calculations of observed power are fundamentally flawed. This conclusion is often implied in Hoenig and Heisey’s writing, which is influenced by their broader dislike of hypothesis testing  in general.

The main valid argument that Hoenig and Heisey make is that power analysis is based on the unknown population effect size and that effect sizes in a particular sample are contaminated with sampling error.  As p-values and power estimates depend on the observed effect size, they are also influenced by random sampling error.

In a special case, when true power is 50%, the p-value matches the significance criterion. If sampling error leads to an underestimation of the true effect size, the p-value will be non-significant and the power estimate will be less than 50%. When sampling error inflates the observed effect size, p-values will be significant and power will be above 50%.

It is therefore impossible to find scenarios where observed power is high (80%) and a result is not significant, p > .05, or where observed power is low (20%) and a result is significant, p < .05.  As a result, it is not possible to use observed power to decide whether a non-significant result was obtained because power was low or because power was high but the effect does not exist.

In fact, a simple mathematical formula can be used to transform p-values into observed power and vice versa (I actually got the idea of using p-values to estimate power from Hoenig and Heisey’s article).  Given this perfect dependence between the two statistics, observed power cannot add additional information to the interpretation of a p-value.

This central argument is valid and it does mean that it is inappropriate to use the observed effect size of a statistical test to draw inferences about the statistical power of a significance test for the same effect (N = 1). Similarly, one would not rely on a single data point to draw inferences about the mean of a population.

However, it is common practice to aggregate original data points or to aggregated effect sizes of multiple studies to obtain more precise estimates of the mean in a population or the mean effect size, respectively. Thus, the interesting question is whether Hoenig and Heisey’s (2001) article contains any arguments that would undermine the aggregation of power estimates to obtain an estimate of the typical power for a set of studies. The answer is no. Hoenig and Heisey do not consider a meta-analysis of observed power in their discussion and their discussion of observed power does not contain arguments that would undermine the validity of a meta-analysis of post-hoc power estimates.

A meta-analysis of observed power can be extremely useful to check whether researchers’ a priori power analysis provide reasonable estimates of the actual power of their studies.

Assume that researchers in a particular field have to demonstrate that their studies have 80% power to produce significant results when an important effect is present because conducting studies with less power would be a waste of resources (although some granting agencies require power analyses, these power analyses are rarely taken seriously, so I consider this a hypothetical example).

Assume that researchers comply and submit a priori power analysis with effect sizes that are considered to be sufficiently meaningful. For example, an effect of half-a-standard deviation (Cohen’s d = .50) might look reasonable large to be meaningful. Researchers submit their grant applications with a prior power analysis that produce 80% power with an effect size of d = .50. Based on the power analysis, researchers request funding for 128 participants. A researcher plans four studies and needs \$50 for each participant. The total budget is \$25,600.

When the research project is completed, all four studies produced non-significant results. The observed standardized effect sizes were 0, .20, .25, and .15. Is it really impossible to estimate the realized power in these studies based on the observed effect sizes? No. It is common practice to conduct a meta-analysis of observed effect sizes to get a better estimate of the (average) population effect size. In this example, the average effect size across the four studies is d = .15. It is also possible to show that the average effect size in these four studies is significantly different from the effect size that was used for the a priori power calculation (M1 = .15, M2 = .50, Mdiff = .35, SE = 1/sqrt(512) = .044, t = .35 / .044 = 7.92, p < 1e-13). Using the more realistic effect size estimate that is based on actual empirical data rather than wishful thinking, the post-hoc power analysis yields a power estimate of 13%. The probability of obtaining non-significant results in all four studies is 57%. Thus, it is not surprising that the studies produced non-significant results.  In this example, a post-hoc power analysis with observed effect sizes provides valuable information about the planning of future studies in this line of research. Either effect sizes of this magnitude are not important enough and research should be abandoned or effect sizes of this magnitude still have important practical implications and future studies should be planned on the basis of a priori power analysis with more realistic effect sizes.

Another valuable application of observed power analysis is the detection of publication bias and questionable research practices (Ioannidis and Trikalinos; 2007), Schimmack, 2012) and for estimating the replicability of statistical results published in scientific journals (Schimmack, 2015).

In conclusion, the article by Hoenig and Heisey is often used as a reference to argue that observed effect sizes should not be used for power analysis.  This post clarifies that this practice is not meaningful for a single statistical test, but that it can be done for larger samples of studies.

# “Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

The article with the witty title “Do Studies of Statistical Power Have an Effect on the Power of Studies?” builds on Cohen’s (1962) seminal power analysis of psychological research.

The main point of the article can be summarized in one word: No. Statistical power has not increased after Cohen published his finding that statistical power is low.

One important contribution of the article was a meta-analysis of power analyses that applied Cohen’s method to a variety of different journals. The table below shows that power estimates vary by journal assuming that the effect size was medium according to Cohen’s criteria of small, medium, and large effect sizes. The studies are sorted by power estimates from the highest to the lowest value, which provides a power ranking of journals based on Cohen’s method. I also included the results of Sedlmeier and Giegerenzer’s power analysis of the 1984 volume of the Journal of Abnormal Psychology (the Journal of Social and Abnormal Psychology was split into Journal of Abnormal Psychology and Journal of Personality and Social Psychology). I used the mean power (50%) rather than median power (44%) because the mean power is consistent with the predicted success rate in the limit. In contrast, the median will underestimate the success rate in a set of studies with heterogeneous effect sizes.

JOURNAL TITLE YEAR Power%
Journal of Marketing Research 1981 89
American Sociological Review 1974 84
Journalism Quarterly, The Journal of Broadcasting 1976 76
American Journal of Educational Psychology 1972 72
Journal of Research in Teaching 1972 71
Journal of Applied Psychology 1976 67
Journal of Communication 1973 56
The Research Quarterly 1972 52
Journal of Abnormal Psychology 1984 50
Journal of Abnormal and Social Psychology 1962 48
American Speech and Hearing Research & Journal of Communication Disorders 1975 44
Counseler Education and Supervision 1973 37

The table shows that there is tremendous variability in power estimates for different journals ranging from as high as 89% (9 out of 10 studies will produce a significant result when an effect is present) to the lowest estimate of  37% power (only 1 out of 3 studies will produce a significant result when an effect is present).

The table also shows that the Journal of Abnormal and Social Psychology and its successor the Journal of Abnormal Psychology yielded nearly identical power estimates. This finding is the key finding that provides empirical support for the claim that power in the Journal of Abnormal Psychology has not increased over time.

The average power estimate for all journals in the table is 62% (median 61%).  The list of journals is not a representative set of journals and few journals are core psychology journals. Thus, the average power may be different if a representative set of journals had been used.

The average for the three core psychology journals (JASP & JAbnPsy,  JAP, AJEduPsy) is 67% (median = 63%) is slightly higher. The latter estimate is likely to be closer to the typical power in psychology in general rather than the prominently featured estimates based on the Journal of Abnormal Psychology. Power could be lower in this journal because it is more difficult to recruit patients with a specific disorder than participants from undergraduate classes. However, only more rigorous studies of power for a broader range of journals and more years can provide more conclusive answers about the typical power of a single statistical test in a psychology journal.

The article also contains some important theoretical discussions about the importance of power in psychological research. One important issue concerns the treatment of multiple comparisons. For example, a multi-factorial design produces an exponential number of statistical comparisons. With two conditions, there is only one comparison. With three conditions, there are three comparisons (C1 vs. C2, C1 vs. C3, and C2 vs. C3). With 5 conditions, there are 10 comparisons. Standard statistical methods often correct for these multiple comparisons. One consequence of this correction for multiple comparisons is that the power of each statistical test decreases. An effect that would be significant in a simple comparison of two conditions would not be significant if this test is part of a series of tests.

Sedlmeier and Giegerenzer used the standard criterion of p < .05 (two-tailed) for their main power analysis and for the comparison with Cohen’s results. However, many articles presented results using a more stringent criterion of significance. If the criterion used by authors would have been used for the power analysis, power decreased further. About 50% of all articles used an adjusted criterion value and if the adjusted criterion value was used power was only 37%.

Sedlmeier and Giegerenzer also found another remarkable difference between articles in 1960 and in 1984. Most articles in 1960 reported the results of a single study. In 1984 many articles reported results from two or more studies. Sedlmeier and Giegerenzer do not discuss the statistical implications of this change in publication practices. Schimmack (2012) introduced the concept of total power to highlight the problem of publishing articles that contain multiple studies with modest power. If studies are used to provide empirical support for an effect, studies have to show a significant effect. For example, Study 1 shows an effect with female participants. Study 2 examines whether the effect can also be demonstrated with male participants. If Study 2 produces a non-significant result, it is not clear how this finding should be interpreted. It may show that the effect does not exist for men. It may show that the first result was just a fluke finding due to sampling error. Or it may show that the effect exists equally for men and women but studies had only 50% power to produce a significant result. In this case, it is expected that one study will produce a significant result and one will produce a non-significant result, but in the long-run significant results are equally likely with male or female participants. Given the difficulty of interpreting a non-significant result, it would be important to conduct a more powerful study that examines gender differences in a more powerful study with more female and male participants. However, this is not what researchers do. Rather, multiple study articles contain only the studies that produced significant results. The rate of successful studies in psychology journals is over 90% (Sterling et al., 1995). However, this outcome is extremely likely in multiple studies where studies have only 50% power to get a significant result in a single attempt. For each additional attempt, the probability to obtain only significant results decreases exponentially (1 Study, 50%, 2 Studies 25%, 3 Studies 12.5%, 4 Studies 6.75%).

The fact that researchers only publish studies that worked is well-known in the research community. Many researchers believe that this is an acceptable scientific practice. However, consumers of scientific research may have a different opinion about this practice. Publishing only studies that produced the desired outcome is akin to a fund manager that only publishes the return rate of funds that gained money and excludes funds with losses. Would you trust this manager to take care of your retirement? It is also akin to a gambler that only remembers winnings. Would you marry a gambler who believes that gambling is ok because you can earn money that way?

I personally do not trust obviously biased information. So, when researchers present 5 studies with significant results, I wonder whether they really had the statistical power to produce these results or whether they simply did not publish results that failed to confirm their claims. To answer this question it is essential to estimate the actual power of individual studies to produce significant results; that is, it is necessary to estimate the typical power in this field, of this researcher, or in the journal that published the results.

In conclusion, Sedlmeier and Gigerenzer made an important contribution to the literature by providing the first power-ranking of scientific journals and the first temporal analyses of time trends in power. Although they probably hoped that their scientific study of power would lead to an increase in statistical power, the general consensus is that their article failed to change scientific practices in psychology. In fact, some journals required more and more studies as evidence for an effect (some articles contain 9 studies) without any indication that researchers increased power to ensure that their studies could actually provide significant results for their hypotheses. Moreover, the topic of statistical power remained neglected in the training of future psychologists.

I recommend Sedlmeier and Gigerenzer’s article as essential reading for anybody interested in improving the credibility of psychology as a rigorous empirical science.

As always, comments (positive or negative) are always welcome.

# Meta-Analysis of Observed Power: Comparison of Estimation Methods

Meta-Analysis of Observed Power

Citation: Dr. R (2015). Meta-analysis of observed power. R-Index Bulletin, Vol(1), A2.

In a previous blog post, I presented an introduction to the concept of observed power. Observed power is an estimate of the true power on the basis of observed effect size, sampling error, and significance criterion of a study. Yuan and Maxwell (2005) concluded that observed power is a useless construct when it is applied to a single study, mainly because sampling error in a single study is too large to obtain useful estimates of true power. However, sampling error decreases as the number of studies increases and observed power in a set of studies can provide useful information about the true power in a set of studies.

This blog post introduces various methods that can be used to estimate power on the basis of a set of studies (meta-analysis). I then present simulation studies that compare the various estimation methods in terms of their ability to estimate true power under a variety of conditions. In this blog post, I examine only unbiased sets of studies. That is, the sample of studies in a meta-analysis is a representative sample from the population of studies with specific characteristics. The first simulation assumes that samples are drawn from a population of studies with fixed effect size and fixed sampling error. As a result, all studies have the same true power (homogeneous). The second simulation assumes that all studies have a fixed effect size, but that sampling error varies across studies. As power is a function of effect size and sampling error, this simulation models heterogeneity in true power. The next simulations assume heterogeneity in population effect sizes. One simulation uses a normal distribution of effect sizes. Importantly, a normal distribution has no influence on the mean because effect sizes are symmetrically distributed around the mean effect size. The next simulations use skewed normal distributions. This simulation provides a realistic scenario for meta-analysis of heterogeneous sets of studies such as a meta-analysis of articles in a specific journal or articles on different topics published by the same author.

Observed Power Estimation Method 1: The Percentage of Significant Results

The simplest method to determine observed power is to compute the percentage of significant results. As power is defined as the long-range percentage of significant results, the percentage of significant results in a set of studies is an unbiased estimate of the long-term percentage. The main limitation of this method is that the dichotomous measure (significant versus insignificant) is likely to be imprecise when the number of studies is small. For example, two studies can only show observed power values of 0, 25%, 50%, or 100%, even if true power were 75%. However, the percentage of significant results plays an important role in bias tests that examine whether a set of studies is representative. When researchers hide non-significant results or use questionable research methods to produce significant results, the percentage of significant results will be higher than the percentage of significant results that could have been obtained on the basis of the actual power to produce significant results.

Observed Power Estimation Method 2: The Median

Schimmack (2012) proposed to average observed power of individual studies to estimate observed power. Yuan and Maxwell (2005) demonstrated that the average of observed power is a biased estimator of true power. It overestimates power when power is less than 50% and it underestimates true power when power is above 50%. Although the bias is not large (no more than 10 percentage points), Yuan and Maxwell (2005) proposed a method that produces an unbiased estimate of power in a meta-analysis of studies with the same true power (exact replication studies). Unlike the average that is sensitive to skewed distributions, the median provides an unbiased estimate of true power because sampling error is equally likely (50:50 probability) to inflate or deflate the observed power estimate. To avoid the bias of averaging observed power, Schimmack (2014) used median observed power to estimate the replicability of a set of studies.

Observed Power Estimation Method 3: P-Curve’s KS Test

Another method is implemented in Simonsohn’s (2014) pcurve. Pcurve was developed to obtain an unbiased estimate of a population effect size from a biased sample of studies. To achieve this goal, it is necessary to determine the power of studies because bias is a function of power. The pcurve estimation uses an iterative approach that tries out different values of true power. For each potential value of true power, it computes the location (quantile) of observed test statistics relative to a potential non-centrality parameter. The best fitting non-centrality parameter is located in the middle of the observed test statistics. Once a non-central distribution has been found, it is possible to assign each observed test-value a cumulative percentile of the non-central distribution. For the actual non-centrality parameter, these percentiles have a uniform distribution. To find the best fitting non-centrality parameter from a set of possible parameters, pcurve tests whether the distribution of observed percentiles follows a uniform distribution using the Kolmogorov-Smirnov test. The non-centrality parameter with the smallest test statistics is then used to estimate true power.

Observed Power Estimation Method 4: P-Uniform

van Assen, van Aert, and Wicherts (2014) developed another method to estimate observed power. Their method is based on the use of the gamma distribution. Like the pcurve method, this method relies on the fact that observed test-statistics should follow a uniform distribution when a potential non-centrality parameter matches the true non-centrality parameter. P-uniform transforms the probabilities given a potential non-centrality parameter with a negative log-function (-log[x]). These values are summed. When probabilities form a uniform distribution, the sum of the log-transformed probabilities matches the number of studies. Thus, the value with the smallest absolute discrepancy between the sum of negative log-transformed percentages and the number of studies provides the estimate of observed power.

Observed Power Estimation Method 5: Averaging Standard Normal Non-Centrality Parameter

In addition to these existing methods, I introduce to novel estimation methods. The first new method converts observed test statistics into one-sided p-values. These p-values are then transformed into z-scores. This approach has a long tradition in meta-analysis that was developed by Stouffer et al. (1949). It was popularized by Rosenthal during the early days of meta-analysis (Rosenthal, 1979). Transformation of probabilities into z-scores makes it easy to aggregate probabilities because z-scores follow a symmetrical distribution. The average of these z-scores can be used as an estimate of the actual non-centrality parameter. The average z-score can then be used to estimate true power. This approach avoids the problem of averaging power estimates that power has a skewed distribution. Thus, it should provide an unbiased estimate of true power when power is homogenous across studies.

Observed Power Estimation Method 6: Yuan-Maxwell Correction of Average Observed Power

Yuan and Maxwell (2005) demonstrated a simple average of observed power is systematically biased. However, a simple average avoids the problems of transforming the data and can produce tighter estimates than the median method. Therefore I explored whether it is possible to apply a correction to the simple average. The correction is based on Yuan and Maxwell’s (2005) mathematically derived formula for systematic bias. After averaging observed power, Yuan and Maxwell’s formula for bias is used to correct the estimate for systematic bias. The only problem with this approach is that bias is a function of true power. However, as observed power becomes an increasingly good estimator of true power in the long run, the bias correction will also become increasingly better at correcting the right amount of bias.

The Yuan-Maxwell correction approach is particularly promising for meta-analysis of heterogeneous sets of studies such as sets of diverse studies in a journal. The main advantage of this method is that averaging of power makes no assumptions about the distribution of power across different studies (Schimmack, 2012). The main limitation of averaging power was the systematic bias, but Yuan and Maxwell’s formula makes it possible to reduce this systematic bias, while maintaining the advantage of having a method that can be applied to heterogeneous sets of studies.

RESULTS

Homogeneous Effect Sizes and Sample Sizes

The first simulation used 100 effect sizes ranging from .01 to 1.00 and 50 sample sizes ranging from 11 to 60 participants per condition (Ns = 22 to 120), yielding 5000 different populations of studies. The true power of these studies was determined on the basis of the effect size, sample size, and the criterion p < .025 (one-tailed), which is equivalent to .05 (two-tailed). Sample sizes were chosen so that average power across the 5,000 studies was 50%. The simulation drew 10 random samples from each of the 5,000 populations of studies. Each sample of a study simulated a between-subject design with the given population effect size and sample size. The results were stored as one-tailed p-values. For the meta-analysis p-values were converted into z-scores. To avoid biases due to extreme outliers, z-scores greater than 5 were set to 5 (observed power = .999).

The six estimation methods were then used to compute observed power on the basis of samples of 10 studies. The following figures show observed power as a function of true power. The green lines show the 95% confidence interval for different levels of true power. The figure also includes red dashed lines for a value of 50% power. Studies with more than 50% observed power would be significant. Studies with less than 50% observed power would be non-significant. The figures also include a blue line for 80% true power. Cohen (1988) recommended that researchers should aim for a minimum of 80% power. It is instructive how accurate estimation methods are in evaluating whether a set of studies met this criterion.

The histogram shows the distribution of true power across the 5,000 populations of studies.

The histogram shows that the simulation covers the full range of power. It also shows that high-powered studies are overrepresented because moderate to large effect sizes can achieve high power for a wide range of sample sizes. The distribution is not important for the evaluation of different estimation methods and benefits all estimation methods equally because observed power is a good estimator of true power when true power is close to the maximum (Yuan & Maxwell, 2005).

The next figure shows scatterplots of observed power as a function of true power. Values above the diagonal indicate that observed power overestimates true power. Values below the diagonal show that observed power underestimates true power.

Visual inspection of the plots suggests that all methods provide unbiased estimates of true power. Another observation is that the count of significant results provides the least accurate estimates of true power. The reason is simply that aggregation of dichotomous variables requires a large number of observations to approximate true power. The third observation is that visual inspection provides little information about the relative accuracy of the other methods. Finally, the plots show how accurate observed power estimates are in meta-analysis of 10 studies. When true power is 50%, estimates very rarely exceed 80%. Similarly, when true power is above 80%, observed power is never below 50%. Thus, observed power can be used to examine whether a set of studies met Cohen’s recommended guidelines to conduct studies with a minimum of 80% power. If observed power is 50%, it is nearly certain that the studies did not have the recommended 80% power.

To examine the relative accuracy of different estimation methods quantitatively, I computed bias scores (observed power – true power). As bias can overestimate and underestimate true power, the standard deviation of these bias scores can be used to quantify the precision of various estimation methods. In addition, I present the mean to examine whether a method has large sample accuracy (i.e. the bias approaches zero as the number of simulations increases). I also present the percentage of studies with no more than 20% points bias. Although 20% bias may seem large, it is not important to estimate power with very high precision. When observed power is below 50%, it suggests that a set of studies was underpowered even if the observed power estimate is an underestimation.

The quantitative analysis also shows no meaningful differences among the estimation methods. The more interesting question is how these methods perform under more challenging conditions when the set of studies are no longer exact replication studies with fixed power.

Homogeneous Effect Size, Heterogeneous Sample Sizes

The next simulation simulated variation in sample sizes. For each population of studies, sample sizes were varied by multiplying a particular sample size by factors of 1 to 5.5 (1.0, 1.5,2.0…,5.5). Thus, a base-sample-size of 40 created a range of sample sizes from 40 to 220. A base-sample size of 100 created a range of sample sizes from 100 to 2,200. As variation in sample sizes increases the average sample size, the range of effect sizes was limited to a range from .004 to .4 and effect sizes were increased in steps of d = .004. The histogram shows the distribution of power in the 5,000 population of studies.

The simulation covers the full range of true power, although studies with low and very high power are overrepresented.

The results are visually not distinguishable from those in the previous simulation.

The quantitative comparison of the estimation methods also shows very similar results.

In sum, all methods perform well even when true power varies as a function of variation in sample sizes. This conclusion may not generalize to more extreme simulations of variation in sample sizes, but more extreme variations in sample sizes would further increase the average power of a set of studies because the average sample size would increase as well. Thus, variation in effect sizes poses a more realistic challenge for the different estimation methods.

Heterogeneous, Normally Distributed Effect Sizes

The next simulation used a random normal distribution of true effect sizes. Effect sizes were simulated to have a reasonable but large variation. Starting effect sizes ranged from .208 to 1.000 and increased in increments of .008. Sample sizes ranged from 10 to 60 and increased in increments of 2 to create 5,000 populations of studies. For each population of studies, effect sizes were sampled randomly from a normal distribution with a standard deviation of SD = .2. Extreme effect sizes below d = -.05 were set to -.05 and extreme effect sizes above d = 1.20 were set to 1.20. The first histogram of effect sizes shows the 50,000 population effect sizes. The histogram on the right shows the distribution of true power for the 5,000 sets of 10 studies.

The plots of observed and true power show that the estimation methods continue to perform rather well even when population effect sizes are heterogeneous and normally distributed.

The quantitative comparison suggests that puniform has some problems with heterogeneity. More detailed studies are needed to examine whether this is a persistent problem for puniform, but given the good performance of the other methods it seems easier to use these methods.

Heterogeneous, Skewed Normal Effect Sizes

The next simulation puts the estimation methods to a stronger challenge by introducing skewed distributions of population effect sizes. For example, a set of studies may contain mostly small to moderate effect sizes, but a few studies examined large effect sizes. To simulated skewed effect size distributions, I used the rsnorm function of the fGarch package. The function creates a random distribution with a specified mean, standard deviation, and skew. I set the mean to d = .2, the standard deviation to SD = .2, and skew to 2. The histograms show the distribution of effect sizes and the distribution of true power for the 5,000 sets of studies (k = 10).

This time the results show differences between estimation methods in the ability of various estimation methods to deal with skewed heterogeneity. The percentage of significant results is unbiased, but is imprecise due to the problem of averaging dichotomous variables. The other methods show systematic deviations from the 95% confidence interval around the true parameter. Visual inspection suggests that the Yuan-Maxwell correction method has the best fit.

This impression is confirmed in quantitative analyses of bias. The quantitative comparison confirms major problems with the puniform estimation method. It also shows that the median, p-curve, and the average z-score method have the same slight positive bias. Only the Yuan-Maxwell corrected average power shows little systematic bias.

To examine biases in more detail, the following graphs plot bias as a function of true power. These plots can reveal that a method may have little average bias, but has different types of bias for different levels of power. The results show little evidence of systematic bias for the Yuan-Maxwell corrected average of power.

The following analyses examined bias separately for simulation with less or more than 50% true power. The results confirm that all methods except the Yuan-Maxwell correction underestimate power when true power is below 50%. In contrast, most estimation methods overestimate true power when true power is above 50%. The exception is puniform which still underestimated true power. More research needs to be done to understand the strange performance of puniform in this simulation. However, even if p-uniform could perform better, it is likely to be biased with skewed distributions of effect sizes because it assumes a fixed population effect size.

Conclusion

This investigation introduced and compared different methods to estimate true power for a set of studies. All estimation methods performed well when a set of studies had the same true power (exact replication studies), when effect sizes were homogenous and sample sizes varied, and when effect sizes were normally distributed and sample sizes were fixed. However, most estimation methods were systematically biased when the distribution of effect sizes was skewed. In this situation, most methods run into problems because the percentage of significant results is a function of the power of individual studies rather than the average power.

The results of these analyses suggest that the R-Index (Schimmack, 2014) can be improved by simply averaging power and then applying the Yuan-Maxwell correction. However, it is important to realize that the median method tends to overestimate power when power is greater than 50%. This makes it even more difficult for the R-Index to produce an estimate of low power when power is actually high. The next step in the investigation of observed power is to examine how different methods perform in unrepresentative (biased) sets of studies. In this case, the percentage of significant results is highly misleading. For example, Sterling et al. (1995) found percentages of 95% power, which would suggest that studies had 95% power. However, publication bias and questionable research practices create a bias in the sample of studies that are being published in journals. The question is whether other observed power estimates can reveal bias and can produce accurate estimates of the true power in a set of studies.