All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Who is Your Daddy? Priming women with a disengaged father increases their willingness to have sex without a condom

Photo credit: https://www.theblot.com/pole-dancing-daddy-fun-acrobatics-7767007
Who’s your daddy?  Priming women with a disengaged father increases their willingness to have sex without a condom.

In a five study article, Danielle J. DelPriore and Sarah E. Hill from Texas Christian University wanted to examine the influence of a disengaged father on daughter’s sexual attitudes and behaviors.

It is difficult to study the determinants of sexual behavior in humans because it is neither practical nor ethical to randomly assign daughters to engaged and distant fathers to see how this influences daughters’ sexual attitudes and behaviors.

Experimental social psychologists believe that they have found a solution to this problem.  Rather than exposing individuals to the actual experiences in the real world, it is possible to expose individuals to stimuli or stories related to these events.  These studies are called priming studies.  The assumption is that priming individuals has the same effect as experiencing these events.  For example, a daughter with a loving and caring partner will respond like a daughter with a distant father if she is randomly assigned to a condition with a parental disengagement prime.

This article reports five priming studies that examined how thinking about a distant father influences daughters’ sexual attitudes.

Study 1 (N = 75 female students)

Participants in the paternal disengagement condition read the following instructions:

Take a few seconds to think back to a time when your biological father was absent for an important life event when you really needed him . . .. Describe in detail how your father’s lack of support—or his physical or psychological absence—made you feel.

Participants in the paternal engagement condition were asked to describe a time their father was physically or psychologically present for an important event.

The dependent variable was a word-stem completion task with words that could be completed with words related to sex (s_x;  _aked;  sex vs. six; naked vs. baked).

Participants primed with a disengaged father completed more word-stems in a sexual manner (M = 4.51, SD = 2.06) than participants primed with an engaged father (M = 3.63, SD = 1.50), F(1,73) = 4.51, p = .037, d = .49.

Study 2 (N = 52 female students)

Study 2 used the same priming manipulation as Study 1, but measured sexual permissiveness with the Sociosexual Orientation Inventory (SOI; Simpson & Gangestad, 1991).  Example items are “sex without love is OK,” and “I can imagine myself being comfortable and enjoying casual sex with different partners.”

Participants who thought about a disengaged father had higher sexual permissiveness scores (M = 2.57, SD = 1.88) than those who thought about an engaged father (M = 1.86, SD = 0.94), F(1,62) = 3.91, p = .052, d = .48.

Study 3 (N = 82 female students)

Study 3 changed the control condition from an engaged father to a disengaged or disappointing friend.  It is not clear why this condition was not included as a third condition Study 2 but ran as a separate experiment. The study showed that participants who thought about a disengaged dad scored higher on the sexual permissiveness scale (M = 2.90, SD = 2.25) than participants who thought about a disappointing friend (M = 2.09, SD = 1.19), F(1,80) = 4.24, p = .043, d = .45.

Study 4 (N = 62 female students)

Study 4 used maternal disengagement as the control condition. Again, it is not clear why the researchers did not run one study with four conditions (disengaged father, engaged father, disappointing friend, disengaged mother).

Participants who thought about a disengaged dad had higher scores on the sexual permissiveness scale (M = 2.85, SD = 1.84) than participants who thought about a disengaged mother (M = 1.87, SD = 1.16), F(1, 60) = 6.03, p = .017, d = .64.

Study 5 (N = 85 female students & 92 male students)

Study 5 could have gone in many directions, but it included women and men as participants and used disappointing friends as the control condition (why not using engaged and disengaged mothers/fathers in a 2 x 2 design to see how gender influences parent-child relationships?).  Even more disappointing was that the only reported (!) dependent variable was attitudes towards condoms. Why was the sexual attitude measure dropped from Study 5?

The results showed a difference between male and female participants who thought about a disengaged dad or friend.  Participants reported more negative attitudes towards condoms after thinking about a disengaged dad (M ~ 3.4 based on Figure) than participants who thought about a disengaged friend (M = 2.9 ~ based on Figure), F(1,172) = 5.10, p = .025, d = 0.33.  The interaction with gender was not significant, p = .58, but the effect of the manipulation on attitudes towards condoms was marginally significant in an analysis limited to women, (M = 3.07, SD = 1.30; M = 2.51, SD = 1.35), F(1, 172)= 3.76, p = .054, d = 0.42.  Although the interaction was not significant, the authors conclude in the general discussion section that “the effects of primed paternal disengagement on sexual risk were also found to be stronger for women than for men (Experiment 5)” (p. 242).

CONCLUSION

Based on this set of five studies, the authors conclude that “the results of the current research provide the first experimental support for PIT [Parental Investment Theory] by demonstrating a causal relationship between paternal disengagement cues and changes in women’s sexual decision making” (p. 242).

They then propose that “insight gained from this research may help inform interventions aimed at reducing some of the personal and financial costs associated with father absence, including teen pregnancy and STI risk” (p. 242)

Well, assuming these results were credible they may also be used by men interested in having sex without condoms by bringing up a time his father was distant and disengaged, which may prime his date to think about a similar time in her life and happily engage in unprotected sex with her date.  Of course, women who are aware of this priming effect may not fall for such a cheap trick.

However, before researchers or lay people get too excited about these experimental findings, it is important to examine whether they are even credible findings.  Five successful studies may seem like strong evidence for the robustness of this effect, but unfortunately the reported studies cannot be taken at face value because scientific journals report only successful studies and it is not clear how many failed studies or analysis are not reported.

To examine the credibility and replicability of these reported findings, I ran a statistical test of the reported results.  These tests suggest that the results are not credible and unlikely to replicate in independent attempts to reproduce these studies.

N statistic p z OP
75 F(1,73)=4.51 0.037 2.08 0.55
64 F(1,62)=3.91 0.052 1.94 0.49
82 F(1,80)=4.24 0.043 2.03 0.53
62 F(1,60)=6.03 0.017 2.39 0.67
177 F(1,172)=5.10 0.025 2.24 0.61

OP = observed power

The Test of Insufficient Variance (TIVA) shows that variance of z-scores is much less than random sampling error would produce, var(z) = 0.03 (expected 1.00), p < .01.   The median observed power is only 55% when the success rate is 100%, showing that the success rate is inflated. The Replicability Index is 55 – (100 – 55) = 10.  This value is below the value that is expected if only significant studies are selected from a set of studies without a real effect (22). A replicabilty index of 10 suggests that other researchers will not be able to replicate the significant results reported in this article.

In conclusion, this article does not contain credible evidence about the causes of male or female sexuality, and if you did grow up without a father or with a disengaged father it does not mean that this necessarily influenced your sexual attitudes, preferences, and behaviors.  Answers to these important questions are more likely to come from studies of real family relationships than from priming studies that assume real world experiences can be simulated with priming studies.

Reference

DelPriore, D.J., & Hill, S.E. (2013). The effects of paternal disengagement on women’s sexual decision making: An experimental approach, Journal of Personality and Social Psychology, 105, 234-246. DOI: 10.1037/a0032784

 

Advertisements

Bayes-Factors Do Not Solve the Credibility Problem in Psychology

Bayesians like to blame p-values and frequentist statistics for the replication crisis in psychology (see, e.g., Wagenmakers et al., 2011).  An alternative view is that the replication crisis is caused by selective reporting of non-significant results (Schimmack, 2012). This bias would influence Frequentist and Bayesian statistics alike and switching from p-values to Bayes-Factors would not solve the replication crisis.  It is difficult to evaluate these competing claims because Bayesian statistics are still used relatively infrequently in research articles.  For example, a search for the term Bayes Factor retrieved only six articles in Psychological Science in the years from 1990 to 2015.

One article made a reference to the use of Bayesian statistic in modeling.  Three articles used Bayes-Factors to test the null-hypothesis. These article will be examined in a different post, but they are not relevant for the problem of replicating results that apeared to demonstrate effects by rejecting the null-hypothesis.  Only two articles used Bayes-Factors to test whether a predicted effect is present.

Example 1

One article reported Bayes-Factors to claim support for predicted effects in 6 studies (Savani & Rattan, 2012).   The results are summarized in Table 1.

Study Design N Statistic p z OP BF1 BF2
1 BS 48 t(42)=2.29 0.027 2.21 0.60 12.76 2.05
2 BS 46 t(40)=2.57 0.014 2.46 0.69 28.03 2.85
3 BS 67 t(65)=2.25 0.028 2.20 0.59 9.55 1.61
4 BS 61 t(57)=2.85 0.006 2.74 0.78 39.48 6.44
5 BS 146 F(1,140)=6.68 0.011 2.55 0.72 NA 2.95
6 BS 50 t(47)=2.43 0.019 2.35 0.65 16.98 2.66
MA BS 418 t(416)=6.05 0.000 5.92 1.00 NA 1,232,427

MA = meta-analysis, OP = observed power, BF1 = Bayes-Factor reported in article based on half-normal with SD = .5,  BF2 = default Bayes-Factor with Cauchy(0,1)

All 6 studies reported a statistically significant result, p < .05 (two-tailed).  Five of the six studies reported a Bayes-Factor and all Bayes-Factors supported the alternative hypothesis.  Bayes-Factors in the article were based on a half-normal centered at d = .5.  The Bayes-Factors show that the data are much more consistent with this alternative hypothesis than with the null-hypothesis.  I also computed the Bayes-Factor for a Cauchy distribution centered at 0 with a scaling parameter of r = 1 (Wagenmakers et al., 2011).  This alternative hypothesis assumes that there is a 50% probability that the standardized effect size is greater than d = 1.  This extreme alternative hypothesis favors the null-hypothesis when the data show small to moderate effect sizes.  Even this Bayes-Factor consistently favors the alternative hypothesis, but the odds are less impressive.  This result shows that Bayes-Factors have to be interpreted in the context of the specified alternative hypothesis.  The last row shows the results of a meta-analysis. The results of the six studies were combined using Stouffer’s formula sum(z) / sqrt(k). To compute the Bayes-Factor the z-score was converted into a t-value with total N – 2 degrees of freedom. The meta-analysis shows strong support for an effect, z = 5.92, and the Bayes-Factor in favor of the hypothesis is greater than 1 million to 1.

Thus, frequentist and Bayesian statistics produce converging results. However, both statistical methods assume that the reported statistics are unbiased.  If researchers only present significant results or use questionable research practices that violate statistical assumptions, effect sizes are inflated, which biases p-values and Bayes-Factors alike.  It is therefore necessary to test whether the reported results are biased.  A bias analysis with the Test of Insufficient Variance (TIVA) shows that the data are biased.  TIVA compares the observed variance in z-scores against the expected variance of z-scores due to random sampling error, which is 1.  The observed variance is only Var(z) = 0.04.  A chi-square test shows that the discrepancy between the observed and expected variance would occur rarely by chance alone, p = .001.   Thus, neither p-values nor Bayes-Factors provide a credible test of the hypothesis because the reported results are not credible.

Example 2

Kibbe and Leslie (2011) reported the results of a single study that compared infants’ looking times in three experimental conditions.  The authors first reported the results of a traditional Analysis of Variance that showed a significant effect, F(2, 31) = 3.54, p = .041.  They also reported p-values for post-hoc tests that compared the critical experimental condition with the control condition, p = .021.  They then reported the results of a Bayesian contrast analysis that compared the critical experimental condition with the other two conditions.  They report a Bayes-Factor of 7.4 in favor of a difference between means.  The article does not specify the alternative hypothesis that was tested and the website link in the article does not provide readily available information about the prior distribution of the test. In any case, the Bayesian results are consistent with the ANOVA results. As there is only one study, it is impossible to conduct a formal bias test, but studies with p-values close to .05 often do not replicate.

Conclusion

In conclusion, Bayesian statistics are still rarely used to test research hypotheses. Only two articles in the journal Psychological Science have done so.  One article reported six studies and reported high Bayes-Factors in five studies to support theoretical predictions. A bias analysis showed that the results in this article are biased and violate basic assumptions of sampling.  This example suggest that Bayesian statistics does not solve the credibility problem in psychology.  Bayes-Factors can be gamed just like p-values.  In fact, it is even easier to game Bayes-Factors by specifying a priori distributions that closely match the observed data in order to report Bayes-Factors that impress reviewers, editors, and readers with limited understanding of Bayesian statistics.  To avoid this problems, Bayesians need to agree on a principled approach how researchers should specify prior distributions. Moreover, Bayesian statistics are only credible if researchers report all relevant results. Thus, Bayesian statistics need to be accompanied by information about the credibility of the data.

References

Kibbe, M., & Leslie, A. (2011). What Do Infants Remember When They Forget? Location and Identity in 6-Month-Olds’ Memory for Objects. Psychological Science, 22, 1500-1505.

Savani, K., & Rattan, A. (2012). A choice mind-set increases the acceptance and maintenance of wealth inequality. Psychological Science, 23, 796-804.

Schimmack, U. (2012).  The ironic effect of significant results on the credibility of multiple-study articles.  Psychological Methods, 17, 551–566.

Schimmack, U. (2015a).  The test of insufficient variance (TIVA).  Abgerufen von https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Stouffer, S. A., Suchman, E. A , DeVinney, L.C., Star, S.A., Williams, R.M. Jr (1949). Adjustment During Army Life. Princeton, NJ, Princeton University Press.

Wagenmakers, E. J.,Wetzels, R., Borsboom,D.,& Van derMaas,  H. L. (2011).Why psychologists must change the way they analyze their data: The case of psi. [Commentary on Bem (2011)]. Journal of Personality and Social Psychology, 100, 426–432. doi: 10.1037/a0022790

Die Verdrängung des selektiven Publizierens: 7 Fallstudien von prominenten Sozialpsychologen

Inoffizieller Beitrag zum Themenheft zur Replikationskrise in der Psychologischen Rundschau

Im Herbst 2015, kontaktierte mich Christoph Klauer mit der Frage, ob ich einen Beitrag zu einem Themenheft in der Psychologischen Rundschau zur Replikationskrise in der Psychologie schreiben wollte.  Ich hatte mit Moritz Heene an einer Diskussion im Diskussionsforum der DGfP teilgenommen und war bereit einen Beitrag zu liefern.  Der Beitrag sollte Ende März 2016 fertig sein und mit einer Woche Verspätung reichten Moritz und ich unseren Beitrag ein.  Wir wussten, dass der Beitrag starke Reaktionen hervorrufen würde, da wir an mehreren persönlichen Fallbeispielen zeigten, wie viele Sozialpsychologen versuchen die Replikationskrise zu verdrängen.  Wir waren auf heftige Kritik von Gutachtern gefasst.  Aber dazu kam es nicht.  In einer überaus verständnisvollen und auch zustimmenden email, erklärte Christoph Klauer warum unser Beitrag nicht in das geplante Themenheft passt.

Vielen Dank für das interessante und lesenswerte Manuskript. Ich habe es mit Vergnügen gelesen und kann den meisten Punkten und Argumenten zustimmen. Ich glaube, diese ganze Debatte wird der Psychologie (und hoffentlich auch der Sozialpsychologie) gut tun, auch wenn sich mancher derzeit noch schwer tut. Das Bewusstsein für die Schädlichkeit mancher früher verbreiteten Gewohnheiten und die Einsicht in die Wichtigkeit von Replikationen hat meinem Eindruck nach jedenfalls in den letzten zwei bis drei Jahren bei sehr vielen Kolleginnen und Kollegen deutlich zugenommen. Leider passt das Manusrkipt aus formalen Gründen nicht so gut in das geplante Sonderheft.  (Christoph Klauer, email April 14, 2016).

Da wir uns einige Mühe mit dem Beitrag gemacht haben und es schwer ist etwas auf Deutsch in anderen Fachzeitschriften zu veröffentlichen haben wir beschlossen unseren Beitrag inoffiziell, d.h., ohne fachliche Begutachtung von Kollegen, zu veröffentlichen. Für nachträgliche Kommentare und Kritik sind wir offen und dankbar.  Wir hoffen dass unserer Beitrag zu weiterer Diskussion über die Replikationskrise insbesondere in der Sozialpsychologie führt. Wir glauben, dass unser Beitrag eine einfache und klare Botschaft hat.  Die Zeit der geschönten Ergebnisse ist vorbei. Es ist Zeit, dass Psychologen ihre Laborbefunden offen und ehrlich berichten, denn geschönte Ergebnisse verlangsamen oder verhindern den wissenschaftlichen Fortschritt.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Wie glaubwürdig ist die Sozialpsychologie?

Ulrich Schimmack1

Moritz Heene2
1 University of Toronto, Mississauga, Kanada

2 Learning sciences Research Methodologies, Department of Psychologie, Ludwig Maximilians Universität München

Zusammenfassung

Eine große Replikationsstudie von 100 Studien zeigte, dass nur 25% sozialpsychologischer Studien und 50% kognitionspsychologischer Studien repliziert werden konnten.  Dieser Befund steht im Einklang mit Befunden, dass die statistische Power oft gering ist und Zeitschriften nur signifikante Ergebnisse berichten.  Dieses Problem ist seit 60 Jahren bekannt und erklärt die Ergebnisse des Replikationsprojekts.  Wir zeigen hier auf, wie prominente Sozialpsychologen auf diesen Befund reagiert haben.  Die Kommentare lenken von dem Hauptproblem des Publikationsbias ab und versuchen das Ergebnis schönzureden.  Wir entkräften diese Argumente und fordern Psychologen auf Forschungsergebnisse offen und ehrlich zu berichten.

Keywords: Replikationskrise, Replizierbarkeit, Power

Wie glaubwürdig ist die Sozialpsychologie?

Im Jahr 2011 wurde die Glaubwürdigkeit der Sozialpsychologie durch zwei Ereignisse in Frage gestellt.  Erst stellte sich heraus, dass der Sozialpsychologe Diederik Stapel massiv Daten erfunden hatte.  Inzwischen sind über 50 seiner Artikel zurückgezogen worden (Retraction Watch, 2015).  Dann publizierte das Journal of Personality and Social Psychologie einen Artikel, der angeblich zeigte, dass extravertierte Personen extrasensorische Fähigkeiten haben und, dass man Testergebnisse verbessern kann wenn man nach dem Test lernt (Bem, 2011).  Bald darauf zeigten Forscher statistische Probleme mit den berichteten Ergebnissen auf und Replikationsstudien konnten diese Ergebnisse nicht replizieren (Francis, 2012; Galak, LeBoeuf, Nelson, & Simmons, 2012; Schimmack, 2012).  In diesem Fall waren die Daten nicht gefälscht, sondern Bem hat höchstwahrscheinlich seine Daten so erhoben und ausgewertet, wie es viele Sozialpsychologen gelernt haben. Es stellte sich daher die Frage, wie glaubwürdig andere Ergebnisse in der Sozialpsychologie sind (Pashler & Wagenmakers, 2012).

Als einige Forscher die Effekte zum „elderly priming“ nicht replizieren konnten, sah der Nobelpreisträger Daniel Kahneman eine Krise vorher (Yong, 2012).  Im Jahr 2015 ist diese Krise nun eingetroffen.  Unter der Leitung von Brian Nosek haben hunderte von Psychologen versucht 100 Ergebnisse zu replizieren, die im Jahr 2008 in drei renommierten Fachzeitschriften (Journal of Experimental Psychology: Learning, Memory, and Cognition, Journal of Personality and Social Psychology, & Psychological Science) veröffentlicht wurden (Open Science Collective, 2015).  Während 97% der Originalstudien ein signifikantes Ergebnis berichteten, war die Erfolgsquote in den Replikationsstudien mit 35% deutlich niedriger.  Es zeigte sich jedoch auch ein Unterschied zwischen den Disziplinen.  So war die Replikationsrate für die Kognitive Psychologie 50%, die der Sozialpsychologie hingegen nur 25%.  Da wir uns in diesem Artikel auf die Sozialpsychologie konzentrieren, stellt sich die Frage, wie die 25% Replikationsrate zu interpretieren ist.

Selektives Publizieren von signifikanten Ergebnissen

Vor über 50 Jahren deutete Sterling (1959) bereits darauf hin, dass die Erfolgsquote in psychologischen Zeitschriften unwahrscheinlich hoch ist und stellte die Hypothese auf, dass Publikationsbias dafür verantwortlich ist.  Drei Jahrzehnte später zeigten Sterling und Kollegen, dass die Erfolgsquote weiterhin über 90% lag (Sterling et al., 1995).  Der Artikel machte auch deutlich, dass diese Erfolgsquote nicht mit Schätzungen der statischen Power in der Psychologie übereinstimmt.  Im optimalen Fall haben Psychologen immer die richtige Alternativhypothese (die Nullhypothese ist immer falsch).  Wenn dies der Fall ist, ist die Erfolgsquote in einer Serie von Studien durch die statische Power bestimmt.  Dies ergibt sich aus der Definition von statistischer Power als die relative Häufigkeit von Studien, in denen die Stichprobeneffektgröße zu einem statistisch signifikanten Ergebnis führt. Wenn die Studien unterschiedliche Power haben, ist die Erfolgsquote eine Funktion der durchschnittlichen Power.  Cohen (1962) schätzte, dass sozialpsychologische Studien rund 50% Power haben, um ein signifikantes Ergebnis mit einem Alphaniveau von 5% zu erreichen.  Sedlmeier and Gigerenzer (1989) replizierten diesen Schätzwert 25 Jahre später; es gibt auch keine Anzeichen dafür, dass sich die typische Power seitdem erhöht hat (Schimmack, 2015c).  Wenn die tatsächliche Erfolgswahrscheinlichkeit 50% ist und die berichtete Erfolgsquote publizierter Studien fast 100% ist, ist es deutlich, dass Publikationsbias zu der hohen Erfolgsquote in der Sozialpsychologie beiträgt.

Publikationsbias liegt dann vor, wenn signifikante Ergebnisse veröffentlicht werden und nicht-signifikante Ergebnisse unveröffentlicht bleiben.  Der Begriff Publikationsbias erklärt jedoch nicht wie es zu der Selektion von signifikanten Ergebnissen kommt.  Als Sterling seinen ersten Artikel dazu 1959 schrieb, war es üblich, dass ein Artikel eine einzige Studie berichtete.  Wenn dies der Fall ist, ist es möglich, dass mehrere Forscher eine ähnliche Studie machen, aber nur diejenigen Forscher, die Glück hatten und ein signifikantes Ergebnis beobachteten, ihre Ergebnisse zur Veröffentlichung einreichen.  Sozialpsychologen waren sich diesem Problem bewusst.  Daher wurde es üblich, dass ein Artikel mehrere Studien berichten musste.  Bem (2011) beispielsweise berichtete 10 Studien und 9 davon hatten ein signifikantes Ergebnis (bei Alpha = 5%, einseitige Testung).  Es ist extrem unwahrscheinlich, dass sich Glück mehrfach wiederholt. Daher kann Glück alleine die hohe Erfolgsrate bei Bem und in anderen Artikeln mit mehreren Studien nicht erklärt (Schimmack, 2012).  Um 6 oder mehr Erfolge zu haben, wenn die Erfolgswahrscheinlichkeit nur 50% ist, müssen Forscher dem Glück etwas nachhelfen.  Es gibt eine Reihe von Erhebungs- und Auswertungsmethoden, die die Erfolgswahrscheinlichkeit artifiziell erhöhen (John, Loewenstein, & Prelec, 2012).  Diese fragwürdigen Methoden haben gemeinsam, dass mehr Ergebnisse produziert als berichtet werden.  Entweder werden ganze Studien nicht berichtet oder es werden nur die Analysen berichtet, die zu einem signifikanten Ergebnis führten.  Einige Sozialpsychologen haben offen zugegeben, dass sie diese fragwürdigen Methoden in ihrer Forschung benutzt haben (z.B., Inzlicht, 2015).

Es gibt also eine einfache Erklärung für die große Diskrepanz zwischen der berichteten Erfolgsquote in sozialpsychologischen Zeitschriften und der niedrigen Replikationsrate im Reproduktionsprojekt: Sozialpsychologen führen wesentlich mehr statistische Tests durch als in den Zeitschriften berichtet werden, aber nur die Tests die eine Hypothese bestätigen werden berichtet.  Man braucht kein Wissenschaftstheoretiker zu sein, um zu sehen, dass Publikationsbias ein Problem ist, aber zumindest US Amerikanische Sozialpsychologen haben sich eingeredet, dass die Selektion von signifikanten Ergebnissen kein Problem ist. Bem (2010, S. 5) schrieb „Last uns falsche Entdeckungen machen.“ (Let’s err on the side of discovery.) und dieses Kapitel wurde in vielen Methodenkursen benutzt, um Doktoranden Forschungsmethoden zu lehren.

Gibt es andere Erklärungen?

Ironischerweise kann die öffentliche Reaktion von einigen Sozialpsychologen auf die Ergebnisse des Replikationsprojekts gut mit psychologischen Theorien der Verdrängung erklärt werden (siehe Abbildung 1).  So kommt das Wort „Publikationsbias“ in Stellungnahmen von Sozialpsychologen wie zum Beispiel der offiziellen Stellungnahme der Deutschen Gesellschaft für Psychologie kaum vor. Die unangenehme Wahrheit, dass die Glaubwürdigkeit vieler Befunde in Frage steht scheint zu bedrohlich zu sein, um offen damit umzugehen.  Dies ist jedoch notwendig, damit die nächste Generation von Sozialpsychologen nicht die Fehler ihrer Doktorväter und Doktormütter wiederholt.  In einer Reihe von Fallstudien zeigen wir die Fehler in Argumenten von Sozialpsychologen auf, die Selektionsbias offenbar nicht wahrhaben wollen.

 

repressionpsychologist2

Abbildung 1.   Nicht-signifikante Ergebnisse werden verdrängt.

Fallstudie 1: Die 25% Erfolgsquote ist nicht interpretierbar

Alison Ledgerwood ist eine prominente US-amerikanische Sozialpsychologien, die Artikel zur Glaubwürdigkeit der Sozialpsychologie veröffentlicht hat (Ledgerwood & Sherman, 2012). Sie schrieb auch ein Blog über die Ergebnisse des Replikationsprojekts und behauptet, dass die Replikationsquote von 36% nicht interpretierbar ist (“36, it turned out, was the answer. It’s just not quite clear what the question was”). Ihr Hauptargument ist, dass es nicht klar ist wie viele erfolgreiche Replikationen man hätte erwarten können.  Bestimmt nicht 100%.  Vielleicht ist es ja realistischer nur 25% erfolgreiche Replikationen für die Sozialpsycholgie zu erwarten.  Und in diesem Fall stimmt die tatsächliche Erfolgsrate mit der erwarteten Erfolgsrate perfekt überein; ein hundert prozentiger Erfolg. Aber warum sollten wir einen Erfolg von 25% erwarten? Warum nicht 10%? Dann wäre die tatsächliche Erfolgsquote doch sogar 150% höher als die erwartete Erfolgsquote. Das wäre doch noch besser.  Es ist ja eine alte Weisheit, dass niedrige Erwartungen das Glück erhöhen.  Es macht daher Sinn für das Wohlbefinden der Sozialpsycholgen die Erwartungen herunterzuschrauben. Jedoch ist diese niedrige Erwartung nicht mit der nahezu perfekten Erfolgsquote in den Zeitschriften vereinbar. Alison Ledgerwood ignoriert die Diskrepanz zwischen der öffentlichen Erfolgsrate und der impliziten wahren Erfolgsrate in sozialpsychologischen Laboren.

Ledgerwood behauptet weiterhin, dass die Replikationsstudien niedrige Power hatten und man daher keine hohe Erfolgsquote erwarten könnte.  Sie übersieht dabei jedoch, dass viele Replikationsstudien größere Stichproben als die Originalstudien hatten, was bedeutet, dass die Power der Originalstudien im Durchschnitt niedriger war als die Power der Replikationsstudien.  Es bleibt daher unklar, wie die Originalstudien mit weniger Power eine wesentlich höhere Erfolgsquote erreichen konnten.

Fallstudie 2:  Negative Replikationen sind Normal (Lisa Feldman Barrett)

In einem Kommentar in der New York Times schrieb Lisa Feldmann Barrett, dass es normal sei, wenn eine Replikationsstudie einen originalen Befund nicht repliziert.  Die Ergebnisse des Replikationsprojekts zeigen daher nur, dass die Sozialpsychologie die Glaubwürdigkeit ihrer Ergebnisse prüft und Fehler korrigiert. Dieses Argument ignoriert die Tatsache, dass selektives Publizieren von signifikanten Ergebnissen die Fehlerrate erhöht. Während Ergebnisse berichtet werden als ob die Wahrscheinlichkeit eines falschen Effekts maximal 5% ist (d.h., man erwartet 5% signifikante Ergebnisse, wenn die Null-Hypothese immer stimmt und all Ergebnisse berichtet werden), ist die wahre Fehlerwahrscheinlichkeit wesentlich höher.  In Statistikkursen wird gelehrt, dass Forscher Studien so planen sollen, dass mit einer 80 prozentigen Wahrscheinlichkeit ein signifikanten Effekt beobachtet wird, wenn die Alternativhypothese gilt.  Studien mit 25% Power durchzuführen und dann nur die Ergebnisse zu berichten, die mit Hilfe des Zufalls/Stichprobenfehlers signifikant wurden, ist nicht wissenschaftlich. Daher ist die 25% Replikationsrate kein Zeichen dafür, dass in der Sozialpsychologie alles in Ordnung ist.  Die Kollegen in der klassischen kognitiven Psychologie (nicht in der Neuropsycholgie) schaffen immerhin 50%.  Selbst 50% ist nicht besonders gut.  Die renommierten Psychologen Kahneman and Tversky (1971) beschrieben eine Power von 50% als lächerlich („ridiculously low“). Die Autoren gehen noch weiter, wenn sie die Wissenschaftlichkeit von Forschern in Frage stellen, die bewusst Studien mit weniger als 50% power durchführen („We refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis“ p. 110).

Fallstudie 3: Die wahre Erfolgsquote ist 68%  (Stroebe & Hewstone)

In einem Kommentar für The Times of Higher Education behaupten Stroebe und Hewstone (2015), dass die 25% Erfolgsquote nicht besonders informativ ist.  Gleichzeitig heben sie hervor, dass es möglich ist, eine Metaanalyse der Originalstudien und der Replikationsstudien durchzuführen.  Diese Analyse wurde schon in der ursprünglichen Science Veröffentlichung  (OSC, 2016) durchgeführt und führt zu einer Schätzung der Replikationsrate von 68%.  Stroebe und Hewstone finden dies bemerkenswert und interpretieren diesen Befund als die bessere Schätzung der Replizierbarkeit von sozialpsychologischen Ergebnissen („In other words, two-thirds of the results could be replicated when evaluated with a simple meta-analysis that was based on both original and replication studies”).  Es ist jedoch nicht möglich die Erfolgsquote der Originalstudien mit den Replikationsstudien derart zu vereinen, um die Replizierbarkeit von Originalstudien in der Sozialpsychologie zu schätzen, da der Selektionsbias in den Originalstudien nicht korrigiert wird und daher die Effektgröße weiterhin erhöht bleibt. was zu einer Überschätzung der Replizierbarkeit führt.  Die Replikationsstudien haben keinen Selektionsbias da sie durchgeführt wurden um die Replizierbarkeit von Originalstudien in der Psychologie zu untersuchen.  Daher kann die Replikationsrate der Replikationsstudien direkt zur Schätzung der Replizierbarkeit interpretiert werden.  Das Ergebnis für die Sozialpsychologie ist eine Rate von 25% und nicht 68%.

Fallstudie 4:  Die Ergebnisse des Replikationsprojekts zeigen nicht das Sozialpsychologie nicht vertrauenswürdig ist (Offizielle Stellungnahme der DGPs)

Offenbar in Reaktion auf kritische Artikel in den Medien sahen sich die DGPs Vorstandsmitglieder veranlasst eine offizielle Stellungnahme zu veröffentlichen.  Diese Stellungnahme wurde von einigen Mitgliedern der DGPs kritisiert, was zu einer öffentlichen, moderierten Diskussion führte.  Die offizielle Stellungnahme behauptet, dass die Replikationsrate von 36% (für kognitive und soziale Psychologie) kein Grund ist die Glaubwürdigkeit psychologischer Forschung in Frage zu stellen.

„Wenn in der medialen Berichterstattung teilweise die Zahl „36%“ in den Mittelpunkt gestellt und als Beleg für die mangelhafte Replizierbarkeit psychologischer Effekte verwendet wird, so bedeutet das nicht, dass die berichteten Ergebnisse in den Originalstudien falsch oder nicht vertrauenswürdig sind. Dies wird auch von den Autorinnen und Autoren des SCIENCE Artikels betont.“  (DGPs, 2015)

Es ist in der Tat wichtig zwischen zwei Interpretationen einer Replikationsstudie mit einem nicht-signifikanten Ergebnisse zu unterscheiden.  Eine Interpretation ist, dass die Replikationsstudie zeigt, dass ein Effekt nicht existiert.  Eine andere Interpretation ist, dass die Replikationsstudie zeigt, dass die Originalstudie keine oder nicht genug Evidenz für einen Effekt bietet, selbst wenn dieser Effekt existiert.  Es ist möglich, dass die Medien und die Öffentlichkeit die 36% Erfolgsrate so interpretiert haben, dass 64% der Originalstudien falsche Evidenz für einen Effekt geliefert haben, der nicht existiert.  Diese Interpretation ist falsch, da es unmöglich ist zu zeigen, dass ein Effekt nicht existiert. Es ist nur möglich zu zeigen, dass es sehr unwahrscheinlich ist, dass der Effekt mit einer bestimmten Größe existiert.  Beispielsweise ist der Stichprobenfehler für den Vergleich von zwei Mittelwerten mit 1600 Probanden .05 Standardabweichungen (Cohens d = .05). Wenn die abhängige Variable standardiziert ist, reicht das 95% Konfidenzinterval um 0 von -.10 bis +.10. Wenn die Differenz der Mittelwerte in diesem Intervall liegt, kann man daraus schließen, dass es wenn überhaupt nur einen schwachen Effekt gibt. Da die Stichproben in den Replikationsstudien oft zu klein waren, um schwache Effekte auszuschließen, sagen die Ergebnisse nichts über die Anzahl von falschen Befunden in den Originalstudien aus.  Dies bedeutet jedoch nicht, dass die Ergebnisse in Originalstudien glaubwürdig oder vertrauenswürdig sind.  Da viele Ergebnisse nicht repliziert wurden, bleibt unklar ob diese Effekte existieren.  Die Frage wie oft die Originalstudien die Richtung eines wahren Mittelwertunterschieds richtig vorhersagen ist also von der Replikationsrate zu unterscheiden und die Replikationsrate ist 25%, selbst wenn weitere Studien mit größeren Stichproben eine höhere Erfolgsquote haben könnten.

Fallstudie 5:  Die Reduktion der Erfolgsquote ist ein normales statistisches Phänomen (Klaus Fiedler)

Im Diskussionsforum der DGPs  bot Klaus Fiedler eine weitere Erklärung für die niedrige Replikationsrate.  (Die gesamte Diskussion kann unter https://dl.dropboxusercontent.com/u/3670118/DGPs_Diskussionsforum.pdf abgerufen werden.)

Klaus Fiedler bezog sich insbesondere auf eine Graphik im Science Artikel, die die Effektgrößen der Replikationsstudie als Funktion der Effektgrößen in den Originalstudien zeigt.  Die Graphik zeigt, dass die Effektgrößen der Replikationsstudien im Durschnitt niedriger sind als die Effektgrößen in den Originalstudien. Der Artikel berichtet, dass sich die durchschnittliche Effektgröße von r = .40 (d = 1.10) auf r = .20 (d = .42) reduzierten.  Die publizierten Effektgrößen überschätzen daher die wahren Effektgrößen um mehr als 100%.

Klaus Fiedler behauptet, dass dies kein empirisches Phänomen sei, sondern nur ein bekanntes statistisches Phänomen der Regression zur Mitte wiederspiegelt  (On a-priori-grounds, to the extent that the reliability of the original results is less than perfect, it can be expected that replication studies regress toward weaker effect sizes. This is very common knowledge).

Klaus Fiedler behauptet weiterhin, dass Effektgrößen schrumpfen können, selbst wenn kein Selektionsbias vorliegt.

The only necessary and sufficient condition for regression (to the mean or toward less pronounced values) is a correlation less than zero [Fiedler probably meant to write less than one]. This was nicely explained and proven by Furby (1973). We all “learned” that lesson in the first semester, but regression remains a counter-intuitive thing.

Wir waren überrascht zu lesen, dass Regression zur Mitte auch ohne Selektion auftreten kann. Dies würde bedeuten, dass wir bspw. das Körpergewicht nur mit ungenauen Messinstrumenten messen müssen und dann das Durchschnittsgewicht bei der zweiten Messung geringer wäre.  Wenn dem so wäre, könnten wir die „Regressionsdiät“ benutzen, um ein paar Kilo abzunehmen.  Leider ist dies nur Wunschdenken, ebenso wie sich Klaus Fiedler wünscht, dass die 25% Replikationsrate kein Problem für die Sozialpsychologie darstellt. Wir haben Klaus Fiedlers Quelle nachgelesen und fanden, dass Furbys Beispiel und Beweis explizit eine Selektion voraussetzte (Furby, 1973, S. 173): „Now let us choose a certain aggression level at Time 1 (any level other than the mean)“ (Hervorhebungen von den Autoren).  Furby (1973) zeigt also genau das Gegenteil von dem, was Hr. Fiedler als Beleg für die Erklärung der Ergebnisse alleine durch die Regression zur Mitte heranzog.

Der Vollständigkeit halber wiederlegen wir an dieser Stelle nochmals die These, dass alleine eine Korrelation von kleiner als 1 zwischen den Effektgrößen der Originalstudien und der Replikationsstudien ausreichend ist, um die Ergebnisse des Reproduzierbarkeitsprojektes zu erklären. Halten wir uns zunächst an die Definition der Regression zur Mitte nach bspw. Shepard und Finison (aber siehe auch Maraun, 2011 für eine umfassende Darstellung). Das Ausmaß der Regression zur Mitte ist gegeben durch  mit r: Korrelation zwischen der ersten und zweiten Messung, µ: Mittelwert der Effektgröße in der Population, M: Mittelwert in der selegierten Gruppe, hier: durchschnittliche Effektgröße der originalen Studien. Siehe hierzu Shepard und Finison (1983, S. 308: „The term in square brackets, the product of two factors, is the estimated reduction in BP [blood pressure] due to regression.“ Ist nun eine Korrelation zwischen beobachteten Effektgrößen der originalen Studien und denen aus dem Reprodizierbarkeitsprojekt von kleiner als 1 eine notwendige und hinreichende Bedingung, wie Hr. Fiedler schrieb? Die Aussagenlogik lehrt uns die folgenden Definitionen:

Notwendig:

~p -> ~q

,wobei „~“ die Negation bezeichnet.

Angewandt auf die obige mathematische Definition der Regression zur Mitte hießt dies:

Falls r nicht kleiner als 1 ist, tritt die Regression zur Mitte nicht auf. Diese Aussage ist wahr, wie man anhand der Formel oben sehen kann.

Hinreichend:

p -> q

Falls r kleiner als 1 ist, tritt die Regression zur Mitte auf. Diese Aussage ist falsch wie man wiederum an der Formel oben sehen kann. Zu diesem Punkt schrieben wir auch im DGPS-Forum: „Wenn bspw. r = .80 (also kleiner eins wie von Hr. Fiedler vorausgesetzt) und der Mittelwerte der selegierten Gruppe gleich dem Populationsmittelwert, also M = µ, also bspw. M = µ = .40, dann tritt kein Regressionseffekt auf, denn (1 – .80)*(.40 – .40) = .20*0 = 0. Folglich ist die Bedingung r < 1 zwar eine notwendige, aber keine hinreichende Bedingung für die Regression zur Mitte. Nur wenn r < 1 und M ungleich µ, tritt dieser Effekt auf.“

Fiedlers Regressionsargument ist daher in perfekter Übereinstimmung mit unserer Erklärung der niedrigen Erfolgsquote im Replikationsprojekt.  Die hohe Erfolgsquote in den Originalstudien beruht auf einer Selektion signifikanter Ergebnisse, welche mit Hilfe von Zufallsfehler signifikant wurden.  Ohne die Hilfe des Zufalls kommt es zu einer Regression der Effektgrößen zum wahren Mittelwert und die Erfolgsquote sinkt. Erstaunlich und beunruhigend ist nur wie stark der Selektionseffekt und wie niedrig die wahre Erfolgsquote in der Sozialpsychologie ist.

Fallstudie 6:  Die Autoren des Replikationsprojekts waren inkompetent (Gilbert)

Vor kurzem veröffentlichten Daniel Gilbert und Kollegen eine Kritik des OSF Science Artikel (Gilbert, King, Pettigrew, & Wilson, 2016).  In der Harvard Gazette behauptet Gilbert, dass das Replikationsprojekt schwere Fehler gemacht habe, und dass die negativen Implikationen für die Glaubwürdigkeit der Sozialpsychologie total ungerechtfertigt sind („the OSC made some serious mistakes that make its pessimistic conclusion completely unwarranted.”) (Reuell, March 2016).  Gilbert führen eigene Berechnungen an und behaupten, dass die Ergebnisse mit einer wahren Erfolgsquote von 100% vereinbar sind („When this error is taken into account, the number of failures in their data is no greater than one would expect if all 100 of the original findings had been true.“).

Gilbert et al. (2016) führen drei Argumente auf, um die Ergebnisse des Replikationsprojekts in Frage zu stellen:

Das erste Argument ist, dass die Autoren die Daten falsch ausgewertet haben.  Dies Argument ist aus zwei Gründen nicht stichhaltig.  Erstens vermeiden es Gilbert et al. die 25% Erfolgsquote zu erwähnen.  Dieses Ergebnis bedarf kein tiefes Wissen über statistische Methoden.  Zählen alleine reicht und der Science-Artikel berichtet die richtige Erfolgsquote von 25% für die Sozialpsychologie.  Um von diesem klaren Ergebnis abzulenken, fokussieren Gilbert et al. ihre Kritik auf einen Vergleich der Effektgrößen in den Originalstudien und den Replikationsstudien.  Dieser Vergleich ist jedoch nicht besonders informativ, da die Konfidenzintervalle der Originalstudien sehr weit sind.  Wenn eine Studie eine Effektgröße von .8 Standardabweichungen berichtet und der Befund gerade mal signifikant ist (bei Alpha = 5%), reicht das 95% Konfidenzintervall von ein bisschen über Null bis zu 1.6 Standardabweichungen.  Selbst wenn die Replikationsstudie einen Effekt von Null zeigen würde, ist dieses Ergebnis nicht signifikant von dem Ergebnis der Originalstudie verschieden, da die Effektgröße der Replikationsstudie auch einen Messfehler hat und das Konfidenzintervall mit dem der Originalstudie überlappt. Wenn mal also diese Methode anwendet ist selbst ein echtes Nullergebnis eine gelungene Replikation eines starken Originaleffekts.  Dies macht keinen Sinn, während es durchaus sinnvoll ist einen Originalbefund in Frage zu stellen, wenn eine Replikationsstudie diesen Befund nicht replizieren kann.  Auf jeden Fall ändert der Vergleich von Konfidenzintervallen nichts an der Tatsache, dass die Erfolgsquote von nahe 100% auf 25% schrumpfte.

Das zweite Argument ist, dass die Replikationsstudien eine zu niedrige Power hatten, um die Replizierbarkeit der Originalstudien zu testen.  Wie bereits erwähnt war die Power der Replikationsstudien im Durschnitt höher als die Power der Originalstudien und die Replikationsstudien hatten daher eine bessere Chance die originalen Ergebnisse zu replizieren als die Originalstudien.  Die niedrige Replikationsrate von 25% kann daher nicht auf eine zu niedrige Power in den Replikationsstudien zurückgeführt werden.  Stattdessen kann die hohe Erfolgsquote in den Originalstudien mit Selektionsbias erklärt werden.  Gilbert et al. vermeiden es jedoch Selektionsbias zu erwähnen und zu erklären wie die Originalstudien ihre signifikanten Ergebnisse erreicht haben.

Das dritte Argument hat etwas mehr Gewicht.  Gilbert et al. stellten die Qualität der Replikationsstudien in Frage.  Erstens behaupteten die OSF Autoren, dass sie eng mit den Autoren der Originalstudien zusammengearbeitet haben, als sie die Replikationsstudien planten und dass die Originalautoren dem Replikationsplan zustimmten („The replication protocol articulated the process of … contacting the original authors for study materials, …  obtaining review of the protocol by the original authors, …”, p. 349).  Gilbert et al. fanden jedoch, dass einige Studien nicht von den Originalautoren begutachtet wurden oder dass die Originalautoren bedenken hatten.  Gilbert et al. fanden auch einige Beispiele, in denen die Replikationsstudie in einer anderen Sprache durchgeführt wurde, was Fragen über die Äquivalenz der Studien aufwirft.  Es stellt sich daher die Frage, ob die unterschiedlichen Erfolgsquoten auf mangelnde Äquivalenz zurückgeführt werden können.  Diese Frage gehen wir in der nächsten Fallstudie genauer nach.  Insgesamt sind die Argumente von Gilbert et al. jedoch schwach. Zwei Argumente sind schlicht falsch und das Problem exakte Replikationen in der Psychologie durchzuführen bedeutet nicht, dass die niedrige Erfolgsquote von 25% einfach ignoriert werden kann.  Viele echte Befunde wie der Ankereffekt lassen sich gut replizieren auch wenn die Studie in unterschiedlichen Ländern stattfindet (Klein et al., 2014). Außerdem erhöht die Notwendigkeit strenger Äquivalenz von experimentellen Bedingungen nicht die Glaubwürdigkeit von sozialpsychologischen Studien. Wenn diese Ergebnisse stark von den experimentellen Bedingungen abhängig sind, ist es nicht klar, unter welchen Bedingungen diese Befunde repliziert werden können.  Da die Probanden oft Studenten an einer Uni sind wo ein Sozialpsychologe beschäftigt ist, bleibt es unklar ob diese Befunde auch an anderen Unis oder mit Probanden die nicht Studenten sind repliziert werden können.  Selbst wenn die 25% Erfolgsquote die Erfolgsquote für strikte Replikationen unterschätzt bleibt es beunruhigend, dass es so schwer ist originelle Befunde zu wiederholen.

Fallstudie 7:  Replikationsstudien sind nicht interpretierbar (Strack)

In einem hochzitierten Artikel haben Fritz Strack und Wolfgang Stroebe (2014) den Sinn von Replikationstudien in Frage gestellt.  Der Artikel wurde veröffentlicht, bevor die Ergebnisse des OSF-Replikationsprojekt bekannt waren, aber das Projekt war den Autoren bekannt.  Die Autoren stellen zunächst in Frage ob die Sozialpsychologie eine Replikationskrise hat und stellen fest, dass es nicht genug Evidenz gibt um von einer Krise zu sprechen („We would argue that such a conclusion is premature.“ (p.  60).  Die Evidenz hat jetzt das Replikationsprojekt geliefert. Jedoch behaupten Strack und Stroebe, dass diese Ergebnisse ignoriert werden können, weil die Forscher den Fehler machten die originellen Studien so exakt wie möglich zu replizieren (das genaue Gegenteil von Gilberts Argument, dass die Studien zu verschieden waren).

Strack and Stroebe argumentieren, dass die Sozialpsychologie vorwiegend allgemeine Theorien zu testen.  Wenn man jedoch diese allgemeine Theorie immer nur unter denselben Bedingungen testet ist es unklar ob die Theorie wirklich gültig ist („A finding may be eminently reproducible and yet constitute a poor test of a theory”, p. 60).  Das stimmt zwar, aber das Problem der Sozialpsychologie ist ja, dass selbst unter möglichst gleichen Bedingungen originale Ergebnisse nicht repliziert werden konnten.  Und wenn das Replikationsprojekt die Experimente verändert hätte, wären diese Veränderungen vermutlich auch für die niedrige Replikationsrate verantwortlich gemacht worden (siehe Fallstudie 6). Diese Kritik an exakten Replikationen ist also höchst unlogisch.

Das bestätigen die Autoren sogar selbst wenn sie darauf hinweisen, dass Replikationen wertvoll sind wenn eine Studie sehr neue und unerwartete Befunde zeigt („Exact replications are also important when studies produce findings that are unexpected and only loosely connected to a theoretical framework“, p. 61).  Die Autoren führen eine berühmte Primingstudie als Beispiel an („It was therefore important that Bargh et al. (1996) published an exact replication of their experiment in the same paper” p. 61).  Und tatsächlich berichteten Bargh et al. (1996) Ergebnisse von zwei exakt gleichen Studien mit 30 Probanden.  Beide Studien zeigen ein signifikantes Ergebnis. Dies macht es sehr unwahrscheinlich, dass es ein Zufallsbefund war. Während eine Studie eine Fehlerwahrscheinlichkeit von 5% (1 / 20) hat, ist die Wahrscheinlichkeit für 2 Studien wesentlich kleiner 0.25% (1 / 400).  Wenn diese Ergebnisse jedoch nicht auf die speziellen Bedingungen von Barghs Labor in den Jahren 1990 bis 1995 beschränkt sind, sollten weitere Studien ebenfalls den Effekt zeigen.  Als aber andere Wissenschaftler den Effekt nicht fanden, wurde dieser Befund als Fehler der Replikationsstudie interpretiert („It is therefore possible that the priming procedure used in the Doyen et al. (2012) study failed in this respect, even though Doyen et al. faithfully replicated the priming procedure of Bargh et al. (1996)” (p. 62).  Es jedoch ebenso möglich, dass Bargh nicht alle Ergebnisse seines 5-jährigen Forschungsprogramms berichtet hat und dass Selektionsbias zu den signifikanten Ergebnissen die im Originalartikel berichtet wurden beigetragen hat.  Diese Möglichkeit wird jedoch von Strack und Stroebe (2014) nicht erwähnt, als ob es Selektionsbias nicht gäbe.

Die Verdrängung von Selektionsbias führt zu weiteren fragwürdigen Behauptungen. So behaupten Strack und Stroebe, dass ein nicht-signifikantes Ergebnis in einer Replikationsstudie als ein Interaktionseffekt interpretiert werden muss („In the ongoing discussion, “failures to replicate” are typically taken as a threat to the existence of the phenomenon. Methodologically, however, nonreplications must be understood as interaction effects in that they suggest that the effect of the crucial influence depends on the idiosyncratic conditions under which the original experiment was conducted”).  Diese Behauptung ist schlicht falsch und dies sollte den Autoren aus ihrer eigenen Forschung klar sein.  Im klassischen 2 x 2- Design der Sozialpsychologie kann man nur von einer Interaktion sprechen, wenn die Interaktion statistisch signifikant ist.  Wenn hingegen zwei Gruppen einen signifikanten Unterschied zeigen und zwei andere Gruppen keinen signifikanten Unterschied zeigen, kann dies auch ein zufälliges Ereignis sein. Entweder ist der signifikante Unterschied ein Type-I -Fehler oder der nicht-signifikanter Unterschied ist ein Type-II-Fehler.  Es ist daher wichtig mit einem Signifikanztest zu zeigen, dass Zufall eine unwahrscheinliche Erklärung für die unterschiedlichen Ergebnisse ist.  Im Replikationsprojekt sind die Unterschiede jedoch oft nicht signifikant.

Strack und Stroebes Argumentation würde bedeuten, dass Stichprobenfehler nicht existieren und daher jeder Mittelwertunterschied bedeutungsvoll ist.  Diese Argumentationslinie führt zu der absurden Schlussfolgerung, dass es Stichprobenfehler nicht gibt und Sozialpsychologische Ergebnisse 100% richtig sind.  Das stimmt zwar, wenn es um Stichprobenmittelwerte geht, aber die wirkliche Frage ist ja ob eine experimentelle Manipulation für diesen Unterschied verantwortlich ist, oder ob der Unterschied reiner Zufall ist.  Es ist daher nicht möglich die Ergebnisse von Originalstudien als unanfechtbare Wahrheiten anzusehen die ewige Gültigkeit haben.  Insbesondere wenn Selektionsbias groß ist, ist es möglich, dass viele veröffentlichte Befunde nicht replizierbar sind.

Abschließende Bemerkung

Es sind viele Artikel geschrieben worden, wie die Glaubwürdigkeit psychologischer Forschung erhöht werden kann.  Wir wollen nur einen Vorschlag machen, der ganz einfach und ganz schwer umzusetzen ist.  Psychologen müssen einfach nur alle Ergebnisse, signifikant oder nicht-signikifkant, in Fachzeitschriften berichten (Schimmack, 2012).  Die selektive Berichterstattung von Erfolgsmeldungen ist nicht mit den Zielen der Wissenschaft zu vereinbaren.  Wunschdenken und irren sind menschlich, aber gerade in den Sozialwissenschaften ist es wichtig diese menschlichen Fehler zu minimieren.  Die Krise der Sozialpsychologie zeigt jedoch, wie schwer es ist objektiv zu bleiben, wenn die eigenen Motive ins Spiel kommen.  Es ist daher notwendig, klare Regeln zu schaffen, die den Einfluss dieser Motive auf die Wissenschaft reduzieren.  Die wichtigste Regel ist, dass Wissenschaftler sich nicht aussuchen können, welche Ergebnisse sie berichten.  Der Artikel von Bem (2011) zur übersinnlichen Wahrnehmung zeigte eindeutig wie sinnlos die wissenschaftliche Methode ist, wenn sie missbraucht wird.  Wir begrüßen daher alle Initiativen die den Forschung- und Publikationsprozess in der Psychologie offener und transparenter machen.

References

Barrett, L. F. (2015, September 1). Psychology Is Not in Crisis. The New York Times. Abgerufen von http://www.nytimes.com/2015/09/01/opinion/psychology-is-not-in-crisis.html

Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002.

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Cohen, J. (1962). Statistical power of abnormal–social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. doi:10.1037/h0045186

DGPs (2015). Replikationen von Studien sichern Qualität in der Wissenschaft und bringen die Forschung voran.  Abgerufen von https://www.dgps.de/index.php?id=143&tx_ttnews%5Btt_news%5D=1630&cHash=6734f2c28f16dbab9de4871525b29a06

Francis, G. (2012b). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156. doi:10.3758/s13423-012-0227-9

Fiedler, K. (2015).  https://dl.dropboxusercontent.com/u/3670118/DGPs_Diskussionsforum.pdf
Furby, L. (1973). Interpreting regression toward mean in developmental research.  Developmental Psychology, 8, 172-179.

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012).  Correcting the Past: Failures to Replicate Psi.  Journal of Personality and Social Psychology, 103, 933-948.  DOI: 10.1037/a0029709

Gilbert, D. T., King, G., Pettigrew, S., / Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science”. Science, 351, 6277, 1037

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524–532. doi:10.1177/0956797611430953

Ledgerwood, A. (2016).  36 is the new 42.  Abgerufen von http://incurablynuanced.blogspot.ca/2016/02/36-is-new-42.html

Ledgerwood, A., & Sherman, J. W. (2012). Short, sweet, and problematic? The rise of the short report in psychological science. Perspectives on Psychological Science, 7, 60–66. doi:10.1177/1745691611427304

Maraun, M. D., Gabriel, S., & Martin, J. (2011). The mythologization of regression towards the mean. Theory & Psychology, 21(6), 762-784. doi: 10.1177/0959354310384910

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, 6251, DOI: 10.1126/science.aac4716

Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence?  Perspectives on Psychological Science, 7, 528-530.  DOI: 10.1177/1745691612465253

Retraction Watch. (2015).  Diederik Stapel now has 58 retractions.  Abgerufen von http://retractionwatch.com/2015/12/08/diederik-stapel-now-has-58-retractions/

Reuelle, P. (2016). Study that undercut psych research got it wrong. http://news.harvard.edu/gazette/story/2016/03/study-that-undercut-psych-research-got-it-wrong/

Schimmack, U. (2012).  The ironic effect of significant results on the credibility of multiple-study articles.  Psychological Methods, 17, 551–566.

Schimmack, U. (2015a).  The test of insufficient variance (TIVA).  Abgerufen von https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Schimmack, U. (2015b).  Introduction to the Replicability Index.  Abgerufen von https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index/

Schimmack, U. (2015c). Replicability Report for Psychological Science.  Abgerufen von https://replicationindex.wordpress.com/2015/08/15/replicability-report-for-psychological-science/

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. doi:10.1037/0033-2909.105.2.309

Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers.  Psychological Bulletin, 76, 105-110.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice-versa. American Statistician, 49, 108–112. doi:10.2307/2684823

Stroebe, W., & Hewstone, M. (2015). What have we learned from the Reproducibility Project?  Times of Higher Educationhttps://www.timeshighereducation.com/opinion/reproducibility-project-what-have-we-learned.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9(1), 59-71.

Yong, E. (October, 3, 2012). Nobel laureate challenges psychologists to clean up their act: Social-priming research needs “daisy chain” of replication.  Nature.  Abgerufen von http://www.nature.com/news/nobel-laureate-challenges-psychologists-to-clean-up-their-act-1.11535

Replicability Report No. 1: Is Ego-Depletion a Replicable Effect?

Abstract

It has been a common practice in social psychology to publish only significant results.  As a result, success rates in the published literature do not provide empirical evidence for the existence of a phenomenon.  A recent meta-analysis suggested that ego-depletion is a much weaker effect than the published literature suggests and a registered replication study failed to find any evidence for it.  This article presents the results of a replicability analysis of the ego-depletion literature.  Out of 165 articles with 429 studies (total N  = 33,927),  128 (78%) showed evidence of bias and low replicability (Replicability-Index < 50%).  Closer inspection of the top 10 articles with the strongest evidence against the null-hypothesis revealed some questionable statistical analyses, and only a few articles presented replicable results.  The results of this meta-analysis show that most published findings are not replicable and that the existing literature provides no credible evidence for ego-depletion.  The discussion focuses on the need for a change in research practices and suggests a new direction for research on ego-depletion that can produce conclusive results.

INTRODUCTION

In 1998, Roy F. Baumeister and colleagues published a groundbreaking article titled “Ego Depletion: Is the Active Self a Limited Resource?”   The article stimulated research on the newly minted construct of ego-depletion.  At present, more than 150 articles and over 400 studies with more than 30,000 participants have contributed to the literature on ego-depletion.  In 2010, a meta-analysis of nearly 100 articles, 200 studies, and 10,000 participants concluded that ego-depletion is a real phenomenon with a moderate to strong effect size of six tenth of a standard deviation (Hagger et al., 2010).

In 2011, Roy F. Baumeister and John Tierney published a popular book on ego-depletion titled “Will-Power,” and Roy F. Baumeister became to be known as the leading expert on self-regulation, will-power (The Atlantic, 2012).

Everything looked as if ego-depletion research has a bright future, but five years later the future of ego-depletion research looks gloomy and even prominent ego-depletion researchers wonder whether ego-depletion even exists (Slate, “Everything is Crumbling”, 2016).

An influential psychological theory, borne out in hundreds of experiments, may have just been debunked. How can so many scientists have been so wrong?

What Happened?

It has been known for 60 years that scientific journals tend to publish only successful studies (Sterling, 1959).  That is, when Roy F. Baumeister reported his first ego-depletion study and found that resisting the temptation to eat chocolate cookies led to a decrease in persistence on a difficult task by 17 minutes, the results were published as a groundbreaking discovery.  However, when studies do not produce the predicted outcome, they are not published.  This bias is known as publication bias.  Every researcher knows about publication bias, but the practice is so widespread that it is not considered a serious problem.  Surely, researches would not conduct more failed studies than successful studies and only report the successful ones.  Yes, omitting a few studies with weaker effects leads to an inflation of the effect size, but the successful studies still show the general trend.

The publication of one controversial article in the same journal that published the first ego-depletion article challenged this indifferent attitude towards publication bias. In a shocking article, Bem (2011) presented 9 successful studies demonstrating that extraverted students at Cornell University were seemingly able to foresee random events in the future. In Study 1, they seemed to be able to predict where a computer would present an erotic picture even before the computer randomly determined the location of the picture.  Although the article presented 9 successful studies and 1 marginally successful study, researchers were not convinced that extrasensory perception is a real phenomenon.  Rather, they wondered how credible the evidence in other article is if it is possible to get 9 significant results for a phenomenon that few researchers believed to be real.  As Sterling (1959) pointed out, a 100% success rate does not provide evidence for a phenomenon if only successful studies are reported. In this case, the success rate is by definition 100% no matter whether an effect is real or not.

In the same year, Simmons et al. (2011) showed how researchers can increase the chances to get significant results without a real effect by using a number of statistical practices that seem harmless, but in combination can increase the chance of a false discovery by more than 1000% (from 5% to 60%).  The use of these questionable research practices has been compared to the use of doping in sports (John et al., 2012).  Researchers who use QRPs are able to produce many successful studies, but the results of these studies cannot be replicated when other researchers replicate the reported studies without QRPs.  Skeptics wondered whether many discoveries in psychology are as incredible as Bem’s discovery of extrasensory perception; groundbreaking, spectacular, and false.  Is ego-depletion a real effect or is it an artificial product of publication bias and questionable research practices?

Does Ego-Depletion Depend on Blood Glucose?

The core assumption of ego-depletion theory is that working on an effortful task requires energy and that performance decreases as energy levels decrease.  If this theory is correct, it should be possible to find a physiological correlate of this energy.  Ten years after the inception of ego-depletion theory, Baumeister and colleagues claimed to have found the biological basis of ego-depletion in an article called “Self-control relies on glucose as a limited energy source.”  (Gailliot et al., 2007).  The article had a huge impact on ego-depletion researchers and it became a common practice to measure blood-glucose levels.

Unfortunately, Baumeister and colleagues had not consulted with physiological psychologists when they developed the idea that brain processes depend on blood-glucose levels.  To maintain vital functions, the human body ensures that the brain is relatively independent of peripheral processes.  A large literature in physiological psychology suggested that inhibiting the impulse to eat delicious chocolate cookies would not lead to a measurable drop in blood glucose levels (Kurzban, 2011).

Let’s look at the numbers. A well-known statistic is that the brain, while only 2% of body weight, consumes 20% of the body’s energy. That sounds like the brain consumes a lot of calories, but if we assume a 2,400 calorie/day diet – only to make the division really easy – that’s 100 calories per hour on average, 20 of which, then, are being used by the brain. Every three minutes, then, the brain – which includes memory systems, the visual system, working memory, then emotion systems, and so on – consumes one (1) calorie. One. Yes, the brain is a greedy organ, but it’s important to keep its greediness in perspective.

But, maybe experts on physiology were just wrong and Baumeister and colleagues made another groundbreaking discovery.  After all, they presented 9 successful studies that appeared to support the glucose theory of will-power, but 9 successful studies alone provide no evidence because it is not clear how these successful studies were produced.

To answer this question, Schimmack (2012) developed a statistical test that provides information about the credibility of a set of successful studies. Experimental researchers try to hold many factors that can influence the results constant (all studies are done in the same laboratory, glucose is measured the same way, etc.).  However, there are always factors that the experimenter cannot control. These random factors make it difficult to predict the exact outcome of a study even if everything goes well and the theory is right.  To minimize the influence of these random factors, researchers need large samples, but social psychologists often use small samples where random factors can have a large influence on results.  As a result, conducting a study is a gamble and some studies will fail even if the theory is correct.  Moreover, the probability of failure increases with the number of attempts.  You may get away with playing Russian roulette once, but you cannot play forever.  Thus, eventually failed studies are expected and a 100% success rate is a sign that failed studies were simply not reported.  Schimmack (2012) was able to use the reported statistics in Gailliot et al. (2007) to demonstrate that it was very likely that the 100% success rate was only achieved by hiding failed studies or with the help of questionable research practices.

Baumeister was a reviewer of Schimmack’s manuscript and confirmed the finding that a success rate of 9 out of 9 studies was not credible.

 “My paper with Gailliot et al. (2007) is used as an illustration here. Of course, I am quite familiar with the process and history of that one. We initially submitted it with more studies, some of which had weaker results. The editor said to delete those. He wanted the paper shorter so as not to use up a lot of journal space with mediocre results. It worked: the resulting paper is shorter and stronger. Does that count as magic? The studies deleted at the editor’s request are not the only story. I am pretty sure there were other studies that did not work. Let us suppose that our hypotheses were correct and that our research was impeccable. Then several of our studies would have failed, simply given the realities of low power and random fluctuations. Is anyone surprised that those studies were not included in the draft we submitted for publication? If we had included them, certainly the editor and reviewers would have criticized them and formed a more negative impression of the paper. Let us suppose that they still thought the work deserved publication (after all, as I said, we are assuming here that the research was impeccable and the hypotheses correct). Do you think the editor would have wanted to include those studies in the published version?”

To summarize, Baumeister defends the practice of hiding failed studies with the argument that this practice is acceptable if the theory is correct.  But we do not know whether the theory is correct without looking at unbiased evidence.  Thus, his line of reasoning does not justify the practice of selectively reporting successful results, which provides biased evidence for the theory.  If we could know whether a theory is correct without data, we would not need empirical tests of the theory.  In conclusion, Baumeister’s response shows a fundamental misunderstanding of the role of empirical data in science.  Empirical results are not mere illustrations of what could happen if a theory were correct. Empirical data are supposed to provide objective evidence that a theory needs to explain.

Since my article has been published, there have been several failures to replicate Gailliot et al.’s findings and recent theoretical articles on ego-depletion no longer assume that blood-glucose as the source of ego-depletion.

“Upon closer inspection notable limitations have emerged. Chief among these is the failure to replicate evidence that cognitive exertion actually lowers blood glucose levels.” (Inzlicht, Schmeichel, & Macrae, 2014, p 18).

Thus, the 9 successful studies that were selected by Baumeister et al. (1998) did not illustrate an empirical fact, they created false evidence for a physiological correlate of ego-depletion that could not be replicated.  Precious research resources were wasted on a line of research that could have been avoided by consulting with experts on human physiology and by honestly examining the successful and failed studies that led to the Baumeister et al. (1998) article.

Even Baumeister agrees that the original evidence was false and that glucose is not the biological correlate of ego-depletion.

In retrospect, even the initial evidence might have gotten a boost in significance from a fortuitous control condition. Hence at present it seems unlikely that ego depletion’s effects are caused by a shortage of glucose in the bloodstream” (Baumeister, 2014, p 315).

Baumeister fails to mention that the initial evidence also got a boost from selection bias.

In sum, the glucose theory of ego-depletion was based on selective reporting of studies that provided misleading support for the theory and the theory lacks credible empirical support.  The failure of the glucose theory raises questions about the basic ego-depletion effect.  If researchers in this field used selective reporting and questionable research practices, the evidence for the basic effect is also likely to be biased and the effect may be difficult to replicate.

If 200 studies show ego-depletion effects, it must be real?

Psychologists have not ignored publication bias altogether.  The main solution to the problem is to conduct meta-analyses.  A meta-analysis combines information from several small studies to examine whether an effect is real.  The problem for meta-analysis is that publication bias also influences the results of a meta-analysis.  If only successful studies are published, a meta-analysis of published studies will show evidence for an effect no matter whether the effect actually exists or not.  For example, the top journal for meta-analysis, Psychological Bulletin, has published meta-analyses that provide evidence for extransensory perception (Bem & Honorton, 1994).

To address this problem, meta-analysts have developed a number of statistical tools to detect publication bias.  The most prominent method is Eggert’s regression of effect size estimates on sampling error.  A positive correlation can reveal publication bias because studies with larger sampling errors (small samples) require larger effect sizes to achieve statistical significance.  To produce these large effect sizes when the actual effect does not exist or is smaller, researchers need to hide more studies or use more questionable research practices.  As a result, these results are particularly difficult to replicate.

Although the use of these statistical methods is state of the art, the original ego-depletion meta-analysis that showed moderate to large effects did not examine the presence of publication bias (Hagger et al., 2010). This omission was corrected in a meta-analysis by Carter and McCollough (2014).

Upon reading Hagger et al. (2010), we realized that their efforts to estimate and account for the possible influence of publication bias and other small-study effects had been less than ideal, given the methods available at the time of its publication (Carter & McCollough, 2014).

The authors then used Eggert regression to examine publication bias.  Moreover, they used a new method that was not available at the time of Hagger et al.’s (2010) meta-analysis to estimate the effect size of ego-depletion after correcting for the inflation caused by publication bias.

Not surprisingly, the regression analysis showed clear evidence of publication bias.  More stunning were the results of the effect size estimate after correcting for publication bias.  The bias-corrected effect size estimate was d = .25 with a 95% confidence interval ranging from d = .18 to d = .32.   Thus, even the upper limit of the confidence interval is about 50% less than the effect size estimate in the original meta-analysis without correction for publication bias.   This suggests that publication bias inflated the effect size estimate by 100% or more.  Interestingly, a similar result was obtained in the reproducibility project, where a team of psychologists replicated 100 original studies and found that published effect sizes were over 100% larger than effect sizes in the replication project (OSC, 2015).

An effect size of d = .2 is considered small.  This does not mean that the effect has no practical importance, but it raises questions about the replicability of ego-depletion results.  To obtain replicable results, researchers should plan studies so that they have an 80% chance to get significant results despite the unpredictable influence of random error.  For small effects, this implies that studies require large samples.  For the standard ego-depletion paradigm with an experimental group and a control group and an effect size of d = .2, a sample size of 788 participants is needed to achieve 80% power. However, the largest sample size in an ego-depletion study was only 501 participants.  A sample size of 388 participants is needed to achieve significance without an inflated effect size (50% power) and most studies fall short of this requirement in sample size.  Thus, most published ego-depletion results are unlikely to replicate and future ego-depletion studies are likely to produce non-significant results.

In conclusion, even 100 studies with 100% successful results do not provide convincing evidence that ego-depletion exists and which experimental procedures can be used to replicate the basic effect.

Replicability without Publication Bias

In response to concerns about replicability, the American Psychological Society created a new format for publications.  A team of researchers can propose a replication project.  The research proposal is peer-reviewed like a grant application.  When the project is approved, researchers conduct the studies and publish the results independent of the outcome of the project.  If it is successful, the results confirm that earlier findings that were reported with publication bias are replicable, although probably with a smaller effect size.  If the studies fail, the results suggest that the effect may not exist or that the effect size is very small.

In the fall of 2014 Hagger and Chatzisarantis announced a replication project of an ego-depletion study.

The third RRR will do so using the paradigm developed and published by Sripada, Kessler, and Jonides (2014), which is similar to that used in the original depletion experiments (Baumeister et al., 1998; Muraven et al., 1998), using only computerized versions of tasks to minimize variability across laboratories. By using preregistered replications across multiple laboratories, this RRR will allow for a precise, objective estimate of the size of the ego depletion effect.

In the end, 23 laboratories participated and the combined sample size of all studies was N = 2141.  This sample size affords an 80% probability to obtain a significant result (p < .05, two-tailed) with an effect size of d = .12, which is below the lower limit of the confidence interval of the bias-corrected meta-analysis.  Nevertheless, the study failed to produce a statistically significant result, d = .04 with a 95%CI ranging from d = -.07 to d = .14.  Thus, the results are inconsistent with a small effect size of d = .20 and suggest that ego-depletion may not even exist at all.

Ego-depletion researchers have responded to this result differently.  Michael Inzlicht, winner of a theoretical innovation prize for his work on ego-depletion, wrote:

The results of a massive replication effort, involving 24 labs (or 23, depending on how you count) and over 2,000 participants, indicates that short bouts of effortful control had no discernable effects on low-level inhibitory control. This seems to contradict two decades of research on the concept of ego depletion and the resource model of self-control. Like I said: science is brutal.

In contrast, Roy F. Baumeister questioned the outcome of this research project that provided the most comprehensive and scientific test of ego-depletion.  In a response with co-author Kathleen D. Vohs titled “A misguided effort with elusive implications,” Baumeister tries to explain why ego depletion is a real effect, despite the lack of unbiased evidence for it.

The first line of defense is to question the validity of the paradigm that was used for the replication project. The only problem is that this paradigm seemed reasonable to the editors who approved the project, researchers who participated in the project and who expected a positive result, and to Baumeister himself when he was consulted during the planning of the replication project.  In his response, Baumeister reverses his opinion about the paradigm.

In retrospect, the decision to use new, mostly untested procedures for a large replication project was foolish.

He further claims that he proposed several well-tested procedures, but that these procedures were rejected by the replication team for technical reasons.

Baumeister nominated several procedures that have been used in successful studies of ego depletion for years. But none of Baumeister’s suggestions were allowable due to the RRR restrictions that it must be done with only computerized tasks that were culturally and linguistically neutral.

Baumeister and Vohs then claim that the manipulation did not lead to ego-depletion and that it is not surprising that an unsuccessful manipulation does not produce an effect.

Signs indicate the RRR was plagued by manipulation failure — and therefore did not test ego depletion.

They then assure readers that ego-depletion is real because they have demonstrated the effect repeatedly using various experimental tasks.

For two decades we have conducted studies of ego depletion carefully and honestly, following the field’s best practices, and we find the effect over and over (as have many others in fields as far-ranging as finance to health to sports, both in the lab and large-scale field studies). There is too much evidence to dismiss based on the RRR, which after all is ultimately a single study — especially if the manipulation failed to create ego depletion.

This last statement is, however, misleading if not outright deceptive.  As noted earlier, Baumeister admitted to the practice of not publishing disconfirming evidence.  He and I disagree whether the selective publication of successful studies is honest or dishonest.  He wrote:

 “We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

So, when Baumeister and Vohs assure readers that they conducted ego-depletion research carefully and honestly, they are not saying that they reported all studies that they conducted in their labs.  The successful studies published in articles are not representative of the studies conducted in their labs.

In a response to Baumeister and Vohs, the lead authors of the replication project pointed out that ego-depletion does not exist unless proponents of ego-depletion theory can specify experimental procedures that reliably produce the predicted effect.

The onus is on researchers to develop a clear set of paradigms that reliably evoke depletion in large samples with high power (Hagger & Chatzisarantis, 2016)

In an open email letter, I asked Baumeister and Vohs to name paradigms that could replicate a published ego-depletion effect.  They were not able or willing to name a single paradigm. Roy Bameister’s response was “In view of your reputation as untrustworthy, dishonest, and otherwise obnoxious, i prefer not to cooperate or collaborate with you.” 

I did not request to collaborate with him.  I merely asked which paradigm would be able to produce ego-depletion effects in an open and transparent replication study, given his criticism of the most rigorous replication study that he initially approved.

If an expert who invented a theory and published numerous successful studies cannot name a paradigm that will work, it suggests that he does not know which studies may work because for each published successful study there are unpublished, unsuccessful studies that used the same procedure, and it is not obvious which study would actually replicate in an honest and transparent replication project.

A New Meta-Analysis of Ego-Depletion Studies:  Are there replicable effects?

Since I published the incredibility index (Schimmack, 2012) and demonstrated bias in research on glucose and ego-depletion, I have developed new and more powerful ways to reveal selection bias and questionable research practices.  I applied these methods to the large literature on ego-depletion to examine whether there are some credible ego-depletion effects and a paradigm that produces replicable effects.

The first method uses powergraphs (Schimmack, 2015) to examine selection bias and the replicability of a set of studies. To create a powergrpah, original research results are converted into absolute z-score.  A z-score shows how much evidence a study result provides against the null-hypothesis that there is no effect.  Unlike effect size measures, z-scores also contain information about the sample size (sampling error).   I therefore distinguish between meta-analysis of effect sizes and meta-analysis of evidence.  Effect size meta-analysis aims to determine the typical, average size of an effect.  Meta-analyses of evidence examine how strong the evidence for an effect (i.e., against the null-hypothesis of no effect) is.

The distribution of absolute z-scores provides important information about selection bias, questionable research practices, and replicability.  Selection bias is revealed if the distribution of z-scores shows a steep drop on the left side of the criterion for statistical significance (this is analogous to the empty space below the line for significance in a funnel plot). Questionable research practices are revealed if z-scores cluster in the area just above the significance criterion.  Replicabilty is estimated by fitting a weighted composite of several non-central distributions that simulate studies with different non-centrality parameters and sampling error.

A literature search retrieved 165 articles that reported 429 studies.  For each study, the most important statistical test was converted first into a two-tailed p-value and then into a z-score.  A single test statistic was used to ensure that all z-scores are statistically independent.

Powergraph for Ego Depletion (Focal Tests)

 

The results show clear evidence of selection bias (Figure 1).  Although there are some results below the significance criterion (z = 1.96, p < .05, two-tailed), most of these results are above z = 1.65, which corresponds to p < .10 (two-tailed) or p < .05 (one-tailed).  These results are typically reported as marginally significant and used as evidence for an effect.   There are hardly any results that fail to confirm a prediction based on ego-depletion theory.  Using z = 1.65 as criterion, the success rate is 96%, which is common for the reported success rate in psychological journals (Sterling, 1959; Sterling et al., 1995; OSC, 2015).  The steep cliff in the powergraph shows that this success rate is due to selection bias because random error would have produced a more gradual decline with many more non-significant results.

The next observation is the tall bar just above the significance criterion with z-scores between 2 and 2.2.   This result is most likely due to questionable research practices that lead to just significant results such as optional stopping or selective dropping of outliers.

Another steep drop is observed at z-scores of 2.6.  This drop is likely due to the use of further questionable research practices such as dropping of experimental conditions, use of multiple dependent variables, or simply running multiple studies and selecting only significant results.

A rather large proportion of z-scores are in the questionable range from z = 1.96 to 2.60.  These results are unlikely to replicate. Although some studies may have reported honest results, there are too many questionable results and it is impossible to say which results are trustworthy and which results are not.  It is like getting information from a group of people where 60% are liars and 40% tell the truth.  Even though 40% are telling the truth, the information is useless without knowing who is telling the truth and who is lying.

The best bet to find replicable ego-depletion results is to focus on the largest z-scores as replicability increases with the strength of evidence (OSC, 2015). The power estimation method uses the distribution of z-scores greater than 2.6 to estimate the average power of these studies.  The estimated power is 47% with a 95% confidence interval ranging from 32% to 63%.  This result suggests that some ego-depletion studies have produced replicable results.  In the next section, I examine which studies this may be.

In sum, a state-of-the art meta-analysis of evidence for an effect in the ego-depletion literature shows clear evidence for selection bias and the use of questionable research practices.  Many published results are essentially useless because the evidence is not credible.  However, the results also show that some studies produced replicable effects, which is consistent with Carter and McCollough’s finding that the average effect size is likely to be above zero.

What Ego-Depletion Studies Are Most Likely to Replicate?

Powergraphs are useful for large sets of heterogeneous studies.  However, they are not useful to examine the replicability of a single study or small sets of studies, such as a set of studies in a multiple-study article.  For this purpose, I developed two additional tools that detect bias in published results. .

The Test of Insufficient Variance (TIVA) requires a minimum of two independent studies.  As z-scores follow a normal distribution (the normal distribution of random error), the variance of z-scores should be 1.  However, if non-significant results are omitted from reported results, the variance shrinks.  TIVA uses the standard comparison of variances to compute the probability that an observed variance of z-scores is an unbiased sample drawn from a normal distribution.  TIVA has been shown to reveal selection bias in Bem’s (2011) article and it is a more powerful test than the incredibility index (Schimmack, 2012).

The R-Index is based on the Incredibilty Index in that it compares the success rate (percentage of significant results) with the observed statistical power of a test. However, the R-Index does not test the probability of the success rate.  Rather, it uses the observed power to predict replicability of an exact replication study.  The R-Index has two components. The first component is the median observed power of a set of studies.  In the limit, median observed power approaches the average power of an unbiased set of exact replication studies.  However, when selection bias is present, median observed power is biased and provides an inflated estimate of true power.  The R-Index measures the extent of selection bias by means of the difference between success rate and median observed power.  If median observed power is 75% and the success rate is 100%, the inflation rate is 25% (100 – 75 = 25).  The inflation rate is subtracted from median observed power to correct for the inflation.  The resulting replication index is not directly an estimate of power, except for the special case when power is 50% and the success rate is 100%   When power is 50% and the success rate is 100%, median observed power increases to 75%.  In this case, the inflation correction of 25% returns the actual power of 50%.

I emphasize this special case because 50% power is also a critical point at which point a rational bet would change from betting against replication (Replicability < 50%) to betting on a successful replication (Replicability > 50%).  Thus, an R-Index of 50% suggests that a study or a set of studies produced a replicable result.  With success rates close to 100%, this criterion implies that median observed power is 75%, which corresponds to a z-score of 2.63.  Incidentally, a z-score of 2.6 also separated questionable results from more credible results in the powergraph analysis above.

It may seem problematic to use the R-Index even for a single study because observed power of a single study is strongly influenced by random factors and observed power is by definition above 50% for a significant result. However, The R-Index provides a correction for selection bias and a significant result implies a 100% success rate.  Of course, it could also be an honestly reported result, but if the study was published in a field with evidence of selection bias, the R-Index provides a reasonable correction for publication bias.  To achieve an R-Index above 50%, observed power has to be greater than 75%.

This criterion has been validated with social psychology studies in the reproducibilty project, where the R-Index predicted replication success with over 90% accuracy. This criterion also correctly predicted that the ego-depletion replication project would produce fewer than 50% successful replications, which it did, because the R-Index for the original study was way below 50% (F(1,90) = 4.64, p = .034, z = 2.12, OP = .56, R-Index = .12).  If this information had been available during the planning of the RRR, researchers might have opted for a paradigm with a higher chance of a successful replication.

To identify paradigms with higher replicability, I computed the R-Index and TIVA (for articles with more than one study) for all 165 articles in the meta-analysis.  For TIVA I used p < .10 as criterion for bias and for the R-Index I used .50 as the criterion.   37 articles (22%) passed this test.  This implies that 128 (78%) showed signs of statistical bias and/or low replicability.  Below I discuss the Top 10 articles with the highest R-Index to identify paradigms that may produce a reliable ego-depletion effect.

1. Robert D. Dvorak and Jeffrey S. Simons (PSPB, 2009) [ID = 142, R-Index > .99]

This article reported a single study with an unusually large sample size for ego-depletion studies. 180 participants were randomly assigned to a standard ego-depletion manipulation. In the control condition, participants watched an amusing video.  In the depletion condition, participants watched the same video, but they were instructed to suppress all feelings and expressions.  The dependent variable was persistence on a set of solvable and unsolvable anagrams.  The t-value in this study suggests strong evidence for an ego-depletion effect, t(178) = 5.91.  The large sample size contributes to this, but the effect size is also large, d = .88.

Interestingly, this study is an exact replication of Study 3 in the seminal ego-depletion article by Baumeister et al. (1998), which obtained a significant effect with just 30 participants and a strong effect size of d = .77, t(28) = 2.12.

The same effect was also reported in a study with 132 smokers (Heckman, Ditre, & Brandon, 2012). Smokers who were not allowed to smoke persisted longer on a figure tracing task when they could watch an emotional video normally than when they had to suppress emotional responses, t(64) = 3.15, d = .78.  The depletion effect was weaker when smokers were allowed to smoke between the video and the figure tracing task. The interaction effect was significant, F(1, 128) = 7.18.

In sum, a set of studies suggests that emotion suppression influences persistence on a subsequent task.  The existing evidence suggests that this is a rather strong effect that can be replicated across laboratories.

2. Megan Oaten, Kipling D. William, Andrew Jones, & Lisa Zadro (J Soc Clinical Psy, 2008) [ID = 118, R-Index > .99]

This article reports two studies that manipulated social exclusion (ostracism) under the assumption that social exclusion is ego-depleting. The dependent variable was consumption of an unhealthy food in Study 1 and drinking a healthy, but unpleasant drink in Study 2.  Both studies showed extremely strong effects of ego-depletion (Study 1: d = 2.69, t(71) = 11.48;  Study 2: d = 1.48, t(72) = 6.37.

One concern about these unusually strong effects is the transformation of the dependent variable.  The authors report that they first ranked the data and then assigned z-scores corresponding to the estimated cumulative proportion.  This is an unusual procedure and it is difficult to say whether this procedure inadvertently inflated the effect size of ego-depletion.

Interestingly, one other article used social exclusion as an ego-depletion manipulation (Baumeister et al., 2005).  This article reported six studies and TIVA showed evidence of selection bias, Var(z) = 0.15, p = .02.  Thus, the reported effect sizes in this article are likely to be inflated.  The first two studies used consumption of an unpleasant tasting drink and eating cookies, respectively, as dependent variables. The reported effect sizes were weaker than in the article by Oaten et al. (d = 1.00, d = .90).

In conclusion, there is some evidence that participants avoid displeasure and seek pleasure after social rejection. A replication study with a sufficient sample size may replicate this result with a weaker effect size.  However, even if this effect exists it is not clear that the effect is mediated by ego-depletion.

3. Kathleen D. Vohs & Ronald J. Farber (Journal of Consumer Research) [ID = 29, R-Index > .99]

This article examined the effect of several ego-depletion manipulations on purchasing behavior.  Study 1 found a weaker effect, t(33) = 2.83,  than Studies 2 and 3, t(63) = 5.26, t(33) = 5.52, respectively.  One possible explanation is that the latter studies used actual purchasing behavior.  Study 2 used the White Bear paradigm and Study 2 used amplification of emotion expressions as ego-depletion manipulations.  Although statistically robust, purchasing behavior does not seem to be the best indicator of ego-depletion.  Thus, replication efforts may focus on other dependent variables that measure ego-depletion more directly.

4. Kathleen D. Vohs, Roy F. Baumeister, & Brandon J. Schmeichel (JESP, 2012/2013) [ID = 49, R-Index = .96]

This article was first published in 2012, but the results for Study 1 were misreported and a corrected version was published in 2013.  The article presents two studies with a 2 x 3 between-subject design. Study 1 had n = 13 participants per cell and Study 2 had n = 35 participants per cell.  Both studies showed an interaction between ego-depletion manipulations and manipulations of self-control beliefs. The dependent variables in both studies were the Cognitive Estimation Test and a delay of gratification task.  Results were similar for both dependent measures. I focus on the CET because it provides a more direct test of ego-depletion; that is, the draining of resources.

In the condition with limited-will-power beliefs of Study 1, the standard ego-depletion effect that compares depleted participants to a control condition was a decreased by about 6 points from about 30 to 24 points (no exact means or standard deviations, or t-values for this contrast are provided).  The unlimited will-power condition shows a smaller decrease by 2 points (31 vs. 29).  Study 2 replicates this pattern. In the limited-will-power condition, CET scores decreased again by 6 points from 32 to 26 and in the unlimited-will-power condition CET scores decreased by about 2 points from about 31 to 29 points.  This interaction effect would again suggest that the standard depletion effect can be reduced by manipulating participants’ beliefs.

One interesting aspect of the study was the demonstration that ego-depletion effects increase with the number of ego-depleting tasks.  Performance on the CET decreased further when participants completed 4 vs. 2 or 3 vs. 1 depleting task.  Thus, given the uncertainty about the existence of ego-depletion, it would make sense to start with a strong manipulation that compares a control condition with a condition with multiple ego-depleting tasks.

One concern about this article is the use of the CET as a measure of ego-depletion.  The task was used in only one other study by Schmeichel, Vohs, and Baumeister (2003) with a small sample of N = 37 participants.  The authors reported a just significant effect on the CET, t(35) = 2.18.  However, Vohs et al. (2013) increased the number of items from 8 to 20, which makes the measure more reliable and sensitive to experimental manipulations.

Another limitation of this study is that there was no control condition without manipulation of beliefs. It is possible that the depletion effect in this study was amplified by the limited-will-power manipulation. Thus, a simple replication of this study would not provide clear evidence for ego-depletion.  However, it would be interesting to do a replication study that examines the effect of ego-depletion on the CET without manipulation of beliefs.

In sum, this study could provide the basis for a successful demonstration of ego-depletion by comparing effects on the CET for a control condition versus a condition with multiple ego-depletion tasks.

5. Veronika Job, Carol S. Dweck, and Gregory M. Walton (Psy Science, 2010) [ID = 191, R-Index = 94]

The article by Job et al. (2010) is noteworthy for several reasons.  First, the article presented three close replications of the same effect with high t-values, ts = 3.88, 8.47, 2.62.  Based on these results, one would expect that other researchers can replicate the results.  Second, the effect is an interaction between a depletion manipulation and a subtle manipulation of theories about the effect of working on an effortful task.  Hidden among other questionnaires, participants received either items that suggested depletion (“After a strenuous mental activity your energy is depleted and you must rest to get it refueled again” or items that suggested energy is unlimited (“Your mental stamina fuels itself; even after strenuous mental exertion you can continue doing more of it”). The pattern of the interaction effect showed that only participants who received the depletion items showed the depletion effect.  Participants who received the unlimited energy items showed no significant difference in Stroop performance.  Taken at face value, this finding would challenge depletion theory, which assumes that depletion is an involuntary response to exerting effort.

However, the study also raises questions because the authors used an unconventional statistical method to analyze their data.  Data were analyzed with a multi-level model that modeled errors as a function of factors that vary within participants over time and factors that vary between participants, including the experimental manipulations.  In an email exchange, the lead author confirmed that the model did not include random factors for between-subject variance.  A statistician assured the lead author that this was acceptable.  However, a simple computation of the standard deviation around mean accuracy levels would show that this variance is not zero.  Thus, the model artificially inflated the evidence for an effect by treating between-subject variance as within-subject variance. In a betwee-subject analysis, the small differences in error rates (about 5 percentage points) are unlikely to be significant.

In sum, it is doubtful that a replication study would replicate the interaction between depletion manipulations and the implicit theory manipulation reported in Job et al. (2010) in an appropriate between-subject analysis.  Even if this result would replicate, it would not support the theory that ego-depletion is a limited resource that is depleted after a short effortful task because the effect can be undone with a simple manipulations of beliefs in unlimited energy.

6. Roland Imhoff, Alexander F. Schmidt, & Friederike Gerstenberg (Journal of Personality, 2014) [ID = 146, R-Index = .90]

Study 1 reports results a standard ego-depletion paradigm with a relatively larger sample (N = 123).  The ego-depletion manipulation was a Stroop task with 180 trials.  The dependent variable was consumption of chocolates (M&M).  The study reported a large effect, d = .72, and strong evidence for an ego-depletion effect, t(127) = 4.07.  The strong evidence is in part justified by the large sample size, but the standardized effect size seems a bit large for a difference of 2g in consumption, whereas the standard deviation of consumption appears a bit small (3g).  A similar study with M&M consumption as dependent variable found a 2g difference in the opposite direction with a much larger standard deviation of 16g and no significant effect, t(48) = -0.44.

The second study produced results in line with other ego-depletion studies and did not contribute to the high R-Index of the article, t(101) = 2.59. The third study was a correlational study with examined correlates of a trait measure of ego-depletion.  Even if this correlation is replicable, it does not support the fundamental assumption of ego-depletion theory of situational effects of effort on subsequent effort.  In sum, it is unlikely that Study 1 is replicable and that strong results are due to misreported standard deviations.

7. Hugo J.E.M. Alberts, Carolien Martijn, & Nanne K. de Vries (JESP, 2011) [ID = 56, R-Index = .86]

This article reports the results of a single study that crossed an ego-depletion manipulation with a self-awareness priming manipulation (2 x 2 with n = 20 per cell).  The dependent variable was persistence in a hand-grip task.  Like many other handgrip studies, this study assessed handgrip persistence before and after the manipulation, which increases the statistical power to detect depletion effects.

The study found weak evidence for an ego-depletion effect, but relatively strong evidence for an interaction effect, F(1,71) = 13.00.  The conditions without priming showed a weak ego depletion effect (6s difference, d = .25).  The strong interaction effect was due to the priming conditions, where depleted participants showed an increase in persistence by 10s and participants in the control condition showed a decrease in performance by 15s.  Even if this is a replicable finding, it does not support the ego-depletion effect.  The weak evidence for ego depletion with the handgrip task is consistent with a meta-analysis of handgrip studies (Schimmack, 2015).

In short, although this study produced an R-Index above .50, closer inspection of the results shows no strong evidence for ego-depletion.

8. James M. Tyler (Human Communications Research, 2008) [ID = 131, R-Index = .82]

This article reports three studies that show depletion effects after sharing intimate information with strangers.  In the depletion condition, participants were asked to answer 10 private questions in a staged video session that suggested several other people were listening.  This manipulation had strong effects on persistence in an anagram task (Study 1, d = 1.6, F(2,45) = 16.73) and the hand-grip task (Study 2: d = 1.35, F(2,40) = 11.09). Study 3 reversed tasks and showed that the crossing-E task influenced identification of complex non-verbal cues, but not simple non-verbal cues, F(1,24) = 13.44. The effect of the depletion manipulation on complex cues was very large, d = 1.93.  Study 4 crossed the social manipulation of depletion from Studies 1 and 2 with the White Bear suppression manipulation and used identification of non-verbal cues as the dependent variable.  The study showed strong evidence for an interaction effect, F(1,52) = 19.41.  The pattern of this interaction is surprising, because the White Bear suppression task showed no significant effect after not sharing intimate details, t(28) = 1.27, d = .46.  In contrast, the crossing-E task had produced a very strong effect in Study 3, d = 1.93.  The interaction was driven by a strong effect of the White Bear manipulation after sharing intimate details, t(28) = 4.62, d = 1.69.

Even though the statistical results suggest that these results are highly replicable, the small sample sizes and very large effect sizes raise some concerns about replicability.  The large effects cannot be attributed to the ego-depletion tasks or measures that have been used in many other studies that produced much weaker effect. Thus, the only theoretical explanation for these large effect sizes would be that ego depletion has particularly strong effects on social processes.  Even if these effects could be replicated, it is not clear that ego-depletion is the mediating mechanism.  Especially the complex manipulation in the first two studies allow for multiple causal pathways.  It may also be difficult to recreate this manipulation and a failure to replicate the results could be attribute to problems with reproducibility.  Thus, a replication of this study is unlikely to advance understanding of ego-depletion without first establishing that ego-depletion exists.

9. Brandon J. Schmeichel, Heath A. Demaree, Jennifer L. Robinson, & Jie Pu (Social Cognition, 2006) [ID = 52, R-Index = .80]

This article reported one study with an emotion regulation task. Participants in the depletion condition were instructed to exaggerated emotional responses to a disgusting film clip.  The study used two task to measure ego-depletion.  One task required generation of words; the other task required generation of figures.  The article reports strong evidence in an ANOVA with both dependent variables, F(1,46) = 11.99.  Separate analyses of the means show a stronger effect for the figural task, d = .98, than for the verbal task, d = .50.

The main concern with this study is that the fluency measures were never used in any other study.  If a replication study fails, one could argue that the task is not a valid measure of ego-depletion.  However, the study shows the advantage of using multiple measures to increase statistical power (Schimmack, 2012).

10. Mark Muraven, Marylene Gagne, and Heather Rosman (JESP, 2008) [ID = 15, R-Index = .78]

Study 1 reports the results of a 2 x 2 design with N = 30 participants (~ 7.5 participants per condition).  It crossed an ego-depletion manipulation (resist eating chocolate cookies vs. radishes) with a self-affirmation manipulation.  The dependent variable was the number of errors in a vigilance task (respond to a 4 after a 6).  The results section shows some inconsistencies.  The 2 x 2 ANOVA shows strong evidence for an interaction, F(1,28) = 10.60, but the planned contrast that matches the pattern of means, shows a just significant effect, F(1,28) = 5.18.  Neither of these statistics is consistent with the reported means and standard deviations, where the depleted not affirmed group has twice the number of errors (M = 12.25, SD = 1.63) than the depleted group with affirmation (M = 5.40, SD = 1.34). These results would imply a standardized effect size of d = 4.59.

Study 2 did not manipulate ego-depletion and reported a more reasonable, but also less impressive result for the self-affirmation manipulation, F(2,63) = 4.67.

Study 3 crossed an ego-depletion manipulation with a pressure manipulation.  The ego-depletion task was a computerized ego-depletion task where participants in the depletion condition had to type a paragraph without copying the letter E or spaces. This is more difficult than just copying a paragraph.  The pressure manipulation were constant reminders to avoid making errors and to be as fast as possible.  The sample size was N = 96 (n = 24 per cell).  The dependent variable was the vigilance task from Study 1.  The evidence for a depletion effect was strong, F(1, 92) = 10.72 (z = 3.17).  However, the effect was qualified by the pressure manipulation, F(1,92) = 6.72.  There was a strong depletion effect in the pressure condition, d = .78, t(46) = 2.63, but there was no evidence for a depletion effect in the no-pressure condition, d = -.23, t(46) = 0.78.

The standard deviations in Study 3 that used the same dependent variable were considerable wider than the standard deviations in Study 1, which explains the larger standardized effect sizes in Study 1.  With the standard deviations of Study 3, Study 1 would not have

DISCUSSION AND FUTURE DIRECTIONS

The original ego-depletion article published in 1998 has spawned a large literature with over 150 articles, more than 400 studies, and a total number of over 30,000 participants. There have been numerous theoretical articles and meta-analyses of this literature.  Unfortunately, the empirical results reported in this literature are not credible because there is strong evidence that reported results are biased.  The bias makes it difficult to predict which effects are replicable. The main conclusion that can be drawn from this shaky mountain of evidence is that ego-depletion researchers have to change the way they conduct and report their findings.

Importantly, this conclusion is in stark disagreement with Baumeister’s recommendations.  In a forthcoming article, he suggests that “the field has done very well with the methods and standards it has developed over recent decades,” (p. 2), and he proposes that “we should continue with business as usual” (p. 1).

Baumeister then explicitly defends the practice of selectively publishing studies that produced significant results without reporting failures to demonstrate the effect in conceptually similar studies.

Critics of the practice of running a series of small studies seem to think researchers are simply conducting multiple tests of the same hypothesis, and so they argue that it would be better to conduct one large test. Perhaps they have a point: One big study could be arguably better than a series of small ones. But they also miss the crucial point that the series of small studies is typically designed to elaborate the idea in different directions, such as by identifying boundary conditions, mediators, moderators, and extensions. The typical Study 4 is not simply another test of the same hypothesis as in Studies 1–3. Rather, each one is different. And yes, I suspect the published report may leave out a few other studies that failed. Again, though, those studies’ purpose was not primarily to provide yet another test of the same hypothesis. Instead, they sought to test another variation, such as a different manipulation, or a different possible boundary condition, or a different mediator. Indeed, often the idea that motivated Study 1 has changed so much by the time Study 5 is run that it is scarcely recognizable. (p. 2)

Baumeister overlooks that a program of research that tests novel hypothesis with new experimental procedures in small samples is most likely to produce a non-significant result.  When these results are not reported, only reporting significant results does not mean that these studies successfully demonstrated an effect or elucidated moderating factors. The result of this program of research is a complicated pattern of results that is shaped by random error, selection bias, and weak true effects that are difficult to replicate (Figure 1).

Baumeister makes the logical mistake to assume that the type-I error rate is reset when a study is not a direct replication and that the type-I error only increases for exact replications. For example, it is obvious that we should not believe that eating green jelly beans decreases the risk of cancer, if 1 out of 20 studies with green jelly beans produced a significant result.  With a 5% error rate, we would expect one significant result in 20 attempts by chance alone.  Importantly, this does not change if green jelly beans showed an effect, but red, orange, purple, blue, ….. jelly beans did not show an effect.  With each study, the risk of a false positive result increases and if 1 out of 20 studies produced a significant result, the success rate is not higher than one would expect by chance alone.  It is therefore important to report all results and to report only the one green-jelly bean study with a significant result distorts the scientific evidence.

Baumeister overlooks the multiple comparison problem when he claims that “a series of small studies can build and refine a hypothesis much more thoroughly than a single large study”

As the meta-analysis, a series of over 400 small studies with selection bias tells us very little about ego-depletion and it remains unclear under which conditions the effect can be reliably demonstrated.  To his credit, Baumeister is humble enough to acknowledge that his sanguine view of social psychological research is biased.

In my humble and biased view, social psychology has actually done quite well. (p. 2)

Baumeister remembers fondly the days when he learned how to conduct social psychological experiments.  “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants.”  A simple power analysis with these sample sizes shows that a study with n = 10 per cell (N = 20) has a sensitivity to detect effect sizes of d = 1.32 with 80% probability.  Even the biased effect size estimate for ego-depletion studies was only half of this effect size.  Thus, a sample size of n = 10 is ridiculously low.  What about a sample size of n = 20?   It still requires an effect size of d = .91 to have an 80% chance to produce a significant result.  Maybe Roy Baumeister might think that it is sufficient to aim for 50% success rate and to drop the other 50%.  An effect size of d = .64 gives researchers a 50% chance to get a significant result with N = 40.  But the meta-analysis shows that the bias-correct effect size is less than this.  So, even n = 20 is not sufficient to demonstrate ego-depletion effects.  Does this mean the effects are too flimsy to study?

Inadvertently, Baumeister seems to dismiss ego-depletion effects as irrelevant, if it would require large sample sizes to demonstrate ego-depletion.

Large samples increase statistical power. Therefore, if social psychology changes to insist on large samples, many weak effects will be significant that would have failed with the traditional and smaller samples. Some of these will be important effects that only became apparent with larger samples because of the constraints on experiments. Other findings will however make a host of weak effects significant, so more minor and trivial effects will enter into the body of knowledge.

If ego-depletion effects are not really strong, but only inflated by selection bias, and the real effects are much weaker, they may be minor and trivial effects that have little practical significance for the understanding of self-control in real life.

Baumeister then comes to the most controversial claim of his article that has produced a vehement response on social media.  He claims that a special skill called flair is needed to produce significant results with small samples.

Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure.

The need for flair also explains why some researchers fail to replicate original studies by researchers with flair.

But in that process, we have created a career niche for bad experimenters. This is an underappreciated fact about the current push for publishing failed replications. I submit that some experimenters are incompetent. In the past their careers would have stalled and failed. But today, a broadly incompetent experimenter can amass a series of impressive publications simply by failing to replicate other work and thereby publishing a series of papers that will achieve little beyond undermining our field’s ability to claim that it has accomplished anything.

Baumeister even noticed individual differences in flair among his graduate and post-doctoral students.  The measure of flair was whether students were able to present significant results to him.

Having mentored several dozen budding researchers as graduate students and postdocs, I have seen ample evidence that people’s ability to achieve success in social psychology varies. My laboratory has been working on self-regulation and ego depletion for a couple decades. Most of my advisees have been able to produce such effects, though not always on the first try. A few of them have not been able to replicate the basic effect after several tries. These failures are not evenly distributed across the group. Rather, some people simply seem to lack whatever skills and talents are needed. Their failures do not mean that the theory is wrong.

The first author of the glucose paper was a victim of a doctoral advisor who believed that one could demonstrate a correlation between blood glucose levels and behavior with samples of 20 or less participants.  He found a way to produce these results in a way that produced statistical evidence of bias, but this effort was wasted on a false theory and a program of research that could not produce evidence for or against the theory because sample sizes were too small to show the effect even if the theory were correct.  Furthermore, it is not clear how many graduate students left Baumeister’s lab thinking that they were failures because they lacked research skills when they only applied the scientific method correctly?

Baumeister does not elaborate further what distinguishes researchers with flair from those without flair.  To better understand flair, I examined the seminal ego-depletion study.  In this study, 67 participants were assigned to three conditions (n = 22 per cell).  The study was advertised as a study on taste perception.  Experimenters baked chocolate cookies in a laboratory room and the room smelled of freshly baked chocolate cookies.  Participants were seated at a table with a bowl of freshly baked cookies and a bowl with red and white radishes.  Participants were instructed to taste either radishes or chocolate cookies.  They were then told that they had to wait at least 15 minutes to allow the sensory memory of the food to fade.  During this time, they were asked to work on an unrelated task.  The task was a figure tracing puzzle with two unsolvable puzzles.  Participants were told that they can take as much time and as many trials as you want and that they will not be judged on the number of trials or the time they take, and that they will be judged on whether or not they finish the task.  However, if they wished to stop without finishing, they could ring a bell to notify the experimenter.  The time spent on this task was used as the dependent variable.  The study showed a strong effect of the manipulation.  Participants who had to taste radishes rang the bell 10 minutes earlier than participants who got to taste the chocolate cookies, t(44) = 6.03, d = 1.80, and 12 minutes earlier than participants in a control condition without the tasting part of the experiment, t(44) = 6.88, d = 2.04.   The ego-depletion effect in this study is gigantic.  Thus, flair might be important to create conditions that can produce strong effects, but once a researcher with flair has created such an experiment, others should be able to replicate it.  It doesn’t take flair to bake chocolate cookies, put a plate of radishes on a table, and to instruct participants how a figure tracing task works and to ring a bell when they no longer want to work on the task.  In fact, Baumeister et al. (1998) proudly reported that even high school students were able to replicate the study in a science project.

As this article went to press, we were notified that this experiment had been independently replicated by Timothy J. Howe, of Cole Junior High School in East Greenwich, Rhode Island, for his science fair project. His results conformed almost exactly to ours, with the exception that mean persistence in the chocolate condition was slightly (but not significantly) higher than in the control condition. These converging results strengthen confidence in the present findings.

If ego-depletion effects can be replicated in a school project, it undermines the idea that successful results require special skills.  Moreover, the meta-analysis shows that flair is little more than selective publishing of significant results, a conclusion that is confirmed by Baumeister’s response to my bias analyses. “you may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication).

In conclusion, future researchers interested in self-regulation have a choice. They can believe in ego-depletion and ignore the statistical evidence of selection bias, failed replications, and admissions of suppressed evidence, and conduct further studies with existing paradigms and sample sizes and see what they get.  Alternatively, they may go to the other extreme and dismiss the entirely literature.

“If all the field’s prior work is misleading, underpowered, or even fraudulent, there is no need to pay attention to it.” (Baumeister, p. 4).

This meta-analysis offers a third possibility by trying to find replicable results that can provide the basis for the planning of future studies that provide better tests of ego-depletion theory.  I do not suggest to directly replicate any past study.  Rather, I think future research should aim for a strong demonstration of ego-depletion.  To achieve this goal, future studies should maximize statistical power in four ways.

First, use a strong experimental manipulation by comparing a control condition with a combination of multiple ego-depletion paradigms to maximize the standardized effect size.

Second, the study should use multiple, reliable, and valid measures of ego-depletion to minimize the influence of random and systematic measurement error in the dependent variable.

Third, the study should use a within-subject design or at least a pre-post design to control for individual differences in performance on the ego-depletion tasks to further reduce error variance.

Fourth, the study should have a sufficient sample size to make a non-significant result theoretically important.  I suggest planning for a standard error of .10 standard deviations.  As a result, any effect size greater than d = .20 will be significant, and a non-significant result if consistent with the null-hypothesis that the effect size is less than d = .20.

The next replicability report will show which path ego-depletion researcher have taken.  Even if they follow Baumeister’s suggestion to continue with business as usual, they can no longer claim that they were unaware of the consequences of going down this path.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

More blogs on replicability.

 

 

Open Ego-Depletion Replication Initiative

Dear Drs. Baumeister and Vohs,

Perspectives on Psychological Science published the results of a “A Multi-Lab Pre-Registered Replication of the Ego-Depletion Paradigm Reported in Sripada, Kessler, and Jonides (2014).”   The main finding of this replication project was a failure to demonstrate the ego-depletion effect across multiple labs with a large combined sample size.

You wrote a response to this finding (Baumeister & Vohs, in press).   In your response, you highlight several problems with the replication studies and conclude that the results only show that the specific experimental procedure used for the replication studies failed to demonstrate ego-depletion.

At the same time, you maintain that ego-depletion is a robust phenomenon that has been demonstrated repeatedly for two decades; quote “for two decades we have conducted studies of ego depletion carefully and honestly, following the field’s best practices, and we find the effect over and over.”

It is regrettable that the recent RRR project failed to show any effect for ego-depletion because the researchers used a paradigm that you never approved and never used in your own two decades of successful ego-depletion research.

I would like to conduct my own replication studies using paradigms that have reliably produced ego depletion effects in your laboratories. As a single paradigm may fail for unknown reasons, I would like to ask you kindly to identify three paradigms that based on your own experience have reliably produced the ego-depletion effect in your own laboratory and that can produce the effect in other laboratories.

To plan sample sizes for my replication studies, it is also important that you provide an estimate of the effect size. A meta-analysis by Hagger et al. (2010) suggested that the average ego-depletion effect size is d = .6, but a bias-corrected estimate suggests that the effect size may be as small as d = .2 (Carter & McCollough, 2014). I would hate to end up with a non-significant result because my replication studies were underpowered and failed to detect the ego-depletion effect due to insufficient power. What effect size would you expect based on your two decades of successful studies?

Sincerely,
Dr. Ulrich Schimmack

Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010a). Ego depletion and the
strength model of self-control: A meta-analysis. Psychological Bulletin, 136, 495-525.
doi: 10.1037/a0019486

Carter, E. C., & McCullough, M. E. (2014). Publication bias and the limited strength model of self-control: Has the evidence for ego depletion been overestimated? Frontiers in
Psychology, 5, 823. doi: 10.3389/fpsyg.2014.00823

Sripada, C., Kessler, D., & Jonides, J. (2014). Methylphenidate blocks effort-induced depletion of regulatory control in healthy volunteers. Psychological Science, 25, 1227-1234. doi: 10.1177/0956797614526415

Estimating Replicability of Psychological Science: 35% or 50%

Examining the Basis of Bakker, van Dijk, and Wichert’s 35% estimate.

Marjan Bakker, Annette van Dijk and Jelte M. Wicherts (2012). The Rules of the Game Called Psychological Science, Perspectives on Psychological Science 2012 7: 543

BDW’s article starts with the observation that psychological journals publish mostly significant results, but that most studies lack the statistical power to produce so many significant results (Sterling, 1959; Sterling et al., 1995). The heading for the paragraph that makes this claim is “Authors Are Lucky!”

“Sterling (1959) and Sterling, Rosenbaum, and Weinkam (1995) showed that in 97% (in 1958) and 96% (in 1986–1987) of psychological studies involving the use of NHST, H0 was rejected at α = .05. “ (p. 543).

“The abundance of positive outcomes is striking because effect sizes (ESs) in psychology are typically not large enough to be detected by the relatively small samples used in most studies (i.e., studies are often underpowered; Cohen, 1990).” (p. 543).

It is true that power is an important determinant of the rate of significant results that a series of experiments will produce. However, power is defined as the probability of obtaining a significant result when an effect is present. Power is not defined when the null-hypothesis is true. As a result, power is the maximum rate of significant results that can be expected when the null-hypothesis is always false (Sterling et al., 1995).

Although it has been demonstrated that publication bias exists and that publication bias contributes to the high success rate in psychology journals, it has been more difficult to estimate the actual rate of significant results that one would expect without publication bias.

BDW provides an estimate and the point of this blog post is to examine their method of obtaining an unbiased estimate of statistical power, which sets an upper limit for the success rate of psychological studies published in psychology journals.

BDW begin with the observation that statistical power is a function of (a) the criterion for statistical significance (alpha), which is typically p < .05 (two-tailed), (b) sampling error, which decreases with increasing sample size, and (c) the population effect size.

The nominal significance level and sample size are known parameters. BDW suggest that the typical sample size in psychology is N = 40.

“According to Marszalek, Barber, Kohlhart, and Holmes (2011), the median total sample size in four representative psychological journals (Journal of Abnormal Psychology, Journal of Applied Psychology, Journal of Experimental Psychology: Human Perception and Performance, and Developmental Psychology) was 40. This finding is corroborated by Wetzels et al. (2011), who found a median cell size of 24 in both between- and within-subjects designs in their large sample of t tests from Psychonomic Bulletin & Review and Journal of Experimental Psychology: Learning, Memory and Cognition.”

The N = 40 estimate has two problems. It is not based on a representative sample of studies across all areas of psychology. Sample sizes are often smaller than N = 40 in animal studies and they are larger in personality psychology (Fraley & Vazire, 2014). Second, the research design also influences sampling error. In a one-sample t-test, N = 40 implies a sampling error of 1/sqrt(40) = .16, and an effect size of d = .33 would be significant (t(39) = .32/.16 = 2.05, p = .043). In contrast, sampling error in a between-subject design is 2/sqrt(40) and an effect size of d = .65 is needed to obtain a significant result, t(39) = .65/.32 = 2.05, p = .047. Thus, power calculations have to take into account what research design was used. N = 40 can be adequate to study moderate effect sizes (d = .5) with a one-sample design, but not with a between-subject design.

The major problem for power estimation is that the population effect size is unknown. BDW rely on meta-analyses to obtain an estimate of the typical effect size in psychological research.   There are two problems with this approach. First, meta-analysis often failed to correct for publication bias. As a result, meta-analytic estimates can be inflated. Second, meta-analyses may focus on research questions with small effect sizes because large effect are so obvious that they do not require a meta-analysis to examine whether they are real. With these caveats in mind, meta-analyses are likely to provide some valid information about the typical population effect size in psychology. BDW arrive at an estimate of d = .50, which Cohen considered a medium effect size.

“The average ES found in meta-analyses in psychology is around d = 0.50 (Anderson, Lindsay, & Bushman, 1999; Hall, 1998; Lipsey & Wilson, 1993; Meyer et al., 2001; Richard, Bond, & Stokes-Zoota, 2003; Tett, Meyer, & Roese, 1994).

Based on a sample size of N = 40 and a typical effect size of d = .50, the authors arrive at an estimate of 35% power; that is a 35% probability that a psychological study that is reported with a significant result in a journal actually produced a significant result or would produce a significant result again in an exact replication study (with the same sample size and power as the original study). The problem with this estimate is that BDW assume that all studies use the low-power, between-subject (BS) design.

“The typical power in our field will average around 0.35 in a two independent samples comparison, if we assume an ES of d = 0.50 and a total sample size of 40” (p. 544).

The authors do generalize from the BS scenario to all areas of research.

“This low power in common psychological research raises the possibility of a file drawer (Rosenthal, 1979) containing studies with negative or inconclusive results.” (p. 544).

Unfortunately, the authors ignore important work that contradicts their conclusions. Most important, Cohen (1962) provided the first estimate of statistical power in psychological research. He did not conduct an explicit meta-analysis of psychological research, but he suggested that an effect size of half a standard deviation is a moderate effect size. This standardized effect size was named after him; Cohen’s d. As it turns out, the effect size used by BDW of Cohen’s d = .50, is the same effect size that Cohen used for his power analysis (he also proposed similar criteria for other effect size measures).   Cohen (1962) arrived at a median power estimate of 50% to detect a moderate effect size.   This estimate was replicated by Sedlmeier and Gigerenzer (1989), who also conducted a meta-analysis of power estimates and found that power in some other research areas was higher with an average of 60% power to detect a moderate effect size.

One major factor that contributes to the discrepancy between BDW’s estimate of 35% power and other power estimates in the range from 50 to 60% power is that BDW estimated sample sizes on the basis of journals that use within-subject designs, but conducted the power analysis with a between-subject design. In contrast, Cohen and others used the actual designs of studies to estimate power. This approach is more labor-intensive, but provides more accurate estimates than an approach that assumes that all studies use between-subject designs.

CONCLUSION

In conclusion, the 35% estimate underestimates the typical power in psychological studies. Given that BDW and Cohen made the same assumption about the median population effect size, Cohen’s method is more accurate and estimates based on his method should be used. These estimates are closer to 50% power.

However, even the 50% estimate is just an estimate that requires further validation research. One limitation is that the accuracy of the meta-analytic estimation method is unknown. Another problem is that power assumes that an effect is present, but in some studies the null-hypothesis is true. Thus, even if the typical power of studies were 50%., the actual success rate is lower.

Unless better estimates become available, it is reasonable to assume that at best 50% of published significant results will replicate in an exact replication study. With success rates close to 100%, this means that researchers routinely obtain non-significant results in studies that would be published if they had produced significant results. This large file-drawer of unreported studies inflates reported effect sizes, increases the risk of false-positive results, and wastes resources.

MY JOURNEY TOWARDS ESTIMATION OF REPLICABILITY OF PSYCHOLOGICAL RESEARCH

BACKGROUND

About 10 years ago, I became disillusioned with psychology; mostly social psychology broadly defined, which is the main area of psychology in which I specialized. Articles published in top journals became longer and longer, with more and more studies, and more and more incredible findings that made no sense to me and that I could not replicate in my own lab.

I also became more familiar with Jacob Cohen’s criticism of psychology and the concept of power. At some point during these dark years, I found a short article in the American Statistician that changed my life (Sterling et al., 1995). The article presented a simple formula and explained that the high success rate in psychology journals (over 90% of reported results confirm authors’ predictions) are incredible, unbelievable, or unreal. Of course, I was aware that publication bias contributed to these phenomenal success rates, but Sterling’s article suggested that there is a way to demonstrate this with statistical methods.

Cohen (1962) estimated that a single study in psychology has only 50% power. This means that a paper with two studies, has only a 25% probability to confirm an authors’ predictions. An article with 4 studies, only has a probability of doing so that is less than 10%. Thus, it was clear that many of these multiple-study articles in top journals had to be produced by means of selective reporting of significant results.

I started doing research with large samples and I started ignoring research based on these made-up results. However, science is a social phenomenon and questionable theories about unconscious emotions and attitudes became popular in psychology. Sometimes being right and being popular are different things. I started trying to educate my colleagues about the importance of power, and I regularly questioned speakers at our colloquium about their small sample sizes. For a while using the word power became a running joke in our colloquium, but research practices did not change.

Then came the year 2011. At the end of 2010, psychology departments all over North America were talking about the Bem paper. An article in press in the top journal JPSP was going to present evidence for extrasensory perception. In 9 out of 10 statistical tests undergraduate students appeared to be able to have precognition of random future events. I was eager to participate in the discussion group at the University of Toronto to point out that these findings are unbelievable, not because we know ESP does not exist, but because it is impossible to get 9 out of 10 significant results without having very high power in each study. Using Sterling’s logic it was clear to me that Bem’s article was not credible.

When I made this argument, I was surprised that some participants in the discussion doubted the logic of my argument more than Bem’s results. I decided to use Bem’s article to make my case in a published article. I was not alone. In 2011 and 2012 numerous articles appeared that pointed out problems with the way psychologists (ab)use the scientific method. Although there are many problems, the key problem is publication bias. Once researchers can select which results they report, it is no longer clear how many reported results are false positive results (Sterling et al., 1995).

When I started writing my article, I wanted to develop a test that reveals selective reporting so that this unscientific practice can be detected and punished, just like a doping test for athletes. Many psychologists do not like to use punishment and think carrots are better than sticks. However, athletes do not get medals for not taking doping and tax payers do not get a reward for filing their taxes. If selective reporting of results violates the basic principle of science, scientists should not do it, and they do not deserve to get a reward for doing what they are supposed to be doing.

THE INCREDIBILITY INDEX

In June 2011, I submitted my manuscript to Psychological Methods. After one and a half years and three rounds of reviews my manuscript finally appeared in print (Schimmack, 2012). Meanwhile, Greg Francis had developed a similar method that also used statistical power to reveal bias. Psychologists were not very enthusiastic about the introduction of our doping test.

This is understandable because the use of scientific doping was a widely accepted practice and there was no formal ban on selective reporting of results. Everybody was doing it, so when Greg Francis used the method to target a specific article, the authors felt attacked. Why me? You could have attacked any other article and found the same result.

When Greg Francis did analyze all articles (published in the top journal Psychological Science for a specific time period) he found, indeed, that over 80% showed positive signs of bias. So, selective reporting of results is a widely used practice and it makes no sense to single out a specific article. Most articles are produced with selective reporting of results. When Greg Francis submitted his findings to Psychological Science, the article was rejected. It was not rejected because it was flawed. After all, it merely confirmed what everybody already knew, namely that all researchers report only the results that support their theory.  It was probably rejected because it was undesirable to document this widely used practice scientifically and to show how common selective reporting is.  It was probably more desirable to maintain the illusion that psychology is an exact science with excellent theories that make accurate predictions that are confirmed when they are submitted to an empirical test. In truth, it is unclear how many of these success stories are false and would fail  if the were replicated without the help of selective reporting.

ESTIMATION OF REPLICABILITY

After the publication of my 2012 paper, I continued to work on the issue of publication bias. In 2013 I met Jerry Brunner in the statistics department. As a former, disillusioned social psychologist, who got a second degree in statistics, he was interested in my ideas. Like many statisticians, he was skeptical (to say the least) about my use of post-hoc power to reveal publication bias.  However, he kept an open mind and we have been working together on statistical methods for the estimation of power. As this topic has been largely neglected by statisticians, we were able to make some new discoveries and we developed the first method that can estimate power under difficult conditions when publication bias is present and when power is heterogeneous (varies across studies).

In 2015, I learned programming in R and wrote software to extract statistical results from journal articles (PDF’s converted into text files). After downloading all articles from 105 journals for a specific time period (2010-2015) with the help of Andrew , I was able to apply the method to over 1 million statistical tests reported in psychology journals. The beauty of using all articles is that the results do not suffer from selection bias (cheery-picking). Of course, the extraction method misses some tests (e.g., tests reported in figures or tables) and the average across journals is based on the selection of journals. But the result for a single journal is based on all tests that are automatically extracted.

It is important to realize the advantage of this method compared to typical situations where researchers rely on samples to estimate population parameters. For example, the OSF-reproducibility project selected three journals and a single statistical test from only 100 articles (Science, 2015). Not surprisingly, the results of the project have been criticized as not being representative of psychology in general or even the subject areas represented in the three journals. Similarly, psychologists routinely collect data from students at their local university, but assume that the results generalize to other populations. It would be easy to dismiss these results as invalid, simply because they are not based on a representative sample. However, most psychologists are willing to accept theories based on these small and unrepresentative samples until somebody demonstrates that the results cannot be replicated in other populations (or still accept the theory because they dismiss the failed replication). None of these sampling problems plague research that obtains data for the total population.

When the data were available and the method had been validated in simulation studies, I started using it to estimate the replicability of results in psychological journals. I also used it for individual researchers and for departments. The estimates were in the range from 40% to 70%. This estimate was broadly in line with estimates obtained using Cohen’s (1962) method which results in power estimates of 50-60% (Sedlmeier & Gigerenzer, 1989). Estimates in this range were consistent with the well-known fact that reported success rates in journals of over 90% are inflated by publication bias (Sterling et al., 1995). It would also be unreasonable to assume that all reported results are false positives, which would result in an estimate of 5% replicability because false positive results have a 5% probability to be significant again in a replication study. Clearly, psychology has produced some reliable findings that can be replicated every year in simple class-room demonstrations.  Thus, an estimate somewhere in the mild of the extremes between nihilism (nothing in psychology is true) and naive optimism (everything is true) seems reasonable and consistent across estimation methods.

My journal rankings also correctly predicted the ranking of journals in the OSF-reproducibility project, where articles published in JEP:General were most replicable, followed by Psychological Science, and then JPSP. There is even a direct causal link between the actual replication rate and power because cognitive psychologists use more powerful designs, and power determines the replicability in an exact replication study (Sterling et al., 1995).

I was excited to share my results in blogs and in a Facebook discussion group because I believed (and still believe) that these results provide valuable information about the replicability of psychological research; a topic that has been hotly debated since Bem’s (2011) article appeared.

The lack of reliable and valid information fuels this debate because opponent in the debate do not agree about the extent of the crisis. Some people assume that most published results are replicable (Gilbert, Wilson), whereas others suggest that the majority of published results is false (Ioannidis). Surprisingly, this debate rarely mentions Cohen’s seminal estimate of 50%.  I was hoping that my results provide some much needed objective estimates of the replicability of psychological research based on a comprehensive analysis of published results.

At present, there exists about five different estimates of replicability of psychological research that range from 20% or less to 95%.

Less than 20%: Button et al. (2014) used meta-analyses in neuroscience, broadly defined, to suggest that power is only 20% and their method did not even correct for inflated effect sizes due to publication bias.

About 40%: A project that replicated 100 studies from social and cognitive psychology yielded about 40% successful replications; that is, they reproduced a significant result in the replication study. This estimate is slightly inflated because the replication studies sometimes used larger samples, which increased the probability of obtaining a significant result, but it can also be attenuated because replication studies were not carried out by the same researchers using the same population.

About 50%: Cohen (1962) and subsequent articles estimated that the typical power in psychology is about 50% to detect a moderate effect size of d = .5, which is slightly higher than the average effect size in a meta-analysis of social psychology (

50-80%: The average replicability for my rankings is 70% for journals and 60% for departments. The discrepancy is likely due to the fact that journals that publish more statistical results (e.g, a six study article in JPSP) have lower replicability. There is variability across journals and departments, but few analyses have produced values below 50% or over 80%. If I had to pick a single number, I would pick 60%, the average for psychology departments. 60% is also the estimate for new, open access journals that publish thousands of articles a year compared to small quarterly journals that publish less than one-hundred articles a year.

If we simply use the median of these five estimates, Cohen’s estimate of 50% provides the best estimate that we currently have.  The average estimate for 51 psychology departments is 60%.  The discrepancy may be explained by the fact that Cohen focused on theoretically important tests. In contrast, an automatic extraction of statistical results retrieves all statistical tests that are being reported in articles. It is unfortunate that psychologists often report hypothesis tests even when they are meaningless (e.g., the positive stimuli were rated as more positive (Mean = 6.00, SD = 1.00) than the negative stimuli (M = 2.00, SD = 1.00, d = 4.00, p < .001). Eventually, it may be possible to develop algorithms that exclude these statistical tests, but while they are included, replicability estimates include the probability of rejecting the null-hypothesis for these obvious hypotheses. Taking this into account, estimates of 60% are likely to overestimate the replicability of theoretically important tests, which may explain the discrepancy between Cohen’s estimate and the results in my rankings.

CONCERNS ABOUT MY RANKINGS

Since I started publishing my rankings, some psychologists have raised some concerns about my rankings. In this blog post, I address these concerns.

#1 CONCERN: Post-Hoc Power is not the same as Replicability

Some researchers have argued that only actual replication studies can be used to measure replicability. This argument has two problems. First actual replication studies do not provide a gold standard to estimate replicability. The reason is that there are many reasons why an actual replication study may fail and there is no shortage of examples where researchers have questioned the validity of actual replication studies. Thus, even the success rate of actual replication studies is only an estimate of the replicability of original studies.

Second, original studies would already provide an estimate of replicability if no publication bias were present. If a set of original studies produced 60% significant result, an exact replication of these studies is also expected to produce 60% significant results; within the margins of sampling error. The reason is that the success rate of any set of studies is determined by the average power of studies (Sterling et al., 1995) and the average power of identical sets of studies is the same. The problem with using published success rates as estimates of replicability is that published success rates are inflated by selection bias (selective reporting of results that support a theoretical prediction).

The main achievement of Brunner and Schimmack’s statistical estimation method was to correct for selection bias so that reported statistical results can be used to estimate replicability. The estimate produced by this method is an estimate of the success rate in an unbiased set of exact replication studies.

#2 CONCERN: Post-Hoc Power does not predict replicability.

In the OSF-project observed power predicts actual replication success with a correlation of r = .23. This may be interpreted as evidence that post-hoc power is a poor predictor of actual replicability. However, the problem with this argument is that statisticians have warned repeatedly about the use of post-hoc power for a single statistical result (Heisey & Hoenig, 2001). The problem is that the confidence interval around the estimate is so wide that only extremely high power (> 99%) leads to accurate predictions that a study will replicate. For most studies, the confidence interval around the point-estimate is too wide to make accurate predictions.

However, this does not mean that post-hoc power cannot predict replicability for larger sets of studies. The reason is that precision of the estimate increases as the number of tests increases. So, when my rankings are based on hundreds or thousands of tests published in a journal, the estimates are sufficiently precise to be useful. Moreover, Brunner and Schimmack developed a bootstrap method that estimates 95% confidence intervals that provide information about the precision of estimates and these confidence intervals can be used to compare whether differences in ranks are statistically meaningful differences.

#3 CONCERN: INSUFFICIENT EVIDENCE OF VALIDITY

I have used the OSF-reproducibility project (Science, 2015) to validate my rankings of journals. My method correctly predicted that results from JEP:General would be more replicable than those by Psychological Science, and JPSP in this order. My estimate based on extraction of all test statistics from the three journals was 64%, whereas the actual replication success rate was 36%. The replicability based on all tests overestimates replicability for replications of theoretically important tests and the actual success rate underestimates replicability because of problems in conducting exact replication studies. The average of the two estimates is 50%, close to the best estimate of replicability.

It has been suggested that a single comparison is insufficient to validate my method. However, this argument ignores that the OSF-project was the first attempt at replicating a representative set of psychological studies and that the study received a lot of praise for doing so. So, N = 1 is all we have to compare my estimates to estimates based on actual replication studies. When more projects of this nature become available, I will use the evidence to validate my rankings and if there are discrepancy use this information to increase the validity of my rankings.

Meanwhile, it is simply false that a single data point is insufficient to validate an estimate. There is only one Earth, so any estimate of global temperature has to be validated with just one data point. We cannot wait for validation of this method on 199 other planets to decide whether estimates of global temperature are valid.

To use an example from psychology, if a psychologist wants to validate a method that presents stimuli subliminally and lets participants guess whether a stimulus was presented or not, the method is valid if participants are correct 50% of the time. If the percentage is 55%, the method is invalid because participants are able to guess above chance.

Also, validity is not an either or construct. Validity comes in degrees. The estimate based on rankings does not perfectly match the OSF-results or Cohen’s method. None of these methods are perfect. However, they converge on the conclusion that the glass is half full and half empty. The consensus across methods is encouraging. Future research has to examine why the methods differ.

In conclusion, the estimated underlying my replicability rankings are broadly consistent with two other methods of estimating replicability; Cohen’s method of estimating post-hoc power for medium effect sizes and the actual replication rate in the OSF-project. The replicability rankings are likely to overestimate replicability of focal tests by about 10% because they include statistical tests of manipulation checks and covariates that are theoretically less important. This bias may also not be constant across journals, which could affect the rankings to some extent, but it is unknown whether this is actually the case and how much rankings would be affected by this. Pointing out that this potential bias could reduce the validity of the rankings does not lead to the conclusion that they are invalid.

#4 RANKINGS HAVE TO PASS PEER-REVIEW

Some researchers have suggested that I should wait with publishing my results until this methodology has passed peer-review. In my experience, this would take probably a couple of years. Maybe that would have been an option when I started as a scientist in the late 1980s where articles were printed, photocopied, and send by mail if the local library did not have a journal. However, this is 2016, information is shared at lightning speed and where articles are already critiqued on twitter or pubpeer before they are officially published.

I learned my lesson, when the Bem (2011) article appeared and it took one-and-a half years for my article to be published. By this time, numerous articles had been published and Greg Francis had published a critique of Bem using a similar method. I was too slow.

In the meantime, Uri Simonsohn gave two SPSP symposia on pcurve before the actual pcurve article was published in print and he had a pcurve.com website.  When Uri presented the method the first time (I was not there), it created an angry response by Norbert Schwarz. Nobody cares about Norbert’s response anymore and pcurve is widely accepted and version 4.0 looks very different from the original version of pcurve. Angry and skeptical responses are to be expected when somebody does something new, important, and disruptive, but this is part of innovation.

Second, I am not the first one to rank journals or departments or individuals. Some researchers get awards suggesting that their work is better than the work of those who do not get awards. Journals with more citations are more prestigious, and departments are ranked in terms of popularity among peers. Who has validated these methods of evaluation and how valid are they? Are they more valid than my replicability rankings?

At least my rankings are based on solid statistical theory and predict correctly that cognitive psychology is more replicable than social psychology. The fact that mostly social psychologists have raised concerns about my method may reveal more about social psychologists than about the validity of my method. Social psychologists also conveniently ignore that the OSF replicability estimate of 36% is an average across areas and that the estimate for social psychology was an abysmal 25% and that my journal rankings place many social psychology journals at the bottom of the ranking. One would only have to apply social psychological theories about heuristics and biases in cognitive processes to explain social psychologists’ concerns about my rankings.

CONCLUSION

In conclusion, the actual replication rate for a set of exact replication studies is identical to the true average power of studies. Average power can be estimated on the basis of reported test statistics and Brunner and Schimmack’s method can produce valid estimates when power is heterogeneous and when selection bias is present. When this method is applied to all statistics in the population (all journals, all articles by an author, etc.), rankings are not affected by selection bias (cheery-picking). When the set of statistics includes all statistical tests, as for example, from an automated extraction of test statistics, the estimate is an estimate of the replicability of a randomly picked statistically significant result from a journal. This may be a manipulation check or a theoretically important test. It is likely that this estimate overestimates the replicability of critically important tests, especially those that are just significant, because selection bias has a stronger impact on results with weak evidence. The estimates are broadly consistent with other estimation methods and more data from actual replication studies are needed to further validate the rankings. Nevertheless, the rankings provide the first objective estimate of replicability for different journals or departments.

The main result of this first attempt at estimating replicability provides clear evidence that selective reporting undermines the validity of published success rates. Whereas published success rates are over 90%, the actual success rate for studies that end up being published when they produced a desirable result is closer to 50%. The negative consequences of selection bias are well known. Reliable information about actual replicability and selection bias is needed to increase the replicability, credibility, and trustworthiness of psychological research. It is also needed to demonstrate to consumers of psychological research that psychologists are improving the replicability of research. Whereas rankings will always show differences, all psychologists are responsible for increasing the average. Real improvement would produce an increase in replicability on all three estimation methods (actual replications, Cohens’ method, & Brunner & Schimmack’s method).  It is an interesting empirical question when and how much replicability estimates increase in the future.  My replicability rankings will play an important role in answering this question.