Category Archives: Uncategorized

Who is Your Daddy? Priming women with a disengaged father increases their willingness to have sex without a condom

Photo credit: https://www.theblot.com/pole-dancing-daddy-fun-acrobatics-7767007
Who’s your daddy?  Priming women with a disengaged father increases their willingness to have sex without a condom.

In a five study article, Danielle J. DelPriore and Sarah E. Hill from Texas Christian University wanted to examine the influence of a disengaged father on daughter’s sexual attitudes and behaviors.

It is difficult to study the determinants of sexual behavior in humans because it is neither practical nor ethical to randomly assign daughters to engaged and distant fathers to see how this influences daughters’ sexual attitudes and behaviors.

Experimental social psychologists believe that they have found a solution to this problem.  Rather than exposing individuals to the actual experiences in the real world, it is possible to expose individuals to stimuli or stories related to these events.  These studies are called priming studies.  The assumption is that priming individuals has the same effect as experiencing these events.  For example, a daughter with a loving and caring partner will respond like a daughter with a distant father if she is randomly assigned to a condition with a parental disengagement prime.

This article reports five priming studies that examined how thinking about a distant father influences daughters’ sexual attitudes.

Study 1 (N = 75 female students)

Participants in the paternal disengagement condition read the following instructions:

Take a few seconds to think back to a time when your biological father was absent for an important life event when you really needed him . . .. Describe in detail how your father’s lack of support—or his physical or psychological absence—made you feel.

Participants in the paternal engagement condition were asked to describe a time their father was physically or psychologically present for an important event.

The dependent variable was a word-stem completion task with words that could be completed with words related to sex (s_x;  _aked;  sex vs. six; naked vs. baked).

Participants primed with a disengaged father completed more word-stems in a sexual manner (M = 4.51, SD = 2.06) than participants primed with an engaged father (M = 3.63, SD = 1.50), F(1,73) = 4.51, p = .037, d = .49.

Study 2 (N = 52 female students)

Study 2 used the same priming manipulation as Study 1, but measured sexual permissiveness with the Sociosexual Orientation Inventory (SOI; Simpson & Gangestad, 1991).  Example items are “sex without love is OK,” and “I can imagine myself being comfortable and enjoying casual sex with different partners.”

Participants who thought about a disengaged father had higher sexual permissiveness scores (M = 2.57, SD = 1.88) than those who thought about an engaged father (M = 1.86, SD = 0.94), F(1,62) = 3.91, p = .052, d = .48.

Study 3 (N = 82 female students)

Study 3 changed the control condition from an engaged father to a disengaged or disappointing friend.  It is not clear why this condition was not included as a third condition Study 2 but ran as a separate experiment. The study showed that participants who thought about a disengaged dad scored higher on the sexual permissiveness scale (M = 2.90, SD = 2.25) than participants who thought about a disappointing friend (M = 2.09, SD = 1.19), F(1,80) = 4.24, p = .043, d = .45.

Study 4 (N = 62 female students)

Study 4 used maternal disengagement as the control condition. Again, it is not clear why the researchers did not run one study with four conditions (disengaged father, engaged father, disappointing friend, disengaged mother).

Participants who thought about a disengaged dad had higher scores on the sexual permissiveness scale (M = 2.85, SD = 1.84) than participants who thought about a disengaged mother (M = 1.87, SD = 1.16), F(1, 60) = 6.03, p = .017, d = .64.

Study 5 (N = 85 female students & 92 male students)

Study 5 could have gone in many directions, but it included women and men as participants and used disappointing friends as the control condition (why not using engaged and disengaged mothers/fathers in a 2 x 2 design to see how gender influences parent-child relationships?).  Even more disappointing was that the only reported (!) dependent variable was attitudes towards condoms. Why was the sexual attitude measure dropped from Study 5?

The results showed a difference between male and female participants who thought about a disengaged dad or friend.  Participants reported more negative attitudes towards condoms after thinking about a disengaged dad (M ~ 3.4 based on Figure) than participants who thought about a disengaged friend (M = 2.9 ~ based on Figure), F(1,172) = 5.10, p = .025, d = 0.33.  The interaction with gender was not significant, p = .58, but the effect of the manipulation on attitudes towards condoms was marginally significant in an analysis limited to women, (M = 3.07, SD = 1.30; M = 2.51, SD = 1.35), F(1, 172)= 3.76, p = .054, d = 0.42.  Although the interaction was not significant, the authors conclude in the general discussion section that “the effects of primed paternal disengagement on sexual risk were also found to be stronger for women than for men (Experiment 5)” (p. 242).

CONCLUSION

Based on this set of five studies, the authors conclude that “the results of the current research provide the first experimental support for PIT [Parental Investment Theory] by demonstrating a causal relationship between paternal disengagement cues and changes in women’s sexual decision making” (p. 242).

They then propose that “insight gained from this research may help inform interventions aimed at reducing some of the personal and financial costs associated with father absence, including teen pregnancy and STI risk” (p. 242)

Well, assuming these results were credible they may also be used by men interested in having sex without condoms by bringing up a time his father was distant and disengaged, which may prime his date to think about a similar time in her life and happily engage in unprotected sex with her date.  Of course, women who are aware of this priming effect may not fall for such a cheap trick.

However, before researchers or lay people get too excited about these experimental findings, it is important to examine whether they are even credible findings.  Five successful studies may seem like strong evidence for the robustness of this effect, but unfortunately the reported studies cannot be taken at face value because scientific journals report only successful studies and it is not clear how many failed studies or analysis are not reported.

To examine the credibility and replicability of these reported findings, I ran a statistical test of the reported results.  These tests suggest that the results are not credible and unlikely to replicate in independent attempts to reproduce these studies.

N statistic p z OP
75 F(1,73)=4.51 0.037 2.08 0.55
64 F(1,62)=3.91 0.052 1.94 0.49
82 F(1,80)=4.24 0.043 2.03 0.53
62 F(1,60)=6.03 0.017 2.39 0.67
177 F(1,172)=5.10 0.025 2.24 0.61

OP = observed power

The Test of Insufficient Variance (TIVA) shows that variance of z-scores is much less than random sampling error would produce, var(z) = 0.03 (expected 1.00), p < .01.   The median observed power is only 55% when the success rate is 100%, showing that the success rate is inflated. The Replicability Index is 55 – (100 – 55) = 10.  This value is below the value that is expected if only significant studies are selected from a set of studies without a real effect (22). A replicabilty index of 10 suggests that other researchers will not be able to replicate the significant results reported in this article.

In conclusion, this article does not contain credible evidence about the causes of male or female sexuality, and if you did grow up without a father or with a disengaged father it does not mean that this necessarily influenced your sexual attitudes, preferences, and behaviors.  Answers to these important questions are more likely to come from studies of real family relationships than from priming studies that assume real world experiences can be simulated with priming studies.

Reference

DelPriore, D.J., & Hill, S.E. (2013). The effects of paternal disengagement on women’s sexual decision making: An experimental approach, Journal of Personality and Social Psychology, 105, 234-246. DOI: 10.1037/a0032784

 

Advertisements

Bayes-Factors Do Not Solve the Credibility Problem in Psychology

Bayesians like to blame p-values and frequentist statistics for the replication crisis in psychology (see, e.g., Wagenmakers et al., 2011).  An alternative view is that the replication crisis is caused by selective reporting of non-significant results (Schimmack, 2012). This bias would influence Frequentist and Bayesian statistics alike and switching from p-values to Bayes-Factors would not solve the replication crisis.  It is difficult to evaluate these competing claims because Bayesian statistics are still used relatively infrequently in research articles.  For example, a search for the term Bayes Factor retrieved only six articles in Psychological Science in the years from 1990 to 2015.

One article made a reference to the use of Bayesian statistic in modeling.  Three articles used Bayes-Factors to test the null-hypothesis. These article will be examined in a different post, but they are not relevant for the problem of replicating results that apeared to demonstrate effects by rejecting the null-hypothesis.  Only two articles used Bayes-Factors to test whether a predicted effect is present.

Example 1

One article reported Bayes-Factors to claim support for predicted effects in 6 studies (Savani & Rattan, 2012).   The results are summarized in Table 1.

Study Design N Statistic p z OP BF1 BF2
1 BS 48 t(42)=2.29 0.027 2.21 0.60 12.76 2.05
2 BS 46 t(40)=2.57 0.014 2.46 0.69 28.03 2.85
3 BS 67 t(65)=2.25 0.028 2.20 0.59 9.55 1.61
4 BS 61 t(57)=2.85 0.006 2.74 0.78 39.48 6.44
5 BS 146 F(1,140)=6.68 0.011 2.55 0.72 NA 2.95
6 BS 50 t(47)=2.43 0.019 2.35 0.65 16.98 2.66
MA BS 418 t(416)=6.05 0.000 5.92 1.00 NA 1,232,427

MA = meta-analysis, OP = observed power, BF1 = Bayes-Factor reported in article based on half-normal with SD = .5,  BF2 = default Bayes-Factor with Cauchy(0,1)

All 6 studies reported a statistically significant result, p < .05 (two-tailed).  Five of the six studies reported a Bayes-Factor and all Bayes-Factors supported the alternative hypothesis.  Bayes-Factors in the article were based on a half-normal centered at d = .5.  The Bayes-Factors show that the data are much more consistent with this alternative hypothesis than with the null-hypothesis.  I also computed the Bayes-Factor for a Cauchy distribution centered at 0 with a scaling parameter of r = 1 (Wagenmakers et al., 2011).  This alternative hypothesis assumes that there is a 50% probability that the standardized effect size is greater than d = 1.  This extreme alternative hypothesis favors the null-hypothesis when the data show small to moderate effect sizes.  Even this Bayes-Factor consistently favors the alternative hypothesis, but the odds are less impressive.  This result shows that Bayes-Factors have to be interpreted in the context of the specified alternative hypothesis.  The last row shows the results of a meta-analysis. The results of the six studies were combined using Stouffer’s formula sum(z) / sqrt(k). To compute the Bayes-Factor the z-score was converted into a t-value with total N – 2 degrees of freedom. The meta-analysis shows strong support for an effect, z = 5.92, and the Bayes-Factor in favor of the hypothesis is greater than 1 million to 1.

Thus, frequentist and Bayesian statistics produce converging results. However, both statistical methods assume that the reported statistics are unbiased.  If researchers only present significant results or use questionable research practices that violate statistical assumptions, effect sizes are inflated, which biases p-values and Bayes-Factors alike.  It is therefore necessary to test whether the reported results are biased.  A bias analysis with the Test of Insufficient Variance (TIVA) shows that the data are biased.  TIVA compares the observed variance in z-scores against the expected variance of z-scores due to random sampling error, which is 1.  The observed variance is only Var(z) = 0.04.  A chi-square test shows that the discrepancy between the observed and expected variance would occur rarely by chance alone, p = .001.   Thus, neither p-values nor Bayes-Factors provide a credible test of the hypothesis because the reported results are not credible.

Example 2

Kibbe and Leslie (2011) reported the results of a single study that compared infants’ looking times in three experimental conditions.  The authors first reported the results of a traditional Analysis of Variance that showed a significant effect, F(2, 31) = 3.54, p = .041.  They also reported p-values for post-hoc tests that compared the critical experimental condition with the control condition, p = .021.  They then reported the results of a Bayesian contrast analysis that compared the critical experimental condition with the other two conditions.  They report a Bayes-Factor of 7.4 in favor of a difference between means.  The article does not specify the alternative hypothesis that was tested and the website link in the article does not provide readily available information about the prior distribution of the test. In any case, the Bayesian results are consistent with the ANOVA results. As there is only one study, it is impossible to conduct a formal bias test, but studies with p-values close to .05 often do not replicate.

Conclusion

In conclusion, Bayesian statistics are still rarely used to test research hypotheses. Only two articles in the journal Psychological Science have done so.  One article reported six studies and reported high Bayes-Factors in five studies to support theoretical predictions. A bias analysis showed that the results in this article are biased and violate basic assumptions of sampling.  This example suggest that Bayesian statistics does not solve the credibility problem in psychology.  Bayes-Factors can be gamed just like p-values.  In fact, it is even easier to game Bayes-Factors by specifying a priori distributions that closely match the observed data in order to report Bayes-Factors that impress reviewers, editors, and readers with limited understanding of Bayesian statistics.  To avoid this problems, Bayesians need to agree on a principled approach how researchers should specify prior distributions. Moreover, Bayesian statistics are only credible if researchers report all relevant results. Thus, Bayesian statistics need to be accompanied by information about the credibility of the data.

References

Kibbe, M., & Leslie, A. (2011). What Do Infants Remember When They Forget? Location and Identity in 6-Month-Olds’ Memory for Objects. Psychological Science, 22, 1500-1505.

Savani, K., & Rattan, A. (2012). A choice mind-set increases the acceptance and maintenance of wealth inequality. Psychological Science, 23, 796-804.

Schimmack, U. (2012).  The ironic effect of significant results on the credibility of multiple-study articles.  Psychological Methods, 17, 551–566.

Schimmack, U. (2015a).  The test of insufficient variance (TIVA).  Abgerufen von https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Stouffer, S. A., Suchman, E. A , DeVinney, L.C., Star, S.A., Williams, R.M. Jr (1949). Adjustment During Army Life. Princeton, NJ, Princeton University Press.

Wagenmakers, E. J.,Wetzels, R., Borsboom,D.,& Van derMaas,  H. L. (2011).Why psychologists must change the way they analyze their data: The case of psi. [Commentary on Bem (2011)]. Journal of Personality and Social Psychology, 100, 426–432. doi: 10.1037/a0022790

Die Verdrängung des selektiven Publizierens: 7 Fallstudien von prominenten Sozialpsychologen

Inoffizieller Beitrag zum Themenheft zur Replikationskrise in der Psychologischen Rundschau

Im Herbst 2015, kontaktierte mich Christoph Klauer mit der Frage, ob ich einen Beitrag zu einem Themenheft in der Psychologischen Rundschau zur Replikationskrise in der Psychologie schreiben wollte.  Ich hatte mit Moritz Heene an einer Diskussion im Diskussionsforum der DGfP teilgenommen und war bereit einen Beitrag zu liefern.  Der Beitrag sollte Ende März 2016 fertig sein und mit einer Woche Verspätung reichten Moritz und ich unseren Beitrag ein.  Wir wussten, dass der Beitrag starke Reaktionen hervorrufen würde, da wir an mehreren persönlichen Fallbeispielen zeigten, wie viele Sozialpsychologen versuchen die Replikationskrise zu verdrängen.  Wir waren auf heftige Kritik von Gutachtern gefasst.  Aber dazu kam es nicht.  In einer überaus verständnisvollen und auch zustimmenden email, erklärte Christoph Klauer warum unser Beitrag nicht in das geplante Themenheft passt.

Vielen Dank für das interessante und lesenswerte Manuskript. Ich habe es mit Vergnügen gelesen und kann den meisten Punkten und Argumenten zustimmen. Ich glaube, diese ganze Debatte wird der Psychologie (und hoffentlich auch der Sozialpsychologie) gut tun, auch wenn sich mancher derzeit noch schwer tut. Das Bewusstsein für die Schädlichkeit mancher früher verbreiteten Gewohnheiten und die Einsicht in die Wichtigkeit von Replikationen hat meinem Eindruck nach jedenfalls in den letzten zwei bis drei Jahren bei sehr vielen Kolleginnen und Kollegen deutlich zugenommen. Leider passt das Manusrkipt aus formalen Gründen nicht so gut in das geplante Sonderheft.  (Christoph Klauer, email April 14, 2016).

Da wir uns einige Mühe mit dem Beitrag gemacht haben und es schwer ist etwas auf Deutsch in anderen Fachzeitschriften zu veröffentlichen haben wir beschlossen unseren Beitrag inoffiziell, d.h., ohne fachliche Begutachtung von Kollegen, zu veröffentlichen. Für nachträgliche Kommentare und Kritik sind wir offen und dankbar.  Wir hoffen dass unserer Beitrag zu weiterer Diskussion über die Replikationskrise insbesondere in der Sozialpsychologie führt. Wir glauben, dass unser Beitrag eine einfache und klare Botschaft hat.  Die Zeit der geschönten Ergebnisse ist vorbei. Es ist Zeit, dass Psychologen ihre Laborbefunden offen und ehrlich berichten, denn geschönte Ergebnisse verlangsamen oder verhindern den wissenschaftlichen Fortschritt.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Wie glaubwürdig ist die Sozialpsychologie?

Ulrich Schimmack1

Moritz Heene2
1 University of Toronto, Mississauga, Kanada

2 Learning sciences Research Methodologies, Department of Psychologie, Ludwig Maximilians Universität München

Zusammenfassung

Eine große Replikationsstudie von 100 Studien zeigte, dass nur 25% sozialpsychologischer Studien und 50% kognitionspsychologischer Studien repliziert werden konnten.  Dieser Befund steht im Einklang mit Befunden, dass die statistische Power oft gering ist und Zeitschriften nur signifikante Ergebnisse berichten.  Dieses Problem ist seit 60 Jahren bekannt und erklärt die Ergebnisse des Replikationsprojekts.  Wir zeigen hier auf, wie prominente Sozialpsychologen auf diesen Befund reagiert haben.  Die Kommentare lenken von dem Hauptproblem des Publikationsbias ab und versuchen das Ergebnis schönzureden.  Wir entkräften diese Argumente und fordern Psychologen auf Forschungsergebnisse offen und ehrlich zu berichten.

Keywords: Replikationskrise, Replizierbarkeit, Power

Wie glaubwürdig ist die Sozialpsychologie?

Im Jahr 2011 wurde die Glaubwürdigkeit der Sozialpsychologie durch zwei Ereignisse in Frage gestellt.  Erst stellte sich heraus, dass der Sozialpsychologe Diederik Stapel massiv Daten erfunden hatte.  Inzwischen sind über 50 seiner Artikel zurückgezogen worden (Retraction Watch, 2015).  Dann publizierte das Journal of Personality and Social Psychologie einen Artikel, der angeblich zeigte, dass extravertierte Personen extrasensorische Fähigkeiten haben und, dass man Testergebnisse verbessern kann wenn man nach dem Test lernt (Bem, 2011).  Bald darauf zeigten Forscher statistische Probleme mit den berichteten Ergebnissen auf und Replikationsstudien konnten diese Ergebnisse nicht replizieren (Francis, 2012; Galak, LeBoeuf, Nelson, & Simmons, 2012; Schimmack, 2012).  In diesem Fall waren die Daten nicht gefälscht, sondern Bem hat höchstwahrscheinlich seine Daten so erhoben und ausgewertet, wie es viele Sozialpsychologen gelernt haben. Es stellte sich daher die Frage, wie glaubwürdig andere Ergebnisse in der Sozialpsychologie sind (Pashler & Wagenmakers, 2012).

Als einige Forscher die Effekte zum „elderly priming“ nicht replizieren konnten, sah der Nobelpreisträger Daniel Kahneman eine Krise vorher (Yong, 2012).  Im Jahr 2015 ist diese Krise nun eingetroffen.  Unter der Leitung von Brian Nosek haben hunderte von Psychologen versucht 100 Ergebnisse zu replizieren, die im Jahr 2008 in drei renommierten Fachzeitschriften (Journal of Experimental Psychology: Learning, Memory, and Cognition, Journal of Personality and Social Psychology, & Psychological Science) veröffentlicht wurden (Open Science Collective, 2015).  Während 97% der Originalstudien ein signifikantes Ergebnis berichteten, war die Erfolgsquote in den Replikationsstudien mit 35% deutlich niedriger.  Es zeigte sich jedoch auch ein Unterschied zwischen den Disziplinen.  So war die Replikationsrate für die Kognitive Psychologie 50%, die der Sozialpsychologie hingegen nur 25%.  Da wir uns in diesem Artikel auf die Sozialpsychologie konzentrieren, stellt sich die Frage, wie die 25% Replikationsrate zu interpretieren ist.

Selektives Publizieren von signifikanten Ergebnissen

Vor über 50 Jahren deutete Sterling (1959) bereits darauf hin, dass die Erfolgsquote in psychologischen Zeitschriften unwahrscheinlich hoch ist und stellte die Hypothese auf, dass Publikationsbias dafür verantwortlich ist.  Drei Jahrzehnte später zeigten Sterling und Kollegen, dass die Erfolgsquote weiterhin über 90% lag (Sterling et al., 1995).  Der Artikel machte auch deutlich, dass diese Erfolgsquote nicht mit Schätzungen der statischen Power in der Psychologie übereinstimmt.  Im optimalen Fall haben Psychologen immer die richtige Alternativhypothese (die Nullhypothese ist immer falsch).  Wenn dies der Fall ist, ist die Erfolgsquote in einer Serie von Studien durch die statische Power bestimmt.  Dies ergibt sich aus der Definition von statistischer Power als die relative Häufigkeit von Studien, in denen die Stichprobeneffektgröße zu einem statistisch signifikanten Ergebnis führt. Wenn die Studien unterschiedliche Power haben, ist die Erfolgsquote eine Funktion der durchschnittlichen Power.  Cohen (1962) schätzte, dass sozialpsychologische Studien rund 50% Power haben, um ein signifikantes Ergebnis mit einem Alphaniveau von 5% zu erreichen.  Sedlmeier and Gigerenzer (1989) replizierten diesen Schätzwert 25 Jahre später; es gibt auch keine Anzeichen dafür, dass sich die typische Power seitdem erhöht hat (Schimmack, 2015c).  Wenn die tatsächliche Erfolgswahrscheinlichkeit 50% ist und die berichtete Erfolgsquote publizierter Studien fast 100% ist, ist es deutlich, dass Publikationsbias zu der hohen Erfolgsquote in der Sozialpsychologie beiträgt.

Publikationsbias liegt dann vor, wenn signifikante Ergebnisse veröffentlicht werden und nicht-signifikante Ergebnisse unveröffentlicht bleiben.  Der Begriff Publikationsbias erklärt jedoch nicht wie es zu der Selektion von signifikanten Ergebnissen kommt.  Als Sterling seinen ersten Artikel dazu 1959 schrieb, war es üblich, dass ein Artikel eine einzige Studie berichtete.  Wenn dies der Fall ist, ist es möglich, dass mehrere Forscher eine ähnliche Studie machen, aber nur diejenigen Forscher, die Glück hatten und ein signifikantes Ergebnis beobachteten, ihre Ergebnisse zur Veröffentlichung einreichen.  Sozialpsychologen waren sich diesem Problem bewusst.  Daher wurde es üblich, dass ein Artikel mehrere Studien berichten musste.  Bem (2011) beispielsweise berichtete 10 Studien und 9 davon hatten ein signifikantes Ergebnis (bei Alpha = 5%, einseitige Testung).  Es ist extrem unwahrscheinlich, dass sich Glück mehrfach wiederholt. Daher kann Glück alleine die hohe Erfolgsrate bei Bem und in anderen Artikeln mit mehreren Studien nicht erklärt (Schimmack, 2012).  Um 6 oder mehr Erfolge zu haben, wenn die Erfolgswahrscheinlichkeit nur 50% ist, müssen Forscher dem Glück etwas nachhelfen.  Es gibt eine Reihe von Erhebungs- und Auswertungsmethoden, die die Erfolgswahrscheinlichkeit artifiziell erhöhen (John, Loewenstein, & Prelec, 2012).  Diese fragwürdigen Methoden haben gemeinsam, dass mehr Ergebnisse produziert als berichtet werden.  Entweder werden ganze Studien nicht berichtet oder es werden nur die Analysen berichtet, die zu einem signifikanten Ergebnis führten.  Einige Sozialpsychologen haben offen zugegeben, dass sie diese fragwürdigen Methoden in ihrer Forschung benutzt haben (z.B., Inzlicht, 2015).

Es gibt also eine einfache Erklärung für die große Diskrepanz zwischen der berichteten Erfolgsquote in sozialpsychologischen Zeitschriften und der niedrigen Replikationsrate im Reproduktionsprojekt: Sozialpsychologen führen wesentlich mehr statistische Tests durch als in den Zeitschriften berichtet werden, aber nur die Tests die eine Hypothese bestätigen werden berichtet.  Man braucht kein Wissenschaftstheoretiker zu sein, um zu sehen, dass Publikationsbias ein Problem ist, aber zumindest US Amerikanische Sozialpsychologen haben sich eingeredet, dass die Selektion von signifikanten Ergebnissen kein Problem ist. Bem (2010, S. 5) schrieb „Last uns falsche Entdeckungen machen.“ (Let’s err on the side of discovery.) und dieses Kapitel wurde in vielen Methodenkursen benutzt, um Doktoranden Forschungsmethoden zu lehren.

Gibt es andere Erklärungen?

Ironischerweise kann die öffentliche Reaktion von einigen Sozialpsychologen auf die Ergebnisse des Replikationsprojekts gut mit psychologischen Theorien der Verdrängung erklärt werden (siehe Abbildung 1).  So kommt das Wort „Publikationsbias“ in Stellungnahmen von Sozialpsychologen wie zum Beispiel der offiziellen Stellungnahme der Deutschen Gesellschaft für Psychologie kaum vor. Die unangenehme Wahrheit, dass die Glaubwürdigkeit vieler Befunde in Frage steht scheint zu bedrohlich zu sein, um offen damit umzugehen.  Dies ist jedoch notwendig, damit die nächste Generation von Sozialpsychologen nicht die Fehler ihrer Doktorväter und Doktormütter wiederholt.  In einer Reihe von Fallstudien zeigen wir die Fehler in Argumenten von Sozialpsychologen auf, die Selektionsbias offenbar nicht wahrhaben wollen.

 

repressionpsychologist2

Abbildung 1.   Nicht-signifikante Ergebnisse werden verdrängt.

Fallstudie 1: Die 25% Erfolgsquote ist nicht interpretierbar

Alison Ledgerwood ist eine prominente US-amerikanische Sozialpsychologien, die Artikel zur Glaubwürdigkeit der Sozialpsychologie veröffentlicht hat (Ledgerwood & Sherman, 2012). Sie schrieb auch ein Blog über die Ergebnisse des Replikationsprojekts und behauptet, dass die Replikationsquote von 36% nicht interpretierbar ist (“36, it turned out, was the answer. It’s just not quite clear what the question was”). Ihr Hauptargument ist, dass es nicht klar ist wie viele erfolgreiche Replikationen man hätte erwarten können.  Bestimmt nicht 100%.  Vielleicht ist es ja realistischer nur 25% erfolgreiche Replikationen für die Sozialpsycholgie zu erwarten.  Und in diesem Fall stimmt die tatsächliche Erfolgsrate mit der erwarteten Erfolgsrate perfekt überein; ein hundert prozentiger Erfolg. Aber warum sollten wir einen Erfolg von 25% erwarten? Warum nicht 10%? Dann wäre die tatsächliche Erfolgsquote doch sogar 150% höher als die erwartete Erfolgsquote. Das wäre doch noch besser.  Es ist ja eine alte Weisheit, dass niedrige Erwartungen das Glück erhöhen.  Es macht daher Sinn für das Wohlbefinden der Sozialpsycholgen die Erwartungen herunterzuschrauben. Jedoch ist diese niedrige Erwartung nicht mit der nahezu perfekten Erfolgsquote in den Zeitschriften vereinbar. Alison Ledgerwood ignoriert die Diskrepanz zwischen der öffentlichen Erfolgsrate und der impliziten wahren Erfolgsrate in sozialpsychologischen Laboren.

Ledgerwood behauptet weiterhin, dass die Replikationsstudien niedrige Power hatten und man daher keine hohe Erfolgsquote erwarten könnte.  Sie übersieht dabei jedoch, dass viele Replikationsstudien größere Stichproben als die Originalstudien hatten, was bedeutet, dass die Power der Originalstudien im Durchschnitt niedriger war als die Power der Replikationsstudien.  Es bleibt daher unklar, wie die Originalstudien mit weniger Power eine wesentlich höhere Erfolgsquote erreichen konnten.

Fallstudie 2:  Negative Replikationen sind Normal (Lisa Feldman Barrett)

In einem Kommentar in der New York Times schrieb Lisa Feldmann Barrett, dass es normal sei, wenn eine Replikationsstudie einen originalen Befund nicht repliziert.  Die Ergebnisse des Replikationsprojekts zeigen daher nur, dass die Sozialpsychologie die Glaubwürdigkeit ihrer Ergebnisse prüft und Fehler korrigiert. Dieses Argument ignoriert die Tatsache, dass selektives Publizieren von signifikanten Ergebnissen die Fehlerrate erhöht. Während Ergebnisse berichtet werden als ob die Wahrscheinlichkeit eines falschen Effekts maximal 5% ist (d.h., man erwartet 5% signifikante Ergebnisse, wenn die Null-Hypothese immer stimmt und all Ergebnisse berichtet werden), ist die wahre Fehlerwahrscheinlichkeit wesentlich höher.  In Statistikkursen wird gelehrt, dass Forscher Studien so planen sollen, dass mit einer 80 prozentigen Wahrscheinlichkeit ein signifikanten Effekt beobachtet wird, wenn die Alternativhypothese gilt.  Studien mit 25% Power durchzuführen und dann nur die Ergebnisse zu berichten, die mit Hilfe des Zufalls/Stichprobenfehlers signifikant wurden, ist nicht wissenschaftlich. Daher ist die 25% Replikationsrate kein Zeichen dafür, dass in der Sozialpsychologie alles in Ordnung ist.  Die Kollegen in der klassischen kognitiven Psychologie (nicht in der Neuropsycholgie) schaffen immerhin 50%.  Selbst 50% ist nicht besonders gut.  Die renommierten Psychologen Kahneman and Tversky (1971) beschrieben eine Power von 50% als lächerlich („ridiculously low“). Die Autoren gehen noch weiter, wenn sie die Wissenschaftlichkeit von Forschern in Frage stellen, die bewusst Studien mit weniger als 50% power durchführen („We refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis“ p. 110).

Fallstudie 3: Die wahre Erfolgsquote ist 68%  (Stroebe & Hewstone)

In einem Kommentar für The Times of Higher Education behaupten Stroebe und Hewstone (2015), dass die 25% Erfolgsquote nicht besonders informativ ist.  Gleichzeitig heben sie hervor, dass es möglich ist, eine Metaanalyse der Originalstudien und der Replikationsstudien durchzuführen.  Diese Analyse wurde schon in der ursprünglichen Science Veröffentlichung  (OSC, 2016) durchgeführt und führt zu einer Schätzung der Replikationsrate von 68%.  Stroebe und Hewstone finden dies bemerkenswert und interpretieren diesen Befund als die bessere Schätzung der Replizierbarkeit von sozialpsychologischen Ergebnissen („In other words, two-thirds of the results could be replicated when evaluated with a simple meta-analysis that was based on both original and replication studies”).  Es ist jedoch nicht möglich die Erfolgsquote der Originalstudien mit den Replikationsstudien derart zu vereinen, um die Replizierbarkeit von Originalstudien in der Sozialpsychologie zu schätzen, da der Selektionsbias in den Originalstudien nicht korrigiert wird und daher die Effektgröße weiterhin erhöht bleibt. was zu einer Überschätzung der Replizierbarkeit führt.  Die Replikationsstudien haben keinen Selektionsbias da sie durchgeführt wurden um die Replizierbarkeit von Originalstudien in der Psychologie zu untersuchen.  Daher kann die Replikationsrate der Replikationsstudien direkt zur Schätzung der Replizierbarkeit interpretiert werden.  Das Ergebnis für die Sozialpsychologie ist eine Rate von 25% und nicht 68%.

Fallstudie 4:  Die Ergebnisse des Replikationsprojekts zeigen nicht das Sozialpsychologie nicht vertrauenswürdig ist (Offizielle Stellungnahme der DGPs)

Offenbar in Reaktion auf kritische Artikel in den Medien sahen sich die DGPs Vorstandsmitglieder veranlasst eine offizielle Stellungnahme zu veröffentlichen.  Diese Stellungnahme wurde von einigen Mitgliedern der DGPs kritisiert, was zu einer öffentlichen, moderierten Diskussion führte.  Die offizielle Stellungnahme behauptet, dass die Replikationsrate von 36% (für kognitive und soziale Psychologie) kein Grund ist die Glaubwürdigkeit psychologischer Forschung in Frage zu stellen.

„Wenn in der medialen Berichterstattung teilweise die Zahl „36%“ in den Mittelpunkt gestellt und als Beleg für die mangelhafte Replizierbarkeit psychologischer Effekte verwendet wird, so bedeutet das nicht, dass die berichteten Ergebnisse in den Originalstudien falsch oder nicht vertrauenswürdig sind. Dies wird auch von den Autorinnen und Autoren des SCIENCE Artikels betont.“  (DGPs, 2015)

Es ist in der Tat wichtig zwischen zwei Interpretationen einer Replikationsstudie mit einem nicht-signifikanten Ergebnisse zu unterscheiden.  Eine Interpretation ist, dass die Replikationsstudie zeigt, dass ein Effekt nicht existiert.  Eine andere Interpretation ist, dass die Replikationsstudie zeigt, dass die Originalstudie keine oder nicht genug Evidenz für einen Effekt bietet, selbst wenn dieser Effekt existiert.  Es ist möglich, dass die Medien und die Öffentlichkeit die 36% Erfolgsrate so interpretiert haben, dass 64% der Originalstudien falsche Evidenz für einen Effekt geliefert haben, der nicht existiert.  Diese Interpretation ist falsch, da es unmöglich ist zu zeigen, dass ein Effekt nicht existiert. Es ist nur möglich zu zeigen, dass es sehr unwahrscheinlich ist, dass der Effekt mit einer bestimmten Größe existiert.  Beispielsweise ist der Stichprobenfehler für den Vergleich von zwei Mittelwerten mit 1600 Probanden .05 Standardabweichungen (Cohens d = .05). Wenn die abhängige Variable standardiziert ist, reicht das 95% Konfidenzinterval um 0 von -.10 bis +.10. Wenn die Differenz der Mittelwerte in diesem Intervall liegt, kann man daraus schließen, dass es wenn überhaupt nur einen schwachen Effekt gibt. Da die Stichproben in den Replikationsstudien oft zu klein waren, um schwache Effekte auszuschließen, sagen die Ergebnisse nichts über die Anzahl von falschen Befunden in den Originalstudien aus.  Dies bedeutet jedoch nicht, dass die Ergebnisse in Originalstudien glaubwürdig oder vertrauenswürdig sind.  Da viele Ergebnisse nicht repliziert wurden, bleibt unklar ob diese Effekte existieren.  Die Frage wie oft die Originalstudien die Richtung eines wahren Mittelwertunterschieds richtig vorhersagen ist also von der Replikationsrate zu unterscheiden und die Replikationsrate ist 25%, selbst wenn weitere Studien mit größeren Stichproben eine höhere Erfolgsquote haben könnten.

Fallstudie 5:  Die Reduktion der Erfolgsquote ist ein normales statistisches Phänomen (Klaus Fiedler)

Im Diskussionsforum der DGPs  bot Klaus Fiedler eine weitere Erklärung für die niedrige Replikationsrate.  (Die gesamte Diskussion kann unter https://dl.dropboxusercontent.com/u/3670118/DGPs_Diskussionsforum.pdf abgerufen werden.)

Klaus Fiedler bezog sich insbesondere auf eine Graphik im Science Artikel, die die Effektgrößen der Replikationsstudie als Funktion der Effektgrößen in den Originalstudien zeigt.  Die Graphik zeigt, dass die Effektgrößen der Replikationsstudien im Durschnitt niedriger sind als die Effektgrößen in den Originalstudien. Der Artikel berichtet, dass sich die durchschnittliche Effektgröße von r = .40 (d = 1.10) auf r = .20 (d = .42) reduzierten.  Die publizierten Effektgrößen überschätzen daher die wahren Effektgrößen um mehr als 100%.

Klaus Fiedler behauptet, dass dies kein empirisches Phänomen sei, sondern nur ein bekanntes statistisches Phänomen der Regression zur Mitte wiederspiegelt  (On a-priori-grounds, to the extent that the reliability of the original results is less than perfect, it can be expected that replication studies regress toward weaker effect sizes. This is very common knowledge).

Klaus Fiedler behauptet weiterhin, dass Effektgrößen schrumpfen können, selbst wenn kein Selektionsbias vorliegt.

The only necessary and sufficient condition for regression (to the mean or toward less pronounced values) is a correlation less than zero [Fiedler probably meant to write less than one]. This was nicely explained and proven by Furby (1973). We all “learned” that lesson in the first semester, but regression remains a counter-intuitive thing.

Wir waren überrascht zu lesen, dass Regression zur Mitte auch ohne Selektion auftreten kann. Dies würde bedeuten, dass wir bspw. das Körpergewicht nur mit ungenauen Messinstrumenten messen müssen und dann das Durchschnittsgewicht bei der zweiten Messung geringer wäre.  Wenn dem so wäre, könnten wir die „Regressionsdiät“ benutzen, um ein paar Kilo abzunehmen.  Leider ist dies nur Wunschdenken, ebenso wie sich Klaus Fiedler wünscht, dass die 25% Replikationsrate kein Problem für die Sozialpsychologie darstellt. Wir haben Klaus Fiedlers Quelle nachgelesen und fanden, dass Furbys Beispiel und Beweis explizit eine Selektion voraussetzte (Furby, 1973, S. 173): „Now let us choose a certain aggression level at Time 1 (any level other than the mean)“ (Hervorhebungen von den Autoren).  Furby (1973) zeigt also genau das Gegenteil von dem, was Hr. Fiedler als Beleg für die Erklärung der Ergebnisse alleine durch die Regression zur Mitte heranzog.

Der Vollständigkeit halber wiederlegen wir an dieser Stelle nochmals die These, dass alleine eine Korrelation von kleiner als 1 zwischen den Effektgrößen der Originalstudien und der Replikationsstudien ausreichend ist, um die Ergebnisse des Reproduzierbarkeitsprojektes zu erklären. Halten wir uns zunächst an die Definition der Regression zur Mitte nach bspw. Shepard und Finison (aber siehe auch Maraun, 2011 für eine umfassende Darstellung). Das Ausmaß der Regression zur Mitte ist gegeben durch  mit r: Korrelation zwischen der ersten und zweiten Messung, µ: Mittelwert der Effektgröße in der Population, M: Mittelwert in der selegierten Gruppe, hier: durchschnittliche Effektgröße der originalen Studien. Siehe hierzu Shepard und Finison (1983, S. 308: „The term in square brackets, the product of two factors, is the estimated reduction in BP [blood pressure] due to regression.“ Ist nun eine Korrelation zwischen beobachteten Effektgrößen der originalen Studien und denen aus dem Reprodizierbarkeitsprojekt von kleiner als 1 eine notwendige und hinreichende Bedingung, wie Hr. Fiedler schrieb? Die Aussagenlogik lehrt uns die folgenden Definitionen:

Notwendig:

~p -> ~q

,wobei „~“ die Negation bezeichnet.

Angewandt auf die obige mathematische Definition der Regression zur Mitte hießt dies:

Falls r nicht kleiner als 1 ist, tritt die Regression zur Mitte nicht auf. Diese Aussage ist wahr, wie man anhand der Formel oben sehen kann.

Hinreichend:

p -> q

Falls r kleiner als 1 ist, tritt die Regression zur Mitte auf. Diese Aussage ist falsch wie man wiederum an der Formel oben sehen kann. Zu diesem Punkt schrieben wir auch im DGPS-Forum: „Wenn bspw. r = .80 (also kleiner eins wie von Hr. Fiedler vorausgesetzt) und der Mittelwerte der selegierten Gruppe gleich dem Populationsmittelwert, also M = µ, also bspw. M = µ = .40, dann tritt kein Regressionseffekt auf, denn (1 – .80)*(.40 – .40) = .20*0 = 0. Folglich ist die Bedingung r < 1 zwar eine notwendige, aber keine hinreichende Bedingung für die Regression zur Mitte. Nur wenn r < 1 und M ungleich µ, tritt dieser Effekt auf.“

Fiedlers Regressionsargument ist daher in perfekter Übereinstimmung mit unserer Erklärung der niedrigen Erfolgsquote im Replikationsprojekt.  Die hohe Erfolgsquote in den Originalstudien beruht auf einer Selektion signifikanter Ergebnisse, welche mit Hilfe von Zufallsfehler signifikant wurden.  Ohne die Hilfe des Zufalls kommt es zu einer Regression der Effektgrößen zum wahren Mittelwert und die Erfolgsquote sinkt. Erstaunlich und beunruhigend ist nur wie stark der Selektionseffekt und wie niedrig die wahre Erfolgsquote in der Sozialpsychologie ist.

Fallstudie 6:  Die Autoren des Replikationsprojekts waren inkompetent (Gilbert)

Vor kurzem veröffentlichten Daniel Gilbert und Kollegen eine Kritik des OSF Science Artikel (Gilbert, King, Pettigrew, & Wilson, 2016).  In der Harvard Gazette behauptet Gilbert, dass das Replikationsprojekt schwere Fehler gemacht habe, und dass die negativen Implikationen für die Glaubwürdigkeit der Sozialpsychologie total ungerechtfertigt sind („the OSC made some serious mistakes that make its pessimistic conclusion completely unwarranted.”) (Reuell, March 2016).  Gilbert führen eigene Berechnungen an und behaupten, dass die Ergebnisse mit einer wahren Erfolgsquote von 100% vereinbar sind („When this error is taken into account, the number of failures in their data is no greater than one would expect if all 100 of the original findings had been true.“).

Gilbert et al. (2016) führen drei Argumente auf, um die Ergebnisse des Replikationsprojekts in Frage zu stellen:

Das erste Argument ist, dass die Autoren die Daten falsch ausgewertet haben.  Dies Argument ist aus zwei Gründen nicht stichhaltig.  Erstens vermeiden es Gilbert et al. die 25% Erfolgsquote zu erwähnen.  Dieses Ergebnis bedarf kein tiefes Wissen über statistische Methoden.  Zählen alleine reicht und der Science-Artikel berichtet die richtige Erfolgsquote von 25% für die Sozialpsychologie.  Um von diesem klaren Ergebnis abzulenken, fokussieren Gilbert et al. ihre Kritik auf einen Vergleich der Effektgrößen in den Originalstudien und den Replikationsstudien.  Dieser Vergleich ist jedoch nicht besonders informativ, da die Konfidenzintervalle der Originalstudien sehr weit sind.  Wenn eine Studie eine Effektgröße von .8 Standardabweichungen berichtet und der Befund gerade mal signifikant ist (bei Alpha = 5%), reicht das 95% Konfidenzintervall von ein bisschen über Null bis zu 1.6 Standardabweichungen.  Selbst wenn die Replikationsstudie einen Effekt von Null zeigen würde, ist dieses Ergebnis nicht signifikant von dem Ergebnis der Originalstudie verschieden, da die Effektgröße der Replikationsstudie auch einen Messfehler hat und das Konfidenzintervall mit dem der Originalstudie überlappt. Wenn mal also diese Methode anwendet ist selbst ein echtes Nullergebnis eine gelungene Replikation eines starken Originaleffekts.  Dies macht keinen Sinn, während es durchaus sinnvoll ist einen Originalbefund in Frage zu stellen, wenn eine Replikationsstudie diesen Befund nicht replizieren kann.  Auf jeden Fall ändert der Vergleich von Konfidenzintervallen nichts an der Tatsache, dass die Erfolgsquote von nahe 100% auf 25% schrumpfte.

Das zweite Argument ist, dass die Replikationsstudien eine zu niedrige Power hatten, um die Replizierbarkeit der Originalstudien zu testen.  Wie bereits erwähnt war die Power der Replikationsstudien im Durschnitt höher als die Power der Originalstudien und die Replikationsstudien hatten daher eine bessere Chance die originalen Ergebnisse zu replizieren als die Originalstudien.  Die niedrige Replikationsrate von 25% kann daher nicht auf eine zu niedrige Power in den Replikationsstudien zurückgeführt werden.  Stattdessen kann die hohe Erfolgsquote in den Originalstudien mit Selektionsbias erklärt werden.  Gilbert et al. vermeiden es jedoch Selektionsbias zu erwähnen und zu erklären wie die Originalstudien ihre signifikanten Ergebnisse erreicht haben.

Das dritte Argument hat etwas mehr Gewicht.  Gilbert et al. stellten die Qualität der Replikationsstudien in Frage.  Erstens behaupteten die OSF Autoren, dass sie eng mit den Autoren der Originalstudien zusammengearbeitet haben, als sie die Replikationsstudien planten und dass die Originalautoren dem Replikationsplan zustimmten („The replication protocol articulated the process of … contacting the original authors for study materials, …  obtaining review of the protocol by the original authors, …”, p. 349).  Gilbert et al. fanden jedoch, dass einige Studien nicht von den Originalautoren begutachtet wurden oder dass die Originalautoren bedenken hatten.  Gilbert et al. fanden auch einige Beispiele, in denen die Replikationsstudie in einer anderen Sprache durchgeführt wurde, was Fragen über die Äquivalenz der Studien aufwirft.  Es stellt sich daher die Frage, ob die unterschiedlichen Erfolgsquoten auf mangelnde Äquivalenz zurückgeführt werden können.  Diese Frage gehen wir in der nächsten Fallstudie genauer nach.  Insgesamt sind die Argumente von Gilbert et al. jedoch schwach. Zwei Argumente sind schlicht falsch und das Problem exakte Replikationen in der Psychologie durchzuführen bedeutet nicht, dass die niedrige Erfolgsquote von 25% einfach ignoriert werden kann.  Viele echte Befunde wie der Ankereffekt lassen sich gut replizieren auch wenn die Studie in unterschiedlichen Ländern stattfindet (Klein et al., 2014). Außerdem erhöht die Notwendigkeit strenger Äquivalenz von experimentellen Bedingungen nicht die Glaubwürdigkeit von sozialpsychologischen Studien. Wenn diese Ergebnisse stark von den experimentellen Bedingungen abhängig sind, ist es nicht klar, unter welchen Bedingungen diese Befunde repliziert werden können.  Da die Probanden oft Studenten an einer Uni sind wo ein Sozialpsychologe beschäftigt ist, bleibt es unklar ob diese Befunde auch an anderen Unis oder mit Probanden die nicht Studenten sind repliziert werden können.  Selbst wenn die 25% Erfolgsquote die Erfolgsquote für strikte Replikationen unterschätzt bleibt es beunruhigend, dass es so schwer ist originelle Befunde zu wiederholen.

Fallstudie 7:  Replikationsstudien sind nicht interpretierbar (Strack)

In einem hochzitierten Artikel haben Fritz Strack und Wolfgang Stroebe (2014) den Sinn von Replikationstudien in Frage gestellt.  Der Artikel wurde veröffentlicht, bevor die Ergebnisse des OSF-Replikationsprojekt bekannt waren, aber das Projekt war den Autoren bekannt.  Die Autoren stellen zunächst in Frage ob die Sozialpsychologie eine Replikationskrise hat und stellen fest, dass es nicht genug Evidenz gibt um von einer Krise zu sprechen („We would argue that such a conclusion is premature.“ (p.  60).  Die Evidenz hat jetzt das Replikationsprojekt geliefert. Jedoch behaupten Strack und Stroebe, dass diese Ergebnisse ignoriert werden können, weil die Forscher den Fehler machten die originellen Studien so exakt wie möglich zu replizieren (das genaue Gegenteil von Gilberts Argument, dass die Studien zu verschieden waren).

Strack and Stroebe argumentieren, dass die Sozialpsychologie vorwiegend allgemeine Theorien zu testen.  Wenn man jedoch diese allgemeine Theorie immer nur unter denselben Bedingungen testet ist es unklar ob die Theorie wirklich gültig ist („A finding may be eminently reproducible and yet constitute a poor test of a theory”, p. 60).  Das stimmt zwar, aber das Problem der Sozialpsychologie ist ja, dass selbst unter möglichst gleichen Bedingungen originale Ergebnisse nicht repliziert werden konnten.  Und wenn das Replikationsprojekt die Experimente verändert hätte, wären diese Veränderungen vermutlich auch für die niedrige Replikationsrate verantwortlich gemacht worden (siehe Fallstudie 6). Diese Kritik an exakten Replikationen ist also höchst unlogisch.

Das bestätigen die Autoren sogar selbst wenn sie darauf hinweisen, dass Replikationen wertvoll sind wenn eine Studie sehr neue und unerwartete Befunde zeigt („Exact replications are also important when studies produce findings that are unexpected and only loosely connected to a theoretical framework“, p. 61).  Die Autoren führen eine berühmte Primingstudie als Beispiel an („It was therefore important that Bargh et al. (1996) published an exact replication of their experiment in the same paper” p. 61).  Und tatsächlich berichteten Bargh et al. (1996) Ergebnisse von zwei exakt gleichen Studien mit 30 Probanden.  Beide Studien zeigen ein signifikantes Ergebnis. Dies macht es sehr unwahrscheinlich, dass es ein Zufallsbefund war. Während eine Studie eine Fehlerwahrscheinlichkeit von 5% (1 / 20) hat, ist die Wahrscheinlichkeit für 2 Studien wesentlich kleiner 0.25% (1 / 400).  Wenn diese Ergebnisse jedoch nicht auf die speziellen Bedingungen von Barghs Labor in den Jahren 1990 bis 1995 beschränkt sind, sollten weitere Studien ebenfalls den Effekt zeigen.  Als aber andere Wissenschaftler den Effekt nicht fanden, wurde dieser Befund als Fehler der Replikationsstudie interpretiert („It is therefore possible that the priming procedure used in the Doyen et al. (2012) study failed in this respect, even though Doyen et al. faithfully replicated the priming procedure of Bargh et al. (1996)” (p. 62).  Es jedoch ebenso möglich, dass Bargh nicht alle Ergebnisse seines 5-jährigen Forschungsprogramms berichtet hat und dass Selektionsbias zu den signifikanten Ergebnissen die im Originalartikel berichtet wurden beigetragen hat.  Diese Möglichkeit wird jedoch von Strack und Stroebe (2014) nicht erwähnt, als ob es Selektionsbias nicht gäbe.

Die Verdrängung von Selektionsbias führt zu weiteren fragwürdigen Behauptungen. So behaupten Strack und Stroebe, dass ein nicht-signifikantes Ergebnis in einer Replikationsstudie als ein Interaktionseffekt interpretiert werden muss („In the ongoing discussion, “failures to replicate” are typically taken as a threat to the existence of the phenomenon. Methodologically, however, nonreplications must be understood as interaction effects in that they suggest that the effect of the crucial influence depends on the idiosyncratic conditions under which the original experiment was conducted”).  Diese Behauptung ist schlicht falsch und dies sollte den Autoren aus ihrer eigenen Forschung klar sein.  Im klassischen 2 x 2- Design der Sozialpsychologie kann man nur von einer Interaktion sprechen, wenn die Interaktion statistisch signifikant ist.  Wenn hingegen zwei Gruppen einen signifikanten Unterschied zeigen und zwei andere Gruppen keinen signifikanten Unterschied zeigen, kann dies auch ein zufälliges Ereignis sein. Entweder ist der signifikante Unterschied ein Type-I -Fehler oder der nicht-signifikanter Unterschied ist ein Type-II-Fehler.  Es ist daher wichtig mit einem Signifikanztest zu zeigen, dass Zufall eine unwahrscheinliche Erklärung für die unterschiedlichen Ergebnisse ist.  Im Replikationsprojekt sind die Unterschiede jedoch oft nicht signifikant.

Strack und Stroebes Argumentation würde bedeuten, dass Stichprobenfehler nicht existieren und daher jeder Mittelwertunterschied bedeutungsvoll ist.  Diese Argumentationslinie führt zu der absurden Schlussfolgerung, dass es Stichprobenfehler nicht gibt und Sozialpsychologische Ergebnisse 100% richtig sind.  Das stimmt zwar, wenn es um Stichprobenmittelwerte geht, aber die wirkliche Frage ist ja ob eine experimentelle Manipulation für diesen Unterschied verantwortlich ist, oder ob der Unterschied reiner Zufall ist.  Es ist daher nicht möglich die Ergebnisse von Originalstudien als unanfechtbare Wahrheiten anzusehen die ewige Gültigkeit haben.  Insbesondere wenn Selektionsbias groß ist, ist es möglich, dass viele veröffentlichte Befunde nicht replizierbar sind.

Abschließende Bemerkung

Es sind viele Artikel geschrieben worden, wie die Glaubwürdigkeit psychologischer Forschung erhöht werden kann.  Wir wollen nur einen Vorschlag machen, der ganz einfach und ganz schwer umzusetzen ist.  Psychologen müssen einfach nur alle Ergebnisse, signifikant oder nicht-signikifkant, in Fachzeitschriften berichten (Schimmack, 2012).  Die selektive Berichterstattung von Erfolgsmeldungen ist nicht mit den Zielen der Wissenschaft zu vereinbaren.  Wunschdenken und irren sind menschlich, aber gerade in den Sozialwissenschaften ist es wichtig diese menschlichen Fehler zu minimieren.  Die Krise der Sozialpsychologie zeigt jedoch, wie schwer es ist objektiv zu bleiben, wenn die eigenen Motive ins Spiel kommen.  Es ist daher notwendig, klare Regeln zu schaffen, die den Einfluss dieser Motive auf die Wissenschaft reduzieren.  Die wichtigste Regel ist, dass Wissenschaftler sich nicht aussuchen können, welche Ergebnisse sie berichten.  Der Artikel von Bem (2011) zur übersinnlichen Wahrnehmung zeigte eindeutig wie sinnlos die wissenschaftliche Methode ist, wenn sie missbraucht wird.  Wir begrüßen daher alle Initiativen die den Forschung- und Publikationsprozess in der Psychologie offener und transparenter machen.

References

Barrett, L. F. (2015, September 1). Psychology Is Not in Crisis. The New York Times. Abgerufen von http://www.nytimes.com/2015/09/01/opinion/psychology-is-not-in-crisis.html

Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002.

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Cohen, J. (1962). Statistical power of abnormal–social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. doi:10.1037/h0045186

DGPs (2015). Replikationen von Studien sichern Qualität in der Wissenschaft und bringen die Forschung voran.  Abgerufen von https://www.dgps.de/index.php?id=143&tx_ttnews%5Btt_news%5D=1630&cHash=6734f2c28f16dbab9de4871525b29a06

Francis, G. (2012b). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156. doi:10.3758/s13423-012-0227-9

Fiedler, K. (2015).  https://dl.dropboxusercontent.com/u/3670118/DGPs_Diskussionsforum.pdf
Furby, L. (1973). Interpreting regression toward mean in developmental research.  Developmental Psychology, 8, 172-179.

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012).  Correcting the Past: Failures to Replicate Psi.  Journal of Personality and Social Psychology, 103, 933-948.  DOI: 10.1037/a0029709

Gilbert, D. T., King, G., Pettigrew, S., / Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science”. Science, 351, 6277, 1037

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524–532. doi:10.1177/0956797611430953

Ledgerwood, A. (2016).  36 is the new 42.  Abgerufen von http://incurablynuanced.blogspot.ca/2016/02/36-is-new-42.html

Ledgerwood, A., & Sherman, J. W. (2012). Short, sweet, and problematic? The rise of the short report in psychological science. Perspectives on Psychological Science, 7, 60–66. doi:10.1177/1745691611427304

Maraun, M. D., Gabriel, S., & Martin, J. (2011). The mythologization of regression towards the mean. Theory & Psychology, 21(6), 762-784. doi: 10.1177/0959354310384910

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, 6251, DOI: 10.1126/science.aac4716

Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence?  Perspectives on Psychological Science, 7, 528-530.  DOI: 10.1177/1745691612465253

Retraction Watch. (2015).  Diederik Stapel now has 58 retractions.  Abgerufen von http://retractionwatch.com/2015/12/08/diederik-stapel-now-has-58-retractions/

Reuelle, P. (2016). Study that undercut psych research got it wrong. http://news.harvard.edu/gazette/story/2016/03/study-that-undercut-psych-research-got-it-wrong/

Schimmack, U. (2012).  The ironic effect of significant results on the credibility of multiple-study articles.  Psychological Methods, 17, 551–566.

Schimmack, U. (2015a).  The test of insufficient variance (TIVA).  Abgerufen von https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Schimmack, U. (2015b).  Introduction to the Replicability Index.  Abgerufen von https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index/

Schimmack, U. (2015c). Replicability Report for Psychological Science.  Abgerufen von https://replicationindex.wordpress.com/2015/08/15/replicability-report-for-psychological-science/

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. doi:10.1037/0033-2909.105.2.309

Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers.  Psychological Bulletin, 76, 105-110.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice-versa. American Statistician, 49, 108–112. doi:10.2307/2684823

Stroebe, W., & Hewstone, M. (2015). What have we learned from the Reproducibility Project?  Times of Higher Educationhttps://www.timeshighereducation.com/opinion/reproducibility-project-what-have-we-learned.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives on Psychological Science, 9(1), 59-71.

Yong, E. (October, 3, 2012). Nobel laureate challenges psychologists to clean up their act: Social-priming research needs “daisy chain” of replication.  Nature.  Abgerufen von http://www.nature.com/news/nobel-laureate-challenges-psychologists-to-clean-up-their-act-1.11535

Open Ego-Depletion Replication Initiative

Dear Drs. Baumeister and Vohs,

Perspectives on Psychological Science published the results of a “A Multi-Lab Pre-Registered Replication of the Ego-Depletion Paradigm Reported in Sripada, Kessler, and Jonides (2014).”   The main finding of this replication project was a failure to demonstrate the ego-depletion effect across multiple labs with a large combined sample size.

You wrote a response to this finding (Baumeister & Vohs, in press).   In your response, you highlight several problems with the replication studies and conclude that the results only show that the specific experimental procedure used for the replication studies failed to demonstrate ego-depletion.

At the same time, you maintain that ego-depletion is a robust phenomenon that has been demonstrated repeatedly for two decades; quote “for two decades we have conducted studies of ego depletion carefully and honestly, following the field’s best practices, and we find the effect over and over.”

It is regrettable that the recent RRR project failed to show any effect for ego-depletion because the researchers used a paradigm that you never approved and never used in your own two decades of successful ego-depletion research.

I would like to conduct my own replication studies using paradigms that have reliably produced ego depletion effects in your laboratories. As a single paradigm may fail for unknown reasons, I would like to ask you kindly to identify three paradigms that based on your own experience have reliably produced the ego-depletion effect in your own laboratory and that can produce the effect in other laboratories.

To plan sample sizes for my replication studies, it is also important that you provide an estimate of the effect size. A meta-analysis by Hagger et al. (2010) suggested that the average ego-depletion effect size is d = .6, but a bias-corrected estimate suggests that the effect size may be as small as d = .2 (Carter & McCollough, 2014). I would hate to end up with a non-significant result because my replication studies were underpowered and failed to detect the ego-depletion effect due to insufficient power. What effect size would you expect based on your two decades of successful studies?

Sincerely,
Dr. Ulrich Schimmack

Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010a). Ego depletion and the
strength model of self-control: A meta-analysis. Psychological Bulletin, 136, 495-525.
doi: 10.1037/a0019486

Carter, E. C., & McCullough, M. E. (2014). Publication bias and the limited strength model of self-control: Has the evidence for ego depletion been overestimated? Frontiers in
Psychology, 5, 823. doi: 10.3389/fpsyg.2014.00823

Sripada, C., Kessler, D., & Jonides, J. (2014). Methylphenidate blocks effort-induced depletion of regulatory control in healthy volunteers. Psychological Science, 25, 1227-1234. doi: 10.1177/0956797614526415

Estimating Replicability of Psychological Science: 35% or 50%

Examining the Basis of Bakker, van Dijk, and Wichert’s 35% estimate.

Marjan Bakker, Annette van Dijk and Jelte M. Wicherts (2012). The Rules of the Game Called Psychological Science, Perspectives on Psychological Science 2012 7: 543

BDW’s article starts with the observation that psychological journals publish mostly significant results, but that most studies lack the statistical power to produce so many significant results (Sterling, 1959; Sterling et al., 1995). The heading for the paragraph that makes this claim is “Authors Are Lucky!”

“Sterling (1959) and Sterling, Rosenbaum, and Weinkam (1995) showed that in 97% (in 1958) and 96% (in 1986–1987) of psychological studies involving the use of NHST, H0 was rejected at α = .05. “ (p. 543).

“The abundance of positive outcomes is striking because effect sizes (ESs) in psychology are typically not large enough to be detected by the relatively small samples used in most studies (i.e., studies are often underpowered; Cohen, 1990).” (p. 543).

It is true that power is an important determinant of the rate of significant results that a series of experiments will produce. However, power is defined as the probability of obtaining a significant result when an effect is present. Power is not defined when the null-hypothesis is true. As a result, power is the maximum rate of significant results that can be expected when the null-hypothesis is always false (Sterling et al., 1995).

Although it has been demonstrated that publication bias exists and that publication bias contributes to the high success rate in psychology journals, it has been more difficult to estimate the actual rate of significant results that one would expect without publication bias.

BDW provides an estimate and the point of this blog post is to examine their method of obtaining an unbiased estimate of statistical power, which sets an upper limit for the success rate of psychological studies published in psychology journals.

BDW begin with the observation that statistical power is a function of (a) the criterion for statistical significance (alpha), which is typically p < .05 (two-tailed), (b) sampling error, which decreases with increasing sample size, and (c) the population effect size.

The nominal significance level and sample size are known parameters. BDW suggest that the typical sample size in psychology is N = 40.

“According to Marszalek, Barber, Kohlhart, and Holmes (2011), the median total sample size in four representative psychological journals (Journal of Abnormal Psychology, Journal of Applied Psychology, Journal of Experimental Psychology: Human Perception and Performance, and Developmental Psychology) was 40. This finding is corroborated by Wetzels et al. (2011), who found a median cell size of 24 in both between- and within-subjects designs in their large sample of t tests from Psychonomic Bulletin & Review and Journal of Experimental Psychology: Learning, Memory and Cognition.”

The N = 40 estimate has two problems. It is not based on a representative sample of studies across all areas of psychology. Sample sizes are often smaller than N = 40 in animal studies and they are larger in personality psychology (Fraley & Vazire, 2014). Second, the research design also influences sampling error. In a one-sample t-test, N = 40 implies a sampling error of 1/sqrt(40) = .16, and an effect size of d = .33 would be significant (t(39) = .32/.16 = 2.05, p = .043). In contrast, sampling error in a between-subject design is 2/sqrt(40) and an effect size of d = .65 is needed to obtain a significant result, t(39) = .65/.32 = 2.05, p = .047. Thus, power calculations have to take into account what research design was used. N = 40 can be adequate to study moderate effect sizes (d = .5) with a one-sample design, but not with a between-subject design.

The major problem for power estimation is that the population effect size is unknown. BDW rely on meta-analyses to obtain an estimate of the typical effect size in psychological research.   There are two problems with this approach. First, meta-analysis often failed to correct for publication bias. As a result, meta-analytic estimates can be inflated. Second, meta-analyses may focus on research questions with small effect sizes because large effect are so obvious that they do not require a meta-analysis to examine whether they are real. With these caveats in mind, meta-analyses are likely to provide some valid information about the typical population effect size in psychology. BDW arrive at an estimate of d = .50, which Cohen considered a medium effect size.

“The average ES found in meta-analyses in psychology is around d = 0.50 (Anderson, Lindsay, & Bushman, 1999; Hall, 1998; Lipsey & Wilson, 1993; Meyer et al., 2001; Richard, Bond, & Stokes-Zoota, 2003; Tett, Meyer, & Roese, 1994).

Based on a sample size of N = 40 and a typical effect size of d = .50, the authors arrive at an estimate of 35% power; that is a 35% probability that a psychological study that is reported with a significant result in a journal actually produced a significant result or would produce a significant result again in an exact replication study (with the same sample size and power as the original study). The problem with this estimate is that BDW assume that all studies use the low-power, between-subject (BS) design.

“The typical power in our field will average around 0.35 in a two independent samples comparison, if we assume an ES of d = 0.50 and a total sample size of 40” (p. 544).

The authors do generalize from the BS scenario to all areas of research.

“This low power in common psychological research raises the possibility of a file drawer (Rosenthal, 1979) containing studies with negative or inconclusive results.” (p. 544).

Unfortunately, the authors ignore important work that contradicts their conclusions. Most important, Cohen (1962) provided the first estimate of statistical power in psychological research. He did not conduct an explicit meta-analysis of psychological research, but he suggested that an effect size of half a standard deviation is a moderate effect size. This standardized effect size was named after him; Cohen’s d. As it turns out, the effect size used by BDW of Cohen’s d = .50, is the same effect size that Cohen used for his power analysis (he also proposed similar criteria for other effect size measures).   Cohen (1962) arrived at a median power estimate of 50% to detect a moderate effect size.   This estimate was replicated by Sedlmeier and Gigerenzer (1989), who also conducted a meta-analysis of power estimates and found that power in some other research areas was higher with an average of 60% power to detect a moderate effect size.

One major factor that contributes to the discrepancy between BDW’s estimate of 35% power and other power estimates in the range from 50 to 60% power is that BDW estimated sample sizes on the basis of journals that use within-subject designs, but conducted the power analysis with a between-subject design. In contrast, Cohen and others used the actual designs of studies to estimate power. This approach is more labor-intensive, but provides more accurate estimates than an approach that assumes that all studies use between-subject designs.

CONCLUSION

In conclusion, the 35% estimate underestimates the typical power in psychological studies. Given that BDW and Cohen made the same assumption about the median population effect size, Cohen’s method is more accurate and estimates based on his method should be used. These estimates are closer to 50% power.

However, even the 50% estimate is just an estimate that requires further validation research. One limitation is that the accuracy of the meta-analytic estimation method is unknown. Another problem is that power assumes that an effect is present, but in some studies the null-hypothesis is true. Thus, even if the typical power of studies were 50%., the actual success rate is lower.

Unless better estimates become available, it is reasonable to assume that at best 50% of published significant results will replicate in an exact replication study. With success rates close to 100%, this means that researchers routinely obtain non-significant results in studies that would be published if they had produced significant results. This large file-drawer of unreported studies inflates reported effect sizes, increases the risk of false-positive results, and wastes resources.

MY JOURNEY TOWARDS ESTIMATION OF REPLICABILITY OF PSYCHOLOGICAL RESEARCH

BACKGROUND

About 10 years ago, I became disillusioned with psychology; mostly social psychology broadly defined, which is the main area of psychology in which I specialized. Articles published in top journals became longer and longer, with more and more studies, and more and more incredible findings that made no sense to me and that I could not replicate in my own lab.

I also became more familiar with Jacob Cohen’s criticism of psychology and the concept of power. At some point during these dark years, I found a short article in the American Statistician that changed my life (Sterling et al., 1995). The article presented a simple formula and explained that the high success rate in psychology journals (over 90% of reported results confirm authors’ predictions) are incredible, unbelievable, or unreal. Of course, I was aware that publication bias contributed to these phenomenal success rates, but Sterling’s article suggested that there is a way to demonstrate this with statistical methods.

Cohen (1962) estimated that a single study in psychology has only 50% power. This means that a paper with two studies, has only a 25% probability to confirm an authors’ predictions. An article with 4 studies, only has a probability of doing so that is less than 10%. Thus, it was clear that many of these multiple-study articles in top journals had to be produced by means of selective reporting of significant results.

I started doing research with large samples and I started ignoring research based on these made-up results. However, science is a social phenomenon and questionable theories about unconscious emotions and attitudes became popular in psychology. Sometimes being right and being popular are different things. I started trying to educate my colleagues about the importance of power, and I regularly questioned speakers at our colloquium about their small sample sizes. For a while using the word power became a running joke in our colloquium, but research practices did not change.

Then came the year 2011. At the end of 2010, psychology departments all over North America were talking about the Bem paper. An article in press in the top journal JPSP was going to present evidence for extrasensory perception. In 9 out of 10 statistical tests undergraduate students appeared to be able to have precognition of random future events. I was eager to participate in the discussion group at the University of Toronto to point out that these findings are unbelievable, not because we know ESP does not exist, but because it is impossible to get 9 out of 10 significant results without having very high power in each study. Using Sterling’s logic it was clear to me that Bem’s article was not credible.

When I made this argument, I was surprised that some participants in the discussion doubted the logic of my argument more than Bem’s results. I decided to use Bem’s article to make my case in a published article. I was not alone. In 2011 and 2012 numerous articles appeared that pointed out problems with the way psychologists (ab)use the scientific method. Although there are many problems, the key problem is publication bias. Once researchers can select which results they report, it is no longer clear how many reported results are false positive results (Sterling et al., 1995).

When I started writing my article, I wanted to develop a test that reveals selective reporting so that this unscientific practice can be detected and punished, just like a doping test for athletes. Many psychologists do not like to use punishment and think carrots are better than sticks. However, athletes do not get medals for not taking doping and tax payers do not get a reward for filing their taxes. If selective reporting of results violates the basic principle of science, scientists should not do it, and they do not deserve to get a reward for doing what they are supposed to be doing.

THE INCREDIBILITY INDEX

In June 2011, I submitted my manuscript to Psychological Methods. After one and a half years and three rounds of reviews my manuscript finally appeared in print (Schimmack, 2012). Meanwhile, Greg Francis had developed a similar method that also used statistical power to reveal bias. Psychologists were not very enthusiastic about the introduction of our doping test.

This is understandable because the use of scientific doping was a widely accepted practice and there was no formal ban on selective reporting of results. Everybody was doing it, so when Greg Francis used the method to target a specific article, the authors felt attacked. Why me? You could have attacked any other article and found the same result.

When Greg Francis did analyze all articles (published in the top journal Psychological Science for a specific time period) he found, indeed, that over 80% showed positive signs of bias. So, selective reporting of results is a widely used practice and it makes no sense to single out a specific article. Most articles are produced with selective reporting of results. When Greg Francis submitted his findings to Psychological Science, the article was rejected. It was not rejected because it was flawed. After all, it merely confirmed what everybody already knew, namely that all researchers report only the results that support their theory.  It was probably rejected because it was undesirable to document this widely used practice scientifically and to show how common selective reporting is.  It was probably more desirable to maintain the illusion that psychology is an exact science with excellent theories that make accurate predictions that are confirmed when they are submitted to an empirical test. In truth, it is unclear how many of these success stories are false and would fail  if the were replicated without the help of selective reporting.

ESTIMATION OF REPLICABILITY

After the publication of my 2012 paper, I continued to work on the issue of publication bias. In 2013 I met Jerry Brunner in the statistics department. As a former, disillusioned social psychologist, who got a second degree in statistics, he was interested in my ideas. Like many statisticians, he was skeptical (to say the least) about my use of post-hoc power to reveal publication bias.  However, he kept an open mind and we have been working together on statistical methods for the estimation of power. As this topic has been largely neglected by statisticians, we were able to make some new discoveries and we developed the first method that can estimate power under difficult conditions when publication bias is present and when power is heterogeneous (varies across studies).

In 2015, I learned programming in R and wrote software to extract statistical results from journal articles (PDF’s converted into text files). After downloading all articles from 105 journals for a specific time period (2010-2015) with the help of Andrew , I was able to apply the method to over 1 million statistical tests reported in psychology journals. The beauty of using all articles is that the results do not suffer from selection bias (cheery-picking). Of course, the extraction method misses some tests (e.g., tests reported in figures or tables) and the average across journals is based on the selection of journals. But the result for a single journal is based on all tests that are automatically extracted.

It is important to realize the advantage of this method compared to typical situations where researchers rely on samples to estimate population parameters. For example, the OSF-reproducibility project selected three journals and a single statistical test from only 100 articles (Science, 2015). Not surprisingly, the results of the project have been criticized as not being representative of psychology in general or even the subject areas represented in the three journals. Similarly, psychologists routinely collect data from students at their local university, but assume that the results generalize to other populations. It would be easy to dismiss these results as invalid, simply because they are not based on a representative sample. However, most psychologists are willing to accept theories based on these small and unrepresentative samples until somebody demonstrates that the results cannot be replicated in other populations (or still accept the theory because they dismiss the failed replication). None of these sampling problems plague research that obtains data for the total population.

When the data were available and the method had been validated in simulation studies, I started using it to estimate the replicability of results in psychological journals. I also used it for individual researchers and for departments. The estimates were in the range from 40% to 70%. This estimate was broadly in line with estimates obtained using Cohen’s (1962) method which results in power estimates of 50-60% (Sedlmeier & Gigerenzer, 1989). Estimates in this range were consistent with the well-known fact that reported success rates in journals of over 90% are inflated by publication bias (Sterling et al., 1995). It would also be unreasonable to assume that all reported results are false positives, which would result in an estimate of 5% replicability because false positive results have a 5% probability to be significant again in a replication study. Clearly, psychology has produced some reliable findings that can be replicated every year in simple class-room demonstrations.  Thus, an estimate somewhere in the mild of the extremes between nihilism (nothing in psychology is true) and naive optimism (everything is true) seems reasonable and consistent across estimation methods.

My journal rankings also correctly predicted the ranking of journals in the OSF-reproducibility project, where articles published in JEP:General were most replicable, followed by Psychological Science, and then JPSP. There is even a direct causal link between the actual replication rate and power because cognitive psychologists use more powerful designs, and power determines the replicability in an exact replication study (Sterling et al., 1995).

I was excited to share my results in blogs and in a Facebook discussion group because I believed (and still believe) that these results provide valuable information about the replicability of psychological research; a topic that has been hotly debated since Bem’s (2011) article appeared.

The lack of reliable and valid information fuels this debate because opponent in the debate do not agree about the extent of the crisis. Some people assume that most published results are replicable (Gilbert, Wilson), whereas others suggest that the majority of published results is false (Ioannidis). Surprisingly, this debate rarely mentions Cohen’s seminal estimate of 50%.  I was hoping that my results provide some much needed objective estimates of the replicability of psychological research based on a comprehensive analysis of published results.

At present, there exists about five different estimates of replicability of psychological research that range from 20% or less to 95%.

Less than 20%: Button et al. (2014) used meta-analyses in neuroscience, broadly defined, to suggest that power is only 20% and their method did not even correct for inflated effect sizes due to publication bias.

About 40%: A project that replicated 100 studies from social and cognitive psychology yielded about 40% successful replications; that is, they reproduced a significant result in the replication study. This estimate is slightly inflated because the replication studies sometimes used larger samples, which increased the probability of obtaining a significant result, but it can also be attenuated because replication studies were not carried out by the same researchers using the same population.

About 50%: Cohen (1962) and subsequent articles estimated that the typical power in psychology is about 50% to detect a moderate effect size of d = .5, which is slightly higher than the average effect size in a meta-analysis of social psychology (

50-80%: The average replicability for my rankings is 70% for journals and 60% for departments. The discrepancy is likely due to the fact that journals that publish more statistical results (e.g, a six study article in JPSP) have lower replicability. There is variability across journals and departments, but few analyses have produced values below 50% or over 80%. If I had to pick a single number, I would pick 60%, the average for psychology departments. 60% is also the estimate for new, open access journals that publish thousands of articles a year compared to small quarterly journals that publish less than one-hundred articles a year.

If we simply use the median of these five estimates, Cohen’s estimate of 50% provides the best estimate that we currently have.  The average estimate for 51 psychology departments is 60%.  The discrepancy may be explained by the fact that Cohen focused on theoretically important tests. In contrast, an automatic extraction of statistical results retrieves all statistical tests that are being reported in articles. It is unfortunate that psychologists often report hypothesis tests even when they are meaningless (e.g., the positive stimuli were rated as more positive (Mean = 6.00, SD = 1.00) than the negative stimuli (M = 2.00, SD = 1.00, d = 4.00, p < .001). Eventually, it may be possible to develop algorithms that exclude these statistical tests, but while they are included, replicability estimates include the probability of rejecting the null-hypothesis for these obvious hypotheses. Taking this into account, estimates of 60% are likely to overestimate the replicability of theoretically important tests, which may explain the discrepancy between Cohen’s estimate and the results in my rankings.

CONCERNS ABOUT MY RANKINGS

Since I started publishing my rankings, some psychologists have raised some concerns about my rankings. In this blog post, I address these concerns.

#1 CONCERN: Post-Hoc Power is not the same as Replicability

Some researchers have argued that only actual replication studies can be used to measure replicability. This argument has two problems. First actual replication studies do not provide a gold standard to estimate replicability. The reason is that there are many reasons why an actual replication study may fail and there is no shortage of examples where researchers have questioned the validity of actual replication studies. Thus, even the success rate of actual replication studies is only an estimate of the replicability of original studies.

Second, original studies would already provide an estimate of replicability if no publication bias were present. If a set of original studies produced 60% significant result, an exact replication of these studies is also expected to produce 60% significant results; within the margins of sampling error. The reason is that the success rate of any set of studies is determined by the average power of studies (Sterling et al., 1995) and the average power of identical sets of studies is the same. The problem with using published success rates as estimates of replicability is that published success rates are inflated by selection bias (selective reporting of results that support a theoretical prediction).

The main achievement of Brunner and Schimmack’s statistical estimation method was to correct for selection bias so that reported statistical results can be used to estimate replicability. The estimate produced by this method is an estimate of the success rate in an unbiased set of exact replication studies.

#2 CONCERN: Post-Hoc Power does not predict replicability.

In the OSF-project observed power predicts actual replication success with a correlation of r = .23. This may be interpreted as evidence that post-hoc power is a poor predictor of actual replicability. However, the problem with this argument is that statisticians have warned repeatedly about the use of post-hoc power for a single statistical result (Heisey & Hoenig, 2001). The problem is that the confidence interval around the estimate is so wide that only extremely high power (> 99%) leads to accurate predictions that a study will replicate. For most studies, the confidence interval around the point-estimate is too wide to make accurate predictions.

However, this does not mean that post-hoc power cannot predict replicability for larger sets of studies. The reason is that precision of the estimate increases as the number of tests increases. So, when my rankings are based on hundreds or thousands of tests published in a journal, the estimates are sufficiently precise to be useful. Moreover, Brunner and Schimmack developed a bootstrap method that estimates 95% confidence intervals that provide information about the precision of estimates and these confidence intervals can be used to compare whether differences in ranks are statistically meaningful differences.

#3 CONCERN: INSUFFICIENT EVIDENCE OF VALIDITY

I have used the OSF-reproducibility project (Science, 2015) to validate my rankings of journals. My method correctly predicted that results from JEP:General would be more replicable than those by Psychological Science, and JPSP in this order. My estimate based on extraction of all test statistics from the three journals was 64%, whereas the actual replication success rate was 36%. The replicability based on all tests overestimates replicability for replications of theoretically important tests and the actual success rate underestimates replicability because of problems in conducting exact replication studies. The average of the two estimates is 50%, close to the best estimate of replicability.

It has been suggested that a single comparison is insufficient to validate my method. However, this argument ignores that the OSF-project was the first attempt at replicating a representative set of psychological studies and that the study received a lot of praise for doing so. So, N = 1 is all we have to compare my estimates to estimates based on actual replication studies. When more projects of this nature become available, I will use the evidence to validate my rankings and if there are discrepancy use this information to increase the validity of my rankings.

Meanwhile, it is simply false that a single data point is insufficient to validate an estimate. There is only one Earth, so any estimate of global temperature has to be validated with just one data point. We cannot wait for validation of this method on 199 other planets to decide whether estimates of global temperature are valid.

To use an example from psychology, if a psychologist wants to validate a method that presents stimuli subliminally and lets participants guess whether a stimulus was presented or not, the method is valid if participants are correct 50% of the time. If the percentage is 55%, the method is invalid because participants are able to guess above chance.

Also, validity is not an either or construct. Validity comes in degrees. The estimate based on rankings does not perfectly match the OSF-results or Cohen’s method. None of these methods are perfect. However, they converge on the conclusion that the glass is half full and half empty. The consensus across methods is encouraging. Future research has to examine why the methods differ.

In conclusion, the estimated underlying my replicability rankings are broadly consistent with two other methods of estimating replicability; Cohen’s method of estimating post-hoc power for medium effect sizes and the actual replication rate in the OSF-project. The replicability rankings are likely to overestimate replicability of focal tests by about 10% because they include statistical tests of manipulation checks and covariates that are theoretically less important. This bias may also not be constant across journals, which could affect the rankings to some extent, but it is unknown whether this is actually the case and how much rankings would be affected by this. Pointing out that this potential bias could reduce the validity of the rankings does not lead to the conclusion that they are invalid.

#4 RANKINGS HAVE TO PASS PEER-REVIEW

Some researchers have suggested that I should wait with publishing my results until this methodology has passed peer-review. In my experience, this would take probably a couple of years. Maybe that would have been an option when I started as a scientist in the late 1980s where articles were printed, photocopied, and send by mail if the local library did not have a journal. However, this is 2016, information is shared at lightning speed and where articles are already critiqued on twitter or pubpeer before they are officially published.

I learned my lesson, when the Bem (2011) article appeared and it took one-and-a half years for my article to be published. By this time, numerous articles had been published and Greg Francis had published a critique of Bem using a similar method. I was too slow.

In the meantime, Uri Simonsohn gave two SPSP symposia on pcurve before the actual pcurve article was published in print and he had a pcurve.com website.  When Uri presented the method the first time (I was not there), it created an angry response by Norbert Schwarz. Nobody cares about Norbert’s response anymore and pcurve is widely accepted and version 4.0 looks very different from the original version of pcurve. Angry and skeptical responses are to be expected when somebody does something new, important, and disruptive, but this is part of innovation.

Second, I am not the first one to rank journals or departments or individuals. Some researchers get awards suggesting that their work is better than the work of those who do not get awards. Journals with more citations are more prestigious, and departments are ranked in terms of popularity among peers. Who has validated these methods of evaluation and how valid are they? Are they more valid than my replicability rankings?

At least my rankings are based on solid statistical theory and predict correctly that cognitive psychology is more replicable than social psychology. The fact that mostly social psychologists have raised concerns about my method may reveal more about social psychologists than about the validity of my method. Social psychologists also conveniently ignore that the OSF replicability estimate of 36% is an average across areas and that the estimate for social psychology was an abysmal 25% and that my journal rankings place many social psychology journals at the bottom of the ranking. One would only have to apply social psychological theories about heuristics and biases in cognitive processes to explain social psychologists’ concerns about my rankings.

CONCLUSION

In conclusion, the actual replication rate for a set of exact replication studies is identical to the true average power of studies. Average power can be estimated on the basis of reported test statistics and Brunner and Schimmack’s method can produce valid estimates when power is heterogeneous and when selection bias is present. When this method is applied to all statistics in the population (all journals, all articles by an author, etc.), rankings are not affected by selection bias (cheery-picking). When the set of statistics includes all statistical tests, as for example, from an automated extraction of test statistics, the estimate is an estimate of the replicability of a randomly picked statistically significant result from a journal. This may be a manipulation check or a theoretically important test. It is likely that this estimate overestimates the replicability of critically important tests, especially those that are just significant, because selection bias has a stronger impact on results with weak evidence. The estimates are broadly consistent with other estimation methods and more data from actual replication studies are needed to further validate the rankings. Nevertheless, the rankings provide the first objective estimate of replicability for different journals or departments.

The main result of this first attempt at estimating replicability provides clear evidence that selective reporting undermines the validity of published success rates. Whereas published success rates are over 90%, the actual success rate for studies that end up being published when they produced a desirable result is closer to 50%. The negative consequences of selection bias are well known. Reliable information about actual replicability and selection bias is needed to increase the replicability, credibility, and trustworthiness of psychological research. It is also needed to demonstrate to consumers of psychological research that psychologists are improving the replicability of research. Whereas rankings will always show differences, all psychologists are responsible for increasing the average. Real improvement would produce an increase in replicability on all three estimation methods (actual replications, Cohens’ method, & Brunner & Schimmack’s method).  It is an interesting empirical question when and how much replicability estimates increase in the future.  My replicability rankings will play an important role in answering this question.

 

 

Reported Success Rates, Actual Success Rates, and Publication Bias In Psychology: Honoring Sterling et al. (1995)

Sterling, Rosenbaum, J. J. Weinkam (1995) “Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa”

When I discovered Sterling et al.’s (1995) article, it changed my life forever. I always had the suspicion that some articles reported results that are too good to be true. I also had my fair share of experiences where I tried to replicate an important finding to build on it, but only found out that I couldn’t replicate the original finding. Already skeptical by nature, I became increasingly uncertain what findings I could actually believe. Discovering the article by Sterling et al. (1995) helped me to develop statistical tests that make it possible to distinguish credible from incredible (a.k.a., not trustworthy) results (Schimmack, 2012), the R-Index (Schimmack, 2014), and Powergraphs (Schimmack, 2015).

In this post, I give a brief summary of the main points made in Sterling et al. (1995) and then show how my method builds on their work to provide an estimate of the actual success rate in psychological laboratories.

Research studies from 11 major journals demonstrate the existence of biases that favor studies that observe effects that, on statistical evaluation, have a low probability of erroneously rejecting the so-called null hypothesis (Ho). This practice makes the probability of erroneously rejecting Ho different for the reader than for the investigator. It introduces two biases in the interpretation of the scientific literature: one due to multiple repetition of studies with false hypothesis, and one due to failure to publish smaller and less significant outcomes of tests of a true hypothesis (Sterling et al., 1995, p. 108).

The main point of the article was to demonstrate that published results are biased. Several decades earlier, Sterling (1959) observed that psychology journals nearly exclusively publish support for theoretical predictions. The 1995 article showed that nothing had changed. It also showed that medical journals were more willing to publish non-significant results. The authors pointed out that publication bias has two negative consequences on scientific progress. First, false results that were published cannot be corrected because non-significant results are not published. However, when a false effect produces another false positive it can be published and it appears as if the effect was successfully replicated. As a result, false results can accumulate and science cannot self-correct and weed out false positives. Thus, a science that publishes only significant results is like a gambler who only remembers winning nights. It appears successful, but it is bankrupting itself in the process.

The second problem is that non-significant results do not necessarily mean that an effect does not exist. It is also possible that the study had insufficient statistical power to rule out chance as an explanation. If these non-significant results were published, they could be used by future researchers to conduct more powerful studies or to conduct meta-analyses. A meta-analysis uses the evidence from many small studies to combine the information into evidence from one large study, which makes it possible to detect small effects. However, if publication bias is present, a meta-analysis will always conclude that an effect is present because significant results are more likely to be included.
Both problems are important, but for psychology the first problem appeared to be a bigger problem because it published nearly exclusively significant results. There was no mechanism for psychology to correct itself until a recent movement started to question the credibility of published results in psychology.

The authors examined how often authors reported that a critical hypothesis test confirmed a theoretical prediction. The Table shows the results for the years 1986-87 and for the year 1958. The 1958 results are based on Sterling (1959).

 

Journal 1987 1958
Journal of Experimental Psychology 93% 99%
Comparative & Physiological Psychology 97% 97%
Consulting & Clinical Psychology 98% 95%
Personality and Social Psychology 96% 97%

 

The authors use the term “proportion of studies rejecting H0” to refer to the percentages of studies with significant results. I call it the success rate. A researcher who plans a study to confirm a theoretical prediction has a success when the study produces a significant result. When the result is not significant, the researcher cannot claim support for the hypothesis and the study is a failure. Failure does not mean that the study was not useful and that the result should not be published. It just means that the study does not provide sufficient support for a prediction.

Sterling et al. (1995) distinguish between the proportion of published studies rejecting H0 and the proportion of all conducted studies rejecting H0. I use the terms reported success rate and actual success rate. Without publication bias, the reported success rate and the actual success rate are the same. However, when publication bias or reporting bias is present, the reported success rate exceeds the actual success rate. A gambler might win on 45% trips to the casino, but he may tell his friends that he wins 90% of the time. This discrepancy reveals a reporting bias. Similarly, a researcher may have a success rate of 40% of studies (or statistical analyses if multiple analyses are conducted with one data set), but the published studies show a 95% success rate. The difference shows the effect of reporting bias.

A reported success rate of 95% in psychology journals seems high, but it does not automatically imply that there is publication bias. To make claims about the presence of publication bias it is necessary to find out what the actual success rate of psychological researchers is. When researchers press the button on a statistics software and look up the p-value, and the result will be used for a publication, how often is this p-value below the critical .05 value. How often do researchers go “Yeah” and “Got it.” versus “S***” and move on to another significance test? [I’ve been there; I have done it.]

Sterling et al. (1995) provide a formula that can be used to predict the actual success rate of a researcher. Actually, Sterling et al. (1995) predicted the failure rate, but the formula can be easily modified to predict the success rate. I first present the original formula for the prediction of failure, but I spare readers the Greek notation.

FR = proportion of studies accepting H0
%H0 = proportion of studies where H0 is true
B = average type-II error probability (type-II error = non-significant result when H0 is false)
C = Criterion value for significance (typically p < .05, two-tailed, also called alpha)

FR = %H0 * B + (1 – %H0) * C

The corresponding formula for the success rate is

SR = %H0 * C + (1 – %H0) * (1 – B)

In this equation, (1 – B) is the average probability to obtain a significant effect when an effect is present, which is known as statistical power (P). Substituting (1 – B) with P gives the formula

SR = %H0 * C + (1 – %H0) * P

This formula says that the success rate is a function of the criterion for significance (C), the proportion of studies where the null-hypothesis is true (%H0) and the average statistical power of studies when an effect is present.

The problem with this formula is that the proportion of true null-effects is unknown or even unknowable. It is unknowable because the null-hypothesis is a point prediction of an effect size and even the smallest deviation from this point prediction invalidates H0 and H1 is true. H0 is true if the effect is exactly zero, but H1 is true if the effect is 0.00000000000000000000000000000001. And even if it were possible to demonstrate that the effect is smaller than 0.00000000000000000000000000000001, it is possible that the effect is 0.000000000000000000000000000000000000000001 and the null-hypothesis would still be false.

Fortunately, it is not necessary to know the proportion of true null-hypotheses to use Sterling et al.’s (1995) formula. Sterling et al. (1995) make the generous assumption that H0 is always false. Researchers may be wrong about the direction of a predicted effect, but the two-tailed significance tests helps to correct this false prediction by showing a significant result in the opposite direction (a one-tailed test would not be able to do this). Thus, H0 is only true when the effect size is exactly zero and it has been proposed that this is very unlikely. Eating a jelly bean a day may not have a noticeable effect on life expectancy, but can we be sure a priori that the effect is exactly 0? Maybe it extends or shortens life-expectancy by 5 seconds. This would not matter to lovers of jelly beans, but the null-hypothesis that the effect is zero would be false.

Even if the null-hypothesis is true in some cases, it is irrelevant because the assumption that it is always false is the best case scenario for a researcher, which makes the use of the formula conservative. The actual success rate can only be lower than the estimated success rate based on the assumption that all null-hypotheses are false. With this assumption, the formula is reduced to

SR = P

This means that the actual success rate is a function of the average power of studies. The formula also implies that an unbiased sample of studies provides an estimate of the average power of studies.

P = SR

Sterling et al. (1995) contemplate what a 95% success rate would mean if no publication bias were present.

If we take this formula at face value, it suggests that only studies with high power are performed and that the investigators formulate only true hypothesis.”

In other words, a 95% success rate in the journals can only occur if the null-hypothesis is false and researchers conduct studies with 95% power or if the null-hypothesis is true in 5% of the studies and true power is 100%.”

Most readers are likely to agree with Sterling et al. (1995) that “common experience tells us that such is unlikely.” Thus, publication bias is most likely to contribute to the high success rate. The really interesting question is whether it is possible to (a) estimate the actual success rate and (b) estimate the extend of publication bias.

Sterling et al. (1995) use post-hoc power analysis to obtain some estimate of the actual success rate.

Now alpha is usually .05 or less, and beta while unknown and variable, is frequently .15-.75 (Hedges 1984). For example, if beta = .05 and we take B = .2 as a conservative estimate, then the proportion of studies that should accept Ho is .95-.75 percent. Thus even if the null hypothesis is always false , we would expect about 20% of published studies to be unable to reject Ho.”

To translate, even with a conservative estimate that the type-II error rate is 20% (i.e., average power is 80%), 20% of published studies should report a non-significant result. Thus, the 95% reported success rate is inflated by publication bias by at least 15%.”

One limitation of Sterling et al.’s (1995) article is that they do not provide a more precise estimate of the actual success rate.

There are essentially three methods to obtain estimates of the actual success rate. One could conduct a survey of researchers and ask them to report how often they obtain significant results in statistical analyses that are conducted for the purpose of publication. Nobody has tried to use this approach. I only heard some informal rumor that a psychologist compared his success rate to batting averages in baseball and was proud of a 33% success rate (a 33% batting average is a good average for hitting a small ball that comes at you at over 80mph).

The second approach would be to take a representative sample of theoretically relevant statistical tests (i.e., excluding statistical tests of manipulation checks or covariates) from published articles and to replicate these studies as closely as possible. The success rate in the replication studies provides an estimate of the actual success rate in psychology because the replication studies do not suffer from publication bias.

This approach was taken by the Open Science Collaboration (2015), although with a slight modification. The replication studies tried to replicate the original studies as closely as possible, but sample sizes differed from the original studies. As sample size has an influence on power, the success rate in the replication studies is not directly comparable to the actual success rate of the original studies. However, sample sizes were often increased and they were usually only decreased if the original study appeared to have very high power. As a result, the average power of the replication studies was higher than the average power of the original studies and the result can be considered an optimistic estimate of the actual success rate.

The study produced a success rate of 35% (95%CI = 25% to 45%). The study also showed different success rates for cognitive psychology (SR = 50%, 95%CI = 35% to 65%) and social psychology (SR = 25%, 95%CI = 14% to 36%).

The actual replication approach has a number of strength and weaknesses. The strength is that actual replications do not only estimate the actual success rate in original studies, but also test how robust these results are when an experiment is repeated. A replication study is never an exact replication study of the original study, but a study that reproduces the core aspects of the original study should be able to reproduce the same result. A weakness of actual replication studies is that they may have failed to reproduce core aspects of the original experiment. Thus, it is possible to attribute non-significant results to problems with the replication study. If 20% of the replication studies suffered from this problem, the actual success rate in psychology would increase from 36% to 45%. The problem with this adjustment is that the adjustment is arbitrary because it is impossible to know whether a replication study successfully reproduced the core aspects of an original experiment or not and using significant results as the criterion would lead to a circular argument;  a replication was successful only if it produced a significant result, which would lead to the absurd implication that only the 36% of replications that produced significant results were good replications and the actual success rate is back to 100%.

A third approach is to use the results of the original article to estimate the actual success rate. The advantage of this method is that it uses the very same results that were used to report a 97% success rate. Thus, no mistakes in data collection can explain discrepancies.  The problem is to find a statistical method to correct for publication bias.  There have been a number of attempts to correct for publication bias in meta-analyses of a set of studies (see Schimmack, 2014, for a review). However, a shared limitation of these methods is that they assume that all studies have the same power or effect size. This assumption is obviously violated when the set of studies spans different designs and disciplines.

Brunner and Schimmack (forthcoming) developed a method that can estimate the average power for a heterogeneous set of results while controlling for publication bias. The method first transforms the reported statistical result into an absolute z-score. This z-score represents the strength of evidence against the null-hypothesis. If non-significant results are reported, they are excluded from the analysis because publication bias makes the reporting of non-significant results unreliable. In the OSF-repoducibility project nearly all reported results were significant and two were marginally significant. Therefore this step is irrelevant. The next step is to reproduce the observed distribution of significant z-scores as a function of several non-centrality parameters and weights. The non-centrality parameters are then converted into power and the weighted average of power is the estimate of the actual average power of studies.

In this method the distinction between true null hypothesis and true effects is irrelevant because true null effects cannot be distinguished from studies with very low power to detect small effects. As a result, the success rate is equivalent to the average power estimate. The figure below shows the distribution of z-scores for the replicated studies (on the right side).

Powergraph for OSF-Reproducibility-Project

The estimated actual success rate is 54%. A 95% confidence interval is obtained by running 500 bootstrap analyses. The 95%CI ranges from 37% to 67%. This confidence interval overlaps with the confidence interval for the success rate in the replication studies of the reproducibility project. Thus, the two methods produce convergent evidence that the actual success rate in psychological laboratories is somewhere between 30% and 60%. This estimate is also consistent with post-hoc power analyses for moderate effect sizes (Cohen, 1962).

It is important to note that this success rate only applies to statistical tests that are included in publication when these tests produce a significant result. The selection of significant results also favors studies that actually had higher power and larger effect sizes, but research do not know a priori how much power their study has because the effect size is unknown. Thus, the power of all studies that are being conducted is even lower than the power estimated for the studies that produced significant results and were published.  The powergraph analysis also estimates power for all studies, including the estimated file-drawer of non-significant results. The estimate is 30% with a 95%CI ranging from 10% to 57%.

The figure on the left addresses another problem of actual replications. A sample of 100 studies is a small sample and may not be representative because researchers focused on studies that are easy to replicate. The statistical analysis of original results does not have this problem. The figure on the left side used all statistical tests that were reported in the three target journals in 2008; the year that was used to sample studies for the reproducibility project.

Average power is in the same same ball bark, but it is 10% higher than for the sample of replication studies and the confidence interval does not overlap with the 95%CI for the success rate in the actual replication studies. There are two explanations for this discrepancy. One explanation is that the power of tests of critical conditions is lower than the power of all statistical tests that can include manipulation checks or covariates. Another explanation could be that the sample of reproduced studies was not representative. Future research may help to explain the discrepancy.

Despite some inconsistencies, these results show that different methods can provide broadly converging evidence about the actual success rate in psychological laboratories. In stark contrast to reported success rates over 90%, the actual success rate is much lower and likely to be less than 60%. Moreover, this average glosses over differences in actual success rates in cognitive and social psychology. The success rate in social psychology is likely to be less than 50%.

CONCLUSION

Reported success rates in journals provide no information about the actual success rate when researchers conduct studies because publication bias dramatically inflates reported success rates. Sterling et al. (1995) showed that actual success rate is equivalent to the power of studies when the null-hypothesis is always false. As a result, the success rate in an unbiased set of studies is an estimate of average power and average power after correcting for publication bias is an estimate of the actual success rate before publication bias. The OSF-reproducibilty project obtained an actual success rate of 36%. A bias-corrected estimate of average power of the original studies produced an estimate of 54%. Given the small sample size of 100 studies, the confidence intervals overlap and both methods provide converging evidence that the actual success rate in psychology laboratories is much lower than reported success rate.  The ability to estimate actual success rates from published results makes it possible to measure reporting bias, which may help to reduce it.