Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

REPLICABILITY REPORTS:  Examining the replicability of research topics
RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?
RR No3. (September 4, 2017) The power of the pen paradigm: A replicability analysis

Featured Blog of the Month (November, 2018):
Replicability Rankings of Eminent Social Psychologists
–  no significant correlation between Eminence (H-Index) and Replicability (R-Index)
–  most p-values between p < .05 and p > .01 are not significant after correcting for selection for significance and questionable research practices
–  replicability varies from 22% to 81%



1.  Preliminary 2017  Replicability Rankings of 104 Psychology Journals
Rankings of 104 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2017, and a comparison of psychological disciplines (cognitive, clinical, social, developmental, biological).

Golden2.  Introduction to Z-Curve with R-Code
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.


8.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.


Replicability Audit of Roy F. Baumeister

“Trust is good, but control is better”  


Sir Ronald Fisher emphasized that a significant p-value is not sufficient evidence for a scientific claim. Other scientists should be able to replicate the study and reproduce a significant result most of the time.  Neyman and Pearson formalized this idea when they distinguished type-I errors (a false positive result) and type-II errors (a false negative result).  Good experiments should have a low risk of a type-I and type-II error.

To reduce type-II errors, researchers need to conduct studies with a good signal to noise ratio.  That is, the population effect size needs to be considerably larger than sampling error so that the observed signal to noise ratio in a sample is large enough to exceed the criterion value for statistical significance.  In practice, the significance criterion of p < .05 (two-tailed) corresponds roughly to a signal to noise ratio of 2:1 (z = 1.96).  With a 3:1 ratio of the population effect size over sampling error, the probability of a type-II error is only 15% and the chance of replicating a significant result in an exact replication study is 1-0.15 = 85%.

Unfortunately, psychologists are not trained to conduct formal power analyses and often conduct studies with low power (Cohen, 1962).  The most direct consequence of this practice is that researchers often fail to find significant results and replication studies fail to confirm original discoveries.  However, these replication failures often remain hidden because psychologists used a number of questionable research practices to avoid reporting replication failures.  Until today, the use of these practices is not considered a violation of research ethics, although the practice clearly undermines the validity of published results.  As Sterling (1959) pointed out, once published results are selected for significance, reporting p < .05 is meaningless because the risk of type-I errors can be much higher than 5%.

In short, the replicability of published results in psychology journals is unknown.  Since 2011, a number of publication suggest that many findings in experimental social psychology have low replicability.  The Open Science Collaboration conducted actual replication studies and found that only 25% of experiments in social psychology could be replicated.  The success rate for the typical between-subject experiment is only 4%.

Brunner and Schimmack (2018) developed a statistical tool to estimate replicability called z-curve.  When I applied z-curve to a representative sample of between-subject experimental social psychology (BS-ESP) results, I obtained an estimate of  32% with a 95%CI ranging from 23% to 39%  (Schimmack, 2018).  This estimate implies that experimental social psychologists are using questionable research practices to inflate their success rate in published articles from about 30% to 95% (Sterling et al., 1995).   Thus, the evidence for many claims in social psychology textbooks and popular books  (e.g., Bargh, 2017Kahneman, 2011) is much weaker than the published literature suggests.

Z-curve makes it possible to use the results of published articles to estimate the replicabilty of published results.  This makes it possible to reexamine the published literature to estimate actual type-I and type-II error rates in experimental social psychology.   Using z-curve, I posted replicability rankings of eminent social psychologists (Schimmack, 2018).   Although these results have heuristic value, they are still overly optimistic because they are based on all published test statistics that were automatically extracted from articles.  The average replicability estimate was 62%, which is considerably higher than the 30% estimate for focal hypothesis tests in Motyl et al.’s dataset.  Thus, a thorough investigation of replicability requires hand-coding of focal hypothesis tests.  Because z-curve assumes indepenendence of test statistics, this means for each study, the most focal hypothesis test (MFHT) has to be identified.  This blog post reports the results of the first replicability analysis based on MFHTs in the most important articles of an eminent social psychologist.


I call a z-curve analysis of authors’ MFHTs an audit. The term audit is apt because published results are based on authors’ statistical analyses.  It is assumed that researchers conducted these analyses properly without the use of questionable practices.  In the same way, tax returns are completed by tax payers or their tax lawyers and it is assumed that they followed tax laws in doing so.  While trust is good, control is better and tax agencies randomly select some tax returns to check that tax payers followed the rules.  Just imagine what tax returns would look like, if they were not audited.  Until recently,  this was the case for scientific publications in psychology.  Researchers could use questionable research practices to inflate effect sizes and the percentage of successes without any concerns that their numbers would be audited. Z-curve makes it possible to audit psychologists without access to the actual data.

I chose Roy F. Baumeister for my fist audit for several reasons.  Most important, Baumeister is objectively the most eminent social psychologist with an H-index of 100.  Just like it is more interesting to audit Donanld Trump’s than Mike Pence’s tax returns, it is more interesting to learn about the replicability of Roy Bameister’s results than the results of, for example, Harry Reis.

Another reason is that Roy Baumeister is best known for his ego-depletion theory of self-control and a major replication study failed to replicate the ego-depletion effect.  In addition, a meta-analysis showed that published ego-depletion studies reported vastly inflated effect sizes.  I also conducted a z-curve analysis of focal hypothesis tests in the ego-depletion literature and found evidence that questionable research practices were used to produce evidence for ego depletion (Schimmack, 2016).  Taken together, these findings raise concerns about the research practices that were used to provide evidence for ego-depletion, and it is possible that similar research practices were used to provide evidence for other claims made in Baumeister’s articles.  Thus, it seemed worthwhile to conduct an audit of Baumeister’s most important articles.


I used WebofScience to identify the most cited articles by Roy F. Baumeister (datafile ).   I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 69 empirical articles (H-Index = 69).  The 69 articles reported 241 studies (average 3.5 studies per article).  The total number of participants was 22,576 with a mean of 94 and a median of 58 participants per study.   For each study, I identified the most focal hypothesis test (MFHT).  The result of the test was converted into an exact p-value and the p-value was then converted into a z-score.   The 241 z-scores were submitted to a z-curve analysis to estimate mean power of the 222 results that were significant at p < .05 (two-tailed). The remaining 19 results were interpreted as evidence with lower standards of significance. Thus, the success rate for the 241 studies was 100% and not a single study reported a failure to support a prediction, implying a phenomenal type-II error probability of zero.


The z-curve estimate of actual replicability is 20% with a 95%CI ranging from 10% to 33%.  The complementary interpretation of this result is that the actual type-II error rate is 80% compared to the 0% failure rate in the published articles.

The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results.  The large area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with just 20% average power.  It is unlikely that dropping studies with non-significant results was the only questionable research practice that was used.  Thyus, the actual file-drawer is likely to be smaller.  Nevertheless, the figure makes it clear that the reported results are just the tip of the iceberg of empirical attempts that were made to produce significant results that appear to support theoretical predictions.

Z-curve is under development and offers additional information other than the replicabilty of significant results.   One new feature is an estimate of the maximum number of false positive results.  This estimate is a maximum because it is empirically impossible to distinguish true false positives (effect size is zero) and true positives with negligible effect sizes (effect size is 0.000001).  To estimate the maximum false discovery rate, z-curve is fitted with a fixed percentage of false positives and the fit of this model is compared to the unconstrained model.  If fit is very similar, it is possible that the set of results contains the specified amount of false positives.  The estimate for Roy F. Baumeister’s most important results is that up to 70% of published results could be false positives or true positives with tiny effect sizes.  This suggests that the observed z-scores could be a mixture of 70% false positives and 30% results with a mean power of 55% to produce the estimate of 20% average power (70*.05 + 30*.55 = 20).

Z-curve also provides estimates of mean power for different intervals on the x-axis.  As power increases with the observed evidence against the null-hypothesis (decreasing p-values, increasing z-scores), mean power increases.  For high z-scores of 6 or higher, power is essentially 1 and we would expect any result with an observed z-score greater than 6 to replicate in an exact replication study.   As can be seen in the Figure below the x-axis, z-scores from 2 to 2.5 have only a mean power of 14% and z-scores between 2.5 and 3 have only a mean power of 17%.  Only z-scores greater than 4 have at least a power of 50% and z-scores greater than 5 are needed for the recommended level of 80% power (Cohen, 1988).  Only 11 out of 241 tests yielded a z-score greater than 4 and only 6 yielded a z-score greater than 5.

In conclusion, the replicability audit of Roy F. Baumeister shows that published results were obtained with a low probability to produce a significant result.  As a result, exact replication studies also have a low probability to reproduce a significant result.  As noted a long time ago by Sterling (1959),  statistical significance loses its meaning when results are selected for significance.  Given the low replicability estimates and the high risk of false positive results, the significant results in Baumeister’s article provide no empirical evidence for his claims because the type-I and type-II error risks are too high. The only empirical evidence that was provided in these 69 articles are the 6 or 11 results with z-scores greater than 5 or 4, respectively.


Unlike tax audits by revenue agencies, my replicability audits have no real consequences when questionable research practices are discovered. Roy F. Baumeister followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud. He might even argue that he was better at playing the game social psychologists were playing, which is producing as many significant results as possible without worrying about replicability.  This prevalent attitude among social psychologists was most clearly expressed by another famous social psychologist, who produced incredible and irrreproducible results.

“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)

Not everybody may be as indifferent to replicability.  For consumers interested in replicable empirical findings it is surely interesting to know how replicable published results are. For example, Noble Laureate Daniel Kahneman might not have featured Roy Baumeister’s results in his popular book “Fast and Slow,” if he had seen these results.  Maybe some readers of this blog also find these results informative. I know first hand that at least some of my undergraduate students who invested time and resources in studying psychology find these results interesting and shocking.


It is nearly certain that I made some mistakes in the coding of Roy Baumeister’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential results that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and everybody who has access to the original articles can do their own analysis of Roy Baumeister or any other author, including myself.  The z-curve code is also openly available.  Thus, I hope that this seminal and fully open publication of a replicability audit motivates other psychologists or researchers in other disciplines to conduct replicability audits.  And last, but not least, I don’t hate Roy Baumeister; I love science.



















Replicability Rankings of Eminent Social Psychologists

Social psychology has a replication problem.  The reason is that social psychologists used questionable research practices to increase their chances of reporting significant results. The consequence is that the real risk of a false positive result is higher than the stated 5% level in publications. In other words, p < .05 no longer means that at most 5% of published results are false positives (Sterling, 1959). Another problem is that selection for significance with low power produces inflated effect sizes estimates. Estimates suggests that on average published effect sizes are inflated by 100% (OSC, 2015). These problems have persisted for decades (Sterling, 1959), but only now psychologists are recognizing that published results provide weak evidence and might not be replicable even if the same study were replicated exactly.

How should consumers of empirical social psychology (textbook writers, undergraduate students, policy planners) respond to the fact that published results cannot be trusted at face value? Jerry Brunner and I have been working on ways to correct published results for the inflation introduced by selection for significance and questionable practices.  Z-curve estimates the mean power of studies selected for significance.  Here I applied the method to automatically extracted test statistics from social psychology journals.  I computed z-curves for 70+ eminent social psychologists (H-index > 35).

The results can be used to evaluate the published results reported by individual researchers.  The main information provided in the table are (a) the replicability of all published p-values, (b) the replicability of just significant p-values (defined as p-values greater than pnorm(2.5) = .0124, and (c) the replicability of p-values with moderate evidence against the null-hypothesis (.0124 > p > .0027). More detailed information is provided in the z-curve plots (powergraphs) that are linked to researchers’ names. An index less than 50% would suggest that these p-values are no longer significant after adjusting for selection for significance.  As can be seen in the table, most just significant results are no longer significant after correction for bias.

Caveat: Interpret with Care

The results should not be overinterpreted. They are estimates based on an objective statistical procedure, but no statistical method can compensate perfectly for the various practices that led to the observed distribution of p-values (transformed into z-scores).  However, in the absence of any information which results can be trusted, these graphs provide some information.  How this information is used by consumers depends ultimately on consumers’ subjective beliefs.  Information about the average replicability of researchers’ published results may influence these beliefs.

It is also important to point out that a low replicability index does not mean researchers were committing scientific misconduct.  There are no clear guidelines about acceptable and unacceptable statistical practices in psychology.  Zcurve is not designed to detect scientific fraud. In fact, it assumes that researcher collect data, but conduct analyses in a way that increases the chances of producing a significant result.  The bias introduced by selection for significance is well known and considered acceptable in psychological science.

There are also many factors that can bias results in favor of researchers’ hypotheses without researchers’ awareness. Thus, the bias evident in many graphs does not imply that researchers intentionally manipulated data to support their claims. Thus, I attribute the bias to unidentified researcher influences.  It is not important to know how bias occurred. It is only important to detect biases and to correct for them.

It is necessary to do so for individual researchers because bias varies across researchers.  For example, the R-Index for all results ranges from 22% to 81%.  It would be unfair to treat all social psychologists alike when their research practices are a reliable moderator of replicability.  Providing personalized information about replicability allows consumers of social psychological research to avoid stereotyping social psychologists and to take individual differences in research practices into account.

Finally, it should be said that producing replicabilty estimates is subject to biases and errors.  Researchers may differ in their selection of hypotheses that they are reporting. A more informative analysis would require hand-coding of researchers’ focal hypothesis tests.  At the moment, R-Index does not have the resources to code all published results in social psychology, let alone other areas of psychology.  This is an important task for the future.  At the moment, automatically extracted results have some heuristic value.

One unintended and unfortunate consequence of making this information available is that some researchers’ reputation might be negatively effected by a low replicability score.  This cost has be be weighted against the benefit to the public and the scientific community of obtaining information about the robustness of published results.  In this regard, the replicability rankings are no different from actual replication studies that fail to replicate an original finding.  The only difference is that replicability rankings use all published results, whereas actual replication studies are often limited to a single or a few studies.  While replication failures in a single study are ambiguous, replicability esitmates based on hundreds of published results are more diagnostic of researchers’ practices.

Nevertheless, statistical estimates provide no definitive answer about the reproducibility of a published result.  Ideally, eminent researchers would conduct their own replication studies to demonstrate that their most important findings can be replicated under optimal conditions.

It is also important to point out that researchers have responded differently to the replication crisis that became apparent in 2011.   It may be unfair to generalize from past practices to new findings for researchers who changed their practices.  If researchers preregistered their studies and followed a well-designed registered research protocol, new results may be more robust than a researchers’ past record suggests.

Finally, the results show evidence of good replicability for some social psychologists.  Thus, the rankings avoid the problem of selectively targeting researchers with low replicability, which can lead to a negative bias in evaluations of social psychology.  The focus on researchers with a high H-index means that the results are representative of the field.

If you believe that you should not be listed as an eminent social psychologists, please contact me so that I can remove you from the list.

If you think you are an eminent social psychologists and you want to be included in the ranking, please contact me so that I can add you to the list.

If you have any suggestions or comments how I can make these rankings more informative, please let me know in the comments section.

***   ***   ***    ***    ***

[sorted by R-Index for all tests from highest to lowest rank]

Rank   Journal R-Index (all) R-Index (2.0-2.5) R-Index (2.5-3.0) #P-vals H-Index #Pub #cit(*1000)
1 Steven J. Heine
81 44 55 197 41 83 10
2 James J. Gross
80 35 58 360 82 413 34
3 Constantine Sedikides
76 45 53 884 52 263 10
4 Bertram Gawronski
73 29 51 1717 37 113 6
5 Kathleen D. Vohs
73 36 53 452 49 158 11
6 Paul Rozin
73 39 57 155 65 218 16
7 Alice H. Eagly
72 42 50 384 61 161 18
8 Anthony G. Greenwald
72 28 51 273 64 175 26
9 David Dunning
72 30 50 674 40 105 8
10 Richard E. Nisbett
72 57 59 190 62 119 20
11 Shinobu Kitayama
70 37 44 545 35 103 15
12 Timothy D. Wilson
70 30 52 327 44 85 14
13 Mahzarin R. Banaji
68 52 53 651 60 133 22
14 Marilynn B. Brewer
68 43 46 193 49 107 15
15 Patricia G. Devine
67 44 52 1098 35 83 9
16 Susan T. Fiske
67 26 41 419 66 213 22
17 Thomas Gilovich
67 23 44 754 44 104 8
18 Daniel T. Gilbert
66 26 46 357 43 107 8
19 Hazel R. Markus
66 48 53 348 38 96 12
20 Mark P. Zanna
66 26 49 565 57 167 11
21 Wendy Wood
66 33 40 373 42 112 7
22 Brad J. Bushman
65 33 50 247 48 227 13
23 E. Tory. Higgins
65 27 44 887 72 274 25
24 Jeff Greenberg
65 30 42 679 60 154 13
25 Nira Liberman
64 25 37 1578 46 115 10
26 Caryl E. Rusbult
63 27 34 171 36 68 9
27 Dacher Keltner
63 38 43 903 60 159 15
28 Harry Reis
63 21 41 470 37 83 6
29 John F. Dovidio
63 28 37 2323 56 206 12
30 John T. Cacioppo
63 23 41 256 101 422 41
31 Nalini Ambady
63 24 42 675 51 171 10
32 Philip R. Shaver
63 24 42 675 60 184 13
33 Richard E. Petty
63 30 37 1428 69 190 20
34 Robert B. Cialdini
63 45 49 258 51 121 11
35 Michael Ross
63 38 51 631 42 245 7
36 Lee Ross
62 23 44 952 49 218 12
37 Roy F. Baumeister
62 32 44 1015 100 363 46
38 S. Alexander Haslam
62 34 41 289 52 234 9
39 Tom Pyszczynski
62 33 40 1101 60 149 13
40 Philip E. Tetlock
62 29 39 158 58 189 11
41 Arie W. Kruglanski
61 23 42 1140 50 228 13
42 Galen V. Bodenhausen
61 18 43 465 39 80 8
43 Norbert Schwarz
61 36 42 2524 49 138 13
44 Jonathan Haidt
60 16 37 98 42 84 14
45 Shelly Chaiken
60 15 36 288 46 86 10
46 Ap Dijksterhuis
59 17 36 456 42 118 8
47 Eddie Harmon-Jones
59 25 37 343 57 212 10
48 Fritz Strack
59 22 43 588 51 149 11
49 Joseph P. Forgas
59 29 43 176 37 142 4
50 Yaacov Trope
59 20 39 1957 57 135 12
51 Charles M. Judd
59 28 33 666 37 142 6
52 Craig A. Anderson
59 26 34 265 51 117 11
53 C. Nathan DeWall
58 30 41 1099 37 135 7
54 Eli J. Finkel
58 24 35 1921 38 109 4
55 Jeffry A. Simpson
58 18 28 261 42 95 7
56 Peter M. Gollwitzer
58 29 44 1711 45 158 11
57 Mario Mikulincer
58 29 34 104 68 272 15
58 Russell H. Fazio
57 23 29 464 51 134 14
59 Thomas Mussweiler
57 24 38 1128 35 85 5
60 Daniel M. Wegner
56 27 36 699 53 130 14
61 John A. Bargh
56 20 36 755 61 140 22
62 Robert S. Wyer
56 33 44 283 40 203 8
63 Carol S. Dweck
54 17 29 458 59 147 20
64 Michael Inzlicht
54 32 36 156 36 124 5
65 John T. Jost
52 21 33 249 46 132 11
66 Shelly E. Taylor
51 15 28 198 73 169 28
67 Claude M. Steele
48 38 42 376 29 49 13
68 Gerald L. Clore
45 20 35 200 37 84 9
69 Adam D. Galinsky
43 24 28 585 57 198 11
70 Robert Zajonc
39 11 26 67 40 114 16
71 Jennifer Crocker
22 5 5 99 38 68 7


The Misattribution Error in the Alpha Wars about Significance Criteria

Preprint. Draft.  Comments are welcome.

A year ago, a group of 71 scientists published a commentary in the journal Nature:Human Behavior (Benjamin et al., 2017).   Several of the authors are prominent members of a revolutionary movement that aims to change the way behavioral scientists do research (Brian A. Nosek, E.-J. Wagenmakers, Kenneth A. Bollen, Christopher D. Chambers,  Andy P. Field,  Donald P. Green, Anthony Greenwald, Larry V. Hedges, John P. A. Ioannidis, Scott E. Maxwell, Felix D. Schönbrodt, & Simine Vazire).

The main argument made in this article is that the standard criterion for statistical significance of a 5% risk to report a false positive result (i.e, the type-I error probability in Neyman-Pearson’s framework) is too high.  The authors recommend lowering the false-positive risk from 0.5% to p < .005.

This recommendation is based on the authors’ shared belief that “a leading cause of non-reproducibility has not yet been adequately addressed: statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. (p. 1)”

In contrast, others, including myself, have argued that the main problem for low reproducibility is that researchers conduct studies with a low probability to produce p-values less than .05, even if the null-hypothesis is false (i.e., the type-II error probability in Neyman-Pearson’s framework) (Open Science Collaboration, 2015; Schimmack, 2012).

The probability of obtaining a true positive result is called statistical power.   The main problem of low power is that many studies produce inconclusive, non-significant results (Cohen, 1962).  However, another problem is that low power also produces significant results that are difficult to reproduce because significance can only be obtained if sampling error boosts observed effect sizes and test statistics.  Replication studies do not reproduce the same random sampling error and are likely to produce non-significant results.

The problem of low power is amplified by the use of questionable research practices, such as selective publishing of results that support a hypothesis. At least in 2011, most psychologists did not consider these practices problematic or unethical (John, Loewenstein, & Prelec, 2012).


The problem with these practices is that replication studies no longer weed out false positives that passed the significance filter in an original discovery.  These practices explain why psychologists often report multiple successful replications of their original study, even if the statistical power to do so is low (Schimmack, 2012).

Benjamin et al. (2017) dismiss this explanation for low reproducibility.

“There has been much progress toward documenting and addressing several causes of this lack of reproducibility (for example, multiple testing, P-hacking, publication bias and under-powered studies).”   

Notably, the authors provide no references for the claim that low power and questionable research practices have been documented, let alone addressed in the behavioral sciences.

The Open Science Collaboration documented that these problems contribute to replication failures (OSC, 2015) and there is no evidence that these practices have changed.

Even if some problems have been addressed, nobody really knows how researchers produce more significant results than statistical power predicts. As these factors remain unidentified, I call them from now on “unidentified researcher influence” (URI).

Because Benjamin et al. ignore URIs in their comment, they fail to make the most persuasive argument in favor of lowering the significance criterion from .05 to .005; namely, the significance criterion influences how much URIs contribute to discoveries. This was shown in a set of simulation studies by Simmons, Nelson, and Simonsohn (2011, see Table 1 of their article).  phack.sim.png

The most dramatic simulation of questionable research practices shows that the type-I error risk increases from 5% to 81.5% for marginally significant results, p < .01 (two-tailed).  The actual type-I error risk is still 60.7% with the current standard of p < .05 (two-tailed). However, it drops to “just” 21.5% with a more conservative criterion of p < .01 (two-tailed).  It would be even lower for the proposed criterion of p < .005.

Thus, a simple solution to the problem of URIs is to lower the significance criterion.  Unfortunately, lowering the significance criterion for everybody has the negative effect of increasing costs for researchers who minimize the influence of URIs and conduct a priori power analyses to plan their studies.

This can be easily seen, by computing statistical power for different levels of statistical significance assuming a small, medium, or large effect size with alpha = .05 versus alpha = .005 in a power-hungry between-subject design (independent t-test).


This is the reason why I think minimizing URIs and honest reporting of replication results is the most effective way to solve the reproducibilty problem in the behavioral sciences.  This is also the reason why I developed statistical tests that can reveal URIs in published data and that could be used by editors and reviewers to reduce the risk of publishing false positive discoveries.

Should we lower alpha even if the problem of URIs were addressed?

Benjamin et al. (2017) claim that a 5% false positive risk is still too high even if URIs were no longer a problem.  I think their argument ignores the importance of statistical power.  The percentage of false discoveries among all statistically significant results is a function of type-I error and type-II error.  This can be easily seen by examining a few simple scenarios.  The scenario assumes a high percentage of false positives of 50%.


With 20% power and alpha = .05, there would be a false positive result of 20% (1 out of 5 attempts).   This seems, indeed, unreasonably high. However,  nobody should conduct studies with 20% power. Tversky & Kahneman (1971) suggested that reasonable scientists would have at least 50% power in their studies.  Now the risk of a false positive is 1 out of 11 studies.  Even 50% power is low and the most widely accepted standard for statistical power is Cohen’s (1988) recommendation to plan for 80% power.  Now, the risk of a false positive is reduced to 1 out of 17 studies.

Most important, the scenario assumes only a single study is being conducted.  With each honestly reported replication study, the percentage of false positives decreases exponentially with the number of replication studies.  For example, with a pair of an original study and a replication study and 80% power, only 1 out of 257 attempts would produce a pair of significant results, while a non-significant result in a replication study would flag the original result as a potential false positive.

The table also shows that lowering the significance criterion reduces the percentage of false positives. However, this is achieved at the cost of using more resources for a single study.  It is important to consider these trade-offs.  Sometimes, it might be beneficial to demonstrate significant results in two conceptual replication studies rather than a single study that tests a hypothesis with one specific paradigm.  It might even be beneficial to lower the alpha level for a first study to 20% and require a larger sample and stronger evidence with an alpha level of 5% or 0.5% for a confirmatory replication study.

While these are important questions to consider in the planning of studies, the balancing of type-I and type-II errors according to the specific question being asked is at the core of Neyman-Pearson’s statistical framework.  Whether lowering alpha to a fixed level of .005 is always the best option can be debated.

However, I don’t think we should have a debate about URIs.  The goal of empirical science is to reduce error in human observations wherever possible.  One might even define science as the practice of observing things with the least amount of human error.  This also seems to be a widely held view of scientific activity.  Unfortunately, science is a human activity and the results reported in scientific journals can be biased by a myriad of URIs.

As Fiske and Taylor (1984) described human information processing. “Instead of a naive scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (p. 88).

What separates charlatans from scientists is the proper use of scientific methods and the honest reporting of all data and all the assumptions that were made in the inferential steps from data to conclusions.  Unfortunately, scientists are human and the human motivation to be right can distort the collective human effort to understand the world and themselves.

Thus, I think there are no trade-offs when it comes to URIs. URIs need to be minimized as much as possible because they undermine the collective goal of science, waste resources, and undermine the credibility of scientists to inform the public about important issues.

If you agree, you can say so in the comment section and maybe become an author of another multiple-author comment on the replication crisis that calls for clear guidelines about scientific integrity that all behavioral scientists need to follow with clear consequences for violating these standards.  Researchers who violate this code should not receive public money to support their research.


In conclusion, I argued that Benjamin et al. (2017) made an attribution error when they blamed the traditional significance criterion for the reproducibility crisis.  The real culprit are unidentified research influences (URIs) that increase the false positive risk and inflate effect sizes.  One solution to this problem is to lower alpha, but this approach requires that more resources are spent on demonstrating true findings. A better approach is to ensure that researchers minimize unintended researcher influence in their labs and that scientific organization provide clear guidelines about best practices. Most important, it is not acceptable to suppress conceptual or direct replication studies that failed to support an original discovery.  Nobody should have to trust original discoveries if researchers do not replicate their work or their self-replications cannot be trusted.















Estimating Reproducibility of Psychology (No. 107): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article “Nonconscious goal pursuit in novel environments: The case of implicit learning”  was published in the journal Psychological Science.


The article has been cited 41 times and is cited with low frequency in recent years.

Study 1

51 students participated in Study 1.  The experiment was a social priming study, where participants were presented with achievement words or words not related to achievement in the control condition.  In a supposedly unrelated task, participants worked on a hypothetical management task (of a sugar factory).  The authors report that achievement primes significantly enhanced performance in the managerial task, t(45) = 2.1.

Study 2 

The study chosen for the replication attempt was Study 2 with 93 participants.  Once more, achievement-primed participants showed better learning of the managerial task, t(84) = 2.09.

Replication Study 

The replication study had a larger sample size (N = 158).  Nevertheless, it failed to reproduce a significant result, t(156) = 1.32 (mean difference in the opposite direction).

Importantly, both studies reported just significant results, which is statistically unlikely for two independent samples.  Thus, the replication failure may be due to inflated effect sizes in the original studies that is produced by selection for significance.

Moreover, other social priming studies have also failed to replicate and social priming has been names “the poster child” of the replication crisis in social psychology.

Thus, the replication failure is not surprising.


Estimating Reproducibility of Psychology (No. 118): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article


The article presents one study with a 2 x 2 between subject design with 120 participants (n = 30 per cell).   One experimental factor manipulated the intake of sugar.  A lemonade was either sweetened with sugar or Splenda.  The second factor manipulated attention regulation.  While watching an interview, words were displayed at the bottom of the screen. Half of the participants were instructed not to look at the words. The other half were given no instructions about their attentional focus.  The dependent variable was a hypothetical decisions task.

The authors used a focal contrast analysis that compared the Splenda and attention-regulation condition against the other three conditions.  This contrast was statistically significant, F(1,111) = 5.31.

Replication Study

The replication study followed the same procedure with a slightly larger sample (N = 169).  The same statistical procedure produced a non-significant result, F(1,158) = 0.38.  The replication authors mention that the original study was carried out in Florida and that the replication study was carried out in Virginia.


The replication study failed to replicate the original result.  This is not surprising, given other replication failures for glucose effects and statistical problems of original glucose studies (Schimmack, 2012).







An Introduction to Z-Curve: A method for estimating mean power after selection for significance (replicability)

Since 2015, Jerry Brunner and I have been working on a statistical tool that can estimate mean (statitical) power for a set of studies with heterogeneous sample sizes and effect sizes (heterogeneity in non-centrality parameters and true power).   This method corrects for the inflation in mean observed power that is introduced by the selection for statistical significance.   Knowledge about mean power makes it possible to predict the success rate of exact replication studies.   For example, if a set of studies with mean power of 60% were replicated exactly (including sample sizes), we would expect that 60% of the replication studies produce a significant result again.

Our latest manuscript is a revision of an earlier manuscript that received a revise and resubmit decision from the free, open-peer-review journal Meta-Psychology.  We consider it the most authoritative introduction to z-curve that should be used to learn about z-curve, critic z-curve, or as a citation for studies that use z-curve.

Cite as “submitted for publication”.

Final.Revision.874-Manuscript in PDF-2236-1-4-20180425 mva final (002)

Feel free to ask questions, provide comments, and critic our manuscript in the comments section.  We are proud to be an open science lab, and consider criticism an opportunity to improve z-curve and our understanding of power estimation.

Latest R-Code to run Z.Curve (Z.Curve.Public.18.10.28).
[updated 18/11/17]   [35 lines of code]
call function  mean.power = zcurve(pvalues,Plot=FALSE,alpha=.05,bw=.05)[1]

Z-Curve related Talks
Presentation on Z-curve and application to BS Experimental Social Psychology and (Mostly) WS-Cognitive Psychology at U Waterloo (November 2, 2018)
[Powerpoint Slides]