Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

REPLICABILITY REPORTS:  Examining the replicability of research topics
RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?
RR No3. (September 4, 2017) The power of the pen paradigm: A replicability analysis

Featured Blog of the Month (November, 2018):
Replicability Rankings of Eminent Social Psychologists
–  no significant correlation between Eminence (H-Index) and Replicability (R-Index)
–  most p-values between p < .05 and p > .01 are not significant after correcting for selection for significance and questionable research practices
–  replicability varies from 22% to 81%



1.  Preliminary 2017  Replicability Rankings of 104 Psychology Journals
Rankings of 104 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2017, and a comparison of psychological disciplines (cognitive, clinical, social, developmental, biological).

Golden2.  Introduction to Z-Curve with R-Code
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.


8.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.


Can Students Trust Social Psychology Textbooks?

Humanistic psychologists have a positive image of human nature.  Given the right environment, humans would act in the interest of the greater good.  Similarly, academia was founded on idealistic ideals of shared understanding of the world based on  empirical facts.  Prominent representative of psychological science still present this naive image of science.

Our field has always encouraged — required, really — peer critiques. 
(Susan T. Fiske, 2016).   

The notion of peer criticism is naive because scientific peers are both active players and referees.  If you don’t think that the World Cup final could be refereed by the players, you should also not believe that scientists can be objective when they have to critique their own science.  This is not news to social psychologists, who teach about motivated biases in their classes, but suddenly these rules of human behavior don’t apply to social psychologists as if they were meta-humans.

Should active researchers write introductory textbooks?

It can be difficult to be objective in the absence of strong empirical evidence.  Thus, disagreements among scientists are part of the normal scientific process of searching for a scientific answer to an important question. However, textbooks are supposed to introduce a new generation of students to fundamental facts that serve as the foundation for further discoveries.  There is no excuse for self-serving biases in introductory textbooks.    

Some textbooks are written by professional textbook writers.  However, other textbooks are written by active and often eminent researchers.  Everything we know about human behavior predicts that they will be unable to present criticism of their field objectively. And the discussion of the replication crisis in social psychology in Gilovich, Keltner, Chen, and Nisbett (2019) confirms this prediction.

The Replication Crisis in Social Psychology in a Social Psychology Textbook

During the past decade social psychology has been rocked by scandals ranging from outright fraud to replication failures of some of the most celebrated textbook findings like unconscious priming of social behavior (Bargh) or ego-depletion (Baumeister), and the finding that a representative sample of replication studies failed to replicate 75% of published results in social psychology (OSC, 2015). 

The forthcoming 5th edition of this social psychology textbook does mention the influential OSC reproducibility project.  However, the presentation is severely biased and fails to inform students that many findings in social psychology were obtained with questionable research practices and may not replicate.   

How to Whitewash Replication Failures in Social Psychology

The textbook starts with the observation that replication failures generate controversy, but ends with the optimistic conclusion that scientists then reach a consensus about the reasons why a replication failure occurred.  

“These debates usually result in a consensus about whether a particular finding should be accepted or not. In this way, science is self-correcting” 

This rosy picture of science is contradicted by the authors own response to the replication failure in the Open Science Reproducibility Project.  There is no consensus about the outcome of the reproducibility project and social psychologists’ views are very different from outsiders’ interpretation of these results.

“In 2015, Brian Nosek and dozens of other psychologists published an article in the journal Science reporting on attempts to replicate  [attempts to replicate!!!]  100 psychological studies (Open Science Collaboration, 2015).  They found that depending on the criterion used, 36-47 percent of the original studies were successfully replicated.” 

They leave out that the article also reported different success rates for social psychology, the focus of the textbook, and cognitive psychology.  The success rate for social psychology was only 25%, but this also included some personality studies. The success rate for the classic between-subject experiment in social psychology was only 4%!  This information is missing, although (or because?) it would make undergraduate students wonder about the robustness of the empirical studies in their textbook. 

Next students are informed that they should not trust the results of this study.  

“The findings received heavy criticism from some quarters (Gilbert, King, Pettigrew, & Wilson, 2016).” 

No mention is made who these people are or that Wilson is a student of textbook author Nisbett. 

“The most prominent concern was that many of the attempted replications utilized procedures that differed substantially from the original studies and thus weren’t replications at all.” 

What is “many” and what is a “substantial” difference?  Students are basically told that the replication project was carried out in the most incompetent way (replication studies weren’t replications) and that the editor of the most prominent journal for all Sciences didn’t realize this.   This is the way social psychologists often create their facts; with a stroke of a pen and without empirical evidence to back it up.  

Students are then told that other studies have produced much higher estimates of replicability that reassure students that textbook findings are credible.

“Other systematic efforts to reproduce the results of findings reported in behavioral science journals have yielded higher replication rates, on the order of 75-85 percent (Camerer et al., 2016; Klein et al., 2014). “

I have been following the replication crisis since its beginning and I have never seen success rates of this magnitude.  Thus, I fact checked these estimates that are presented to undergraduate students as the “real” replication rates of psychology, presumably including social psychology. 

The Cramerer et al. (2016) article is titled “Evaluating replicability of laboratory experiments in economics”  Economics!  Even if the success rate in this article were 75%, it would have no relevance for the majority of studies reported in a social psychology textbook.  Maybe telling students that replicability in economics is much better than in psychology would make some students switch to economics.  

The Klein et al. (2014) article did report on the replicability of studies in psychology.   However, it only replicated 13 studies and the studies were not a representative sample of studies, which makes it impossible to generalize the success rate to a population of studies like the studies in a social psychology textbook. 

We all know the saying, there are lies, damn lies, and statistics. The 75-85% success rate in “good” replication study is a damn lie with statistics.  It misrepresents the extent of the replication crisis in social psychology.   An analysis of a representative set of hundreds of original results leads to the conclusion that no more than 50% of exact replication studies would reproduce a significant result even if the study could be replicated exactly (How replicable is psychological science).  Telling students otherwise is misleading. 

The textbook authors do acknowledge that failed replication studies can sometimes reveal shoddy work by original researchers.

“In those cases, investigators who report failed attempts to replicate do a great service to everyone for setting the record straight.” 

They also note that social psychologists are slowly changing research practices to reduce the number of significant results that are obtained with “shoddy practices” that do not replicate. 

“Foremost among these changes has been an increase in the sample sizes generally used in research.” 

One wonders why these changes are needed if success rates are already 75% or higher. 

The discussion of the replication crisis ends with the reassurance that probably most of the reported results in the article are credible and that evidence is presented objectively.

“In this textbook we have tried to be scrupulous about noting when the evidence about a given point is mixed.”  

How credible is this claim when the authors misrepresent the OSC (2015) article as a collection of amateur studies that can be ignored and then cite a study of economics to claim that social psychology is replicable? 

Moreover, the authors have a conflict of interest because they have a monetary incentive to present social psychology in the most positive light so that students take social psychology courses and buy social psychology textbooks. 

A more rigorous audit of this and other social psychology textbooks by independent investigators is needed because we cannot trust social psychologists to be objective in the assessment of their field.  After all, they are human. 

Why Wagenmakers is Wrong

The crisis of confidence in psychological science started with Bem’s (2011) article in the Journal of Personality and Social Psychology.   The article made the incredible claim that extraverts can foresee future random events (e.g., the location of an erotic picture) above chance. 

Rather than demonstrating some superhuman abilities, the article revealed major problems in the way psychologists conduct research and report their results. 

Wagenmakers and colleagues were given the opportunity to express their concerns in a commentary that was published along with the original article, which is highly unusual (Wagenmakers et al., 2011).  

Wagenmakers used this opportunity to attribute the problems in psychological science to the use of p-values.  The claim that the replication crisis in psychology follows from the use of p-values has been repeated several times, most recently in a special issue that promotes Bayes Factors as an alternative statistical approach. 

the edifice of NHST appears to show subtle signs of decay. This is arguably due
to the recent trials and tribulations collectively known as the “crisis of confidence” in psychological research, and indeed, in empirical research more generally (e.g., Begley
& Ellis, 2012; Button et al., 2013; Ioannidis, 2005; John, Loewenstein, & Prelec, 2012; Nosek & Bar-Anan, 2012; Nosek, Spies, & Motyl, 2012; Pashler & Wagenmakers, 2012; Simmons, Nelson, & Simonsohn, 2011). This crisis of confidence has stimulated a methodological reorientation away from the current practice of p value NHST
 (Wagenmakers et al., 2018, Psychonomics Bulletin and Review).

In short, Bem used NHST and p-values, Bem’s claims are false, therefore NHST and p-values are false. 

However, it does not follow from Bem’s use of p-values that NHST is flawed or caused the replication crisis in experimental social psychology, just like it does not follow  from the fact that Bem is a men and that his claims were false that all claims by man are false. 

The key problem with Bem’s article is that he used questionable and some would argue fraudulent research practices to produce incredible p-values (Francis, 2012; Schimmack, 2012).  For example, he combined several smaller studies with promising trends into a single dataset to report a p-value less than .05 (Schimmack, 2018).  This highly problematic practice violates the assumption that the observations in a dataset are drawn from a representative sample.  It is not clear how any statistical method could produce valid results when its basic assumptions are violated. 

So, we have two competing accounts of the replication crisis in psychology.  Wagenmakers argues that even proper use of NHST produces questionable results that are difficult to replicate. In contrast, I argue that proper use of NHST produces credible p-values that can be replicated and only questionable research practices and abuse of NHST produce incredible p-values that cannot be replicated. 

Who is right? 

The answer is simple. Wagenmakers et al. (2011) engaged in a questionable research practice to demonstrate the superiority of Bayes-Factors when they examined Bem’s results with Bayesian statistics.  They analyzed each study individually to show that each study alone produced fairly weak evidence for extraverts’ miraculous extrasensory abilities.  However, they did not report the results of a meta-analysis of all studies.

The weak evidence in each single study is not important because JPSP would not have accepted Bem’s manuscript for publication, if he had presented a significant result in a single study. In 2011, social psychologists were well aware that a single p-value less than .05 provides only suggestive evidence and does not warrant publication in a top journal (Kerr, 1998).  Most articles in JPSP report four or more studies.  Bem reported 9 studies. Thus, the crucial statistical question is how strong the combined evidence of all 9 studies is.  This question is best addressed by means of a meta-analysis of the evidence.  Wagenmakers et al. (2011) are well-aware of this fact, but avoided reporting the results of a Bayesian meta-analysis.   

In this article, we have assessed the evidential impact of Bem’s (2011) experiments in isolation. It is certainly possible to combine the information across experiments, for instance by means of a meta-analysis (Storm, Tressoldi, & Di Risio, 2010; Utts, 1991). We are ambivalent about the merits of meta-analyses in the context of psi: One may obtain a significant result by combining the data from many experiments, but this may simply reflect the fact that some proportion of these experiments suffer from experimenter bias and excess exploration (Wagenmakers et al., 2011) 

I believe the real reason why they did not report the results of a Bayesian analysis is that it would have shown that p-values and Bayes-Factors lead to the same inference that Bem’s data are inconsistent with the null-hypothesis.  After all, Bayes-Factors and p-values are mere transformations of a test-statistic into a different parameter.  Holding sample size constant, p-values and Bayes-Factors in favor of the null-hypothesis decrease as the test statistic (e.g., a t-value) increases.  This is shown below with Bem’s data. 

Bayesian Meta-Analysis of Bem

Bem reported a mean effect size of d = .22 based on 9 studies with a total of 1170 participants.  A better measure of effect size is the weighted average, which is slightly smaller, d = .197.  The effect size can be tested against an expected value of 0 (no ESP) with a one-sample t-test with a sampling error of 1 / sqrt(1170) = 0.029.  The t-value is .197/.029 = 6.73.  The corresponding z-score is 6.66 (cf. Bem, Utts, & Johnson, 2011).   

The p-value for t(1169) = 6.73 is 2.65e-11 or 0.00000000003.   

I used Rouder’s online app to compute the default Bayes-Factor. 

To obtain the BF in favor of the null-hypothesis, which is more comparable to a p-value that expresses evidence against the null-hypothesis, we obtain a BF with 9 zeros after the decimal,  BF01 = 1/139075597  =  7.190334e-09 or 0.000000007.

Given the data, it is reasonable to reject the null-hypothesis using p-values or Bayes-Factors. Thus, the problem is the high t-value and not the transformation of the t-value into a p-value.   

The problem with the t-value is clear when we consider that particle physicists (a.k.a real scientists) use values greater than 5 to rule out chance findings.  Thus, Bem’s evidence meets the same strict criterion that was used to celebrate the discovery of the Higgs-Bosson particle in physics (cf. Schimmack, 2012).   


The problem with Bem’s article is not that he used p-values.  He could also have used Bayesian statistics to support his incredible claims.  The problem is that Bem engaged in highly questionable research practices and was not transparent in reporting these practices.  Holding p-values accountable for his behavior would be like holding cars responsible for drunk drivers.  

Wagenmakers railing against p-values is akin to Don Quixote’s railing against windmills.  It is not uncommon that a group of scientist is vehemently pushing an agenda. In fact, the incentive structure in science seems to promote self-promoters.  However, it is disappointing that a peer-reviewed journal uncritically accepted his questionable claim that p-value caused the replication crisis.   There is ample evidence that questionable research practices are being used to produce too many significant results (John et al., 2012; Schimmack, 2012).  Disregarding this evidence to make false, self-serving attributions is just as questionable as other questionable research practices that impede scientific progress.  

The biggest danger with Wagenmakers and colleagues agenda is that it distracts from the key problems that need to be fixed.  Curbing the use of questionable research practices and increasing the statistical power of studies to produce strong evidence (i.e., high t-values) is paramount to improving psychological science. However, there is little evidence that psychologists have changed their practices since 2011; with the exception of some social psychologists (Schimmack, 2017).

Thus, it is important to realize that Wagenmakers’ attribution of the replication crisis to the use of NHST is a fundamental attribution error in meta-psychology that is rooted in a motivated bias to find some useful application for Bayes-Factors.   Contrary to Wagenmakers et al.’s claim that “Psychologists need to change the way they analyze their data” they actually need to change the way they obtain their data.  With good data, the differences between p-values and Bayes-Factors are of minor importance.



Thinking Too Fast About Life-Satisfaction Judgments

In 2002, Daniel Kahneman was awarded the Nobel Prize for Economics.   He received the award for his groundbreaking work on human irrationality in collaboration with Amos Tversky in the 1970s. 

In 1999, Daniel Kahneman was the lead editor of the book “Well-Being: The foundations of Hedonic Psychology.”   Subsequently, Daniel Kahneman conducted several influential studies on well-being. 

The aim of the book was to draw attention to hedonic or affective experiences as an important, if not the sole, contributor to human happiness.  He called for a return to Bentham’s definition of a good life as a life filled with pleasure and devoid of pain a.k.a displeasure. 

The book was co-edited by Norbert Schwarz and Ed Diener, who both contributed chapters to the book.  These chapters make contradictory claims about the usefulness of life-satisfaction judgments as an alternative measure of a good life.  Ed Diener is famous for his conception of wellbeing in terms of a positive hedonic balance (lot’s of pleasure, little pain) and high life-satisfaction.   In contrast, Schwarz is known as a critic of life-satisfaction judgments.  In fact, Schwarz and Strack’s contribution to the book ended with the claim that “most readers have probably concluded that there is little to be learned from self-reports of global well-being” (p. 80).   

To a large part, Schwarz and Strack’s pessimistic view is based on their own studies that seemed to show that life-satisfaction judgments are influenced by transient factors such as current mood or priming effects.

the obtained reports of SWB are subject to pronounced question-order- effects because the content of preceding questions influences the temporary accessibility of relevant information” (Schwarz & Strack, p. 79). 

There is only one problem with this claim; it is only true for a few studies conducted by Schwarz and Strack.  Studies by other researchers have produced much weaker and often not statistically reliable context effects (see Schimmack & Oishi, 2005, for a meta-analysis). 
In fact, a recent attempt to replicate Schwarz and Strack’s results in a large sample of over 7,000 participants failed to show the effect and even found a small, but statistically significant effect in the opposite direction (ManyLabs2).   

When Daniel Kahneman wrote his popular book “Thinking Fast and Slow), published in 2011, it was clear that Schwarz and Strack’s claims in the 1999 book were not representative of the broader literature on well-being.   However, Chapter 9 relies exclusively on one of Schwarz and Strack’s studies that failed to replicate. 

A survey of German students is one of the best examples of substitution. The survey that the young participants completed included the following two questions:

How happy are you these days?
How many dates did you have last month?

The experimenters were interested in the correlation between the two answers. Would the students who reported many dates say that they were happier than those with fewer dates?

Surprisingly, no: the correlation between the answers was about zero. Evidently, dating was not what came first to the students’ minds when they were asked to assess their happiness.

Another group of students saw the same two questions, but in reverse order:

How many dates did you have last month?
How happy are you these days?

The results this time were completely different. In this sequence, the correlation between the number of dates and reported happiness was about as high as correlations between psychological measures can get.

What happened? The explanation is straightforward, and it is a good example of substitution. Dating was apparently not the center of these students’ life (in the first survey, happiness and dating were uncorrelated), but when they were asked to think about their romantic life, they certainly had an emotional reaction. The students who had many dates were reminded of a happy aspect of their life, while those who had none were reminded of loneliness and rejection. The emotion aroused by the dating question was still on everyone’s mind when the query about general happiness came up.

Kahneman did inform his readers that he is biased against life-satisfaction judgments.  Having come to the topic of well-being from the study of the mistaken memories of colonoscopies and painfully cold hands, I was naturally suspicious of global satisfaction with life as a valid measure of well-being (Kindle Locations 6796-6798). Later on, he even admits to his mistake.  Life satisfaction is not a flawed measure of their experienced well-being, as I thought some years ago. It is something else entirely (Kindle Location 6911-6912)

However, he does not inform his readers about scientific evidence that these judgments are much more valid than the unrepresentative study by Schwarz and Strack suggests. 

How can we explain the biased presentation of life-satisfaction judgments in Kahneman’s book.  The explanation is simple. Scientists, even Nobel Laureates, are human, and humans are not always rational, which was exactly the point of Kahneman’s early work.  Scientists are supposed to engage in slow information processing to use all of the available evidence and integrate it in the most systematic and objective way possible. However, scientific thinking is slow and effortful.  Inevitably, scientists sometimes revert to everyday human information processing that is more error prone, but faster. 

To write about life-satisfaction judgments, Kahneman could have done a literature search and retrieved all relevant studies or a meta-analysis of these studies (Schimmack & Oishi, 2005).  However, a faster way to report about life-satisfaction judgments was to rely on memory, which led to the retrieval of Schwarz and Strack’s sensational finding.  In this way, the reliance on a single study is a good example of substitution.  What should be answered based on an objective assessment of a whole literature was answered based on a single study because it was falsely assumed that the result was representative of the literature. 

For researchers like Ed Diener and his students, including myself, it has been frustrating to see that the word of an eminent Nobel Laureate has trumped scientific evidence.  The recent failure to replicate Schwarz and Strack’s findings even with over 7,000 participants may help to correct the false belief that item-order effects are pervasive and that life-satisfaction judgments are invalid.  Even Daniel Kahneman does not believe this anymore. 

So Chapter 9 in Daniel Kahneman’s book “Thinking Fast and Slow” is as, or even more, disappointing than Chapter 4, which reported about social priming studies that failed to replicate and provide no empirical evidence for the claims made in that chapter.   

Before I end, I have to make clear that my review of Chapter’s 4 and 9 should not be generalized to other chapters. I do believe that my criticism of these chapters is valid, but these chapters are not a representative sample of chapters.  The scientific validity of the other chapters needs to be assessed chapter by chapter, and that takes time and effort.  The reason for the focus on Chapter 9 is that I use life-satisfaction judgments in my research (which may make me biased in the opposite direction) and because the key finding featured in Chapter 9 just failed to replicate in a definitive replication study with over 7,000 participants.  I think readers who bought the book might be interested to know about this replication failure.


Kahneman, Daniel. Thinking, Fast and Slow. Doubleday Canada. Kindle Edition.

Replicability Audit of Susan T. Fiske

“Trust is good, but control is better”  


Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated.  However, the replicability of published results in social psychology is doubtful.Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly.  In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate  the replicability of their studies.

Susan T. Fiske

Susan T. Fiske is an eminent social psychologist (H-Index in WebofScience = 66).   She also is a prominent figure in meta-psychology.   Her most important contribution to meta-psychology was a guest column in the APS Observer (Fiske, 2016), titled “A Call to Change Science’s Culture of Shaming.”  ur field has always encouraged — required, really — peer critiques. But the new media (e.g., blogs, Twitter, Facebook) can encourage a certain amount of uncurated, unfiltered denigration. In the most extreme examples, individuals are finding their research programs, their careers, and their personal integrity under attack.”

In her article, she refers to researchers who examine the replicability of published results as “self-appointed data police,” which is relatively mild in comparison to the term “method terrorist” that she used in a leaked draft of her article. 

She accuses meta-psychologists of speculating about the motives of researchers who use questionable research practices, but she never examines the motives of meta-psychologists. Why are they devoting their time and resources to meta-psychology and publish their results on social media rather than advancing their career by publishing original research in peer-reviewed journals.  One possible reason is that meta-psychologists recognize deep and fundamental problems in the way social psychologists conduct research and they are trying to improve it.   

Instread, Fiske denies that psychological science and claims that the Association for Psychological Science (APS) is  a leader in promoting good scientific practices. 

What’s more, APS has been a leader in encouraging robust methods: transparency, replication, power analysis, effect-size reporting, and data access. 

She also dismisses meta-psychological criticism of social psychology as unfounded.

But some critics do engage in public shaming and blaming, often implying dishonesty on the part of the target and other innuendo based on unchecked assumptions. 

In this blog post, I am applying z-curve to Susan T. Fiske’s results to examine whether she used questionable research practices to report mostly significant results that support her predictions, and to examine how replicable her published results are.  The scientific method, z-curve, makes assumptions that have been validated in simulation studies (Brunner & Schimmack, 2018).  


I used WebofScience to identify the most cited articles by Susan T. Fiske (datafile).  I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 41 empirical articles (H-Index = 41).  The 41 articles reported 76 studies (average 1.9 studies per article).  The total number of participants was 21,298 with a median of 54 participants per study.  For each study, I identified the most focal hypothesis test (MFHT).  The result of the test was converted into an exact p-value and the p-value was then converted into a z-score.  The z-scores were submitted to a z-curve analysis to estimate mean power of the 65 results that were significant at p < .05 (two-tailed). Six studies did not test a hypothesis or predicted a non-significant result. The remaining 5 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 70 reported hypothesis tests was 100%.

The z-curve estimate of replicability is 59% with a 95%CI ranging from 42% to 77%.  The complementary interpretation of this result is that the actual type-II error rate is 41% compared to the 0% of non-significant results reported in the articles. 

The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results.  The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with 59% average power.  The ratio of the area of non-significant results to the area of all significant results (including z-scores greater than 6) is called the File Drawer Ratio.  Although this is just a projection, and other questionable practices may have been used, the file drawer ratio of 1.63 and the figure makes it clear that the reported results were selected to support theoretical predictions.

Z-curve is under development and offers additional information other than the replicabilty of significant results.   One new feature is an estimate of the maximum number of false positive results. The maximum percentage of false positive results is estimated to be 30% (95%CI = 10% to 60%).  This estimate means that a z-curve with a fixed percentage of 30% false positives fits the data nearly as well as a z-curve without restrictions on the percentage of false positives.  Given the relatively small number of studies, the estimate is not very precise and the upper limit goes as high as 60%.  It is unlikely that there are 60% false positives, but the point of empirical research is to reduce the risk of false positives to an acceptable level of 5%.  Thus, the actual risk is unacceptably high.

A 59% replicability estimate is actually very high for a social psychologist. However, it would be wrong to apply this estimate to all studies.  The estimate is an average and replicability varies as a function of the strength of evidence against the null-hypothesis (the magnitude of a z-score).  This is shown with the replicabiilty estimates for segments of z-scores below the x-axis. For just significant results with z-scores from 2 to 2.5 (~ p < .05 & p > .01),  replicability is only 33%.  This means, these results are less likely to replicate and results of actual replication studies show very low success rates for studies with just significant results.   Without selection bias, significant results have an average replicabilit greater than 50%.  However, with selection for significance, this is no longer the case. For Susan T. Fiske’s data, the criterion value to achieve 50% average replicability is a z-score greater than 3 (as opposed to 1.96 without selection).  56 reported results meet this criterion.  This is a high percentage of credible results for a social psychologist (see links to other replicability audits at the end of this post).


Although Susan T. Fiske’s work has not been the target of criticism by meta-psychologists, she has been a vocal critic of meta-psychologists.  This audit shows that here work is more replicable than the work by other empirical social psychologists.  One explanation for Fiske’s defense of social psychology could be the false consensus effect, which is a replicable social psychological phenomenon.  In the absence of hard evidence, humans tend to believe that others are more similar to them than they actually are.  Maybe Susan Fiske assumed that social psychologists who have been criticized for their research practices were conducting research like herself.  A comparison of different audits (see below) shows that this is not the case. I wonder what Fiske thinks about the research practices of her colleague that produce replicability estimates well below 50%.  I believe that a key contributor to the conflict between experimental social psychologists and meta-psychologist is the lack of credible information about the extend of the crisis.  Actual replication studies and replicability reports provide much needed objective facts.  The question is whether social psychologists like Susan Fiske are willing to engage in a scientific discussion about these facts or whether they continue to ignore these facts to maintain the positive illusion that social psychological results can be trusted. 


It is nearly certain that I made some mistakes in the coding of Susan T. Fiske’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and the z-curve code is also openly available.  Thus, this replicability audit is fully transparent and open to revision.

If you found this audit interesting, you might also be interested in other replicability audits.
Roy F. Baumeister (20%) 
Fritz Strack (38%)
Timothy D. Wilson (41%)

Replicability Audit of Fritz Strack

“Trust is good, but control is better”  


Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated.  However, the replicability of published results in social psychology is doubtful.

Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly.  In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate  the replicability of their studies.

Fritz Strack

Fritz Strack is an eminent social psychologist (H-Index in WebofScience = 51).

Fritz Strack also made two contributions to meta-psychology.

First, he volunteered his facial-feedback study for a registered replication report; a major effort to replicate a published result across many labs.  The study failed to replicate the original finding.  In response, Fritz Strack argued that the replication study introduced cameras as a confound or that the replication team actively tried to find no effect (reverse p-hacking).

Second, Strack co-authored an article that tried to explain replication failures as a result of problems with direct replication studies  (Strack & Stroebe, 2014).  This is a concern, when replicability is examined with actual replication studies.  However, this concern does not apply when replicability is examined on the basis of test statistics published in original articles.  Using z-curve, we can estimate how replicable these studies are, if they could be replicated exactly, even if this is not possible.

Given Fritz Strack’s skepticism about the value of actual replication studies, he may be particularly interested in estimates based on his own published results.


I used WebofScience to identify the most cited articles by Fritz Strack (datafile).  I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 42 empirical articles (H-Index = 42).  The 42 articles reported 117 studies (average 2.8 studies per article).  The total number of participants was 8,029 with a median of 55 participants per study.  For each study, I identified the most focal hypothesis test (MFHT).  The result of the test was converted into an exact p-value and the p-value was then converted into a z-score.  The z-scores were submitted to a z-curve analysis to estimate mean power of the 114 results that were significant at p < .05 (two-tailed). Three studies did not test a hypothesis or predicted a non-significant result. The remaining 11 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 114 reported hypothesis tests was 100%.

This image has an empty alt attribute; its file name is fritz-strack2.png

The z-curve estimate of replicability is 38% with a 95%CI ranging from 26% to51%.  The complementary interpretation of this result is that the actual type-II error rate is 62% compared to the 0% failure rate in the published articles.

The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results.  The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with 28% average power.  Although this is just a projection, the figure makes it clear that Strack and collaborators used questionable research practices to report only significant results.

Z-curve is under development and offers additional information other than the replicabilty of significant results.   One new feature is an estimate of the maximum number of false positive results. The maximum percentage of false positive results is estimated to be 35% (95%CI = 10% to 73%).  Given the relatively small number of studies, the estimate is not very precise and the upper limit goes as high as 73%.  It is unlikely that there are XX% false positives, but the point of empirical research is to reduce the risk of false positives to an acceptable level of 5%.  Thus, the actual risk is unacceptably high.

Based on the low overall replicability it would be difficult to identify results that provided credible evidence.  However, replicability varies with the strength of evidence against the null-hypothesis;  that is, with increasing z-values on the x-axis.  Z-curve provides estimates of replicability for different segments of tests.  For just significant results with z-scores from 2 to 2.5 (~ p < .05 & p > .01),  replicability is just 23%.  These studies can be considered preliminary and require verification with confirmatory studies that need much higher sample sizes to have sufficient power to detect an effect (I would not call these studies mere replication studies because the outcome of these studies is uncertain).  For z-scores between 2.5 and 3, replicability is still below average with 28%.  The nominal type-I error probability of .05 is reached when mean power is above 50%.  This is the case only for z-scores greater than 4.0. Thus, after correcting for the use of questionable research practices, only p-values less than 0.00005 allow rejecting the null-hypothesis with a 5% false positive criterion.  Only 11 results meet this criterion (see data file for the actual studies and hypothesis tests).


The analysis of Fritz Strack’s published results provides clear evidence that questionable research practices were used and that published significant results could be false positives in two ways.  First, the risk of a classic type-I error is much higher than 5% and second the results are false positives in the sense that many results do not meet a corrected level of significance that takes selection for significance into account.

It is important to emphasize that Fritz Strack and colleagues followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud. Moreover, non-significant results in replication studies do not mean that the theoretical predictions are wrong. It merely means that the published results provide insufficient evidence for the empirical claim.


It is nearly certain that I made some mistakes in the coding of Fritz Strack’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and the z-curve code is also openly available.  Thus, this replicability audit is fully transparent and open to revision.

If you found this audit interesting, you might also be interested in other replicability audits.
Roy F. Baumeister
Timothy D. Wilson



















Self-Audit: How replicable is my research?


Since 2014, I have been posted statistical information about researchers’ replicability (Schnall; Baumeister).   I have used this information to explain why social psychology experiments often fail to replicate (Open Science Collaboration, 2015).

Some commentators have asked to examine myself, and I finally did it.  I used the new format of a replicability audit (Baumeister, Wilson).  A replicability audit picks the most cited articles until the number of articles exceeds the number of citations (H-Index).  For each study, the most focal hypothesis test is selected. The test-statistic is converted into a p-value and then into a z-score.  The z-scores are analysed with z-curve (Brunner & Schimmack, 2018) to estimate replicability.


I used WebofScience to identify my most cited articles (datafile).  I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 27 empirical articles (H-Index = 27).  The 27 articles reported 64 studies  (average 2.4 studies per article).  45 of the 64 studies reported a hypothesis test.  The total number of participants in these 45 studies was 350,148 with a median of 136 participants per statistical test.  42 of the 45 tests were significant with alpha = .05 (two-tailed).  The remaining 3 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 45 reported hypothesis tests was 100%.



The z-curve plot shows evidence of a file-drawer.  Counting marginally significant results, the success rate is 100%, but power to produce significant results is estimated to be only 71%.  Power for the set of significant results (excluding marginally significant ones), is estimated to be 76%. The maximum false discovery rate is estimated to be 5%. Thus, even if results would not replicate with the same sample size, studies with much larger sample sizes are expected to produce a significant result in the same direction.

These results are comparable to the results in cognitive psychology, where replicability estimates are around 80% and the maximum false discovery rate is also low.  In contrast, results in experimental social psychology are a lot worse with replicability estimates below 50%; both in statistical estimates with z-curve and in estimates based on actual replication studies (Open Science Collaboration, 2015).  The reason is that experimental social psychologists conducted between-subject experiments with small samples. In contrast, most of my studies are correlational studies with large samples or experiments with within-subject designs and many repeated trails.  These studies have high power and tend to replicate fairly well (Open Science Collaboration, 2015).


Actual replication studies have produced many replication failures and created the impression that results published in psychology journals are not credible.  This is unfortunate because these replication projects have focused on between-subject paradigms in experimental social psychology.  It is misleading to generalize these results to all areas of psychology.

Psychologists who want to demonstrate that their work is replicable do not have to wait for somebody to replicate their study. They can conduct a self-audit using z-curve and demonstrate that their results are different from experimental social psychology.  Even just reporting the number of observations rather than number of participants may help to signal that a study had good power to produce a significant result.  A within-subject study with 8 participants and 100 repetitions has 800 observations, which is 10 times more than the number of observations in the typical between-subject study with 80 participants and one observation per participant.






How Replicable is Psychological Science?

Since 2011, psychologists are wondering about the replicability of the results in their journals. Until then, psychologists blissfully ignored that success rates of 95% in their journals are eerily high and more akin to outcomes of totalitarian elections than actual success rates of empirical studies (Sterling, 1959).

The Open Science Framework (OSF) has conducted several stress-tests of psychological science by means of empirical replication studies.   There have been several registered replication reports of individual studies of interest,  a couple of many labs projects, and the reproducibility project.

Although these replication projects have produced many interesting results, headlines often focus on the percentage of successful replications.  Typically, the criterion for a successful replication is a statistically significant (p < .05, two-tailed) result in the replication study.

The reproducibility project produced a success rate of 36%, and the just released results of Many Labs 2 showed a success rate of 50%.   While it is crystal clear what these numbers are (they are a simple count of p-values less than .05 divided by the number of tests), it is much less clear what these numbers mean.




As Alison Ledgerwood pointed out, everybody knows the answer, but nobody seems to know the question.


Descriptive Statistics in Convenience Samples are Meaningless

The first misinterpretation of the number is that the percentage tells us something about the “reproducibility of psychological science” (Open Science Collaboration, 2015).  As the actual article explains, the reproducibility project selected studies from three journals that publish mostly social and cognitive psychological experiments.  To generalize from these two areas of psychology to all areas of psychological research is only slightly less questionable than it is to generalize from 40 undergraduate students at Harvard University to the world population.   The problem is that replicability can vary across disciplines.  In fact, the reproducibility project found differences between the two disciplines that were examined.  While cognitive psychology achieved at least a success rate of 50%, the success rate for social psychology was only 25%.

If researchers had a goal to provide an average estimate of all areas of psychology, it would be necessary to define the population of results in psychology (which journals should be included) and to draw a representative sample from this population.

Without a sampling plan, the percentage is only valid for the sample of studies.  The importance of sampling is well understood in most social sciences, but psychologists may have ignored this aspect because psychologists make unwarranted generalized claims based on convenience samples.

In their defense, OSC authors may argue that their results are at least representative of results published in the three journals that were selected for replication attempts; that is, the Journal of Personality and Social Psychology, the Journal of Experimental Psychology: Learning, Memory, and Cognition, and Psychological Science.   Replication teams could pick any article that was published in the year 2008.  As there is no reason to assume that 2008 was a particularly good or bad year, the results are likely to generalize to other years.


However, the sets of studies in Many Labs projects were picked ad hoc without any reference to a particular population. They were simply studies of interest that were sufficiently simple to be packaged into a battery of studies that could be completed within an hour to allow for massive replication across many labs.  The percentage of successes in these projects is practically meaningless just like the average height of the next 20 customers at McDonals is a meaningless number.  Thus, 50% successes in ManyLab2 is a very clear answer to a very unclear question because the percentage depends on an unclear selection mechanism.  The next project that selects another 20 studies of interest could produce 20%, 50%, or 80% successes.  Moreover, adding up meaningless success rates doesn’t improve things just like averaging data from Harvard and Kentucky State University does not address the problem that the samples are not representative of the population.


Thus, the only meaningful result of empirical replicability estimation studies is that the success rate in social psychology is estimated to be 25% and the success rate in cognitive psychology is estimated to be 50%.  No other areas have been investigated.

Success Rates are Arbitrary

Another problem with the headline finding is that success rates of replication studies depend on the sample size of replication studies, unless an original finding was a false positive result (in this case, the success rate matches the significance criterion, i.e. 5%).

The reason is that statistical significance is a function of sampling error and sampling error decreases as sample sizes increase.  Thus, replication studies can produce fewer successes if sample sizes are lowered (as they were for some cognitive studies in the reproducibility project) and they increase when sample sizes are increased (as they are in the ManyLabs projects).

It is therefore a fallacy to compare success rates in the reproducibility project with success rates in the Many Labs projects.  The 37% success rate for the reproducibility project conflates social and cognitive psychology, while sample sizes remained fairly similar, while ManyLabs2 mostly focused on social psychology, but increased sample sizes by a factor of 64 (Median Original N = 112, replication N = 7157).  A more reasonable comparison would focus on social psychology and compute successes for replication studies on the basis of effect size and sample size of the original study (see Table 5 of ManyLab2 report).  Two studies are no longer significant and the success rate is reduced from 50% to 43%.

Estimates in Small Samples are Variable

The 43% success rate in ManyLabs2 has to be compared to the 25% success rate for social psychology, suggesting that the success rate is still higher. However, estimates in small samples are not very precise, and these differences are not statistically different.  Thus, there is no empirical evidence for the claim that ManyLabs was more successful than the reproducibility project.  Given the lack of representative sampling in ManyLabs projects, the estimate could be ignored, but assuming that there was no major bias in the selection process, the two estimates can be combined to produce an estimate somewhere in the middle. This would suggest that we should expect only one-third of results in social psychology to replicate.

Convergent Validation with Z-Curve

Empirical replicability estimation has major problems because not all studies can be easily replicated.  Jerry Brunner and I have developed a statistical approach to estimate replicability based on published empirical results.  I have applied this approach to Motly et al.’s (2017) representative set of statistical results in social psychology journals (PSPB, JPSP, JESP).



The statistical approach predicts that exact replication studies with the same sample sizes produce 44% successes.  This estimate is a bit higher than the actual success rate.  There are a number of possible reasons for this discrepancy.  First, the estimate of the statistical approach is more precise because the sample size is much larger than the set of actually replicated studies.  Second, the statistical model may not be correct, leading to an overestimation of the actual success rate.  Third, it is difficult to conduct exact replication studies and effect sizes might be weaker in replication studies.  Finally, the set of actual replication studies is not representative because pragmatic decisions influenced which studies were actually replicated.

Aside from these differences, it is noteworthy that both estimation methods produce converging evidence that less than 50% of published results in social psychology journals can be expected to reproduce a success, even if the study could be replicated exactly.

What Do Replication Failures Mean?

The refined conclusion about replicability in psychological science is that only social psychology has been examined and that replicability in social psychology is likely to be less than 50%.  However, the question still remains unclear.

In fact, there are two questions that are often confused.   One question is how many successes in social psychology are false positives; that is, contrary to the reported finding, the effect size in the (not well-defined population) is zero or even in the opposite direction of the original study.  The other question is how much statistical power studies in social psychology have to discover true effects?

The problem is that replication failures do not provide a clear answer to the first question.  A replication failure can reveal a type-I error in the original study (the original finding was a fluke) or a type-II error (the replication study failed to detect a true effect).

To estimate the percentage of false positives, it is not possible to simply count non-significant replication studies.  As every undergraduate student learns (and may forget) a non-significant result does not prove that the null-hypothesis is true.  To answer the question about false positives, it is necessary to define a region of interest around zero and to demonstrate that the population effect size is unlikely to fall outside the region. If the region is reasonably small, sample sizes have to be at least as large as those of the Many Lab projects. However, Many Lab projects are not representative samples. Thus, the percentage of false positives in social psychology remains unknown.  I personally believe that the search for false positives is not very fruitful.

Thus, the real question that replicability estimates can answer is how much power studies in social psychology, on average, have.  The results suggest that successful studies in social psychology only have about 20 to 40 percent power.  As Tversky and Kahneman (1971) pointed out, no sane researcher would invest in studies that have a greater chance to fail than to succeed.  Thus, we can conclude from the empirical finding that the actual power is less than 50% that social psychologists are either insane or unaware that they are conducting studies with less than 50% most of the time.

If social psychologist are sane, replicability estimates are very useful because they inform social psychologists that they need to change their research practices to increase statistical power.  Cohen (1962) tried to tell them this, as did Tversky and Kahneman (1971), and Sedlmeier & Gigerenzer (1989), and Maxwell (2004), and myself (Schimmack, 2012).   Maybe it was necessary to demonstrate low replicability with actual replication studies because statistical arguments alone were not powerful enough.  Hopefully, the dismal success rates in actual replication studies provides the necessary wake-up call for social psychologists to finally change their research practices.  That would be welcome evidence that science is self-correcting and social psychologists are sane.


The average replicability of results in social psychology journals is less than 50%.  The reason is that original studies have low statistical power.  To improve replicability, social psychologists need to conduct realistic a priori power calculations and honestly report non-significant results when they occur.