Tag Archives: replicability

2015 Replicability Ranking of 100+ Psychology Journals

Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.

The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis.  Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.

The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%.  A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.

The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).

Rank   Journal 2010/14 2015
1   Social Indicators Research   81   84
2   Journal of Happiness Studies   81   83
3   Journal of Comparative Psychology   72   83
4   International Journal of Psychology   80   81
5   Journal of Cross-Cultural Psychology   78   81
6   Child Psychiatry and Human Development   75   81
7   Psychonomic Review and Bulletin   72   80
8   Journal of Personality   72   79
9   Journal of Vocational Behavior   79   78
10   British Journal of Developmental Psychology   75   78
11   Journal of Counseling Psychology   72   78
12   Cognitve Development   69   78
13   JPSP: Personality Processes
and Individual Differences
  65   78
14   Journal of Research in Personality   75   77
15   Depression & Anxiety   74   77
16   Asian Journal of Social Psychology   73   77
17   Personnel Psychology   78   76
18   Personality and Individual Differences   74   76
19   Personal Relationships   70   76
20   Cognitive Science   77   75
21   Memory and Cognition   73   75
22   Early Human Development   71   75
23   Journal of Sexual Medicine   76   74
24   Journal of Applied Social Psychology   74   74
25   Journal of Experimental Psychology: Learning, Memory & Cognition   74   74
26   Journal of Youth and Adolescence   72   74
27   Social Psychology   71   74
28   Journal of Experimental Psychology: Human Perception and Performance   74   73
29   Cognition and Emotion   72   73
30   Journal of Affective Disorders   71   73
31   Attention, Perception and Psychophysics   71   73
32   Evolution & Human Behavior   68   73
33   Developmental Science   68   73
34   Schizophrenia Research   66   73
35   Achive of Sexual Behavior   76   72
36   Pain   74   72
37    Acta Psychologica   72   72
38   Cognition   72   72
39   Journal of Experimental Child Psychology   72   72
40   Aggressive Behavior   72   72
41   Journal of Social Psychology   72   72
42   Behaviour Research and Therapy   70   72
43   Frontiers in Psychology   70   72
44   Journal of Autism and Developmental Disorders   70   72
45   Child Development   69   72
46   Epilepsy & Behavior   75   71
47   Journal of Child and Family Studies   72   71
48   Psychology of Music   71   71
49   Psychology and Aging   71   71
50   Journal of Memory and Language   69   71
51   Journal of Experimental Psychology: General   69   71
52   Psychotherapy   78   70
53   Developmental Psychology   71   70
54   Behavior Therapy   69   70
55   Judgment and Decision Making   68   70
56   Behavioral Brain Research   68   70
57   Social Psychology and Personality Science   62   70
58   Political Psychology   75   69
59   Cognitive Psychology   74   69
60   Organizational Behavior and Human Decision Processes   69   69
61   Appetite   69   69
62   Motivation and Emotion   69   69
63   Sex Roles   68   69
64   Journal of Experimental Psychology: Applied   68   69
65   Journal of Applied Psychology   67   69
66   Behavioral Neuroscience   67   69
67   Psychological Science   67   68
68   Emotion   67   68
69   Developmental Psychobiology   66   68
70   European Journal of Social Psychology   65   68
71   Biological Psychology   65   68
72   British Journal of Social Psychology   64   68
73   JPSP: Attitudes & Social Cognition   62   68
74   Animal Behavior   69   67
75   Psychophysiology   67   67
76   Journal of Child Psychology and Psychiatry and Allied Disciplines   66   67
77   Journal of Research on Adolescence   75   66
78   Journal of Educational Psychology   74   66
79   Clinical Psychological Science   69   66
80   Consciousness and Cognition   69   66
81   The Journal of Positive Psychology   65   66
82   Hormones & Behavior   64   66
83   Journal of Clinical Child and
Adolescence Psychology
  62   66
84   Journal of Gerontology: Series B   72   65
85   Psychological Medicine   66   65
86   Personalit and Social Psychology
  64   64
87   Infancy   61   64
88   Memory   75   63
89   Law and Human Behavior   70   63
90   Group Processes & Intergroup Relations   70   63
91   Journal of Social and Personal Relationships   69   63
92   Cortex   67   63
93   Journal of Abnormal Psychology   64   63
94   Journal of Consumer Psychology   60   63
95   Psychology of Violence   71   62
96   Psychoneuroendocrinology   63   62
97   Health Psychology   68   61
98   Journal of Experimental Social
  59   61
99   JPSP: Interpersonal Relationships
and Group Processes
  60   60
100   Social Cognition   65   59
101   Journal of Consulting and Clinical Psychology   63   58
102   European Journal of Personality   72   57
103   Journal of Family Psychology   60   57
104   Social Development   75   55
105   Annals of Behavioral Medicine   65   54
106   Self and Identity   63   54

Replicability-Ranking of 100 Social Psychology Departments

Please see the new post on rankings of psychology departments that is based on all areas of psychology and covers the years from 2010 to 2015 with separate information for the years 2012-2015.


Old post on rankings of social psychology research at 100 Psychology Departments

This post provides the first analysis of replicability for individual departments. The table focuses on social psychology and the results cannot be generalized to other research areas in the same department. An explanation of the rational and methodology of replicability rankings follows in the text below the table.

Department 2010-2014
Macquarie University 91
New Mexico State University 82
The Australian National University 81
University of Western Australia 74
Maastricht University 70
Erasmus University Rotterdam 70
Boston University 69
KU Leuven 67
Brown University 67
University of Western Ontario 67
Carnegie Mellon 67
Ghent University 66
University of Tokyo 64
University of Zurich 64
Purdue University 64
University College London 63
Peking University 63
Tilburg University 63
University of California, Irvine 63
University of Birmingham 62
University of Leeds 62
Victoria University of Wellington 62
University of Kent 62
Princeton 61
University of Queensland 61
Pennsylvania State University 61
Cornell University 59
University of California at Los Angeles 59
University of Pennsylvania 59
University of New South Wales (UNSW) 59
Ohio State University 58
National University of Singapore 58
Vanderbilt University 58
Humboldt Universit„ät Berlin 58
Radboud University 58
University of Oregon 58
Harvard University 56
University of California, San Diego 56
University of Washington 56
Stanford University 55
Dartmouth College 55
SUNY Albany 55
University of Amsterdam 54
University of Texas, Austin 54
University of Hong Kong 54
Chinese University of Hong Kong 54
Simone Fraser University 54
Ruprecht-Karls-Universitaet Heidelberg 53
University of Florida 53
Yale University 52
University of California, Berkeley 52
University of Wisconsin 52
University of Minnesota 52
Indiana University 52
University of Maryland 52
University of Toronto 51
Northwestern University 51
University of Illinois at Urbana-Champaign 51
Nanyang Technological University 51
University of Konstanz 51
Oxford University 50
York University 50
Freie Universit„ät Berlin 50
University of Virginia 50
University of Melbourne 49
Leiden University 49
University of Colorado, Boulder 49
Univeritä„t Würzburg 49
New York University 48
McGill University 48
University of Kansas 48
University of Exeter 47
Cardiff University 46
University of California, Davis 46
University of Groningen 46
University of Michigan 45
University of Kentucky 44
Columbia University 44
University of Chicago 44
Michigan State University 44
University of British Columbia 43
Arizona State University 43
University of Southern California 41
Utrecht University 41
University of Iowa 41
Northeastern University 41
University of Waterloo 40
University of Sydney 40
University of Bristol 40
University of North Carolina, Chapel Hill 40
University of California, Santa Barbara 40
University of Arizona 40
Cambridge University 38
SUNY Buffalo 38
Duke University 37
Florida State University 37
Washington University, St. Louis 37
Ludwig-Maximilians-Universit„ät München 36
University of Missouri 34
London School of Economics 33

Replicability scores of 50% and less are considered inadequate (grade F). The reason is that less than 50% of the published results are expected to produce a significant result in a replication study, and with less than 50% successful replications, the most rational approach is to treat all results as false because it is unclear which results would replicate and which results would not replicate.


University rankings have become increasingly important in science. Top ranking universities use these rankings to advertise their status. The availability of a single number of quality and distinction creates pressures on scientists to meet criteria that are being used for these rankings. One key criterion is the number of scientific articles that are being published in top ranking scientific journals under the assumption that these impact factors of scientific journals track the quality of scientific research. However, top ranking journals place a heavy premium on novelty without ensuring that novel findings are actually true discoveries. Many of these high-profile discoveries fail to replicate in actual replication studies. The reason for the high rate of replication failures is that scientists are rewarded for successful studies, while there is no incentive to publish failures. The problem is that many of these successful studies are obtained with the help of luck or questionable research methods. For example, scientists do not report studies that fail to support their theories. The problem of bias in published results has been known for a long time (Sterling, 1959). However, few researchers were aware of the extent of the problem.   New evidence suggests that more than half of published results provide false or extremely biased evidence. When more than half of published results are not credible, a science loses its credibility because it is not clear which results can be trusted and which results provide false information.

The credibility and replicability of published findings varies across scientific disciplines (Fanelli, 2010). More credible sciences are more willing to conduct replication studies and to revise original evidence. Thus, it is inappropriate to make generalized claims about the credibility of science. Even within a scientific discipline credibility and replicability can vary across sub-disciplines. For example, results from cognitive psychology are more replicable than results from social psychology. The replicability of social psychological findings is extremely low. Despite an increase in sample size, which makes it easier to obtain a significant result in a replication study, only 5 out of 38 replication studies produced a significant result. If the replication studies had used the same sample sizes as the original studies, only 3 out of 38 results would have replicated, that is, produced a significant result in the replication study. Thus, most published results in social psychology are not trustworthy.

There have been mixed reactions by social psychologists to the replication crisis in social psychology. On the one hand, prominent leaders of the field have defended the status quo with the following arguments.

1 – The experiments who conducted the replication studies are incompetent (Bargh, Schnall, Gilbert).

2 – A mysterious force makes effects disappear over time (Schooler).

3 – A statistical artifact (regression to the mean) will always make it harder to find significant results in a replication study (Fiedler).

4 – It is impossible to repeat social psychological studies exactly and a replication study is likely to produce different results than an original study (the hidden moderator) (Schwarz, Strack).

These arguments can be easily dismissed because they do not explain why cognitive psychologists and other scientific disciplines have more successful replications and more failed results.   The real reason for the low replicability of social psychology is that social psychologists conduct many, relatively cheap studies that often fail to produce the expected results. They then conduct exploratory data analyses to find unexpected patterns in the data or they simply discard the study and publish only studies that support a theory that is consistent with the data (Bem). This hazardous approach to science can produce false positive results. For example, it allowed Bem (2011) to publish 9 significant results that seemed to show that humans can foresee unpredictable outcomes in the future. Some prominent social psychologists defend this approach to science.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister,)

The lack of rigorous scientific standards also allowed Diederik Stapel, a prominent social psychologist to fabricate data, which led to over 50 retractions of scientific articles. The commission that investigated Stapel came to the conclusion that he was only able to publish so many fake articles because social psychology is a “sloppy science,” where cute findings and sexy stories count more than empirical facts.

Social psychology faces a crisis of confidence. While social psychology tried hard to convince the general public that it is a real science, it actually failed to follow standard norms of science to ensure that social psychological theories are based on objective replicable findings. Social psychology therefore needs to reform its practices if it wants to be taken serious as a scientific field that can provide valuable insights into important question about human nature and human behavior.

There are many social psychologists who want to improve scientific standards. For example, the head of the OSF-reproducibility project, Brian Nosek, is a trained social psychologist. Mickey Inzlicht published a courageous self-analysis that revealed problems in some of his most highly cited articles and changed the way his lab is conducting studies to improve social psychology. Incoming editors of social psychology journals are implementing policies to increase the credibility of results published in their journals (Simine Vazire; Roger Giner-Sorolla). One problem for social psychologists willing to improve their science is that the current incentive structure does not reward replicability. The reason is that it is possible to count number of articles and number of citations, but it seems difficult to quantify replicability and scientific integrity.

To address this problem, Jerry Brunner and I developed a quantitative measure of replicability. The replicability-score uses published statistical results (p-values) and transforms them into absolute z-scores. The distribution of z-scores provides information about the statistical power of a study given the sample size, design, and observed effect size. Most important, the method takes publication bias into account and can estimate the true typical power of published results. It also reveals the presence of a file-drawer of unpublished failed studies, if the published studies contain more significant results than the actual power of studies allows. The method is illustrated in the following figure that is based on t- and F-tests published in the most important journals that publish social psychology research.

PHP-Curve Social Journals

The green curve in the figure illustrates the distribution of z-scores that would be expected if a set of studies had 53% power. That is, random sampling error will sometimes inflate the observed effect size and sometimes deflate the observed effect size in a sample relative to the population effect size. With 54% power, there would be 46% (1 – .54 = .46) non-significant results because the study had insufficient power to demonstrate an effect that actually exists. The graph shows that the green curve fails to describe the distribution of observed z-scores. On the one hand, there are more extremely high z-scores. This reveals that the set of studies is heterogeneous. Some studies had more than 54% power and others had less than 54% power. On the other hand, there are fewer non-significant results than the green curve predicts. This discrepancy reveals that non-significant results are omitted from the published reports.

Given the heterogeneity of true power, the red curve is more appropriate. It provides the best fit to the observed z-scores that are significant (z-scores > 2). It does not model the z-scores below 2 because non-significant z-scores are not reported.   The red-curve gives a lower estimate of power and shows a much larger file-drawer.

I limit the power analysis to z-scores in the range from 2 to 4. The reason is that z-scores greater than 4 imply very high power (> 99%). In fact, many of these results tend to replicate well. However, many theoretically important findings are published with z-scores less than 4 as evidence. These z-scores do not replicate well. If social psychology wants to improve its replicability, social psychologists need to conduct fewer studies with more statistical power that yield stronger evidence and they need to publish all studies to reduce the file-drawer.

To provide an incentive to increase the scientific standards in social psychology, I computed the replicability-score (homogeneous model for z-scores between 2 and 4) for different journals. Journal editors can use the replicability rankings to demonstrate that their journal publishes replicable results. Here I report the first rankings of social psychology departments.   To rank departments, I searched the database of articles published in social psychology journals for the affiliation of articles’ authors. The rankings are based on the z-scores of these articles published in the years 2010 to 2014. I also conducted an analysis for the year 2015. However, the replicability scores were uncorrelated with those in 2010-2014 (r = .01). This means that the 2015 results are unreliable because the analysis is based on too few observations. As a result, the replicability rankings of social psychology departments cannot reveal recent changes in scientific practices. Nevertheless, they provide a first benchmark to track replicability of psychology departments. This benchmark can be used by departments to monitor improvements in scientific practices and can serve as an incentive for departments to create policies and reward structures that reward scientific integrity over quantitative indicators of publication output and popularity. Replicabilty is only one aspect of high-quality research, but it is a necessary one. Without sound empirical evidence that supports a theoretical claim, discoveries are not real discoveries.

Replicability Report for the journal SOCIAL PSYCHOLOGY

The journal SOCIAL PSYCHOLOGY is published by Hogrefe.  The journal was published under the title Zeitschrift für Sozialpsychologie in German until 2007. This replicability report covers the years since 2008.

SCImago rankings of all psychology journals ranked SOCIAL PSYCHOLOGY #286 with an SJR-Impact-Factor of 0.9 in 2014.

At present, the replicability-report is based on articles published from 2008 to 2015.  During this time, SOCIAL PSYCHOLOGY published 293 articles.  The Replicability-Report is based on 223 articles that reported one or more t or F-test in the text (results reported in Figures or Tables are not included).  The test-statistic was converted into z-scores to estimate post-hoc-power.  The analysis is based on 1,064 z-scores in the range from 2 (just above the 1.96 criterion value for p < .05 (two-tailed) to 4.

PHP-Curve ZSPBased on the distribution of z-scores in the range between 2 and 4, the average power for significant results in this range is estimated to be 61% with a homogeneous model, which is currently being used for the replicability ranking.  The average power assuming heterogeneity is 56%.  This estimate suggests that 56% of the published results with z-scores in this range yield significant results in an exact replication study with the same sample size and power (results with z > 4 are expected to replicate with nearly 100%).

The same method was used to estimate power for individual years.

PHP-Trend ZSPThe results show a flat time trend.

Replicability-Report for the journal JUDGMENT AND DECISION MAKING

JUDGMENT AND DECISION MAKING (JDM) is an open-access journal published by the Society for Judgment and Decision Making

SCImago rankings of all psychology journals ranked JDM #149 with an SJR-Impact-Factor of 1.3 in 2014.

The journal started publishing articles in 2006.  The replicability-report is based on all articles that were posted from 2006 to 2015.  During this time, JDM published 425 articles.  The Replicability-Report is based on 305 articles that reported one or more t or F-test in the text (results reported in Figures or Tables are not included).  The test-statistic was converted into z-scores to estimate post-hoc-power.  The analysis is based on 1,082 z-scores in the range from 2 (just above the 1.96 criterion value for p < .05 (two-tailed) to 4.

PHP-Curve JDMBased on the distribution of z-scores in the range between 2 and 4, the average power for significant results in this range is estimated to be 64% with a homogeneous model, which is currently being used for the replicability ranking.  The average power assuming heterogeneity is slightly lower with 59%.  This estimate suggests that about 60% of the published results with z-scores in this range yield significant results in an exact replication study with the same sample size and power (results with z > 4 are expected to replicate with nearly 100%).

The same method was used to estimate power for individual years.


The results show a positive trend in post-hoc power. The highest estimates were obtained in the past two years. The current replicability score in 2015 is 70%, which is one of the highest scores for psychology journals.  Based on the rate of actual successful replications in cognitive psychology, the actual rate of successful replications is expected to be about 50%.  This is much higher than the actual rate of successful replications of results in social psychology with higher impact factors.  Based on the present results, I recommend JDM as a more credible source of scientific evidence.



SCImago rankings of all psychology journals ranked JCCP #191 with an SJR-Impact-Factor of 1.2 in 2014.

At present, the replicability-report is based on articles published from 2000 to 2015.  During this time, JCCP published 881 articles.  The Replicability-Report is based on 591 articles that reported one or more t or F-test in the text (results reported in Figures or Tables are not included).  The test-statistic was converted into z-scores to estimate post-hoc-power.  The analysis is based on 2,193 z-scores in the range from 2 (just above the 1.96 criterion value for p < .05 (two-tailed) to 4.

PHP-Curve JCCPBased on the distribution of z-scores in the range between 2 and 4, the average power for significant results in this range is estimated to be 68% with a homogeneous model, which is currently being used for the replicability ranking.  The average power assuming heterogeneity is 58%.  This estimate suggests that only half of the published results with z-scores in this range yield significant results in an exact replication study with the same sample size and power (results with z > 4 are expected to replicate with nearly 100%).

The same method was used to estimate power for individual years.


The results show a flat time trend.  Due to the relatively small number of observations in a year, annual estimates vary considerably, but the average estimate in 2015 is close to the historic average of JCCP.  A replicability score of 65% in 2015 places JCCP in the top-third of psychology journals. In the OSF-Reproducibilty Project, the actual rate of successful replicaitons is likely to be about 20% lower than the statistically predicted power. Thus, about 1/3 of the published results are expected to produce significant results in studies that aim to reproduce the original studies.

Replicability-Report for SOCIAL COGNITION

SOCIAL COGNITION is published by Guilford Press.

SCImago rankings of all psychology journals ranked SOCIAL COGNITION #178 with an SJR-Impact-Factor of 1.2 in 2014.

At present, the replicability-report is based on articles published from 1995 to 2015.  During this time, SOCIAL COGNITION published 550 articles.  The Replicability-Report is based on 450 articles that reported one or more t or F-test in the text (results reported in Figures or Tables are not included).  The test-statistic was converted into z-scores to estimate post-hoc-power.  The analysis is based on 5,331 z-scores in the range from 2 (just above the 1.96 criterion value for p < .05 (two-tailed) to 4.

PHP-Curve SocialCognition

Based on the distribution of z-scores in the range between 2 and 4, the average power for significant results in this range is estimated to be 55% with a homogeneous model, which is currently being used for the replicability ranking.  The average power assuming heterogeneity is 46%.  This estimate suggests that only half of the published results with z-scores in this range yield significant results in an exact replication study with the same sample size and power (results with z > 4 are expected to replicate with nearly 100%).

The same method was used to estimate power for individual years.

PHP-Trend SocialCognition

The results show a decreasing trend and the estimate for the current year is only 35%. This estimate could still increase as more articles from 2015 are being published.  However, the replicability score for SOCIAL COGNITION is low and raises concerns about the replicability of results published in this journal.   The same method produced a replicability score of 32% for social psychology results in the OSF-Reproducibilty Project. The actual rate of successful replications, including z-scores greater than 4, was 8% when sample size was held constant.   Thus, the replicability score of 35% for articles published in 2015 in SOCIAL COGNITION suggests that few of the theoretically important results published in SOCIAL COGNITION would replicate in an actual replication study.

Examining the Replicability of 66,212 Published Results in Social Psychology: A Post-Hoc-Power Analysis Informed by the Actual Success Rate in the OSF-Reproducibilty Project

The OSF-Reproducibility-Project examined the replicability of 99 statistical results published in three psychology journals. The journals covered mostly research in cognitive psychology and social psychology. An article in Science, reported that only 35% of the results were successfully replicated (i.e., produced a statistically significant result in the replication study).

I have conducted more detailed analyses of replication studies in social psychology and cognitive psychology. Cognitive psychology had a notably higher success rate (50%, 19 out of 38) than social psychology (8%, 3 out of 38). The main reason for this discrepancy is that social psychologists and cognitive psychologists use different designs. Whereas cognitive psychologists typically use within-subject designs with many repeated measurements of the same individual, social psychologists typically assign participants to different groups and compare behavior on a single measure. This so-called between-subject design makes it difficult to detect small experimental effects because it does not control the influence of other factors that influence participants’ behavior (e.g., personality dispositions, mood, etc.). To detect small effects in these noisy data, between-subject designs require large sample sizes.

It has been known for a long time that sample sizes in between-subject designs in psychology are too small to have a reasonable chance to detect an effect (less than 50% chance to find an effect that is actually there) (Cohen, 1962; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989). As a result, many studies fail to find statistically significant results, but these studies are not submitted for publication. Thus, only studies that achieved statistical significance with the help of chance (the difference between two groups is inflated by uncontrolled factors such as personality) are reported in journals. The selective reporting of lucky results creates a bias in the published literature that gives a false impression of the replicability of published results. The OSF-results for social psychology make it possible to estimate the consequences of publication bias on the replicability of results published in social psychology journals.

A naïve estimate of the replicability of studies would rely on the actual success rate in journals. If journals would publish significant and non-significant results, this would be a reasonable approach. However, journals tend to publish exclusively significant results. As a result, the success rate in journals (over 90% significant results; Sterling, 1959; Sterling et al., 1995) gives a drastically inflated estimate of replicability.

A somewhat better estimate of replicability can be obtained by computing post-hoc power based on the observed effect sizes and sample sizes of published studies. Statistical power is the long-run probability that a series of exact replication studies with the same sample size would produce significant results. Cohen (1962) estimated that the typical power of psychological studies is about 60%. Thus, even for 100 studies that all reported significant results, only 60 are expected to produce a significant result again in the replication attempt.

The problem with Cohen’s (1962) estimate of replicability is that post-hoc-power analysis uses the reported effect sizes as an estimate of the effect size in the population. However, due to the selection bias in journals, the reported effect sizes and power estimates are inflated. In collaboration with Jerry Brunner, I have developed an improved method to estimate typical power of reported results that corrects for the inflation in reported effect sizes. I applied this method to results from 38 social psychology articles included in the OSF-reproducibility project and obtained a replicability estimate of 35%.

The OSF-reproducbility project provides another opportunity to estimate the replicability of results in social psychology. The OSF-project selected a representative set of studies from two journals and tried to reproduce the same experimental conditions as closely as possible. This should produce unbiased results and the success rate provides an estimate of replicability. The advantage of this method is that it does not rely on statistical assumptions. The disadvantage is that the success rate depends on the ability to exactly recreate the conditions of the original studies. Any differences between studies (e.g., recruiting participants from different populations) can change the success rate. The OSF replication studies also often changed the sample size of the replication study, which will also change the success rate. If sample sizes in a replication study are larger, power increases and the success rate no longer can be used as an estimate of the typical replicability of social psychology. To address this problem, it is possible to apply a statistical adjustment and use the success rate that would have occurred with the original sample sizes. I found that 5 out of 38 (13%) produced significant results and after correcting for the increase in sample size, replicability was only 8% (3 out of 38).

One important question is how how representative the 38 results from the OSF-project are for social psychology in general. Unfortunately, it is practically impossible and too expensive to conduct a large number of exact replication studies. In comparison, it is relatively easy to apply post-hoc power analysis to a large number of statistical results reported in social psychology. Thus, I examined the representativeness of the OSF-reproducibility results by comparing the results of my post-hoc power analysis based on the 38 results in the OSF to a post-hoc-power analysis of a much larger number of results reported in major social psychology journals .

I downloaded articles from 12 social psychology journals, which are the primary outlets for publishing experimental social psychology research: Basic and Applied Social Psychology, British Journal of Social Psychology, European Journal of Social Psychology, Journal of Experimental Social Psychology, Journal of Personality and Social Psychology: Attitudes and Social Cognition, Journal of Personality and Social Psychology: Interpersonal Relationships and Group Processes, Journal of Social and Personal Relationships, Personal Relationships, Personality and Social Psychology Bulletin, Social Cognition, Social Psychology and Personality Science, Social Psychology.

I converted pdf files into text files and searched for all reports of t-tests or F-tests and converted the reported test-statistic into exact two-tailed p-values. The two-tailed p-values were then converted into z-scores by finding the z-score corresponding to the probability of 1-p/2, with p equal the two-tailed p-value. The total number of z-scores included in the analysis is 134,929.

I limited my estimate of power to z-scores in the range between 2 and 4. Z-scores below 2 are not statistically significant (z = 1.96, p = .05). Sometimes these results are reported as marginal evidence for an effect, sometimes they are reported as evidence that an effect is not present, and sometimes they are reported without an inference about the population effect. It is more important to determine the replicability of results that are reported as statistically significant support for a prediction. Z-scores greater than 4 were excluded because z-scores greater than 4 imply that this test had high statistical power (> 99%). Many of these results replicated successfully in the OSF-project. Thus, a simple rule is to assign a success rate of 100% to these findings. The Figure below shows the distribution of z-scores in the range form z = 0 to6, but the power estimate is applied to z-scores in the range between 2 and 4 (n = 66,212).

PHP-Curve Social Journals

The power estimate based on the post-hoc-power curve for z-scores between 2 and 4 is 46%. It is important to realize that this estimate is based on 70% of all significant results that were reported. As z-scores greater than 4 essentially have a power of 100%, the overall power estimate for all statistical tests that were reported is .46*.70 + .30 = .62. It is also important to keep in mind that this analysis uses all statistical tests that were reported including manipulation checks (e.g., pleasant picture were rated as more pleasant than unpleasant pictures). For this reason, the range of z-scores is limited to values between 2 and 4, which is much more likely to reflect a test of a focal hypothesis.

46% power for z-scores between 2 and 4 of is a higher estimate than the estimate for the 38 studies in the OSF-reproducibility project (35%). This suggests that the estimated replicability based on the OSF-results is an underestimation of the true replicability. The discrepancy between predicted and observed replicability in social psychology (8 vs. 38) and cognitive psychology (50 vs. 75), suggests that the rate of actual successful replications is about 20 to 30% lower than the success rate based on statistical prediction. Thus, the present analysis suggests that actual replication attempts of results in social psychology would produce significant results in about a quarter of all attempts (46% – 20% = 26%).

The large sample of test results makes it possible to make more detailed predictions for results with different strength of evidence. To provide estimates of replicability for different levels of evidence, I conducted post-hoc power analysis for intervals of half a standard deviation (z = .5). The power estimates are:

Strength of Evidence      Power    

2.0 to 2.5                            33%

2.5 to 3.0                            46%

3.0 to 3.5                            58%

3.5 to 4.0                            72%


These estimates are important for researchers who are aiming to replicate a published study in social psychology. The reported effect sizes are inflated and a replication study with the same sample size has a low chance to produce a significant result even if a smaller effect exists.   To conducted a properly powered replication study, researchers would have to increase sample sizes. To illustrate, imagine that a study demonstrate a significant difference between two groups with 40 participants (20 in each cell) with a z-score of 2.3 (p = .02, two-tailed). The observed power for this result is 65% and it would suggest that a slightly larger sample of N = 60 is sufficient to achieve 80% power (80% chance to get a significant result). However, after correcting for bias, the true power is more likely to be just 33% (see table above) and power for a study with N = 60 would still only be 50%. To achieve 80% power, the replication study would need a sample size of 130 participants. Sample sizes would need to be even larger taking into account that the actual probability of a successful replication is even lower than the probability based on post-hoc power analysis. In the OSF-project only 1 out of 30 studies with an original z-score between 2 and 3 was successfully replicated.


The results also have implications for the way social psychologists should conduct and evaluate new research. The main reason why z-scores between 2 and 3 provide untrustworthy evidence for an effect is that they are obtained with underpowered studies and publication bias. As a result, it is likely that the strength of evidence is inflated. If, however, the same z-scores were obtained in studies with high power, a z-score of 2.5 would provide more credible evidence for an effect. The strength of evidence in a single study would still be subject to random sampling error, but it would no longer be subject to systematic bias. Therefore, the evidence would be more likely to reveal a true effect and it would be less like to be a false positive.   This implies that z-scores should be interpreted in the context of other information about the likelihood of selection bias. For example, a z-score of 2.5 in a pre-registered study provides stronger evidence for an effect than the same z-score in a study where researchers may have had a chance to conduct multiple studies and to select the most favorable results for publication.

The same logic can also be applied to journals and labs. A z-score of 2.5 in a journal with an average z-score of 2.3 is less trustworthy than a z-score of 2.5 in a journal with an average z-score of 3.5. In the former journal, a z-score of 2.5 is likely to be inflated, whereas in the latter journal a z-score of 2.5 is more likely to be negatively biased by sampling error. For example, currently a z-score of 2.5 is more likely to reveal a true effect if it is published in a cognitive journal than a social journal (see ranking of psychology journals).

The same logic applies even more strongly to labs because labs have a distinct research culture (MO). Some labs conduct many underpowered studies and publish only the studies that worked. Other labs may conduct fewer studies with high power. A z-score of 2.5 is more trustworthy if it comes from a lab with high average power than from a lab with low average power. Thus, providing information about the post-hoc-power of individual researchers can help readers to evaluate the strength of evidence of individual studies in the context of the typical strength of evidence that is obtained in a specific lab. This will create an incentive to publish results with strong evidence rather than fishing for significant results because a low replicability index increases the criterion at which results from a lab provide evidence for an effect.