All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

“Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

The article with the witty title “Do Studies of Statistical Power Have an Effect on the Power of Studies?” builds on Cohen’s (1962) seminal power analysis of psychological research.

The main point of the article can be summarized in one word: No. Statistical power has not increased after Cohen published his finding that statistical power is low.

One important contribution of the article was a meta-analysis of power analyses that applied Cohen’s method to a variety of different journals. The table below shows that power estimates vary by journal assuming that the effect size was medium according to Cohen’s criteria of small, medium, and large effect sizes. The studies are sorted by power estimates from the highest to the lowest value, which provides a power ranking of journals based on Cohen’s method. I also included the results of Sedlmeier and Giegerenzer’s power analysis of the 1984 volume of the Journal of Abnormal Psychology (the Journal of Social and Abnormal Psychology was split into Journal of Abnormal Psychology and Journal of Personality and Social Psychology). I used the mean power (50%) rather than median power (44%) because the mean power is consistent with the predicted success rate in the limit. In contrast, the median will underestimate the success rate in a set of studies with heterogeneous effect sizes.

JOURNAL TITLE YEAR Power%
Journal of Marketing Research 1981 89
American Sociological Review 1974 84
Journalism Quarterly, The Journal of Broadcasting 1976 76
American Journal of Educational Psychology 1972 72
Journal of Research in Teaching 1972 71
Journal of Applied Psychology 1976 67
Journal of Communication 1973 56
The Research Quarterly 1972 52
Journal of Abnormal Psychology 1984 50
Journal of Abnormal and Social Psychology 1962 48
American Speech and Hearing Research & Journal of Communication Disorders 1975 44
Counseler Education and Supervision 1973 37

 

The table shows that there is tremendous variability in power estimates for different journals ranging from as high as 89% (9 out of 10 studies will produce a significant result when an effect is present) to the lowest estimate of  37% power (only 1 out of 3 studies will produce a significant result when an effect is present).

The table also shows that the Journal of Abnormal and Social Psychology and its successor the Journal of Abnormal Psychology yielded nearly identical power estimates. This finding is the key finding that provides empirical support for the claim that power in the Journal of Abnormal Psychology has not increased over time.

The average power estimate for all journals in the table is 62% (median 61%).  The list of journals is not a representative set of journals and few journals are core psychology journals. Thus, the average power may be different if a representative set of journals had been used.

The average for the three core psychology journals (JASP & JAbnPsy,  JAP, AJEduPsy) is 67% (median = 63%) is slightly higher. The latter estimate is likely to be closer to the typical power in psychology in general rather than the prominently featured estimates based on the Journal of Abnormal Psychology. Power could be lower in this journal because it is more difficult to recruit patients with a specific disorder than participants from undergraduate classes. However, only more rigorous studies of power for a broader range of journals and more years can provide more conclusive answers about the typical power of a single statistical test in a psychology journal.

The article also contains some important theoretical discussions about the importance of power in psychological research. One important issue concerns the treatment of multiple comparisons. For example, a multi-factorial design produces an exponential number of statistical comparisons. With two conditions, there is only one comparison. With three conditions, there are three comparisons (C1 vs. C2, C1 vs. C3, and C2 vs. C3). With 5 conditions, there are 10 comparisons. Standard statistical methods often correct for these multiple comparisons. One consequence of this correction for multiple comparisons is that the power of each statistical test decreases. An effect that would be significant in a simple comparison of two conditions would not be significant if this test is part of a series of tests.

Sedlmeier and Giegerenzer used the standard criterion of p < .05 (two-tailed) for their main power analysis and for the comparison with Cohen’s results. However, many articles presented results using a more stringent criterion of significance. If the criterion used by authors would have been used for the power analysis, power decreased further. About 50% of all articles used an adjusted criterion value and if the adjusted criterion value was used power was only 37%.

Sedlmeier and Giegerenzer also found another remarkable difference between articles in 1960 and in 1984. Most articles in 1960 reported the results of a single study. In 1984 many articles reported results from two or more studies. Sedlmeier and Giegerenzer do not discuss the statistical implications of this change in publication practices. Schimmack (2012) introduced the concept of total power to highlight the problem of publishing articles that contain multiple studies with modest power. If studies are used to provide empirical support for an effect, studies have to show a significant effect. For example, Study 1 shows an effect with female participants. Study 2 examines whether the effect can also be demonstrated with male participants. If Study 2 produces a non-significant result, it is not clear how this finding should be interpreted. It may show that the effect does not exist for men. It may show that the first result was just a fluke finding due to sampling error. Or it may show that the effect exists equally for men and women but studies had only 50% power to produce a significant result. In this case, it is expected that one study will produce a significant result and one will produce a non-significant result, but in the long-run significant results are equally likely with male or female participants. Given the difficulty of interpreting a non-significant result, it would be important to conduct a more powerful study that examines gender differences in a more powerful study with more female and male participants. However, this is not what researchers do. Rather, multiple study articles contain only the studies that produced significant results. The rate of successful studies in psychology journals is over 90% (Sterling et al., 1995). However, this outcome is extremely likely in multiple studies where studies have only 50% power to get a significant result in a single attempt. For each additional attempt, the probability to obtain only significant results decreases exponentially (1 Study, 50%, 2 Studies 25%, 3 Studies 12.5%, 4 Studies 6.75%).

The fact that researchers only publish studies that worked is well-known in the research community. Many researchers believe that this is an acceptable scientific practice. However, consumers of scientific research may have a different opinion about this practice. Publishing only studies that produced the desired outcome is akin to a fund manager that only publishes the return rate of funds that gained money and excludes funds with losses. Would you trust this manager to take care of your retirement? It is also akin to a gambler that only remembers winnings. Would you marry a gambler who believes that gambling is ok because you can earn money that way?

I personally do not trust obviously biased information. So, when researchers present 5 studies with significant results, I wonder whether they really had the statistical power to produce these results or whether they simply did not publish results that failed to confirm their claims. To answer this question it is essential to estimate the actual power of individual studies to produce significant results; that is, it is necessary to estimate the typical power in this field, of this researcher, or in the journal that published the results.

In conclusion, Sedlmeier and Gigerenzer made an important contribution to the literature by providing the first power-ranking of scientific journals and the first temporal analyses of time trends in power. Although they probably hoped that their scientific study of power would lead to an increase in statistical power, the general consensus is that their article failed to change scientific practices in psychology. In fact, some journals required more and more studies as evidence for an effect (some articles contain 9 studies) without any indication that researchers increased power to ensure that their studies could actually provide significant results for their hypotheses. Moreover, the topic of statistical power remained neglected in the training of future psychologists.

I recommend Sedlmeier and Gigerenzer’s article as essential reading for anybody interested in improving the credibility of psychology as a rigorous empirical science.

As always, comments (positive or negative) are always welcome.

Advertisements

Distinguishing Questionable Research Practices from Publication Bias

It is well-known that scientific journals favor statistically significant results (Sterling, 1959). This phenomenon is known as publication bias. Publication bias can be easily detected by comparing the observed statistical power of studies with the success rate in journals. Success rates of 90% or more would only be expected if most theoretical predictions are true and empirical studies have over 90% statistical power to produce significant results. Estimates of statistical power range from 20% to 50% (Button et al., 2015, Cohen, 1962). It follows that for every published significant result an unknown number of non-significant results has occurred that remained unpublished. These results linger in researchers proverbial file-drawer or more literally in unpublished data sets on researchers’ computers.

The selection of significant results also creates an incentive for researchers to produce significant results. In rare cases, researchers simply fabricate data to produce significant results. However, scientific fraud is rare. A more serious threat to the integrity of science is the use of questionable research practices. Questionable research practices are all research activities that create a systematic bias in empirical results. Although systematic bias can produce too many or too few significant results, the incentive to publish significant results suggests that questionable research practices are typically used to produce significant results.

In sum, publication bias and questionable research practices contribute to an inflated success rate in scientific journals. So far, it has been difficult to examine the prevalence of questionable research practices in science. One reason is that publication bias and questionable research practices are conceptually overlapping. For example, a research article may report the results of a 2 x 2 x 2 ANOVA or a regression analysis with 5 predictor variables. The article may only report the significant results and omit detailed reporting of the non-significant results. For example, researchers may state that none of the gender effects were significant and not report the results for main effects or interaction with gender. I classify these cases as publication bias because each result tests a different hypothesis., even if the statistical tests are not independent.

Questionable research practices are practices that change the probability of obtaining a specific significant result. An example would be a study with multiple outcome measures that would support the same theoretical hypothesis. For example, a clinical trial of an anti-depressant might include several depression measures. In this case, a researcher can increase the chances of a significant result by conducting tests for each measure. Other questionable research practices would be optional stopping once a significant result is obtained, selective deletion of cases based on the results after deletion. A common consequence of these questionable practices is that they will produce results that meet the significance criterion, but deviate from the distribution that is expected simply on the basis of random sampling error.

A number of articles have tried to examine the prevalence of questionable research practices by comparing the frequency of p-values above and below the typical criterion of statistical significance, namely a p-value less than .05. The logic is that random error would produce a nearly equal amount of p-values just above .05 (e.g., p = .06) and below .05 (e.g., p = .04). According to this logic, questionable research practices are present, if there are more p-values just below the criterion than p-values just above the criterion (Masicampo & Lalande, 2012).

Daniel Lakens has pointed out some problems with this approach. The most crucial problem is that publication bias alone is sufficient to predict a lower frequency of p-values below the significance criterion. After all, these p-values imply a non-significant result and non-significant results are subject to publication bias. The only reason why p-values of .06 are reported with higher frequency than p-values of .11 is that p-values between .05 and .10 are sometimes reported as marginally significant evidence for a hypothesis. Another problem is that many p-values of .04 are not reported as p = .04, but are reported as p < .05. Thus, the distribution of p-values close to the criterion value provides unreliable information about the prevalence of questionable research practices.

In this blog post, I introduce an alternative approach to the detection of questionable research practices that produce just significant results. Questionable research practices and publication bias have different effects on the distribution of p-values (or corresponding measures of strength of evidence). Whereas publication bias will produce a distribution that is consistent with the average power of studies, questionable research practice will produce an abnormal distribution with a peak just below the significance criterion. In other words, questionable research practices produce a distribution with too few non-significant results and too few highly significant results.

I illustrate this test of questionable research practices with post-hoc-power analysis of three journals. One journal shows neither signs of publication bias, nor significant signs of questionable research practices. The second journal shows clear evidence of publication bias, but no evidence of questionable research practices. The third journal illustrates the influence of publication bias and questionable research practices.

Example 1: A Relatively Unbiased Z-Curve

The first example is based on results published during the years 2010-2014 in the Journal of Experimental Psychology: Learning, Memory, and Cognition. A text-mining program searched all articles for publications of F-tests, t-tests, correlation coefficients, regression coefficients, odds-ratios, confidence intervals, and z-tests. Due to the inconsistent and imprecise reporting of p-values (p = .02 or p < .05), p-values were not used. All statistical tests were converted into absolute z-scores.

The program found 14,800 tests. 8,423 tests were in the critical interval between z = 2 and z = 6 that is used for estimation of 4 non-centrality parameters and 4 weights that are used to model the distribution of z-values between 2 and 6 and to estimate the distribution in the range from 0 to 2. Z-values greater than 6 are not used because they correspond to Power close to 1. 11% of all tests fall into this region of z-scores that are not shown.

PHP-Curve JEP-LMCThe histogram and the blue density distribution show the observed data. The green curve shows the predicted distribution based on the post-hoc power analysis. Post-hoc power analysis suggests that the average power of significant results is 67%. Power for all statistical tests is estimated to be 58% (including 11% of z-scores greater than 6, power is .58*.89 + .11 = 63%). More important is the predicted distribution of z-scores. The predicted distribution on the left side of the criterion value matches the observed distribution rather well. This shows that there are not a lot of missing non-significant results. In other words, there does not appear to be a file-drawer of studies with non-significant results. There is also only a very small blip in the observed data just at the level of statistical significance. The close match between the observed and predicted distributions suggests that results in this journal are relatively free of systematic bias due to publication bias or questionable research practices.

Example 2: A Z-Curve with Publication Bias

The second example is based on results published in the Attitudes & Social Cognition Section of the Journal of Personality and Social Psychology. The text-mining program retrieved 5,919 tests from articles published between 2010 and 2014. 3,584 tests provided z-scores in the range from 2 to 6 that is being used for model fitting.

PHP-Curve JPSP-ASC

The average power of significant results in JPSP-ASC is 55%. This is significantly less than the average power in JEP-LMC, which was used for the first example. The estimated power for all statistical tests, including those in the estimated file drawer, is 35%. More important is the estimated distribution of z-values. On the right side of the significance criterion the estimated curve shows relatively close fit to the observed distribution. This finding shows that random sampling error alone is sufficient to explain the observed distribution. However, on the left side of the distribution, the observed z-scores drop off steeply. This drop is consistent with the effect of publication bias that researchers do not report all non-significant results. There is only a slight hint that questionable research practices are also present because observed z-scores just above the criterion value are a bit more frequent than the model predicts. However, this discrepancy is not conclusive because the model could increase the file drawer, which would produce a steeper slope. The most important characteristic of this z-curve is the steep cliff on the left side of the criterion value and the gentle slope on the right side of the criterion value.

Example 3: A Z-Curve with Questionable Research Practices.

Example 3 uses results published in the journal Aggressive Behavior during the years 2010 to 2014. The text mining program found 1,429 results and 863 z-scores in the range from 2 to 6 that were used for the post-hoc-power analysis.

PHP-Curve for AggressiveBeh 2010-14

 

The average power for significant results in the range from 2 to 6 is 73%, which is similar to the power estimate in the first example. The power estimate that includes non-significant results is 68%. The power estimate is similar because there is no evidence of a file drawer with many underpowered studies. In fact, there are more observed non-significant results than predicted non-significant results, especially for z-scores close to zero. This outcome shows some problems of estimating the frequency of non-significant results based on the distribution of significant results. More important, the graph shows a cluster of z-scores just above and below the significance criterion. The step cliff to the left of the criterion might suggest publication bias, but the whole distribution does not show evidence of publication bias. Moreover, the steep cliff on the right side of the cluster cannot be explained with publication bias. Only questionable research practices can produce this cliff because publication bias relies on random sampling error which leads to a gentle slope of z-scores as shown in the second example.

Prevalence of Questionable Research Practices

The examples suggest that the distribution of z-scores can be used to distinguish publication bias and questionable research practices. Based on this approach, the prevalence of questionable research practices would be rare. The journal Aggressive Behavior is exceptional. Most journals show a pattern similar to Example 2, with varying sizes of the file drawer. However, this does not mean that questionable research practices are rare because it is most likely that the pattern observed in Example 2 is a combination of questionable research practices and publication bias. As shown in Example 2, the typical power of statistical tests that produce a significant result is about 60%. However, researchers do not know which experiments will produce significant results. Slight modifications in experimental procedures, so-called hidden moderators, can easily change an experiment with 60% power into an experiment with 30% power. Thus, the probability of obtaining a significant result in a replication study is less than the nominal power of 60% that is implied by post-hoc-power analysis. With only 30% to 60% power, researchers will frequently encounter results that fail to produce an expected significant result. In this case, researchers have two choices to avoid reporting a non-significant result. They can put the study in the file-drawer or they can try to salvage the study with the help of questionable research practices. It is likely that researchers will do both and that the course of action depends on the results. If the data show a trend in the right direction, questionable research practices seem an attractive alternative. If the data show a trend in the opposite direction, it is more likely that the study will be terminated and the results remain unreported.

Simons et al. (2011) conducted some simulation studies and found that even extreme use of multiple questionable research practices (p-hacking) will produce a significant result in at most 60% of cases, when the null-hypothesis is true. If such extreme use of questionable research practices were widespread, z-curve would produce corrected power estimates well-below 50%. There is no evidence that extreme use of questionable research practices is prevalent. In contrast, there is strong evidence that researchers conduct many more studies than they actually report and that many of these studies have a low probability of success.

Implications of File-Drawers for Science

First, it is clear that researchers could be more effective if they would use existing resources more effectively. An fMRI study with 20 participants costs about $10,000. Conducting a study that costs $10,000 that has only a 50% probability of producing a significant result is wasteful and should not be funded by taxpayers. Just publishing the non-significant result does not fix this problem because a non-significant result in a study with 50% power is inconclusive. Even if the predicted effect exists, one would expect a non-significant result in ever second study.   Instead of wasting $10,000 on studies with 50% power, researchers should invest $20,000 in studies with higher power (unfortunately, power does not increase proportional to resources). With the same research budget, more money would contribute to results that are being published. Thus, without spending more money, science could progress faster.

Second, higher powered studies make non-significant results more relevant. If a study had 80% power, there is only a 20% chance to get a non-significant result if an effect is present. If a study had 95% power, the chance of a non-significant result would be just as low as the chance of a false positive result. In this case, it is noteworthy that a theoretical prediction was not confirmed. In a set of high-powered studies, a post-hoc power analysis would show a bimodal distribution with clusters of z-scores around 0 for true null-hypothesis and a cluster of z-scores of 3 or higher for clear effects. Type-I and Type-II errors would be rare.

Third, Example 3 shows that the use of questionable research practices becomes detectable in the absence of a file drawer and that it would be harder to publish results that were obtained with questionable research practices.

Finally, the ability to estimate the size of file-drawers may encourage researchers to plan studies more carefully and to invest more resources into studies to keep their file drawers small because a large file-drawer may harm reputation or decrease funding.

In conclusion, post-hoc power analysis of large sets of data can be used to estimate the size of the file drawer based on the distribution of z-scores on the right side of a significance criterion. As file-drawers harm science, this tool can be used as an incentive to conduct studies that produce credible results and thus reducing the need for dishonest research practices. In this regard, the use of post-hoc power analysis complements other efforts towards open science such as preregistration and data sharing.

2015 Replicability Ranking of 54 Psychology Journals

INTRODUCTION

This post shows the latest replicability rankings of psychological journals.

The following improvements have been made.

1 – The number of journals included in the ranking has doubled from 27 to 54 journals.

2 – The list of journals covers a broader range of psychological disciplines. It now includes journals for cognitive psychology (COG), social psychology (SOC), personality psychology (PER), developmental psychology (DEV), and clinical psychology (CLI).

3 – Replicability scores are now based on F-tests, t-tests, z-tests, correlation and regression coefficients, and confidence intervals.

4 – Replicability scores are reported with the number of tests that were used to compute replicability scores as well as 95% CI based on bootstrap analysis with 500 samples.

5 – The range of z-scores that is used for the computation of replicability scores has been expanded to a limit of z = 6.

6 – The more appropriate heterogeneous model is being used to estimate replicability scores. The heterogeneous model allows for variation the population effect sizes and true power across different analysis.

For more details on the computation of replicability scores see the text below the table and related posts (1, 2, 3, 4).

 

Rank Journal Area 2010-2014 Est. Low High 2015 Est. Low High  Grade
1 Cognitive Psychology COG 1733 0.72 0.69 0.75 186 0.75 0.63 0.79 B|C-B
2 Journal of Cross-Cultural Psychology GEN 2277 0.65 0.61 0.69 238 0.74 0.66 0.79 B|C-B
3 Journal of Memory & Language COG 4729 0.67 0.64 0.69 409 0.74 0.67 0.78 B|C-B
4 Developmental Psychology DEV 4083 0.65 0.62 0.67 358 0.72 0.64 0.77 B|C-B↑
5 JEP: LMC COG 7730 0.7 0.68 0.71 1323 0.71 0.68 0.75 B|C-B
6 JEP: Human Percepetion and Performance COG 10342 0.72 0.7 0.74 1064 0.71 0.67 0.73 B|C-B
7 Journal of Experimental Psychology: General GEN 5045 0.65 0.62 0.67 708 0.71 0.65 0.74 B|C-B?
8 Judgment and Decision Making COG 978 0.67 0.6 0.7 110 0.71 0.6 0.78 B|C-B↑
9 Social Psychology SOC 1558 0.62 0.58 0.66 146 0.71 0.62 0.78 B|C-B↑
10 Archives of Sexual Behavior GEN 1271 0.7 0.66 0.74 226 0.7 0.61 0.75 B|C-B
11 Journal of Research in Personality PER 2366 0.64 0.61 0.67 230 0.7 0.63 0.77 B|D-B↑
12 JPSP: Personality Process & Individual Differences PER 1735 0.59 0.54 0.62 179 0.7 0.53 0.76 B|D-B↑
13 Personality and Individual Differences PER 5506 0.62 0.59 0.64 853 0.7 0.66 0.74 B|C-B↑
14 Cognition & Emotion EMO 3723 0.67 0.64 0.69 773 0.69 0.64 0.73 C|C-B
15 Depression & Anxiety CLI 878 0.65 0.59 0.7 227 0.68 0.6 0.74 C|C-B
16 Aggressive Behavior GEN 810 0.6 0.53 0.66 81 0.67 0.54 0.77 C|D-B↑
17 Personal Relationships SOC 1320 0.54 0.5 0.58 126 0.67 0.56 0.75 C|D-B↑
18 Behavior Research & Therapy CLI 3116 0.64 0.62 0.67 479 0.66 0.6 0.72 C|C-B
19 JPSP: Attitude & Social Cognition SOC 3550 0.54 0.5 0.58 354 0.66 0.58 0.72 C|D-B↑
20 Psychology and Aging DEV 4422 0.71 0.69 0.73 184 0.66 0.58 0.74 C|D-B↓
21 Psychology of Music COG 693 0.64 0.59 0.69 136 0.66 0.56 0.72 C|D-B
22 Appetite GEN 2164 0.58 0.55 0.6 384 0.64 0.6 0.72 C|C-B↑
23 Cognition COG 7744 0.65 0.63 0.67 1826 0.64 0.61 0.66 C|C-C
24 Journal of Applied Psychology APP 662 0.64 0.55 0.69 173 0.64 0.54 0.72 C|D-B
25 Social Psychology Personality Science SOC 2631 0.53 0.49 0.56 174 0.64 0.56 0.74 C|D-B↑
26 Journal of Positive Psychology GEN 518 0.63 0.57 0.68 125 0.62 0.5 0.7 C|D-B
27 British Journal of Social Psychology SOC 1357 0.54 0.49 0.59 119 0.61 0.5 0.73 C|D-B↑
28 Developmental Science DEV 4211 0.58 0.54 0.61 318 0.61 0.52 0.67 C|D-C
29 Journal of Personality PER 1338 0.52 0.48 0.55 249 0.6 0.47 0.68 C|F-C↑
30 Psychological Science GEN 7975 0.57 0.55 0.59 444 0.6 0.5 0.66 C|D-C
31 Law and Human Behavior APP 891 0.67 0.63 0.71 180 0.6 0.49 0.7 C|F-B↓
32 Child Development DEV 4707 0.62 0.58 0.63 493 0.59 0.49 0.64 D|F-C
33 Infancy DEV 1399 0.53 0.49 0.57 730 0.59 0.53 0.64 D|D-C↑
34 Journal of Experimental Social Psychology SOC 8720 0.52 0.48 0.54 1182 0.59 0.5 0.62 D|D-C↑
35 Motivation and Emotion EMO 2410 0.63 0.6 0.65 547 0.58 0.53 0.66 D|D-C↓
36 Psychophysiology BIO 8568 0.61 0.59 0.64 1346 0.58 0.54 0.63 D|D-C↓
37 Social Development DEV 1244 0.63 0.59 0.66 218 0.58 0.48 0.66 D|F-C↓
38 European Journal of Personality PER 400 0.67 0.61 0.72 71 0.56 0.37 0.71 D|F-B↓
39 Health Psychology APP 1503 0.56 0.52 0.6 121 0.56 0.43 0.67 D|F-C
40 Journal of Educational Psychology APP 1550 0.68 0.63 0.71 295 0.56 0.47 0.64 D|F-C↓
41 European Journal of Social Psychology SOC 3353 0.58 0.55 0.61 350 0.55 0.47 0.62 D|F-C
42 Journal of Social & Personal Relationships SOC 646 0.63 0.56 0.68 148 0.55 0.42 0.67 D|F-C↓
43 JPSP:Interpersonal Relationships & Group Processes SOC 4047 0.54 0.5 0.56 468 0.55 0.47 0.62 D|F-C
44 Organizational Behavior and Decision Processes APP 3373 0.62 0.59 0.65 608 0.55 0.49 0.61 D|F-C↓
45 Personality & Social Psychology Bulletin SOC 8737 0.54 0.52 0.56 1238 0.55 0.46 0.58 D|F-D
46 Journal of Child Psychology & Psychiatry DEV 1823 0.51 0.47 0.57 445 0.54 0.47 0.59 D|F-D
47 Journal of Consulting and Clinical Psychology CLI 1390 0.55 0.5 0.59 304 0.54 0.45 0.62 D|F-C
48 Personnel Psychology APP 287 0.62 0.53 0.66 72 0.54 0.33 0.68 D|F|C
49 Journal of Abnormal Psychology CLI 2034 0.55 0.52 0.59 258 0.53 0.41 0.61 D|F-C
50 Emotion EMO 5695 0.62 0.6 0.64 376 0.52 0.46 0.59 D|F-D↓
51 Evolution & Human Behavior GEN 450 0.55 0.48 0.63 56 0.51 0.29 0.72 D|F|B
52 Social Cognition SOC 1830 0.57 0.52 0.6 141 0.49 0.36 0.59 F|F-D↓
53 Group Processes & Intergroup Relations SOC 2059 0.53 0.51 0.57 352 0.46 0.39 0.54 F|F-D↓
54 Journal of Youth and Adolescence DEV 1724 0.6 0.56 0.65 282 0.46 0.33 0.5 F|F-D↓

Notes:   ↑↓ upward/downward trend if 2015 estimate is outside 2010-2014 confidence interval.
Estimate = Point estimate of bias-corrected power, low/high = 95% confidence interval limits based on 500 bootstrap analyses.

 

DEFINITION OF KEY TERMS

REPLICABILITY: I define replicability as the probability that an exact replication study with the same population effect size and sample size (and therewith the same true power) produces a significant result. This definition of replicability equates replicability with statistical power. Importantly, a replication study that does not replicate a significant result cannot be interpreted as evidence that an effect does not exist. The non-significant result can occur for several reasons. When replicability is estimated for a heterogeneous set of statistical tests (e.g., tests from different articles in a journal), replicability is the percentage of significant results that can be expected if all of the tests were carried out again in a new set of exact replication studies.

REPORTING BIAS: If journals would publish significant and non-significant results, the percentage of significant results in journals would provide a simple and valid measure of replicability (1, 2). However, psychological journals are much more likely to publish significant results (Sterling, 1959). As a result, published success rates provide biased and inflated estimates of replicability. Replicability rankings use a novel statistical method to correct for reporting bias and estimate the true replicability of significant results published in psychological journals.

DATA

The data were obtained from online published articles in the years from 2010-2015 using 2015 articles that were published online on October 1.  The results will be updated when all articles from 2015 are published.

The analysis are based on z-scores in the range from 2 (just above significance criterion of 1.96) and 6.  An additional 12% of z-scores with z > 6 were excluded.   A z-score of 6 has a probability of 1 in 10 billion to occur when the null-hypothesis is true.  Thus, any z-score greater than 6 can be considered to be a real effect and a failed replication (z < 2) would suggest that the two experiments are not exact replications.

The analysis is based on 19,133 articles.  Articles were scanned for reports of F-tests, t-tests, z-tests, correlations, regression coefficients, and confidence intervals.  The results of these tests were converted into absolute z-scores.  The search yielded 182,413 z-scores in the critical range from 2 to 6.

The majority of reported tests were F-tests (n = 100,014) followed by t-tests (n = 59,644), correlations (n = 18,679), z-tests (n = 3278), confidence intervals (n = 540), and regression coefficients with standard errors (n = 258).

The data were submitted to a post-hoc-power analysis. The method relies on the fact that a set of studies with a distribution of true powers produces a characteristic curve of observed z-scores. The shape of this distribution can be used to estimate the average true power based even for a truncated set of observed z-scores.  The method is illustrated below for the full set of 273,494 z-scores in the range form 0 to 6 (excluding 12% of data with z > 6), but only the 182,413 z-scores in the range from 2 to 6 are used to estimate average true power.

 

54journals

The blue line shows the observed z-scores and the green line shows simulated z-scores of the best fitting model. The best fitting model assigns weights to a set of true power values and uses the weighted average as an estimate of average true power.  The estimate is 61%.  A bootstrap analysis with 500 trials provides a confidence interval.  With over 100,000 observations the interval is small and ranges from 60 to 62%.

The green curve also estimates the number of non-significant results that were obtained but not reported.  To estimate the size of the proverbial file-drawer,  non-significant results are assigned a single z-score of 1, which corresponds to 17% power.  The weight assigned to these non-significant results is determined by the difference between the inflated power and the corrected power estimate.  The greater the inflation, the more non-significant results are hidden in file drawers.

Dr. R’s comment on the Official Statement by the Board of the German Psychological Association (DGPs) about the Results of the OSF-Reproducibility Project published in Science.

Thanks to social media, geography is no longer a barrier for scientific discourse. However, language is still a barrier. Fortunately, I understand German and I can respond to the official statement of the board of the German Psychological Association (DGPs), which was posted on the DGPs website (in German).

BACKGROUND

On September 1, 2015, Prof. Dr. Andrea Abele-Brehm, Prof. Dr. Mario Gollwitzer, and Prof. Dr. Fritz Strack published an official response to the results of the OSF-Replication Project – Psychology (in German) that was distributed to public media in order to correct potentially negative impressions about psychology as a science.

Numerous members of DGPs felt that this official statement did not express their views and noticed that members were not consulted about the official response of their organization. In response to this criticism, DGfP opened a moderated discussion page, where members could post their personal views (mostly in German).

On October 6, 2015, the board closed the discussion page and posted some final words (Schlussbeitrag). In this blog, I provide a critical commentary on these final words.

BOARD’S RESPONSE TO COMMENTS

The board members provide a summary of the core insights and arguments of the discussion from their (personal/official) perspective.

„Wir möchten nun die aus unserer Sicht zentralen Erkenntnisse und Argumente der unterschiedlichen Forumsbeiträge im Folgenden zusammenfassen und deutlich machen, welche vorläufigen Erkenntnisse wir im Vorstand aus ihnen ziehen.“

1. 68% success rate?

The first official statement suggested that the replication project showed that 68% of studies. This number is based on significance in a meta-analysis of the original and replication study. Critics pointed out that this approach is problematic because the replication project showed clearly that the original effect sizes were inflated (on average by 100%). Thus, the meta-analysis is biased and the 68% number is inflated.

In response to this criticism, the DGPs board states that “68% is the maximum [größtmöglich] optimistic estimate.” I think the term “biased and statistically flawed estimate” is a more accurate description of this estimate.   It is common practice to consider fail-safe-N or to correct meta-analysis for publication bias. When there is clear evidence of bias, it is unscientific to report the biased estimate. This would be like saying that the maximum optimistic estimate of global warming is that global warming does not exist. This is probably a true statement about the most optimistic estimate, but not a scientific estimate of the actual global warming that has been taking place. There is no place for optimism in science. Optimism is a bias and the aim of science is to remove bias. If DGPs wants to represent scientific psychology, the board should post what they consider the most accurate estimate of replicability in the OSF-project.

2. The widely cited 36% estimate is negative.

The board members then justify the publication of the maximally optimistic estimate as a strategy to counteract negative perceptions of psychology as a science in response to the finding that only 36% of results were replicated. The board members felt that these negative responses misrepresent the OSF-project and psychology as a scientific discipline.

„Dies wird weder dem Projekt der Open Science Collaboration noch unserer Disziplin insgesamt gerecht. Wir sollten jedoch bei der konstruktiven Bewältigung der Krise Vorreiter innerhalb der betroffenen Wissenschaften sein.“

However, reporting the dismal 36% replication rate of the OSF-replication project is not a criticism of the OSF-project. Rather, it assumes that the OSF-replication project was a rigorous and successful attempt to provide an estimate of the typical replicability of results published in top psychology journals. The outcome could have been 70% or 35%. The quality of the project does not depend on the result. The result is also not a negatively biased perception of psychology as a science. It is an objective scientific estimate of the probability that a reported significant result in a journal would produce a significant result again in a replication study.   Whether 36% is acceptable or not can be debated, but it seems problematic to post a maximally optimistic estimate to counteract negative implications of an objective estimate.

3. Is 36% replicability good or bad?

Next, the board ponders the implications of the 36% success rate. “How should we evaluate this number?” The board members do not know.  According to their official conclusion, this question is complex as divergent contributions on the discussion page suggest.

„Im Science-Artikel wurde die relative Häufigkeit der in den Replikationsstudien statistisch bedeutsamen Effekte mit 36% angegeben. Wie ist diese Zahl zu bewerten? Wie komplex die Antwort auf diese Frage ist, machen die Forumsbeiträge von Roland Deutsch, Klaus Fiedler, Moritz Heene (s.a. Heene & Schimmack) und Frank Renkewitz deutlich.“

To help the board members to understand the number, I can give a brief explanation of replicability. Although there are several ways to define replicability, one plausible definition of replicability is to equate it with statistical power. Statistical power is the probability that a study will produce a significant result. A study with 80% power has an 80% probability to produce a significant result. For a set of 100 studies, one would expect roughly 80 significant results and 20 non-significant results. For 100 studies with 36% power, one would expect roughly 36 significant results and 64 non-significant results. If researchers would publish all studies, the percentage of published significant results would provide an unbiased estimate of the typical power of studies.   However, it is well known that significant results are more likely to be written up, submitted for publication, and accepted for publication. These reporting biases explain why psychology journals report over 90% significant results, although the actual power of studies is less than 90%.

In 1962, Jacob Cohen provided the first attempt to estimate replicability of psychological results. His analysis suggested that psychological studies have approximately 50% power. He suggested that psychologists should increase power to 80% to provide robust evidence for effects and to avoid wasting resources on studies that cannot detect small, but practically important effects. For the next 50 years, psychologists have ignored Cohen’s warning that most studies are underpowered, despite repeated reminders that there are no signs of improvement, including reminders by prominent German psychologists like Gerg Giegerenzer, director of a Max Planck Institute (Sedlmeier & Giegerenzer, 1989; Maxwell, 2004; Schimmack, 2012).

The 36% success rate for an unbiased set of 100 replication studies, suggest that the actual power of published studies in psychology journals is 36%.  The power of all studies conducted is even lower because the p < .05 selection criterion favors studies with higher power.  Does the board think 36% power is an acceptable amount of power?

4. Psychologists should improve replicability in the future

On a positive note, the board members suggest that, after careful deliberation, psychologists need to improve replicability so that it can be demonstrated in a few years that replicability has increased.

„Wir müssen nach sorgfältiger Diskussion unter unseren Mitgliedern Maßnahmen ergreifen (bei Zeitschriften, in den Instituten, bei Förderorganisationen, etc.), die die Replikationsquote im temporalen Vergleich erhöhen können.“

The board members do not mention a simple solution to the replicabilty problem that was advocated over 50 years ago by Jacob Cohen. To increase replicability, psychologists have to think about the strength of the effects that they are investigating and they have to conduct studies that have a realistic chance to distinguish these effects from variation due to random error.   This often means investing more resources (larger samples, repeated trials, etc.) in a single study.   Unfortunately, the leaders of German psychologists appear to be unaware of this important and simple solution to the replication crisis. They neither mention power as a cause of the problem, nor do they recommend increasing power to increase replicability in the future.

5. Do the Results Reveal Fraud?

The DGPs board members then discuss the possibility that the OSF-reproducibilty results reveal fraud, like the fraud committed by Stapel. The board points out that the OSF-results do not imply that psychologists commit fraud because failed replications can occur for various reasons.

„Viele Medien (und auch einige Kolleginnen und Kollegen aus unserem Fach) nennen die Befunde der Science-Studie im gleichen Atemzug mit den Betrugsskandalen, die unser Fach in den letzten Jahren erschüttert haben. Diese Assoziation ist unserer Meinung nach problematisch: sie suggeriert, die geringe Replikationsrate sei auf methodisch fragwürdiges Verhalten der Autor(inn)en der Originalstudien zurückzuführen.“

It is true that the OSF-results do not reveal fraud. However, the board members confuse fraud with questionable research practices. Fraud is defined as fabricating data that were never collected. Only one of the 100 studies in the OSF-replication project (by Jens Förster, a former student of Fritz Strack, one of the board members) is currently being investigated for fraud by the University of Amsterdam.  Despite very strong results in the original study, it failed to replicate.

The more relevant question is how much questionable research practices contributed to the results. Questionable research practices are practices where data are being collected, but statistical results are only being reported if they produce a significant result (studies, conditions, dependent variables, data points that do not produce significant results are excluded from the results that are being submitted for publication. It has been known for over 50 years that these practices produce a discrepancy between the actual power of studies and the rate of significant results that are published in psychology journals (Sterling, 1959).

Recent statistical developments have made it possible to estimate the true power of studies after correcting for publication bias.   Based on these calculations, the true power of the original studies in the OSF-project was only 50%.   Thus a large portion of the discrepancy between nearly 100% reported significant results and a replication success rate of 36% is explained by publication bias (see R-Index blogs for social psychology and cognitive psychology).

Other factors may contribute to the discrepancy between the statistical prediction that the replication success rate would be 50% and the actual success rate of 36%. Nevertheless, the lion share of the discrepancy can be explained by the questionable practice to report only evidence that supports a hypothesis that a researcher wants to support. This motivated bias undermines the very foundations of science. Unfortunately, the board ignores this implication of the OSF results.

6. What can we do?

The board members have no answer to this important question. In the past four years, numerous articles have been published that have made suggestions how psychology can improve its credibility as a science. Yet, the DPfP board seems to be unaware of these suggestions or unable to comment on these proposals.

„Damit wären wir bei der Frage, die uns als Fachgesellschaft am stärksten beschäftigt und weiter beschäftigen wird. Zum einen brauchen wir eine sorgfältige Selbstreflexion über die Bedeutung von Replikationen in unserem Fach, über die Bedeutung der neuesten Science-Studie sowie der weiteren, zurzeit noch im Druck oder in der Phase der Auswertung befindlichen Projekte des Center for Open Science (wie etwa die Many Labs-Studien) und über die Grenzen unserer Methoden und Paradigmen“

The time for more discussion has passed. After 50 years of ignoring Jacob Cohen’s recommendation to increase statistical power it is time for action. If psychologists are serious about replicability, they have to increase the power of their studies.

The board then discusses the possibility of measuring and publishing replication rates at the level of departments or individual scientists. They are not in favor of such initiatives, but they provide no argument for their position.

„Datenbanken über erfolgreiche und gescheiterte Replikationen lassen sich natürlich auch auf der Ebene von Instituten oder sogar Personen auswerten (wer hat die höchste Replikationsrate, wer die niedrigste?). Sinnvoller als solche Auswertungen sind Initiativen, wie sie zurzeit (unter anderem) an der LMU an der LMU München implementiert wurden (siehe den Beitrag von Schönbrodt und Kollegen).“

The question is why replicability should not be measured and used to evaluate researchers. If the board really valued replicability and wanted to increase replicability in a few years, wouldn’t it be helpful to have a measure of replicability and to reward departments or researchers who invest more resources in high powered studies that can produce significant results without the need to hide disconfirming evidence in file-drawers?   A measure of replicability is also needed because current quantitative measures of scientific success are one of the reasons for the replicability crisis. The most successful researchers are those who publish the most significant results, no matter how these results were obtained (with the exception of fraud). To change this unscientific practice of significance chasing, it is necessary to have an alternative indicator of scientific quality that reflects how significant results were obtained.

Conclusion

The board makes some vague concluding remarks that are not worthwhile repeating here. So let me conclude with my own remarks.

The response of the DGPs board is superficial and does not engage with the actual arguments that were exchanged on the discussion page. Moreover, it ignores some solid scientific insights into the causes of the replicability crisis and it makes no concrete suggestions how German psychologists should change their behaviors to improve the credibility of psychology as a science. Not once do they point out that the results of the OSF-project were predictable based on the well-known fact that psychological studies are underpowered and that failed studies are hidden in file-drawers.

I received my education in Germany all the way to the Ph.D at the Free University in Berlin. I had several important professors and mentors that educated me about philosophy of science and research methods (Rainer Reisenzein, Hubert Feger, Hans Westmeyer, Wolfgang Schönpflug). I was a member of DGPs for many years. I do not believe that the opinion of the board members represent a general consensus among German psychologists. I hope that many German psychologists recognize the importance of replicability and are motivated to make changes to the way psychologists conduct research.  As I am no longer a member of DGfP, I have no direct influence on it, but I hope that the next election will elect a candidate that will promote open science, transparency, and above all scientific integrity.

Replicability Ranking of 27 Psychology Journals (2015)

Click on this link to see the latest rankings for over 100 Psychology Journals for 2015.

The replicability rankings below are based on post-hoc power analyses of published results. The method is explained in more detail elsewhere.  More detailed results and time trends can be found by clicking on the hyperlink of a journal.  The ranking for the average replicability score in 2010-2014 and 2015 is r = .66, indicating that there are reliable differences in replicability between journals.  Movements by more than 10 percentage points are marked with an arrow.

Rank Journal Area 2010-2014 2015 Grade
1 Developmental Psychology DEV 0.63 0.76 B↑
2 Cognitive Psychology COG 0.72 0.74 B
3 JEP: Human Percpetion and Performance COG 0.72 0.71 B
4 Judgment and Decision Making COG 0.66 0.70 B
5 J. Experimental Psych: Learning, Memory, Cognition COG 0.69 0.70 B-
6 JPSP: Personality Process & Individual Differences PER 0.56 0.70 B↑
7 Journal of Memory & Language COG 0.67 0.69 C
8 Social Psychology Personality Science SOC 0.51 0.68 C↑
9 Journal of Experimental Psychology: General GEN 0.63 0.67 C
10 Cognition & Emotion EMO 0.64 0.67 C
11 Social Psychology SOC 0.61 0.66 C
12 Journal of Cross-Cultural Psychology GEN 0.69 0.65 C
13 Journal of Positive Psychology GEN 0.53 0.63 C
14 Psychology and Aging DEV 0.67 0.59 D
15 Child Development DEV 0.63 0.58 D
16 Journal of Experimental Social Psychology SOC 0.48 0.55 D
17 Psychological Science GEN 0.56 0.54 D
18 Developmental Science DEV 0.58 0.53 D
19 European Journal of Social Psychology SOC 0.55 0.52 D
20 Emotion EMO 0.61 0.52 D↓
21 Personal Relationships SOC 0.52 0.52 D
22 JPSP: Attitude & Social Cognition SOC 0.51 0.51 D
23 JPSP:Interpersonal Relationships & Group Processes SOC 0.48 0.50 D
24 British Journal of Social Psychology SOC 0.48 0.50 D
25 Personality & Social Psychology Bulletin SOC 0.50 0.46 F
26 Journal of Social & Personal Relationships SOC 0.56 0.39 F
27 Social Cognition SOC 0.54 0.35 F

Replicability-Ranking of 100 Social Psychology Departments

Please see the new post on rankings of psychology departments that is based on all areas of psychology and covers the years from 2010 to 2015 with separate information for the years 2012-2015.

===========================================================================

Old post on rankings of social psychology research at 100 Psychology Departments

This post provides the first analysis of replicability for individual departments. The table focuses on social psychology and the results cannot be generalized to other research areas in the same department. An explanation of the rational and methodology of replicability rankings follows in the text below the table.

Department 2010-2014
Macquarie University 91
New Mexico State University 82
The Australian National University 81
University of Western Australia 74
Maastricht University 70
Erasmus University Rotterdam 70
Boston University 69
KU Leuven 67
Brown University 67
University of Western Ontario 67
Carnegie Mellon 67
Ghent University 66
University of Tokyo 64
University of Zurich 64
Purdue University 64
University College London 63
Peking University 63
Tilburg University 63
University of California, Irvine 63
University of Birmingham 62
University of Leeds 62
Victoria University of Wellington 62
University of Kent 62
Princeton 61
University of Queensland 61
Pennsylvania State University 61
Cornell University 59
University of California at Los Angeles 59
University of Pennsylvania 59
University of New South Wales (UNSW) 59
Ohio State University 58
National University of Singapore 58
Vanderbilt University 58
Humboldt Universit„ät Berlin 58
Radboud University 58
University of Oregon 58
Harvard University 56
University of California, San Diego 56
University of Washington 56
Stanford University 55
Dartmouth College 55
SUNY Albany 55
University of Amsterdam 54
University of Texas, Austin 54
University of Hong Kong 54
Chinese University of Hong Kong 54
Simone Fraser University 54
Ruprecht-Karls-Universitaet Heidelberg 53
University of Florida 53
Yale University 52
University of California, Berkeley 52
University of Wisconsin 52
University of Minnesota 52
Indiana University 52
University of Maryland 52
University of Toronto 51
Northwestern University 51
University of Illinois at Urbana-Champaign 51
Nanyang Technological University 51
University of Konstanz 51
Oxford University 50
York University 50
Freie Universit„ät Berlin 50
University of Virginia 50
University of Melbourne 49
Leiden University 49
University of Colorado, Boulder 49
Univeritä„t Würzburg 49
New York University 48
McGill University 48
University of Kansas 48
University of Exeter 47
Cardiff University 46
University of California, Davis 46
University of Groningen 46
University of Michigan 45
University of Kentucky 44
Columbia University 44
University of Chicago 44
Michigan State University 44
University of British Columbia 43
Arizona State University 43
University of Southern California 41
Utrecht University 41
University of Iowa 41
Northeastern University 41
University of Waterloo 40
University of Sydney 40
University of Bristol 40
University of North Carolina, Chapel Hill 40
University of California, Santa Barbara 40
University of Arizona 40
Cambridge University 38
SUNY Buffalo 38
Duke University 37
Florida State University 37
Washington University, St. Louis 37
Ludwig-Maximilians-Universit„ät München 36
University of Missouri 34
London School of Economics 33

Replicability scores of 50% and less are considered inadequate (grade F). The reason is that less than 50% of the published results are expected to produce a significant result in a replication study, and with less than 50% successful replications, the most rational approach is to treat all results as false because it is unclear which results would replicate and which results would not replicate.

RATIONALE AND METHODOLOGY

University rankings have become increasingly important in science. Top ranking universities use these rankings to advertise their status. The availability of a single number of quality and distinction creates pressures on scientists to meet criteria that are being used for these rankings. One key criterion is the number of scientific articles that are being published in top ranking scientific journals under the assumption that these impact factors of scientific journals track the quality of scientific research. However, top ranking journals place a heavy premium on novelty without ensuring that novel findings are actually true discoveries. Many of these high-profile discoveries fail to replicate in actual replication studies. The reason for the high rate of replication failures is that scientists are rewarded for successful studies, while there is no incentive to publish failures. The problem is that many of these successful studies are obtained with the help of luck or questionable research methods. For example, scientists do not report studies that fail to support their theories. The problem of bias in published results has been known for a long time (Sterling, 1959). However, few researchers were aware of the extent of the problem.   New evidence suggests that more than half of published results provide false or extremely biased evidence. When more than half of published results are not credible, a science loses its credibility because it is not clear which results can be trusted and which results provide false information.

The credibility and replicability of published findings varies across scientific disciplines (Fanelli, 2010). More credible sciences are more willing to conduct replication studies and to revise original evidence. Thus, it is inappropriate to make generalized claims about the credibility of science. Even within a scientific discipline credibility and replicability can vary across sub-disciplines. For example, results from cognitive psychology are more replicable than results from social psychology. The replicability of social psychological findings is extremely low. Despite an increase in sample size, which makes it easier to obtain a significant result in a replication study, only 5 out of 38 replication studies produced a significant result. If the replication studies had used the same sample sizes as the original studies, only 3 out of 38 results would have replicated, that is, produced a significant result in the replication study. Thus, most published results in social psychology are not trustworthy.

There have been mixed reactions by social psychologists to the replication crisis in social psychology. On the one hand, prominent leaders of the field have defended the status quo with the following arguments.

1 – The experiments who conducted the replication studies are incompetent (Bargh, Schnall, Gilbert).

2 – A mysterious force makes effects disappear over time (Schooler).

3 – A statistical artifact (regression to the mean) will always make it harder to find significant results in a replication study (Fiedler).

4 – It is impossible to repeat social psychological studies exactly and a replication study is likely to produce different results than an original study (the hidden moderator) (Schwarz, Strack).

These arguments can be easily dismissed because they do not explain why cognitive psychologists and other scientific disciplines have more successful replications and more failed results.   The real reason for the low replicability of social psychology is that social psychologists conduct many, relatively cheap studies that often fail to produce the expected results. They then conduct exploratory data analyses to find unexpected patterns in the data or they simply discard the study and publish only studies that support a theory that is consistent with the data (Bem). This hazardous approach to science can produce false positive results. For example, it allowed Bem (2011) to publish 9 significant results that seemed to show that humans can foresee unpredictable outcomes in the future. Some prominent social psychologists defend this approach to science.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister,)

The lack of rigorous scientific standards also allowed Diederik Stapel, a prominent social psychologist to fabricate data, which led to over 50 retractions of scientific articles. The commission that investigated Stapel came to the conclusion that he was only able to publish so many fake articles because social psychology is a “sloppy science,” where cute findings and sexy stories count more than empirical facts.

Social psychology faces a crisis of confidence. While social psychology tried hard to convince the general public that it is a real science, it actually failed to follow standard norms of science to ensure that social psychological theories are based on objective replicable findings. Social psychology therefore needs to reform its practices if it wants to be taken serious as a scientific field that can provide valuable insights into important question about human nature and human behavior.

There are many social psychologists who want to improve scientific standards. For example, the head of the OSF-reproducibility project, Brian Nosek, is a trained social psychologist. Mickey Inzlicht published a courageous self-analysis that revealed problems in some of his most highly cited articles and changed the way his lab is conducting studies to improve social psychology. Incoming editors of social psychology journals are implementing policies to increase the credibility of results published in their journals (Simine Vazire; Roger Giner-Sorolla). One problem for social psychologists willing to improve their science is that the current incentive structure does not reward replicability. The reason is that it is possible to count number of articles and number of citations, but it seems difficult to quantify replicability and scientific integrity.

To address this problem, Jerry Brunner and I developed a quantitative measure of replicability. The replicability-score uses published statistical results (p-values) and transforms them into absolute z-scores. The distribution of z-scores provides information about the statistical power of a study given the sample size, design, and observed effect size. Most important, the method takes publication bias into account and can estimate the true typical power of published results. It also reveals the presence of a file-drawer of unpublished failed studies, if the published studies contain more significant results than the actual power of studies allows. The method is illustrated in the following figure that is based on t- and F-tests published in the most important journals that publish social psychology research.

PHP-Curve Social Journals

The green curve in the figure illustrates the distribution of z-scores that would be expected if a set of studies had 53% power. That is, random sampling error will sometimes inflate the observed effect size and sometimes deflate the observed effect size in a sample relative to the population effect size. With 54% power, there would be 46% (1 – .54 = .46) non-significant results because the study had insufficient power to demonstrate an effect that actually exists. The graph shows that the green curve fails to describe the distribution of observed z-scores. On the one hand, there are more extremely high z-scores. This reveals that the set of studies is heterogeneous. Some studies had more than 54% power and others had less than 54% power. On the other hand, there are fewer non-significant results than the green curve predicts. This discrepancy reveals that non-significant results are omitted from the published reports.

Given the heterogeneity of true power, the red curve is more appropriate. It provides the best fit to the observed z-scores that are significant (z-scores > 2). It does not model the z-scores below 2 because non-significant z-scores are not reported.   The red-curve gives a lower estimate of power and shows a much larger file-drawer.

I limit the power analysis to z-scores in the range from 2 to 4. The reason is that z-scores greater than 4 imply very high power (> 99%). In fact, many of these results tend to replicate well. However, many theoretically important findings are published with z-scores less than 4 as evidence. These z-scores do not replicate well. If social psychology wants to improve its replicability, social psychologists need to conduct fewer studies with more statistical power that yield stronger evidence and they need to publish all studies to reduce the file-drawer.

To provide an incentive to increase the scientific standards in social psychology, I computed the replicability-score (homogeneous model for z-scores between 2 and 4) for different journals. Journal editors can use the replicability rankings to demonstrate that their journal publishes replicable results. Here I report the first rankings of social psychology departments.   To rank departments, I searched the database of articles published in social psychology journals for the affiliation of articles’ authors. The rankings are based on the z-scores of these articles published in the years 2010 to 2014. I also conducted an analysis for the year 2015. However, the replicability scores were uncorrelated with those in 2010-2014 (r = .01). This means that the 2015 results are unreliable because the analysis is based on too few observations. As a result, the replicability rankings of social psychology departments cannot reveal recent changes in scientific practices. Nevertheless, they provide a first benchmark to track replicability of psychology departments. This benchmark can be used by departments to monitor improvements in scientific practices and can serve as an incentive for departments to create policies and reward structures that reward scientific integrity over quantitative indicators of publication output and popularity. Replicabilty is only one aspect of high-quality research, but it is a necessary one. Without sound empirical evidence that supports a theoretical claim, discoveries are not real discoveries.