Too good to be true: A reanalysis of Damisch, Stoberock, and Mussweiler (2010). Keep Your Fingers Crossed! How Superstition Improves Performance. Psychological Science, (21)7, p.1014-1020

This post was submitted as a comment to the R-Index Bulletin, but I think posting in a comment section of a blog reduces visibility. Therefore,  I am reposting this contribution as a post.  It is a good demonstration that article-based metrics can predict replication failures. Please consider submitting similar analyses to R-Index Bulletin or send me an email to post your findings anonymously or with author credit.

=================================================================

Too good to be true: A reanalysis of Damisch, Stoberock, and Mussweiler (2010). Keep Your Fingers Crossed! How Superstition Improves Performance. Psychological Science, (21)7, p.1014-1020

Preliminary note:
Test statistics of the t-tests on p.1016 (t(48) = 2.0, p < .05 and t(48) = 2.36, p < .03) were excluded from the following analyses as they served just as manipulation checks. The t-test reported on p.1017 (t(39) = 3.07, p < .01) was also excluded because mean differences in self-efficacy represent a mere exploratory analysis.

One statistical test reported a significant finding with F(2, 48) = 3.16, p < .05. However, computing the p-value with R gives a p-value of 0.051, which is above the criterion value of .05. For this analysis, the critical p-value was set to p = .055 to be consistent with the interpretation of the test as significant evidence in favor of the authors’ hypothesis.

R-Index analysis:
Success rate = 1
Mean observed power = 0.5659
Median observed power = 0.537
Inflation rate = 0.4341
R-Index = 0.1319

Note that, according to http://www.r-index.org/uploads/3/5/6/7/3567479/r-index_manual.pdf (p.7):
“An R-Index of 22% is consistent with a set of studies in which the null-hypothesis is true and a researcher reported only significant results”.

Furthermore, the test of insufficient variance (TIVA) was conducted.
Note that variances of z-values < 1 suggest bias. The chi2 test tests the H0 that variance = 1.
Results:
Variance = 0.1562
Chi^2(7) = 1.094; p = .007

Thus, the insufficient variance in z-scores of .156 suggests that it is extremely likely that the reported results overestimate the population effect and replicability of the reported studies.

It should be noted that the present analysis is consistent with earlier claims that these results are too good to be true based on Francis’s Test of Excessive Significance (Francis et al., 2014; http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114255).

Finally, the study results were analyzed using p-curve (http://p-curve.com/):

Statistical Inference on p-curve:
Studies contain evidential value:
chisq(16) = 10.745; p = .825
Note that a significant p-value indicates that the p-curve is right-skewed, which indicates evidential value.

Studies lack evidential value:
chisq(16) = 36.16; p = .003
Note that a significant p-value indicates that the p-curve is flatter than one would expect if studies were powered at 33%, which indicates that the results have no evidential value.

Studies lack evidential value and were intensely p-hacked :
chisq(16) = 26.811; p = .044
Note that a significant p-value indicates that the p-curve is left-skewed, which indicates p-hacking/selective reporting.

All bias tests suggest that the reported results are biased. Consistent with these statistical results, a replication study failed to reproduce the original findings (see https://osf.io/fsadm/)

Because all studies were conducted by the same team of researchers the bias cannot be attributed to publication bias. Thus, it appears probable that questionable research practices were used to produce the observed significant results. A possible explanation might be that the authors ran multiple studies and reported just those that produced significant results.

In conclusion, researchers should be suspicious about the power of superstition or at least keep their fingers crossed when they attempt to replicate the reported findings.

Advertisements

A Revised Introduction to the R-Index

A draft of this manuscript was posted in December, 2014 as a pdf file on http://www.r-index.org.   I have received several emails about the draft.  This revised manuscript does not include a comparison of different bias tests.  The main aim is to provide an introduction to the R-Index and to correct some misconceptions of the R-Index that have become apparent over the past year.

Please cite this post as:  Schimmack, U. (2016). The Replicability-Index: Quantifying Statistical Research Integrity.  https://wordpress.com/post/replication-index.wordpress.com/920

Author’s Note. I would like to thank Gregory Francis, Julia McNeil, Amy Muise, Michelle Martel, Elizabeth Page-Gould, Geoffrey MacDonald, Brent Donnellan, David Funder, Michael Inzlicht, and the Social-Personality Research Interest Group at the University of Toronto for valuable discussions, suggestions, and encouragement.

Abstract

Researchers are competing for positions, grant money, and status. In this competition, researchers can gain an unfair advantage by using questionable research practices (QRPs) that inflate effect sizes and increase the chances of obtaining stunning and statistically significant results. To ensure fair competition that benefits the greater good, it is necessary to detect and discourage the use of QRPs. To this aim, I introduce a doping test for science; the replicability index (R-index). The R-Index is a quantitative measure of research integrity that can be used to evaluate the statistical replicability of a set of studies (e.g., journals, individual researchers’ publications).  A comparison of the R-Index for the Journal of Abnormal and Social Psychology in 1960 and the Attitudes and Social Cognition section of the Journal of Social and Personality Psychology in 2011 shows an increase in the use of QRPs. Like doping tests in sports, the availability of a scientific doping test should deter researchers from engaging in practices that advance their careers at the expense of everybody else. Demonstrating replicability should become an important criterion of research excellence that can be used by funding agencies and other stakeholders to allocate resources to research that advances science.
Keywords: Power, Publication Bias, Significance, Credibility, Sample Size, Questionable Research Methods, Replicability, Statistical Methods

INTRODUCTION

It has been known for decades that published results are likely to be biased in favor of authors’ theoretical inclinations (Sterling, 1959). The strongest scientific evidence for publication bias stems from a comparison of the rate of significant results in psychological journals and the statistical power of published studies. Statistical power is the long-run probability to obtain a significant result, when the null-hypothesis is false (Cohen, 1988). The typical statistical power of psychological studies has been estimates to be around 60% (Sterling, Rosenbaum, & Weinkam, 1995). However, the rate of significant results in psychological journals is over 90% (Sterling, 1959; Sterling et al., 1995). The discrepancy between these estimates of power reveals that published studies are biased and that some findings may be simply false positive results, whereas other studies report inflated effect size estimates.

It has been overlooked that estimates of statistical power are also inflated by the use of questionable research methods. Thus, the commonly reported estimate that typical power in psychological studies is 60% is an inflated estimate of true power (Schimmack, 2012). If the actual power is less than 50%, it means that a typical study in psychology has a larger probability to fail (produce a false negative result) than to succeed (rejecting a false null-hypothesis). Conducting such low powered studies is extremely wasteful. Moreover, few researchers have resources to discard 50% of their empirical output. As a result, the incentive for the use of questionable research practices that inflate effect sizes is strong.

Not surprisingly, the use of questionable research practices is common (John et al., 2012). More than 50% of anonymous respondents reported selective reporting of dependent variables, dropping experimental conditions, or not reporting studies that did not support theoretical predictions. The widespread use of QRPs undermines the fundamental assumption of science that scientific theories have been subjected to rigorous empirical tests. In violation of this assumption, QRPs allow researchers to find empirical support for hypotheses even when these hypotheses are false.

The most dramatic example was Bem’s (2011) infamous evidence of time-reversed causality (e.g., studying after a test can improve test performance). Although Bem reported nine successful studies, subsequent studies failed to replicate this finding and raised concerns about the integrity of Bem’s studies (Schimmack, 2012). One possibility for false positive results could be that a desirable outcome occurred by chance and a researcher mistakes this fluke finding as evidence that a prediction was true. However, a fluke finding is unlikely to repeat itself in a series of studies. Statistically, it is highly improbable that Bem’s results are simple type-I errors because the chance of obtaining 9 out of 10 type-I errors with a probability of .05 is less than 1 out of 53 billion (1 / 53,610,771,049). This probability is much smaller than the probability of winning the lottery (1 / 14 million). It is also unlikely that Bem simply failed to report studies with non-significant results because he would have needed 180 studies (9 x 20) to obtain 9 significant results because a type-I error of 5% implies that a significant result will occur, on average, for every 20 studies. With sample sizes of about 100 participants in reported studies, this would imply that Bem tested 18,000 participants. It is therefore reasonable to conclude that Bem used questionable research methods to produce his implausible and improbable results.

Although the publication of Bem’s article in a flagship journal of psychology was a major embarrassment for psychologists, it provided an opportunity to highlight fundamental problems in the way psychologists produced and published empirical results. There have been many valuable suggestions and initiatives to increase the integrity of psychological science (e.g., Asendorpf et al., 2012). In this manuscript, I propose another solution to the problem of QRPs; I suggest that scientific organizations ban the use of questionable research practices, just like sports organizations ban the use of performance enhancing substances. At present, scientific organization only ban and punish outright manipulation of original data. However, excessive use of QRPs can produce fake results without fake data. As the ultimate product of an empirical science are the results of statistical analyses, it does not matter whether fake results were obtained with fake data or with questionable statistical analyses.  The use of QRPs therefore violates the code of ethics in science that a researcher should base conclusions on an objective and unbiased analyses of empirical data. Dropping studies or dependent variables that do not support a hypothesis violates this code of scientific integrity.

Unfortunately, the world of professional sports also shows that doping bans are ineffective unless they are enforced by regular doping tests. Thus, a ban of questionable research practices needs to be accompanied by objective tests that can reveal the use of questionable research practices. The main purpose of this article is to introduce a statistical test that reveals the use of questionable research practices that can be used to enforce a ban of such practices. This quantitative index of research integrity can be used by readers, editors, and funding agencies to ensure that only rigorous empirical studies are published or funded.

The Replicability-index

The R-index is based on power theory (Cohen, 1988). Statistical power is defined as the long-run probability of obtaining statistically significant results in a series of studies (see Schimmack, 2016, for more details). A study with 50% power is expected to produce 50 significant results and 50 non-significant results. In the short-run, the actual number of significant results can underestimate or overestimate the true power of a study, but in an unbiased set of studies, the long-run percentage of significant result provides an unbiased estimate of average power (see Schimmack, 2016, for details on meta-analysis of power). Importantly, in smaller sets of studies underestimation is as likely as overestimation. However, Sterling (1959) was the first to observe that scientific journals report more significant results than the actual power of studies justifies. In other words,  a simple count of significant results provides an inflated estimate of observed power.

A simple count of the percentage of significant results in journals would suggest that psychological studies have over 90% statistical power to reject the null-hypothesis. However, studies of the typical power in psychology based on sample sizes and a moderate effect size suggest that the typical power of statistical tests in psychology is around 60% (Giegerenzer & Sedelmeier, 1995; see also Schimmack, 2016).

The discrepancy between these estimates of power reveals a systematic bias because these estimates should converge in the long run. Discrepancies between the two estimates of power can be tested for significance. Schimmack (2012) developed the incredibility index to examine whether a set of studies reported too many significant results. For example, the probability that 10 studies with 60% power produce 90% significant results (9 significant and 1 non-significant) is p = .046 (binomial prob. calculator).  The incredibility index uses 1 – p, so that higher numbers show that the result is incredible because there should have been more non-significant results.  In this example, the incredibility index is 1 – .046 = .954.  This result suggests that the reported results were selected to provide stronger evidence for a hypothesis than the full set of results would have provided; in other words, questionable research practices were used to produce the reported results.

Some critics have argued that the incredibility index is flawed because it relies on observed effect sizes to estimate power.  These power estimates are called observed power or post-hoc power and statisticians have warned against the computation of observed power (Henning & Hoenig, 2001). However, this objection is flawed because Henning and Hoenig (2001) only examined the usefulness of computing observed power for a single statistical test (Schimmack, 2016).  The problem of observed power estimates for a single statistical test is that the confidence interval around the estimate is so large that it often covers the full range of possible estimates from 0 (or more accurately, the alpha criterion of significance) to 1 (Schimmack, 2015). This estimate is not fundamentally flawed, but it is uninformative.   However, in a meta-analysis of power estimates, sampling error decreases, the confidence interval around the power estimate shrinks, and the power estimate becomes more accurate and useful. Thus, a meta-analysis of studies can be used to estimate power and to compare the success rate (percentage of significant results) to the power estimate.

The incredibilty index computed a power estimate for each study and then averaged these power estimates to obtain an estimate of average observed power.  A binomimal probability test was then used to compute the probability that a set of reported results reported too few non-significant results.

The R-Index builds on the incredibility index. One problem of the incredibility index is that probabilities provide no information about effect sizes.   An incredibility index of 99% can be obtained with 10 studies that produced 10 significant results with an average observed power of 60% or with 100 studies that produced 100% significant results with average observed power of 95%.  Evidently, average observed power of 95% is very high and the fact that one would expect only 95 significant results while 100 significant results were reported suggests only a small bias. In contrast, the discrepancy between 60% observed power and 100% reported results is large.  The fact that the same incredibility index can be obtained for different amount of bias is nothing special. Probabilities are always a function of the magnitude of an effect (discrepancy) and the amount of sampling error, which is inversely related to sample size.  For this reason, it is important to complement information about probabilities with effect size measures.  For the incredibility index, the effect size is the difference between the success-rate and the observed power estimate.  In this example, the effect sizes are 100-60 = 40 vs. 100-95 = 5.  This effect size is called the inflation rate, because it is expected that the success rate exceeds observed power.

In large sets of studies (e.g., an entire volume of a journal), the IC-index is useless because it will merely reveal the well-known presence of publication bias and QRPs, and the p-value is influenced by the number of tests in a journal.  A journal with more articles and statistical tests would have a lower incredibility index even if the studies, on average, have more power and are less biased.  The inflation rate provides a better measure of the integrity of reported results in a journal.

Another problem of the incredibility index is that power is not normally or symmetrically distributed. As a result, the average observed power estimate is a biased estimate of the average true power (Yuan & Maxwell, 2005; Schimmack, 2015). For example, when the true power is close to the upper value of 100%, observed power is more likely to underestimate than to overestimate true power. To overcome this problem, the R-Index uses the median to estimate true power. The median is unbiased because in each study it is equally likely that the observed effect size underestimates or overestimates the true effect size. Thus, it is equally likely that a power estimate underestimates or overestimates true power. While the amount of underestimation and overestimation is not symmetrically distributed, the direction of bias is known to be equally distributed on both sides of true power. Simulations confirm that the median provides an unbiased estimate of true power even when power is high.

Thus, the formula for the inflation in a set of studies is

Inflation  = Percentage of Significant Results – Median (Estimated Power)

Median observed power is an unbiased estimate of power in an unbiased set of studies.  However, if the set of studies is biased by publication bias,  median observed power is inflated.  It is still able to detect publication bias because the success rate increases faster than median observed power.  For example, if true power is 50%, but only significant results are reported (100% success rate),  median observed power increases to 75% (Schimmack, 2015).

The amount of inflation is proportional to the actual power of a set of studies.  When the set of studies includes only significant results (100% success rate), inflation is necessarily greater than 0 because power is never 100%.  However, median observed power of 95% implies only a small amount of inflation (5%) and the actual power is close to the median observed power (94%). In contrast, median observed power of 70% implies a large amount of bias, and true power is only 30%.  As a result, the true power of a set of studies increases with the median observed power and decreases with the amount of inflation.  The R-Index combines these two indicators of power by subtracting the inflation rate from median observed power.

R-Index = Median Observed Power – Inflation

As Inflation = Success Rate – Median Observed Power, the R-Index can also be expressed as a function of Success Rate and Median Observed Power

R-Index = Median Observed Power – (Success Rate – Median Observed Power)

or

R-Index = 2 * Median Observed Power – Success Rate

The R-Index can range from 0 to 1.  A value of 0 is obtained when median observed power is 50% and the success rate is 100%.   However, this event should not occur with real data because significant results have a minimum observed power of 50%.  To obtain a median of 50% observed power all studies would have to have 50% power, but sampling error should produce variation in observed power estimates.  A fixed value or restricted variance is another indicator of bias (Schimmack, 2015).  A more realistic lower limit for the R-Index is a value of 22%.  This value is obtained when the null-hypothesis is true (the population effect size is zero) and only significant results are reported (success rate = 100%).  In this case, median observed power is 61%, the inflation rate is 39%, and the R-Index is 61 – 39 = 22.  The maximum of 100 would be obtained if studies practically have 100% power and the success rate is 100%.

It is important to note that the R-Index is not an estimate of power.  It is monotonically related to power, but an R-Index of 22% does not imply that a set of studies has 22% power.  As noted earlier, an R-Index of 22% is obtained when the null-hypothesis is true which produces only 5% significant results if the significance criterion is 5%.  When power is less than 50%, the R-Index is conservative and the Index values are higher than true power. When power is more than 50%, the R-Index values are lower than true power.   However, for comparisons of journals, authors, etc., rankings with the R-Index will reflect the ranking in terms of true power.  Moreover, an R-Index below 50% implies that true power is less than 50%, which can be considered inadequate power for most research questions.

Example 1:  Bem’s Feeling the Future

The first example uses Bem’s (2011) article to demonstrate the usefulness of computing an R-Index.

 

 

N d Obs.Pow Success
100 0.25 0.79 1
150 0.2 0.78 1
100 0.26 0.82 1
100 0.23 0.73 1
100 0.22 0.70 1
150 0.15 0.57 1
150 0.14 0.52 1
200 0.09 0.35 0
100 0.19 0.59 1
50 0.42 0.88 1

The median observed power is 71%. The success rate is 90% Accordingly, the inflation rate is 90 – 71 = 19%. The R-Index is 71 – 19 = 52. An R-Index of 52 is higher than the 22% that is expected from a set of studies without a real effect and publication bias. However, it is not clear how questionable research practices influence the R-Index. Thus, the R-Index should not be used to infer from values greater than 22% that an effect is present. The R-Index does suggest that Bem’s studies did not have 80% power as he suggested in the planning of his studies. It also suggests that the nominal median effect size of d = .21 is inflated and that future studies should expect a lower effect size. These predictions were confirmed in a set of replication studies (Galak et al., 2013) In short, an R-Index of 50% raises concerns about the robustness of empirical results and shows that impressive success rates of 90% or more do not necessarily provide strong evidence for the existence of an effect.

Example 2: The Multiple Lab Project

In the wake of the replicability crisis, the Open-Science Fouundation has started to examine the replicability of psychological research with replication studies. These replication studies reproduce the original studies as closely as possible.  The first results emerged from the Many-Labs project. In this project, an international team of researchers replicated 13 psychological studies in several laboratories. The main finding of this project was that 10 of the 13 studies were successfully replicated in several labs.  The success rate is 77%.  I computed the R-Index for the original studies. One study provided insufficient information to compute observed power, leaving 12 studies to be analyzed.  The success rate for the original studies was 100% (one study had a marginally significant effect, p < .10, two-tailed).  Median observed power was 86%.  The inflation rate is 100 – 86 = 14, and the R-Index is 86 – 14 = 72.   Thus, an R-Index of 72 suggests that studies have a high probability of replicating.  Of course, a higher R-Index would be even better.

It is important to note that success in the Many Lab Project was defined as a significant result in a meta-analysis across all labs with over 3,000 participants.  The success rate would be lower if replication success were defined as a significant result in an exact replication study with the same statistical power (sample size) as the original study. Nevertheless, many of the results were replicated even with smaller sample sizes because the original studies examined large effects, had large samples, or both.

Conclusion

It has been widely recognized that questionable research practice are threatening the foundations of science. This manuscript introduces the R-Index as a statistical tool to assess the replicability of published results. Results are replicable if the original studies had sufficient power to produce significant results. A study with 80% power is likely to produce a significant result in 80% of all attempts without the need for questionable research practices. In contrast, a study with 20% power can only produce significant results with the help of inflated effect sizes. In 20% of all attempts, luck alone will be sufficient to inflate effect sizes. In all other cases, researchers have to hide failed attempts in file drawers or use questionable statistical practices to inflate effect sizes. The R-Index reveals the presence of questionable research practices when observed power is lower than the rate of significant results. The R-Index has two components. It increases with observed power because studies with high power are more likely to replicate. The second component is the discrepancy between the percentage of significant results and observed power. The greater the discrepancy, the more questionable research practices have contributed to success and the more observed power overestimates true power.

References

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. [Article]. Journal of Personality and Social Psychology, 100(3), 407-425. doi: 10.1037/a0021524

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2013). Correcting the Past: Failures to Replicate Psi. Journal of Personality and Social Psychology.

Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. [Article]. American Statistician, 55(1), 19-24. doi: 10.1198/000313001300339897

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. [Article]. Psychological Science, 23(5), 524-532. doi: 10.1177/0956797611430953

Schimmack, U. (2012). The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles. [Article]. Psychological Methods, 17(4), 551-566. doi: 10.1037/a0029487

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies. [Article]. Psychological Bulletin, 105(2), 309-316.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. [Article]. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice-versa. [Editorial Material]. American Statistician, 49(1), 108-112.

2015 Replicability Ranking of 100+ Psychology Journals

Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.

The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis.  Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.

The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%.  A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.

The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).

Rank   Journal 2010/14 2015
1   Social Indicators Research   81   84
2   Journal of Happiness Studies   81   83
3   Journal of Comparative Psychology   72   83
4   International Journal of Psychology   80   81
5   Journal of Cross-Cultural Psychology   78   81
6   Child Psychiatry and Human Development   75   81
7   Psychonomic Review and Bulletin   72   80
8   Journal of Personality   72   79
9   Journal of Vocational Behavior   79   78
10   British Journal of Developmental Psychology   75   78
11   Journal of Counseling Psychology   72   78
12   Cognitve Development   69   78
13   JPSP: Personality Processes
and Individual Differences
  65   78
14   Journal of Research in Personality   75   77
15   Depression & Anxiety   74   77
16   Asian Journal of Social Psychology   73   77
17   Personnel Psychology   78   76
18   Personality and Individual Differences   74   76
19   Personal Relationships   70   76
20   Cognitive Science   77   75
21   Memory and Cognition   73   75
22   Early Human Development   71   75
23   Journal of Sexual Medicine   76   74
24   Journal of Applied Social Psychology   74   74
25   Journal of Experimental Psychology: Learning, Memory & Cognition   74   74
26   Journal of Youth and Adolescence   72   74
27   Social Psychology   71   74
28   Journal of Experimental Psychology: Human Perception and Performance   74   73
29   Cognition and Emotion   72   73
30   Journal of Affective Disorders   71   73
31   Attention, Perception and Psychophysics   71   73
32   Evolution & Human Behavior   68   73
33   Developmental Science   68   73
34   Schizophrenia Research   66   73
35   Achive of Sexual Behavior   76   72
36   Pain   74   72
37    Acta Psychologica   72   72
38   Cognition   72   72
39   Journal of Experimental Child Psychology   72   72
40   Aggressive Behavior   72   72
41   Journal of Social Psychology   72   72
42   Behaviour Research and Therapy   70   72
43   Frontiers in Psychology   70   72
44   Journal of Autism and Developmental Disorders   70   72
45   Child Development   69   72
46   Epilepsy & Behavior   75   71
47   Journal of Child and Family Studies   72   71
48   Psychology of Music   71   71
49   Psychology and Aging   71   71
50   Journal of Memory and Language   69   71
51   Journal of Experimental Psychology: General   69   71
52   Psychotherapy   78   70
53   Developmental Psychology   71   70
54   Behavior Therapy   69   70
55   Judgment and Decision Making   68   70
56   Behavioral Brain Research   68   70
57   Social Psychology and Personality Science   62   70
58   Political Psychology   75   69
59   Cognitive Psychology   74   69
60   Organizational Behavior and Human Decision Processes   69   69
61   Appetite   69   69
62   Motivation and Emotion   69   69
63   Sex Roles   68   69
64   Journal of Experimental Psychology: Applied   68   69
65   Journal of Applied Psychology   67   69
66   Behavioral Neuroscience   67   69
67   Psychological Science   67   68
68   Emotion   67   68
69   Developmental Psychobiology   66   68
70   European Journal of Social Psychology   65   68
71   Biological Psychology   65   68
72   British Journal of Social Psychology   64   68
73   JPSP: Attitudes & Social Cognition   62   68
74   Animal Behavior   69   67
75   Psychophysiology   67   67
76   Journal of Child Psychology and Psychiatry and Allied Disciplines   66   67
77   Journal of Research on Adolescence   75   66
78   Journal of Educational Psychology   74   66
79   Clinical Psychological Science   69   66
80   Consciousness and Cognition   69   66
81   The Journal of Positive Psychology   65   66
82   Hormones & Behavior   64   66
83   Journal of Clinical Child and
Adolescence Psychology
  62   66
84   Journal of Gerontology: Series B   72   65
85   Psychological Medicine   66   65
86   Personalit and Social Psychology
Bulletin
  64   64
87   Infancy   61   64
88   Memory   75   63
89   Law and Human Behavior   70   63
90   Group Processes & Intergroup Relations   70   63
91   Journal of Social and Personal Relationships   69   63
92   Cortex   67   63
93   Journal of Abnormal Psychology   64   63
94   Journal of Consumer Psychology   60   63
95   Psychology of Violence   71   62
96   Psychoneuroendocrinology   63   62
97   Health Psychology   68   61
98   Journal of Experimental Social
Psychology
  59   61
99   JPSP: Interpersonal Relationships
and Group Processes
  60   60
100   Social Cognition   65   59
101   Journal of Consulting and Clinical Psychology   63   58
102   European Journal of Personality   72   57
103   Journal of Family Psychology   60   57
104   Social Development   75   55
105   Annals of Behavioral Medicine   65   54
106   Self and Identity   63   54

Is the N-pact Factor (NF) a Reasonable Proxy for Statistical Power and Should the NF be used to Rank Journals’ Reputation and Replicability? A Critical Review of Fraley and Vazir (2014)

Link to open access PlosOne article:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0109019

Information about typical sample sizes is informative for a number of reasons. Most important, sampling error is related to sample size. Everything else being equal, larger samples have less sampling error. Studies with less sampling error (a) are more likely to produce statistically significant evidence for an effect when an effect is present, (b)  can produce more precise estimates of effect sizes, and (c) are more likely to produce replicable results.

Fraley and Vazire (2014) proposed that typical sample sizes (median N) in journals can be used to evaluate the replicability of results published in these journals. They called this measure the N-pact Factor (NF). The authors propose that the N-pact Factor (NF) is a reasonable proxy for statistical power; that is, the probability of obtaining a statistically significant result that is real rather than a simple fluke finding in a particular study.

“The authors evaluate the quality of research reported in major journals in social-personality psychology by ranking those journals with respect to their N-pact Factors (NF)—the statistical power of the empirical studies they publish to detect” (Abstract, p. 1).

The article also contains information about the typical sample size in six psychology journals for the years 2006 to 2010. The numbers are fairly consistent across years and the authors present a combined NF for the total time period.

CSV To HTML using codebeautify.org

Journal Name NF (median N) Pow/d=.41 Pow/d=.5
Journal of Personality 178 0.78 0.91
Journal of Research in Personality 129 0.64 0.8
Personality and Social Psychology Bulletin 95 0.5 0.67
Journal of Personality and Social Psychology 90 0.49 0.65
Journal of Experimental Social Psychology 87 0.47 0.64
Psychological Science 73 0.4 0.56

The results show that median sample sizes range from 73 to 178. The authors also examined the relationship between NF and the impact factor of a journal. They found a negative correlation of r = -.48, 95%CI = -.93, +.54.   Based on this non-significant correlation in a study with a rather low NF of 6, the authors suggest that “journals that have the highest impact also tend to publish studies that have smaller samples” (p. 8).

In their conclusions, the authors suggest that “journals that have a tendency to publish higher power studies should be held in higher regard than journals that publish lower powered studies—a quality we indexed using the N-pact Factor.” (p. 8).   According to their NF-Index, the Journal of Personality should be considered the best of the six journals. In contrast, the journal with the highest impact factor, Psychological Science, should be considered the worst journal because the typical sample size is the smallest.

The authors also make some more direct claims about statistical power. To make inferences about the typical power of statistical tests in a journal, the authors assume that “Statistical power is a function of three ingredients: a, N, and the population effect size” (p. 6).

Consistent with previous post-hoc power analyses, the authors set the criterion to p < .05 (two-tailed). The sample size is provided by the median sample size in a journal which is equivalent to the N-pact factor (NF). The only missing information is the median population effect size. The authors rely on a meta-analysis by Richard et al. (2001) to estimate the population effect size as d = .41. This value is the median effect size in a meta-analysis of over 300 meta-analyses in social psychology that covers the entire history of social psychology. Alternatively, they could have used d = .50, which is a moderate effect size according to Cohen. This value has been used in previous studies of statistical power of journals (Sedlmeier & Gigerenzer, 1989). The table above shows both power estimates. Accordingly, the Journal of Personality has good power (Cohen recommended 80% power) and the journal Psychological Science would have low power to produce significant results.

In the end, the authors suggest that the N-pact factor can be used to evaluate journals and that journal editors should strive towards a high NF. “One of our goals is to encourage journals (and their editors, publishers, and societies that sponsor them) to pay attention to and strive to improve their NFs” (p. 11). The authors further suggest that NF provides “an additional heuristic to use when deciding which journal to submit to, what to read, what to believe, or where to look to find studies to publicize” (p. 11).

Before I present my criticism of the N-pact Factor, I want to emphasize that I agree on several points with the authors. First, I believe that statistical power is important (Schimmack, 2012). Second, I believe that quantitative indicators that provide information about the typical statistical power of studies in a journal are valuable. Third, I agree with the authors that everything else being equal, statistical power increases with sample size.

My first concern is that sample sizes can provide misleading information about power because researchers often tend to conduct analyses on subsamples of their data. For example, with 20 participants per cell, a typical 2 x 2 ANOVA design has a total sample size of N = 80. The ANOVA with all participants is often followed by post-hoc tests that aim to test differences between two theoretically important means.For example, after showing an interaction between men and women, the post-hoc tests are used to show that there is a significant increase for men and a significant decrease in women. Although the interaction effect can have high power because the pattern in the two groups goes into opposite directions (cross-over interaction), the comparisons within gender with N = 40 have considerably less power. A comparison of sample sizes and degrees of freedoms in Psychological Science shows that many tests have smaller df than N (e.g., 37/76, 65/131, 62/130, 66/155, 57/182 for the first five articles in 2010 in alphabetical order). This problem could be addressed by using information about df to compute median N of statistical tests.

A more fundamental concern is the use of sample size as a proxy for statistical power. This is only true, if all studies had the same effect size and all studies used the same research design. These restrictive conditions are clearly violated when the goal is to provide information about the typical statistical power of diverse articles in a scientific journal. Some research areas could have larger effects than others. For example, animal studies make it is easier to control variables, which reduces sampling error. Perception studies can often gather hundreds of observations in a one-hour session, where social psychologists may end up with a single behavior in a carefully staged deception study. The use of a single effect size for all journals benefits journals that use large samples to study small effects and punishes journals that publish carefully controlled studies that produce large effects. At a minimum, one would expect that the information about sample sizes is complemented with information about the median effect size in a journal, but the authors did not consider this option, presumably because it is much harder to to obtain than information about sample sizes, but this information is essential for power estimation.

A related concern is that sample size can only be used to estimate power for a simple between-subject design. Estimating statistical power for more complex designs is more difficult and often not possible without information that is not reported. Applying the simple formula for between-subject designs to these studies can severely underestimate statistical power. A within-subject design with many repeated trials can produce more power than a between-subject design with 200 participants. If the NF were used to evaluate journals or researchers, it would favor researchers who use inefficient between-subject designs rather than efficient designs, which would incentivize waste of research funds. It would be like evaluating cars based on their gasoline consumption rather than on their mileage.

AN EMPIRICAL TEST OF NF AS A MEASURE OF POWER

The problem of equating sample size with statistical power is apparent in the results of the OSF-reproducibility project. In this project, a team of researchers conducted exact replication studies of 97 statistically significant results published in three prominent psychology journals. Only 36% of the replication studies were significant. The authors examined several predictors of replication success (p < .05 in the replication study), including sample size.

Importantly, they found a negative relationship between sample size of the original studies and replication success (r = -.15). One might argue that a more appropriate measure of power would be the sample size of the replication studies, but even this measure failed to predict replication success  (r = -.09).

The reason for this failure of NF is that the OSF-reproducibility project mixed studies from the cognitive literature that often use powerful within-subject designs with small samples and studies from social psychology that often use the less powerful between-subject design. Although sample sizes are larger in these studies, studies with small samples in cognitive psychology are more powerful and tended to replicate at a higher rate.

This example illustrates that the focus on sample size is misleading and that the N-pact Factor would have led to the wrong conclusion about the replicability of research in social versus cognitive psychology.

CONCLUSION

Everything being equal, studies with larger samples have more statistical power to demonstrate real effects, and statistical power is monotonically related to sample size. Everything else being equal, larger samples are better because bigger statistical power is better. However, in real life everything else is not equal and rewarding sample size without taking effect sizes and design features of a study into account creates a false incentive structure. In other words, bigger samples are not necessarily better.

To increase replicability and to reward journals for publishing replicable results it would be better to measure the typical statistical power of studies than to use sample size as a simple, but questionable proxy.

_____________________________________________________________

P.S. The authors briefly discuss the possibility of using observed power, but reject this option based on a common misinterpretation of Hoenig and Heisey (2001). Hoenig and Heisey (2001) pointed out that observed power is a useless statistic when an observed effect size is used to estimate the power of this particular study. Their critique does not invalidate the use of observed power for a set of studies or a meta-analysis of studies. In fact, the authors used a meta-analytically derived effect size to compute observed power for their median sample sizes. They could also have computed a meta-analytic effect size for each journal and used this effect size for a power analysis. One may be concerned about the effect of publication bias on effect sizes published in journals, but this concern applies equally to the meta-analytic results by Richard et al. (2001).

P.P.S. Conflict of Interest. I am working on a statistical method that provides estimates of power. I am personally motivated to find reasons to like my method better than the N-pact Factor, which may have influenced my reasoning and my presentation of the facts.

 

On the Definition of Statistical Power

D1: In plain English, statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down (first hit on Google)

D2: The power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. (Wikipedia)

D3: The probability of not committing a Type II error is called the power of a hypothesis test. (Stat Trek)

The concept of statistical power arose from Neyman and Pearson’s approach to statistical inferences. Neyman and Pearson distinguished between two types of errors that could occur when a researcher draws conclusions about a population from observations in a sample. The first error (type-I error) is to infer a systematic relationship (in tests of causality this is an effect) when no relationship (no effect) exists. This error is also known as a false-positive as in a pregnancy test that shows a positive result (pregnant) when a women is not pregnant. The second error (type-II error) is to fail to detect a systematic relationship that actually exists. This error is also known as a false negative as when a pregnancy shows a negative result (not pregnant) when a woman is actually pregnant.

Ideally researchers would never make type-I or type-II errors, but it is inevitable that researchers will make both types of mistakes. However, researchers have some control over the probability of making these two mistakes. Statistical power is simply the probability of not making a type-II mistake; that is to avoid negative results when effects are present.

Many definitions of statistical power imply that the probability of avoiding a type-II error is equivalent to the long-run frequency of statistical significant results because statistical significance is used to decide whether an effect is present or not. By definition statistically non-significant results are negative results when an effect exists in the population. However, it does not automatically follow that all significant results are positive results when an effect is present.   Significant results and positive results are only identical in one-sided hypotheses tests. For example, if the hypothesis is that men are taller than women and a one-sided statistical tests is used only significant results that show a greater mean for men than for women will be significant. A study that shows a large difference in the opposite direction would not produce a significant result no matter how large the difference is.

The equivalence between significant results and positive results no longer holds in the more commonly used two-tailed tests of statistical significance. In this case, the relationship in the population is either positive or negative. It cannot be both positive or negative. Only significant results that also show the correct direction of the effect (either as predicted by a correct prediction or as demonstrated by consistency with the majority of other significant results) are positive results. Significant results in the other direction are false positive results in that they show a false effect, which becomes only visible in a two-tailed test when the sign of the effect is taken into account.

How important is the distinction between the rate of positive results and the rate of significant results in a two-tailed test? Actually it is not very important. The largest number of false positive results is obtained when no effect exists at all. If the 5% significance criterion is used, no more than 5% of tests will produce false positive results. It will also become apparent after some time that there is no effect because half the studies will show a positive effect and the other half will show a negative effect. The inconsistency in the sign of the effect shows that significant results are not caused by a systematic relation. As the power of a test increases, more and more significant results will have the correct sign and fewer and fewer results will be false positives. The picture on top shows an example with 13% power.  As can be seen most of this percentage comes from the fat right tail of the blue distribution. However, a small portion comes from the left tail that is more extreme than the criterion for significance (the green line).

For a study with 50% power to produce a true positive result (a significant result with the correct sign) is 50%. The probability of a false-positive result (a significant result with the wrong sign) is 0 to the second decimal, but not exactly zero (~0.05%). In other words, even in studies with modes power, false positive results have a negligible effect. A much bigger concern is that 50% of results are expected to be false negative results.

In conclusion, the sign of an effect matters. Two-tailed significance testing ignores the sign of an effect. Power is the long-run probability of obtaining a significant result with the correct sign. This probability is identical to the probability of a statistically significant result in a one-tailed test. It is not identical to the probability of a statistically significant results in a two-tailed test, but for practical purposes the difference is negligible. Nevertheless, it is probably most accurate to use a definition that is equally applicable to one-tailed and two-tailed tests.

D4: Statistical power is the probability of drawing the correct conclusion from a statistically significant result when an effect is present. If the effect is positive, the correct inference is that a positive effect exists. If an effect is negative, the correct inference is that a negative effect exists. When the inference is that the effect is negative (positive), but the effect is positive (negative), a statistically significant result does not count towards the power of a statistical test.

This definition differs from other definitions of power because it distinguishes between true positive and false positive results. Other definitions of power treat all non-negative results (false positive and true positive) as equivalent.

 

The Abuse of Hoenig and Heisey: A Justification of Power Calculations with Observed Effect Sizes

In 2001, Hoenig and Heisey wrote an influential article, titled “The Abuse of Power: The Persuasive Fallacy of Power Calculations For Data Analysis.”  The article has been cited over 500 times and it is commonly cited as a reference to claim that it is a fallacy to use observed effect sizes to compute statistical power.

In this post, I provide a brief summary of Hoenig and Heisey’s argument. The summary shows that Hoenig and Heisey were concerned with the practice of assessing the statistical power of a single test based on the observed effect size for this effect. I agree that it is often not informative to do so (unless the result is power = .999). However, the article is often cited to suggest that the use of observed effect sizes in power calculations is fundamentally flawed. I show that this statement is false.

The abstract of the article makes it clear that Hoenig and Heisey focused on the estimation of power for a single statistical test. “There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result” (page 1). The abstract informs readers that this practice is fundamentally flawed. “This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic” (p. 1).

Given that method articles can be difficult to read, it is possible that the misinterpretation of Hoenig and Heisey is the result of relying on the term “fundamentally flawed” in the abstract. However, some passages in the article are also ambiguous. In the Introduction Hoenig and Heisey write “we describe the flaws in trying to use power calculations for data-analytic purposes” (p. 1). It is not clear what purposes are left for power calculations if they cannot be used for data-analytic purposes. Later on, they write more forcefully “A number of authors have noted that observed power may not be especially useful, but to our knowledge a fatal logical flaw has gone largely unnoticed.” (p. 2). So readers cannot be blamed entirely if they believed that calculations of observed power are fundamentally flawed. This conclusion is often implied in Hoenig and Heisey’s writing, which is influenced by their broader dislike of hypothesis testing  in general.

The main valid argument that Hoenig and Heisey make is that power analysis is based on the unknown population effect size and that effect sizes in a particular sample are contaminated with sampling error.  As p-values and power estimates depend on the observed effect size, they are also influenced by random sampling error.

In a special case, when true power is 50%, the p-value matches the significance criterion. If sampling error leads to an underestimation of the true effect size, the p-value will be non-significant and the power estimate will be less than 50%. When sampling error inflates the observed effect size, p-values will be significant and power will be above 50%.

It is therefore impossible to find scenarios where observed power is high (80%) and a result is not significant, p > .05, or where observed power is low (20%) and a result is significant, p < .05.  As a result, it is not possible to use observed power to decide whether a non-significant result was obtained because power was low or because power was high but the effect does not exist.

In fact, a simple mathematical formula can be used to transform p-values into observed power and vice versa (I actually got the idea of using p-values to estimate power from Hoenig and Heisey’s article).  Given this perfect dependence between the two statistics, observed power cannot add additional information to the interpretation of a p-value.

This central argument is valid and it does mean that it is inappropriate to use the observed effect size of a statistical test to draw inferences about the statistical power of a significance test for the same effect (N = 1). Similarly, one would not rely on a single data point to draw inferences about the mean of a population.

However, it is common practice to aggregate original data points or to aggregated effect sizes of multiple studies to obtain more precise estimates of the mean in a population or the mean effect size, respectively. Thus, the interesting question is whether Hoenig and Heisey’s (2001) article contains any arguments that would undermine the aggregation of power estimates to obtain an estimate of the typical power for a set of studies. The answer is no. Hoenig and Heisey do not consider a meta-analysis of observed power in their discussion and their discussion of observed power does not contain arguments that would undermine the validity of a meta-analysis of post-hoc power estimates.

A meta-analysis of observed power can be extremely useful to check whether researchers’ a priori power analysis provide reasonable estimates of the actual power of their studies.

Assume that researchers in a particular field have to demonstrate that their studies have 80% power to produce significant results when an important effect is present because conducting studies with less power would be a waste of resources (although some granting agencies require power analyses, these power analyses are rarely taken seriously, so I consider this a hypothetical example).

Assume that researchers comply and submit a priori power analysis with effect sizes that are considered to be sufficiently meaningful. For example, an effect of half-a-standard deviation (Cohen’s d = .50) might look reasonable large to be meaningful. Researchers submit their grant applications with a prior power analysis that produce 80% power with an effect size of d = .50. Based on the power analysis, researchers request funding for 128 participants. A researcher plans four studies and needs $50 for each participant. The total budget is $25,600.

When the research project is completed, all four studies produced non-significant results. The observed standardized effect sizes were 0, .20, .25, and .15. Is it really impossible to estimate the realized power in these studies based on the observed effect sizes? No. It is common practice to conduct a meta-analysis of observed effect sizes to get a better estimate of the (average) population effect size. In this example, the average effect size across the four studies is d = .15. It is also possible to show that the average effect size in these four studies is significantly different from the effect size that was used for the a priori power calculation (M1 = .15, M2 = .50, Mdiff = .35, SE = 1/sqrt(512) = .044, t = .35 / .044 = 7.92, p < 1e-13). Using the more realistic effect size estimate that is based on actual empirical data rather than wishful thinking, the post-hoc power analysis yields a power estimate of 13%. The probability of obtaining non-significant results in all four studies is 57%. Thus, it is not surprising that the studies produced non-significant results.  In this example, a post-hoc power analysis with observed effect sizes provides valuable information about the planning of future studies in this line of research. Either effect sizes of this magnitude are not important enough and research should be abandoned or effect sizes of this magnitude still have important practical implications and future studies should be planned on the basis of a priori power analysis with more realistic effect sizes.

Another valuable application of observed power analysis is the detection of publication bias and questionable research practices (Ioannidis and Trikalinos; 2007), Schimmack, 2012) and for estimating the replicability of statistical results published in scientific journals (Schimmack, 2015).

In conclusion, the article by Hoenig and Heisey is often used as a reference to argue that observed effect sizes should not be used for power analysis.  This post clarifies that this practice is not meaningful for a single statistical test, but that it can be done for larger samples of studies.

 

 

Klaus Fiedler “it is beyond the scope of this article to discuss whether publication bias actually exists”

Urban Dictionary: Waffle

A Critical Examination of “Research Practices That Can Prevent an Inflation of False-Positive Rates” by Murayama, Pekrun, and Fiedler (2014) in Personality and Social Psychology Review.

The article by Murayama, Pekrun, and Fiedler (MPK) discusses the probability of false positive results (evidence for an effect when no effect is present also known as type-I error) in multiple study articles. When researchers conduct a single study the nominal probability of obtaining a significant result without a real effect (a type-I error) is typically set to 5% (p < .05, two-tailed). Thus, for every significant result one would expect 19 non-significant results. A false-positive finding (type-I error) would be followed by several failed replications. Thus, replication studies can quickly correct false discoveries. Or so, one would like to believe. However, traditionally journals reported only significant results. Thus, false positive results remained uncorrected in the literature because failed replications were not published.

In the 1990s, experimental psychologists that run relatively cheap studies found a solution to this problem. Journals demanded that researchers replicate their findings in a series of studies that were then published in a single article.

MPK point out that the probability of a type-I error decreases exponentially as the number of studies increases. With two studies, the probability is less than 1% (.05 * .05 = .0025). It is easier to see the exponential effect in terms or ratios (1 out of 20, 1 out of 400, 1 out of 8000, etc. In top journals of experimental social psychology, a typical article contains four studies. The probability that all four studies produce a type-I error is only 1 out of 160,000. The corresponding value on a standard normal distribution is z = 4.52, which means the strength of evidence is 4.5 standard deviations away from 0, which represents the absence of an effect. In particle physics a value of z = 5 is used to rule out false-positives. Thus, getting 4 out of 4 significant results in four independent tests of an effect provides strong evidence for an effect.

I am in full agreement with MPK and I made the same point in Schimmack (2012). The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160). A real effect will produce stronger evidence for an effect as sample size increase. Getting four significant results at the 5% level is not more impressive than getting a single significant result at the p < .00001 level.

However, the strength of evidence from multiple study articles depends on one crucial condition. This condition is so elementary and self-evidence that it is not even mentioned in statistics. The condition is that a researcher honestly reports all results. 4 significant results is only impressive when a researcher went into the lab, conducted four studies, and obtained significant results in all studies. Similarly, 4 free throws are only impressive when there were only 4 attempts. 4 out of 20 free-throws is not that impressive and 4 out of 80 attempts is horrible. Thus, the absolute number of successes is not important. What matters is the relative frequency of successes for all attempts that were made.

Schimmack (2012) developed the incredibility index to examine whether a set of significant results is based on honest reporting or whether it was obtained by omitting non-significant results or by using questionable statistical practices to produce significant results. Evidence for dishonest reporting of results would undermine the credibility of the published results.

MPK have the following to say about dishonest reporting of results.

“On a related note, Francis (2012a, 2012b, 2012c, 2012d; see also Schimmack, 2012) recently published a series of analyses that indicated the prevalence of publication bias (i.e., file-drawer problem) in multi-study papers in the psychological literature.” (p. 111).   They also note that Francis used a related method to reveal that many multiple-study articles show statistical evidence of dishonest reporting. “Francis argued that there may be many cases in which the findings reported in multi-study papers are too good to be true” (p. 111).

In short, Schimmack and Francis argued that multiple study articles can be misleading because the provide the illusion of replicability (a researcher was able to demonstrate the effect again, and again, and again, therefore it must be a robust effect), but in reality it is not clear how robust the effect is because the results were not obtain in the way as the studies are described in the article (first we did Study 1, then we did Study 2, etc. and voila all of the studies worked and showed the effect).

One objection to Schimmack and Francis would be to find a problem with their method of detecting bias. However, MPK do not comment on the method at all. They sidestep this issue when they write “it is beyond the scope of this article to discuss whether publication bias actually exists in these articles or. or how prevalent it is in general” (p. 111).

After sidestepping the issue, MPK are faced with a dilemma or paradox. Do multiple study articles strengthen the evidence because the combined type-I error probability decreases or do multiple study articles weaken the evidence because the probability that researchers did not report the results of their research program honestly? “Should multi-study findings be regarded as reliable or shaky evidence?” (p. 111).

MPK solve this paradox with a semantic trick. First, they point out that dishonest reporting has undesirable effects on effect size estimates.

“A publication bias, if it exists, leads to overestimation of effect sizes because some null findings are not reported (i.e., only studies with relatively large effect sizes that produce significant results are reported). The overestimation of effect sizes is problematic” (p. 111).

They do not explain why researchers should be allowed to omit studies with non-significant results from an article, given that this practice leads to the undesirable consequences of inflated effect sizes. Accurate estimates of effect sizes would be obtained if researchers published all of their results. In fact, Schimmack (2012) suggested that researchers report all results and then conduct a meta-analysis of their set of studies to examine how strong the evidence of a set of studies is. This meta-analysis would provide an unbiased measure of the true effect size and unbiased evidence about the probability that the results of all studies were obtained in the absence of an effect.

The semantic trick occurs when the authors suggest that dishonest reporting practices are only a problem for effect size estimates, but not for the question whether an effect actually exists.

“However, the presence of publication bias does not necessarily mean that the effect is absent (i.e., that the findings are falsely positive).” (p. 111) and “Publication bias simply means that the effect size is overestimated—it does not necessarily imply that the effect is not real (i.e., falsely positive).” (p. 112).

This statement is true because it is practically impossible to demonstrate false positives, which would require demonstrating that the true effect size is exactly 0.   The presence of bias does not warrant the conclusion that the effect size is zero and that reported results are false positives.

However, this is not the point of revealing dishonest practices. The point is that dishonest reporting of results undermines the credibility of the evidence that was used to claim that an effect exists. The issue is the lack of credible evidence for an effect, not credible evidence for the lack of an effect. These two statements are distinct and MPK use the truth of the second statement to suggest that we can ignore whether the first statement is true.

Finally, MPK present a scenario of a multiple study article with 8 studies that all produced significant results. The state that it is “unrealistic that as many as eight statistically significant results were produced by a non-existent effect” (p. 112).

This blue-eyed view of multiple study articles ignores the fact that the replication crisis in psychology was triggered by Bem’s (2011) infamous article that contained 9 out of 9 statistically significant results (one marginal result was attributed to methodological problems, see Schimmack, 2012, for details) that supposedly demonstrated humans ability to foresee the future and to influence the past (e.g., learning after a test increased performance on a test that was taken before learning for the test). Schimmack (2012) used this article to demonstrate how important it can be to evaluate the credibility of multiple study articles and the incredibility index predicted correctly that these results would not replicate. So, it is simply naïve to assume that articles with more studies automatically strengthen evidence for the existence of an effect and that 8 significant results cannot occur in the absence of a true effect (maybe MPK believe in ESP).

It is also not clear why researchers should wonder about the credibility of results in multiple study articles.  A simple solution to the paradox is to reported all results honestly.  If an honest set of studies provides evidence for an effect, it is not clear why researchers would prefer to engage in dishonest reporting practices. MPK provide no explanation for this practices and make no recommendation to increase honesty in reporting of results as a simple solution to the replicability crisis in psychology.

They write, “the researcher may have conducted 10, or even 20, experiments until he/she obtained 8 successful experiments, but far more studies would have been needed had the effect not existed at all”. This is true, but we do not know how many studies a researcher conducted or what else a researcher did to the data unless all of this information is reported. If the combined evidence of 20 studies with 8 significant results shows that an effect is present, a researcher could just publish all 20 studies. What is the reason to hide over 50% of the evidence?

In the end, MPK assure readers that they “do not intend to defend underpowered studies” and they do suggest that “the most straightforward solution to this paradox is to conduct studies that have sufficient statistical power” (p. 112). I fully agree with these recommendations because powerful studies can provide real evidence for an effect and decrease the incentive to engage in dishonest practices.

It is discouraging that this article was published in a major review journal in social psychology. It is difficult to see how social psychology can regain trust, if social psychologists believe they can simply continue to engaging in dishonest reporting of results.  Unfortunately, social psychologists continue to downplay the replication crisis and the shaky foundations of many textbook claims.