Category Archives: Psychological Science

How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

Over the past five years, psychological science has been in a crisis of confidence.  For decades, psychologists have assumed that published significant results provide strong evidence for theoretically derived predictions, especially when authors presented multiple studies with internal replications within a single article (Schimmack, 2012). However, even multiple significant results provide little empirical evidence, when journals only publish significant results (Sterling, 1959; Sterling et al., 1995).  When published results are selected for significance, statistical significance loses its ability to distinguish replicable effects from results that are difficult to replicate or results that are type-I errors (i.e., the theoretical prediction was false).

The crisis of confidence led to several initiatives to conduct independent replications. The most informative replication initiative was conducted by the Open Science Collaborative (Science, 2015).  It replicated close to 100 significant results published in three high-ranked psychology journals.  Only 36% of the replication studies replicated a statistically significant result.  The replication success rate varied by journal.  The journal “Psychological Science” achieved a success rate of 42%.

The low success rate raises concerns about the empirical foundations of psychology as a science.  Without further information, a success rate of 42% implies that it is unclear which published results provide credible evidence for a theory and which findings may not replicate.  It is impossible to conduct actual replication studies for all published studies.  Thus, it is highly desirable to identify replicable findings in the existing literature.

One solution is to estimate replicability for sets of studies based on the published test statistics (e.g., F-statistic, t-values, etc.).  Schimmack and Brunner (2016) developed a statistical method, Powergraphs, that estimates the average replicability of a set of significant results.  This method has been used to estimate replicability of psychology journals using automatic extraction of test statistics (2016 Replicability Rankings, Schimmack, 2017).  The results for Psychological Science produced estimates in the range from 55% to 63% for the years 2010-2016 with an average of 59%.   This is notably higher than the success rate for the actual replication studies, which only produced 42% successful replications.

There are two explanations for this discrepancy.  First, actual replication studies are not exact replication studies and differences between the original and the replication studies may explain some replication failures.  Second, the automatic extraction method may overestimate replicability because it may include non-focal statistical tests. For example, significance tests of manipulation checks can be highly replicable, but do not speak to the replicability of theoretically important predictions.

To address the concern about automatic extraction of test statistics, I estimated replicability of focal hypothesis tests in Psychological Science with hand-coded, focal hypothesis tests.  I used three independent data sets.

Study 1

For Study 1, I hand-coded focal hypothesis tests of all studies in the 2008 Psychological Science articles that were used for the OSC reproducibility project (Science, 2015).

OSC.PS

The powergraphs show the well-known effect of publication bias in that most published focal hypothesis tests report a significant result (p < .05, two-tailed, z > 1.96) or at least a marginally significant result (p < .10, two-tailed or p < .05, one-tailed, z > 1.65). Powergraphs estimate the average power of studies with significant results on the basis of the density distribution of significant z-scores.  Average power is an estimate of replicabilty for a set of exact replication studies.  The left graph uses all significant results. The right graph uses only z-scores greater than 2.4 because questionable research practices may produce many just-significant results and lead to biased estimates of replicability. However, both estimation methods produce similar estimates of replicability (57% & 61%).  Given the small number of statistics the 95%CI is relatively wide (left graph: 44% to 73%).  These results are compatible with the low actual success rate for actual replication studies (42%) and the estimate based on automated extraction (59%).

Study 2

The second dataset was provided by Motyl et al. (JPSP, in press), who coded a large number of articles from social psychology journals and psychological science. Importantly, they coded a representative sample of Psychological Science studies from the years 2003, 2004, 2013, and 2014. That is, they did not only code social psychology articles published in Psychological Science.  The dataset included 281 test statistics from Psychological Science.

PS.Motyl

The powergraph looks similar to the powergraph in Study 1.  More important, the replicability estimates are also similar (57% & 52%).  The 95%CI for Study 1 (44% to 73%) and Study 2 (left graph: 49% to 65%) overlap considerably.  Thus, two independent coding schemes and different sets of studies (2008 vs. 2003-2004/2013/2014) produce very similar results.

Study 3

Study 3 was carried out in collaboration with Sivaani Sivaselvachandran, who hand-coded articles from Psychological Science published in 2016.  The replicability rankings showed a slight positive trend based on automatically extracted test statistics.  The goal of this study was to examine whether hand-coding would also show an increase in replicability.  An increase was expected based on an editorial by D. Stephen Linday, incoming editor in 2015, who aimed to increase replicability of results published in Psychological Science by introducing badges for open data and preregistered hypotheses. However, the results failed to show a notable increase in average replicability.

PS.2016

The replicability estimate was similar to those in the first two studies (59% & 59%).  The 95%CI ranged from 49% to 70%. These wide confidence intervals make it difficult to notice small improvements, but the histogram shows that just significant results (z = 2 to 2.2) are still the most prevalent results reported in Psychological Science and that non-significant results that are to be expected are not reported.

Combined Analysis 

Given the similar results in all three studies, it made sense to pool the data to obtain the most precise estimate of replicability of results published in Psychological Science. With 479 significant test statistics, replicability was estimated at 58% with a 95%CI ranging from 51% to 64%.  This result is in line with the estimated based on automated extraction of test statistics (59%).  The reason for the close match between hand-coded and automated results could be that Psych Science publishes short articles and authors may report mostly focal results because space does not allow for extensive reporting of other statistics.  The hand-coded data confirm that replicabilty in Psychological Science is likely to be above 50%.

PS.combined

It is important to realize that the 58% estimate is an average.  Powergraphs also show average replicability for segments of z-scores. Here we see that replicabilty for just-significant results (z < 2.5 ~ p > .01) is only 35%. Even for z-score between 2.5 and 3.0 (~ p > .001) is only 47%.  Once z-scores are greater than 3, average replicabilty is above 50% and with z-scores greater than 4, replicability is greater than 80%.  For any single study, p-values can vary greatly due to sampling error, but in general a published result with a p-value < .001 is much more likely to replicate than a p-value > .01 (see also OSC, Science, 2015).

Conclusion

This blog-post used hand-coding of test-statistics published in Psychological Science, the flagship journal of the Association for Psychological Science, to estimate replicabilty of published results.  Three dataset produced convergent evidence that the average replicabilty of exact replication studies is 58% +/- 7%.  This result is consistent with estimates based on automatic extraction of test statistics.  It is considerably higher than the success rate of actual replication studies in the OSC reproducibility project (42%). One possible reason for this discrepancy is that actual replication studies are never exact replication studies, which makes it more difficult to obtain statistical significance if the original studies are selected for significance. For example, the original study may have had an outlier in the experimental group that helped to produce a significant result. Not removing this outlier is not considered a questionable research practice, but an exact replication study will not reproduce the same outlier and may fail to reproduce a just-significant result.  More broadly, any deviation from the assumptions underlying the computation of test statistics will increase the bias that is introduced by selecting significant results.  Thus, the 58% estimate is an optimistic estimate of the maximum replicability under ideal conditions.

At the same time, it is important to point out that 58% replicability for Psychological Science does not mean psychological science is rotten to the core (Motyl et al., in press) or that most reported results are false (Ioannidis, 2005).  Even results that did not replicate in actual replication studies are not necessarily false positive results.  It is possible that more powerful studies would produce a significant result, but with a smaller effect size estimate.

Hopefully, these analyses will spur further efforts to increase replicability of published results in Psychological Science and in other journals.  We are already near the middle of 2017 and can look forward to the 2017 results.

 

 

 

Advertisements

Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.

Replicability Ranking of Psychology Departments

Evaluations of individual researchers, departments, and universities are common and arguably necessary as science is becoming bigger. Existing rankings are based to a large extent on peer-evaluations. A university is ranked highly if peers at other universities perceive it to produce a steady stream of high-quality research. At present the most widely used objective measures rely on the quantity of research output and on the number of citations. These quantitative indicators of research quality work are also heavily influenced by peers because peer-review controls what gets published, especially in journals with high rejection rates, and peers decide what research they cite in their own work. The social mechanisms that regulate peer-approval are unavoidable in a collective enterprise like science that does not have a simple objective measure of quality (e.g., customer satisfaction ratings, or accident rates of cars). Unfortunately, it is well known that social judgments are subject to many biases due to conformity pressure, self-serving biases, confirmation bias, motivated biases, etc. Therefore, it is desirable to complement peer-evaluations with objective indicators of research quality.

Some aspects of research quality are easier to measure than others. Replicability rankings focus on one aspect of research quality that can be measured objectively, namely the replicability of a published significant result. In many scientific disciplines such as psychology, a successful study reports a statistically significant result. A statistically significant result is used to minimize the risk of publishing evidence for an effect that does not exist (or even goes in the opposite direction). For example, a psychological study that shows effectiveness of a treatment for depression would have to show that the effect in the study reveals a real effect that can be observed in other studies and in real patients if the treatment is used for the treatment of depression.

In a science that produces thousands of results a year, it is inevitable that some of the published results are fluke findings (even Toyota’s break down sometimes). To minimize the risk of false results entering the literature, psychology like many other sciences, adopted a 5% error rate. By using a 5% as the criterion, psychologists ensured that no more than 5% of results are fluke findings. With thousands of results published in each year, this still means that more than 50 false results enter the literature each year. However, this is acceptable because a single study does not have immediate consequences. Only if these results are replicated in other studies, findings become the foundation of theories and may influence practical decisions in therapy or in other applications of psychological findings (at work, in schools, or in policy). Thus, to outside observers it may appear safe to trust published results in psychology and to report about these findings in newspaper articles, popular books, or textbooks.

Unfortunately, it would be a mistake to interpret a significant result in a psychology journal as evidence that the result is probably true.  The reason is that the published success rate in journals has nothing to do with the actual success rate in psychological laboratories. All insiders know that it is common practice to report only results that support a researcher’s theory. While outsiders may think of scientists as neutral observers (judges), insiders play the game of lobbyist, advertisers, and self-promoters. The game is to advance one’s theory, publish more than others, get more citations than others, and win more grant money than others. Honest reporting of failed studies does not advance this agenda. As a result, the fact that psychological studies report nearly exclusively success stories (Sterling, 1995; Sterling et al., 1995) tells outside observers nothing about the replicability of a published finding and the true rate of fluke findings could be 100%.

This problem has been known for over 50 years (Cohen, 1962; Sterling, 1959). So it would be wrong to call the selective reporting of successful studies an acute crisis. However, what changed is that some psychologists have started to criticize the widely accepted practice of selective reporting of successful studies (Asendorpf et al., 2012; Francis, 2012; Simonsohn et al., 2011; Schimmack, 2012; Wagenmakers et al., 2011). Over the past five years, psychologists, particularly social psychologists, have been engaged in heated arguments over the so-called “replication crisis.”

One group argues that selective publishing of successful studies occurred, but without real consequences on the trustworthiness of published results. The other group argues that published results cannot be trusted unless they have been successfully replicated. The problem is that neither group has objective information about the replicability of published results.  That is, there is no reliable estimate of the percentage of studies that would produce a significant result again, if a representative sample of significant results published in psychology journals were replicated.

Evidently, it is not possible to conduct exact replication studies of all studies that have been published in the past 50 years. Fortunately, it is not necessary to conduct exact replication studies to obtain an objective estimate of replicability. The reason is that replicability of exact replication studies is a function of the statistical power of studies (Sterling et al., 1995). Without selective reporting of results, a 95% success rate is an estimate of the statistical power of the studies that achieved this success rate. Vice versa, a set of studies with average power of 50% is expected to produce a success rate of 50% (Sterling, et al., 1995).

Although selection bias renders success rates uninformative, the actual statistical results provide valuable information that can be used to estimate the unbiased statistical power of published results. Although selection bias inflates effect sizes and power, Brunner and Schimmack (forcecoming) developed and validated a method that can correct for selection bias. This method makes it possible to estimate the replicability of published significant results on the basis of the original reported results. This statistical method was used to estimate the replicabilty of research published by psychology departments in the years from 2010 to 2015 (see Methodology for details).

The averages for the 2010-2012 period (M = 59) and the 2013-2015 period (M = 61) show only a small difference, indicating that psychologists have not changed their research practices in accordance with recommendations to improve replicability in 2011  (Simonsohn et al., 2011). For most of the departments the confidence intervals for the two periods overlap (see attached powergraphs). Thus, the more reliable average across all years is used for the rankings, but the information for the two time periods is presented as well.

There are no obvious predictors of variability across departments. Private universities are at the top (#1, #2, #8), the middle (#24, #26), and at the bottom (#44, #47). European universities can also be found at the top (#4, #5), middle (#25) and bottom (#46, #51). So are Canadian universities (#9, #15, #16, #18, #19, #50).

There is no consensus on an optimal number of replicability.  Cohen recommended that researchers should plan studies with 80% power to detect real effects. If 50% of studies tested real effects with 80% power and the other 50% tested a null-hypothesis (no effect = 2.5% probability to replicate a false result again), the estimated power for significant results would be 78%. The effect on average power is so small because most of the false predictions produce a non-significant result. As a result, only a few studies with low replication probability dilute the average power estimate. Thus, a value greater than 70 can be considered broadly in accordance with Cohen’s recommendations.

It is important to point out that the estimates are very optimistic estimates of the success rate in actual replications of theoretically important effects. For a representative set of 100 studies (OSC, Science, 2015), Brunner and Schimmack’s statistical approach predicted a success rate of 54%, but the success rate in actual replication studies was only 37%. One reason for this discrepancy could be that the statistical approach assumes that the replication studies are exact, but actual replications always differ in some ways from the original studies, and this uncontrollable variability in experimental conditions posses another challenge for replicability of psychological results.  Before further validation research has been completed, the estimates can only be used as a rough estimate of replicability. However, the absolute accuracy of estimates is not relevant for the relative comparison of psychology departments.

And now, without further ado, the first objective rankings of 51 psychology departments based on the replicability of published significant results. More departments will be added to these rankings as the results become available.

Rank University 2010-2015 2010-2012 2013-2015
1 U Penn 72 69 75
2 Cornell U 70 67 72
3 Purdue U 69 69 69
4 Tilburg U 69 71 66
5 Humboldt U Berlin 67 68 66
6 Carnegie Mellon 67 67 67
7 Princeton U 66 65 67
8 York U 66 63 68
9 Brown U 66 71 60
10 U Geneva 66 71 60
11 Northwestern U 65 66 63
12 U Cambridge 65 66 63
13 U Washington 65 70 59
14 Carleton U 65 68 61
15 Queen’s U 63 57 69
16 U Texas – Austin 63 63 63
17 U Toronto 63 65 61
18 McGill U 63 72 54
19 U Virginia 63 61 64
20 U Queensland 63 66 59
21 Vanderbilt U 63 61 64
22 Michigan State U 62 57 67
23 Harvard U 62 64 60
24 U Amsterdam 62 63 60
25 Stanford U 62 65 58
26 UC Davis 62 57 66
27 UCLA 61 61 61
28 U Michigan 61 63 59
29 Ghent U 61 58 63
30 U Waterloo 61 65 56
31 U Kentucky 59 58 60
32 Penn State U 59 63 55
33 Radboud U 59 60 57
34 U Western Ontario 58 66 50
35 U North Carolina Chapel Hill 58 58 58
36 Boston University 58 66 50
37 U Mass Amherst 58 52 64
38 U British Columbia 57 57 57
39 The University of Hong Kong 57 57 57
40 Arizona State U 57 57 57
41 U Missouri 57 55 59
42 Florida State U 56 63 49
43 New York U 55 55 54
44 Dartmouth College 55 68 41
45 U Heidelberg 54 48 60
46 Yale U 54 54 54
47 Ohio State U 53 58 47
48 Wake Forest U 51 53 49
49 Dalhousie U 50 45 55
50 U Oslo 49 54 44
51 U Kansas 45 45 44

 

“Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

The article with the witty title “Do Studies of Statistical Power Have an Effect on the Power of Studies?” builds on Cohen’s (1962) seminal power analysis of psychological research.

The main point of the article can be summarized in one word: No. Statistical power has not increased after Cohen published his finding that statistical power is low.

One important contribution of the article was a meta-analysis of power analyses that applied Cohen’s method to a variety of different journals. The table below shows that power estimates vary by journal assuming that the effect size was medium according to Cohen’s criteria of small, medium, and large effect sizes. The studies are sorted by power estimates from the highest to the lowest value, which provides a power ranking of journals based on Cohen’s method. I also included the results of Sedlmeier and Giegerenzer’s power analysis of the 1984 volume of the Journal of Abnormal Psychology (the Journal of Social and Abnormal Psychology was split into Journal of Abnormal Psychology and Journal of Personality and Social Psychology). I used the mean power (50%) rather than median power (44%) because the mean power is consistent with the predicted success rate in the limit. In contrast, the median will underestimate the success rate in a set of studies with heterogeneous effect sizes.

JOURNAL TITLE YEAR Power%
Journal of Marketing Research 1981 89
American Sociological Review 1974 84
Journalism Quarterly, The Journal of Broadcasting 1976 76
American Journal of Educational Psychology 1972 72
Journal of Research in Teaching 1972 71
Journal of Applied Psychology 1976 67
Journal of Communication 1973 56
The Research Quarterly 1972 52
Journal of Abnormal Psychology 1984 50
Journal of Abnormal and Social Psychology 1962 48
American Speech and Hearing Research & Journal of Communication Disorders 1975 44
Counseler Education and Supervision 1973 37

 

The table shows that there is tremendous variability in power estimates for different journals ranging from as high as 89% (9 out of 10 studies will produce a significant result when an effect is present) to the lowest estimate of  37% power (only 1 out of 3 studies will produce a significant result when an effect is present).

The table also shows that the Journal of Abnormal and Social Psychology and its successor the Journal of Abnormal Psychology yielded nearly identical power estimates. This finding is the key finding that provides empirical support for the claim that power in the Journal of Abnormal Psychology has not increased over time.

The average power estimate for all journals in the table is 62% (median 61%).  The list of journals is not a representative set of journals and few journals are core psychology journals. Thus, the average power may be different if a representative set of journals had been used.

The average for the three core psychology journals (JASP & JAbnPsy,  JAP, AJEduPsy) is 67% (median = 63%) is slightly higher. The latter estimate is likely to be closer to the typical power in psychology in general rather than the prominently featured estimates based on the Journal of Abnormal Psychology. Power could be lower in this journal because it is more difficult to recruit patients with a specific disorder than participants from undergraduate classes. However, only more rigorous studies of power for a broader range of journals and more years can provide more conclusive answers about the typical power of a single statistical test in a psychology journal.

The article also contains some important theoretical discussions about the importance of power in psychological research. One important issue concerns the treatment of multiple comparisons. For example, a multi-factorial design produces an exponential number of statistical comparisons. With two conditions, there is only one comparison. With three conditions, there are three comparisons (C1 vs. C2, C1 vs. C3, and C2 vs. C3). With 5 conditions, there are 10 comparisons. Standard statistical methods often correct for these multiple comparisons. One consequence of this correction for multiple comparisons is that the power of each statistical test decreases. An effect that would be significant in a simple comparison of two conditions would not be significant if this test is part of a series of tests.

Sedlmeier and Giegerenzer used the standard criterion of p < .05 (two-tailed) for their main power analysis and for the comparison with Cohen’s results. However, many articles presented results using a more stringent criterion of significance. If the criterion used by authors would have been used for the power analysis, power decreased further. About 50% of all articles used an adjusted criterion value and if the adjusted criterion value was used power was only 37%.

Sedlmeier and Giegerenzer also found another remarkable difference between articles in 1960 and in 1984. Most articles in 1960 reported the results of a single study. In 1984 many articles reported results from two or more studies. Sedlmeier and Giegerenzer do not discuss the statistical implications of this change in publication practices. Schimmack (2012) introduced the concept of total power to highlight the problem of publishing articles that contain multiple studies with modest power. If studies are used to provide empirical support for an effect, studies have to show a significant effect. For example, Study 1 shows an effect with female participants. Study 2 examines whether the effect can also be demonstrated with male participants. If Study 2 produces a non-significant result, it is not clear how this finding should be interpreted. It may show that the effect does not exist for men. It may show that the first result was just a fluke finding due to sampling error. Or it may show that the effect exists equally for men and women but studies had only 50% power to produce a significant result. In this case, it is expected that one study will produce a significant result and one will produce a non-significant result, but in the long-run significant results are equally likely with male or female participants. Given the difficulty of interpreting a non-significant result, it would be important to conduct a more powerful study that examines gender differences in a more powerful study with more female and male participants. However, this is not what researchers do. Rather, multiple study articles contain only the studies that produced significant results. The rate of successful studies in psychology journals is over 90% (Sterling et al., 1995). However, this outcome is extremely likely in multiple studies where studies have only 50% power to get a significant result in a single attempt. For each additional attempt, the probability to obtain only significant results decreases exponentially (1 Study, 50%, 2 Studies 25%, 3 Studies 12.5%, 4 Studies 6.75%).

The fact that researchers only publish studies that worked is well-known in the research community. Many researchers believe that this is an acceptable scientific practice. However, consumers of scientific research may have a different opinion about this practice. Publishing only studies that produced the desired outcome is akin to a fund manager that only publishes the return rate of funds that gained money and excludes funds with losses. Would you trust this manager to take care of your retirement? It is also akin to a gambler that only remembers winnings. Would you marry a gambler who believes that gambling is ok because you can earn money that way?

I personally do not trust obviously biased information. So, when researchers present 5 studies with significant results, I wonder whether they really had the statistical power to produce these results or whether they simply did not publish results that failed to confirm their claims. To answer this question it is essential to estimate the actual power of individual studies to produce significant results; that is, it is necessary to estimate the typical power in this field, of this researcher, or in the journal that published the results.

In conclusion, Sedlmeier and Gigerenzer made an important contribution to the literature by providing the first power-ranking of scientific journals and the first temporal analyses of time trends in power. Although they probably hoped that their scientific study of power would lead to an increase in statistical power, the general consensus is that their article failed to change scientific practices in psychology. In fact, some journals required more and more studies as evidence for an effect (some articles contain 9 studies) without any indication that researchers increased power to ensure that their studies could actually provide significant results for their hypotheses. Moreover, the topic of statistical power remained neglected in the training of future psychologists.

I recommend Sedlmeier and Gigerenzer’s article as essential reading for anybody interested in improving the credibility of psychology as a rigorous empirical science.

As always, comments (positive or negative) are always welcome.

Distinguishing Questionable Research Practices from Publication Bias

It is well-known that scientific journals favor statistically significant results (Sterling, 1959). This phenomenon is known as publication bias. Publication bias can be easily detected by comparing the observed statistical power of studies with the success rate in journals. Success rates of 90% or more would only be expected if most theoretical predictions are true and empirical studies have over 90% statistical power to produce significant results. Estimates of statistical power range from 20% to 50% (Button et al., 2015, Cohen, 1962). It follows that for every published significant result an unknown number of non-significant results has occurred that remained unpublished. These results linger in researchers proverbial file-drawer or more literally in unpublished data sets on researchers’ computers.

The selection of significant results also creates an incentive for researchers to produce significant results. In rare cases, researchers simply fabricate data to produce significant results. However, scientific fraud is rare. A more serious threat to the integrity of science is the use of questionable research practices. Questionable research practices are all research activities that create a systematic bias in empirical results. Although systematic bias can produce too many or too few significant results, the incentive to publish significant results suggests that questionable research practices are typically used to produce significant results.

In sum, publication bias and questionable research practices contribute to an inflated success rate in scientific journals. So far, it has been difficult to examine the prevalence of questionable research practices in science. One reason is that publication bias and questionable research practices are conceptually overlapping. For example, a research article may report the results of a 2 x 2 x 2 ANOVA or a regression analysis with 5 predictor variables. The article may only report the significant results and omit detailed reporting of the non-significant results. For example, researchers may state that none of the gender effects were significant and not report the results for main effects or interaction with gender. I classify these cases as publication bias because each result tests a different hypothesis., even if the statistical tests are not independent.

Questionable research practices are practices that change the probability of obtaining a specific significant result. An example would be a study with multiple outcome measures that would support the same theoretical hypothesis. For example, a clinical trial of an anti-depressant might include several depression measures. In this case, a researcher can increase the chances of a significant result by conducting tests for each measure. Other questionable research practices would be optional stopping once a significant result is obtained, selective deletion of cases based on the results after deletion. A common consequence of these questionable practices is that they will produce results that meet the significance criterion, but deviate from the distribution that is expected simply on the basis of random sampling error.

A number of articles have tried to examine the prevalence of questionable research practices by comparing the frequency of p-values above and below the typical criterion of statistical significance, namely a p-value less than .05. The logic is that random error would produce a nearly equal amount of p-values just above .05 (e.g., p = .06) and below .05 (e.g., p = .04). According to this logic, questionable research practices are present, if there are more p-values just below the criterion than p-values just above the criterion (Masicampo & Lalande, 2012).

Daniel Lakens has pointed out some problems with this approach. The most crucial problem is that publication bias alone is sufficient to predict a lower frequency of p-values below the significance criterion. After all, these p-values imply a non-significant result and non-significant results are subject to publication bias. The only reason why p-values of .06 are reported with higher frequency than p-values of .11 is that p-values between .05 and .10 are sometimes reported as marginally significant evidence for a hypothesis. Another problem is that many p-values of .04 are not reported as p = .04, but are reported as p < .05. Thus, the distribution of p-values close to the criterion value provides unreliable information about the prevalence of questionable research practices.

In this blog post, I introduce an alternative approach to the detection of questionable research practices that produce just significant results. Questionable research practices and publication bias have different effects on the distribution of p-values (or corresponding measures of strength of evidence). Whereas publication bias will produce a distribution that is consistent with the average power of studies, questionable research practice will produce an abnormal distribution with a peak just below the significance criterion. In other words, questionable research practices produce a distribution with too few non-significant results and too few highly significant results.

I illustrate this test of questionable research practices with post-hoc-power analysis of three journals. One journal shows neither signs of publication bias, nor significant signs of questionable research practices. The second journal shows clear evidence of publication bias, but no evidence of questionable research practices. The third journal illustrates the influence of publication bias and questionable research practices.

Example 1: A Relatively Unbiased Z-Curve

The first example is based on results published during the years 2010-2014 in the Journal of Experimental Psychology: Learning, Memory, and Cognition. A text-mining program searched all articles for publications of F-tests, t-tests, correlation coefficients, regression coefficients, odds-ratios, confidence intervals, and z-tests. Due to the inconsistent and imprecise reporting of p-values (p = .02 or p < .05), p-values were not used. All statistical tests were converted into absolute z-scores.

The program found 14,800 tests. 8,423 tests were in the critical interval between z = 2 and z = 6 that is used for estimation of 4 non-centrality parameters and 4 weights that are used to model the distribution of z-values between 2 and 6 and to estimate the distribution in the range from 0 to 2. Z-values greater than 6 are not used because they correspond to Power close to 1. 11% of all tests fall into this region of z-scores that are not shown.

PHP-Curve JEP-LMCThe histogram and the blue density distribution show the observed data. The green curve shows the predicted distribution based on the post-hoc power analysis. Post-hoc power analysis suggests that the average power of significant results is 67%. Power for all statistical tests is estimated to be 58% (including 11% of z-scores greater than 6, power is .58*.89 + .11 = 63%). More important is the predicted distribution of z-scores. The predicted distribution on the left side of the criterion value matches the observed distribution rather well. This shows that there are not a lot of missing non-significant results. In other words, there does not appear to be a file-drawer of studies with non-significant results. There is also only a very small blip in the observed data just at the level of statistical significance. The close match between the observed and predicted distributions suggests that results in this journal are relatively free of systematic bias due to publication bias or questionable research practices.

Example 2: A Z-Curve with Publication Bias

The second example is based on results published in the Attitudes & Social Cognition Section of the Journal of Personality and Social Psychology. The text-mining program retrieved 5,919 tests from articles published between 2010 and 2014. 3,584 tests provided z-scores in the range from 2 to 6 that is being used for model fitting.

PHP-Curve JPSP-ASC

The average power of significant results in JPSP-ASC is 55%. This is significantly less than the average power in JEP-LMC, which was used for the first example. The estimated power for all statistical tests, including those in the estimated file drawer, is 35%. More important is the estimated distribution of z-values. On the right side of the significance criterion the estimated curve shows relatively close fit to the observed distribution. This finding shows that random sampling error alone is sufficient to explain the observed distribution. However, on the left side of the distribution, the observed z-scores drop off steeply. This drop is consistent with the effect of publication bias that researchers do not report all non-significant results. There is only a slight hint that questionable research practices are also present because observed z-scores just above the criterion value are a bit more frequent than the model predicts. However, this discrepancy is not conclusive because the model could increase the file drawer, which would produce a steeper slope. The most important characteristic of this z-curve is the steep cliff on the left side of the criterion value and the gentle slope on the right side of the criterion value.

Example 3: A Z-Curve with Questionable Research Practices.

Example 3 uses results published in the journal Aggressive Behavior during the years 2010 to 2014. The text mining program found 1,429 results and 863 z-scores in the range from 2 to 6 that were used for the post-hoc-power analysis.

PHP-Curve for AggressiveBeh 2010-14

 

The average power for significant results in the range from 2 to 6 is 73%, which is similar to the power estimate in the first example. The power estimate that includes non-significant results is 68%. The power estimate is similar because there is no evidence of a file drawer with many underpowered studies. In fact, there are more observed non-significant results than predicted non-significant results, especially for z-scores close to zero. This outcome shows some problems of estimating the frequency of non-significant results based on the distribution of significant results. More important, the graph shows a cluster of z-scores just above and below the significance criterion. The step cliff to the left of the criterion might suggest publication bias, but the whole distribution does not show evidence of publication bias. Moreover, the steep cliff on the right side of the cluster cannot be explained with publication bias. Only questionable research practices can produce this cliff because publication bias relies on random sampling error which leads to a gentle slope of z-scores as shown in the second example.

Prevalence of Questionable Research Practices

The examples suggest that the distribution of z-scores can be used to distinguish publication bias and questionable research practices. Based on this approach, the prevalence of questionable research practices would be rare. The journal Aggressive Behavior is exceptional. Most journals show a pattern similar to Example 2, with varying sizes of the file drawer. However, this does not mean that questionable research practices are rare because it is most likely that the pattern observed in Example 2 is a combination of questionable research practices and publication bias. As shown in Example 2, the typical power of statistical tests that produce a significant result is about 60%. However, researchers do not know which experiments will produce significant results. Slight modifications in experimental procedures, so-called hidden moderators, can easily change an experiment with 60% power into an experiment with 30% power. Thus, the probability of obtaining a significant result in a replication study is less than the nominal power of 60% that is implied by post-hoc-power analysis. With only 30% to 60% power, researchers will frequently encounter results that fail to produce an expected significant result. In this case, researchers have two choices to avoid reporting a non-significant result. They can put the study in the file-drawer or they can try to salvage the study with the help of questionable research practices. It is likely that researchers will do both and that the course of action depends on the results. If the data show a trend in the right direction, questionable research practices seem an attractive alternative. If the data show a trend in the opposite direction, it is more likely that the study will be terminated and the results remain unreported.

Simons et al. (2011) conducted some simulation studies and found that even extreme use of multiple questionable research practices (p-hacking) will produce a significant result in at most 60% of cases, when the null-hypothesis is true. If such extreme use of questionable research practices were widespread, z-curve would produce corrected power estimates well-below 50%. There is no evidence that extreme use of questionable research practices is prevalent. In contrast, there is strong evidence that researchers conduct many more studies than they actually report and that many of these studies have a low probability of success.

Implications of File-Drawers for Science

First, it is clear that researchers could be more effective if they would use existing resources more effectively. An fMRI study with 20 participants costs about $10,000. Conducting a study that costs $10,000 that has only a 50% probability of producing a significant result is wasteful and should not be funded by taxpayers. Just publishing the non-significant result does not fix this problem because a non-significant result in a study with 50% power is inconclusive. Even if the predicted effect exists, one would expect a non-significant result in ever second study.   Instead of wasting $10,000 on studies with 50% power, researchers should invest $20,000 in studies with higher power (unfortunately, power does not increase proportional to resources). With the same research budget, more money would contribute to results that are being published. Thus, without spending more money, science could progress faster.

Second, higher powered studies make non-significant results more relevant. If a study had 80% power, there is only a 20% chance to get a non-significant result if an effect is present. If a study had 95% power, the chance of a non-significant result would be just as low as the chance of a false positive result. In this case, it is noteworthy that a theoretical prediction was not confirmed. In a set of high-powered studies, a post-hoc power analysis would show a bimodal distribution with clusters of z-scores around 0 for true null-hypothesis and a cluster of z-scores of 3 or higher for clear effects. Type-I and Type-II errors would be rare.

Third, Example 3 shows that the use of questionable research practices becomes detectable in the absence of a file drawer and that it would be harder to publish results that were obtained with questionable research practices.

Finally, the ability to estimate the size of file-drawers may encourage researchers to plan studies more carefully and to invest more resources into studies to keep their file drawers small because a large file-drawer may harm reputation or decrease funding.

In conclusion, post-hoc power analysis of large sets of data can be used to estimate the size of the file drawer based on the distribution of z-scores on the right side of a significance criterion. As file-drawers harm science, this tool can be used as an incentive to conduct studies that produce credible results and thus reducing the need for dishonest research practices. In this regard, the use of post-hoc power analysis complements other efforts towards open science such as preregistration and data sharing.

A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

Cumming (2014) wrote an article “The New Statistics: Why and How” that was published in the prestigious journal Psychological Science.   On his website, Cumming uses this article to promote his book “Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.”

The article clear states the conflict of interest. “The author declared that he earns royalties on his book (Cumming, 2012) that is referred to in this article.” Readers are therefore warned that the article may at least inadvertently give an overly positive account of the new statistics and an overly negative account of the old statistics. After all, why would anybody buy a book about new statistics when the old statistics are working just fine.

This blog post critically examines Cumming’s claim that his “new statistics” can solve endemic problems in psychological research that have created a replication crisis and that the old statistics are the cause of this crisis.

Like many other statisticians who are using the current replication crisis as an opportunity to sell their statistical approach, Cumming’s blames null-hypothesis significance testing (NHST) for the low credibility of research articles in Psychological Science (Francis, 2013).

In a nutshell, null-hypothesis significance testing entails 5 steps. First, researchers conduct a study that yields an observed effect size. Second, the sampling error of the design is estimated. Third, the ratio of the observed effect size and sampling error (signal-to-noise ratio) is computed to create a test-statistic (t, F, chi-square). The test-statistic is then used to compute the probability of obtaining the observed test-statistic or a larger one under the assumption that the true effect size in the population is zero (there is no effect or systematic relationship). The last step is to compare the test statistic to a criterion value. If the probability (p-value) is less than a criterion value (typically 5%), the null-hypothesis is rejected and it is concluded that an effect was present.

Cumming’s (2014) claims that we need a new way to analyze data because there is “renewed recognition of the severe flaws of null-hypothesis significance testing (NHST)” (p. 7). His new statistical approach has “no place for NHST” (p. 7). His advice is to “whenever possible, avoid using statistical significance or p values” (p. 8).

So what is wrong with NHST?

The first argument against NHST is that Ioannidis (2005) wrote an influential article with the eye-catching title “Why most published research findings are false” and most research articles use NHST to draw inferences from the observed results. Thus, NHST seems to be a flawed method because it produces mostly false results. The problem with this argument is that Ioannidis (2005) did not provide empirical evidence that most research findings are false, nor is this a particularly credible claim for all areas of science that use NHST, including partical physics.

The second argument against NHST is that researchers can use questionable research practices to produce significant results. This is not really a criticism of NHST, because researchers under pressure to publish are motivated to meet any criteria that are used to select articles for publication. A simple solution to this problem would be to publish all submitted articles in a single journal. As a result, there would be no competition for limited publication space in more prestigious journals. However, better studies would be cited more often and researchers will present their results in ways that lead to more citations. It is also difficult to see how psychology can improve its credibility by lowering standards for publication. A better solution would be to ensure that researchers are honestly reporting their results and report credible evidence that can provide a solid empirical foundation for theories of human behavior.

Cummings agrees. “To ensure integrity of the literature, we must report all research conducted to a reasonable standard, and reporting must be full and accurate” (p. 9). If a researcher conducted five studies with only a 20% chance to get a significant result and would honestly report all five studies, p-values would provide meaningful evidence about the strength of the evidence, namely most p-values would be non-significant and show that the evidence is weak. Moreover, post-hoc power analysis would reveal that the studies had indeed low power to test a theoretical prediction. Thus, I agree with Cumming’s that honesty and research integrity are important, but I see no reason to abandon NHST as a systematic way to draw inferences from a sample about the population because researchers have failed to disclose non-significant results in the past.

Cumming’s then cites a chapter by Kline (2014) that “provided an excellent summary of the deep flaws in NHST and how we use it” (p. 11). Apparently, the summary is so excellent that readers are better off by reading the actual chapter because Cumming’s does not explain what these deep flaws are. He then observes that “very few defenses of NHST have been attempted” (p. 11). He doesn’t even list a single reference. Here is one by a statistician: “In defence of p-values” (Murtaugh, 2014). In a response, Gelman agrees that the problem is more with the way p-values are used rather than with the p-value and NHST per se.

Cumming’s then states a single problem of NHST. Namely that it forces researchers to make a dichotomous decision. If the signal-to-noise ratio is above a criterion value, the null-hypothesis is rejected and it is concluded that an effect is present. If the signal-to-noise ratio is below the criterion value the null-hypothesis is not rejected. If Cumming’s has a problem with decision making, it would be possible to simply report the signal-to-noise ratio or simply to report the effect size that was observed in a sample. For example, mortality in an experimental Ebola drug trial was 90% in the control condition and 80% in the experimental condition. As this is the only evidence, it is not necessary to compute sampling error, signal-to-noise ratios, or p-values. Given all of the available evidence, the drug seems to improve survival rates. But wait. Now a dichotomous decision is made based on the observed mean difference and there is no information about the probability that the results in the drug trial generalize to the population. Maybe the finding was a chance finding and the drug actually increases mortality. Should we really make life-and-death decision if the decision were based on the fact that 8 out of 10 patients died in one condition and 9 out of 10 patients died in the other condition?

Even in a theoretical research context decisions have to be made. Editors need to decide whether they accept or reject a submitted manuscript and readers of published studies need to decide whether they want to incorporate new theoretical claims in their theories or whether they want to conduct follow-up studies that build on a published finding. It may not be helpful to have a fixed 5% criterion, but some objective information about the probability of drawing the right or wrong conclusions seems useful.

Based on this rather unconvincing critique of p-values, Cumming’s (2014) recommends that “the best policy is, whenever possible, not to use NHST at all” (p. 12).

So what is better than NHST?

Cumming then explains how his new statistics overcome the flaws of NHST. The solution is simple. What is astonishing about this new statistic is that it uses the exact same components as NHST, namely the observed effect size and sampling error.

NHST uses the ratio of the effect size and sampling error. When the ratio reaches a value of 2, p-values reach the criterion value of .05 and are considered sufficient to reject the null-hypothesis.

The new statistical approach is to multiple the standard error by a factor of 2 and to add and subtract this value from the observed mean. The interval from the lower value to the higher value is called a confidence interval. The factor of 2 was chosen to obtain a 95% confidence interval.  However, drawing a confidence interval alone is not sufficient to draw conclusions from the data. Whether we describe the results in terms of a ratio, .5/.2 = 2.5 or in terms of a 95%CI = .5 +/- .2 or CI = .1 to .7, is not a qualitative difference. It is simply different ways to provide information about the effect size and sampling error. Moreover, it is arbitrary to multiply the standard error by a factor of 2. It would also be possible to multiply it by a factor of 1, 3, or 5. A factor of 2 is used to obtain a 95% confidence interval rather than a 20%, 50%, 80%, or 99% confidence interval. A 95% confidence is commonly used because it corresponds to a 5% error rate (100 – 95 = 5!). A 95% confidence interval is as arbitrary as a p-value of .05.

So, how can a p-value be fundamentally wrong and how can a confidence interval be the solution to all problems if they provide the same information about effect size and sampling error? In particular how do confidence intervals solve the main problem of making inferences from an observed mean in a sample about the mean in a population?

To sell confidence intervals, Cumming’s uses a seductive example.

“I suggest that, once freed from the requirement to report p values, we may appreciate how simple, natural, and informative it is to report that “support for Proposition X is 53%, with a 95% CI of [51, 55],” and then interpret those point and interval estimates in practical terms” (p 14).

Support for proposition X is a rather unusual dependent variable in psychology. However, let us assume that Cumming refers to an opinion poll among psychologists whether NHST should be abandoned. The response format is a simple yes/no format. The average in the sample is 53%. The null-hypothesis is 50%. The observed mean of 53% in the sample shows more responses in favor of the proposition. To compute a significance test or to compute a confidence interval, we need to know the standard error. The confidence interval ranges from 51% to 55%. As the 95% confidence interval is defined by the observed mean plus/minus two standard errors, it is easy to see that the standard error is SE = (53-51)/2 = 1% or .01. The formula for the standard error in a one sample test with a dichotomous dependent variable is sqrt(p * (p-1) / n)). Solving for n yields a sample size of N = 2,491. This is not surprising because public opinion polls often use large samples to predict election outcomes because small samples would not be informative. Thus, Cumming’s example shows how easy it is to draw inferences from confidence intervals when sample sizes are large and confidence intervals are tight. However, it is unrealistic to assume that psychologists can and will conduct every study with samples of N = 1,000. Thus, the real question is how useful confidence intervals are in a typical research context, when researchers do not have sufficient resources to collect data from hundreds of participants for a single hypothesis test.

For example, sampling error for a between-subject design with N = 100 (n = 50 per cell) is SE = 2 / sqrt(100) = .2. Thus, the lower and upper limit of the 95%CI are 4/10 of a standard deviation away from the observed mean and the full width of the confidence interval covers 8/10th of a standard deviation. If the true effect size is small to moderate (d = .3) and a researcher happens to obtain the true effect size in a sample, the confidence interval would range from d = -.1 to d = .7. Does this result support the presence of a positive effect in the population? Should this finding be published? Should this finding be reported in newspaper articles as evidence for a positive effect? To answer this question, it is necessary to have a decision criterion.

One way to answer this question is to compute the signal-to-noise ratio, .3/.2 = 1.5 and to compute the probability that the positive effect in the sample could have occurred just by chance, t(98) = .3/.2 = 1.5, p = .15 (two-tailed). Given this probability, we might want to see stronger evidence. Moreover, a researcher is unlikely to be happy with this result. Evidently, it would have been better to conduct a study that could have provided stronger evidence for the predicted effect, say a confidence interval of d = .25 to .35, but that would have required a sample size of N = 6,500 participants.

A wide confidence interval can also suggest that more evidence is needed, but the important question is how much more evidence is needed and how narrow a confidence interval should be before it can give confidence in a result. NHST provides a simple answer to this question. The evidence should be strong enough to reject the null-hypothesis with a specified error rate. Cumming’s new statistics provides no answer to the important question. The new statistics is descriptive, whereas NHST is an inferential statistic. As long as researchers merely want to describe their data, they can report their results in several ways, including reporting of confidence intervals, but when they want to draw conclusions from their data to support theoretical claims, it is necessary to specify what information constitutes sufficient empirical evidence.

One solution to this dilemma is to use confidence intervals to test the null-hypothesis. If the 95% confidence interval does not include 0, the ratio of effect size / sampling error is greater than 2 and the p-value would be less than .05. This is the main reason why many statistics programs report 95%CI intervals rather than 33%CI or 66%CI. However, the use of 95% confidence intervals to test significance is hardly a new statistical approach that justifies the proclamation of a new statistic that will save empirical scientists from NHST. It is NHST! Not surprisingly, Cumming’s states that “this is my least preferred way to interpret a confidence interval” (p. 17).

However, he does not explain how researchers should interpret a 95% confidence interval that does include zero. Instead, he thinks it is not necessary to make a decision. “We should not lapse back into dichotomous thinking by attaching any particular importance to whether a value of interest lies just inside or just outside our CI.”

Does an experimental treatment for Ebolay work? CI = -.3 to .8. Let’s try it. Let’s do nothing and do more studies forever. The benefit of avoiding making any decisions is that one can never make a mistake. The cost is that one can also never claim that an empirical claim is supported by evidence. Anybody who is worried about dichotomous thinking might ponder the fact that modern information processing is built on the simple dichotomy of 0/1 bits of information and that it is common practice to decide the fate of undergraduate students on the basis of scoring multiple choice tests in terms of True or False answers.

In my opinion, the solution to the credibility crisis in psychology is not to move away from dichotomous thinking, but to obtain better data that provide more conclusive evidence about theoretical predictions and a simple solution to this problem is to reduce sampling error. As sampling error decreases, confidence intervals get smaller and are less likely to include zero when an effect is present and the signal-to-noise ratio increases so that p-values get smaller and smaller when an effect is present. Thus, less sampling error also means less decision errors.

The question is how small should sampling error be to reduce decision error and at what point are resources being wasted because the signal-to-noise ratio is clear enough to make a decision.

Power Analysis

Cumming’s does not distinguish between Fischer’s and Neyman-Pearson’s use of p-values. The main difference is that Fischer advocated the use of p-values without strict criterion values for significance testing. This approach would treat p-values just like confidence intervals as continuous statistics that do not imply an inference. A p-value of .03 is significant with a criterion value of .05, but it is not significant with a criterion value of .01.

Neyman-Pearson introduced the concept of a fixed criterion value to draw conclusions from observed data. A criterion value of p = .05 has a clear interpretation. It means that a test of 1,000 null-hypotheses is expected to produce about 50 significant results (type-I errors). A lower error rate can be achieved by lowering the criterion value (p < .01 or p < .001).

Importantly, Neyman-Pearson also considered the alternative problem that the p-value may fail to reach the critical value when an effect is actually present. They called this probability the type-II error. Unfortunately, social scientists have ignored this aspect of Neyman-Pearson Significance Testing (NPST). Researchers can avoid making type-II errors by reducing sampling error. The reason is that a reduction of sampling error increases the signal-to-noise ratio.

For example, the following p-values were obtained from simulating studies with 95% power. The graph only shows p-values greater than .001 to make the distribution of p-values more prominent. As a result 62.5% of the data are missing because these p-values are below p < .001. The histogram of p-values has been popularized by Simmonsohn et al. (2013) as a p-curve. The p-curve shows that p-values are heavily skewed towards low p-values. Thus, the studies provide consistent evidence that an effect is present, even though p-values can vary dramatically from one study (p = .0001) to the next (p = .02). The variability of p-values is not a problem for NPST as long as the p-values lead to the same conclusion because the magnitude of a p-value is not important in Neyman-Pearson hypothesis testing.

CumFig1

The next graph shows p-values for studies with 20% power. P-values vary just as much, but now the variation covers both sides of the significance criterion, p = .05. As a result, the evidence is often inconclusive and 80% of studies fail to reject the false null-hypothesis.

CumFig2

R-Code
seed = length(“Cumming’sDancingP-Values”)
power=.20
low_limit = .000
up_limit = .10
p <-(1-pnorm(rnorm(2500,qnorm(.975,0,1)+qnorm(.20,0,1),1),0,1))*2
hist(p,breaks=1000,freq=F,ylim=c(0,100),xlim=c(low_limit,up_limit))
abline(v=.05,col=”red”)
percent_below_lower_limit = length(subset(p, p <  low_limit))/length(p)
percent_below_lower_limit
If a study is designed to test a qualitative prediction (an experimental manipulation leads to an increase on an observed measure), power analysis can be used to plan a study so that it has a high probability of providing evidence for the hypothesis if the hypothesis is true. It does not matter whether the hypothesis is tested with p-values or with confidence intervals by showing that the confidence does not include zero.

Thus, power analysis seems useful even for the new statistics. However, Cummings is “ambivalent about statistical power” (p. 23). First, he argues that it has “no place when we use the new statistics” (p. 23), presumably because the new statistics never make dichotomous decisions.

Cumming’s next argument against power is that power is a function of the type-I error criterion. If the type-I error probability is set to 5% and power is only 33% (e.g., d = .5, between-group design N = 40), it is possible to increase power by increasing the type-I error probability. If type-I error rate is set to 50%, power is 80%. Cumming’s thinks that this is an argument against power as a statistical concept, but raising alpha to 50% is equivalent to reducing the width of the confidence interval by computing a 50% confidence interval rather than a 95% confidence interval. Moreover, researchers who adjust alpha to 50% are essentially saying that the null-hypothesis would produce a significant result in every other study. If an editor finds this acceptable and wants to publish the results, neither power analysis nor the reported results are problematic. It is true that there was a good chance to get a significant result when a moderate effect is present (d = .5, 80% probability) and when no effect is present (d = 0, 50% probability). Power analysis provides accurate information about the type-I and type-II error rates. In contrast, the new statistics provides no information about error rates in decision making because it is merely descriptive and does not make decisions.

Cumming then points out that “power calculations have traditionally been expected [by granting agencies], but these can be fudged” (p. 23). The problem with fudging power analysis is that the requested grant money may be sufficient to conduct the study, but insufficient to produce a significant result. For example, a researcher may be optimistic and expect a strong effect, d = .80, when the true effect size is only a small effect, d = .20. The researcher conducts a study with N = 52 participants to achieve 80% power. In reality the study has only 11% power and the researcher is likely to end up with a non-significant result. In the new statistics world this is apparently not a problem because the researcher can report the results with a wide confidence interval that includes zero, but it is not clear why a granting agency should fund studies that cannot even provide information about the direction of an effect in the population.

Cummings then points out that “one problem is that we never know true power, the probability that our experiment will yield a statistically significant result, because we do not know the true effect size; that is why we are doing the experiment!” (p. 24). The exclamation mark indicates that this is the final dagger in the coffin of power analysis. Power analysis is useless because it makes assumptions about effect sizes when we can just do an experiment to observe the effect size. It is that easy in the world of new statistics. The problem is that we do not know the true effect sizes after an experiment either. We never know the true effect size because we can never determine a population parameter, just like we can never prove the null-hypothesis. It is only possible to estimate population parameter. However, before we estimate a population parameter, we may simply want to know whether an effect exists at all. Power analysis can help in planning studies so that the sample mean shows the same sign as the population mean with a specified error rate.

Determining Sample Sizes in the New Statistics

Although Cumming does not find power analysis useful, he gives some information about sample sizes. Studies should be planned to have a specified level of precision. Cumming gives an example for a between-subject design with n = 50 per cell (N = 100). He chose to present confidence intervals for unstandardized coefficients. In this case, there is no fixed value for the width of the confidence interval because the sampling variance influences the standard error. However, for standardized coefficients like Cohen’s d, sampling variance will produce variation in standardized coefficients, while the standard error is constant. The standard error is simply 2 / sqrt (N), which equals SE = .2 for N = 100. This value needs to be multiplied by 2 to get the confidence interval, and the 95%CI = d +/- .4.   Thus, it is known before the study is conducted that the confidence interval will span 8/10 of a standard deviation and that an observed effect size of d > .4 is needed to exclude 0 from the confidence interval and to state with 95% confidence that the observed effect size would not have occurred if the true effect size were 0 or in the opposite direction.

The problem is that Cumming provides no guidelines about the level of precision that a researcher should achieve. Is 8/10 of a standard deviation precise enough? Should researchers aim for 1/10 of a standard deviation? So when he suggests that funding agencies should focus on precision, it is not clear what criterion should be used to fund research.

One obvious criterion would be to ensure that precision is sufficient to exclude zero so that the results can be used to state that direction of the observed effect is the same as the direction of the effect in the population that a researcher wants to generalize to. However, as soon as effect sizes are used in the planning of the precision of a study, precision planning is equivalent to power analysis. Thus, the main novel aspect of the new statistics is to ignore effect sizes in the planning of studies, but without providing guidelines about desirable levels of precision. Researchers should be aware that N = 100 in a between-subject design gives a confidence interval that spans 8/10 of a standard deviation. Is that precise enough?

Problem of Questionable Research Practices, Publication Bias, and Multiple Testing

A major problem for any statistical method is the assumption that random sampling error is the only source of error. However, the current replication crisis has demonstrated that reported results are also systematically biased. A major challenge for any statistical approach, old or new, is to deal effectively with systematically biased data.

It is impossible to detect bias in a single study. However, when more than one study is available, it becomes possible to examine whether the reported data are consistent with the statistical assumption that each sample is an independent sample and that the results in each sample are a function of the true effect size and random sampling error. In other words, there is no systematic error that biases the results. Numerous statistical methods have been developed to examine whether data are biased or not.

Cumming (2014) does not mention a single method for detecting bias (Funnel Plot, Eggert regression, Test of Excessive Significance, Incredibility-Index, P-Curve, Test of Insufficient Variance, Replicabiity-Index, P-Uniform). He merely mentions a visual inspection of forest plots and suggests that “if for example, a set of studies is distinctly too homogeneous – it shows distinctly less bouncing around than we would expect from sampling variability… we can suspect selection or distortion of some kind” (p. 23). However, he provides no criteria that explain how variability of observed effect sizes should be compared against predicted variability and how the presence of bias influences the interpretation of a meta-analysis. Thus, he concludes that “even so [biases may exist], meta-analysis can give the best estimates justified by research to date, as well as the best guidance for practitioners” (p. 23). Thus, the new statistics would suggest that extrasensory perception is real because a meta-analysis of Bem’s (2011) infamous Journal of Personality and Social Psychology article shows an effect with a tight confidence interval that does not include zero. In contrast, other researchers have demonstrated with old statistical tools and with the help of post-hoc power that Bem’s results are not credible (Francis, 2012; Schimmack, 2012).

Research Integrity

Cumming also advocates research integrity. His first point is that psychological science should “promote research integrity: (a) a public research literature that is complete and trustworthy and (b) ethical practice, including full and accurate reporting of research” (p. 8). However, his own article falls short of this ideal. His article does not provide a complete, balanced, and objective account of the statistical literature. Rather, Cumming (2014) cheery-picks references that support his claims and does not cite references that are inconvenient for his claims. I give one clear example of bias in his literature review.

He cites Ioannidis’s 2005 paper to argue that p-values and NHST is flawed and should be abandoned. However, he does not cite Ioannidis and Trikalinos (2007). This article introduces a statistical approach that can detect biases in meta-analysis by comparing the success rate (percentage of significant results) to the observed power of the studies. As power determines the success rate in an honest set of studies, a higher success rate reveals publication bias. Cumming not only fails to mention this article. He goes on to warn readers “beware of any power statement that does not state an ES; do not use post hoc power.” Without further elaboration, this would imply that readers should ignore evidence for bias with the Test of Excessive Significance because it relies on post-hoc power. To support this claim, he cites Hoenig and Heisey (2001) to claim that “post hoc power can often take almost any value, so it is likely to be misleading” (p. 24). This statement is misleading because post-hoc power is no different from any other statistic that is influenced by sampling error. In fact,Hoenig and Heisey (2001) show that post-hoc power in a single study is monotonically related to p-values. Their main point is that post-hoc power provides no other information than p-values. However, like p-values, post-hoc power becomes more informative, the higher it is. A study with 99% post-hoc power is likely to be a high powered study, just like extremely low p-values, p < .0001, are unlikely to be obtained in low powered studies or in studies when the null-hypothesis is true. So, post-hoc power is informative when it is high. Cumming (2014) further ignores that variability of post-hoc power estimates decreases in a meta-analysis of post-hoc power and that post-hoc power has been used successfully to reveal bias in published articles (Francis, 2012; Schimmack (2012). Thus, his statement that researchers should ignore post-hoc power analyses is not supported by an unbiased review of the literature, and his article does not provide a complete and trustworthy account of the public research literature.

Conclusion

I cannot recommend Cumming’s new statistics. I routinely report confidence intervals in my empirical articles, but I do not consider them as a new statistical tool. In my opinion, the root cause of the credibility crisis is that researchers conduct underpowered studies that have a low chance to produce the predicted effect and then use questionable research practices to boost power and to hide non-significant results that could not be salvaged. A simple solution to this problem is to conduct more powerful studies that can produce significant results when the predict effect exists. I do not claim that this is a new insight. Rather, Jacob Cohen has tried his whole life to educate psychologists about the importance of statistical power.

Here is what Jacob Cohen had to say about the new statistics in 1994 using time-travel to comment on Cumming’s article 20 years later.

“Everyone knows” that confidence intervals contain all the information to be found in significance tests and much more. They not only reveal the status of the trivial nil hypothesis but also about the status of non-nil null hypotheses and thus help remind researchers about the possible operation of the crud factor. Yet they are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large! But their sheer size should move us toward improving our measurement by seeking to reduce the unreliable and invalid part of the variance in our measures (as Student himself recommended almost a century ago). Also, their width provides us with the analogue of power analysis in significance testing—larger sample sizes reduce the size of confidence intervals as they increase the statistical power of NHST” (p. 1002).

If you are looking for a book on statistics, I recommend Cohen’s old statistics over Cumming’s new statistics, p < .05.

Conflict of Interest: I do not have a book to sell (yet), but I strongly believe that power analysis is an important tool for all scientists who have to deal with uncontrollable variance in their data. Therefore I am strongly opposed to Cumming’s push for a new statistics that provides no guidelines for researchers how they can optimize the use of their resources to obtain credible evidence for effects that actually exist and no guidelines how science can correct false positive results.