Category Archives: False Positives

Estimating the False Discovery Risk of Psychology Science

Abstract

Since 2011, the credibility of psychological science is in doubt. A major concern is that questionable research practices could have produced many false positive results, and it has been suggested that most published results are false. Here we present an empirical estimate of the false discovery risk using a z-curve analysis of randomly selected p-values from a broad range of journals that span most disciplines in psychology. The results suggest that no more than a quarter of published results could be false positives. We also show that the false positive risk can be reduced to less than 5% by using alpha = .01 as the criterion for statistical significance. This remedy can restore confidence in the direction of published effects. However, published effect sizes cannot be trusted because the z-curve analysis shows clear evidence of selection for significance that inflates effect size estimates.

Introduction

Several events in the early 2010s led to a credibility crisis in psychology. As journals selectively publish only statistically significant results, statistical significance loses its, well, significance. Every published focal hypothesis will be statistically significant, and it is unclear which of these results are true positives and which are false positives.

A key article that contributed to the credibility crisis was Simmons, Nelson, & Simonsohn’s article “False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”

The title made a bold statement that it is easy to obtain statistically significant results even when the null-hypothesis is true. This led to concerns that many, if not most, published results are indeed false positive results. Many meta-psychological articles quoted Simmons et al.’s (2011) article to suggest that there is a high risk or even a high rate of false positive results in the psychological literature; including my own 2012 article.

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

The Appendix lists citations from influential meta-psychological articles that imply a high false positive risk in the psychological literature. Only one article suggested that fears about high false positive rates may be unwarranted (Strobe & Strack, 2014). In contrast, other articles have suggested that false positive rates might be as high as 50% or more (Szucs & Ioannidis, 2017).

There have been two noteworthy attempts at estimating the false discovery rate in psychology. Szucs and Ioannidis (2017) automatically extracted p-values from five psychology journals and estimated the average power of extracted t-tests. They then used this power estimate in combination with the assumption that psychologists discover one true, non-zero, effect for every 13 true null-hypotheses to suggest that the false discovery rate in psychology exceeds 50%. The problem with this estimate is that it relies on the questionable assumption that psychologists tests a very small percentage of true hypotheses.

The other article tried to estimate the false positive rate based on 70 of the 100 studies that were replicated in the Open Science Collaboration project (Open Science Collaboration, 2015). The statistical model estimated that psychologists test 93 true null-hypotheses for every 7 true effects (true positives), and that true effects are tested with 75% power (Johnson et al., 2017). This yields a false positive rate of about 50%. The main problem with this study is the reliance on a small, unrepresentative sample of studies that focused heavily on experimental social psychology, a field that triggered concerns about the credibility of psychology in general (Schimmack, 2020). Another problem is that point estimates based on a small sample are unreliable.

To provide new and better information about the false positive risk in psychology, we conducted a new investigation that addresses three limitations of the previous studies. First, we used hand-coding of focal hypothesis tests, rather than automatic extraction of all test-statistics. Second, we sampled from a broad range of journals that cover all areas of psychology rather than focusing narrowly on experimental psychology. Third, we used a validated method to estimate the false discovery risk based on an estimate of the expected discovery rate (Bartos & Schimmack, 2021). In short, the false discovery risk decreases as a monotonic function of the number of discoveries (i.e., p-values below .05) (Soric, 1989).

Z-curve relies on the observation that false positives and true positives produce different distributions of p-values. To fit a model to distributions of significant p-values, z-curve transforms p-values into absolute z-scores. We illustrate z-curve with two simulation studies. The first simulation is based on Simmons et al.’s (2011) scenario in which the combination of four questionable research practices inflates the false positive risk from 5% to 60%. In our simulation, we assumed an equal number of true null-hypotheses (effect size d = 0) and true hypotheses with small to moderate effect sizes (d = .2 to .5). The use of questionable research practices also increases the chances of getting a significant result for true hypotheses. In our simulation, the probability to get significance with true H0 was 58%, whereas the probability to get significance with true H1 was .93. Given the 1:1 ratio of H0 and H1 that were tested, this yields a false discovery rate of 39%.

Figure 1 shows that questionable research practices produce a steeply declining z-curve. Based on this shape, z-curve estimates a discovery rate of 5%, with a 95%CI ranging from 5% to 10%. This translates into estimates of the false discovery risk of 100% with a 95%CI ranging from 46% to 100% (Soric, 1989). The reason why z-curve provides a conservative estimate of the false discovery risk is that p-hacking changes the shape of the distribution in a way that produces even more z-values just above 1.96 than mere selection for significance would produce. In other words, p-hacking destroys evidential value when true hypotheses are being tested. It is not necessary to simulate scenarios in which even more true null-hypotheses are being tested because this would make the z-curve even steeper. Thus, Figure 1 provides a prediction for our z-curve analyses based on actual data, if psychologists heavily rely on Simmons et al.’s recipe to produce significant results.

Figure 2 is based on a simulation of Johnson et al.’s (2013) scenario with a 9% discovery rate (9 true hypotheses for very 100 hypothesis tests), a false discovery rate of 50%, and power to detect true effects of 75% (Figure 2). Johnson et al. did not assume or model p-hacking.

The z-curve for this scenario also shows a steep decline that can be attributed to the high percentage of false positive results. However, there is also a notable tail with z-values greater than 3 that reflects the influence of true hypotheses with adequate power. In this scenario, the expected discovery rate is higher with a 95%CI ranging from 7% to 20%. This translates into a 95%CI for the false discovery risk ranging from 21% to 71% (Soric, 1989). This interval contains the true value of 50%, although the point estimate, 34% underestimates the true value. Thus, we recommend to use the upper limit of the 95%CI as an estimate of the maximum false discovery rate that is consistent with data.

We now turn to real data. Figure 3 shows a z-curve analysis of Kühberger, Frity, and Scherndl (2014) data. The authors conducted an audit of psychological research by randomly sampling 1,000 English language articles published in the year 2007 that were listed in PsychInfo. This audit produced 344 significant p-values that could be subjected to a z-curve analysis. The results differ notably from the previous results. The expected discovery rate is higher and implies a much smaller false discovery risk of only 9%. However, due to the small set of studies, the confidence interval is wide and allows for nearly 50% false positive results.

To produce a larger set of test-statistics, my students and I have hand-coded over 1,000 randomly selected articles from a broad range of journals (Schimmack, 2021). These data were combined with Motyl et al.’s (2017) coding of social psychology journals. The time period spans the years 2008 to 2014, with a focus on the year 2010 and 2009. This dataset produced 1,715 significant p-values. The estimated false discovery risk is similar to the estimate for Kühberger et al.’s (2014) studies. Although the point estimate for the false discovery risk is a bit higher, 12%, the upper bound of the 95%CI is lower because the confidence interval is tighter.

Given the similarity of the results, we combined the two datasets to obtain an even more precise estimate of the false discovery risk based on 2,059 significant p-values. However, the upper limit of the 95%CI decreased only slightly from 30% to 26%.

The most important conclusion from these findings is that concerns about the amount of false positive results have exaggerated assumptions about the prevalence of false positive results in psychology journals. The present results suggest that at most a quarter of published results are false positives and that actual z-curves are very different from those implied by the influential simulation studies of Simmons et al. (2011). Our empirical results show no evidence that massive p-hacking is a common practice.

However, a false positive rate of 25% is still unacceptably high. Fortunately, there is an easy solution to this problem because the false discovery rate depends on the significance threshold. Based on their pessimistic estimates, Johnson et al. (2015) suggested to lower alpha to .005 or even .001. However, these stringent criteria would render most published results statistically non-significant. We suggest to lower alpha to .01. Figure 6 shows the rational for this recommendation by fitting z-curve with alpha = .01 (i.e., the red vertical line that represents the significance criterion is moved from 1.96 to 2.58.

Lowering alpha to .01, lowers the percentage of significant results from 83% (not counting marginally significant, p < .1, results) to 53%. Thus, the expected discovery decreases, but the more stringent criterion for significance lowers the false discovery risk to 4% and even the upper limit of the 95%CI is just 4%.

It is likely that discovery rates vary across journals and disciplines (Schimmack, 2021). In the future, it may be possible to make more specific recommendations for different disciplines or journals based on their discovery rates. Journals that publish riskier hypotheses tests or studies with modest power would need a more stringent significance criterion to maintain an acceptable false discovery risk.

An alpha level of .01 is also recommended by Simmons et al.’s (2011) simulation studies of p-hacking. Massive p-hacking that inflates the false positive risk from 5% to 61% produces only 22% false positives with alpha = .01. Milder forms of p-hacking inflates the false positive risk produces only a probability of 8% to obtain a p-value below .01. Ideally, open science practices like pre-registration will curb the use of questionable practices in the future. Increasing sample sizes will also help to lower the false positive risk. A z-curve analysis of new studies can be used to estimate the current false discovery risk and may suggest that even the traditional alpha level of .05 is able to maintain a false discovery risk below 5%.

While the present results may be considered good news relative to the scenario that most published results cannot be trusted, the results do not change the fact that some areas of psychology have a replication crisis (Open Science Collaboration, 2015). The z-curve results show clear evidence of selection for significance, which leads to inflated effect size estimates. Studies suggest that effect sizes are often inflated by more than 100% (Open Science Collaboration, 2015). Thus, published effect size estimates cannot be trusted even if p-values below .01 show the correct sign of an effect. The present results also imply that effect size meta-analyses that did not correct for publication bias produce inflated effect size estimates. For these reasons, many meta-analyses have to be reexamined and use statistical tools that correct for publication bias.

Appendix

“Given that these publishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may
affect at least as much, if not even more so, the most prominent journals” (Button et al., 2013; 3,316 citations).

“In a theoretical analysis, Ioannidis estimated that publishing and analytic practices make it likely that more than half of research results are false and therefore irreproducible” (Open Science Collaboration, 2015, aac4716-1)

“There is increasing concern that most current published research findings are false. (Ioannidis,
2005, abstract)” (Cumming, 2014, p7, 1,633 citations).

“In a recent article, Simmons, Nelson, and Simonsohn (2011) showed how, due to the misuse of statistical tools, significant results could easily turn out to be false positives (i.e., effects considered significant whereas the null hypothesis is actually true). (Leys et al., 2013, p. 765, 1,406 citations)

“During data analysis it can be difficult for researchers to recognize P-hacking or data dredging because confirmation and hindsight biases can encourage the acceptance of outcomes that fit expectations or desires as appropriate, and the rejection of outcomes that do not as the result of suboptimal designs or analyses. Hypotheses may emerge that fit the data and are then reported without indication or recognition of their post hoc origin. This, unfortunately, is not scientific discovery, but self-deception. Uncontrolled, it can dramatically increase the false discovery rate” (Munafò et al., 2017, p. 2, 1,010 citations)

Just how dramatic these effects can be was demonstrated by Simmons, Nelson, and Simonsohn (2011) in a series of experiments and simulations that showed how greatly QRPs increase the likelihood of finding support for a false hypothesis. (John et al., 2012, p. 524, 877 citations).

“Simonsohn’s simulations have shown that changes in a few data-analysis
decisions can increase the
false-positive rate in a single study to 60%” (Nuzzo, 2014, 799 citations).

“the publication of an important article in Psychological Science showing how easily researchers can, in the absence of any real effects, nonetheless obtain statistically significant differences through various questionable research practices (QRPs) such as exploring multiple dependent variables or covariates and only reporting these when they yield significant results (Simmons, Nelson, & Simonsohn, 2011)” (Pashler & Wagenmakers, 2012, p. 528, 736 citations)

“Even seemingly conservative levels of p-hacking make it easy for researchers to find statistically significant support for nonexistent effects. Indeed, p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011).” (Simonsohn, Nelson, & Simmons, 2014, p. 534, 656 citations)

“Recent years have seen intense interest in the reproducibility of scientific results and the degree to which some problematic, but common, research practices may be responsible for high rates of false findings in the scientific literature, particularly within psychology but also more generally” (Poldrack et al., 2017, p. 115, 475 citations)

“especially in an environment in which multiple comparisons or researcher dfs (Simmons, Nelson, & Simonsohn, 2011) make it easy for researchers to find large and statistically significant effects that could arise from noise alone” (Gelman & Carlin,

“In an influential recent study, Simmons and colleagues demonstrated that even a moderate amount of flexibility in analysis choice—for example, selecting from among two DVs or
optionally including covariates in a regression analysis— could easily produce false-positive rates in excess of 60%, a figure they convincingly argue is probably a conservative
estimate (Simmons et al., 2011).” (Yarkoni & Westfall, 2017, p. 1103, 457 citations)

“In the face of human biases and the vested interest of the experimenter, such freedom of analysis provides access to a Pandora’s box of tricks that can be used to achieve any desired result (e.g., John et al., 2012; Simmons, Nelson, & Simonsohn, 2011″ (Wagenmakers et al., 2012, p. 633, 425 citations)

“Simmons et al. (2011) illustrated how easy it is to inflate Type I error rates when researchers employ hidden degrees of freedom in their analyses and design of studies (e.g., selecting the most desirable outcomes, letting the sample size depend on results of significance tests).” (Bakker et al., 2012, p. 545, 394 citations).

“Psychologists have recently become increasingly concerned about the likely overabundance of false positive results in the scientific literature. For example, Simmons, Nelson, and Simonsohn (2011) state that “In many cases, a researcher is more likely to falsely find
evidence that an effect exists than to correctly find evidence that it does not
” (p. 1359)” (Maxwell, Lau, & Howard, 2015, p. 487,

“More-over, the highest impact journals famously tend to favor highly surprising results; this makes it easy to see how the proportion of false positive findings could be even higher in such journals.” (Pashler & Harris, 2012, p. 532, 373 citations)

“There is increasing concern that many published results are false positives [1,2] (but see [3]).” (Head et al., 2015, p. 1, 356 citations)

“Quantifying p-hacking is important because publication of false positives hinders scientific
progress” (Head et al., 2015, p. 2, 356 citations).

“To be sure, methodological discussions are important for any discipline, and both fraud and dubious research procedures are damaging to the image of any field and potentially undermine confidence in the validity of social psychological research findings. Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology.” (Strobe & Strack, 2014, p. 60, 291 citations)

“Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature” (Szucs & Ioannidis, 2017, p. 1, 269 citations)

“Notably, if we consider the recent estimate of 13:1 H0:H1 odds [30], then FRP exceeds 50% even in the absence of bias” (Szucs & Ioannidis, 2017, p. 12, 269 citations)

“In all, the combination of low power, selective reporting, and other biases and errors that have been well documented suggest that high FRP can be expected in cognitive neuroscience and psychology. For example, if we consider the recent estimate of 13:1 H0:H1 odds [30], then
FRP exceeds 50% even in the absence of bias.” (Szucs & Ioannidis, 2017, p. 15, 269 citations)

“Many prominent researchers believe that as much as half of the scientific literature—not only in medicine, by also in psychology and other fields—may be wrong [11,13–15]” (Smaldino & McElreath, 2016, p. 2, 251 citations).

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

“A more recent article compellingly demonstrated how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates (Simmons, Nelson, & Simonsohn, 2011).” (Dick et al., 2015, p. 43, 208 citations)

“In 2011, we wrote “False-Positive Psychology” (Simmons et al. 2011), an article reporting the surprisingly severe consequences of selectively reporting data and analyses, a practice that we later called p-hacking. In that article, we showed that conducting multiple analyses on the same data set and then reporting only the one(s) that obtained statistical significance (e.g., analyzing multiple measures but reporting only one) can dramatically increase the likelihood of publishing a false-positive finding. Independently and nearly simultaneously, John et al. (2012) documented that a large fraction of psychological researchers admitted engaging in precisely the forms of p-hacking that we had considered. Identifying these realities—that researchers engage in p-hacking and that p-hacking makes it trivially easy to accumulate significant evidence for a false hypothesisopened psychologists’ eyes to the fact that many published findings, and even whole literatures, could be false positive.” (Nelson, Simmons, & Simonsohn, 2018, 204 citations).

“As Simmons et al.(2011) concluded—reflecting broadly on the state of the discipline—“it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis”(p.1359)” (Earp & Trafimov, 2015, p. 4, 200 citations)

“The second, related set of events was the publication of articles by a series of authors (Ioannidis 2005, Kerr 1998, Simmons et al. 2011, Vul et al. 2009) criticizing questionable research practices (QRPs) that result in grossly inflated false positive error rates in the psychological literature” (Shrout & Rodgers, 2018, p. 489, 195 citations).

“Let us add a new dimension, which was brought up in a seminal publication of Simmons, Nelson & Simonsohn (2011). They stated that researchers actually have so much flexibility in deciding how to analyse their data that this flexibility allows them to coax statistically significant results from nearly any data set” (Forstmeier, Wagenmakers, & Parker, 2017, p. 1945, 173 citations)

“Publication bias (Ioannidis, 2005) and flexibility during data analyses (Simmons, Nelson, & Simonsohn, 2011) create a situation in which false positives are easy to publish, whereas contradictory null findings do not reach scientific journals (but see Nosek & Lakens, in press)” (Lakens & Evers, 2014, p. 278, 139 citations)

“Recent reports hold that allegedly common research practices allow psychologists to support just about any conclusion (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011).” (Koole & Lakens, 2012, p. 608, 139 citations)

“Researchers then may be tempted to write up and concoct papers around the significant results and send them to journals for publication. This outcome selection seems to be widespread practice in psychology [12], which implies a lot of false positive results in the literature and a massive overestimation of ES, especially in meta-analyses” (

“Researcher df, or researchers’ behavior directed at obtaining statistically significant results (Simonsohn, Nelson, & Simmons, 2013), which is also known as p-hacking or questionable research practices in the context of null hypothesis significance testing (e.g., O’Boyle, Banks, & Gonzalez-Mulé, 2014), results in a higher frequency of studies with false positives (Simmons et al., 2011) and inflates genuine effects (Bakker et al., 2012).” (van Assen, van Aert, & Wicherts, p. 294, 133 citations)

“The scientific community has witnessed growing concern about the high rate of false positives and unreliable results within the psychological literature, but the harmful impact
of false negatives has been largely ignored” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Much of the debate has concerned habits (such as “phacking” and the filedrawer effect) which can boost the prevalence of false positives in the published literature (Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014; Simmons, Nelson, & Simonsohn, 2011).” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Simmons, Nelson, and Simonsohn (2011) showed that researchers without scruples can nearly always find a p < .05 in a data set if they set their minds to it.” (Crandall & Sherman, 2014, p. 96, 114 citations)

Before we can balance false positives and false negatives, we have to publish false negatives.

Ten years ago, a stunning article by Bem (2011) triggered a crisis of confidence about psychology as a science. The article presented nine studies that seemed to show time-reversed causal effects of subliminal stimuli on human behavior. Hardly anybody believed the findings, but everybody wondered how Bem was able to produce significant results for effects that do not exist. This triggered a debate about research practices in social psychology.

Over the past decade, most articles on the replication crisis in social psychology pointed out problems with existing practices, but some articles tried to defend the status quo (cf. Schimmack, 2020).

Finkel, Eastwick, and Reis (2015) contributed to the debate with a plea to balance false positives and false negatives.

Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science

I argue that the main argument in this article is deceptive, but before I do so it is important to elaborate a bit on the use of the word deceptive. Psychologists make a distinction between self-deception and other-deception. Other-deception is easy to explain. For example, a politician may spread a lie for self-gain knowing full well that it is a lie. The meaning of self-deception is also relatively clear. Here individuals are spreading false information because they are unaware that the information is false. The main problem for psychologists is to distinguish between self-deception and other-deception. For example, it is unclear whether Donald Trump’s and his followers’ defence mechanisms are so strong that they really believes the election was stolen without any evidence to support this belief or whether he is merely using a lie for political gains. Similarly, it is also unclear whether Finkel et al. were deceiving themselves when they characterized the research practices of relationship researchers as an error-balanced approach, but the distinction between self-deception and other-deception is irrelevant. Self-deception also leads to the spreading of misinformation that needs to be corrected.

In short, my main thesis is that Finkel et al. misrepresent research practices in psychology and that they draw false conclusions about the status quo and the need for change based on a false premise.

Common Research Practices in Psychology

Psychological research practices follow a number of simple steps.

1. Researchers formulate a hypothesis that two variables are related (e.g., height is related to weight; dieting leads to weight loss).

2. They find ways to measure or manipulate a potential causal factor (height, dieting) and find a way to measure the effect (weight).

3. They recruit a sample of participants (e.g., N = 40).

4. They compute a statistic that reflects the strength of the relationship between the two variables (e.g., height and weight correlate r = .5).

5. They determine the amount of sampling error given their sample size.

6. They compute a test-statistic (t-value, F-value, z-score) that reflects the ratio of the effect size over the sample size (e.g., r (40) = .5; t(38) = 3.56.

7. They use the test-statistic to decide whether the relationship in the sample (e.g., r = .5) is strong enough to reject the nil-hypothesis that the relationship in the population is zero (p = .001).

The important question is what researchers do after they compute a p-value. Here critics of the status quo (the evidential value movement) and Finkel et al. make divergent assumptions.

The Evidential Value Movement

The main assumption of the EVM is that psychologists, including relationship researchers, have interpreted p-values incorrectly. For the most part, the use of p-values in psychology follows Fisher’s original suggestion to use a fixed criterion value of .05 to decide whether a result is statistically significant. In our example of a correlation of r = .5 with N = 40 participants, a p-value of .001 is below .05 and therefore it is sufficiently unlikely that the correlation could have emerged by chance if the real correlation between height and weight was zero. We therefore can reject the nil-hypothesis and infer that there is indeed a positive correlation.

However, if a correlation is not significant (e.g., r = .2, p > .05), the results are inconclusive because we cannot infer from a non-significant result that the nil-hypothesis is true. This creates an asymmetry in the value of significant results. Significant results can be used to claim a discovery (a diet produces weight loss), but non-significant results cannot be used to claim that there is no relationship (a diet has no effect on weight).

This asymmetry explains why most published articles results in psychology report significant results (Sterling, 1959; Sterling et al., 1959). As significant results are more conclusive, journals found it more interesting to publish studies with significant results.

Significant
http://allendowney.blogspot.com/2014/08/new-study-vaccines-prevent-disease-and.html

As Sterling (1959) pointed out, if only significant results are published, statistical significance no longer provides valuable information, and as Rosenthal (1979) warned, in theory journals could be filled with significant results even if most results are false positives (i.e., the nil-hypothesis is actually true).

Importantly, Fisher did not prescribe to do studies only once and to publish only significant results. Fisher clearly stated that results should only be considered credible if replication studies confirm the original results most of the time (say 8 out of 10 replication studies also produced p < .05). However, this important criterion of credibility was ignored by social psychologists, especially in research areas like relationship research that is resource intensive.

To conclude, the main concern among critics of research practices in psychology is that selective publishing of significant results produces results that have a high risk of being false positives (cf. Schimmack, 2020).

The Error Balanced Approach

Although Finkel et al. (2015) do not mention Neyman and Pearson, their error-balanced approach is rooted in Neyman-Pearsons approach to the interpretation of p-values. This approach is rather different from Fisher’s approach and it is well documented that Fisher and Neyman-Pearson were in a bitter fight over this issue. Neyman and Pearson introduced the distinction between Type I errors also called false positives and type-II errors also called false negatives.

Understanding Confusion Matrix. When we get the data, after data… | by  Sarang Narkhede | Towards Data Science

The type-I error is the same error that one could make in Fisher’s approach, namely a significant results, p < .05, is falsely interpreted as evidence for a relationship when there is no relationship between two variables in the population and the observed relationship was produced by sampling error alone.

So, what is a type-II error? It only occurred to me yesterday that most explanations of type-II errors are based on a misunderstanding of Neyman-Pearson’s approach. A simplistic explanation of a type-II error is the inference that there is no relationship, when a relationship actually exists. In the pregnancy example, a type-II error would be a pregnancy test that suggests a pregnant woman is not pregnant.

This explains conceptually what a type-II error is, but it does not explain how psychologists could ever make a type-II error. To actually make type-II errors, researchers would have to approach research entirely differently than psychologists actually do. Most importantly, they would need to specify a theoretically expected effect size. For example, researchers could test the nil-hypothesis that a relationship between height and weight is r = 0 against the alternative hypothesis that the relationship is r = .4. They would then need to compute the probability of obtaining a non-significant result under the assumption that the correlation is r = .4. This probability is known as the type-II error probability (beta). Only then, a non-significant result can be used to reject the alternative hypothesis that the effect size is .4 or larger with a pre-determined error rate beta. If this suddenly sounds very unfamiliar, the reason is that neither training nor published articles follow this approach. Thus, psychologists never make type-II error because they never specify a priori effect sizes and use p-values greater than .05 to infer that population effect sizes are smaller than a specified effect size.

However, psychologists often seem to believe that they are following Neyman-Pearson because statistics is often taught as a convoluted, incoherent mishmash of the two approaches (Gigerenzer, 1993). It also seems that Finkel et al. (2015) falsely assumed that psychologists follow Neyman-Pearson’s approach and carefully weight the risks of type-I and type-II errors. For example, they write

Psychological scientists typically set alpha (the theoretical possibility of a false positive) at .05, and, following Cohen (1988), they frequently set beta (the theoretical possibility of a false negative) at .20.

It is easy to show that this is not the case. To set the probability of a type-II error at 20%, psychologists would need to specify an effect size that gives them an 80% probability (power) to reject the nil-hypothesis, and they would then report the results with the conclusion that the population effect size is less than their a priori specified effect size. I have read more than 1,000 research articles in psychology and I have never seen an article that followed this approach. Moreover, it has been noted repeatedly that sample sizes are determined on an ad hoc basis with little concerns about low statistical power (Cohen, 1962; Sedlmeier & Gigerenzer, 1989; Schimmack, 2012; Sterling et al., 1995). Thus, the claim that psychologists are concerned about beta (type-II errors) is delusional, even if many psychologists believe it.

Finkel et al. (2015) suggests that an optimal approach to research would balance the risk of false positive results with the risk of false negative results. However, once more they ignore that false negatives can only be specified with clearly specified effect sizes.

Estimates of false positive and false negative rates in situations like these would go a long way toward helping scholars who work with large datasets to refine their confirmatory and exploratory hypothesis testing practices to optimize the balance between false-positive and false-negative error rates.

Moreover, they are blissfully unaware that false positive rates are abstract entities because it is practically impossible to verify that the relationship between two variables in a population is exactly zero. Thus, neither false positives nor false negatives are clearly defined and therefore cannot be counted to compute rates of their occurrences.

Without any information about the actual rate of false positives and false negatives, it is of course difficult to say whether current practices produce too many false positives or false negatives. A simple recommendation would be to increase sample sizes because higher statistical power reduces the risk of false negatives and the risk of false positives. So, it might seem like a win-win. However, this is not what Finkel et al. considered to be best practices.

As discussed previously, many policy changes oriented toward reducing false-positive rates will exacerbate false-negative rates

This statement is blatantly false and ignores recommendations to test fewer hypotheses in larger samples (Cohen, 1990; Schimmack, 2012).

They further make unsupported claims about the difficulty of correcting false positive results and false negative results. The evidential value critics have pointed out that current research practices in psychology make it practically impossible to correct a false positive result. Classic findings that failed to replicate are often cited and replications are ignored. The reason is that p < .05 is treated as strong evidence, whereas p > .05 is treated as inconclusive, following Fisher’s approach. If p > .05 was considered evidence against a plausible hypothesis, there would be no reason not to publish it (e.g., a diet does not decrease weight by more than .3 standard deviations in a study with 95% power, p < .05).

We are especially concerned about the evidentiary value movement’s relative neglect of false negatives because, for at least two major reasons, false negatives are much less likely to be the subject of replication attempts. First, researchers typically lose interest in unsuccessful ideas, preferring to use their resources on more “productive” lines of research (i.e., those that yield evidence for an effect rather than lack of evidence for an effect). Second, others in the field are unlikely to learn about these failures because null results are rarely published (Greenwald, 1975). As a result, false negatives are unlikely to be corrected by the normal processes of reconsideration and replication. In contrast, false positives appear in the published literature, which means that, under almost all circumstances, they receive more attention than false negatives. Correcting false positive errors is unquestionably desirable, but the consequences of increasingly favoring the detection of false positives relative to the detection of false negatives are more ambiguous.

This passage makes no sense. As the authors themselves acknowledge, the key problem with existing research practices is that non-significant results are rarely published (“because null-results are rarely published”). In combination with low statistical power to detect small effect sizes, this selection implies that researchers will often obtain non-significant results that are not published. However, it also means that published significant results often inflate the effect size because the true population effect size alone is too weak to produce a significant result. Only with the help of sampling error, the observed relationship is strong enough to be significant. So, many correlations that are r = .2 will be published as correlations of r = .5. The risk of false negatives is also reduced by publication bias. Because researchers do not know that a hypothesis was tested and produced a non-significant result, they will try again. Eventually, a study will produce a significant result (green jelly beans cause acne, p < .05), and the effect size estimate will be dramatically inflated. When follow-up studies fail to replicate this finding, these replication results are again not published because non-significant results are considered inconclusive. This means that current research practices in psychology never produce type-II errors, only produce type-I errors, and type-I errors are not corrected. This fundamentally flawed approach to science has created the replication crisis.

In short, while evidential value critics and Finkel agree that statistical significance is widely used to decide editorial decisions, they draw fundamentally different conclusions from this practice. Finkel et al. falsely label non-significant results in small samples, false negative results, but they are not false negatives in Neyman-Pearson’s approach to significance testing. They are, however, inconclusive results and the best practice to avoid inconclusive results would be to increase statistical power and to specify type-II error probabilities for reasonable effect sizes.

Finkel et al. (2015) are less concerned about calls for higher statistical power. They are more concerned with the introduction of badges for materials sharing, data sharing, and preregistration as “quick-and-dirty indicator of which studies, and which scholars,
have strong research integrity
” (p. 292).

Finkel et al. (2015) might therefore welcome cleaner and more direct indicators of research integrity that my colleagues and I have developed over the past decade that are related to some of their key concerns about false negative and false positive results (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020, Schimmack, 2012; Schimmack, 2020). To illustrate this approach, I am using Eli J. Finkel’s published results.

I first downloaded published articles from major social and personality journals (Schimmack, 2020). I then converted these pdf files into text files and used R-code to find statistical results that were reported in the text. I then used a separate R-code to search these articles for the name “Eli J. Finkel.” I excluded thank you notes. I then selected the subset of test statistics that appeared in publications by Eli J. Finkel. The extracted test statistics are available in the form of an excel file (data). The file contains 1,638 useable test statistics (z-scores between 0 and 100).

A z-curve analysis of test-statistic converts all published test-statistics into p-values. Then the p-values are converted into z-scores on an standard normal distribution. Because the sign of an effect does not matter, all z-scores are positive The higher a z-score, the stronger is the evidence against the null-hypothesis. Z-scores greater than 1.96 (red line in the plot) are significant with the standard criterion of p < .05 (two-tailed). Figure 1 shows a histogram of the z-scores between 0 and 6; 143 z-scores exceed the upper value. They are included in the calculations, but not shown.

The first notable observation in Figure 1 is that the peak (mode) of the distribution is just to the right side of the significance criterion. It is also visible that there are more results just to the right (p < .05) than to the left (p > .05) around the peak. This pattern is common and reflects the well-known tendency for journals to favor significant results.

The advantage of a z-curve analysis is that it is possible to quantify the amount of publication bias. To do so, we can compare the observed discovery rate with the expected discovery rate. The observed discovery rate is simply the percentage of published results that are significant. Finkel published 1,031 significant results, which is a percentage of 63%.

The expected discovery rate is based on a statistical model. The statistical model is fitted to the distribution of significant results. To produce the distribution of significant results in Figure 1, we assume that they were selected from a larger set of tests that produced significant and non-significant results. Based on the mean power of these tests, we can estimate the full distribution before selection for significance. Simulation studies show that these estimates match simulated true values reasonably well (Bartos & Schimmack, 2020).

The expected discovery rate is 26%. This estimate implies that the average power of statistical tests conducted by Finkel is low. With over 1,000 significant test statistics, it is possible to obtain a fairly close confidence interval around this estimate, 95%CI = 11% to 44%. The confidence interval does not include 50%, showing that the average power is below 50%, which is often considered a minimum value for good science (Tversky & Kahneman, 1971). The 95% confidence interval also does not include the observed discovery rate of 63%. This shows the presence of publication bias. These results are by no means unique to Finkel. I was displeased to see that a z-curve analysis of my own articles produced similar results (ODR = 74%, EDR = 25%).

The EDR estimate is not only useful to examine publication bias. It can also be used to estimate the maximum false discovery rate (Soric, 1989). That is, although it is impossible to specify how many published results are false positives, it is possible to quantify the worst case scenario. Finkel’s EDR estimate of 26% implies a maximum false discovery rate of 15%. Once again, this is an estimate and it is useful to compute a confidence interval around it. The 95%CI ranges from 7% to 43%. On the one hand, this makes it possible to reject Ioannidis’ claim that most published results are false. On the other hand, we cannot rule out that some of Finkel’s significant results were false positives. Moreover, given the evidence that publication bias is present, we cannot rule out the possibility that non-significant results that failed to replicate a significant result are missing from the published record.

A major problem for psychologists is the reliance on p-values to evaluate research findings. Some psychologists even falsely assume that p < .05 implies that 95% of significant results are true positives. As we see here, the risk of false positives can be much higher, but significance does not tell us which p-values below .05 are credible. One solution to this problem is to focus on the false discovery rate as a criterion. This approach has been used in genomics to reduce the risk of false positive discoveries. The same approach can also be used to control the risk of false positives in other scientific disciplines (Jager & Leek, 2014).

To reduce the false discovery rate, we need to reduce the criterion to declare a finding a discovery. A team of researchers suggested to lower alpha from .05 to .005 (Benjamin et al. 2017). Figure 2 shows the results if this criterion is used for Finkel’s published results. We now see that the number of significant results is only 579, but that is still a lot of discoveries. We see that the observed discovery rate decreased to 35%. The reason is that many of the just significant results with p-values between .05 and .005 are no longer considered to be significant. We also see that the expected discovery rate increased! This requires some explanation. Figure 2 shows that there is an excess of significant results between .05 and .005. These results are not fitted to the model. The justification for this would be that these results are likely to be obtained with questionable research practices. By disregarding them, the remaining significant results below .005 are more credible and the observed discovery rate is in line with the expected discovery rate.

The results look different if we do not assume that questionable practices were used. In this case, the model can be fitted to all p-values below .05.

If we assume that p-values are simply selected for significance, the decrease of p-values from .05 to .005 implies that there is a large file-drawer of non-significant results and the expected discovery rate with alpha = .005 is only 11%. This translates into a high maximum false discovery rate of 44%, but the 95%CI is wide and ranges from 14% to 100%. In other words, the published significant results provide no credible evidence for the discoveries that were made. It is therefore charitable to attribute the peak of just significant results to questionable research practices so that p-values below .005 provide some empirical support for the claims in Finkel’s articles.

Discussion

Ultimately, science relies on trust. For too long, psychologists have falsely assumed that most if not all significant results are discoveries. Bem’s (2011) article made many psychologists realize that this is not the case, but this awareness created a crisis of confidence. Which significant results are credible and which ones are false positives? Are most published results false positives? During times of uncertainty, cognitive biases can have a strong effect. Some evidential value warriors saw false positive results everywhere. Others wanted to believe that most published results are credible. These extreme positions are not supported by evidence. The reproducibility project showed that some results replicate and others do not (Open Science Collaboration, 2015). To learn from the mistakes of the past, we need solid facts. Z-curve analyses can provide these facts. It can also help to separate more credible p-values from less credible p-values. Here, I showed that about half of Finkel’s discoveries can be salvaged from the wreckage of the replication crisis in social psychology by using p < .005 as a criterion for a discovery.

However, researchers may also have different risk preferences. Maybe some are more willing to build on a questionable, but intriguing finding than others. Z-curve analysis can accommodate personalized risk-preferences as well. I shared the data here and an R-package is available to fit z-curve with different alpha levels and selection thresholds.

Aside from these practical implications, this blog post also made a theoretical observation. The term type-II error or false negative is often used loosely and incorrectly. Until yesterday, I also made this mistake. Finkel et al. (2015) use the term false negative to refer to all non-significant results were the nil-hypothesis is false. They then worry that there is a high risk of false negatives that needs to be counterbalanced against the risk of a false positive. However, not every trivial deviation from zero is meaningful. For example, a diet that reduces weight by 0.1 pounds is not worthwhile studying. A real type-II error is made when researcher specify a meaningful effect size, conduct a high-powered study to find it, and then falsely conclude that an effect of this magnitude does not exist. To make a type-II error, it is necessary to conduct studies with high power. Otherwise, beta is so high that it makes no sense to draw a conclusion from the data. As average power in psychology in general and in Finkel’s studies is low, it is clear that they did not make any type-II errors. Thus, I recommend to increase power to finally get a balance between type-I and type-II errors which requires making some type-II errors some of the time.

References

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum, Inc.

Replicability 101: How to interpret the results of replication studies

Even statistically sophisticated psychologists struggle with the interpretation of replication studies (Maxwell et al., 2015).  This article gives a basic introduction to the interpretation of statistical results within the Neyman Pearson approach to statistical inferences.

I make two important points and correct some potential misunderstandings in Maxwell et al.’s discussion of replication failures.  First, there is a difference between providing sufficient evidence for the null-hypothesis (evidence of absence) and providing insufficient evidence against the null-hypothesis (absence of evidence).  Replication studies are useful even if they simply produce absence of evidence without evidence that an effect is absent.  Second, I  point out that publication bias undermines the credibility of significant results in original studies.  When publication bias is present, open replication studies are valuable because they provide an unbiased test of the null-hypothesis, while original studies are rigged to reject the null-hypothesis.

DEFINITION OF REPLICATING A STATISTICAL RESULT

Replicating something means to get the same result.  If I make the first free throw, replicating this outcome means to also make the second free throw.  When we talk about replication studies in psychology we borrow from the common meaning of the term “to replicate.”

If we conduct psychological studies, we can control many factors, but some factors are not under our control.  Participants in two independent studies differ from each other and the variation in the dependent variable across samples introduces sampling error. Hence, it is practically impossible to get identical results, even if the two studies are exact copies of each other.  It is therefore more complicated to compare the results of two studies than to compare the outcome of two free throws.

To determine whether the results of two studies are identical or not, we need to focus on the outcome of a study.  The most common outcome in psychological studies is a significant or non-significant result.  The goal of a study is to produce a significant result and for this reason a significant result is often called a success.  A successful replication study is a study that also produces a significant result.  Obtaining two significant results is akin to making two free throws.  This is one of the few agreements between Maxwell and me.

“Generally speaking, a published  original study has in all likelihood demonstrated a statistically significant effect. In the current zeitgeist, a replication study is usually interpreted as successful if it also demonstrates a statistically significant effect.” (p. 488)

The more interesting and controversial scenario is a replication failure. That is, the original study produced a significant result (success) and the replication study produced a non-significant result (failure).

I propose that a lot of confusion arises from the distinction between original and replication studies. If a replication study is an exact copy of the first study, the outcome probabilities of original and replication studies are identical.  Otherwise, the replication study is not really a replication study.

There are only three possible outcomes in a set of two studies: (a) both studies are successful, (b) one study is a success and one is a failure, or (c) both studies are failures.  The probability of these outcomes depends on whether the significance criterion (the type-I error probability) when the null-hypothesis is true and the statistical power of a study when the null-hypothesis is false.

Table 1 shows the probability of the outcomes in two studies.  The uncontroversial scenario of two significant results is very unlikely, if the null-hypothesis is true. With conventional alpha = .05, the probability is .0025 or 1 out of 400 attempts.  This shows the value of replication studies. False positives are unlikely to repeat themselves and a series of replication studies with significant results is unlikely to occur by chance alone.

2 sig, 0 ns 1 sig, 1 ns 0 sig, 2 ns
H0 is True alpha^2 2*alpha*(1-alpha) (1-alpha^2)
H1 is True (1-beta)^2 2*(1-beta)*beta beta^2

The probability of a successful replication of a true effect is a function of statistical power (1 – type-II error probability).  High power is needed to get significant results in a pair of studies (an original study and a replication study).  For example, if power is only 50%, the chance of this outcome is only 25% (Schimmack, 2012).  Even with conventionally acceptable power of 80%, only 2/3 (64%) of replication attempts would produce this outcome.  However, studies in psychology do not have 80% power and estimates of power can be as low as 37% (OSC, 2015). With 40% power, a pair of studies would produce significant results in no more than 16 out of 100 attempts.   Although successful replications of true effects with low power are unlikely, they are still much more likely then significant results when the null-hypothesis is true (16/100 vs. 1/400 = 64:1).  It is therefore reasonable to infer from two significant results that the null-hypothesis is false.

If the null-hypothesis is true, it is extremely likely that both studies produce a non-significant result (.95^2 = 90.25%).  In contrast, it is unlikely that even a study with modest power would produce two non-significant results.  For example, if power is 50%, there is a 75% chance that at least one of the two studies produces a significant result. If power is 80%, the probability of obtaining two non-significant results is only 4%.  This means, it is much more likely (22.5 : 1) that the null-hypothesis is true than that the alternative hypothesis is true.  This does not mean that the null-hypothesis is true in an absolute sense because power depends on the effect size.  For example, if 80% power were obtained with a standardized effect size of Cohen’s d = .5,  two non-significant results would suggest that the effect size is smaller than .5, but it does not warrant the conclusion that H0 is true and the effect size is exactly 0.  Once more, it is important to distinguish between the absence of evidence for an effect and the evidence of absence of an effect.

The most controversial scenario assumes that the two studies produced inconsistent outcomes.  Although theoretically there is no difference between the first and the second study, it is common to focus on a successful outcome followed by a replication failure  (Maxwell et al., 2015). When the null-hypothesis is true, the probability of this outcome is low;  .05 * (1-.05) = .0425.  The same probability exists for the reverse pattern that a non-significant result is followed by a significant one.  A probability of 4.25% shows that it is unlikely to observe a significant result followed by a non-significant result when the null-hypothesis is true. However, the low probability is mostly due to the low probability of obtaining a significant result in the first study, while the replication failure is extremely likely.

Although inconsistent results are unlikely when the null-hypothesis is true, they can also be unlikely when the null-hypothesis is false.  The probability of this outcome depends on statistical power.  A pair of studies with very high power (95%) is very unlikely to produce an inconsistent outcome because both studies are expected to produce a significant result.  The probability of this rare event can be as low, or lower, than the probability with a true null effect; .95 * (1-.95) = .0425.  Thus, an inconsistent result provides little information about the probability of a type-I or type-II  error and is difficult to interpret.

In conclusion, a pair of significance tests can produce three outcomes. All three outcomes can occur when the null-hypothesis is true and when it is false.  Inconsistent outcomes are likely unless the null-hypothesis is true or the null-hypothesis is false and power is very high.  When two studies produce inconsistent results, statistical significance provides no basis for statistical inferences.

Meta-Analysis 

The counting of successes and failures is an old way to integrate information from multiple studies.  This approach has low power and is no longer used.  A more powerful approach is effect size meta-analysis.  Effect size meta-analysis was one way to interpret replication results in the Open Science Collaboration (2015) reproducibility project.  Surprisingly, Maxwell et al. (2015) do not consider this approach to the interpretation of failed replication studies. To be clear, Maxwell et al. (2015) mention meta-analysis, but they are talking about meta-analyzing a larger set of replication studies, rather than meta-analyzing the results of an original and a replication study.

“This raises a question about how to analyze the data obtained from multiple studies. The natural answer is to use meta-analysis.” (p. 495)

I am going to show that effect-size meta-analysis solves the problem of interpreting inconsistent results in pairs of studies. Importantly, effect size meta-analysis does not care about significance in individual studies.  A meta-analysis of a pair of studies with inconsistent results is no different from a meta-analysis of a pair of studies with consistent results.

Maxwell et al.’s (2015) introduced an example of a between-subject (BS) design with n = 40 per group (total N = 80) and a standardized effect size of Cohen’s d = .5 (a medium effect size).  This study has 59% power to obtain a significant result.  Thus, it is quite likely that a pair of studies produces inconsistent results (48.38%).   However, a pair of studies with N = 80 has the power of a total sample size of N = 160, which means a fixed-effects meta-analysis will produce a significant result in 88% of all attempts.  Thus, it is not difficult at all to interpret the results of pairs of studies with inconsistent results if the studies have acceptable power (> 50%).   Even if the results are inconsistent, a meta-analysis will provide the correct answer that there is an effect most of the time.

A more interesting scenario are inconsistent results when the null-hypothesis is true.  I turned to simulations to examine this scenario more closely.   The simulation showed that a meta-analysis of inconsistent studies produced a significant result in 34% of all cases.  The percentage slightly varies as a function of sample size.  With a small sample of N = 40, the percentage is 35%. With a large sample of  1,000 participants it is 33%.  This finding shows that in two-thirds of attempts, a failed replication reverses the inference about the null-hypothesis based on a significant original study.  Thus, if an original study produced a false-positive results, a failed replication study corrects this error in 2 out of 3 cases.  Importantly, this finding does not warrant the conclusion that the null-hypothesis is true. It merely reverses the result of the original study that falsely rejected the null-hypothesis.

In conclusion, meta-analysis of effect sizes is a powerful tool to interpret the results of replication studies, especially failed replication studies.  If the null-hypothesis is true, failed replication studies can reduce false positives by 66%.

DIFFERENCES IN SAMPLE SIZES

We can all agree that, everything else being equal, larger samples are better than smaller samples (Cohen, 1990).  This rule applies equally to original and replication studies. Sometimes it is recommended that replication studies should use much larger samples than original studies, but it is not clear to me why researchers who conduct replication studies should have to invest more resources than original researchers.  If original researchers conducted studies with adequate power,  an exact replication study with the same sample size would also have adequate power.  If the original study was a type-I error, the replication study is unlikely to replicate the result no matter what the sample size.  As demonstrated above, even a replication study with the same sample size as the original study can be effective in reversing false rejections of the null-hypothesis.

From a meta-analytic perspective, it does not matter whether a replication study had a larger or smaller sample size.  Studies with larger sample sizes are given more weight than studies with smaller samples.  Thus, researchers who invest more resources are rewarded by giving their studies more weight.  Large original studies require large replication studies to reverse false inferences, whereas small original studies require only small replication studies to do the same.  Nevertheless, failed replications with larger samples are more likely to reverse false rejections of the null-hypothesis, but there is no magical number about the size of a replication study to be useful.

I simulated a scenario with a sample size of N = 80 in the original study and a sample size of N = 200 in the replication study (a factor of 2.5).  In this simulation, only 21% of meta-analyses produced a significant result.  This is 13 percentage points lower than in the simulation with equal sample sizes (34%).  If the sample size of the replication study is 10 times larger (N = 80 and N = 800), the percentage of remaining false positive results in the meta-analysis shrinks to 10%.

The main conclusion is that even replication studies with the same sample size as the original study have value and can help to reverse false positive findings.  Larger sample sizes simply give replication studies more weight than original studies, but it is by no means necessary to increase sample sizes of replication studies to make replication failures meaningful.  Given unlimited resources, larger replications are better, but these analysis show that large replication studies are not necessary.  A replication study with the same sample size as the original study is more valuable than no replication study at all.

CONFUSING ABSENCE OF EVIDENCE WITH EVIDENCE OF ABSENCE

One problem in Maxwell et al’s (2015) article is to conflate two possible goals of replication studies.  One goal is to probe the robustness of the evidence against the null-hypothesis. If the original result was a false positive result, an unsuccessful replication study can reverse the initial inference and produce a non-significant result in a meta-analysis.  This finding would mean that evidence for an effect is absent.  The status of a hypothesis (e.g., humans have supernatural abilities; Bem, 2011) is back to where it was before the original study found a significant result and the burden of proof is shifted back to proponents of the hypothesis to provide unbiased credible evidence for it.

Another goal of replication studies can be to provide conclusive evidence that an original study reported a false positive result (i..e, humans do not have supernatural abilities).  Throughout their article, Maxwell et al. assume that the goal of replication studies is to prove the absence of an effect.  They make many correct observations about the difficulties of achieving this goal, but it is not clear why replication studies have to be conclusive when original studies are not held to the same standard.

This makes it easy to produce (potentially false) positive results and very hard to remove false positive results from the literature.   It also creates a perverse incentive to conduct underpowered original studies and to claim victory when a large replication study finds a significant result with an effect size that is 90% smaller than the effect size in an original study.  The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported.  To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles.  If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.

THE DIRTY BIG SECRET

The main problem of Maxwell et al.’s (2015) article is that the authors blissfully ignore the problem of publication bias.  They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results (Schimmack, 2012; Sterling; 1959; Sterling et al., 1995).

It is hard to believe that Maxwell is unaware of this problem, if only because Maxwell was action editor of my article that demonstrated how publication bias undermines the credibility of replication studies that are selected for significance  (Schimmack, 2012).

I used Bem’s infamous article on supernatural abilities as an example, which appeared to show 8 successful replications of supernatural abilities.  Ironically, Maxwell et al. (2015) also cites Bem’s article to argue that failed replication studies can be misinterpreted as evidence of absence of an effect.

“Similarly, Ritchie, Wiseman, and French (2012) state that their failure to obtain significant results in attempting to replicate Bem (2011) “leads us to favor the ‘experimental artifacts’ explanation for Bem’s original result” (p. 4)”

This quote is not only an insult to Ritchie et al.; it also ignores the concerns that have been raised about Bem’s research practices. First, Ritchie et al. do not claim that they have provided conclusive evidence against ESP.  They merely express their own opinion that they “favor the ‘experimental artifacts’ explanation.  There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities.

More important, Maxwell et al. ignore the broader context of these studies.  Schimmack (2012) discussed many questionable practices in Bem’s original studies and I presented statistical evidence that the significant results in Bem’s article were obtained with the help of questionable research practices.  Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.

It is not clear why Maxwell et al. (2015) picked Bem’s article to discuss problems with failed replication studies and ignores that questionable research practices undermine the credibility of significant results in original research articles. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.

Maxwell et al. (2015) were not aware that in the same year, the OSC (2015) reproducibilty project would replicate only 37% of statistically significant results in top psychology journals, while the apparent success rate in these journals is over 90%.  The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not.  Whether this makes things better is a matter of opinion.

Publication bias also undermines the usefulness of meta-analysis for hypothesis testing.  In the OSC reproducibility project, a meta-analysis of original and replication studies produced 68% significant results.  This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.

DOES PSYCHOLOGY HAVE A REPLICATION CRISIS?

Maxwell et al.’s (2015) answer to this question is captured in this sentence. “Despite raising doubts about the extent to which apparent failures to replicate necessarily reveal that psychology is in crisis,we do not intend to dismiss concerns about documented methodological flaws in the field.” (p. 496).  The most important part of this quote is “raising doubt,” the rest is Orwellian double-talk.

The whole point of Maxwell et al.’s article is to assure fellow psychologists that psychology is not in crisis and that failed replication studies should not be a major concern.  As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail.

The real answer to Maxwell et al.’s question was provided by the OSC (2015) finding that only 37% of published significant results could be replicated.  In my opinion that is not only a crisis, but a scandal because psychologists routinely apply for funding with power analyses that claim 80% power.  The reproducibilty project shows that the true power to obtain significant results in original and replication studies is much lower than this and that the 90% success rate is no more meaningful than 90% votes for a candidate in communist elections.

In the end, Maxwell et al. draw the misleading conclusion that “the proper design and interpretation of replication studies is less straightforward than conventional practice would suggest.”  They suggest that “most importantly, the mere fact that a replication study yields a nonsignificant statistical result should not by itself lead to a conclusion that the corresponding original study was somehow deficient and should no longer be trusted.”

As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if (a) the original study was not preregistered, (b) the original study produced weak evidence (e.g., p = .04), the original study was published in a journal that only publishes significant results, (d) the replication study had a larger sample, (e) the replication study would have been published independent of outcome, and (f) the replication study was preregistered.

We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail.  Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.

Long life the replicability revolution.  !!!

REFERENCES

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.

http://dx.doi.org/10.1037/0003-066X.45.12.1304

Maxwell, S.E, Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does ‘failure to replicate’ really mean? American Psychologist, 70, 487-498. http://dx.doi.org/10.1037/a0039400.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

 

Table 1 shows the results.

Power                  PTH / PFH             
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67                     

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH             
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14                    

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.

ioannidis-fig6

A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).

ioannidis-fig1

Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.

ioannidis-fig3

Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.

Powergraphs for JEP-LMC3.g

The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.

Powergraphs for PsySci3.g

The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.

Powergraphs for JPSP-ASC3.g

One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.