“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)
In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.
The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results. “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).
The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.
In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.
The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings). The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).
Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.
Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.
I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely. Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.
The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.
If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.
Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.
Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.
2. Questionable Research Practices
The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies. This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used. John et al. listed a number of questionable research practices that might explain Bem’s findings.
2.1. Multiple Dependent Variables
One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.
2.2. Failure to report all conditions
This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.
2.3 Generous Rounding
Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.
Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.
2.5 Excluding of Data
Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.
2.6 Stopping Data Collection Early
Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.
In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.
2.7 Optional Stopping/Snooping
Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.
The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).
Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.
2.8 Selective Reporting
The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).
Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.
Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants $5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid $90,000 out of pocket.
As a strong believer in ESP, Bem may have paid $90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.
“The integrity of the scientific enterprise requires the reporting of disconfirming results.”
In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.
3. The Decline Effect and a New Questionable Research Practice
When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased. This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).
Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.
The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.
Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1. P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.
|EXPERIMENT||S 1-50||S 51-100|
|EXP1||p = .004||p = .194|
|EXP2||p = .096||p = .170|
|EXP3||p = .039||p = .100|
|EXP4||p = .033||p = .067|
|EXP5||p = .013||p = .069|
|EXP6a||p = .412||p = .126|
|EXP5b||p = .023||p = .410|
|EXP7||p = .020||p = .338|
|EXP8||p = .010||p = .318|
|EXP9||p = .003||NA|
There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).
In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes. Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422). How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.
There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection. In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.
“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”
Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.
Additional hints come from an interview with Engber (2017).
“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.
In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.
Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one. Schooler (2011) proposed that something more intriguing may cause decline effects.
“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.”
Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.
I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.
Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.
The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).
The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality. The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.
Conceptual Replications and Hidden Moderators
In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).
Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).
The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.
Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.
The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.
The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.
Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.
Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011). The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used. They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%. A combination of four questionable research practices increased the risk to 60.7%. The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky. But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results, (p = .69 = 1%).
The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies). If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.
To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).
Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct. Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.
Should the Journal of Personality and Social Psychology Retract Bem’s Article?
Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.
However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.
“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).
Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.
This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.
A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).
[edited January, 8, 2018]
It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.
When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”
If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.
Alcock, J. E. (2011). Back from the future: Parapsychology and the Bem affair. Skeptical Inquirer, 35(2). Retrieved from http://www.csicop.org/specialarticles/show/back_from_the_future
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524
Bem, D.J., Tressoldi, P., Rabeyron, T. & Duggan, M. (2015) Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events, F1000 Research, 4, 1–33.
Engber, D. (2017). Daryl Bem proved ESP Is real: Which means science is broken. https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html
Francis, G. (2012). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9
Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.
Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, Issue 6203, 502-1505, DOI: 10.1126/science.1255484
Galak, J., Leboeuf, R.A., Nelson, L. D., & Simmons, J.P. (2012). Journal of Personality and Social Psychology, 103, 933-948, doi: 10.1037/a0029709.
John, L. K. Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532. DOI: 10.1177/0956797611430953
RetractionWatch (2018). Ask retraction watch: Is it OK to cite a retracted paper? http://retractionwatch.com/2018/01/05/ask-retraction-watch-ok-cite-retracted-paper/
Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 2012, 17, 551–566.
Schimmack, U. (2015). The Test of Insufficient Variance: A New Tool for the Detection of Questionable Research Practices. https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632
Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746
Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137
Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.
In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).
Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.
This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.
Critical Examination of Ioannidis’s Simulations
First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.
How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.
Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.
(1) PPV = TP/(TP + FP)
PTP = True Positive Results, FP = False Positive Results
The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).
(2) TP = PTH * Power = PTH * (1 – beta)
The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).
(3) FP = PFH * alpha
This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.
(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))
Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.
(5) (PTH*power) / ((PTH*power) + (PFH * alpha)) < .50
Equation (5) can be simplied to the inequality equation
(6) alpha > PTH/PFH * power
We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.
(7a) = alpha = PTH/(1-PTH) * power
(7b) = alpha*(1-PTH) = PTH * power
(7c) = alpha – PTH*alpha = PTH * power
(7d) = alpha = PTH*alpha + PTH*power
(7e) = alpha = PTH(alpha + power)
(7f) = alpha/(power + alpha) = PTH
Table 1 shows the results.
Power PTH / PFH
90% 5 / 95
80% 6 / 94
70% 7 / 93
60% 8 / 92
50% 9 / 91
40% 11 / 89
30% 14 / 86
20% 20 / 80
10% 33 / 67
Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.
To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.
Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).
90% 40 / 60
80% 43 / 57
70% 46 / 54
60% 50 / 50
50% 55 / 45
40% 60 / 40
30% 67 / 33
20% 75 / 25
10% 86 / 14
If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.
However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.
Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.
For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.
A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).
Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.
Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.
In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.
Testing Ioannidis’s Simulations
10 years after the publication of “Why Most Published Research Findings Are False,” it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.
Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.
A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.
Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.
Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).
Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false. The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.
The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.
The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%. However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%. Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false. Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.
I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.
The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.
The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.
Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.
The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.
The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.
One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.
I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.
Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.
Do Deceptive Reporting Practices in Social Psychology Harm Social Psychology?
A Critical Examination of “Research Practices That Can Prevent an Inflation of False-Positive Rates” by Murayama, Pekrun, and Fiedler (2014).
The article by Murayama, Pekrun, and Fiedler (MPK) discusses the probability of false positive results (evidence for an effect when no effect is present also known as type-I error) in multiple study articles. When researchers conduct a single study the nominal probability of obtaining a significant result without a real effect (a type-I error) is typically set to 5% (p < .05, two-tailed). Thus, for every significant result one would expect 19 non-significant results. A false-positive finding (type-I error) would be followed by several failed replications. Thus, replication studies can quickly correct false discoveries. Or so, one would like to believe. However, traditionally journals reported only significant results. Thus, false positive results remained uncorrected in the literature because failed replications were not published.
In the 1990s, experimental psychologists that run relatively cheap studies found a solution to this problem. Journals demanded that researchers replicate their findings in a series of studies that were then published in a single article.
MPK point out that the probability of a type-I error decreases exponentially as the number of studies increases. With two studies, the probability is less than 1% (.05 * .05 = .0025). It is easier to see the exponential effect in terms or ratios (1 out of 20, 1 out of 400, 1 out of 8000, etc. In top journals of experimental social psychology, a typical article contains four studies. The probability that all four studies produce a type-I error is only 1 out of 160,000. The corresponding value on a standard normal distribution is z = 4.52, which means the strength of evidence is 4.5 standard deviations away from 0, which represents the absence of an effect. In particle physics a value of z = 5 is used to rule out false-positives. Thus, getting 4 out of 4 significant results in four independent tests of an effect provides strong evidence for an effect.
I am in full agreement with MPK and I made the same point in Schimmack (2012). The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160). A real effect will produce stronger evidence for an effect as sample size increase. Getting four significant results at the 5% level is not more impressive than getting a single significant result at the p < .00001 level.
However, the strength of evidence from multiple study articles depends on one crucial condition. This condition is so elementary and self-evidence that it is not even mentioned in statistics. The condition is that a researcher honestly reports all results. 4 significant results is only impressive when a researcher went into the lab, conducted four studies, and obtained significant results in all studies. Similarly, 4 free throws are only impressive when there were only 4 attempts. 4 out of 20 free-throws is not that impressive and 4 out of 80 attempts is horrible. Thus, the absolute number of successes is not important. What matters is the relative frequency of successes for all attempts that were made.
Schimmack (2012) developed the incredibility index to examine whether a set of significant results is based on honest reporting or whether it was obtained by omitting non-significant results or by using questionable statistical practices to produce significant results. Evidence for dishonest reporting of results would undermine the credibility of the published results.
MPK have the following to say about dishonest reporting of results.
“On a related note, Francis (2012a, 2012b, 2012c, 2012d; see also Schimmack, 2012) recently published a series of analyses that indicated the prevalence of publication bias (i.e., file-drawer problem) in multi-study papers in the psychological literature.” (p. 111). They also note that Francis used a related method to reveal that many multiple-study articles show statistical evidence of dishonest reporting. “Francis argued that there may be many cases in which the findings reported in multi-study papers are too good to be true” (p. 111).
In short, Schimmack and Francis argued that multiple study articles can be misleading because the provide the illusion of replicability (a researcher was able to demonstrate the effect again, and again, and again, therefore it must be a robust effect), but in reality it is not clear how robust the effect is because the results were not obtain in the way as the studies are described in the article (first we did Study 1, then we did Study 2, etc. and voila all of the studies worked and showed the effect).
One objection to Schimmack and Francis would be to find a problem with their method of detecting bias. However, MPK do not comment on the method at all. They sidestep this issue when they write “it is beyond the scope of this article to discuss whether publication bias actually exists in these articles or. or how prevalent it is in general” (p. 111).
After sidestepping the issue, MPK are faced with a dilemma or paradox. Do multiple study articles strengthen the evidence because the combined type-I error probability decreases or do multiple study articles weaken the evidence because the probability that researchers did not report the results of their research program honestly? “Should multi-study findings be regarded as reliable or shaky evidence?” (p. 111).
MPK solve this paradox with a semantic trick. First, they point out that dishonest reporting has undesirable effects on effect size estimates.
“A publication bias, if it exists, leads to overestimation of effect sizes because some null findings are not reported (i.e., only studies with relatively large effect sizes that produce significant results are reported). The overestimation of effect sizes is problematic” (p. 111).
They do not explain why researchers should be allowed to omit studies with non-significant results from an article, given that this practice leads to the undesirable consequences of inflated effect sizes. Accurate estimates of effect sizes would be obtained if researchers published all of their results. In fact, Schimmack (2012) suggested that researchers report all results and then conduct a meta-analysis of their set of studies to examine how strong the evidence of a set of studies is. This meta-analysis would provide an unbiased measure of the true effect size and unbiased evidence about the probability that the results of all studies were obtained in the absence of an effect.
The semantic trick occurs when the authors suggest that dishonest reporting practices are only a problem for effect size estimates, but not for the question whether an effect actually exists.
“However, the presence of publication bias does not necessarily mean that the effect is absent (i.e., that the findings are falsely positive).” (p. 111) and “Publication bias simply means that the effect size is overestimated—it does not necessarily imply that the effect is not real (i.e., falsely positive).” (p. 112).
This statement is true because it is practically impossible to demonstrate false positives, which would require demonstrating that the true effect size is exactly 0. The presence of bias does not warrant the conclusion that the effect size is zero and that reported results are false positives.
However, this is not the point of revealing dishonest practices. The point is that dishonest reporting of results undermines the credibility of the evidence that was used to claim that an effect exists. The issue is the lack of credible evidence for an effect, not credible evidence for the lack of an effect. These two statements are distinct and MPK use the truth of the second statement to suggest that we can ignore whether the first statement is true.
Finally, MPK present a scenario of a multiple study article with 8 studies that all produced significant results. The state that it is “unrealistic that as many as eight statistically significant results were produced by a non-existent effect” (p. 112).
This blue-eyed view of multiple study articles ignores the fact that the replication crisis in psychology was triggered by Bem’s (2011) infamous article that contained 9 out of 9 statistically significant results (one marginal result was attributed to methodological problems, see Schimmack, 2012, for details) that supposedly demonstrated humans ability to foresee the future and to influence the past (e.g., learning after a test increased performance on a test that was taken before learning for the test). Schimmack (2012) used this article to demonstrate how important it can be to evaluate the credibility of multiple study articles and the incredibility index predicted correctly that these results would not replicate. So, it is simply naïve to assume that articles with more studies automatically strengthen evidence for the existence of an effect and that 8 significant results cannot occur in the absence of a true effect (maybe MPK believe in ESP).
It is also not clear why researchers should wonder about the credibility of results in multiple study articles. A simple solution to the paradox is to reported all results honestly. If an honest set of studies provides evidence for an effect, it is not clear why researchers would prefer to engage in dishonest reporting practices. MPK provide no explanation for this practices and make no recommendation to increase honesty in reporting of results as a simple solution to the replicability crisis in psychology.
They write, “the researcher may have conducted 10, or even 20, experiments until he/she obtained 8 successful experiments, but far more studies would have been needed had the effect not existed at all”. This is true, but we do not know how many studies a researcher conducted or what else a researcher did to the data unless all of this information is reported. If the combined evidence of 20 studies with 8 significant results shows that an effect is present, a researcher could just publish all 20 studies. What is the reason to hide over 50% of the evidence?
In the end, MPK assure readers that they “do not intend to defend underpowered studies” and they do suggest that “the most straightforward solution to this paradox is to conduct studies that have sufficient statistical power” (p. 112). I fully agree with these recommendations because powerful studies can provide real evidence for an effect and decrease the incentive to engage in dishonest practices.
It is discouraging that this article was published in a major review journal in social psychology. It is difficult to see how social psychology can regain trust, if social psychologists believe they can simply continue to engaging in dishonest reporting of results.
Fortunately, numerous social psychologists have responded to the replication crisis by demanding more honest research practices and by increasing statistical power of studies. The article by MPK should not be considered representative of the response by all social psychologists and I hope MPK will agree that honest reporting of results is vital for a healthy science.
It is well-known that scientific journals favor statistically significant results (Sterling, 1959). This phenomenon is known as publication bias. Publication bias can be easily detected by comparing the observed statistical power of studies with the success rate in journals. Success rates of 90% or more would only be expected if most theoretical predictions are true and empirical studies have over 90% statistical power to produce significant results. Estimates of statistical power range from 20% to 50% (Button et al., 2015, Cohen, 1962). It follows that for every published significant result an unknown number of non-significant results has occurred that remained unpublished. These results linger in researchers proverbial file-drawer or more literally in unpublished data sets on researchers’ computers.
The selection of significant results also creates an incentive for researchers to produce significant results. In rare cases, researchers simply fabricate data to produce significant results. However, scientific fraud is rare. A more serious threat to the integrity of science is the use of questionable research practices. Questionable research practices are all research activities that create a systematic bias in empirical results. Although systematic bias can produce too many or too few significant results, the incentive to publish significant results suggests that questionable research practices are typically used to produce significant results.
In sum, publication bias and questionable research practices contribute to an inflated success rate in scientific journals. So far, it has been difficult to examine the prevalence of questionable research practices in science. One reason is that publication bias and questionable research practices are conceptually overlapping. For example, a research article may report the results of a 2 x 2 x 2 ANOVA or a regression analysis with 5 predictor variables. The article may only report the significant results and omit detailed reporting of the non-significant results. For example, researchers may state that none of the gender effects were significant and not report the results for main effects or interaction with gender. I classify these cases as publication bias because each result tests a different hypothesis., even if the statistical tests are not independent.
Questionable research practices are practices that change the probability of obtaining a specific significant result. An example would be a study with multiple outcome measures that would support the same theoretical hypothesis. For example, a clinical trial of an anti-depressant might include several depression measures. In this case, a researcher can increase the chances of a significant result by conducting tests for each measure. Other questionable research practices would be optional stopping once a significant result is obtained, selective deletion of cases based on the results after deletion. A common consequence of these questionable practices is that they will produce results that meet the significance criterion, but deviate from the distribution that is expected simply on the basis of random sampling error.
A number of articles have tried to examine the prevalence of questionable research practices by comparing the frequency of p-values above and below the typical criterion of statistical significance, namely a p-value less than .05. The logic is that random error would produce a nearly equal amount of p-values just above .05 (e.g., p = .06) and below .05 (e.g., p = .04). According to this logic, questionable research practices are present, if there are more p-values just below the criterion than p-values just above the criterion (Masicampo & Lalande, 2012).
Daniel Lakens has pointed out some problems with this approach. The most crucial problem is that publication bias alone is sufficient to predict a lower frequency of p-values below the significance criterion. After all, these p-values imply a non-significant result and non-significant results are subject to publication bias. The only reason why p-values of .06 are reported with higher frequency than p-values of .11 is that p-values between .05 and .10 are sometimes reported as marginally significant evidence for a hypothesis. Another problem is that many p-values of .04 are not reported as p = .04, but are reported as p < .05. Thus, the distribution of p-values close to the criterion value provides unreliable information about the prevalence of questionable research practices.
In this blog post, I introduce an alternative approach to the detection of questionable research practices that produce just significant results. Questionable research practices and publication bias have different effects on the distribution of p-values (or corresponding measures of strength of evidence). Whereas publication bias will produce a distribution that is consistent with the average power of studies, questionable research practice will produce an abnormal distribution with a peak just below the significance criterion. In other words, questionable research practices produce a distribution with too few non-significant results and too few highly significant results.
I illustrate this test of questionable research practices with post-hoc-power analysis of three journals. One journal shows neither signs of publication bias, nor significant signs of questionable research practices. The second journal shows clear evidence of publication bias, but no evidence of questionable research practices. The third journal illustrates the influence of publication bias and questionable research practices.
Example 1: A Relatively Unbiased Z-Curve
The first example is based on results published during the years 2010-2014 in the Journal of Experimental Psychology: Learning, Memory, and Cognition. A text-mining program searched all articles for publications of F-tests, t-tests, correlation coefficients, regression coefficients, odds-ratios, confidence intervals, and z-tests. Due to the inconsistent and imprecise reporting of p-values (p = .02 or p < .05), p-values were not used. All statistical tests were converted into absolute z-scores.
The program found 14,800 tests. 8,423 tests were in the critical interval between z = 2 and z = 6 that is used for estimation of 4 non-centrality parameters and 4 weights that are used to model the distribution of z-values between 2 and 6 and to estimate the distribution in the range from 0 to 2. Z-values greater than 6 are not used because they correspond to Power close to 1. 11% of all tests fall into this region of z-scores that are not shown.
The histogram and the blue density distribution show the observed data. The green curve shows the predicted distribution based on the post-hoc power analysis. Post-hoc power analysis suggests that the average power of significant results is 67%. Power for all statistical tests is estimated to be 58% (including 11% of z-scores greater than 6, power is .58*.89 + .11 = 63%). More important is the predicted distribution of z-scores. The predicted distribution on the left side of the criterion value matches the observed distribution rather well. This shows that there are not a lot of missing non-significant results. In other words, there does not appear to be a file-drawer of studies with non-significant results. There is also only a very small blip in the observed data just at the level of statistical significance. The close match between the observed and predicted distributions suggests that results in this journal are relatively free of systematic bias due to publication bias or questionable research practices.
Example 2: A Z-Curve with Publication Bias
The second example is based on results published in the Attitudes & Social Cognition Section of the Journal of Personality and Social Psychology. The text-mining program retrieved 5,919 tests from articles published between 2010 and 2014. 3,584 tests provided z-scores in the range from 2 to 6 that is being used for model fitting.
The average power of significant results in JPSP-ASC is 55%. This is significantly less than the average power in JEP-LMC, which was used for the first example. The estimated power for all statistical tests, including those in the estimated file drawer, is 35%. More important is the estimated distribution of z-values. On the right side of the significance criterion the estimated curve shows relatively close fit to the observed distribution. This finding shows that random sampling error alone is sufficient to explain the observed distribution. However, on the left side of the distribution, the observed z-scores drop off steeply. This drop is consistent with the effect of publication bias that researchers do not report all non-significant results. There is only a slight hint that questionable research practices are also present because observed z-scores just above the criterion value are a bit more frequent than the model predicts. However, this discrepancy is not conclusive because the model could increase the file drawer, which would produce a steeper slope. The most important characteristic of this z-curve is the steep cliff on the left side of the criterion value and the gentle slope on the right side of the criterion value.
Example 3: A Z-Curve with Questionable Research Practices.
Example 3 uses results published in the journal Aggressive Behavior during the years 2010 to 2014. The text mining program found 1,429 results and 863 z-scores in the range from 2 to 6 that were used for the post-hoc-power analysis.
The average power for significant results in the range from 2 to 6 is 73%, which is similar to the power estimate in the first example. The power estimate that includes non-significant results is 68%. The power estimate is similar because there is no evidence of a file drawer with many underpowered studies. In fact, there are more observed non-significant results than predicted non-significant results, especially for z-scores close to zero. This outcome shows some problems of estimating the frequency of non-significant results based on the distribution of significant results. More important, the graph shows a cluster of z-scores just above and below the significance criterion. The step cliff to the left of the criterion might suggest publication bias, but the whole distribution does not show evidence of publication bias. Moreover, the steep cliff on the right side of the cluster cannot be explained with publication bias. Only questionable research practices can produce this cliff because publication bias relies on random sampling error which leads to a gentle slope of z-scores as shown in the second example.
Prevalence of Questionable Research Practices
The examples suggest that the distribution of z-scores can be used to distinguish publication bias and questionable research practices. Based on this approach, the prevalence of questionable research practices would be rare. The journal Aggressive Behavior is exceptional. Most journals show a pattern similar to Example 2, with varying sizes of the file drawer. However, this does not mean that questionable research practices are rare because it is most likely that the pattern observed in Example 2 is a combination of questionable research practices and publication bias. As shown in Example 2, the typical power of statistical tests that produce a significant result is about 60%. However, researchers do not know which experiments will produce significant results. Slight modifications in experimental procedures, so-called hidden moderators, can easily change an experiment with 60% power into an experiment with 30% power. Thus, the probability of obtaining a significant result in a replication study is less than the nominal power of 60% that is implied by post-hoc-power analysis. With only 30% to 60% power, researchers will frequently encounter results that fail to produce an expected significant result. In this case, researchers have two choices to avoid reporting a non-significant result. They can put the study in the file-drawer or they can try to salvage the study with the help of questionable research practices. It is likely that researchers will do both and that the course of action depends on the results. If the data show a trend in the right direction, questionable research practices seem an attractive alternative. If the data show a trend in the opposite direction, it is more likely that the study will be terminated and the results remain unreported.
Simons et al. (2011) conducted some simulation studies and found that even extreme use of multiple questionable research practices (p-hacking) will produce a significant result in at most 60% of cases, when the null-hypothesis is true. If such extreme use of questionable research practices were widespread, z-curve would produce corrected power estimates well-below 50%. There is no evidence that extreme use of questionable research practices is prevalent. In contrast, there is strong evidence that researchers conduct many more studies than they actually report and that many of these studies have a low probability of success.
Implications of File-Drawers for Science
First, it is clear that researchers could be more effective if they would use existing resources more effectively. An fMRI study with 20 participants costs about $10,000. Conducting a study that costs $10,000 that has only a 50% probability of producing a significant result is wasteful and should not be funded by taxpayers. Just publishing the non-significant result does not fix this problem because a non-significant result in a study with 50% power is inconclusive. Even if the predicted effect exists, one would expect a non-significant result in ever second study. Instead of wasting $10,000 on studies with 50% power, researchers should invest $20,000 in studies with higher power (unfortunately, power does not increase proportional to resources). With the same research budget, more money would contribute to results that are being published. Thus, without spending more money, science could progress faster.
Second, higher powered studies make non-significant results more relevant. If a study had 80% power, there is only a 20% chance to get a non-significant result if an effect is present. If a study had 95% power, the chance of a non-significant result would be just as low as the chance of a false positive result. In this case, it is noteworthy that a theoretical prediction was not confirmed. In a set of high-powered studies, a post-hoc power analysis would show a bimodal distribution with clusters of z-scores around 0 for true null-hypothesis and a cluster of z-scores of 3 or higher for clear effects. Type-I and Type-II errors would be rare.
Third, Example 3 shows that the use of questionable research practices becomes detectable in the absence of a file drawer and that it would be harder to publish results that were obtained with questionable research practices.
Finally, the ability to estimate the size of file-drawers may encourage researchers to plan studies more carefully and to invest more resources into studies to keep their file drawers small because a large file-drawer may harm reputation or decrease funding.
In conclusion, post-hoc power analysis of large sets of data can be used to estimate the size of the file drawer based on the distribution of z-scores on the right side of a significance criterion. As file-drawers harm science, this tool can be used as an incentive to conduct studies that produce credible results and thus reducing the need for dishonest research practices. In this regard, the use of post-hoc power analysis complements other efforts towards open science such as preregistration and data sharing.
Meta-Analysis of Observed Power
Citation: Dr. R (2015). Meta-analysis of observed power. R-Index Bulletin, Vol(1), A2.
In a previous blog post, I presented an introduction to the concept of observed power. Observed power is an estimate of the true power on the basis of observed effect size, sampling error, and significance criterion of a study. Yuan and Maxwell (2005) concluded that observed power is a useless construct when it is applied to a single study, mainly because sampling error in a single study is too large to obtain useful estimates of true power. However, sampling error decreases as the number of studies increases and observed power in a set of studies can provide useful information about the true power in a set of studies.
This blog post introduces various methods that can be used to estimate power on the basis of a set of studies (meta-analysis). I then present simulation studies that compare the various estimation methods in terms of their ability to estimate true power under a variety of conditions. In this blog post, I examine only unbiased sets of studies. That is, the sample of studies in a meta-analysis is a representative sample from the population of studies with specific characteristics. The first simulation assumes that samples are drawn from a population of studies with fixed effect size and fixed sampling error. As a result, all studies have the same true power (homogeneous). The second simulation assumes that all studies have a fixed effect size, but that sampling error varies across studies. As power is a function of effect size and sampling error, this simulation models heterogeneity in true power. The next simulations assume heterogeneity in population effect sizes. One simulation uses a normal distribution of effect sizes. Importantly, a normal distribution has no influence on the mean because effect sizes are symmetrically distributed around the mean effect size. The next simulations use skewed normal distributions. This simulation provides a realistic scenario for meta-analysis of heterogeneous sets of studies such as a meta-analysis of articles in a specific journal or articles on different topics published by the same author.
Observed Power Estimation Method 1: The Percentage of Significant Results
The simplest method to determine observed power is to compute the percentage of significant results. As power is defined as the long-range percentage of significant results, the percentage of significant results in a set of studies is an unbiased estimate of the long-term percentage. The main limitation of this method is that the dichotomous measure (significant versus insignificant) is likely to be imprecise when the number of studies is small. For example, two studies can only show observed power values of 0, 25%, 50%, or 100%, even if true power were 75%. However, the percentage of significant results plays an important role in bias tests that examine whether a set of studies is representative. When researchers hide non-significant results or use questionable research methods to produce significant results, the percentage of significant results will be higher than the percentage of significant results that could have been obtained on the basis of the actual power to produce significant results.
Observed Power Estimation Method 2: The Median
Schimmack (2012) proposed to average observed power of individual studies to estimate observed power. Yuan and Maxwell (2005) demonstrated that the average of observed power is a biased estimator of true power. It overestimates power when power is less than 50% and it underestimates true power when power is above 50%. Although the bias is not large (no more than 10 percentage points), Yuan and Maxwell (2005) proposed a method that produces an unbiased estimate of power in a meta-analysis of studies with the same true power (exact replication studies). Unlike the average that is sensitive to skewed distributions, the median provides an unbiased estimate of true power because sampling error is equally likely (50:50 probability) to inflate or deflate the observed power estimate. To avoid the bias of averaging observed power, Schimmack (2014) used median observed power to estimate the replicability of a set of studies.
Observed Power Estimation Method 3: P-Curve’s KS Test
Another method is implemented in Simonsohn’s (2014) pcurve. Pcurve was developed to obtain an unbiased estimate of a population effect size from a biased sample of studies. To achieve this goal, it is necessary to determine the power of studies because bias is a function of power. The pcurve estimation uses an iterative approach that tries out different values of true power. For each potential value of true power, it computes the location (quantile) of observed test statistics relative to a potential non-centrality parameter. The best fitting non-centrality parameter is located in the middle of the observed test statistics. Once a non-central distribution has been found, it is possible to assign each observed test-value a cumulative percentile of the non-central distribution. For the actual non-centrality parameter, these percentiles have a uniform distribution. To find the best fitting non-centrality parameter from a set of possible parameters, pcurve tests whether the distribution of observed percentiles follows a uniform distribution using the Kolmogorov-Smirnov test. The non-centrality parameter with the smallest test statistics is then used to estimate true power.
Observed Power Estimation Method 4: P-Uniform
van Assen, van Aert, and Wicherts (2014) developed another method to estimate observed power. Their method is based on the use of the gamma distribution. Like the pcurve method, this method relies on the fact that observed test-statistics should follow a uniform distribution when a potential non-centrality parameter matches the true non-centrality parameter. P-uniform transforms the probabilities given a potential non-centrality parameter with a negative log-function (-log[x]). These values are summed. When probabilities form a uniform distribution, the sum of the log-transformed probabilities matches the number of studies. Thus, the value with the smallest absolute discrepancy between the sum of negative log-transformed percentages and the number of studies provides the estimate of observed power.
Observed Power Estimation Method 5: Averaging Standard Normal Non-Centrality Parameter
In addition to these existing methods, I introduce to novel estimation methods. The first new method converts observed test statistics into one-sided p-values. These p-values are then transformed into z-scores. This approach has a long tradition in meta-analysis that was developed by Stouffer et al. (1949). It was popularized by Rosenthal during the early days of meta-analysis (Rosenthal, 1979). Transformation of probabilities into z-scores makes it easy to aggregate probabilities because z-scores follow a symmetrical distribution. The average of these z-scores can be used as an estimate of the actual non-centrality parameter. The average z-score can then be used to estimate true power. This approach avoids the problem of averaging power estimates that power has a skewed distribution. Thus, it should provide an unbiased estimate of true power when power is homogenous across studies.
Observed Power Estimation Method 6: Yuan-Maxwell Correction of Average Observed Power
Yuan and Maxwell (2005) demonstrated a simple average of observed power is systematically biased. However, a simple average avoids the problems of transforming the data and can produce tighter estimates than the median method. Therefore I explored whether it is possible to apply a correction to the simple average. The correction is based on Yuan and Maxwell’s (2005) mathematically derived formula for systematic bias. After averaging observed power, Yuan and Maxwell’s formula for bias is used to correct the estimate for systematic bias. The only problem with this approach is that bias is a function of true power. However, as observed power becomes an increasingly good estimator of true power in the long run, the bias correction will also become increasingly better at correcting the right amount of bias.
The Yuan-Maxwell correction approach is particularly promising for meta-analysis of heterogeneous sets of studies such as sets of diverse studies in a journal. The main advantage of this method is that averaging of power makes no assumptions about the distribution of power across different studies (Schimmack, 2012). The main limitation of averaging power was the systematic bias, but Yuan and Maxwell’s formula makes it possible to reduce this systematic bias, while maintaining the advantage of having a method that can be applied to heterogeneous sets of studies.
Homogeneous Effect Sizes and Sample Sizes
The first simulation used 100 effect sizes ranging from .01 to 1.00 and 50 sample sizes ranging from 11 to 60 participants per condition (Ns = 22 to 120), yielding 5000 different populations of studies. The true power of these studies was determined on the basis of the effect size, sample size, and the criterion p < .025 (one-tailed), which is equivalent to .05 (two-tailed). Sample sizes were chosen so that average power across the 5,000 studies was 50%. The simulation drew 10 random samples from each of the 5,000 populations of studies. Each sample of a study simulated a between-subject design with the given population effect size and sample size. The results were stored as one-tailed p-values. For the meta-analysis p-values were converted into z-scores. To avoid biases due to extreme outliers, z-scores greater than 5 were set to 5 (observed power = .999).
The six estimation methods were then used to compute observed power on the basis of samples of 10 studies. The following figures show observed power as a function of true power. The green lines show the 95% confidence interval for different levels of true power. The figure also includes red dashed lines for a value of 50% power. Studies with more than 50% observed power would be significant. Studies with less than 50% observed power would be non-significant. The figures also include a blue line for 80% true power. Cohen (1988) recommended that researchers should aim for a minimum of 80% power. It is instructive how accurate estimation methods are in evaluating whether a set of studies met this criterion.
The histogram shows the distribution of true power across the 5,000 populations of studies.
The histogram shows that the simulation covers the full range of power. It also shows that high-powered studies are overrepresented because moderate to large effect sizes can achieve high power for a wide range of sample sizes. The distribution is not important for the evaluation of different estimation methods and benefits all estimation methods equally because observed power is a good estimator of true power when true power is close to the maximum (Yuan & Maxwell, 2005).
The next figure shows scatterplots of observed power as a function of true power. Values above the diagonal indicate that observed power overestimates true power. Values below the diagonal show that observed power underestimates true power.
Visual inspection of the plots suggests that all methods provide unbiased estimates of true power. Another observation is that the count of significant results provides the least accurate estimates of true power. The reason is simply that aggregation of dichotomous variables requires a large number of observations to approximate true power. The third observation is that visual inspection provides little information about the relative accuracy of the other methods. Finally, the plots show how accurate observed power estimates are in meta-analysis of 10 studies. When true power is 50%, estimates very rarely exceed 80%. Similarly, when true power is above 80%, observed power is never below 50%. Thus, observed power can be used to examine whether a set of studies met Cohen’s recommended guidelines to conduct studies with a minimum of 80% power. If observed power is 50%, it is nearly certain that the studies did not have the recommended 80% power.
To examine the relative accuracy of different estimation methods quantitatively, I computed bias scores (observed power – true power). As bias can overestimate and underestimate true power, the standard deviation of these bias scores can be used to quantify the precision of various estimation methods. In addition, I present the mean to examine whether a method has large sample accuracy (i.e. the bias approaches zero as the number of simulations increases). I also present the percentage of studies with no more than 20% points bias. Although 20% bias may seem large, it is not important to estimate power with very high precision. When observed power is below 50%, it suggests that a set of studies was underpowered even if the observed power estimate is an underestimation.
The quantitative analysis also shows no meaningful differences among the estimation methods. The more interesting question is how these methods perform under more challenging conditions when the set of studies are no longer exact replication studies with fixed power.
Homogeneous Effect Size, Heterogeneous Sample Sizes
The next simulation simulated variation in sample sizes. For each population of studies, sample sizes were varied by multiplying a particular sample size by factors of 1 to 5.5 (1.0, 1.5,2.0…,5.5). Thus, a base-sample-size of 40 created a range of sample sizes from 40 to 220. A base-sample size of 100 created a range of sample sizes from 100 to 2,200. As variation in sample sizes increases the average sample size, the range of effect sizes was limited to a range from .004 to .4 and effect sizes were increased in steps of d = .004. The histogram shows the distribution of power in the 5,000 population of studies.
The simulation covers the full range of true power, although studies with low and very high power are overrepresented.
The results are visually not distinguishable from those in the previous simulation.
The quantitative comparison of the estimation methods also shows very similar results.
In sum, all methods perform well even when true power varies as a function of variation in sample sizes. This conclusion may not generalize to more extreme simulations of variation in sample sizes, but more extreme variations in sample sizes would further increase the average power of a set of studies because the average sample size would increase as well. Thus, variation in effect sizes poses a more realistic challenge for the different estimation methods.
Heterogeneous, Normally Distributed Effect Sizes
The next simulation used a random normal distribution of true effect sizes. Effect sizes were simulated to have a reasonable but large variation. Starting effect sizes ranged from .208 to 1.000 and increased in increments of .008. Sample sizes ranged from 10 to 60 and increased in increments of 2 to create 5,000 populations of studies. For each population of studies, effect sizes were sampled randomly from a normal distribution with a standard deviation of SD = .2. Extreme effect sizes below d = -.05 were set to -.05 and extreme effect sizes above d = 1.20 were set to 1.20. The first histogram of effect sizes shows the 50,000 population effect sizes. The histogram on the right shows the distribution of true power for the 5,000 sets of 10 studies.
The plots of observed and true power show that the estimation methods continue to perform rather well even when population effect sizes are heterogeneous and normally distributed.
The quantitative comparison suggests that puniform has some problems with heterogeneity. More detailed studies are needed to examine whether this is a persistent problem for puniform, but given the good performance of the other methods it seems easier to use these methods.
Heterogeneous, Skewed Normal Effect Sizes
The next simulation puts the estimation methods to a stronger challenge by introducing skewed distributions of population effect sizes. For example, a set of studies may contain mostly small to moderate effect sizes, but a few studies examined large effect sizes. To simulated skewed effect size distributions, I used the rsnorm function of the fGarch package. The function creates a random distribution with a specified mean, standard deviation, and skew. I set the mean to d = .2, the standard deviation to SD = .2, and skew to 2. The histograms show the distribution of effect sizes and the distribution of true power for the 5,000 sets of studies (k = 10).
This time the results show differences between estimation methods in the ability of various estimation methods to deal with skewed heterogeneity. The percentage of significant results is unbiased, but is imprecise due to the problem of averaging dichotomous variables. The other methods show systematic deviations from the 95% confidence interval around the true parameter. Visual inspection suggests that the Yuan-Maxwell correction method has the best fit.
This impression is confirmed in quantitative analyses of bias. The quantitative comparison confirms major problems with the puniform estimation method. It also shows that the median, p-curve, and the average z-score method have the same slight positive bias. Only the Yuan-Maxwell corrected average power shows little systematic bias.
To examine biases in more detail, the following graphs plot bias as a function of true power. These plots can reveal that a method may have little average bias, but has different types of bias for different levels of power. The results show little evidence of systematic bias for the Yuan-Maxwell corrected average of power.
The following analyses examined bias separately for simulation with less or more than 50% true power. The results confirm that all methods except the Yuan-Maxwell correction underestimate power when true power is below 50%. In contrast, most estimation methods overestimate true power when true power is above 50%. The exception is puniform which still underestimated true power. More research needs to be done to understand the strange performance of puniform in this simulation. However, even if p-uniform could perform better, it is likely to be biased with skewed distributions of effect sizes because it assumes a fixed population effect size.
This investigation introduced and compared different methods to estimate true power for a set of studies. All estimation methods performed well when a set of studies had the same true power (exact replication studies), when effect sizes were homogenous and sample sizes varied, and when effect sizes were normally distributed and sample sizes were fixed. However, most estimation methods were systematically biased when the distribution of effect sizes was skewed. In this situation, most methods run into problems because the percentage of significant results is a function of the power of individual studies rather than the average power.
The results of these analyses suggest that the R-Index (Schimmack, 2014) can be improved by simply averaging power and then applying the Yuan-Maxwell correction. However, it is important to realize that the median method tends to overestimate power when power is greater than 50%. This makes it even more difficult for the R-Index to produce an estimate of low power when power is actually high. The next step in the investigation of observed power is to examine how different methods perform in unrepresentative (biased) sets of studies. In this case, the percentage of significant results is highly misleading. For example, Sterling et al. (1995) found percentages of 95% power, which would suggest that studies had 95% power. However, publication bias and questionable research practices create a bias in the sample of studies that are being published in journals. The question is whether other observed power estimates can reveal bias and can produce accurate estimates of the true power in a set of studies.