Category Archives: Publication Bias

Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem

“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)

In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.

The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results.  “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).

The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.

In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.

The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings).  The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).

Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.

Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.

I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely.  Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.

1. Luck

The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.

If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.

Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.

Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.

2. Questionable Research Practices

The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies.  This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used.  John et al. listed a number of questionable research practices that might explain Bem’s findings.

2.1. Multiple Dependent Variables

One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.

2.2. Failure to report all conditions

This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.

2.3 Generous Rounding

Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.

2.4 HARKing

Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.

2.5 Excluding of Data

Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.

2.6 Stopping Data Collection Early

Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.

In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.

2.7 Optional Stopping/Snooping

Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.

The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).

Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.

2.8 Selective Reporting

The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).

Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.

Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants $5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid $90,000 out of pocket.

As a strong believer in ESP, Bem may have paid $90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.

“The integrity of the scientific enterprise requires the reporting of disconfirming results.”

2.9 Conclusion

In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.

3. The Decline Effect and a New Questionable Research Practice

When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased.  This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).

Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.

Figure1

 

The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.

Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1.  P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.

EXPERIMENT S 1-50 S 51-100
EXP1 p = .004 p = .194
EXP2 p = .096 p = .170
EXP3 p = .039 p = .100
EXP4 p = .033 p = .067
EXP5 p = .013 p = .069
EXP6a p = .412 p = .126
EXP5b p = .023 p = .410
EXP7 p = .020 p = .338
EXP8 p = .010 p = .318
EXP9 p = .003 NA

 

There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).

In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes.  Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422).  How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.

There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection.  In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.

“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”

Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.

Additional hints come from an interview with Engber (2017).

“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.

In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.

Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one.  Schooler (2011) proposed that something more intriguing may cause decline effects.

“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.” 

Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.

I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.

Discussion

Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.

The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).

The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality.  The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.

Conceptual Replications and Hidden Moderators

In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).

Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).

The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.

Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.

The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.

The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.

Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.

P-Hacking

Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011).  The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used.  They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%.  A combination of four questionable research practices increased the risk to 60.7%.  The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky.  But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results,  (p = .6= 1%).

The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies).  If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.

To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).

Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct.  Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.

Should the Journal of Personality and Social Psychology Retract Bem’s Article?

Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.

However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.

“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).

Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.

This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.

A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).

[edited January, 8, 2018]
It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.

When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”

If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.

REFERENCES

Alcock, J. E. (2011). Back from the future: Parapsychology and the Bem affair. Skeptical Inquirer, 35(2). Retrieved from http://www.csicop.org/specialarticles/show/back_from_the_future

Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Bem, D.J., Tressoldi, P., Rabeyron, T. & Duggan, M. (2015) Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events, F1000 Research, 4, 1–33.

Engber, D. (2017). Daryl Bem proved ESP Is real: Which means science is broken. https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html

Francis, G. (2012). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9

Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, Issue 6203, 502-1505, DOI: 10.1126/science.1255484

Galak, J., Leboeuf, R.A., Nelson, L. D., & Simmons, J.P. (2012). Journal of Personality and Social Psychology, 103, 933-948, doi: 10.1037/a0029709.

John, L. K. Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532. DOI: 10.1177/0956797611430953

RetractionWatch (2018). Ask retraction watch: Is it OK to cite a retracted paper? http://retractionwatch.com/2018/01/05/ask-retraction-watch-ok-cite-retracted-paper/

Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 2012, 17, 551–566.

Schimmack, U. (2015). The Test of Insufficient Variance: A New Tool for the Detection of Questionable Research Practices. https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632

Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746

Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137

Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.

 

Advertisements

Hidden Figures: Replication Failures in the Stereotype Threat Literature

In the past five years, it has become apparent that many classic and important findings in social psychology fail to replicate (Schimmack, 2016).  The replication crisis is often considered a new phenomenon, but failed replications are not entirely new.  Sometimes these studies have simply been ignored.  These studies deserve more attention and need to be reevaluated in the context of the replication crisis in social psychology.

In the past, failed replications were often dismissed because seminal articles were assumed to provide robust empirical support for a phenomenon, especially if an article presented multiple studies. The chance of reporting a false positive results in a multiple study article is low because the risk of a false positive decreases exponentially (Schimmack, 2012). However, the low risk of a false positive is illusory if authors only publish studies that worked. In this case, even false positives can be supported by significant results in multiple studies, as demonstrated in the infamous ESP study by Bem (2011).  As a result, publication bias undermines the reporting of statistical significance as diagnostic information about the risk of false positives (Sterling, 1959) and many important theories in social psychology rest on shaky empirical foundations that need to be reexamined.

Research on stereotype threat and women’s performance on math tests is one example where publication bias undermines the findings in a seminal study that produced a large literature of studies on gender differences in math performance. After correcting for publication bias, this literature shows very little evidence that stereotype threat has a notable and practically significant effect on women’s math performance (Flore & Wicherts, 2014).

Another important line of research has examined the contribution of stereotype threat to differences between racial groups on academic performance tests.  This blog post examines the strength of the empirical evidence for stereotype threat effects in the seminal article by Steele and Aronson (1995). This article is currently the 12th most cited article in the top journal for social psychology, Journal of Personality and Social Psychology (2,278 citations so far).

According to the abstract, “stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one’s group.” Studies 1 and 2 showed that “reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled).”  “Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it.”  “Study 4 showed that mere salience of the stereotype could impair Blacks’ performance even when the test was not
ability diagnostic.”

The results of Study 4 motivated Stricker and colleagues to examine the influence of stereotype-treat on test performance in a real-world testing situation.  These studies had large samples and were not limited to students at Stanford. One study was reported in a College Board Report (Stricker and Ward, 1998).   Another two studies were published in the Journal of Applied Social Psychology (Stricker & Ward, 2004).  This article received only 52 citations, although it reported two studies with an experimental manipulation of stereotype threat in a real assessment context.  One group of participants were asked about their gender or ethnicity before the text, the other group did not receive these questions.  As noted in the abstract, neither the inquiry about race, nor about gender, had a significant effect on test performance. In short, this study failed to replicate Study 4 of the classic and widely cited article by Steele and Aronson.

Stricker and Ward’s Abstract
Steele and Aronson (1995) found that the performance of Black research participants on
ability test items portrayed as a problem-solving task, in laboratory experiments, was affected adversely when they were asked about their ethnicity. This outcome was attributed to stereotype threat: Performance was disrupted by participants’ concerns about fulfilling the negative stereotype concerning Black people’s intellectual ability. The present field experiments extended that research to other ethnic groups and to males and females taking operational tests. The experiments evaluated the effects of inquiring about ethnicity and gender on the performance of students taking 2 standardized tests-the Advanced Placement Calculus AB Examination, and the Computerized Placement Tests-in actual test administrations. This inquiry did not have any effects on the test performance of Black, female, or other subgroups of students that were both statistically and practically significant.

The article also mentions a personal communication with Steele, in which Steele mentions an unpublished study that also failed to demonstrate the effect under similar conditions.

“In fact, Steele found in an unpublished pilot study that inquiring about ethnicity did not affect Black participants’ performance when the task was described as diagnostic of their ability (C. M. Steele, personal communication, May 2 1, 1997), in contrast to the
substantial effect of inquiring when the task was described as nondiagnostic.”

A substantive interpretation of this finding is that inquires about race or gender do not produce stereotype threat effects when a test is diagnostic because a diagnostic test already activates stereotype threat.  However, if this were a real moderator, it would be important to document this fact and it is not clear why this finding obtained in an earlier study by Steele remained unpublished. Moreover, it is premature to interpret the significant result in the published study with a non-diagnostic task and the non-significant result in an unpublished study with a diagnostic task as evidence that diagnosticity moderates the effect of the stereotype-threat manipulation. A proper test of this moderator hypothesis would require the demonstration of a three-way interaction between race, inquiry about race, and diagnosticity. Absent this evidence, it remains possible that diagnosticity is not a moderator and that the published result is a false positive (or a positive result with an inflated effect size estimate). In contrast, there appears to be consistent evidence that inquiries about race or gender before a real assessment of academic performance does not influence performance. This finding is not widely publicized, but is important for a better understanding of performance differences in real world settings.

The best way to examine the replicability of Steele and Aronson’s seminal finding with non-diagnostic tasks would be to conduct an exact replication study.  However, exact replication studies are difficult and costly.  An alternative is to examine the robustness of the published results by taking a closer look at the strength of the statistical results reported by Steele and Aronson, using modern statistical tests of publication bias and statistical power like the R-Index (Schimmack, 2014) and the Test of Insufficient Variance (TIVA, Schimmack, 2014).

Replicability Analysis of Steele and Aronson’s four studies

Study 1. The first study had a relatively large sample of N = 114 participants, but it is not clear how many of the participants were White or Black.  The study also had a 2 x 3 design, which leaves less than 20 participants per condition.   The study produced a significant main effect of condition, F(2, 107) = 4.74, and race, F(1,107) = 5.22, but the critical condition x race interaction was not significant (reported as p > .19).   However, a specific contrast showed significant differences between Black participants in the diagnostic condition and the non-diagnostic condition, t(107) = 2.88, p = .005, z = 2.82.  The authors concluded “in sum, then, the hypothesis was supported by the pattern of contrasts, but when tested over the whole design, reached only marginal significance” (p. 800).  In other words, Study 1 provided only weak support for the stereotype threat hypothesis.

Study 2. Study 2 eliminated one of the three experimental conditions. Participants were 20 Black and 20 White participants. This means there were only 10 participants in each condition of a 2 x 2 design. The degrees of freedom further indicate that the actual sample size was only 38 participants. Given the weak evidence in Study 1, there is no justification for a reduction in the number of participants per cell, although the difficulty of recruiting Black participants at Stanford may explain this inadequate sample size. Nevertheless, the study showed a significant interaction between race and test description, F(1,35) = 8.07, p = .007. The study also replicated the contrast from Study 1 that Black participants in the diagnostic condition performed significantly worse than Black participants in the non-diagnostic group, t(35) = 2.38, p = .023, z = 2.28.

Studies 1 and 2 are close replications of each other.  The consistent finding across the two studies that supports stereotype-treat theory is the finding that merely changing the description of an assessment task changes Black participants performance, as revealed by significant differences between the diagnostic and non-diagnostic condition in both studies.  The problem is that both studies had small numbers of Black participants and that small samples have low power to produce significant results. As a result, it is unlikely that a pair of studies would produce significant results in both studies.

Observed power  in the two studies is .81 and .62 with median observed power of .71. Thus, the actual success rate of 100% (2 out of 2 significant results) is 29 percentage points higher than the expected success rate. Moreover, when inflation is evident, median observed power is also inflated. To correct for this inflation, the Replicability-Index (R-Index) subtracts inflation from median observed power, which yields an R-Index of 42.  Any value below 50 is considered unacceptably low and I give it a letter grade F, just like students at American Universities receive an F for exams with less than 50% correct answers.  This does not mean that stereotype threat is not a valid theory or that there was no real effect in this pair of studies. It simply means that the evidence in this highly cited article is insufficient to make strong claims about the causes of Black’s performance on academic tests.

The Test of Insufficient Variance (TIVA) provides another way to examine published results.  Test statistics like t-values vary considerably from study to study even if the exact same study is conducted twice (or if one larger sample is randomly split into two sub-samples).  When test-statistics are converted into z-scores, sampling error (the random variability from sample to sample) follows approximately a standard normal distribution with a variance of 1.  If the variance is considerably smaller than 1, it suggests that the reported results represent a selected sample. Often the selection is a result of publication bias.  Applying TIVA to the pair of studies, yields a variance of Var(z) = 0.15.  As there are only two studies, it is possible that this outcome occurred by chance, p = .300, and it does not imply intentional selection for significance or other questionable research practices.  Nevertheless, it suggests that future replication studies will be more variable and produce some non-significant results.

In conclusion, the evidence presented in the first two studies is weaker than we might assume if we focused only on the fact that both studies produced significant contrasts. Given publication bias, the fact that both studies reported significant results provides no empirical evidence because virtually all published studies report significant results. The R-Index quantifies the strength of evidence for an effect while taking the influence of publication bias into account and it shows that the two studies with small samples provide only weak evidence for an effect.

Study 3.  This study did not examine performance. The aim was to demonstrate activation of stereotype threat with a sentence completion task.  The sample size of 68 participants  (35 Black, 33 White) implied that only 11 or 12 participants were assigned to one of the six cells in a 2 (race) by 3 (task description) design. The study produced main effects for race and condition, but most importantly it produced a significant interaction effect, F(2,61) = 3.30, p = .044.  In addition, Black participants in the diagnostic condition had more stereotype-related associations than Black participants in the non-diagnostic condition, t(61) = 3.53,

Study 4.  This study used inquiry about race to induce stereotype-threat. Importantly, the task was described as non-diagnostic (as noted earlier, a similar study produced no significant results when the task was described as diagnostic).  The design was a 2 x 2 design with 47 participants, which means only 11 or 12 participants were allocated to the four conditions.  The degrees of freedom indicated that cell frequencies were even lower. The study produced a significant interaction effect, F(1,39) = 7.82, p = .008.  The study also produced a significant contrast between Blacks in the race-prime condition and the no-prime condition, t(39) = 2.43, p = .020.

The contrast effect in Study 3 is strong, but it is not a performance measure.  If stereotype threat mediates the effect of task characteristics and performance, we would expect a stronger effect on the measure of the mediator than on the actual outcome of interest, task performance.  The key aim of stereotype threat theory is to explain differences in performance.  With a focus on performance outcomes, it is possible to examine the R-Index and TIVA of Studies 1, 2, and 4.  All three studies reported significant contrasts between Black students randomly assigned to two groups that were expected to show performance differences (Table 1).

Table 1

Study Test Statistic p-value z-score obs.pow
Study 1 t(107) = 2.88 0.005 2.82 0.81
Study 2 t(35)=2.38 0.023 2.28 0.62
Study 4 t(39) = 2.43 0.020 2.33 0.64

Median observed power is 64 and the R-Index is well below 50, 64 – 36 = 28 (F).  The variance in z-scores is Var(z) = 0.09, p = .086.  These results cast doubt about the replicability of the performance effects reported in Steele and Aronson’s seminal stereotype threat article.

Conclusion

Racial stereotypes and racial disparities are an important social issue.  Social psychology aims and promises to contribute to the understanding of this issue by conducting objective, scientific studies that can inform our understanding of these issues.  In order to live up to these expectations, social psychology has to follow the rules of science and listen to the data.  Just like it is important to get the numbers right to send men and women into space (and bring them back), it is important to get the numbers right when we use science to understand women and men on earth.  Unfortunately, social psychologists have not followed the examples of astronomers and the numbers do not add up.

The three African American women, features in this years movie “Hidden Figures”***,  Katherine Johnson, Dorothy Vaughan, and Mary Jackson might not approve of the casual way social psychologists use numbers in their research, especially the wide-spread practice of hiding numbers that do not match expectations.  No science that wants to make a real-world contribution can condone this practice.  It is also not acceptable to simply ignore published results from well-conducted studies with large samples that challenge a prominent theory.

Surely, the movie Hidden Figures dramatized some of the experiences of Black women at NASA, but there is little doubt that Katherine Johnson, Dorothy Vaughan, and Mary Jackson encountered many obstacles that might be considered stereotype threatening situations.  Yet, they prevailed and they paved the way for future generations of stereotyped groups.  Understanding racial and gender bias and performance differences remains an important issue and that is the reason why it is important to shed a light on hidden numbers and put simplistic theories under the microscope. Stereotype threat is too often used as a simple explanation that avoids tackling deeper and more difficult issues that cannot be easily studied in a quick laboratory experiment with undergraduate students at top research universities.  It is time for social psychologists to live up to its promises by tackling real world issues with research designs that have real world significance that produce real evidence using open and transparent research practices.

————————————————————————————————————————————

*** If you haven’t seen the movie, I highly recommend it.

 

Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

inflated-mop

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

inflated-effect-sizes

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

inflated-mop

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

 

 

Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

 

Table 1 shows the results.

Power                  PTH / PFH             
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67                     

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH             
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14                    

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.

ioannidis-fig6

A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).

ioannidis-fig1

Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.

ioannidis-fig3

Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.

Powergraphs for JEP-LMC3.g

The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.

Powergraphs for PsySci3.g

The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.

Powergraphs for JPSP-ASC3.g

One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.

 

A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

Replicability Ranking of Psychology Departments

Evaluations of individual researchers, departments, and universities are common and arguably necessary as science is becoming bigger. Existing rankings are based to a large extent on peer-evaluations. A university is ranked highly if peers at other universities perceive it to produce a steady stream of high-quality research. At present the most widely used objective measures rely on the quantity of research output and on the number of citations. These quantitative indicators of research quality work are also heavily influenced by peers because peer-review controls what gets published, especially in journals with high rejection rates, and peers decide what research they cite in their own work. The social mechanisms that regulate peer-approval are unavoidable in a collective enterprise like science that does not have a simple objective measure of quality (e.g., customer satisfaction ratings, or accident rates of cars). Unfortunately, it is well known that social judgments are subject to many biases due to conformity pressure, self-serving biases, confirmation bias, motivated biases, etc. Therefore, it is desirable to complement peer-evaluations with objective indicators of research quality.

Some aspects of research quality are easier to measure than others. Replicability rankings focus on one aspect of research quality that can be measured objectively, namely the replicability of a published significant result. In many scientific disciplines such as psychology, a successful study reports a statistically significant result. A statistically significant result is used to minimize the risk of publishing evidence for an effect that does not exist (or even goes in the opposite direction). For example, a psychological study that shows effectiveness of a treatment for depression would have to show that the effect in the study reveals a real effect that can be observed in other studies and in real patients if the treatment is used for the treatment of depression.

In a science that produces thousands of results a year, it is inevitable that some of the published results are fluke findings (even Toyota’s break down sometimes). To minimize the risk of false results entering the literature, psychology like many other sciences, adopted a 5% error rate. By using a 5% as the criterion, psychologists ensured that no more than 5% of results are fluke findings. With thousands of results published in each year, this still means that more than 50 false results enter the literature each year. However, this is acceptable because a single study does not have immediate consequences. Only if these results are replicated in other studies, findings become the foundation of theories and may influence practical decisions in therapy or in other applications of psychological findings (at work, in schools, or in policy). Thus, to outside observers it may appear safe to trust published results in psychology and to report about these findings in newspaper articles, popular books, or textbooks.

Unfortunately, it would be a mistake to interpret a significant result in a psychology journal as evidence that the result is probably true.  The reason is that the published success rate in journals has nothing to do with the actual success rate in psychological laboratories. All insiders know that it is common practice to report only results that support a researcher’s theory. While outsiders may think of scientists as neutral observers (judges), insiders play the game of lobbyist, advertisers, and self-promoters. The game is to advance one’s theory, publish more than others, get more citations than others, and win more grant money than others. Honest reporting of failed studies does not advance this agenda. As a result, the fact that psychological studies report nearly exclusively success stories (Sterling, 1995; Sterling et al., 1995) tells outside observers nothing about the replicability of a published finding and the true rate of fluke findings could be 100%.

This problem has been known for over 50 years (Cohen, 1962; Sterling, 1959). So it would be wrong to call the selective reporting of successful studies an acute crisis. However, what changed is that some psychologists have started to criticize the widely accepted practice of selective reporting of successful studies (Asendorpf et al., 2012; Francis, 2012; Simonsohn et al., 2011; Schimmack, 2012; Wagenmakers et al., 2011). Over the past five years, psychologists, particularly social psychologists, have been engaged in heated arguments over the so-called “replication crisis.”

One group argues that selective publishing of successful studies occurred, but without real consequences on the trustworthiness of published results. The other group argues that published results cannot be trusted unless they have been successfully replicated. The problem is that neither group has objective information about the replicability of published results.  That is, there is no reliable estimate of the percentage of studies that would produce a significant result again, if a representative sample of significant results published in psychology journals were replicated.

Evidently, it is not possible to conduct exact replication studies of all studies that have been published in the past 50 years. Fortunately, it is not necessary to conduct exact replication studies to obtain an objective estimate of replicability. The reason is that replicability of exact replication studies is a function of the statistical power of studies (Sterling et al., 1995). Without selective reporting of results, a 95% success rate is an estimate of the statistical power of the studies that achieved this success rate. Vice versa, a set of studies with average power of 50% is expected to produce a success rate of 50% (Sterling, et al., 1995).

Although selection bias renders success rates uninformative, the actual statistical results provide valuable information that can be used to estimate the unbiased statistical power of published results. Although selection bias inflates effect sizes and power, Brunner and Schimmack (forcecoming) developed and validated a method that can correct for selection bias. This method makes it possible to estimate the replicability of published significant results on the basis of the original reported results. This statistical method was used to estimate the replicabilty of research published by psychology departments in the years from 2010 to 2015 (see Methodology for details).

The averages for the 2010-2012 period (M = 59) and the 2013-2015 period (M = 61) show only a small difference, indicating that psychologists have not changed their research practices in accordance with recommendations to improve replicability in 2011  (Simonsohn et al., 2011). For most of the departments the confidence intervals for the two periods overlap (see attached powergraphs). Thus, the more reliable average across all years is used for the rankings, but the information for the two time periods is presented as well.

There are no obvious predictors of variability across departments. Private universities are at the top (#1, #2, #8), the middle (#24, #26), and at the bottom (#44, #47). European universities can also be found at the top (#4, #5), middle (#25) and bottom (#46, #51). So are Canadian universities (#9, #15, #16, #18, #19, #50).

There is no consensus on an optimal number of replicability.  Cohen recommended that researchers should plan studies with 80% power to detect real effects. If 50% of studies tested real effects with 80% power and the other 50% tested a null-hypothesis (no effect = 2.5% probability to replicate a false result again), the estimated power for significant results would be 78%. The effect on average power is so small because most of the false predictions produce a non-significant result. As a result, only a few studies with low replication probability dilute the average power estimate. Thus, a value greater than 70 can be considered broadly in accordance with Cohen’s recommendations.

It is important to point out that the estimates are very optimistic estimates of the success rate in actual replications of theoretically important effects. For a representative set of 100 studies (OSC, Science, 2015), Brunner and Schimmack’s statistical approach predicted a success rate of 54%, but the success rate in actual replication studies was only 37%. One reason for this discrepancy could be that the statistical approach assumes that the replication studies are exact, but actual replications always differ in some ways from the original studies, and this uncontrollable variability in experimental conditions posses another challenge for replicability of psychological results.  Before further validation research has been completed, the estimates can only be used as a rough estimate of replicability. However, the absolute accuracy of estimates is not relevant for the relative comparison of psychology departments.

And now, without further ado, the first objective rankings of 51 psychology departments based on the replicability of published significant results. More departments will be added to these rankings as the results become available.

Rank University 2010-2015 2010-2012 2013-2015
1 U Penn 72 69 75
2 Cornell U 70 67 72
3 Purdue U 69 69 69
4 Tilburg U 69 71 66
5 Humboldt U Berlin 67 68 66
6 Carnegie Mellon 67 67 67
7 Princeton U 66 65 67
8 York U 66 63 68
9 Brown U 66 71 60
10 U Geneva 66 71 60
11 Northwestern U 65 66 63
12 U Cambridge 65 66 63
13 U Washington 65 70 59
14 Carleton U 65 68 61
15 Queen’s U 63 57 69
16 U Texas – Austin 63 63 63
17 U Toronto 63 65 61
18 McGill U 63 72 54
19 U Virginia 63 61 64
20 U Queensland 63 66 59
21 Vanderbilt U 63 61 64
22 Michigan State U 62 57 67
23 Harvard U 62 64 60
24 U Amsterdam 62 63 60
25 Stanford U 62 65 58
26 UC Davis 62 57 66
27 UCLA 61 61 61
28 U Michigan 61 63 59
29 Ghent U 61 58 63
30 U Waterloo 61 65 56
31 U Kentucky 59 58 60
32 Penn State U 59 63 55
33 Radboud U 59 60 57
34 U Western Ontario 58 66 50
35 U North Carolina Chapel Hill 58 58 58
36 Boston University 58 66 50
37 U Mass Amherst 58 52 64
38 U British Columbia 57 57 57
39 The University of Hong Kong 57 57 57
40 Arizona State U 57 57 57
41 U Missouri 57 55 59
42 Florida State U 56 63 49
43 New York U 55 55 54
44 Dartmouth College 55 68 41
45 U Heidelberg 54 48 60
46 Yale U 54 54 54
47 Ohio State U 53 58 47
48 Wake Forest U 51 53 49
49 Dalhousie U 50 45 55
50 U Oslo 49 54 44
51 U Kansas 45 45 44