Added January 30, 2018: A formal letter to the editor of JPSP, calling for a retraction of the article (Letter).
“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)
In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.
The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results. “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).
The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.
In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.
The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings). The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).
Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.
Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.
I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely. Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.
The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.
If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.
Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.
Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.
2. Questionable Research Practices
The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies. This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used. John et al. listed a number of questionable research practices that might explain Bem’s findings.
2.1. Multiple Dependent Variables
One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.
2.2. Failure to report all conditions
This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.
2.3 Generous Rounding
Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.
Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.
2.5 Excluding of Data
Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.
2.6 Stopping Data Collection Early
Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.
In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.
2.7 Optional Stopping/Snooping
Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.
The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).
Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.
2.8 Selective Reporting
The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).
Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.
Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants $5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid $90,000 out of pocket.
As a strong believer in ESP, Bem may have paid $90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.
“The integrity of the scientific enterprise requires the reporting of disconfirming results.”
In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.
3. The Decline Effect and a New Questionable Research Practice
When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased. This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).
Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.
The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.
Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1. P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.
||p = .004
||p = .194
||p = .096
||p = .170
||p = .039
||p = .100
||p = .033
||p = .067
||p = .013
||p = .069
||p = .412
||p = .126
||p = .023
||p = .410
||p = .020
||p = .338
||p = .010
||p = .318
||p = .003
There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).
In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes. Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422). How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.
There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection. In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.
“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”
Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.
Additional hints come from an interview with Engber (2017).
“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.
In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.
Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one. Schooler (2011) proposed that something more intriguing may cause decline effects.
“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.”
Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.
I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.
Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.
The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).
The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality. The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.
Conceptual Replications and Hidden Moderators
In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).
Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).
The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.
Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.
The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.
The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.
Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.
Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011). The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used. They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%. A combination of four questionable research practices increased the risk to 60.7%. The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky. But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results, (p = .69 = 1%).
The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies). If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.
To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).
Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct. Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.
Should the Journal of Personality and Social Psychology Retract Bem’s Article?
Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.
However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.
“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).
Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.
This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.
A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).
[edited January, 8, 2018]
It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.
When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”
If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.
Alcock, J. E. (2011). Back from the future: Parapsychology and the Bem affair. Skeptical Inquirer, 35(2). Retrieved from http://www.csicop.org/specialarticles/show/back_from_the_future
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524
Bem, D.J., Tressoldi, P., Rabeyron, T. & Duggan, M. (2015) Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events, F1000 Research, 4, 1–33.
Engber, D. (2017). Daryl Bem proved ESP Is real: Which means science is broken. https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html
Francis, G. (2012). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9
Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.
Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, Issue 6203, 502-1505, DOI: 10.1126/science.1255484
Galak, J., Leboeuf, R.A., Nelson, L. D., & Simmons, J.P. (2012). Journal of Personality and Social Psychology, 103, 933-948, doi: 10.1037/a0029709.
John, L. K. Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532. DOI: 10.1177/0956797611430953
RetractionWatch (2018). Ask retraction watch: Is it OK to cite a retracted paper? http://retractionwatch.com/2018/01/05/ask-retraction-watch-ok-cite-retracted-paper/
Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 2012, 17, 551–566.
Schimmack, U. (2015). The Test of Insufficient Variance: A New Tool for the Detection of Questionable Research Practices. https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632
Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746
Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137
Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.