# Replicability 101: How to interpret the results of replication studies

Even statistically sophisticated psychologists struggle with the interpretation of replication studies (Maxwell et al., 2015).  This article gives a basic introduction to the interpretation of statistical results within the Neyman Pearson approach to statistical inferences.

I make two important points and correct some potential misunderstandings in Maxwell et al.’s discussion of replication failures.  First, there is a difference between providing sufficient evidence for the null-hypothesis (evidence of absence) and providing insufficient evidence against the null-hypothesis (absence of evidence).  Replication studies are useful even if they simply produce absence of evidence without evidence that an effect is absent.  Second, I  point out that publication bias undermines the credibility of significant results in original studies.  When publication bias is present, open replication studies are valuable because they provide an unbiased test of the null-hypothesis, while original studies are rigged to reject the null-hypothesis.

DEFINITION OF REPLICATING A STATISTICAL RESULT

Replicating something means to get the same result.  If I make the first free throw, replicating this outcome means to also make the second free throw.  When we talk about replication studies in psychology we borrow from the common meaning of the term “to replicate.”

If we conduct psychological studies, we can control many factors, but some factors are not under our control.  Participants in two independent studies differ from each other and the variation in the dependent variable across samples introduces sampling error. Hence, it is practically impossible to get identical results, even if the two studies are exact copies of each other.  It is therefore more complicated to compare the results of two studies than to compare the outcome of two free throws.

To determine whether the results of two studies are identical or not, we need to focus on the outcome of a study.  The most common outcome in psychological studies is a significant or non-significant result.  The goal of a study is to produce a significant result and for this reason a significant result is often called a success.  A successful replication study is a study that also produces a significant result.  Obtaining two significant results is akin to making two free throws.  This is one of the few agreements between Maxwell and me.

“Generally speaking, a published  original study has in all likelihood demonstrated a statistically significant effect. In the current zeitgeist, a replication study is usually interpreted as successful if it also demonstrates a statistically significant effect.” (p. 488)

The more interesting and controversial scenario is a replication failure. That is, the original study produced a significant result (success) and the replication study produced a non-significant result (failure).

I propose that a lot of confusion arises from the distinction between original and replication studies. If a replication study is an exact copy of the first study, the outcome probabilities of original and replication studies are identical.  Otherwise, the replication study is not really a replication study.

There are only three possible outcomes in a set of two studies: (a) both studies are successful, (b) one study is a success and one is a failure, or (c) both studies are failures.  The probability of these outcomes depends on whether the significance criterion (the type-I error probability) when the null-hypothesis is true and the statistical power of a study when the null-hypothesis is false.

Table 1 shows the probability of the outcomes in two studies.  The uncontroversial scenario of two significant results is very unlikely, if the null-hypothesis is true. With conventional alpha = .05, the probability is .0025 or 1 out of 400 attempts.  This shows the value of replication studies. False positives are unlikely to repeat themselves and a series of replication studies with significant results is unlikely to occur by chance alone.

 2 sig, 0 ns 1 sig, 1 ns 0 sig, 2 ns H0 is True alpha^2 2*alpha*(1-alpha) (1-alpha^2) H1 is True (1-beta)^2 2*(1-beta)*beta beta^2

The probability of a successful replication of a true effect is a function of statistical power (1 – type-II error probability).  High power is needed to get significant results in a pair of studies (an original study and a replication study).  For example, if power is only 50%, the chance of this outcome is only 25% (Schimmack, 2012).  Even with conventionally acceptable power of 80%, only 2/3 (64%) of replication attempts would produce this outcome.  However, studies in psychology do not have 80% power and estimates of power can be as low as 37% (OSC, 2015). With 40% power, a pair of studies would produce significant results in no more than 16 out of 100 attempts.   Although successful replications of true effects with low power are unlikely, they are still much more likely then significant results when the null-hypothesis is true (16/100 vs. 1/400 = 64:1).  It is therefore reasonable to infer from two significant results that the null-hypothesis is false.

If the null-hypothesis is true, it is extremely likely that both studies produce a non-significant result (.95^2 = 90.25%).  In contrast, it is unlikely that even a study with modest power would produce two non-significant results.  For example, if power is 50%, there is a 75% chance that at least one of the two studies produces a significant result. If power is 80%, the probability of obtaining two non-significant results is only 4%.  This means, it is much more likely (22.5 : 1) that the null-hypothesis is true than that the alternative hypothesis is true.  This does not mean that the null-hypothesis is true in an absolute sense because power depends on the effect size.  For example, if 80% power were obtained with a standardized effect size of Cohen’s d = .5,  two non-significant results would suggest that the effect size is smaller than .5, but it does not warrant the conclusion that H0 is true and the effect size is exactly 0.  Once more, it is important to distinguish between the absence of evidence for an effect and the evidence of absence of an effect.

The most controversial scenario assumes that the two studies produced inconsistent outcomes.  Although theoretically there is no difference between the first and the second study, it is common to focus on a successful outcome followed by a replication failure  (Maxwell et al., 2015). When the null-hypothesis is true, the probability of this outcome is low;  .05 * (1-.05) = .0425.  The same probability exists for the reverse pattern that a non-significant result is followed by a significant one.  A probability of 4.25% shows that it is unlikely to observe a significant result followed by a non-significant result when the null-hypothesis is true. However, the low probability is mostly due to the low probability of obtaining a significant result in the first study, while the replication failure is extremely likely.

Although inconsistent results are unlikely when the null-hypothesis is true, they can also be unlikely when the null-hypothesis is false.  The probability of this outcome depends on statistical power.  A pair of studies with very high power (95%) is very unlikely to produce an inconsistent outcome because both studies are expected to produce a significant result.  The probability of this rare event can be as low, or lower, than the probability with a true null effect; .95 * (1-.95) = .0425.  Thus, an inconsistent result provides little information about the probability of a type-I or type-II  error and is difficult to interpret.

In conclusion, a pair of significance tests can produce three outcomes. All three outcomes can occur when the null-hypothesis is true and when it is false.  Inconsistent outcomes are likely unless the null-hypothesis is true or the null-hypothesis is false and power is very high.  When two studies produce inconsistent results, statistical significance provides no basis for statistical inferences.

Meta-Analysis

The counting of successes and failures is an old way to integrate information from multiple studies.  This approach has low power and is no longer used.  A more powerful approach is effect size meta-analysis.  Effect size meta-analysis was one way to interpret replication results in the Open Science Collaboration (2015) reproducibility project.  Surprisingly, Maxwell et al. (2015) do not consider this approach to the interpretation of failed replication studies. To be clear, Maxwell et al. (2015) mention meta-analysis, but they are talking about meta-analyzing a larger set of replication studies, rather than meta-analyzing the results of an original and a replication study.

“This raises a question about how to analyze the data obtained from multiple studies. The natural answer is to use meta-analysis.” (p. 495)

I am going to show that effect-size meta-analysis solves the problem of interpreting inconsistent results in pairs of studies. Importantly, effect size meta-analysis does not care about significance in individual studies.  A meta-analysis of a pair of studies with inconsistent results is no different from a meta-analysis of a pair of studies with consistent results.

Maxwell et al.’s (2015) introduced an example of a between-subject (BS) design with n = 40 per group (total N = 80) and a standardized effect size of Cohen’s d = .5 (a medium effect size).  This study has 59% power to obtain a significant result.  Thus, it is quite likely that a pair of studies produces inconsistent results (48.38%).   However, a pair of studies with N = 80 has the power of a total sample size of N = 160, which means a fixed-effects meta-analysis will produce a significant result in 88% of all attempts.  Thus, it is not difficult at all to interpret the results of pairs of studies with inconsistent results if the studies have acceptable power (> 50%).   Even if the results are inconsistent, a meta-analysis will provide the correct answer that there is an effect most of the time.

A more interesting scenario are inconsistent results when the null-hypothesis is true.  I turned to simulations to examine this scenario more closely.   The simulation showed that a meta-analysis of inconsistent studies produced a significant result in 34% of all cases.  The percentage slightly varies as a function of sample size.  With a small sample of N = 40, the percentage is 35%. With a large sample of  1,000 participants it is 33%.  This finding shows that in two-thirds of attempts, a failed replication reverses the inference about the null-hypothesis based on a significant original study.  Thus, if an original study produced a false-positive results, a failed replication study corrects this error in 2 out of 3 cases.  Importantly, this finding does not warrant the conclusion that the null-hypothesis is true. It merely reverses the result of the original study that falsely rejected the null-hypothesis.

In conclusion, meta-analysis of effect sizes is a powerful tool to interpret the results of replication studies, especially failed replication studies.  If the null-hypothesis is true, failed replication studies can reduce false positives by 66%.

DIFFERENCES IN SAMPLE SIZES

We can all agree that, everything else being equal, larger samples are better than smaller samples (Cohen, 1990).  This rule applies equally to original and replication studies. Sometimes it is recommended that replication studies should use much larger samples than original studies, but it is not clear to me why researchers who conduct replication studies should have to invest more resources than original researchers.  If original researchers conducted studies with adequate power,  an exact replication study with the same sample size would also have adequate power.  If the original study was a type-I error, the replication study is unlikely to replicate the result no matter what the sample size.  As demonstrated above, even a replication study with the same sample size as the original study can be effective in reversing false rejections of the null-hypothesis.

From a meta-analytic perspective, it does not matter whether a replication study had a larger or smaller sample size.  Studies with larger sample sizes are given more weight than studies with smaller samples.  Thus, researchers who invest more resources are rewarded by giving their studies more weight.  Large original studies require large replication studies to reverse false inferences, whereas small original studies require only small replication studies to do the same.  Nevertheless, failed replications with larger samples are more likely to reverse false rejections of the null-hypothesis, but there is no magical number about the size of a replication study to be useful.

I simulated a scenario with a sample size of N = 80 in the original study and a sample size of N = 200 in the replication study (a factor of 2.5).  In this simulation, only 21% of meta-analyses produced a significant result.  This is 13 percentage points lower than in the simulation with equal sample sizes (34%).  If the sample size of the replication study is 10 times larger (N = 80 and N = 800), the percentage of remaining false positive results in the meta-analysis shrinks to 10%.

The main conclusion is that even replication studies with the same sample size as the original study have value and can help to reverse false positive findings.  Larger sample sizes simply give replication studies more weight than original studies, but it is by no means necessary to increase sample sizes of replication studies to make replication failures meaningful.  Given unlimited resources, larger replications are better, but these analysis show that large replication studies are not necessary.  A replication study with the same sample size as the original study is more valuable than no replication study at all.

CONFUSING ABSENCE OF EVIDENCE WITH EVIDENCE OF ABSENCE

One problem in Maxwell et al’s (2015) article is to conflate two possible goals of replication studies.  One goal is to probe the robustness of the evidence against the null-hypothesis. If the original result was a false positive result, an unsuccessful replication study can reverse the initial inference and produce a non-significant result in a meta-analysis.  This finding would mean that evidence for an effect is absent.  The status of a hypothesis (e.g., humans have supernatural abilities; Bem, 2011) is back to where it was before the original study found a significant result and the burden of proof is shifted back to proponents of the hypothesis to provide unbiased credible evidence for it.

Another goal of replication studies can be to provide conclusive evidence that an original study reported a false positive result (i..e, humans do not have supernatural abilities).  Throughout their article, Maxwell et al. assume that the goal of replication studies is to prove the absence of an effect.  They make many correct observations about the difficulties of achieving this goal, but it is not clear why replication studies have to be conclusive when original studies are not held to the same standard.

This makes it easy to produce (potentially false) positive results and very hard to remove false positive results from the literature.   It also creates a perverse incentive to conduct underpowered original studies and to claim victory when a large replication study finds a significant result with an effect size that is 90% smaller than the effect size in an original study.  The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported.  To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles.  If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.

THE DIRTY BIG SECRET

The main problem of Maxwell et al.’s (2015) article is that the authors blissfully ignore the problem of publication bias.  They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results (Schimmack, 2012; Sterling; 1959; Sterling et al., 1995).

It is hard to believe that Maxwell is unaware of this problem, if only because Maxwell was action editor of my article that demonstrated how publication bias undermines the credibility of replication studies that are selected for significance  (Schimmack, 2012).

I used Bem’s infamous article on supernatural abilities as an example, which appeared to show 8 successful replications of supernatural abilities.  Ironically, Maxwell et al. (2015) also cites Bem’s article to argue that failed replication studies can be misinterpreted as evidence of absence of an effect.

“Similarly, Ritchie, Wiseman, and French (2012) state that their failure to obtain significant results in attempting to replicate Bem (2011) “leads us to favor the ‘experimental artifacts’ explanation for Bem’s original result” (p. 4)”

This quote is not only an insult to Ritchie et al.; it also ignores the concerns that have been raised about Bem’s research practices. First, Ritchie et al. do not claim that they have provided conclusive evidence against ESP.  They merely express their own opinion that they “favor the ‘experimental artifacts’ explanation.  There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities.

More important, Maxwell et al. ignore the broader context of these studies.  Schimmack (2012) discussed many questionable practices in Bem’s original studies and I presented statistical evidence that the significant results in Bem’s article were obtained with the help of questionable research practices.  Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.

It is not clear why Maxwell et al. (2015) picked Bem’s article to discuss problems with failed replication studies and ignores that questionable research practices undermine the credibility of significant results in original research articles. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.

Maxwell et al. (2015) were not aware that in the same year, the OSC (2015) reproducibilty project would replicate only 37% of statistically significant results in top psychology journals, while the apparent success rate in these journals is over 90%.  The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not.  Whether this makes things better is a matter of opinion.

Publication bias also undermines the usefulness of meta-analysis for hypothesis testing.  In the OSC reproducibility project, a meta-analysis of original and replication studies produced 68% significant results.  This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.

DOES PSYCHOLOGY HAVE A REPLICATION CRISIS?

Maxwell et al.’s (2015) answer to this question is captured in this sentence. “Despite raising doubts about the extent to which apparent failures to replicate necessarily reveal that psychology is in crisis,we do not intend to dismiss concerns about documented methodological flaws in the field.” (p. 496).  The most important part of this quote is “raising doubt,” the rest is Orwellian double-talk.

The whole point of Maxwell et al.’s article is to assure fellow psychologists that psychology is not in crisis and that failed replication studies should not be a major concern.  As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail.

The real answer to Maxwell et al.’s question was provided by the OSC (2015) finding that only 37% of published significant results could be replicated.  In my opinion that is not only a crisis, but a scandal because psychologists routinely apply for funding with power analyses that claim 80% power.  The reproducibilty project shows that the true power to obtain significant results in original and replication studies is much lower than this and that the 90% success rate is no more meaningful than 90% votes for a candidate in communist elections.

In the end, Maxwell et al. draw the misleading conclusion that “the proper design and interpretation of replication studies is less straightforward than conventional practice would suggest.”  They suggest that “most importantly, the mere fact that a replication study yields a nonsignificant statistical result should not by itself lead to a conclusion that the corresponding original study was somehow deficient and should no longer be trusted.”

As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if (a) the original study was not preregistered, (b) the original study produced weak evidence (e.g., p = .04), the original study was published in a journal that only publishes significant results, (d) the replication study had a larger sample, (e) the replication study would have been published independent of outcome, and (f) the replication study was preregistered.

We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail.  Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.

Long life the replicability revolution.  !!!

REFERENCES

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.

http://dx.doi.org/10.1037/0003-066X.45.12.1304

Maxwell, S.E, Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does ‘failure to replicate’ really mean? American Psychologist, 70, 487-498. http://dx.doi.org/10.1037/a0039400.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

# My email correspondence with Daryl J. Bem about the data for his 2011 article “Feeling the future”

In 2015, Daryl J. Bem shared the datafiles for the 9 studies reported in the 2011 article “Feeling the Future” with me.  In a blog post, I reported an unexplained decline effect in the data.  In an email exchange with Daryl Bem, I asked for some clarifications about the data, comments on the blog post, and permission to share the data.

Today, Daryl J. Bem granted me permission to share the data.  He declined to comment on the blog post and did not provide an explanation for the decline effect.  He also did not comment on my observation that the article did not mention that “Experiment 5” combined two experiments with N = 50 and that “Experiment 6” combined three datasets with Ns = 91, 19, and 40.  It is highly unusual to combine studies and this practice contradicts Bem’s claim that sample sizes were determined a priori based on power analysis.

Footnote on p. 409. “I set 100 as the minimum number of participants/sessions for each of the experiments reported in this article because most effect sizes (d) reported in the
psi literature range between 0.2 and 0.3. If d = 0.25 and N = 100, the power
to detect an effect significant at .05 by a one-tail, one-sample t test is .80
(Cohen, 1988).”

The undisclosed concoction of datasets is another questionable research practice that undermines the scientific integrity of significance tests reported in the original article. At a minimum, Bem should issue a correction that explains how the nine datasets were created and what decision rules were used to stop data collection.

I am sharing the datafiles so that other researchers can conduct further analyses of the data.

Datafiles: EXP1   EXP2   EXP3   EXP4   EXP5   EXP6   EXP7   EXP8   EXP9

Below is the complete email correspondence with Daryl J. Bem.

—————————————————————————————————————————————–

To: Daryl J. Bem
From:  Ulrich Schimmack
Sent: Thursday,  January 25, 2018 5:23 PM

Dear Dr. Bem,

I find the enthusiasm explanation less plausible than you.  More important, it doesn’t explain the lack of a decline effect in studies with significant results.

I just finished the analysis of the 6 studies with N > 100 by Maier that are also included in the meta-analysis (see Figure below).

Given the lack of a plausible explanation for your data, I think JPSP should retract your article or at least issue an expression of concern because the published results are based on abnormally strong effect sizes in the beginning of each study. Moreover, Study 5 is actually two studies of N = 50 and the pattern is repeated at the beginning of the two datasets.

I also noticed that the meta-analysis included one more study by you with an underpowered study of N = 42 that surprisingly produced yet another significant result.  As I pointed out in my article that you reviewed that you reviewed points out, this success makes it even more likely that some non-significant (pilot) studies were omitted.  Your success record is simply too good to be true (Francis, 2012).  Have you conducted any other studies since 2012?  A non-significant result is overdue.

Regarding the meta-analysis itself, most of these studies are severely underpowered and there is still evidence for publication bias after excluding your studies.

When I used puniform to control for publication bias and limited the dataset to studies with N > 90 and excluded your studies (as we agree, N < 90 is low power) the p-value was not significant, and even if it were less than .05, it would not be convincing evidence for an effect.  In addition, I computed t-values using the effect size that you assumed in 2011, d = .2, and found significant evidence against the null-hypothesis that the ESP effect size could be as large as d = .2.  This means, even studies with N = 100 are underpowered.   Any serious test of the hypothesis requires much larger sample sizes.

However, the meta-analysis and the existence of ESP are not my concern.  My concern is the way (social) psychologists have conducted research in the past and are responding to the replication crisis.  We need to understand how researchers were able to produce seemingly convincing evidence like your 9 studies in JPSP that are difficult to replicate.  How can original articles have success rates of 90% or more and replications produce only a success rate of 30% or less?  You are well aware that your 2011 article was published with reservations and concerns about the way social psychologists conducted research.   You can make a real contribution to the history of psychology by contributing to the understanding of the research process that led to your results.  This is independent of any future tests of PSI with more rigorous studies.

Best, Dr. Schimmack

To:
Ulrich Schimmack
From:
Daryl J. Bem
Sent: Thursday,  January 25, 2018 4:45 PM

Dear Dr. Schimmack,

You reference Schooler who has documented the decline effect in several areas—not just in psi research—and has advanced some hypotheses about its possible causes.  The hypothesis that strikes me as most plausible is that it is an experimenter effect whereby experimenters and their assistants begin with high expectations and enthusiasm begin to get bored after conducting a lot of sessions.  This increasing lack of enthusiasm gets transmitted to the participants during the sessions.  I also refer you to Bob Rosenthal’s extensive work with experimenter effects—which show up even in studies with maze-running rats.

Most of Galak’s sessions were online, thereby diminishing this factor.  Now that I am retired and no longer have a laboratory with access to student assistants and participants, I, too, am shifting to online administration, so it will provide a rough test of this hypothesis.

Were you planning to publish our latest exchange concerning the meta-analysis?  I would not like to leave your blog followers with only your statement that it was “contaminated” by my own studies when, in fact, we did a separate meta-analysis on the non-Bem replications, as I noted in my previous email to you.

Best,
Daryl Bem

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Thursday, January 25, 2018 12:05 PM

Dear Dr. Bem,

I now started working on the meta-analysis.
I see another study by you listed (Bem, 2012, N = 42).
Can you please send me the original data for this study?

Best, Dr. Schimmack

To:
Ulrich Schimmack
From:
Daryl J. Bem
Sent: Thursday,  January 25, 2018 4:45 PM

Dear Dr. Shimmack,

I was not able to figure out how to leave a comment on your blog post at the website. (I kept being asked to register a site of my own.)  So, I thought I would simply write you a note.  You are free to publish it as my response to your most recent post if you wish.

In reading your posts on my precognitive experiments, I kept puzzling over why you weren’t mentioning the published Meta-analysis of 90 “Feeling the Future” studies that I published in 2015 with Tessoldi, Rabeyron, & Duggan. After all, the first question we typically ask when controversial results are presented is  “Can Independent researchers replicate the effect(s)?”  I finally spotted a fleeting reference to our meta-analysis in one of your posts, in which you simply dismissed it as irrelevant because it included my own experiments, thereby “contaminating” it.

But in the very first Table of our analysis, we presented the results for both the full sample of 90 studies and, separately, for the 69 replications conducted by independent researchers (from 33 laboratories in 14 countries on 10,000 participants).

These 69 (non-Bem-contaminated) independent replications yielded a z score of 4.16, p =1.2 x E-5.  The Bayes Factor was 3.85—generally considered large enough to provide “Substantial Evidence” for the experimental hypothesis.

Of these 69 studies, 31 were exact replications in that the investigators used my computer programs for conducting the experiments, thereby controlling the stimuli, the number of trials, all event timings, and automatic data recording. The data were also encrypted to ensure that no post-experiment manipulations were made on them by the experimenters or their assistants. (My own data were similarly encrypted to prevent my own assistants from altering them.) The remaining 38 “modified” independent replications variously used investigator-designed computer programs, different stimuli, or even automated sessions conducted online.

Both exact and modified replications were statistically significant and did not differ from one another.  Both peer reviewed and non-peer reviewed replications were statistically significant and did not differ from one another. Replications conducted prior to the publication of my own experiments and those conducted after their publication were each statistically significant and did not differ from one another.

We also used the recently introduced p-curve analysis to rule out several kinds of selection bias (file drawer problems), p-hacking, and to estimate “true” effect sizes.
There was no evidence of p-hacking in the database, and the effect size for the non-bem replications was 0.24, somewhat higher than the average effect size of my 11 original experiments (0.22.)  (This is also higher than the mean effect size of 0.21 achieved by Presentiment experiments in which indices of participants’ physiological arousal “precognitively” anticipate the random presentation of an arousing stimulus.)

For various reasons, you may not find our meta-analysis any more persuasive than my original publication, but your website followers might.

Best,
Daryl J.  Bem

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 20, 2018 6:48 PM

Dear Dr. Bem,

Thank you for your final response.   It answers all of my questions.

I am sorry if you felt bothered by my emails, but I am confident that many psychologists are interested in your answers to my questions.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Saturday, January 20, 2018 5:56 PM

Dear Dr. Schimmack,

I hereby grant you permission to be the conduit for making my data available to those requesting them. Most of the researchers who contributed to our 2015/16 meta-analysis of 90 retroactive “feeling-the-future” experiments have already received the data they required for replicating my experiments.

At the moment, I am planning to follow up our meta-analysis of 90 experiments by setting up pre-registered studies. That seems to me to be the most profitable response to the methodological, statistical, and reporting critiques that have emerged since I conducted my original experiments more than a decade ago.  To respond to your most recent request, I am not planning at this time to write any commentary to your posts.  I am happy to let replications settle the matter.

(One minor point: I did not spend \$90,000 to conduct my experiments.  Almost all of the participants in my studies at Cornell were unpaid volunteers taking psychology courses that offered (or required) participation in laboratory experiments.  Nor did I discard failed experiments or make decisions on the basis of the results obtained.)

What I did do was spend a lot of time and effort preparing and discarding early versions of written instructions, stimulus sets and timing procedures.  These were pretested primarily on myself and my graduate assistants, who served repeatedly as pilot subjects. If instructions or procedures were judged to be too time consuming, confusing, or not arousing enough, they were changed before the formal experiments were begun on “real” participants.  Changes were not made on the basis of positive or negative results because we were only testing the procedures on ourselves.

When I did decide to change a formal experiment after I had started it, I reported it explicitly in my article. In several cases I wrote up the new trials as a modified replication of the prior experiment.  That’s why there are more experiments than phenomena in my article:  2 approach/avoidance experiments, 2 priming experiments, 3 habituation experiments, & 2 recall experiments.)

In some cases the literature suggested that some parameters would be systematically related to the dependent variables in nonlinear fashion—e.g., the number of subliminal presentations used in the familiarity-produces-increased liking effect, which has a curvilinear relationship.  In that case, I incorporated the variable as a systematic independent variable. That is also reported in the article.

It took you approximately 3 years to post your responses to my experiments after I sent you the data.  Understandable for a busy scholar.  But a bit unziemlich for you to then send me near-daily reminders the past 3 weeks to respond back to you (as Schumann commands in the first movement of his piano Sonata in g Minor) “so schnell wie möglich!”  And then a page later, “Schneller!”

Solche Unverschämtheit!   Wenn ich es sage.

Daryl J.  Bem
Professor Emeritus of Psychology

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 20, 2018, 1.06 PM

Dear Dr. Bem,

I want to post my blog about Study 6 tomorrow. If you want to comment on it before I post it, please do so today.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Monday, January 15, 2018, 10.35 PM

You are correct:  Experiment 8, the first Retroactive Recall experiment was conducted in 2007 and its replication (Experiment 9) was conducted in 2009.

The Avoidance of Negative Stimuli (Study/Experiment 2)  was conducted (and reported as a single experiment with 150 sessions) in 2008.  More later.

Best,
Daryl Bem

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Monday, January 15, 2018, 8.52 PM

Dear Dr. Bem,

Thank you for your table.  I think we are mostly in agreement (sorry, if I confused you by calling studies datasets. The numbers are supposed to correspond to the experiment numbers in your table.

The only remaining inconsistency is that the datafile for study 8 shows year 2007, while you have 2008 in your table.

Best, Dr. Schimmack

Study    Sample    Year       N             Experiment
5              1              2002       50           #5: Retroactive Habituation I (Neg only)
5              2              2002       50           #5: Retroactive Habituation I (Neg only)
6              1              2002       91           #6: Retroactive Habituation II (Neg & Erot)
6              2              2002       19           #6: Retroactive Habituation II (Neg & Erot)
6              3              2002       40           #6: Retroactive Habituation II (Neg & Erot)
7              1              2005       200         #7: Retroactive Induction of Boredom
1              1              2006       40           #1: Precognitive Detection of Erotic Stimuli
1              2              2006       60           #1: Precognitive Detection of Erotic Stimuli
2              1              2008       100         #2: Precognitive Avoidance of Negative Stimuli
2              2              2008       50           #2: Precognitive Avoidance of Negative Stimuli
3              1              2007       100         #3: Retroactive Priming I
4              1              2008       100         #4: Retroactive Priming  II
8?           1              2007/08  100         #8: Retroactive Facilitation of Recall I
9              1              2009       50           #9: Retroactive Facilitation of Recall II

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Monday, January 15, 2018, 4.17 PM

Dear Dr. Schimmack,

Here is my analysis of your Table.  I will try to get to the rest of your commentary in the coming week.

Attached Word document:

Dear Dr. Schimmack,

In looking at your table, I wasn’t sure from your numbering of Datasets & Samples which studies corresponded to those reported in my Feeling the Future article.  So I have prepared my own table in the same ordering you have provided and added a column identifying the phenomenon under investigation  (It is on the next page)

Unless I have made a mistake in identifying them, I find agreement between us on most of the figures.  I have marked in red places where we seem to disagree, which occur on Datasets identified as 3 & 8.  You have listed the dates for both as 2007, whereas my datafiles have 2008 listed for all participant sessions which describe the Precognitive Avoidance experiment and its replication.  Perhaps I have misidentified the two Datasets.  The second discrepancy is that you have listed Dataset 8 as having 100 participants, whereas I ran only 50 sessions with a revised method of selecting the negative stimulus for each trial.  As noted in the article, this did not produce a significant difference in the size of the effect, so I included all 150 sessions in the write-up of that experiment.

I do find it useful to identify the Datasets & Samples with their corresponding titles in the article.  This permits readers to read the method sections along with the table.  Perhaps it will also identify the discrepancy between our Tables.  In particular, I don’t understand the separation in your table between Datasets 8 & 9.  Perhaps you have transposed Datasets 4 & 8.

If so, then Datasets 4 & 9 would each comprise 50 sessions.

More later.

Dataset Sample    Year       N
5              1              2002       50
5              2              2002       50
6              1              2002       91
6              2              2002       19
6              3              2002       40
7              1              2005       200
1              1              2006       40
1              2              2006       60
3              1              2007       100
8              1              2007       100
2              1              2008       100
2              2              2008       50
4              1              2008       100
9              1              2009       50

My Table:

Dataset Sample    Year       N             Experiment
5              1              2002       50           #5: Retroactive Habituation I (Neg only)
5              2              2002       50           #5: Retroactive Habituation I (Neg only)
6              1              2002       91          #6: Retroactive Habituation II (Neg & Erot)
6              2              2002       19           #6: Retroactive Habituation II (Neg & Erot)
6              3              2002       40          #6: Retroactive Habituation II (Neg & Erot)
7              1              2005       200         #7: Retroactive Induction of Boredom
1              1              2006       40           #1: Precognitive Detection of Erotic Stimuli
1              2              2006       60           #1: Precognitive Detection of Erotic Stimuli
3              1              2008       100         #2: Precognitive Avoidance of Negative Stimuli
8?           1              2008       50           #2: Precognitive Avoidance of Negative Stimuli
2              1              2007       100         #3: Retroactive Priming I
2              2              2008       100         #4: Retroactive Priming  II
4?           1              2008       100         #8: Retroactive Facilitation of Recall I
9              1              2009       50           #9: Retroactive Facilitation of Recall II

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Monday, January 15, 2018 10.46 AM

Dear Dr. Bem,

I am sorry to bother you with my requests. It would be helpful if you could let me know if you are planning to respond to my questions and if so, when you will be able to do so?

Best regards,
Dr. Ulrich Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 13, 2018 3.53 PM

Dear Dr. Bem,

I put together a table that summarizes when studies were done and how they were combined into datasets.

Please confirm that this is accurate or let me know if there are any mistakes.

Best, Dr. Schimmack

 Dataset Sample Year N 5 1 2002 50 5 2 2002 50 6 1 2002 91 6 2 2002 19 6 3 2002 40 7 1 2005 200 1 1 2006 40 1 2 2006 60 3 1 2007 100 8 1 2007 100 2 1 2008 100 2 2 2008 50 4 1 2008 100 9 1 2009 50

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 13, 2018 2.42 PM

Dear Dr. Bem,

Also, other researchers are interested in looking at the data and I still need to hear from you how to share the datafiles.

Best, Dr. Schimmack

[Attachment: Draft of Blog Post]

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 7.47 PM

Dear. Dr. Bem,

Also, is it ok for me to share your data in public or would you rather post them in public?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 7.01 PM

Dear Dr. Bem,

Now that my question about Study 6 has been answered, I would like to hear your thoughts about my blog post. How do you explain the decline effect in your data; that is effect sizes decrease over the course of each experiment and when two experiments are combined into a single dataset, the decline effect seems to repeat at the beginning of the new study.   Study 6, your earliest study, doesn’t show the effect, but most other studies show this pattern.  As I pointed out on my blog, I think there are two explanations (see also Schooler, 2011).  Either unpublished studies with negative results were omitted or measurement of PSI makes the effect disappear.  What is probably most interesting is to know what you did when you encountered a promising pilot study.  Did you then start collecting new data with this promising procedure or did you continue collecting data and retained the pilot data?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Friday, January 12, 2018 2.17 PM

Dear Dr. Schimmack,

You are correct that I calculated all hit rates against a fixed null of 50%.

You are also correct that the first 91 participants (Spring semester of 2002) were exposed to 48 trials: 16 Negative images, 16, Erotic images, and 16 Neutral Images.

We continued with that same protocol in the Fall semester of 2002 for 19 additional sessions, sessions 51-91.

At this point, it was becoming clear from post-session debriefings of participants that the erotic pictures from the Affective Picture System (IAPS) were much too mild, especially for male participants.

(Recall that this was chronologically my first experiment and also the first one to use erotic materials.  The observation that mild erotic stimuli are insufficiently arousing, at least for college students, was later confirmed in our 2016 meta-analysis, which found that Wagenmakers attempt to replicate my Experiment #1 (Which of two curtains hides an erotic picture?) using only mild erotic pictures was the only replication failure out of 11 replication attempts of that protocol in our database.)  In all my subsequent experiments with erotic materials, I used the stronger images and permitted participants to choose which kind of erotic images (same-sex vs. opposite-sex erotica) they would be seeing.

For this reason, I decided to introduce more explicit erotic pictures into this attempted replication of the habituation protocol.

In particular, Sessions 92-110 (19 sessions) also consisted of 48 trials, but they were divided into 12 Negative trials, 12 highly Erotic trials, & 24 Neutral trials.

Finally, Sessions 111-150 (40 sessions) increased the number of trials to 60:  15 Negative trials, 15 Highly Erotic trials, & 30 Neutral trials.  With the stronger erotic materials, we felt we needed to have relatively more neutral stimuli interspersed with the stronger erotic materials.

Best,
Daryl Bem

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 11.08 AM

Dear Dr. Bem,

I conducted further analyses and I figured out why I obtained discrepant results for Study 6.

I computed difference scores with the control condition, but the article reports results for a one-sample t-test of the hit rates against an expected value of 50%.

I also figured out that the first 91 participants were exposed to 16 critical trials and participants 92 to 150 were exposed to 30 critical trials. Can you please confirm this?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Thursday, January 11, 2018 10.53 PM

I’ll check them tomorrow to see where the problems are.

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 10, 2018 5.41 PM

Dear Dr. Bem,

I just double checked the data you sent me today and they match the data you sent me in 2015.

This means neither of these datasets reproduces the results reported in your 2011 article.

This means your article reported two more significant results (Study 6, Negative and Erotic) than the data support.

This raises further concerns about the credibility of your published results, in addition to the decline effect that I found in your data (except in Study 6, which also produced non-significant results).

Do you still believe that your 2011 studies provided credible information about timer-reversed causality or do you think that you may have capitalized on chance by conducting many pilot studies?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 10, 2018 5:03 PM

Dear Dr. Bem,

Frequencies of male and female in dataset 5.

> table(bem5\$Participant.Sex)

Female   Male
63     37

Article “One hundred Cornell undergraduates, 63 women and 37 men,
were recruited through the Psychology Department’s”

Analysis of dataset 5

One Sample t-test
data:  bem5\$N.PC.C.PC[b:e]
t = 2.7234, df = 99, p-value = 0.007639
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1.137678 7.245655
sample estimates:
mean of  x
4.191667

Article “t(99) =  2.23, p = .014”

Conclusion:
Gender of participants matches.
t-values do not match, but both are significant.

Frequencies of male and female in dataset 6.

> table(bem6\$Participant.Sex)

Female   Male
87     63

Article: Experiment 6: Retroactive Habituation II
One hundred fifty Cornell undergraduates, 87 women and 63
men,

Negative

Paired t-test
data:  bem6\$NegHits.PC and bem6\$ControlHits.PC
t = 1.4057, df = 149, p-value = 0.1619
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8463098  5.0185321
sample estimates:
mean of the differences
2.086111

Erotic

Paired t-test
data:  bem6\$EroticHits.PC and bem6\$ControlHits.PC
t = -1.3095, df = 149, p-value = 0.1924
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.2094289  0.8538733
sample estimates:
mean of the differences
-1.677778

Article

Both retroactive habituation hypothesis were supported. On
trials with negative picture pairs, participants preferred the target
significantly more frequently than the nontarget, 51.8%, t(149) _
1.80, p _ .037, d _ 0.15, binomial z _ 1.74, p _ .041, thereby
providing a successful replication of Experiment 5. On trials with
erotic picture pairs, participants preferred the target significantly
less frequently than the nontarget, 48.2%, t(149) _ _1.77, p _
.039, d _ 0.14, binomial z _ _1.74, p _ .041.

Conclusion:
t-values do not match, article reports significant results, but data you shared show non-significant results, although gender composition matches article.

I will double check the datafiles that you sent me in 2015 against the one you are sending me now.

Let’s first understand what is going on here before we discuss other issues.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Wednesday, January 10, 2018 4:42 PM

Dear Dr. Schimmack,

Sorry for the delay.  I have been busy re-programming my new experiments so they can be run online, requiring me to relearn the programming language.

The confusion you have experienced arises because the data from Experiments 5 and 6 in my article were split differently for exposition purposes. If you read the report of those two experiments in the article, you will see that Experiment 5 contained 100 participants experiencing only negative (and control) stimuli.  Experiment contained 150 participants who experienced negative, erotic, and control stimuli.

I started Experiment 5 (my first precognitive experiment) in the Spring semester of 2002. I ran the pre-planned 100 sessions, using only negative and control stimuli.  During that period, I was alerted to the 2002 publication by Dijksterhuis & Smith in the journal Emotion, in which they claimed to demonstrate the reverse of the standard “familiarity-promotes-liking” effect, showing that people also adapt to stimuli that are initially very positive and hence become less attractive as the result of multiple exposures.

So after completing my 100 sessions, I used what remained of the Spring semester to design and run a version of my own retroactive experiment that included erotic stimuli in addition to the negative and control stimuli.  I was able to run 50 sessions before the Spring semester ended, and I resumed that extended version the experiment in the following Fall semester when student-subjects again became available until I had a total of 150 sessions of this extended version.  For purposes of analysis and exposition, I then divided the experiments as described in the article:  100 sessions with only negative stimuli and 150 sessions with negative and erotic stimuli.  No subjects or sessions have been added or omitted, just re-assembled to reflect the change in protocol.

I don’t remember how I sent you the original data, so I am attaching a comma-delimited file (which will open automatically in Excel if you simply double or right click it).  It contains all 250 sessions ordered by dates.  The fields provided are:  Session number (numbered from 1 to 250 in chronological order),  the date of the session, the sex of the participant, % of hits on negative stimuli, % of hits on erotic stimuli (which is blank for the 100 subjects in Experiment 5) and % of hits on neutral stimuli.

Let me know if you need additional information.

I hope to get to your blog post soon.

Best,
Daryl Bem

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 6, 2018 11:43 AM

Dear Dr. Bem,

Please reply as soon as possible to my email.  Other researchers are interested in analyzing the data and if I submit my analyses some journals want me to provide data or an explanation why I cannot share the data.  I hope to hear from you by the end of this week.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 6, 2018 11:43 AM

Dear Dr. Bem,

Meanwhile I posted a blog post about your 2011 article.  It has been well received by the scientific community.  I would like to encourage you to comment on it.

https://replicationindex.wordpress.com/2018/01/05/why-the-journal-of-personality-and-social-psychology-should-retract-article-doi-10-1037-a0021524-feeling-the-future-experimental-evidence-for-anomalous-retroactive-influences-on-cognition-a/

Best,
Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 3, 2018 4:12 PM

Dear Dr. Bem,

I am finally writing up the results of my reanalyses of your ESP studies.

I encountered one problem with the data for Study 6.

I cannot reproduce the test results reported in the article.

The article :

Both retroactive habituation hypothesis were supported. On trials with negative picture pairs, participants preferred the target significantly more frequently than the nontarget, 51.8%, t(149) _ 1.80, p _ .037, d _ 0.15, binomial z _ 1.74, p _ .041, thereby providing a successful replication of Experiment 5. On trials with erotic picture pairs, participants preferred the target significantly less frequently than the nontarget, 48.2%, t(149) _ _1.77, p _.039, d _ 0.14, binomial z _ _1.74, p _ .041.

I obtain

(negative)
t = 1.4057, df = 149, p-value = 0.1619

(erotic)
t = -1.3095, df = 149, p-value = 0.1924

Also, I wonder why the first 100 cases often produce decimals of .25 and the last 50 cases produce decimals of .33.

It would be nice if you could look into this and let me know what could explain the discrepancy.

Best,
Uli Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Wednesday, February 25, 2015 2:47 AM

Dear Dr. Schimmack,

Attached is a folder of the data from my nine “Feeling the Future” experiments.  The files are plain text files, one line for each session, with variables separated by tabs.  The first line of each file is the list of variable names, also separated by tabs. I have omitted participants’ names but supplied their sex and age.

You should consult my 2011 article for the descriptions and definitions of the dependent variables for each experiment.

Most of the files contain the following variables: Session#, Date, StartTime, Session Length, Participant’s Sex, Participant’s Age, Experimenter’s Sex,  [the main dependent variable or variables], Stimulus Seeking score (from 1 to 5).

For the priming experiments (#3 & #4), the dependent variables are LnRT Forward and LnRT Retro, where Ln is the natural log of Response Times. As described in my 2011 publication, each response time (RT) is transformed by taking the natural log before being entered into calculations.  The software subtracts the mean transformed RT for congruent trials from the mean Transformed RT for incongruent trials, so positive values of LnRT indicate that the person took longer to respond to incongruent trials than to congruent trials.  Forward refers to the standard version of affective priming and Retro refers to the time-reversed version.  In the article, I show the results for both the Ln transformation and the inverse transformation (1/RT) for two different outlier definitions.  In the attached files, I provide the results using the Ln transformation and the definition of a too-long RT outlier as 2500 ms.

Subjects who made too many errors (> 25%) in judging the valence of the target picture were discarded. Thus, 3 subjects were discarded from Experiment #3 (hence N = 97) and 1 subject was discarded from Experiment #4 (hence N  = 99).  Their data do not appear in the attached files.

Note that the habituation experiment #5 used only negative and control (neutral) stimuli.

Habituation experiment #6 used Negative, erotic, and Control (neutral) stimuli.

Retro Boredom experiment #7 used only neutral stimuli.

In Experiment #8, the first  Retro Recall, the first 100 sessions are experimental sessions.  The last 25 sessions are no-practice control sessions.  The type of session is the second variable listed.

In Experiment #9, the first 50 sessions are the experimental sessions and the last 25 are no-practice control sessions.   Be sure to exclude the control sessions when analyzing the main experimental sessions. The summary measure of psi performance is the Precog% Score (DR%) whose definition you will find on page 419 of my article.

Let me know if you encounter any problems or want additional data.

Sincerely,
Daryl J.  Bem
Professor Emeritus of Psychology

—————————————————————————————————————————————–

A day ago

# ‘Before you know it’ by John A. Bargh: A quantitative book review

Nov 28, 2017 12:41 PM

# (Preprint) Z-Curve: A Method for Estimating Replicability Based on Test Statistics in Original Studies (Schimmack & Brunner, 2017)

Nov 16, 2017 3:46 PM

# Preliminary 2017 Replicability Rankings of 104 Psychology Journals

Oct 24, 2017 2:56 PM

# P-REP (2005-2009): Reexamining the experiment to replace p-values with the probability of replicating an effect

Sep 19, 2017 12:01 PM

# The Power of the Pen Paradigm: A Replicability Analysis

Sep 4, 2017 8:48 PM

# What would Cohen say? A comment on p < .005

Aug 2, 2017 11:46 PM

# How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

May 15, 2017 9:24 PM

# How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press).

May 4, 2017 8:41 PM

# Hidden Figures: Replication Failures in the Stereotype Threat Literature

Apr 7, 2017 10:18 AM

# Personalized Adjustment of p-values for publication bias

Mar 13, 2017 2:31 PM

# Meta-Psychology: A new discipline and a new journal (draft)

Mar 5, 2017 2:35 PM

# 2016 Replicability Rankings of 103 Psychology Journals

Mar 1, 2017 12:01 AM

# An Attempt at Explaining Null-Hypothesis Testing and Statistical Power with 1 Figure and 1,500 Words

Feb 26, 2017 10:35 AM

# Random measurement error and the replication crisis: A statistical analysis

Feb 23, 2017 7:38 AM

# How Selection for Significance Influences Observed Power

Feb 21, 2017 10:38 AM

# Reconstruction of a Train Wreck: How Priming Research Went off the Rails

Feb 2, 2017 11:33 AM

# Are Most Published Results in Psychology False? An Empirical Study

Jan 15, 2017 1:42 PM

# Reexamining Cunningham, Preacher, and Banaji’s Multi-Method Model of Racism Measures

Jan 8, 2017 7:50 PM

# Validity of the Implicit Association Test as a Measure of Implicit Attitudes

Jan 5, 2017 5:20 PM

# Replicability Review of 2016

Dec 31, 2016 5:25 PM

# Z-Curve: Estimating Replicability of Published Results in Psychology (Revision)

Dec 12, 2016 5:35 PM

# How did Diedrik Stapel Create Fake Results? A forensic analysis of “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations”

Dec 6, 2016 12:27 PM

# A sarcastic comment on “Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology” by Harry Reis

Dec 3, 2016 9:31 PM

# A replicability analysis of”I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning”

Dec 3, 2016 3:15 PM

# Bayesian Meta-Analysis: The Wrong Way and The Right Way

Nov 28, 2016 3:25 PM

# Peer-Reviews from Psychological Methods

Nov 18, 2016 6:33 PM

# How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Sep 17, 2016 5:44 PM

# A Critical Review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”

Sep 16, 2016 11:20 AM

# Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

Sep 13, 2016 10:10 PM

# The decline effect in social psychology: Evidence and possible explanations

Sep 4, 2016 9:26 AM

# Fritz Strack’s self-serving biases in his personal account of the failure to replicate his most famous study.

Aug 29, 2016 12:10 PM

# How Can We Interpret Inferences with Bayesian Hypothesis Tests?

Aug 9, 2016 4:16 PM

# Bayes Ratios: A Principled Approach to Bayesian Hypothesis Testing

Jul 25, 2016 12:51 PM

# How Does Uncertainty about Population Effect Sizes Influence the Probability that the Null-Hypothesis is True?

Jul 16, 2016 11:53 AM

# Subjective Bayesian T-Test Code

Jul 5, 2016 6:58 PM

# Wagenmakers’ Default Prior is Inconsistent with the Observed Results in Psychologial Research

Jun 30, 2016 10:15 AM

# A comparison of The Test of Excessive Significance and the Incredibility Index

Jun 18, 2016 11:52 AM

# R-Code for (Simplified) Powergraphs with StatCheck Dataset

Jun 17, 2016 6:32 PM

# Replicability Report No.2: Do Mating Primes have a replicable effects on behavior?

May 21, 2016 9:37 AM

# Subjective Priors: Putting Bayes into Bayes-Factors

May 18, 2016 12:22 PM

# Who is Your Daddy? Priming women with a disengaged father increases their willingness to have sex without a condom

May 11, 2016 4:15 PM

# Bayes-Factors Do Not Solve the Credibility Problem in Psychology

May 9, 2016 6:03 PM

# Die Verdrängung des selektiven Publizierens: 7 Fallstudien von prominenten Sozialpsychologen

Apr 20, 2016 10:05 PM

# Replicability Report No. 1: Is Ego-Depletion a Replicable Effect?

Apr 18, 2016 7:28 PM

# Open Ego-Depletion Replication Initiative

Mar 26, 2016 10:10 AM

# Estimating Replicability of Psychological Science: 35% or 50%

Mar 12, 2016 11:56 AM

# MY JOURNEY TOWARDS ESTIMATION OF REPLICABILITY OF PSYCHOLOGICAL RESEARCH

Mar 4, 2016 10:11 AM

# Replicability Ranking of Psychology Departments

Mar 2, 2016 5:01 PM

# Reported Success Rates, Actual Success Rates, and Publication Bias In Psychology: Honoring Sterling et al. (1995)

Feb 26, 2016 5:50 PM

# Are You Planning a 10-Study Article? You May Want to Read This First

Feb 10, 2016 5:24 PM

# Dr. R Expresses Concerns about Results in Latest Psycholgical Science Article by Yaacov Trope and colleagues

Feb 9, 2016 11:41 AM

# Dr. R’s Blog about Replicability

Feb 5, 2016 5:32 PMSticky

# A Scathing Review of “Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science”

Feb 3, 2016 10:32 AM

# Keep your Distance from Questionable Results

Jan 31, 2016 5:05 PM

# Too good to be true: A reanalysis of Damisch, Stoberock, and Mussweiler (2010). Keep Your Fingers Crossed! How Superstition Improves Performance. Psychological Science, (21)7, p.1014-1020

Jan 31, 2016 5:01 PM

# A Revised Introduction to the R-Index

Jan 31, 2016 3:50 PM

# 2015 Replicability Ranking of 100+ Psychology Journals

Jan 26, 2016 12:17 PM

# Is the N-pact Factor (NF) a Reasonable Proxy for Statistical Power and Should the NF be used to Rank Journals’ Reputation and Replicability? A Critical Review of Fraley and Vazir (2014)

Jan 16, 2016 7:56 PM

# On the Definition of Statistical Power

Jan 14, 2016 3:36 PM

# The Abuse of Hoenig and Heisey: A Justification of Power Calculations with Observed Effect Sizes

Jan 14, 2016 2:06 PM

# Do Deceptive Reporting Practices in Social Psychology Harm Social Psychology?

Jan 13, 2016 12:24 PM

# “Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

Jan 12, 2016 1:51 PM

# Distinguishing Questionable Research Practices from Publication Bias

Dec 8, 2015 6:00 PM

# 2015 Replicability Ranking of 54 Psychology Journals

Oct 27, 2015 7:05 PM

# Dr. R’s comment on the Official Statement by the Board of the German Psychological Association (DGPs) about the Results of the OSF-Reproducibility Project published in Science.

Oct 10, 2015 12:01 PM

# Replicability Ranking of 27 Psychology Journals (2015)

Sep 28, 2015 7:42 AM

# Replicability Report for JOURNAL OF SOCIAL AND PERSONAL RELATIONSHIPS

Sep 28, 2015 7:02 AM

# Replicability Report for PERSONAL RELATIONSHIPS

Sep 27, 2015 5:40 PM

# Replicability Report for DEVELOPMENTAL SCIENCE

Sep 27, 2015 3:16 PM

# Replicability Report for JOURNAL OF POSITIVE PSYCHOLOGY

Sep 26, 2015 8:08 PM

# Replicability-Report for CHILD DEVELOPMENT

Sep 26, 2015 10:37 AM

# Replicability-Report for PSYCHOLOGY & AGING

Sep 25, 2015 5:47 PM

# Replicability-Report for JOURNAL OF EXPERIMENTAL PSYCHOLOGY: HUMAN PERCEPTION AND PERFORMANCE

Sep 25, 2015 12:33 PM

# “THE STATISTICAL POWER OF ABNORMAL-SOCIAL PSYCHOLOGICAL RESEARCH: A REVEW” BY JACOB COHEN

Sep 22, 2015 7:11 AM

# Replicability Report for the BRITISH JOURNAL OF SOCIAL PSYCHOLOGY

Sep 18, 2015 6:40 PM

# Replicability-Ranking of 100 Social Psychology Departments

Sep 15, 2015 12:49 PM

# Replicability Report for the journal SOCIAL PSYCHOLOGY

Sep 13, 2015 4:54 PM

# Replicability-Report for the journal JUDGMENT AND DECISION MAKING

Sep 13, 2015 1:08 PM

# Replicability-Report for JOURNAL OF CROSS-CULTURAL PSYCHOLOGY

Sep 12, 2015 12:32 PM

# Replicability-Report for SOCIAL COGNITION

Sep 11, 2015 6:43 PM

# Examining the Replicability of 66,212 Published Results in Social Psychology: A Post-Hoc-Power Analysis Informed by the Actual Success Rate in the OSF-Reproducibilty Project

Sep 7, 2015 9:20 AM

# The Replicability of Cognitive Psychology in the OSF-Reproducibility-Project

Sep 5, 2015 2:47 PM

# The Replicability of Social Psychology in the OSF-Reproducibility Project

Sep 3, 2015 8:18 PM

# Which Social Psychology Results Were Successfully Replicated in the OSF-Reproducibility Project? Recommeding a 4-Sigma Rule

Aug 30, 2015 11:04 AM

# Predictions about Replication Success in OSF-Reproducibility Project

Aug 26, 2015 9:54 PM

# Replicability-Report for EUROPEAN JOURNAL OF SOCIAL PSYCHOLOGY

Aug 22, 2015 8:00 PM

# Replicability-Report for JOURNAL OF EXPERIMENTAL SOCIAL PSYCHOLOGY

Aug 22, 2015 5:30 PM

# Replicability-Report for JOURNAL OF MEMORY AND LANGUAGE

Aug 22, 2015 3:22 PM

# Replicability-Report for JOURNAL OF EXPERIMENTAL PSYCHOLOGY: GENERAL

Aug 21, 2015 5:52 PM

# Replicability-Report for DEVELOPMENTAL PSYCHOLOGY

Aug 21, 2015 4:36 PM

# Replicability-Report for JOURNAL OF EXPERIMENTAL PSYCHOLOGY: LEARNING, MEMORY, AND COGNITION

Aug 19, 2015 10:21 AM

# Replicability-Report for COGNITIVE PSYCHOLOGY

Aug 18, 2015 10:03 AM

# Replicability-Report for COGNITION & EMOTION

Aug 18, 2015 8:37 AM

# Replicability-Report for SOCIAL PSYCHOLOGY AND PERSONALITY SCIENCE

Aug 18, 2015 7:35 AM

# Replicability-Report for EMOTION

Aug 17, 2015 7:53 PM

# Replicability Report for PERSONALITY AND SOCIAL PSYCHOLOGY BULLETIN

Aug 17, 2015 9:59 AM

# Replicability-Report for JPSP: INTERPERSONAL RELATIONSHIPS & GROUP PROCESSES

Aug 16, 2015 9:51 AM

# Replicability-Report for JPSP: ATTITUDES & SOCIAL COGNITION

Aug 16, 2015 6:12 AM

# Replicability-Report for JPSP: Personality Processes and Individual Differences

Aug 15, 2015 8:30 PM

# Replicability Report for PSYCHOLOGICAL SCIENCE

Aug 15, 2015 5:51 PM

# REPLICABILITY RANKING OF 26 PSYCHOLOGY JOURNALS

Aug 13, 2015 4:16 PM

# Using the R-index to detect questionable research practices in SSRI studies

Aug 5, 2015 2:34 PM

# R-Index predicts lower replicability of “subliminal” studies than “attribution” studies in JESP

Jul 7, 2015 9:37 PM

# Post-Hoc-Power Curves of Social Psychology in Psychological Science, JESP, and Social Cognition

Jul 7, 2015 7:07 PM

# Post-Hoc Power Curves: Estimating the typical power of statistical tests (t, F) in Psychological Science and Journal of Experimental Social Psychology

Jun 27, 2015 10:48 PM

# When Exact Replications Are Too Exact: The Lucky-Bounce-Test for Pairs of Exact Replication Studies

May 27, 2015 11:30 AM

# The Association for Psychological Science Improves Success Rate from 95% to 100% by Dropping Hypothesis Testing: The Sample Mean is the Sample Mean, Type-I Error 0%

May 21, 2015 1:40 PM

# R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

May 19, 2015 5:43 PM

# A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

May 18, 2015 6:41 PM

# Power Analysis for Bayes-Factor: What is the Probability that a Study Produces an Informative Bayes-Factor?

May 16, 2015 4:41 PM

# A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

May 16, 2015 7:51 AM

# The Test of Insufficient Variance (TIVA): A New Tool for the Detection of Questionable Research Practices

May 13, 2015 12:40 PM

# Why Psychologists Should Not Change The Way They Analyze Their Data: The Devil is in the Default Prior

May 9, 2015 10:43 AM

# Replacing p-values with Bayes-Factors: A Miracle Cure for the Replicability Crisis in Psychological Science

Apr 30, 2015 1:48 PM

# Further reflections on the linearity in Dr. Förster’s Data

Apr 21, 2015 7:57 AM

# The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

Apr 20, 2015 6:04 PM

# Bayesian Statistics in Small Samples: Replacing Prejudice against the Null-Hypothesis with Prejudice in Favor of the Null-Hypothesis

Apr 9, 2015 12:26 PM

# Meta-Analysis of Observed Power: Comparison of Estimation Methods

Apr 1, 2015 9:49 PM

# An Introduction to Observed Power based on Yuan and Maxwell (2005)

Mar 24, 2015 7:24 PM

# R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

Feb 6, 2015 11:20 AM

# Bayesian Statistics in Small Samples: Replacing Prejudice against the Null-Hypothesis with Prejudice in Favor of the Null-Hypothesis

Feb 2, 2015 10:45 AM

# Questionable Research Practices: Definition, Detect, and Recommendations for Better Practices

Jan 24, 2015 8:40 AM

# Further reflections on the linearity in Dr. Förster’s Data

Jan 14, 2015 5:47 PM

# Why are Stereotype-Threat Effects on Women’s Math Performance Difficult to Replicate?

Jan 6, 2015 12:01 PM

# How Power Analysis Could Have Prevented the Sad Story of Dr. Förster

Jan 2, 2015 12:22 PM

# A Playful Way to Learn about Power, Publication Bias, and the R-Index: Simulate questionable research methods and see what happens.

Dec 31, 2014 3:25 PM

# The Test of Insufficient Variance (TIVA): A New Tool for the Detection of Questionable Research Practices

Dec 30, 2014 10:22 PM

# Christmas Special: R-Index of “Women Are More Likely to Wear Red or Pink at Peak Fertility”

Dec 24, 2014 1:54 PM

# The R-Index of Ego-Depletion Studies with the Handgrip Paradigm

Dec 21, 2014 2:21 PM

# The R-Index of Nicotine-Replacement-Therapy Studies: An Alternative Approach to Meta-Regression

Dec 17, 2014 3:52 PM

# The R-Index of Simmons et al.’s 21 Word Solution

Dec 17, 2014 7:14 AM

# The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

Dec 13, 2014 10:42 AM

# Do it yourself: R-Index Spreadsheet and Manual is now available.

Dec 7, 2014 11:01 AM

# Nature Neuroscience: R-Index

Dec 5, 2014 11:44 AM

# Dr. Schnall’s R-Index

Dec 4, 2014 11:57 AM

# Roy Baumeister’s R-Index

Dec 1, 2014 6:29 PM

# The Replicability-Index (R-Index): Quantifying Research Integrity

Nov 30, 2014 10:19 PM

# Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem

Added January 30, 2018: A formal letter to the editor of JPSP, calling for a retraction of the article (Letter).

“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)

In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.

The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results.  “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).

The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.

In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.

The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings).  The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).

Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.

Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.

I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely.  Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.

1. Luck

The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.

If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.

Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.

Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.

2. Questionable Research Practices

The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies.  This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used.  John et al. listed a number of questionable research practices that might explain Bem’s findings.

2.1. Multiple Dependent Variables

One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.

2.2. Failure to report all conditions

This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.

2.3 Generous Rounding

Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.

2.4 HARKing

Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.

2.5 Excluding of Data

Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.

2.6 Stopping Data Collection Early

Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.

In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.

2.7 Optional Stopping/Snooping

Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.

The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).

Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.

2.8 Selective Reporting

The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).

Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.

Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants \$5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid \$90,000 out of pocket.

As a strong believer in ESP, Bem may have paid \$90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.

“The integrity of the scientific enterprise requires the reporting of disconfirming results.”

2.9 Conclusion

In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.

3. The Decline Effect and a New Questionable Research Practice

When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased.  This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).

Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.

The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.

Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1.  P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.

 EXPERIMENT S 1-50 S 51-100 EXP1 p = .004 p = .194 EXP2 p = .096 p = .170 EXP3 p = .039 p = .100 EXP4 p = .033 p = .067 EXP5 p = .013 p = .069 EXP6a p = .412 p = .126 EXP5b p = .023 p = .410 EXP7 p = .020 p = .338 EXP8 p = .010 p = .318 EXP9 p = .003 NA

There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).

In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes.  Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422).  How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.

There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection.  In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.

“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”

Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.

Additional hints come from an interview with Engber (2017).

“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.

In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.

Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one.  Schooler (2011) proposed that something more intriguing may cause decline effects.

“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.”

Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.

I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.

Discussion

Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.

The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).

The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality.  The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.

Conceptual Replications and Hidden Moderators

In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).

Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).

The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.

Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.

The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.

The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.

Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.

P-Hacking

Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011).  The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used.  They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%.  A combination of four questionable research practices increased the risk to 60.7%.  The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky.  But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results,  (p = .6= 1%).

The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies).  If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.

To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).

Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct.  Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.

Should the Journal of Personality and Social Psychology Retract Bem’s Article?

Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.

However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.

“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).

Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.

This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.

A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).

[edited January, 8, 2018]
It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.

When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”

If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.

REFERENCES

Alcock, J. E. (2011). Back from the future: Parapsychology and the Bem affair. Skeptical Inquirer, 35(2). Retrieved from http://www.csicop.org/specialarticles/show/back_from_the_future

Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Bem, D.J., Tressoldi, P., Rabeyron, T. & Duggan, M. (2015) Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events, F1000 Research, 4, 1–33.

Engber, D. (2017). Daryl Bem proved ESP Is real: Which means science is broken. https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html

Francis, G. (2012). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9

Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, Issue 6203, 502-1505, DOI: 10.1126/science.1255484

Galak, J., Leboeuf, R.A., Nelson, L. D., & Simmons, J.P. (2012). Journal of Personality and Social Psychology, 103, 933-948, doi: 10.1037/a0029709.

John, L. K. Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532. DOI: 10.1177/0956797611430953

RetractionWatch (2018). Ask retraction watch: Is it OK to cite a retracted paper? http://retractionwatch.com/2018/01/05/ask-retraction-watch-ok-cite-retracted-paper/

Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 2012, 17, 551–566.

Schimmack, U. (2015). The Test of Insufficient Variance: A New Tool for the Detection of Questionable Research Practices. https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632

Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746

Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137

Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.