Dr. R’s Blog about Replicability


For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication (Cohen, 1994).


DEFINITION OF REPLICABILITY:  In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.


2017 Blog Posts:

(October, 24, 2017)
Preliminary 2017 Replicability Rankings of 104 Psychology Journals

(September 19, 2017)
Reexaming the experiment to replace p-values with the probability of replicating an effect

(September 4, 2017)
The Power of the Pen Paradigm: A Replicability Analysis

(August, 2, 2017)
What would Cohen say: A comment on p < .005 as the new criterion for significance

(April, 7, 2017)
Hidden Figures: Replication failures in the stereotype threat literature

(February, 2, 2017)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails

REPLICABILITY REPORTS:  Examining the replicability of research topics

RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?
RR No3. (September 4, 2017) The power of the pen paradigm: A replicability analysis




1.  Preliminary 2017  Replicability Rankings of 104 Psychology Journals
Rankings of 104 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2017, and a comparison of psychological disciplines (cognitive, clinical, social, developmental, biological).


2.  Z-Curve: Estimating replicability for sets of studies with heterogeneous power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.


8.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.


Z-curve vs. P-curve: Break down of an attempt to resolve disagreement in private.

Background:   In a tweet that I can no longer find because Uri Simonsohn blocked me from his twitter account, Uri suggested that it would be good if scientists could discuss controversial issues in private before they start fighting on social media.  I was just about to submit a manuscript that showed some problems with his p-curve approach to power estimation and a demonstration that z-curve works better in some situations, namely when there is substantial variation in studies in statistical power. So, I thought I give it a try and sent him the manuscript so that we could try to find agreement in a private email exchange.

The outcome of this attempt was that we could not reach agreement on this topic.  At best, Uri admitted that p-curve is biased when some extreme test statistics (e.g., F(1,198) = 40, or t(48) = 5.00) are included in the dataset.  He likes to call these values outliers. I consider them part of the data that influence the variability and distribution of test statistics.

For the most part Uri disagreed with my conclusions and considers the simulation results that show evidence for my claims unrealistic.   Meanwhile, Uri published a blog post with his simulations that have small heterogeneity to claim that p-curve works even better than z-curve when there is heterogeneity.

The reason for the discrepancy between his results and my results are different assumptions about what is realistic variability in strength of evidence against the null-hypothesis, as reflected in absolute z-scores (transformation of p-values into z-scores by means of  -qnorm(p.2t) with p.2t equals two.tailed t-test or F-test.

To give everybody an opportunity to examine the arguments that were exchanged during our discussion of p-curve versus z-curve, I am sharing the email exchange.  I hope that more statisticians will examine the properties of p-curve and z-curve and add to the discussion.  To facilitate this, I will make the r-code to run simulation studies of p-curve and z-curve available in a separate blog post.

P.S.  P-curve is available as an online app that provides power estimates without any documentation how p-curve behaves in simulation studies or warnings that datasets with large test statistics can produce inflated estimates of average power.

My email correspondence with Uri Simonsohn – RE: p-curve and heterogeneity

From:    URI
To:          ULI
Date:     11/24/2017

Hi Uli,

I think email is better at this point.

Ok I am behind a ton of stuff and have a short workday today so cannot look in detail are your z-curve paper right now.

I did a quick search for “osf”, “http” and “code” and could not find the R Code , that may facilitate things if you can share it. Mostly, I would like the code that shows p-curve is biased, especially looking at how the population parameter being estimated is being defined.

I then did a search for “p-curve” and found this

Quick reactions:

1)            For power estimation p-curve does not assume homogeneity of effect size, indeed, if anything it assumes homogeneity of power and allows each study to have a different effect size, but it is not really assuming a single power, it is asking what single power best fits the data, which is a different thing. It is computing an average. All average computations ask “what single value best fits the data” but that’s not the same as saying “I think all values are identical, and identical to the average”

2)            We do report a few tests of the impact of heterogeneity on p-curve, maybe you have something else in mind. But here they go just in case:

Figure 2C in our POPS paper, has d~N(x,sd=.2)

[Clarification: This Figure shows estimation of effect sizes. It does not show estimation of power.]

Supplement 2

[Again. It does not show simulations for power estimation.]

A key thing to keep in mind is the population parameter of interst. P-curve does not estimate the population effect size or power of all studies attempted, published, reported, etc. It does for the set of studies included in p-curve. So note, for example, in the figure S2C above that  when half of studies are .5 and half are .3 among the attempted, p-curve estimates the average included study accurately but differently from .4. The truth is .48 for included studies, p-curve says .47, and the average attempted study is .4

[This is not the issue. Replicability implies conditioning on significance. We want to predict the success rate of studies that replicate significant results. Of course it is meaningful to do follow up studies on non-significant results. But the goal here is not to replicate another inconclusive non-significant result.]

Happy to discuss of course, Uri


From     ULI
To           URI
Date      11/24/2017

Hi Uri,

I will change the description of your p-curve code for power.

Honest, I am not fully clear about what the code does or what the underlying assumptions are.

So, thanks for clarifying.

I agree with you that pcurve (also puniform) are surprisingly robust estimates of effect sizes even with heterogeneity (I have pointed that out in comments in the Facebook Discussion group), but that doesn’t mean it works well for power.   If you have published any simulation tests for the power estimation function, I am happy to cite them.

Attached is a single R code file that contains (a) my shortened version of your p-curve code, (b) the z-curve code, (c) the code for the simulation studies.

The code shows the cumulative results. You don’t have to run all 5,000 replications before you see the means stabilizing.

Best, Uli


From     URI
To           ULI
Date      11/27/2017

Hi Uli,

Thanks for sending the code, I am trying to understand it.  I am a little confused about how the true power is being generated. I think you are drawing “noncentrality” parameters  (ncp) that are skewed, and then turning those into power, rather than drawing directly skewedly distributed power, correct? (I am not judging that as good or bad, I am just verifying).

[Yes that is correct]

In any case, I created a histogram of the distribution of true power implied by the ncp’s that you are drawing (I think, not 100% sure I am getting that right).

For scenario 3.1 it looks like this:




For scenario 3.3 it looks like this:



(the only code I added was to turn all the true power values into a vector before averaging it, and then ploting a histogram for that vector, if interestd, you can copy paste this into the line of code that just reads “tp” in your code and you will re-produce my histogram)


power.i=pnorm(z,z.crit)[obs.z > z.crit]                #line added by Uri SImonsohn to look at the distribution

hist(power.i,xlab=’true power of each study’)




mtext(side=3,line=0,paste0(“mean=”,mean.pi,”   median=”,median.pi,”   sd=”,sd.pi))

I wanted to make sure

1)            I am correctly understanding this variable as being the true power of the observed studies, the average/median of which we are trying to estimate

2)            Those distributions are the distributions you intended to generate

[Yes, that is correct. To clarify, 90% power for p < .05 (two-tailed) is obtained with a z-score of  qnorm(.90, 1.96)  = 3.24.   A z-score of 4 corresponds to 97.9% power.  So, in the literature with adequately powered studies, we would expect studies to bunch up at the upper limit of power, while some studies may have very low power because the theory made the wrong prediction and effect sizes are close to zero and power is close to alpha (5%).]

Thanks, Uri


From     ULI
To           URI
Date      11/27/2017

Hi Uri,

Thanks for getting back to me so quickly.   You are right, it would be more accurate to describe the distribution as the distribution of the non-centrality parameters rather than power.

The distribution of power is also skewed but given the limit of 1,  all high power studies will create a spike at 1.  The same can happen at the lower end and you can easily get U-shaped distributions.

So, what you see is something that you would also see in actual datasets.  Actually, the dataset minimizes skew because I only used non-centrality parameters from 0 to 6.

I did this because z-curve only models z-values between 0 and 6 and treats all observed z-scores greater than 6 as having a true power of 1.  That reduces the pile on the right side.

You could do the same to improve performance of p-curve, but it will still not work as well as z-curve, as the simulations with z-scores below 6 show.

Best, Uli


From     URI
To           ULI
Date      11/27/2017

OK, yes, probably worth clarifying that.

Ok, now I am trying to make sure I understand the function you use to estimate power with z-curve.

If I  see p-values, say c(.001,.002,.003,.004,.005) and I wanted to estimate true power for them via z-curve, I would run:

p= c(.001,.002,.003,.004,.005)

z= -qnorm(p/2)


And estimate true power to be 85%, correct?



From     ULI
To           URI
Date      11/27/2017



From     URI
To           ULI
Date      11/27/2017

Hi Uli,

To make sure I understood z-curve’s function I run a simple simulation.
I am getting somewhat biased results with z-curve, do you want to take a look and see if I may be doing something wrong?

I am attaching the code, I tried to make it clear but it is sometimes hard to convey what one is trying to do, so feel free to ask any questions.



From     ULI
To           URI
Date      11/27/2017

Hi Uri,

What is the k in these simulations?   (z-curve requires somewhat large k because the smoothing of the density function can distort things)

You may also consult this paper (the smallest k was 15 in this paper).


In this paper, we implemented pcurve differently, so you can ignore the p-curve results.

If you get consistent underestimation with z-curve, I would like to see how you simulate the data.

I haven’t seen this behavior in z-curve in my simulations.

Best, Uli


From     ULI
To           URI
Date      11/27/2017

Hi Uli,

I don’t know where “k” is set, I am using the function you sent me and it does not have k as a parameter

I am running this:

fun.zcurve = function(z.val.input, z.crit = 1.96, Int.End=6, bw=.05) {…

Where would k be set?

Into the function you have this

### resolution of density function (doesn’t seem to matter much)

bars = 500

Is that k?



From     ULI
To           URI
Date      11/27/2017

I mean the number of test statistics that you submit to z-curve.



From     ULI
To           URI
Date      11/27/2017

I just checked with k = 20, the z-curve code I sent you underestimates fixed power of 80 as 72.

The paper I sent you shows a similar trend with true power of 75.

k             15     25    50    100  250
Z-curve 0.704 0.712 0.717 0.723 0.728

[Clarification: This is from the Brunner & Schimmack, 2016, article]


From     ULI
To           URI
Date      11/30/2017

Hi Uli,

Sorry for disappearing, got distracted with other things.

I looked a bit more at the apparent bias downwards that z-curve has on power estimates.

First, I added p-curve’s estimates to the chart I had sent, I know p-curve performs well for that basic setup so I used it as a way to diagnose possible errors in my simulations, but p-curve did correctly recover power, so I conclude the simulations are fine.

If you spot a problem with them, however, let me know.


From     ULI
To           URI
Date      11/30/2017

Hi Uri,

I am also puzzled why z-curve underestimates power in the homogeneous case even with large N.  This is clearly an undersirable behavior and I am going to look for solutions to the problem.

However, in real data that I analyze, this is not a problem because there is heterogeneity.

When there is heterogenity, z-curve performs very well, no matter what the distribution of power/non-centrality parameters is. That is the point of the paper.  Any comments on comparisons in the heterogeneous case?

Best, Uli


From     ULI
To           URI
Date      11/30/2017

Hey Uli,

I have something with heterogeneity but want to check my work and am almost done for the day, will try tomorrow.


[Remember: I supplied Uri with r-code to rerun the simulations of heterogeneity and he ran them to show what the distribution of power looks like.  So at this point we could discuss the simulation results that are presented in the manuscript.]


From     ULI
To           URI
Date      11/30/2017

I ran simulations with t-distrubutions and N = 40.

The results look the same for me.

Mean estimates for 500 simulations

32, 48, 75

As you can see, p-curve also has bias when t-values are converted into z-scores and then analyzed with p-curve.

This suggests that with small N,  the transformation from t to z introduces some bias.

The simulations by Jerry Brunner showed less bias because we used the sample sizes in Psych Science for the simulation (median N ~ 80).

So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.


From     URI
To           ULI
Date      11/30/2017

Hi Uli,

The fact that p-curve is also biased when you convert to z-scores suggests to me that approximation is indeed part of the problem.

[Clarification: I think URI means z-curve]

Fortunately p-curve analysis does not require that transformation and one of the reasons we ask in the app to enter test-statistics is to avoid unnecessary transformations.

I guess it would also be true that if you added .012 to p-values p-curve would get it wrong, but p-curve does not require one to add .012 to p-values.

You write “So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.”

Only partial agreement, because the statement implies that for larger N and larger K z-curve is not biased, I believe it is also biased for large k and large N. Here, for instance, is the chart with n=50 per cell (N=100 total) and 50 studies total.

Today I modified the code I sent you so that I would accommodate any power distribution in the submitted studies, not just a fixed level. (attached)

I then used the new montecarlo function to play around with heterogeneity and skewness.

The punchline is that p-curve continues to do well, and z-curve continues to be biased downward.

I also noted, by computed the standard deviation of estimates across simulations, that p-curve has a slightly less random error.

My assessment is that z-curve and p-curve are very similar and will generally agree, but that z-curve is more biased and has more variance.

In any case, let’s get to the simulations Below I show 8 scenarios sorted by the ex-post average true power for the sets of studies.

[Note, N = 20 per cell.  As I pointed out earlier, with these small sample sizes the t to z-transformation is a factor. Also k = 20 is a small set of studies that makes it difficult to get good density distributions.  So, this plot is p-hacked to show that p-curve is perfect and z-curve consistently worse.  The results are not wrong, but they do not address the main question. What happens when we have substantial heterogeneity in true power?  Again, Uri has the data, he has the r-code, and he has the results that show p-curve starts overestimating.  However, he ignores this problem and presents simulations that are most favorable for p-curve.]


From     URI
To           ULI
Date      12/1/2017

Hi Uri,

I really do not care so much about bias in the homogeneous case. I just fixed the problem by first doing a test of the variance and if variance is small to use a fixed effects model.

[Clarification:  This is not yet implemented in z-curve and was not done for the manuscript submitted for publication which just acknowledges that p-curve is superior when there is no heterogeneity.]

The main point of the manuscript is really about data that I actually encounter in the literature (see demonstrations in the manuscript, including power posing) where there is considerable heterogeneity.

In this case, p-curve overestimates as you can see in the simulations that I sent you.   That is really the main point of the paper and any comments from you about p-curve and heterogeneity would be welcome.

And, I did not mean to imply that pcurve needs transformation. I just found it interesting that transformation is a problem when N is small (as N gets bigger t approaches z and the transformation has less influence).

So, we are in agreement that pcurve does very well when there is little variability in the true power across studies.  The question is whether we are in agreement about heterogeneity in power?

Best, Uli


From     ULI
To           URI
Date      12/1/2017

Hi Uri,

Why not simulate scenarios that match onto real data.

[I attached data from my focal hypothesis analysis of Bargh’s book “Before you know it” ]



From     ULI
To           URI
Date      12/1/2017


Also, my simulations show that z-curve OVERestimates when true power is below 50%.   Do you find this as well?

This is important because power posing estimates are below 50%, so estimation problems with small k and N would mean that z-curve estimate is inflated rather than suggesting that p-curve estimate is correct.

Best, Uli


From     URI
To           ULI
Date      12/2/2017

Hi Uli,

The results I sent show substantial heterogeneity and p-curve does well, do you disagree?



From     URI
To           ULI
Date      12/2/2017

Not sure what you mean here. What aspect of real data would you like to add to the simulations? I did what I did to address the concerns you had that p-curve may not handle heterogeneity and skewed distributions of power, and it seems to do well with very substantial skew and heterogeneity.

What aspect are the simulations abstracting away from that you worry may lead p-curve to break down with real data?



From     ULI
To           URI
Date      12/2/2017

Hi Uri,

I think you are not simulating sufficient heterogeneity to see that p-curve is biased in these situations.

Let’s focus on one example (simulation 2.3) in the r-code I sent you: High true power (.80) and heterogeneity.

This is the distribution of the non-centrality parameters.

And this is the distribution of true power for p < 05 (two-tailed, |z| > = 1.96).

[Clarification: this is not true power, it is the distribution of observed absolute z-scores]

More important, the variance of the observed significant (z > 1.96) z-scores is 2.29.

[Clarification: In response to this email exchange, I added the variance of significant z-scores to the manuscript as a measure of heterogeneity.  Due to the selection for significance, variance with low power can be well below 1.   A variance of 2.29 is large heterogeneity. ]

In comparison the variance for the fixed model (non-central z = 2.80) is 0.58.

So, we can start talking about heterogeneity in quantitative terms. How much variance do you simulated observed p-values have when you convert them into z-scores?

The whole point of the paper is that performance of z-curve suffers, the greater the heterogeneity of true power is.  As sampling error is constant for z-scores, variance of observed z-scores has a maximum of 1 if true power is constant. It is lower than 1 due to selection for significance, which is more severe the lower the power is.

The question is whether my simulations use some unrealistic, large amount of heterogeneity.   I attached some Figures for the Journal of Judgment and Decision Making.

As you can see, heterogeneity can be even larger than the heterogeneity simulated in scenario 2.3 (with a normal distribution around z = 2.75).

In conclusion, I don’t doubt that you can find scenarios where p-curve does well with some heterogeneity.  However, the point of the paper is that it is possible to find scenarios where there is heterogeneity and p-curve does not well.   What your simulations suggest is that z-curve can also be biased in some situations, namely with low variability, small N (so that transformation to z-scores matters) and small number of studies.

I am already working on a solution for this problem, but I see it as a minor problem because most datasets that I have examined (like the one’s that I used for the demonstrations in the ms) do not match this scenario.

So, if I can acknowledge that p-curve outperforms z-curve in some situations, I wonder whether you can do the same and acknowledge that z-curve outperforms p-curve when power is relatively high (50%+) and there is substantial heterogeneity?

Best, Uli


From     ULI
To           URI
Date      12/2/2017

What surprises me is that I sent you r-code with 5 simulations that showed when p-curve is breaking down (starting with normal distributed variability of non-central z-scores and 50% power (sim2.2) followed by higher power (80%) and all skewed distributions (sim 3.1, 3.2, 3.3).  Do you find a problem with these simulations or is there some other reason why you ignore these simulation studies?


From     ULI
To           URI
Date      12/2/2017

I tried “power = runif(n.sim)*.4 + .58”  with k = 100.

Now pcurve starts to overestimate and zcurve is unbiased.

So, k makes a difference.  Even if pcurve does well with k = 20,  we also have to look for larger sets of studies.

Results of 500 simulations with k = 100


From     ULI
To           URI
Date      12/2/2017

Even with k = 40,  pcurve overestimates as much as zcurve underestimates.

zcurve           pcurve

Min.   :0.5395   Min.   :0.5600

1st Qu.:0.7232   1st Qu.:0.7900

Median :0.7898   Median :0.8400

Mean   :0.7817   Mean   :0.8246

3rd Qu.:0.8519   3rd Qu.:0.8700

Max.   :0.9227   Max.   :0.9400


From     ULI
To           URI
Date      12/2/2017

Hi Uri,

This is what I find with systematic variation of number of studies (k) and the maximum heterogeneity for a uniform distribution of power and average power of 80% after selection for significance.

power = runif(n.sim)*.4 + .58”

zcurve   pcurve

k = 20                    77.5        81.2

k = 40                    78.2        82.5

k = 100                  79.3        82.7

k = 10000             80.2       81.7

(1 run)

If we are going to look at k = 20, we also have to look at k = 100.

Best, Uli


From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Why did you truncate the beta distributions so that they start at 50% power?

Isn’t it realistic to assume that some studies have less than 50% power, including false positives (power = alpha = 5%)?

How about trying this beta distribution?


80% true power after selection for significance.

Best, Uli


From     ULI
To           URI
Date      12/2/2017

HI Uli,

I know I have a few emails from you, thanks.

My plan is to get to them on Monday or Tuesday. OK?



Hi Uli,

We have a blogpost going up tomorrow and have been distracted with that, made someprogress with z- vs p- but am not ready yet.

Sorry Uri


Hi Uli,

From     ULI
To           URI
Date      12/2/2017

Ok, finally I have time to answer your emails from over the weekend.

Why I run something different?

First, you asked why I run simulations that were different from those you have in your paper (scenario 2.1 and 3.1).

The answer is that I tried to simulate what I thought you were describing in the text: heterogeneity in power that was skewed.

When I saw you had run simulations that led to a power distribution that looked like this:

I assumed that was not what was intended.

First, that’s not skewed

Second, that seems unrealistic, you are simulating >30% of studies powered above 90%.

[Clarification:  If studies were powered at 80%,  33% of studies would be above 90% :


It is important to remember that we are talking only about studies that produced a significant result. Even if many null-hypothesis are tested, relatively few of these would make it into the set of studies that produced a significant result.  Most important, this claim ignores the examples in the paper and my calculations of heterogeneity that can be used to compare simulations of heterogeneity with real data.]

Third, when one has extremely bimodal data, central tendency measures are less informative/important (e.g., the average human wears half a bra). So if indeed power was distributed that way, I don’t think I would like to estimate average power anyway. And if it did, saying the average is 60% or 80% is almost irrelevant, hardly any studies are in that range in reality (like say the average person wears .65 bras, that’s wrong, but inconsequentially worse that .5 bras).

Fourth, if indeed 30% of studies have >90% power, we don’t need p-curve or z-curve. Stuff is gonna be obviously true to naked eye.

But below I will ignore these reservations and stick to that extreme bimodal distribution you propose that we focus our attention on.

The impact of null findings

Actually, before that, let me acknowledge I think you raised a very valid point about the importance of adding null findings to the simulations. I don’t think the extreme bimodal you used is the way to do it, but I do think power=5% in the mix does make sense.

We had not considered p-curve’s performance there and we should have.

Prompted by this exchange I did that, and I am comfortable with how p-curve handles power=5% in the mix.

For example, I considered 40 studies, starting with all 40 null, and then having an increasing number drawn from U(40*-80%) power. Looks fine.

Why p-curve overshoots?

Ok. So having discuss the potential impact of null findings on estimates, and leaving aside my reservations with defining the extreme bimodal distribution of power as something we should worry about, let’s try to understand why p-curve over-estimates and z-curve does not.

Your paper proposes it is because p-curve assumes homogeneity.

It doesn’t. p-curve does not assume homogeneity of power any more than computing average height involves assuming homogeneity of height. It is true that p-curve does not estimate heterogeneity in power, but averaging height also does not compute the SD(height). P-curve does not assume it is zero, in fact, one could use p-curve results to estimate heterogeneity.

But in any case, is z-curve handling the extreme bimodal better thanks to its mixture of distributions, as you propose in the paper, or due to something else?

Because power is nonlinearly related to ncp I assumed it had to do with the censoring of high z-values you did rather than the mixture (though I did not actually look into the mixture in any detail at all)..

To look into that I censored t-values going into p-curve. Not as a proposal for a modification but to make the discussion concrete. I censored at t<3.5 so that any t>3.5 is replaced by 3.5 before being entered into p-curve.  I did not spend much time fine-tuning it and I am definitely not proposing htat if one were to censore t-values in p-curve they should be censored at 3.5


Ok, so I run p-curve with censored t-values for the rbeta() distribution you sent and for various others of the same style.

We see that censored p-curve behaves very similarly to z-curve (which is censored also).

I also tried adding more studies, running rbeta(3,1) and (1,3), etc.. Across the board, I find that if there is a high share of extremely high powered studies, censored p-curve and z-curve look quite similar.

If we knew nothing else, we would be inclined to censor p-curve going forward, or to use z-curve instead. But censored p-curve, and especially z-curve, give worse answers when the set of studies does not include many extremely high-powered ones, and in real life we don’t have many extremely high-powered studies. So z-curve and censored p-curve make gains in an world that I don’t think exist, and exhibit losses in one that I do think exists.

In particular, z-curve estimates power to be about 10% when the null is true, instead of 5% (censored p-curve actually get this one right, null is estimated at 5%).

Also, z-curve underestimates power in most scenarios not involving an extreme bimodal distribution (see charts I sent in my previous email). IN addition, z-curve tends to have higher variance than p-curve.

As indicated in my previous email, z-curve and p-curve agree most of the time, their differences will typically be within sampling error. It is a low stakes decision to use p-curve vs z-curve, especially compared to the much more important issue of which studies are selected and which tests are selected within studies.

Thanks for engaging in this conversation.

We don’t have to converge to agreement to gain from discussing things.

Btw, we will write a blog post on the repeated and incorrect claim that p-curve assumes homogeneity and does not deal with heterogeneity well. We will send you a draft when we do, but it could be several weeks till we get to that. I don’t anticipate it being a contentions post from your point of view but figured I would tell you about it now.



From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Now that we are on the same page, the only question is what is realistic.

First, your blog post on outliers already shows what is realistic. A single outlier in the power pose study increases the p-curve estimate by more than 10% points.

You can fix this now, but p-curve as it existed did not do this.   I would also describe this as a case of heterogeneity. Clearly the study with z = 7 is different from studies with z = 2.

This is in the manuscript that I asked you to evaluate and you haven’t commented on it at all, while writing a blog post about it.

The paper contains several other examples that are realistic because they are based on real data.

I mainly present them as histograms of z-scores rather than historgrams of p-values or observed power because I find the distribution of the z-scores more informative (e.g., where is the mode,  is the distribution roughly normal, etc.), but if you convert the z-scores into power you get distributions like the one shown below (U-shpaed), which is not surprising because power is bounded at  alpha and 1.  So, that is a realistic scenario, whereas your simulations of truncated distributions are not.

I think we can end the discussion here.  You have not shown any flaws with my analyses. You have shown that under very limited and unrealistic situations p-curve performs better than z-curve, which is fine because I already acknowledged in the paper that p-curve does better in the homogeneous case.

I will change the description of the assumption underlying p-curve, but leave everything else as is.

If you think there is an error let me know but I have been waiting patiently for you to comment on the paper, and examined your new simulations.

Best, Uli


Hi Uri,

What about the real world of power posing?

A few z-scores greater than 4 mess up p-curve as you just pointed out in your outlier blog.

I have presented several real world data to you that you continue to ignore.

Please provide one REAL dataset where p-curve gets it right and z-curve underestimates.

Best, Uli

Hi Uli,


From     ULI
To           URI
Date      12/6/2017

With real datasets you don’t know true power so you don’t know what’s right and wrong.

The point of our post today is that there is no point statistically analyzing the studies that Cuddy et al put together, with p-curve or any other tool.

I personally don’t think we ever observe true power with enough granularity to make z- vs p-curve prediction differences consequential.

But I don’t think we, you and I, should debate this aspect (is this bias worth that bias). Let’s stick to debating basic facts such as whether or not p-curve assumes homogeneity, or z-curve differs from p-curve because of homogeneity assumption or because of censoring, or how big bias is with this or that assumption. Then when we write we present those facts as transparently as possible to our readers, and they can make an educated decision about it based on their priors and preferences.



From     ULI
To           URI
Date      12/6/2017

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values.



z-curve uses multiple parameters, which improves prediction when there is substantial heterogeneity?



In many cases, the differences are small and not consequential.



When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates.

(see simulations in our manuscript)



I want to submit the manuscript by end of the week.

Best, Uli


From     ULI
To           URI
Date      12/6/2017

Going through the manuscript one more time, I found this.

To examine the robustness of estimates against outliers, we also obtained estimates for a subset of studies with z-scores less than 4 (k = 49).  Excluding the four studies with extreme scores had relatively little effect on z-curve; replicability estimate = 34%.  In contrast, the p-curve estimate dropped from 44% to 5%, while the 90%CI of p-curve ranged from 13% to 30% and did not include the point estimate.

Any comments on this, I mean point estimate is 5% and 90%CI is 13 to 30%,

Best, Uli

[Clarification:  this was a mistake. I confused point estimate and lower bound of CI in my output]


From     URI
To           ULI
Date      12/7/2017

Hi Uli.

See below:

From: Ulrich Schimmack [mailto:ulrich.schimmack@utoronto.ca]

Sent: Wednesday, December 6, 2017 10:44 PM

To: Simonsohn, Uri <uws@wharton.upenn.edu>

Subject: RE: Its’ about censoring i think

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values. 


z-curve uses multiple parameters,

Agree I don’t know the details of how z-curve works, but I suspect you do and are correct.

which improves prediction when there is substantial heterogeneity?


Few fronts.

1)            I don’t think heterogeneity per-se is the issue, but extremity of the values. P-curve is accurate with very substantial heterogeneity. In your examples what causes the trouble are those extremely high power values. Even with minimal heterogeneity you will get over-estimation if you use such values.

2)            I also don’t know that it is the extra parametres in z-curve that are helping because p-curve with censoring does just as well. so I suspect it is the censoring and not the multiple parameters. That’s also consistent with z-curve under-estimating almost everywhere, the multiple parameters should not lead to that I don’t think.

In many cases, the differences are small and not consequential.

Agree, mostly. I would not state that in an unqualified way out of context.

For example, my persona assessment, which I realize you probably don’t share, is that z-curve does worse in contexts that matter a bit more, and that are vastly more likely to be observed.

When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates. 

(see simulations in our manuscript)


You can have very substantial heterogeneity and very high power and p-curve is accurate (z-curve under-estimates).

For example, for the blogpost on heterogeneity and p-curve I figured that rather than simulating power directly  it made more sense to simulate n and d distributions, over which people have better intuitions.. and then see what happened to power (rather than simulating power or ncp directly).

Here is one example. Sets of 20 studies, drawn with n and d from the first two panels, with the implied true power and its estimate in the 3rd panel.

I don’t mention this in the post, but z-curve in this simulation under-estimates power, 86% instead of 93%

The parameters are



What you need for p-curve to over-estimate and for z-curve to not under-estimate is substantial share of studies at both extremes, many null, many with power>95%

In general very high power leads to over-estimation, but it is trivial in the absence of many very low power studies that lower the average enough that it matters.

That’s the combination I find unlikely, 30%+ with >90% power and at the same time 15% of null findings (approx., going off memory here).

I don’t generically find high power with heterogeneity unlikely, I find the figure above super plausible for instance.

NOTE: For the post I hope to gain more insight on the precise boundary conditions for over-estimation, I am not sure I totally get it just yet.

I want to submit the manuscript by end of the week.

Hope that helps.  Good luck.

Best, Uli


From     URI
To           ULI
Date      12/7/2017

Hi Uli,

First, I had not read your entire paper and only now I realize you analyze the Cuddy et al paper, that’s an interesting coincidence. For what is worth, we worked on the post before you and I had this exchange (the post was written in November and we first waited for Thanksgiving and then over 10 days for them to reply). And moreover, our post is heavily based off the peer-review Joe wrote when reviewing this paper, nearly a year ago, and which was largely ignored by the authors unfortunately.

In terms of the results. I am not sure I understand. Are you saying you get an estimate of 5% with a confidence interval between 13 and 30?

That’s not what I get.


From     ULI
To           URI
Date      12/7/2017

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

As you can see in your output, the numbers are switched (I should label columns in output).

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

Best, Uli


HI Uli,

See below

From: Ulrich Schimmack [mailto:ulrich.schimmack@utoronto.ca]

Sent: Friday, December 8, 2017 12:39 AM

To: Simonsohn, Uri <uws@wharton.upenn.edu>

Subject: RE: one more question

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

*I figured

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

*Happens to the best of us

As you can see in your output, the numbers are switched (I should label columns in output).

*I figured

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

*The tone is a bit accusatorial “admit”, but yes, in my blog post I will talk about it. My goal is to present facts in a way that lets readers decide with the same information I am using to decide.

It’s not always feasible to achieve that goal, but I strive for it. I prefer people making right inferences than relying on my work to arrive at them.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

*I don’t think that’s for us to decide. We can ‘fight’ about how to present the facts to readers, they decide which is more realistic.

I am not ignoring your simulation results.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

*I would prefer if you don’t speak on my behalf either way, our conversation is for each of us to learn from the other, then you speak for yourself.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

*I haven’t tried to reproduce your simulations, but I did indicate in our emails that if you run the rbeta(n,.35,.5)*.95+.05 p-curve over-estimates, I also explained why I don’t find that particularly worrisome. But you are not publishing a report on our email exchange, you are proposing a new tool. Our exchange hopefully helped make that paper clearer.

Please don’t quote any aspect of our exchange. You can say you discussed matters with me, but please do not quote me. This is a private email exchange. You can quote from my work and posts. The heterogeneity blog post may be up in a week or two.






The Deductive Fallacy in (some) Bayesian Inductive Inferences

I learned about Bayes theorem in the 1990s and I used Bayes’s famous formula in my first JPSP article (Schimmack & Reisenzein, 1997).  When Wagenmakers et al. (2011) published their criticism of Bem (2011), I did not know about Bayesian statistics. I have since learned more about Bayesian statistics and I am aware that there are many different approaches to using priors in statistical inferences.  This post is about a single Bayesian statistical approach, namely Bayesian Null-Hypothesis Testing (BNHT), which has been attributed to Jeffrey’s, introduced into psychology by Rouder, Speckman, and Sun (2009), and used by Wagenmakers et al. (2011) to suggest that Bem’s evidence for ESP was obtained by using flawed p-values, whereas Bayes-Factors showed no evidence for ESP, although they did not show evidence for the absence of ESP, either. Since then, I have learned more about Bayes Factors, in part from reading blog posts by Jeff Rouder, including R-Code to run my own simulation studies, and from discussions with Jeff Rouder on social media.  I am not an expert on Bayesian modeling, but I understand the basic logic underlying Bayes-Factors.

Rouder et al.’s (2009) article has been cited over 800 times and was cited over 200 times in 2016 and 2017.  An influential article like this cannot be ignored.  Like all other inferential statistical methods, JBF  (Jeffrey’s Bayes Factors or Jeff’s Bayes Factors) examine statistical properties of data (effect size, sampling error) in relation to sampling distributions of test statistics. Rouder et al. (2009) focused on t-distributions that are used for the comparison of means by means of t-tests.  Although most research articles in psychology continue to use traditional significance testing. the use of Bayes-Factors is on the rise.  It is therefore important to critically examine how Bayes-Factors are being used and whether inferences based on Bayes-Factors are valid.

Inferences about Sampling Error as Causes of Observed Effects.
The main objective of inferential statistics in psychological research is to rule out the possibility that an observed effect is merely a statistical fluke.  If the evidence obtained in a study is strong enough given some specified criterion value, researchers are allowed to reject the hypothesis that an observed effect was merely produced by chance (a false positive effect) and interpret the result as being caused by some effect.  Although Bayes-Factors could be reported without drawing conclusions (just like t-values or p-values could be reported without drawing inferences), most empirical articles that use Bayes-Factors use them to draw inferences about effects.  Thus, the aim of this blog post is to examine whether empirical researchers use JBFs correctly.

Two Types of Errors
Inferential statistics are error prone.  The outcome of empirical studies is not deterministic and results from samples may not generalize to populations.  There are two possible errors that can occur, the so-called type-I errors and type-II errors. Type-I errors are false positive results. A false positive result occurs when there is no real effect in the population, but the results of a study led to the rejection of the null-hypothesis that sampling error alone caused the observed mean differences.  The second error is the false inference that sampling error alone caused an observed difference in a sample, while a test of the entire population would show that there is an actual effect.  This is called a false negative result.  The main problem in assessing type-II errors (false negatives) is that the probability of a type-II error depends on the magnitude of the effect.  Large effects can be easily observed even in small samples and the risk of a type-II error is small.  However, as effect sizes become smaller and approach zero, it becomes harder and harder to distinguish the effect from pure sampling error.  Once effect sizes become really small (say 0.0000001 percent of a standard deviation), it is practically impossible to distinguish results of a study with a tiny real effect from results of a study with no effect at all.

False Negatives
For reasons that are irrelevant here, psychologists have ignored type-II errors. A type-II error can only be made when researchers conclude that an effect is absent.  However, empirical psychologists were trained to ignore non-significant results as inconclusive rather than drawing the inferences that an effect is absent, and risking making a type-II error.  This led to the belief that p-values cannot be used to test the hypothesis that an effect is absent.  This was not much of a problem because most of the time psychologists made predictions that an effect should occur (reward should increase behavior; learning should improve recall, etc.).  However, it became a problem when Bem (2011) claimed to demonstrate that subliminal priming can influence behavior even if the prime is presented AFTER the behavior occurred.  Wagenmakers et al. (2011) and others found this hypothesis implausible and the evidence for it unbelievable.  However, rather than being satisfied with demonstrating that the evidence is flawed, it would be even better to demonstrate that this implausible effect does not exist.  Traditional statistical methods that focus on rejecting the null-hypothesis did not allow this.  Wagenmakers et al. (2011) suggested that Bayes-Factors solve this problem because they can be used to test the plausible hypothesis that time-reversed priming does not exist (the effect size is truly zero).  Many subsequent articles have used JBFs for exactly this purpose; that is, to provide empirical evidence in support of the null-hypothesis that an observed mean difference is entirely due to sampling error.  Like all inductive inferences, inferences in favor of H0 can be false.   While psychologists have traditionally ignored type-II errors because they did not make inferences in favor of H0, the rise of inferences in favor of H0 by means of JBFs makes it necessary to examine the validity of these inferences.

Deductive Fallacy
The main problem of using JBFs to provide evidence for the null-hypothesis is that Bayes-Factors are ratios of two hypotheses.  The data can be more or less compatible with each of the two hypotheses.  Say, if the data favor H0 by a likelihood of .2 and H1 by a likelihood of .1, the ratio of the two likelihoods is .2/.1 = 2.  The greater the likelihood in favor of H0, the more likely it is that an observed mean difference is purely sampling error.  As JBFs are ratios of two likelihoods, they depend on the specification of H1. For t-tests with continuous variables, H1 is specified as a weighted distribution of effect sizes. Although H1 covers all possible effect sizes, it is possible to create an infinite number of alternative hypotheses (H1.1, H1.2, H1.3….H1.∞).  The Bayes-Factor changes as a function of the way H1 is specified.  Thus, while one specific H1 may produce a JBF of 1000000:1 in favor of H0, another one may produce a JBF of 1:1.   It is therefore a logical fallacy to infer from a specific JBF for one particular H1 that H0 is supported, true, or that there is evidence for the absence of an effect.  The logically correct inference is that, with extremely high probability),  the alternative hypothesis is false, but that does not justify the inverse inference that H0 is true because H0 and H1 do not specify the full event space of all possible hypotheses that could be tested.  It is easy to overlook this because every H1 covers the full range of effect sizes, but these effect sizes can be used to create an infinite number of alternative hypotheses.

To make it simple, let’s forget about sampling distributions and likelihoods.  Using JBFs to claim that the data support the null-hypothesis in some absolute sense, is like a guessing game where you can pick any number you want, I guess that you picked 7 (because people like the number 7), you say it was not 7, and I now infer that you must have picked 0, as if 7 and 0 were the only options.  If you think this is silly, you are right, and it is equally silly to infer from rejecting one out of an infinite number of possible H1s that H0 must be true.

So, a correct use of JBFs would be to state conclusions in terms of the H1 that was specified to compute the JBF.  For example, in Wagenmakers et al’s analyses of Bem’s data, the authors specified H1 as a hypothesis that allocated 25% probability to effect sizes of d less than 1 (the opposite of the predicted effect) and 25% probability of a d greater than 1 (a very strong effect similar to gender differences in height).  Even if the JBF would strongly favor H0, which it did not, it would not justify the inference that time-reversed priming does not exist.  It would merely justify the inference that the effect size is unlikely to be greater than 1, one way or the other.  However, if Wagenmakers et al. (2011) had presented their results correctly, nobody would have bothered to take notice of such a trivial inference.  It was only the incorrect presentation of JBFs as a test of the null-hypothesis that led to the false belief that JBFs can provide evidence for the absence of an effect (e..g, the true effect size in Bem’s studies is zero).  In fact, Wagenmakers played the game where H1 guessed that the effect size is 1, H1 was wrong, leading to the conclusion that the effect size must be 0.  This is an invalid inference because there are still an infinite number of plausible effect sizes between 0 and 1.

There is nothing inherently wrong in calculating likelihood ratios and using them to test competing predictions.  However, the use of Bayes-Factors as a way to provide evidence for the absence of an effect is misguided because it is logically impossible to provide evidence for one specific effect size out of an infinite set of possible effect sizes. It doesn’t matter whether the specific effect size is 0 or any other value.  A likelihood ratio can only compare two hypothesis out of an infinite set of hypotheses. If one hypothesis is rejected, it does not justify inferring that the other hypotheses is true.  This is the reason why we can falsify H0 because when we reject H0 we do not infer that one specific effect size is true; we merely infer that it is not 0, leaving open the possibility that it is any one of the other infinite number of effect sizes.  We cannot reverse this because we cannot test the hypothesis that the effect is zero against a single hypothesis that covers all other effect sizes. JBFs can only test H0 against all other effect sizes by assigning weights to them. As there is an infinite number of ways to weight effect sizes, there is an infinite set of alternative hypothesis.  Thus, we can reject the hypothesis that sampling error alone produced an effect but practically we can never demonstrate that sampling error alone caused the outcome of a study.

To demonstrate that an effect does not exist it is necessary to specify a region of effect sizes around zero.  The smaller the region, the more resources are needed to provide evidence that the effect size is at best very small.  One negative consequence of the JBF approach has been that small samples were used to claim support for the point null-hypothesis, with a high probability that this conclusion was a false negative result. Researchers should always report the 95% confidence interval around the observed effect size. If this interval includes effect sizes of .2 standard deviations, the inference in favor of a null-result is questionable because many effect sizes in psychology are small.  Confidence intervals (or Bayesian credibility intervals with plausible priors) are more useful for claims about the absence of an effect than misleading statistics that pretend to provide strong evidence in favor of a false null-hypothesis.

















Conduct Your Own Replicability Analysis

You can download an excel spreadsheet to conduct your own replicability analysis.


The most important columns are L (which.statistical.test), O (df1), P(df2), Q(test.statistic), and R (Success).

L (which statistical test): 
Enter F,t,z,chisq

Enter experimenter df for F, enter 1 or leave blank for other tests.

Enter participant df for F and t and df for chi^2

Q(test statistic)
Enter actual F,t,z, or chi^2 value

Enter whether result was counted as success or not.
(Entering marginal significant as success results in negative R-Index for this study)
(Entering marginal significant as failure results in very high R-Index, that is probably not justified).
(If in doubt, do not enter marginally significant results or enter with 0.5 for success).

Feel free to post questions in the comment section or email me.



Why most Multiple-Study Articles are False: An Introduction to the Magic Index

In 2011 I wrote a manuscript in response to Bem’s (2011) unbelievable and flawed evidence for extroverts’ supernatural abilities.  It took nearly two years for the manuscript to get published in Psychological Methods. While I was proud to have published in this prestigious journal without formal training in statistics and a grasp of Greek notation, I now realize that Psychological Methods was not the best outlet for the article, which may explain why even some established replication revolutionaries do not know it (comment: I read your blog, but I didn’t know about this article). So, I decided to publish an abridged (it is still long), lightly edited (I have learned a few things since 2011), and commented (comments are in […]) version here.

I also learned a few things about titles. So the revised version, has a new title.

Finally, I can now disregard the request from the editor, Scott Maxwell, on behave of reviewer Daryl Bem, to change the name of my statistical index from magic index to incredibilty index.  (the advantage of publishing without the credentials and censorship of peer-review).

For readers not familiar with experimental social psychology, it is also important to understand what a multiple study article is.  Most science are happy with one empirical study per article.  However, social psychologists didn’t trust the results of a single study with p < .05. Therefore, they wanted to see internal conceptual replications of phenomena.  Magically, Bem was able to provide evidence for supernatural abilities in not just 1 or 2 or 3 studies, but 8 conceptual replication studies with 9 successful tests.  The chance of a false positive result in 9 statistical tests is smaller than the chance of finding evidence for the Higgs-Bosson particle, which was a big discovery in physics.  So, readers in 2011 had a difficult choice to make: either supernatural phenomena are real or multiple study articles are unreal.  My article shows that the latter is likely to be true, as did an article by Greg Francis.

Aside from Alcock’s demonstration of a nearly perfect negative correlation between effect sizes and sample sizes and my demonstration of insufficient variance in Bem’s p-values, Francis’s article and my article remain the only article that question the validity of Bem’s origina findings. Other articles have shown that the results cannot be replicated, but I showed that the original results were already too good to be true. This blog post explains, how I did it.

Why most multiple-study articles are false: An Introduction to the Magic Index
(the article formerly known as “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles”)

Cohen (1962) pointed out the importance of statistical power for psychology as a science, but statistical power of studies has not increased, while the number of studies in a single article has increased. It has been overlooked that multiple studies with modest power have a high probability of producing nonsignificant results because power decreases as a function of the number of statistical tests that are being conducted (Maxwell, 2004). The discrepancy between the expected number of significant results and the actual number of significant results in multiple-study articles undermines the credibility of the reported
results, and it is likely that questionable research practices have contributed to the reporting of too many significant results (Sterling, 1959). The problem of low power in multiple-study articles is illustrated using Bem’s (2011) article on extrasensory perception and Gailliot et al.’s (2007) article on glucose and self-regulation. I conclude with several recommendations that can increase the credibility of scientific evidence in psychological journals. One major recommendation is to pay more attention to the power of studies to produce positive results without the help of questionable research practices and to request that authors justify sample sizes with a priori predictions of effect sizes. It is also important to publish replication studies with nonsignificant results if these studies have high power to replicate a published finding.

Keywords: power, publication bias, significance, credibility, sample size


Less is more, except of course for sample size. (Cohen, 1990, p. 1304)

In 2011, the prestigious Journal of Personality and Social Psychology published an article that provided empirical support for extrasensory perception (ESP; Bem, 2011). The publication of this controversial article created vigorous debates in psychology
departments, the media, and science blogs. In response to this debate, the acting editor and the editor-in-chief felt compelled to write an editorial accompanying the article. The editors defended their decision to publish the article by noting that Bem’s (2011) studies were performed according to standard scientific practices in the field of experimental psychology and that it would seem inappropriate to apply a different standard to studies of ESP (Judd & Gawronski, 2011).

Others took a less sanguine view. They saw the publication of Bem’s (2011) article as a sign that the scientific standards guiding publication decisions are flawed and that Bem’s article served as a glaring example of these flaws (Wagenmakers, Wetzels, Borsboom,
& van der Maas, 2011). In a nutshell, Wagenmakers et al. (2011) argued that the standard statistical model in psychology is biased against the null hypothesis; that is, only findings that are statistically significant are submitted and accepted for publication.

This bias leads to the publication of too many positive (i.e., statistically significant) results. The observation that scientific journals, not only those in psychology,
publish too many statistically significant results is by no means novel. In a seminal article, Sterling (1959) noted that selective reporting of statistically significant results can produce literatures that “consist in substantial part of false conclusions” (p.

Three decades later, Sterling, Rosenbaum, and Weinkam (1995) observed that the “practice leading to publication bias have [sic] not changed over a period of 30 years” (p. 108). Recent articles indicate that publication bias remains a problem in psychological
journals (Fiedler, 2011; John, Loewenstein, & Prelec, 2012; Kerr, 1998; Simmons, Nelson, & Simonsohn, 2011; Strube, 2006; Vul, Harris, Winkielman, & Pashler, 2009; Yarkoni, 2010).

Other sciences have the same problem (Yong, 2012). For example, medical journals have seen an increase in the percentage of retracted articles (Steen, 2011a, 2011b), and there is the concern that a vast number of published findings may be false (Ioannidis,

However, a recent comparison of different scientific disciplines suggested that the bias is stronger in psychology than in some of the older and harder scientific disciplines at the top of a hierarchy of sciences (Fanelli, 2010).

It is important that psychologists use the current crisis as an opportunity to fix problems in the way research is being conducted and reported. The proliferation of eye-catching claims based on biased or fake data can have severe negative consequences for a
science. A New Yorker article warned the public that “all sorts of  well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable” (Lehrer, 2010, p. 1).

If students who read psychology textbooks and the general public lose trust in the credibility of psychological science, psychology loses its relevance because
objective empirical data are the only feature that distinguishes psychological science from other approaches to the understanding of human nature and behavior. It is therefore hard to exaggerate the seriousness of doubts about the credibility of research findings published in psychological journals.

In an influential article, Kerr (1998) discussed one source of bias, namely, hypothesizing after the results are known (HARKing). The practice of HARKing may be attributed to the
high costs of conducting a study that produces a nonsignificant result that cannot be published. To avoid this negative outcome, researchers can design more complex studies that test multiple hypotheses. Chances increase that at least one of the hypotheses
will be supported, if only because Type I error increases (Maxwell, 2004). As noted by Wagenmakers et al. (2011), generations of graduate students were explicitly advised that this questionable research practice is how they should write scientific manuscripts
(Bem, 2000).

It is possible that Kerr’s (1998) article undermined the credibility of single-study articles and added to the appeal of multiple-study articles (Diener, 1998; Ledgerwood & Sherman, 2012). After all, it is difficult to generate predictions for significant effects
that are inconsistent across studies. Another advantage is that the requirement of multiple significant results essentially lowers the chances of a Type I error, that is, the probability of falsely rejecting the null hypothesis. For a set of five independent studies,
the requirement to demonstrate five significant replications essentially shifts the probability of a Type I error from p < .05 for a single study to p < .0000003 (i.e., .05^5) for a set of five studies.

This is approximately the same stringent criterion that is being used in particle physics to claim a true discovery (Castelvecchi, 2011). It has been overlooked, however, that researchers have to pay a price to meet more stringent criteria of credibility. To demonstrate significance at a more stringent criterion of significance, it is
necessary to increase sample sizes to reduce the probability of making a Type II error (failing to reject the null hypothesis). This probability is called beta. The inverse probability (1 – beta) is called power. Thus, to maintain high statistical power to demonstrate an effect with a more stringent alpha level requires an
increase in sample sizes, just as physicists had to build a bigger collider to have a chance to find evidence for smaller particles like the Higgs boson particle.

Yet there is no evidence that psychologists are using bigger samples to meet more stringent demands of replicability (Cohen, 1992; Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). This raises the question of how researchers are able to replicate findings in multiple-study articles despite modest power to demonstrate significant effects even within a single study. Researchers can use questionable research
practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result. Moreover, a survey of researchers indicated that these
practices are common (John et al., 2012), and the prevalence of these practices has raised concerns about the credibility of psychology as a science (Yong, 2012).

An implicit assumption in the field appears to be that the solution to these problems is to further increase the number of positive replication studies that need to be presented to ensure scientific credibility (Ledgerwood & Sherman, 2012). However, the assumption that many replications with significant results provide strong evidence for a hypothesis is an illusion that is akin to the Texas sharpshooter fallacy (Milloy, 1995). Imagine a Texan farmer named Joe. One day he invites you to his farm and shows you a target with nine shots in the bull’s-eye and one shot just outside the bull’s-eye. You are impressed by his shooting abilities until you find out that he cannot repeat this performance when you challenge him to do it again.

[So far, well-known Texan sharpshooters in experimental social psychology have carefully avoided demonstrating their sharp shooting abilities in open replication studies to avoid the embarrassment of not being able to do it again].

Over some beers, Joe tells you that he first fired 10 shots at the barn and then drew the targets after the shots were fired. One problem in science is that reading a research
article is a bit like visiting Joe’s farm. Readers only see the final results, without knowing how the final results were created. Is Joe a sharpshooter who drew a target and then fired 10 shots at the target? Or was the target drawn after the fact? The reason why multiple-study articles are akin to a Texan sharpshooter is that psychological studies have modest power (Cohen, 1962; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). Assuming
60% power for a single study, the probability of obtaining 10 significant results in 10 studies is less than 1% (.6^10 = 0.6%).

I call the probability to obtain only significant results in a set of studies total power. Total power parallels Maxwell’s (2004) concept of all-pair power for multiple comparisons in analysis-of variance designs. Figure 1 illustrates how total power decreases with the number of studies that are being conducted. Eventually, it becomes extremely unlikely that a set of studies produces only significant results. This is especially true if a single study has modest power. When total power is low, it is incredible that a set
of studies yielded only significant results. To avoid the problem of incredible results, researchers would have to increase the power of studies in multiple-study articles.

Table 1 shows how the power of individual studies has to be adjusted to maintain 80% total power for a set of studies. For example, to have 80% total power for five replications, the power of each study has to increase to 96%.

Table 1 also shows the sample sizes required to achieve 80% total power, assuming a simple between-group design, an alpha level of .05 (two-tailed), and Cohen’s
(1992) guidelines for a small (d = .2), moderate, (d = .5), and strong (d = .8) effect.

[To demonstrate a small effect 7 times would require more than 10,000 participants.]

In sum, my main proposition is that psychologists have falsely assumed that increasing the number of replications within an article increases credibility of psychological science. The problem of this practice is that a truly programmatic set of multiple studies
is very costly and few researchers are able to conduct multiple studies with adequate power to achieve significant results in all replication attempts. Thus, multiple-study articles have intensified the pressure to use questionable research methods to compensate for low total power and may have weakened rather than strengthened
the credibility of psychological science.

[I believe this is one reason why the replication crisis has hit experimental social psychology the hardest.  Other psychologists could use HARKing to tell a false story about a single study, but experimental social psychologists had to manipulate the data to get significance all the time.  Experimental cognitive psychologists also have multiple study articles, but they tend to use more powerful within-subject designs, which makes it more credible to get significant results multiple times. The multiple study BS design made it impossible to do so, which resulted in the publication of BS results.]

What Is the Allure of Multiple-Study Articles?

One apparent advantage of multiple-study articles is to provide stronger evidence against the null hypothesis (Ledgerwood & Sherman, 2012). However, the number of studies is irrelevant because the strength of the empirical evidence is a function of the
total sample size rather than the number of studies. The main reason why aggregation across studies reduces randomness as a possible explanation for observed mean differences (or correlations) is that p values decrease with increasing sample size. The
number of studies is mostly irrelevant. A study with 1,000 participants has as much power to reject the null hypothesis as a meta-analysis of 10 studies with 100 participants if it is reasonable to assume a common effect size for the 10 studies. If true effect sizes vary across studies, power decreases because a random-effects model may be more appropriate (Schmidt, 2010; but see Bonett, 2009). Moreover, the most logical approach to reduce concerns about Type I error is to use more stringent criteria for significance (Mudge, Baker, Edge, & Houlahan, 2012). For controversial or very important research findings, the significance level could be set to p < .001 or, as in particle physics, to p <

[Ironically, five years later we have a debate about p < .05 versus p < .005, without even thinking about p < .0000005 or any mention that even a pair of studies with p < .05 in each study effectively have an alpha less than p < .005, namely .0025 to be exact.]  

It is therefore misleading to suggest that multiple-study articles are more credible than single-study articles. A brief report with a large sample (N = 1,000) provides more credible evidence than a multiple-study article with five small studies (N = 40, total
N = 200).

The main appeal of multiple-study articles seems to be that they can address other concerns (Ledgerwood & Sherman, 2012). For example, one advantage of multiple studies could be to test the results across samples from diverse populations (Henrich, Heine, & Norenzayan, 2010). However, many multiple-study articles are based on samples drawn from a narrowly defined population (typically, students at the local university). If researchers were concerned about generalizability across a wider range of individuals, multiple-study articles should examine different populations. However, it is not clear why it would be advantageous to conduct multiple independent studies with different populations. To compare populations, it would be preferable to use the same procedures and to analyze the data within a single statistical model with population as a potential moderating factor. Moreover, moderator tests often have low power. Thus, a single study with a large sample and moderator variables is more informative than articles that report separate analyses with small samples drawn from different populations.

Another attraction of multiple-study articles appears to be the ability to provide strong evidence for a hypothesis by means of slightly different procedures. However, even here, single studies can be as good as multiple-study articles. For example, replication across different dependent variables in different studies may mask the fact that studies included multiple dependent variables and researchers picked dependent variables that produced significant results (Simmons et al., 2011). In this case, it seems preferable to
demonstrate generalizability across dependent variables by including multiple dependent variables within a single study and reporting the results for all dependent variables.

One advantage of a multimethod assessment in a single study is that the power to
demonstrate an effect increases for two reasons. First, while some dependent variables may produce nonsignificant results in separate small studies due to low power (Maxwell, 2004), they may all show significant effects in a single study with the total sample size
of the smaller studies. Second, it is possible to increase power further by constraining coefficients for each dependent variable or by using a latent-variable measurement model to test whether the effect is significant across dependent variables rather than for each one independently.

Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating
a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions.  Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe
truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005).

[This is my take down of social psychologists’ claim that multiple conceptual replications test theories, Stroebe & Strack, 2004]

The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect. In this way, empirical studies no longer test theoretical hypotheses because they can only produce two results: Either they support the theory (p < .05) or the manipulation did not work (p > .05). It is therefore worrisome that Bem noted that “like most  social psychological experiments, the experiments reported here required extensive pilot testing” (Bem, 2011, p. 421). If Joe is a sharpshooter, who can hit the bull’s-eye from different angles and with different guns, why does he need extensive training before he can perform the critical shot?

The freedom of researchers to discount null findings leads to the paradox that conceptual replications across multiple studies give the impression that an effect is robust followed by warnings that experimental findings may not replicate because they depend “on subtle and unknown factors” (Bem, 2011, p. 422).

If experimental results were highly context dependent, it would be difficult to explain how studies reported in research articles nearly always produce the expected results. One possible explanation for this paradox is that sampling error in small samples creates the illusion that effect sizes vary systematically, although most of the variation is random. Researchers then pick studies that randomly produced inflated effect sizes and may further inflate them by using questionable research methods to achieve significance (Simmons et al., 2011).

[I was polite when I said “may”.  This appears to be exactly what Bem did to get his supernatural effects.]

The final set of studies that worked is then published and gives a false sense of the effect size and replicability of the effect (you should see the other side of Joe’s barn). This may explain why research findings initially seem so impressive, but when other researchers try to build on these seemingly robust findings, it becomes increasingly uncertain whether a phenomenon exists at all (Ioannidis, 2005; Lehrer, 2010).

At this point, a lot of resources have been wasted without providing credible evidence for an  effect.

[And then Stroebe and Strack in 2014 suggest that real replication studies that let the data determine the outcome are a waste of resources.]

To increase the credibility of reported findings, it would be better to use all of the resources for one powerful study. For example, the main dependent variable in Bem’s (2011) study of ESP was the percentage of correct predictions of future events.
Rather than testing this ability 10 times with N = 100 participants, it would have been possible to test the main effect of ESP in a single study with 10 variations of experimental procedures and use the experimental conditions as a moderating factor. By testing one
main effect of ESP in a single study with N = 1,000, power would be greater than 99.9% to demonstrate an effect with Bem’s a priori effect size.

At the same time, the power to demonstrate significant moderating effects would be much lower. Thus, the study would lead to the conclusion that ESP does exist but that it is unclear whether the effect size varies as a function of the actual experimental
paradigm. This question could then be examined in follow-up studies with more powerful tests of moderating factors.

In conclusion, it is true that a programmatic set of studies is superior to a brief article that reports a single study if both articles have the same total power to produce significant results (Ledgerwood & Sherman, 2012). However, once researchers use questionable research practices to make up for insufficient total power, multiple-study articles lose their main advantage over single-study articles, namely, to demonstrate generalizability across different experimental manipulations or other extraneous factors.

Moreover, the demand for multiple studies counteracts the demand for more
powerful studies (Cohen, 1962; Maxwell, 2004; Rossi, 1990) because limited resources (e.g., subject pool of PSY100 students) can only be used to increase sample size in one study or to conduct more studies with small samples.

It is therefore likely that the demand for multiple studies within a single article has eroded rather than strengthened the credibility of published research findings
(Steen, 2011a, 2011b), and it is problematic to suggest that multiple-study articles solve the problem that journals publish too many positive results (Ledgerwood & Sherman, 2012). Ironically, the reverse may be true because multiple-study articles provide a
false sense of credibility.

Joe the Magician: How Many Significant Results Are Too Many?

Most people enjoy a good magic show. It is fascinating to see something and to know at the same time that it cannot be real. Imagine that Joe is a well-known magician. In front of a large audience, he fires nine shots from impossible angles, blindfolded, and seemingly through the body of an assistant, who miraculously does not bleed. You cannot figure out how Joe pulled off the stunt, but you know it was a stunt. Similarly, seeing Joe hit the bull’s-eye 1,000 times in a row raises concerns about his abilities as a sharpshooter and suggests that some magic is contributing to this miraculous performance. Magic is fun, but it is not science.

[Before Bem’s article appeared, Steve Heine gave a talk at the University of Toront where he presented multiple studies with manipulations of absurdity (absurdity like Monty Python’s “Biggles: Pioneer Air Fighter; cf. Proulx, Heine, & Vohs, PSPB, 2010).  Each absurd manipulation was successful.  I didn’t have my magic index then, but I did understand the logic of Sterling et al.’s (1995) argument. So, I did ask whether there were also manipulations that did not work and the answer was affirmative.  It was rude at the time to ask about a file drawer before 2011, but a recent twitter discussion suggests that it wouldn’t be rude in 2018. Times are changing.]

The problem is that some articles in psychological journals appear to be more magical than one would expect on the basis of the normative model of science (Kerr, 1998). To increase the credibility of published results, it would be desirable to have a diagnostic tool that can distinguish between credible research findings and those that are likely to be based on questionable research practices. Such a tool would also help to
counteract the illusion that multiple-study articles are superior to single-study articles without leading to the erroneous reverse conclusion that single-study articles are more trustworthy.

[I need to explain why I targeted multiple-study articles in particular. Even the personality section of JPSP started to demand multiple studies because they created the illusion of being more rigorous, e.g., the crazy glucose article was published in that section. At that time, I was still trying to publish as many articles as possible in JPSP and I was not able to compete with crazy science.]

Articles should be evaluated on the basis of their total power to demonstrate consistent evidence for an effect. As such, a single-study article with 80% (total) power is superior to a multiple-study article with 20% total power, but a multiple-study article with 80% total power is superior to a single-study article with 80% power.

The Magic Index (formerly known as the Incredibility Index)

The idea to use power analysis to examine bias in favor of theoretically predicted effects and against the null hypothesis was introduced by Sterling et al. (1995). Ioannidis and Trikalinos (2007) provided a more detailed discussion of this approach for the detection of bias in meta-analyses. Ioannidis and Trikalinos’s exploratory test estimates the probability of the number of reported significant results given the average power of the reported studies. Low p values suggest that there are too many significant results,  suggesting that questionable research methods contributed to the reported results. In contrast, the inverse inference is not justified because high p values do not justify the inference that questionable research practices did not contribute to the results. To emphasize this asymmetry in inferential strength, I suggest reversing the exploratory test, focusing on the probability of obtaining more nonsignificant results than were reported in a multiple-study article and calling this index the magic index.

Higher values indicate that there is a surprising lack of nonsignificant results (a.k.a., shots that missed the bull’s eye). The higher the magic index is, the more incredible the observed outcome becomes.

Too many significant results could be due to faking, fudging, or fortune. Thus, the statistical demonstration that a set of reported findings is magical does not prove that questionable research methods contributed to the results in a multiple-study article. However, even when questionable research methods did not contribute to the results, the published results are still likely to be biased because fortune helped to inflate effect sizes and produce more significant results than total power justifies.

Computation of the Incredibility Index

To understand the basic logic of the M-index, it is helpful to consider a concrete example. Imagine a multiple-study article with 10 studies with an average observed effect size of d = .5 and 84 participants in each study (42 in two conditions, total N = 840) and all studies producing a significant result. At first sight, these 10 studies seem to provide strong support against the null hypothesis. However, a post hoc power analysis with the average effect size of d = .5 as estimate of the true effect size reveals that each study had
only 60% power to obtain a significant result. That is, even if the true effect size were d = .5, only six out of 10 studies should have produced a significant result.

The M-index quantifies the probability of the actual outcome (10 out of 10 significant results) given the expected value (six out of 10 significant results) using binomial
probability theory. From the perspective of binomial probability theory, the scenario
is analogous to an urn problem with replacement with six green balls (significant) and four red balls (nonsignificant). The binomial probability to draw at least one red ball in 10 independent draws is 99.4%. (Stat Trek, 2012).

That is, 994 out of 1,000 multiple-study articles with 10 studies and 60% average power
should have produced at least one nonsignificant result in one of the 10 studies. It is therefore incredible if an article reports 10 significant results because only six out of 1,000 attempts would have produced this outcome simply due to chance alone.

[I now realize that observed power of 60% would imply that the null-hypothesis is true because observed power is also inflated by selecting for significance.  As 50% observed poewr is needed to achieve significance and chance cannot produce the same observed power each time, the minimum observed power is 62%!]

One of the main problems for power analysis in general and the computation of the IC-index in particular is that the true effect size is unknown and has to be estimated. There are three basic approaches to the estimation of true effect sizes. In rare cases, researchers provide explicit a priori assumptions about effect sizes (Bem, 2011). In this situation, it seems most appropriate to use an author’s stated assumptions about effect sizes to compute power with the sample sizes of each study. A second approach is to average reported effect sizes either by simply computing the mean value or by weighting effect sizes by their sample sizes. Averaging of effect sizes has the advantage that post hoc effect size estimates of single studies tend to have large confidence intervals. The confidence intervals shrink when effect sizes are aggregated across
studies. However, this approach has two drawbacks. First, averaging of effect sizes makes strong assumptions about the sampling of studies and the distribution of effect sizes (Bonett, 2009). Second, this approach assumes that all studies have the same effect
size, which is unlikely if a set of studies used different manipulations and dependent variables to demonstrate the generalizability of an effect. Ioannidis and Trikalinos (2007) were careful to warn readers that “genuine heterogeneity may be mistaken for bias” (p.

[I did not know about  Ioannidis and Trikalinos’s (2007) article when I wrote the first draft. Maybe that is a good thing because I might have followed their approach. However, my approach is different from their approach and solves the problem of pooling effect sizes. Claiming that my method is the same as Trikalinos’s method is like confusing random effects meta-analysis with fixed-effect meta-analysis]   

To avoid the problems of average effect sizes, it is promising to consider a third option. Rather than pooling effect sizes, it is possible to conduct post hoc power analysis for each study. Although each post hoc power estimate is associated with considerable sampling error, sampling errors tend to cancel each other out, and the M-index for a set of studies becomes more accurate without having to assume equal effect sizes in all studies.

Unfortunately, this does not guarantee that the M-index is unbiased because power is a nonlinear function of effect sizes. Yuan and Maxwell (2005) examined the implications of this nonlinear relationship. They found that the M-index may provide inflated estimates of average power, especially in small samples where observed effect sizes vary widely around the true effect size.  Thus, the M-index is conservative when power is low and magic had to be used to create significant results.

In sum, it is possible to use reported effect sizes to compute post hoc power and to use post hoc power estimates to determine the probability of obtaining a significant result. The post hoc power values can be averaged and used as the probability for a successful
outcome. It is then possible to use binomial probability theory to determine the probability that a set of studies would have produced equal or more nonsignificant results than were actually reported.  This probability is [now] called the M-index.

[Meanwhile, I have learned that it is much easier to compute observed power based on reported test statistics like t, F, and chi-square values because observed power is determined by these statistics.]

Example 1: Extrasensory Perception (Bem, 2011)

I use Bem’s (2011) article as an example because it may have been a tipping point for the current scientific paradigm in psychology (Wagenmakers et al., 2011).

[I am still waiting for EJ to return the favor and cite my work.]

The editors explicitly justified the publication of Bem’s article on the grounds that it was subjected to a rigorous review process, suggesting that it met current standards of scientific practice (Judd & Gawronski, 2011). In addition, the editors hoped that the publication of Bem’s article and Wagenmakers et al.’s (2011) critique would stimulate “critical further thoughts about appropriate methods in research on social cognition and attitudes” (Judd & Gawronski, 2011, p. 406).

A first step in the computation of the M-index is to define the set of effects that are being examined. This may seem trivial when the M-index is used to evaluate the credibility of results in a single article, but multiple-study articles contain many results and it is not always obvious that all results should be included in the analysis (Maxwell, 2004).

[Same here.  Maxwell accepted my article, but apparently doesn’t think it is useful to cite when he writes about the replication crisis.]

[deleted minute details about Bem’s study here.]

Another decision concerns the number of hypotheses that should be examined. Just as multiple studies reduce total power, tests of multiple hypotheses within a single study also reduce total power (Maxwell, 2004). Francis (2012b) decided to focus only on the
hypothesis that ESP exists, that is, that the average individual can foresee the future. However, Bem (2011) also made predictions about individual differences in ESP. Therefore, I used all 19 effects reported in Table 7 (11 ESP effects and eight personality effects).

[I deleted the section that explains alternative approaches that rely on effect sizes rather than observed power here.]

I used G*Power 3.1.2 to obtain post hoc power on the basis of effect sizes and sample sizes (Faul, Erdfelder, Buchner, & Lang, 2009).

The M-index is more powerful when a set of studies contains only significant results. In this special case, the M-index is the inverse probability of total power. 

[An article by Fabrigar and Wegener misrepresents my article and confuses the M-Index with total power.  When articles do report non-significant result and honestly report them as failures to reject the null-hypothesis (not marginal significance), it is necessary to compute the binomial probability to get the M-Index.]  

[Again, I deleted minute computations for Bem’s results.]

Using the highest magic estimates produces a total Magic-Index of 99.97% for Bem’s 17 results.  Thus, it is unlikely that Bem (2011) conducted 10 studies, ran 19 statistical tests of planned hypotheses, and obtained 14 statisstically significant results.

Yet the editors felt compelled to publish the manuscript because “we can only take the author at his word that his data are in fact genuine and that the reported findings have not been taken from a larger set of unpublished studies showing null effects” (Judd & Gawronski, 2011, p. 406).

[It is well known that authors excluded disconfirming evidence and that editors sometimes even asked authors to engage in this questionable research practice. However, this quote implies that the editors asked Bem about failed studies and that he assured them that there are no failed studies, which may have been necessary to publish these magical results in JPSP.  If Bem did not disclose failed studies on request and these studies exist, it would violate even the lax ethical standards of the time that mostly operated on a “don’t ask don’t tell” basis. ]

The M-index provides quantitative information about the credibility of this assumption and would have provided the editors with objective information to guide their decision. More importantly, awareness about total power could have helped Bem to plan fewer studies with higher total power to provide more credible evidence for his hypotheses.

Example 2: Sugar High—When Rewards Undermine Self-Control

Bem’s (2011) article is exceptional in that it examined a controversial phenomenon. I used another nine-study article that was published in the prestigious Journal of Personality and Social Psychology to demonstrate that low total power is also a problem
for articles that elicit less skepticism because they investigate less controversial hypotheses. Gailliot et al. (2007) examined the relation between blood glucose levels and self-regulation. I chose this article because it has attracted a lot of attention (142 citations in Web of Science as of May 2012; an average of 24 citations per year) and it is possible to evaluate the replicability of the original findings on the basis of subsequent studies by other researchers (Dvorak & Simons, 2009; Kurzban, 2010).

[If anybody needs evidence that citation counts are a silly indicator of quality, here it is: the article has been cited 80 times in 2014, 64 times in 2015, 63 times in 2016, and 61 times in 2017.  A good reason to retract it, if JPSP and APA cares about science and not just impact factors.]

Sample sizes were modest, ranging from N = 12 to 102. Four studies had sample sizes of N < 20, which Simmons et al. (2011) considered to require special justification.  The total N is 359 participants. Table 1 shows that this total sample
size is sufficient to have 80% total power for four large effects or two moderate effects and is insufficient to demonstrate a [single] small effect. Notably, Table 4 shows that all nine reported studies produced significant results.

The M-Index for these 9 studies was greater than 99%. This indicates that from a statistical point of view, Bem’s (2011) evidence for ESP is more credible
than Gailliot et al.’s (2007) evidence for a role of blood glucose in

A more powerful replication study with N = 180 participants provides more conclusive evidence (Dvorak & Simons, 2009). This study actually replicated Gailliot et al.’s (1997) findings in Study 1. At the same time, the study failed to replicate the results for Studies 3–6 in the original article. Dvorak and Simons (2009) did not report the correlation, but the authors were kind enough to provide this information. The correlation was not significant in the experimental group, r(90) = .10, and the control group, r(90) =
.03. Even in the total sample, it did not reach significance, r(180) = .11. It is therefore extremely likely that the original correlations were inflated because a study with a sample of N = 90 has 99.9% power to produce a significant effect if the true effect
size is r = .5. Thus, Dvorak and Simons’s results confirm the prediction of the M-index that the strong correlations in the original article are incredible.

In conclusion, Gailliot et al. (2007) had limited resources to examine the role of blood glucose in self-regulation. By attempting replications in nine studies, they did not provide strong evidence for their theory. Rather, the results are incredible and difficult to replicate, presumably because the original studies yielded inflated effect sizes. A better solution would have been to test the three hypotheses in a single study with a large sample. This approach also makes it possible to test additional hypotheses, such as mediation (Dvorak & Simons, 2009). Thus, Example 2 illustrates that
a single powerful study is more informative than several small studies.

General Discussion

Fifty years ago, Cohen (1962) made a fundamental contribution to psychology by emphasizing the importance of statistical power to produce strong evidence for theoretically predicted effects. He also noted that most studies at that time had only sufficient power to provide evidence for strong effects. Fifty years later, power
analysis remains neglected. The prevalence of studies with insufficient power hampers scientific progress in two ways. First, there are too many Type II errors that are often falsely interpreted as evidence for the null hypothesis (Maxwell, 2004). Second, there
are too many false-positive results (Sterling, 1959; Sterling et al., 1995). Replication across multiple studies within a single article has been considered a solution to these problems (Ledgerwood & Sherman, 2012). The main contribution of this article is to point
out that multiple-study articles do not provide more credible evidence simply because they report more statistically significant results. Given the modest power of individual studies, it is even less credible that researchers were able to replicate results repeatedly in a series of studies than that they obtained a significant effect in a single study.

The demonstration that multiple-study articles often report incredible results might help to reduce the allure of multiple-study articles (Francis, 2012a, 2012b). This is not to say that multiple-study articles are intrinsically flawed or that single-study articles are superior. However, more studies are only superior if total power is held constant, yet limited resources create a trade-off between the number of studies and total power of a set of studies.

To maintain credibility, it is better to maximize total power rather than number of studies. In this regard, it is encouraging that some  editors no longer consider number ofstudies as a selection criterion for publication (Smith, 2012).

[Over the past years, I have been disappointed by many psychologists that I admired or respected. I loved ER Smith’s work on exemplar models that influenced my dissertation work on frequency estimation of emotion.  In 2012, I was hopeful that he would make real changes, but my replicability rankings show that nothing changed during his term as editor of the JPSP section that published Bem’s article. Five wasted years and nobody can say he couldn’t have known better.]

Subsequently, I first discuss the puzzling question of why power continues to be ignored despite the crucial importance of power to obtain significant results without the help of questionable research methods. I then discuss the importance of paying more attention to total power to increase the credibility of psychology as a science. Due to space limitations, I will not repeat many other valuable suggestions that have been made to improve the current scientific model (Schooler, 2011; Simmons et al., 2011; Spellman, 2012; Wagenmakers et al., 2011).

In my discussion, I will refer to Bem’s (2011) and Gailliot et al.’s (2007) articles, but it should be clear that these articles merely exemplify flaws of the current scientific
paradigm in psychology.

Why Do Researchers Continue to Ignore Power?

Maxwell (2004) proposed that researchers ignore power because they can use a shotgun approach. That is, if Joe sprays the barn with bullets, he is likely to hit the bull’s-eye at least once. For example, experimental psychologists may use complex factorial
designs that test multiple main effects and interactions to obtain at
least one significant effect (Maxwell, 2004).

Psychologists who work with many variables can test a large number of correlations
to find a significant one (Kerr, 1998). Although studies with small samples have modest power to detect all significant effects (low total power), they have high power to detect at least one significant effect (Maxwell, 2004).

The shotgun model is unlikely to explain incredible results in multiple-study articles because the pattern of results in a set of studies has to be consistent. This has been seen as the main strength of multiple-study articles (Ledgerwood & Sherman, 2012).

However, low total power in multiple-study articles makes it improbable that all studies produce significant results and increases the pressure on researchers to use questionable research methods to comply with the questionable selection criterion that
manuscripts should report only significant results.

A simple solution to this problem would be to increase total power to avoid
having to use questionable research methods. It is therefore even more puzzling why the requirement of multiple studies has not resulted in an increase in power.

One possible explanation is that researchers do not care about effect sizes. Researchers may not consider it unethical to use questionable research methods that inflate effect sizes as long as they are convinced that the sign of the reported effect is consistent
with the sign of the true effect. For example, the theory that implicit attitudes are malleable is supported by a positive effect of experimental manipulations on the implicit association test, no matter whether the effect size is d = .8 (Dasgupta & Greenwald,
2001) or d = .08 (Joy-Gaba & Nosek, 2010), and the influence of blood glucose levels on self-control is supported by a strong correlation of r = .6 (Gailliot et al., 2007) and a weak correlation of r = .1 (Dvorak & Simons, 2009).

The problem is that in the real world, effect sizes matter. For example, it matters whether exercising for 20 minutes twice a week leads to a weight loss of one
pound or 10 pounds. Unbiased estimates of effect sizes are also important for the integrity of the field. Initial publications with stunning and inflated effect sizes produce underpowered replication studies even if subsequent researchers use a priori power analysis.

As failed replications are difficult to publish, inflated effect sizes are persistent and can bias estimates of true effect sizes in meta-analyses. Failed replication studies in file drawers also waste valuable resources (Spellman, 2012).

In comparison to one small (N = 40) published study with an inflated effect size and
nine replication studies with nonsignificant replications in file drawers (N = 360), it would have been better to pool the resources of all 10 studies for one strong test of an important hypothesis (N = 400).

A related explanation is that true effect sizes are often likely to be small to moderate and that researchers may not have sufficient resources for unbiased tests of their hypotheses. As a result, they have to rely on fortune (Wegner, 1992) or questionable research
methods (Simmons et al., 2011; Vul et al., 2009) to report inflated observed effect sizes that reach statistical significance in small samples.

Another explanation is that researchers prefer small samples to large samples because small samples have less power. When publications do not report effect sizes, sample sizes become an imperfect indicator of effect sizes because only strong effects
reach significance in small samples. This has led to the flawed perception that effect sizes in large samples have no practical significance because even effects without practical significance can reach statistical significance (cf. Royall, 1986). This line of
reasoning is fundamentally flawed and confounds credibility of scientific evidence with effect sizes.

The most probable and banal explanation for ignoring power is poor statistical training at the undergraduate and graduate levels. Discussions with colleagues and graduate students suggest that power analysis is mentioned, but without a sense of importance.

[I have been preaching about power for years in my department and it became a running joke for students to mention power in their presentation without having any effect on research practices until 2011. Fortunately, Bem unintentionally made it able to convince some colleagues that power is important.]

Research articles also reinforce the impression that power analysis is not important as sample sizes vary seemingly at random from study to study or article to article. As a result, most researchers probably do not know how risky their studies are and how lucky they are when they do get significant and inflated effects.

I hope that this article will change this and that readers take total power into account when they read the next article with five or more studies and 10 or more significant results and wonder whether they have witnessed a sharpshooter or have seen a magic show.

Finally, it is possible that researchers ignore power simply because they follow current practices in the field. Few scientists are surprised that published findings are too good to be true. Indeed, a common response to presentations of this work has been that the M-index only shows the obvious. Everybody knows that researchers use a number of questionable research practices to increase their chances of reporting significant results, and a high percentage of researchers admit to using these practices, presumably
because they do not consider them to be questionable (John et al., 2012).

[Even in 2014, Stroebe and Strack claim that it is not clear which practices should be considered questionable, whereas my undergraduate students have no problem realizing that hiding failed studies undermines the purpose of doing an empirical study in the first place.]

The benign view of current practices is that successful studies provide all of the relevant information. Nobody wants to know about all the failed attempts of alchemists to turn base metals into gold, but everybody would want to know about a process that
actually achieves this goal. However, this logic rests on the assumption that successful studies were really successful and that unsuccessful studies were really flawed. Given the modest power of studies, this conclusion is rarely justified (Maxwell, 2004).

To improve the status of psychological science, it will be important to elevate the scientific standards of the field. Rather than pointing to limited resources as an excuse,
researchers should allocate resources more wisely (spend less money on underpowered studies) and conduct more relevant research that can attract more funding. I think it would be a mistake to excuse the use of questionable research practices by pointing out that false discoveries in psychological research have less dramatic consequences than drugs with little benefits, huge costs, and potential side effects.

Therefore, I disagree with Bem’s (2000) view that psychologists should “err on the side of discovery” (p. 5).

[Yup, he wrote that in a chapter that was used to train graduate students in social psychology in the art of magic.]

Recommendations for Improvement

Use Power in the Evaluation of Manuscripts

Granting agencies often ask that researchers plan studies with adequate power (Fritz & MacKinnon, 2007). However, power analysis is ignored when researchers report their results. The reason is probably that (a priori) power analysis is only seen as a way to ensure that a study produces a significant result. Once a significant finding has been found, low power no longer seems to be a problem. After all, a significant effect was found (in one condition, for male participants, after excluding two outliers, p =
.07, one-tailed).

One way to improve psychological science is to require researchers to justify sample sizes in the method section. For multiple-study articles, researchers should be asked to compute total power.

[This is something nobody has even started to discuss.  Although there are more and more (often questionable) a priori power calculations in articles, they tend to aim for  80%  power for a single hypothesis test, but these articles often report multiple studies or multiple hypothesis tests in a single article.  The power to get two significant results with 80-% for each test is only 64%. ]

If a study has 80% total power, researchers should also explain how they would deal with the possible outcome of a nonsignificant result. Maybe it would change the perception of research contributions when a research article reports 10 significant
results, although power was only sufficient to obtain six. Implementing this policy would be simple. Thus, it is up to editors to realize the importance of statistical power and to make power an evaluation criterion in the review process (Cohen, 1992).

Implementing this policy could change the hierarchy of psychological
journals. Top journals would no longer be the journals with the most inflated effect sizes but, rather, the journals with the most powerful studies and the most credible scientific evidence.

[Based on this idea, I started developing my replicability rankings of journals. And they show that impact factors still do not take replicability into account.]

Reward Effort Rather Than Number of Significant Results

Another recommendation is to pay more attention to the total effort that went into an empirical study rather than the number of significant p values. The requirement to have multiple studies with no guidelines about power encourages a frantic empiricism in
which researchers will conduct as many cheap and easy studies as possible to find a set of significant results.

[And if power is taken into account, researchers now do six cheap Mturk studies. Although this is better than six questionable studies, it does not correct the problem that good research often requires a lot of resources.]

It is simply too costly for researchers to invest in studies with observation of real behaviors, high ecological validity, or longitudinal assessments that take
time and may produce a nonsignificant result.

Given the current environmental pressures, a low-quality/high-quantity strategy is
more adaptive and will ensure survival (publish or perish) and reproductive success (more graduate students who pursue a lowquality/ high-quantity strategy).

[It doesn’t help to become a meta-psychologists. Which smart undergraduate student would risk the prospect of a career by becoming a meta-psychologist?]

A common misperception is that multiple-study articles should be rewarded because they required more effort than a single study. However, the number of studies is often a function of the difficulty of conducting research. It is therefore extremely problematic to
assume that multiple studies are more valuable than single studies.

A single longitudinal study can be costly but can answer questions that multiple cross-sectional studies cannot answer. For example, one of the most important developments in psychological measurement has been the development of the implicit association test
(Greenwald, McGhee, & Schwartz, 1998). A widespread belief about the implicit association test is that it measures implicit attitudes that are more stable than explicit attitudes (Gawronski, 2009), but there exist hardly any longitudinal studies of the stability of implicit attitudes.

[I haven’t checked but I don’t think this has changed much. Cross-sectional Mturk studies can still produce sexier results than a study that simply estimates the stability of the same measure over time.  Social psychologists tend to be impatient creatures (e.g., Bem)]

A simple way to change the incentive structure in the field is to undermine the false belief that multiple-study articles are better than single-study articles. Often multiple studies are better combined into a single study. For example, one article published four studies that were identical “except that the exposure duration—suboptimal (4 ms)
or optimal (1 s)—of both the initial exposure phase and the subsequent priming phase was orthogonally varied” (Murphy, Zajonc, & Monahan, 1995, p. 589). In other words, the four studies were four conditions of a 2 x 2 design. It would have been more efficient and
informative to combine the information of all studies in a single study. In fact, after reporting each study individually, the authors reported the results of a combined analysis. “When all four studies are entered into a single analysis, a clear pattern emerges” (Murphy et al., 1995, p. 600). Although this article may be the most extreme example of unnecessary multiplicity, other multiple-study articles could also be more informative by reducing the number of studies in a single article.

Apparently, readers of scientific articles are aware of the limited information gain provided by multiple-study articles because citation counts show that multiple-study articles do not have more impact than single-study articles (Haslam et al., 2008). Thus, editors should avoid using number of studies as a criterion for accepting articles.

Allow Publication of Nonsignificant Results

The main point of the M-index is to alert researchers, reviewers, editors, and readers of scientific articles that a series of studies that produced only significant results is neither a cause for celebration  nor strong evidence for the demonstration of a scientific discovery; at least not without a power analysis that shows the results are credible.

Given the typical power of psychological studies, nonsignificant findings should be obtained regularly, and the absence of nonsignificant results raises concerns about the credibility of published research findings.

Most of the time, biases may be benign and simply produce inflated effect sizes, but occasionally, it is possible that biases may have more serious consequences (e.g.,
demonstrate phenomena that do not exist).

A perfectly planned set of five studies, where each study has 80% power, is expected to produce one nonsignificant result. It is not clear why editors sometimes ask researchers to remove studies with nonsignificant results. Science is not a beauty contest, and a
nonsignificant result is not a blemish.

This wisdom is captured in the Japanese concept of wabi-sabi, in which beautiful objects are designed to have a superficial imperfection as a reminder that nothing is perfect. On the basis of this conception of beauty, a truly perfect set of studies is one that echoes the imperfection of reality by including failed studies or studies that did not produce significant results.

Even if these studies are not reported in great detail, it might be useful to describe failed studies and explain how they informed the development of studies that produced significant results. Another possibility is to honestly report that a study failed to produce a significant result with a sample size that provided 80% power and that the researcher then added more participants to increase power to 95%. This is different from snooping (looking at the data until a significant result has been found), especially if it is stated clearly that the sample size was increased because the effect was not significant with the originally planned sample size and the significance test has been adjusted to take into account that two significance tests were performed.

The M-index rewards honest reporting of results because reporting of null findings renders the number of significant results more consistent with the total power of the studies. In contrast, a high M-index can undermine the allure of articles that report more significant results than the power of the studies warrants. In this
way, post-hoc power analysis could have the beneficial effect that researchers finally start paying more attention to a priori power.

Limited resources may make it difficult to achieve high total power. When total power is modest, it becomes important to report nonsignificant results. One way to report nonsignificant results would be to limit detailed discussion to successful studies but to
include studies with nonsignificant results in a meta-analysis. For example, Bem (2011) reported a meta-analysis of all studies covered in the article. However, he also mentioned several pilot studies and a smaller study that failed to produce a significant
result. To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results.

This overall effect is functionally equivalent to the test of the hypothesis in a single
study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results.

[Since then, several articles have proposed meta-analyses and given tutorials on mini-meta-analysis without citing my article and without clarifying that these meta-analysis are only useful if all evidence is included and without clarifying that bias tests like the M-Index can reveal whether all relevant evidence was included.]

It is also important that top journals publish failed replication studies. The reason is that top journals are partially responsible for the contribution of questionable research practices to published research findings. These journals look for novel and groundbreaking studies that will garner many citations to solidify their position
as top journals. As everywhere else (e.g., investing), the higher payoff comes with a higher risk. In this case, the risk is publishing false results. Moreover, the incentives for researchers to get published in top journals or get tenure at Ivy League universities
increases the probability that questionable research practices contribute
to articles in the top journals (Ledford, 2010). Stapel faked data to get a publication in Science, not to get a publication in Psychological Reports.

There are positive signs that some journal editors are recognizing their responsibility for publication bias (Dirnagl & Lauritzen, 2010). The medical journal Journal of Cerebral Blood Flow and Metabolism created a section that allows researchers to publish studies with disconfirmatory evidence so that this evidence is published in the same journal. One major advantage of having this section in top journals is that it may change the evaluation criteria of journal editors toward a more careful assessment of Type I error when they accept a manuscript for publication. After all, it would be quite embarrassing to publish numerous articles that erred on the side of discovery if subsequent issues reveal that these discoveries were illusory.

[After some pressure from social media, JPSP did publish failed replications of Bem, and it now has a replication section (online only).  Maybe somebody can dig up some failed replications of glucose studies, I know they exist, or do one more study to publish in JPSP that, just like ESP, glucose is a myth.]

It could also reduce the use of questionable research practices by researchers eager to publish in prestigious journals if there was a higher likelihood that the same journal will publish failed replications by independent researchers. It might also motivate more researchers to conduct rigorous replication studies if they can bet against a finding and hope to get a publication in a prestigious journal.

The M-index can be helpful in putting pressure on editors and journals to curb the proliferation of false-positive results because it can be used to evaluate editors and journals in terms of the credibility of the results that are published in these journals.

As everybody knows, the value of a brand rests on trust, and it is easy to destroy this value when consumers lose that trust. Journals that continue to publish incredible results and suppress contradictory replication studies are not going to survive, especially given the fact that the Internet provides an opportunity for authors of repressed replication studies to get their findings out (Spellman, 2012).

[I wrote this in the third revision when I thought the editor would not want to see the manuscript again.]

[I deleted the section where I pick on Ritchie’s failed replications of Bem because three studies with small studies of N = 50 are underpowered and can be dismissed as false positives. Replication studies should have at least the sample size of original studies which was N = 100 for most of Bem’s studies.]

Another solution would be to ignore p values altogether and to focus more on effect sizes and confidence intervals (Cumming & Finch, 2001). Although it is impossible to demonstrate that the true effect size is exactly zero, it is possible to estimate
true effect sizes with very narrow confidence intervals. For example, a sample of N = 1,100 participants would be sufficient to demonstrate that the true effect size of ESP is zero with a narrow confidence interval of plus or minus .05.

If an even more stringent criterion is required to claim a null effect, sample sizes would have to increase further, but there is no theoretical limit to the precision of effect size estimates. No matter whether the focus is on p values or confidence intervals, Cohen’s recommendation that bigger is better, at least for sample sizes, remains true because large samples are needed to obtain narrow confidence intervals (Goodman & Berlin, 1994).


Changing paradigms is a slow process. It took decades to unsettle the stronghold of behaviorism as the main paradigm in psychology. Despite Cohen’s (1962) important contribution to the field 50 years ago and repeated warnings about the problems of underpowered studies, power analysis remains neglected (Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). I hope the M-index can make a small contribution toward the goal of improving the scientific standards of psychology as a science.

Bem’s (2011) article is not going to be a dagger in the heart of questionable research practices, but it may become the historic marker of a paradigm shift.

There are positive signs in the literature  on meta-analysis (Sutton & Higgins, 2008), the search for better statistical methods (Wagenmakers, 2007)*, the call for more
open access to data (Schooler, 2011), changes in publication practices of journals (Dirnagl & Lauritzen, 2010), and increasing awareness of the damage caused by questionable research practices (Francis, 2012a, 2012b; John et al., 2012; Kerr, 1998; Simmons
et al., 2011) to be hopeful that a paradigm shift may be underway.

[Another sad story. I did not understand Wagenmaker’s use of Bayesian methods at the time and I honestly thought this work might make a positive contribution. However, in retrospect I realize that Wagenmakers is more interested in selling his statistical approach at any cost and disregards criticisms of his approach that have become evident in recent years. And, yes, I do understand how the method works and why it will not solve the replication crisis (see commentary by Carlsson et al., 2017, in Psychological Science).]

Even the Stapel debacle (Heatherton, 2010), where a prominent psychologist admitted to faking data, may have a healthy effect on the field.

[Heaterton emailed me and I thought he was going to congratulate me on my nice article or thank me for citing him, but he was mainly concerned that quoting him in the context of Stapel might give the impression that he committed fraud.]

After all, faking increases Type I error by 100% and is clearly considered unethical. If questionable research practices can increase Type I error by up to 60% (Simmons et al., 2011), it becomes difficult to maintain that these widely used practices are questionable but not unethical.

[I guess I was a bit optimistic here. Apparently, you can hide as many studies as you want, but you cannot change one data point because that is fraud.]

During the reign of a paradigm, it is hard to imagine that things will ever change. However, for most contemporary psychologists, it is also hard to imagine that there was a time when psychology was dominated by animal research and reinforcement schedules. Older psychologists may have learned that the only constant in life is change.

[Again, too optimistic. Apparently, many old social psychologists still believe things will remain the same as they always were.  Insert head in the sand cartoon here.]

I have been fortunate enough to witness historic moments of change such as the falling of the Berlin Wall in 1989 and the end of behaviorism when Skinner gave his last speech at the convention of the American Psychological Association in 1990. In front of a packed auditorium, Skinner compared cognitivism to creationism. There was dead silence, made more audible by a handful of grey-haired members in the audience who applauded

[Only I didn’t realize that research in 1990 had other problems. Nowadays I still think that Skinner was just another professor with a big ego and some published #me_too allegations to his name, but he was right in his concerns about (social) cognitivism as not much more scientific than creationism.]

I can only hope to live long enough to see the time when Cohen’s valuable contribution to psychological science will gain the prominence that it deserves. A better understanding of the need for power will not solve all problems, but it will go a long way toward improving the quality of empirical studies and the credibility of results published in psychological journals. Learning about power not only empowers researchers to conduct studies that can show real effects without the help of questionable research practices but also empowers them to be critical consumers of published research findings.

Knowledge about power is power.


Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide
to publishing in psychological journals (pp. 3–16). Cambridge, England:
Cambridge University Press. doi:10.1017/CBO9780511807862.002

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality
and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Bonett, D. G. (2009). Meta-analytic interval estimation for standardized
and unstandardized mean differences. Psychological Methods, 14, 225–
238. doi:10.1037/a0016619

Castelvecchi, D. (2011). Has the Higgs been discovered? Physicists gear up
for watershed announcement. Scientific American. Retrieved from http://

Cohen, J. (1962). Statistical power of abnormal–social psychological research:
A review. Journal of Abnormal and Social Psychology, 65,
145–153. doi:10.1037/h0045186

Cohen, J. (1990). Things I have learned (so far). American Psychologist,
45, 1304–1312. doi:10.1037/0003-066X.45.12.1304

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.

Dasgupta, N., & Greenwald, A. G. (2001). On the malleability of automatic
attitudes: Combating automatic prejudice with images of admired and
disliked individuals. Journal of Personality and Social Psychology, 81,
800–814. doi:10.1037/0022-3514.81.5.800

Diener, E. (1998). Editorial. Journal of Personality and Social Psychology,
74, 5–6. doi:10.1037/h0092824

Dirnagl, U., & Lauritzen, M. (2010). Fighting publication bias: Introducing
the Negative Results section. Journal of Cerebral Blood Flow and
Metabolism, 30, 1263–1264. doi:10.1038/jcbfm.2010.51

Dvorak, R. D., & Simons, J. S. (2009). Moderation of resource depletion
in the self-control strength model: Differing effects of two modes of
self-control. Personality and Social Psychology Bulletin, 35, 572–583.

Erdfelder, E., Faul, F., & Buchner, A. (1996). GPOWER: A general power
analysis program. Behavior Research Methods, 28, 1–11. doi:10.3758/

Fanelli, D. (2010). “Positive” results increase down the hierarchy of the
sciences. PLoS One, 5, Article e10068. doi:10.1371/journal.pone

Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical
power analyses using G*Power 3.1: Tests for correlation and regression
analyses. Behavior Research Methods, 41, 1149–1160. doi:10.3758/

Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A
flexible statistical power analysis program for the social, behavioral, and
biomedical sciences. Behavior Research Methods, 39, 175–191. doi:

Fiedler, K. (2011). Voodoo correlations are everywhere—not only in
neuroscience. Perspectives on Psychological Science, 6, 163–171. doi:

Francis, G. (2012a). The same old New Look: Publication bias in a study
of wishful seeing. i-Perception, 3, 176–178. doi:10.1068/i0519ic

Francis, G. (2012b). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9

Fritz, M. S., & MacKinnon, D. P. (2007). Required sample size to detect
the mediated effect. Psychological Science, 18, 233–239. doi:10.1111/

Gailliot, M. T., Baumeister, R. F., DeWall, C. N., Maner, J. K., Plant,
E. A., Tice, D. M., & Schmeichel, B. J. (2007). Self-control relies on
glucose as a limited energy source: Willpower is more than a metaphor.
Journal of Personality and Social Psychology, 92, 325–336. doi:

Gawronski, B. (2009). Ten frequently asked questions about implicit
measures and their frequently supposed, but not entirely correct answers.
Canadian Psychology/Psychologie canadienne, 50, 141–150. doi:

Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence
intervals when planning experiments and the misuse of power when
interpreting results. Annals of Internal Medicine, 121, 200–206.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring
individual differences in implicit cognition: The implicit association test.
Journal of Personality and Social Psychology, 74, 1464–1480. doi:

Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J.,
& Wilson, S. (2008). What makes an article influential? Predicting
impact in social and personality psychology. Scientometrics, 76, 169–
185. doi:10.1007/s11192-007-1892-8

Heatherton, T. (2010). Official SPSP communique´ on the Diederik Stapel
debacle. Retrieved from http://danaleighton.edublogs.org/2011/09/13/

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in
the world? Behavioral and Brain Sciences, 33, 61–83. doi:10.1017/

Ioannidis, J. P. A. (2005). Why most published research findings are false.
PLoS Medicine, 2(8), Article e124. doi:10.1371/journal.pmed.0020124

Ioannidis, J. P. A., & Trikali nos, T. A. (2007). An exploratory test for an
excess of significant findings. Clinical Trials, 4, 245–253. doi:10.1177/

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence
of questionable research practices with incentives for truth telling.
Psychological Science, 23, 524–532. doi:10.1177/0956797611430953

Joy-Gaba, J. A., & Nosek, B. A. (2010). The surprisingly limited malleability
of implicit racial evaluations. Social Psychology, 41, 137–146.

Judd, C. M., & Gawronski, B. (2011). Editorial comment. Journal of
Personality and Social Psychology, 100, 406. doi:10.1037/0022789

Kerr, N. L. (1998). HARKing: Hypothezising after the results are known.
Personality and Social Psychology Review, 2, 196–217. doi:10.1207/

Kurzban, R. (2010). Does the brain consume additional glucose during
self-control tasks? Evolutionary Psychology, 8, 244–259.

Ledford, H. (2010, August 17). Harvard probe kept under wraps. Nature,
466, 908–909. doi:10.1038/466908a

Ledgerwood, A., & Sherman, J. W. (2012). Short, sweet, and problematic?
The rise of the short report in psychological science. Perspectives on Psychological Science, 7, 60–66. doi:10.1177/1745691611427304

Lehrer, J. (2010). The truth wears off. The New Yorker. Retrieved from

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological
research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082-989X.9.2.147

Milloy, J. S. (1995). Science without sense: The risky business of public
health research. Washington, DC: Cato Institute.

Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting
an optimal that minimizes errors in null hypothesis significance tests.
PLoS One, 7(2), Article e32734. doi:10.1371/journal.pone.0032734

Murphy, S. T., Zajonc, R. B., & Monahan, J. L. (1995). Additivity of
nonconscious affect: Combined effects of priming and exposure. Journal
of Personality and Social Psychology, 69, 589–602. doi:10.1037/0022-

Ritchie, S. J., Wiseman, R., & French, C. C. (2012a). Failing the future:
Three unsuccessful attempts to replicate Bem’s “retroactive facilitation
of recall” effect. PLoS One, 7(3), Article e33423. doi:10.1371/

Rossi, J. S. (1990). Statistical power of psychological research: What have
we gained in 20 years? Journal of Consulting and Clinical Psychology,
58, 646–656. doi:10.1037/0022-006X.58.5.646

Royall, R. M. (1986). The effect of sample size on the meaning of
significance tests. American Statistician, 40, 313–315. doi:10.2307/

Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives
on Psychological Science, 5, 233–242. doi:10.1177/

Schooler, J. (2011, February 23). Unpublished results hide the decline
effect. Nature, 470, 437. doi:10.1038/470437a

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power
have an effect on the power of studies? Psychological Bulletin, 105,
309–316. doi:10.1037/0033-2909.105.2.309

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: Undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22,
1359–1366. doi:10.1177/0956797611417632

Smith, E. R. (2012). Editorial. Journal of Personality and Social Psychology,
102, 1–3. doi:10.1037/a0026676

Spellman, B. A. (2012). Introduction to the special section: Data, data,
everywhere . . . especially in my file drawer. Perspectives on Psychological
Science, 7, 58–59. doi:10.1177/1745691611432124

Stat Trek. (2012). Binomial calculator: Online statistical table. Retrieved
from http://stattrek.com/tables/binomial.aspx

Steen, R. G. (2011a). Retractions in the scientific literature: Do authors
deliberately commit research fraud? Journal of Medical Ethics, 37,
113–117. doi:10.1136/jme.2010.038125

Steen, R. G. (2011b). Retractions in the scientific literature: Is the incidence
of research fraud increasing? Journal of Medical Ethics, 37,
249–253. doi:10.1136/jme.2010.040923

Sterling, T. D. (1959). Publication decisions and their possible effects on
inferences drawn from tests of significance— or vice versa. Journal of
the American Statistical Association, 54(285), 30–34. doi:10.2307/

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication
decisions revisited: The effect of the outcome of statistical tests on the
decision to publish and vice-versa. American Statistician, 49, 108–112.

Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746

Sutton, A. J., & Higgins, J. P. I. (2008). Recent developments in metaanalysis.
Statistics in Medicine, 27, 625–650. doi:10.1002/sim.2934

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high
correlations in fMRI studies of emotion, personality, and social cognition.
Perspectives on Psychological Science, 4, 274–290. doi:10.1111/

Wagenmakers, E. J. (2007). A practical solution to the pervasive problems
of p values. Psychonomic Bulletin & Review, 14, 779–804. doi:10.3758/

Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J.
(2011). Why psychologists must change the way they analyze their data:
The case of psi: Comment on Bem (2011). Journal of Personality and
Social Psychology, 100, 426–432. doi:10.1037/a0022790

Wegner, D. M. (1992). The premature demise of the solo experiment.
Personality and Social Psychology Bulletin, 18, 504–508. doi:10.1177/

Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations
reflect low statistical power—Commentary on Vul et al. (2009).
Perspectives on Psychological Science, 4, 294–298. doi:10.1111/j.1745-

Yong, E. (2012, May 16). Bad copy. Nature, 485, 298–300. doi:10.1038/

Yuan, K. H., & Maxwell, S. (2005). On the post hoc power in testing mean
differences. Journal of Educational and Behavioral Statistics, 30, 141–
167. doi:10.3102/10769986030002141

Received May 30, 2011
Revision received June 18, 2012
Accepted June 25, 2012
Further Revised February 18, 2018


A critique of Stroebe and Strack’s Article “The Alleged Crisis and the Illusion of Exact Replication”

The article by Stroebe and Strack (2014) [henceforth S&S] illustrates how experimental social psychologists responded to replication failures in the beginning of the replicability revolution.  The response is a classic example of repressive coping: Houston, we do not have a problem. Even in 2014,  problems with the way experimental social psychologists had conducted research for decades were obvious (Bem, 2011; Wagenmakers et al., 2011; John et al., 2012; Francis, 2012; Schimmack, 2012; Hasher & Wagenmakers, 2012).  S&S article is an attempt to dismiss these concerns as misunderstandings and empirically unsupported criticism.

“In contrast to the prevalent sentiment, we will argue that the claim of a replicability crisis is greatly exaggerated” (p. 59).  

Although the article was well received by prominent experimental social psychologists (see citations in appendix), future events proved S&S wrong and vindicated critics of research methods in experimental social psychology. Only a year later, the Open Science Collaboration (2015) reported that only 25% of studies in social psychology could be replicated successfully.  A statistical analysis of focal hypothesis tests in social psychology suggests that roughly 50% of original studies could be replicated successfully if these studies were replicated exactly (Motyl et al., 2017).  Ironically, one of S&S’s point is that exact replication studies are impossible. As a result, the 50% estimate is an optimistic estimate of the success rate for actual replication studies, suggesting that the actual replicability of published results in social psychology is less than 50%.

Thus, even if S&S had reasons to be skeptical about the extent of the replicability crisis in experimental social psychology, it is now clear that experimental social psychology has a serious replication problem. Many published findings in social psychology textbooks may not replicate and many theoretical claims in social psychology rest on shaky empirical foundations.

What explains the replication problem in experimental social psychology?  The main reason for replication failures is that social psychology journals mostly published significant results.  The selective publishing of significant results is called publication bias. Sterling pointed out that publication bias in psychology is rampant.  He found that psychology journals publish over 90% significant results (Sterling, 1959; Sterling et al., 1995).  Given new estimates that the actual success rate of studies in experimental social psychology is less than 50%, only publication bias can explain why journals publish over 90% results that confirm theoretical predictions.

It is not difficult to see that reporting only studies that confirm predictions undermines the purpose of empirical tests of theoretical predictions.  If studies that do not confirm predictions are hidden, it is impossible to obtain empirical evidence that a theory is wrong.  In short, for decades experimental social psychologists have engaged in a charade that pretends that theories are empirically tested, but publication bias ensured that theories would never fail.  This is rather similar to Volkswagen’s emission tests that were rigged to pass because emissions were never subjected to a real test.

In 2014, there were ample warning signs that publication bias and other dubious practices inflated the success rate in social psychology journals.  However, S&S claim that (a) there is no evidence for the use of questionable research practices and (b) that it is unclear which practices are questionable or not.

“Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology. In fact, the discipline still needs to reach an agreement about the conditions under which these practices are unacceptable” (p. 60).

Scientists like to hedge their statements so that they are immune to criticism. S&S may argue that the evidence in 2014 was not “solid” and surely there was and still is no agreement about good research practices. However, this is irrelevant. What is important is that success rates in social psychology journals were and still are inflated by suppressing disconfirming evidence and biasing empirical tests of theories in favor of positive outcomes.

Although S&S’s main claims are not based on empirical evidence, it is instructive to examine how they tried to shield published results and established theories from the harsh light of open replication studies that report results without selection for significance and subject social psychological theories to real empirical tests for the first time.

Failed Replication of Between-Subject Priming Studies

S&S discuss failed replications of two famous priming studies in social psychology: Bargh’s elderly priming study and Dijksterhuis’s professor priming studies.  Both seminal articles reported several successful tests of the prediction that a subtle priming manipulation would influence behavior without participants even noticing the priming effect.  In 2012, Doyen et al., failed to replicate elderly priming. Schanks et al. (2013) failed to replicate professor priming effects and more recently a large registered replication report also provided no evidence for professor priming.  For naïve readers it is surprising that original studies had a 100% success rate and replication studies had a 0% success rate.  However, S&S are not surprised at all.

“as in most sciences, empirical findings cannot always be replicated” (p. 60). 

Apparently, S&S knows something that naïve readers do not know.  The difference between naïve readers and experts in the field is that experts have access to unpublished information about failed replications in their own labs and in the labs of their colleagues. Only they know how hard it sometimes was to get the successful outcomes that were published. With the added advantage of insider knowledge, it makes perfect sense to expect replication failures, although may be not 0%.

The problem is that S&S give the impression that replication failures are too be expected, but that this expectation cannot be based on the objective scientific record that hardly ever reports results that contradict theoretical predictions.  Replication failures occur all the time, but they remained unpublished. Doyen et al. and Schanks et al.’s articles only violated the code to publish only supportive evidence.

Kahneman’s Train Wreck Letter

S&S also comment on Kahneman’s letter to Bargh that compared priming research to a train wreck.  In response S&S claim that

“priming is an entirely undisputed method that is widely used to test hypotheses about associative memory (e.g., Higgins, Rholes, & Jones, 1977; Meyer & Schvaneveldt, 1971; Tulving & Schacter, 1990).” (p. 60).  

This argument does not stand the test of time.  Since S&S published their article researchers have distinguished more clearly between highly replicable priming effects in cognitive psychology with repeated measures and within-subject designs and difficult to replicate between-subject social priming studies with subtle priming manipulations and a single outcome measure (BS social priming).  With regards to BS social priming, it is unclear which of these effects can be replicated and leading social psychologists have been reluctant to demonstrate replicability of their famous studies by conducting self-replications as they were encouraged to do in Kahneman’s letter.

S&S also point to empirical evidence for robust priming effects.

“A meta-analysis of studies that investigated how trait primes influence impression formation identified 47 articles based on 6,833 participants and found overall effects to be statistically highly significant (DeCoster & Claypool, 2004).” (p. 60). 

The problem with this evidence is that this meta-analysis did not take publication bias into account; in fact, it does not even mention publication bias as a possible problem.  A meta-analysis of studies that were selected for significance produces is also biased by selection for significance.

Several years after Kahneman’s letter, it is widely agreed that past research on social priming is a train wreck.  Kahneman published a popular book that celebrated social priming effects as a major scientific discovery in psychology.  Nowadays, he agrees with critiques that the existing evidence is not credible.  It is also noteworthy that none of the researchers in this area have followed Kahneman’s advice to replicate their own findings to show the world that these effects are real.

It is all a big misunderstanding

S&S suggest that “the claim of a replicability crisis in psychology is based on a major misunderstanding.” (p. 60). 

Apparently, lay people, trained psychologists, and a Noble laureate are mistaken in their interpretation of replication failures.  S&S suggest that failed replications are unimportant.

“the myopic focus on “exact” replications neglects basic epistemological principles” (p. 60).  

To make their argument, they introduce the notion of exact replications and suggest that exact replication studies are uninformative.

 “a finding may be eminently reproducible and yet constitute a poor test of a theory.” (p. 60).

The problem with this line of argument is that we are supposed to assume that a finding is eminently reproducible, which probably means it has been successfully replicate many times.  It seems sensible that further studies of gender differences in height are unnecessary to convince us that there is a gender difference in height. However, results in social psychology are not like gender differences in height.  According to S&S own accord earlier, “empirical findings cannot always be replicated” (p. 60). And if journals only publish significant results, it remains unknown which results are eminently reproducible and which results are not.  S&S ignore publication bias and pretend that the published record suggests that all findings in social psychology are eminently reproducible. Apparently, they would suggest that even Bem’s findings that people have supernatural abilities is eminently reproducible.  These days, few social psychologists are willing to endorse this naïve interpretation of the scientific record as a credible body of empirical facts.   

Exact Replication Studies are Meaningful if they are Successful

Ironically, S&S next suggest that exact replication studies can be useful.

Exact replications are also important when studies produce findings that are unexpected and only loosely connected to a theoretical framework. Thus, the fact that priming individuals with the stereotype of the elderly resulted in a reduction of walking speed was a finding that was unexpected. Furthermore, even though it was consistent with existing theoretical knowledge, there was no consensus about the processes that mediate the impact of the prime on walking speed. It was therefore important that Bargh et al. (1996) published an exact replication of their experiment in the same paper.

Similarly, Dijksterhuis and van Knippenberg (1998) conducted four studies in which they replicated the priming effects. Three of these studies contained conditions that were exact replications.

Because it is standard practice in publications of new effects, especially of effects that are surprising, to publish one or two exact replications, it is clearly more conducive to the advancement of psychological knowledge to conduct conceptual replications rather than attempting further duplications of the original study.

Given these citations it is problematic that S&S article is often cited to claim that exact replications are impossible or unnecessary.  The argument that S&S are making here is rather different.  They are suggesting that original articles already provide sufficient evidence that results in social psychology are eminently reproducible because original articles report multiple studies and some of these studies are often exact replication studies.  At face value, S&S have a point.  An honest series of statistically significant results makes it practically impossible that an effect is a false positive result (Schimmack, 2012).  The problem is that multiple study articles are not honest reports of all replication attempts.  Francis (2014) found that at least 80% of multiple study articles showed statistical evidence of questionable research practices.  Given the pervasive influence of selection for significance, exact replication studies in original articles provide no information about the replicability of these results.

What made the failed replications by Doyen et al. and Shank et al. so powerful was that these studies were the first real empirical tests of BS social priming effects because the authors were willing to report successes or failures.  The problem for social psychology is that many textbook findings that were obtained with selection for significance cannot be reproduced in honest empirical tests of the predicted effects.  This means that the original effects were either dramatically inflated or may not exist at all.

Replication Studies are a Waste of Resources

S&S want readers to believe that replication studies are a waste of resources.

Given that both research time and money are scarce resources, the large scale attempts at duplicating previous studies seem to us misguided” (p. 61).

This statement sounds a bit like a plea to spare social psychology from the embarrassment of actual empirical tests that reveal the true replicability of textbook findings. After all, according to S&S it is impossible to duplicate original studies (i.e., conduct exact replication studies) because replication studies differ in some way from original studies and may not reproduce the original results.  So, none of the failed replication studies is an exact replication.  Doyen et al. replicate Bargh’s study that was conducted in New York city in Belgium and Shanks et al. replicated Dijksterhuis’s studies from the Netherlands in the United States.  The finding that the original results could not be replicate the original results does not imply that the original findings were false positives, but they do imply that these findings may be unique to some unspecified specifics of the original studies.  This is noteworthy when original results are used in textbook as evidence for general theories and not as historical accounts of what happened in one specific socio-cultural context during a specific historic period. As social situations and human behavior are never exact replications of the past, social psychological results need to be permanently replicated and doing so is not a waste of resources.  Suggesting that replications is a waste of resources is like suggesting that measuring GDP or unemployment every year is a waste of resources because we can just use last-year’s numbers.

As S&S ignore publication bias and selection for significance, they are also ignoring that publication bias leads to a massive waste of resources.  First, running empirical tests of theories that are not reported is a waste of resources.  Second, publishing only significant results is also a waste of resources because researchers design new studies based on the published record. When the published record is biased, many new studies will fail, just like airplanes who are designed based on flawed science would drop from the sky.  Thus, a biased literature creates a massive waste of resources.

Ultimately, a science that publishes only significant result wastes all resources because the outcome of the published studies is a foregone conclusion: the prediction was supported, p < .05. Social psychologists might as well publish purely theoretical article, just like philosophers in the old days used “thought experiments” to support their claims. An empirical science is only a real science if theoretical predictions are subjected to tests that can fail.  By this simple criterion, experimental social psychology is not (yet) a science.

Should Psychologists Conduct Exact Replications or Conceptual Replications?

Strobe and Strack’s next cite Pashler and Harris (2012) to claim that critiques of experimental social psychology have dismissed the value of so-called conceptual replications and generalize.

The main criticism of conceptual replications is that they are less informative than exact replications (e.g., Pashler & Harris, 2012).” 

Before I examine S&S’s counterargument, it is important to realize that S&S misrepresented, and maybe misunderstood, Pashler and Harris’s main point. Here is the relevant quote from Pashler and Harris’s article.

We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some of the famous and puzzling “pathological science” cases that embarrassed the natural sciences at several points in the 20th century (e.g., Polywater; Rousseau & Porto, 1970; and cold fusion; Taubes, 1993).

The problem for S&S is that they cannot address the problem of publication bias and therefore carefully avoid talking about it.  As a result, they misrepresent Pashler and Harris’s critique of conceptual replications in combination with publication bias as a criticism of conceptual replication studies, which is absurd and not what Pashler and Harris’s intended to say or actually said. The following quote from their article makes this crystal clear.

However, what kept faith in cold fusion alive for some time (at least in the eyes of some onlookers) was a trickle of positive results achieved using very different designs than the originals (i.e., what psychologists would call conceptual replications). This suggests that one important hint that a controversial finding is pathological may arise when defenders of a controversial effect disavow the initial methods used to obtain an effect and rest their case entirely upon later studies conducted using other methods. Of course, productive research into real phenomena often yields more refined and better ways of producing effects. But what should inspire doubt is any situation where defenders present a phenomenon as a “moving target” in terms of where and how it is elicited (cf. Langmuir, 1953/1989). When this happens, it would seem sensible to ask, “If the finding is real and yet the methods used by the original investigators are not reproducible, then how were these investigators able to uncover a valid phenomenon with methods that do not work?” Again, the unavoidable conclusion is that a sound assessment of a controversial phenomenon should focus first and foremost on direct replications of the original reports and not on novel variations, each of which may introduce independent ambiguities.

I am confident that unbiased readers will recognize that Pashler and Harris did not suggest that conceptual replication studies are bad.  Their main point is that a few successful conceptual replication studies can be used to keep theories alive in the face of a string of many replication failures. The problem is not that researchers conduct successful conceptual replication studies. The problem is dismissing or outright hiding of disconfirming evidence in replication studies. S&S misconstrue Pashler and Harris’s claim to avoid addressing this real problem of ignoring and suppressing failed studies to support an attractive but false theory.

The illusion of exact replications.

S&S next argument is that replication studies are never exact.

If one accepts that the true purpose of replications is a (repeated) test of a theoretical hypothesis rather than an assessment of the reliability of a particular experimental procedure, a major problem of exact replications becomes apparent: Repeating a specific operationalization of a theoretical construct at a different point in time and/or with a different population of participants might not reflect the same theoretical construct that the same procedure operationalized in the original study.

The most important word in this quote is “might.”   Ebbinghaus’s memory curve MIGHT not replicate today because he was his own subject.  Bargh’s elderly priming study MIGHT not work today because Florida is no longer associated with the elderly, and Disjterhuis’s priming study MIGHT no longer works because students no longer think that professors are smart or that Hooligans are dumb.

Just because there is no certainty in inductive inferences doesn’t mean we can just dismiss replication failures because something MIGHT have changed.  It is also possible that the published results MIGHT be false positives because significant results were obtained by chance, with QRPs, or outright fraud.  Most people think that outright fraud is unlikely, but the Stapel debacle showed that we cannot rule it out.  So, we can argue forever about hypothetical reasons why a particular study was successful or a failure. These arguments are futile and have nothing to do with scientific arguments and objective evaluation of facts.

This means that every study, whether it is a groundbreaking success or a replication failure needs to be evaluate in terms of the objective scientific facts. There is no blanket immunity for seminal studies that protects them from disconfirming evidence.  No study is an exact replication of another study. That is a truism and S&S article is often cited for this simple fact.  It is as true as it is irrelevant to understand the replication crisis in social psychology.

Exact Replications Are Often Uninformative

S&S contradict themselves in the use of the term exact replication.  First it is impossible to do exact replications, but then they are uninformative.  I agree with S&S that exact replication studies are impossible. So, we can simply drop the term “exact” and examine why S&S believe that some replication studies are uninformative.

First they give an elaborate, long and hypothetical explanation for Doyen et al.’s failure to replicate Bargh’s pair of elderly priming studies. After considering some possible explanations, they conclude

It is therefore possible that the priming procedure used in the Doyen et al. (2012) study failed in this respect, even though Doyen et al. faithfully replicated the priming procedure of Bargh et al. (1996).  

Once more the realm of hypothetical conjectures has to rescue seminal findings. Just as it is possible that S&S are right it is also possible that Bargh faked his data. To be sure, I do not believe that he faked his data and I apologized for a Facebook comment that gave the wrong impression that I did. I am only raising this possibility here to make the point that everything is possible. Maybe Bargh just got lucky.  The probability of this is 1 out of 1,600 attempts (the probability to get the predicted effect with .05 two-tailed (!) twice is .025^2). Not very likely, but also not impossible.

No matter what the reason for the discrepancy between Bargh and Doyen’s findings is, the example does not support S&S’s claim that replication studies are uninformative. The failed replication raised concerns about the robustness of BS social priming studies and stimulated further investigation of the robustness of social priming effects. In the short span of six years, the scientific consensus about these effects has shifted dramatically, and the first publication of a failed replication is an important event in the history of social psychology.

S&S’s critique of Shank et al.’s replication studies is even weaker.  First, they have to admit that professor probably still primes intelligence more than soccer hooligans. To rescue the original finding S&S propose

“the priming manipulation might have failed to increase the cognitive representation of the concept “intelligence.” 

S&S also think that

another LIKELY reason for their failure could be their selection of knowledge items.

Meanwhile a registered replication report with a design that was approved by Dijksterhuis failed to replicate the effect.  Although it is possible to come up with more possible reasons for these failures, real scientific creativity is revealed in creating experimental paradigms that produce replicable results, not in coming up with many post-hoc explanations for replication failures.

Ironically, S&S even agree with my criticism of their argument.

 “To be sure, these possibilities are speculative”  (p. 62). 

In contrast, S&S fail to consider the possibility that published significant results are false positives, even though there is actual evidence for publication bias. The strong bias against published failures may be rooted in a long history of dismissing unpublished failures that social psychologists routinely encounter in their own laboratory.  To avoid the self-awareness that hiding disconfirming evidence is unscientific, social psychologists made themselves believe that minute changes in experimental procedures can ruin a study (Stapel).  Unfortunately, a science that dismisses replication failures as procedural hiccups is fated to fail because it removed the mechanism that makes science self-correcting.

Failed Replications are Uninformative

S&S next suggest that “nonreplications are uninformative unless one can demonstrate that the theoretically relevant conditions were met” (p. 62).

This reverses the burden of proof.  Original researchers pride themselves on innovative ideas and groundbreaking discoveries.  Like famous rock stars, they are often not the best musicians, nor is it impossible for other musicians to play their songs. They get rewarded because they came up with something original. Take the Implicit Association Test as an example. The idea to use cognitive switching tasks to measure attitudes was original and Greenwald deserves recognition for inventing this task. The IAT did not revolutionize attitude research because only Tony Greenwald could get the effects. It did so because everybody, including my undergraduate students, could replicate the basic IAT effect.

However, let’s assume that the IAT effect could not have been replicated. Is it really the job of researchers who merely duplicated a study to figure out why it did not work and develop a theory under which circumstances an effect may occur or not?  I do not think so. Failed replications are informative even if there is no immediate explanation why the replication failed.  As Pashler and Harris’s cold fusion example shows there may not even be a satisfactory explanation after decades of research. Most probably, cold fusion never really worked and the successful outcome of the original study was a fluke or a problem of the experimental design.  Nevertheless, it was important to demonstrate that the original cold fusion study could not be replicated.  To ask for an explanation why replication studies fail is simply a way to make replication studies unattractive and to dismiss the results of studies that fail to produce the desired outcome.

Finally, S&S ignore that there is a simple explanation for replication failures in experimental social psychology: publication bias.  If original studies have low statistical power (e.g., Bargh’s studies with N = 30) to detect small effects, only vastly inflated effect sizes reach significance.  An open replication study without inflated effect sizes is unlikely to produce a successful outcome. Statistical analysis of original studies show that this explanation accounts for a large proportion of replication failures. Thus, publication bias provides one explanation for replication failures.

Conceptual Replication Studies are Informative

S&S cite Schmidt (2009) to argue that conceptual replication studies are informative.

With every difference that is introduced the confirmatory power of the replication increases, because we have shown that the phenomenon does not hinge on a particular operationalization but “generalizes to a larger area of application” (p. 93).

S&S continue

“An even more effective strategy to increase our trust in a theory is to test it using completely different manipulations.”

This is of course true as long as conceptual replication studies are successful. However, it is not clear why conceptual replication studies that for the first time try a completely different manipulation should be successful.  As I pointed out in my 2012 article, reading multiple-study articles with only successful conceptual replication studies is a bit like watching a magic show.

Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions. Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005). The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect.

I don’t know whether S&S are impressed by Bem’s article with 9 conceptual replication studies that successfully demonstrated supernatural abilities.  According to their line of arguments, they should be.  However, even most social psychologists found it impossible to accept that time-reversed subliminal priming works. Unfortunately, this also means that successful conceptual replication studies are meaningless if only successful results are published.  Once more, S&S cannot address this problem because they ignore the simple fact that selection for significance undermines the purpose of empirical research to test theoretical predictions.

Exact Replications Contribute Little to Scientific Knowledge

Without providing much evidence for their claims, S&S conclude

one reason why exact replications are not very interesting is that they contribute little to scientific knowledge.

Ironically, one year later Science published 100 replication studies with the only goal of estimating the replicability of psychology, with a focus on social psychology.  The article has already been cited 640 times, while S&S’s criticism of replication studies has been cited (only) 114 times.

Although the article did nothing else then to report the outcome of replication studies, it made a tremendous empirical contribution to psychology because it reported results of studies without the filter of publication bias.  Suddenly the success rate plummeted from over 90% to 37% and for social psychology to 25%.  While S&S could claim in 2014 that “Thus far, however, no solid data exist on the prevalence of such [questionable] research practices in either social or any other area of psychology,” the reproducibility project revealed that these practices dramatically inflated the percentage of successful studies reported in psychology journals.

The article has been celebrated by scientists in many disciplines as a heroic effort and a sign that psychologists are trying to improve their research practices. S&S may disagree, but I consider the reproducibility project a big contribution to scientific knowledge.

Why null findings are not always that informative

To fully appreciate the absurdity of S&S’s argument, I let them speak for themselves.

One reason is that not all null findings are interesting.  For example, just before his downfall, Stapel published an article on how disordered contexts promote stereotyping and discrimination. In this publication, Stapel and Lindenberg (2011) reported findings showing that litter or a broken-up sidewalk and an abandoned bicycle can increase social discrimination. These findings, which were later retracted, were judged to be sufficiently important and interesting to be published in the highly prestigious journal Science. Let us assume that Stapel had actually conducted the research described in this paper and failed to support his hypothesis. Such a null finding would have hardly merited publication in the Journal of Articles in Support of the Null Hypothesis. It would have been uninteresting for the same reason that made the positive result interesting, namely, that (a) nobody expected a relationship between disordered environments and prejudice and (b) there was no previous empirical evidence for such a relationship. Similarly, if Bargh et al. (1996) had found that priming participants with the stereotype of the elderly did not influence walking speed or if Dijksterhuis and van Knippenberg (1998) had reported that priming participants with “professor” did not improve their performance on a task of trivial pursuit, nobody would have been interested in their findings.

Notably, all of the examples are null-findings in original studies. Thus, they have absolutely no relevance for the importance of replication studies. As noted by Strack and Stroebe earlier

Thus, null findings are interesting only if they contradict a central hypothesis derived from an established theory and/or are discrepant with a series of earlier studies.” (p. 65). 

Bem (2011) reported 9 significant results to support unbelievable claims about supernatural abilities.  However, several failed replication studies allowed psychologists to dismiss these findings and to ignore claims about time-reversed priming effects. So, while not all null-results are important, null-results in replication studies are important because they can correct false positive results in original articles. Without this correction mechanism, science looses its ability to correct itself.

Failed Replications Do Not Falsify Theories

S&S state that failed replications do not falsify theories

The nonreplications published by Shanks and colleagues (2013) cannot be taken as a falsification of that theory, because their study does not explain why previous research was successful in replicating the original findings of Dijksterhuis and van Knippenberg (1998).” (p. 64). 

I am unaware of any theory in psychology that has been falsified. The reason for this is not that failed replication studies are not informative. The reason is that theories have been protected by hiding failed replication studies until recently. Only in recent years have social psychologists started to contemplate the possibility that some theories in social psychology might be false.  The most prominent example is ego-depletion theory, which has been one of the first prominent theories that has been put under the microscope of open science without the protection of questionable research practices in recent years. While ego-depletion theory is not entirely dead, few people still believe in the simple theory that 20 Stroop trials deplete individuals’ will power.  Falsification is hard, but falsification without disconfirming evidence is impossible.

Inconsistent Evidence

S&S argue that replication failures have to be evaluated in the context of replication successes.

Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis. 

Earlier S&S wrote

in social psychology, as in most sciences, empirical findings cannot always be replicated (this was one of the reasons for the development of meta-analytic methods). 

Indeed. Unless studies have very high statistical power, inconsistent results are inevitable; which is one reason why publishing only significant results is a sign of low credibility (Schimmack, 2012). Meta-analysis is the only way to make sense of these inconsistent findings.  However, it is well known that publication bias makes meta-analytic results meaningless (e.g., meta-analysis show very strong evidence for supernatural abilities).  Thus, it is important that all tests of a theoretical prediction are reported to produce meaningful meta-analyses.  If social psychologists would take S&S seriously and continue to suppress non-significant results because they are uninformative, meta-analysis would continue to provide biased results that support even false theories.

Failed Replications are Uninformative II

Sorry that this is getting really long. But S&S keep on making the same arguments and the editor of this article didn’t tell them to shorten the article. Here they repeat the argument that failed replications are uninformative.

One reason why null findings are not very interesting is because they tell us only that a finding could not be replicated but not why this was the case. This conflict can be resolved only if researchers develop a theory that could explain the inconsistency in findings.  

A related claim is that failed replications never demonstrate that original findings were false because the inconsistency is always due to some third variable; a hidden moderator.

Methodologically, however, nonreplications must be understood as interaction effects in that they suggest that the effect of the crucial influence depends on the idiosyncratic conditions under which the original experiment was conducted” (p. 64). 

These statements reveal a fundamental misunderstanding of statistical inferences.  A significant result never proofs that the null-hypothesis is false.  The inference that a real effect rather than sampling error caused the observed result can be a mistake. This mistake is called a false positive or a type-I error. S&S seems to believe that type-I errors do not exist. Accordingly, Bem’s significant results show real supernatural abilities.  If this were the case, it would be meaningless to report statistical significance tests. The only possible error that could be made would be false negatives or type-II error; the theory makes the correct prediction, but a study failed to produce a significant result. And if theoretical predictions are always correct, it is also not necessary to subject theories to empirical tests, because these tests either correctly show that a prediction was confirmed or falsely fail to confirm a prediction.

S&S’s belief in published results has a religious quality.  Apparently we know nothing about the world, but once a significant result is published in a social psychology journal, ideally JPSP, it becomes a holy truth that defies any evidence that non-believers may produce under the misguided assumption that further inquiry is necessary. Elderly priming is real, amen.

More Confusing Nonsense

At some point, I was no longer surprised by S&S’s claims, but I did start to wonder about the reviewers and editors who allowed this manuscript to be published apparently with light or no editing.  Why would a self-respecting journal publish a sentence like this?

As a consequence, the mere coexistence of exact replications that are both successful and unsuccessful is likely to leave researchers helpless about what to conclude from such a pattern of outcomes.

Didn’t S&S claim that exact replication studies do not exist? Didn’t they tell readers that every inconsistent finding has to be interpreted as an interaction effect?  And where do they see inconsistent results if journals never publish non-significant results?

Aside from these inconsistencies, inconsistent results do not lead to a state of helpless paralysis. As S&S suggested themselves, they conduct a meta-analysis. Are S&S suggesting that we need to spare researchers from inconsistent results to protect them from a state of helpless confusion? Is this their justification for publishing only significant results?

Even Massive Replication Failures in Registered Replication Reports are Uninformative

In response to the replication crisis, some psychologists started to invest time and resources in major replication studies called many lab studies or registered replication studies.  A single study was replicated in many labs.  The total sample size of many labs gives these studies high precision in estimating the average effect size and makes it even possible to demonstrate that an effect size is close to zero, which suggests that the null-hypothesis may be true.  These studies have failed to find evidence for classic social psychology findings, including Strack’s facial feedback studies. S&S suggest that even these results are uninformative.

Conducting exact replications in a registered and coordinated fashion by different laboratories does not remove the described shortcomings. This is also the case if exact replications are proposed as a means to estimate the “true size” of an effect. As the size of an experimental effect always depends on the specific error variance that is generated by the context, exact replications can assess only the efficiency of an intervention in a given situation but not the generalized strength of a causal influence.

Their argument does not make any sense to me.  First, it is not clear what S&S mean by “the size of an experimental effect always depends on the specific error variance.”  Neither unstandardized nor standardized effect sizes depend on the error variance. This is simple to see because error variance depends on the sample size and effect sizes do not depend on sample size.  So, it makes no sense to claim that effect sizes depend on error variance.

Second, it is not clear what S&S mean by specific error variance that is generated by the context.  I simply cannot address this argument because the notion of context generated specific error variance is not a statistical construct and S&S do not explain what they are talking about.

Finally, it is not clear why meta-analysis of replication studies cannot be used to estimate the generalized strength of a causal influence, which I believe to mean “an effect size”?  Earlier S&S alluded to meta-analysis as a way to resolve inconsistencies in the literature, but now they seem to suggest that meta-analysis cannot be used.

If S&S really want to imply that meta-analyses are useless, it is unclear how they would make sense of inconsistent findings.  The only viable solution seems to be to avoid inconsistencies by suppressing non-significant results in order to give the impression that every theory in social psychology is correct because theoretical predictions are always confirmed.  Although this sounds absurd, it is the inevitable logical consequence of S&S’s claim that non-significant results are uninformative, even if over 20 labs independently and in combination failed to provide evidence for a theoretical predicted effect.

The Great History of Social Psychological Theories

S&S next present Über-social psychologist, Leon Festinger, as an example why theories are good and failed studies are bad.  The argument is that good theories make correct predictions, even if bad studies fail to show the effect.

“Although their theoretical analysis was valid, it took a decade before researchers were able to reliably replicate the findings reported by Festinger and Carlsmith (1959).”

As a former student, I was surprised by this statement because I had learned that Festinger’s theory was challenged by Bem’s theory and that social psychologists had been unable to resolve which of the two theories was correct.  Couldn’t some of these replication failures be explained by the fact that Festinger’s theory sometimes made the wrong prediction?

It is also not surprising that researchers had a hard time replicating Festinger and Carlsmith original findings.  The reason is that the original study had low statistical power and replication failures are expected even if the theory is correct. Finally, I have been around social psychologists long enough to have heard some rumors about Festinger and Carlsmith’s original studies.  Accordingly, some of Festinger’s graduate students also tried and failed to get the effect. Carlsmith was the ‘lucky’ one who got the effect, in one study p < .05, and he became the co-author of one of the most cited articles in the history of social psychology. Naturally, Festinger did not publish the failed studies of his other graduate students because surely they must have done something wrong. As I said, that is a rumor.  Even if the rumor is not true, and Carlsmith got lucky on the first try, luck played a factor and nobody should expect that a study replicates simply because a single published study reported a p-value less than .05.

Failed Replications Did Not Influence Social Psychological Theories

Argument quality reaches a new low with the next argument against replication studies.

 “If we look at the history of social psychology, theories have rarely been abandoned because of failed replications.”

This is true, but it reveals the lack of progress in theory development in social psychology rather than the futility of replication studies.  From an evolutionary perspective, theory development requires selection pressure, but publication bias protects bad theories from failure.

The short history of open science shows how weak social psychological theories are and that even the most basic predictions cannot be confirmed in open replication studies that do not selectively report significant results.  So, even if it is true that failed replications have played a minor role in the past of social psychology, they are going to play a much bigger role in the future of social psychology.

The Red Herring: Fraud

S&S imply that Roediger suggested to use replication studies as a fraud detection tool.

if others had tried to replicate his [Stapel’s] work soon after its publication, his misdeeds might have been uncovered much more quickly

S&S dismiss this idea in part on the basis of Stroebe’s research on fraud detection.

To their own surprise, Stroebe and colleagues found that replications hardly played any role in the discovery of these fraud cases.

Now this is actually not surprising because failed replications were hardly ever published.  And if there is no variance in a predictor variable (significance), we cannot see a correlation between the predictor variable and an outcome (fraud).  Although failed replication studies may help to detect fraud in the future, this is neither their primary purpose, nor necessary to make replication studies valuable. Replication studies also do not bring world peace or bring an end to global warming.

For some inexplicable reason S&S continue to focus on fraud. For example, they also argue that meta-analyses are poor fraud detectors, which is as true as it is irrelevant.

They conclude their discussion with an observation by Stapel, who famously faked 50+ articles in social psychology journals.

As Stapel wrote in his autobiography, he was always pleased when his invented findings were replicated: “What seemed logical and was fantasized became true” (Stapel, 2012). Thus, neither can failures to replicate a research finding be used as indicators of fraud, nor can successful replications be invoked as indication that the original study was honestly conducted.

I am not sure why S&S spend so much time talking about fraud, but it is the only questionable research practice that they openly address.  In contrast, they do not discuss other questionable research practices, including suppressing failed studies, that are much more prevalent and much more important for the understanding of the replication crisis in social psychology than fraud.  The term “publication bias” is not mentioned once in the article. Sometimes what is hidden is more significant than what is being published.


The conclusion section correctly predicts that the results of the reproducibility project will make social psychology look bad and that social psychology will look worse than other areas of psychology.

But whereas it will certainly be useful to be informed about studies that are difficult to replicate, we are less confident about whether the investment of time and effort of the volunteers of the Open Science Collaboration is well spent on replicating studies published in three psychology journals. The result will be a reproducibility coefficient that will not be greatly informative, because of justified doubts about whether the “exact” replications succeeded in replicating the theoretical conditions realized in the original research.

As social psychologists, we are particularly concerned that one of the outcomes of this effort will be that results from our field will be perceived to be less “reproducible” than research in other areas of psychology. This is to be expected because for the reasons discussed earlier, attempts at “direct” replications of social psychological studies are less likely than exact replications of experiments in psychophysics to replicate the theoretical conditions that were established in the original study.

Although psychologists should not be complacent, there seem to be no reasons to panic the field into another crisis. Crises in psychology are not caused by methodological flaws but by the way people talk about them (Kruglanski & Stroebe, 2012).

S&S attribute the foreseen (how did they know?) bad outcome in the reproducibility project to the difficulty of replicating social psychological studies, but they fail to explain why social psychology journals publish as many successes as other disciplines.

The results of the reproducibility project provide an answer to this question.  Social psychologists use designs with less statistical power that have a lower chance of producing a significant result. Selection for significance ensures that the success rate is equally high in all areas of psychology, but lower power makes these successes less replicable.

To avoid further embarrassments in an increasingly open science, social psychologists must improve the statistical power of their studies. Which social psychological theories will survive actual empirical tests in the new world of open science is unclear.  In this regard, I think it makes more sense to compare social psychology to a ship wreck than a train wreck.  Somewhere down on the floor of the ocean is some gold. But it will take some deep diving and many failed attempts to find it.  Good luck!


S&S’s article was published in a “prestigious” psychology journal and has already garnered 114 citations. It ranks #21 in my importance rankings of articles in meta-psychology.  So, I was curious why the article gets cited.  The appendix lists 51 citing articles with the relevant citation and the reason for citing S&S’s article.   The table shows the reasons for citations in decreasing order of frequency.

S&S are most frequently cited for the claim that exact replications are impossible, followed by the reason for this claim that effects in psychological research are sensitive to the unique context in which a study is conducted.  The next two reasons for citing the article are that only conceptual replications (CR) test theories, whereas the results of exact replications (ER) are uninformative.  The problem is that every study is a conceptual replication because exact replications are impossible. So, even if exact replications were uninformative this claim has no practical relevance because there are no exact replications.  Some articles cite S&S with no specific claim attached to the citation.  Only two articles cite them for the claim that there is no replication crisis and only 1 citation cites S&S for the claim that there is no evidence about the prevalence of QRPs.   In short, the article is mostly cited for the uncontroversial and inconsequential claim that exact replications are impossible and that effect sizes in psychological studies can vary as a function of unique features of a particular sample or study.  This observation is inconsequential because it is unclear how unknown unique characteristics of studies influence results.  The main implication of this observation is that study results will be more variable than we would expect from a set of exact replication studies. For this reason, meta-analysts often use random-effects model because fixed-effects meta-analysis assumes that all studies are exact replications.

ER impossible 11
Contextual Sensitivity 8
CR test theory 8
ER uninformative 7
Mention 6
ER/CR Distinction 2
No replication crisis 2
Disagreement 1
CR Definition 1
ER informative 1
ER useful for applied research 1
ER cannot detect fraud 1
No evidence about prevalence of QRP 1
Contextual sensitivity greater in social psychology 1

the most influential citing articles and the relevant citation.  I haven’t had time to do a content analysis, but the article is mostly cited to say (a) exact replications are impossible, and (b) conceptual replications are valuable, and (c) social psychological findings are harder to replicate.  Few articles cite to article to claim that the replication crisis is overblown or that failed replications are uninformative.  Thus, even though the article is cited a lot, it is not cited for the main points S&S tried to make.  The high number of citation therefore does not mean that S&S’s claims have been widely accepted.

The value of replication studies.

Simmons, DJ.
“In this commentary, I challenge these claims.”

(ER/CR Distinction)
Bilingualism and cognition.

Valian, V.
“A host of methodological issues should be resolved. One is whether the field should undertake exact replications, conceptual replications, or both, in order to determine the conditions under which effects are reliably obtained (Paap, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(Contextual Sensitivity)
Is Psychology Suffering From a Replication Crisis? What Does “Failure to Replicate” Really Mean?“
Maxwell et al. (2015)
A particular replication may fail to confirm the results of an original study for a variety of reasons, some of which may include intentional differences in procedures, measures, or samples as in a conceptual replication (Cesario, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(ER impossible)
The Chicago face database: A free stimulus set of faces and norming data 

Debbie S. Ma, Joshua Correll, & Bernd Wittenbrink.
The CFD will also make it easier to conduct exact replications, because researchers can use the same stimuli employed by other researchers (but see Stroebe & Strack, 2014).”

(Contextual Sensitivity)
“Contextual sensitivity in scientific reproducibility”
vanBavel et al. (2015)
“Many scientists have also argued that the failure to reproduce results might reflect contextual differences—often termed “hidden moderators”—between the original research and the replication attempt”

(Contextual Sensitivity)
Editorial Psychological Science

As Nosek and his coauthors made clear, even ideal replications of ideal studies are expected to fail some of the time (Francis, 2012), and failure to replicate a previously observed effect can arise from differences between the original and replication studies and hence do not necessarily indicate flaws in the original study (Maxwell, Lau, & Howard, 2015; Stroebe & Strack, 2014). Still, it seems likely that psychology journals have too often reported spurious effects arising from Type I errors (e.g., Francis, 2014).

(ER impossible)
Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science

Finkel et al. (2015).
“Nevertheless, many scholars believe that direct replications are impossible in the human sciences—S&S (2014) call them “an illusion”— because certain factors, such as a moment in historical time or the precise conditions under which a sample was obtained and tested, that may have contributed to a result can never be reproduced identically.”

Conceptualizing and evaluating the replication of research results
Fabrigar and Wegener (2016)
(CR test theory)
“Traditionally, the primary presumed strength of conceptual replications has been their ability to address issues of construct validity (e.g., Brewer & Crano, 2014; Schmidt, 2009; Stroebe & Strack, 2014). “

(ER impossible)
“First, it should be recognized that an exact replication in the strictest sense of the term can never be achieved as it will always be impossible to fully recreate the contextual factors and participant characteristics present in the original experiment (see Schmidt (2009); S&S (2014).”

(Contextual Sensitivity)
“S&S (2014) have argued that there is good reason to expect that many traditional and contemporary experimental manipulations in social psychology would have different psychological properties and effects if used in contexts or populations different from the original experiments for which they were developed. For example, classic dissonance manipulations and fear manipulations or more contemporary priming procedures might work very differently if used in new contexts and/or populations. One could generate many additional examples beyond those mentioned by S&S.”

(ER impossible)
“Another important point illustrated by the above example is that the distinction between exact and conceptual replications is much more nebulous than many discussions of replication would suggest. Indeed, some critics of the exact/conceptual replication distinction have gone so far as to argue that the concept of exact replication is an “illusion” (Stroebe & Strack, 2014). Though we see some utility in the exact/conceptual distinction (especially regarding the goal of the researcher in the work), we agree with the sentiments expressed by S&S. Classifying studies on the basis of the exact/conceptual distinction is more difficult than is often appreciated, and the presumed strengths and weaknesses of the approaches are less straightforward than is often asserted or assumed.”

(Contextual Sensitivity)
“Furthermore, assuming that these failed replication experiments have used the same operationalizations of the independent and dependent variables, the most common inference drawn from such failures is that confidence in the existence of the originally demonstrated effect should be substantially undermined (e.g., see Francis (2012); Schimmack (2012)). Alternatively, a more optimistic interpretation of such failed replication experiments could be that the failed versus successful experiments differ as a function of one or more unknown moderators that regulate the emergence of the effect (e.g., Cesario, 2014; Stroebe & Strack, 2014).”

Replicating Studies in Which Samples of Participants Respond to Samples of Stimuli.
(CR Definition)
Westfall et al. (2015).
Nevertheless, the original finding is considered to be conceptually replicated if it can be convincingly argued that the same theoretical constructs thought to account for the results of the original study also account for the results of the replication study (Stroebe & Strack, 2014). Conceptual replications are thus “replications” in the sense that they establish the reproducibility of theoretical interpretations.”

“Although establishing the generalizability of research findings is undoubtedly important work, it is not the focus of this article (for opposing viewpoints on the value of conceptual replications, see Pashler & Harris, 2012; Stroebe & Strack, 2014).“

Introduction to the Special Section on Advancing Our Methods and Practices
Ledgerwood, A.
We can and surely should debate which problems are most pressing and which solutions most suitable (e.g., Cesario, 2014; Fiedler, Kutzner, & Krueger, 2012; Murayama, Pekrun, & Fiedler, 2013; Stroebe & Strack, 2014). But at this point, most can agree that there are some real problems with the status quo.

***Theory Building, Replication, and Behavioral Priming: Where Do We Need to Go From Here?
Locke, EA
(ER impossible)
As can be inferred from Table 1, I believe that the now popular push toward “exact” replication (e.g., see Simons, 2014) is not the best way to go. Everyone agrees that literal replication is impossible (e.g., Stroebe & Strack, 2014), but let us assume it is as close as one can get. What has been achieved?

The War on Prevention: Bellicose Cancer: Metaphors Hurt (Some) Prevention Intentions”
(CR test theory)
David J. Hauser1 and Norbert Schwarz
“As noted in recent discussions (Stroebe & Strack, 2014), consistent effects of multiple operationalizations of a conceptual variable across diverse content domains are a crucial criterion for the robustness of a theoretical approach.”

Doyen et al. “
(CR test theory)
In contrast, social psychologists assume that the primes activate culturally and situationally contextualized representations (e.g., stereotypes, social norms), meaning that they can vary over time and culture and across individuals. Hence, social psychologists have advocated the use of “conceptual replications” that reproduce an experiment by relying on different operationalizations of the concepts under investigation (Stroebe & Strack, 2014). For example, in a society in which old age is associated not with slowness but with, say, talkativeness, the outcome variable could be the number of words uttered by the subject at the end of the experiment rather than walking speed.”

***Welcome back Theory
Ap Dijksterhuis
(ER uninformative)
“it is unavoidable, and indeed, this commentary is also about replication—it is done against the background of something we had almost forgotten: theory! S&S (2014, this issue) argue that focusing on the replication of a phenomenon without any reference to underlying theoretical mechanisms is uninformative”

On the scientific superiority of conceptual replications for scientific progress
Christian S. Crandall, Jeffrey W. Sherman
(ER impossible)
But in matters of social psychology, one can never step in the same river twice—our phenomena rely on culture, language, socially primed knowledge and ideas, political events, the meaning of questions and phrases, and an ever-shifting experience of participant populations (Ramscar, 2015). At a certain level, then, all replications are “conceptual” (Stroebe & Strack, 2014), and the distinction between direct and conceptual replication is continuous rather than categorical (McGrath, 1981). Indeed, many direct replications turn out, in fact, to be conceptual replications. At the same time, it is clear that direct replications are based on an attempt to be as exact as possible, whereas conceptual replications are not.

***Are most published social psychological findings false?
Stroebe, W.
(ER uninformative)
This near doubling of replication success after combining original and replication effects is puzzling. Because these replications were already highly powered, the increase is unlikely to be due to the greater power of a meta-analytic synthesis. The two most likely explanations are quality problems with the replications or publication bias in the original studies or. An evaluation of the quality of the replications is beyond the scope of this review and should be left to the original authors of the replicated studies. However, the fact that all replications were exact rather than conceptual replications of the original studies is likely to account to some extent for the lower replication rate of social psychological studies (Stroebe & Strack, 2014). There is no evidence either to support or to reject the second explanation.”

(ER impossible)
“All four projects relied on exact replications, often using the material used in the original studies. However, as I argued earlier (Stroebe & Strack, 2014), even if an experimental manipulation exactly replicates the one used in the original study, it may not reflect the same theoretical variable.”

(CR test theory)
“Gergen’s argument has important implications for decisions about the appropriateness of conceptual compared to exact replication. The more a phenomenon is susceptible to historical change, the more conceptual replication rather than exact replication becomes appropriate (Stroebe & Strack, 2014).”

(CR test theory)
“Moonesinghe et al. (2007) argued that any true replication should be an exact replication, “a precise processwhere the exact same finding is reexamined in the same way”. However, conceptual replications are often more informative than exact replications, at least in studies that are testing theoretical predictions (Stroebe & Strack, 2014). Because conceptual replications operationalize independent and/or dependent variables in a different way, successful conceptual replications increase our trust in the predictive validity of our theory.”

There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance”
Anderson & Maxwell
“It is important to note some caveats regarding direct (exact) versus conceptual replications. While direct replications were once avoided for lack of originality, authors have recently urged the field to take note of the benefits and importance of direct replication. According to Simons (2014), this type of replication is “the only way to verify the reliability of an effect” (p. 76). With respect to this recent emphasis, the current article will assume direct replication. However, despite the push toward direct replication, some have still touted the benefits of conceptual replication (Stroebe & Strack, 2014). Importantly, many of the points and analyses suggested in this paper may translate well to conceptual replication.”

Reconceptualizing replication as a sequence of different studies: A replication typology
Joachim Hüffmeier, Jens Mazei, Thomas Schultze
(ER impossible)
The first type of replication study in our typology encompasses exact replication studies conducted by the author(s) of an original finding. Whereas we must acknowledge that replications can never be “exact” in a literal sense in psychology (Cesario, 2014; Stroebe & Strack, 2014), exact replications are studies that aspire to be comparable to the original study in all aspects (Schmidt, 2009). Exact replications—at least those that are not based on questionable research practices such as the arbitrary exclusion of critical outliers, sampling or reporting biases (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)—serve the function of protecting against false positive effects (Type I errors) right from the start.

(ER informative)
Thus, this replication constitutes a valuable contribution to the research process. In fact, already some time ago, Lykken (1968; see also Mummendey, 2012) recommended that all experiments should be replicated  before publication. From our perspective, this recommendation applies in particular to new findings (i.e., previously uninvestigated theoretical relations), and there seems to be some consensus that new findings should be replicated at least once, especially when they were unexpected, surprising, or only loosely connected to existing theoretical models (Stroebe & Strack, 2014; see also Giner-Sorolla, 2012; Murayama et al., 2014).”

Although there is currently some debate about the epistemological value of close replication studies (e.g., Cesario, 2014; LeBel & Peters, 2011; Pashler & Harris, 2012; Simons, 2014; Stroebe & Strack, 2014), the possibility that each original finding can—in principal—be replicated by the scientific community represents a cornerstone of science (Kuhn, 1962; Popper, 1992).”

(CR test theory)
So far, we have presented “only” the conventional rationale used to stress the importance of close replications. Notably, however, we will now add another—and as we believe, logically necessary—point originally introduced by S&S (2014). This point protects close replications from being criticized (cf. Cesario, 2014; Stroebe & Strack, 2014; see also LeBel & Peters, 2011). Close replications can be informative only as long as they ensure that the theoretical processes investigated or at least invoked by the original study are shown to also operate in the replication study.

(CR test theory)
The question of how to conduct a close replication that is maximally informative entails a number of methodological choices. It is important to both adhere to the original study proceedings (Brandt et al., 2014; Schmidt, 2009) and focus on and meticulously measure the underlying theoretical mechanisms that were shown or at least proposed in the original studies (Stroebe & Strack, 2014). In fact, replication attempts are most informative when they clearly demonstrate either that the theoretical processes have unfolded as expected or at which point in the process the expected results could no longer be observed (e.g., a process ranging from a treatment check to a manipulation check and [consecutive] mediator variables to the dependent variable). Taking these measures is crucial to rule out that a null finding is simply due to unsuccessful manipulations or changes in a manipulation’s meaning and impact over time (cf. Stroebe & Strack, 2014). “

(CR test theory)
Conceptual replications in laboratory settings are the fourth type of replication study in our typology. In these replications, comparability to the original study is aspired to only in the aspects that are deemed theoretically relevant (Schmidt, 2009; Stroebe & Strack, 2014). In fact, most if not all aspects may differ as long as the theoretical processes that have been studied or at least invoked in the original study are also covered in a conceptual replication study in the laboratory.”

(ER useful for applied research)
For instance, conceptual replications may be less important for applied disciplines that focus on clinical phenomena and interventions. Here, it is important to ensure that there is an impact of a specific intervention and that the related procedure does not hurt the members of the target population (e.g., Larzelere et al., 2015; Stroebe & Strack, 2014).”

From intrapsychic to ecological theories in social psychology: Outlines of a functional theory approach
Klaus Fiedler
(ER uninformative)
Replicating an ill-understood finding is like repeating a complex sentence in an unknown language. Such a “replication” in the absence of deep understanding may appear funny, ridiculous, and embarrassing to a native speaker, who has full control over the foreign language. By analogy, blindly replicating or running new experiments on an ill-understood finding will rarely create real progress (cf. Stroebe & Strack, 2014). “

Into the wild: Field research can increase both replicability and real-world impact
Jon K. Maner
(CR test theory)
Although studies relying on homogeneous samples of laboratory or online participants might be highly replicable when conducted again in a similar homogeneous sample of laboratory or online participants, this is not the key criterion (or at least not the only criterion) on which we should judge replicability (Westfall, Judd & Kenny, 2015; see also Brandt et al., 2014; Stroebe & Strack, 2014). Just as important is whether studies replicate in samples that include participants who reflect the larger and more diverse population.”

Romance, Risk, and Replication: Can Consumer Choices and Risk-Taking Be Primed by Mating Motives?
Shanks et al.
(ER impossible)
There is no such thing as an “exact” replication (Stroebe & Strack, 2014) and hence it must be acknowledged that the published studies (notwithstanding the evidence for p-hacking and/or publication bias) may have obtained genuine effects and that undetected moderator variables explain why the present studies failed to obtain priming.   Some of the experiments reported here differed in important ways from those on which they were modeled (although others were closer replications and even these failed to obtain evidence of reliable romantic priming).

(CR test theory)
As S&S (2014) point out, what is crucial is not so much exact surface replication but rather identical operationalization of the theoretically relevant variables. In the present case, the crucial factors are the activation of romantic motives and the appropriate assessment of consumption, risk-taking, and other measures.”

A Duty to Describe: Better the Devil You Know Than the Devil You Don’t
Brown, Sacha D et al.
Ioannidis (2005) has been at the forefront of researchers identifying factors interfering with self-correction. He has claimed that journal editors selectively publish positive findings and discriminate against study replications, permitting errors in data and theory to enjoy a long half-life (see also Ferguson & Brannick, 2012; Ioannidis, 2008, 2012; Shadish, Doherty, & Montgomery, 1989; Stroebe & Strack, 2014). We contend there are other equally important, yet relatively unexplored, problems.

A Room with a Viewpoint Revisited: Descriptive Norms and Hotel Guests’ Towel Reuse Behavior
(Contextual Sensitivity)
Bohner, Gerd; Schlueter, Lena E.
On the other hand, our pilot participants’ estimates of towel reuse rates were generally well below 75%, so we may assume that the guests participating in our experiments did not perceive the normative messages as presenting a surprisingly low figure. In a more general sense, the issue of greatly diverging baselines points to conceptual issues in trying to devise a ‘‘direct’’ replication: Identical operationalizations simply may take on different meanings for people in different cultures.

***The empirical benefits of conceptual rigor: Systematic articulation of conceptual hypotheses can reduce the risk of non-replicable results (and facilitate novel discoveries too)
Mark Schaller
(Contextual Sensitivity)
Unless these subsequent studies employ methods that exactly replicate the idiosyncratic context in which the effect was originally detected, these studies are unlikely to replicate the effect. Indeed, because many psychologically important contextual variables may lie outside the awareness of researchers, even ostensibly “exact” replications may fail to create the conditions necessary for a fragile effect to emerge (Stroebe & Strack, 2014)

A Concise Set of Core Recommendations to Improve the Dependability of Psychological Research
David A. Lishner
(CR test theory)
The claim that direct replication produces more dependable findings across replicated studies than does conceptual replication seems contrary to conventional wisdom that conceptual replication is preferable to direct replication (Dijksterhuis, 2014; Neulip & Crandall, 1990, 1993a, 1993b; Stroebe & Strack, 2014).
(CR test theory)
However, most arguments advocating conceptual replication over direct replication are attempting to promote the advancement or refinement of theoretical understanding (see Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014). The argument is that successful conceptual replication demonstrates a hypothesis (and by extension the theory from which it derives) is able to make successful predictions even when one alters the sampled population, setting, operations, or data analytic approach. Such an outcome not only suggests the presence of an organizing principle, but also the quality of the constructs linked by the organizing principle (their theoretical meanings). Of course this argument assumes that the consistency across the replicated findings is not an artifact of data acquisition or data analytic approaches that differ among studies. The advantage of direct replication is that regardless of how flexible or creative one is in data acquisition or analysis, the approach is highly similar across replication studies. This duplication ensures that any false finding based on using a flexible approach is unlikely to be repeated multiple times.

(CR test theory)
Does this mean conceptual replication should be abandoned in favor of direct replication? No, absolutely not. Conceptual replication is essential for the theoretical advancement of psychological science (Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014), but only if dependability in findings via direct replication is first established (Cesario, 2014; Simons, 2014). Interestingly, in instances where one is able to conduct multiple studies for inclusion in a research report, one approach that can produce confidence in both dependability of findings and theoretical generalizability is to employ nested replications.

(ER cannot detect fraud)
A second advantage of direct replications is that they can protect against fraudulent findings (Schmidt, 2009), particularly when different research groups conduct direct replication studies of each other’s research. S&S (2014) make a compelling argument that direct replication is unlikely to prove useful in detection of fraudulent research. However, even if a fraudulent study remains unknown or undetected, its impact on the literature would be lessened when aggregated with nonfraudulent direct replication studies conducted by honest researchers.

***Does cleanliness influence moral judgments? Response effort moderates the effect of cleanliness priming on moral judgments.
(ER uninformative)
Indeed, behavioral priming effects in general have been the subject of increased scrutiny (see Cesario, 2014), and researchers have suggested different causes for failed replication, such as measurement and sampling errors (Stanley and Spence,2014), variation in subject populations (Cesario, 2014), discrepancy in operationalizations (S&S, 2014), and unidentified moderators (Dijksterhuis,2014).

Daniel C. Molden
(ER uninformative)
Therefore, some greater emphasis on direct replication in addition to conceptual replication is likely necessary to maximize what can be learned from further research on priming (but see Stroebe and Strack, 2014, for costs of overemphasizing direct replication as well).

On the automatic link between affect and tendencies to approach and avoid: Chen and Bargh (1999) revisited
Mark Rotteveel et al.
(no replication crisis)
Although opinions differ with regard to the extent of this “replication crisis” (e.g., Pashler and Harris, 2012; S&S, 2014), the scientific community seems to be shifting its focus more toward direct replication.

(ER uninformative)
Direct replications not only affect one’s confidence about the veracity of the phenomenon under study, but they also increase our knowledge about effect size (see also Simons, 2014; but see also S&S, 2014).

Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability
McShane and Bockenholt
(ER impossible)
The purpose of meta-analysis is to synthesize a set of studies of a common phenomenon. This task is complicated in behavioral research by the fact that behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999).

(ER impossible)
Further, because behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999), our SPM methodology estimates and accounts for heterogeneity, which has been shown to be important in a wide variety of behavioral research settings (Hedges and Pigott 2001; Klein et al. 2014; Pigott 2012).

A Closer Look at Social Psychologists’ Silver Bullet: Inevitable and Evitable Side   Effects of the Experimental Approach
Herbert Bless and Axel M. Burger
(ER/CR Distinction)
Given the above perspective, it becomes obvious that in the long run, conceptual replications can provide very fruitful answers because they address the question of whether the initially observed effects are potentially caused by some perhaps unknown aspects of the experimental procedure (for a discussion of conceptual versus direct replications, see e.g., Stroebe & Strack, 2014; see also Brandt et al., 2014; Cesario, 2014; Lykken, 1968; Schwarz & Strack, 2014).  Whereas conceptual replications are adequate solutions for broadening the sample of situations (for examples, see Stroebe & Strack, 2014), the present perspective, in addition, emphasizes that it is important that the different conceptual replications do not share too much overlap in general aspects of the experiment (see also Schwartz, 2015, advocating for  conceptual replications)

Men in red: A reexamination of the red-attractiveness effect
Vera M. Hesslinger, Lisa Goldbach, & Claus-Christian Carbon
(ER impossible)
As Brandt et al. (2014) pointed out, a replication in psychological research will never be absolutely exact or direct (see also, Stroebe & Strack, 2014), which is, of course, also the case in the present research.

***On the challenges of drawing conclusions from p-values just below 0.05
Daniel Lakens
(no evidence about QRP)
In recent years, researchers have become more aware of how flexibility during the data-analysis can increase false positive results (e.g., Simmons, Nelson & Simonsohn, 2011). If the true Type 1 error rate is substantially inflated, for example because researchers analyze their data until a p-value smaller than 0.05 is observed, the robustness of scientific knowledge can substantially decrease. However, as Stroebe & Strack (2014, p. 60) have pointed out: ‘Thus far, however, no solid data exist on the prevalence of such research practices.’

***Does Merely Going Through the Same Moves Make for a ‘‘Direct’’ Replication? Concepts, Contexts, and Operationalizations
Norbert Schwarz and Fritz Strack
(Contextual Sensitivity)
In general, meaningful replications need to realize the psychological conditions of the original study. The easier option of merely running through technically identical procedures implies the assumption that psychological processes are context insensitive and independent of social, cultural, and historical differences (Cesario, 2014; Stroebe & Strack, 2014). Few social (let alone cross-cultural) psychologists would be willing to endorse this assumption with a straight face. If so, mere procedural equivalence is an insufficient criterion for assessing the quality of a replication.

The Replication Paradox: Combining Studies can Decrease Accuracy of Effect Size Estimates
(ER uninformative)
Michèle B. Nuijten, Marcel A. L. M. van Assen, Coosje L. S. Veldkamp, and Jelte M. Wicherts
Replications with nonsignificant results are easily dismissed with the argument that the replication might contain a confound that caused the null finding (Stroebe & Strack, 2014).

Retro-priming, priming, and double testing: psi and replication in a test-retest design
Rabeyron, T
Bem’s paper spawned numerous attempts to replicate it (see e.g., Galak et al., 2012; Bem et al., submitted) and reflections on the difficulty of direct replications in psychology (Ritchie et al., 2012). This aspect has been associated more generally with debates concerning the “decline effect” in science (Schooler, 2011) and a potential “replication crisis” (S&S, 2014) especially in the fields of psychology and medical sciences (De Winter and Happee, 2013).

Do p Values Lose Their Meaning in Exploratory Analyses? It Depends How You Define the Familywise Error Rate
Mark Rubin
(ER impossible)
Consequently, the Type I error rate remains constant if researchers simply repeat the same test over and over again using different samples that have been randomly drawn from the exact same population. However, this first situation is somewhat hypothetical and may even be regarded as impossible in the social sciences because populations of people change over time and location (e.g., Gergen, 1973; Iso-Ahola, 2017; Schneider, 2015; Serlin, 1987; Stroebe & Strack, 2014). Yesterday’s population of psychology undergraduate students from the University of Newcastle, Australia, will be a different population to today’s population of psychology undergraduate students from the University of Newcastle, Australia.

***Learning and the replicability of priming effects
Michael Ramscar
(ER uninformative)
In the limit, this means that in the absence of a means for objectively determining what the information that produces a priming effect is, and for determining that the same information is available to the population in a replication, all learned priming effects are scientifically unfalsifiable. (Which also means that in the absence of an account of what the relevant information is in a set of primes, and how it produces a specific effect, reports of a specific priming result — or failures to replicate it — are scientifically uninformative; see also [Stroebe & Strack, 2014.)

***Evaluating Psychological Research Requires More Than Attention to the N: A Comment on Simonsohn’s (2015) “Small Telescopes”
Norbert Schwarz and Gerald L. Clore
(CR test theory)
Simonsohn’s decision to equate a conceptual variable (mood) with its manipulation (weather) is compatible with the logic of clinical trials, but not with the logic of theory testing. In clinical trials, which have inspired much of the replicability debate and its statistical focus, the operationalization (e.g., 10 mg of a drug) is itself the variable of interest; in theory testing, any given operationalization is merely one, usually imperfect, way to realize the conceptual variable. For this reason, theory tests are more compelling when the results of different operationalizations converge (Stroebe & Strack, 2014), thus ensuring, in the case in point, that it is not “the weather” but indeed participants’ (sometimes weather-induced) mood that drives the observed effect.

Internal conceptual replications do not increase independent replication success
Kunert, R
(Contextual Sensitivity)
According to the unknown moderator account of independent replication failure, successful internal replications should correlate with independent replication success. This account suggests that replication failure is due to the fact that psychological phenomena are highly context-dependent, and replicating seemingly irrelevant contexts (i.e. unknown moderators) is rare (e.g., Barrett, 2015; DGPS, 2015; Fleming Crim, 2015; see also Stroebe & Strack, 2014; for a critique, see Simons, 2014). For example, some psychological phenomenon may unknowingly be dependent on time of day.

(Contextual Sensitivity greater in social psychology)
When the chances of unknown moderator influences are greater and replicability is achieved (internal, conceptual replications), then the same should be true when chances are smaller (independent, direct replications). Second, the unknown moderator account is usually invoked for social psychological effects (e.g. Cesario, 2014; Stroebe & Strack, 2014). However, the lack of influence of internal replications on independent replication success is not limited to social psychology. Even for cognitive psychology a similar pattern appears to hold.

On Klatzky and Creswell (2014): Saving Social Priming Effects But Losing Science as We Know It?
Barry Schwartz
(ER uninformative)
The recent controversy over what counts as “replication” illustrates the power of this presumption. Does “conceptual replication” count? In one respect, conceptual replication is a real advance, as conceptual replication extends the generality of the phenomena that were initially discovered. But what if it fails? Is it because the phenomena are unreliable, because the conceptual equivalency that justified the new study was logically flawed, or because the conceptual replication has permitted the intrusion of extraneous variables that obscure the original phenomenon? This ambiguity has led some to argue that there is no substitute for strict replication (see Pashler & Harris, 2012; Simons, 2014, and Stroebe & Strack, 2014, for recent manifestations of this controversy). A significant reason for this view, however, is less a critique of the logic of conceptual replication than it is a comment on the sociology (or politics, or economics) of science. As Pashler and Harris (2012) point out, publication bias virtually guarantees that successful conceptual replications will be published whereas failed conceptual replications will live out their lives in a file drawer.  I think Pashler and Harris’ surmise is probably correct, but it is not an argument for strict replication so much as it is an argument for publication of failed conceptual replication.

Commentary and Rejoinder on Lynott et al. (2014)
Lawrence E. Williams
(CR test theory)
On the basis of their investigations, Lynott and colleagues (2014) conclude ‘‘there is no evidence that brief exposure to warm therapeutic packs induces greater prosocial responding than exposure to cold therapeutic packs’’ (p. 219). This conclusion, however, does not take into account other related data speaking to the connection between physical warmth and prosociality. There is a fuller body of evidence to be considered, in which both direct and conceptual replications are instructive. The former are useful if researchers particularly care about the validity of a specific phenomenon; the latter are useful if researchers particularly care about theory testing (Stroebe & Strack, 2014).

The State of Social and Personality Science: Rotten to the Core, Not So Bad, Getting Better, or Getting Worse?
(no replication crisis)
Motyl et al. (2017) “The claim of a replicability crisis is greatly exaggerated.” Wolfgang Stroebe and Fritz Strack, 2014

Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology
Harry T. Reis, Karisa Y. Lee
(ER impossible)
Much of the current debate, however, is focused narrowly on direct or exact replications—whether the findings of a given study, carried out in a particular way with certain specific operations, would be repeated. Although exact replications are surely desirable, the papers by Fabrigar and by Crandall and Sherman remind us that in an absolute sense they are fundamentally impossible in social–personality psychology (see also S&S, 2014).

Show me the money
(Contextual Sensitivity)
Of course, it is possible that additional factors, which varied or could have varied among our studies and previously published studies (e.g., participants’ attitudes toward money) or among the online studies and laboratory study in this article (e.g., participants’ level of distraction), might account for these apparent inconsistencies. We did not aim to conduct a direct replication of any specific past study, and therefore we encourage special care when using our findings to evaluate existing ones (Doyen, Klein, Simons, & Cleeremans, 2014; Stroebe & Strack, 2014).

***From Data to Truth in Psychological Science. A Personal Perspective.
(ER uninformative)
In their introduction to the 2016 volume of the Annual Review of Psychology, Susan Fiske, Dan Schacter, and Shelley Taylor point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects. To allow non-replications to regain this constructive role, they must come with conclusions that enter and stimulate a critical debate. It is even better if replication studies are endowed with a hypothesis that relates to the state of the scientific discourse. To show that an effect occurs only under one but not under another condition is more informative than simply demonstrating noneffects (S&S, 2014). But this may require expertise and effort.


Replicability 101: How to interpret the results of replication studies

Even statistically sophisticated psychologists struggle with the interpretation of replication studies (Maxwell et al., 2015).  This article gives a basic introduction to the interpretation of statistical results within the Neyman Pearson approach to statistical inferences.

I make two important points and correct some potential misunderstandings in Maxwell et al.’s discussion of replication failures.  First, there is a difference between providing sufficient evidence for the null-hypothesis (evidence of absence) and providing insufficient evidence against the null-hypothesis (absence of evidence).  Replication studies are useful even if they simply produce absence of evidence without evidence that an effect is absent.  Second, I  point out that publication bias undermines the credibility of significant results in original studies.  When publication bias is present, open replication studies are valuable because they provide an unbiased test of the null-hypothesis, while original studies are rigged to reject the null-hypothesis.


Replicating something means to get the same result.  If I make the first free throw, replicating this outcome means to also make the second free throw.  When we talk about replication studies in psychology we borrow from the common meaning of the term “to replicate.”

If we conduct psychological studies, we can control many factors, but some factors are not under our control.  Participants in two independent studies differ from each other and the variation in the dependent variable across samples introduces sampling error. Hence, it is practically impossible to get identical results, even if the two studies are exact copies of each other.  It is therefore more complicated to compare the results of two studies than to compare the outcome of two free throws.

To determine whether the results of two studies are identical or not, we need to focus on the outcome of a study.  The most common outcome in psychological studies is a significant or non-significant result.  The goal of a study is to produce a significant result and for this reason a significant result is often called a success.  A successful replication study is a study that also produces a significant result.  Obtaining two significant results is akin to making two free throws.  This is one of the few agreements between Maxwell and me.

“Generally speaking, a published  original study has in all likelihood demonstrated a statistically significant effect. In the current zeitgeist, a replication study is usually interpreted as successful if it also demonstrates a statistically significant effect.” (p. 488)

The more interesting and controversial scenario is a replication failure. That is, the original study produced a significant result (success) and the replication study produced a non-significant result (failure).

I propose that a lot of confusion arises from the distinction between original and replication studies. If a replication study is an exact copy of the first study, the outcome probabilities of original and replication studies are identical.  Otherwise, the replication study is not really a replication study.

There are only three possible outcomes in a set of two studies: (a) both studies are successful, (b) one study is a success and one is a failure, or (c) both studies are failures.  The probability of these outcomes depends on whether the significance criterion (the type-I error probability) when the null-hypothesis is true and the statistical power of a study when the null-hypothesis is false.

Table 1 shows the probability of the outcomes in two studies.  The uncontroversial scenario of two significant results is very unlikely, if the null-hypothesis is true. With conventional alpha = .05, the probability is .0025 or 1 out of 400 attempts.  This shows the value of replication studies. False positives are unlikely to repeat themselves and a series of replication studies with significant results is unlikely to occur by chance alone.

2 sig, 0 ns 1 sig, 1 ns 0 sig, 2 ns
H0 is True alpha^2 2*alpha*(1-alpha) (1-alpha^2)
H1 is True (1-beta)^2 2*(1-beta)*beta beta^2

The probability of a successful replication of a true effect is a function of statistical power (1 – type-II error probability).  High power is needed to get significant results in a pair of studies (an original study and a replication study).  For example, if power is only 50%, the chance of this outcome is only 25% (Schimmack, 2012).  Even with conventionally acceptable power of 80%, only 2/3 (64%) of replication attempts would produce this outcome.  However, studies in psychology do not have 80% power and estimates of power can be as low as 37% (OSC, 2015). With 40% power, a pair of studies would produce significant results in no more than 16 out of 100 attempts.   Although successful replications of true effects with low power are unlikely, they are still much more likely then significant results when the null-hypothesis is true (16/100 vs. 1/400 = 64:1).  It is therefore reasonable to infer from two significant results that the null-hypothesis is false.

If the null-hypothesis is true, it is extremely likely that both studies produce a non-significant result (.95^2 = 90.25%).  In contrast, it is unlikely that even a study with modest power would produce two non-significant results.  For example, if power is 50%, there is a 75% chance that at least one of the two studies produces a significant result. If power is 80%, the probability of obtaining two non-significant results is only 4%.  This means, it is much more likely (22.5 : 1) that the null-hypothesis is true than that the alternative hypothesis is true.  This does not mean that the null-hypothesis is true in an absolute sense because power depends on the effect size.  For example, if 80% power were obtained with a standardized effect size of Cohen’s d = .5,  two non-significant results would suggest that the effect size is smaller than .5, but it does not warrant the conclusion that H0 is true and the effect size is exactly 0.  Once more, it is important to distinguish between the absence of evidence for an effect and the evidence of absence of an effect.

The most controversial scenario assumes that the two studies produced inconsistent outcomes.  Although theoretically there is no difference between the first and the second study, it is common to focus on a successful outcome followed by a replication failure  (Maxwell et al., 2015). When the null-hypothesis is true, the probability of this outcome is low;  .05 * (1-.05) = .0425.  The same probability exists for the reverse pattern that a non-significant result is followed by a significant one.  A probability of 4.25% shows that it is unlikely to observe a significant result followed by a non-significant result when the null-hypothesis is true. However, the low probability is mostly due to the low probability of obtaining a significant result in the first study, while the replication failure is extremely likely.

Although inconsistent results are unlikely when the null-hypothesis is true, they can also be unlikely when the null-hypothesis is false.  The probability of this outcome depends on statistical power.  A pair of studies with very high power (95%) is very unlikely to produce an inconsistent outcome because both studies are expected to produce a significant result.  The probability of this rare event can be as low, or lower, than the probability with a true null effect; .95 * (1-.95) = .0425.  Thus, an inconsistent result provides little information about the probability of a type-I or type-II  error and is difficult to interpret.

In conclusion, a pair of significance tests can produce three outcomes. All three outcomes can occur when the null-hypothesis is true and when it is false.  Inconsistent outcomes are likely unless the null-hypothesis is true or the null-hypothesis is false and power is very high.  When two studies produce inconsistent results, statistical significance provides no basis for statistical inferences.


The counting of successes and failures is an old way to integrate information from multiple studies.  This approach has low power and is no longer used.  A more powerful approach is effect size meta-analysis.  Effect size meta-analysis was one way to interpret replication results in the Open Science Collaboration (2015) reproducibility project.  Surprisingly, Maxwell et al. (2015) do not consider this approach to the interpretation of failed replication studies. To be clear, Maxwell et al. (2015) mention meta-analysis, but they are talking about meta-analyzing a larger set of replication studies, rather than meta-analyzing the results of an original and a replication study.

“This raises a question about how to analyze the data obtained from multiple studies. The natural answer is to use meta-analysis.” (p. 495)

I am going to show that effect-size meta-analysis solves the problem of interpreting inconsistent results in pairs of studies. Importantly, effect size meta-analysis does not care about significance in individual studies.  A meta-analysis of a pair of studies with inconsistent results is no different from a meta-analysis of a pair of studies with consistent results.

Maxwell et al.’s (2015) introduced an example of a between-subject (BS) design with n = 40 per group (total N = 80) and a standardized effect size of Cohen’s d = .5 (a medium effect size).  This study has 59% power to obtain a significant result.  Thus, it is quite likely that a pair of studies produces inconsistent results (48.38%).   However, a pair of studies with N = 80 has the power of a total sample size of N = 160, which means a fixed-effects meta-analysis will produce a significant result in 88% of all attempts.  Thus, it is not difficult at all to interpret the results of pairs of studies with inconsistent results if the studies have acceptable power (> 50%).   Even if the results are inconsistent, a meta-analysis will provide the correct answer that there is an effect most of the time.

A more interesting scenario are inconsistent results when the null-hypothesis is true.  I turned to simulations to examine this scenario more closely.   The simulation showed that a meta-analysis of inconsistent studies produced a significant result in 34% of all cases.  The percentage slightly varies as a function of sample size.  With a small sample of N = 40, the percentage is 35%. With a large sample of  1,000 participants it is 33%.  This finding shows that in two-thirds of attempts, a failed replication reverses the inference about the null-hypothesis based on a significant original study.  Thus, if an original study produced a false-positive results, a failed replication study corrects this error in 2 out of 3 cases.  Importantly, this finding does not warrant the conclusion that the null-hypothesis is true. It merely reverses the result of the original study that falsely rejected the null-hypothesis.

In conclusion, meta-analysis of effect sizes is a powerful tool to interpret the results of replication studies, especially failed replication studies.  If the null-hypothesis is true, failed replication studies can reduce false positives by 66%.


We can all agree that, everything else being equal, larger samples are better than smaller samples (Cohen, 1990).  This rule applies equally to original and replication studies. Sometimes it is recommended that replication studies should use much larger samples than original studies, but it is not clear to me why researchers who conduct replication studies should have to invest more resources than original researchers.  If original researchers conducted studies with adequate power,  an exact replication study with the same sample size would also have adequate power.  If the original study was a type-I error, the replication study is unlikely to replicate the result no matter what the sample size.  As demonstrated above, even a replication study with the same sample size as the original study can be effective in reversing false rejections of the null-hypothesis.

From a meta-analytic perspective, it does not matter whether a replication study had a larger or smaller sample size.  Studies with larger sample sizes are given more weight than studies with smaller samples.  Thus, researchers who invest more resources are rewarded by giving their studies more weight.  Large original studies require large replication studies to reverse false inferences, whereas small original studies require only small replication studies to do the same.  Nevertheless, failed replications with larger samples are more likely to reverse false rejections of the null-hypothesis, but there is no magical number about the size of a replication study to be useful.

I simulated a scenario with a sample size of N = 80 in the original study and a sample size of N = 200 in the replication study (a factor of 2.5).  In this simulation, only 21% of meta-analyses produced a significant result.  This is 13 percentage points lower than in the simulation with equal sample sizes (34%).  If the sample size of the replication study is 10 times larger (N = 80 and N = 800), the percentage of remaining false positive results in the meta-analysis shrinks to 10%.

The main conclusion is that even replication studies with the same sample size as the original study have value and can help to reverse false positive findings.  Larger sample sizes simply give replication studies more weight than original studies, but it is by no means necessary to increase sample sizes of replication studies to make replication failures meaningful.  Given unlimited resources, larger replications are better, but these analysis show that large replication studies are not necessary.  A replication study with the same sample size as the original study is more valuable than no replication study at all.


One problem in Maxwell et al’s (2015) article is to conflate two possible goals of replication studies.  One goal is to probe the robustness of the evidence against the null-hypothesis. If the original result was a false positive result, an unsuccessful replication study can reverse the initial inference and produce a non-significant result in a meta-analysis.  This finding would mean that evidence for an effect is absent.  The status of a hypothesis (e.g., humans have supernatural abilities; Bem, 2011) is back to where it was before the original study found a significant result and the burden of proof is shifted back to proponents of the hypothesis to provide unbiased credible evidence for it.

Another goal of replication studies can be to provide conclusive evidence that an original study reported a false positive result (i..e, humans do not have supernatural abilities).  Throughout their article, Maxwell et al. assume that the goal of replication studies is to prove the absence of an effect.  They make many correct observations about the difficulties of achieving this goal, but it is not clear why replication studies have to be conclusive when original studies are not held to the same standard.

This makes it easy to produce (potentially false) positive results and very hard to remove false positive results from the literature.   It also creates a perverse incentive to conduct underpowered original studies and to claim victory when a large replication study finds a significant result with an effect size that is 90% smaller than the effect size in an original study.  The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported.  To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles.  If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.


The main problem of Maxwell et al.’s (2015) article is that the authors blissfully ignore the problem of publication bias.  They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results (Schimmack, 2012; Sterling; 1959; Sterling et al., 1995).

It is hard to believe that Maxwell is unaware of this problem, if only because Maxwell was action editor of my article that demonstrated how publication bias undermines the credibility of replication studies that are selected for significance  (Schimmack, 2012).

I used Bem’s infamous article on supernatural abilities as an example, which appeared to show 8 successful replications of supernatural abilities.  Ironically, Maxwell et al. (2015) also cites Bem’s article to argue that failed replication studies can be misinterpreted as evidence of absence of an effect.

“Similarly, Ritchie, Wiseman, and French (2012) state that their failure to obtain significant results in attempting to replicate Bem (2011) “leads us to favor the ‘experimental artifacts’ explanation for Bem’s original result” (p. 4)”

This quote is not only an insult to Ritchie et al.; it also ignores the concerns that have been raised about Bem’s research practices. First, Ritchie et al. do not claim that they have provided conclusive evidence against ESP.  They merely express their own opinion that they “favor the ‘experimental artifacts’ explanation.  There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities.

More important, Maxwell et al. ignore the broader context of these studies.  Schimmack (2012) discussed many questionable practices in Bem’s original studies and I presented statistical evidence that the significant results in Bem’s article were obtained with the help of questionable research practices.  Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.

It is not clear why Maxwell et al. (2015) picked Bem’s article to discuss problems with failed replication studies and ignores that questionable research practices undermine the credibility of significant results in original research articles. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.

Maxwell et al. (2015) were not aware that in the same year, the OSC (2015) reproducibilty project would replicate only 37% of statistically significant results in top psychology journals, while the apparent success rate in these journals is over 90%.  The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not.  Whether this makes things better is a matter of opinion.

Publication bias also undermines the usefulness of meta-analysis for hypothesis testing.  In the OSC reproducibility project, a meta-analysis of original and replication studies produced 68% significant results.  This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.


Maxwell et al.’s (2015) answer to this question is captured in this sentence. “Despite raising doubts about the extent to which apparent failures to replicate necessarily reveal that psychology is in crisis,we do not intend to dismiss concerns about documented methodological flaws in the field.” (p. 496).  The most important part of this quote is “raising doubt,” the rest is Orwellian double-talk.

The whole point of Maxwell et al.’s article is to assure fellow psychologists that psychology is not in crisis and that failed replication studies should not be a major concern.  As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail.

The real answer to Maxwell et al.’s question was provided by the OSC (2015) finding that only 37% of published significant results could be replicated.  In my opinion that is not only a crisis, but a scandal because psychologists routinely apply for funding with power analyses that claim 80% power.  The reproducibilty project shows that the true power to obtain significant results in original and replication studies is much lower than this and that the 90% success rate is no more meaningful than 90% votes for a candidate in communist elections.

In the end, Maxwell et al. draw the misleading conclusion that “the proper design and interpretation of replication studies is less straightforward than conventional practice would suggest.”  They suggest that “most importantly, the mere fact that a replication study yields a nonsignificant statistical result should not by itself lead to a conclusion that the corresponding original study was somehow deficient and should no longer be trusted.”

As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if (a) the original study was not preregistered, (b) the original study produced weak evidence (e.g., p = .04), the original study was published in a journal that only publishes significant results, (d) the replication study had a larger sample, (e) the replication study would have been published independent of outcome, and (f) the replication study was preregistered.

We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail.  Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.

Long life the replicability revolution.  !!!


Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.


Maxwell, S.E, Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does ‘failure to replicate’ really mean? American Psychologist, 70, 487-498. http://dx.doi.org/10.1037/a0039400.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487






















My email correspondence with Daryl J. Bem about the data for his 2011 article “Feeling the future”

In 2015, Daryl J. Bem shared the datafiles for the 9 studies reported in the 2011 article “Feeling the Future” with me.  In a blog post, I reported an unexplained decline effect in the data.  In an email exchange with Daryl Bem, I asked for some clarifications about the data, comments on the blog post, and permission to share the data.

Today, Daryl J. Bem granted me permission to share the data.  He declined to comment on the blog post and did not provide an explanation for the decline effect.  He also did not comment on my observation that the article did not mention that “Experiment 5” combined two experiments with N = 50 and that “Experiment 6” combined three datasets with Ns = 91, 19, and 40.  It is highly unusual to combine studies and this practice contradicts Bem’s claim that sample sizes were determined a priori based on power analysis.

Footnote on p. 409. “I set 100 as the minimum number of participants/sessions for each of the experiments reported in this article because most effect sizes (d) reported in the
psi literature range between 0.2 and 0.3. If d = 0.25 and N = 100, the power
to detect an effect significant at .05 by a one-tail, one-sample t test is .80
(Cohen, 1988).”

The undisclosed concoction of datasets is another questionable research practice that undermines the scientific integrity of significance tests reported in the original article. At a minimum, Bem should issue a correction that explains how the nine datasets were created and what decision rules were used to stop data collection.

I am sharing the datafiles so that other researchers can conduct further analyses of the data.

Datafiles: EXP1   EXP2   EXP3   EXP4   EXP5   EXP6   EXP7   EXP8   EXP9

Below is the complete email correspondence with Daryl J. Bem.


To: Daryl J. Bem
From:  Ulrich Schimmack
Sent: Thursday,  January 25, 2018 5:23 PM

Dear Dr. Bem,

I am going to share your comments on the blog.

I find the enthusiasm explanation less plausible than you.  More important, it doesn’t explain the lack of a decline effect in studies with significant results.

I just finished the analysis of the 6 studies with N > 100 by Maier that are also included in the meta-analysis (see Figure below).

Given the lack of a plausible explanation for your data, I think JPSP should retract your article or at least issue an expression of concern because the published results are based on abnormally strong effect sizes in the beginning of each study. Moreover, Study 5 is actually two studies of N = 50 and the pattern is repeated at the beginning of the two datasets.

I also noticed that the meta-analysis included one more study by you with an underpowered study of N = 42 that surprisingly produced yet another significant result.  As I pointed out in my article that you reviewed that you reviewed points out, this success makes it even more likely that some non-significant (pilot) studies were omitted.  Your success record is simply too good to be true (Francis, 2012).  Have you conducted any other studies since 2012?  A non-significant result is overdue.

Regarding the meta-analysis itself, most of these studies are severely underpowered and there is still evidence for publication bias after excluding your studies.

Maier.ESP.pngWhen I used puniform to control for publication bias and limited the dataset to studies with N > 90 and excluded your studies (as we agree, N < 90 is low power) the p-value was not significant, and even if it were less than .05, it would not be convincing evidence for an effect.  In addition, I computed t-values using the effect size that you assumed in 2011, d = .2, and found significant evidence against the null-hypothesis that the ESP effect size could be as large as d = .2.  This means, even studies with N = 100 are underpowered.   Any serious test of the hypothesis requires much larger sample sizes.

However, the meta-analysis and the existence of ESP are not my concern.  My concern is the way (social) psychologists have conducted research in the past and are responding to the replication crisis.  We need to understand how researchers were able to produce seemingly convincing evidence like your 9 studies in JPSP that are difficult to replicate.  How can original articles have success rates of 90% or more and replications produce only a success rate of 30% or less?  You are well aware that your 2011 article was published with reservations and concerns about the way social psychologists conducted research.   You can make a real contribution to the history of psychology by contributing to the understanding of the research process that led to your results.  This is independent of any future tests of PSI with more rigorous studies.

Best, Dr. Schimmack

Ulrich Schimmack
Daryl J. Bem
Sent: Thursday,  January 25, 2018 4:45 PM

Dear Dr. Schimmack,

You reference Schooler who has documented the decline effect in several areas—not just in psi research—and has advanced some hypotheses about its possible causes.  The hypothesis that strikes me as most plausible is that it is an experimenter effect whereby experimenters and their assistants begin with high expectations and enthusiasm begin to get bored after conducting a lot of sessions.  This increasing lack of enthusiasm gets transmitted to the participants during the sessions.  I also refer you to Bob Rosenthal’s extensive work with experimenter effects—which show up even in studies with maze-running rats.

Most of Galak’s sessions were online, thereby diminishing this factor.  Now that I am retired and no longer have a laboratory with access to student assistants and participants, I, too, am shifting to online administration, so it will provide a rough test of this hypothesis.

Were you planning to publish our latest exchange concerning the meta-analysis?  I would not like to leave your blog followers with only your statement that it was “contaminated” by my own studies when, in fact, we did a separate meta-analysis on the non-Bem replications, as I noted in my previous email to you.

Daryl Bem


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Thursday, January 25, 2018 12:05 PM

Dear Dr. Bem,

I now started working on the meta-analysis.
I see another study by you listed (Bem, 2012, N = 42).
Can you please send me the original data for this study?

Best, Dr. Schimmack


Ulrich Schimmack
Daryl J. Bem
Sent: Thursday,  January 25, 2018 4:45 PM

Dear Dr. Shimmack,

I was not able to figure out how to leave a comment on your blog post at the website. (I kept being asked to register a site of my own.)  So, I thought I would simply write you a note.  You are free to publish it as my response to your most recent post if you wish.

In reading your posts on my precognitive experiments, I kept puzzling over why you weren’t mentioning the published Meta-analysis of 90 “Feeling the Future” studies that I published in 2015 with Tessoldi, Rabeyron, & Duggan. After all, the first question we typically ask when controversial results are presented is  “Can Independent researchers replicate the effect(s)?”  I finally spotted a fleeting reference to our meta-analysis in one of your posts, in which you simply dismissed it as irrelevant because it included my own experiments, thereby “contaminating” it.

But in the very first Table of our analysis, we presented the results for both the full sample of 90 studies and, separately, for the 69 replications conducted by independent researchers (from 33 laboratories in 14 countries on 10,000 participants).

These 69 (non-Bem-contaminated) independent replications yielded a z score of 4.16, p =1.2 x E-5.  The Bayes Factor was 3.85—generally considered large enough to provide “Substantial Evidence” for the experimental hypothesis.

Of these 69 studies, 31 were exact replications in that the investigators used my computer programs for conducting the experiments, thereby controlling the stimuli, the number of trials, all event timings, and automatic data recording. The data were also encrypted to ensure that no post-experiment manipulations were made on them by the experimenters or their assistants. (My own data were similarly encrypted to prevent my own assistants from altering them.) The remaining 38 “modified” independent replications variously used investigator-designed computer programs, different stimuli, or even automated sessions conducted online.

Both exact and modified replications were statistically significant and did not differ from one another.  Both peer reviewed and non-peer reviewed replications were statistically significant and did not differ from one another. Replications conducted prior to the publication of my own experiments and those conducted after their publication were each statistically significant and did not differ from one another.

We also used the recently introduced p-curve analysis to rule out several kinds of selection bias (file drawer problems), p-hacking, and to estimate “true” effect sizes.
There was no evidence of p-hacking in the database, and the effect size for the non-bem replications was 0.24, somewhat higher than the average effect size of my 11 original experiments (0.22.)  (This is also higher than the mean effect size of 0.21 achieved by Presentiment experiments in which indices of participants’ physiological arousal “precognitively” anticipate the random presentation of an arousing stimulus.)

For various reasons, you may not find our meta-analysis any more persuasive than my original publication, but your website followers might.

Daryl J.  Bem

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 20, 2018 6:48 PM

Dear Dr. Bem,

Thank you for your final response.   It answers all of my questions.

I am sorry if you felt bothered by my emails, but I am confident that many psychologists are interested in your answers to my questions.

Best, Dr. Schimmack


From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Saturday, January 20, 2018 5:56 PM

Dear Dr. Schimmack,

I hereby grant you permission to be the conduit for making my data available to those requesting them. Most of the researchers who contributed to our 2015/16 meta-analysis of 90 retroactive “feeling-the-future” experiments have already received the data they required for replicating my experiments.

At the moment, I am planning to follow up our meta-analysis of 90 experiments by setting up pre-registered studies. That seems to me to be the most profitable response to the methodological, statistical, and reporting critiques that have emerged since I conducted my original experiments more than a decade ago.  To respond to your most recent request, I am not planning at this time to write any commentary to your posts.  I am happy to let replications settle the matter.

(One minor point: I did not spend $90,000 to conduct my experiments.  Almost all of the participants in my studies at Cornell were unpaid volunteers taking psychology courses that offered (or required) participation in laboratory experiments.  Nor did I discard failed experiments or make decisions on the basis of the results obtained.)

What I did do was spend a lot of time and effort preparing and discarding early versions of written instructions, stimulus sets and timing procedures.  These were pretested primarily on myself and my graduate assistants, who served repeatedly as pilot subjects. If instructions or procedures were judged to be too time consuming, confusing, or not arousing enough, they were changed before the formal experiments were begun on “real” participants.  Changes were not made on the basis of positive or negative results because we were only testing the procedures on ourselves.

When I did decide to change a formal experiment after I had started it, I reported it explicitly in my article. In several cases I wrote up the new trials as a modified replication of the prior experiment.  That’s why there are more experiments than phenomena in my article:  2 approach/avoidance experiments, 2 priming experiments, 3 habituation experiments, & 2 recall experiments.)

In some cases the literature suggested that some parameters would be systematically related to the dependent variables in nonlinear fashion—e.g., the number of subliminal presentations used in the familiarity-produces-increased liking effect, which has a curvilinear relationship.  In that case, I incorporated the variable as a systematic independent variable. That is also reported in the article.

It took you approximately 3 years to post your responses to my experiments after I sent you the data.  Understandable for a busy scholar.  But a bit unziemlich for you to then send me near-daily reminders the past 3 weeks to respond back to you (as Schumann commands in the first movement of his piano Sonata in g Minor) “so schnell wie möglich!”  And then a page later, “Schneller!”

Solche Unverschämtheit!   Wenn ich es sage.

Daryl J.  Bem
Professor Emeritus of Psychology


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 20, 2018, 1.06 PM

Dear Dr. Bem,

Please let me know by tomorrow how your data should be made public.

I want to post my blog about Study 6 tomorrow. If you want to comment on it before I post it, please do so today.

Best, Dr. Schimmack


From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Monday, January 15, 2018, 10.35 PM

You are correct:  Experiment 8, the first Retroactive Recall experiment was conducted in 2007 and its replication (Experiment 9) was conducted in 2009.

The Avoidance of Negative Stimuli (Study/Experiment 2)  was conducted (and reported as a single experiment with 150 sessions) in 2008.  More later.

Daryl Bem


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Monday, January 15, 2018, 8.52 PM

Dear Dr. Bem,

Thank you for your table.  I think we are mostly in agreement (sorry, if I confused you by calling studies datasets. The numbers are supposed to correspond to the experiment numbers in your table.

The only remaining inconsistency is that the datafile for study 8 shows year 2007, while you have 2008 in your table.

Best, Dr. Schimmack

Study    Sample    Year       N             Experiment
5              1              2002       50           #5: Retroactive Habituation I (Neg only)
5              2              2002       50           #5: Retroactive Habituation I (Neg only)
6              1              2002       91           #6: Retroactive Habituation II (Neg & Erot)
6              2              2002       19           #6: Retroactive Habituation II (Neg & Erot)
6              3              2002       40           #6: Retroactive Habituation II (Neg & Erot)
7              1              2005       200         #7: Retroactive Induction of Boredom
1              1              2006       40           #1: Precognitive Detection of Erotic Stimuli
1              2              2006       60           #1: Precognitive Detection of Erotic Stimuli
2              1              2008       100         #2: Precognitive Avoidance of Negative Stimuli
2              2              2008       50           #2: Precognitive Avoidance of Negative Stimuli
3              1              2007       100         #3: Retroactive Priming I
4              1              2008       100         #4: Retroactive Priming  II
8?           1              2007/08  100         #8: Retroactive Facilitation of Recall I
9              1              2009       50           #9: Retroactive Facilitation of Recall II


From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Monday, January 15, 2018, 4.17 PM

Dear Dr. Schimmack,

Here is my analysis of your Table.  I will try to get to the rest of your commentary in the coming week.

Attached Word document:

Dear Dr. Schimmack,

In looking at your table, I wasn’t sure from your numbering of Datasets & Samples which studies corresponded to those reported in my Feeling the Future article.  So I have prepared my own table in the same ordering you have provided and added a column identifying the phenomenon under investigation  (It is on the next page)

Unless I have made a mistake in identifying them, I find agreement between us on most of the figures.  I have marked in red places where we seem to disagree, which occur on Datasets identified as 3 & 8.  You have listed the dates for both as 2007, whereas my datafiles have 2008 listed for all participant sessions which describe the Precognitive Avoidance experiment and its replication.  Perhaps I have misidentified the two Datasets.  The second discrepancy is that you have listed Dataset 8 as having 100 participants, whereas I ran only 50 sessions with a revised method of selecting the negative stimulus for each trial.  As noted in the article, this did not produce a significant difference in the size of the effect, so I included all 150 sessions in the write-up of that experiment.

I do find it useful to identify the Datasets & Samples with their corresponding titles in the article.  This permits readers to read the method sections along with the table.  Perhaps it will also identify the discrepancy between our Tables.  In particular, I don’t understand the separation in your table between Datasets 8 & 9.  Perhaps you have transposed Datasets 4 & 8.

If so, then Datasets 4 & 9 would each comprise 50 sessions.

More later.

Your Table:

Dataset Sample    Year       N
5              1              2002       50
5              2              2002       50
6              1              2002       91
6              2              2002       19
6              3              2002       40
7              1              2005       200
1              1              2006       40
1              2              2006       60
3              1              2007       100
8              1              2007       100
2              1              2008       100
2              2              2008       50
4              1              2008       100
9              1              2009       50

My Table:

Dataset Sample    Year       N             Experiment
5              1              2002       50           #5: Retroactive Habituation I (Neg only)
5              2              2002       50           #5: Retroactive Habituation I (Neg only)
6              1              2002       91          #6: Retroactive Habituation II (Neg & Erot)
6              2              2002       19           #6: Retroactive Habituation II (Neg & Erot)
6              3              2002       40          #6: Retroactive Habituation II (Neg & Erot)
7              1              2005       200         #7: Retroactive Induction of Boredom
1              1              2006       40           #1: Precognitive Detection of Erotic Stimuli
1              2              2006       60           #1: Precognitive Detection of Erotic Stimuli
3              1              2008       100         #2: Precognitive Avoidance of Negative Stimuli
8?           1              2008       50           #2: Precognitive Avoidance of Negative Stimuli
2              1              2007       100         #3: Retroactive Priming I
2              2              2008       100         #4: Retroactive Priming  II
4?           1              2008       100         #8: Retroactive Facilitation of Recall I
9              1              2009       50           #9: Retroactive Facilitation of Recall II


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Monday, January 15, 2018 10.46 AM

Dear Dr. Bem,

I am sorry to bother you with my requests. It would be helpful if you could let me know if you are planning to respond to my questions and if so, when you will be able to do so?

Best regards,
Dr. Ulrich Schimmack


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 13, 2018 3.53 PM

Dear Dr. Bem,

I put together a table that summarizes when studies were done and how they were combined into datasets.

Please confirm that this is accurate or let me know if there are any mistakes.

Best, Dr. Schimmack

Dataset Sample Year N
5 1 2002 50
5 2 2002 50
6 1 2002 91
6 2 2002 19
6 3 2002 40
7 1 2005 200
1 1 2006 40
1 2 2006 60
3 1 2007 100
8 1 2007 100
2 1 2008 100
2 2 2008 50
4 1 2008 100
9 1 2009 50


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 13, 2018 2.42 PM

Dear Dr. Bem,

I wrote another blog post about Study 6.  If you have any comments about this blog post or the earlier blog post, please let me know.

Also, other researchers are interested in looking at the data and I still need to hear from you how to share the datafiles.

Best, Dr. Schimmack

[Attachment: Draft of Blog Post]


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 7.47 PM

Dear. Dr. Bem,

Also, is it ok for me to share your data in public or would you rather post them in public?

Best, Dr. Schimmack


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 7.01 PM

Dear Dr. Bem,

Now that my question about Study 6 has been answered, I would like to hear your thoughts about my blog post. How do you explain the decline effect in your data; that is effect sizes decrease over the course of each experiment and when two experiments are combined into a single dataset, the decline effect seems to repeat at the beginning of the new study.   Study 6, your earliest study, doesn’t show the effect, but most other studies show this pattern.  As I pointed out on my blog, I think there are two explanations (see also Schooler, 2011).  Either unpublished studies with negative results were omitted or measurement of PSI makes the effect disappear.  What is probably most interesting is to know what you did when you encountered a promising pilot study.  Did you then start collecting new data with this promising procedure or did you continue collecting data and retained the pilot data?

Best, Dr. Schimmack


From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Friday, January 12, 2018 2.17 PM

Dear Dr. Schimmack,

You are correct that I calculated all hit rates against a fixed null of 50%.

You are also correct that the first 91 participants (Spring semester of 2002) were exposed to 48 trials: 16 Negative images, 16, Erotic images, and 16 Neutral Images.

We continued with that same protocol in the Fall semester of 2002 for 19 additional sessions, sessions 51-91.

At this point, it was becoming clear from post-session debriefings of participants that the erotic pictures from the Affective Picture System (IAPS) were much too mild, especially for male participants.

(Recall that this was chronologically my first experiment and also the first one to use erotic materials.  The observation that mild erotic stimuli are insufficiently arousing, at least for college students, was later confirmed in our 2016 meta-analysis, which found that Wagenmakers attempt to replicate my Experiment #1 (Which of two curtains hides an erotic picture?) using only mild erotic pictures was the only replication failure out of 11 replication attempts of that protocol in our database.)  In all my subsequent experiments with erotic materials, I used the stronger images and permitted participants to choose which kind of erotic images (same-sex vs. opposite-sex erotica) they would be seeing.

For this reason, I decided to introduce more explicit erotic pictures into this attempted replication of the habituation protocol.

In particular, Sessions 92-110 (19 sessions) also consisted of 48 trials, but they were divided into 12 Negative trials, 12 highly Erotic trials, & 24 Neutral trials.

Finally, Sessions 111-150 (40 sessions) increased the number of trials to 60:  15 Negative trials, 15 Highly Erotic trials, & 30 Neutral trials.  With the stronger erotic materials, we felt we needed to have relatively more neutral stimuli interspersed with the stronger erotic materials.

Daryl Bem


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 11.08 AM

Dear Dr. Bem,

I conducted further analyses and I figured out why I obtained discrepant results for Study 6.

I computed difference scores with the control condition, but the article reports results for a one-sample t-test of the hit rates against an expected value of 50%.

I also figured out that the first 91 participants were exposed to 16 critical trials and participants 92 to 150 were exposed to 30 critical trials. Can you please confirm this?

Best, Dr. Schimmack


From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Thursday, January 11, 2018 10.53 PM

I’ll check them tomorrow to see where the problems are.


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 10, 2018 5.41 PM

Dear Dr. Bem,

I just double checked the data you sent me today and they match the data you sent me in 2015.

This means neither of these datasets reproduces the results reported in your 2011 article.

This means your article reported two more significant results (Study 6, Negative and Erotic) than the data support.

This raises further concerns about the credibility of your published results, in addition to the decline effect that I found in your data (except in Study 6, which also produced non-significant results).

Do you still believe that your 2011 studies provided credible information about timer-reversed causality or do you think that you may have capitalized on chance by conducting many pilot studies?

Best, Dr. Schimmack


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 10, 2018 5:03 PM

Dear Dr. Bem,

Frequencies of male and female in dataset 5.

> table(bem5$Participant.Sex)

Female   Male
63     37

Article “One hundred Cornell undergraduates, 63 women and 37 men,
were recruited through the Psychology Department’s”

Analysis of dataset 5

One Sample t-test
data:  bem5$N.PC.C.PC[b:e]
t = 2.7234, df = 99, p-value = 0.007639
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1.137678 7.245655
sample estimates:
mean of  x

Article “t(99) =  2.23, p = .014”

Gender of participants matches.
t-values do not match, but both are significant.

Frequencies of male and female in dataset 6.

> table(bem6$Participant.Sex)

Female   Male
87     63

Article: Experiment 6: Retroactive Habituation II
One hundred fifty Cornell undergraduates, 87 women and 63


Paired t-test
data:  bem6$NegHits.PC and bem6$ControlHits.PC
t = 1.4057, df = 149, p-value = 0.1619
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8463098  5.0185321
sample estimates:
mean of the differences


Paired t-test
data:  bem6$EroticHits.PC and bem6$ControlHits.PC
t = -1.3095, df = 149, p-value = 0.1924
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.2094289  0.8538733
sample estimates:
mean of the differences


Both retroactive habituation hypothesis were supported. On
trials with negative picture pairs, participants preferred the target
significantly more frequently than the nontarget, 51.8%, t(149) _
1.80, p _ .037, d _ 0.15, binomial z _ 1.74, p _ .041, thereby
providing a successful replication of Experiment 5. On trials with
erotic picture pairs, participants preferred the target significantly
less frequently than the nontarget, 48.2%, t(149) _ _1.77, p _
.039, d _ 0.14, binomial z _ _1.74, p _ .041.

t-values do not match, article reports significant results, but data you shared show non-significant results, although gender composition matches article.

I will double check the datafiles that you sent me in 2015 against the one you are sending me now.

Let’s first understand what is going on here before we discuss other issues.

Best, Dr. Schimmack


From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Wednesday, January 10, 2018 4:42 PM

Dear Dr. Schimmack,

Sorry for the delay.  I have been busy re-programming my new experiments so they can be run online, requiring me to relearn the programming language.

The confusion you have experienced arises because the data from Experiments 5 and 6 in my article were split differently for exposition purposes. If you read the report of those two experiments in the article, you will see that Experiment 5 contained 100 participants experiencing only negative (and control) stimuli.  Experiment contained 150 participants who experienced negative, erotic, and control stimuli.

I started Experiment 5 (my first precognitive experiment) in the Spring semester of 2002. I ran the pre-planned 100 sessions, using only negative and control stimuli.  During that period, I was alerted to the 2002 publication by Dijksterhuis & Smith in the journal Emotion, in which they claimed to demonstrate the reverse of the standard “familiarity-promotes-liking” effect, showing that people also adapt to stimuli that are initially very positive and hence become less attractive as the result of multiple exposures.

So after completing my 100 sessions, I used what remained of the Spring semester to design and run a version of my own retroactive experiment that included erotic stimuli in addition to the negative and control stimuli.  I was able to run 50 sessions before the Spring semester ended, and I resumed that extended version the experiment in the following Fall semester when student-subjects again became available until I had a total of 150 sessions of this extended version.  For purposes of analysis and exposition, I then divided the experiments as described in the article:  100 sessions with only negative stimuli and 150 sessions with negative and erotic stimuli.  No subjects or sessions have been added or omitted, just re-assembled to reflect the change in protocol.

I don’t remember how I sent you the original data, so I am attaching a comma-delimited file (which will open automatically in Excel if you simply double or right click it).  It contains all 250 sessions ordered by dates.  The fields provided are:  Session number (numbered from 1 to 250 in chronological order),  the date of the session, the sex of the participant, % of hits on negative stimuli, % of hits on erotic stimuli (which is blank for the 100 subjects in Experiment 5) and % of hits on neutral stimuli.

Let me know if you need additional information.

I hope to get to your blog post soon.

Daryl Bem


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 6, 2018 11:43 AM

Dear Dr. Bem,

Please reply as soon as possible to my email.  Other researchers are interested in analyzing the data and if I submit my analyses some journals want me to provide data or an explanation why I cannot share the data.  I hope to hear from you by the end of this week.

Best, Dr. Schimmack


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 6, 2018 11:43 AM

Dear Dr. Bem,

Meanwhile I posted a blog post about your 2011 article.  It has been well received by the scientific community.  I would like to encourage you to comment on it.


Dr. Schimmack


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 3, 2018 4:12 PM

Dear Dr. Bem,

I am finally writing up the results of my reanalyses of your ESP studies.

I encountered one problem with the data for Study 6.

I cannot reproduce the test results reported in the article.

The article :

Both retroactive habituation hypothesis were supported. On trials with negative picture pairs, participants preferred the target significantly more frequently than the nontarget, 51.8%, t(149) _ 1.80, p _ .037, d _ 0.15, binomial z _ 1.74, p _ .041, thereby providing a successful replication of Experiment 5. On trials with erotic picture pairs, participants preferred the target significantly less frequently than the nontarget, 48.2%, t(149) _ _1.77, p _.039, d _ 0.14, binomial z _ _1.74, p _ .041.

I obtain

t = 1.4057, df = 149, p-value = 0.1619

t = -1.3095, df = 149, p-value = 0.1924

Also, I wonder why the first 100 cases often produce decimals of .25 and the last 50 cases produce decimals of .33.

It would be nice if you could look into this and let me know what could explain the discrepancy.

Uli Schimmack


From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Wednesday, February 25, 2015 2:47 AM

Dear Dr. Schimmack,

Attached is a folder of the data from my nine “Feeling the Future” experiments.  The files are plain text files, one line for each session, with variables separated by tabs.  The first line of each file is the list of variable names, also separated by tabs. I have omitted participants’ names but supplied their sex and age.

You should consult my 2011 article for the descriptions and definitions of the dependent variables for each experiment.

Most of the files contain the following variables: Session#, Date, StartTime, Session Length, Participant’s Sex, Participant’s Age, Experimenter’s Sex,  [the main dependent variable or variables], Stimulus Seeking score (from 1 to 5).

For the priming experiments (#3 & #4), the dependent variables are LnRT Forward and LnRT Retro, where Ln is the natural log of Response Times. As described in my 2011 publication, each response time (RT) is transformed by taking the natural log before being entered into calculations.  The software subtracts the mean transformed RT for congruent trials from the mean Transformed RT for incongruent trials, so positive values of LnRT indicate that the person took longer to respond to incongruent trials than to congruent trials.  Forward refers to the standard version of affective priming and Retro refers to the time-reversed version.  In the article, I show the results for both the Ln transformation and the inverse transformation (1/RT) for two different outlier definitions.  In the attached files, I provide the results using the Ln transformation and the definition of a too-long RT outlier as 2500 ms.

Subjects who made too many errors (> 25%) in judging the valence of the target picture were discarded. Thus, 3 subjects were discarded from Experiment #3 (hence N = 97) and 1 subject was discarded from Experiment #4 (hence N  = 99).  Their data do not appear in the attached files.

Note that the habituation experiment #5 used only negative and control (neutral) stimuli.

Habituation experiment #6 used Negative, erotic, and Control (neutral) stimuli.

Retro Boredom experiment #7 used only neutral stimuli.

In Experiment #8, the first  Retro Recall, the first 100 sessions are experimental sessions.  The last 25 sessions are no-practice control sessions.  The type of session is the second variable listed.

In Experiment #9, the first 50 sessions are the experimental sessions and the last 25 are no-practice control sessions.   Be sure to exclude the control sessions when analyzing the main experimental sessions. The summary measure of psi performance is the Precog% Score (DR%) whose definition you will find on page 419 of my article.

Let me know if you encounter any problems or want additional data.

Daryl J.  Bem
Professor Emeritus of Psychology