Category Archives: Uncategorized

Z-curve vs. P-curve: Break down of an attempt to resolve disagreement in private.

Background:   In a tweet that I can no longer find because Uri Simonsohn blocked me from his twitter account, Uri suggested that it would be good if scientists could discuss controversial issues in private before they start fighting on social media.  I was just about to submit a manuscript that showed some problems with his p-curve approach to power estimation and a demonstration that z-curve works better in some situations, namely when there is substantial variation in studies in statistical power. So, I thought I give it a try and sent him the manuscript so that we could try to find agreement in a private email exchange.

The outcome of this attempt was that we could not reach agreement on this topic.  At best, Uri admitted that p-curve is biased when some extreme test statistics (e.g., F(1,198) = 40, or t(48) = 5.00) are included in the dataset.  He likes to call these values outliers. I consider them part of the data that influence the variability and distribution of test statistics.

For the most part Uri disagreed with my conclusions and considers the simulation results that show evidence for my claims unrealistic.   Meanwhile, Uri published a blog post with his simulations that have small heterogeneity to claim that p-curve works even better than z-curve when there is heterogeneity.

The reason for the discrepancy between his results and my results are different assumptions about what is realistic variability in strength of evidence against the null-hypothesis, as reflected in absolute z-scores (transformation of p-values into z-scores by means of  -qnorm(p.2t) with p.2t equals two.tailed t-test or F-test.

To give everybody an opportunity to examine the arguments that were exchanged during our discussion of p-curve versus z-curve, I am sharing the email exchange.  I hope that more statisticians will examine the properties of p-curve and z-curve and add to the discussion.  To facilitate this, I will make the r-code to run simulation studies of p-curve and z-curve available in a separate blog post.

P.S.  P-curve is available as an online app that provides power estimates without any documentation how p-curve behaves in simulation studies or warnings that datasets with large test statistics can produce inflated estimates of average power.

My email correspondence with Uri Simonsohn – RE: p-curve and heterogeneity

From:    URI
To:          ULI
Date:     11/24/2017

Hi Uli,

I think email is better at this point.

Ok I am behind a ton of stuff and have a short workday today so cannot look in detail are your z-curve paper right now.

I did a quick search for “osf”, “http” and “code” and could not find the R Code , that may facilitate things if you can share it. Mostly, I would like the code that shows p-curve is biased, especially looking at how the population parameter being estimated is being defined.

I then did a search for “p-curve” and found this

Quick reactions:

1)            For power estimation p-curve does not assume homogeneity of effect size, indeed, if anything it assumes homogeneity of power and allows each study to have a different effect size, but it is not really assuming a single power, it is asking what single power best fits the data, which is a different thing. It is computing an average. All average computations ask “what single value best fits the data” but that’s not the same as saying “I think all values are identical, and identical to the average”

2)            We do report a few tests of the impact of heterogeneity on p-curve, maybe you have something else in mind. But here they go just in case:

Figure 2C in our POPS paper, has d~N(x,sd=.2)

[Clarification: This Figure shows estimation of effect sizes. It does not show estimation of power.]

Supplement 2

[Again. It does not show simulations for power estimation.]

A key thing to keep in mind is the population parameter of interst. P-curve does not estimate the population effect size or power of all studies attempted, published, reported, etc. It does for the set of studies included in p-curve. So note, for example, in the figure S2C above that  when half of studies are .5 and half are .3 among the attempted, p-curve estimates the average included study accurately but differently from .4. The truth is .48 for included studies, p-curve says .47, and the average attempted study is .4

[This is not the issue. Replicability implies conditioning on significance. We want to predict the success rate of studies that replicate significant results. Of course it is meaningful to do follow up studies on non-significant results. But the goal here is not to replicate another inconclusive non-significant result.]

Happy to discuss of course, Uri


From     ULI
To           URI
Date      11/24/2017

Hi Uri,

I will change the description of your p-curve code for power.

Honest, I am not fully clear about what the code does or what the underlying assumptions are.

So, thanks for clarifying.

I agree with you that pcurve (also puniform) are surprisingly robust estimates of effect sizes even with heterogeneity (I have pointed that out in comments in the Facebook Discussion group), but that doesn’t mean it works well for power.   If you have published any simulation tests for the power estimation function, I am happy to cite them.

Attached is a single R code file that contains (a) my shortened version of your p-curve code, (b) the z-curve code, (c) the code for the simulation studies.

The code shows the cumulative results. You don’t have to run all 5,000 replications before you see the means stabilizing.

Best, Uli


From     URI
To           ULI
Date      11/27/2017

Hi Uli,

Thanks for sending the code, I am trying to understand it.  I am a little confused about how the true power is being generated. I think you are drawing “noncentrality” parameters  (ncp) that are skewed, and then turning those into power, rather than drawing directly skewedly distributed power, correct? (I am not judging that as good or bad, I am just verifying).

[Yes that is correct]

In any case, I created a histogram of the distribution of true power implied by the ncp’s that you are drawing (I think, not 100% sure I am getting that right).

For scenario 3.1 it looks like this:




For scenario 3.3 it looks like this:



(the only code I added was to turn all the true power values into a vector before averaging it, and then ploting a histogram for that vector, if interestd, you can copy paste this into the line of code that just reads “tp” in your code and you will re-produce my histogram)


power.i=pnorm(z,z.crit)[obs.z > z.crit]                #line added by Uri SImonsohn to look at the distribution

hist(power.i,xlab=’true power of each study’)




mtext(side=3,line=0,paste0(“mean=”,mean.pi,”   median=”,median.pi,”   sd=”,sd.pi))

I wanted to make sure

1)            I am correctly understanding this variable as being the true power of the observed studies, the average/median of which we are trying to estimate

2)            Those distributions are the distributions you intended to generate

[Yes, that is correct. To clarify, 90% power for p < .05 (two-tailed) is obtained with a z-score of  qnorm(.90, 1.96)  = 3.24.   A z-score of 4 corresponds to 97.9% power.  So, in the literature with adequately powered studies, we would expect studies to bunch up at the upper limit of power, while some studies may have very low power because the theory made the wrong prediction and effect sizes are close to zero and power is close to alpha (5%).]

Thanks, Uri


From     ULI
To           URI
Date      11/27/2017

Hi Uri,

Thanks for getting back to me so quickly.   You are right, it would be more accurate to describe the distribution as the distribution of the non-centrality parameters rather than power.

The distribution of power is also skewed but given the limit of 1,  all high power studies will create a spike at 1.  The same can happen at the lower end and you can easily get U-shaped distributions.

So, what you see is something that you would also see in actual datasets.  Actually, the dataset minimizes skew because I only used non-centrality parameters from 0 to 6.

I did this because z-curve only models z-values between 0 and 6 and treats all observed z-scores greater than 6 as having a true power of 1.  That reduces the pile on the right side.

You could do the same to improve performance of p-curve, but it will still not work as well as z-curve, as the simulations with z-scores below 6 show.

Best, Uli


From     URI
To           ULI
Date      11/27/2017

OK, yes, probably worth clarifying that.

Ok, now I am trying to make sure I understand the function you use to estimate power with z-curve.

If I  see p-values, say c(.001,.002,.003,.004,.005) and I wanted to estimate true power for them via z-curve, I would run:

p= c(.001,.002,.003,.004,.005)

z= -qnorm(p/2)


And estimate true power to be 85%, correct?



From     ULI
To           URI
Date      11/27/2017



From     URI
To           ULI
Date      11/27/2017

Hi Uli,

To make sure I understood z-curve’s function I run a simple simulation.
I am getting somewhat biased results with z-curve, do you want to take a look and see if I may be doing something wrong?

I am attaching the code, I tried to make it clear but it is sometimes hard to convey what one is trying to do, so feel free to ask any questions.



From     ULI
To           URI
Date      11/27/2017

Hi Uri,

What is the k in these simulations?   (z-curve requires somewhat large k because the smoothing of the density function can distort things)

You may also consult this paper (the smallest k was 15 in this paper).

In this paper, we implemented pcurve differently, so you can ignore the p-curve results.

If you get consistent underestimation with z-curve, I would like to see how you simulate the data.

I haven’t seen this behavior in z-curve in my simulations.

Best, Uli


From     ULI
To           URI
Date      11/27/2017

Hi Uli,

I don’t know where “k” is set, I am using the function you sent me and it does not have k as a parameter

I am running this:

fun.zcurve = function(z.val.input, z.crit = 1.96, Int.End=6, bw=.05) {…

Where would k be set?

Into the function you have this

### resolution of density function (doesn’t seem to matter much)

bars = 500

Is that k?



From     ULI
To           URI
Date      11/27/2017

I mean the number of test statistics that you submit to z-curve.



From     ULI
To           URI
Date      11/27/2017

I just checked with k = 20, the z-curve code I sent you underestimates fixed power of 80 as 72.

The paper I sent you shows a similar trend with true power of 75.

k             15     25    50    100  250
Z-curve 0.704 0.712 0.717 0.723 0.728

[Clarification: This is from the Brunner & Schimmack, 2016, article]


From     ULI
To           URI
Date      11/30/2017

Hi Uli,

Sorry for disappearing, got distracted with other things.

I looked a bit more at the apparent bias downwards that z-curve has on power estimates.

First, I added p-curve’s estimates to the chart I had sent, I know p-curve performs well for that basic setup so I used it as a way to diagnose possible errors in my simulations, but p-curve did correctly recover power, so I conclude the simulations are fine.

If you spot a problem with them, however, let me know.


From     ULI
To           URI
Date      11/30/2017

Hi Uri,

I am also puzzled why z-curve underestimates power in the homogeneous case even with large N.  This is clearly an undersirable behavior and I am going to look for solutions to the problem.

However, in real data that I analyze, this is not a problem because there is heterogeneity.

When there is heterogenity, z-curve performs very well, no matter what the distribution of power/non-centrality parameters is. That is the point of the paper.  Any comments on comparisons in the heterogeneous case?

Best, Uli


From     ULI
To           URI
Date      11/30/2017

Hey Uli,

I have something with heterogeneity but want to check my work and am almost done for the day, will try tomorrow.


[Remember: I supplied Uri with r-code to rerun the simulations of heterogeneity and he ran them to show what the distribution of power looks like.  So at this point we could discuss the simulation results that are presented in the manuscript.]


From     ULI
To           URI
Date      11/30/2017

I ran simulations with t-distrubutions and N = 40.

The results look the same for me.

Mean estimates for 500 simulations

32, 48, 75

As you can see, p-curve also has bias when t-values are converted into z-scores and then analyzed with p-curve.

This suggests that with small N,  the transformation from t to z introduces some bias.

The simulations by Jerry Brunner showed less bias because we used the sample sizes in Psych Science for the simulation (median N ~ 80).

So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.


From     URI
To           ULI
Date      11/30/2017

Hi Uli,

The fact that p-curve is also biased when you convert to z-scores suggests to me that approximation is indeed part of the problem.

[Clarification: I think URI means z-curve]

Fortunately p-curve analysis does not require that transformation and one of the reasons we ask in the app to enter test-statistics is to avoid unnecessary transformations.

I guess it would also be true that if you added .012 to p-values p-curve would get it wrong, but p-curve does not require one to add .012 to p-values.

You write “So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.”

Only partial agreement, because the statement implies that for larger N and larger K z-curve is not biased, I believe it is also biased for large k and large N. Here, for instance, is the chart with n=50 per cell (N=100 total) and 50 studies total.

Today I modified the code I sent you so that I would accommodate any power distribution in the submitted studies, not just a fixed level. (attached)

I then used the new montecarlo function to play around with heterogeneity and skewness.

The punchline is that p-curve continues to do well, and z-curve continues to be biased downward.

I also noted, by computed the standard deviation of estimates across simulations, that p-curve has a slightly less random error.

My assessment is that z-curve and p-curve are very similar and will generally agree, but that z-curve is more biased and has more variance.

In any case, let’s get to the simulations Below I show 8 scenarios sorted by the ex-post average true power for the sets of studies.

[Note, N = 20 per cell.  As I pointed out earlier, with these small sample sizes the t to z-transformation is a factor. Also k = 20 is a small set of studies that makes it difficult to get good density distributions.  So, this plot is p-hacked to show that p-curve is perfect and z-curve consistently worse.  The results are not wrong, but they do not address the main question. What happens when we have substantial heterogeneity in true power?  Again, Uri has the data, he has the r-code, and he has the results that show p-curve starts overestimating.  However, he ignores this problem and presents simulations that are most favorable for p-curve.]


From     URI
To           ULI
Date      12/1/2017

Hi Uri,

I really do not care so much about bias in the homogeneous case. I just fixed the problem by first doing a test of the variance and if variance is small to use a fixed effects model.

[Clarification:  This is not yet implemented in z-curve and was not done for the manuscript submitted for publication which just acknowledges that p-curve is superior when there is no heterogeneity.]

The main point of the manuscript is really about data that I actually encounter in the literature (see demonstrations in the manuscript, including power posing) where there is considerable heterogeneity.

In this case, p-curve overestimates as you can see in the simulations that I sent you.   That is really the main point of the paper and any comments from you about p-curve and heterogeneity would be welcome.

And, I did not mean to imply that pcurve needs transformation. I just found it interesting that transformation is a problem when N is small (as N gets bigger t approaches z and the transformation has less influence).

So, we are in agreement that pcurve does very well when there is little variability in the true power across studies.  The question is whether we are in agreement about heterogeneity in power?

Best, Uli


From     ULI
To           URI
Date      12/1/2017

Hi Uri,

Why not simulate scenarios that match onto real data.

[I attached data from my focal hypothesis analysis of Bargh’s book “Before you know it” ]


From     ULI
To           URI
Date      12/1/2017


Also, my simulations show that z-curve OVERestimates when true power is below 50%.   Do you find this as well?

This is important because power posing estimates are below 50%, so estimation problems with small k and N would mean that z-curve estimate is inflated rather than suggesting that p-curve estimate is correct.

Best, Uli


From     URI
To           ULI
Date      12/2/2017

Hi Uli,

The results I sent show substantial heterogeneity and p-curve does well, do you disagree?



From     URI
To           ULI
Date      12/2/2017

Not sure what you mean here. What aspect of real data would you like to add to the simulations? I did what I did to address the concerns you had that p-curve may not handle heterogeneity and skewed distributions of power, and it seems to do well with very substantial skew and heterogeneity.

What aspect are the simulations abstracting away from that you worry may lead p-curve to break down with real data?



From     ULI
To           URI
Date      12/2/2017

Hi Uri,

I think you are not simulating sufficient heterogeneity to see that p-curve is biased in these situations.

Let’s focus on one example (simulation 2.3) in the r-code I sent you: High true power (.80) and heterogeneity.

This is the distribution of the non-centrality parameters.

And this is the distribution of true power for p < 05 (two-tailed, |z| > = 1.96).

[Clarification: this is not true power, it is the distribution of observed absolute z-scores]

More important, the variance of the observed significant (z > 1.96) z-scores is 2.29.

[Clarification: In response to this email exchange, I added the variance of significant z-scores to the manuscript as a measure of heterogeneity.  Due to the selection for significance, variance with low power can be well below 1.   A variance of 2.29 is large heterogeneity. ]

In comparison the variance for the fixed model (non-central z = 2.80) is 0.58.

So, we can start talking about heterogeneity in quantitative terms. How much variance do you simulated observed p-values have when you convert them into z-scores?

The whole point of the paper is that performance of z-curve suffers, the greater the heterogeneity of true power is.  As sampling error is constant for z-scores, variance of observed z-scores has a maximum of 1 if true power is constant. It is lower than 1 due to selection for significance, which is more severe the lower the power is.

The question is whether my simulations use some unrealistic, large amount of heterogeneity.   I attached some Figures for the Journal of Judgment and Decision Making.

As you can see, heterogeneity can be even larger than the heterogeneity simulated in scenario 2.3 (with a normal distribution around z = 2.75).

In conclusion, I don’t doubt that you can find scenarios where p-curve does well with some heterogeneity.  However, the point of the paper is that it is possible to find scenarios where there is heterogeneity and p-curve does not well.   What your simulations suggest is that z-curve can also be biased in some situations, namely with low variability, small N (so that transformation to z-scores matters) and small number of studies.

I am already working on a solution for this problem, but I see it as a minor problem because most datasets that I have examined (like the one’s that I used for the demonstrations in the ms) do not match this scenario.

So, if I can acknowledge that p-curve outperforms z-curve in some situations, I wonder whether you can do the same and acknowledge that z-curve outperforms p-curve when power is relatively high (50%+) and there is substantial heterogeneity?

Best, Uli


From     ULI
To           URI
Date      12/2/2017

What surprises me is that I sent you r-code with 5 simulations that showed when p-curve is breaking down (starting with normal distributed variability of non-central z-scores and 50% power (sim2.2) followed by higher power (80%) and all skewed distributions (sim 3.1, 3.2, 3.3).  Do you find a problem with these simulations or is there some other reason why you ignore these simulation studies?


From     ULI
To           URI
Date      12/2/2017

I tried “power = runif(n.sim)*.4 + .58”  with k = 100.

Now pcurve starts to overestimate and zcurve is unbiased.

So, k makes a difference.  Even if pcurve does well with k = 20,  we also have to look for larger sets of studies.

Results of 500 simulations with k = 100


From     ULI
To           URI
Date      12/2/2017

Even with k = 40,  pcurve overestimates as much as zcurve underestimates.

zcurve           pcurve

Min.   :0.5395   Min.   :0.5600

1st Qu.:0.7232   1st Qu.:0.7900

Median :0.7898   Median :0.8400

Mean   :0.7817   Mean   :0.8246

3rd Qu.:0.8519   3rd Qu.:0.8700

Max.   :0.9227   Max.   :0.9400


From     ULI
To           URI
Date      12/2/2017

Hi Uri,

This is what I find with systematic variation of number of studies (k) and the maximum heterogeneity for a uniform distribution of power and average power of 80% after selection for significance.

power = runif(n.sim)*.4 + .58”

zcurve   pcurve

k = 20                    77.5        81.2

k = 40                    78.2        82.5

k = 100                  79.3        82.7

k = 10000             80.2       81.7

(1 run)

If we are going to look at k = 20, we also have to look at k = 100.

Best, Uli


From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Why did you truncate the beta distributions so that they start at 50% power?

Isn’t it realistic to assume that some studies have less than 50% power, including false positives (power = alpha = 5%)?

How about trying this beta distribution?


80% true power after selection for significance.

Best, Uli


From     ULI
To           URI
Date      12/2/2017

HI Uli,

I know I have a few emails from you, thanks.

My plan is to get to them on Monday or Tuesday. OK?



Hi Uli,

We have a blogpost going up tomorrow and have been distracted with that, made someprogress with z- vs p- but am not ready yet.

Sorry Uri


Hi Uli,

From     ULI
To           URI
Date      12/2/2017

Ok, finally I have time to answer your emails from over the weekend.

Why I run something different?

First, you asked why I run simulations that were different from those you have in your paper (scenario 2.1 and 3.1).

The answer is that I tried to simulate what I thought you were describing in the text: heterogeneity in power that was skewed.

When I saw you had run simulations that led to a power distribution that looked like this:

I assumed that was not what was intended.

First, that’s not skewed

Second, that seems unrealistic, you are simulating >30% of studies powered above 90%.

[Clarification:  If studies were powered at 80%,  33% of studies would be above 90% :


It is important to remember that we are talking only about studies that produced a significant result. Even if many null-hypothesis are tested, relatively few of these would make it into the set of studies that produced a significant result.  Most important, this claim ignores the examples in the paper and my calculations of heterogeneity that can be used to compare simulations of heterogeneity with real data.]

Third, when one has extremely bimodal data, central tendency measures are less informative/important (e.g., the average human wears half a bra). So if indeed power was distributed that way, I don’t think I would like to estimate average power anyway. And if it did, saying the average is 60% or 80% is almost irrelevant, hardly any studies are in that range in reality (like say the average person wears .65 bras, that’s wrong, but inconsequentially worse that .5 bras).

Fourth, if indeed 30% of studies have >90% power, we don’t need p-curve or z-curve. Stuff is gonna be obviously true to naked eye.

But below I will ignore these reservations and stick to that extreme bimodal distribution you propose that we focus our attention on.

The impact of null findings

Actually, before that, let me acknowledge I think you raised a very valid point about the importance of adding null findings to the simulations. I don’t think the extreme bimodal you used is the way to do it, but I do think power=5% in the mix does make sense.

We had not considered p-curve’s performance there and we should have.

Prompted by this exchange I did that, and I am comfortable with how p-curve handles power=5% in the mix.

For example, I considered 40 studies, starting with all 40 null, and then having an increasing number drawn from U(40*-80%) power. Looks fine.

Why p-curve overshoots?

Ok. So having discuss the potential impact of null findings on estimates, and leaving aside my reservations with defining the extreme bimodal distribution of power as something we should worry about, let’s try to understand why p-curve over-estimates and z-curve does not.

Your paper proposes it is because p-curve assumes homogeneity.

It doesn’t. p-curve does not assume homogeneity of power any more than computing average height involves assuming homogeneity of height. It is true that p-curve does not estimate heterogeneity in power, but averaging height also does not compute the SD(height). P-curve does not assume it is zero, in fact, one could use p-curve results to estimate heterogeneity.

But in any case, is z-curve handling the extreme bimodal better thanks to its mixture of distributions, as you propose in the paper, or due to something else?

Because power is nonlinearly related to ncp I assumed it had to do with the censoring of high z-values you did rather than the mixture (though I did not actually look into the mixture in any detail at all)..

To look into that I censored t-values going into p-curve. Not as a proposal for a modification but to make the discussion concrete. I censored at t<3.5 so that any t>3.5 is replaced by 3.5 before being entered into p-curve.  I did not spend much time fine-tuning it and I am definitely not proposing htat if one were to censore t-values in p-curve they should be censored at 3.5


Ok, so I run p-curve with censored t-values for the rbeta() distribution you sent and for various others of the same style.

We see that censored p-curve behaves very similarly to z-curve (which is censored also).

I also tried adding more studies, running rbeta(3,1) and (1,3), etc.. Across the board, I find that if there is a high share of extremely high powered studies, censored p-curve and z-curve look quite similar.

If we knew nothing else, we would be inclined to censor p-curve going forward, or to use z-curve instead. But censored p-curve, and especially z-curve, give worse answers when the set of studies does not include many extremely high-powered ones, and in real life we don’t have many extremely high-powered studies. So z-curve and censored p-curve make gains in an world that I don’t think exist, and exhibit losses in one that I do think exists.

In particular, z-curve estimates power to be about 10% when the null is true, instead of 5% (censored p-curve actually get this one right, null is estimated at 5%).

Also, z-curve underestimates power in most scenarios not involving an extreme bimodal distribution (see charts I sent in my previous email). IN addition, z-curve tends to have higher variance than p-curve.

As indicated in my previous email, z-curve and p-curve agree most of the time, their differences will typically be within sampling error. It is a low stakes decision to use p-curve vs z-curve, especially compared to the much more important issue of which studies are selected and which tests are selected within studies.

Thanks for engaging in this conversation.

We don’t have to converge to agreement to gain from discussing things.

Btw, we will write a blog post on the repeated and incorrect claim that p-curve assumes homogeneity and does not deal with heterogeneity well. We will send you a draft when we do, but it could be several weeks till we get to that. I don’t anticipate it being a contentions post from your point of view but figured I would tell you about it now.



From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Now that we are on the same page, the only question is what is realistic.

First, your blog post on outliers already shows what is realistic. A single outlier in the power pose study increases the p-curve estimate by more than 10% points.

You can fix this now, but p-curve as it existed did not do this.   I would also describe this as a case of heterogeneity. Clearly the study with z = 7 is different from studies with z = 2.

This is in the manuscript that I asked you to evaluate and you haven’t commented on it at all, while writing a blog post about it.

The paper contains several other examples that are realistic because they are based on real data.

I mainly present them as histograms of z-scores rather than historgrams of p-values or observed power because I find the distribution of the z-scores more informative (e.g., where is the mode,  is the distribution roughly normal, etc.), but if you convert the z-scores into power you get distributions like the one shown below (U-shpaed), which is not surprising because power is bounded at  alpha and 1.  So, that is a realistic scenario, whereas your simulations of truncated distributions are not.

I think we can end the discussion here.  You have not shown any flaws with my analyses. You have shown that under very limited and unrealistic situations p-curve performs better than z-curve, which is fine because I already acknowledged in the paper that p-curve does better in the homogeneous case.

I will change the description of the assumption underlying p-curve, but leave everything else as is.

If you think there is an error let me know but I have been waiting patiently for you to comment on the paper, and examined your new simulations.

Best, Uli


Hi Uri,

What about the real world of power posing?

A few z-scores greater than 4 mess up p-curve as you just pointed out in your outlier blog.

I have presented several real world data to you that you continue to ignore.

Please provide one REAL dataset where p-curve gets it right and z-curve underestimates.

Best, Uli

Hi Uli,


From     ULI
To           URI
Date      12/6/2017

With real datasets you don’t know true power so you don’t know what’s right and wrong.

The point of our post today is that there is no point statistically analyzing the studies that Cuddy et al put together, with p-curve or any other tool.

I personally don’t think we ever observe true power with enough granularity to make z- vs p-curve prediction differences consequential.

But I don’t think we, you and I, should debate this aspect (is this bias worth that bias). Let’s stick to debating basic facts such as whether or not p-curve assumes homogeneity, or z-curve differs from p-curve because of homogeneity assumption or because of censoring, or how big bias is with this or that assumption. Then when we write we present those facts as transparently as possible to our readers, and they can make an educated decision about it based on their priors and preferences.



From     ULI
To           URI
Date      12/6/2017

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values.



z-curve uses multiple parameters, which improves prediction when there is substantial heterogeneity?



In many cases, the differences are small and not consequential.



When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates.

(see simulations in our manuscript)



I want to submit the manuscript by end of the week.

Best, Uli


From     ULI
To           URI
Date      12/6/2017

Going through the manuscript one more time, I found this.

To examine the robustness of estimates against outliers, we also obtained estimates for a subset of studies with z-scores less than 4 (k = 49).  Excluding the four studies with extreme scores had relatively little effect on z-curve; replicability estimate = 34%.  In contrast, the p-curve estimate dropped from 44% to 5%, while the 90%CI of p-curve ranged from 13% to 30% and did not include the point estimate.

Any comments on this, I mean point estimate is 5% and 90%CI is 13 to 30%,

Best, Uli

[Clarification:  this was a mistake. I confused point estimate and lower bound of CI in my output]


From     URI
To           ULI
Date      12/7/2017

Hi Uli.

See below:

From: Ulrich Schimmack []

Sent: Wednesday, December 6, 2017 10:44 PM

To: Simonsohn, Uri <>

Subject: RE: Its’ about censoring i think

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values. 


z-curve uses multiple parameters,

Agree I don’t know the details of how z-curve works, but I suspect you do and are correct.

which improves prediction when there is substantial heterogeneity?


Few fronts.

1)            I don’t think heterogeneity per-se is the issue, but extremity of the values. P-curve is accurate with very substantial heterogeneity. In your examples what causes the trouble are those extremely high power values. Even with minimal heterogeneity you will get over-estimation if you use such values.

2)            I also don’t know that it is the extra parametres in z-curve that are helping because p-curve with censoring does just as well. so I suspect it is the censoring and not the multiple parameters. That’s also consistent with z-curve under-estimating almost everywhere, the multiple parameters should not lead to that I don’t think.

In many cases, the differences are small and not consequential.

Agree, mostly. I would not state that in an unqualified way out of context.

For example, my persona assessment, which I realize you probably don’t share, is that z-curve does worse in contexts that matter a bit more, and that are vastly more likely to be observed.

When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates. 

(see simulations in our manuscript)


You can have very substantial heterogeneity and very high power and p-curve is accurate (z-curve under-estimates).

For example, for the blogpost on heterogeneity and p-curve I figured that rather than simulating power directly  it made more sense to simulate n and d distributions, over which people have better intuitions.. and then see what happened to power (rather than simulating power or ncp directly).

Here is one example. Sets of 20 studies, drawn with n and d from the first two panels, with the implied true power and its estimate in the 3rd panel.

I don’t mention this in the post, but z-curve in this simulation under-estimates power, 86% instead of 93%

The parameters are



What you need for p-curve to over-estimate and for z-curve to not under-estimate is substantial share of studies at both extremes, many null, many with power>95%

In general very high power leads to over-estimation, but it is trivial in the absence of many very low power studies that lower the average enough that it matters.

That’s the combination I find unlikely, 30%+ with >90% power and at the same time 15% of null findings (approx., going off memory here).

I don’t generically find high power with heterogeneity unlikely, I find the figure above super plausible for instance.

NOTE: For the post I hope to gain more insight on the precise boundary conditions for over-estimation, I am not sure I totally get it just yet.

I want to submit the manuscript by end of the week.

Hope that helps.  Good luck.

Best, Uli


From     URI
To           ULI
Date      12/7/2017

Hi Uli,

First, I had not read your entire paper and only now I realize you analyze the Cuddy et al paper, that’s an interesting coincidence. For what is worth, we worked on the post before you and I had this exchange (the post was written in November and we first waited for Thanksgiving and then over 10 days for them to reply). And moreover, our post is heavily based off the peer-review Joe wrote when reviewing this paper, nearly a year ago, and which was largely ignored by the authors unfortunately.

In terms of the results. I am not sure I understand. Are you saying you get an estimate of 5% with a confidence interval between 13 and 30?

That’s not what I get.


From     ULI
To           URI
Date      12/7/2017

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

As you can see in your output, the numbers are switched (I should label columns in output).

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

Best, Uli


HI Uli,

See below

From: Ulrich Schimmack []

Sent: Friday, December 8, 2017 12:39 AM

To: Simonsohn, Uri <>

Subject: RE: one more question

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

*I figured

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

*Happens to the best of us

As you can see in your output, the numbers are switched (I should label columns in output).

*I figured

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

*The tone is a bit accusatorial “admit”, but yes, in my blog post I will talk about it. My goal is to present facts in a way that lets readers decide with the same information I am using to decide.

It’s not always feasible to achieve that goal, but I strive for it. I prefer people making right inferences than relying on my work to arrive at them.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

*I don’t think that’s for us to decide. We can ‘fight’ about how to present the facts to readers, they decide which is more realistic.

I am not ignoring your simulation results.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

*I would prefer if you don’t speak on my behalf either way, our conversation is for each of us to learn from the other, then you speak for yourself.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

*I haven’t tried to reproduce your simulations, but I did indicate in our emails that if you run the rbeta(n,.35,.5)*.95+.05 p-curve over-estimates, I also explained why I don’t find that particularly worrisome. But you are not publishing a report on our email exchange, you are proposing a new tool. Our exchange hopefully helped make that paper clearer.

Please don’t quote any aspect of our exchange. You can say you discussed matters with me, but please do not quote me. This is a private email exchange. You can quote from my work and posts. The heterogeneity blog post may be up in a week or two.







The Deductive Fallacy in (some) Bayesian Inductive Inferences

I learned about Bayes theorem in the 1990s and I used Bayes’s famous formula in my first JPSP article (Schimmack & Reisenzein, 1997).  When Wagenmakers et al. (2011) published their criticism of Bem (2011), I did not know about Bayesian statistics. I have since learned more about Bayesian statistics and I am aware that there are many different approaches to using priors in statistical inferences.  This post is about a single Bayesian statistical approach, namely Bayesian Null-Hypothesis Testing (BNHT), which has been attributed to Jeffrey’s, introduced into psychology by Rouder, Speckman, and Sun (2009), and used by Wagenmakers et al. (2011) to suggest that Bem’s evidence for ESP was obtained by using flawed p-values, whereas Bayes-Factors showed no evidence for ESP, although they did not show evidence for the absence of ESP, either. Since then, I have learned more about Bayes Factors, in part from reading blog posts by Jeff Rouder, including R-Code to run my own simulation studies, and from discussions with Jeff Rouder on social media.  I am not an expert on Bayesian modeling, but I understand the basic logic underlying Bayes-Factors.

Rouder et al.’s (2009) article has been cited over 800 times and was cited over 200 times in 2016 and 2017.  An influential article like this cannot be ignored.  Like all other inferential statistical methods, JBF  (Jeffrey’s Bayes Factors or Jeff’s Bayes Factors) examine statistical properties of data (effect size, sampling error) in relation to sampling distributions of test statistics. Rouder et al. (2009) focused on t-distributions that are used for the comparison of means by means of t-tests.  Although most research articles in psychology continue to use traditional significance testing. the use of Bayes-Factors is on the rise.  It is therefore important to critically examine how Bayes-Factors are being used and whether inferences based on Bayes-Factors are valid.

Inferences about Sampling Error as Causes of Observed Effects.
The main objective of inferential statistics in psychological research is to rule out the possibility that an observed effect is merely a statistical fluke.  If the evidence obtained in a study is strong enough given some specified criterion value, researchers are allowed to reject the hypothesis that an observed effect was merely produced by chance (a false positive effect) and interpret the result as being caused by some effect.  Although Bayes-Factors could be reported without drawing conclusions (just like t-values or p-values could be reported without drawing inferences), most empirical articles that use Bayes-Factors use them to draw inferences about effects.  Thus, the aim of this blog post is to examine whether empirical researchers use JBFs correctly.

Two Types of Errors
Inferential statistics are error prone.  The outcome of empirical studies is not deterministic and results from samples may not generalize to populations.  There are two possible errors that can occur, the so-called type-I errors and type-II errors. Type-I errors are false positive results. A false positive result occurs when there is no real effect in the population, but the results of a study led to the rejection of the null-hypothesis that sampling error alone caused the observed mean differences.  The second error is the false inference that sampling error alone caused an observed difference in a sample, while a test of the entire population would show that there is an actual effect.  This is called a false negative result.  The main problem in assessing type-II errors (false negatives) is that the probability of a type-II error depends on the magnitude of the effect.  Large effects can be easily observed even in small samples and the risk of a type-II error is small.  However, as effect sizes become smaller and approach zero, it becomes harder and harder to distinguish the effect from pure sampling error.  Once effect sizes become really small (say 0.0000001 percent of a standard deviation), it is practically impossible to distinguish results of a study with a tiny real effect from results of a study with no effect at all.

False Negatives
For reasons that are irrelevant here, psychologists have ignored type-II errors. A type-II error can only be made when researchers conclude that an effect is absent.  However, empirical psychologists were trained to ignore non-significant results as inconclusive rather than drawing the inferences that an effect is absent, and risking making a type-II error.  This led to the belief that p-values cannot be used to test the hypothesis that an effect is absent.  This was not much of a problem because most of the time psychologists made predictions that an effect should occur (reward should increase behavior; learning should improve recall, etc.).  However, it became a problem when Bem (2011) claimed to demonstrate that subliminal priming can influence behavior even if the prime is presented AFTER the behavior occurred.  Wagenmakers et al. (2011) and others found this hypothesis implausible and the evidence for it unbelievable.  However, rather than being satisfied with demonstrating that the evidence is flawed, it would be even better to demonstrate that this implausible effect does not exist.  Traditional statistical methods that focus on rejecting the null-hypothesis did not allow this.  Wagenmakers et al. (2011) suggested that Bayes-Factors solve this problem because they can be used to test the plausible hypothesis that time-reversed priming does not exist (the effect size is truly zero).  Many subsequent articles have used JBFs for exactly this purpose; that is, to provide empirical evidence in support of the null-hypothesis that an observed mean difference is entirely due to sampling error.  Like all inductive inferences, inferences in favor of H0 can be false.   While psychologists have traditionally ignored type-II errors because they did not make inferences in favor of H0, the rise of inferences in favor of H0 by means of JBFs makes it necessary to examine the validity of these inferences.

Deductive Fallacy
The main problem of using JBFs to provide evidence for the null-hypothesis is that Bayes-Factors are ratios of two hypotheses.  The data can be more or less compatible with each of the two hypotheses.  Say, if the data favor H0 by a likelihood of .2 and H1 by a likelihood of .1, the ratio of the two likelihoods is .2/.1 = 2.  The greater the likelihood in favor of H0, the more likely it is that an observed mean difference is purely sampling error.  As JBFs are ratios of two likelihoods, they depend on the specification of H1. For t-tests with continuous variables, H1 is specified as a weighted distribution of effect sizes. Although H1 covers all possible effect sizes, it is possible to create an infinite number of alternative hypotheses (H1.1, H1.2, H1.3….H1.∞).  The Bayes-Factor changes as a function of the way H1 is specified.  Thus, while one specific H1 may produce a JBF of 1000000:1 in favor of H0, another one may produce a JBF of 1:1.   It is therefore a logical fallacy to infer from a specific JBF for one particular H1 that H0 is supported, true, or that there is evidence for the absence of an effect.  The logically correct inference is that, with extremely high probability),  the alternative hypothesis is false, but that does not justify the inverse inference that H0 is true because H0 and H1 do not specify the full event space of all possible hypotheses that could be tested.  It is easy to overlook this because every H1 covers the full range of effect sizes, but these effect sizes can be used to create an infinite number of alternative hypotheses.

To make it simple, let’s forget about sampling distributions and likelihoods.  Using JBFs to claim that the data support the null-hypothesis in some absolute sense, is like a guessing game where you can pick any number you want, I guess that you picked 7 (because people like the number 7), you say it was not 7, and I now infer that you must have picked 0, as if 7 and 0 were the only options.  If you think this is silly, you are right, and it is equally silly to infer from rejecting one out of an infinite number of possible H1s that H0 must be true.

So, a correct use of JBFs would be to state conclusions in terms of the H1 that was specified to compute the JBF.  For example, in Wagenmakers et al’s analyses of Bem’s data, the authors specified H1 as a hypothesis that allocated 25% probability to effect sizes of d less than 1 (the opposite of the predicted effect) and 25% probability of a d greater than 1 (a very strong effect similar to gender differences in height).  Even if the JBF would strongly favor H0, which it did not, it would not justify the inference that time-reversed priming does not exist.  It would merely justify the inference that the effect size is unlikely to be greater than 1, one way or the other.  However, if Wagenmakers et al. (2011) had presented their results correctly, nobody would have bothered to take notice of such a trivial inference.  It was only the incorrect presentation of JBFs as a test of the null-hypothesis that led to the false belief that JBFs can provide evidence for the absence of an effect (e..g, the true effect size in Bem’s studies is zero).  In fact, Wagenmakers played the game where H1 guessed that the effect size is 1, H1 was wrong, leading to the conclusion that the effect size must be 0.  This is an invalid inference because there are still an infinite number of plausible effect sizes between 0 and 1.

There is nothing inherently wrong in calculating likelihood ratios and using them to test competing predictions.  However, the use of Bayes-Factors as a way to provide evidence for the absence of an effect is misguided because it is logically impossible to provide evidence for one specific effect size out of an infinite set of possible effect sizes. It doesn’t matter whether the specific effect size is 0 or any other value.  A likelihood ratio can only compare two hypothesis out of an infinite set of hypotheses. If one hypothesis is rejected, it does not justify inferring that the other hypotheses is true.  This is the reason why we can falsify H0 because when we reject H0 we do not infer that one specific effect size is true; we merely infer that it is not 0, leaving open the possibility that it is any one of the other infinite number of effect sizes.  We cannot reverse this because we cannot test the hypothesis that the effect is zero against a single hypothesis that covers all other effect sizes. JBFs can only test H0 against all other effect sizes by assigning weights to them. As there is an infinite number of ways to weight effect sizes, there is an infinite set of alternative hypothesis.  Thus, we can reject the hypothesis that sampling error alone produced an effect but practically we can never demonstrate that sampling error alone caused the outcome of a study.

To demonstrate that an effect does not exist it is necessary to specify a region of effect sizes around zero.  The smaller the region, the more resources are needed to provide evidence that the effect size is at best very small.  One negative consequence of the JBF approach has been that small samples were used to claim support for the point null-hypothesis, with a high probability that this conclusion was a false negative result. Researchers should always report the 95% confidence interval around the observed effect size. If this interval includes effect sizes of .2 standard deviations, the inference in favor of a null-result is questionable because many effect sizes in psychology are small.  Confidence intervals (or Bayesian credibility intervals with plausible priors) are more useful for claims about the absence of an effect than misleading statistics that pretend to provide strong evidence in favor of a false null-hypothesis.

















Conduct Your Own Replicability Analysis

You can download an excel spreadsheet to conduct your own replicability analysis.


The most important columns are L (which.statistical.test), O (df1), P(df2), Q(test.statistic), and R (Success).

L (which statistical test): 
Enter F,t,z,chisq

Enter experimenter df for F, enter 1 or leave blank for other tests.

Enter participant df for F and t and df for chi^2

Q(test statistic)
Enter actual F,t,z, or chi^2 value

Enter whether result was counted as success or not.
(Entering marginal significant as success results in negative R-Index for this study)
(Entering marginal significant as failure results in very high R-Index, that is probably not justified).
(If in doubt, do not enter marginally significant results or enter with 0.5 for success).

Feel free to post questions in the comment section or email me.



Why most Multiple-Study Articles are False: An Introduction to the Magic Index

In 2011 I wrote a manuscript in response to Bem’s (2011) unbelievable and flawed evidence for extroverts’ supernatural abilities.  It took nearly two years for the manuscript to get published in Psychological Methods. While I was proud to have published in this prestigious journal without formal training in statistics and a grasp of Greek notation, I now realize that Psychological Methods was not the best outlet for the article, which may explain why even some established replication revolutionaries do not know it (comment: I read your blog, but I didn’t know about this article). So, I decided to publish an abridged (it is still long), lightly edited (I have learned a few things since 2011), and commented (comments are in […]) version here.

I also learned a few things about titles. So the revised version, has a new title.

Finally, I can now disregard the request from the editor, Scott Maxwell, on behave of reviewer Daryl Bem, to change the name of my statistical index from magic index to incredibilty index.  (the advantage of publishing without the credentials and censorship of peer-review).

For readers not familiar with experimental social psychology, it is also important to understand what a multiple study article is.  Most science are happy with one empirical study per article.  However, social psychologists didn’t trust the results of a single study with p < .05. Therefore, they wanted to see internal conceptual replications of phenomena.  Magically, Bem was able to provide evidence for supernatural abilities in not just 1 or 2 or 3 studies, but 8 conceptual replication studies with 9 successful tests.  The chance of a false positive result in 9 statistical tests is smaller than the chance of finding evidence for the Higgs-Bosson particle, which was a big discovery in physics.  So, readers in 2011 had a difficult choice to make: either supernatural phenomena are real or multiple study articles are unreal.  My article shows that the latter is likely to be true, as did an article by Greg Francis.

Aside from Alcock’s demonstration of a nearly perfect negative correlation between effect sizes and sample sizes and my demonstration of insufficient variance in Bem’s p-values, Francis’s article and my article remain the only article that question the validity of Bem’s origina findings. Other articles have shown that the results cannot be replicated, but I showed that the original results were already too good to be true. This blog post explains, how I did it.

Why most multiple-study articles are false: An Introduction to the Magic Index
(the article formerly known as “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles”)

Cohen (1962) pointed out the importance of statistical power for psychology as a science, but statistical power of studies has not increased, while the number of studies in a single article has increased. It has been overlooked that multiple studies with modest power have a high probability of producing nonsignificant results because power decreases as a function of the number of statistical tests that are being conducted (Maxwell, 2004). The discrepancy between the expected number of significant results and the actual number of significant results in multiple-study articles undermines the credibility of the reported
results, and it is likely that questionable research practices have contributed to the reporting of too many significant results (Sterling, 1959). The problem of low power in multiple-study articles is illustrated using Bem’s (2011) article on extrasensory perception and Gailliot et al.’s (2007) article on glucose and self-regulation. I conclude with several recommendations that can increase the credibility of scientific evidence in psychological journals. One major recommendation is to pay more attention to the power of studies to produce positive results without the help of questionable research practices and to request that authors justify sample sizes with a priori predictions of effect sizes. It is also important to publish replication studies with nonsignificant results if these studies have high power to replicate a published finding.

Keywords: power, publication bias, significance, credibility, sample size


Less is more, except of course for sample size. (Cohen, 1990, p. 1304)

In 2011, the prestigious Journal of Personality and Social Psychology published an article that provided empirical support for extrasensory perception (ESP; Bem, 2011). The publication of this controversial article created vigorous debates in psychology
departments, the media, and science blogs. In response to this debate, the acting editor and the editor-in-chief felt compelled to write an editorial accompanying the article. The editors defended their decision to publish the article by noting that Bem’s (2011) studies were performed according to standard scientific practices in the field of experimental psychology and that it would seem inappropriate to apply a different standard to studies of ESP (Judd & Gawronski, 2011).

Others took a less sanguine view. They saw the publication of Bem’s (2011) article as a sign that the scientific standards guiding publication decisions are flawed and that Bem’s article served as a glaring example of these flaws (Wagenmakers, Wetzels, Borsboom,
& van der Maas, 2011). In a nutshell, Wagenmakers et al. (2011) argued that the standard statistical model in psychology is biased against the null hypothesis; that is, only findings that are statistically significant are submitted and accepted for publication.

This bias leads to the publication of too many positive (i.e., statistically significant) results. The observation that scientific journals, not only those in psychology,
publish too many statistically significant results is by no means novel. In a seminal article, Sterling (1959) noted that selective reporting of statistically significant results can produce literatures that “consist in substantial part of false conclusions” (p.

Three decades later, Sterling, Rosenbaum, and Weinkam (1995) observed that the “practice leading to publication bias have [sic] not changed over a period of 30 years” (p. 108). Recent articles indicate that publication bias remains a problem in psychological
journals (Fiedler, 2011; John, Loewenstein, & Prelec, 2012; Kerr, 1998; Simmons, Nelson, & Simonsohn, 2011; Strube, 2006; Vul, Harris, Winkielman, & Pashler, 2009; Yarkoni, 2010).

Other sciences have the same problem (Yong, 2012). For example, medical journals have seen an increase in the percentage of retracted articles (Steen, 2011a, 2011b), and there is the concern that a vast number of published findings may be false (Ioannidis,

However, a recent comparison of different scientific disciplines suggested that the bias is stronger in psychology than in some of the older and harder scientific disciplines at the top of a hierarchy of sciences (Fanelli, 2010).

It is important that psychologists use the current crisis as an opportunity to fix problems in the way research is being conducted and reported. The proliferation of eye-catching claims based on biased or fake data can have severe negative consequences for a
science. A New Yorker article warned the public that “all sorts of  well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable” (Lehrer, 2010, p. 1).

If students who read psychology textbooks and the general public lose trust in the credibility of psychological science, psychology loses its relevance because
objective empirical data are the only feature that distinguishes psychological science from other approaches to the understanding of human nature and behavior. It is therefore hard to exaggerate the seriousness of doubts about the credibility of research findings published in psychological journals.

In an influential article, Kerr (1998) discussed one source of bias, namely, hypothesizing after the results are known (HARKing). The practice of HARKing may be attributed to the
high costs of conducting a study that produces a nonsignificant result that cannot be published. To avoid this negative outcome, researchers can design more complex studies that test multiple hypotheses. Chances increase that at least one of the hypotheses
will be supported, if only because Type I error increases (Maxwell, 2004). As noted by Wagenmakers et al. (2011), generations of graduate students were explicitly advised that this questionable research practice is how they should write scientific manuscripts
(Bem, 2000).

It is possible that Kerr’s (1998) article undermined the credibility of single-study articles and added to the appeal of multiple-study articles (Diener, 1998; Ledgerwood & Sherman, 2012). After all, it is difficult to generate predictions for significant effects
that are inconsistent across studies. Another advantage is that the requirement of multiple significant results essentially lowers the chances of a Type I error, that is, the probability of falsely rejecting the null hypothesis. For a set of five independent studies,
the requirement to demonstrate five significant replications essentially shifts the probability of a Type I error from p < .05 for a single study to p < .0000003 (i.e., .05^5) for a set of five studies.

This is approximately the same stringent criterion that is being used in particle physics to claim a true discovery (Castelvecchi, 2011). It has been overlooked, however, that researchers have to pay a price to meet more stringent criteria of credibility. To demonstrate significance at a more stringent criterion of significance, it is
necessary to increase sample sizes to reduce the probability of making a Type II error (failing to reject the null hypothesis). This probability is called beta. The inverse probability (1 – beta) is called power. Thus, to maintain high statistical power to demonstrate an effect with a more stringent alpha level requires an
increase in sample sizes, just as physicists had to build a bigger collider to have a chance to find evidence for smaller particles like the Higgs boson particle.

Yet there is no evidence that psychologists are using bigger samples to meet more stringent demands of replicability (Cohen, 1992; Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). This raises the question of how researchers are able to replicate findings in multiple-study articles despite modest power to demonstrate significant effects even within a single study. Researchers can use questionable research
practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result. Moreover, a survey of researchers indicated that these
practices are common (John et al., 2012), and the prevalence of these practices has raised concerns about the credibility of psychology as a science (Yong, 2012).

An implicit assumption in the field appears to be that the solution to these problems is to further increase the number of positive replication studies that need to be presented to ensure scientific credibility (Ledgerwood & Sherman, 2012). However, the assumption that many replications with significant results provide strong evidence for a hypothesis is an illusion that is akin to the Texas sharpshooter fallacy (Milloy, 1995). Imagine a Texan farmer named Joe. One day he invites you to his farm and shows you a target with nine shots in the bull’s-eye and one shot just outside the bull’s-eye. You are impressed by his shooting abilities until you find out that he cannot repeat this performance when you challenge him to do it again.

[So far, well-known Texan sharpshooters in experimental social psychology have carefully avoided demonstrating their sharp shooting abilities in open replication studies to avoid the embarrassment of not being able to do it again].

Over some beers, Joe tells you that he first fired 10 shots at the barn and then drew the targets after the shots were fired. One problem in science is that reading a research
article is a bit like visiting Joe’s farm. Readers only see the final results, without knowing how the final results were created. Is Joe a sharpshooter who drew a target and then fired 10 shots at the target? Or was the target drawn after the fact? The reason why multiple-study articles are akin to a Texan sharpshooter is that psychological studies have modest power (Cohen, 1962; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). Assuming
60% power for a single study, the probability of obtaining 10 significant results in 10 studies is less than 1% (.6^10 = 0.6%).

I call the probability to obtain only significant results in a set of studies total power. Total power parallels Maxwell’s (2004) concept of all-pair power for multiple comparisons in analysis-of variance designs. Figure 1 illustrates how total power decreases with the number of studies that are being conducted. Eventually, it becomes extremely unlikely that a set of studies produces only significant results. This is especially true if a single study has modest power. When total power is low, it is incredible that a set
of studies yielded only significant results. To avoid the problem of incredible results, researchers would have to increase the power of studies in multiple-study articles.

Table 1 shows how the power of individual studies has to be adjusted to maintain 80% total power for a set of studies. For example, to have 80% total power for five replications, the power of each study has to increase to 96%.

Table 1 also shows the sample sizes required to achieve 80% total power, assuming a simple between-group design, an alpha level of .05 (two-tailed), and Cohen’s
(1992) guidelines for a small (d = .2), moderate, (d = .5), and strong (d = .8) effect.

[To demonstrate a small effect 7 times would require more than 10,000 participants.]

In sum, my main proposition is that psychologists have falsely assumed that increasing the number of replications within an article increases credibility of psychological science. The problem of this practice is that a truly programmatic set of multiple studies
is very costly and few researchers are able to conduct multiple studies with adequate power to achieve significant results in all replication attempts. Thus, multiple-study articles have intensified the pressure to use questionable research methods to compensate for low total power and may have weakened rather than strengthened
the credibility of psychological science.

[I believe this is one reason why the replication crisis has hit experimental social psychology the hardest.  Other psychologists could use HARKing to tell a false story about a single study, but experimental social psychologists had to manipulate the data to get significance all the time.  Experimental cognitive psychologists also have multiple study articles, but they tend to use more powerful within-subject designs, which makes it more credible to get significant results multiple times. The multiple study BS design made it impossible to do so, which resulted in the publication of BS results.]

What Is the Allure of Multiple-Study Articles?

One apparent advantage of multiple-study articles is to provide stronger evidence against the null hypothesis (Ledgerwood & Sherman, 2012). However, the number of studies is irrelevant because the strength of the empirical evidence is a function of the
total sample size rather than the number of studies. The main reason why aggregation across studies reduces randomness as a possible explanation for observed mean differences (or correlations) is that p values decrease with increasing sample size. The
number of studies is mostly irrelevant. A study with 1,000 participants has as much power to reject the null hypothesis as a meta-analysis of 10 studies with 100 participants if it is reasonable to assume a common effect size for the 10 studies. If true effect sizes vary across studies, power decreases because a random-effects model may be more appropriate (Schmidt, 2010; but see Bonett, 2009). Moreover, the most logical approach to reduce concerns about Type I error is to use more stringent criteria for significance (Mudge, Baker, Edge, & Houlahan, 2012). For controversial or very important research findings, the significance level could be set to p < .001 or, as in particle physics, to p <

[Ironically, five years later we have a debate about p < .05 versus p < .005, without even thinking about p < .0000005 or any mention that even a pair of studies with p < .05 in each study effectively have an alpha less than p < .005, namely .0025 to be exact.]  

It is therefore misleading to suggest that multiple-study articles are more credible than single-study articles. A brief report with a large sample (N = 1,000) provides more credible evidence than a multiple-study article with five small studies (N = 40, total
N = 200).

The main appeal of multiple-study articles seems to be that they can address other concerns (Ledgerwood & Sherman, 2012). For example, one advantage of multiple studies could be to test the results across samples from diverse populations (Henrich, Heine, & Norenzayan, 2010). However, many multiple-study articles are based on samples drawn from a narrowly defined population (typically, students at the local university). If researchers were concerned about generalizability across a wider range of individuals, multiple-study articles should examine different populations. However, it is not clear why it would be advantageous to conduct multiple independent studies with different populations. To compare populations, it would be preferable to use the same procedures and to analyze the data within a single statistical model with population as a potential moderating factor. Moreover, moderator tests often have low power. Thus, a single study with a large sample and moderator variables is more informative than articles that report separate analyses with small samples drawn from different populations.

Another attraction of multiple-study articles appears to be the ability to provide strong evidence for a hypothesis by means of slightly different procedures. However, even here, single studies can be as good as multiple-study articles. For example, replication across different dependent variables in different studies may mask the fact that studies included multiple dependent variables and researchers picked dependent variables that produced significant results (Simmons et al., 2011). In this case, it seems preferable to
demonstrate generalizability across dependent variables by including multiple dependent variables within a single study and reporting the results for all dependent variables.

One advantage of a multimethod assessment in a single study is that the power to
demonstrate an effect increases for two reasons. First, while some dependent variables may produce nonsignificant results in separate small studies due to low power (Maxwell, 2004), they may all show significant effects in a single study with the total sample size
of the smaller studies. Second, it is possible to increase power further by constraining coefficients for each dependent variable or by using a latent-variable measurement model to test whether the effect is significant across dependent variables rather than for each one independently.

Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating
a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions.  Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe
truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005).

[This is my take down of social psychologists’ claim that multiple conceptual replications test theories, Stroebe & Strack, 2004]

The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect. In this way, empirical studies no longer test theoretical hypotheses because they can only produce two results: Either they support the theory (p < .05) or the manipulation did not work (p > .05). It is therefore worrisome that Bem noted that “like most  social psychological experiments, the experiments reported here required extensive pilot testing” (Bem, 2011, p. 421). If Joe is a sharpshooter, who can hit the bull’s-eye from different angles and with different guns, why does he need extensive training before he can perform the critical shot?

The freedom of researchers to discount null findings leads to the paradox that conceptual replications across multiple studies give the impression that an effect is robust followed by warnings that experimental findings may not replicate because they depend “on subtle and unknown factors” (Bem, 2011, p. 422).

If experimental results were highly context dependent, it would be difficult to explain how studies reported in research articles nearly always produce the expected results. One possible explanation for this paradox is that sampling error in small samples creates the illusion that effect sizes vary systematically, although most of the variation is random. Researchers then pick studies that randomly produced inflated effect sizes and may further inflate them by using questionable research methods to achieve significance (Simmons et al., 2011).

[I was polite when I said “may”.  This appears to be exactly what Bem did to get his supernatural effects.]

The final set of studies that worked is then published and gives a false sense of the effect size and replicability of the effect (you should see the other side of Joe’s barn). This may explain why research findings initially seem so impressive, but when other researchers try to build on these seemingly robust findings, it becomes increasingly uncertain whether a phenomenon exists at all (Ioannidis, 2005; Lehrer, 2010).

At this point, a lot of resources have been wasted without providing credible evidence for an  effect.

[And then Stroebe and Strack in 2014 suggest that real replication studies that let the data determine the outcome are a waste of resources.]

To increase the credibility of reported findings, it would be better to use all of the resources for one powerful study. For example, the main dependent variable in Bem’s (2011) study of ESP was the percentage of correct predictions of future events.
Rather than testing this ability 10 times with N = 100 participants, it would have been possible to test the main effect of ESP in a single study with 10 variations of experimental procedures and use the experimental conditions as a moderating factor. By testing one
main effect of ESP in a single study with N = 1,000, power would be greater than 99.9% to demonstrate an effect with Bem’s a priori effect size.

At the same time, the power to demonstrate significant moderating effects would be much lower. Thus, the study would lead to the conclusion that ESP does exist but that it is unclear whether the effect size varies as a function of the actual experimental
paradigm. This question could then be examined in follow-up studies with more powerful tests of moderating factors.

In conclusion, it is true that a programmatic set of studies is superior to a brief article that reports a single study if both articles have the same total power to produce significant results (Ledgerwood & Sherman, 2012). However, once researchers use questionable research practices to make up for insufficient total power, multiple-study articles lose their main advantage over single-study articles, namely, to demonstrate generalizability across different experimental manipulations or other extraneous factors.

Moreover, the demand for multiple studies counteracts the demand for more
powerful studies (Cohen, 1962; Maxwell, 2004; Rossi, 1990) because limited resources (e.g., subject pool of PSY100 students) can only be used to increase sample size in one study or to conduct more studies with small samples.

It is therefore likely that the demand for multiple studies within a single article has eroded rather than strengthened the credibility of published research findings
(Steen, 2011a, 2011b), and it is problematic to suggest that multiple-study articles solve the problem that journals publish too many positive results (Ledgerwood & Sherman, 2012). Ironically, the reverse may be true because multiple-study articles provide a
false sense of credibility.

Joe the Magician: How Many Significant Results Are Too Many?

Most people enjoy a good magic show. It is fascinating to see something and to know at the same time that it cannot be real. Imagine that Joe is a well-known magician. In front of a large audience, he fires nine shots from impossible angles, blindfolded, and seemingly through the body of an assistant, who miraculously does not bleed. You cannot figure out how Joe pulled off the stunt, but you know it was a stunt. Similarly, seeing Joe hit the bull’s-eye 1,000 times in a row raises concerns about his abilities as a sharpshooter and suggests that some magic is contributing to this miraculous performance. Magic is fun, but it is not science.

[Before Bem’s article appeared, Steve Heine gave a talk at the University of Toront where he presented multiple studies with manipulations of absurdity (absurdity like Monty Python’s “Biggles: Pioneer Air Fighter; cf. Proulx, Heine, & Vohs, PSPB, 2010).  Each absurd manipulation was successful.  I didn’t have my magic index then, but I did understand the logic of Sterling et al.’s (1995) argument. So, I did ask whether there were also manipulations that did not work and the answer was affirmative.  It was rude at the time to ask about a file drawer before 2011, but a recent twitter discussion suggests that it wouldn’t be rude in 2018. Times are changing.]

The problem is that some articles in psychological journals appear to be more magical than one would expect on the basis of the normative model of science (Kerr, 1998). To increase the credibility of published results, it would be desirable to have a diagnostic tool that can distinguish between credible research findings and those that are likely to be based on questionable research practices. Such a tool would also help to
counteract the illusion that multiple-study articles are superior to single-study articles without leading to the erroneous reverse conclusion that single-study articles are more trustworthy.

[I need to explain why I targeted multiple-study articles in particular. Even the personality section of JPSP started to demand multiple studies because they created the illusion of being more rigorous, e.g., the crazy glucose article was published in that section. At that time, I was still trying to publish as many articles as possible in JPSP and I was not able to compete with crazy science.]

Articles should be evaluated on the basis of their total power to demonstrate consistent evidence for an effect. As such, a single-study article with 80% (total) power is superior to a multiple-study article with 20% total power, but a multiple-study article with 80% total power is superior to a single-study article with 80% power.

The Magic Index (formerly known as the Incredibility Index)

The idea to use power analysis to examine bias in favor of theoretically predicted effects and against the null hypothesis was introduced by Sterling et al. (1995). Ioannidis and Trikalinos (2007) provided a more detailed discussion of this approach for the detection of bias in meta-analyses. Ioannidis and Trikalinos’s exploratory test estimates the probability of the number of reported significant results given the average power of the reported studies. Low p values suggest that there are too many significant results,  suggesting that questionable research methods contributed to the reported results. In contrast, the inverse inference is not justified because high p values do not justify the inference that questionable research practices did not contribute to the results. To emphasize this asymmetry in inferential strength, I suggest reversing the exploratory test, focusing on the probability of obtaining more nonsignificant results than were reported in a multiple-study article and calling this index the magic index.

Higher values indicate that there is a surprising lack of nonsignificant results (a.k.a., shots that missed the bull’s eye). The higher the magic index is, the more incredible the observed outcome becomes.

Too many significant results could be due to faking, fudging, or fortune. Thus, the statistical demonstration that a set of reported findings is magical does not prove that questionable research methods contributed to the results in a multiple-study article. However, even when questionable research methods did not contribute to the results, the published results are still likely to be biased because fortune helped to inflate effect sizes and produce more significant results than total power justifies.

Computation of the Incredibility Index

To understand the basic logic of the M-index, it is helpful to consider a concrete example. Imagine a multiple-study article with 10 studies with an average observed effect size of d = .5 and 84 participants in each study (42 in two conditions, total N = 840) and all studies producing a significant result. At first sight, these 10 studies seem to provide strong support against the null hypothesis. However, a post hoc power analysis with the average effect size of d = .5 as estimate of the true effect size reveals that each study had
only 60% power to obtain a significant result. That is, even if the true effect size were d = .5, only six out of 10 studies should have produced a significant result.

The M-index quantifies the probability of the actual outcome (10 out of 10 significant results) given the expected value (six out of 10 significant results) using binomial
probability theory. From the perspective of binomial probability theory, the scenario
is analogous to an urn problem with replacement with six green balls (significant) and four red balls (nonsignificant). The binomial probability to draw at least one red ball in 10 independent draws is 99.4%. (Stat Trek, 2012).

That is, 994 out of 1,000 multiple-study articles with 10 studies and 60% average power
should have produced at least one nonsignificant result in one of the 10 studies. It is therefore incredible if an article reports 10 significant results because only six out of 1,000 attempts would have produced this outcome simply due to chance alone.

[I now realize that observed power of 60% would imply that the null-hypothesis is true because observed power is also inflated by selecting for significance.  As 50% observed poewr is needed to achieve significance and chance cannot produce the same observed power each time, the minimum observed power is 62%!]

One of the main problems for power analysis in general and the computation of the IC-index in particular is that the true effect size is unknown and has to be estimated. There are three basic approaches to the estimation of true effect sizes. In rare cases, researchers provide explicit a priori assumptions about effect sizes (Bem, 2011). In this situation, it seems most appropriate to use an author’s stated assumptions about effect sizes to compute power with the sample sizes of each study. A second approach is to average reported effect sizes either by simply computing the mean value or by weighting effect sizes by their sample sizes. Averaging of effect sizes has the advantage that post hoc effect size estimates of single studies tend to have large confidence intervals. The confidence intervals shrink when effect sizes are aggregated across
studies. However, this approach has two drawbacks. First, averaging of effect sizes makes strong assumptions about the sampling of studies and the distribution of effect sizes (Bonett, 2009). Second, this approach assumes that all studies have the same effect
size, which is unlikely if a set of studies used different manipulations and dependent variables to demonstrate the generalizability of an effect. Ioannidis and Trikalinos (2007) were careful to warn readers that “genuine heterogeneity may be mistaken for bias” (p.

[I did not know about  Ioannidis and Trikalinos’s (2007) article when I wrote the first draft. Maybe that is a good thing because I might have followed their approach. However, my approach is different from their approach and solves the problem of pooling effect sizes. Claiming that my method is the same as Trikalinos’s method is like confusing random effects meta-analysis with fixed-effect meta-analysis]   

To avoid the problems of average effect sizes, it is promising to consider a third option. Rather than pooling effect sizes, it is possible to conduct post hoc power analysis for each study. Although each post hoc power estimate is associated with considerable sampling error, sampling errors tend to cancel each other out, and the M-index for a set of studies becomes more accurate without having to assume equal effect sizes in all studies.

Unfortunately, this does not guarantee that the M-index is unbiased because power is a nonlinear function of effect sizes. Yuan and Maxwell (2005) examined the implications of this nonlinear relationship. They found that the M-index may provide inflated estimates of average power, especially in small samples where observed effect sizes vary widely around the true effect size.  Thus, the M-index is conservative when power is low and magic had to be used to create significant results.

In sum, it is possible to use reported effect sizes to compute post hoc power and to use post hoc power estimates to determine the probability of obtaining a significant result. The post hoc power values can be averaged and used as the probability for a successful
outcome. It is then possible to use binomial probability theory to determine the probability that a set of studies would have produced equal or more nonsignificant results than were actually reported.  This probability is [now] called the M-index.

[Meanwhile, I have learned that it is much easier to compute observed power based on reported test statistics like t, F, and chi-square values because observed power is determined by these statistics.]

Example 1: Extrasensory Perception (Bem, 2011)

I use Bem’s (2011) article as an example because it may have been a tipping point for the current scientific paradigm in psychology (Wagenmakers et al., 2011).

[I am still waiting for EJ to return the favor and cite my work.]

The editors explicitly justified the publication of Bem’s article on the grounds that it was subjected to a rigorous review process, suggesting that it met current standards of scientific practice (Judd & Gawronski, 2011). In addition, the editors hoped that the publication of Bem’s article and Wagenmakers et al.’s (2011) critique would stimulate “critical further thoughts about appropriate methods in research on social cognition and attitudes” (Judd & Gawronski, 2011, p. 406).

A first step in the computation of the M-index is to define the set of effects that are being examined. This may seem trivial when the M-index is used to evaluate the credibility of results in a single article, but multiple-study articles contain many results and it is not always obvious that all results should be included in the analysis (Maxwell, 2004).

[Same here.  Maxwell accepted my article, but apparently doesn’t think it is useful to cite when he writes about the replication crisis.]

[deleted minute details about Bem’s study here.]

Another decision concerns the number of hypotheses that should be examined. Just as multiple studies reduce total power, tests of multiple hypotheses within a single study also reduce total power (Maxwell, 2004). Francis (2012b) decided to focus only on the
hypothesis that ESP exists, that is, that the average individual can foresee the future. However, Bem (2011) also made predictions about individual differences in ESP. Therefore, I used all 19 effects reported in Table 7 (11 ESP effects and eight personality effects).

[I deleted the section that explains alternative approaches that rely on effect sizes rather than observed power here.]

I used G*Power 3.1.2 to obtain post hoc power on the basis of effect sizes and sample sizes (Faul, Erdfelder, Buchner, & Lang, 2009).

The M-index is more powerful when a set of studies contains only significant results. In this special case, the M-index is the inverse probability of total power. 

[An article by Fabrigar and Wegener misrepresents my article and confuses the M-Index with total power.  When articles do report non-significant result and honestly report them as failures to reject the null-hypothesis (not marginal significance), it is necessary to compute the binomial probability to get the M-Index.]  

[Again, I deleted minute computations for Bem’s results.]

Using the highest magic estimates produces a total Magic-Index of 99.97% for Bem’s 17 results.  Thus, it is unlikely that Bem (2011) conducted 10 studies, ran 19 statistical tests of planned hypotheses, and obtained 14 statisstically significant results.

Yet the editors felt compelled to publish the manuscript because “we can only take the author at his word that his data are in fact genuine and that the reported findings have not been taken from a larger set of unpublished studies showing null effects” (Judd & Gawronski, 2011, p. 406).

[It is well known that authors excluded disconfirming evidence and that editors sometimes even asked authors to engage in this questionable research practice. However, this quote implies that the editors asked Bem about failed studies and that he assured them that there are no failed studies, which may have been necessary to publish these magical results in JPSP.  If Bem did not disclose failed studies on request and these studies exist, it would violate even the lax ethical standards of the time that mostly operated on a “don’t ask don’t tell” basis. ]

The M-index provides quantitative information about the credibility of this assumption and would have provided the editors with objective information to guide their decision. More importantly, awareness about total power could have helped Bem to plan fewer studies with higher total power to provide more credible evidence for his hypotheses.

Example 2: Sugar High—When Rewards Undermine Self-Control

Bem’s (2011) article is exceptional in that it examined a controversial phenomenon. I used another nine-study article that was published in the prestigious Journal of Personality and Social Psychology to demonstrate that low total power is also a problem
for articles that elicit less skepticism because they investigate less controversial hypotheses. Gailliot et al. (2007) examined the relation between blood glucose levels and self-regulation. I chose this article because it has attracted a lot of attention (142 citations in Web of Science as of May 2012; an average of 24 citations per year) and it is possible to evaluate the replicability of the original findings on the basis of subsequent studies by other researchers (Dvorak & Simons, 2009; Kurzban, 2010).

[If anybody needs evidence that citation counts are a silly indicator of quality, here it is: the article has been cited 80 times in 2014, 64 times in 2015, 63 times in 2016, and 61 times in 2017.  A good reason to retract it, if JPSP and APA cares about science and not just impact factors.]

Sample sizes were modest, ranging from N = 12 to 102. Four studies had sample sizes of N < 20, which Simmons et al. (2011) considered to require special justification.  The total N is 359 participants. Table 1 shows that this total sample
size is sufficient to have 80% total power for four large effects or two moderate effects and is insufficient to demonstrate a [single] small effect. Notably, Table 4 shows that all nine reported studies produced significant results.

The M-Index for these 9 studies was greater than 99%. This indicates that from a statistical point of view, Bem’s (2011) evidence for ESP is more credible
than Gailliot et al.’s (2007) evidence for a role of blood glucose in

A more powerful replication study with N = 180 participants provides more conclusive evidence (Dvorak & Simons, 2009). This study actually replicated Gailliot et al.’s (1997) findings in Study 1. At the same time, the study failed to replicate the results for Studies 3–6 in the original article. Dvorak and Simons (2009) did not report the correlation, but the authors were kind enough to provide this information. The correlation was not significant in the experimental group, r(90) = .10, and the control group, r(90) =
.03. Even in the total sample, it did not reach significance, r(180) = .11. It is therefore extremely likely that the original correlations were inflated because a study with a sample of N = 90 has 99.9% power to produce a significant effect if the true effect
size is r = .5. Thus, Dvorak and Simons’s results confirm the prediction of the M-index that the strong correlations in the original article are incredible.

In conclusion, Gailliot et al. (2007) had limited resources to examine the role of blood glucose in self-regulation. By attempting replications in nine studies, they did not provide strong evidence for their theory. Rather, the results are incredible and difficult to replicate, presumably because the original studies yielded inflated effect sizes. A better solution would have been to test the three hypotheses in a single study with a large sample. This approach also makes it possible to test additional hypotheses, such as mediation (Dvorak & Simons, 2009). Thus, Example 2 illustrates that
a single powerful study is more informative than several small studies.

General Discussion

Fifty years ago, Cohen (1962) made a fundamental contribution to psychology by emphasizing the importance of statistical power to produce strong evidence for theoretically predicted effects. He also noted that most studies at that time had only sufficient power to provide evidence for strong effects. Fifty years later, power
analysis remains neglected. The prevalence of studies with insufficient power hampers scientific progress in two ways. First, there are too many Type II errors that are often falsely interpreted as evidence for the null hypothesis (Maxwell, 2004). Second, there
are too many false-positive results (Sterling, 1959; Sterling et al., 1995). Replication across multiple studies within a single article has been considered a solution to these problems (Ledgerwood & Sherman, 2012). The main contribution of this article is to point
out that multiple-study articles do not provide more credible evidence simply because they report more statistically significant results. Given the modest power of individual studies, it is even less credible that researchers were able to replicate results repeatedly in a series of studies than that they obtained a significant effect in a single study.

The demonstration that multiple-study articles often report incredible results might help to reduce the allure of multiple-study articles (Francis, 2012a, 2012b). This is not to say that multiple-study articles are intrinsically flawed or that single-study articles are superior. However, more studies are only superior if total power is held constant, yet limited resources create a trade-off between the number of studies and total power of a set of studies.

To maintain credibility, it is better to maximize total power rather than number of studies. In this regard, it is encouraging that some  editors no longer consider number ofstudies as a selection criterion for publication (Smith, 2012).

[Over the past years, I have been disappointed by many psychologists that I admired or respected. I loved ER Smith’s work on exemplar models that influenced my dissertation work on frequency estimation of emotion.  In 2012, I was hopeful that he would make real changes, but my replicability rankings show that nothing changed during his term as editor of the JPSP section that published Bem’s article. Five wasted years and nobody can say he couldn’t have known better.]

Subsequently, I first discuss the puzzling question of why power continues to be ignored despite the crucial importance of power to obtain significant results without the help of questionable research methods. I then discuss the importance of paying more attention to total power to increase the credibility of psychology as a science. Due to space limitations, I will not repeat many other valuable suggestions that have been made to improve the current scientific model (Schooler, 2011; Simmons et al., 2011; Spellman, 2012; Wagenmakers et al., 2011).

In my discussion, I will refer to Bem’s (2011) and Gailliot et al.’s (2007) articles, but it should be clear that these articles merely exemplify flaws of the current scientific
paradigm in psychology.

Why Do Researchers Continue to Ignore Power?

Maxwell (2004) proposed that researchers ignore power because they can use a shotgun approach. That is, if Joe sprays the barn with bullets, he is likely to hit the bull’s-eye at least once. For example, experimental psychologists may use complex factorial
designs that test multiple main effects and interactions to obtain at
least one significant effect (Maxwell, 2004).

Psychologists who work with many variables can test a large number of correlations
to find a significant one (Kerr, 1998). Although studies with small samples have modest power to detect all significant effects (low total power), they have high power to detect at least one significant effect (Maxwell, 2004).

The shotgun model is unlikely to explain incredible results in multiple-study articles because the pattern of results in a set of studies has to be consistent. This has been seen as the main strength of multiple-study articles (Ledgerwood & Sherman, 2012).

However, low total power in multiple-study articles makes it improbable that all studies produce significant results and increases the pressure on researchers to use questionable research methods to comply with the questionable selection criterion that
manuscripts should report only significant results.

A simple solution to this problem would be to increase total power to avoid
having to use questionable research methods. It is therefore even more puzzling why the requirement of multiple studies has not resulted in an increase in power.

One possible explanation is that researchers do not care about effect sizes. Researchers may not consider it unethical to use questionable research methods that inflate effect sizes as long as they are convinced that the sign of the reported effect is consistent
with the sign of the true effect. For example, the theory that implicit attitudes are malleable is supported by a positive effect of experimental manipulations on the implicit association test, no matter whether the effect size is d = .8 (Dasgupta & Greenwald,
2001) or d = .08 (Joy-Gaba & Nosek, 2010), and the influence of blood glucose levels on self-control is supported by a strong correlation of r = .6 (Gailliot et al., 2007) and a weak correlation of r = .1 (Dvorak & Simons, 2009).

The problem is that in the real world, effect sizes matter. For example, it matters whether exercising for 20 minutes twice a week leads to a weight loss of one
pound or 10 pounds. Unbiased estimates of effect sizes are also important for the integrity of the field. Initial publications with stunning and inflated effect sizes produce underpowered replication studies even if subsequent researchers use a priori power analysis.

As failed replications are difficult to publish, inflated effect sizes are persistent and can bias estimates of true effect sizes in meta-analyses. Failed replication studies in file drawers also waste valuable resources (Spellman, 2012).

In comparison to one small (N = 40) published study with an inflated effect size and
nine replication studies with nonsignificant replications in file drawers (N = 360), it would have been better to pool the resources of all 10 studies for one strong test of an important hypothesis (N = 400).

A related explanation is that true effect sizes are often likely to be small to moderate and that researchers may not have sufficient resources for unbiased tests of their hypotheses. As a result, they have to rely on fortune (Wegner, 1992) or questionable research
methods (Simmons et al., 2011; Vul et al., 2009) to report inflated observed effect sizes that reach statistical significance in small samples.

Another explanation is that researchers prefer small samples to large samples because small samples have less power. When publications do not report effect sizes, sample sizes become an imperfect indicator of effect sizes because only strong effects
reach significance in small samples. This has led to the flawed perception that effect sizes in large samples have no practical significance because even effects without practical significance can reach statistical significance (cf. Royall, 1986). This line of
reasoning is fundamentally flawed and confounds credibility of scientific evidence with effect sizes.

The most probable and banal explanation for ignoring power is poor statistical training at the undergraduate and graduate levels. Discussions with colleagues and graduate students suggest that power analysis is mentioned, but without a sense of importance.

[I have been preaching about power for years in my department and it became a running joke for students to mention power in their presentation without having any effect on research practices until 2011. Fortunately, Bem unintentionally made it able to convince some colleagues that power is important.]

Research articles also reinforce the impression that power analysis is not important as sample sizes vary seemingly at random from study to study or article to article. As a result, most researchers probably do not know how risky their studies are and how lucky they are when they do get significant and inflated effects.

I hope that this article will change this and that readers take total power into account when they read the next article with five or more studies and 10 or more significant results and wonder whether they have witnessed a sharpshooter or have seen a magic show.

Finally, it is possible that researchers ignore power simply because they follow current practices in the field. Few scientists are surprised that published findings are too good to be true. Indeed, a common response to presentations of this work has been that the M-index only shows the obvious. Everybody knows that researchers use a number of questionable research practices to increase their chances of reporting significant results, and a high percentage of researchers admit to using these practices, presumably
because they do not consider them to be questionable (John et al., 2012).

[Even in 2014, Stroebe and Strack claim that it is not clear which practices should be considered questionable, whereas my undergraduate students have no problem realizing that hiding failed studies undermines the purpose of doing an empirical study in the first place.]

The benign view of current practices is that successful studies provide all of the relevant information. Nobody wants to know about all the failed attempts of alchemists to turn base metals into gold, but everybody would want to know about a process that
actually achieves this goal. However, this logic rests on the assumption that successful studies were really successful and that unsuccessful studies were really flawed. Given the modest power of studies, this conclusion is rarely justified (Maxwell, 2004).

To improve the status of psychological science, it will be important to elevate the scientific standards of the field. Rather than pointing to limited resources as an excuse,
researchers should allocate resources more wisely (spend less money on underpowered studies) and conduct more relevant research that can attract more funding. I think it would be a mistake to excuse the use of questionable research practices by pointing out that false discoveries in psychological research have less dramatic consequences than drugs with little benefits, huge costs, and potential side effects.

Therefore, I disagree with Bem’s (2000) view that psychologists should “err on the side of discovery” (p. 5).

[Yup, he wrote that in a chapter that was used to train graduate students in social psychology in the art of magic.]

Recommendations for Improvement

Use Power in the Evaluation of Manuscripts

Granting agencies often ask that researchers plan studies with adequate power (Fritz & MacKinnon, 2007). However, power analysis is ignored when researchers report their results. The reason is probably that (a priori) power analysis is only seen as a way to ensure that a study produces a significant result. Once a significant finding has been found, low power no longer seems to be a problem. After all, a significant effect was found (in one condition, for male participants, after excluding two outliers, p =
.07, one-tailed).

One way to improve psychological science is to require researchers to justify sample sizes in the method section. For multiple-study articles, researchers should be asked to compute total power.

[This is something nobody has even started to discuss.  Although there are more and more (often questionable) a priori power calculations in articles, they tend to aim for  80%  power for a single hypothesis test, but these articles often report multiple studies or multiple hypothesis tests in a single article.  The power to get two significant results with 80-% for each test is only 64%. ]

If a study has 80% total power, researchers should also explain how they would deal with the possible outcome of a nonsignificant result. Maybe it would change the perception of research contributions when a research article reports 10 significant
results, although power was only sufficient to obtain six. Implementing this policy would be simple. Thus, it is up to editors to realize the importance of statistical power and to make power an evaluation criterion in the review process (Cohen, 1992).

Implementing this policy could change the hierarchy of psychological
journals. Top journals would no longer be the journals with the most inflated effect sizes but, rather, the journals with the most powerful studies and the most credible scientific evidence.

[Based on this idea, I started developing my replicability rankings of journals. And they show that impact factors still do not take replicability into account.]

Reward Effort Rather Than Number of Significant Results

Another recommendation is to pay more attention to the total effort that went into an empirical study rather than the number of significant p values. The requirement to have multiple studies with no guidelines about power encourages a frantic empiricism in
which researchers will conduct as many cheap and easy studies as possible to find a set of significant results.

[And if power is taken into account, researchers now do six cheap Mturk studies. Although this is better than six questionable studies, it does not correct the problem that good research often requires a lot of resources.]

It is simply too costly for researchers to invest in studies with observation of real behaviors, high ecological validity, or longitudinal assessments that take
time and may produce a nonsignificant result.

Given the current environmental pressures, a low-quality/high-quantity strategy is
more adaptive and will ensure survival (publish or perish) and reproductive success (more graduate students who pursue a lowquality/ high-quantity strategy).

[It doesn’t help to become a meta-psychologists. Which smart undergraduate student would risk the prospect of a career by becoming a meta-psychologist?]

A common misperception is that multiple-study articles should be rewarded because they required more effort than a single study. However, the number of studies is often a function of the difficulty of conducting research. It is therefore extremely problematic to
assume that multiple studies are more valuable than single studies.

A single longitudinal study can be costly but can answer questions that multiple cross-sectional studies cannot answer. For example, one of the most important developments in psychological measurement has been the development of the implicit association test
(Greenwald, McGhee, & Schwartz, 1998). A widespread belief about the implicit association test is that it measures implicit attitudes that are more stable than explicit attitudes (Gawronski, 2009), but there exist hardly any longitudinal studies of the stability of implicit attitudes.

[I haven’t checked but I don’t think this has changed much. Cross-sectional Mturk studies can still produce sexier results than a study that simply estimates the stability of the same measure over time.  Social psychologists tend to be impatient creatures (e.g., Bem)]

A simple way to change the incentive structure in the field is to undermine the false belief that multiple-study articles are better than single-study articles. Often multiple studies are better combined into a single study. For example, one article published four studies that were identical “except that the exposure duration—suboptimal (4 ms)
or optimal (1 s)—of both the initial exposure phase and the subsequent priming phase was orthogonally varied” (Murphy, Zajonc, & Monahan, 1995, p. 589). In other words, the four studies were four conditions of a 2 x 2 design. It would have been more efficient and
informative to combine the information of all studies in a single study. In fact, after reporting each study individually, the authors reported the results of a combined analysis. “When all four studies are entered into a single analysis, a clear pattern emerges” (Murphy et al., 1995, p. 600). Although this article may be the most extreme example of unnecessary multiplicity, other multiple-study articles could also be more informative by reducing the number of studies in a single article.

Apparently, readers of scientific articles are aware of the limited information gain provided by multiple-study articles because citation counts show that multiple-study articles do not have more impact than single-study articles (Haslam et al., 2008). Thus, editors should avoid using number of studies as a criterion for accepting articles.

Allow Publication of Nonsignificant Results

The main point of the M-index is to alert researchers, reviewers, editors, and readers of scientific articles that a series of studies that produced only significant results is neither a cause for celebration  nor strong evidence for the demonstration of a scientific discovery; at least not without a power analysis that shows the results are credible.

Given the typical power of psychological studies, nonsignificant findings should be obtained regularly, and the absence of nonsignificant results raises concerns about the credibility of published research findings.

Most of the time, biases may be benign and simply produce inflated effect sizes, but occasionally, it is possible that biases may have more serious consequences (e.g.,
demonstrate phenomena that do not exist).

A perfectly planned set of five studies, where each study has 80% power, is expected to produce one nonsignificant result. It is not clear why editors sometimes ask researchers to remove studies with nonsignificant results. Science is not a beauty contest, and a
nonsignificant result is not a blemish.

This wisdom is captured in the Japanese concept of wabi-sabi, in which beautiful objects are designed to have a superficial imperfection as a reminder that nothing is perfect. On the basis of this conception of beauty, a truly perfect set of studies is one that echoes the imperfection of reality by including failed studies or studies that did not produce significant results.

Even if these studies are not reported in great detail, it might be useful to describe failed studies and explain how they informed the development of studies that produced significant results. Another possibility is to honestly report that a study failed to produce a significant result with a sample size that provided 80% power and that the researcher then added more participants to increase power to 95%. This is different from snooping (looking at the data until a significant result has been found), especially if it is stated clearly that the sample size was increased because the effect was not significant with the originally planned sample size and the significance test has been adjusted to take into account that two significance tests were performed.

The M-index rewards honest reporting of results because reporting of null findings renders the number of significant results more consistent with the total power of the studies. In contrast, a high M-index can undermine the allure of articles that report more significant results than the power of the studies warrants. In this
way, post-hoc power analysis could have the beneficial effect that researchers finally start paying more attention to a priori power.

Limited resources may make it difficult to achieve high total power. When total power is modest, it becomes important to report nonsignificant results. One way to report nonsignificant results would be to limit detailed discussion to successful studies but to
include studies with nonsignificant results in a meta-analysis. For example, Bem (2011) reported a meta-analysis of all studies covered in the article. However, he also mentioned several pilot studies and a smaller study that failed to produce a significant
result. To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results.

This overall effect is functionally equivalent to the test of the hypothesis in a single
study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results.

[Since then, several articles have proposed meta-analyses and given tutorials on mini-meta-analysis without citing my article and without clarifying that these meta-analysis are only useful if all evidence is included and without clarifying that bias tests like the M-Index can reveal whether all relevant evidence was included.]

It is also important that top journals publish failed replication studies. The reason is that top journals are partially responsible for the contribution of questionable research practices to published research findings. These journals look for novel and groundbreaking studies that will garner many citations to solidify their position
as top journals. As everywhere else (e.g., investing), the higher payoff comes with a higher risk. In this case, the risk is publishing false results. Moreover, the incentives for researchers to get published in top journals or get tenure at Ivy League universities
increases the probability that questionable research practices contribute
to articles in the top journals (Ledford, 2010). Stapel faked data to get a publication in Science, not to get a publication in Psychological Reports.

There are positive signs that some journal editors are recognizing their responsibility for publication bias (Dirnagl & Lauritzen, 2010). The medical journal Journal of Cerebral Blood Flow and Metabolism created a section that allows researchers to publish studies with disconfirmatory evidence so that this evidence is published in the same journal. One major advantage of having this section in top journals is that it may change the evaluation criteria of journal editors toward a more careful assessment of Type I error when they accept a manuscript for publication. After all, it would be quite embarrassing to publish numerous articles that erred on the side of discovery if subsequent issues reveal that these discoveries were illusory.

[After some pressure from social media, JPSP did publish failed replications of Bem, and it now has a replication section (online only).  Maybe somebody can dig up some failed replications of glucose studies, I know they exist, or do one more study to publish in JPSP that, just like ESP, glucose is a myth.]

It could also reduce the use of questionable research practices by researchers eager to publish in prestigious journals if there was a higher likelihood that the same journal will publish failed replications by independent researchers. It might also motivate more researchers to conduct rigorous replication studies if they can bet against a finding and hope to get a publication in a prestigious journal.

The M-index can be helpful in putting pressure on editors and journals to curb the proliferation of false-positive results because it can be used to evaluate editors and journals in terms of the credibility of the results that are published in these journals.

As everybody knows, the value of a brand rests on trust, and it is easy to destroy this value when consumers lose that trust. Journals that continue to publish incredible results and suppress contradictory replication studies are not going to survive, especially given the fact that the Internet provides an opportunity for authors of repressed replication studies to get their findings out (Spellman, 2012).

[I wrote this in the third revision when I thought the editor would not want to see the manuscript again.]

[I deleted the section where I pick on Ritchie’s failed replications of Bem because three studies with small studies of N = 50 are underpowered and can be dismissed as false positives. Replication studies should have at least the sample size of original studies which was N = 100 for most of Bem’s studies.]

Another solution would be to ignore p values altogether and to focus more on effect sizes and confidence intervals (Cumming & Finch, 2001). Although it is impossible to demonstrate that the true effect size is exactly zero, it is possible to estimate
true effect sizes with very narrow confidence intervals. For example, a sample of N = 1,100 participants would be sufficient to demonstrate that the true effect size of ESP is zero with a narrow confidence interval of plus or minus .05.

If an even more stringent criterion is required to claim a null effect, sample sizes would have to increase further, but there is no theoretical limit to the precision of effect size estimates. No matter whether the focus is on p values or confidence intervals, Cohen’s recommendation that bigger is better, at least for sample sizes, remains true because large samples are needed to obtain narrow confidence intervals (Goodman & Berlin, 1994).


Changing paradigms is a slow process. It took decades to unsettle the stronghold of behaviorism as the main paradigm in psychology. Despite Cohen’s (1962) important contribution to the field 50 years ago and repeated warnings about the problems of underpowered studies, power analysis remains neglected (Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). I hope the M-index can make a small contribution toward the goal of improving the scientific standards of psychology as a science.

Bem’s (2011) article is not going to be a dagger in the heart of questionable research practices, but it may become the historic marker of a paradigm shift.

There are positive signs in the literature  on meta-analysis (Sutton & Higgins, 2008), the search for better statistical methods (Wagenmakers, 2007)*, the call for more
open access to data (Schooler, 2011), changes in publication practices of journals (Dirnagl & Lauritzen, 2010), and increasing awareness of the damage caused by questionable research practices (Francis, 2012a, 2012b; John et al., 2012; Kerr, 1998; Simmons
et al., 2011) to be hopeful that a paradigm shift may be underway.

[Another sad story. I did not understand Wagenmaker’s use of Bayesian methods at the time and I honestly thought this work might make a positive contribution. However, in retrospect I realize that Wagenmakers is more interested in selling his statistical approach at any cost and disregards criticisms of his approach that have become evident in recent years. And, yes, I do understand how the method works and why it will not solve the replication crisis (see commentary by Carlsson et al., 2017, in Psychological Science).]

Even the Stapel debacle (Heatherton, 2010), where a prominent psychologist admitted to faking data, may have a healthy effect on the field.

[Heaterton emailed me and I thought he was going to congratulate me on my nice article or thank me for citing him, but he was mainly concerned that quoting him in the context of Stapel might give the impression that he committed fraud.]

After all, faking increases Type I error by 100% and is clearly considered unethical. If questionable research practices can increase Type I error by up to 60% (Simmons et al., 2011), it becomes difficult to maintain that these widely used practices are questionable but not unethical.

[I guess I was a bit optimistic here. Apparently, you can hide as many studies as you want, but you cannot change one data point because that is fraud.]

During the reign of a paradigm, it is hard to imagine that things will ever change. However, for most contemporary psychologists, it is also hard to imagine that there was a time when psychology was dominated by animal research and reinforcement schedules. Older psychologists may have learned that the only constant in life is change.

[Again, too optimistic. Apparently, many old social psychologists still believe things will remain the same as they always were.  Insert head in the sand cartoon here.]

I have been fortunate enough to witness historic moments of change such as the falling of the Berlin Wall in 1989 and the end of behaviorism when Skinner gave his last speech at the convention of the American Psychological Association in 1990. In front of a packed auditorium, Skinner compared cognitivism to creationism. There was dead silence, made more audible by a handful of grey-haired members in the audience who applauded

[Only I didn’t realize that research in 1990 had other problems. Nowadays I still think that Skinner was just another professor with a big ego and some published #me_too allegations to his name, but he was right in his concerns about (social) cognitivism as not much more scientific than creationism.]

I can only hope to live long enough to see the time when Cohen’s valuable contribution to psychological science will gain the prominence that it deserves. A better understanding of the need for power will not solve all problems, but it will go a long way toward improving the quality of empirical studies and the credibility of results published in psychological journals. Learning about power not only empowers researchers to conduct studies that can show real effects without the help of questionable research practices but also empowers them to be critical consumers of published research findings.

Knowledge about power is power.


Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide
to publishing in psychological journals (pp. 3–16). Cambridge, England:
Cambridge University Press. doi:10.1017/CBO9780511807862.002

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality
and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Bonett, D. G. (2009). Meta-analytic interval estimation for standardized
and unstandardized mean differences. Psychological Methods, 14, 225–
238. doi:10.1037/a0016619

Castelvecchi, D. (2011). Has the Higgs been discovered? Physicists gear up
for watershed announcement. Scientific American. Retrieved from http://

Cohen, J. (1962). Statistical power of abnormal–social psychological research:
A review. Journal of Abnormal and Social Psychology, 65,
145–153. doi:10.1037/h0045186

Cohen, J. (1990). Things I have learned (so far). American Psychologist,
45, 1304–1312. doi:10.1037/0003-066X.45.12.1304

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.

Dasgupta, N., & Greenwald, A. G. (2001). On the malleability of automatic
attitudes: Combating automatic prejudice with images of admired and
disliked individuals. Journal of Personality and Social Psychology, 81,
800–814. doi:10.1037/0022-3514.81.5.800

Diener, E. (1998). Editorial. Journal of Personality and Social Psychology,
74, 5–6. doi:10.1037/h0092824

Dirnagl, U., & Lauritzen, M. (2010). Fighting publication bias: Introducing
the Negative Results section. Journal of Cerebral Blood Flow and
Metabolism, 30, 1263–1264. doi:10.1038/jcbfm.2010.51

Dvorak, R. D., & Simons, J. S. (2009). Moderation of resource depletion
in the self-control strength model: Differing effects of two modes of
self-control. Personality and Social Psychology Bulletin, 35, 572–583.

Erdfelder, E., Faul, F., & Buchner, A. (1996). GPOWER: A general power
analysis program. Behavior Research Methods, 28, 1–11. doi:10.3758/

Fanelli, D. (2010). “Positive” results increase down the hierarchy of the
sciences. PLoS One, 5, Article e10068. doi:10.1371/journal.pone

Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical
power analyses using G*Power 3.1: Tests for correlation and regression
analyses. Behavior Research Methods, 41, 1149–1160. doi:10.3758/

Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A
flexible statistical power analysis program for the social, behavioral, and
biomedical sciences. Behavior Research Methods, 39, 175–191. doi:

Fiedler, K. (2011). Voodoo correlations are everywhere—not only in
neuroscience. Perspectives on Psychological Science, 6, 163–171. doi:

Francis, G. (2012a). The same old New Look: Publication bias in a study
of wishful seeing. i-Perception, 3, 176–178. doi:10.1068/i0519ic

Francis, G. (2012b). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9

Fritz, M. S., & MacKinnon, D. P. (2007). Required sample size to detect
the mediated effect. Psychological Science, 18, 233–239. doi:10.1111/

Gailliot, M. T., Baumeister, R. F., DeWall, C. N., Maner, J. K., Plant,
E. A., Tice, D. M., & Schmeichel, B. J. (2007). Self-control relies on
glucose as a limited energy source: Willpower is more than a metaphor.
Journal of Personality and Social Psychology, 92, 325–336. doi:

Gawronski, B. (2009). Ten frequently asked questions about implicit
measures and their frequently supposed, but not entirely correct answers.
Canadian Psychology/Psychologie canadienne, 50, 141–150. doi:

Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence
intervals when planning experiments and the misuse of power when
interpreting results. Annals of Internal Medicine, 121, 200–206.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring
individual differences in implicit cognition: The implicit association test.
Journal of Personality and Social Psychology, 74, 1464–1480. doi:

Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J.,
& Wilson, S. (2008). What makes an article influential? Predicting
impact in social and personality psychology. Scientometrics, 76, 169–
185. doi:10.1007/s11192-007-1892-8

Heatherton, T. (2010). Official SPSP communique´ on the Diederik Stapel
debacle. Retrieved from

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in
the world? Behavioral and Brain Sciences, 33, 61–83. doi:10.1017/

Ioannidis, J. P. A. (2005). Why most published research findings are false.
PLoS Medicine, 2(8), Article e124. doi:10.1371/journal.pmed.0020124

Ioannidis, J. P. A., & Trikali nos, T. A. (2007). An exploratory test for an
excess of significant findings. Clinical Trials, 4, 245–253. doi:10.1177/

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence
of questionable research practices with incentives for truth telling.
Psychological Science, 23, 524–532. doi:10.1177/0956797611430953

Joy-Gaba, J. A., & Nosek, B. A. (2010). The surprisingly limited malleability
of implicit racial evaluations. Social Psychology, 41, 137–146.

Judd, C. M., & Gawronski, B. (2011). Editorial comment. Journal of
Personality and Social Psychology, 100, 406. doi:10.1037/0022789

Kerr, N. L. (1998). HARKing: Hypothezising after the results are known.
Personality and Social Psychology Review, 2, 196–217. doi:10.1207/

Kurzban, R. (2010). Does the brain consume additional glucose during
self-control tasks? Evolutionary Psychology, 8, 244–259.

Ledford, H. (2010, August 17). Harvard probe kept under wraps. Nature,
466, 908–909. doi:10.1038/466908a

Ledgerwood, A., & Sherman, J. W. (2012). Short, sweet, and problematic?
The rise of the short report in psychological science. Perspectives on Psychological Science, 7, 60–66. doi:10.1177/1745691611427304

Lehrer, J. (2010). The truth wears off. The New Yorker. Retrieved from

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological
research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082-989X.9.2.147

Milloy, J. S. (1995). Science without sense: The risky business of public
health research. Washington, DC: Cato Institute.

Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting
an optimal that minimizes errors in null hypothesis significance tests.
PLoS One, 7(2), Article e32734. doi:10.1371/journal.pone.0032734

Murphy, S. T., Zajonc, R. B., & Monahan, J. L. (1995). Additivity of
nonconscious affect: Combined effects of priming and exposure. Journal
of Personality and Social Psychology, 69, 589–602. doi:10.1037/0022-

Ritchie, S. J., Wiseman, R., & French, C. C. (2012a). Failing the future:
Three unsuccessful attempts to replicate Bem’s “retroactive facilitation
of recall” effect. PLoS One, 7(3), Article e33423. doi:10.1371/

Rossi, J. S. (1990). Statistical power of psychological research: What have
we gained in 20 years? Journal of Consulting and Clinical Psychology,
58, 646–656. doi:10.1037/0022-006X.58.5.646

Royall, R. M. (1986). The effect of sample size on the meaning of
significance tests. American Statistician, 40, 313–315. doi:10.2307/

Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives
on Psychological Science, 5, 233–242. doi:10.1177/

Schooler, J. (2011, February 23). Unpublished results hide the decline
effect. Nature, 470, 437. doi:10.1038/470437a

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power
have an effect on the power of studies? Psychological Bulletin, 105,
309–316. doi:10.1037/0033-2909.105.2.309

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: Undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22,
1359–1366. doi:10.1177/0956797611417632

Smith, E. R. (2012). Editorial. Journal of Personality and Social Psychology,
102, 1–3. doi:10.1037/a0026676

Spellman, B. A. (2012). Introduction to the special section: Data, data,
everywhere . . . especially in my file drawer. Perspectives on Psychological
Science, 7, 58–59. doi:10.1177/1745691611432124

Stat Trek. (2012). Binomial calculator: Online statistical table. Retrieved

Steen, R. G. (2011a). Retractions in the scientific literature: Do authors
deliberately commit research fraud? Journal of Medical Ethics, 37,
113–117. doi:10.1136/jme.2010.038125

Steen, R. G. (2011b). Retractions in the scientific literature: Is the incidence
of research fraud increasing? Journal of Medical Ethics, 37,
249–253. doi:10.1136/jme.2010.040923

Sterling, T. D. (1959). Publication decisions and their possible effects on
inferences drawn from tests of significance— or vice versa. Journal of
the American Statistical Association, 54(285), 30–34. doi:10.2307/

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication
decisions revisited: The effect of the outcome of statistical tests on the
decision to publish and vice-versa. American Statistician, 49, 108–112.

Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746

Sutton, A. J., & Higgins, J. P. I. (2008). Recent developments in metaanalysis.
Statistics in Medicine, 27, 625–650. doi:10.1002/sim.2934

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high
correlations in fMRI studies of emotion, personality, and social cognition.
Perspectives on Psychological Science, 4, 274–290. doi:10.1111/

Wagenmakers, E. J. (2007). A practical solution to the pervasive problems
of p values. Psychonomic Bulletin & Review, 14, 779–804. doi:10.3758/

Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J.
(2011). Why psychologists must change the way they analyze their data:
The case of psi: Comment on Bem (2011). Journal of Personality and
Social Psychology, 100, 426–432. doi:10.1037/a0022790

Wegner, D. M. (1992). The premature demise of the solo experiment.
Personality and Social Psychology Bulletin, 18, 504–508. doi:10.1177/

Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations
reflect low statistical power—Commentary on Vul et al. (2009).
Perspectives on Psychological Science, 4, 294–298. doi:10.1111/j.1745-

Yong, E. (2012, May 16). Bad copy. Nature, 485, 298–300. doi:10.1038/

Yuan, K. H., & Maxwell, S. (2005). On the post hoc power in testing mean
differences. Journal of Educational and Behavioral Statistics, 30, 141–
167. doi:10.3102/10769986030002141

Received May 30, 2011
Revision received June 18, 2012
Accepted June 25, 2012
Further Revised February 18, 2018


‘Before you know it’ by John A. Bargh: A quantitative book review

November 28, Open Draft/Preprint (Version 1.0)
[Please provide comments and suggestions]

In this blog post I present a quantitative review of John A Bargh’s book “Before you know it: The unconscious reasons we do what we do”  A quantitative book review is different from a traditional book review.  The goal of a quantitative review is to examine the strength of the scientific evidence that is provided to support ideas in the book.  Readers of a popular science book written by an eminent scientist expect that these ideas are based on solid scientific evidence.  However, the strength of scientific evidence in psychology, especially social psychology has been questioned.  I use statistical methods to examine how strong the evidence actually is.

One problem in psychological publishing is publication bias in favor of studies that support theories, so called publication bias.  The reason for publication bias is that scientific journals can publish only a fraction of results that scientists produce.  This leads to heavy competition among scientists to produce publishable results, and journals like to publish statistically significant results; that is studies that provide evidence for an effect (e.g., “eating green jelly beans cures cancer” rather than “eating red jelly beans does not cure cancer”).  Statisticians have pointed out that publication bias undermines the meaning of statistical significance, just like counting only hits would undermine the meaning of batting averages. Everybody would have an incredible batting average of 1.00.

For a long time it was assumed that publication bias is just a minor problem. Maybe researchers conducted 10 studies and reported only 8 significant results while not reporting the remaining two studies that did not produce a significant result.  However, in the past five years it has become apparent that publication bias, at least in some areas of the social sciences, is much more severe, and that there are more unpublished studies with non-significant results than published results with significant results.

In 2012, Daniel Kahneman (2012) raised doubts about the credibilty of priming research in an open email letter addressed to John A. Bargh, the author of “Before you know it.”  Daniel Kahneman is a big name in psychology; he won a Nobel Prize for economics in 2002.  He also wrote a popular book that features John Bargh’s priming research (see review of Chapter 4).  Kahneman wrote “As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.”

Kahneman is not an outright critic of priming research. In fact, he was concerned about the future of priming research and made some suggestions how Bargh and colleagues could alleviate doubts about the replicability of priming results.  He wrote:

“To deal effectively with the doubts you should acknowledge their existence and confront them straight on, because a posture of defiant denial is self-defeating. Specifically, I believe that you should have an association, with a board that might include prominent social psychologists from other fields. The first mission of the board would be to organize an effort to examine the replicability of priming results.”

However, prominent priming researchers have been reluctant to replicate their old studies.  At the same time, other scientists have conducted replication studies and failed to replicate classic findings. One example is Ap Dijksterhuis’s claim that showing words related to intelligence before taking a test can increase test performance.  Shanks and colleagues tried to replicate this finding in 9 studies and came up empty in all 9 studies. More recently, a team of over 100 scientists conducted 24 replication studies of Dijsterhuis’s professor priming study.  Only 1 study successfully replicated the original finding, but with a 5% error rate, 1 out of 20 studies is expected to produce a statistically significant result by chance alone.  This result validates Shanks’ failures to replicate and strongly suggests that the original result was a statistical fluke (i.e., a false positive result).

Proponent of priming research like  Dijksterhuis “argue that social-priming results are hard to replicate because the slightest change in conditions can affect the outcome” (Abbott, 2013, Nature News).  Many psychologists consider this response inadequate.  The hallmark of a good theory is that it predicts the outcome of a good experiment.  If the outcome depends on unknown factors and replication attempts fail more often than not, a scientific theory lacks empirical support.  For example,  Kahneman wrote in an email that the apparent “refusal to engage in a legitimate scientific conversation … invites the interpretation that the believers are afraid of the outcome” (Abbott, 2013, Nature News).

It is virtually impossible to check on all original findings by conducting extensive and expensive replication studies.  Moreover, proponents of priming research can always find problems with actual replication studies to dismiss replication failures.  Fortunately, there is another way to examine the replicability of priming research. This alternative approach, z-curve, uses a statistical approach to estimate replicability based on the results reported in original studies.  Most important, this approach examines how replicable and credible original findings were based on the results reported in the original articles.  Therefore, original researches cannot use inadequate methods or slight variations in contextual factors to dismiss replication failures. Z-curve can reveal that the original evidence was not as strong as dozens of published studies may reveal because it takes into account that published studies were selected to provide evidence for priming effects.

My colleagues and I used z-curve to estimate the average replicability of priming studies that were cited in Kahneman’s chapter on priming research.  We found that the average probability of a successful replication was only 14%. Given the small number of studies (k = 31), this estimate is not very precise. It could be higher, but it could also be even lower. This estimate would imply that for each published significant result, there are  9 unpublished non-significant results that were omitted due to publication bias. Given these results, the published significant results provide only weak empirical support for theoretical claims about priming effects.  In a response to our blog post, Kahneman agreed (“What the blog gets absolutely right is that I placed too much faith in underpowered studies”).

Our analysis of Kahneman’s chapter on priming provided a blue print for this  quantitative book review of Bargh’s book “Before you know it.”  I first checked the notes for sources and then linked the sources to the corresponding references in the reference section.  If the reference was an original research article, I downloaded the original research article and looked for the most critical statistical test of a study. If an article contained multiple studies, I chose one test from each study.  I found 168 usable original articles that reported a total of 400 studies.  I then converted all test statistics into absolute z-scores and analyzed them with z-curve to estimate replicability (see Excel file for coding of studies).

Figure 1 shows the distribution of absolute z-scores.  90% of test statistics were statistically significant (z > 1.96) and 99% were at least marginally significant (z > 1.65), meaning they passed a less stringent statistical criterion to claim a success.  This is not surprising because supporting evidence requires statistical significance. The more important question is how many studies would produce a statistically significant result again if all 400 studies were replicated exactly.  The estimated success rate in Figure 1 is less than half (41%). Although there is some uncertainty around this estimate, the 95% confidence interval just reaches 50%, suggesting that the true value is below 50%.  There is no clear criterion for inadequate replicability, but Tversky and Kahneman (1971) suggested a minimum of 50%.  Professors are also used to give students who scored below 50% on a test an F.  So, I decided to use the grading scheme at my university as a grading scheme for replicability scores.  So, the overall score for the replicability of studies cited by Bargh to support the ideas in his book is F.



This being said, 41% replicability is a lot more than we would expect by chance alone, namely 5%.  Clearly some of the results mentioned in the book are replicable. The question is which findings are replicable and which ones are difficult to replicate or even false positive results.  The problem with 41% replicable results is that we do not know which results we can trust. Imagine you are interviewing 100 eyewitnesses and only 42 of them are reliable. Would you be able to identify a suspect?

It is also possible to analyze subsets of studies. Figure 2 shows the results of all experimental studies that randomly assigned participants to two or more conditions.  If a manipulation has an effect, it produces mean differences between the groups. Social psychologists like these studies because they allow for strong causal inferences and make it possible to disguise the purpose of a study.  Unfortunately, this design requires large samples to produce replicable results and social psychologists often used rather small samples in the past (the rule of thumb was 20 per group).  As Figure 2 shows, the replicability of these studies is lower than the replicability of all studies.  The average replicability is only 24%.  This means for every significant result there are at least three non-significant results that have not been reported due to the pervasive influence of publication bias.


If 24% doesn’t sound bad enough, it is important to realize that this estimate assumes that the original studies can be replicated exactly.  However, social psychologists have pointed out that even minor differences between studies can lead to replication failures.  Thus, the success rate of actual replication studies is likely to be even less than 24%.

In conclusion, the statistical analysis of the evidence cited in Bargh’s book confirms concerns about the replicability of social psychological studies, especially experimental studies that compared mean differences between two groups in small samples. Readers of the book should be aware that the results reported in the book might not replicate in a new study under slightly different conditions and that numerous claims in the book are not supported by strong empirical evidence.

Replicability of Chapters

I also estimated the replicability separately for each of the 10 chapters to examine whether some chapters are based on stronger evidence than others. Table 1 shows the results. Seven chapters scored an F, two chapters scored a D, and one chapter earned a C-.   Although there is some variability across chapters, none of the chapters earned a high score, but some chapters may contain some studies with strong evidence.

Table 1. Chapter Report Card

Chapter 1 28 F
Chapter 2 40 F
Chapter 3 13 F
Chapter 4 47 F
Chapter 5 50 D-
Chapter 6 57 D+
Chapter 7 24 F
Chapter 8 19 F
Chapter 9 31 F
Chapter 10 62 C-

Credible Findings in the Book

Unfortunately, it is difficult to determine the replicability of individual studies with high precision.  Nevertheless, studies with high z-scores are more replicable.  Particle physicists use a criterion value of z > 5 to minimize the risk that the results of a single study are not a false positive.  I found that psychological studies with a z-score greater than 4 had an 80% chance of being replicated in actual replication studies.  Using this rule as a rough estimate of replicability, I was also able to identify credible claims in the book.  Highlighting these claims does not mean that the other claims are wrong. It simply means that they are not supported by strong evidence.

Chapter 1:    

According to Chapter 1, there seems “to be a connection between the strength of the unconscious physical safety motivation and a person’s political attitudes.”   The notes list a number of articles to support this claim.  The only conclusive evidence in these studies is that self-reported political attitudes (a measure of right-wing authoritarianism) is correlated with self-reported beliefs that the world is dangerous (Duckitt et al., JPSP, 2002, 2 studies, z = 5.42, 6.93).  The correlation between self-report measures is hardly evidence for unconscious physical safety motives.

Another claim is that “our biological mandate to reproduce can have surprising manifestations in today’s world.”   This claim is linked to a study that examined the influence of physical attractiveness on call backs for a job interview.  In a large field experiment, researchers mailed (N = 11,008 resumes) to real job ads and found that both men and women were more likely to be called for an interview if the application included a picture of a highly attractive applicant versus a not so attractive applicant (Busetta et al., 2013, z = 19.53).  Although this is an interesting and important finding, it is not clear that the human resource offices preference for attractive applicants was driven by their “biological mandate to reproduce.”

Chapter 2: 

Chapter 2 introduces the idea that there is a fundamental connection between physical sensations and social relationships.  “… why today we still speak so easily of a warm friend, or a cold father. We always will. Because the connection between physical and social warmth, and between physical and social coldness, is hardwired into the human brain.”   Only one z-score surpassed the 4-sigma threshold.  This z-score comes from a brain imaging study that found increased sensorimotor activation in response to hand-washing products (soap) after participants had lied in a written email, but not after they had lied verbally;  Schaefer et al., 2015, z = 4.65).  There are two problems with this supporting evidence.  First, z-scores in fMRI studies require a higher threshold than z-scores in other studies because brain imaging studies allow for multiple comparisons that increase the risk of a false positive result (Vul et al., 2009).  More important, even if this finding could be replicated, it does not provide support for the claim that these neurological connections are hard-wired into humans’ brains.

The second noteworthy claim in Chapter 2 is that infants “have a preference for their native language over other languages, even though they don’t yet understand a word.” This claim is not very controversial given ample evidence that humans’ prefer familiar over unfamilar stimuli (Zajonc, 1968, also cited in the book).  However, it is not so easy to study infants’ preferences (after all, they are not able to tell us).  Developmental researchers use a visual attention task to infer preferences. If an infant looks longer at one of two stimuli, it indicates a preference for this stimulus. Kinzler et al. (PNAS, 2007) reported six studies. For five studies, z-scores ranged from 1.85 to 2.92, which is insufficient evidence to draw strong conclusions.  However, Study 6 provided convincing evidence (z = 4.61) that 5-year old children in Boston preferred a native speaker to a child with a French accent. The effect was so strong that 8 children were sufficient to demonstrate it.  However, a study with 5-year olds hardly provides evidence for infants’ preferences. In addition, the design of this study holds all other features constant. Thus, it is not clear how strong this effect is in the real world when many other factors can influence the choice of a friend.

Chapter 3

Chapter 3 introduces the concept of priming. “Primes are like reminders, whether we are aware of the reminding or not”   It uses two examples to illustrate priming with and without awareness. One example implies that people can be aware of the primes that influenced their behavior.  If you are in the airport, smell Cinnabon, and find yourself suddenly in front of the Cinnabon counter you are likely to know that the smell made you think about Cinnabon and decide to eat one. The second example introduces the idea that primes can influence behavior without awareness. If you were caught off in traffic, you may respond more hostile to a transgression of a co-worker without being aware that the earlier experience in traffic influenced your reaction.  The supporting references contain two noteworthy (z > 4) findings that show how priming can be used effectively as reminders (Rogers & Milkman, 2016, Psychological Science, Studies 2a (N = 920, z = 5.45) and Study 5 (N = 305, z = 5.50). In Study 2a, online participants were presented with the following instruction:

In this survey, you will have an opportunity to
support a charitable organization called Gardens
for Health that provides lasting agricultural
solutions to address the problem of chronic
childhood malnutrition. On the 12th page of this
survey, please choose answer “A” for the last
question on that page, no matter your opinion. The
previous page is Page 1. You are now on Page 2.
The next page is Page 3. The picture below will
be on top of the NEXT button on the 12th page.
This is intended to remind you to select
answer “A” for the last question on that page. If you
follow these directions, we will donate $0.30 to
Gardens for Health.


On pages 2-11 participants either saw distinct animals or other elephants.



Participants in the distinct animal condition were more likely to press the response that led to a donation than participants who saw a variety of elephants (z = 5.45).

Study 5 examined whether respondents would be willing to pay for a reminder.  They were offered 60 cents extra payment for responding with “E” to the last question.  They could either pay 3 cents to get an elephant reminder or not.  53% of participants were willing to pay for the reminder, which the authors compared to 0, z = 2 × 10^9.  This finding implies that participants are not only aware of the prime when they respond in the primed way, but are also aware of this link ahead of time and are willing to pay for it.

In short, Chapter 3 introduces the idea of unconscious or automatic priming, but the only solid evidence in the reference section supports the notion that we can also be consciously aware of priming effects and use them to our advantage.

Chapter 4

Chapter 4 introduces the concept of arousal transfer; the idea that arousal from a previous event can linger and influence how we react to another event.  The book reports in detail a famous experiment by Dutton and Aaron (1974).

“In another famous demonstration of the same effect, men who had just crossed a rickety pedestrian bridge over a deep gorge were found to be more attracted to a woman they met while crossing that bridge. How do we know this? Because they were more likely to call that woman later on (she was one of the experimenters for the study and had given these men her number after they filled out a survey for her) than were those who met the same woman while crossing a much safer bridge. The men in this study reported that their decision to call the woman had nothing to do with their experience of crossing the scary bridge. But the experiment clearly showed they were wrong about that, because those in the scary-bridge group were more likely to call the woman than were those who had just crossed the safe bridge.”

First, it is important to correct the impression that men were asked about their reasons to call back.  The original article does not report any questions about motives.  This is the complete section in the results that mentions the call back.

“Female interviewer. In the experimental group, 18 of the 23 subjects who agreed to
the interview accepted the interviewer’s phone number. In the control group, 16 out of 22 accepted (see Table 1). A second measure of sexual attraction was the number of subjects who called the interviewer. In the experimental group 9 out of 18 called, in the control group 2 out of 16 called (x2 = 5.7, p < .02). Taken in conjunction with the sexual
imagery data, this finding suggests that subjects in the experimental group were more
attracted to the interviewer.”

A second concern is that the sample size was small and the evidence for the effect was not very strong.  In the experimental group 9 out of 18 called, in the control
group 2 out of 16 called (x2 = 5.7, p < .02) [z = 2.4].

Finally, the authors mention a possible confound in this field study.  It is possible that men who dared to cross the suspension bridge differ from men who crossed the safe bridge, and it has been shown that risk taking men are more likely to engage in casual sex.  Study 3 addressed this problem with a less colorful, but more rigorous experimental design.

Male students were led to believe that they were participants in a study on electric shock and learning.  An attractive female confederate (a student working with the experimenter but pretending to be a participants) was also present.  The study had four conditions. Male participants were told that they would receive weak or strong shock and they were told that the female confederate would receive weak or strong shock.  They then were asked to fill out a questionnaire before the study would start; in fact, the study ended after participants completed the questionnaire and they were told about the real purpose of the study.

The questionnaire contained two questions about the attractive female confederate. “How much would you like to kiss her?” and “How much would you like to ask her out on a date?”  Participants who were anticipating strong shock had much higher average ratings than those who anticipated weak shock, z = 4.46.

Although this is a strong finding, we also have a large literature on emotions and arousal that suggests frightening your date may not be the best way to get to second base (Reisenzein, 1983; Schimmack, 2005).  It is also not clear whether arousal transfer is a conscious or unconscious process. One study cited in the book found that exercise did not influenced sexual arousal right away, presumably because participants attributed their increased heart rate to the exercise. This suggests that arousal transfer is not entirely an unconscious process.

Chapter 4 also brings up global warming.  An unusually warm winter day in Canada often make people talk about global warming.  A series of studies examined the link between weather and beliefs about global warming more scientifically.  “What is fascinating (and sadly ironic) is how opinions regarding this issue fluctuate as a function of the very climate we’re arguing about. In general, what Weber and colleagues found was that when the current weather is hot, public opinion holds that global warming is occurring, and when the current weather is cold, public opinion is less concerned about global warming as a general threat. It is as if we use “local warming” as a proxy for “global warming.” Again, this shows how prone we are to believe that what we are experiencing right now in the present is how things always are, and always will be in the future. Our focus on the present dominates our judgments and reasoning, and we are unaware of the effects of our long-term and short-term past on what we are currently feeling and thinking.”

One of the four studies produced strong evidence (z = 7.05).  This study showed a correlation between respondents’ ratings of the current day’s temperature and their estimate of the percentage of above average warm days in the past year.  This result does not directly support the claim that we are more concerned about global warming on warm days for two reasons. First, response styles can produce spurious correlations between responses to similar questions on a questionnaire.  Second, it is not clear that participants attributed above average temperatures to global warming.

A third credible finding (z = 4.62) is from another classic study (Ross & Sicoly, 1974, JPSP, Study 2a).  “You will have more memories of yourself doing something than of your spouse or housemate doing them because you are guaranteed to be there when you do the chores. This seems pretty obvious, but we all know how common those kinds of squabbles are, nonetheless. (“I am too the one who unloads the dishwasher! I remember doing it last week!”)”   In this study, 44 students participated in pairs. They were given separate pieces of information and exchange information to come up with a joint answer to a set of questions.  Two days later, half of the participants were told that they performed poorly, whereas the other half was told that they performed well. In the success condition, participants were more likely to make self-attributions (i.e., take credit) than expected by chance.

Chapter 5

In Chapter 5, John Bargh tell us about work by his supervisor Robert Zajonc (1968).  “Bob was doing important work on the mere exposure effect, which is, basically, our tendency to like new things more, the more often we encounter them. In his studies, he repeatedly showed that we like them more just because they are shown to us more often, even if we don’t consciously remember seeing them”  The 1968 classic article contains two studies with strong evidence (Study 2, z = 6.84, Study 3 z = 5.81).  Even though the sample sizes were small, this was not a problem because the studies presented many stimuli at different frequencies to all participants. This makes it easy to spot reliable patterns in the data.









Chapter 5 also introduces the concept of affective priming.  Affective priming refers to the tendency to respond emotionally to a stimulus even if a task demands to ignore it.  We simply cannot help to feel good or bad and turn our emotions off.  The experimental way to demonstrate this is to present an emotional stimulus quickly followed by a second emotional stimulus. Participants have to respond to the second stimulus and ignore the first stimulus.  It is easier to perform the task when the two stimuli have the same valence, suggesting that the valence of the first stimulus was processed even though participants had to ignore it.  Bargh et al. (1996, JESP) reported that this even happens  when the task is simply to pronounce the second word (Study 1 z = 5.42, Study 2 z = 4.13, Study 3, z = 3.97).

The book does not inform readers that we have to distinguish two types of affective priming effects.  Affective priming is a robust finding when participants’ task is to report on the valence (is it good or bad) of the second stimulus following the prime.  However, this finding has been interpreted by some researches as an interference effect, similar to the Stroop effect.  This explanation would not predict effects on a simple pronounciation task.  However, there are fewer studies with the pronounciation task and some of these have failed to replicate Bargh et al.’s original findings, despite the strong evidence observed in their studies. First, Klauer and Musch (2001) failed to replicate Bargh et al.’s findings that affective priming influences pronunciation of target words in three studies with good statistical power. Second, DeHouwer et al. (2001) were able to replicate it with degraded primes, but also failed to replicate the effect with visible primes that were used by Bargh et al.  In conclusion, affective priming is a robust effect when participants have to report on the valence of the second stimulus, but this finding does not necessarily imply that primes unconsciously activate related content in memory.

Chapter 5 also reports about some surprising associations between individuals’ names, or better their initials, and the places they live, professions, and partners. These correlations are relatively small, but they are based on large datasets and very unlikely to be just statistical flukes (z-scores ranging from 4.65 to 49.44).  The causal process underlying these correlations is less clear.  One possible explanation is that we have unconscious preferences that influence our choices. However, experimental studies that tried to study this effect in the laboratory are less convincing.  Moreover, Hodson and Olson failed to find a similar effect across a variety of domains such as liking of animals (Alicia is not more likely to like ants than Samantha), foods, or leisure activities. They found a significant correlation for brand names (p = .007), but this finding requires replication.   More recently, Kooti, Magno, and Weber (2014) examined name effects on social media. They found significant effects for some brand comparisons (Sega vs. Nintendo), but not for others (Pepsi vs. Coke).  However, they found that twitter users were more likely to follow other twitter uses with the same first name. Taken together, these results suggest that individuals’ names predict some choices, but it is not clear when or why this is the case.

The chapter ends with a not very convincing article (z = 2.39, z = 2.22) that it is actually very easy to resist or override unwanted priming effects. According to this article, simply being told that somebody is a team member can make automatic prejudice go away.  If it were so easy to control unwanted feelings, it is not clear why racism is still a problem 50 years after the civil rights movement started.

In conclusion Chapter 5 contains a mix of well-established findings with strong support (mere-exposure effects, affective priming) and several less supported ideas. One problem is that priming is sometimes presented as an unconscious process that is difficult to control and at other times these effects seem to be easily controllable. The chapter does not illuminate under which conditions we should suspect priming to influence our behavior in ways we don’t notice or cannot control and when we notice them and have the ability to control them.

Chapter 6

Chapter 6 deals with the thorny problem in psychological science that most theories make correct predictions sometimes. Even a broken clock tells the time right twice a day. The problem is to know in which context a theory makes correct predictions and when it does not.

“Entire books—bestsellers—have appeared in recent years that seem to give completely conflicting advice on this question: can we trust our intuitions (Blink, by Malcolm Gladwell), or not (Thinking, Fast and Slow, by Daniel Kahneman)? The answer lies in between. There are times when you can and should, and times when you can’t and shouldn’t [trust your gut].

Bargh then proceeds to make 8 evidence-based recommendation when it is advantages to rely on intuition without effortful deliberation (gut feelings).

Rule #1: supplement your gut impulse with at least a little conscious reflection, if you have the time to do so.

Rule # 2: when you don’t have the time to think about it, don’t take big chances for small gains going on your gut alone.

Rule #3: when you are faced with a complex decision involving many factors, and especially when you don’t have objective measurements (reliable data) of those important factors, take your gut feelings seriously.

Rule #4: be careful what you wish for, because your current goals and needs will color what you want and like in the present.

Rule #5: when our initial gut reaction to a person of a different race or ethnic group is negative, we should stifle it.

Rule #6: we should not trust our appraisals of others based on their faces alone, or on photographs, before we’ve had any interaction with them.

Rule #7: (it may be the most important one of all): You can trust your gut about other people—but only after you have seen them in action.

Rule #8: it is perfectly fine for attraction be one part of the romantic equation, but not so fine to let it be the only, or even the main, thing.

Unfortunately, the credible evidence in this chapter (z > 4) is only vaguely related to these rules and insufficient to claim that these rules are based on solid scientific evidence.

Morewedge and Norton (2009) provide strong evidence that people in different cultures (US z = 4.52, South Korea z = 7.18, India z = 6.78) believe that dreams provide meaningful information about themselves.   Study 3 used a hypothetical scenario to examine whether people would change their behavior in response to a dream.  Participants were more likely to say that they would change a flight after dreaming about a plane crash in the night before the flight than if they thought about a plane crash the evening before and dreams influenced behavior about as much as hearing about an actual plane crash (z = 10.13).   In a related article, Morewedge and colleagues (2014) asked participants to rate types of thoughts (e.g., dreams, problem solving, etc.) in terms of spontaneity or deliberation. A second rating asked about the extent to which the type of thought would generate self-insight or is merely a reflection of the current situation.  They found that spontaneous thoughts were considered to generate more self-insight (Study 1 z = 5.32, Study 2 z = 5.80).   In Study 5, they also found that more spontaneous recollection of a recent positive or negative experience with their romantic partner predicted hypothetical behavioral intention ratings (““To what extent might recalling the experience affect your likelihood of ending the relationship, if it came to mind when you tried to remember it”) (z = 4.06). These studies suggest that people find spontaneous, non-deliberate thoughts meaningful and that they are willing to use them in decision making.  The studies do not tell us under which circumstances listening to dreams and other spontaneous thoughts (gut feelings) is beneficial.

Inbar, Cone, and Gilovich (2010) created a set of 25 choice problems (e.g., choosing an entree, choosing a college).  They found that “the more a choice was seen as objectively evaluable, the more a rational approach was seen as the appropriate choice strategy” (Study 1a, z = 5.95).  In a related study, they found “the more participants
thought the decision encouraged sequential rather than holistic processing, the more they thought it should be based on rational analysis” (Study 1b, z = 5.02).   These studies provide some insight into people’s beliefs about optimal decision rules, but they do not tell us whether people’s beliefs are right or wrong, which would require to examine people’s actual satisfaction with their choices.

Frederick (2005) examined personality differences in the processing of simple problems (e.g., A bat and a ball cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?).  The quick answer is 10 cent, but the correct answer is 5 cent.  In this case, the gut response is false.  A sample of over 3,000 participants answered several similar questions. Participants who performed above average were more willing to delay gratification (get $3,800 in a month rather than $3,400 now) than participants with below average performance (z > 5).  If we consider the bigger reward a better choice, these results imply that it is not good to rely on gut responses when it is possible to use deliberation to get the right answer.

Two studies by Wilson and Schooler (1991) are used to support the claim that we can overthink choices.

“In their first study, they had participants judge the quality of different brands of jam, then compared their ratings with those of experts. They found that the participants who were asked to spend time consciously analyzing the jam had preferences that differed further from those of the experts, compared to those who responded with just the “gut” of their taste buds.”  The evidence in this study with a small sample is not very strong and requires replication  (N = 49, z = 2.36).

“In Wilson and Schooler’s second study, they interviewed hundreds of college students about the quality of a class. Once again, those who were asked to think for a moment about their decisions were further from the experts’ judgments than were those who just went with their initial feelings.”


The description in the book does not match the actual study.  There were three conditions.  In the control condition, participants were asked to read the information about the courses carefully.  In the reasons condition, participants were asked to write down their reasons. and in the rate all condition participants were asked to rate all pieces of information, no matter how important, in terms of its effect on their choices. The study showed that considering all pieces of information increased the likelihood of choosing a poorly rated course (a bad choice), but had a much smaller effect on ratings of highly rated courses (z = 4.14 for the interaction effect).  All conditions asked for some reflection and it remains unclear how students would respond if they went with their initial feelings, as described in the book.  Nevertheless, the study suggests that good choices require focusing on important factors and paying attention to trivial factors can lead to suboptimal choices.  For example, real estate agents in hot markets use interior design to drive up prices even though the design is not part of the sale.

We are born sensitive to violations of fair treatment and with the ability to detect those who are causing harm to others, and assign blame and responsibility to them. Recent research has shown that even children three to five years old are quite sensitive to fairness in social exchanges. They preferred to throw an extra prize (an eraser) away than to give more to one child than another—even when that extra prize could have gone to themselves. This is not an accurate description of the studies.  Study 1 (z > 5) found that 6 to 8 year old children preferred to give 2 erasers to one kid and 2 erasers to another kid and to throw the fifth eraser away to maintain equality (20 out of 20, p < .0001).  However, “the 3-to 5-year-olds showed no preference to throw a resource away (14 out of 24, p = .54)” (p. 386).  Subsequent studies used only 6-8 year old children. Study 4 examined how children would respond if erasers are divided between themselves and another kid. 17 out of 20 (p = .003, z = 2.97 preferred to throw the eraser away rather than getting one more for themselves.  However, in a related article, Shaw and Olson, 2012b) found that children preferred favoritism (getting more erasers) when receiving more erasers was introduced as winning a contest (Study 2, z = 4.65). These studies are quiet interesting, but they do not support the claim that equality norms are inborn, nor do they help us to figure out when we should or should not listen to our gut or whether it is better for us to be equitable or selfish.

The last, but in my opinion most interesting and relevant, piece of evidence in Chapter 6 is a large (N = 16,624) survey study of relationship satisfaction (Cacioppo et al., 2013, PNAS, z = 6.58).   Respondents reported their relationship satisfaction and how they had met.   Respondents who had met their partner online were slightly more satisfied than respondents who had met their partner offline.  There were also differences between different types of meeting online.  Respondents who met their partner in a bar had one of the lowest average level of satisfaction.  The study did not reveal why online dating is slightly more successful, but both forms of dating probably involve a combination of deliberation and “gut” reactions.

In conclusion, Chapter 6 provides some interesting insights into the way people make choices. However, the evidence does not provide a scientific foundation for recommendations when it is better to follow your instinct and when it is better to rely on logical reasoning and deliberation.  Either the evidence of the reviewed studies is too weak or the studies do not use actual choice outcomes as outcome variable. The comparison of online and offline dating is a notable exception.

Chapter 7

Chapter 7 uses an impressive field experiment to support the idea that “our mental representations of concepts such as politeness and rudeness, as well as innumerable other behaviors such as aggression and substance abuse, become activated by our direct perception of these forms of social behavior and emotion, and in this way are contagious.”   Keizer et al. (2008) conducted the study in an alley in Groningen, a city in the Netherlands.  In one condition, bikes were parked in front of a wall with graffiti, despite an anti-graffiti sign.  In the other condition, the wall was clean.  Researchers attached fliers to the bikes and recorded how many users would simply throw the fliers on the ground.  They recorded the behaviors of 77 bike riders in each condition. In the graffiti condition, 69% of riders littered. In the clean condition, only 33% of riders littered (z = 4.51).


In Study 2, the researchers put up a fence in front of the entrance to a car park that required car owners to walk an extra 200m to get to their car, but they left a gap that allowed car owners to avoid the detour.  There was also a sign that forbade looking bikes to the fence.  In one condition, bikes were not locked to the fence. In the experimental condition, the norm was violated and four bikes were locked to the fence.  41 car owners’ behaviors were observed in each condition.  In the experimental condition, 82% of participants stepped through the gap. In the control condition, only 27% of car owners stepped through the gap (z = 5.27).


It is unlikely that bike riders or car owners in these studies consciously processed the graffiti or the locked bikes.  Thus, these studies support the hypothesis that our environment can influence behavior in subtle ways without our awareness.  Moreover, these studies show these effects with real-world behavior.

Another noteworthy study in Chapter 7 examined happiness in social networks (Fowler & Christakis, 2008).   The authors used data from the Framingham Heart Study, which is a unique study where most inhabitants of a small town, Framingham, participated in the study.   Researchers collected many measures, including a measure of happiness. They also mapped social relationships among them.  Fowler and Christakis used sophisticated statistical methods to examine whether people who were connected in the social network (e.g., spouses, friends, neighbors) had similar levels of happiness. They did (z = 9.09).  I may be more likely to believe these findings because I have found this in my own research on married couples (Schimmack & Lucas, 2010).  Spouses are not only more similar to each other at one moment in time, they also change in the same direction over time.  However, the causal mechanism underlying this effect is more elusive.  Maybe happiness is contagious and can spread through social networks like a disease. However, it is also possible that related members in social networks are exposed to similar environments.  For example, spouses share a common household income and money buys some happiness.  It is even less clear whether these effects occur outside of people’s awareness or not.

Chapter 8 ends with the positive message that a single person can change the word because his or her actions influence many people. “The effect of just one act, multiplies and spreads to influence many other people. A single drop becomes a wave”  This rosy conclusion overlooks that the impact of one person decreases exponentially when it spreads over social networks. If you are kind to a neighbor, the neighbor may be slightly more likely to be kind to the pizza delivery man, but your effect on the pizza delivery man is already barely noticeable.  This may be a good thing when it comes to the spreading of negative behaviors.  Even if the friend of a friend is engaging in immoral behaviors, it doesn’t mean that you are more likely to commit a crime. To really change society it is important to change social norms and increase individuals’ reliance on these norms even when situational influences tell them otherwise.   The more people have a strong norm not to litter, the less it matters whether there are graffiti on the wall or not.

Chapter 8

Chapter 8 examines dishonesty and suggests that dishonesty is a general human tendency. “When the goal of achievement and high performance is active, people are more likely to bend the rules in ways they’d normally consider  dishonest and immoral, if doing so helps them attain their performance goal”

Of course, not all people cheat in all situations even if they think they can get away with it.  So, the interesting scientific question is who will be dishonest in which context?

Mazar et al. (2008) examined situational effects on dishonesty.  In Study 2 (z = 4.33) students were given an opportunity to cheat in order to receive a higher reward. The study had three conditions, a control condition that did not allow students to cheat, a cheating condition, and a cheating condition with an honor pledge.  In the honor pledge condition, the test started with the sentence “I understand that this short survey falls under MIT’s [Yale’s] honor system”.   This manipulation eliminated cheating.  However, even in the cheating condition “participants cheated only 13.5% of the possible average magnitude.  Thus, MIT/Yale students are rather honest or the incentive was too small to tempt them (an extra $2).  Study 3 found that students were more likely to cheat if they were rewarded with tokens rather than money, even though they later could exchange tokens for money.  The authors suggests that cheating merely for tokens rather than real money made it seem less like “real” cheating (z = 6.72).

Serious immoral acts cannot be studied experimentally in a psychology laboratory.  Therefore, research on this topic has to rely on self-report and correlations. Pryor (1987) developed a questionnaire to study “Sexual Harassment Proclivities in Men.”  The questionnaire asks men to imagine being in a position of power and to indicate whether they would take advantage of their power to incur sexual favors if they know they can get away with it.  To validate the scale, Pryor showed that it correlated with a scale that measures how much men buy into rape myths (r = .40, z = 4.47).   Self-reports on these measures have to be taken with a grain of salt, but the results suggest that some men are willing to admit that they would abuse power to gain sexual favors, at least in anonymous questionnaires.

Another noteworthy study found that even prisoners are not always dishonest. Cohn et al. (2015) used a gambling task to study dishonesty in 182 prisoners in a maximum security prison.  Participants were given the opportunity to flip 10 coins and to keep all coins that showed head.  Importantly, the coin toss was not observed.  As it is possible, although unlikely, that all 10 coins show head by chance, inmates could keep all coins and hide behind chance.  The randomness of the outcome makes it impossible to accuse a particular prisoner of dishonesty.  Nevertheless, the task makes it possible to measure dishonesty of the group (collective dishonesty) because the percentage of coin tosses that were reported should be close to chance (50%). If it is significantly higher than chance, it shows that some prisoners were dishonest. On average, prisoners reported 60% head, which reveals some dishonesty, but even convicted criminals were more likely to respond honestly than not (the percentage increased from 60% to 66% when they were primed with their criminal identity, z = 2.39).

I see some parallels between the gambling task and the world of scientific publishing, at least in psychology.  The outcome of a study is partially determined by random factors. Even if a scientist does everything right, a study may produce a non-significant result due to random sampling error. The probability of observing a non-significant result is called a type-II error. The probability of observing a significant result is called statistical power.  Just like in a coin toss experiment, the observed percentage of significant results should match the expected percentage based on average power.  Numerous studies have shown that researchers report more significant results than the power of their studies justifies. As in the coin toss experiment, it is not possible to point the finger at a single outcome because chance might have been in a researcher’s favor, but in the long run the odds “cannot be always in your favor” (Hunger Games).  Psychologists disagree whether the excess of significant results in psychology journals should be attributed to dishonesty.  I think it is and it fits Bargh’s observation that humans, and most scientists are humans, have a tendency to bend the rules when doing so helps them to reach their goal, especially when the goal is highly relevant (e.g., get a job, get a grant, get tenure). Sadly, the extent of over-reporting significant results is considerably larger than the 10 to 15% overreporting of heads in the prisoner study.

Chapter 9

Chapter 9 introduces readers to Metcalfe’s work on insight problems (e.g., how to put 27 animals into 4 pens so that there is an odd number of animals in all four pens).  Participants had to predict quickly whether they would be able to solve the problem. They then got 5 minutes to actually solve the problem. Participants were not able to predict accurately which insight problems they would solve.  Metcalfe concluded that the solution for insight problems comes during a moment of sudden illumination that is not predictable.  Bargh adds “This is because the solver was working on the problem unconsciously, and when she reached a solution, it was delivered to her fully formed and ready for use.”  In contrast, people are able to predict memory performance on a recognition test, even when they were not able to recall the answer immediately.  This phenomenon is known as the tip-of-the-tongue effect (z = 5.02).  This phenomenon shows that we have access to our memory even before we can recall the final answer.  This phenomenon is similar to the feeling of familiarity that is created by mere exposure (Zajonc, 1968). We often know a face is familiar without being able to recall specific memories where we encountered it.

The only other noteworthy study in Chapter 9 was a study of sleep quality (Fichten et al., 2001).  “The researchers found that by far, the most common type of thought that kept them awake, nearly 50 percent of them, was about the future, the short-term events coming up in the next day or week. Their thoughts were about what they needed to get done the following day, or in the next few days.”   It is true that 48% thought about future short-term events, but only 1% described these thoughts as worries, and 57% of these thoughts were positive.  It is not clear, however, whether this category distinguished good and poor sleepers.  What distinguished good sleepers from poor sleepers, especially those with high distress, was the frequency of negative thoughts (z = 5.59).

Chapter 10

Chapter 10 examines whether it is possible to control automatic impulses. Ample research by personality psychologists suggests that controlling impulses is easier for some people than others.  The ability to exert self-control is often measured with self-report measures that predict objective life outcomes.

However, the book adds a twist to self-control. “The most effective self-control is not through willpower and exerting effort to stifle impulses and unwanted behaviors. It comes from effectively harnessing the unconscious powers of the mind to much more easily do the self-control for you.”

There is a large body of strong evidence that some individuals, those with high impulse control and conscientiousness, perform better academically or at work (Tangney et al., 2004; Study 1 z = 5.90, Galla & Duckworth, Studies 1, 4, & 6, Ns = 488, 7.62, 5.18).  Correlations between personality measures and outcomes do not reveal the causal mechanism that leads to these positive outcomes.  Bargh suggests that individuals who score high on self-control measures are “the ones who do the good things less consciously, more automatically, and more habitually. And you can certainly do the same.”   This maybe true, but empirical work to demonstrate this is hard to find.  At the end of the chapter, Bargh cites a recent study by Milyavskaya and Michael Inzlicht that suggested avoiding temptations is more important than being able to exert self-control in the face of temptation, willful or unconsciously.


The book “Before you know it: The unconscious reasons we do what we do” is based on the authors’ personal experiences, studies he has conducted, and studies he has read. The author is a scientist and I have no doubt that he shares with his readers insights that he believes to be true.  However, this does not automatically make them true.  John Bargh is well aware that many psychologists are skeptical about some of the findings that are used in the book.  Famously, some of Bargh’s own studies have been difficult to replicate.  One response to concerns about replicability could have been new demonstrations that important unconscious priming effects can be replicated. In an interview Tom Bartlett (January, 2013) suggested this to John Bargh.

“So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway. Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.”

Beliefs are subjective.  Readers of the book have their own beliefs and may find part of the book interesting and may be willing to change some of their beliefs about human behavior.  Not that there is anything wrong with this, but readers should also be aware that it is reasonable to treat the ideas presented in this book with a healthy does of skepticism.  In 2011, Daniel Kahneman wrote ““disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”  Five years later, it is pretty clear that Kahneman is more skeptical about the state of priming research and results of experiments with small samples in general.  Unfortunately, it is not clear which studies we can believe until replication studies distinguish real effects from statistical flukes. So, until we have better evidence, we are still free to belief what we want about the power of unconscious forces on our behavior.


(Preprint) Z-Curve: A Method for Estimating Replicability Based on Test Statistics in Original Studies (Schimmack & Brunner, 2017)

Update:  March 20, 2018
An earlier version included a reference to my role as editor of Meta-Psychology.  I apologize for including this reference.  The journal has nothing to do with this blog post and the tone of this blog post reflects only my personal frustration with traditional peer-reviews.  Some readers should be warned that the tone of this blog post is rude. Some people think this is inappropriate. I consider it an open and transparent depiction of what really goes on in academia where scientists’ ego is often more important than some objective search for the truth.  And yes, I have an ego, too, and I think the only way to deal with it is open and frank exchange of arguments and critical examination of all arguments.  Reviews that simply dismiss alternative ideas are not helpful and cannot advance psychology as a science.

In this PDF document, Jerry Brunner and I would like to share our latest manuscript on z-curve,  a method that estimates average power of a set of studies selected for significance.  We call this estimate replicabilty because average power determines the success rate if the set of original studies were replicated exactly.

We welcome all comments and criticism as we plan to submit this manuscript to a peer-reviewed journal by December 1.


Comparison of P-curve and Z-Curve in Simulation studies

Estimate of average replicability in Cuddy et al.’s (2017) P-curve analysis of power posing with z-curve (30% for z-curve vs. 44% for p-curvce).

Estimating average replicability in psychology based on over 500,000 significant test statitics.

Comparing automated extraction of test statistics and focal hypothesis tests using Motyl et al.’s (2016) replicability analysis of social psychology.


UPDATE:   17-Mar-2018

The manuscript was rejected.   Here you can read the reasons by the editor and the reviews (2 anonymous and 1 by Leif Nelson) and make up your mind whether these reviews contain valid criticism.   Importantly, nobody questions the key findings of the simulation studies that show our methods is unbiased whereas p-curve, which is already been used as a statistical tool, can provide inflated estimates in realistic scenarios when power varies across studies.   We think the decision to not publish a method that improves on an existing method that is being used is somewhat strange for a journal that calls itself ADVANCES IN METHODS AND PRACTICES.


Dear Dr. Schimmack:

Thank you for submitting your manuscript (AMPPS-17-0114) entitled “Z-Curve: A Method for the Estimating Replicability Based on Test Statistics in Original Studies” to Advances in Methods and Practices in Psychological Science (AMPPS). First, my apologies for the overly long review process. I initially struggled to find reviewers for the paper and I also had to wait for the final review. In the end, I received guidance from three expert reviewers whose comments appear at the end of this message.

Reviewers 1 and 2 chose to remain anonymous and Reviewer 3 is Leif Nelson (signed review). Reviewers 1 and 2 were both strongly negative and recommended rejection. Nelson was more positive about the goals of the paper and approach, although he wasn’t entirely convinced by the approach and evidence. I read the paper independently of the reviews, both before sending it out and again before reading the reviews (given that it had been a while). My take was largely consistent with that of the reviewers.

Although the issue of estimating replicability from published results is an important one, I was less convinced about the method and felt that the paper does not do enough to define the approach precisely, and it did not adequately demonstrate its benefits and limits relative to other meta-analytic bias correction techniques. Based on the comments of the reviewers and my independent evaluation, I found these issues to be substantial enough that I have decided to decline the manuscript.

The reviews are extensive and thoughtful, and I won’t rehash all of the details in my letter. I would like to highlight what I see as the key issues, but many of the other comments are important and substantive. I hope you will find the comments useful as you continue to develop this approach (which I do think is a worthwhile enterprise).

All three reviews raised concerns about the clarity of the paper and the figures as well as the lack of grounding for a number of strong claims and conclusions (they each quote examples). They also note the lack of specificity for some of the simulations and question the datasets used for the analyses.

I agreed that the use of some of the existing data sets (e.g., the scraped data, the Cuddy data, perhaps the Motyl data) are not ideal ways to demonstrate the usefulness of this tool. Simulations in which you know and can specify the ground truth seem more helpful in demonstrating the advantages and constraints of this approach.

Reviewers 1 and 2 both questioned the goal of estimating average power. Reviewer 2 presents the strongest case against doing so. Namely, average power is a weird quantity to estimate in light of a) decades of research on meta-analytic approaches to estimating the average effect size in the face of selection, and b) the fact that average power is a transformation of effect size. To demonstrate that Z-curve is a valid measure and an improvement over existing approaches, it seems critical to test it against other established meta-analytic models.

p-curve is relatively new, and as reviewer 2 notes, it has not been firmly established as superior to other more formal meta-analytic approaches (it might well be better in some contexts and worse in others). When presenting a new method like Z-curve, it’s is important to establish it against well-grounded methods or at least to demonstrate how accurate, precise, and biased it is under a range of realistic scenarios. In the context of this broader literature on bias correction, the comparison only to p-curve seems narrow, and a stronger case would involve comparing the ability of Z-curve to recover average effect size against other models of bias correction (or power if you want to adapt them to do that).

[FYI:  p-curve is the only other method that aims to estimate average power of studies selected for significance.  Other meta-analytic tools aim to estimate effect sizes, which are related to power but not identical. ]

Nelson notes that other analyses show p-curve to be robust to heterogeneity and argues that you need to more clearly specify why and when Z curve does better or worse. I would take that as a constructive suggestion that is worth pursuing. (e.g., he and the other reviewers are right that you need to provide more specificity about the nature of the heterogeneity your modeling.).

I thought Nelson’s suggestions for ways to explain the discrepant results of these two approaches were constructive, and they might help to explain when each approach does better, which would be a useful contribution. Just to be clear, I know that the datacolada post that Nelson cites was posted after your paper was submitted and I’m not factoring your paper’s failure to anticipate it into my decision (after all, Bem was wrong).

[That blog post was posted after I shared our manuscript with Uri and tried to get him to comment on z-curve.  In a long email exchange he came up with scenarios in which p-curve did better,but never challenged the results of my simulations that it performs a lot worse when there is heterogeneity.  To refer to this self-serving blog post as a reason for rejection is problematic at best, especially if the simulation results in the manuscript are ignored. 

Like Reviewer 2 and Nelson, I was troubled by the lack of a data model for Z curve (presented around page 11-12). As best I can tell, it is a weighted average of 7 standard normal curves with different means. I could see that approach being useful, and it might well turn out to be optimal for some range of cases, but it seems  arbitrary and isn’t suitably justified. Why 7? Why those 7? Is there some importance to those choices?

The data model (never heard the term, model) was specified and it is so simple that the editorial letter even characterizes it correctly.  The observed density distribution is modeled with weighted averages of the density distributions of 7 standard normal distributions and yes 7 is arbitrary because it has very little influence on the results. 

Do they reflect some underlying principle or are they a means to an end? If the goal is to estimate only the end output from these weights, how do we know that those are the right weights to use?

Because the simulation results show that the model recovers the simulated average power correctly within 3% points? 

If the discrete values themselves are motivated by a model, then the fact that the weight estimates for each component are not accurate even with k=10000 seems worrisome.

No it is not worrisome because the end goal is the average, not the individual weights.

If they aren’t motivated by a more formal model, how were they selected and what aspects of the data do they capture? Similarly, doesn’t using absolute value mean that your model can’t handle sign errors for significant results? And, how are your results affected by an arbitrary ceiling at z=6?

There are no sign errors in analyses of significant results that cover different research questions. Heck, there is not even a reasonable way to speak about signs.  Is neuroticism a negative predictor of wellbeing or is emotional stabilty a positive predictor of wellbeing? 

Finally, the paper comments near the end that this approach works well if k=100, but that doesn’t inform the reader about whether it works for k=15 or k=30 as would be common for meta-analysis in psychology.

The editor doesn’t even seem to understand that this method is not intended to be used for a classic effect size meta-analysis.  We do have statistical methods for that. Believe me I know that.  But how would we apply these methods to estimate the replicability of social psychology? And why would we use k = 30 to do so, when we can use k = 1,000?

To show that this approach is useful in practice, it would be good to show how it fares with sets of results that are more typical in scale in psychology. What are the limits of its usefulness? That could be demonstrated more fully with simulations in which the ground truth is known.

No bias correction method that relies on only significant results provides meaningful results with k = 30.  We provide 95%CI and they are huge with k = 30. 

I know you worked hard on the preparation of this manuscript, and that you will be disappointed by this outcome. I hope that you will find the reviewer comments helpful in further developing this work and that the outcome for this submission will not discourage you from submitting future manuscripts to AMPPS.

Daniel J. Simons, Editor
Advances in Methods and Practices in Psychological Science (AMPPS) Psychology

Unfortunately, I don’t think the editor worked hard on reading the manuscript and missed the main point of the contribution.  So, no I am not planing on wasting more time on sending my best work to this journal.  I had hopes that AMPPS was serious about improving psychology as a science. Now I know better.  I will also no longer review for your journal.  Good luck with your efforts to do actual replication studies. I will look elsewhere to publish my work that makes original research more credible to start with.

Ulrich  Schimmack


Reviewer: 1

The authors of this manuscript introduce a new statistical method, the z-curve, for estimating the average replicability of empirical studies. The authors evaluate the method via simulation methods and via select empirical examples; they also compare it to an alternative approach (p-curve). The authors conclude that the z-curve approach works well, and that it may be superior to the p-curve in cases where there is substantial heterogeneity in the effect sizes of the studies being examined. They also conclude, based on applying the z-curve to specific cases, that the average power of studies in some domains (e.g., power posing research and social psychology) is low.

One of the strengths of this manuscript is that it addresses an important issue: How can we evaluate the replicability of findings reported in the literature based on properties inherent to the studies (or their findings) themselves?  In addition, the manuscript approaches the issue with a variety of demonstrations, including simulated data, studies based on power posing, scraped statistics from psychology journals, and social psychological studies.

After reading the manuscript carefully, however, I’m not sure I understand how the z-curve works or how it is supposed to solve potential problems faced by other approaches for evaluating the power of studies published in the empirical literature.

That is too bad.  We provided detailed annotated R-Code to make it possible for quantitative psychologists to understand how z-curve works.  It is unfortunate that you were not able to understand the code.  We would have been happy to answer questions.

I realize the authors have included links to more technical discussions of the z-curve on their websites, but I think a manuscript like this should be self-contained–especially when it is billed as an effort “to introduce and evaluate a new statistical method” (p. 25, line 34).

We think that extensive code is better provided in a supplement. However, this is really an editorial question and not a comment on the quality or originality of our work.

Some additional comments, questions, and suggestions:

1. One of my concerns is that the approach appears to be based on using “observed power” (or observed effect sizes) as the basis for the calculations. And, although the authors are aware of the problems with this (e.g., published effect sizes are over-estimates of true effect sizes; p. 9), they seem content with using observed effect sizes when multiple effect sizes from diverse studies are considered. I don’t understand how averaging values that are over-estimates of true values can lead to anything other than an inflated average. Perhaps this can be explained better.

Again, we are sorry that you did not understand how our method achieves this goal, but that is surely not a reason to recommend rejection

2. Figure 2 is not clear. What is varying on the x-axis? Why is there a Microsoft-style spelling error highlighted in the graph?

The Figure was added after an exchange with Uri Simonsohn and reproduces the simulation that they did for effect sizes (see text).  The x-axis shows d-values.

3. Figure 4 shows a z-curve for power posing research. But the content of the graph isn’t explained. What does the gray, dotted line represent? What does the solid blue line represent? (Is it a smoothed density curve?) What do the hashed red vertical lines represent? In short, without guidance, this graph is impossible to understand.

Thank you for your suggestion. We will revise the manuscript to make it easier to read the figures.

4. I’m confused on how Figure 2 is relevant to the discussion (see p. 19, line 22)

Again. We are sorry about the confusion that we caused.  The Figure shows that p-curve overestimates power in some scenarios (d = .8, SD = .2) which was not apparent when Simonsohn did these simulation to estimate effect sizes. 

5. Some claims are made without any explanation or rationale. For example, on page 19 the authors write “Random sampling error cannot produce this drop” when commenting on the distribution of z-scores in the power posing data. But no explanation is offered for how this conclusion is reached.

Random sampling error of z-scores is one.  So, we should see a lot of values next to the mode of a distribution.  A steep drop cannot be produced by random sampling error. The same observation has been made repeatedly about a string of just significant p-values. If you can get .04., .03, .02, again and again, why do you not get .06 or .11?

6. I assume the authors are re-analyzing the data collected by Motyl and colleagues for Demonstration 3? This isn’t stated explicitly; one has to read between the lines to reach this conclusion.

You read correctly between the lines 

7. Figure 6 contains text which states that the estimated replicability is 67%. But the narrative states that the estimated replicability using the z-curve approach is 46% (p. 24, line 8). Is the figure using a different method than the z-curve method?

This is a real problem. This was the wrong Figure.  Thank you for pointing it out. The estimate in the text is correct.

8.  p. 25. Unclear why Figure 4 is being referenced here.

Another typo. Thanks for pointing it out.

9. The authors write that “a study with 80% power is expected to produce 4 out of 5 significant results in the long run.” (p. 6). This is only true when the null hypothesis is false. I assume the authors know this, but it would be helpful to be precise when describing concepts that most psychologists don’t “really” understand.

If a study has 80% power it is assumed that the null-hypothesis is false. A study where the null-hypothesis is true has a power of alpha to produce significant results.

10. I am not sure I understand the authors’ claim that, “once we take replicability into account, the distinction between false positives and true positives with low power becomes meaningless.” (p. 7).

We are saying in the article that there is no practical difference between a study with power = alpha (5%) where the null-hypothesis is true and a study with very low power (6%) where the null-hypothesis is false.  Maybe it helps to think about effect sizes.  d = 0 means null is true and power is 5% and d = 0.000000000000001 means null-hypothesis is false and power is 5.000000001%.  In terms o f the probability to replicate a significant result both studies have a very low probability of doing so.

11. With respect to the second demonstration: The authors should provide a stronger justification for examining all reported test statistics. It seems that the z-curve’s development is mostly motivated by debates concerning the on-going replication crisis. Presumably, that crisis concerns the evaluation of specific hypotheses in the literature (e.g., power posing effects of hormone levels) and not a hodge-podge of various test results that could be relevant to manipulation checks, age differences, etc. I realize it requires more work to select the tests that are actually relevant to each research article than to scrape all statistics robotically from a manuscript, but, without knowing whether the tests are “relevant” or not, it seems pointless to analyze them and draw conclusions about them.

This is a criticism of one dataset, not a criticism of the method.

12. Some of the conclusions that the authors research, such as “our results suggest that the majority of studies in psychology fail to meet the minimum standard of a good study . . . and even more studies fail to meet the well-known and accepted norm that studies should have 80% power” have been reached by other authors too.

But what methodology did these authors used to come to this conclusion?  Did they validate their method with simulation studies? 

This leads me to wonder whether the z-curve approach represents an incremental advance over other approaches. (I’m nitpicking when I say this, of course. But, ultimately, the “true power” of a collection of studies is not really a “thing.”

What does that mean the true power of studies is not a thing.  Researchers conduct significance tests (many) and see whether they get a publishable significant result. The average percentage of times they get a significant result is the true average power of the population of statistical tests that are being conducted.  Of course, we can only estimate this true value, but who says that other estimates that we use everyday are any better than the z-curve estimates?

It is a useful fiction, of course, but getting a more precise estimate of it might be overkill.)

Sure let’s not get to precise.  Why don’t we settle for 50% +/- 50% and call it a day?

Perhaps the authors can provide a stronger justification for the need of highly precise, but non-transparent, methods for estimating power in published research?

Just because you don’t understand the method doesn’t mean it is not transparent and maybe it could be useful to know that social psychologists conduct studies with 30% power and only publish results that fit their theories and got significant with the help of luck?  Maybe we have had 6 years of talk about a crisis without any data except the OSC results in 2015 that are limited to 2008 and three journals.  But maybe we just don’t care because it is 2018 and it is time to get on with business as usual. Glad you were able to review for a new journal that was intended to Advance Methods and Practices in Psychological Science.   Clearly estimating the typical power of studies in psychology is not important for this goal in your opinion.  Again sorry for submitting such a difficult manuscript and wasting your time.


Reviewer: 2

The authors present a new methodology (“z-curve”) that purports to estimate the average of the power of a set of studied included in a meta-analysis which is subject to publication bias (i.e., statistically significant studies are over-represented among the set meta-analyzed). At present, the manuscript is not suitable for publication largely for three major reasons.

[1] Average Power: The authors propose to estimate the average power of the set of prior historical studies included in a meta-analysis. This is a strange quantity: meta-analytic research has for decades focused on estimating effect sizes. Why are the authors proposing this novel quantity? This needs ample justification. I for one see no reason why I would be interested in such a quantity (for the record, I do not believe it says much at all about replicability).

Why three reasons, if the first reason is that we are doing something silly.  Who gives a fuck about power of studies.  Clearly knowing how powerful studies are is as irrelevant as knowing the number of potholes in Toronto.  Thank you for your opinion that unfortunately was shared by the editor and mostly the first reviewer.

There is another reason why this quantity is strange, namely it is redundant. In particular, under a homogeneous effect size, average power is a simple transformation of the effect size; under heterogeneous effect sizes, it is a simple transformation of the effect size distribution (if normality is assumed for the effect size distribution, then a simple transformation of the average effect size and the heterogeneity variance parameter; if a more complicated mixture distribution is assumed as here then a somewhat more complicated transformation). So, since it is just a transformation, why not stick with what meta-analysts have focused on for decades!

You should have stopped when things were going good.  Now you are making silly comments that show your prejudice and ignorance.  The whole point of the paper is to present a method that estimate average power when there is heterogeneity  (if this is too difficult for you, let’s call it variability or even better, you know, bro, power is not always the same in each study).  If you missed this, you clearly didn’t read the manuscript for more than two minutes.  So, your clever remark about redundancy is just a waste of my time and the time of readers of this blog because things are no longer so simple when there is heterogeneity.  But may be you even know this but just wanted to be a smart ass.

[2] No Data Model / Likelihood: On pages 10-13, the authors heuristically propose a model but never write down the formal data model or likelihood. This is simply unacceptable in a methods paper: we need to know what assumptions your model is making about the observed data!

We provided r-code that not only makes it clear how z-curve works but also was available for reviewers to test it.  The assumptions are made clear and are simple. This is not some fancy Bayesian model with 20 unproven priors. We simply estimate a single population parameter from the observed distribution of z-scores and we make this pretty clear.  It is simple, makes no assumptions, and it works. Take that! 

Further, what are the model parameters? It is unclear whether they are mixture weights as well as means, just mixture weights, etc. Further, if just the weights and your setting the means to 0, 1, …, 6 is not just an example but embedded in your method, this is sheer ad hockery.

Again, it works. What is your problem?

It is quite clear from your example (Page 12-13) the model cannot recover the weights correctly even with 10,000 (whoa!) studies! This is not good. I realize your interest is on the average power that comes out of the model and not the weights themselves (these are a means to an end) but I would nonetheless be highly concerned—especially as 20-100 studies would be much more common than 10,000.

Unlike some statisticians we do not pretend that we can estimate something that cannot be estimated without making strong and unproven assumptions.  We are content with estimating what we can estimate and that is average power, which of course, you think is useless. If average power is useless, why would it be better if we could estimate the weights?

[3] Model Validation / Comparison: The authors validate their z-curve by comparing it to an ad hoc improvised method known as the p-curve (“p-Curve and effect size”, Perspectives on Psychological Science, 2014). The p-curve method was designed to estimate effect sizes (as per the title of the paper) and is known to perform extremely poorly at estimating at this task (particularly under effect size heterogeneity); there is no work validating how well it performs at estimating this rather curious average power quantity (but likely it would do poorly given that it is poor at estimating effect sizes and average power is a transformation of the effect size). Thus, knowing the z-curve performs better than the p-curve at estimating average power tells me next to nothing: you cannot validate your model against a model that has no known validation properties! Please find a compelling way to validate your model estimates (some suggested in the paragraphs below) whether that is via theoretical results, comparison to other models known to perform well, etc. etc.

No we are not validating z-curve with p-curve. We are validating z-curve with simulation studies that show z-curve produces good estimates of simulated true power. We only included p-curve to show that this method produces biased estimates when there is considerable variability in power. 

At the same time, we disagree with the claim that p-curve is not a good tool to estimate average effect sizes from a set of studies that are selected for significance. It is actually surprisingly good at estimating the average effect size for the set of studies that were selected for significance (as is puniform). 

It is not a good tool to estimate the effect size for the population of studies before selection for significance, but this is irrelevant in this context because we focus on replicability which implies that an original study produced a significant result and we want to know how likely it is that a replication study will produce a significant result again.

Relatedly, the results in Table 1 are completely inaccessible. I have no idea what you are presenting here and this was not made clear either in the table caption or in the main text. Here is what we would need to see at minimum to understand how well the approach performs—at least in an absolute sense.

[It shows the estimates (mean, SD) by the various models for our 3 x 3 design of the simulation study.  But who cares, the objective is useless so you probably spend 5 seconds trying to understand the Table.]

First, and least important, we need results around bias: what is the bias in each of the simulation scenarios (these are implicitly in the Table 1 results I believe)? However, we also need a measure of accuracy, say RMSE, a metric the authors should definitely include for each simulation setting. Finally, we need to know something about standard errors or confidence intervals so we can know the precision of individual estimates. What would be nice to report is the coverage percentage of your 95% confidence intervals and the average width of these intervals in each simulation setting.

There are many ways to present results about accuracy. Too bad we didn’t pick the right way, but would it matter to you?  You don’t really think it is useful anyways.

This would allow us to, if not compare methods in a relative way, to get an absolute assessment of model performance. If, for example, in some simulation you have a bias of 1% and an RMSE of 3% and coverage percentage of 94% and average width of 12% you would seem to be doing well on all metrics*; on the other hand, if you have a bias of 1% and an RMSE of 15% and coverage percentage of 82% and average width of 56%, you would seem to be doing poorly on all metrics but bias (this is especially the case for RMSE and average width bc average power is bounded between 0% and 100%).

* Of course, doing well versus poorly is in the eye of the beholder and for the purposes at hand, but I have tried to use illustrative values for the various metrics that for almost all tasks at hand would be good / poor performance.

For this reason, we presented the Figure that showed how often the estimates were outside +/- 10%, where we think estimates of power do not need to be more precise than that.  No need to make a big deal out of 33% vs. 38% power, but 30% vs. 80% matters. 

I have many additional comments. These are not necessarily minor at all (some are; some aren’t) but they are minor relative to the above three:

[a] Page 4: a prior effect size: You dismiss these hastily which is a shame. You should give them more treatment, and especially discuss the compelling use of them by Gelman and Carlin here:

This paper is absolutely irrelevant for the purpose of z-curve to estimate the actual power that researchers achieve in their studies.    

[b] Page 5: What does “same result” and “successful replication” mean? You later define this in terms of statistical significance. This is obviously a dreadful definition as it is subject to all the dichotomization issues intrinsic to the outmoded null hypothesis significance paradigm. You should not rely on dichotomization and NHST so strongly.

What is obvious to you, is not the scientific consensus.  The most widely used criterion for a succesful replicaiton study is to get a significant result again.  Of course, we could settle for getting the same sign again and a 50% type-I error probability, but hey, as a reviewer you get to say whatever you want without accountability. 

Further, throughout please replace “significant” by “statistically significant” and related terms when it is the latter you mean.

[F… You]

[c] Page 6: Your discussion regarding if studies had 80% then up to 80% of results would be successful is not quite right: this would depend on the prior probability of “non-null” studies.

[that is why we wrote UP TO]

[d] Page 7: I do not think 50% power is at all “good”. I would be appalled in fact to trust my scientific results to a mere coin toss. You should drop this or justify why coin tosses are the way we should be doing science.

We didn’t say it is all good.  We used it as a minimum, less than that is all bad, but that doesn’t mean 50% is all good. But hey, you don’t care anyways. so what the heck.  

[e] Page 10: Taking absolute values of z-statistics seems wrong as the sign provides information about the sign of the effect. Why do you do this?

It is only wrong if you are thinking about a meta-analysis of studies that test the same hypothesis.  However, if I want to examine the replicability of more than one specific hypothesis all results have to be coded so that a significant results  implies support for the hypothesis in the direction of significance.   

[f] Page 13 and throughout: There are ample references to working papers and blog posts in this paper. That really is not going to cut it. Peer review is far from perfect but these cited works do not even reach that low bar.

Well better than quoting hearsay rumors form blog posts that coding in some dataset is debatable in a peer review of a methods paper.

[g] Page 16: What was the “skewed distribution”? More details about this and all simulation settings are necessary. You need to be explicit about what you are doing so readers can evaluate it.

We provided the r-code to recreate the distributions or change them.  It doesn’t matter. The conclusions remain the same.

[h] Page 15, Figure 2: Why plot median and no mean? Where are SEs or CIs on this figure?

Why do you need a CI or SE for simulations and what do you need to see that there is a difference between 0 and 80%?

[i] Page 14: p-curve does NOT provide good estimates of effect sizes!

Wrong. You don’t know what you are talking about.  It does provide a good estimate of average effect sizes for the set of studies selected for significance, which is the relevant set here.  

[j] You find p-curve is biased upwards for average power under heterogeneity; this seems to follow directly from the fact that it is biased upwards for effect size under heterogeneity (“Adjusting for Publication Bias in Meta-analysis”, Perspectives on Psychological Science, 2016) and the simply mapping between effect size and average power discussed above.

Wrong again. You are confusing estimates of average effect size for the studies before selection and after selection for significance.

[k] Page 20: Can z-curve estimate heterogeneity (the answer is yes)? You should probably provide such estimates.

We do not claim that z-curve estimates heterogeneity. Maybe some misunderstanding.

[l] Page 21-23: I don’t think the concept of the “replicability of all of psychology” is at all meaningful*. You are mixing apples and oranges in terms of areas studies as well as in terms of tests (focal tests vs manipulation checks). I would entirely cut this.

Of course, we can look for moderators but that is not helpful to you because you don’t  think the concept of power is useful.

* Even if it were, it seems completely implausible that the way to estimate it would be to combine all the studies in a single meta-analysis as here.


[m] Page 23-25: I also don’t think the concept of the “replicability of all of social psychology” is at all meaningful. Note also there has been much dispute about the Motyl coding of the data so it is not necessarily reliable.

Of course you don’t, but why should I care about your personal preferences.

Further, why do you exclude large sample, large F, and large df1 studies? This seems unjustified. 

They are not representative, but it doesn’t make a difference.  

[n] Page 25: You write “47% average power implies that most published results are not false positives because we would expect 52.5% replicability if 50% of studies were false positives and the other 50% of studies had 100% power.” No, I think this will depend on the prior probability.

Wrong again. If 50% of studies were false positives, the power estimate would be 5%.  To get an average of 50%, and the other studies have the maximum of 100% power, we would get a clearly visible bimodal distribution of z-scores and we would get an average estimate of p(H0) * 2.5 + (1-p(H0) * 100.  You are a smart boy (sorry assuming this is a dick), you figure it out.

[o] Page 25: What are the analogous z-curve results if those extreme outliers are excluded? You give them for p-curve but not z-curve.

We provided that information, but you would need to care to look for them. 

[p] Page 27: You say the z-curve limitations are not a problem when there are 100 or more studies and some heterogeneity. The latter is fine to assume as heterogeneity is rife in psychological research but seldom do we have 100+ studies. Usually 100 is an upper bound so this poses problems for your method.

It doesn’t mean our method doesn’t work with smaller N.  Moreover, the goal is not to conduct effect size meta-analysis, but apparently you missed that because you don’t really care about the main objective to estimate replicability.  Not sure why you agreed to review a paper that is titled “A method for estimating replicability?” 

Final comment: Thanks for nothing. 


Reviewer: 3

This review was conducted by Leif Nelson

[Thank you for signing your review.]

Let me begin by apologizing for the delay in my review; the process has been delayed because of me and not anyone else in the review team. 

Not sure why the editor waited for your review.  Could have rejected it after reading the first two reviews that the whole objective, which you and I think is meaningful, is irrelevant for advancing psychological science. Sorry for the unnecessary trouble.

Part of the delay was because I spent a long time working on the review (as witnessed by the cumbersome length of this document). The paper is dense, makes strong claims, and is necessarily technical; evaluating it is a challenge.

I commend the authors for developing a new statistical tool for such an important topic. The assessment of published evidence has always been a crucial topic, but in the midst of the current methodological renaissance, it has gotten a substantial spotlight.

Furthermore, the authors are technically competent and the paper articulates a clear thesis. A new and effective tool for identifying the underlying power of studies could certainly be useful, and though I necessarily have a positive view of p-curve, I am open to the idea that a new tool could be even better.

Ok, enough with the politeness. Let’s get to it.

I am not convinced that Z-curve is that tool. To be clear, it might be, but this paper does not convince me of that.

As expected,…. p < .01.  So let’s hear why the simulation results and the demonstration of inflated estimates in real datasets do not convince you.

I have a list of concerns, but a quick summary might save someone from the long slog through the 2500 words that follow:

  1. The authors claim that, relative to Z-curve, p-curve fails under heterogeneity and do not report, comment on, or explain analyses showing exactly the opposite of that assertion.

Wow. let me parse this sentence.  The authors claim p-curve fails under heterogeneity (yes) and do not report … analyses showing … the opposite of that assertion.

Yes, that is correct. We do not show results opposite to our assertion. We show results that confirm our assertion in Figure 1 and 2.  We show in simulations with R-code that we provided and you could have used to run your own simulations that z-curve provides very good estimates of average power when there is heterogeneity and that p-curve tends to overestimate average power.  That is the key point of this paper.  Now how much time did you spend on this review, exactly?

The authors do show that Z-curve gives better average estimates under certain circumstances, but they neither explain why, nor clarify what those circumstances look like in some easy to understand way, nor argue that those circumstances are representative of published results. 

Our understanding was that technical details are handled in the supplement that we provided.  The editor asked us to supply R-code again for a reviewer but it is not clear to us which reviewer actually used the provided R-code to answer technical questions like this.  The main point is made clear in the paper. When the true power (or z-values) varies across studies,  p-curve tends to overestimate.  Not sure the claims of being open are very credible if this main point is ignored.

3. They attempt to demonstrate the validity of the Z-curve with three sets of clearly invalid data.

No. we do not attempt to validate z-curve with real datasets.  That would imply that we already know the average power in real data, which we do not.  We used simulations to validate z-cure and to show that p-curve estimates are biased.  We used real data only to show that the differences in estimates have real world implications.  For example, when we use the Motyl et al. (JSPSP) data to examine replicability,  z-curve gives a reasonable estimate of 46% (in line with the reported R-Index estimates in the JPSP article), while p-curve gives an estimate of 72% power.  This is not a demonstration of validity, it is a demonstration that p-curve would overestimate replicability of social psychological findings in a way that most readers would consider practically meaningful. ]

I think that any one of those would make me an overall negative evaluator; the combination only more so. Despite that, I could see a version which clarified the “heterogeneity” differences, acknowledged the many circumstances where Z-curve is less accurate than p-curve, and pointed out why Z-curve performs better under certain circumstances. Those might not be easy adjustments, but they are possible, and I think that these authors could be the right people to do it. (the demonstrations should simply be removed, or if the authors are motivated, replaced with valid sets).

We already point out when p-curve does better. When there is minimal variability or actually identical power, precision of p-curve is 2-3% better.

Brief elaboration on the first point: In the initial description of p-curve the authors seem to imply that it should/does/might have “problems when the true power is heterogeneous”. I suppose that is an empirical question, but it one that has been answered. In the original paper, Simonsohn et al. report results showing how p-curve behaves under some types of heterogeneity. Furthermore, and more recently, we have reported how p-curve responds under other different and severe forms of heterogeneity ( Across all of those simulations, p-curve does indeed seem to perform fine. If the authors want to claim that it doesn’t perform well enough (with some quantifiable statement about what that means), or perhaps that there are some special conditions in which it performs worse, that would be entirely reasonable to articulate. However, to say “the robustness of p-curve has not been tested” is not even slightly accurate and quite misleading.

These are totally bogus and cheery-picked simulations that were conducted after I shared a preprint of this manuscript with Uri.  I don’t agree with Reviewer 2 that we shouldn’t use blogs, but the content of the blog post needs to be accurate and scientific. The simulations in this blog post are not.  The variation of power is very small.  In contrast, we examine p-curve and z-curve in a fair comparison with varying amounts of heterogeneity that is found in real data sets.   In this simulations p-curve again does slightly better when there is no heterogeneity, but it does a lot worse when there is considerable variability. 

To ignore the results in the manuscript and to claim that the blog post shows something different is not scientific.  It is pure politics. The good news is that simulation studies have a real truth and the truth is that when you simulate large variability in power,  p-curve starts overestimating average power.  We explain that this is due to the use of a single parameter model that cannot model heterogeneity. If we limit z-curve to a single parameter it has the same problem. The novel contribution of z-curve is to use multiple (3 or 7 doesn’t matter much) parameters to model heterogeneity.  Not surprisingly, a model that is more consistent with the data produces better estimates.

Brief elaboration on the second point: The paper claims (and shows) that p-curve performs worse than Z-curve with more heterogeneity. DataColada[67] claims (and shows) that p-curve performs better than Z-curve with more heterogeneity.

p-curve does not perform better with more heterogeneity. I had a two-week email exchange with Uri when he came up with simulations that showed better performance of p-curve.  For example, transformation to z-scores is an approximation and when you use t-values with small N (all studies have N = 20), the approximation leads to suboptimal estimates. Also smaller k is an issue because z-curve estimates density distributions. So, I am well aware of limited specialized situations where p-curve can do better by up to 10% points, but that doesn’t change the fact that it does a lot worse when p-curve is applied to real heterogeneous data like I have been analyzing for years (ego-depletion replicability report, Motyl focal hpyothesis tests, etc. etc.).

I doubt neither set of simulations. That means that the difference – barring an error or similar – must lie in the operational definition of “heterogeneity.” Although I have a natural bias in interpretation (I assisted in the process of generating different versions of heterogeneity to then be tested for the DataColada post), I accept that the Z-curve authors may have entirely valid thinking here as well. So a few suggestions: 1. Since there is apparently some disagreement about how to operationalize heterogeneity, I would recommend not talking about it as a single monolithic construct.

How is variability in true power not a single construct.  We have a parameter and it can vary from alpha to 1.   Or we have a population effect size and a specific amount of sampling error and that gives us a ratio that reflects the deviation of a test statistic from 0.   I understand the aim of saving p-curve, but in the end p-curve in its current form is unable to handle larger amounts of heterogeneity.   You provide no evidence to the contrary.

Instead clarify exactly how it will be operationalized and tested and then talk about those. 2. When running simulations, rather than only reporting the variance or the skewness, simply show the distribution of power in the studies being submitted to Z-curve (as in DataColada[67]). Those distributions, at the end of the day, will convey what exactly Z-curve (or p-curve) is estimating. 3. To the extent possible, figure out why the two differ. What are the cases where one fails and the other succeeds? It is neither informative (nor accurate) to describe Z-curve as simply “better”. If it were better in every situation then I might say, “hey, who cares why?”. But it is not. So then it becomes a question of identifying when it will be better.

Again, I had a frustrating email correspondence with Uri and the issues are all clear and do not change the main conclusion of our paper.  When there is large heterogeneity, modeling this heterogeneity leads to unbiased estimates of average power, whereas a single component model tends to produce biased estimates.

Brief elaboration on the third point: Cuddy et al. selected incorrect test statistics from problematic studies. Motyl et al. selected lots and lots of incorrect tests. Scraping test-fstatistics is not at all relevant to an assessment of the power of the studies where they came from. These are all unambiguously invalid. Unfortunately, one cannot therefore learn anything about the performance of Z-curve in assessing them.

I really don’t care about Cuddy. What I do care about is that they used p-curve as if it can produce accurate estimates of average power and reported an estimate to readers that suggested they had the right estimate, when p-curve again overestimated average power.

The claims about Motyl are false. I have done my own coding of these studies and despite a few inconsistencies in coding some studies, I get the same results with my coding.  Please provide your own coding of these studies and I am sure the results will be the same.  Unless you have coded Motyl et al.’s studies, you should not make unbased claims about this dataset or the results that are based on it.

OK, with those in mind, I list below concerns I have with specifics in the paper. These are roughly ordered based on where they occur in the paper:

Really,  I would love to stop hear, but I am a bit obsessive compulsive, but readers might have enough information to draw their own conclusions.

* The paper contains are a number of statements of fact that seem too certain. Just one early example, “the most widely used criterion for a successful replication is statistical significance (Killeen, 2005).” That is a common definition and it may be the most common, but that is hardly a certainty (even with a citation). It would be better to simply identify that definition as common and then consider its limitations (while also considering others).

Aside from being the most common, it is also the most reasonable.  How else would we compare the results of a study that claimed, the effect is positive, 95%CI d = .03 to .1.26 to the results of a replication study.  Would we say, wow replication d = .05, this is consistent with the original study therefore we have a successful replication?

* The following statement seems incorrect (and I think that the authors would be the first to agree with me): “Exact replications of the original study should also produce significant results; at least we should observe more successful than failed replications if the hypothesis is true.” If original studies were all true, but all powered at 25%, then exact (including sample size) replications would be significant 25% of the time. I assume that I am missing the argument, so perhaps I am merely suggesting a simple clarification. (p. 6)

You misinterpret the intention here. We are stating that a good study should be replicable and are implying that a study with 25% power is not a good study. At a minimum we would expect a good study to be more often correct than incorrect which happens when power is over 50%.

* I am not sure that I completely understand the argument about the equivalence of low power and false positives (e.g., “Once we take replicability into account, the distinction between false positives and true positives with low power becomes meaningless, and it is more important to distinguish between studies with good power that are replicable and studies with low power or false positives that are difficult to replicate.”) It seems to me that underpowered original studies may, in the extreme case, be true hypotheses, but they lack meaningful evidence. Alternatively, false positives are definitionally false hypotheses that also, definitionally, lack meaningful evidence. If a replicator were to use a very large sample size, they would certainly care about the difference. Note that I am hardly making a case in support of the underpowered original – I think the authors’ articulations of the importance of statistical power is entirely reasonable – but I think the statement of functional equivalence is a touch cavalier.

Replicability is a property of the original study.  If the original study had 6% power it is a bad study, even if a subsequent study with 10 times the sample size is able to show a significant result with much more power.

* I was surprised that there was no discussion of the Simonsohn Small Telescopes perspective in the statistical evaluation of replications. That offers a well-cited and frequently discussed definition of replicability that talks about many of the same issues considered in this introduction. If the authors think that work isn’t worth considering, that is fine, but they might anticipate that other readers would at least wonder why it was not.

The paper is about the replicability of published findings, not about sample size planing for replication studies.  Average power predicts what would happen in a study with the same sample sizes, not what would happen if sample sizes were increased.  So, the telescope paper is not relevant.

* The consideration of the Reproducibility Project struck me as lacking nuance. It takes the 36% estimate too literally, despite multiple articles and blog posts which have challenged that cut-and-dried interpretation. I think that it would be reasonable to at least give some voice to the Gilbert et al. criticisms which point out that, given the statistical imprecision of the replication studies, a more positive estimate is justifiable. (again, I am guessing that many people – including me – share the general sense of pessimism expressed by the authors, but a one-sided argument will not be persuasive).

Are you nuts? Gilbert may have had one or two points about specific replication studies, but his broader claims about the OSC results are utter nonsense, even if they were published as a commentary in Science.  It is a trivial fact that the success rate in a set of studies that is not selected for significance is an estimate of average power.  If we didn’t have a file drawer, we could just count the percentage of significant results to know how low power actually is. However, we do have file drawers, and therefore we need a statistical tool like z-curve to estimate average power if that is a desirable goal.  If you cannot see that the OSC data are the best possible dataset to evaluate bias-correction methods with heterogeneous data, you seem to lack the most fundamental understanding of statistical power and how it relates to success rates in significance tests.

* The initial description of Z-Curve is generally clear and brief. That is great. On the other hand I think that a reasonable standard should be that readers would need neither to download and run the R-code nor go and read the 2016 paper in order to understand the machinery of the algorithm. Perhaps a few extra sentences to clarify before giving up and sending readers to those other sources.

This is up to the editor. We are happy to move content from the Supplement to the main article or do anything else that can improve clarity and communication.  But first we need to be given an opportunity to do so.

* I don’t understand what is happening on pages 11-13. I say that with as much humility as possible, because I am sure that the failing is with me. Nevertheless, I really don’t understand. Is this going to be a telling example? Or is it the structure of the underlying computations? What was the data generating function that made the figure? What is the goal?

* Figure 1: (A few points). The caption mentions “…how Z-curve models…” I am sure that it does, but it doesn’t make sense to me. Perhaps it would be worth clarifying what the inputs are, what the outputs are, what the inferences are, and in general, what the point of the figure is. The authors have spent far more time in creating this figure than anyone else who simply reads it, so I do not doubt that it is a good representation of something, but I am honestly indicating that I do not know what that is. Furthermore, the authors’ say “the dotted black line in Figure 1.” I found it eventually, but it is really hard to see. Perhaps make the other lines a very light gray and the critical line a pure and un-dashed black?

It is a visual representation of the contribution of each component of the model to the total density.

* The authors say that they turn every Z-score of >6 into 6. How consequential is that decision? The explanation that those are all powered at 100% is not sufficient. If there are two results entered into Z-curve one with Z = 7 and one with Z = 12, Z-curve would treat them identically to each other and identically as if they were both Z = 6, right? Is that a strength? (without clarification, it sounds like a weakness). Perhaps it would be worth some sentences and some simulations to clarify the consequences of the arbitrary cutoff. Quite possibly the consequences are zero, but I can’t tell. 

Z-curve could also fit components here, but there are few z-scores and if you convert the z-score into power it is  pnorm(6, 1.96) = .99997 or 99.997%.  So does it matter. No it doesn’t, which is the reason why are doing it. If it would make a difference, we wouldn’t be doing it.

* On p. 13 the authors say, “… the average power estimate was 50% demonstrating large sample accuracy.” That seems like a good solid conclusion inference, but I didn’t understand how they got to it. One possible approach would be to start a bit earlier with clarifying the approach. Something that sounded like, “Our goal was to feed data from a 50% power distribution and then assess the accuracy of Z-curve by seeing whether or not it returned an average estimate of 50%.” From there, perhaps, it might be useful to explain in conversational language how that was conducted.

The main simulations are done later.  This is just an example.  So we can just delete the claim about large sample accuracy here.

* To reiterate, I simply cannot follow what the authors are doing. I accept that as my fault, but let’s assume that a different reader might share some of my shortcomings. If so, then some extra clarification would be helpful.

Thanks, but if you don’t understand what we are doing, why are you an expert reviewer for our paper.  I did ask that Uri is not picked as a reviewer because he ignored all reasonable arguments when I sent him a preprint, but that didn’t mean that some other proponent of p-curve with less statistical background should be the reviewer. 

* The authors say that p-curve generates an estimate of 76% for this analysis and that is bad. I believe them. Unfortunately, as I have indicated in a few places, I simply do not understand what the authors did, and so cannot assess the different results.

We used the R-Code for the p-curve app, submitted the data and read the output.  And yes, we agree, it is bad that a tool is in the public domain without any warning about bias when there is heterogeneity and the tool can overestimates average power by 25% points.  What are you going to do about it? 

So clarification would help. Furthermore, the authors then imply that this is due to p-curve’s failure with heterogeneity. That sounds unlikely, given the demonstrations of p-curve’s robustness to heterogeneity (i.e., DataColada[67]), but let’s assume that they are correct.

Uri simulated minimal heterogeneity to save p-curve from embarrassment. So there is nothing surprising here. Uri essentially p-hacked p-curve results to get the results he wanted.

It then becomes absolutely critical for the authors to explain why that particular version is so far off. Based on lengthy exchanges between Uli and Uri, and as referenced in the DataColada post, across large and varied forms of heterogeneity, Z-curve performs worse than p-curve. What is special about this case? Is it one that exists frequently in nature?

Enough already.  That p-hacked post is not worth the bytes on the hosting server.

* I understand Figure 2. That is great. 

Do we have badges for reviewers who actually understand something in a paper?

* The authors imply that p-curve does worse at estimating high powered studies because of heterogeneity. Is there evidence for that causal claim? It would be great if they could identify the source of the difference.

The evidence is in the fucking paper you were supposed to review and evaluate.

* Uri, in the previous exchanges with Uli (and again, described in the blog post), came to the conclusion that Z-curve did better than p-curve when there were many very extreme (100% power) observations in the presence of other very low powered observations. The effect seemed to be carried by how Z-curve handles those extreme cases. I believe – and truly I am not sure here – that the explanation had to do with the fact that with Z-curve, extreme cases are capped at some upper bound. If that is true then (a) it is REALLY important for that to be described, clarified, and articulated. In addition, (b) it needs to be clearly justified. Is that what we want the algorithm to do? It seems important and potentially persuasive that Z-curve does better with certain distributions, but it clearly does worse with others. Given that, (c) it seems like the best case positioning for Z-curve would be if it could lay out the conditions under which it would perform better (e.g., one in which there are many low powered studies, but the mode was nevertheless >99.99% power), while acknowledging those in which it performs worse (e.g., all of the scenarios laid out in DataColada[67]).

I can read and I read the blog post. I didn’t know these p-hacked simulations would be used against me in the review process. 

* Table 1: Rather than presenting these findings in tabular form, I think it would be informative if there were histograms of the studies being entered into Z-curve (as in DataColada[67]). That allows someone to see what is being assessed rather than relying on their intuitive grasp of skewness, for example.

Of course we can add those, but that doesn’t change anything about the facts.

The Deomnstrations:

* the Power Posing Meta-Analysis. I think it is interesting to look at how Z-curve evaluates a set of studies. I don’t think that one can evaluate the tool in this way (because we do not know the true power of Power Posing studies), but it is interesting to see. I would make some suggestions though. (a) In a different DataColada post (, we looked carefully at the Cuddy, Shultz, & Fosse p-curve and identified that the authors had selected demonstrably incorrect tests from demonstrably problematic studies. I can’t imagine anyone debating either contention (indeed, no one has, though the Z-curve authors might think the studies and tests selected were perfect. That would be interesting to add to this paper.). Those tests were also the most extreme (all >99% power). Without reading this section I would say, “well, no analysis should be run on those test statistics since they are meaningless. On the other hand, since they are extreme in the presence of other very low powered studies, this sounds like exactly the scenario where Z-curve will generate a different estimate from p-curve”. [again, the authors simply cite “heterogeneity” as the explanation and again, that is not informative]. I think that a better comparison might be on the original power-posing p-curve (Simmons & Simonsohn, 2017; Since those test statistics were coded by two authors of the original p-curve, that part is not going to be casually contested. I have no idea what that comparison will look like, but I would be interested.

I don’t care about the silly power-posing research. I can take this out, but it just showed that p-curve is used without understanding its limitations, which have been neglected by the developers of p-curve (not sure how much you were involved). 

* The scraping of 995,654 test statistics. I suppose one might wonder “what is the average power of any test reported in psychology between 2010 and 2017?” So long as that is not seen as even vaguely relevant to the power of the studies in which they were reported, then OK. But any implied relevance is completely misleading. The authors link the numbers (68% or 83%) to the results of the Reproducibility Project. That is exactly the type of misleading reporting I am referring to. I would strongly encourage this demonstration to be removed from the paper.

How do we know what the replicability in developmental psychology is? How do we know what the replicability in clinical psychology is?  The only information that we have comes from social and experimental cognitive research with simple paradigms. Clearly we cannot generalize to all areas of psychology.  Surely an analysis of focal and non-focal tests has some problems that we discuss, but it clearly serves as an upper limit and can be used for termporal and cross-discipline comparisons without taking the absolute numbers too seriously.  But this can only be done with a method that is unbiased, not a method that estimates 96% power when power is 75%.

* The Motyl et al p-curve. This is a nice idea, but the data set being evaluated is completely unreasonable to use. In yet another DataColada post (, we show that the Motyl et al. researchers selected a number of incorrect tests. Many omnibus tests and many manipulation checks. I honestly think that those authors made a sincere effort but there is no way to use those data in any reasonable fashion. It is certainly no better (and possibly worse) than simply scraping every p-value from each of the included studies. I wish the Motyl et al. study had been very well conducted and that the data were usable. They are not. I recommend that this be removed from the analysis or, time permitting, the Z-curve authors could go through the set of papers and select and document the correct tests themselves.

You are wrong and I haven’t seen you posting a corrected data set. Give me a corrected data set and I bet you $1000 dollar that p-curve will produce a higher estimate than z-curve again.

* Since it is clearly relevant with the above, I will mention that the Z-curve authors do not mention how tests should be selected. Experientially, p-curve users infrequently make mistakes with the statistical procedure, but they frequently make mistakes in the selection of test statistics. I think that if the authors want their tool to be used correctly they would be well served by giving serious consideration to how tests should be selected and then carefully explaining that.

Any statistical method depends on the data you supply.  Like when Uri phacked simulations to show that p-curve does well with heterogeneity.

Final conclusion:

Dear reader (if you made it this far, please let me know in the comments section what you take away from all of this).  

The Motyl data are ok,  p-curve overestimates, and that is because p-curve doesn’t handle realistic amounts of heterogeneity well.

The Motyl data are ok,  p-curve overestimates, but this only happens with the Motyl data. 

The Motyl data are ok,  p-curve overestimates, but that is because we didn’t use p-curve properly.

The Motyl data are not ok, and our simulations are p-hacked and p-curve does well with heterogeneity.


Preliminary 2017 Replicability Rankings of 104 Psychology Journals

The table shows the preliminary 2017 rankings of 104 psychology journals.  A description of the methodology and analyses of by discipline and time are reported below the table.

Rank   Journal 2017 2016 2015 2014 2013 2012 2011 2010
1 European Journal of Developmental Psychology 93 88 67 83 74 71 79 65
2 Journal of Nonverbal Behavior 93 72 66 74 81 73 64 70
3 Behavioral Neuroscience 86 67 71 70 69 71 68 73
4 Sex Roles 83 83 75 71 73 78 77 74
5 Epilepsy & Behavior 82 82 82 85 85 81 87 77
6 Journal of Anxiety Disorders 82 77 73 77 76 80 75 77
7 Attention, Perception and Psychophysics 81 71 73 77 78 80 75 73
8 Cognitive Development 81 73 82 73 69 73 67 65
9 Judgment and Decision Making 81 79 78 78 67 75 70 74
10 Psychology of Music 81 80 72 73 77 72 81 86
11 Animal Behavior 80 74 71 72 72 71 70 78
12 Early Human Development 80 92 86 83 79 70 64 81
13 Journal of Experimental Psychology – Learning, Memory & Cognition 80 80 79 80 77 77 71 81
14 Journal of Memory and Language 80 84 81 74 77 73 80 76
15 Memory and Cognition 80 75 79 76 77 78 76 76
16 Social Psychological and Personality Science 80 67 61 65 61 58 63 55
17 Journal of Positive Psychology 80 70 72 72 64 64 73 81
18 Archives of Sexual Behavior 79 79 81 80 83 79 78 87
19 Consciousness and Cognition 79 71 69 73 67 70 73 74
20 Journal of Applied Psychology 79 80 74 76 69 74 72 73
21 Journal of Experimental Psychology – Applied 79 67 68 75 68 74 74 72
22 Journal of Experimental Psychology – General 79 75 73 73 76 69 74 69
23 Journal of Experimental Psychology – Human Perception and Performance 79 78 76 77 76 78 78 75
24 Journal of Personality 79 75 72 68 72 75 73 82
25 JPSP-Attitudes & Social Cognition 79 57 75 69 50 62 61 61
26 Personality and Individual Differences 79 79 79 78 78 76 74 73
27 Social Development 79 78 66 75 73 72 73 75
28 Appetite 78 74 69 66 75 72 74 77
29 Cognitive Behavioral Therapy 78 82 76 65 72 82 71 62
30 Journal of Comparative Psychology 78 77 76 83 83 75 69 64
31 Journal of Consulting and Clinical Psychology 78 71 68 65 66 66 69 68
32 Neurobiology of Learning and Memory 78 72 75 72 71 70 75 73
33 Psychonomic Bulletin and Review 78 79 82 79 82 72 71 78
34 Acta Psychologica 78 75 73 78 76 75 77 75
35 Behavior Therapy 77 74 71 75 76 78 64 76
36 Journal of Affective Disorders 77 85 84 77 83 82 76 76
37 Journal of Child and Family Studies 77 76 69 71 76 71 76 77
38 Journal of Vocational Behavior 77 85 84 69 82 79 86 74
39 Motivation and Emotion 77 64 67 66 67 65 79 68
40 Psychology and Aging 77 79 78 80 74 78 78 74
41 Psychophysiology 77 77 70 69 68 70 80 78
42 Britsh Journal of Social Psychology 76 65 66 62 64 60 72 63
43 Cognition 76 74 75 75 77 76 73 73
44 Cognitive Psychology 76 80 74 76 79 72 82 75
45 Developmental Psychology 76 77 77 75 71 68 70 70
46 Emotion 76 72 69 69 72 70 70 73
47 Frontiers in Behavioral Neuroscience 76 70 71 68 71 72 73 70
48 Frontiers in Psychology 76 75 73 73 72 72 70 82
49 Journal of Autism and Developmental Disorders 76 77 73 67 73 70 70 72
50 Journal of Social and Personal Relationships 76 82 60 63 69 67 79 83
51 Journal of Youth and Adolescence 76 88 81 82 79 76 79 74
52 Cognitive Therapy and Research 75 71 72 62 77 75 70 66
53 Depression & Anxiety 75 78 73 76 82 79 82 84
54 Journal of Child Psychology and Psychiatry and Allied Disciplines 75 63 66 66 72 76 58 66
55 Journal of Occupational and Organizational Psychology 75 85 84 71 77 77 74 67
56 Journal of Social Psychology 75 75 74 67 65 80 71 75
57 Political Psychology 75 81 75 72 75 74 51 70
58 Social Cognition 75 68 68 73 62 78 71 60
59 British Journal of Developmental Psychology 74 77 74 63 61 85 77 79
60 Evolution & Human Behavior 74 81 75 79 67 77 78 68
61 Journal of Research in Personality 74 77 82 80 79 73 74 71
62 Memory 74 79 66 83 73 71 76 78
63 Psychological Medicine 74 83 71 79 79 68 79 75
64 Psychopharmacology 74 75 73 73 71 73 73 71
65 Psychological Science 74 69 70 64 65 64 62 63
66 Behavioural Brain Research 73 69 75 69 71 72 73 74
67 Behaviour Research and Therapy 73 74 76 77 74 77 68 71
68 Journal of Cross-Cultural Psychology 73 75 80 78 78 71 76 76
69 Journal of Experimental Child Psychology 73 73 78 74 74 72 72 76
70 Personality and Social Psychology Bulletin 73 71 65 65 61 61 62 61
71 Social Psychology 73 75 72 74 69 64 75 74
72 Developmental Science 72 68 68 66 71 68 68 66
73 Journal of Cognition and Development 72 78 68 64 69 62 66 70
74 Law and Human Behavior 72 76 76 61 76 76 84 72
75 Perception 72 78 79 74 78 85 94 91
76 Journal of Applied Social Psychology 71 81 69 72 71 80 74 75
77 Journal of Experimental Social Psychology 71 68 63 61 58 56 58 57
78 Annals of Behavioral Medicine 70 70 62 71 71 77 75 71
79 Frontiers in Human Neuroscience 70 74 73 74 75 75 75 72
80 Health Psychology 70 63 68 69 68 63 70 72
81 Journal of Abnormal Child Psychology 70 74 70 74 78 78 68 78
82 Journal of Counseling Psychology 70 69 74 75 76 78 67 80
83 Journal of Educational Psychology 70 74 73 76 76 78 78 84
84 Journal of Family Psychology 70 68 75 71 73 66 68 69
85 JPSP-Interpersonal Relationships and Group Processes 70 74 64 62 66 58 60 56
86 Child Development 69 72 72 71 69 75 72 75
87 European Journal of Social Psychology 69 76 64 72 67 59 69 66
88 Group Processes & Intergroup Relations 69 67 73 68 70 66 68 61
89 Organizational Behavior and Human Decision Processes 69 73 70 70 72 70 71 65
90 Personal Relationships 69 72 71 70 68 74 60 69
91 Journal of Pain 69 79 71 81 73 78 74 72
92 Journal of Research on Adolescence 68 78 69 68 75 76 84 77
93 Self and Identity 66 70 56 73 71 72 70 73
94 Developmental Psychobiology 65 69 67 69 70 69 71 66
95 Infancy 65 61 57 65 70 67 73 57
96 Hormones & Behavior 64 68 66 66 67 64 68 67
97 Journal of Abnormal Psychology 64 67 71 64 71 67 73 70
98 JPSP-Personality Processes and Individual Differences 64 74 70 70 72 71 71 64
99 Psychoneuroendocrinology 64 68 66 65 65 62 66 63
100 Cognition and Emotion 63 69 75 72 76 76 76 76
101 European Journal of Personality 62 78 66 81 70 74 74 78
102 Biological Psychology 61 68 70 66 65 62 70 70
103 Journal of Happiness Studies 60 78 79 72 81 78 80 83
104 Journal of Consumer Psychology 58 56 69 66 61 62 61 66



Download PDF of this ggplot representation of the table courtesy of David Lovis-McMahon.







I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result.  In the past five years, there have been concerns about a replication crisis in psychology.  Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011).  The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated  (OSC, 2015).

The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability.  However, the reliance on actual replication studies has a a number of limitations.  First, actual replication studies are expensive, time-consuming, and sometimes impossible (e.g., a longitudinal study spanning 20 years).  This makes it difficult to rely on actual replication studies to assess the replicability of psychological results, produce replicability rankings of journals, and to track replicability over time.

Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the test-statistics reported in published articles.  This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies: (a) replicability can be assessed in real time, (b) it can be estimated for all published results rather than a small sample of studies, and (c) it can be applied to studies that are impossible to reproduce.  Finally, it has the advantage that actual replication studies can be criticized  (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on results reported in the original articles.

Z-curve has been validated with simulation studies and can be used with heterogeneous sets of studies that vary across statistical methods, sample sizes, and effect sizes  (Brunner & Schimmack, 2016).  I have applied this method to articles published in psychology journals to create replicability rankings of psychology journals in 2015 and 2016.  This blog post presents preliminary rankings for 2017 based on articles that have been published so far. The rankings will be updated in 2018, when all 2017 articles are available.

For the 2016 rankings, I used z-curve to obtain annual replicability estimates for 103 journals from 2010 to 2016.  Analyses of time trends showed no changes from 2010-2015. However, in 2016 there were first signs of an increase in replicabilty.  Additional analyses suggested that social psychology journals contributed mostly to this trend.  The preliminary 2017 rankings provide an opportunity to examine whether there is a reliable increase in replicability in psychology and whether such a trend is limited to social psychology.


Journals were mainly selected based on impact factor.  Preliminary replicability rankings for 2017 are based on 104 journals. Several new journals were added to increase the number of journals specializing in five disciplines: social (24), cognitive (13), development (15), clinical/medical (18), biological (13).  The other 24 journals were broad journals (Psychological Science) or from other disciplines.  The total number of journals for the preliminary rankings is 104.  More journals will be added to the final rankings for 2017.

Data Preparation

All PDF versions of published articles were downloaded and converted into text files using the conversion program pdfzilla.  Text files were searched for reports of statistical results using a self-created R program. Only F-tests, t-tests, and z-tests were used for the rankings because they can be reliabilty extracted from diverse journals. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom. Test-statistics were converted into exact p-values and exact p-values were converted into absolute z-scores as a measure of the strength of evidence against the null-hypothesis.

Data Analysis

The data for each year were analyzed using z-curve (Schimmack and Brunner (2016). Z-curve provides a replicability estimate. In addition, it generates a Powergraph. A Powergraph is essentially a histogram of absolute z-scores. Visual inspection of Powergraphs can be used to examine publication bias. A drop of z-values on the left side of the significance criterion (p < .05, two-tailed, z = 1.96) shows that non-significant results are underpresented. A further drop may be visible at z = 1.65 because values between z = 1.65 and z = 1.96 are sometimes reported as marginally significant support for a hypothesis.  The critical values z = 1.65 and z = 1.96 are marked by vertical red lines in the Powergraphs.

Replicabilty rankings rely only on statistically significant results (z > 1.96).  The aim of z-curve is to estimate the average probability that an exact replication of a study that produced a significant result produces a significant result again.  As replicability estimates rely only on significant results, journals are not being punished for publishing non-significant results.  The key criterion is how strong the evidence against the null-hypothesis is when an article published results that lead to the rejection of the null-hypothesis.

Statistically, replicability is the average statistical power of the set of studies that produced significant results.  As power is the probabilty of obtaining a significant result, average power of the original studies is equivalent with average power of a set of exact replication studies. Thus, average power of the original studies is an estimate of replicability.

Links to powergraphs for all journals and years are provided in the ranking table.  These powergraphs provide additional information that is not used for the rankings. The only information that is being used is the replicability estimate based on the distribution of significant z-scores.


The replicability estimates for each journal and year (104 * 8 = 832 data points) served as the raw data for the following statistical analyses.  I fitted a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.

I compared several models. Model 1 assumed no mean level changes and stable variability across journals (significant variance in the intercept/trait). Model 2 assumed no change from 2010 to 2015 and allowed for mean level changes in 2016 and 2017 as well as stable differences between journals. Model 3 was identical to Model 2 and allowed for random variability in the slope factor.

Model 1 did not have acceptable fit (RMSEA = .109, BIC = 5198). Model 2 increased fit (RMSEA = 0.063, BIC = 5176).  Model 3 did not improve model fit (RMSEA = .063, BIC = 5180), the variance of the slope factor was not significant, and BIC favored the more parsimonious Model 2.  The parameter estimates suggested that replicability estimates increased from 72 in the years from 2010 to 2015 by 2 points to 74 (z = 3.70, p < .001).

The standardized loadings of individual years on the latent intercept factor ranged from .57 to .61.  This implies that about one-third of the variance is stable, while the remaining two-thirds of the variance is due to fluctuations in estimates from year to year.

The average of 72% replicability is notably higher than the estimate of 62% reported in the 2016 rankings.  The difference is due to a computational error in the 2016 rankings that affected mainly the absolute values, but not the relative ranking of journals. The r-code for the 2016 rankings miscalculated the percentage of extreme z-scores (z > 6), which is used to adjust the z-curve estimate that are based on z-scores between 1.96 and 6 because all z-scores greater than 6 essentially have 100% power.  For the 2016 rankings, I erroneously computed the percentage of extreme z-scores out of all z-scores rather than limiting it to the set of statistically significant results. This error became apparent during new simulation studies that produced wrong estimates.

Although the previous analysis failed to find significant variability for the slope (change factor), this could be due to the low power of this statistical test.  The next models included disciplines as predictors of the intercept (Model 4) or the intercept and slope (Model 5).  Model 4 had acceptable fit (RMSEA = .059, BIC = 5175), but Model 5 improved fit, although BIC favored the more parsimonious model (RMSEA = .036, BIC = 5178). The Bayesian Information Criterion favors parsimony and better fit cannot be interpreted as evidence for the absence of an effect.  Model 5 showed two significant (p < .05) effects for social and developmental psychology.  In Model 6 I included only social and development as predictors of the slope factor.  BIC favored this model over the other models (RMSEA = .029, BIC = 5164).  The model results showed improvements for social psychology (increase by 4.48 percentage points, z = 3.46, p = .001) and developmental psychology (increase by 3.25 percentage points, z = 2.65, p = .008).  Whereas the improvement for social psychology was expected based on the 2016 results, the increase for developmental psychology was unexpected and requires replication in the 2018 rankings.

The only significant predictors for the intercept were social psychology (-4.92 percentage points, z = 4.12, p < .001) and cognitive psychology (+2.91, z = 2.15, p = .032).  The strong negative effect (standardized effect size d = 1.14) for social psychology confirms earlier findings that social psychology was most strongly affected by the replication crisis (OSC, 2015). It is encouraging to see that social psychology is also the discipline with the strongest evidence for improvement in response to the replication crisis.  With an increase by 4.48 points, replicabilty of social psychology is now at the same level as other disciplines in psychology other than cognitive psychology, which is still a bit more replicable than all other disciplines.

In conclusion, the results confirm that social psychology had lower replicability than other disciplines, but also shows that social psychology has significantly improved in replicabilty over the past couple of years.

Analysis of Individual Journals

The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2010-1015 (0) with 2016-2017 (1). This analysis produced 10 significant increases with p < .01 (one-tailed), when only 1 out of 100 would be expected by chance.

Five of the 10 journals (50% vs. 20% in the total set of journals) were from social psychology (SPPS + 13, JESP + 11, JPSP-IRGP + 11, PSPB + 10, Sex Roles + 8).  The remaining journals were from developmental psychology (European J. Dev. Psy + 17, J Cog. Dev. + 9), clinical psychology (J. Cons. & Clinical Psy + 8, J. Autism and Dev. Disorders + 6), and the Journal of Applied Psychology (+7).  The high proportion of social psychology journals provides further evidence that social psychology has responded most strongly to the replication crisis.



Although z-curve provides very good absolute estimates of replicability in simulation studies, the absolute values in the rankings have to be interpreted with a big grain of salt for several reasons.  Most important, the rankings are based on all test-statistics that were reported in an article.  Only a few of these statistics test theoretically important hypothesis. Others may be manipulation checks or other incidental analyses.  For the OSC (2015) studies the replicability etimate was 69% when the actual success rate was only 37%.  Moreover, comparisons of the automated extraction method used for the rankings and hand-coding of focal hypothesis in the same article also show a 20% point difference.  Thus, a posted replicability of 70% may imply only 50% replicability for a critical hypothesis test.  Second, the estimates are based on the ideal assumptions underlying statistical test distributions. Violations of these assumptions (outliers) are likely to reduce actual replicability.  Third, actual replication studies are never exact replication studies and minor differences between the studies are also likely to reduce replicability.  There are currently not sufficient actual replication studies to correct for these factors, but the average is likely to be less than 72%. It is also likely to be higher than 37% because this estimate is heavily influenced by social psychology, while cognitive psychology had a success rate of 50%.  Thus, a plausible range of the typical replicability of psychology is somwhere between 40% and 60%.  We might say the glass is half full and have empty, while there is systematic variation around this average across journals.


55 years after Cohen (1962) pointed out that psychologists conduct many studies that produce non-significant results (type-II errors).  For decades there was no sign of improvement.  The preliminary rankings of 2017 provide the first empirical evidence that psychologists are waking up to the replication crisis caused by selective reporting of significant results from underpowered studies.  Right now, social psychologists appear to respond most strongly to concerns about replicability.  However, it is possible that other disciplines will follow in the future as the open science movement is gaining momentum.  Hopefully, replicabilty rankings can provide an incentive to consider replicability as one of several criterion for publication.   A study with z = 2.20 and another study with z = 3.85 are both significant (z > 1.96), but a study with z =3.85 has a higher chance of being replicable. Everything else being equal, editors should favor studies with stronger evidence; that is higher z-scores (a.k.a, lower p-values).  By taking the strength of evidence into account, psychologists can move away from treating all significant results (p < .05) as equal and take type-II errors and power into account.