All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

The Replicability Revolution: My commentary to Zwaan et al.’s PPS article “Making replication mainstream”

Ulrich Schimmack
Department of Psychology, University of Toronto, Mississauga, ON L5L 1C6,
doi:10.1017/S0140525X18000833, e147

“It is therefore inevitable that the ongoing correction of the scientific record damages the reputation of researchers, if this reputation was earned by selective publishing of significant results.”

Abstract: Psychology is in the middle of a replicability revolution. High-profile
replication studies have produced a large number of replication
failures. The main reason why replication studies in psychology often fail
is that original studies were selected for significance. If all studies were
reported, original studies would fail to produce significant results as
often as replication studies. Replications would be less contentious if
original results were not selected for significance.


The history of psychology is characterized by revolutions. This
decade is marked by the replicability revolution. One prominent
feature of the replicability revolution is the publication of replication
studies with nonsignificant results.

The publication of several high-profile replication failures has triggered a confidence crisis. Zwaan et al. have been active participants in the replicability
revolution. Their target article addresses criticisms of direct replication

One concern is the difficulty of re-creating original studies, which may explain replication failures, particularly in social psychology. This argument fails on three counts. First, it does not explain why published studies have an apparent success rate
greater than 90%. If social psychological studies were difficult to replicate, the success rate should be lower. Second, it is not clear why it would be easier to conduct conceptual replication studies that vary crucial aspects of a successful original study.

If social priming effects were, indeed, highly sensitive to contextual variations, conceptual replication studies would be even more likely to fail than direct replication studies; however, miraculously they always seem to work. The third problem with this argument is that it ignores selection for significance. It treats successful conceptual
replication studies as credible evidence, but bias tests reveal that these studies have been selected for significance and that many original studies that failed are simply not reported (Schimmack 2017; Schimmack et al. 2017).

A second concern about direct replications is that they are less informative than conceptual replications (Crandall & Sherman 2016). This argument is misguided because it assumes a successful outcome. If a conceptual replication study is successful, it increases the probability that the original finding was true and it expands the range of conditions under which an effect can be observed. However, the advantage of a conceptual replication study becomes a disadvantage when a study fails. For example,
if the original study showed that eating green jelly beans increases happiness and a conceptual replication study with red jelly beans does not show this effect, it remains unclear whether green jelly  beans make people happier or not. Even the non-significant finding with red jelly beans is inconclusive because the result could be a false negative. Meanwhile, a failure to replicate the green jelly bean effect in a direct replication study is informative because it casts doubt on the original finding.

In fact, a meta-analysis of the original and replication study might produce a non-significant result and reverse the initial inference that green jelly beans make people happy. Crandall and Sherman’s argument rests on the false assumption that only significant studies are informative. This assumption is flawed because selection for significance renders significance uninformative (Sterling 1959).

A third argument against direct replication studies is that there are multiple ways to compare the results of original and replication studies. I believe the discussion of this point also benefits from taking publication bias into account. Selection for significance
explained why the reproducibility project obtained only 36% significant results in direct replications of original studies with significant results (Open Science Collaboration 2015). As a result, the significant results of original studies are less credible than the  nonsignificant results in direct replication studies. This generalizes to all comparisons of original studies and direct replication studies.

Once there is suspicion or evidence that selection for significance occurred, the results of original studies are less credible, and more weight should be given to replication studies that are not biased by selection for significance. Without selection for significance,
there is no reason why replication studies should be more likely to fail than original studies. If replication studies correct mistakes in original studies and use larger samples, they are actually more likely to produce a significant result than original studies.

Selection for Significance Explains Reputational Damage of Replication Failures 

Selection for significance also explains why replication failures are damaging to the reputation of researchers. The reputation of researchers is based on their publication record, and this record is biased in favor of successful studies. Thus, researchers’ reputations are inflated by selection for significance. Once an unbiased replication
produces a nonsignificant result, the unblemished record is tainted, and it is apparent that a perfect published record is illusory and not the result of research excellence (a.k.a flair). Thus, unbiased failed replication studies not only provide new evidence; they also
undermine the credibility of existing studies. Although positive illusions may be beneficial for researchers’ eminence, they have no place in science. It is therefore inevitable that the ongoing correction of the scientific record damages the reputation of researchers, if this reputation was earned by selective publishing of significant results.
In this way direct replication studies complement statistical tools that can reveal selective publishing of significant results with statistical tests of original studies (Schimmack 2012; 2014; Schimmack & Brunner submitted for publication).


Schimmack, U. (2012) The ironic effect of significant results on the credibility of
multiple-study articles. Psychological Methods 17:551–56. [US]

Schimmack, U. (2014) The test of insufficient variance (TIVA): A new tool for the
detection of questionable research practices. Working paper. Available at:

Schimmack, U. (2017) ‘Before you know it’ by John A. Bargh: A quantitative book
review. Available at:

Schimmack, U. & Brunner, J. (submittrd for publication) Z-Curve: A method for
estimating replicability based on test statistics in original studies. Submitted for
Publication. [US]

Schimmack, U., Heene, M. & Kesavan, K. (2017) Reconstruction of a train wreck:
How priming research went off the rails. Blog post. Available at: https://replicationindex.


Statistics Wars and Submarines

I am not the first to describe the fight among statisticians for eminence as a war (Mayo Blog).   The statistics war is as old as modern statistics itself.  The main parties in this war are the Fisherians, Bayesians, and Neymanians (or Neyman-Pearsonians).

Fisherians use p-values as evidence to reject the null-hypothesis; the smaller the p-value the better.

Neymanians distinguish between type-I and type-II errors and use regions of test statistics to reject null-hypotheses or alternative hypotheses.  They also use confidence intervals to obtain interval estimates of population parameters.

Bayesians differ from the Fisherians and Neymanians in that their inferences combine information obtained from data with prior information. Bayesians sometimes fights with each other about the proper prior information. Some prefer subjective priors that are ideally based on prior knowledge. Others prefer objective priors that do not require any prior knowledge and can be applied to all statistical problems (Jeffreysians).  Although they fight with each other, they are united in their fight against Fisherians and Neymanians, which they call Frequentists.

The statistics war has been going on for over 80 years and there has been no winner.  Unlike empirical sciences, there are no new data that could resolve scientific controversies.  Thus, the statistics war is more like wars in philosophy where philosophers are still fighting over the right way to define fundamental concepts like justice or happiness.

For applied researchers these statistics wars can be very confusing because a favorite weapon of statisticians is propaganda.  In this blog post, I examine the Bayesian Submarine (Morey et al., 2016), which aims to sink the ship of Neymansian confidence intervals.

The Bayesian Submarine 

Submarines are fascinating and are currently making major discoveries about sea life.  The Bayesian submarine is rather different.  It is designed to convince readers that confidence intervals provide no meaningful information about population parameters and should be abandoned in favor of Bayesian interval estimation.

Example 1: The lost submarine
In this section, we present an example taken from the confidence interval literature (Berger and Wolpert, 1988; Lehmann, 1959; Pratt, 1961;Welch, 1939) designed to bring into focus how CI theory works. This example is intentionally simple; unlike many demonstrations of CIs, no simulations are needed, and almost all results can be derived by readers with some training in probability and geometry. We have also created interactive versions of our figures to aid readers in understanding the example; see the figure captions for details.

A 10-meter-long research submersible with several people on board has lost contact with its surface support vessel. The submersible has a rescue hatch exactly halfway along
its length, to which the support vessel will drop a rescue line. Because the rescuers only get one rescue attempt, it is crucial that when the line is dropped to the craft in the deep water that the line be as close as possible to this hatch. The researchers on the support vessel do not know where the submersible is, but they do know that it forms two distinctive bubbles. These bubbles could form anywhere along the craft’s length, independently, with equal probability, and float to the surface where they can be seen by the support vessel.

The situation is shown in Fig. 1a. The rescue hatch is the unknown location θ, and the bubbles can rise from anywhere with uniform probability between θ − 5 meters (the
bow of the submersible) to θ+5 meters (the stern of the submersible). 


Let’s translate this example into a standard statistical problem.  It is uncommon to have a uniform distribution of observed data around a population parameter.  Most commonly, we assume that observations are more likely to cluster closer to the population parameter and that deviations between the population parameter and an observed value reflect some random process.  However, a bound uniform distribution also allows us to compute the standard deviation of the randomly occurring data.

round(sd(runif(100000,0,10)),2) = 2.89

We only have two data points to construct a confidence interval.  Evidently, the standard error based on a sample size of n = 2 is large (1/sqrt(2) = .71 (or 71% of a standard deviation).  We can use the typical formula for sampling error,  SD/sqrt(N) to estimate the sampling error as 2.89/1.41 = 2.04.

To construct a 95% confidence interval, we have to multiply the sampling error by the critical t.value for a probability of .975, which leaves .025 for the error region. Multiplying this by 2 gives a two-tailed error probability of .025 & 2 = 5%.  That is, 5% of observations could be more extreme than the boundaries of the confidence interval just by chance alone.  With 1 degree of freedom, we get  a value of 12.71.

n = 2; alpha = .05; qt(1-alpha/2,n-1)

The width of the CI is determined by the standard deviation and sample size.  So, the information is sufficient to say that the 95%CI is  the observed mean +/-  2.04m * 12.71 = 25.92m.

Hopefully it is obvious that this 95%CI covers 100% of all possible values because the length of the submarine is limited to 10m.

In short, two data points provide very little information and make it impossible to say anything with confidence about the location of the hatch.  Even without these calculations we can say with 100% confidence that the hatch cannot be further from the mean of the two bubbles than 5 meters because the maximum distance is limited by the length of the submarine.

The submarine problem is also strange because the width of confidence intervals is irrelevant for the rescue operation. With just one rescue line, the most optimal place is the mean of the two bubbles (see Figure, all intervals are centered on the same point).  So, the statisticians do not have to argue, because they all agree on the location where to drop the rescue line.

How is the Bayesian submarine supposed to sink confidence intervals? 

The rescuers first note that from observing two bubbles, it is easy to rule out all values except those within five meters of both bubbles because no bubble can occur further than 5 meters from the hatch.

Importantly, this only works for dependent variables with bounded values. For example, on an 11-point scale ranging from 0 to 10, it is obvious that any population mean cannot deviated from the middle of the scale (5) by more than 5 points.  Even there it is not very relevant because the goal of research is not to find the middle of the scale, but to estimate the actual population parameter that could be anywhere between 0 and 10. Thus, the submarine example does not map on any empirical problem of interval estimation.

1. A procedure based on the sampling distribution of the mean
The first statistician suggests building a confidence procedure using the sampling distribution of the mean M . The sampling distribution of  M has a known triangular distribution with θ as the mean. With this sampling distribution, there is a 50 % probability that  M will differ from θ by less than 5 − 5/ √2, or about 1.46m.  

This leads to the confidence procedure M = +/-  5 − 5/√2

which we call the “sampling distribution” procedure. This procedure also has the familiar form ¯x ± C × SE, where here the standard error (that is, the standard deviation of the estimate M) is known to be 2.04.

It is important to note that the authors use a 50% CI.  In this special case, the confidence interval is equivalent to the standard deviation because the standard deviation is multiplied by 1 to determine the width of the confidence interval.

n = 2; alpha = .50; qt(1-alpha/2,n-1)

The choice of a 50%CI is also not typical in actual research settings. It is not clear, why we should accept such a high error rate, especially when the survival of the crew members is at stake.  Imagine that the submarine had an emergency system that releases bubbles from the hatch, but the bubbles do not go straight to the surface. Yet there are hundreds of bubbles. Would we compute a 50% confidence interval or would we want to get a 99% confidence interval to bring the rescue line as close to the hatch as possible?

We still haven’t seen how the Bayesian submarine sinks confidence intervals.  To make their case, the Bayesian soldiers compute several possible confidence intervals and show how they lead to different conclusions (see Figure). They suggest that this is a fundamental problem for confidence intervals.

It is clear, first of all, why the fundamental confidence fallacy is a fallacy. 

They are happy to join forces with the Fisherians in their attack of Neymanian confidence intervals, while they usually attack Fisher for his use of p-values.

As Fisher pointed out in the discussion of CI theory mentioned above, for any given problem — as for this one — there are many possible confidence procedures. These confidence procedures will lead to different confidence intervals. In the case of our submersible confidence procedures, all confidence intervals are centered around M, and so the intervals will be nested within one another.

If we mistakenly interpret these observed intervals as having a 50 % probability of containing the true value, a logical problem arises. 

However, shortly after the authors bring up this fundamental problem for confidence intervals, they mention that Neyman solved this logical problem.

There are, as far as we know, only two general strategies for eliminating the threat of contradiction from relevant subsets: Neyman’s strategy of avoiding any assignment of probabilities to particular intervals, and the Bayesian strategy of always conditioning on the observed data, to be discussed subsequently.

Importantly, Neyman’s solution to the problem does not lead to the Bayesians’ conclusion that he suggested we should not make probabilistic statements based on confidence intervals. Instead, he argued that we should apply the long-run success rate to make probability judgments based on confidence intervals.  This use of the term probability can be illustrated with the submarine example. A simple simulation of the submarine problem shows that the 50% confidence interval contains the population parameter 50% of the time.

It is therefore reasonable to place relatively modest confidence in the belief that the hatch of the submarine is within the confidence interval.  To be more confident, it would be necessary to lower the error rate, but this makes the interval wider. The only way to be confident with a narrow interval is to collect more data.

Confidence intervals do have exactly the properties that Neyman claimed they have and there is no logical inconsistency between the statement that we cannot quantify the probability of singular events, while we can use long-run outcomes of similar events to make claims about the probability of being right or wrong in a particular event.

Neyman compares this to gambling where it is impossible to say anything about the probability of a particular bet unless we know the long-run probability of similar bets. Researchers who use confidence intervals are no different from people who drive their cars with confidence because they never had an accident or who order a particular pizza because they ordered it many times before and liked it.  Without any other relevant information, the use of long-run frequencies to assign probabilities to individual events is not a fallacy.


So, does the Bayesian submarine sink the confidence interval ship?  Does the example show that interpreting confidence intervals as probabilities is a fallacy and a misinterpretation of Neyman?  I don’t think so.

The probability of winning a coin toss (with a fair coin) is 50%.  What is the probability that I win any specific game.  It is not defined.  It is 100% if I win and 0% if I don’t win. This is trivial and Neyman made it clear that he was not using the term probability in this sense.  He also made it clear that he used the term probability to refer to the long-term proportion of correct decisions and most people would feel very confident in their beliefs and decision making if the odds of winning were 95%.  Bayesians do not deny that 95% confidence intervals give the right answer 95% of the time. They just object to the phrase, “There is a 95% probability that the confidence interval includes the population parameter” when a researcher uses a 95%confidence interval.  Similarly, they would object to somebody saying “There is a 99.9% chance that I am pregnant” when a pregnancy test with a 0.01% false positive rate shows a positive result.  The woman is either pregnant or she is not, but we don’t know this until she does repeat the test several times or an ultrasound shows it.  As long as there is uncertainty about the actual truth, the long-run frequency of false positives quantifies the rational belief in being pregnant or not.

What should applied researchers do.  They should use confidence intervals with confidence.  If Bayesians want to argue with them, all they need to say is that they are using a procedure that has a 95% probability of giving the right answer and it is not possible to say whether a particular result is one of the few errors.  The best way to address this question is not to argue about semantics, but to do a replication study.  And that is the good news.  While statisticians are busy fighting with each other, empirical scientists can make progress by collecting more informative data and make actual progress.

In conclusion, the submarine problem does not convince me for many reasons.  Most important, it is not even necessary to create any intervals to decide on the best action.  Absent any other information, the best bet is to drop the rescue line right in the middle of the two bubbles.  This is very fortunate for the submarine crew because otherwise the statisticians would still be arguing about the best course of action, while the submarine is running out of air.

The Fallacy of Placing Confidence in Bayesian Salvation

Richard D. Morey, Rink Hoekstra, Jeffrey N. Rouder, Michael D. Lee, and
Eric-Jan Wagenmakers (2016), henceforce psycho-Baysians, have a clear goal.  They want psychologists to change the way they analyze their data.

Although this goal motivates the flood of method articles by this group, the most direct attack on other statistical approaches is made in the article “The fallacy of placing confidence in confidence intervals.”   In this article, the authors claim that everybody, including textbook writers in statistics, misunderstood Neyman’s classic article on interval estimation.   What are the prior odds that after 80 years, a group of psychologists discover a fundamental flaw in the interpretation of confidence intervals (H1) versus a few psychologists are either unable or unwilling to understand Neyman’s article?

Underlying this quest for change in statistical practices lies the ultimate attribution error that Fisher’s p-values or Neyman-Pearsons significance testing with or without confidence intervals are responsible for the replication crisis in psychology (Wagenmakers et al., 2011).

This is an error because numerous articles have argued and demonstrated that questionable research practices undermine the credibility of the psychological literature.  The unprincipled use of p-values (undisclosed multiple testing), also called p-hacking, means that many statistically significant results have inflated error rates and the long-run probabilities of false positives are not 5%, as stated in each article, but could be 100% (Rosenthal, 1979; Sterling, 1959; Simmons, Nelson, & Simonsohn, 2011).

You will not find a single article by Psycho-Bayesians that will acknowledge the contribution of unprincipled use of p-values to the replication crisis. The reason is that they want to use the replication crisis as a vehicle to sell Bayesian statistics.

It is hard to believe that classic statistics are fundamentally flawed and misunderstood because they are used in industry to  produce SmartPhones and other technology that requires tight error control in mass production of technology. Nevertheless, this article claims that everybody misunderstood Neyman’s seminal article on confidence intervals.

The authors claim that Neyman wanted us to compute confidence intervals only before we collect data, but warned readers that confidence intervals provide no useful information after the data are collected.

Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative”

This is utter nonsense. Of course, Neyman was asking us to interpret confidence intervals after we collected data because we need a sample to compute confidence interval. It is hard to believe that this could have passed peer-review in a statistics journal and it is not clear who was qualified to review this paper for Psychonomic Bullshit Review.

The way the psycho-statisticians use Neyman’s quote is unscientific because they omit the context and the following statements.  In fact, Neyman was arguing against Bayesian attempts of estimate probabilities that can be applied to a single event.

It is important to notice that for this conclusion to be true, it is not necessary that the problem of estimation should be the same in all the cases. For instance, during a period of time the statistician may deal with a thousand problems of estimation and in each the parameter M  to be estimated and the probability law of the X’s may be different. As far as in each case the functions L and U are properly calculated and correspond to the same value of alpha, his steps (a), (b), and (c), though different in details of sampling and arithmetic, will have this in common—the probability of their resulting in a correct statement will be the same, alpha. Hence the frequency of actually correct statements will approach alpha. It will be noticed that in the above description the probability statements refer to the problems of estimation with which the statistician will be concerned in the future. In fact, I have repeatedly stated that the frequency of correct results tend to alpha.*

Consider now the case when a sample, S, is already drawn and the calculations have given, say, L = 1 and U = 2. Can we say that in this particular case the probability of the true value of M falling between 1 and 2 is equal to alpha? The answer is obviously in the negative.  

The parameter M is an unknown constant and no probability statement concerning its value may be made, that is except for the hypothetical and trivial ones P{1 < M < 2}) = 1 if 1 < M < 2) or 0 if either M < 1 or 2 < M) ,  which we have decided not to consider. 

The full quote makes it clear that Neyman is considering the problem of quantifying the probability that a population parameter is in a specific interval and dismisses it as trivial because it doesn’t solve the estimation problem.  We don’t even need observe data and compute a confidence interval.  The statement that a specific unknown number is between two other numbers (1 and 2) or not is either TRUE (P = 1) or FALSE (P = 0).  To imply that this trivial observation leads to the conclusion that we cannot make post-data  inferences based on confidence intervals is ridiculous.

Neyman continues.

The theoretical statistician [constructing a confidence interval] may be compared with the organizer of a game of chance in which the gambler has a certain range of possibilities to choose from while, whatever he actually chooses, the probability of his winning and thus the probability of the bank losing has permanently the same value, 1 – alpha. The choice of the gambler on what to bet, which is beyond the control of the bank, corresponds to the uncontrolled possibilities of M having this or that value. The case in which the bank wins the game corresponds to the correct statement of the actual value of M. In both cases the frequency of “ successes ” in a long series of future “ games ” is approximately known. On the other hand, if the owner of the bank, say, in the case of roulette, knows that in a particular game the ball has stopped at the sector No. 1, this information does not help him in any way to guess how the gamblers have betted. Similarly, once the boundaries of the interval are drawn and the values of L and U determined, the calculus of probability adopted here is helpless to provide answer to the question of what is the true value of M.

What Neyman was saying is that population parameters are unknowable and remain unknown even after researchers compute a confidence interval.  Moreover, the construction of a confidence interval doesn’t allow us to quantify the probability that an unknown value is within the constructed interval. This probability remains unspecified. Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval.  This is common sense. If we place bets in roulette or other random events, we rely on long-run frequencies of winnings to calculate our odds of winning in a specific game.

It is absurd to suggest that Neyman himself argued that confidence intervals provide no useful information after data are collected because the computation of a confidence interval requires a sample of data.  That is, while the width of a confidence interval can be determined a priori before data collection (e.g. in precision planning and power calculations),  the actual confidence interval can only be computed based on actual data because the sample statistic determines the location of the confidence interval.

Readers of this blog may face a dilemma. Why should they place confidence in another psycho-statistician?   The probability that I am right is 1, if I am right and 0 if I am wrong, but this doesn’t help readers to adjust their beliefs in confidence intervals.

The good news is that they can use prior information. Neyman is widely regarded as one of the most influential figures in statistics.  His methods are taught in hundreds of text books, and statistical software programs compute confidence intervals. Major advances in statistics have been new ways to compute confidence intervals for complex statistical problems (e.g., confidence intervals for standardized coefficients in structural equation models; MPLUS; Muthen & Muthen).  What are the a priori chances that generations of statisticians misinterpreted Neyman and committed the fallacy of interpreting confidence intervals after data are obtained?

However, if readers need more evidence of psycho-statisticians deceptive practices, it is important to point out that they omitted Neyman’s criticism of their favored approach, namely Bayesian estimation.

The fallacy article gives the impression that Neyman’s (1936) approach to estimation is outdated and should be replaced with more modern, superior approaches like Bayesian credibility intervals.  For example, they cite Jeffrey’s (1961) theory of probability, which gives the impression that Jeffrey’s work followed Neyman’s work. However, an accurate representation of Neyman’s work reveals that Jeffrey’s work preceded Neyman’s work and that Neyman discussed some of the problems with Jeffrey’s approach in great detail.  Neyman’s critical article was even “communicated” by Jeffreys (these were different times where scientists had open conflict with honor and integrity and actually engaged in scientific debates).


Given that Jeffrey’s approach was published just one year before Neyman’s (1936) article, Neyman’s article probably also offers the first thorough assessment of Jeffrey’s approach. Neyman first gives a thorough account of Jeffrey’s approach (those were the days).


Neyman then offers his critique of Jeffrey’s approach.

It is known that, as far as we work with the conception of probability as adopted in
this paper, the above theoretically perfect solution may be applied in practice only
in quite exceptional cases, and this for two reasons. 

Importantly, he does not challenge the theory.  He only points out that the theory is not practical because it requires knowledge that is often not available.  That is, to estimate the probability that an unknown parameter is within a specific interval, we need to make prior assumptions about unknown parameters.   This is the problem that has plagued subjective Bayesians approaches.

Neyman then discusses Jeffrey’s approach to solving this problem.  I am not claiming that I am a statistical expert to decide whether Neyman or Jeffrey’s are right. Even statisticians have been unable to resolve these issues and I believe the consensus is that Bayesian credibility intervals and Neyman’s confidence intervals are both mathematically viable approaches to interval estimation with different strengths and weaknesses.


I am only trying to point out to unassuming readers of the fallacy article that both approaches are as old as statistics and that the presentation of the issue in this article is biased and violates my personal, and probably idealistic, standards of scientific integrity.   Using a selective quote by Neyman to dismiss confidence intervals and then to omit Neyman’s critic of Bayesian credibility intervals is deceptive and shows an unwillingness or inability to engage in open scientific examination of scientific arguments for and against different estimation methods.

It is sad and ironic that Wagenmakers’ efforts to convert psychologists into Bayesian statisticians is similar to Bem’s (2011) attempt to convert psychologists into believers in parapsychology; or at least in parapsychology as a respectable science. While Bem fudged data to show false empirical evidence, Wagenmakers is misrepresenting the way classic statistics works and ignoring the key problem of Bayesian statistics, namely that Bayesian inferences are contingent on prior assumptions that can be gamed to show what a researcher wants to show.  Wagenmaker used this flexibility in Bayesian statistics to suggest that Bem (2011) presented weak evidence for extra-sensory perception.  However, a rebuttle by Bem showed that Bayesian statistics also showed support for extra-sensory perception with different and more reasonable priors.  Thus, Wagenmakers et al. (2011) were simply wrong to suggest that Bayesian methods would have prevented Bem from providing strong evidence for an incredible phenomenon.

The problem with Bem’s article is not the way he “analyzed” the data. The problem is that Bem violated basic principles of science that are required to draw valid statistical inferences from data.  It would be a miracle if Bayesian methods that assume unbiased data could correct for data falsification.   The problem with Bem’s data has been revealed using statistical tools for the detection of bias (Francis, 2012; Schimmack, 2012, 2015, 2118). There has been no rebuttal from Bem and he admits to the use of practices that invalidate the published p-values.  So, the problem is not the use of p-values, confidence intervals, or Bayesian statistics.  The problem is abuse of statistical methods. There are few cases of abuse of Bayesian methods simply because they are used rarely. However, Bayesian statistics can be gamed without data fudging by specifying convenient priors and failing to inform readers about the effect of priors on results (Gronau et al., 2017).

In conclusion, it is not a fallacy to interpret confidence intervals as a method for interval estimation of unknown parameter estimates. It would be a fallacy to cite Morey et al.’s article as a valid criticism of confidence intervals.  This does not mean that Bayesian credibility intervals are bad or could not be better than confidence intervals. It only means that this article is so blatantly biased and dogmatic that it does not add to the understanding of Neyman’s or Jeffrey’s approach to interval estimation.

P.S.  More discussion of the article can be found on Gelman’s blog.

Andrew Gelman himself comments:

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% confidence interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic.

I have to admit some Schadenfreude when I see one Bayesian attacking another Bayesian for the use of an ill-informed prior.  While Bayesians are still fighting over the right priors, practical researchers may be better off to use statistical methods that do not require priors, like, hm, confidence intervals?

P.P.S.  Science requires trust.  At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.











A Clarification of P-Curve Results: The Presence of Evidence Does Not Imply the Absence of Questionable Research Practices

This post is not a criticism of p-curve.  The p-curve authors have been very clear in their writing that p-curve is not designed to detect publication bias.  However, numerous articles make the surprising claim that they used p-curve to test publication bias.  The purpose of this post is to simply correct a misunderstanding of p-curve.

Questionable Research Practices and Excessive Significance

Sterling (1959) pointed out that psychology journals have a surprisingly high success rate. Over 90% of articles reported statistically significant results in support of authors’ predictions.  This success rate would be surprising, even if most predictions in psychology are true.  The reason is that the results of a study are not only influenced by cause-effect relationships.  Another factor that influences the outcome of a study is sampling error.  Even if researchers are nearly always right in their predictions, some studies will fail to provide sufficient evidence for the predicted effect because sampling error makes it impossible to detect the effect.  The ability of a study to show a true effect is called power.  Just like bigger telescopes are needed to detect more distant stars with a weaker signal, bigger sample sizes are needed to detect small effects (Cohen, 1962; 1988).  Sterling et al. (1995) pointed out that the typical power of studies in psychology does not justify the high success rate in psychology journals.  In other words, the success rate was too good to be true.  This means, published articles are selected for significance.

The bias in favor of significant results is typically called publication bias (Rosenthal, 1979).  However, the term publication bias does not explain the discrepancy between estimates of statistical power and success rates in psychology journals.  John et al. (2012) listed a number of questionable research practices that can inflate the percentage of significant results in published articles.

One mechanism is simply to not report non-significant result.  Rosenthal (1979) suggested that non-significant results end up in the proverbial file-drawer.  That is, a whole data set remains unpublished.  The other possibilities is that researchers use multiple exploratory analyses to find a significant result and do not disclose their fishing expedition.  These practices are now widely known as p-hacking.

Unlike John et al. ,(2012), the p-curve authors make a big distinction between not disclosing an entire dataset (publication bias) and not disclosing all statistical analyses of a dataset (p-hacking).

QRP = Publication Bias + P-Hacking

We Don’t Need Tests of Publication Bias

The p-curve authors assume that publication bias is unavoidable.

“Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results.”  (Simonsohn, Nelson, Simmons, 2014).

“By the Way, of Course There is Publication Bias. Virtually all published studies are significant (see, e.g., Fanelli, 2012; Sterling, 1959; Sterling, Rosenbaum, & Weinkam,
1995), and most studies are underpowered (see, e.g., Cohen, 1962). It follows that a considerable number of unpublished failed studies must exist. With this knowledge already in hand, testing for publication bias on paper after paper makes little
sense” (Simonsohn, 2012, p. 597).

“Yes, p-curve ignores p>.05 because it acknowledges that we observe an unknowably small and non-random subset of p-values >.05.”  (personal email, January 18, 2015).

I hope these quotes make it crystal clear that p-curve is not designed to examine publication bias because the authors assume that selection for significance is unavoidable.  Any statistical test that reveals no evidence of publication bias is a false negative result because the sample size was not large enough to detect it.

Another concern by Uri Simonsohn is that bias tests may reveal statistically significant bias that has no practical consequences.

Consider a literature with 100 studies, all with p < .05, but where the implied statistical
power is “just” 97%. Three expected failed studies are missing. The test from the critiques would conclude there is statistically significant publication bias; its magnitude, however, is trivial. (Simonsohn, 2012, p. 598). 

k.sig = 100; k.studies = 100; power = .97; pbinom(k.studies-k.sig,k.studies,1-power) =

This is a valid criticism that applies to all p-values.  A p-value only provides information about the contribution of random sampling error.  A p-value of .048 suggest that it is unlikely to observe only significant results, even if 100 studies have 97% power to show a significant result.   However, with 97% observed power, the 100 studies provide credible evidence for an effect and even the inflation of the average effect size is minimal.

A different conclusion would follow from a p-value less than .05 in a set of 7 studies that all show significant results.

k.sig = 7; k.studies = 7; power = .62; pbinom(k.studies-k.sig,k.studies,1-power) = 0 .035

Rather than showing small bias with a large set of studies, this finding shows large bias with a small set of studies.  P-values do not distinguish between these two scenarios. Both outcomes are equally unlikely.  Thus, information about the probability of an event should always be interpreted in the context of the effect.  The effect size is simply the difference between the expected and observed rate of significant results.  In Simonsohn’s example, the effect size is small (1 – .97 = .03).  In the second example, the discrepancy is large (1 – .62 = .38).

The previous scenarios assume that only significant results are reported. However, in sciences that use preregistration to reduce deceptive publishing practices (e..g, medicine), non-significant results are more common.  When non-significant results are reported, bias tests can be used to assess the extent of bias.

For example, a literature may report 10 studies with only 4 significant results and the median observed power is 30%.  In this case, the bias is small (.40 – .30 = .10) and a conventional meta-analysis would produce only slightly inflated estimates of the average effect size.  In contrast, p-curve would discard over 50% of the studies because it assumes that the non-significant results are not trustworthy.  This is an unnecessary loss of information that could be avoided by testing for publication bias.

In short, p-curve assumes that publication bias is unavoidable. Hence, tests of publication bias are unnecessary and non-significant results should always be discarded.

Why Do P-Curve Users Think P-Curve is a Publication Bias Test?

Example 1

I conducted a literature research on studies that used p-curve and I was surprised by numerous claims that p-curve is a test of publication bias.

Simonsohn, Nelson, and Simmons (2014a, 2014b, 2016) and Simonsohn, Simmons, and Nelson (2015) introduced pcurve as a method for identifying publication bias (Steiger & Kühberger, 2018, p. 48).   

However, the authors do not explain how p-curve detects publication bias. Later on, they correctly point out that p-curve is a method that can correct for publication bias.

P-curve is a good method to correct for publication bias, but it has drawbacks. (Steiger & Kühberger, 2018, p. 48).   

Thus, the authors seem to confuse detection of publication bias with correction for publication bias.  P-curve corrects for publication bias, but it does not detect publication bias; it assumes that publication bias is present and a correction is necessary.

Example 2

An article in the medical journal JAMA Psychiatry also claimed that they used p-curve and other methods to assess publication bias.

Publication bias was assessed across all regions simultaneously by visual inspection of funnel plots of SEs against regional residuals and by using the excess significance test,  the P-curve method, and a multivariate analogue of the Egger regression test (Bruger & Howes, 2018, p. 1106).  

After reporting the results of several bias tests, the authors report the p-curve results.

P-curve analysis indicated evidential value for all measures (Bruger & Howes, 2018, p. 1106).

The authors seem to confuse presence of evidential value with absence of publication bias. As discussed above,  publication bias can be present even if studies have evidential value.

Example 3

To assess publication bias, we considered multiple indices. Specifically, we evaluated Duval and Tweedie’s Trim and Fill Test, Egger’s Regression Test, Begg and Mazumdar Rank Correlation Test, Classic Fail-Safe N, Orwin’s Fail-Safe N, funnel plot symmetry, P-Curve Tests for Right-Skewness, and Likelihood Ratio Test of Vevea and Hedges Weight-Function Model.

As in the previous example, the authors confuse evidence for evidential value (significant right-skwed p-curve) with evidence for the absence of publication bias.

Example 4

The next example even claims that p-curve can be used to quantify the presence of bias.

Publication bias was investigated using funnel plots and the Egger regression asymmetry test. Both the trim and fill technique (Duval & Tweedie, 2000) and p-curve (Simonsohn, Nelson, & Simmons, 2014a, 2014b) technique were used to quantify the presence of bias (Korrel et al., 2017, p. 642).

The actual results section only reports that the p-curve is right skewed.

The p-curve for the remaining nine studies (p < .025) was significantly right skewed
(binomial test: p = .002; continuous test full curve: Z = -9.94, p < .0001, and half curve Z = -9.01, p < .0001) (Korrel et al., 2017, p. 642)

These results do not assess or quantify publication bias.  One might consider the reported z-scores a quantitative measure of evidential value as larger z-scores are less probable under the nil-hypothesis that all significant results are false positives. Nevertheless, strong evidential value (e.g., 100 studies with 97% power) does not imply that publication bias is absent, nor does it mean that publication bias is small .

A set of 1000 studies with 10% power is expected to produce 900 non-significant results and 100 significant results.  Removing the non-significant results produces large publication bias, but a p-curve analysis shows strong evidence against the nil-hypothesis that all studies are false positives.

Z = rnorm(1000,qnorm(.10,1.96))
Stouffer.Z = sum(Z[Z > 1.96]-1.96)/sqrt(length(Z.sig))
Stouffer.Z = 4.89

The reason is that p-curve is a meta-analysis and the results depend on the strength of evidence in individual studies and the number of studies.  Strong evidence can be result of many studies with weak evidence or a few studies with strong evidence.  Thus, p-curve is a meta-analytic method that combines information from several small studies to draw inferences about a population parameter.  The main difference to older meta-analytic methods is that older methods assumed that publication bias is absent, whereas p-curve assumes that publication bias is present. Neither method assesses whether publication bias is present, nor do they quantify the amount of publication bias.

Example 5

Sala and Gobet (2017) explicitly make the mistake to equate evidence for evidence with evidence against publication bias.

Finally, a p-curve analysis was run with all the p values < .05 related to positive effect sizes (Simonsohn, Nelson, & Simmons, 2014). The results showed evidential values (i.e., no evidence of publication bias), Z(9) = -3.39, p = .003.  (p. 676).

As discussed in detail before, this is not a valid inference.

Example 6

Ironically, the interpretation of p-curve results as evidence that there is no publication bias contradicts the fundamental assumption of p-curve that we can safely assume that publication bias is always. present.

The danger is that misuse of p-curve as a test of publication bias may give the false impression that psychological scientists are reporting their results honestly, while actual bias tests show that this is not the case.

It is therefore problematic if authors in high impact journals (not necessarily high quality journals) claim that they found evidence for the absence of publication bias based on a p-curve analysis.

To check whether this research field suffers from publication bias, we conducted p-curve analyses (Simonsohn, Nelson, & Simmons, 2014a, 2014b) on the most extended data set of the current meta-analysis (i.e., psychosocial correlates of the dark triad traits), using an on-line application ( As can be seen in Figure 2, for each of the dark triad traits, we found an extremely right-skewed p-curve, with statistical tests indicating that the studies included in our meta-analysis, indeed, contained evidential value (all ps < .001) and did not point in the direction of inadequate evidential value (all ps non-significant). Thus, it is unlikely that the dark triad literature is affected by publication bias (Muris, Merckelbach, Otgaar, & Meijer, 2017).

Once more, presence of evidential value does not imply absence of publication bias!

Evidence of P-Hacking  

Publication bias is not the only reason for the high success rates in psychology.  P-hacking will also produce more significant results than the actual power of studies warrants. In fact, the whole purpose of p-hacking is to turn non-significant results into significant ones.  Most bias tests do not distinguish between publication bias and p-hacking as causes of bias.  However, the p-curve authors make this distinction and claim that p-curve can be used to detect p-hacking.

Apparently, we should not assume that p-hacking is just as prevalent as publication bias, which makes testing for p-hacking irrelevant.

The problem is that it is a lot harder to distinguish p-hacking and publication bias as the p-curve authors imply and that their p-curve test of p-hacking will only work under very limited conditions.  Most of the time, the p-curve test of p-hacking will fail to provide evidence for p-hacking and this result can be misinterpreted as evidence that results were obtained without p-hacking, which is a logical fallacy.

This mistake was made by Winternitz, Abbate, Huchard, Havlicek, & Gramszegi (2017).

Fourth and finally, as bias for publications with significant results can rely more on the P-value than on the effect size, we used the Pcurve method to test whether the distribution of significant P-values, the ‘P-curve’, indicates that our studies have evidential value and are free from ‘p-hacking’ (Simonsohn et al. 2014a, b).

The problem is that the p-curve test of p-hacking only works when evidential value is very low and for some specific forms of p-hacking. For example, researchers can p-hack by testing many dependent variables. Selecting significant dependent variables is no different from running many studies with a single dependent variable and selecting entire studies with significant results; it is just more efficient.  The p-curve would not show the left-skewed p-curve that is considered diagnostic of p-hacking.

Even a flat p-curve would merely show lack of evidential value, but it would be wrong to assume that p-hacking was not used.  To demonstrate this I submitted the results from Bem’s (2011) infamous “feeling the future” article to a p-curve analysis (  pcurve.bem.png

The p-curve analysis shows a flat p-curve.  This shows lack of evidential value under the assumption that questionable research practices were used to produce 9 out of 10 significant (p < .05, one-tailed) results.  However, there is no evidence that the results are p-hacked if we were to rely on a left-skewed p-curve as evidence for p-hacking.

One possibility would be that Bem did not p-hack his studies. However, this would imply that he ran 20 studies for each significant result. with sample sizes of 100 particpants per study, this would imply that he tested 20,000 participants.  This seems unrealistic and Bem states that he reported all studies that were conducted.  Moreover, analyses of the raw data showed peculiar patterns that suggest some form of p-hacking was used.  Thus, this example shows that p-curve is not very effective in revealing p-hacking.

It is also interesting that the latest version of p-curve, p-curve4.06, no longer tests for left-skewedness of distributions and doesn’t mention p-hacking.  This change in p-curve suggests that the authors realized the ineffectiveness of p-curve in detecting p-hacking (I didn’t ask the authors for comments, but they are welcome to comment here or elsewhere on this change in their app).

It is problematic if meta-analysts assume that p-curve can reveal p-hacking and infer from a flat or right-skewed p-curve that the data are not p-hacked.  This inference is not warranted because absence of evidence is not the same as evidence of absence.


P-curve is a family of statistical tests for meta-analyses of sets of studies.  One version is an effect size meta-analysis; others test the nil-hypothesis that the population effect size is zero.  The novel feature of p-curve is that it assumes that questionable research practices undermine the validity of traditional meta-analyses that assume no selection for significance. To correct for the assumed bias, observed test statistics are corrected for selection bias (i.e., p-values between .05 and 0 are multiplied by 20 to produce p-values between 0 and 1 that can be analyzed like unbiased p-values).  Just like regular meta-analysis, the main result of a p-curve analysis is a combined test-statistic or effect size estimate that can be used to test the nil-hypothesis.  If the nil-hypothesis can be rejected, p-curve analysis suggests that some effect was observed.  Effect size p-curve also provides an effect size estimate for the set of studies that produced significant results.

Just like regular meta-analyses, p-curve is not a bias test. It does not test whether publication bias exists and it fails as a test of p-hacking under most circumstances. Unfortunately, users of p-curve seem to be confused about the purpose of p-curve or make the logical mistake to infer from the presence of evidence that questionable research practices (publication bias; p-hacking) are absent. This is a fallacy.  To examine the presence of publication bias, researchers should use existing and validated bias tests.





















Can the Bayesian Mixture Model Estimate the Percentage of False Positive Results in Psychology Journals?

A method revolution is underway in psychological science.  In 2011, an article published in JPSP-ASC made it clear that experimental social psychologists were publishing misleading p-values because researchers violated basic principles of significance testing  (Schimmack, 2012; Wagenmakers et al., 2011).  Deceptive reporting practices led to the publication of mostly significant results, while many non-significant results were not reported.  This selective publishing of results dramatically increases the risk of a false positive result from the nominal level of 5% that is typically claimed in publications that report significance tests  (Sterling, 1959).

Although experimental social psychologists think that these practices are defensible, no statistician would agree with them.  In fact, Sterling (1959) already pointed out that the success rate in psychology journals is too high and claims about statistical significance are meaningless.  Similar concerns were raised again within psychology (Rosenthal, 1979), but deceptive practices remain acceptable until today (Kitayama, 2018). As a result, most published results in social psychology do not replicate and cannot be trusted (Open Science Collaboration, 2015).

For non-methodologists it can be confusing to make sense of the flood of method papers that have been published in the past years.  It is therefore helpful to provide a quick overview of methodological contributions concerned with detection and correction of biases.

First, some methods focus on effect sizes, (pcurve2.0; puniform), whereas others focus on strength of evidence (Test of Excessive Significance; Incredibility Index; R-Index, Pcurve2.1; Pcurve4.06; Zcurve).

Another important distinction is between methods that assume a fixed parameter and methods that assume heterogeneity.   If all studies have a common effect size or the same strength of evidence,  it is relatively easy to demonstrate bias and to correct for bias (Pcurve2.1; Puniform; TES).  However, heterogeneity in effect sizes or sampling error produces challenges.  Relatively few methods have been developed for this challenging, yet realistic scenario.  For example, Ioannidis and Trikalonis (2005) developed a method to reveal publication bias that assumes a fixed effect size across studies, while allowing for variation in sampling error, but this method can be biased if there is heterogeneity in effect sizes.  In contrast, I developed the Incredibilty-Index (also called Magic Index) to allow for heterogeneity in effect sizes and sampling error (Schimmack, 2012).

Following my work on bias detection in heterogeneous sets of studies, I started working with Jerry Brunner on methods that can estimate average power of a heterogeneous set of studies that are selected for significance.  I first published this method on my blog in June 2015, when I called it post-hoc power curves.   These days, the term Zcurve is used more often to refer to this method.  I illustrated the usefulness of Zcurve in various posts in the Psychological Methods Discussion Group.

In September, 2015 I posted replicability rankings of social psychology departments using this method. the post generated a lot of discussions and a question about the method.  Although the details were still unpublished, I described the main approach of the method.  To deal with heterogeneity, the method uses a mixture model.


In 2016, Jerry Brunner and I submitted a manuscript for publication that compared four methods for estimating average power of heterogeneous studies selected for significance (Puniform1.1; Pcurve2.1; Zcurve & a Maximul Likelihood Method).  In this article, the mixture model, Zcurve, outperformed other methods, including a maximum-likelihood method developed by Jerry Brunner. The manuscript was rejected from Psychological Methods.

In 2017, Gronau, Duizer, Bakker, and Eric-Jan Wagenmakers published an article titled “A Bayesian Mixture Modeling of Significant p Values: A Meta-Analytic Method to Estimate the Degree of Contamination From H0”  in the Journal of Experimental Psychology: General.  The article did not mention z-curve, presumably because it was not published in a peer-reviewed journal.

Although a reference to our mixture model would have been nice, the Bayesian Mixture Model differs in several ways from Zcurve.  This blog post examines the similarities and differences between the two mixture models, it shows that BMM fails to provide useful estimates with simulations and social priming studies, and it explains why BMM fails. It also shows that Zcurve can provide useful information about replicability of social priming studies, while the BMM estimates are uninformative.


The Bayesian Mixture Model (BMM) and Zcurve have different aims.  BMM aims to estimate the percentage of false positives (significant results with an effect size of zero). This percentage is also called the False Discovery Rate (FDR).

FDR = False Positives / (False Positives + True Positives)

Zcurve aims to estimate the average power of studies selected for significance. Importantly, Brunner and Schimmack use the term power to refer to the unconditional probability of obtaining a significant result and not the common meaning of power as being conditional on the null-hypothesis being false. As a result, Zcurve does not distinguish between false positives with a 5% probability of producing a significant result (when alpha = .05) and true positives with an average probability between 5% and 100% of producing a significant result.

Average unconditional power is simply the percentage of false positives times alpha plus the average conditional power of true positive results (Sterling et al., 1995).

Unconditional Power = False Positives * Alpha + True Positives * Mean(1 – Beta)

Zcurve therefore avoids the thorny issue of defining false positives and trying to distinguish between false positives and true positives with very small effect sizes and low power.


BMM and zcurve use p-values as input.  That is, they ignore the actual sampling distribution that was used to test statistical significance.  The only information that is used is the strength of evidence against the null-hypothesis; that is, how small the p-value actually is.

The problem with p-values is that they have a specified sampling distribution only when the null-hypothesis is true. When the null-hypothesis is true, p-values have a uniform sampling distribution.  However, this is not useful for a mixture model, because a mixture model assumes that the null-hypothesis is sometimes false and the sampling distribution for true positives is not defined.

Zcurve solves this problem by using the inverse normal distribution to convert all p-values into absolute z-scores (abs(z) = -qnorm(p/2).  Absolute z-scores are used because F-tests or two-sided t-tests do not have a sign and a test score of 0 corresponds to a probability of 1.  Thus, the results do not say anything about the direction of an effect, while the size of the p-value provides information about the strength of evidence.

BMM also transforms p-values. The only difference is that BMM uses the full normal distribution with positive and negative z-scores  (z = qnorm(p)). That is, a p-value of .5 corresponds to a z-score of zero; p-values greater than .5 would be positive, and p-values less than .5 are assigned negative z-scores.  However, because only significant p-values are selected, all z-scores are negative in the range from -1.65 (p = .05, one-tailed) to negative infinity (p = 0).

The non-centrality parameter (i.e., the true parameter that generates the sampling dstribution) is simply the mean of the normal distribution. For the null-hypothesis and false positives, the mean is zero.

Zcurve and BMM differ in the modeling of studies with true positive results that are heterogeneous.  Zcurve uses several normal distributions with a standard deviation of 1 that reflects sampling error for z-tests.  Heterogeneity in power is modeled by varying means of normal distributions, where power increases with increasing means.

BMM uses a single normal distribution with varying standard deviation.  A wider distribution is needed to predict large observed z-scores.

The main difference between Zcurve and BMM is that Zcurve either does not have fixed means (Brunner & Schimmack, 2016) or has fixed means, but does not interpret the weight assigned to a mean of zero as an estimate of false positives (Schimmack & Brunner, 2018).  The reason is that the weights attached to individual components are not very reliable estimates of the weights in the data-generating model.  Importantly, this is not relevant for the goal of zurve to estimate average power because the weighted average of the components of the model is a good estimate of the average true power in the data-generating model, even if the weights do not match the weights of the data-generating model.

For example, Zcurve does not care whether 50% average power is produced by a mixture of 50% false positives and 50% true positives with 95% power or 50% of studies with 20% power and 50% studies with 80% power. If all of these studies were exactly replicated, they are expected to produce 50% significant results.

BMM uses the weights assigned to the standard normal with a mean of zero as an estimate of the percentage of false positive results.  It does not estimate the average power of true positives or average unconditional power.

Given my simulation studies with zcruve, I was surprised that BBM solved a problem that weights of individual components cannot be reliably estimated because the same distribution of p-values can be produced by many mixture models with different weights.  The next section examines how BMM tries to estimate the percentage of false positives from the distribution of p-values.

A Bayesian Approach

Another difference between BMM and Zcurve is that BMM uses prior distributions, whereas Zcurve does not.  Whereas Zcurve makes no assumptions about the percentage of false positives, BMM uses a uniform distribution with values from 0 to 1 (100%) as a prior.  That is, it is equally likely that the percentage of false positives is 0%, 100%, or any value in between.  A uniform prior is typically justified as being agnostic; that is, no subjective assumptions bias the final estimate.

For the mean of the true positives, the authors use a truncated normal prior, which they also describe as a folded standard normal.  They justify this prior as reasonable based on extensive simulation studies.

Most important, however, is the parameter for the standard deviation.  The prior for this parameter was a uniform distribution with values between 0 and 1.   The authors argue that larger values would produce too many p-values close to 1.

“implausible prediction that p values near 1 are more common under H1 than under H0” (p 1226). 

But why would this be implausible.  If there are very few false positives and many true positives with low power, most p-values close to 1 would be the result of  true positives (H1) than of false positives (H0).

Thus, one way BMM is able to estimate the false discovery rate is by setting the standard deviation in a way that there is a limit to the number of low z-scores that are predicted by true positives (H1).

Although understanding priors and how they influence results is crucial for meaningful use of Bayesian statistics, the choice of priors is not crucial for Bayesian estimation models with many observations because the influence of the priors diminishes as the number of observations increases.  Thus, the ability of BMM to estimate the percentage of false positives in large samples cannot be explained by the use of priors. It is therefore still not clear how BMM can distinguish between false positives and true positives with low power.

Simulation Studies

The authors report several simulation studies that suggest BMM estimates are close and robust across many scenarios.

The online supplemental material presents a set of simulation studies that highlight that the model is able to accurately estimate the quantities of interest under a relatively broad range of circumstances”  (p. 1226).

The first set of simulations uses a sample size of N = 500 (n = 250 per condition).  Heterogeneity in effect sizes is simulated with a truncated normal distribution with a standard deviation of .10 (truncated at 2*SD) and effect sizes of d = .45, .30, and .15.  The lowest values are .35, .20, and .05.  With N = 500, these values correspond to  97%, 61%, and 8% power respectively.

d = c(.35,.20,.05); 1-pt(qt(.975,500-2),500-2,d*sqrt(500)/2)

The number of studies was k = 5,000 with half of the studies being false positives (H0) and half being true positives (H1).

Figure 1 shows the Zcurve plot for the simulation with high power (d = .45, power >  97%; median true power = 99.9%).


The graph shows a bimodal distribution with clear evidence of truncation (the steep drop at z = 1.96 (p = .05, two-tailed) is inconsistent with the distribution of significant z-scores.  The sharp drop from z = 1.96 to 3 shows that there are many studies with non-significant results are missing.  The estimate of unconditional power (called replicability = expected success rate in exact replication studies) is 53%.  This estimate is consistent with the simulation of 50% studies with a probability of success of 5% and 50% of studies with a success probability of 99.9% (.5 * .05 + .5 * .999 = 52.5).

The values below the x-axis show average power for  specific z-scores. A z-score of 2 corresponds roughly to p = .05 and 50% power without selection for significance. Due to selection for significance, the average power is only 9%. Thus the observed power of 50% provides a much inflated estimate of replicability.  A z-score of 3.5 is needed to achieve significance with p < .05, although the nominal p-value for z = 3.5 is p = .0002.  Thus, selection for significance renders nominal p-values meaningless.

The sharp change in power from Z = 3 to Z = 3.5 is due to the extreme bimodal distribution.  While most Z-scores below 3 are from the sampling distribution of H0 (false positives), most Z-scores of 3.5 or higher come from H1 (true positives with high power).

Figure 2 shows the results for the simulation with d = .30.  The results are very similar because d = .30 still gives 92% power.  As a result, replicabilty is nearly as high as in the previous example.



The most interesting scenario is the simulation with low powered true positives. Figure 3 shows the Zcurve for this scenario with an unconditional average power of only 23%.


It is no longer possible to recognize two sampling distributions and average power increases rather gradually from 18% for z = 2, to 35% for z = 3.5.  Even with this challenging scenario, BMM performed well and correctly estimated the percentage of false positives.   This is surprising because it is easy to generate a similar Zcurve without false positives.

Figure 4 shows a simulation with a mixture distribution but the false positives (d = 0) have been replaced by true positives (d = .06), while the mean for the heterogeneous studies was reduced to from d = .15 to d = .11.  These values were chosen to produce the same average unconditional power (replicability) of 23%.


I transformed the z-scores into (two-sided) p-values and submitted them to the online BMM app at .  I used only k = 1,500 p-values because the server timed me out several times with k = 5,000 p-values.  The estimated percentage of false positives was 24%, with a wide 95% credibility interval ranging from 0% to 48%.   These results suggest that BMM has problems distinguishing between false positives and true positives with low power.   BMM appears to be able to estimate the percentage of false positives correctly when most low z-scores are sampled from H0 (false positives). However, when these z-scores are due to studies with low power, BMM cannot distinguish between false positives and true positives with low power. As a result, the credibility interval is wide and the point estimates are misleading.


With k = 1,500 the influence of the priors is negligible.  However, with smaller sample sizes, the priors do have an influence on results and may lead to overestimation and false credibility intervals.  A simulation with k = 200, produced a point estimate of 34% false positives with a very wide CI ranging from 0% to 63%. The authors suggest a sensitivity analysis by changing model parameters. The most crucial parameter is the standard deviation.  Increasing the standard deviation to 2, increases the upper limit of the 95%CI to 75%.  Thus, without good justification for a specific standard deviation, the data provide very little information about the percentage of false positives underlying this Zcurve.



For simulations with k = 100, the prior started to bias the results and the CI no longer included the true value of 0% false positives.


In conclusion, these simulation results show that BMM promises more than it can deliver.  It is very difficult to distinguish p-values sampled from H0 (mean z = 0) and those sampled from H1 with weak evidence (e.g., mean z = 0.1).

In the Challenges and Limitations section, the authors pretty much agree with this assessment of BMM (Gronau et al., 2017, p. 1230).

The procedure does come with three important caveats.

First, estimating the parameters of the mixture model is an inherently difficult statistical problem. ..  and consequently a relatively large number of p values are required for the mixture model to provide informative results. 

A second caveat is that, even when a reasonable number of p values are available, a change in the parameter priors might bring about a noticeably different result.

The final caveat is that our approach uses a simple parametric form to account for the distribution of p values that stem from H1. Such simplicity comes with the risk of model-misspecification.

Practical Implications

Despite the limitations of BMM, the authors applied BMM to several real data.  The most interesting application selected focal hypothesis tests from social priming studies.  Social priming studies have come under attack as a research area with sloppy research methods as well as fraud (Stapel).  Bias tests show clear evidence that published results were obtained with questionable scientific practices (Schimmack, 2017a, 2017b).

The authors analyzed 159 social priming p-values.  The 95%CI for the percentage of false positives ranged from 48% to 88%.  When the standard deviation was increased to 2, the 95%CI increased slightly to 56% to 91%.  However, when the standard deviation was halved, the 95%CI ranged from only 10% to 75%.  These results confirm the authors’ warning that estimates in small sets of studies (k < 200) are highly sensitive to the specification of priors.

What inferences can be drawn from these results about the social priming literature?  A false positive percentage of 10% doesn’t sound so bad.  A false positive percentage of 88% sound terrible. A priori, the percentage is somewhere between 0 and 100%. After looking at the data, uncertainty about the percentage of false positives in the social priming literature remains large.  Proponents will focus on the 10% estimate and critics will use the 88% estimate.  The data simply do not resolve inconsistent prior assumptions about the credibility of discoveries in social priming research.

In short, BMM promises that it can estimate the percentage of false positives in a set of studies, but in practice these estimates are too imprecise and too dependent on prior assumptions to be very useful.

A Zcurve of Social Priming Studies (k = 159)

It is instructive to compare the BMM results to a Zcurve analysis of the same data.


The zcurve graph shows a steep drop and very few z-scores greater than 4, which tend to have a high success rate in actual replication attempts (OSC, 2015).  The average estimated replicability is only 27%.  This is consistent with the more limited analysis of social priming studies in Kahneman’ s Thinking Fast and Slow book (Schimmack, 2017a).

More important than the point estimate is that the 95%CI ranges from 15% to a maximum of 39%.  Thus, even a sample size of 159 studies is sufficient to provide conclusive evidence that these published studies have a low probability of replicating even if it were possible to reproduce the exact conditions again.

These results show that it is not very useful to distinguish between false positives with a replicability of 5% and true positives with a replicability of 6, 10, or 15%.  Good research provides evidence that can be replicated at least with a reasonable degree of statistical power.  Tversky and Kahneman (1971) suggested a minimum of 50% and most social priming studies fail to meet this minimal standard and hardly any studies seem to have been planned with the typical standard of 80% power.

The power estimates below the x-axis show that a nomimal z-score of 4 or higher is required to achieve 50% average power and an actual false positive risk of 5%. Thus, after correcting for deceptive publication practices, most of the seemingly statistically significant results are actually not significant with the common criterion of a 5% risk of a false positive.

The difference between BMM and Zcurve is captured in the distinction between evidence of absence and absence of evidence.  BMM aims to provide evidence of absence (false positives). In contrast, Zcurve has the more modest goal of demonstrating absence (or presence) of evidence.  It is unknown whether any social priming studies could produce robust and replicable effects and under what conditions these effects occur or do not occur.  However, it is not possible to conclude from the poorly designed studies and the selectively reported results that social priming effects are zero.


Zcurve and BMM are both mixture models, but they have different statistical approaches, they have different aims.  They also differ in their ability to provide useful estimates.  Zcurve is designed to estimate average unconditional power to obtain significant results without distinguishing between true positives and false positives.  False positives reduce average power, just like low powered studies, and in reality it can be difficult or impossible to distinguish between a false positive with an effect size of zero and a true positive with an effect size that is negligibly different from zero.

The main problem of BMM is that it treats the nil-hypothesis as an important hypothesis that can be accepted or rejected.  However, this is a logical fallacy.  it is possible to reject an implausible effect sizes (e.g., the nil-hypothesis is probably false if the 95%CI ranges from .8 to  1.2], but it is not possible to accept the nil-hypothesis because there are always values close to 0 that are also consistent with the data.

The problem of BMM is that it contrasts the point-nil-hypothesis with all other values, even if these values are very close to zero.  The same problem plagues the use of Bayes-Factors that compare the point-nil-hypothesis with all other values (Rouder et al., 2009).  A Bayes-Factor in favor of the point nil-hypothesis is often interpreted as if all the other effect sizes are inconsistent with the data.  However, this is a logical fallacy because data that are inconsistent with a specific H1 can be consistent with an alternative H1.  Thus, a BF in favor of H0 can only be interpreted as evidence against a specific H1, but never as evidence that the nil-hypothesis is true.

To conclude, I have argued that it is more important to estimate the replicability of published results than to estimate the percentage of false positives.  A literature with 100% true positives and average power of 10% is no more desirable than a literature with 50% false positives and 50% true positives with 20% power.  Ideally, researchers should conduct studies with 80% power and honest reporting of statistics and failed replications should control the false discovery rate.  The Zcurve for social priming studies shows that priming researchers did not follow these basic and old principles of good science.  As a result, decades of research are worthless and Kahneman was right to compare social priming research to a train wreck because the conductors ignored all warning signs.




An Even Better P-curve

It is my pleasure to post the first guest post on the R-Index blog.  The blog post is written by my colleague and partner in “crime”-detection, Jerry Brunner.  I hope we will see many more guest posts by Jerry in the future.


Jerry Brunner
Department of Statistical Sciences
University of Toronto

First, my thanks to the mysterious Dr. R for the opportunity to do this guest post. At issue are the estimates of population mean power produced by the online p-curve app. The current version is 4.06, available at As the p-curve team (Simmons, Nelson, and Simonsohn) observe in their blog post entitled “P-curve handles heterogeneity just fine” at, the app does well on average as long as there is not too much heterogeneity in power. They show in one of their examples that it can over-estimate mean power when there is substantial heterogeneity.

Heterogeneity in power is produced by heterogeneity in effect size and heterogeneity in sample size. In the simulations reported at, sample size varies over a fairly narrow range — as one might expect from a meta-analysis of small-sample studies. What if we wanted to estimate mean power for sets of studies with large heterogeneity in sample sizes or an entire discipline, or sub-areas, or journals, or psychology departments? Sample size would be much more variable.

This post gives an example in which the p-curve app consistently over-estimates population mean power under realistic heterogeneity in sample size. To demonstrate that heterogeneity in sample size alone is a problem for the online pcurve app, population effect size was held constant.

In 2016, Brunner and Schimmack developed an alternative p-curve method (p-curve 2.1), which performs much better than the online app p-curve 4.06. P-curve 2.1 is fully documented and evaluated in Brunner and Schimmack (2018). This is the most recent version of the notorious and often-rejected paper mentioned in It has been re-written once again, and submitted to Meta-psychology. It will shortly be posted during the open review process, but in the meantime I have put a copy on my website at

P-curve 2.1 is based on Simonsohn, Nelson and Simmons’ (2014) p-curve estimate of effect size. It is designed specifically for the situation where there is heterogeneity in sample size, but just a single fixed effect size. P-curve 2.1 is a simple, almost trivial application of p-curve 2.0. It first uses the p-curve 2.0 method to estimate a common effect size. It then combines that estimated effect size and the observed sample sizes to calculate an estimated power for each significance test in the sample. The sample mean of the estimated power values is the p-curve 2.1 estimate.

One of the virtues of p-curve is that it allows for publication bias, using only significant test statistics as input. The population mean power being estimated is the mean power of the sub-population of tests that happened to be significant. To compare the performance of p-curve 4.06 to p-curve 2.1, I simulated samples of significant test statistics with a single effect size, and realistic heterogeneity in sample size.

Here’s how I arrived at the “realistic” sample sizes. In another project, Uli Schimmack had harvested a large number of t and F statistics from the journal Psychological Science, from the years 2001-2015. I used N = df + 2 to calculate implied total sample sizes. I then eliminated all sample sizes less than 20 and greater than 500, and randomly sampled 5,000 of the remaining numbers. These 5,000 numbers will be called the “Psychological Science urn.” They are available at, and can be read directly into R with the scan function.

The numbers in the Psychological Science urn are not exactly sample sizes and they are not a true random sample. In particular, truncating the distribution at 500 makes them less heterogeneous than real sample sizes, since web surveys with enormous sample sizes are eliminated. Still, I believe the numbers in the Psychological Science urn may be fairly reflective of the sample sizes in psychology journals. Certainly, they are better than anything I would be able to make up. Figure 1 shows a histogram, which is right skewed as one might expect.


By sampling with replacement from the Psychological Science urn, one could obtain a random sample of sample sizes, similar to sampling without replacement from a very large population of studies. However, that’s not what I did. Selection for significance tends to select larger sample sizes, because tests based on smaller sample sizes have lower power and so are less likely to be significant. The numbers in the Psychological Science urn come from studies that passed the filter of publication bias. It is the distribution of sample size after selection for significance that should match Figure 1.

To take care of this issue, I constructed a distribution of sample size before selection and chose an effect size that yielded (a) population mean power after selection equal to 0.50, and (b) a population distribution of sample size after selection that exactly matched the relative frequencies in the Psychological Science urn. The fixed effect size, in a metric of Cohen (1988, p. 216) was w = 0.108812. This is roughly Cohen’s “small” value of w = 0.10. If you have done any simulations involving literal selection for significance, you will realize that getting the numbers to come out just right by trial and error would be nearly impossible. I got the job done by using a theoretical result from Brunner and Schimmack (2018). Details are given at the end of this post, after the results.

I based the simulations on k=1,000 significant chi-squared tests with 5 degrees of freedom. This large value of k (the number of studies, or significance tests on which the estimates are based) means that estimates should be very accurate. To calculate the estimates for p-curve 4.06, it was easy enough to get R to write input suitable for pasting into the online app. For p-curve 2.1, I used the function heteroNpcurveCHI, part of a collection developed for the Brunner and Schimmack paper. The code for all the functions is available at Within R, the functions can be defined with source(""). Then to see a list of functions, type functions() at the R prompt.

Recall that population mean power after selection is 0.50. The first time I ran the simulation, the p-curve 4.06 estimate was 0.64, with a 95% confidence interval from from 0.61 to 0.66.. The p-curve 2.1 estimate was 0.501. Was this a fluke? The results of five more independent runs are given in the table below. Again, the true value of mean power after selection for significance is 0.50.

P-curve 2.1 P-curve 4.06 P-curve 4.06 Confidence Interval
0.510 0.64 0.61 0.67
0.497 0.62 0.59 0.65
0.502 0.62 0.59 0.65
0.509 0.64 0.61 0.67
0.487 0.61 0.57 0.64

It is clear that the p-curve 4.06 estimates are consistently too high, while p-curve 2.1 is on the money. One could argue that an error of around twelve percentage points is not too bad (really?), but certainly an error of one percentage point is better. Also, eliminating sample sizes greater than 500 substantially reduced the heterogeneity in sample size. If I had left the huge sample sizes in, the p-curve 4.06 estimates would have been ridiculously high.

Why did p-curve 4.06 fail? The answer is that even with complete homogeneity in effect size, the Psychological Science urn was heterogeneous enough to produce substantial heterogeneity in power. Figure 2 is a histogram of the true (not estimated) power values.


Figure 2 shows that that even under homogeneity in effect size, a sample size distribution matching the Psychological Science urn can produce substantial heterogeneity in power, with a mode near one even though the mean is 0.50. In this situation, p-curve 4.06 fails. P-curve 2.1 is clearly preferable, because it specifically allows for heterogeneity in sample size.

Of course p-curve 2.1 does assume homogeneity in effect size. What happens when effect size is heterogeneous too? The paper by Brunner and Schimmack (2018) contains a set of large-scale simulation studies comparing estimates of population mean power from p-curve, p-uniform, maximum likelihood and z-curve, a new method dreamed up by Schimmack. The p-uniform method is based on van Assen, van Aertand and Wicherts (2014), extended to power estimation as in p-curve 2.1. The p-curve method we consider in the paper is p-curve 2.1. It does okay as long as heterogeneity in effect size is modest. Other methods may be better, though. To summarize, maximum likelihood is most accurate when its assumptions about the distribution of effect size are satisfied or approximately satisfied. When effect size is heterogeneous and the assumptions of maximum likelihood are not satisfied, z-curve does best.

I would not presume to tell the p-curve team what to do, but I think they should replace p-curve 4.06 with something like p-curve 2.1. They are free to use my heteroNpcurveCHI and heteroNpcurveF functions if they wish. A reference to Brunner and Schimmack (2018) would be appreciated.

Details about the simulations

Before selection for significance, there is a bivariate distribution of sample size and effect size. This distribution is affected by the selection process, because tests with higher effect size or sample size (or especially, both) are more likely to be significant. The question is, exactly how does selection affect the joint distribution? The answer is in Brunner and Schimmack (2018). This paper is not just a set of simulation studies. It also has a set of “Principles” relating the population distribution of power before selection to its distribution after selection. The principles are actually theorems, but I did not want it to sound too mathematical. Anyway, Principle 6 says that to get the probability of a (sample size, effect size) pair after selection, take the probability before selection, multiply by the power calculated from that pair, and divide by the population mean power before selection.

In the setting we are considering here, there is just a single effect size, so it’s even simpler. The probability of a (sample size, effect size) pair is just the probability of the sample size. Also, we know the probability distribution of sample size after selection. It’s the relative frequencies of the Psychological Science urn. Solving for the probability of sample size before selection yields this rule: the probability of sample size before selection equals the probability of sample size after selection, divided by the power for that sample size, and multiplied by population mean power before selection.

This formula will work for any fixed effect size. That is, for any fixed effect size, there is a probability distribution of sample size before selection that makes the distribution of sample size after selection exactly match the Psychological Science frequencies in Figure 1. Effect size can be anything. So, choose the effect size that makes expected (that is, population mean) power after selection equal to some nice value like 0.50.

Here’s the R code. First, we read the Psychological Science urn and make a table of probabilities.


options(scipen=999) # To avoid scientific notation

source(""); functions()

PsychScience = scan("")

hist(PsychScience, xlab='Sample size',breaks=100, main = 'Figure 1: The Psychological Science Urn')

# A handier urn, for some purposes

nvals = sort(unique(PsychScience)) # There are 397 rather than 8000 values

nprobs = table(PsychScience)/sum(table(PsychScience))

# sum(nvals*nprobs) = 81.8606 = mean(PsychScience)

For any given effect size, the frequencies from the Psychological Science urn can be used to calculate expected power after selection. Minimizing the (squared) difference between this value and the desired mean power yields the required effect size.

# Minimize this function to find effect size giving desired power 

# after selection for significance.

fun = function(es,wantpow,dfreedom) 


    alpha = 0.05; cv=qchisq(1-alpha,dfreedom)

    epow = sum( (1-pchisq(cv,df=dfreedom,ncp=nvals*es))*nprobs ) 

    # cat("es = ",es," Expected power = ",epow,"\n")


    } # End of all the fun

# Find needed effect size for chi-square with df=5 and desired 

# population mean power AFTER selection.

popmeanpower = 0.5 # Change this value if you wish

EffectSize = nlminb(start=0.01, objective=fun,lower=0,df=5,wantpow=popmeanpower)$par

EffectSize # 0.108812

Calculate the probability distribution of sample size before selection.

# The distribution of sample size before selection is proportional to the

# distribution after selection divided by power, term by term.

crit = qchisq(0.95,5)

powvals = 1-pchisq(crit,5,ncp=nvals*EffectSize)

Pn = nprobs/powvals 

EG = 1/sum(Pn)

cat("Expected power before selection = ",EG,"\n")

Pn = Pn*EG # Probability distribution of n before selection

Generate test statistics before selection.

nsim = 50000 # Initial number of simulated statistics. This is over-kill. Change the value if you wish.


# For repeated simulations, execute the rest of the code repeatedly.

nbefore = sample(nvals,size=nsim,replace=TRUE,prob=Pn)

ncpbefore = nbefore*EffectSize

powbefore = 1-pchisq(crit,5,ncp=ncpbefore)

Ybefore = rchisq(nsim,5,ncp=ncpbefore)

Select for significance.

sigY = Ybefore[Ybefore>crit]

sigN = nbefore[Ybefore>crit]

sigPOW = 1-pchisq(crit,5,ncp=sigN*EffectSize)

hist(sigPOW, xlab='Power',breaks=100,freq=F ,main = 'Figure 2: Power After Selection for Significance')

Estimate mean power both ways.

# Two estimates of expected power before selection

c( length(sigY)/nsim , mean(powbefore) ) 

c(popmeanpower, mean(sigPOW)) # Golden


k = 1000 # Select 1,000 significant results.

Y = sigY[1:k]; n = sigN[1:k]; TruePower = sigPOW[1:k]

# Estimate with p-curve 2.1

heteroNpcurveCHI(Y=Y,dfree=5,nn=n) # 0.5058606 the first time.

# Write out chi-squared statistics for pasting into the online app

for(j in 1:k) cat("chi2(5) =",Y[j],"\n")


Brunner, J. and Schimmack, U. (2018). Estimating population mean power under conditions of heterogeneity and selection for significance. Under review. Available at

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Edition), Hillsdale, New Jersey: Erlbaum.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666-681.

van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2014). Meta-analysis using effect size distributions of only statistically significant studies. Psychological methods, 20, 293-309.