Category Archives: Uncategorized

Statistics Wars and Submarines

I am not the first to describe the fight among statisticians for eminence as a war (Mayo Blog).   The statistics war is as old as modern statistics itself.  The main parties in this war are the Fisherians, Bayesians, and Neymanians (or Neyman-Pearsonians).

Fisherians use p-values as evidence to reject the null-hypothesis; the smaller the p-value the better.

Neymanians distinguish between type-I and type-II errors and use regions of test statistics to reject null-hypotheses or alternative hypotheses.  They also use confidence intervals to obtain interval estimates of population parameters.

Bayesians differ from the Fisherians and Neymanians in that their inferences combine information obtained from data with prior information. Bayesians sometimes fights with each other about the proper prior information. Some prefer subjective priors that are ideally based on prior knowledge. Others prefer objective priors that do not require any prior knowledge and can be applied to all statistical problems (Jeffreysians).  Although they fight with each other, they are united in their fight against Fisherians and Neymanians, which they call Frequentists.

The statistics war has been going on for over 80 years and there has been no winner.  Unlike empirical sciences, there are no new data that could resolve scientific controversies.  Thus, the statistics war is more like wars in philosophy where philosophers are still fighting over the right way to define fundamental concepts like justice or happiness.

For applied researchers these statistics wars can be very confusing because a favorite weapon of statisticians is propaganda.  In this blog post, I examine the Bayesian Submarine (Morey et al., 2016), which aims to sink the ship of Neymansian confidence intervals.

The Bayesian Submarine 

Submarines are fascinating and are currently making major discoveries about sea life.  The Bayesian submarine is rather different.  It is designed to convince readers that confidence intervals provide no meaningful information about population parameters and should be abandoned in favor of Bayesian interval estimation.

Example 1: The lost submarine
In this section, we present an example taken from the confidence interval literature (Berger and Wolpert, 1988; Lehmann, 1959; Pratt, 1961;Welch, 1939) designed to bring into focus how CI theory works. This example is intentionally simple; unlike many demonstrations of CIs, no simulations are needed, and almost all results can be derived by readers with some training in probability and geometry. We have also created interactive versions of our figures to aid readers in understanding the example; see the figure captions for details.

A 10-meter-long research submersible with several people on board has lost contact with its surface support vessel. The submersible has a rescue hatch exactly halfway along
its length, to which the support vessel will drop a rescue line. Because the rescuers only get one rescue attempt, it is crucial that when the line is dropped to the craft in the deep water that the line be as close as possible to this hatch. The researchers on the support vessel do not know where the submersible is, but they do know that it forms two distinctive bubbles. These bubbles could form anywhere along the craft’s length, independently, with equal probability, and float to the surface where they can be seen by the support vessel.

The situation is shown in Fig. 1a. The rescue hatch is the unknown location θ, and the bubbles can rise from anywhere with uniform probability between θ − 5 meters (the
bow of the submersible) to θ+5 meters (the stern of the submersible). 


Let’s translate this example into a standard statistical problem.  It is uncommon to have a uniform distribution of observed data around a population parameter.  Most commonly, we assume that observations are more likely to cluster closer to the population parameter and that deviations between the population parameter and an observed value reflect some random process.  However, a bound uniform distribution also allows us to compute the standard deviation of the randomly occurring data.

round(sd(runif(100000,0,10)),2) = 2.89

We only have two data points to construct a confidence interval.  Evidently, the standard error based on a sample size of n = 2 is large (1/sqrt(2) = .71 (or 71% of a standard deviation).  We can use the typical formula for sampling error,  SD/sqrt(N) to estimate the sampling error as 2.89/1.41 = 2.04.

To construct a 95% confidence interval, we have to multiply the sampling error by the critical t.value for a probability of .975, which leaves .025 for the error region. Multiplying this by 2 gives a two-tailed error probability of .025 & 2 = 5%.  That is, 5% of observations could be more extreme than the boundaries of the confidence interval just by chance alone.  With 1 degree of freedom, we get  a value of 12.71.

n = 2; alpha = .05; qt(1-alpha/2,n-1)

The width of the CI is determined by the standard deviation and sample size.  So, the information is sufficient to say that the 95%CI is  the observed mean +/-  2.04m * 12.71 = 25.92m.

Hopefully it is obvious that this 95%CI covers 100% of all possible values because the length of the submarine is limited to 10m.

In short, two data points provide very little information and make it impossible to say anything with confidence about the location of the hatch.  Even without these calculations we can say with 100% confidence that the hatch cannot be further from the mean of the two bubbles than 5 meters because the maximum distance is limited by the length of the submarine.

The submarine problem is also strange because the width of confidence intervals is irrelevant for the rescue operation. With just one rescue line, the most optimal place is the mean of the two bubbles (see Figure, all intervals are centered on the same point).  So, the statisticians do not have to argue, because they all agree on the location where to drop the rescue line.

How is the Bayesian submarine supposed to sink confidence intervals? 

The rescuers first note that from observing two bubbles, it is easy to rule out all values except those within five meters of both bubbles because no bubble can occur further than 5 meters from the hatch.

Importantly, this only works for dependent variables with bounded values. For example, on an 11-point scale ranging from 0 to 10, it is obvious that any population mean cannot deviated from the middle of the scale (5) by more than 5 points.  Even there it is not very relevant because the goal of research is not to find the middle of the scale, but to estimate the actual population parameter that could be anywhere between 0 and 10. Thus, the submarine example does not map on any empirical problem of interval estimation.

1. A procedure based on the sampling distribution of the mean
The first statistician suggests building a confidence procedure using the sampling distribution of the mean M . The sampling distribution of  M has a known triangular distribution with θ as the mean. With this sampling distribution, there is a 50 % probability that  M will differ from θ by less than 5 − 5/ √2, or about 1.46m.  

This leads to the confidence procedure M = +/-  5 − 5/√2

which we call the “sampling distribution” procedure. This procedure also has the familiar form ¯x ± C × SE, where here the standard error (that is, the standard deviation of the estimate M) is known to be 2.04.

It is important to note that the authors use a 50% CI.  In this special case, the confidence interval is equivalent to the standard deviation because the standard deviation is multiplied by 1 to determine the width of the confidence interval.

n = 2; alpha = .50; qt(1-alpha/2,n-1)

The choice of a 50%CI is also not typical in actual research settings. It is not clear, why we should accept such a high error rate, especially when the survival of the crew members is at stake.  Imagine that the submarine had an emergency system that releases bubbles from the hatch, but the bubbles do not go straight to the surface. Yet there are hundreds of bubbles. Would we compute a 50% confidence interval or would we want to get a 99% confidence interval to bring the rescue line as close to the hatch as possible?

We still haven’t seen how the Bayesian submarine sinks confidence intervals.  To make their case, the Bayesian soldiers compute several possible confidence intervals and show how they lead to different conclusions (see Figure). They suggest that this is a fundamental problem for confidence intervals.

It is clear, first of all, why the fundamental confidence fallacy is a fallacy. 

They are happy to join forces with the Fisherians in their attack of Neymanian confidence intervals, while they usually attack Fisher for his use of p-values.

As Fisher pointed out in the discussion of CI theory mentioned above, for any given problem — as for this one — there are many possible confidence procedures. These confidence procedures will lead to different confidence intervals. In the case of our submersible confidence procedures, all confidence intervals are centered around M, and so the intervals will be nested within one another.

If we mistakenly interpret these observed intervals as having a 50 % probability of containing the true value, a logical problem arises. 

However, shortly after the authors bring up this fundamental problem for confidence intervals, they mention that Neyman solved this logical problem.

There are, as far as we know, only two general strategies for eliminating the threat of contradiction from relevant subsets: Neyman’s strategy of avoiding any assignment of probabilities to particular intervals, and the Bayesian strategy of always conditioning on the observed data, to be discussed subsequently.

Importantly, Neyman’s solution to the problem does not lead to the Bayesians’ conclusion that he suggested we should not make probabilistic statements based on confidence intervals. Instead, he argued that we should apply the long-run success rate to make probability judgments based on confidence intervals.  This use of the term probability can be illustrated with the submarine example. A simple simulation of the submarine problem shows that the 50% confidence interval contains the population parameter 50% of the time.

It is therefore reasonable to place relatively modest confidence in the belief that the hatch of the submarine is within the confidence interval.  To be more confident, it would be necessary to lower the error rate, but this makes the interval wider. The only way to be confident with a narrow interval is to collect more data.

Confidence intervals do have exactly the properties that Neyman claimed they have and there is no logical inconsistency between the statement that we cannot quantify the probability of singular events, while we can use long-run outcomes of similar events to make claims about the probability of being right or wrong in a particular event.

Neyman compares this to gambling where it is impossible to say anything about the probability of a particular bet unless we know the long-run probability of similar bets. Researchers who use confidence intervals are no different from people who drive their cars with confidence because they never had an accident or who order a particular pizza because they ordered it many times before and liked it.  Without any other relevant information, the use of long-run frequencies to assign probabilities to individual events is not a fallacy.


So, does the Bayesian submarine sink the confidence interval ship?  Does the example show that interpreting confidence intervals as probabilities is a fallacy and a misinterpretation of Neyman?  I don’t think so.

The probability of winning a coin toss (with a fair coin) is 50%.  What is the probability that I win any specific game.  It is not defined.  It is 100% if I win and 0% if I don’t win. This is trivial and Neyman made it clear that he was not using the term probability in this sense.  He also made it clear that he used the term probability to refer to the long-term proportion of correct decisions and most people would feel very confident in their beliefs and decision making if the odds of winning were 95%.  Bayesians do not deny that 95% confidence intervals give the right answer 95% of the time. They just object to the phrase, “There is a 95% probability that the confidence interval includes the population parameter” when a researcher uses a 95%confidence interval.  Similarly, they would object to somebody saying “There is a 99.9% chance that I am pregnant” when a pregnancy test with a 0.01% false positive rate shows a positive result.  The woman is either pregnant or she is not, but we don’t know this until she does repeat the test several times or an ultrasound shows it.  As long as there is uncertainty about the actual truth, the long-run frequency of false positives quantifies the rational belief in being pregnant or not.

What should applied researchers do.  They should use confidence intervals with confidence.  If Bayesians want to argue with them, all they need to say is that they are using a procedure that has a 95% probability of giving the right answer and it is not possible to say whether a particular result is one of the few errors.  The best way to address this question is not to argue about semantics, but to do a replication study.  And that is the good news.  While statisticians are busy fighting with each other, empirical scientists can make progress by collecting more informative data and make actual progress.

In conclusion, the submarine problem does not convince me for many reasons.  Most important, it is not even necessary to create any intervals to decide on the best action.  Absent any other information, the best bet is to drop the rescue line right in the middle of the two bubbles.  This is very fortunate for the submarine crew because otherwise the statisticians would still be arguing about the best course of action, while the submarine is running out of air.


The Fallacy of Placing Confidence in Bayesian Salvation

Richard D. Morey, Rink Hoekstra, Jeffrey N. Rouder, Michael D. Lee, and
Eric-Jan Wagenmakers (2016), henceforce psycho-Baysians, have a clear goal.  They want psychologists to change the way they analyze their data.

Although this goal motivates the flood of method articles by this group, the most direct attack on other statistical approaches is made in the article “The fallacy of placing confidence in confidence intervals.”   In this article, the authors claim that everybody, including textbook writers in statistics, misunderstood Neyman’s classic article on interval estimation.   What are the prior odds that after 80 years, a group of psychologists discover a fundamental flaw in the interpretation of confidence intervals (H1) versus a few psychologists are either unable or unwilling to understand Neyman’s article?

Underlying this quest for change in statistical practices lies the ultimate attribution error that Fisher’s p-values or Neyman-Pearsons significance testing with or without confidence intervals are responsible for the replication crisis in psychology (Wagenmakers et al., 2011).

This is an error because numerous articles have argued and demonstrated that questionable research practices undermine the credibility of the psychological literature.  The unprincipled use of p-values (undisclosed multiple testing), also called p-hacking, means that many statistically significant results have inflated error rates and the long-run probabilities of false positives are not 5%, as stated in each article, but could be 100% (Rosenthal, 1979; Sterling, 1959; Simmons, Nelson, & Simonsohn, 2011).

You will not find a single article by Psycho-Bayesians that will acknowledge the contribution of unprincipled use of p-values to the replication crisis. The reason is that they want to use the replication crisis as a vehicle to sell Bayesian statistics.

It is hard to believe that classic statistics are fundamentally flawed and misunderstood because they are used in industry to  produce SmartPhones and other technology that requires tight error control in mass production of technology. Nevertheless, this article claims that everybody misunderstood Neyman’s seminal article on confidence intervals.

The authors claim that Neyman wanted us to compute confidence intervals only before we collect data, but warned readers that confidence intervals provide no useful information after the data are collected.

Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative”

This is utter nonsense. Of course, Neyman was asking us to interpret confidence intervals after we collected data because we need a sample to compute confidence interval. It is hard to believe that this could have passed peer-review in a statistics journal and it is not clear who was qualified to review this paper for Psychonomic Bullshit Review.

The way the psycho-statisticians use Neyman’s quote is unscientific because they omit the context and the following statements.  In fact, Neyman was arguing against Bayesian attempts of estimate probabilities that can be applied to a single event.

It is important to notice that for this conclusion to be true, it is not necessary that the problem of estimation should be the same in all the cases. For instance, during a period of time the statistician may deal with a thousand problems of estimation and in each the parameter M  to be estimated and the probability law of the X’s may be different. As far as in each case the functions L and U are properly calculated and correspond to the same value of alpha, his steps (a), (b), and (c), though different in details of sampling and arithmetic, will have this in common—the probability of their resulting in a correct statement will be the same, alpha. Hence the frequency of actually correct statements will approach alpha. It will be noticed that in the above description the probability statements refer to the problems of estimation with which the statistician will be concerned in the future. In fact, I have repeatedly stated that the frequency of correct results tend to alpha.*

Consider now the case when a sample, S, is already drawn and the calculations have given, say, L = 1 and U = 2. Can we say that in this particular case the probability of the true value of M falling between 1 and 2 is equal to alpha? The answer is obviously in the negative.  

The parameter M is an unknown constant and no probability statement concerning its value may be made, that is except for the hypothetical and trivial ones P{1 < M < 2}) = 1 if 1 < M < 2) or 0 if either M < 1 or 2 < M) ,  which we have decided not to consider. 

The full quote makes it clear that Neyman is considering the problem of quantifying the probability that a population parameter is in a specific interval and dismisses it as trivial because it doesn’t solve the estimation problem.  We don’t even need observe data and compute a confidence interval.  The statement that a specific unknown number is between two other numbers (1 and 2) or not is either TRUE (P = 1) or FALSE (P = 0).  To imply that this trivial observation leads to the conclusion that we cannot make post-data  inferences based on confidence intervals is ridiculous.

Neyman continues.

The theoretical statistician [constructing a confidence interval] may be compared with the organizer of a game of chance in which the gambler has a certain range of possibilities to choose from while, whatever he actually chooses, the probability of his winning and thus the probability of the bank losing has permanently the same value, 1 – alpha. The choice of the gambler on what to bet, which is beyond the control of the bank, corresponds to the uncontrolled possibilities of M having this or that value. The case in which the bank wins the game corresponds to the correct statement of the actual value of M. In both cases the frequency of “ successes ” in a long series of future “ games ” is approximately known. On the other hand, if the owner of the bank, say, in the case of roulette, knows that in a particular game the ball has stopped at the sector No. 1, this information does not help him in any way to guess how the gamblers have betted. Similarly, once the boundaries of the interval are drawn and the values of L and U determined, the calculus of probability adopted here is helpless to provide answer to the question of what is the true value of M.

What Neyman was saying is that population parameters are unknowable and remain unknown even after researchers compute a confidence interval.  Moreover, the construction of a confidence interval doesn’t allow us to quantify the probability that an unknown value is within the constructed interval. This probability remains unspecified. Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval.  This is common sense. If we place bets in roulette or other random events, we rely on long-run frequencies of winnings to calculate our odds of winning in a specific game.

It is absurd to suggest that Neyman himself argued that confidence intervals provide no useful information after data are collected because the computation of a confidence interval requires a sample of data.  That is, while the width of a confidence interval can be determined a priori before data collection (e.g. in precision planning and power calculations),  the actual confidence interval can only be computed based on actual data because the sample statistic determines the location of the confidence interval.

Readers of this blog may face a dilemma. Why should they place confidence in another psycho-statistician?   The probability that I am right is 1, if I am right and 0 if I am wrong, but this doesn’t help readers to adjust their beliefs in confidence intervals.

The good news is that they can use prior information. Neyman is widely regarded as one of the most influential figures in statistics.  His methods are taught in hundreds of text books, and statistical software programs compute confidence intervals. Major advances in statistics have been new ways to compute confidence intervals for complex statistical problems (e.g., confidence intervals for standardized coefficients in structural equation models; MPLUS; Muthen & Muthen).  What are the a priori chances that generations of statisticians misinterpreted Neyman and committed the fallacy of interpreting confidence intervals after data are obtained?

However, if readers need more evidence of psycho-statisticians deceptive practices, it is important to point out that they omitted Neyman’s criticism of their favored approach, namely Bayesian estimation.

The fallacy article gives the impression that Neyman’s (1936) approach to estimation is outdated and should be replaced with more modern, superior approaches like Bayesian credibility intervals.  For example, they cite Jeffrey’s (1961) theory of probability, which gives the impression that Jeffrey’s work followed Neyman’s work. However, an accurate representation of Neyman’s work reveals that Jeffrey’s work preceded Neyman’s work and that Neyman discussed some of the problems with Jeffrey’s approach in great detail.  Neyman’s critical article was even “communicated” by Jeffreys (these were different times where scientists had open conflict with honor and integrity and actually engaged in scientific debates).


Given that Jeffrey’s approach was published just one year before Neyman’s (1936) article, Neyman’s article probably also offers the first thorough assessment of Jeffrey’s approach. Neyman first gives a thorough account of Jeffrey’s approach (those were the days).


Neyman then offers his critique of Jeffrey’s approach.

It is known that, as far as we work with the conception of probability as adopted in
this paper, the above theoretically perfect solution may be applied in practice only
in quite exceptional cases, and this for two reasons. 

Importantly, he does not challenge the theory.  He only points out that the theory is not practical because it requires knowledge that is often not available.  That is, to estimate the probability that an unknown parameter is within a specific interval, we need to make prior assumptions about unknown parameters.   This is the problem that has plagued subjective Bayesians approaches.

Neyman then discusses Jeffrey’s approach to solving this problem.  I am not claiming that I am a statistical expert to decide whether Neyman or Jeffrey’s are right. Even statisticians have been unable to resolve these issues and I believe the consensus is that Bayesian credibility intervals and Neyman’s confidence intervals are both mathematically viable approaches to interval estimation with different strengths and weaknesses.


I am only trying to point out to unassuming readers of the fallacy article that both approaches are as old as statistics and that the presentation of the issue in this article is biased and violates my personal, and probably idealistic, standards of scientific integrity.   Using a selective quote by Neyman to dismiss confidence intervals and then to omit Neyman’s critic of Bayesian credibility intervals is deceptive and shows an unwillingness or inability to engage in open scientific examination of scientific arguments for and against different estimation methods.

It is sad and ironic that Wagenmakers’ efforts to convert psychologists into Bayesian statisticians is similar to Bem’s (2011) attempt to convert psychologists into believers in parapsychology; or at least in parapsychology as a respectable science. While Bem fudged data to show false empirical evidence, Wagenmakers is misrepresenting the way classic statistics works and ignoring the key problem of Bayesian statistics, namely that Bayesian inferences are contingent on prior assumptions that can be gamed to show what a researcher wants to show.  Wagenmaker used this flexibility in Bayesian statistics to suggest that Bem (2011) presented weak evidence for extra-sensory perception.  However, a rebuttle by Bem showed that Bayesian statistics also showed support for extra-sensory perception with different and more reasonable priors.  Thus, Wagenmakers et al. (2011) were simply wrong to suggest that Bayesian methods would have prevented Bem from providing strong evidence for an incredible phenomenon.

The problem with Bem’s article is not the way he “analyzed” the data. The problem is that Bem violated basic principles of science that are required to draw valid statistical inferences from data.  It would be a miracle if Bayesian methods that assume unbiased data could correct for data falsification.   The problem with Bem’s data has been revealed using statistical tools for the detection of bias (Francis, 2012; Schimmack, 2012, 2015, 2118). There has been no rebuttal from Bem and he admits to the use of practices that invalidate the published p-values.  So, the problem is not the use of p-values, confidence intervals, or Bayesian statistics.  The problem is abuse of statistical methods. There are few cases of abuse of Bayesian methods simply because they are used rarely. However, Bayesian statistics can be gamed without data fudging by specifying convenient priors and failing to inform readers about the effect of priors on results (Gronau et al., 2017).

In conclusion, it is not a fallacy to interpret confidence intervals as a method for interval estimation of unknown parameter estimates. It would be a fallacy to cite Morey et al.’s article as a valid criticism of confidence intervals.  This does not mean that Bayesian credibility intervals are bad or could not be better than confidence intervals. It only means that this article is so blatantly biased and dogmatic that it does not add to the understanding of Neyman’s or Jeffrey’s approach to interval estimation.

P.S.  More discussion of the article can be found on Gelman’s blog.

Andrew Gelman himself comments:

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% confidence interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic.

I have to admit some Schadenfreude when I see one Bayesian attacking another Bayesian for the use of an ill-informed prior.  While Bayesians are still fighting over the right priors, practical researchers may be better off to use statistical methods that do not require priors, like, hm, confidence intervals?

P.P.S.  Science requires trust.  At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.











A Clarification of P-Curve Results: The Presence of Evidence Does Not Imply the Absence of Questionable Research Practices

This post is not a criticism of p-curve.  The p-curve authors have been very clear in their writing that p-curve is not designed to detect publication bias.  However, numerous articles make the surprising claim that they used p-curve to test publication bias.  The purpose of this post is to simply correct a misunderstanding of p-curve.

Questionable Research Practices and Excessive Significance

Sterling (1959) pointed out that psychology journals have a surprisingly high success rate. Over 90% of articles reported statistically significant results in support of authors’ predictions.  This success rate would be surprising, even if most predictions in psychology are true.  The reason is that the results of a study are not only influenced by cause-effect relationships.  Another factor that influences the outcome of a study is sampling error.  Even if researchers are nearly always right in their predictions, some studies will fail to provide sufficient evidence for the predicted effect because sampling error makes it impossible to detect the effect.  The ability of a study to show a true effect is called power.  Just like bigger telescopes are needed to detect more distant stars with a weaker signal, bigger sample sizes are needed to detect small effects (Cohen, 1962; 1988).  Sterling et al. (1995) pointed out that the typical power of studies in psychology does not justify the high success rate in psychology journals.  In other words, the success rate was too good to be true.  This means, published articles are selected for significance.

The bias in favor of significant results is typically called publication bias (Rosenthal, 1979).  However, the term publication bias does not explain the discrepancy between estimates of statistical power and success rates in psychology journals.  John et al. (2012) listed a number of questionable research practices that can inflate the percentage of significant results in published articles.

One mechanism is simply to not report non-significant result.  Rosenthal (1979) suggested that non-significant results end up in the proverbial file-drawer.  That is, a whole data set remains unpublished.  The other possibilities is that researchers use multiple exploratory analyses to find a significant result and do not disclose their fishing expedition.  These practices are now widely known as p-hacking.

Unlike John et al. ,(2012), the p-curve authors make a big distinction between not disclosing an entire dataset (publication bias) and not disclosing all statistical analyses of a dataset (p-hacking).

QRP = Publication Bias + P-Hacking

We Don’t Need Tests of Publication Bias

The p-curve authors assume that publication bias is unavoidable.

“Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results.”  (Simonsohn, Nelson, Simmons, 2014).

“By the Way, of Course There is Publication Bias. Virtually all published studies are significant (see, e.g., Fanelli, 2012; Sterling, 1959; Sterling, Rosenbaum, & Weinkam,
1995), and most studies are underpowered (see, e.g., Cohen, 1962). It follows that a considerable number of unpublished failed studies must exist. With this knowledge already in hand, testing for publication bias on paper after paper makes little
sense” (Simonsohn, 2012, p. 597).

“Yes, p-curve ignores p>.05 because it acknowledges that we observe an unknowably small and non-random subset of p-values >.05.”  (personal email, January 18, 2015).

I hope these quotes make it crystal clear that p-curve is not designed to examine publication bias because the authors assume that selection for significance is unavoidable.  Any statistical test that reveals no evidence of publication bias is a false negative result because the sample size was not large enough to detect it.

Another concern by Uri Simonsohn is that bias tests may reveal statistically significant bias that has no practical consequences.

Consider a literature with 100 studies, all with p < .05, but where the implied statistical
power is “just” 97%. Three expected failed studies are missing. The test from the critiques would conclude there is statistically significant publication bias; its magnitude, however, is trivial. (Simonsohn, 2012, p. 598). 

k.sig = 100; k.studies = 100; power = .97; pbinom(k.studies-k.sig,k.studies,1-power) =

This is a valid criticism that applies to all p-values.  A p-value only provides information about the contribution of random sampling error.  A p-value of .048 suggest that it is unlikely to observe only significant results, even if 100 studies have 97% power to show a significant result.   However, with 97% observed power, the 100 studies provide credible evidence for an effect and even the inflation of the average effect size is minimal.

A different conclusion would follow from a p-value less than .05 in a set of 7 studies that all show significant results.

k.sig = 7; k.studies = 7; power = .62; pbinom(k.studies-k.sig,k.studies,1-power) = 0 .035

Rather than showing small bias with a large set of studies, this finding shows large bias with a small set of studies.  P-values do not distinguish between these two scenarios. Both outcomes are equally unlikely.  Thus, information about the probability of an event should always be interpreted in the context of the effect.  The effect size is simply the difference between the expected and observed rate of significant results.  In Simonsohn’s example, the effect size is small (1 – .97 = .03).  In the second example, the discrepancy is large (1 – .62 = .38).

The previous scenarios assume that only significant results are reported. However, in sciences that use preregistration to reduce deceptive publishing practices (e..g, medicine), non-significant results are more common.  When non-significant results are reported, bias tests can be used to assess the extent of bias.

For example, a literature may report 10 studies with only 4 significant results and the median observed power is 30%.  In this case, the bias is small (.40 – .30 = .10) and a conventional meta-analysis would produce only slightly inflated estimates of the average effect size.  In contrast, p-curve would discard over 50% of the studies because it assumes that the non-significant results are not trustworthy.  This is an unnecessary loss of information that could be avoided by testing for publication bias.

In short, p-curve assumes that publication bias is unavoidable. Hence, tests of publication bias are unnecessary and non-significant results should always be discarded.

Why Do P-Curve Users Think P-Curve is a Publication Bias Test?

Example 1

I conducted a literature research on studies that used p-curve and I was surprised by numerous claims that p-curve is a test of publication bias.

Simonsohn, Nelson, and Simmons (2014a, 2014b, 2016) and Simonsohn, Simmons, and Nelson (2015) introduced pcurve as a method for identifying publication bias (Steiger & Kühberger, 2018, p. 48).   

However, the authors do not explain how p-curve detects publication bias. Later on, they correctly point out that p-curve is a method that can correct for publication bias.

P-curve is a good method to correct for publication bias, but it has drawbacks. (Steiger & Kühberger, 2018, p. 48).   

Thus, the authors seem to confuse detection of publication bias with correction for publication bias.  P-curve corrects for publication bias, but it does not detect publication bias; it assumes that publication bias is present and a correction is necessary.

Example 2

An article in the medical journal JAMA Psychiatry also claimed that they used p-curve and other methods to assess publication bias.

Publication bias was assessed across all regions simultaneously by visual inspection of funnel plots of SEs against regional residuals and by using the excess significance test,  the P-curve method, and a multivariate analogue of the Egger regression test (Bruger & Howes, 2018, p. 1106).  

After reporting the results of several bias tests, the authors report the p-curve results.

P-curve analysis indicated evidential value for all measures (Bruger & Howes, 2018, p. 1106).

The authors seem to confuse presence of evidential value with absence of publication bias. As discussed above,  publication bias can be present even if studies have evidential value.

Example 3

To assess publication bias, we considered multiple indices. Specifically, we evaluated Duval and Tweedie’s Trim and Fill Test, Egger’s Regression Test, Begg and Mazumdar Rank Correlation Test, Classic Fail-Safe N, Orwin’s Fail-Safe N, funnel plot symmetry, P-Curve Tests for Right-Skewness, and Likelihood Ratio Test of Vevea and Hedges Weight-Function Model.

As in the previous example, the authors confuse evidence for evidential value (significant right-skwed p-curve) with evidence for the absence of publication bias.

Example 4

The next example even claims that p-curve can be used to quantify the presence of bias.

Publication bias was investigated using funnel plots and the Egger regression asymmetry test. Both the trim and fill technique (Duval & Tweedie, 2000) and p-curve (Simonsohn, Nelson, & Simmons, 2014a, 2014b) technique were used to quantify the presence of bias (Korrel et al., 2017, p. 642).

The actual results section only reports that the p-curve is right skewed.

The p-curve for the remaining nine studies (p < .025) was significantly right skewed
(binomial test: p = .002; continuous test full curve: Z = -9.94, p < .0001, and half curve Z = -9.01, p < .0001) (Korrel et al., 2017, p. 642)

These results do not assess or quantify publication bias.  One might consider the reported z-scores a quantitative measure of evidential value as larger z-scores are less probable under the nil-hypothesis that all significant results are false positives. Nevertheless, strong evidential value (e.g., 100 studies with 97% power) does not imply that publication bias is absent, nor does it mean that publication bias is small .

A set of 1000 studies with 10% power is expected to produce 900 non-significant results and 100 significant results.  Removing the non-significant results produces large publication bias, but a p-curve analysis shows strong evidence against the nil-hypothesis that all studies are false positives.

Z = rnorm(1000,qnorm(.10,1.96))
Stouffer.Z = sum(Z[Z > 1.96]-1.96)/sqrt(length(Z.sig))
Stouffer.Z = 4.89

The reason is that p-curve is a meta-analysis and the results depend on the strength of evidence in individual studies and the number of studies.  Strong evidence can be result of many studies with weak evidence or a few studies with strong evidence.  Thus, p-curve is a meta-analytic method that combines information from several small studies to draw inferences about a population parameter.  The main difference to older meta-analytic methods is that older methods assumed that publication bias is absent, whereas p-curve assumes that publication bias is present. Neither method assesses whether publication bias is present, nor do they quantify the amount of publication bias.

Example 5

Sala and Gobet (2017) explicitly make the mistake to equate evidence for evidence with evidence against publication bias.

Finally, a p-curve analysis was run with all the p values < .05 related to positive effect sizes (Simonsohn, Nelson, & Simmons, 2014). The results showed evidential values (i.e., no evidence of publication bias), Z(9) = -3.39, p = .003.  (p. 676).

As discussed in detail before, this is not a valid inference.

Example 6

Ironically, the interpretation of p-curve results as evidence that there is no publication bias contradicts the fundamental assumption of p-curve that we can safely assume that publication bias is always. present.

The danger is that misuse of p-curve as a test of publication bias may give the false impression that psychological scientists are reporting their results honestly, while actual bias tests show that this is not the case.

It is therefore problematic if authors in high impact journals (not necessarily high quality journals) claim that they found evidence for the absence of publication bias based on a p-curve analysis.

To check whether this research field suffers from publication bias, we conducted p-curve analyses (Simonsohn, Nelson, & Simmons, 2014a, 2014b) on the most extended data set of the current meta-analysis (i.e., psychosocial correlates of the dark triad traits), using an on-line application ( As can be seen in Figure 2, for each of the dark triad traits, we found an extremely right-skewed p-curve, with statistical tests indicating that the studies included in our meta-analysis, indeed, contained evidential value (all ps < .001) and did not point in the direction of inadequate evidential value (all ps non-significant). Thus, it is unlikely that the dark triad literature is affected by publication bias (Muris, Merckelbach, Otgaar, & Meijer, 2017).

Once more, presence of evidential value does not imply absence of publication bias!

Evidence of P-Hacking  

Publication bias is not the only reason for the high success rates in psychology.  P-hacking will also produce more significant results than the actual power of studies warrants. In fact, the whole purpose of p-hacking is to turn non-significant results into significant ones.  Most bias tests do not distinguish between publication bias and p-hacking as causes of bias.  However, the p-curve authors make this distinction and claim that p-curve can be used to detect p-hacking.

Apparently, we should not assume that p-hacking is just as prevalent as publication bias, which makes testing for p-hacking irrelevant.

The problem is that it is a lot harder to distinguish p-hacking and publication bias as the p-curve authors imply and that their p-curve test of p-hacking will only work under very limited conditions.  Most of the time, the p-curve test of p-hacking will fail to provide evidence for p-hacking and this result can be misinterpreted as evidence that results were obtained without p-hacking, which is a logical fallacy.

This mistake was made by Winternitz, Abbate, Huchard, Havlicek, & Gramszegi (2017).

Fourth and finally, as bias for publications with significant results can rely more on the P-value than on the effect size, we used the Pcurve method to test whether the distribution of significant P-values, the ‘P-curve’, indicates that our studies have evidential value and are free from ‘p-hacking’ (Simonsohn et al. 2014a, b).

The problem is that the p-curve test of p-hacking only works when evidential value is very low and for some specific forms of p-hacking. For example, researchers can p-hack by testing many dependent variables. Selecting significant dependent variables is no different from running many studies with a single dependent variable and selecting entire studies with significant results; it is just more efficient.  The p-curve would not show the left-skewed p-curve that is considered diagnostic of p-hacking.

Even a flat p-curve would merely show lack of evidential value, but it would be wrong to assume that p-hacking was not used.  To demonstrate this I submitted the results from Bem’s (2011) infamous “feeling the future” article to a p-curve analysis (  pcurve.bem.png

The p-curve analysis shows a flat p-curve.  This shows lack of evidential value under the assumption that questionable research practices were used to produce 9 out of 10 significant (p < .05, one-tailed) results.  However, there is no evidence that the results are p-hacked if we were to rely on a left-skewed p-curve as evidence for p-hacking.

One possibility would be that Bem did not p-hack his studies. However, this would imply that he ran 20 studies for each significant result. with sample sizes of 100 particpants per study, this would imply that he tested 20,000 participants.  This seems unrealistic and Bem states that he reported all studies that were conducted.  Moreover, analyses of the raw data showed peculiar patterns that suggest some form of p-hacking was used.  Thus, this example shows that p-curve is not very effective in revealing p-hacking.

It is also interesting that the latest version of p-curve, p-curve4.06, no longer tests for left-skewedness of distributions and doesn’t mention p-hacking.  This change in p-curve suggests that the authors realized the ineffectiveness of p-curve in detecting p-hacking (I didn’t ask the authors for comments, but they are welcome to comment here or elsewhere on this change in their app).

It is problematic if meta-analysts assume that p-curve can reveal p-hacking and infer from a flat or right-skewed p-curve that the data are not p-hacked.  This inference is not warranted because absence of evidence is not the same as evidence of absence.


P-curve is a family of statistical tests for meta-analyses of sets of studies.  One version is an effect size meta-analysis; others test the nil-hypothesis that the population effect size is zero.  The novel feature of p-curve is that it assumes that questionable research practices undermine the validity of traditional meta-analyses that assume no selection for significance. To correct for the assumed bias, observed test statistics are corrected for selection bias (i.e., p-values between .05 and 0 are multiplied by 20 to produce p-values between 0 and 1 that can be analyzed like unbiased p-values).  Just like regular meta-analysis, the main result of a p-curve analysis is a combined test-statistic or effect size estimate that can be used to test the nil-hypothesis.  If the nil-hypothesis can be rejected, p-curve analysis suggests that some effect was observed.  Effect size p-curve also provides an effect size estimate for the set of studies that produced significant results.

Just like regular meta-analyses, p-curve is not a bias test. It does not test whether publication bias exists and it fails as a test of p-hacking under most circumstances. Unfortunately, users of p-curve seem to be confused about the purpose of p-curve or make the logical mistake to infer from the presence of evidence that questionable research practices (publication bias; p-hacking) are absent. This is a fallacy.  To examine the presence of publication bias, researchers should use existing and validated bias tests.





















An Even Better P-curve

It is my pleasure to post the first guest post on the R-Index blog.  The blog post is written by my colleague and partner in “crime”-detection, Jerry Brunner.  I hope we will see many more guest posts by Jerry in the future.


Jerry Brunner
Department of Statistical Sciences
University of Toronto

First, my thanks to the mysterious Dr. R for the opportunity to do this guest post. At issue are the estimates of population mean power produced by the online p-curve app. The current version is 4.06, available at As the p-curve team (Simmons, Nelson, and Simonsohn) observe in their blog post entitled “P-curve handles heterogeneity just fine” at, the app does well on average as long as there is not too much heterogeneity in power. They show in one of their examples that it can over-estimate mean power when there is substantial heterogeneity.

Heterogeneity in power is produced by heterogeneity in effect size and heterogeneity in sample size. In the simulations reported at, sample size varies over a fairly narrow range — as one might expect from a meta-analysis of small-sample studies. What if we wanted to estimate mean power for sets of studies with large heterogeneity in sample sizes or an entire discipline, or sub-areas, or journals, or psychology departments? Sample size would be much more variable.

This post gives an example in which the p-curve app consistently over-estimates population mean power under realistic heterogeneity in sample size. To demonstrate that heterogeneity in sample size alone is a problem for the online pcurve app, population effect size was held constant.

In 2016, Brunner and Schimmack developed an alternative p-curve method (p-curve 2.1), which performs much better than the online app p-curve 4.06. P-curve 2.1 is fully documented and evaluated in Brunner and Schimmack (2018). This is the most recent version of the notorious and often-rejected paper mentioned in It has been re-written once again, and submitted to Meta-psychology. It will shortly be posted during the open review process, but in the meantime I have put a copy on my website at

P-curve 2.1 is based on Simonsohn, Nelson and Simmons’ (2014) p-curve estimate of effect size. It is designed specifically for the situation where there is heterogeneity in sample size, but just a single fixed effect size. P-curve 2.1 is a simple, almost trivial application of p-curve 2.0. It first uses the p-curve 2.0 method to estimate a common effect size. It then combines that estimated effect size and the observed sample sizes to calculate an estimated power for each significance test in the sample. The sample mean of the estimated power values is the p-curve 2.1 estimate.

One of the virtues of p-curve is that it allows for publication bias, using only significant test statistics as input. The population mean power being estimated is the mean power of the sub-population of tests that happened to be significant. To compare the performance of p-curve 4.06 to p-curve 2.1, I simulated samples of significant test statistics with a single effect size, and realistic heterogeneity in sample size.

Here’s how I arrived at the “realistic” sample sizes. In another project, Uli Schimmack had harvested a large number of t and F statistics from the journal Psychological Science, from the years 2001-2015. I used N = df + 2 to calculate implied total sample sizes. I then eliminated all sample sizes less than 20 and greater than 500, and randomly sampled 5,000 of the remaining numbers. These 5,000 numbers will be called the “Psychological Science urn.” They are available at, and can be read directly into R with the scan function.

The numbers in the Psychological Science urn are not exactly sample sizes and they are not a true random sample. In particular, truncating the distribution at 500 makes them less heterogeneous than real sample sizes, since web surveys with enormous sample sizes are eliminated. Still, I believe the numbers in the Psychological Science urn may be fairly reflective of the sample sizes in psychology journals. Certainly, they are better than anything I would be able to make up. Figure 1 shows a histogram, which is right skewed as one might expect.


By sampling with replacement from the Psychological Science urn, one could obtain a random sample of sample sizes, similar to sampling without replacement from a very large population of studies. However, that’s not what I did. Selection for significance tends to select larger sample sizes, because tests based on smaller sample sizes have lower power and so are less likely to be significant. The numbers in the Psychological Science urn come from studies that passed the filter of publication bias. It is the distribution of sample size after selection for significance that should match Figure 1.

To take care of this issue, I constructed a distribution of sample size before selection and chose an effect size that yielded (a) population mean power after selection equal to 0.50, and (b) a population distribution of sample size after selection that exactly matched the relative frequencies in the Psychological Science urn. The fixed effect size, in a metric of Cohen (1988, p. 216) was w = 0.108812. This is roughly Cohen’s “small” value of w = 0.10. If you have done any simulations involving literal selection for significance, you will realize that getting the numbers to come out just right by trial and error would be nearly impossible. I got the job done by using a theoretical result from Brunner and Schimmack (2018). Details are given at the end of this post, after the results.

I based the simulations on k=1,000 significant chi-squared tests with 5 degrees of freedom. This large value of k (the number of studies, or significance tests on which the estimates are based) means that estimates should be very accurate. To calculate the estimates for p-curve 4.06, it was easy enough to get R to write input suitable for pasting into the online app. For p-curve 2.1, I used the function heteroNpcurveCHI, part of a collection developed for the Brunner and Schimmack paper. The code for all the functions is available at Within R, the functions can be defined with source(""). Then to see a list of functions, type functions() at the R prompt.

Recall that population mean power after selection is 0.50. The first time I ran the simulation, the p-curve 4.06 estimate was 0.64, with a 95% confidence interval from from 0.61 to 0.66.. The p-curve 2.1 estimate was 0.501. Was this a fluke? The results of five more independent runs are given in the table below. Again, the true value of mean power after selection for significance is 0.50.

P-curve 2.1 P-curve 4.06 P-curve 4.06 Confidence Interval
0.510 0.64 0.61 0.67
0.497 0.62 0.59 0.65
0.502 0.62 0.59 0.65
0.509 0.64 0.61 0.67
0.487 0.61 0.57 0.64

It is clear that the p-curve 4.06 estimates are consistently too high, while p-curve 2.1 is on the money. One could argue that an error of around twelve percentage points is not too bad (really?), but certainly an error of one percentage point is better. Also, eliminating sample sizes greater than 500 substantially reduced the heterogeneity in sample size. If I had left the huge sample sizes in, the p-curve 4.06 estimates would have been ridiculously high.

Why did p-curve 4.06 fail? The answer is that even with complete homogeneity in effect size, the Psychological Science urn was heterogeneous enough to produce substantial heterogeneity in power. Figure 2 is a histogram of the true (not estimated) power values.


Figure 2 shows that that even under homogeneity in effect size, a sample size distribution matching the Psychological Science urn can produce substantial heterogeneity in power, with a mode near one even though the mean is 0.50. In this situation, p-curve 4.06 fails. P-curve 2.1 is clearly preferable, because it specifically allows for heterogeneity in sample size.

Of course p-curve 2.1 does assume homogeneity in effect size. What happens when effect size is heterogeneous too? The paper by Brunner and Schimmack (2018) contains a set of large-scale simulation studies comparing estimates of population mean power from p-curve, p-uniform, maximum likelihood and z-curve, a new method dreamed up by Schimmack. The p-uniform method is based on van Assen, van Aertand and Wicherts (2014), extended to power estimation as in p-curve 2.1. The p-curve method we consider in the paper is p-curve 2.1. It does okay as long as heterogeneity in effect size is modest. Other methods may be better, though. To summarize, maximum likelihood is most accurate when its assumptions about the distribution of effect size are satisfied or approximately satisfied. When effect size is heterogeneous and the assumptions of maximum likelihood are not satisfied, z-curve does best.

I would not presume to tell the p-curve team what to do, but I think they should replace p-curve 4.06 with something like p-curve 2.1. They are free to use my heteroNpcurveCHI and heteroNpcurveF functions if they wish. A reference to Brunner and Schimmack (2018) would be appreciated.

Details about the simulations

Before selection for significance, there is a bivariate distribution of sample size and effect size. This distribution is affected by the selection process, because tests with higher effect size or sample size (or especially, both) are more likely to be significant. The question is, exactly how does selection affect the joint distribution? The answer is in Brunner and Schimmack (2018). This paper is not just a set of simulation studies. It also has a set of “Principles” relating the population distribution of power before selection to its distribution after selection. The principles are actually theorems, but I did not want it to sound too mathematical. Anyway, Principle 6 says that to get the probability of a (sample size, effect size) pair after selection, take the probability before selection, multiply by the power calculated from that pair, and divide by the population mean power before selection.

In the setting we are considering here, there is just a single effect size, so it’s even simpler. The probability of a (sample size, effect size) pair is just the probability of the sample size. Also, we know the probability distribution of sample size after selection. It’s the relative frequencies of the Psychological Science urn. Solving for the probability of sample size before selection yields this rule: the probability of sample size before selection equals the probability of sample size after selection, divided by the power for that sample size, and multiplied by population mean power before selection.

This formula will work for any fixed effect size. That is, for any fixed effect size, there is a probability distribution of sample size before selection that makes the distribution of sample size after selection exactly match the Psychological Science frequencies in Figure 1. Effect size can be anything. So, choose the effect size that makes expected (that is, population mean) power after selection equal to some nice value like 0.50.

Here’s the R code. First, we read the Psychological Science urn and make a table of probabilities.


options(scipen=999) # To avoid scientific notation

source(""); functions()

PsychScience = scan("")

hist(PsychScience, xlab='Sample size',breaks=100, main = 'Figure 1: The Psychological Science Urn')

# A handier urn, for some purposes

nvals = sort(unique(PsychScience)) # There are 397 rather than 8000 values

nprobs = table(PsychScience)/sum(table(PsychScience))

# sum(nvals*nprobs) = 81.8606 = mean(PsychScience)

For any given effect size, the frequencies from the Psychological Science urn can be used to calculate expected power after selection. Minimizing the (squared) difference between this value and the desired mean power yields the required effect size.

# Minimize this function to find effect size giving desired power 

# after selection for significance.

fun = function(es,wantpow,dfreedom) 


    alpha = 0.05; cv=qchisq(1-alpha,dfreedom)

    epow = sum( (1-pchisq(cv,df=dfreedom,ncp=nvals*es))*nprobs ) 

    # cat("es = ",es," Expected power = ",epow,"\n")


    } # End of all the fun

# Find needed effect size for chi-square with df=5 and desired 

# population mean power AFTER selection.

popmeanpower = 0.5 # Change this value if you wish

EffectSize = nlminb(start=0.01, objective=fun,lower=0,df=5,wantpow=popmeanpower)$par

EffectSize # 0.108812

Calculate the probability distribution of sample size before selection.

# The distribution of sample size before selection is proportional to the

# distribution after selection divided by power, term by term.

crit = qchisq(0.95,5)

powvals = 1-pchisq(crit,5,ncp=nvals*EffectSize)

Pn = nprobs/powvals 

EG = 1/sum(Pn)

cat("Expected power before selection = ",EG,"\n")

Pn = Pn*EG # Probability distribution of n before selection

Generate test statistics before selection.

nsim = 50000 # Initial number of simulated statistics. This is over-kill. Change the value if you wish.


# For repeated simulations, execute the rest of the code repeatedly.

nbefore = sample(nvals,size=nsim,replace=TRUE,prob=Pn)

ncpbefore = nbefore*EffectSize

powbefore = 1-pchisq(crit,5,ncp=ncpbefore)

Ybefore = rchisq(nsim,5,ncp=ncpbefore)

Select for significance.

sigY = Ybefore[Ybefore>crit]

sigN = nbefore[Ybefore>crit]

sigPOW = 1-pchisq(crit,5,ncp=sigN*EffectSize)

hist(sigPOW, xlab='Power',breaks=100,freq=F ,main = 'Figure 2: Power After Selection for Significance')

Estimate mean power both ways.

# Two estimates of expected power before selection

c( length(sigY)/nsim , mean(powbefore) ) 

c(popmeanpower, mean(sigPOW)) # Golden


k = 1000 # Select 1,000 significant results.

Y = sigY[1:k]; n = sigN[1:k]; TruePower = sigPOW[1:k]

# Estimate with p-curve 2.1

heteroNpcurveCHI(Y=Y,dfree=5,nn=n) # 0.5058606 the first time.

# Write out chi-squared statistics for pasting into the online app

for(j in 1:k) cat("chi2(5) =",Y[j],"\n")


Brunner, J. and Schimmack, U. (2018). Estimating population mean power under conditions of heterogeneity and selection for significance. Under review. Available at

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Edition), Hillsdale, New Jersey: Erlbaum.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666-681.

van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2014). Meta-analysis using effect size distributions of only statistically significant studies. Psychological methods, 20, 293-309.


Lies, Damn Lies, and Abnormal Psychological Science (APS)

Everybody knows the saying “Lies, damn lies, and statistics”   But it is not the statistics; it is the ab/users of statistics who are distorting the truth.  The Association for Psychological Science (APS) is trying to hide the truth that experimental psychologists are not using scientific methods in the way they are supposed to be used.  These abnormal practices are known as questionable research practices (QRPs).   Surveys show that researchers are aware that these practices have negative consequences, but they also show that these practices are being used because they can advance researchers careers (John et al., 2012).  Before 2011, it was also no secrete that these practices were used and psychologists might even brag about the use of QRPs to get results  (it took me 20 attempts to find this significant result).

However, some scandals in social psychology (Stapel, Bem) changed the perception of these practices.  Hiding studies, removing outliers selectively, or not disclosing dependent variables that failed to show the predicted result was no longer something anybody would admit doing in public (except a few people who paid dearly for it; e.g. Wansink).

Unfortunately for abnormal psychological scientists, some researchers, including myself, have developed statistical methods that can reveal the use of questionable research practices and applications of these methods show the use of QRPs in numerous articles (Greg Francis; Schimmack, 2012).  Francis (2014) showed that 80% or more of articles in the flagship journal of APS used QRPs to report successful studies.  He was actually invited by the editor of Psychological Science to audit the journal, but when he submitted the results of his audit for publication, the manuscript was rejected. Apparently, it was not significant enough to tell readers of Psychological Science that most of the published articles in Psychological Science are based on abnormal psychological science.  Fortunately, the results were published in another peer-reviewed journal.

Another major embarrassment for APS was the result of a major replication project of studies published in Psychological Science, the main APS journal, as well as two APA (American Psychological Association) journals (Open Science Collaboration, 2015).  The results showed that only 36% of significant results in original articles could be replicated. The “success rate” for social psychology was even lower with 25%.  The main response to this stunning failure rate have been attempts to discredit the replication studies or to normalize replication failures as a normal outcome of science.

In several blog posts and manuscripts I have pointed out that the failure rate of social psychology is not the result of normal science.  Instead, replication failures are the result of abnormal scientific practices where researchers use QRPs to produce significant results.  My colleague Jerry Brunner developed a statistical method, z-curve, that reveals this fact. We have tried to publish our statistical method in an APA journal (Psychological Methods) and the APS journal, Perspectives on Psychological Science, where it was desk-rejected by Sternberg, who needed journal space to publish his own editorials [he resigned after a revolt form APS members, including former editor Bobbie Spellman].

Each time our manuscript was rejected without any criticism of our statistical method.  The reason was that it was not interesting to estimate replicability of psychological science.   This argument makes little sense because the OSC reproducibility article from 2015 has already been cited over 500 times in peer-reviewed journals (WebofScience).

The argument that our work is not interesting is further undermined by a recent article published in the new APS journal Advances in Methods and Practices in Psychological Science  with the title “The Prior Odds of Testing a True Effect in Cognitive and Social Psychology”  The article was accepted by the main editor Daniel J. Simons, who also rejected our article as irrelevant (see rejection letter).  Ironically, the article presents very similar analyses of the OSC data and required a method that could estimate average power, but the authors used an ad-hoc approach to do so.  The article even cites our pre-print, but the authors did not contact us or run the R-code that we shared to estimate average power.  This behavior would be like eyeballing a scatter plot rather than using a formula to quantify the correlation between two variables.  It is contradictory to claim that our method is not useful and then accept a paper that could have benefited from using our method.

Why would an editor reject a paper that provides an estimation method for a parameter that an accepted paper needs to estimate?

One possible explanation is that the accepted article normalizes replication failures, while we showed that these replication failures are at least partially explained by QRPs.  First evidence for the normalization of abnormal science is that the article does not cite Francis (2014) or Schimmack (2012) or John et al.’s (2012) survey about questionable research practices.  The article also does not mention Sterling’s work on abnormally high success rates in psychology journals (Sterling, 1959; Sterling et al., 1995). It does not mention Simmons, Nelson, and Simonsohn’s (2011) False-Positive Psychology article that discussed the harmful consequences of abnormal psychological science.  The article simply never mentions the term questionable research practices. Nor does it mention the “replication crisis” although it mentions that the OSC project replicated only 25% of findings in social psychology.  Apparently, this is neither abnormal nor symptomatic of a crisis, but just how good social psychological science works.

So, how does this article explain the low replicability of social psychology as normal science?  The authors point out that replicability is a function of the percentage of true null-hypothesis that are being tested.  As researchers conduct empirical studies to find out which predicts are true and which predicts are not, it is normal science to sometimes predict effects that do not exist (true null-hypotheses), and inferential statistics will sometimes lead to the wrong conclusion (type-I errors / false positives).  It is therefore unavoidable that empirical scientists will sometimes make mistakes.

The question is how often they make these mistakes and how they correct them.  How many false-positives end up in the literature depends on several other factors, including (a) the percentage of null-hypothesis that are being tested and (b) questionable research practices.

The key argument in the article is that social psychologists are risk-takers and test many false hypothesis.  As a result, they end up finding many false positive results. Replication studies are needed to show which findings are true and which findings are false. So, doing risky exploratory studies followed by replication studies is good science. In contrast, cognitive psychologist are not risk-takers and test hypothesis that have a high probability of being true. Thus, they have fewer false positives, but that doesn’t mean they are better scientists or social psychologists are worse scientists.  In the happy place of APS journals, all psychological scientists are good scientists.

Conceivably, social psychologists place higher value on surprising findings—that is, findings that reflect a departure from what is already known—than cognitive  psychologists do.

There is only one problem with this happy story of psychological scientists working hard to find the truth using the best possible scientific methods.  It is not true.

How Many Point-Nil-Hypothesis are True

How often is the null-hypothesis true?  To answer this question it is important to define the null-hypothesis.  A null-hypothesis can be any point or a range of effect sizes.  However, psychologists often wrongly use the term null-hypothesis to refer to the point-nil-hypothesis (cf. Cohen, 1994) that there is absolutely no effect (e.g., the effect of studying for a test AFTER the test has already happened; Bem, 2011).  We can then distinguish two sets of studies. Studies with an effect of any magnitude and studies without an effect.

The authors argue correctly that testing many null-effects will result in more false positives and lower replicability.  This is easy to see, if all significant results are false positives (Bem, 2011).  The probability that any single replication study produces a significant result is simply alpha (5%) and for a set of studies only 5% of studies are expected to produce a significant result. This is the worst case scenario (Rosenthal, 1979; Sterling et al., 1995).

Importantly, this does not only apply to replication studies. It also applies to original studies. If all studies have a true effect size of zero, only 5% of studies should produce a significant result.  However, it is well known that the success rate in psychology journals is above 90% (Sterling, 1959; Sterling et al., 1995).  Thus, it is not clear how social psychology can test many risky hypothesis that are often false and report over 90% successes in their journal or even within a single article (Schimmack, 2012). The only way to achieve this high success rate while most hypothesis are false is by reporting only successful studies (like a gambling addict who only counts wins and ignores losses; Rosenthal, 1979) or to make up hypothesis after randomly finding a significant result (Kerr, 1998).

To my knowledge, Sterling et al. (1995) were the first to relate the expected failure rate (without QRPs) to alpha, power, and the percentage of studies with and without an effect.


Sterling et al. point out that we should not have expected that 100% of published results in the Open Science Collaboration reported significant results, while the 25% success rate in the replication studies is shockingly low, but at least more believable than the 100% success rate.  The article neither mentions Sterling’s statistical contribution, nor the implication for the expected success rate in original studies.

The main aim of the authors is to separate the effects of power and the proportion of studies without effects on the success rate; that is the percentage of studies with significant results.

For example, a 25% success rate for social psychology could be produced by 25 studies with 85% power and 75% of studies without an effect (and a 5% chance of producing a significant result) or it could be produced by 100 studies with an average of 25% power, or any other percentage of studies with an effect between 25% and 100%.

As pointed out by Brunner and Schimmack (2017), it is impossible to obtain a precise estimate of this percentage because different mixtures of studies can produce the same success rate.  I was therefore surprised when the abstract claimed that “we found that R was lower for the social-psychology studies than for the cognitive-psychology studies”  How were the authors able to quantify and compare the proportions of studies with an effect in social psychology versus cognitive psychology? The answer is provided in the following quote.

Using binary logic for the time being, we assume that the observed proportion of studies yielding effects in the same direction as originally observed, ω, is equal to the proportion of true effects, PPV, plus half of the remaining 1 – PPV noneffects, which would be expected to yield effects in the same direction as originally observed 50% of the time by chance.

To clarify,  a null-result is equally likely to produce a positive or a negative effect size by chance.  A sign reversal in a replication study is used to infer that the original result was a false positive.  However, these sign reversals are only half of the false positives because random chance is equally likely to produce the same sign (head-tail is equally probable as head-head).  Using this logic, the percentage of sign reversals times two is an estimate of the percentage of false positives in the original studies.

Based on the finding that 25.5% of social replication studies showed a sign reversal, the authors conclude that 51% of the original significant results were false positives.  This would imply that every other significant result that is published in social psychology journals is a false positive.

One problem with this approach is that sign reversals can also occur for true positive studies with low power (Gelman & Carlin, 2014).  Thus, the percentage of sign reversals is at best a rough estimate of false positive results.

However, low power can be the result of small effect sizes and many of these effect sizes might be so close to zero that they can be considered false positives if the null-hypothesis is defined as a range of effect sizes close to zero.

So, I will just use the authors estimate of 50% false positive results as a reasonable estimate of the percentage of false positive results that are reported in social psychology journals.

Are Social Psychologists Testing Riskier Hypotheses? 

The authors claim that social psychologists have more false positive results than cognitive psychologists because they test more false hypotheses. That is, they are risk takers:

Maybe watching a Beyoncé video reduces implicit bias? Let’s try it (with n = 20 per cell in a between-subject design).  It doesn’t and the study produced a non-significant result.  Let’s try something else.  After trying many other manipulations, finally a significant result is observed and published.  Unfortunately, this manipulation also had no effect and the published result is a false positive.  Another researcher replicates the study and obtains a significant result with a sign reversal. The original result gets corrected and the search for a true effect continues.


To make claims about the ratio of studies with effects and studies without effects (or negligible effects) that are being tested, the authors use the formula shown above.  Here the ratio (R) of studies with an effect over studies without an effect is a function of  alpha (the criterion for significance), beta (type-II error probabilty), and PPV; the positive predictive value, which is simply the percentage of true positive significant results in the published literature.

As note before, the PPV for social psychology was estimated to be 49%. This leaves two unknowns to make claims about R; alpha and beta.  The authors approach to estimating alpha and beta is questionable and undermines their main conclusion.

Estimating Alpha

The authors use the nominal alpha level as the probability that a study without a real effect produces a false positive result.

Social and cognitive psychology generally follow the same convention for their alpha level (i.e., p < .05), so the difference in that variable likely does not explain the difference in PPV. 

However, this is a highly questionable assumption when researcher use questionable research practices.  As Simonsohn et al. (2011) demonstrated p-hacking can be used to bypass the significance filter and the risk of reporting a false positive result with a nominal alpha of 5% can be over 50%.  That is, the actual risk of reporting a false positive result is not 5% as stated, but much higher.  This has clear implications for the presence of false positive results in the literature.  While it would require 20 risky hypotheses to observe a false positive result with a significance filter of 5%, p-hacking makes it possible to report every other false positive result as significant.  Thus, massive p-hacking could explain a high percentage of false positive results in social psychology just as well as honest testing of risky hypotheses.

The authors simply ignore this possibility when they use the nominal alpha level as the factual probability of a false positive result and neither the reviewers nor the editor seemed to realize that p-hacking could explain replication failures.

Is there any evidence that p-hacking rather than risk-taking explains the results? Indeed, there is lots of evidence.  As I pointed out in 2012,  it is easy to see that social psychologists are using QRPs because they typically report multiple conceptual replication studies in a single article. Many of the studies in the replication project were selected from multiple study articles.  A multiple study article essentially lowers alpha from .05 in a single study to .05 raised to the power of the number of studies. Even with just two studies, the risk of repeating a false positive result is just .05^2 = .0025.  And none of these multiple study articles report replication failures, even if the tested hypothesis is ridiculous (Bem, 2011).  There are only two explanation for the high success rate in social psychology.  Either they are testing true hypothesis and the estimate of 50% false positive results is wrong or they are using p-hacking and the risk of a false positive results in a single study is greater than the nominal alpha.  Either explanation invalidates the authors conclusions about R. Either their PPV estimates are wrong or their assumptions about the real alpha criterion are wrong.

Estimating Beta

Beta or the type-II error is the risk of obtaining a non-significant result when an effect exists.  Power is the complementary probability of getting a significant result when an effect is present (a true positive result).  The authors note that social psychologists might end up with more false positive results because they conduct studies with lower power.

To illustrate, imagine that social psychologists run 100 studies with an average power of 50% and 250 studies without an effect and due to QRPs 20% of these studies produce a significant result with a nominal alpha of p < .05.  In this case, there are 50 true positive results (100 * .50 = 50) and 50 false positive results (250 * .20 = 50).   In contrast, cognitive psychologists conduct studies with 80% power, while everything else is the same. In this case,  there would be 80 true positive results (100 * .8 = 80) and also 50 false positive results.  The percentage of false positives would be 50% for social, but only 50/(50+80) = 38% false positives for cognitive psychology.  In this example, R and alpha are held constant, but the PPVs differ simply as a function of power.  If we assume that cognitive psychologists use less severe p-hacking, there could be even fewer false positives (250 * .10 = 25) and the PPV for cognitive psychology would be only 24%.  [actual estimate in the article is 19%]

Thus, to make claims about differences between social psychologists and cognitive psychologists, it is necessary to estimate beta or power (1 – beta) and because power varies across the diverse studies in the OSC project, they have to estimate average power.  Moreover, because only significant studies are relevant, they need to estimate the average power after selection for significance.  The problem is that there exists no published peer-reviewed method to do this.  The reason why no published peer-reviewed method exists is that editors have rejected our manuscripts that have evaluated four different methods of estimating average power after selection for significance and shown that z-curve is the best method.

How do the authors estimate average power after selection for significance without z-curve?  They  use p-curve plots and use visual inspection of the plots against simulations of data with fixed power to obtain rough estimates of  50% average power for social psychology and 80% average power for cognitive psychology.

It is interesting that the authors used p-curve plots, but did not use the p-curve online app to estimate average power.  The online p-curve app also provides power estimates. However, we pointed out in the rejected manuscript, this method can severely overestimate average power. In fact when the online p-curve app is used, it produces estimates of 96% average power for social psychology and 98% for cognitive psychology. These estimates are implausible and this is the reason why the authors created their own ad-hoc method of power estimation rather than using the estimates provided by the p-curve app.

We used the p-curve app and also got really high power estimates that seemed implausible, so we used ballpark estimates from the Simonsohn et al. (2014) paper instead (Brent Wilson, email communication, May 7, 2018). 


pcurve.pngBased on their visual inspection of the graphs they conclude that the average power in social psychology is about 50% and the average power in cognitive psychology is about 80%.

Putting it all together 

After estimating PPV, alpha, and beta in the way described above, the authors used the formula to estimate R.

If we set PPV to .49, αlpha to .05, and 1 – β (i.e., power) to .50 for the social-psychology
studies and we set the corresponding values to .81, .05, and .80 for the cognitive-psychology studies, Equation 2 shows that R is .10 (odds = 1 to 10) for social psychology
and .27 (odds = 1 to ~4) for cognitive psychology. 

Now the authors make another mistake.  The power estimate obtained from p-curve applies to ALL p-values, including the false positive ones.  Of course, the average estimate of power is lower for a set of studies that contains more false positive results.

To end up with 50% average power with 50% false positive results,  the power of the studies that are not false positives can be computed with the following formula.

Avg.Power = FP*alpha + TP*power   <=>  power = (Avg.Power – FP*alpha)/TP

With 49% true positives (TP), 51% false positives (FP), alpha = .05, and average power = .50 for social psychology, the estimated average power of studies with an effect is 97%.

alpha = .05; avg.power = .50; TP = .49; FP = 1-TP;  (avg.power – FP*alpha)/TP

With 81% true positives and 80% average power for cognitive psychology, the estimated average power of studies with an effect in cognitive psychology  is 98%.

Thus, there is actually no difference in power between social and cognitive psychology because the percentage of false positive results alone explains the differences in the estimates of average power for all studies.


alpha = .05; PPV = .49; power = .96; alpha*PPV/(power * (1-PPV))
alpha = .05; PPV = .81; power = .97; alpha*PPV/(power * (1-PPV))

With these correct estimates of power for studies with true effects, the estimate for social psychology is .05 and the estimate for cognitive psychology is .22.  This means the social psychologists test 20 false hypothesis for every true hypothesis, while cognitive psycholgists test 4.55 false hypothesis for every correct hypothesis, assuming the authors assumptions are correct.


The authors make some questionable assumptions and some errors to arrive at the conclusion that social psychologists are conducting many studies with no real effect. All of these studies are run with a high level of power. When a non-significant result is obtained, they discard the hypothesis and move on to testing another one.  The significance filter keeps most of the false hypothesis out of the literature, but because there are so many false hypothesis, 50% of the published results end up being false positives.  Unfortunately, social psychologists failed to conduct actual replication studies and a large pile of false positive results accumulated in the literature until social psychologists realized that they need to replicate findings in 2011.

Although this is not really a flattering description of social psychology, the truth is worse.  Social psychologists have been replicating findings for a long time. However, they never reported studies that failed to replicate earlier findings and when possible they used statistical tricks to produce empirical findings that supported their conclusions with a nominal error rate of 5%, while the true error rate was much higher.  Only scandals in 2011 led to honest reporting of replication failures. However, these replication studies were conducted by independent investigators, while researchers with influential theories tried to discredit these replication failures.  Nobody is willing to admit that abnormal scientific practices may explain why many famous findings in social psychology textbooks were suddenly no longer replicable after 2011, especially when hypotheses and research protocols were preregistered and prevented the use of questionable research practices.

Ultimately, the truth will be published in peer-reviewed journals. APS does not control all journals.  When the truth becomes apparent,  APS will look bad because it did nothing to enforce normal scientific practices and it will look worse because it tried to cover up the truth.  Thank god , former APS president Susan Fiske reminded her colleagues that real scientists should welcome humiliation when their mistakes come to light because the self-correcting forces of science are more important than researchers feelings. So far, APS leaders seem to prefer repressive coping over open acknowledgment of past mistakes. I wonder what the most famous psychologists of all times would have to say about this.

Estimating Reproducibility of Psychology (No. 52): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The replication crisis has split psychologists and disrupted social networks.  I respected Jerry Clore as an emotion researcher when I started my career in emotion research.  His work on appraisal theories of emotions made an important contribution and influenced my thinking about emotions.  I enjoyed participating in Jerry’s lab meetings when I was a post-doctoral student of Ed Diener at Illinois.  However, I was never a big fan of Jerry’s most famous article on the effect of mood on life-satisfaction judgments.  Working with Ed Diener convinced me that life-satisfaction judgments are more stable and more strongly based on chronically accessible information than the mood as information model suggested (Anusic & Schimmack, 2016; Schimmack & Oishi, 2005).  Nevertheless, I had a positive relationship with Jerry and I am grateful that he wrote recommendation letters for me when I was on the job market.

When researchers started doing replication studies after 2011, some of Jerry’s articles failed to replicate, and one reason for these replication failures is that the original studies used questionable research practices.  Importantly, nobody considered these practices unethical and it was not a secret that these methods were used. Books even taught students that the use of these practices is good science.  The problem is that Jerry didn’t acknowledge that questionable practices could at least partially explain replication failures.  Maybe he did it to protect students like Simone Schnall. Maybe he had other reasons.  Personally, I was disappointed by this response to replication failures, but I guess that is life.

Summary of Original Article


In five studies, the authors crossed the priming of happy and sad concepts with affective experiences. In all studies, the expected interaction was significant. Coherence between affective concepts and affective experiences led to better recall of a story than in the incoherent conditions.

Study 1

56 students were assigned to six conditions (n ~ 10) of a 2 x 3 design. Three priming conditions with a scrambled sentence task were crossed with a manipulation of flexing or extending one arm. This manipulation is supposed to create an approach or avoidance motivation (Cacioppo et al., 1993).  The expected interaction was significant, F(2, 50) = 3.50, p = .038.

Study 2

75 students participated in Study 2, which was a replication study with two changes:  the arm position manipulation was paired with the priming task and half the participants rated their mood before the measurement of the DV.  The ANOVA result was marginally significant; F(2, 69) = 2.81, p = .067.

Study 3

58 students used the same priming procedure, but used music as a mood manipulation.  The neutral priming condition was dropped (n ~ 15 per cell).  The interaction effect was marginally significant, F(1, 54) = 3.48, p = .068.

Study 4

132 students participated in Study 4.  The study changed the priming task to a subliminal priming manipulation (although the 60ms presentation time may not be fully subliminal).  Affect was manipulated by asking participants to hold a happy or sad facial expression.  The interaction was significant, F(1, 128) = 3.97, p = .048.

Study 5 

133 students participated in Study 5.  Study 5 combined the scrambled sentence priming manipulation from Studies 1-3 with the facial expression manipulation from Study 4.  The interaction effect was significant, F(1, 129) = 5.21, p = .024.

Replicability Analysis

Although all five studies showed support for the predicted two-way interaction, the p-values in the five studies are surprisingly similar (ps = .038, .067, .068, .048, .025). The probability of such small variability or even less variability in p-values is p = .002 (TIVA).  This suggests that QRPs were used to produce (marginally) significant results in five studies with low power (Schimmack, 2012).

A small set of studies provides limited information about QRPs.  It is helpful to look at these p-values in the context of other results reported in articles with Jerry Clore as co-author.


The plot shows a large file-drawer (missing studies with non-significant results) that is produced by a large number of just significant results.  Either many studies were run to obtain a just significant result or other QRPs were used.  This analysis supports the conclusion that QRPs contributed to the reported results in the original article.

Replication Study

The replication project attempted a replication of Study 5.  However, the authors did not pick the 2 x 2 interaction as the target finding.  Instead, they used the finding in a “repeated measures ANOVA with condition (coherent vs. incoherent) and story prompt (tree vs. house. vs. car) produced a significant linear trend for the interaction of Condition X Story, F(1, 131), 5.79, p < .02, η2 = .04” (Centerbar, et a., 2008, p. 573).  The replication study did not find this trend, F(2, 110) = .759, p = .471.  However, the difference in degrees of freedom shows that the replication analysis had less power because it did not test the linear contrast. Moreover, the replication report states that the replication study showed a trend regarding the main effect of affective coherence on the percentage of causal words used, F(1, 111) = 3.172, p = .078.  This makes it difficult to evaluate whether the replication study was really a failure.

I used the posted data to test the interaction for the total number of words produced. It was not significant, F(1,126) = 0.602, p = .439.

In conclusion, the reported significant interaction failed to replicate.


The replication study of this 2 x 2 between-subject social psychology experiment failed to replicate the original result.  Bias tests suggests that the replication failure was at least partially caused by the use of questionable research practices in the original study.








Confused about Effect Sizes? Read more Cohen (and less Loken & Gelman)

*** Background.  The Loken and Gelman article “Measurement Error and the Replication Crisis” created a lot of controversy in the Psychological Methods Discussion Group. I believe the article is confusing and potentially misleading. For example, the authors do not clearly distinguish between unstandardized and standardized effect size measures, although random measurement error has different consequences for one or the other.  I think a blog post by Gelman makes it clear what the true purpose of the article. is.

We talked about why traditional statistics are often counterproductive to research in the human sciences.

This explains why the article tries to construct one more fallacy in the use of traditional statistics, but fails to point out a simple solution to avoid this fallacy.  Moreover, I argue in the blog post that Loken and Gelman committed several fallacies on their own in an attempt to discredit t-values and significance testing.

I asked Gelman to clarify several statements that made no sense to me. 

 “It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal-to-noise level is high”  (Loken and Gelman)

Ulrich Schimmack
Would you say that there is no meaningful difference between a z-score of 2 and a z-score of 4? These z-scores are significantly different from each other. Why would we not say that a study with a z-score of 4 provides stronger evidence for an effect than a study with a z-score of 2?

  • Andrew says:


    Sure, fair enough. The z-score provides some information. I guess I’d just say it provides less information than people think.


I believe that the article contains many more statements that are misleading and do not inform readers how t-values and significance testing works.  Maybe the article is not as bad as I think it is, but I am pretty sure that it provides less information than people think.

In contrast, Jacob Cohen has  provided clear and instructive recommendations for psychologists to improve their science.  If psychologists had listened to him, we wouldn’t have a replication crisis.

The main points to realize about random measurement error and replicability are.

1.  Neither population nor sample mean differences (or covariances) are effect sizes. They are statistics that provide some information about effects and the magnitude of effects.  The main problem in psychology has been the interpretation of mean differences in small samples as “observed effect sizes”  Effects cannot be observed.

2.  Point estimates of effect sizes vary from sample to sample.  It is incorrect to interpret a point estimate as information about the size of an effect in a sample or a population. To avoid this problem, researchers should always report a confidence interval of plausible effect sizes. In small samples with just significant results these intervals are wide and often close to zero.  Thus, no research should interpret a moderate to large point estimate, when effect sizes close to zero are also consistent with the data.

3.  Random measurement creates more uncertainty about effect sizes.  It has no systematic effect on unstandardized effect sizes, but it systematically lowers standardized effect sizes (correlations, Cohen’s d amount of explained variance).

4.  Selection for significance inflates standardized and unstandardized effect size estimates.  Replication studies may fail if original studies were selected for significance, depending on the amount of bias introduced by selection for significance (this is essentially regression to the mean).

5. As random measurement error attenuates standardized effect sizes,  selection for significance partially corrects for this attenuation.  Applying a correction formula (Spearman) to estimates after selection for significance would produce even more inflated effect size estimates.

6.  The main cause of the replication crisis is undisclosed selection for significance.  Random measurement error has nothing to do with the replication crisis because random measurement error has the same effect on original and replication studies. Thus, it cannot explain why an original study was significant and a replication study failed to be significant.

Questionable Claims in Loken and Gelman’s  Backpack article. 

If you learned that a friend had run a mile in 5 minutes, you would be respectful; if you learned she had done it while carrying a heavy backpack, you would be awed. The obvious inference is that she would have been even faster without the backpack.

This makes sense. We assume that our friends’ ability is a relatively fixed ability,  everybody is slower with a heavy backpack, and the distance is really a mile, the clock was working properly, and no magic potion or tricks are involved.  As a result, we expect very little variability in our friends’ performance and an even faster time without the backpack.

But should the same intuition always be applied to research findings? Should we assume that if statistical significance is achieved in the presence of measurement error, the associated effects would have been stronger without noise?

How do we translate this analogy?  Let’s say running 1 mile in 5 minutes corresponds to statistical significance. Any time below 5 minutes is significant and any time longer than 5 minutes is not significant.  The friends’ ability is the sample size. The lager the sample size, the easier it is to get a significant result.  Finally, the backpack is measurement error.  Just like a heavy backpack makes it harder to run 1 mile in 5 minutes, more measurement error makes it harder to get significance.

The question is whether it follows that the “associated effects” (mean difference or regression coefficient that are used to estimate effect sizes) would have been stronger without random measurement error?

The answer is no.  This may not be obvious, but it directly follows from basic introductory statistics, like the formula for the t-statistic.

t-value  =  mean.difference / SD * sqrt(N)/2

and SD reflects the variability of a construct in the population plus additional variability due to measurement error.  So, measurement error increases the SD component of the t-value, but it has no effect on the effect size.

We caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger. 

With all due respect for trying to make statistics accessible, there is a trade-off between accessibility and sensibility.   First, statistical significance cannot be made stronger. A finding is either significant or it is not.  Surely a test-statistic like a t-value can be (made) stronger or weaker depending on changes in its components.  If we interpret “that which does not kill” as “obtaining a significant result with a lot of random measurement error” it is correct to expect a larger t-value and stronger evidence against the null-hypothesis in a study with a more reliable measure.  This follows directly from the effect of random error on the standard deviation in the denominator of the formula. So how can it be a fallacy to assume something that can be deduced from a mathematical formula? Maybe the authors are not talking about t-values.

It is understandable, then, that many researchers have the intuition that if they manage to achieve statistical significance under noisy conditions, the observed effect would have been even larger in the absence of noise.  As with the runner, they assume that without the burden—that is, uncontrolled variation—their effects would have been even larger.

Although this statement makes it clear that the authors are not talking about t-values, it is not clear why researchers should have the intuition that a study with a more reliable measure should produce larger effect sizes.  As shown above, random measurement error adds to the variability of observations, but it has no systematic effect on the mean difference or regression coefficient.

Now the authors introduce a second source of bias. Unlike random measurement error, this error is systematic and can lead to inflated estimates of effect sizes.

The reasoning about the runner with the backpack fails in noisy research for two reasons. First, researchers typically have so many “researcher degrees of freedom”—unacknowledged choices in how they prepare, analyze, and report their data—that statistical significance is easily found even in the absence of underlying effects and even without multiple hypothesis testing by researchers. In settings with uncontrolled researcher degrees of freedom, the attainment of statistical significance in the presence of noise is not an impressive feat.

The main reason for inferential statistics is to generalize results from a sample to a wider population.  The problem of these inductive inferences is that results in a sample vary from sample to sample. This variation is called sampling error.  Sampling error is separate from measurement error and even studies with perfect measures have sampling error and sampling error is inversely related to sample size (2/sqrt(N)).  Sampling error alone is again unbiased. It can produce larger mean differences or smaller mean differences.  However,  if studies are split into significant studies and non-significant studies,  mean differences of significant results are inflated – and mean differences of non-significant results are deflated estimates of the population mean difference.  So, effect size estimates in studies that are selected for significance are inflated. This is true, even in studies with reliable measures.

In a study with noisy measurements and small or moderate sample size, standard errors will be high and statistically significant estimates will therefore be large, even if the underlying effects are small.

To give an example,  assume there were a height difference of 1 cm between brown eyed and blue eyed individuals.  The standard deviation of height is 10 cm.  A study with 400 participants has a sampling error of 10 / sqrt(400)/2 cm  = 1 cm.  To achieve significance, the effect size has to be about twice as larger as the sampling error (t = 2 ~ p = .05).  Thus, a significant result requires a mean difference of 2 cm, which is 100% larger than the population mean difference in height.

Another researcher uses an unreliable measure (25% reliability) of height that quadruples the variance (100 cm^2 vs. 400 cm^2) and doubles the standard deviation (10cm vs. 20cm).  The sampling error also doubles to 2 cm, and now a mean difference of 4 cm is needed to achieve significance with the same t-value of 2 as in the study with the perfect measure.

The mean difference is two times larger than before and four times larger than the mean difference in the population.

The fallacy would be to look at this difference of 4 cm and to believe that an even larger difference could have been obtained with a more reliable measure.   This is a fallacy, but not for the reasons the authors suggest.  The fallacy is to assume that random measurement error in the measure of height reduced the estimate of 4cm and that an even bigger difference would be obtained with a more reliable measure.  This is a fallacy because random measurement error does not influence the mean difference of 4cm.  Instead,  it increased the standard deviation and with a more reliable measure the standard deviation would be smaller (1 cm) and the mean difference of 4 cm would have a t-value of 4 rather than 2, which is significantly stronger evidence for an effect.

How can the authors overlook that random measurement error has no influence on mean differences?  The reason is that they do not clearly distinguish between standardized and unstandardized estimates of effect sizes.

Spearman famously derived a formula for the attenuation of observed correlations due to unreliable measurement. 

Spearman’s formula applies to correlation coefficients and correlation coefficients are standardized measures of effect sizes because the covariance is divided by the standard deviations of both variables.  Similarly Cohen’s d is a standardized coefficient because the mean difference is divided by the pooled standard deviation of the two groups.

Random measurement error does clearly influence standardized effect size estimates because the standard deviation is used to standardized effect sizes.

The true population mean difference of 1 cm divided by the population standard deviation  of 10 cm yields a Cohen’s d = .10; that is one-tenth of a standard deviation difference.

In the example, the mean difference for a just significant result with a perfect measure was 2 cm, which yields a Cohen’s d = 2 cm divided by 10 cm = .2,  two-tenth of a standard deviation.

The mean difference for a just significant result with a noisy measure was 4 cm, which yields a standardized effect size of 4 cm divided by 20cm = .20, also two-tenth of a standard deviation.

Thus, the inflation of the mean difference is proportional to the increase in the standard deviation.  As a result, the standardized effect size is the same for the perfect measure and the unreliable measure.

Compared to the true mean difference of one-tenth of a standard deviation, the standardized effect sizes are both inflated by the same amount (d = .20 vs. d = .10, 100% inflation).

This example shows the main point the authors are trying to make.  Standardized effect size estimates are attenuated by random measurement error. At the same time, random measurement error increases sampling error and the mean difference has to be inflated to get significance.  This inflation already corrects for the attenuation of standardized effect sizes and any additional corrections for unreliabilty with the Spearman formula would inflate effect size estimates rather than correcting for attenuation.

This would have been a noteworthy observation, but the authors suggest that random measurement error can even have paradox effects on effect size estimates.

But in the small-N setting, this will not hold; the observed correlation can easily be larger in the presence of measurement error (see the figure, middle panel).

This statement is confusing because the most direct effect of measurement error on standardized effect sizes is attenuation.  In the height example, any observed mean difference is divided by 20 rather than 10, reducing the standardized effect sizes by 50%. The variability of these standardized effect sizes is simply a function of sample size and therefore equal.  Thus, it is not clear how a study with more measurement error can produce larger standardized effect sizes.  As demonstrated above, the inflation produced by the significance filter at most compensates for the deflation due to random measurement error.  There is simply no paradox that researchers can obtain stronger evidence (larger t-values or larger standardized effect sizes) with nosier measures even if results are selected for significance.

Our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies. If it really were true that effect sizes were always attenuated by measurement error, then it would be all the more impressive to have achieved significance.

This makes no sense. If random measurement error attenuates effect sizes, it cannot be used to justify surprisingly large mean differences.  Either we are talking about unstandardized effect sizes and they are not influenced by measurement error or we are talking about standardized effect sizes and those are attenuated by measurement error and so obtaining large mean differences is surprising.  If the true mean difference is 1 cm and an effect of 4 cm is needed to get significance with SD = 20 cm, it is surprising to get significance because the power to do so is only 17%.  Of course, it is only surprising if we knew that the population effect size is only 1 cm, but the main point is that we cannot use random measurement error to justify large effect sizes because random measurement error always attenuates standardized effect size estimates.

More confusing claims follow.

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

As explained above, random measurement error makes t-values weaker not stronger. It therefore makes no sense to attribute strong t-values to random measurement error as a potential source of variance.  The most likely explanation for strong effect sizes in studies with large sampling error is selection for significance, not random measurement error.

After all of these confusing claims the authors end with a key point.

A key point for practitioners is that surprising results from small studies should not be defended by saying that they would have been even better with improved measurement.

This is true because it is not a logical argument and not an argument researchers actually make.  The bigger problem is that researchers do not realize that their significance filter makes it necessary to find moderate to large effects and that sampling error in small samples alone can produce these effect sizes, especially when questionable research practices are being used.  No claims about hypothetically larger effect sizes are necessary or regularly made.

Next the authors simply make random statements about significance testing that reveal their ideological bias rather than adding to the understanding of t-values.

It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal to-
noise level is high. 

Of course, the t-value is a measure of the strength of evidence against the null-hypothesis, typically the hypothesis that the data were obtained without a mean difference in the population.  The larger the t-value, the less likely it is that the observed t-value could have been obtained without a population mean difference in the direction of the mean difference in the sample.  And with t-values of 4 or higher, published results also have a high probability of replicating a significant result in a replication study (Open Science Collaboration, 2015).  It can be debated whether a t-value of 2 is weak, moderate or strong evidence, but it is not debatable whether t-values provide information that can be used for inductive inferences.  Even Bayes-Factors rely on t-values.  So, the authors’ criticism of t-values makes little sense from any statistical perspective.

It is also a mistake to assume that the observed effect size would have been even larger if not for the burden of measurement error. Intuitions that are appropriate when measurements are precise are sometimes misapplied in noisy and more
probabilistic settings.

Once more these broad claims are false and misleading.  Everything else equal, estimates of standardized effect sizes are attenuated by random measurement error and would be larger if a more reliable measure had been used.  Once selection for significance is present,  the inflation introduced by selection for significance inflates standardized effect size estimates for perfect measures and it starts to disattenuate standardized effect size estimates with unreliable measures.

In the end, the authors try to link their discussion of random measurement error to the replication crisis.

The consequences for scientific replication are obvious. Many published effects
are overstated and future studies, powered by the expectation that the effects can be
replicated, might be destined to fail before they even begin. We would all run faster
without a backpack on our backs. But when it comes to surprising research findings
from small studies, measurement error (or other uncontrolled variation) should not be
invoked automatically to suggest that effects are even larger.

This is confusing. Replicability is a function of power and power is a function of the population mean difference and the sampling error of the design of a study.  Random measurement error increases sampling error, which reduces standardized effect sizes, power, and replicability.  As a result, studies with unreliable measure are less likely to produce significant results in original studies and in replication studies.

The only reason for surprising replication failures (e.g., 100% significant original studies and 25% significant replication studies for social psychology; OSC, 2015) are questionable practices that inflate the percentage of significant results in original studies.  It is irrelevant whether the original result was produced with a small population mean difference and a reliable measure or with a moderate population mean difference and an unreliable measure.  It only matters how strong the mean difference for the measure that was used is.  That is, replicability is the same for a height difference of 1 cm with a perfect measure and a standard deviation of 10 cm or a height difference of  2 cm and a noisy measure with a standard deviation of 20 cm.  However, the chance of obtaining a significant result in a study if the mean difference is 1 cm and the SD is 20 cm is lower because the noisy measure reduces the standardized effect size to Cohen’s d  = 1 cm / 20 cm = 0.05.


Loken and Gelman wrote a very confusing article about measurement error.  Although confusion about statistics is the norm among social scientists, it is surprising that a statistician has problems to explain basic statistical concepts and how they relate to the outcome of original and replication studies.

The most probable explanation for the confusion is that the authors seem to be believe that the combination of random measurement error and large sampling error creates a novel problem that has been overlooked.

Measurement error and selection bias thus can combine to exacerbate the replication crisis.

In the large-N scenario, adding measurement error will almost always reduce the observed correlation.  Take these scenarios and now add selection on statistical significance… for smaller N, a fraction of the observed effects exceeds the original. 

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

“Of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small”

The quotes suggest that the authors believe something extraordinary is happening in studies with large random measurement error and small samples.  However, this is not the case. Random measurement error attenuates t-values and selection for significance inflates them and these two effects are independent.  There is no evidence to suggest that random measurement error suddenly inflates effect size estimates in small samples with or without selection for significance.

Recommendations for Researchers 

It is also disconcerting that the authors fail to give recommendations how researchers can avoid fallacies, while those recommendations have been made before and would easily fix the problems associated with interpretation of effect sizes in studies with noisy measures and small samples.

The main problem in noisy studies is that point estimates of effect sizes are not a meaningful statistic.   This is not necessarily a problem Many exploratory studies in psychology aim to examine whether there is an effect at all and whether this effect is positive or negative.  A statistically significant result only allows researchers to infer that a positive or negative effect contributed to the outcome of the study (because the extreme t-value falls into a range of values that are unlikely without an effect). So, conclusions should be limited to discussion of the sign of the effect.

Unfortunately, psychologists have misinterpreted Jacob Cohen’s work and started to interpret standardized coefficients like correlation coefficients or Cohen’s d that they observed in their samples.  To make matters worse these coefficients are sometimes called observed effect sizes, as in the article by Loken and Gelman.

This might have been a reasonable term for trained statisticians, but for poorly trained psychologists it suggested that this number tells them something about the magnitude of the effect they were studying.  After all, this seems a reasonable interpretation of the term “observed effect size.”  They then used Cohen’s book to interpret these values as evidence that they obtained a small, moderate, or large effect.  In small studies, the effects have to be moderate (2 groups, n = 20, p = .05 => d = .64) to reach significance.

However, Cohen explicitly warned against this use of effect sizes. He developed standardized effect size measures to help researchers to plan studies that can provide meaningful tests of hypotheses.  A small effect size requires a large sample.  His effect sizes were develop to help researchers to plan studies. If they think an effect is small, they shouldn’t run a study with 40 participants because the study is so noisy that it is likely to fail.  So, standardized effect sizes were intended to be assumptions about unobservable population parameters.

However, psychologists ignored Cohen’s guidelines for the planning of studies. Instead they used his standardized effect sizes to examine how strong the “observed effects” in their studies were.  The misintepretation of Cohen is partially responsible for the replication crisis because researchers ignored the significance filter and were happy to report that they consistently observed moderate to large effect sizes.

However, they also consistently observed replication failures in their labs.  This was puzzling because moderate to large effects should be easy to replicate.  However, without training in statistics, social psychologists found an explanation for this variability of observed effect sizes as well: surely, the variability in observed effect sizes (!) from study to study meant that their results were highly dependent on context.  I still remember joking with some other social psychologists that effects even dependent on the color of research assistants’ shirts.  Only after reading Cohen did I understand what was really happening.  In studies with large sampling error, the “observed effect sizes” move around a lot because they are not observations of effects.  Most of the variation is mean differences from study to study is purely random sampling error.

At the end of his career, Cohen seemed to have lost faith in psychology as a science.  He wrote a dark and sarcastic article titled “The Earth is Round, p < .05.”  In this article, he proposes a simple solution for misinterpretation of “observed effect sizes” in small samples.  The abstract of this article is more informative and valuable than Loken and Gelman’s entire article.

Exploratory data analysis and the use of graphic methods, a steady improvement in
and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical
methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences,

The key point is that any sample statistic like an “effect size estimate” (not an observed effect size) has to be considered in the context of the precision of the estimate.  Nobody would take a public opinion poll seriously if it were conducted with 40 respondents and the result was a 55% chance of a candidate winning an election if this information were provided with the information that the 95%CI  ranges from 40% to 70%.

The same is true for tons of articles that reported effect size estimates without confidence intervals.  For studies with just significant results this is not a problem because significance translates into a confidence interval that does not contain the value specified by a null-hypothesis; typically zero.  For a just significant result this means that the boundary of the CI is close to zero.  So, researchers are justified in interpreting the result as evidence about the sign of an effect, but the effect size is uncertain.  Nobody would rush to buy stocks in a drug company, if they report that their new drug had an effectiveness of extending life expectancy by 1 day up to 3 years.  But if we are mislead in focusing on an observed effect size of 1.5 years, we might be foolish enough to invest in the company and lose some money.

In short, noisy studies with unreliable measures and wide confidence intervals cannot be used to make claims about effect sizes.   The reporting of standardized effect size measures can be useful for meta-analysis or to help future research in the planning of their studies, but researchers should never interpret their point estimates as observed effect sizes.

Final Conclusion

Although mathematics and statistics are fundamental sciences for all quantitative, empirical sciences each scientific discipline has its own history, terminology, and unique challenges.  Political science differs from psychology in many ways.  On the one hand, political science has access to large representative samples because there is a lot of interest in those kind of data and a lot of money is spent on collecting these data.  These data make it possible to obtain relatively precise estimates. The downside is that many data are unique to a historic context. The 2016 election in the United States cannot be replicated.

Psychology is different.  Research budgets and ethics often limit sample sizes.  However, within-subject designs with many repeated measures can increase power, something political scientists cannot do.  In addition, studies in psychology can be replicated because the results are less sensitive to a particular historic context (and yes, there are many replicable findings in psychology that generalize across time and culture).

Gelman knows about as much about psychology as I know about political science. Maybe his article is more useful for political scientists, but psychologists would be better off if they finally recognized the important contribution of one of their own methodologist.

To paraphrase Cohen: Sometimes reading less is more, except for Cohen.