Category Archives: Uncategorized

The Replicability Revolution: My commentary to Zwaan et al.’s PPS article “Making replication mainstream”

Ulrich Schimmack
Department of Psychology, University of Toronto, Mississauga, ON L5L 1C6,
doi:10.1017/S0140525X18000833, e147

“It is therefore inevitable that the ongoing correction of the scientific record damages the reputation of researchers, if this reputation was earned by selective publishing of significant results.”

Abstract: Psychology is in the middle of a replicability revolution. High-profile
replication studies have produced a large number of replication
failures. The main reason why replication studies in psychology often fail
is that original studies were selected for significance. If all studies were
reported, original studies would fail to produce significant results as
often as replication studies. Replications would be less contentious if
original results were not selected for significance.


The history of psychology is characterized by revolutions. This
decade is marked by the replicability revolution. One prominent
feature of the replicability revolution is the publication of replication
studies with nonsignificant results.

The publication of several high-profile replication failures has triggered a confidence crisis. Zwaan et al. have been active participants in the replicability
revolution. Their target article addresses criticisms of direct replication

One concern is the difficulty of re-creating original studies, which may explain replication failures, particularly in social psychology. This argument fails on three counts. First, it does not explain why published studies have an apparent success rate
greater than 90%. If social psychological studies were difficult to replicate, the success rate should be lower. Second, it is not clear why it would be easier to conduct conceptual replication studies that vary crucial aspects of a successful original study.

If social priming effects were, indeed, highly sensitive to contextual variations, conceptual replication studies would be even more likely to fail than direct replication studies; however, miraculously they always seem to work. The third problem with this argument is that it ignores selection for significance. It treats successful conceptual
replication studies as credible evidence, but bias tests reveal that these studies have been selected for significance and that many original studies that failed are simply not reported (Schimmack 2017; Schimmack et al. 2017).

A second concern about direct replications is that they are less informative than conceptual replications (Crandall & Sherman 2016). This argument is misguided because it assumes a successful outcome. If a conceptual replication study is successful, it increases the probability that the original finding was true and it expands the range of conditions under which an effect can be observed. However, the advantage of a conceptual replication study becomes a disadvantage when a study fails. For example,
if the original study showed that eating green jelly beans increases happiness and a conceptual replication study with red jelly beans does not show this effect, it remains unclear whether green jelly  beans make people happier or not. Even the non-significant finding with red jelly beans is inconclusive because the result could be a false negative. Meanwhile, a failure to replicate the green jelly bean effect in a direct replication study is informative because it casts doubt on the original finding.

In fact, a meta-analysis of the original and replication study might produce a non-significant result and reverse the initial inference that green jelly beans make people happy. Crandall and Sherman’s argument rests on the false assumption that only significant studies are informative. This assumption is flawed because selection for significance renders significance uninformative (Sterling 1959).

A third argument against direct replication studies is that there are multiple ways to compare the results of original and replication studies. I believe the discussion of this point also benefits from taking publication bias into account. Selection for significance
explained why the reproducibility project obtained only 36% significant results in direct replications of original studies with significant results (Open Science Collaboration 2015). As a result, the significant results of original studies are less credible than the  nonsignificant results in direct replication studies. This generalizes to all comparisons of original studies and direct replication studies.

Once there is suspicion or evidence that selection for significance occurred, the results of original studies are less credible, and more weight should be given to replication studies that are not biased by selection for significance. Without selection for significance,
there is no reason why replication studies should be more likely to fail than original studies. If replication studies correct mistakes in original studies and use larger samples, they are actually more likely to produce a significant result than original studies.

Selection for Significance Explains Reputational Damage of Replication Failures 

Selection for significance also explains why replication failures are damaging to the reputation of researchers. The reputation of researchers is based on their publication record, and this record is biased in favor of successful studies. Thus, researchers’ reputations are inflated by selection for significance. Once an unbiased replication
produces a nonsignificant result, the unblemished record is tainted, and it is apparent that a perfect published record is illusory and not the result of research excellence (a.k.a flair). Thus, unbiased failed replication studies not only provide new evidence; they also
undermine the credibility of existing studies. Although positive illusions may be beneficial for researchers’ eminence, they have no place in science. It is therefore inevitable that the ongoing correction of the scientific record damages the reputation of researchers, if this reputation was earned by selective publishing of significant results.
In this way direct replication studies complement statistical tools that can reveal selective publishing of significant results with statistical tests of original studies (Schimmack 2012; 2014; Schimmack & Brunner submitted for publication).


Schimmack, U. (2012) The ironic effect of significant results on the credibility of
multiple-study articles. Psychological Methods 17:551–56. [US]

Schimmack, U. (2014) The test of insufficient variance (TIVA): A new tool for the
detection of questionable research practices. Working paper. Available at:

Schimmack, U. (2017) ‘Before you know it’ by John A. Bargh: A quantitative book
review. Available at:

Schimmack, U. & Brunner, J. (submittrd for publication) Z-Curve: A method for
estimating replicability based on test statistics in original studies. Submitted for
Publication. [US]

Schimmack, U., Heene, M. & Kesavan, K. (2017) Reconstruction of a train wreck:
How priming research went off the rails. Blog post. Available at: https://replicationindex.



Statistics Wars and Submarines

I am not the first to describe the fight among statisticians for eminence as a war (Mayo Blog).   The statistics war is as old as modern statistics itself.  The main parties in this war are the Fisherians, Bayesians, and Neymanians (or Neyman-Pearsonians).

Fisherians use p-values as evidence to reject the null-hypothesis; the smaller the p-value the better.

Neymanians distinguish between type-I and type-II errors and use regions of test statistics to reject null-hypotheses or alternative hypotheses.  They also use confidence intervals to obtain interval estimates of population parameters.

Bayesians differ from the Fisherians and Neymanians in that their inferences combine information obtained from data with prior information. Bayesians sometimes fights with each other about the proper prior information. Some prefer subjective priors that are ideally based on prior knowledge. Others prefer objective priors that do not require any prior knowledge and can be applied to all statistical problems (Jeffreysians).  Although they fight with each other, they are united in their fight against Fisherians and Neymanians, which they call Frequentists.

The statistics war has been going on for over 80 years and there has been no winner.  Unlike empirical sciences, there are no new data that could resolve scientific controversies.  Thus, the statistics war is more like wars in philosophy where philosophers are still fighting over the right way to define fundamental concepts like justice or happiness.

For applied researchers these statistics wars can be very confusing because a favorite weapon of statisticians is propaganda.  In this blog post, I examine the Bayesian Submarine (Morey et al., 2016), which aims to sink the ship of Neymansian confidence intervals.

The Bayesian Submarine 

Submarines are fascinating and are currently making major discoveries about sea life.  The Bayesian submarine is rather different.  It is designed to convince readers that confidence intervals provide no meaningful information about population parameters and should be abandoned in favor of Bayesian interval estimation.

Example 1: The lost submarine
In this section, we present an example taken from the confidence interval literature (Berger and Wolpert, 1988; Lehmann, 1959; Pratt, 1961;Welch, 1939) designed to bring into focus how CI theory works. This example is intentionally simple; unlike many demonstrations of CIs, no simulations are needed, and almost all results can be derived by readers with some training in probability and geometry. We have also created interactive versions of our figures to aid readers in understanding the example; see the figure captions for details.

A 10-meter-long research submersible with several people on board has lost contact with its surface support vessel. The submersible has a rescue hatch exactly halfway along
its length, to which the support vessel will drop a rescue line. Because the rescuers only get one rescue attempt, it is crucial that when the line is dropped to the craft in the deep water that the line be as close as possible to this hatch. The researchers on the support vessel do not know where the submersible is, but they do know that it forms two distinctive bubbles. These bubbles could form anywhere along the craft’s length, independently, with equal probability, and float to the surface where they can be seen by the support vessel.

The situation is shown in Fig. 1a. The rescue hatch is the unknown location θ, and the bubbles can rise from anywhere with uniform probability between θ − 5 meters (the
bow of the submersible) to θ+5 meters (the stern of the submersible). 


Let’s translate this example into a standard statistical problem.  It is uncommon to have a uniform distribution of observed data around a population parameter.  Most commonly, we assume that observations are more likely to cluster closer to the population parameter and that deviations between the population parameter and an observed value reflect some random process.  However, a bound uniform distribution also allows us to compute the standard deviation of the randomly occurring data.

round(sd(runif(100000,0,10)),2) = 2.89

We only have two data points to construct a confidence interval.  Evidently, the standard error based on a sample size of n = 2 is large (1/sqrt(2) = .71 (or 71% of a standard deviation).  We can use the typical formula for sampling error,  SD/sqrt(N) to estimate the sampling error as 2.89/1.41 = 2.04.

To construct a 95% confidence interval, we have to multiply the sampling error by the critical t.value for a probability of .975, which leaves .025 for the error region. Multiplying this by 2 gives a two-tailed error probability of .025 & 2 = 5%.  That is, 5% of observations could be more extreme than the boundaries of the confidence interval just by chance alone.  With 1 degree of freedom, we get  a value of 12.71.

n = 2; alpha = .05; qt(1-alpha/2,n-1)

The width of the CI is determined by the standard deviation and sample size.  So, the information is sufficient to say that the 95%CI is  the observed mean +/-  2.04m * 12.71 = 25.92m.

Hopefully it is obvious that this 95%CI covers 100% of all possible values because the length of the submarine is limited to 10m.

In short, two data points provide very little information and make it impossible to say anything with confidence about the location of the hatch.  Even without these calculations we can say with 100% confidence that the hatch cannot be further from the mean of the two bubbles than 5 meters because the maximum distance is limited by the length of the submarine.

The submarine problem is also strange because the width of confidence intervals is irrelevant for the rescue operation. With just one rescue line, the most optimal place is the mean of the two bubbles (see Figure, all intervals are centered on the same point).  So, the statisticians do not have to argue, because they all agree on the location where to drop the rescue line.

How is the Bayesian submarine supposed to sink confidence intervals? 

The rescuers first note that from observing two bubbles, it is easy to rule out all values except those within five meters of both bubbles because no bubble can occur further than 5 meters from the hatch.

Importantly, this only works for dependent variables with bounded values. For example, on an 11-point scale ranging from 0 to 10, it is obvious that any population mean cannot deviated from the middle of the scale (5) by more than 5 points.  Even there it is not very relevant because the goal of research is not to find the middle of the scale, but to estimate the actual population parameter that could be anywhere between 0 and 10. Thus, the submarine example does not map on any empirical problem of interval estimation.

1. A procedure based on the sampling distribution of the mean
The first statistician suggests building a confidence procedure using the sampling distribution of the mean M . The sampling distribution of  M has a known triangular distribution with θ as the mean. With this sampling distribution, there is a 50 % probability that  M will differ from θ by less than 5 − 5/ √2, or about 1.46m.  

This leads to the confidence procedure M = +/-  5 − 5/√2

which we call the “sampling distribution” procedure. This procedure also has the familiar form ¯x ± C × SE, where here the standard error (that is, the standard deviation of the estimate M) is known to be 2.04.

It is important to note that the authors use a 50% CI.  In this special case, the confidence interval is equivalent to the standard deviation because the standard deviation is multiplied by 1 to determine the width of the confidence interval.

n = 2; alpha = .50; qt(1-alpha/2,n-1)

The choice of a 50%CI is also not typical in actual research settings. It is not clear, why we should accept such a high error rate, especially when the survival of the crew members is at stake.  Imagine that the submarine had an emergency system that releases bubbles from the hatch, but the bubbles do not go straight to the surface. Yet there are hundreds of bubbles. Would we compute a 50% confidence interval or would we want to get a 99% confidence interval to bring the rescue line as close to the hatch as possible?

We still haven’t seen how the Bayesian submarine sinks confidence intervals.  To make their case, the Bayesian soldiers compute several possible confidence intervals and show how they lead to different conclusions (see Figure). They suggest that this is a fundamental problem for confidence intervals.

It is clear, first of all, why the fundamental confidence fallacy is a fallacy. 

They are happy to join forces with the Fisherians in their attack of Neymanian confidence intervals, while they usually attack Fisher for his use of p-values.

As Fisher pointed out in the discussion of CI theory mentioned above, for any given problem — as for this one — there are many possible confidence procedures. These confidence procedures will lead to different confidence intervals. In the case of our submersible confidence procedures, all confidence intervals are centered around M, and so the intervals will be nested within one another.

If we mistakenly interpret these observed intervals as having a 50 % probability of containing the true value, a logical problem arises. 

However, shortly after the authors bring up this fundamental problem for confidence intervals, they mention that Neyman solved this logical problem.

There are, as far as we know, only two general strategies for eliminating the threat of contradiction from relevant subsets: Neyman’s strategy of avoiding any assignment of probabilities to particular intervals, and the Bayesian strategy of always conditioning on the observed data, to be discussed subsequently.

Importantly, Neyman’s solution to the problem does not lead to the Bayesians’ conclusion that he suggested we should not make probabilistic statements based on confidence intervals. Instead, he argued that we should apply the long-run success rate to make probability judgments based on confidence intervals.  This use of the term probability can be illustrated with the submarine example. A simple simulation of the submarine problem shows that the 50% confidence interval contains the population parameter 50% of the time.

It is therefore reasonable to place relatively modest confidence in the belief that the hatch of the submarine is within the confidence interval.  To be more confident, it would be necessary to lower the error rate, but this makes the interval wider. The only way to be confident with a narrow interval is to collect more data.

Confidence intervals do have exactly the properties that Neyman claimed they have and there is no logical inconsistency between the statement that we cannot quantify the probability of singular events, while we can use long-run outcomes of similar events to make claims about the probability of being right or wrong in a particular event.

Neyman compares this to gambling where it is impossible to say anything about the probability of a particular bet unless we know the long-run probability of similar bets. Researchers who use confidence intervals are no different from people who drive their cars with confidence because they never had an accident or who order a particular pizza because they ordered it many times before and liked it.  Without any other relevant information, the use of long-run frequencies to assign probabilities to individual events is not a fallacy.


So, does the Bayesian submarine sink the confidence interval ship?  Does the example show that interpreting confidence intervals as probabilities is a fallacy and a misinterpretation of Neyman?  I don’t think so.

The probability of winning a coin toss (with a fair coin) is 50%.  What is the probability that I win any specific game.  It is not defined.  It is 100% if I win and 0% if I don’t win. This is trivial and Neyman made it clear that he was not using the term probability in this sense.  He also made it clear that he used the term probability to refer to the long-term proportion of correct decisions and most people would feel very confident in their beliefs and decision making if the odds of winning were 95%.  Bayesians do not deny that 95% confidence intervals give the right answer 95% of the time. They just object to the phrase, “There is a 95% probability that the confidence interval includes the population parameter” when a researcher uses a 95%confidence interval.  Similarly, they would object to somebody saying “There is a 99.9% chance that I am pregnant” when a pregnancy test with a 0.01% false positive rate shows a positive result.  The woman is either pregnant or she is not, but we don’t know this until she does repeat the test several times or an ultrasound shows it.  As long as there is uncertainty about the actual truth, the long-run frequency of false positives quantifies the rational belief in being pregnant or not.

What should applied researchers do.  They should use confidence intervals with confidence.  If Bayesians want to argue with them, all they need to say is that they are using a procedure that has a 95% probability of giving the right answer and it is not possible to say whether a particular result is one of the few errors.  The best way to address this question is not to argue about semantics, but to do a replication study.  And that is the good news.  While statisticians are busy fighting with each other, empirical scientists can make progress by collecting more informative data and make actual progress.

In conclusion, the submarine problem does not convince me for many reasons.  Most important, it is not even necessary to create any intervals to decide on the best action.  Absent any other information, the best bet is to drop the rescue line right in the middle of the two bubbles.  This is very fortunate for the submarine crew because otherwise the statisticians would still be arguing about the best course of action, while the submarine is running out of air.

The Fallacy of Placing Confidence in Bayesian Salvation

Richard D. Morey, Rink Hoekstra, Jeffrey N. Rouder, Michael D. Lee, and
Eric-Jan Wagenmakers (2016), henceforce psycho-Baysians, have a clear goal.  They want psychologists to change the way they analyze their data.

Although this goal motivates the flood of method articles by this group, the most direct attack on other statistical approaches is made in the article “The fallacy of placing confidence in confidence intervals.”   In this article, the authors claim that everybody, including textbook writers in statistics, misunderstood Neyman’s classic article on interval estimation.   What are the prior odds that after 80 years, a group of psychologists discover a fundamental flaw in the interpretation of confidence intervals (H1) versus a few psychologists are either unable or unwilling to understand Neyman’s article?

Underlying this quest for change in statistical practices lies the ultimate attribution error that Fisher’s p-values or Neyman-Pearsons significance testing with or without confidence intervals are responsible for the replication crisis in psychology (Wagenmakers et al., 2011).

This is an error because numerous articles have argued and demonstrated that questionable research practices undermine the credibility of the psychological literature.  The unprincipled use of p-values (undisclosed multiple testing), also called p-hacking, means that many statistically significant results have inflated error rates and the long-run probabilities of false positives are not 5%, as stated in each article, but could be 100% (Rosenthal, 1979; Sterling, 1959; Simmons, Nelson, & Simonsohn, 2011).

You will not find a single article by Psycho-Bayesians that will acknowledge the contribution of unprincipled use of p-values to the replication crisis. The reason is that they want to use the replication crisis as a vehicle to sell Bayesian statistics.

It is hard to believe that classic statistics are fundamentally flawed and misunderstood because they are used in industry to  produce SmartPhones and other technology that requires tight error control in mass production of technology. Nevertheless, this article claims that everybody misunderstood Neyman’s seminal article on confidence intervals.

The authors claim that Neyman wanted us to compute confidence intervals only before we collect data, but warned readers that confidence intervals provide no useful information after the data are collected.

Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative”

This is utter nonsense. Of course, Neyman was asking us to interpret confidence intervals after we collected data because we need a sample to compute confidence interval. It is hard to believe that this could have passed peer-review in a statistics journal and it is not clear who was qualified to review this paper for Psychonomic Bullshit Review.

The way the psycho-statisticians use Neyman’s quote is unscientific because they omit the context and the following statements.  In fact, Neyman was arguing against Bayesian attempts of estimate probabilities that can be applied to a single event.

It is important to notice that for this conclusion to be true, it is not necessary that the problem of estimation should be the same in all the cases. For instance, during a period of time the statistician may deal with a thousand problems of estimation and in each the parameter M  to be estimated and the probability law of the X’s may be different. As far as in each case the functions L and U are properly calculated and correspond to the same value of alpha, his steps (a), (b), and (c), though different in details of sampling and arithmetic, will have this in common—the probability of their resulting in a correct statement will be the same, alpha. Hence the frequency of actually correct statements will approach alpha. It will be noticed that in the above description the probability statements refer to the problems of estimation with which the statistician will be concerned in the future. In fact, I have repeatedly stated that the frequency of correct results tend to alpha.*

Consider now the case when a sample, S, is already drawn and the calculations have given, say, L = 1 and U = 2. Can we say that in this particular case the probability of the true value of M falling between 1 and 2 is equal to alpha? The answer is obviously in the negative.  

The parameter M is an unknown constant and no probability statement concerning its value may be made, that is except for the hypothetical and trivial ones P{1 < M < 2}) = 1 if 1 < M < 2) or 0 if either M < 1 or 2 < M) ,  which we have decided not to consider. 

The full quote makes it clear that Neyman is considering the problem of quantifying the probability that a population parameter is in a specific interval and dismisses it as trivial because it doesn’t solve the estimation problem.  We don’t even need observe data and compute a confidence interval.  The statement that a specific unknown number is between two other numbers (1 and 2) or not is either TRUE (P = 1) or FALSE (P = 0).  To imply that this trivial observation leads to the conclusion that we cannot make post-data  inferences based on confidence intervals is ridiculous.

Neyman continues.

The theoretical statistician [constructing a confidence interval] may be compared with the organizer of a game of chance in which the gambler has a certain range of possibilities to choose from while, whatever he actually chooses, the probability of his winning and thus the probability of the bank losing has permanently the same value, 1 – alpha. The choice of the gambler on what to bet, which is beyond the control of the bank, corresponds to the uncontrolled possibilities of M having this or that value. The case in which the bank wins the game corresponds to the correct statement of the actual value of M. In both cases the frequency of “ successes ” in a long series of future “ games ” is approximately known. On the other hand, if the owner of the bank, say, in the case of roulette, knows that in a particular game the ball has stopped at the sector No. 1, this information does not help him in any way to guess how the gamblers have betted. Similarly, once the boundaries of the interval are drawn and the values of L and U determined, the calculus of probability adopted here is helpless to provide answer to the question of what is the true value of M.

What Neyman was saying is that population parameters are unknowable and remain unknown even after researchers compute a confidence interval.  Moreover, the construction of a confidence interval doesn’t allow us to quantify the probability that an unknown value is within the constructed interval. This probability remains unspecified. Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval.  This is common sense. If we place bets in roulette or other random events, we rely on long-run frequencies of winnings to calculate our odds of winning in a specific game.

It is absurd to suggest that Neyman himself argued that confidence intervals provide no useful information after data are collected because the computation of a confidence interval requires a sample of data.  That is, while the width of a confidence interval can be determined a priori before data collection (e.g. in precision planning and power calculations),  the actual confidence interval can only be computed based on actual data because the sample statistic determines the location of the confidence interval.

Readers of this blog may face a dilemma. Why should they place confidence in another psycho-statistician?   The probability that I am right is 1, if I am right and 0 if I am wrong, but this doesn’t help readers to adjust their beliefs in confidence intervals.

The good news is that they can use prior information. Neyman is widely regarded as one of the most influential figures in statistics.  His methods are taught in hundreds of text books, and statistical software programs compute confidence intervals. Major advances in statistics have been new ways to compute confidence intervals for complex statistical problems (e.g., confidence intervals for standardized coefficients in structural equation models; MPLUS; Muthen & Muthen).  What are the a priori chances that generations of statisticians misinterpreted Neyman and committed the fallacy of interpreting confidence intervals after data are obtained?

However, if readers need more evidence of psycho-statisticians deceptive practices, it is important to point out that they omitted Neyman’s criticism of their favored approach, namely Bayesian estimation.

The fallacy article gives the impression that Neyman’s (1936) approach to estimation is outdated and should be replaced with more modern, superior approaches like Bayesian credibility intervals.  For example, they cite Jeffrey’s (1961) theory of probability, which gives the impression that Jeffrey’s work followed Neyman’s work. However, an accurate representation of Neyman’s work reveals that Jeffrey’s work preceded Neyman’s work and that Neyman discussed some of the problems with Jeffrey’s approach in great detail.  Neyman’s critical article was even “communicated” by Jeffreys (these were different times where scientists had open conflict with honor and integrity and actually engaged in scientific debates).


Given that Jeffrey’s approach was published just one year before Neyman’s (1936) article, Neyman’s article probably also offers the first thorough assessment of Jeffrey’s approach. Neyman first gives a thorough account of Jeffrey’s approach (those were the days).


Neyman then offers his critique of Jeffrey’s approach.

It is known that, as far as we work with the conception of probability as adopted in
this paper, the above theoretically perfect solution may be applied in practice only
in quite exceptional cases, and this for two reasons. 

Importantly, he does not challenge the theory.  He only points out that the theory is not practical because it requires knowledge that is often not available.  That is, to estimate the probability that an unknown parameter is within a specific interval, we need to make prior assumptions about unknown parameters.   This is the problem that has plagued subjective Bayesians approaches.

Neyman then discusses Jeffrey’s approach to solving this problem.  I am not claiming that I am a statistical expert to decide whether Neyman or Jeffrey’s are right. Even statisticians have been unable to resolve these issues and I believe the consensus is that Bayesian credibility intervals and Neyman’s confidence intervals are both mathematically viable approaches to interval estimation with different strengths and weaknesses.


I am only trying to point out to unassuming readers of the fallacy article that both approaches are as old as statistics and that the presentation of the issue in this article is biased and violates my personal, and probably idealistic, standards of scientific integrity.   Using a selective quote by Neyman to dismiss confidence intervals and then to omit Neyman’s critic of Bayesian credibility intervals is deceptive and shows an unwillingness or inability to engage in open scientific examination of scientific arguments for and against different estimation methods.

It is sad and ironic that Wagenmakers’ efforts to convert psychologists into Bayesian statisticians is similar to Bem’s (2011) attempt to convert psychologists into believers in parapsychology; or at least in parapsychology as a respectable science. While Bem fudged data to show false empirical evidence, Wagenmakers is misrepresenting the way classic statistics works and ignoring the key problem of Bayesian statistics, namely that Bayesian inferences are contingent on prior assumptions that can be gamed to show what a researcher wants to show.  Wagenmaker used this flexibility in Bayesian statistics to suggest that Bem (2011) presented weak evidence for extra-sensory perception.  However, a rebuttle by Bem showed that Bayesian statistics also showed support for extra-sensory perception with different and more reasonable priors.  Thus, Wagenmakers et al. (2011) were simply wrong to suggest that Bayesian methods would have prevented Bem from providing strong evidence for an incredible phenomenon.

The problem with Bem’s article is not the way he “analyzed” the data. The problem is that Bem violated basic principles of science that are required to draw valid statistical inferences from data.  It would be a miracle if Bayesian methods that assume unbiased data could correct for data falsification.   The problem with Bem’s data has been revealed using statistical tools for the detection of bias (Francis, 2012; Schimmack, 2012, 2015, 2118). There has been no rebuttal from Bem and he admits to the use of practices that invalidate the published p-values.  So, the problem is not the use of p-values, confidence intervals, or Bayesian statistics.  The problem is abuse of statistical methods. There are few cases of abuse of Bayesian methods simply because they are used rarely. However, Bayesian statistics can be gamed without data fudging by specifying convenient priors and failing to inform readers about the effect of priors on results (Gronau et al., 2017).

In conclusion, it is not a fallacy to interpret confidence intervals as a method for interval estimation of unknown parameter estimates. It would be a fallacy to cite Morey et al.’s article as a valid criticism of confidence intervals.  This does not mean that Bayesian credibility intervals are bad or could not be better than confidence intervals. It only means that this article is so blatantly biased and dogmatic that it does not add to the understanding of Neyman’s or Jeffrey’s approach to interval estimation.

P.S.  More discussion of the article can be found on Gelman’s blog.

Andrew Gelman himself comments:

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% confidence interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic.

I have to admit some Schadenfreude when I see one Bayesian attacking another Bayesian for the use of an ill-informed prior.  While Bayesians are still fighting over the right priors, practical researchers may be better off to use statistical methods that do not require priors, like, hm, confidence intervals?

P.P.S.  Science requires trust.  At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.











A Clarification of P-Curve Results: The Presence of Evidence Does Not Imply the Absence of Questionable Research Practices

This post is not a criticism of p-curve.  The p-curve authors have been very clear in their writing that p-curve is not designed to detect publication bias.  However, numerous articles make the surprising claim that they used p-curve to test publication bias.  The purpose of this post is to simply correct a misunderstanding of p-curve.

Questionable Research Practices and Excessive Significance

Sterling (1959) pointed out that psychology journals have a surprisingly high success rate. Over 90% of articles reported statistically significant results in support of authors’ predictions.  This success rate would be surprising, even if most predictions in psychology are true.  The reason is that the results of a study are not only influenced by cause-effect relationships.  Another factor that influences the outcome of a study is sampling error.  Even if researchers are nearly always right in their predictions, some studies will fail to provide sufficient evidence for the predicted effect because sampling error makes it impossible to detect the effect.  The ability of a study to show a true effect is called power.  Just like bigger telescopes are needed to detect more distant stars with a weaker signal, bigger sample sizes are needed to detect small effects (Cohen, 1962; 1988).  Sterling et al. (1995) pointed out that the typical power of studies in psychology does not justify the high success rate in psychology journals.  In other words, the success rate was too good to be true.  This means, published articles are selected for significance.

The bias in favor of significant results is typically called publication bias (Rosenthal, 1979).  However, the term publication bias does not explain the discrepancy between estimates of statistical power and success rates in psychology journals.  John et al. (2012) listed a number of questionable research practices that can inflate the percentage of significant results in published articles.

One mechanism is simply to not report non-significant result.  Rosenthal (1979) suggested that non-significant results end up in the proverbial file-drawer.  That is, a whole data set remains unpublished.  The other possibilities is that researchers use multiple exploratory analyses to find a significant result and do not disclose their fishing expedition.  These practices are now widely known as p-hacking.

Unlike John et al. ,(2012), the p-curve authors make a big distinction between not disclosing an entire dataset (publication bias) and not disclosing all statistical analyses of a dataset (p-hacking).

QRP = Publication Bias + P-Hacking

We Don’t Need Tests of Publication Bias

The p-curve authors assume that publication bias is unavoidable.

“Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results.”  (Simonsohn, Nelson, Simmons, 2014).

“By the Way, of Course There is Publication Bias. Virtually all published studies are significant (see, e.g., Fanelli, 2012; Sterling, 1959; Sterling, Rosenbaum, & Weinkam,
1995), and most studies are underpowered (see, e.g., Cohen, 1962). It follows that a considerable number of unpublished failed studies must exist. With this knowledge already in hand, testing for publication bias on paper after paper makes little
sense” (Simonsohn, 2012, p. 597).

“Yes, p-curve ignores p>.05 because it acknowledges that we observe an unknowably small and non-random subset of p-values >.05.”  (personal email, January 18, 2015).

I hope these quotes make it crystal clear that p-curve is not designed to examine publication bias because the authors assume that selection for significance is unavoidable.  Any statistical test that reveals no evidence of publication bias is a false negative result because the sample size was not large enough to detect it.

Another concern by Uri Simonsohn is that bias tests may reveal statistically significant bias that has no practical consequences.

Consider a literature with 100 studies, all with p < .05, but where the implied statistical
power is “just” 97%. Three expected failed studies are missing. The test from the critiques would conclude there is statistically significant publication bias; its magnitude, however, is trivial. (Simonsohn, 2012, p. 598). 

k.sig = 100; k.studies = 100; power = .97; pbinom(k.studies-k.sig,k.studies,1-power) =

This is a valid criticism that applies to all p-values.  A p-value only provides information about the contribution of random sampling error.  A p-value of .048 suggest that it is unlikely to observe only significant results, even if 100 studies have 97% power to show a significant result.   However, with 97% observed power, the 100 studies provide credible evidence for an effect and even the inflation of the average effect size is minimal.

A different conclusion would follow from a p-value less than .05 in a set of 7 studies that all show significant results.

k.sig = 7; k.studies = 7; power = .62; pbinom(k.studies-k.sig,k.studies,1-power) = 0 .035

Rather than showing small bias with a large set of studies, this finding shows large bias with a small set of studies.  P-values do not distinguish between these two scenarios. Both outcomes are equally unlikely.  Thus, information about the probability of an event should always be interpreted in the context of the effect.  The effect size is simply the difference between the expected and observed rate of significant results.  In Simonsohn’s example, the effect size is small (1 – .97 = .03).  In the second example, the discrepancy is large (1 – .62 = .38).

The previous scenarios assume that only significant results are reported. However, in sciences that use preregistration to reduce deceptive publishing practices (e..g, medicine), non-significant results are more common.  When non-significant results are reported, bias tests can be used to assess the extent of bias.

For example, a literature may report 10 studies with only 4 significant results and the median observed power is 30%.  In this case, the bias is small (.40 – .30 = .10) and a conventional meta-analysis would produce only slightly inflated estimates of the average effect size.  In contrast, p-curve would discard over 50% of the studies because it assumes that the non-significant results are not trustworthy.  This is an unnecessary loss of information that could be avoided by testing for publication bias.

In short, p-curve assumes that publication bias is unavoidable. Hence, tests of publication bias are unnecessary and non-significant results should always be discarded.

Why Do P-Curve Users Think P-Curve is a Publication Bias Test?

Example 1

I conducted a literature research on studies that used p-curve and I was surprised by numerous claims that p-curve is a test of publication bias.

Simonsohn, Nelson, and Simmons (2014a, 2014b, 2016) and Simonsohn, Simmons, and Nelson (2015) introduced pcurve as a method for identifying publication bias (Steiger & Kühberger, 2018, p. 48).   

However, the authors do not explain how p-curve detects publication bias. Later on, they correctly point out that p-curve is a method that can correct for publication bias.

P-curve is a good method to correct for publication bias, but it has drawbacks. (Steiger & Kühberger, 2018, p. 48).   

Thus, the authors seem to confuse detection of publication bias with correction for publication bias.  P-curve corrects for publication bias, but it does not detect publication bias; it assumes that publication bias is present and a correction is necessary.

Example 2

An article in the medical journal JAMA Psychiatry also claimed that they used p-curve and other methods to assess publication bias.

Publication bias was assessed across all regions simultaneously by visual inspection of funnel plots of SEs against regional residuals and by using the excess significance test,  the P-curve method, and a multivariate analogue of the Egger regression test (Bruger & Howes, 2018, p. 1106).  

After reporting the results of several bias tests, the authors report the p-curve results.

P-curve analysis indicated evidential value for all measures (Bruger & Howes, 2018, p. 1106).

The authors seem to confuse presence of evidential value with absence of publication bias. As discussed above,  publication bias can be present even if studies have evidential value.

Example 3

To assess publication bias, we considered multiple indices. Specifically, we evaluated Duval and Tweedie’s Trim and Fill Test, Egger’s Regression Test, Begg and Mazumdar Rank Correlation Test, Classic Fail-Safe N, Orwin’s Fail-Safe N, funnel plot symmetry, P-Curve Tests for Right-Skewness, and Likelihood Ratio Test of Vevea and Hedges Weight-Function Model.

As in the previous example, the authors confuse evidence for evidential value (significant right-skwed p-curve) with evidence for the absence of publication bias.

Example 4

The next example even claims that p-curve can be used to quantify the presence of bias.

Publication bias was investigated using funnel plots and the Egger regression asymmetry test. Both the trim and fill technique (Duval & Tweedie, 2000) and p-curve (Simonsohn, Nelson, & Simmons, 2014a, 2014b) technique were used to quantify the presence of bias (Korrel et al., 2017, p. 642).

The actual results section only reports that the p-curve is right skewed.

The p-curve for the remaining nine studies (p < .025) was significantly right skewed
(binomial test: p = .002; continuous test full curve: Z = -9.94, p < .0001, and half curve Z = -9.01, p < .0001) (Korrel et al., 2017, p. 642)

These results do not assess or quantify publication bias.  One might consider the reported z-scores a quantitative measure of evidential value as larger z-scores are less probable under the nil-hypothesis that all significant results are false positives. Nevertheless, strong evidential value (e.g., 100 studies with 97% power) does not imply that publication bias is absent, nor does it mean that publication bias is small .

A set of 1000 studies with 10% power is expected to produce 900 non-significant results and 100 significant results.  Removing the non-significant results produces large publication bias, but a p-curve analysis shows strong evidence against the nil-hypothesis that all studies are false positives.

Z = rnorm(1000,qnorm(.10,1.96))
Stouffer.Z = sum(Z[Z > 1.96]-1.96)/sqrt(length(Z.sig))
Stouffer.Z = 4.89

The reason is that p-curve is a meta-analysis and the results depend on the strength of evidence in individual studies and the number of studies.  Strong evidence can be result of many studies with weak evidence or a few studies with strong evidence.  Thus, p-curve is a meta-analytic method that combines information from several small studies to draw inferences about a population parameter.  The main difference to older meta-analytic methods is that older methods assumed that publication bias is absent, whereas p-curve assumes that publication bias is present. Neither method assesses whether publication bias is present, nor do they quantify the amount of publication bias.

Example 5

Sala and Gobet (2017) explicitly make the mistake to equate evidence for evidence with evidence against publication bias.

Finally, a p-curve analysis was run with all the p values < .05 related to positive effect sizes (Simonsohn, Nelson, & Simmons, 2014). The results showed evidential values (i.e., no evidence of publication bias), Z(9) = -3.39, p = .003.  (p. 676).

As discussed in detail before, this is not a valid inference.

Example 6

Ironically, the interpretation of p-curve results as evidence that there is no publication bias contradicts the fundamental assumption of p-curve that we can safely assume that publication bias is always. present.

The danger is that misuse of p-curve as a test of publication bias may give the false impression that psychological scientists are reporting their results honestly, while actual bias tests show that this is not the case.

It is therefore problematic if authors in high impact journals (not necessarily high quality journals) claim that they found evidence for the absence of publication bias based on a p-curve analysis.

To check whether this research field suffers from publication bias, we conducted p-curve analyses (Simonsohn, Nelson, & Simmons, 2014a, 2014b) on the most extended data set of the current meta-analysis (i.e., psychosocial correlates of the dark triad traits), using an on-line application ( As can be seen in Figure 2, for each of the dark triad traits, we found an extremely right-skewed p-curve, with statistical tests indicating that the studies included in our meta-analysis, indeed, contained evidential value (all ps < .001) and did not point in the direction of inadequate evidential value (all ps non-significant). Thus, it is unlikely that the dark triad literature is affected by publication bias (Muris, Merckelbach, Otgaar, & Meijer, 2017).

Once more, presence of evidential value does not imply absence of publication bias!

Evidence of P-Hacking  

Publication bias is not the only reason for the high success rates in psychology.  P-hacking will also produce more significant results than the actual power of studies warrants. In fact, the whole purpose of p-hacking is to turn non-significant results into significant ones.  Most bias tests do not distinguish between publication bias and p-hacking as causes of bias.  However, the p-curve authors make this distinction and claim that p-curve can be used to detect p-hacking.

Apparently, we should not assume that p-hacking is just as prevalent as publication bias, which makes testing for p-hacking irrelevant.

The problem is that it is a lot harder to distinguish p-hacking and publication bias as the p-curve authors imply and that their p-curve test of p-hacking will only work under very limited conditions.  Most of the time, the p-curve test of p-hacking will fail to provide evidence for p-hacking and this result can be misinterpreted as evidence that results were obtained without p-hacking, which is a logical fallacy.

This mistake was made by Winternitz, Abbate, Huchard, Havlicek, & Gramszegi (2017).

Fourth and finally, as bias for publications with significant results can rely more on the P-value than on the effect size, we used the Pcurve method to test whether the distribution of significant P-values, the ‘P-curve’, indicates that our studies have evidential value and are free from ‘p-hacking’ (Simonsohn et al. 2014a, b).

The problem is that the p-curve test of p-hacking only works when evidential value is very low and for some specific forms of p-hacking. For example, researchers can p-hack by testing many dependent variables. Selecting significant dependent variables is no different from running many studies with a single dependent variable and selecting entire studies with significant results; it is just more efficient.  The p-curve would not show the left-skewed p-curve that is considered diagnostic of p-hacking.

Even a flat p-curve would merely show lack of evidential value, but it would be wrong to assume that p-hacking was not used.  To demonstrate this I submitted the results from Bem’s (2011) infamous “feeling the future” article to a p-curve analysis (  pcurve.bem.png

The p-curve analysis shows a flat p-curve.  This shows lack of evidential value under the assumption that questionable research practices were used to produce 9 out of 10 significant (p < .05, one-tailed) results.  However, there is no evidence that the results are p-hacked if we were to rely on a left-skewed p-curve as evidence for p-hacking.

One possibility would be that Bem did not p-hack his studies. However, this would imply that he ran 20 studies for each significant result. with sample sizes of 100 particpants per study, this would imply that he tested 20,000 participants.  This seems unrealistic and Bem states that he reported all studies that were conducted.  Moreover, analyses of the raw data showed peculiar patterns that suggest some form of p-hacking was used.  Thus, this example shows that p-curve is not very effective in revealing p-hacking.

It is also interesting that the latest version of p-curve, p-curve4.06, no longer tests for left-skewedness of distributions and doesn’t mention p-hacking.  This change in p-curve suggests that the authors realized the ineffectiveness of p-curve in detecting p-hacking (I didn’t ask the authors for comments, but they are welcome to comment here or elsewhere on this change in their app).

It is problematic if meta-analysts assume that p-curve can reveal p-hacking and infer from a flat or right-skewed p-curve that the data are not p-hacked.  This inference is not warranted because absence of evidence is not the same as evidence of absence.


P-curve is a family of statistical tests for meta-analyses of sets of studies.  One version is an effect size meta-analysis; others test the nil-hypothesis that the population effect size is zero.  The novel feature of p-curve is that it assumes that questionable research practices undermine the validity of traditional meta-analyses that assume no selection for significance. To correct for the assumed bias, observed test statistics are corrected for selection bias (i.e., p-values between .05 and 0 are multiplied by 20 to produce p-values between 0 and 1 that can be analyzed like unbiased p-values).  Just like regular meta-analysis, the main result of a p-curve analysis is a combined test-statistic or effect size estimate that can be used to test the nil-hypothesis.  If the nil-hypothesis can be rejected, p-curve analysis suggests that some effect was observed.  Effect size p-curve also provides an effect size estimate for the set of studies that produced significant results.

Just like regular meta-analyses, p-curve is not a bias test. It does not test whether publication bias exists and it fails as a test of p-hacking under most circumstances. Unfortunately, users of p-curve seem to be confused about the purpose of p-curve or make the logical mistake to infer from the presence of evidence that questionable research practices (publication bias; p-hacking) are absent. This is a fallacy.  To examine the presence of publication bias, researchers should use existing and validated bias tests.





















An Even Better P-curve

It is my pleasure to post the first guest post on the R-Index blog.  The blog post is written by my colleague and partner in “crime”-detection, Jerry Brunner.  I hope we will see many more guest posts by Jerry in the future.


Jerry Brunner
Department of Statistical Sciences
University of Toronto

First, my thanks to the mysterious Dr. R for the opportunity to do this guest post. At issue are the estimates of population mean power produced by the online p-curve app. The current version is 4.06, available at As the p-curve team (Simmons, Nelson, and Simonsohn) observe in their blog post entitled “P-curve handles heterogeneity just fine” at, the app does well on average as long as there is not too much heterogeneity in power. They show in one of their examples that it can over-estimate mean power when there is substantial heterogeneity.

Heterogeneity in power is produced by heterogeneity in effect size and heterogeneity in sample size. In the simulations reported at, sample size varies over a fairly narrow range — as one might expect from a meta-analysis of small-sample studies. What if we wanted to estimate mean power for sets of studies with large heterogeneity in sample sizes or an entire discipline, or sub-areas, or journals, or psychology departments? Sample size would be much more variable.

This post gives an example in which the p-curve app consistently over-estimates population mean power under realistic heterogeneity in sample size. To demonstrate that heterogeneity in sample size alone is a problem for the online pcurve app, population effect size was held constant.

In 2016, Brunner and Schimmack developed an alternative p-curve method (p-curve 2.1), which performs much better than the online app p-curve 4.06. P-curve 2.1 is fully documented and evaluated in Brunner and Schimmack (2018). This is the most recent version of the notorious and often-rejected paper mentioned in It has been re-written once again, and submitted to Meta-psychology. It will shortly be posted during the open review process, but in the meantime I have put a copy on my website at

P-curve 2.1 is based on Simonsohn, Nelson and Simmons’ (2014) p-curve estimate of effect size. It is designed specifically for the situation where there is heterogeneity in sample size, but just a single fixed effect size. P-curve 2.1 is a simple, almost trivial application of p-curve 2.0. It first uses the p-curve 2.0 method to estimate a common effect size. It then combines that estimated effect size and the observed sample sizes to calculate an estimated power for each significance test in the sample. The sample mean of the estimated power values is the p-curve 2.1 estimate.

One of the virtues of p-curve is that it allows for publication bias, using only significant test statistics as input. The population mean power being estimated is the mean power of the sub-population of tests that happened to be significant. To compare the performance of p-curve 4.06 to p-curve 2.1, I simulated samples of significant test statistics with a single effect size, and realistic heterogeneity in sample size.

Here’s how I arrived at the “realistic” sample sizes. In another project, Uli Schimmack had harvested a large number of t and F statistics from the journal Psychological Science, from the years 2001-2015. I used N = df + 2 to calculate implied total sample sizes. I then eliminated all sample sizes less than 20 and greater than 500, and randomly sampled 5,000 of the remaining numbers. These 5,000 numbers will be called the “Psychological Science urn.” They are available at, and can be read directly into R with the scan function.

The numbers in the Psychological Science urn are not exactly sample sizes and they are not a true random sample. In particular, truncating the distribution at 500 makes them less heterogeneous than real sample sizes, since web surveys with enormous sample sizes are eliminated. Still, I believe the numbers in the Psychological Science urn may be fairly reflective of the sample sizes in psychology journals. Certainly, they are better than anything I would be able to make up. Figure 1 shows a histogram, which is right skewed as one might expect.


By sampling with replacement from the Psychological Science urn, one could obtain a random sample of sample sizes, similar to sampling without replacement from a very large population of studies. However, that’s not what I did. Selection for significance tends to select larger sample sizes, because tests based on smaller sample sizes have lower power and so are less likely to be significant. The numbers in the Psychological Science urn come from studies that passed the filter of publication bias. It is the distribution of sample size after selection for significance that should match Figure 1.

To take care of this issue, I constructed a distribution of sample size before selection and chose an effect size that yielded (a) population mean power after selection equal to 0.50, and (b) a population distribution of sample size after selection that exactly matched the relative frequencies in the Psychological Science urn. The fixed effect size, in a metric of Cohen (1988, p. 216) was w = 0.108812. This is roughly Cohen’s “small” value of w = 0.10. If you have done any simulations involving literal selection for significance, you will realize that getting the numbers to come out just right by trial and error would be nearly impossible. I got the job done by using a theoretical result from Brunner and Schimmack (2018). Details are given at the end of this post, after the results.

I based the simulations on k=1,000 significant chi-squared tests with 5 degrees of freedom. This large value of k (the number of studies, or significance tests on which the estimates are based) means that estimates should be very accurate. To calculate the estimates for p-curve 4.06, it was easy enough to get R to write input suitable for pasting into the online app. For p-curve 2.1, I used the function heteroNpcurveCHI, part of a collection developed for the Brunner and Schimmack paper. The code for all the functions is available at Within R, the functions can be defined with source(""). Then to see a list of functions, type functions() at the R prompt.

Recall that population mean power after selection is 0.50. The first time I ran the simulation, the p-curve 4.06 estimate was 0.64, with a 95% confidence interval from from 0.61 to 0.66.. The p-curve 2.1 estimate was 0.501. Was this a fluke? The results of five more independent runs are given in the table below. Again, the true value of mean power after selection for significance is 0.50.

P-curve 2.1 P-curve 4.06 P-curve 4.06 Confidence Interval
0.510 0.64 0.61 0.67
0.497 0.62 0.59 0.65
0.502 0.62 0.59 0.65
0.509 0.64 0.61 0.67
0.487 0.61 0.57 0.64

It is clear that the p-curve 4.06 estimates are consistently too high, while p-curve 2.1 is on the money. One could argue that an error of around twelve percentage points is not too bad (really?), but certainly an error of one percentage point is better. Also, eliminating sample sizes greater than 500 substantially reduced the heterogeneity in sample size. If I had left the huge sample sizes in, the p-curve 4.06 estimates would have been ridiculously high.

Why did p-curve 4.06 fail? The answer is that even with complete homogeneity in effect size, the Psychological Science urn was heterogeneous enough to produce substantial heterogeneity in power. Figure 2 is a histogram of the true (not estimated) power values.


Figure 2 shows that that even under homogeneity in effect size, a sample size distribution matching the Psychological Science urn can produce substantial heterogeneity in power, with a mode near one even though the mean is 0.50. In this situation, p-curve 4.06 fails. P-curve 2.1 is clearly preferable, because it specifically allows for heterogeneity in sample size.

Of course p-curve 2.1 does assume homogeneity in effect size. What happens when effect size is heterogeneous too? The paper by Brunner and Schimmack (2018) contains a set of large-scale simulation studies comparing estimates of population mean power from p-curve, p-uniform, maximum likelihood and z-curve, a new method dreamed up by Schimmack. The p-uniform method is based on van Assen, van Aertand and Wicherts (2014), extended to power estimation as in p-curve 2.1. The p-curve method we consider in the paper is p-curve 2.1. It does okay as long as heterogeneity in effect size is modest. Other methods may be better, though. To summarize, maximum likelihood is most accurate when its assumptions about the distribution of effect size are satisfied or approximately satisfied. When effect size is heterogeneous and the assumptions of maximum likelihood are not satisfied, z-curve does best.

I would not presume to tell the p-curve team what to do, but I think they should replace p-curve 4.06 with something like p-curve 2.1. They are free to use my heteroNpcurveCHI and heteroNpcurveF functions if they wish. A reference to Brunner and Schimmack (2018) would be appreciated.

Details about the simulations

Before selection for significance, there is a bivariate distribution of sample size and effect size. This distribution is affected by the selection process, because tests with higher effect size or sample size (or especially, both) are more likely to be significant. The question is, exactly how does selection affect the joint distribution? The answer is in Brunner and Schimmack (2018). This paper is not just a set of simulation studies. It also has a set of “Principles” relating the population distribution of power before selection to its distribution after selection. The principles are actually theorems, but I did not want it to sound too mathematical. Anyway, Principle 6 says that to get the probability of a (sample size, effect size) pair after selection, take the probability before selection, multiply by the power calculated from that pair, and divide by the population mean power before selection.

In the setting we are considering here, there is just a single effect size, so it’s even simpler. The probability of a (sample size, effect size) pair is just the probability of the sample size. Also, we know the probability distribution of sample size after selection. It’s the relative frequencies of the Psychological Science urn. Solving for the probability of sample size before selection yields this rule: the probability of sample size before selection equals the probability of sample size after selection, divided by the power for that sample size, and multiplied by population mean power before selection.

This formula will work for any fixed effect size. That is, for any fixed effect size, there is a probability distribution of sample size before selection that makes the distribution of sample size after selection exactly match the Psychological Science frequencies in Figure 1. Effect size can be anything. So, choose the effect size that makes expected (that is, population mean) power after selection equal to some nice value like 0.50.

Here’s the R code. First, we read the Psychological Science urn and make a table of probabilities.


options(scipen=999) # To avoid scientific notation

source(""); functions()

PsychScience = scan("")

hist(PsychScience, xlab='Sample size',breaks=100, main = 'Figure 1: The Psychological Science Urn')

# A handier urn, for some purposes

nvals = sort(unique(PsychScience)) # There are 397 rather than 8000 values

nprobs = table(PsychScience)/sum(table(PsychScience))

# sum(nvals*nprobs) = 81.8606 = mean(PsychScience)

For any given effect size, the frequencies from the Psychological Science urn can be used to calculate expected power after selection. Minimizing the (squared) difference between this value and the desired mean power yields the required effect size.

# Minimize this function to find effect size giving desired power 

# after selection for significance.

fun = function(es,wantpow,dfreedom) 


    alpha = 0.05; cv=qchisq(1-alpha,dfreedom)

    epow = sum( (1-pchisq(cv,df=dfreedom,ncp=nvals*es))*nprobs ) 

    # cat("es = ",es," Expected power = ",epow,"\n")


    } # End of all the fun

# Find needed effect size for chi-square with df=5 and desired 

# population mean power AFTER selection.

popmeanpower = 0.5 # Change this value if you wish

EffectSize = nlminb(start=0.01, objective=fun,lower=0,df=5,wantpow=popmeanpower)$par

EffectSize # 0.108812

Calculate the probability distribution of sample size before selection.

# The distribution of sample size before selection is proportional to the

# distribution after selection divided by power, term by term.

crit = qchisq(0.95,5)

powvals = 1-pchisq(crit,5,ncp=nvals*EffectSize)

Pn = nprobs/powvals 

EG = 1/sum(Pn)

cat("Expected power before selection = ",EG,"\n")

Pn = Pn*EG # Probability distribution of n before selection

Generate test statistics before selection.

nsim = 50000 # Initial number of simulated statistics. This is over-kill. Change the value if you wish.


# For repeated simulations, execute the rest of the code repeatedly.

nbefore = sample(nvals,size=nsim,replace=TRUE,prob=Pn)

ncpbefore = nbefore*EffectSize

powbefore = 1-pchisq(crit,5,ncp=ncpbefore)

Ybefore = rchisq(nsim,5,ncp=ncpbefore)

Select for significance.

sigY = Ybefore[Ybefore>crit]

sigN = nbefore[Ybefore>crit]

sigPOW = 1-pchisq(crit,5,ncp=sigN*EffectSize)

hist(sigPOW, xlab='Power',breaks=100,freq=F ,main = 'Figure 2: Power After Selection for Significance')

Estimate mean power both ways.

# Two estimates of expected power before selection

c( length(sigY)/nsim , mean(powbefore) ) 

c(popmeanpower, mean(sigPOW)) # Golden


k = 1000 # Select 1,000 significant results.

Y = sigY[1:k]; n = sigN[1:k]; TruePower = sigPOW[1:k]

# Estimate with p-curve 2.1

heteroNpcurveCHI(Y=Y,dfree=5,nn=n) # 0.5058606 the first time.

# Write out chi-squared statistics for pasting into the online app

for(j in 1:k) cat("chi2(5) =",Y[j],"\n")


Brunner, J. and Schimmack, U. (2018). Estimating population mean power under conditions of heterogeneity and selection for significance. Under review. Available at

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Edition), Hillsdale, New Jersey: Erlbaum.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666-681.

van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2014). Meta-analysis using effect size distributions of only statistically significant studies. Psychological methods, 20, 293-309.


Lies, Damn Lies, and Abnormal Psychological Science (APS)

Everybody knows the saying “Lies, damn lies, and statistics”   But it is not the statistics; it is the ab/users of statistics who are distorting the truth.  The Association for Psychological Science (APS) is trying to hide the truth that experimental psychologists are not using scientific methods in the way they are supposed to be used.  These abnormal practices are known as questionable research practices (QRPs).   Surveys show that researchers are aware that these practices have negative consequences, but they also show that these practices are being used because they can advance researchers careers (John et al., 2012).  Before 2011, it was also no secrete that these practices were used and psychologists might even brag about the use of QRPs to get results  (it took me 20 attempts to find this significant result).

However, some scandals in social psychology (Stapel, Bem) changed the perception of these practices.  Hiding studies, removing outliers selectively, or not disclosing dependent variables that failed to show the predicted result was no longer something anybody would admit doing in public (except a few people who paid dearly for it; e.g. Wansink).

Unfortunately for abnormal psychological scientists, some researchers, including myself, have developed statistical methods that can reveal the use of questionable research practices and applications of these methods show the use of QRPs in numerous articles (Greg Francis; Schimmack, 2012).  Francis (2014) showed that 80% or more of articles in the flagship journal of APS used QRPs to report successful studies.  He was actually invited by the editor of Psychological Science to audit the journal, but when he submitted the results of his audit for publication, the manuscript was rejected. Apparently, it was not significant enough to tell readers of Psychological Science that most of the published articles in Psychological Science are based on abnormal psychological science.  Fortunately, the results were published in another peer-reviewed journal.

Another major embarrassment for APS was the result of a major replication project of studies published in Psychological Science, the main APS journal, as well as two APA (American Psychological Association) journals (Open Science Collaboration, 2015).  The results showed that only 36% of significant results in original articles could be replicated. The “success rate” for social psychology was even lower with 25%.  The main response to this stunning failure rate have been attempts to discredit the replication studies or to normalize replication failures as a normal outcome of science.

In several blog posts and manuscripts I have pointed out that the failure rate of social psychology is not the result of normal science.  Instead, replication failures are the result of abnormal scientific practices where researchers use QRPs to produce significant results.  My colleague Jerry Brunner developed a statistical method, z-curve, that reveals this fact. We have tried to publish our statistical method in an APA journal (Psychological Methods) and the APS journal, Perspectives on Psychological Science, where it was desk-rejected by Sternberg, who needed journal space to publish his own editorials [he resigned after a revolt form APS members, including former editor Bobbie Spellman].

Each time our manuscript was rejected without any criticism of our statistical method.  The reason was that it was not interesting to estimate replicability of psychological science.   This argument makes little sense because the OSC reproducibility article from 2015 has already been cited over 500 times in peer-reviewed journals (WebofScience).

The argument that our work is not interesting is further undermined by a recent article published in the new APS journal Advances in Methods and Practices in Psychological Science  with the title “The Prior Odds of Testing a True Effect in Cognitive and Social Psychology”  The article was accepted by the main editor Daniel J. Simons, who also rejected our article as irrelevant (see rejection letter).  Ironically, the article presents very similar analyses of the OSC data and required a method that could estimate average power, but the authors used an ad-hoc approach to do so.  The article even cites our pre-print, but the authors did not contact us or run the R-code that we shared to estimate average power.  This behavior would be like eyeballing a scatter plot rather than using a formula to quantify the correlation between two variables.  It is contradictory to claim that our method is not useful and then accept a paper that could have benefited from using our method.

Why would an editor reject a paper that provides an estimation method for a parameter that an accepted paper needs to estimate?

One possible explanation is that the accepted article normalizes replication failures, while we showed that these replication failures are at least partially explained by QRPs.  First evidence for the normalization of abnormal science is that the article does not cite Francis (2014) or Schimmack (2012) or John et al.’s (2012) survey about questionable research practices.  The article also does not mention Sterling’s work on abnormally high success rates in psychology journals (Sterling, 1959; Sterling et al., 1995). It does not mention Simmons, Nelson, and Simonsohn’s (2011) False-Positive Psychology article that discussed the harmful consequences of abnormal psychological science.  The article simply never mentions the term questionable research practices. Nor does it mention the “replication crisis” although it mentions that the OSC project replicated only 25% of findings in social psychology.  Apparently, this is neither abnormal nor symptomatic of a crisis, but just how good social psychological science works.

So, how does this article explain the low replicability of social psychology as normal science?  The authors point out that replicability is a function of the percentage of true null-hypothesis that are being tested.  As researchers conduct empirical studies to find out which predicts are true and which predicts are not, it is normal science to sometimes predict effects that do not exist (true null-hypotheses), and inferential statistics will sometimes lead to the wrong conclusion (type-I errors / false positives).  It is therefore unavoidable that empirical scientists will sometimes make mistakes.

The question is how often they make these mistakes and how they correct them.  How many false-positives end up in the literature depends on several other factors, including (a) the percentage of null-hypothesis that are being tested and (b) questionable research practices.

The key argument in the article is that social psychologists are risk-takers and test many false hypothesis.  As a result, they end up finding many false positive results. Replication studies are needed to show which findings are true and which findings are false. So, doing risky exploratory studies followed by replication studies is good science. In contrast, cognitive psychologist are not risk-takers and test hypothesis that have a high probability of being true. Thus, they have fewer false positives, but that doesn’t mean they are better scientists or social psychologists are worse scientists.  In the happy place of APS journals, all psychological scientists are good scientists.

Conceivably, social psychologists place higher value on surprising findings—that is, findings that reflect a departure from what is already known—than cognitive  psychologists do.

There is only one problem with this happy story of psychological scientists working hard to find the truth using the best possible scientific methods.  It is not true.

How Many Point-Nil-Hypothesis are True

How often is the null-hypothesis true?  To answer this question it is important to define the null-hypothesis.  A null-hypothesis can be any point or a range of effect sizes.  However, psychologists often wrongly use the term null-hypothesis to refer to the point-nil-hypothesis (cf. Cohen, 1994) that there is absolutely no effect (e.g., the effect of studying for a test AFTER the test has already happened; Bem, 2011).  We can then distinguish two sets of studies. Studies with an effect of any magnitude and studies without an effect.

The authors argue correctly that testing many null-effects will result in more false positives and lower replicability.  This is easy to see, if all significant results are false positives (Bem, 2011).  The probability that any single replication study produces a significant result is simply alpha (5%) and for a set of studies only 5% of studies are expected to produce a significant result. This is the worst case scenario (Rosenthal, 1979; Sterling et al., 1995).

Importantly, this does not only apply to replication studies. It also applies to original studies. If all studies have a true effect size of zero, only 5% of studies should produce a significant result.  However, it is well known that the success rate in psychology journals is above 90% (Sterling, 1959; Sterling et al., 1995).  Thus, it is not clear how social psychology can test many risky hypothesis that are often false and report over 90% successes in their journal or even within a single article (Schimmack, 2012). The only way to achieve this high success rate while most hypothesis are false is by reporting only successful studies (like a gambling addict who only counts wins and ignores losses; Rosenthal, 1979) or to make up hypothesis after randomly finding a significant result (Kerr, 1998).

To my knowledge, Sterling et al. (1995) were the first to relate the expected failure rate (without QRPs) to alpha, power, and the percentage of studies with and without an effect.


Sterling et al. point out that we should not have expected that 100% of published results in the Open Science Collaboration reported significant results, while the 25% success rate in the replication studies is shockingly low, but at least more believable than the 100% success rate.  The article neither mentions Sterling’s statistical contribution, nor the implication for the expected success rate in original studies.

The main aim of the authors is to separate the effects of power and the proportion of studies without effects on the success rate; that is the percentage of studies with significant results.

For example, a 25% success rate for social psychology could be produced by 25 studies with 85% power and 75% of studies without an effect (and a 5% chance of producing a significant result) or it could be produced by 100 studies with an average of 25% power, or any other percentage of studies with an effect between 25% and 100%.

As pointed out by Brunner and Schimmack (2017), it is impossible to obtain a precise estimate of this percentage because different mixtures of studies can produce the same success rate.  I was therefore surprised when the abstract claimed that “we found that R was lower for the social-psychology studies than for the cognitive-psychology studies”  How were the authors able to quantify and compare the proportions of studies with an effect in social psychology versus cognitive psychology? The answer is provided in the following quote.

Using binary logic for the time being, we assume that the observed proportion of studies yielding effects in the same direction as originally observed, ω, is equal to the proportion of true effects, PPV, plus half of the remaining 1 – PPV noneffects, which would be expected to yield effects in the same direction as originally observed 50% of the time by chance.

To clarify,  a null-result is equally likely to produce a positive or a negative effect size by chance.  A sign reversal in a replication study is used to infer that the original result was a false positive.  However, these sign reversals are only half of the false positives because random chance is equally likely to produce the same sign (head-tail is equally probable as head-head).  Using this logic, the percentage of sign reversals times two is an estimate of the percentage of false positives in the original studies.

Based on the finding that 25.5% of social replication studies showed a sign reversal, the authors conclude that 51% of the original significant results were false positives.  This would imply that every other significant result that is published in social psychology journals is a false positive.

One problem with this approach is that sign reversals can also occur for true positive studies with low power (Gelman & Carlin, 2014).  Thus, the percentage of sign reversals is at best a rough estimate of false positive results.

However, low power can be the result of small effect sizes and many of these effect sizes might be so close to zero that they can be considered false positives if the null-hypothesis is defined as a range of effect sizes close to zero.

So, I will just use the authors estimate of 50% false positive results as a reasonable estimate of the percentage of false positive results that are reported in social psychology journals.

Are Social Psychologists Testing Riskier Hypotheses? 

The authors claim that social psychologists have more false positive results than cognitive psychologists because they test more false hypotheses. That is, they are risk takers:

Maybe watching a Beyoncé video reduces implicit bias? Let’s try it (with n = 20 per cell in a between-subject design).  It doesn’t and the study produced a non-significant result.  Let’s try something else.  After trying many other manipulations, finally a significant result is observed and published.  Unfortunately, this manipulation also had no effect and the published result is a false positive.  Another researcher replicates the study and obtains a significant result with a sign reversal. The original result gets corrected and the search for a true effect continues.


To make claims about the ratio of studies with effects and studies without effects (or negligible effects) that are being tested, the authors use the formula shown above.  Here the ratio (R) of studies with an effect over studies without an effect is a function of  alpha (the criterion for significance), beta (type-II error probabilty), and PPV; the positive predictive value, which is simply the percentage of true positive significant results in the published literature.

As note before, the PPV for social psychology was estimated to be 49%. This leaves two unknowns to make claims about R; alpha and beta.  The authors approach to estimating alpha and beta is questionable and undermines their main conclusion.

Estimating Alpha

The authors use the nominal alpha level as the probability that a study without a real effect produces a false positive result.

Social and cognitive psychology generally follow the same convention for their alpha level (i.e., p < .05), so the difference in that variable likely does not explain the difference in PPV. 

However, this is a highly questionable assumption when researcher use questionable research practices.  As Simonsohn et al. (2011) demonstrated p-hacking can be used to bypass the significance filter and the risk of reporting a false positive result with a nominal alpha of 5% can be over 50%.  That is, the actual risk of reporting a false positive result is not 5% as stated, but much higher.  This has clear implications for the presence of false positive results in the literature.  While it would require 20 risky hypotheses to observe a false positive result with a significance filter of 5%, p-hacking makes it possible to report every other false positive result as significant.  Thus, massive p-hacking could explain a high percentage of false positive results in social psychology just as well as honest testing of risky hypotheses.

The authors simply ignore this possibility when they use the nominal alpha level as the factual probability of a false positive result and neither the reviewers nor the editor seemed to realize that p-hacking could explain replication failures.

Is there any evidence that p-hacking rather than risk-taking explains the results? Indeed, there is lots of evidence.  As I pointed out in 2012,  it is easy to see that social psychologists are using QRPs because they typically report multiple conceptual replication studies in a single article. Many of the studies in the replication project were selected from multiple study articles.  A multiple study article essentially lowers alpha from .05 in a single study to .05 raised to the power of the number of studies. Even with just two studies, the risk of repeating a false positive result is just .05^2 = .0025.  And none of these multiple study articles report replication failures, even if the tested hypothesis is ridiculous (Bem, 2011).  There are only two explanation for the high success rate in social psychology.  Either they are testing true hypothesis and the estimate of 50% false positive results is wrong or they are using p-hacking and the risk of a false positive results in a single study is greater than the nominal alpha.  Either explanation invalidates the authors conclusions about R. Either their PPV estimates are wrong or their assumptions about the real alpha criterion are wrong.

Estimating Beta

Beta or the type-II error is the risk of obtaining a non-significant result when an effect exists.  Power is the complementary probability of getting a significant result when an effect is present (a true positive result).  The authors note that social psychologists might end up with more false positive results because they conduct studies with lower power.

To illustrate, imagine that social psychologists run 100 studies with an average power of 50% and 250 studies without an effect and due to QRPs 20% of these studies produce a significant result with a nominal alpha of p < .05.  In this case, there are 50 true positive results (100 * .50 = 50) and 50 false positive results (250 * .20 = 50).   In contrast, cognitive psychologists conduct studies with 80% power, while everything else is the same. In this case,  there would be 80 true positive results (100 * .8 = 80) and also 50 false positive results.  The percentage of false positives would be 50% for social, but only 50/(50+80) = 38% false positives for cognitive psychology.  In this example, R and alpha are held constant, but the PPVs differ simply as a function of power.  If we assume that cognitive psychologists use less severe p-hacking, there could be even fewer false positives (250 * .10 = 25) and the PPV for cognitive psychology would be only 24%.  [actual estimate in the article is 19%]

Thus, to make claims about differences between social psychologists and cognitive psychologists, it is necessary to estimate beta or power (1 – beta) and because power varies across the diverse studies in the OSC project, they have to estimate average power.  Moreover, because only significant studies are relevant, they need to estimate the average power after selection for significance.  The problem is that there exists no published peer-reviewed method to do this.  The reason why no published peer-reviewed method exists is that editors have rejected our manuscripts that have evaluated four different methods of estimating average power after selection for significance and shown that z-curve is the best method.

How do the authors estimate average power after selection for significance without z-curve?  They  use p-curve plots and use visual inspection of the plots against simulations of data with fixed power to obtain rough estimates of  50% average power for social psychology and 80% average power for cognitive psychology.

It is interesting that the authors used p-curve plots, but did not use the p-curve online app to estimate average power.  The online p-curve app also provides power estimates. However, we pointed out in the rejected manuscript, this method can severely overestimate average power. In fact when the online p-curve app is used, it produces estimates of 96% average power for social psychology and 98% for cognitive psychology. These estimates are implausible and this is the reason why the authors created their own ad-hoc method of power estimation rather than using the estimates provided by the p-curve app.

We used the p-curve app and also got really high power estimates that seemed implausible, so we used ballpark estimates from the Simonsohn et al. (2014) paper instead (Brent Wilson, email communication, May 7, 2018). 


pcurve.pngBased on their visual inspection of the graphs they conclude that the average power in social psychology is about 50% and the average power in cognitive psychology is about 80%.

Putting it all together 

After estimating PPV, alpha, and beta in the way described above, the authors used the formula to estimate R.

If we set PPV to .49, αlpha to .05, and 1 – β (i.e., power) to .50 for the social-psychology
studies and we set the corresponding values to .81, .05, and .80 for the cognitive-psychology studies, Equation 2 shows that R is .10 (odds = 1 to 10) for social psychology
and .27 (odds = 1 to ~4) for cognitive psychology. 

Now the authors make another mistake.  The power estimate obtained from p-curve applies to ALL p-values, including the false positive ones.  Of course, the average estimate of power is lower for a set of studies that contains more false positive results.

To end up with 50% average power with 50% false positive results,  the power of the studies that are not false positives can be computed with the following formula.

Avg.Power = FP*alpha + TP*power   <=>  power = (Avg.Power – FP*alpha)/TP

With 49% true positives (TP), 51% false positives (FP), alpha = .05, and average power = .50 for social psychology, the estimated average power of studies with an effect is 97%.

alpha = .05; avg.power = .50; TP = .49; FP = 1-TP;  (avg.power – FP*alpha)/TP

With 81% true positives and 80% average power for cognitive psychology, the estimated average power of studies with an effect in cognitive psychology  is 98%.

Thus, there is actually no difference in power between social and cognitive psychology because the percentage of false positive results alone explains the differences in the estimates of average power for all studies.


alpha = .05; PPV = .49; power = .96; alpha*PPV/(power * (1-PPV))
alpha = .05; PPV = .81; power = .97; alpha*PPV/(power * (1-PPV))

With these correct estimates of power for studies with true effects, the estimate for social psychology is .05 and the estimate for cognitive psychology is .22.  This means the social psychologists test 20 false hypothesis for every true hypothesis, while cognitive psycholgists test 4.55 false hypothesis for every correct hypothesis, assuming the authors assumptions are correct.


The authors make some questionable assumptions and some errors to arrive at the conclusion that social psychologists are conducting many studies with no real effect. All of these studies are run with a high level of power. When a non-significant result is obtained, they discard the hypothesis and move on to testing another one.  The significance filter keeps most of the false hypothesis out of the literature, but because there are so many false hypothesis, 50% of the published results end up being false positives.  Unfortunately, social psychologists failed to conduct actual replication studies and a large pile of false positive results accumulated in the literature until social psychologists realized that they need to replicate findings in 2011.

Although this is not really a flattering description of social psychology, the truth is worse.  Social psychologists have been replicating findings for a long time. However, they never reported studies that failed to replicate earlier findings and when possible they used statistical tricks to produce empirical findings that supported their conclusions with a nominal error rate of 5%, while the true error rate was much higher.  Only scandals in 2011 led to honest reporting of replication failures. However, these replication studies were conducted by independent investigators, while researchers with influential theories tried to discredit these replication failures.  Nobody is willing to admit that abnormal scientific practices may explain why many famous findings in social psychology textbooks were suddenly no longer replicable after 2011, especially when hypotheses and research protocols were preregistered and prevented the use of questionable research practices.

Ultimately, the truth will be published in peer-reviewed journals. APS does not control all journals.  When the truth becomes apparent,  APS will look bad because it did nothing to enforce normal scientific practices and it will look worse because it tried to cover up the truth.  Thank god , former APS president Susan Fiske reminded her colleagues that real scientists should welcome humiliation when their mistakes come to light because the self-correcting forces of science are more important than researchers feelings. So far, APS leaders seem to prefer repressive coping over open acknowledgment of past mistakes. I wonder what the most famous psychologists of all times would have to say about this.

Estimating Reproducibility of Psychology (No. 52): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The replication crisis has split psychologists and disrupted social networks.  I respected Jerry Clore as an emotion researcher when I started my career in emotion research.  His work on appraisal theories of emotions made an important contribution and influenced my thinking about emotions.  I enjoyed participating in Jerry’s lab meetings when I was a post-doctoral student of Ed Diener at Illinois.  However, I was never a big fan of Jerry’s most famous article on the effect of mood on life-satisfaction judgments.  Working with Ed Diener convinced me that life-satisfaction judgments are more stable and more strongly based on chronically accessible information than the mood as information model suggested (Anusic & Schimmack, 2016; Schimmack & Oishi, 2005).  Nevertheless, I had a positive relationship with Jerry and I am grateful that he wrote recommendation letters for me when I was on the job market.

When researchers started doing replication studies after 2011, some of Jerry’s articles failed to replicate, and one reason for these replication failures is that the original studies used questionable research practices.  Importantly, nobody considered these practices unethical and it was not a secret that these methods were used. Books even taught students that the use of these practices is good science.  The problem is that Jerry didn’t acknowledge that questionable practices could at least partially explain replication failures.  Maybe he did it to protect students like Simone Schnall. Maybe he had other reasons.  Personally, I was disappointed by this response to replication failures, but I guess that is life.

Summary of Original Article


In five studies, the authors crossed the priming of happy and sad concepts with affective experiences. In all studies, the expected interaction was significant. Coherence between affective concepts and affective experiences led to better recall of a story than in the incoherent conditions.

Study 1

56 students were assigned to six conditions (n ~ 10) of a 2 x 3 design. Three priming conditions with a scrambled sentence task were crossed with a manipulation of flexing or extending one arm. This manipulation is supposed to create an approach or avoidance motivation (Cacioppo et al., 1993).  The expected interaction was significant, F(2, 50) = 3.50, p = .038.

Study 2

75 students participated in Study 2, which was a replication study with two changes:  the arm position manipulation was paired with the priming task and half the participants rated their mood before the measurement of the DV.  The ANOVA result was marginally significant; F(2, 69) = 2.81, p = .067.

Study 3

58 students used the same priming procedure, but used music as a mood manipulation.  The neutral priming condition was dropped (n ~ 15 per cell).  The interaction effect was marginally significant, F(1, 54) = 3.48, p = .068.

Study 4

132 students participated in Study 4.  The study changed the priming task to a subliminal priming manipulation (although the 60ms presentation time may not be fully subliminal).  Affect was manipulated by asking participants to hold a happy or sad facial expression.  The interaction was significant, F(1, 128) = 3.97, p = .048.

Study 5 

133 students participated in Study 5.  Study 5 combined the scrambled sentence priming manipulation from Studies 1-3 with the facial expression manipulation from Study 4.  The interaction effect was significant, F(1, 129) = 5.21, p = .024.

Replicability Analysis

Although all five studies showed support for the predicted two-way interaction, the p-values in the five studies are surprisingly similar (ps = .038, .067, .068, .048, .025). The probability of such small variability or even less variability in p-values is p = .002 (TIVA).  This suggests that QRPs were used to produce (marginally) significant results in five studies with low power (Schimmack, 2012).

A small set of studies provides limited information about QRPs.  It is helpful to look at these p-values in the context of other results reported in articles with Jerry Clore as co-author.


The plot shows a large file-drawer (missing studies with non-significant results) that is produced by a large number of just significant results.  Either many studies were run to obtain a just significant result or other QRPs were used.  This analysis supports the conclusion that QRPs contributed to the reported results in the original article.

Replication Study

The replication project attempted a replication of Study 5.  However, the authors did not pick the 2 x 2 interaction as the target finding.  Instead, they used the finding in a “repeated measures ANOVA with condition (coherent vs. incoherent) and story prompt (tree vs. house. vs. car) produced a significant linear trend for the interaction of Condition X Story, F(1, 131), 5.79, p < .02, η2 = .04” (Centerbar, et a., 2008, p. 573).  The replication study did not find this trend, F(2, 110) = .759, p = .471.  However, the difference in degrees of freedom shows that the replication analysis had less power because it did not test the linear contrast. Moreover, the replication report states that the replication study showed a trend regarding the main effect of affective coherence on the percentage of causal words used, F(1, 111) = 3.172, p = .078.  This makes it difficult to evaluate whether the replication study was really a failure.

I used the posted data to test the interaction for the total number of words produced. It was not significant, F(1,126) = 0.602, p = .439.

In conclusion, the reported significant interaction failed to replicate.


The replication study of this 2 x 2 between-subject social psychology experiment failed to replicate the original result.  Bias tests suggests that the replication failure was at least partially caused by the use of questionable research practices in the original study.