A method revolution is underway in psychological science. In 2011, an article published in JPSP-ASC made it clear that experimental social psychologists were publishing misleading p-values because researchers violated basic principles of significance testing (Schimmack, 2012; Wagenmakers et al., 2011). Deceptive reporting practices led to the publication of mostly significant results, while many non-significant results were not reported. This selective publishing of results dramatically increases the risk of a false positive result from the nominal level of 5% that is typically claimed in publications that report significance tests (Sterling, 1959).
Although experimental social psychologists think that these practices are defensible, no statistician would agree with them. In fact, Sterling (1959) already pointed out that the success rate in psychology journals is too high and claims about statistical significance are meaningless. Similar concerns were raised again within psychology (Rosenthal, 1979), but deceptive practices remain acceptable until today (Kitayama, 2018). As a result, most published results in social psychology do not replicate and cannot be trusted (Open Science Collaboration, 2015).
For non-methodologists it can be confusing to make sense of the flood of method papers that have been published in the past years. It is therefore helpful to provide a quick overview of methodological contributions concerned with detection and correction of biases.
First, some methods focus on effect sizes, (pcurve2.0; puniform), whereas others focus on strength of evidence (Test of Excessive Significance; Incredibility Index; R-Index, Pcurve2.1; Pcurve4.06; Zcurve).
Another important distinction is between methods that assume a fixed parameter and methods that assume heterogeneity. If all studies have a common effect size or the same strength of evidence, it is relatively easy to demonstrate bias and to correct for bias (Pcurve2.1; Puniform; TES). However, heterogeneity in effect sizes or sampling error produces challenges. Relatively few methods have been developed for this challenging, yet realistic scenario. For example, Ioannidis and Trikalonis (2005) developed a method to reveal publication bias that assumes a fixed effect size across studies, while allowing for variation in sampling error, but this method can be biased if there is heterogeneity in effect sizes. In contrast, I developed the Incredibilty-Index (also called Magic Index) to allow for heterogeneity in effect sizes and sampling error (Schimmack, 2012).
Following my work on bias detection in heterogeneous sets of studies, I started working with Jerry Brunner on methods that can estimate average power of a heterogeneous set of studies that are selected for significance. I first published this method on my blog in June 2015, when I called it post-hoc power curves. These days, the term Zcurve is used more often to refer to this method. I illustrated the usefulness of Zcurve in various posts in the Psychological Methods Discussion Group.
In September, 2015 I posted replicability rankings of social psychology departments using this method. the post generated a lot of discussions and a question about the method. Although the details were still unpublished, I described the main approach of the method. To deal with heterogeneity, the method uses a mixture model.
In 2016, Jerry Brunner and I submitted a manuscript for publication that compared four methods for estimating average power of heterogeneous studies selected for significance (Puniform1.1; Pcurve2.1; Zcurve & a Maximul Likelihood Method). In this article, the mixture model, Zcurve, outperformed other methods, including a maximum-likelihood method developed by Jerry Brunner. The manuscript was rejected from Psychological Methods.
In 2017, Gronau, Duizer, Bakker, and Eric-Jan Wagenmakers published an article titled “A Bayesian Mixture Modeling of Significant p Values: A Meta-Analytic Method to Estimate the Degree of Contamination From H0” in the Journal of Experimental Psychology: General. The article did not mention z-curve, presumably because it was not published in a peer-reviewed journal.
Although a reference to our mixture model would have been nice, the Bayesian Mixture Model differs in several ways from Zcurve. This blog post examines the similarities and differences between the two mixture models, it shows that BMM fails to provide useful estimates with simulations and social priming studies, and it explains why BMM fails. It also shows that Zcurve can provide useful information about replicability of social priming studies, while the BMM estimates are uninformative.
The Bayesian Mixture Model (BMM) and Zcurve have different aims. BMM aims to estimate the percentage of false positives (significant results with an effect size of zero). This percentage is also called the False Discovery Rate (FDR).
FDR = False Positives / (False Positives + True Positives)
Zcurve aims to estimate the average power of studies selected for significance. Importantly, Brunner and Schimmack use the term power to refer to the unconditional probability of obtaining a significant result and not the common meaning of power as being conditional on the null-hypothesis being false. As a result, Zcurve does not distinguish between false positives with a 5% probability of producing a significant result (when alpha = .05) and true positives with an average probability between 5% and 100% of producing a significant result.
Average unconditional power is simply the percentage of false positives times alpha plus the average conditional power of true positive results (Sterling et al., 1995).
Unconditional Power = False Positives * Alpha + True Positives * Mean(1 – Beta)
Zcurve therefore avoids the thorny issue of defining false positives and trying to distinguish between false positives and true positives with very small effect sizes and low power.
BMM and zcurve use p-values as input. That is, they ignore the actual sampling distribution that was used to test statistical significance. The only information that is used is the strength of evidence against the null-hypothesis; that is, how small the p-value actually is.
The problem with p-values is that they have a specified sampling distribution only when the null-hypothesis is true. When the null-hypothesis is true, p-values have a uniform sampling distribution. However, this is not useful for a mixture model, because a mixture model assumes that the null-hypothesis is sometimes false and the sampling distribution for true positives is not defined.
Zcurve solves this problem by using the inverse normal distribution to convert all p-values into absolute z-scores (abs(z) = -qnorm(p/2). Absolute z-scores are used because F-tests or two-sided t-tests do not have a sign and a test score of 0 corresponds to a probability of 1. Thus, the results do not say anything about the direction of an effect, while the size of the p-value provides information about the strength of evidence.
BMM also transforms p-values. The only difference is that BMM uses the full normal distribution with positive and negative z-scores (z = qnorm(p)). That is, a p-value of .5 corresponds to a z-score of zero; p-values greater than .5 would be positive, and p-values less than .5 are assigned negative z-scores. However, because only significant p-values are selected, all z-scores are negative in the range from -1.65 (p = .05, one-tailed) to negative infinity (p = 0).
The non-centrality parameter (i.e., the true parameter that generates the sampling dstribution) is simply the mean of the normal distribution. For the null-hypothesis and false positives, the mean is zero.
Zcurve and BMM differ in the modeling of studies with true positive results that are heterogeneous. Zcurve uses several normal distributions with a standard deviation of 1 that reflects sampling error for z-tests. Heterogeneity in power is modeled by varying means of normal distributions, where power increases with increasing means.
BMM uses a single normal distribution with varying standard deviation. A wider distribution is needed to predict large observed z-scores.
The main difference between Zcurve and BMM is that Zcurve either does not have fixed means (Brunner & Schimmack, 2016) or has fixed means, but does not interpret the weight assigned to a mean of zero as an estimate of false positives (Schimmack & Brunner, 2018). The reason is that the weights attached to individual components are not very reliable estimates of the weights in the data-generating model. Importantly, this is not relevant for the goal of zurve to estimate average power because the weighted average of the components of the model is a good estimate of the average true power in the data-generating model, even if the weights do not match the weights of the data-generating model.
For example, Zcurve does not care whether 50% average power is produced by a mixture of 50% false positives and 50% true positives with 95% power or 50% of studies with 20% power and 50% studies with 80% power. If all of these studies were exactly replicated, they are expected to produce 50% significant results.
BMM uses the weights assigned to the standard normal with a mean of zero as an estimate of the percentage of false positive results. It does not estimate the average power of true positives or average unconditional power.
Given my simulation studies with zcruve, I was surprised that BBM solved a problem that weights of individual components cannot be reliably estimated because the same distribution of p-values can be produced by many mixture models with different weights. The next section examines how BMM tries to estimate the percentage of false positives from the distribution of p-values.
A Bayesian Approach
Another difference between BMM and Zcurve is that BMM uses prior distributions, whereas Zcurve does not. Whereas Zcurve makes no assumptions about the percentage of false positives, BMM uses a uniform distribution with values from 0 to 1 (100%) as a prior. That is, it is equally likely that the percentage of false positives is 0%, 100%, or any value in between. A uniform prior is typically justified as being agnostic; that is, no subjective assumptions bias the final estimate.
For the mean of the true positives, the authors use a truncated normal prior, which they also describe as a folded standard normal. They justify this prior as reasonable based on extensive simulation studies.
Most important, however, is the parameter for the standard deviation. The prior for this parameter was a uniform distribution with values between 0 and 1. The authors argue that larger values would produce too many p-values close to 1.
“implausible prediction that p values near 1 are more common under H1 than under H0” (p 1226).
But why would this be implausible. If there are very few false positives and many true positives with low power, most p-values close to 1 would be the result of true positives (H1) than of false positives (H0).
Thus, one way BMM is able to estimate the false discovery rate is by setting the standard deviation in a way that there is a limit to the number of low z-scores that are predicted by true positives (H1).
Although understanding priors and how they influence results is crucial for meaningful use of Bayesian statistics, the choice of priors is not crucial for Bayesian estimation models with many observations because the influence of the priors diminishes as the number of observations increases. Thus, the ability of BMM to estimate the percentage of false positives in large samples cannot be explained by the use of priors. It is therefore still not clear how BMM can distinguish between false positives and true positives with low power.
The authors report several simulation studies that suggest BMM estimates are close and robust across many scenarios.
“The online supplemental material presents a set of simulation studies that highlight that the model is able to accurately estimate the quantities of interest under a relatively broad range of circumstances” (p. 1226).
The first set of simulations uses a sample size of N = 500 (n = 250 per condition). Heterogeneity in effect sizes is simulated with a truncated normal distribution with a standard deviation of .10 (truncated at 2*SD) and effect sizes of d = .45, .30, and .15. The lowest values are .35, .20, and .05. With N = 500, these values correspond to 97%, 61%, and 8% power respectively.
d = c(.35,.20,.05); 1-pt(qt(.975,500-2),500-2,d*sqrt(500)/2)
The number of studies was k = 5,000 with half of the studies being false positives (H0) and half being true positives (H1).
Figure 1 shows the Zcurve plot for the simulation with high power (d = .45, power > 97%; median true power = 99.9%).
The graph shows a bimodal distribution with clear evidence of truncation (the steep drop at z = 1.96 (p = .05, two-tailed) is inconsistent with the distribution of significant z-scores. The sharp drop from z = 1.96 to 3 shows that there are many studies with non-significant results are missing. The estimate of unconditional power (called replicability = expected success rate in exact replication studies) is 53%. This estimate is consistent with the simulation of 50% studies with a probability of success of 5% and 50% of studies with a success probability of 99.9% (.5 * .05 + .5 * .999 = 52.5).
The values below the x-axis show average power for specific z-scores. A z-score of 2 corresponds roughly to p = .05 and 50% power without selection for significance. Due to selection for significance, the average power is only 9%. Thus the observed power of 50% provides a much inflated estimate of replicability. A z-score of 3.5 is needed to achieve significance with p < .05, although the nominal p-value for z = 3.5 is p = .0002. Thus, selection for significance renders nominal p-values meaningless.
The sharp change in power from Z = 3 to Z = 3.5 is due to the extreme bimodal distribution. While most Z-scores below 3 are from the sampling distribution of H0 (false positives), most Z-scores of 3.5 or higher come from H1 (true positives with high power).
Figure 2 shows the results for the simulation with d = .30. The results are very similar because d = .30 still gives 92% power. As a result, replicabilty is nearly as high as in the previous example.
The most interesting scenario is the simulation with low powered true positives. Figure 3 shows the Zcurve for this scenario with an unconditional average power of only 23%.
It is no longer possible to recognize two sampling distributions and average power increases rather gradually from 18% for z = 2, to 35% for z = 3.5. Even with this challenging scenario, BMM performed well and correctly estimated the percentage of false positives. This is surprising because it is easy to generate a similar Zcurve without false positives.
Figure 4 shows a simulation with a mixture distribution but the false positives (d = 0) have been replaced by true positives (d = .06), while the mean for the heterogeneous studies was reduced to from d = .15 to d = .11. These values were chosen to produce the same average unconditional power (replicability) of 23%.
I transformed the z-scores into (two-sided) p-values and submitted them to the online BMM app at https://qfgronau.shinyapps.io/bmmsp/ . I used only k = 1,500 p-values because the server timed me out several times with k = 5,000 p-values. The estimated percentage of false positives was 24%, with a wide 95% credibility interval ranging from 0% to 48%. These results suggest that BMM has problems distinguishing between false positives and true positives with low power. BMM appears to be able to estimate the percentage of false positives correctly when most low z-scores are sampled from H0 (false positives). However, when these z-scores are due to studies with low power, BMM cannot distinguish between false positives and true positives with low power. As a result, the credibility interval is wide and the point estimates are misleading.
With k = 1,500 the influence of the priors is negligible. However, with smaller sample sizes, the priors do have an influence on results and may lead to overestimation and false credibility intervals. A simulation with k = 200, produced a point estimate of 34% false positives with a very wide CI ranging from 0% to 63%. The authors suggest a sensitivity analysis by changing model parameters. The most crucial parameter is the standard deviation. Increasing the standard deviation to 2, increases the upper limit of the 95%CI to 75%. Thus, without good justification for a specific standard deviation, the data provide very little information about the percentage of false positives underlying this Zcurve.
For simulations with k = 100, the prior started to bias the results and the CI no longer included the true value of 0% false positives.
In conclusion, these simulation results show that BMM promises more than it can deliver. It is very difficult to distinguish p-values sampled from H0 (mean z = 0) and those sampled from H1 with weak evidence (e.g., mean z = 0.1).
In the Challenges and Limitations section, the authors pretty much agree with this assessment of BMM (Gronau et al., 2017, p. 1230).
The procedure does come with three important caveats.
First, estimating the parameters of the mixture model is an inherently difficult statistical problem. .. and consequently a relatively large number of p values are required for the mixture model to provide informative results.
A second caveat is that, even when a reasonable number of p values are available, a change in the parameter priors might bring about a noticeably different result.
The final caveat is that our approach uses a simple parametric form to account for the distribution of p values that stem from H1. Such simplicity comes with the risk of model-misspecification.
Despite the limitations of BMM, the authors applied BMM to several real data. The most interesting application selected focal hypothesis tests from social priming studies. Social priming studies have come under attack as a research area with sloppy research methods as well as fraud (Stapel). Bias tests show clear evidence that published results were obtained with questionable scientific practices (Schimmack, 2017a, 2017b).
The authors analyzed 159 social priming p-values. The 95%CI for the percentage of false positives ranged from 48% to 88%. When the standard deviation was increased to 2, the 95%CI increased slightly to 56% to 91%. However, when the standard deviation was halved, the 95%CI ranged from only 10% to 75%. These results confirm the authors’ warning that estimates in small sets of studies (k < 200) are highly sensitive to the specification of priors.
What inferences can be drawn from these results about the social priming literature? A false positive percentage of 10% doesn’t sound so bad. A false positive percentage of 88% sound terrible. A priori, the percentage is somewhere between 0 and 100%. After looking at the data, uncertainty about the percentage of false positives in the social priming literature remains large. Proponents will focus on the 10% estimate and critics will use the 88% estimate. The data simply do not resolve inconsistent prior assumptions about the credibility of discoveries in social priming research.
In short, BMM promises that it can estimate the percentage of false positives in a set of studies, but in practice these estimates are too imprecise and too dependent on prior assumptions to be very useful.
A Zcurve of Social Priming Studies (k = 159)
It is instructive to compare the BMM results to a Zcurve analysis of the same data.
The zcurve graph shows a steep drop and very few z-scores greater than 4, which tend to have a high success rate in actual replication attempts (OSC, 2015). The average estimated replicability is only 27%. This is consistent with the more limited analysis of social priming studies in Kahneman’ s Thinking Fast and Slow book (Schimmack, 2017a).
More important than the point estimate is that the 95%CI ranges from 15% to a maximum of 39%. Thus, even a sample size of 159 studies is sufficient to provide conclusive evidence that these published studies have a low probability of replicating even if it were possible to reproduce the exact conditions again.
These results show that it is not very useful to distinguish between false positives with a replicability of 5% and true positives with a replicability of 6, 10, or 15%. Good research provides evidence that can be replicated at least with a reasonable degree of statistical power. Tversky and Kahneman (1971) suggested a minimum of 50% and most social priming studies fail to meet this minimal standard and hardly any studies seem to have been planned with the typical standard of 80% power.
The power estimates below the x-axis show that a nomimal z-score of 4 or higher is required to achieve 50% average power and an actual false positive risk of 5%. Thus, after correcting for deceptive publication practices, most of the seemingly statistically significant results are actually not significant with the common criterion of a 5% risk of a false positive.
The difference between BMM and Zcurve is captured in the distinction between evidence of absence and absence of evidence. BMM aims to provide evidence of absence (false positives). In contrast, Zcurve has the more modest goal of demonstrating absence (or presence) of evidence. It is unknown whether any social priming studies could produce robust and replicable effects and under what conditions these effects occur or do not occur. However, it is not possible to conclude from the poorly designed studies and the selectively reported results that social priming effects are zero.
Zcurve and BMM are both mixture models, but they have different statistical approaches, they have different aims. They also differ in their ability to provide useful estimates. Zcurve is designed to estimate average unconditional power to obtain significant results without distinguishing between true positives and false positives. False positives reduce average power, just like low powered studies, and in reality it can be difficult or impossible to distinguish between a false positive with an effect size of zero and a true positive with an effect size that is negligibly different from zero.
The main problem of BMM is that it treats the nil-hypothesis as an important hypothesis that can be accepted or rejected. However, this is a logical fallacy. it is possible to reject an implausible effect sizes (e.g., the nil-hypothesis is probably false if the 95%CI ranges from .8 to 1.2], but it is not possible to accept the nil-hypothesis because there are always values close to 0 that are also consistent with the data.
The problem of BMM is that it contrasts the point-nil-hypothesis with all other values, even if these values are very close to zero. The same problem plagues the use of Bayes-Factors that compare the point-nil-hypothesis with all other values (Rouder et al., 2009). A Bayes-Factor in favor of the point nil-hypothesis is often interpreted as if all the other effect sizes are inconsistent with the data. However, this is a logical fallacy because data that are inconsistent with a specific H1 can be consistent with an alternative H1. Thus, a BF in favor of H0 can only be interpreted as evidence against a specific H1, but never as evidence that the nil-hypothesis is true.
To conclude, I have argued that it is more important to estimate the replicability of published results than to estimate the percentage of false positives. A literature with 100% true positives and average power of 10% is no more desirable than a literature with 50% false positives and 50% true positives with 20% power. Ideally, researchers should conduct studies with 80% power and honest reporting of statistics and failed replications should control the false discovery rate. The Zcurve for social priming studies shows that priming researchers did not follow these basic and old principles of good science. As a result, decades of research are worthless and Kahneman was right to compare social priming research to a train wreck because the conductors ignored all warning signs.