Category Archives: Publication Bias

dora-backpack

Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

inflated-mop

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

inflated-effect-sizes

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

feature-inflated-mop

How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

inflated-mop

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

 

 

Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

 

Table 1 shows the results.

Power                  PTH / PFH             
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67                     

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH             
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14                    

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.

ioannidis-fig6

A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).

ioannidis-fig1

Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.

ioannidis-fig3

Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.

Powergraphs for JEP-LMC3.g

The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.

Powergraphs for PsySci3.g

The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.

Powergraphs for JPSP-ASC3.g

One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.

 

A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

trust

Replicability Ranking of Psychology Departments

Evaluations of individual researchers, departments, and universities are common and arguably necessary as science is becoming bigger. Existing rankings are based to a large extent on peer-evaluations. A university is ranked highly if peers at other universities perceive it to produce a steady stream of high-quality research. At present the most widely used objective measures rely on the quantity of research output and on the number of citations. These quantitative indicators of research quality work are also heavily influenced by peers because peer-review controls what gets published, especially in journals with high rejection rates, and peers decide what research they cite in their own work. The social mechanisms that regulate peer-approval are unavoidable in a collective enterprise like science that does not have a simple objective measure of quality (e.g., customer satisfaction ratings, or accident rates of cars). Unfortunately, it is well known that social judgments are subject to many biases due to conformity pressure, self-serving biases, confirmation bias, motivated biases, etc. Therefore, it is desirable to complement peer-evaluations with objective indicators of research quality.

Some aspects of research quality are easier to measure than others. Replicability rankings focus on one aspect of research quality that can be measured objectively, namely the replicability of a published significant result. In many scientific disciplines such as psychology, a successful study reports a statistically significant result. A statistically significant result is used to minimize the risk of publishing evidence for an effect that does not exist (or even goes in the opposite direction). For example, a psychological study that shows effectiveness of a treatment for depression would have to show that the effect in the study reveals a real effect that can be observed in other studies and in real patients if the treatment is used for the treatment of depression.

In a science that produces thousands of results a year, it is inevitable that some of the published results are fluke findings (even Toyota’s break down sometimes). To minimize the risk of false results entering the literature, psychology like many other sciences, adopted a 5% error rate. By using a 5% as the criterion, psychologists ensured that no more than 5% of results are fluke findings. With thousands of results published in each year, this still means that more than 50 false results enter the literature each year. However, this is acceptable because a single study does not have immediate consequences. Only if these results are replicated in other studies, findings become the foundation of theories and may influence practical decisions in therapy or in other applications of psychological findings (at work, in schools, or in policy). Thus, to outside observers it may appear safe to trust published results in psychology and to report about these findings in newspaper articles, popular books, or textbooks.

Unfortunately, it would be a mistake to interpret a significant result in a psychology journal as evidence that the result is probably true.  The reason is that the published success rate in journals has nothing to do with the actual success rate in psychological laboratories. All insiders know that it is common practice to report only results that support a researcher’s theory. While outsiders may think of scientists as neutral observers (judges), insiders play the game of lobbyist, advertisers, and self-promoters. The game is to advance one’s theory, publish more than others, get more citations than others, and win more grant money than others. Honest reporting of failed studies does not advance this agenda. As a result, the fact that psychological studies report nearly exclusively success stories (Sterling, 1995; Sterling et al., 1995) tells outside observers nothing about the replicability of a published finding and the true rate of fluke findings could be 100%.

This problem has been known for over 50 years (Cohen, 1962; Sterling, 1959). So it would be wrong to call the selective reporting of successful studies an acute crisis. However, what changed is that some psychologists have started to criticize the widely accepted practice of selective reporting of successful studies (Asendorpf et al., 2012; Francis, 2012; Simonsohn et al., 2011; Schimmack, 2012; Wagenmakers et al., 2011). Over the past five years, psychologists, particularly social psychologists, have been engaged in heated arguments over the so-called “replication crisis.”

One group argues that selective publishing of successful studies occurred, but without real consequences on the trustworthiness of published results. The other group argues that published results cannot be trusted unless they have been successfully replicated. The problem is that neither group has objective information about the replicability of published results.  That is, there is no reliable estimate of the percentage of studies that would produce a significant result again, if a representative sample of significant results published in psychology journals were replicated.

Evidently, it is not possible to conduct exact replication studies of all studies that have been published in the past 50 years. Fortunately, it is not necessary to conduct exact replication studies to obtain an objective estimate of replicability. The reason is that replicability of exact replication studies is a function of the statistical power of studies (Sterling et al., 1995). Without selective reporting of results, a 95% success rate is an estimate of the statistical power of the studies that achieved this success rate. Vice versa, a set of studies with average power of 50% is expected to produce a success rate of 50% (Sterling, et al., 1995).

Although selection bias renders success rates uninformative, the actual statistical results provide valuable information that can be used to estimate the unbiased statistical power of published results. Although selection bias inflates effect sizes and power, Brunner and Schimmack (forcecoming) developed and validated a method that can correct for selection bias. This method makes it possible to estimate the replicability of published significant results on the basis of the original reported results. This statistical method was used to estimate the replicabilty of research published by psychology departments in the years from 2010 to 2015 (see Methodology for details).

The averages for the 2010-2012 period (M = 59) and the 2013-2015 period (M = 61) show only a small difference, indicating that psychologists have not changed their research practices in accordance with recommendations to improve replicability in 2011  (Simonsohn et al., 2011). For most of the departments the confidence intervals for the two periods overlap (see attached powergraphs). Thus, the more reliable average across all years is used for the rankings, but the information for the two time periods is presented as well.

There are no obvious predictors of variability across departments. Private universities are at the top (#1, #2, #8), the middle (#24, #26), and at the bottom (#44, #47). European universities can also be found at the top (#4, #5), middle (#25) and bottom (#46, #51). So are Canadian universities (#9, #15, #16, #18, #19, #50).

There is no consensus on an optimal number of replicability.  Cohen recommended that researchers should plan studies with 80% power to detect real effects. If 50% of studies tested real effects with 80% power and the other 50% tested a null-hypothesis (no effect = 2.5% probability to replicate a false result again), the estimated power for significant results would be 78%. The effect on average power is so small because most of the false predictions produce a non-significant result. As a result, only a few studies with low replication probability dilute the average power estimate. Thus, a value greater than 70 can be considered broadly in accordance with Cohen’s recommendations.

It is important to point out that the estimates are very optimistic estimates of the success rate in actual replications of theoretically important effects. For a representative set of 100 studies (OSC, Science, 2015), Brunner and Schimmack’s statistical approach predicted a success rate of 54%, but the success rate in actual replication studies was only 37%. One reason for this discrepancy could be that the statistical approach assumes that the replication studies are exact, but actual replications always differ in some ways from the original studies, and this uncontrollable variability in experimental conditions posses another challenge for replicability of psychological results.  Before further validation research has been completed, the estimates can only be used as a rough estimate of replicability. However, the absolute accuracy of estimates is not relevant for the relative comparison of psychology departments.

And now, without further ado, the first objective rankings of 51 psychology departments based on the replicability of published significant results. More departments will be added to these rankings as the results become available.

Rank University 2010-2015 2010-2012 2013-2015
1 U Penn 72 69 75
2 Cornell U 70 67 72
3 Purdue U 69 69 69
4 Tilburg U 69 71 66
5 Humboldt U Berlin 67 68 66
6 Carnegie Mellon 67 67 67
7 Princeton U 66 65 67
8 York U 66 63 68
9 Brown U 66 71 60
10 U Geneva 66 71 60
11 Northwestern U 65 66 63
12 U Cambridge 65 66 63
13 U Washington 65 70 59
14 Carleton U 65 68 61
15 Queen’s U 63 57 69
16 U Texas – Austin 63 63 63
17 U Toronto 63 65 61
18 McGill U 63 72 54
19 U Virginia 63 61 64
20 U Queensland 63 66 59
21 Vanderbilt U 63 61 64
22 Michigan State U 62 57 67
23 Harvard U 62 64 60
24 U Amsterdam 62 63 60
25 Stanford U 62 65 58
26 UC Davis 62 57 66
27 UCLA 61 61 61
28 U Michigan 61 63 59
29 Ghent U 61 58 63
30 U Waterloo 61 65 56
31 U Kentucky 59 58 60
32 Penn State U 59 63 55
33 Radboud U 59 60 57
34 U Western Ontario 58 66 50
35 U North Carolina Chapel Hill 58 58 58
36 Boston University 58 66 50
37 U Mass Amherst 58 52 64
38 U British Columbia 57 57 57
39 The University of Hong Kong 57 57 57
40 Arizona State U 57 57 57
41 U Missouri 57 55 59
42 Florida State U 56 63 49
43 New York U 55 55 54
44 Dartmouth College 55 68 41
45 U Heidelberg 54 48 60
46 Yale U 54 54 54
47 Ohio State U 53 58 47
48 Wake Forest U 51 53 49
49 Dalhousie U 50 45 55
50 U Oslo 49 54 44
51 U Kansas 45 45 44

 

DeafEars

“Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

The article with the witty title “Do Studies of Statistical Power Have an Effect on the Power of Studies?” builds on Cohen’s (1962) seminal power analysis of psychological research.

The main point of the article can be summarized in one word: No. Statistical power has not increased after Cohen published his finding that statistical power is low.

One important contribution of the article was a meta-analysis of power analyses that applied Cohen’s method to a variety of different journals. The table below shows that power estimates vary by journal assuming that the effect size was medium according to Cohen’s criteria of small, medium, and large effect sizes. The studies are sorted by power estimates from the highest to the lowest value, which provides a power ranking of journals based on Cohen’s method. I also included the results of Sedlmeier and Giegerenzer’s power analysis of the 1984 volume of the Journal of Abnormal Psychology (the Journal of Social and Abnormal Psychology was split into Journal of Abnormal Psychology and Journal of Personality and Social Psychology). I used the mean power (50%) rather than median power (44%) because the mean power is consistent with the predicted success rate in the limit. In contrast, the median will underestimate the success rate in a set of studies with heterogeneous effect sizes.

JOURNAL TITLE YEAR Power%
Journal of Marketing Research 1981 89
American Sociological Review 1974 84
Journalism Quarterly, The Journal of Broadcasting 1976 76
American Journal of Educational Psychology 1972 72
Journal of Research in Teaching 1972 71
Journal of Applied Psychology 1976 67
Journal of Communication 1973 56
The Research Quarterly 1972 52
Journal of Abnormal Psychology 1984 50
Journal of Abnormal and Social Psychology 1962 48
American Speech and Hearing Research & Journal of Communication Disorders 1975 44
Counseler Education and Supervision 1973 37

 

The table shows that there is tremendous variability in power estimates for different journals ranging from as high as 89% (9 out of 10 studies will produce a significant result when an effect is present) to the lowest estimate of  37% power (only 1 out of 3 studies will produce a significant result when an effect is present).

The table also shows that the Journal of Abnormal and Social Psychology and its successor the Journal of Abnormal Psychology yielded nearly identical power estimates. This finding is the key finding that provides empirical support for the claim that power in the Journal of Abnormal Psychology has not increased over time.

The average power estimate for all journals in the table is 62% (median 61%).  The list of journals is not a representative set of journals and few journals are core psychology journals. Thus, the average power may be different if a representative set of journals had been used.

The average for the three core psychology journals (JASP & JAbnPsy,  JAP, AJEduPsy) is 67% (median = 63%) is slightly higher. The latter estimate is likely to be closer to the typical power in psychology in general rather than the prominently featured estimates based on the Journal of Abnormal Psychology. Power could be lower in this journal because it is more difficult to recruit patients with a specific disorder than participants from undergraduate classes. However, only more rigorous studies of power for a broader range of journals and more years can provide more conclusive answers about the typical power of a single statistical test in a psychology journal.

The article also contains some important theoretical discussions about the importance of power in psychological research. One important issue concerns the treatment of multiple comparisons. For example, a multi-factorial design produces an exponential number of statistical comparisons. With two conditions, there is only one comparison. With three conditions, there are three comparisons (C1 vs. C2, C1 vs. C3, and C2 vs. C3). With 5 conditions, there are 10 comparisons. Standard statistical methods often correct for these multiple comparisons. One consequence of this correction for multiple comparisons is that the power of each statistical test decreases. An effect that would be significant in a simple comparison of two conditions would not be significant if this test is part of a series of tests.

Sedlmeier and Giegerenzer used the standard criterion of p < .05 (two-tailed) for their main power analysis and for the comparison with Cohen’s results. However, many articles presented results using a more stringent criterion of significance. If the criterion used by authors would have been used for the power analysis, power decreased further. About 50% of all articles used an adjusted criterion value and if the adjusted criterion value was used power was only 37%.

Sedlmeier and Giegerenzer also found another remarkable difference between articles in 1960 and in 1984. Most articles in 1960 reported the results of a single study. In 1984 many articles reported results from two or more studies. Sedlmeier and Giegerenzer do not discuss the statistical implications of this change in publication practices. Schimmack (2012) introduced the concept of total power to highlight the problem of publishing articles that contain multiple studies with modest power. If studies are used to provide empirical support for an effect, studies have to show a significant effect. For example, Study 1 shows an effect with female participants. Study 2 examines whether the effect can also be demonstrated with male participants. If Study 2 produces a non-significant result, it is not clear how this finding should be interpreted. It may show that the effect does not exist for men. It may show that the first result was just a fluke finding due to sampling error. Or it may show that the effect exists equally for men and women but studies had only 50% power to produce a significant result. In this case, it is expected that one study will produce a significant result and one will produce a non-significant result, but in the long-run significant results are equally likely with male or female participants. Given the difficulty of interpreting a non-significant result, it would be important to conduct a more powerful study that examines gender differences in a more powerful study with more female and male participants. However, this is not what researchers do. Rather, multiple study articles contain only the studies that produced significant results. The rate of successful studies in psychology journals is over 90% (Sterling et al., 1995). However, this outcome is extremely likely in multiple studies where studies have only 50% power to get a significant result in a single attempt. For each additional attempt, the probability to obtain only significant results decreases exponentially (1 Study, 50%, 2 Studies 25%, 3 Studies 12.5%, 4 Studies 6.75%).

The fact that researchers only publish studies that worked is well-known in the research community. Many researchers believe that this is an acceptable scientific practice. However, consumers of scientific research may have a different opinion about this practice. Publishing only studies that produced the desired outcome is akin to a fund manager that only publishes the return rate of funds that gained money and excludes funds with losses. Would you trust this manager to take care of your retirement? It is also akin to a gambler that only remembers winnings. Would you marry a gambler who believes that gambling is ok because you can earn money that way?

I personally do not trust obviously biased information. So, when researchers present 5 studies with significant results, I wonder whether they really had the statistical power to produce these results or whether they simply did not publish results that failed to confirm their claims. To answer this question it is essential to estimate the actual power of individual studies to produce significant results; that is, it is necessary to estimate the typical power in this field, of this researcher, or in the journal that published the results.

In conclusion, Sedlmeier and Gigerenzer made an important contribution to the literature by providing the first power-ranking of scientific journals and the first temporal analyses of time trends in power. Although they probably hoped that their scientific study of power would lead to an increase in statistical power, the general consensus is that their article failed to change scientific practices in psychology. In fact, some journals required more and more studies as evidence for an effect (some articles contain 9 studies) without any indication that researchers increased power to ensure that their studies could actually provide significant results for their hypotheses. Moreover, the topic of statistical power remained neglected in the training of future psychologists.

I recommend Sedlmeier and Gigerenzer’s article as essential reading for anybody interested in improving the credibility of psychology as a rigorous empirical science.

As always, comments (positive or negative) are always welcome.

filedrawer

Distinguishing Questionable Research Practices from Publication Bias

It is well-known that scientific journals favor statistically significant results (Sterling, 1959). This phenomenon is known as publication bias. Publication bias can be easily detected by comparing the observed statistical power of studies with the success rate in journals. Success rates of 90% or more would only be expected if most theoretical predictions are true and empirical studies have over 90% statistical power to produce significant results. Estimates of statistical power range from 20% to 50% (Button et al., 2015, Cohen, 1962). It follows that for every published significant result an unknown number of non-significant results has occurred that remained unpublished. These results linger in researchers proverbial file-drawer or more literally in unpublished data sets on researchers’ computers.

The selection of significant results also creates an incentive for researchers to produce significant results. In rare cases, researchers simply fabricate data to produce significant results. However, scientific fraud is rare. A more serious threat to the integrity of science is the use of questionable research practices. Questionable research practices are all research activities that create a systematic bias in empirical results. Although systematic bias can produce too many or too few significant results, the incentive to publish significant results suggests that questionable research practices are typically used to produce significant results.

In sum, publication bias and questionable research practices contribute to an inflated success rate in scientific journals. So far, it has been difficult to examine the prevalence of questionable research practices in science. One reason is that publication bias and questionable research practices are conceptually overlapping. For example, a research article may report the results of a 2 x 2 x 2 ANOVA or a regression analysis with 5 predictor variables. The article may only report the significant results and omit detailed reporting of the non-significant results. For example, researchers may state that none of the gender effects were significant and not report the results for main effects or interaction with gender. I classify these cases as publication bias because each result tests a different hypothesis., even if the statistical tests are not independent.

Questionable research practices are practices that change the probability of obtaining a specific significant result. An example would be a study with multiple outcome measures that would support the same theoretical hypothesis. For example, a clinical trial of an anti-depressant might include several depression measures. In this case, a researcher can increase the chances of a significant result by conducting tests for each measure. Other questionable research practices would be optional stopping once a significant result is obtained, selective deletion of cases based on the results after deletion. A common consequence of these questionable practices is that they will produce results that meet the significance criterion, but deviate from the distribution that is expected simply on the basis of random sampling error.

A number of articles have tried to examine the prevalence of questionable research practices by comparing the frequency of p-values above and below the typical criterion of statistical significance, namely a p-value less than .05. The logic is that random error would produce a nearly equal amount of p-values just above .05 (e.g., p = .06) and below .05 (e.g., p = .04). According to this logic, questionable research practices are present, if there are more p-values just below the criterion than p-values just above the criterion (Masicampo & Lalande, 2012).

Daniel Lakens has pointed out some problems with this approach. The most crucial problem is that publication bias alone is sufficient to predict a lower frequency of p-values below the significance criterion. After all, these p-values imply a non-significant result and non-significant results are subject to publication bias. The only reason why p-values of .06 are reported with higher frequency than p-values of .11 is that p-values between .05 and .10 are sometimes reported as marginally significant evidence for a hypothesis. Another problem is that many p-values of .04 are not reported as p = .04, but are reported as p < .05. Thus, the distribution of p-values close to the criterion value provides unreliable information about the prevalence of questionable research practices.

In this blog post, I introduce an alternative approach to the detection of questionable research practices that produce just significant results. Questionable research practices and publication bias have different effects on the distribution of p-values (or corresponding measures of strength of evidence). Whereas publication bias will produce a distribution that is consistent with the average power of studies, questionable research practice will produce an abnormal distribution with a peak just below the significance criterion. In other words, questionable research practices produce a distribution with too few non-significant results and too few highly significant results.

I illustrate this test of questionable research practices with post-hoc-power analysis of three journals. One journal shows neither signs of publication bias, nor significant signs of questionable research practices. The second journal shows clear evidence of publication bias, but no evidence of questionable research practices. The third journal illustrates the influence of publication bias and questionable research practices.

Example 1: A Relatively Unbiased Z-Curve

The first example is based on results published during the years 2010-2014 in the Journal of Experimental Psychology: Learning, Memory, and Cognition. A text-mining program searched all articles for publications of F-tests, t-tests, correlation coefficients, regression coefficients, odds-ratios, confidence intervals, and z-tests. Due to the inconsistent and imprecise reporting of p-values (p = .02 or p < .05), p-values were not used. All statistical tests were converted into absolute z-scores.

The program found 14,800 tests. 8,423 tests were in the critical interval between z = 2 and z = 6 that is used for estimation of 4 non-centrality parameters and 4 weights that are used to model the distribution of z-values between 2 and 6 and to estimate the distribution in the range from 0 to 2. Z-values greater than 6 are not used because they correspond to Power close to 1. 11% of all tests fall into this region of z-scores that are not shown.

PHP-Curve JEP-LMCThe histogram and the blue density distribution show the observed data. The green curve shows the predicted distribution based on the post-hoc power analysis. Post-hoc power analysis suggests that the average power of significant results is 67%. Power for all statistical tests is estimated to be 58% (including 11% of z-scores greater than 6, power is .58*.89 + .11 = 63%). More important is the predicted distribution of z-scores. The predicted distribution on the left side of the criterion value matches the observed distribution rather well. This shows that there are not a lot of missing non-significant results. In other words, there does not appear to be a file-drawer of studies with non-significant results. There is also only a very small blip in the observed data just at the level of statistical significance. The close match between the observed and predicted distributions suggests that results in this journal are relatively free of systematic bias due to publication bias or questionable research practices.

Example 2: A Z-Curve with Publication Bias

The second example is based on results published in the Attitudes & Social Cognition Section of the Journal of Personality and Social Psychology. The text-mining program retrieved 5,919 tests from articles published between 2010 and 2014. 3,584 tests provided z-scores in the range from 2 to 6 that is being used for model fitting.

PHP-Curve JPSP-ASC

The average power of significant results in JPSP-ASC is 55%. This is significantly less than the average power in JEP-LMC, which was used for the first example. The estimated power for all statistical tests, including those in the estimated file drawer, is 35%. More important is the estimated distribution of z-values. On the right side of the significance criterion the estimated curve shows relatively close fit to the observed distribution. This finding shows that random sampling error alone is sufficient to explain the observed distribution. However, on the left side of the distribution, the observed z-scores drop off steeply. This drop is consistent with the effect of publication bias that researchers do not report all non-significant results. There is only a slight hint that questionable research practices are also present because observed z-scores just above the criterion value are a bit more frequent than the model predicts. However, this discrepancy is not conclusive because the model could increase the file drawer, which would produce a steeper slope. The most important characteristic of this z-curve is the steep cliff on the left side of the criterion value and the gentle slope on the right side of the criterion value.

Example 3: A Z-Curve with Questionable Research Practices.

Example 3 uses results published in the journal Aggressive Behavior during the years 2010 to 2014. The text mining program found 1,429 results and 863 z-scores in the range from 2 to 6 that were used for the post-hoc-power analysis.

PHP-Curve for AggressiveBeh 2010-14

 

The average power for significant results in the range from 2 to 6 is 73%, which is similar to the power estimate in the first example. The power estimate that includes non-significant results is 68%. The power estimate is similar because there is no evidence of a file drawer with many underpowered studies. In fact, there are more observed non-significant results than predicted non-significant results, especially for z-scores close to zero. This outcome shows some problems of estimating the frequency of non-significant results based on the distribution of significant results. More important, the graph shows a cluster of z-scores just above and below the significance criterion. The step cliff to the left of the criterion might suggest publication bias, but the whole distribution does not show evidence of publication bias. Moreover, the steep cliff on the right side of the cluster cannot be explained with publication bias. Only questionable research practices can produce this cliff because publication bias relies on random sampling error which leads to a gentle slope of z-scores as shown in the second example.

Prevalence of Questionable Research Practices

The examples suggest that the distribution of z-scores can be used to distinguish publication bias and questionable research practices. Based on this approach, the prevalence of questionable research practices would be rare. The journal Aggressive Behavior is exceptional. Most journals show a pattern similar to Example 2, with varying sizes of the file drawer. However, this does not mean that questionable research practices are rare because it is most likely that the pattern observed in Example 2 is a combination of questionable research practices and publication bias. As shown in Example 2, the typical power of statistical tests that produce a significant result is about 60%. However, researchers do not know which experiments will produce significant results. Slight modifications in experimental procedures, so-called hidden moderators, can easily change an experiment with 60% power into an experiment with 30% power. Thus, the probability of obtaining a significant result in a replication study is less than the nominal power of 60% that is implied by post-hoc-power analysis. With only 30% to 60% power, researchers will frequently encounter results that fail to produce an expected significant result. In this case, researchers have two choices to avoid reporting a non-significant result. They can put the study in the file-drawer or they can try to salvage the study with the help of questionable research practices. It is likely that researchers will do both and that the course of action depends on the results. If the data show a trend in the right direction, questionable research practices seem an attractive alternative. If the data show a trend in the opposite direction, it is more likely that the study will be terminated and the results remain unreported.

Simons et al. (2011) conducted some simulation studies and found that even extreme use of multiple questionable research practices (p-hacking) will produce a significant result in at most 60% of cases, when the null-hypothesis is true. If such extreme use of questionable research practices were widespread, z-curve would produce corrected power estimates well-below 50%. There is no evidence that extreme use of questionable research practices is prevalent. In contrast, there is strong evidence that researchers conduct many more studies than they actually report and that many of these studies have a low probability of success.

Implications of File-Drawers for Science

First, it is clear that researchers could be more effective if they would use existing resources more effectively. An fMRI study with 20 participants costs about $10,000. Conducting a study that costs $10,000 that has only a 50% probability of producing a significant result is wasteful and should not be funded by taxpayers. Just publishing the non-significant result does not fix this problem because a non-significant result in a study with 50% power is inconclusive. Even if the predicted effect exists, one would expect a non-significant result in ever second study.   Instead of wasting $10,000 on studies with 50% power, researchers should invest $20,000 in studies with higher power (unfortunately, power does not increase proportional to resources). With the same research budget, more money would contribute to results that are being published. Thus, without spending more money, science could progress faster.

Second, higher powered studies make non-significant results more relevant. If a study had 80% power, there is only a 20% chance to get a non-significant result if an effect is present. If a study had 95% power, the chance of a non-significant result would be just as low as the chance of a false positive result. In this case, it is noteworthy that a theoretical prediction was not confirmed. In a set of high-powered studies, a post-hoc power analysis would show a bimodal distribution with clusters of z-scores around 0 for true null-hypothesis and a cluster of z-scores of 3 or higher for clear effects. Type-I and Type-II errors would be rare.

Third, Example 3 shows that the use of questionable research practices becomes detectable in the absence of a file drawer and that it would be harder to publish results that were obtained with questionable research practices.

Finally, the ability to estimate the size of file-drawers may encourage researchers to plan studies more carefully and to invest more resources into studies to keep their file drawers small because a large file-drawer may harm reputation or decrease funding.

In conclusion, post-hoc power analysis of large sets of data can be used to estimate the size of the file drawer based on the distribution of z-scores on the right side of a significance criterion. As file-drawers harm science, this tool can be used as an incentive to conduct studies that produce credible results and thus reducing the need for dishonest research practices. In this regard, the use of post-hoc power analysis complements other efforts towards open science such as preregistration and data sharing.