# Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

# How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

# A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

# Dr. Ulrich Schimmack’s Blog about Replicability

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication (Cohen, 1994).

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DEFINITION OF REPLICABILITY:  In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2017 Blog Posts:

(October, 24, 2017)
Preliminary 2017 Replicability Rankings of 104 Psychology Journals

(September 4, 2017)
The Power of the Pen Paradigm: A Replicability Analysis

(February, 2, 2017)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
REPLICABILITY REPORTS:  Examining the replicability of research topics

RR No1. (April 19, 2016)  Is ego-depletion a replicable effect?
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?
RR No3. (September 4, 2017) The power of the pen paradigm: A replicability analysis

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# TOP TEN LIST

1.  Preliminary 2017  Replicability Rankings of 104 Psychology Journals
Rankings of 104 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2017, and a comparison of psychological disciplines (cognitive, clinical, social, developmental, biological).

2.  Z-Curve: Estimating replicability for sets of studies with heterogeneous power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

3. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

4.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

5.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

6. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

8.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

10. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

# The Abuse of Hoenig and Heisey: A Justification of Power Calculations with Observed Effect Sizes

In 2001, Hoenig and Heisey wrote an influential article, titled “The Abuse of Power: The Persuasive Fallacy of Power Calculations For Data Analysis.”  The article has been cited over 500 times and it is commonly cited as a reference to claim that it is a fallacy to use observed effect sizes to compute statistical power.

In this post, I provide a brief summary of Hoenig and Heisey’s argument. The summary shows that Hoenig and Heisey were concerned with the practice of assessing the statistical power of a single test based on the observed effect size for this effect. I agree that it is often not informative to do so (unless the result is power = .999). However, the article is often cited to suggest that the use of observed effect sizes in power calculations is fundamentally flawed. I show that this statement is false.

The abstract of the article makes it clear that Hoenig and Heisey focused on the estimation of power for a single statistical test. “There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result” (page 1). The abstract informs readers that this practice is fundamentally flawed. “This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic” (p. 1).

Given that method articles can be difficult to read, it is possible that the misinterpretation of Hoenig and Heisey is the result of relying on the term “fundamentally flawed” in the abstract. However, some passages in the article are also ambiguous. In the Introduction Hoenig and Heisey write “we describe the flaws in trying to use power calculations for data-analytic purposes” (p. 1). It is not clear what purposes are left for power calculations if they cannot be used for data-analytic purposes. Later on, they write more forcefully “A number of authors have noted that observed power may not be especially useful, but to our knowledge a fatal logical flaw has gone largely unnoticed.” (p. 2). So readers cannot be blamed entirely if they believed that calculations of observed power are fundamentally flawed. This conclusion is often implied in Hoenig and Heisey’s writing, which is influenced by their broader dislike of hypothesis testing  in general.

The main valid argument that Hoenig and Heisey make is that power analysis is based on the unknown population effect size and that effect sizes in a particular sample are contaminated with sampling error.  As p-values and power estimates depend on the observed effect size, they are also influenced by random sampling error.

In a special case, when true power is 50%, the p-value matches the significance criterion. If sampling error leads to an underestimation of the true effect size, the p-value will be non-significant and the power estimate will be less than 50%. When sampling error inflates the observed effect size, p-values will be significant and power will be above 50%.

It is therefore impossible to find scenarios where observed power is high (80%) and a result is not significant, p > .05, or where observed power is low (20%) and a result is significant, p < .05.  As a result, it is not possible to use observed power to decide whether a non-significant result was obtained because power was low or because power was high but the effect does not exist.

In fact, a simple mathematical formula can be used to transform p-values into observed power and vice versa (I actually got the idea of using p-values to estimate power from Hoenig and Heisey’s article).  Given this perfect dependence between the two statistics, observed power cannot add additional information to the interpretation of a p-value.

This central argument is valid and it does mean that it is inappropriate to use the observed effect size of a statistical test to draw inferences about the statistical power of a significance test for the same effect (N = 1). Similarly, one would not rely on a single data point to draw inferences about the mean of a population.

However, it is common practice to aggregate original data points or to aggregated effect sizes of multiple studies to obtain more precise estimates of the mean in a population or the mean effect size, respectively. Thus, the interesting question is whether Hoenig and Heisey’s (2001) article contains any arguments that would undermine the aggregation of power estimates to obtain an estimate of the typical power for a set of studies. The answer is no. Hoenig and Heisey do not consider a meta-analysis of observed power in their discussion and their discussion of observed power does not contain arguments that would undermine the validity of a meta-analysis of post-hoc power estimates.

A meta-analysis of observed power can be extremely useful to check whether researchers’ a priori power analysis provide reasonable estimates of the actual power of their studies.

Assume that researchers in a particular field have to demonstrate that their studies have 80% power to produce significant results when an important effect is present because conducting studies with less power would be a waste of resources (although some granting agencies require power analyses, these power analyses are rarely taken seriously, so I consider this a hypothetical example).

Assume that researchers comply and submit a priori power analysis with effect sizes that are considered to be sufficiently meaningful. For example, an effect of half-a-standard deviation (Cohen’s d = .50) might look reasonable large to be meaningful. Researchers submit their grant applications with a prior power analysis that produce 80% power with an effect size of d = .50. Based on the power analysis, researchers request funding for 128 participants. A researcher plans four studies and needs \$50 for each participant. The total budget is \$25,600.

When the research project is completed, all four studies produced non-significant results. The observed standardized effect sizes were 0, .20, .25, and .15. Is it really impossible to estimate the realized power in these studies based on the observed effect sizes? No. It is common practice to conduct a meta-analysis of observed effect sizes to get a better estimate of the (average) population effect size. In this example, the average effect size across the four studies is d = .15. It is also possible to show that the average effect size in these four studies is significantly different from the effect size that was used for the a priori power calculation (M1 = .15, M2 = .50, Mdiff = .35, SE = 1/sqrt(512) = .044, t = .35 / .044 = 7.92, p < 1e-13). Using the more realistic effect size estimate that is based on actual empirical data rather than wishful thinking, the post-hoc power analysis yields a power estimate of 13%. The probability of obtaining non-significant results in all four studies is 57%. Thus, it is not surprising that the studies produced non-significant results.  In this example, a post-hoc power analysis with observed effect sizes provides valuable information about the planning of future studies in this line of research. Either effect sizes of this magnitude are not important enough and research should be abandoned or effect sizes of this magnitude still have important practical implications and future studies should be planned on the basis of a priori power analysis with more realistic effect sizes.

Another valuable application of observed power analysis is the detection of publication bias and questionable research practices (Ioannidis and Trikalinos; 2007), Schimmack, 2012) and for estimating the replicability of statistical results published in scientific journals (Schimmack, 2015).

In conclusion, the article by Hoenig and Heisey is often used as a reference to argue that observed effect sizes should not be used for power analysis.  This post clarifies that this practice is not meaningful for a single statistical test, but that it can be done for larger samples of studies.

# “Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

The article with the witty title “Do Studies of Statistical Power Have an Effect on the Power of Studies?” builds on Cohen’s (1962) seminal power analysis of psychological research.

The main point of the article can be summarized in one word: No. Statistical power has not increased after Cohen published his finding that statistical power is low.

One important contribution of the article was a meta-analysis of power analyses that applied Cohen’s method to a variety of different journals. The table below shows that power estimates vary by journal assuming that the effect size was medium according to Cohen’s criteria of small, medium, and large effect sizes. The studies are sorted by power estimates from the highest to the lowest value, which provides a power ranking of journals based on Cohen’s method. I also included the results of Sedlmeier and Giegerenzer’s power analysis of the 1984 volume of the Journal of Abnormal Psychology (the Journal of Social and Abnormal Psychology was split into Journal of Abnormal Psychology and Journal of Personality and Social Psychology). I used the mean power (50%) rather than median power (44%) because the mean power is consistent with the predicted success rate in the limit. In contrast, the median will underestimate the success rate in a set of studies with heterogeneous effect sizes.

JOURNAL TITLE YEAR Power%
Journal of Marketing Research 1981 89
American Sociological Review 1974 84
Journalism Quarterly, The Journal of Broadcasting 1976 76
American Journal of Educational Psychology 1972 72
Journal of Research in Teaching 1972 71
Journal of Applied Psychology 1976 67
Journal of Communication 1973 56
The Research Quarterly 1972 52
Journal of Abnormal Psychology 1984 50
Journal of Abnormal and Social Psychology 1962 48
American Speech and Hearing Research & Journal of Communication Disorders 1975 44
Counseler Education and Supervision 1973 37

The table shows that there is tremendous variability in power estimates for different journals ranging from as high as 89% (9 out of 10 studies will produce a significant result when an effect is present) to the lowest estimate of  37% power (only 1 out of 3 studies will produce a significant result when an effect is present).

The table also shows that the Journal of Abnormal and Social Psychology and its successor the Journal of Abnormal Psychology yielded nearly identical power estimates. This finding is the key finding that provides empirical support for the claim that power in the Journal of Abnormal Psychology has not increased over time.

The average power estimate for all journals in the table is 62% (median 61%).  The list of journals is not a representative set of journals and few journals are core psychology journals. Thus, the average power may be different if a representative set of journals had been used.

The average for the three core psychology journals (JASP & JAbnPsy,  JAP, AJEduPsy) is 67% (median = 63%) is slightly higher. The latter estimate is likely to be closer to the typical power in psychology in general rather than the prominently featured estimates based on the Journal of Abnormal Psychology. Power could be lower in this journal because it is more difficult to recruit patients with a specific disorder than participants from undergraduate classes. However, only more rigorous studies of power for a broader range of journals and more years can provide more conclusive answers about the typical power of a single statistical test in a psychology journal.

The article also contains some important theoretical discussions about the importance of power in psychological research. One important issue concerns the treatment of multiple comparisons. For example, a multi-factorial design produces an exponential number of statistical comparisons. With two conditions, there is only one comparison. With three conditions, there are three comparisons (C1 vs. C2, C1 vs. C3, and C2 vs. C3). With 5 conditions, there are 10 comparisons. Standard statistical methods often correct for these multiple comparisons. One consequence of this correction for multiple comparisons is that the power of each statistical test decreases. An effect that would be significant in a simple comparison of two conditions would not be significant if this test is part of a series of tests.

Sedlmeier and Giegerenzer used the standard criterion of p < .05 (two-tailed) for their main power analysis and for the comparison with Cohen’s results. However, many articles presented results using a more stringent criterion of significance. If the criterion used by authors would have been used for the power analysis, power decreased further. About 50% of all articles used an adjusted criterion value and if the adjusted criterion value was used power was only 37%.

Sedlmeier and Giegerenzer also found another remarkable difference between articles in 1960 and in 1984. Most articles in 1960 reported the results of a single study. In 1984 many articles reported results from two or more studies. Sedlmeier and Giegerenzer do not discuss the statistical implications of this change in publication practices. Schimmack (2012) introduced the concept of total power to highlight the problem of publishing articles that contain multiple studies with modest power. If studies are used to provide empirical support for an effect, studies have to show a significant effect. For example, Study 1 shows an effect with female participants. Study 2 examines whether the effect can also be demonstrated with male participants. If Study 2 produces a non-significant result, it is not clear how this finding should be interpreted. It may show that the effect does not exist for men. It may show that the first result was just a fluke finding due to sampling error. Or it may show that the effect exists equally for men and women but studies had only 50% power to produce a significant result. In this case, it is expected that one study will produce a significant result and one will produce a non-significant result, but in the long-run significant results are equally likely with male or female participants. Given the difficulty of interpreting a non-significant result, it would be important to conduct a more powerful study that examines gender differences in a more powerful study with more female and male participants. However, this is not what researchers do. Rather, multiple study articles contain only the studies that produced significant results. The rate of successful studies in psychology journals is over 90% (Sterling et al., 1995). However, this outcome is extremely likely in multiple studies where studies have only 50% power to get a significant result in a single attempt. For each additional attempt, the probability to obtain only significant results decreases exponentially (1 Study, 50%, 2 Studies 25%, 3 Studies 12.5%, 4 Studies 6.75%).

The fact that researchers only publish studies that worked is well-known in the research community. Many researchers believe that this is an acceptable scientific practice. However, consumers of scientific research may have a different opinion about this practice. Publishing only studies that produced the desired outcome is akin to a fund manager that only publishes the return rate of funds that gained money and excludes funds with losses. Would you trust this manager to take care of your retirement? It is also akin to a gambler that only remembers winnings. Would you marry a gambler who believes that gambling is ok because you can earn money that way?

I personally do not trust obviously biased information. So, when researchers present 5 studies with significant results, I wonder whether they really had the statistical power to produce these results or whether they simply did not publish results that failed to confirm their claims. To answer this question it is essential to estimate the actual power of individual studies to produce significant results; that is, it is necessary to estimate the typical power in this field, of this researcher, or in the journal that published the results.

In conclusion, Sedlmeier and Gigerenzer made an important contribution to the literature by providing the first power-ranking of scientific journals and the first temporal analyses of time trends in power. Although they probably hoped that their scientific study of power would lead to an increase in statistical power, the general consensus is that their article failed to change scientific practices in psychology. In fact, some journals required more and more studies as evidence for an effect (some articles contain 9 studies) without any indication that researchers increased power to ensure that their studies could actually provide significant results for their hypotheses. Moreover, the topic of statistical power remained neglected in the training of future psychologists.

I recommend Sedlmeier and Gigerenzer’s article as essential reading for anybody interested in improving the credibility of psychology as a rigorous empirical science.

As always, comments (positive or negative) are always welcome.

# Distinguishing Questionable Research Practices from Publication Bias

It is well-known that scientific journals favor statistically significant results (Sterling, 1959). This phenomenon is known as publication bias. Publication bias can be easily detected by comparing the observed statistical power of studies with the success rate in journals. Success rates of 90% or more would only be expected if most theoretical predictions are true and empirical studies have over 90% statistical power to produce significant results. Estimates of statistical power range from 20% to 50% (Button et al., 2015, Cohen, 1962). It follows that for every published significant result an unknown number of non-significant results has occurred that remained unpublished. These results linger in researchers proverbial file-drawer or more literally in unpublished data sets on researchers’ computers.

The selection of significant results also creates an incentive for researchers to produce significant results. In rare cases, researchers simply fabricate data to produce significant results. However, scientific fraud is rare. A more serious threat to the integrity of science is the use of questionable research practices. Questionable research practices are all research activities that create a systematic bias in empirical results. Although systematic bias can produce too many or too few significant results, the incentive to publish significant results suggests that questionable research practices are typically used to produce significant results.

In sum, publication bias and questionable research practices contribute to an inflated success rate in scientific journals. So far, it has been difficult to examine the prevalence of questionable research practices in science. One reason is that publication bias and questionable research practices are conceptually overlapping. For example, a research article may report the results of a 2 x 2 x 2 ANOVA or a regression analysis with 5 predictor variables. The article may only report the significant results and omit detailed reporting of the non-significant results. For example, researchers may state that none of the gender effects were significant and not report the results for main effects or interaction with gender. I classify these cases as publication bias because each result tests a different hypothesis., even if the statistical tests are not independent.

Questionable research practices are practices that change the probability of obtaining a specific significant result. An example would be a study with multiple outcome measures that would support the same theoretical hypothesis. For example, a clinical trial of an anti-depressant might include several depression measures. In this case, a researcher can increase the chances of a significant result by conducting tests for each measure. Other questionable research practices would be optional stopping once a significant result is obtained, selective deletion of cases based on the results after deletion. A common consequence of these questionable practices is that they will produce results that meet the significance criterion, but deviate from the distribution that is expected simply on the basis of random sampling error.

A number of articles have tried to examine the prevalence of questionable research practices by comparing the frequency of p-values above and below the typical criterion of statistical significance, namely a p-value less than .05. The logic is that random error would produce a nearly equal amount of p-values just above .05 (e.g., p = .06) and below .05 (e.g., p = .04). According to this logic, questionable research practices are present, if there are more p-values just below the criterion than p-values just above the criterion (Masicampo & Lalande, 2012).

Daniel Lakens has pointed out some problems with this approach. The most crucial problem is that publication bias alone is sufficient to predict a lower frequency of p-values below the significance criterion. After all, these p-values imply a non-significant result and non-significant results are subject to publication bias. The only reason why p-values of .06 are reported with higher frequency than p-values of .11 is that p-values between .05 and .10 are sometimes reported as marginally significant evidence for a hypothesis. Another problem is that many p-values of .04 are not reported as p = .04, but are reported as p < .05. Thus, the distribution of p-values close to the criterion value provides unreliable information about the prevalence of questionable research practices.

In this blog post, I introduce an alternative approach to the detection of questionable research practices that produce just significant results. Questionable research practices and publication bias have different effects on the distribution of p-values (or corresponding measures of strength of evidence). Whereas publication bias will produce a distribution that is consistent with the average power of studies, questionable research practice will produce an abnormal distribution with a peak just below the significance criterion. In other words, questionable research practices produce a distribution with too few non-significant results and too few highly significant results.

I illustrate this test of questionable research practices with post-hoc-power analysis of three journals. One journal shows neither signs of publication bias, nor significant signs of questionable research practices. The second journal shows clear evidence of publication bias, but no evidence of questionable research practices. The third journal illustrates the influence of publication bias and questionable research practices.

Example 1: A Relatively Unbiased Z-Curve

The first example is based on results published during the years 2010-2014 in the Journal of Experimental Psychology: Learning, Memory, and Cognition. A text-mining program searched all articles for publications of F-tests, t-tests, correlation coefficients, regression coefficients, odds-ratios, confidence intervals, and z-tests. Due to the inconsistent and imprecise reporting of p-values (p = .02 or p < .05), p-values were not used. All statistical tests were converted into absolute z-scores.

The program found 14,800 tests. 8,423 tests were in the critical interval between z = 2 and z = 6 that is used for estimation of 4 non-centrality parameters and 4 weights that are used to model the distribution of z-values between 2 and 6 and to estimate the distribution in the range from 0 to 2. Z-values greater than 6 are not used because they correspond to Power close to 1. 11% of all tests fall into this region of z-scores that are not shown.

The histogram and the blue density distribution show the observed data. The green curve shows the predicted distribution based on the post-hoc power analysis. Post-hoc power analysis suggests that the average power of significant results is 67%. Power for all statistical tests is estimated to be 58% (including 11% of z-scores greater than 6, power is .58*.89 + .11 = 63%). More important is the predicted distribution of z-scores. The predicted distribution on the left side of the criterion value matches the observed distribution rather well. This shows that there are not a lot of missing non-significant results. In other words, there does not appear to be a file-drawer of studies with non-significant results. There is also only a very small blip in the observed data just at the level of statistical significance. The close match between the observed and predicted distributions suggests that results in this journal are relatively free of systematic bias due to publication bias or questionable research practices.

Example 2: A Z-Curve with Publication Bias

The second example is based on results published in the Attitudes & Social Cognition Section of the Journal of Personality and Social Psychology. The text-mining program retrieved 5,919 tests from articles published between 2010 and 2014. 3,584 tests provided z-scores in the range from 2 to 6 that is being used for model fitting.

The average power of significant results in JPSP-ASC is 55%. This is significantly less than the average power in JEP-LMC, which was used for the first example. The estimated power for all statistical tests, including those in the estimated file drawer, is 35%. More important is the estimated distribution of z-values. On the right side of the significance criterion the estimated curve shows relatively close fit to the observed distribution. This finding shows that random sampling error alone is sufficient to explain the observed distribution. However, on the left side of the distribution, the observed z-scores drop off steeply. This drop is consistent with the effect of publication bias that researchers do not report all non-significant results. There is only a slight hint that questionable research practices are also present because observed z-scores just above the criterion value are a bit more frequent than the model predicts. However, this discrepancy is not conclusive because the model could increase the file drawer, which would produce a steeper slope. The most important characteristic of this z-curve is the steep cliff on the left side of the criterion value and the gentle slope on the right side of the criterion value.

Example 3: A Z-Curve with Questionable Research Practices.

Example 3 uses results published in the journal Aggressive Behavior during the years 2010 to 2014. The text mining program found 1,429 results and 863 z-scores in the range from 2 to 6 that were used for the post-hoc-power analysis.

The average power for significant results in the range from 2 to 6 is 73%, which is similar to the power estimate in the first example. The power estimate that includes non-significant results is 68%. The power estimate is similar because there is no evidence of a file drawer with many underpowered studies. In fact, there are more observed non-significant results than predicted non-significant results, especially for z-scores close to zero. This outcome shows some problems of estimating the frequency of non-significant results based on the distribution of significant results. More important, the graph shows a cluster of z-scores just above and below the significance criterion. The step cliff to the left of the criterion might suggest publication bias, but the whole distribution does not show evidence of publication bias. Moreover, the steep cliff on the right side of the cluster cannot be explained with publication bias. Only questionable research practices can produce this cliff because publication bias relies on random sampling error which leads to a gentle slope of z-scores as shown in the second example.

Prevalence of Questionable Research Practices

The examples suggest that the distribution of z-scores can be used to distinguish publication bias and questionable research practices. Based on this approach, the prevalence of questionable research practices would be rare. The journal Aggressive Behavior is exceptional. Most journals show a pattern similar to Example 2, with varying sizes of the file drawer. However, this does not mean that questionable research practices are rare because it is most likely that the pattern observed in Example 2 is a combination of questionable research practices and publication bias. As shown in Example 2, the typical power of statistical tests that produce a significant result is about 60%. However, researchers do not know which experiments will produce significant results. Slight modifications in experimental procedures, so-called hidden moderators, can easily change an experiment with 60% power into an experiment with 30% power. Thus, the probability of obtaining a significant result in a replication study is less than the nominal power of 60% that is implied by post-hoc-power analysis. With only 30% to 60% power, researchers will frequently encounter results that fail to produce an expected significant result. In this case, researchers have two choices to avoid reporting a non-significant result. They can put the study in the file-drawer or they can try to salvage the study with the help of questionable research practices. It is likely that researchers will do both and that the course of action depends on the results. If the data show a trend in the right direction, questionable research practices seem an attractive alternative. If the data show a trend in the opposite direction, it is more likely that the study will be terminated and the results remain unreported.

Simons et al. (2011) conducted some simulation studies and found that even extreme use of multiple questionable research practices (p-hacking) will produce a significant result in at most 60% of cases, when the null-hypothesis is true. If such extreme use of questionable research practices were widespread, z-curve would produce corrected power estimates well-below 50%. There is no evidence that extreme use of questionable research practices is prevalent. In contrast, there is strong evidence that researchers conduct many more studies than they actually report and that many of these studies have a low probability of success.

Implications of File-Drawers for Science

First, it is clear that researchers could be more effective if they would use existing resources more effectively. An fMRI study with 20 participants costs about \$10,000. Conducting a study that costs \$10,000 that has only a 50% probability of producing a significant result is wasteful and should not be funded by taxpayers. Just publishing the non-significant result does not fix this problem because a non-significant result in a study with 50% power is inconclusive. Even if the predicted effect exists, one would expect a non-significant result in ever second study.   Instead of wasting \$10,000 on studies with 50% power, researchers should invest \$20,000 in studies with higher power (unfortunately, power does not increase proportional to resources). With the same research budget, more money would contribute to results that are being published. Thus, without spending more money, science could progress faster.

Second, higher powered studies make non-significant results more relevant. If a study had 80% power, there is only a 20% chance to get a non-significant result if an effect is present. If a study had 95% power, the chance of a non-significant result would be just as low as the chance of a false positive result. In this case, it is noteworthy that a theoretical prediction was not confirmed. In a set of high-powered studies, a post-hoc power analysis would show a bimodal distribution with clusters of z-scores around 0 for true null-hypothesis and a cluster of z-scores of 3 or higher for clear effects. Type-I and Type-II errors would be rare.

Third, Example 3 shows that the use of questionable research practices becomes detectable in the absence of a file drawer and that it would be harder to publish results that were obtained with questionable research practices.

Finally, the ability to estimate the size of file-drawers may encourage researchers to plan studies more carefully and to invest more resources into studies to keep their file drawers small because a large file-drawer may harm reputation or decrease funding.

In conclusion, post-hoc power analysis of large sets of data can be used to estimate the size of the file drawer based on the distribution of z-scores on the right side of a significance criterion. As file-drawers harm science, this tool can be used as an incentive to conduct studies that produce credible results and thus reducing the need for dishonest research practices. In this regard, the use of post-hoc power analysis complements other efforts towards open science such as preregistration and data sharing.