Post-hoc-power curves are used to evaluate the replicability of published results. At present, PHP curves are based on t-tests and F-tests that are automatically extracted from text files of journal articles. All test results are converted into z-scores. PHP-curves are fitted to the density of the histogram of z-scores.

It is well known that non-significant results are less likely to be published and end up in the proverbial file-drawer. To overcome this problem, PHP curves are fitted to the data by excluding non-significant results from the estimation of typical power (Simonsohn et al., 2013, 2014).

Another problem in the estimation of typical power is that power varies across tests. Heterogeneity of power leads to more variation in observed z-scores than a homogeneous model would assume (see comparison of variances in the figures below). PHP-curves address this problem by fitting a model with multiple true power values to the observed data. Fit for the non-significant results is not expected to be good due to the file drawer problem. In fact, the gap between actual and predicted data can be considered a rough estimate of the size of the file-drawer.

For heterogeneous data, power depends on the set of results that is being analyzed. The reason is that low z-scores are more likely to be obtained in studies with low power, whereas high z-scores are more likely to be the result of high powered studies. The figures below estimated power for z-scores in the range from 2 to 6. The mode of the red heterogenous curve shows that power for all tests would be considerably lower. However, non-significant results are typically not interpreted or even excluded from published studies. Thus, replicability is better indexed by the typical power of significant results.

The power estimates for all JESP articles and for social psychology articles in Psychological Science are very similar (47%, and 45%). Power for Social Cognition in the years from 2010 to present is estimated to be higher (60%). Older issues could not be analyzed because text recognition did not work. In comparison, the estimated power for Memory related articles in Psychological Science is higher (66%).

The average can be a misleading statistic for skewed distributions. The figures show that the majority of significant results are closer to the lower limit (z = 2) than to the upper limit (z = 6) of the test interval. Thus, the median power is lower than the average power of 45-60%.

It is important to realize that post-hoc power is meaningful when it is based on a large set of studies. A z-score of 4 is more likely to be based on a highly powered study than a z-score of 2, but a single z-score of 2 could be based on a high-powered study or it could be a type-I error. The purpose of PHP-curves is to evaluate journals, areas of research, and other meaningful sets of studies. Hopefully, recent attempts to increase the replicability of social psychology will increase power. PHP-curves, the R-Index, and estimates of typical power can be used to document improvements in future years.