In 1962, Jacob Cohen wrote an important article about a fundamental problem in the way psychologists conduct their studies. He noted that psychologists pay a lot of attention to the problem that a statistically significant result might be a false-positive finding (a so-called type-I error). At the same time, psychologists mostly ignored the complementary error that a study might fail to reject the null-hypothesis when the predicted effect actually exists (a so called type-II error).
It is surprising that researchers would not be concerned with type-II errors because studies are often designed to demonstrate effects and a study is considered a failed study if it does not provide evidence for a predicted effect. Most researchers do not even bother to write up these studies. Given the high investment of researchers in their theories, one would expect that they do everything they can to minimize the risk of failure.
Cohen noted that the type-II error is inversely related to sample size. As sample sizes increase, sampling error decreases, and it becomes easier to demonstrate that an observed effect is a real effect rather than just a random event due to sampling error. Thus, researchers can control the probability of a type-II error by conducting studies with reasonable sample sizes. However, Cohen observed that researchers in the early 1960s used ad-hoc rules (e.g., n = 20 per cell) to plan sample sizes. He concluded that “these nonrational bases for setting sample size must often result in investigations being undertaken which have little chance of success despite the actual falsity of the null hypothesis” (p. 145).
Many psychologists at that time were not aware of more rational ways to determine sample sizes. To address this fundamental problem in psychological research, Cohen provided the first power analysis of psychological research. To conduct this analysis, he examined the statistical results in Volume 61 of the Journal of Abnormal and Social Psychology (which later was split into the Journal of Abnormal Psychology and the Journal of Personality and Social Psychology).
The main purpose of Cohen’s seminal article was to answer the question: “What kind of chance did these investigators have of rejecting false null hypotheses?” (Cohen, 1962, p. 145).
The probability of avoiding a type-II error (i.e, obtaining a statistically non-significant result when a predicted effect exists) is called statistical power (Power = 1 – Type-II-Error). The main problem in estimation of statistical power is that statistical power depends on the size of the effect in the population. Psychological theories rarely make precise predictions about effect sizes. Thus, it is not clear what size should be used for a power analysis. Another problem is that the size of an effect depends on the unit of measurement. The effect of sex (men vs. women) on height depends on the measurement of height in centimeters, meters, or inches. To conduct a power analysis of published results, Cohen had to find a way to quantify effect sizes for studies with measures that had different measurement units.
To the best of my knowledge, Cohen (1962) invented standardized effect sizes to deal with this problem. For example, the mean difference in height between men and women would be much smaller if it were measured in meters than if it were measured in centimeters. However, the unit of measurement would have the same effect on the standard deviation, which measures the variability of observed scores in meters or centimeters, or inches. Cohen realized that one can obtain a standardized measurement of the effect size by dividing the mean difference (in meters or centimeters) by the standard deviation (in meters or centimeters). The ratio of means divided by the standard deviation is no longer influenced by the original unit of measurement.
In Cohen’s honor, the mean difference divided by the pooled standard deviation is called Cohen’s d, where d stands for a difference score.
Cohen also introduced the broad distinction between standardized effect sizes into three ordinal categories of effect sizes. Standardized effect sizes could be statistically small (hard to detect), moderate, and large (easy to detect). Nowadays it is often overlooked that the distinction between small and large effects refers to the implications of the effect size for statistical power and it is sometimes argued that small effects lack practical importance or significance. However, statistically small effects can have huge practical importance. For example, relative to the normal variation in temperature in Toronto from a daily high of -10 degrees in winter to 40 degrees in summer, a 4 degree difference may seem small. However, if global warming would raise the average temperature on each day by 4 degrees it could have huge practical implications. Whether an effect is theoretically or practically important is not a statistical question that can be answered by computing a standardized effect size. The primary purpose of standardized effect sizes was to allow researchers to estimate and control type-II errors.
For a mean difference between two groups (a typical scenario in experimental psychology), Cohen called an effect size of d = 25 small, an effect size of d = .50 medium, and an effect size of d = 1 as large (he later revised these values to d = .2, d = .5, and d = .8). He illustrated the implications of these values with IQ scores, which are standardized scores where the average performance on an intelligence test is set to 100. Nowadays, the standard deviation is 15 IQ points, but Cohen used the Stanford-Binet test with a standard deviation of 16 points as an example. He pointed out that conducting a study with an expected population effect size of half a standard deviation (d = .5, medium) is equivalent to a study where a researcher expects an 8 point difference in IQ.
After creating these valuable statistical tools, Cohen (1962) conducted the laborious task of extracting statistical information from the articles published in Volume 61 of the Journal of Abnormal and Social Psychology. There were 78 articles in this volume, but 6 articles did not include statistical tests. The remaining 72 articles reported 4,829 statistical tests. However, not all of these tests were testing a theoretically important hypothesis. Cohen (1962) was concerned that a power analysis of all tests might underestimate power of theoretically important tests. Therefore, he limited his analysis to 2,088 statistical tests of theoretically important hypotheses. Two articles did not report tests of a theoretical hypothesis and were not included, leaving 70 articles.
For each reported result, Cohen computed the power the design and sample size of the statistical test assuming a small, medium, or large effect size. This means, Cohen did not use the actually observed effect sizes for his power calculations. His power estimates are hypothetical estimates for three scenarios with varying degrees of effect sizes.
He then computed the mean power for each article. The results showed that studies had very low power to detect small effects (18% power), modest power to detect medium effects (48% power), and good power to detect large effects (83%). When the analysis was repeated with all statistical tests, very similar results were obtained (20%, 50%, and 83%). The actual power of these studies depends on the unknown effect sizes. Assuming that effect sizes might range from small to large with an average effect size of d = .5, the results would suggest that published results are based on studies with about 50% power.
Cohen next observed that, despite modest power, most articles did report significant results. This might suggest that these studies did examine large effects (d = 1), but Cohen pointed out that “this argument rests on the implicit assumption that the research which is published is representative of the research undertaken in this area” (p. 151). He then pointed out that this assumption is likely to be false. “It seems obvious that investigators are less likely to submit for publication unsuccessful than successful research, to say nothing of a similar editorial bias in accepting research for publication.”
Given reporting bias and publication bias, the percentage of reported significant results (versus non-significant ones) is a poor indicator of the probability that a replication study will also produce a significant result. If the typical effect size were d = .5, only 50% of studies would replicate even if the published literature reports 100% significant results.
Moreover, Cohen assumed that a d = .5 would be a medium effect size for the relationship between two constructs or an experimental variable and a perfectly reliable measure, but statistical tests are based on the observed effects with unreliable measures. Based on this considerations, Cohen suggested that researchers would be optimistic if they conducted power analyses with an effect size of d = .5.
Cohen also noticed that most studies had small samples. The average sample size for the test of a theoretically important effect was N = 68. Given the skewed distribution of sample sizes, the median would be even less. Cohen points out that a between-subject design with n = 34 in each cell (N = 68 total) has 52% power to detect a medium effect size of d = .5. This would suggest that for every published study, there is an unpublished study that produced a non-significant result.
Cohen then points out several ways to increase power. One solution is to increase sample size, but when this is not possible, researchers could use a number of other strategies to increase power. “Other means for increasing power: improving experimental design efficiency and/or experimental control, and renouncing a slavish adherence to a standard Type I level, usually .05.” (p. 153). This means that it is irrational to maintain a 5% type-I error probability in a study that has an 80% type-II error probability because the study examines a small effect with limited resources. In this case, it would make sense to increase the type-I error probability, say to 20%, in order to have a reasonable chance to obtain a statistically significant result. Otherwise, it would be best not to do the study.
Cohen then points out the high cost of conducting studies with small samples and low power. The most likely outcome of these studies is that they produce a non-significant result and remain unpublished. As a result, valuable resources were wasted on studies that were unlikely to produce a meaningful result.
Cohen concludes with the following recommendation.
“Since power is a direct monotonic function of sample size, it is recommended that investigators use larger sample sizes than they customarily do. It is further recommended that research plans be routinely subjected to power analysis, using as conventions the criteria of population effect size employed in this survey” (p. 153).
One can only imagine where psychology would be today, more than half a decade later, if psychologists had taken Cohen’s advice and increased the statistical power of their studies. However, thousands of subject hours have been wasted on underpowered studies that ended up with inconclusive results. Moreover, published studies often achieved significance only because observed effect sizes dramatically overestimated the true population effect sizes. As a result, published results are often not credible and difficult to replicate. The current replication crisis in psychology would not have surprised Cohen. His power analysis from 1960 predicted the dismal success rate in the OSF-reproducibility project (36% success rate, which implies only 36% power in original studies that reported 97% significant results). If psychology doesn’t want to waste another half-decade, it is time to honor Cohen and to start taking type-II errors and power more seriously.