# What would Cohen say? A comment on p < .005

Most psychologists are trained in Fisherian statistics, which has become known as Null-Hypothesis Significance Testing (NHST).  NHST compares an observed effect size against a hypothetical effect size. The hypothetical effect size is typically zero; that is, the hypothesis is that there is no effect.  The deviation of the observed effect size from zero relative to the amount of sampling error provides a test statistic (test statistic = effect size / sampling error).  The test statistic can then be compared to a criterion value. The criterion value is typically chosen so that only 5% of test statistics would exceed the criterion value by chance alone.  If the test statistic exceeds this value, the null-hypothesis is rejected in favor of the inference that an effect greater than zero was present.

One major problem of NHST is that non-significant results are not considered.  To address this limitation, Neyman and Pearson extended Fisherian statistic and introduced the concepts of type-I (alpha) and type-II (beta) errors.  A type-I error occurs when researchers falsely reject a true null-hypothesis; that is, they infer from a significant result that an effect was present, when there is actually no effect.  The type-I error rate is fixed by the criterion for significance, which is typically p < .05.  This means, that a set of studies cannot produce more than 5% false-positive results.  The maximum of 5% false positive results would only be observed if all studies have no effect. In this case, we would expect 5% significant results and 95% non-significant results.

The important contribution by Neyman and Pearson was to consider the complementary type-II error.  A type-II error occurs when an effect is present, but a study produces a non-significant result.  In this case, researchers fail to detect a true effect.  The type-II error rate depends on the size of the effect and the amount of sampling error.  If effect sizes are small and sampling error is large, test statistics will often be too small to exceed the criterion value.

Neyman-Pearson statistics was popularized in psychology by Jacob Cohen.  In 1962, Cohen examined effect sizes and sample sizes (as a proxy for sampling error) in the Journal of Abnormal and Social Psychology and concluded that there is a high risk of type-II errors because sample sizes are too small to detect even moderate effect sizes and inadequate to detect small effect sizes.  Over the next decades, methodologists have repeatedly pointed out that psychologists often conduct studies with a high risk to fail; that is, to provide empirical evidence for real effects (Sedlemeier & Gigerenzer, 1989).

The concern about type-II errors has been largely ignored by empirical psychologists.  One possible reason is that journals had no problem filling volumes with significant results, while rejecting 80% of submissions that also presented significant results.  Apparently, type-II errors were much less common than methodologists feared.

However, in 2011 it became apparent that the high success rate in journals was illusory. Published results were not representative of studies that were conducted. Instead, researchers used questionable research practices or simply did not report studies with non-significant results.  In other words, the type-II error rate was as high as methodologists suspected, but selection of significant results created the impression that nearly all studies were successful in producing significant results.  The influential “False Positive Psychology” article suggested that it is very easy to produce significant results without an actual effect.  This led to the fear that many published results in psychology may be false positive results.

Doubt about the replicability and credibility of published results has led to numerous recommendations for the improvement of psychological science.  One of the most obvious recommendations is to ensure that published results are representative of the studies that are actually being conducted.  Given the high type-II error rates, this would mean that journals would be filled with many non-significant and inconclusive results.  This is not a very attractive solution because it is not clear what the scientific community can learn from an inconclusive result.  A better solution would be to increase the statistical power of studies. Statistical power is simply the inverse of a type-II error (power = 1 – beta).  As power increases, studies with a true effect have a higher chance of producing a true positive result (e.g., a drug is an effective treatment for a disease). Numerous articles have suggested that researchers should increase power to increase replicability and credibility of published results (e.g., Schimmack, 2012).

In a recent article, a team of 72 authors proposed another solution. They recommended that psychologists should reduce the probability of a type-I error from 5% (1 out of 20 studies) to 0.5% (1 out of 200 studies).  This recommendation is based on the belief that the replication crisis in psychology reflects a large number of type-I errors.  By reducing the alpha criterion, the rate of type-I errors will be reduced from a maximum of 10 out of 200 studies to 1 out of 200 studies.

I believe that this recommendation is misguided because it ignores the consequences of a more stringent significance criterion on type-II errors.  Keeping resources and sampling error constant, reducing the type-I error rate increases the type-II error rate. This is undesirable because the actual type-II error is already large.

For example, a between-subject comparison of two means with a standardized effect size of d = .4 and a sample size of N = 100 (n = 50 per cell) has a 50% risk of a type-II error.  The risk of a type-II error raises to 80%, if alpha is reduced to .005.  It makes no sense to conduct a study with an 80% chance of failure (Tversky & Kahneman, 1971).  Thus, the call for a lower alpha implies that researchers will have to invest more resources to discover true positive results.  Many researchers may simply lack the resources to meet this stringent significance criterion.

My suggestion is exactly opposite to the recommendation of a more stringent criterion.  The main problem for selection bias in journals is that even the existing criterion of p < .05 is too stringent and leads to a high percentage of type-II errors that cannot be published.  This has produced the replication crisis with large file-drawers of studies with p-values greater than .05,  the use of questionable research practices, and publications of inflated effect sizes that cannot be replicated.

To avoid this problem, researchers should use a significance criterion that balances the risk of a type-I and type-II error.  For example, with an expected effect size of d = .4 and N = 100, researchers should use p < .20 for significance, which reduces the risk of a type -II error to 20%.  In this case, type-I and type-II error are balanced.  If the study produces a p-value of, say, .15, researchers can publish the result with the conclusion that the study provided evidence for the effect. At the same time, readers are warned that they should not interpret this result as strong evidence for the effect because there is a 20% probability of a type-I error.

Given this positive result, researchers can then follow up their initial study with a larger replication study that allows for a stricter type-I error control, while holding power constant.   With d = 4, they now need N = 200 participants to have 80% power and alpha = .05.  Even if the second study does not produce a significant result (the probability that two studies with 80% power are significant is only 64%, Schimmack, 2012), researchers can combine the results of both studies and with N = 300, the combined studies have 80% power with alpha = .01.

The advantage of starting with smaller studies with a higher alpha criterion is that researchers are able to test risky hypothesis with a smaller amount of resources.  In the example, the first study used “only” 100 participants.  In contrast, the proposal to require p < .005 as evidence for an original, risky study implies that researchers need to invest a lot of resources in a risky study that may provide inconclusive results if it fails to produce a significant result.  A power analysis shows that a sample size of N = 338 participants is needed to have 80% power for an effect size of d = .4 and p < .005 as criterion for significance.

Rather than investing 300 participants into a risky study that may produce a non-significant and uninteresting result (eating green jelly beans does not cure cancer), researchers may be better able and willing to start with 100 participants and to follow up an encouraging result with a larger follow-up study.  The evidential value that arises from one study with 300 participants or two studies with 100 and 200 participants is the same, but requiring p < .005 from the start discourages risky studies and puts even more pressure on researchers to produce significant results if all of their resources are used for a single study.  In contrast, lowering alpha reduces the need for questionable research practices and reduces the risk of type-II errors.

In conclusion, it is time to learn Neyman-Pearson statistic and to remember Cohen’s important contribution that many studies in psychology are underpowered.  Low power produces inconclusive results that are not worthwhile publishing.  A study with low power is like a high-jumper that puts the bar too high and fails every time. We learned nothing about the jumpers’ ability. Scientists may learn from high-jump contests where jumpers start with lower and realistic heights and then raise the bar when they succeeded.  In the same manner, researchers should conduct pilot studies or risky exploratory studies with small samples and a high type-I error probability and lower the alpha criterion gradually if the results are encouraging, while maintaining a reasonably low type-II error.

Evidently, a significant result with alpha = .20 does not provide conclusive evidence for an effect.  However, the arbitrary p < .005 criterion also fails short of demonstrating conclusively that an effect exists.  Journals publish thousands of results a year and some of these results may be false positives, even if the error rate is set at 1 out of 200. Thus, p < .005 is neither defensible as a criterion for a first exploratory study, nor conclusive evidence for an effect.  A better criterion for conclusive evidence is that an effect can be replicated across different laboratories and a type-I error probability of less than 1 out of a billion (6 sigma).  This is by no means an unrealistic target.  To achieve this criterion with an effect size of d = .4, a sample size of N = 1,000 is needed.  The combined evidence of 5 labs with N = 200 per lab would be sufficient to produce conclusive evidence for an effect, but only if there is no selection bias.  Thus, the best way to increase the credibility of psychological science is to conduct studies with high power and to minimize selection bias.

This is what I believe Cohen would have said, but even if I am wrong about this, I think it follows from his futile efforts to teach psychologists about type-II errors and statistical power.