J Pers Soc Psychol. 2015 Feb;108(2):275-97. doi: 10.1037/pspi0000007.
Best research practices in psychology: Illustrating epistemological and pragmatic considerations with the case of relationship science.
Finkel EJ, Eastwick PW, Reis HT.
The article “Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science” examines how social psychologists should respond to the crisis of confidence in the wake of scandales that rocked social psychology in 2011 (i.e., the Staple debacle and the Bem bust).
The article is written by prolific relationship researchers, Finkel, Eastwick, and Reis (FER), and is directed primarily at relationship researchers, but their article also has implications for social psychology in general. In this blog post, I critically examine FER’s recommendations for “best research practices.”
FER and I are in general agreement about the problem. The goal of empirical science is to obtain objective evidence that can be used to test theoretical predictions. If the evidence supports a theoretical prediction, a theory that made this prediction gets to live another day. If the evidence does not support the prediction, the theory is being challenged and may need to be revised. The problem is that scientists are not disinterested observers of empirical phenomena. Rather, they often have a vested interest in providing empirical support for a theory. Moreover, scientists have no obligation to report all of their data or statistical analyses. As a result, the incentive structure encourages self-serving selection of supportive evidence. While data fabrication is a punishable academic offense, dishonest reporting practices have been and are still being tolerates.
The 2011 scandals led to numerous calls to curb dishonest reporting practices and to encourage or enforce honest reporting of all relevant materials and results. FER use the term “evidential value movement” to refer to researchers who have proposed changes to research practices in social psychology.
FER credit the evidential value movement with changes in research practices such as (a) reporting how sample sizes were determined to have adequate power to demonstrate predicted effects, (b) avoiding the use of dishonest research practices that inflate the strength of evidence and effect sizes, and (c) encouraging publications of replication studies independent of the outcome (i.e., a study may actually fail to provide support for a hypothesis).
FER propose that these changes are not necessarily to the benefit of social psychology. To make their point, they introduce Neyman-Pearson’s distinction between type-I errors ((a.k.a, false-positives) and type-II errors (a.k.a. false negatives). A type-I error occurs when a researcher draws the conclusion that an effect exists, but an effect does not exist (a cold remedy shows a statistically significant result in a clinical trial, but it has no real effect). A type-II error occurs when an effect exists, but a study fails to show a statistically significant result (e.g., a cold remedy does reduce cold symptoms, but a clinical trial fails to show a statistically significant result).
By convention, the type-I error rate in social psychology is set at 5%. This means, that in the long run no more than 5% of significant results in independent tests are false positive results and the maximum of 5% is only reached if all studies tested false hypotheses (i.e., they predicted an effect when no effect exists). As the number of true prediction increases, the actual rate of false-positive results decreases. If all hypothesis are true (the null-hypothesis that there is no effect is always false), the false-positive rate is 0 because it is impossible to make a type-I error. A maximum of 5% false-positive results has assured generations of social psychologists that most published results are likely to be true.
Unlike the type-I error probability that is set by convention, the type-II error probability is unknown because it depends on the unknown size of an effect. However, meta-analyses of actual studies can be used to estimate the typical type-II error probability in social psychology. In a seminal article, Cohen (1962) estimated that the type-II error rate is 50% for studies with a medium effect size. Power for studies of larger effects is higher and power for studies with smaller effects is lower. Actual power would depend on the distribution of small, large, and medium effects, but an estimate of 50% is a reasonable estimate. Cohen (1962) also proposed that a type-II error rate of 50% is unacceptably high and suggested that researchers should plan studies to reduce the type-II error rate to 20%. A common term for the complementary probability of avoiding a type-Il error is power (Power = 1 – Prob. Type-II Error) and Cohen suggested that psychologists plan studies with 80% power to detect effects that actually exist.
WHAT ARE THE TYPE-I and TYPE-II ERROR RATES IN PSYCHOLOGY?
Assuming that researcher follow Cohen’s recommendation (a questionable assumption) FER write “the field has, in principle, been willing to accept false positives 5% of the time and false negatives 20% of the time.” They then state in parenthesis that the “de facto false-positive and false-negative rates almost certainly have been higher than these nominal levels”.
In this parenthesis, FER hide the real problem that created the evidential value movement. The main point of the evidential value movement is that a type-I error probability of 5% does not tell us much about the false positive rate (how many false-positive results are being published) when dishonest reporting practices are allowed (Sterling, 1959).
For example, if a researcher conducts 10 tests of a hypothesis and only one test obtains a significant result and only the significant result is published, the probability of a false-positive result increased from 5% to 50%. Moreover, readers would be appropriately skeptical about a discovery that is matched by 9 failures to discover the same effect. In contrast, if readers only see the significant result, it seems as if the actual success rate is 100% rather than 10%. When only significant results are being reported, the 5% criterion no longer sets an upper limit and the real rate of false positive results could be 100% (Sterling, 1959).
The main goal of the evidential value movement is to curb dishonest reporting practices. A major theme in the evidential value movement is that editors and reviewers should be more tolerant of non-significant results, especially in multiple study articles that contain several tests of a theory (Schimmack, 2012). For example, in a multiple study paper with five studies and 80% power, one of the four studies is expected to produce a type-II error if the effect exists in all five studies. If power is only 50%, 2 or 3 studies should fail to provide statistically significant support for a hypothesis on their own.
Traditionally, authors excluded these studies from their multi-study articles and all studies provided support for their hypothesis. To reduce this dishonest reporting practice, editors should focus on the total evidence and allow for non-significant results in one or two studies. If four out of five studies produce a significant result, there is strong evidence for a theory and the evidence is stronger if all five studies are reported honestly.
Surprisingly, FER write that this change in editorial policy will “not necessarily alter the ratio of false positive to false negative errors” (p. ). This statement makes no sense because reporting of non-significant result that were previously hidden in file-drawers would reduce the percentage of type-I errors (relative to all published results) and increase the percentage of type-II errors that are being reported (because many non-significant results in underpowered studies are type-II errors). Thus, more honest reporting of results would increase the percentage of reported type-II errors and FER are confusing readers if they suggest that this is not the case.
Even more problematic is FER’s second scenario. Accordingly, researchers continue to conduct studies with low power (50%), submit manuscripts with multiple studies, where half the studies show statistically significant results and the other half do not, and editors reject these articles because they do not provide strong support for the hypothesis in all studies. FER anticipate that we would “see a marked decline in journal acceptance rates”. However, FER fail to mention a simple solution to this problem. Researchers could (and should) combine the resources that were needed to produce five studies with 50% powers to conduct one study that has a high probability of being successful (Schimmack, 2012). As a result, the type-I error rate and the type-II error rate would decrease. The type-I error rate would decrease because fewer tests are being conducted (e.g., conduct 10 studies to get 5 significant results, which doubles the probability that a significant result was obtained even if no effect exists). The Type-II error rate would decrease because researchers have more power to show the predicted effect without the use of dishonest research practices.
Alternatively, researchers can continue to conduct and report multiple underpowered studies, but abandon the elusive goal of finding significant results in each study. Instead, they could ignore significance tests of individual studies and conduct inferential statistical tests in a meta-analysis of all studies (Schimmack, 2012). The consequences for type-I and type-II error rates are the same as if researchers had conducted a single, more powerful study. Both approaches reduce type-I and type-II error rates because they reduce the number of statistical tests.
Based on their flawed reasoning, FER come to the wrong conclusion when they state “our point here is not that heightened stringency regarding false-positive rates is bad, but rather that it will almost certainly increase false-negative rates, which renders it less than an unmitigated scientific good.”
As demonstrated above, this statement is false because a reduction in statistical tests and an increase in power of each individual tests reduces the risk of type-I error rates and decreases the probability of making a type-II error (i.e., a false negative result).
WHAT IS AN ERROR BALANCED APPROACH?
As a result of FER’s false premise their recommendations for best practices that are based on this false premise are questionable. In fact, it is not even clear what their recommendations are when they introduce their error balanced approach that is supposed to have three principles.
The first principle is that both false positives and false negatives undermine the superordinate goals of science.
This principle is hardly controversial. It is problematic if a study shows that a drug is effective when the drug is actually not effective and it is problematic if an underpowered study fails to show that a drug is actually effective. FER fail to mention a long list of psychologists, including Jacob Cohen, who have tried to change the indifferent attitude of psychologists to non-significant results and the persistent practice of conducting underpowered studies that provide ample opportunity for multiple statistical tests so that at least one statistically significant result will emerge that can be used for a publication.
As noted earlier, the type-I error probability for a single statistical test is set at a maximum of 5%, but estimates of the type-II error probability are around 50%, a ten-fold difference. Cohen and others have advocated to increase power to 80%, which would reduce the type-II error risk to 20%. This would still imply that type-I error are considered more harmful than type-II errors by a ratio of 1:4 (5% vs. 20%).
Yet, FER do not recommend increasing statistical power, which would imply that the type-II error rate remains at 50%. The only other way to balance the two error rates would be to increase the type-I error rate. For example, one could increase the type-I error rate to 20%. As power increases when the significance criterion increases (becomes more liberal), this approach would also decrease the risk of type-II errors. The type-II error rate decreases when alpha is raised because results that were not significant are now significant. The risk is that more of these significant results are false-positives. In a between-subject design with alpha = 5% (type-I error probability) and 50% power, power increases to 76% if alpha is raised to 20% and the two error probabilities are roughly matched (20% vs. 24%).
In sum, although I agree with FER that type-I and type-II errors are important, FER fail to mention how researchers should balance error rates and ignore the fact that the most urgent course of action is to increase power of individual studies.
FER’s second principle is that neither type of error is “uniformly a greater threat to validity than the other type.”
Again, this is not controversial. In the early days of AIDS research, researchers and patients were willing to take greater risks in the hope that some medicine might work even if the probability of a false positive result in a clinical trial was high. When it comes to saving money in the supply of drinking water, a false negative result that the cheaper water is as healthy as the more expensive water is costly (of course, it is worse if it is well known that the cheaper water is toxic and politicians poison a population with toxic water).
A simple solution to this problem is to set the criterion value for an effect based on the implications of a type-I or a type-II error. However, in basic research no immediate actions have to be taken. The most common conclusion of a scientific article is that further research is needed. Moreover, researchers themselves can often conduct further research by conducting a follow-up study with more power. Therefore, it is understandable that the research community has been reluctant to increase the criterion for statistical significance from 5% to 20%
An interesting exception might be a multiple study article where a 5% criterion for each study makes it very difficult to obtain significant results in each study (Schimmack, 2012). One could adopt a more lenient 20% criterion for individual studies. A two study paper would already have only a 4% probability to produce a type-I error if both studies yielded a significant result (.20 * .20 = .04).
In sum, FER’s second principle about type-I and type-II errors is not controversial, but FER do not explain how the importance of type-I and type-II errors should influence the way researchers conduct their research and report their result. Most important, they do not explain why it would be problematic to report all results honestly.
FER’s third principle is that that “any serious consideration of optimal scientific practice must contend with both types of error simultaneously.”
I have a hard time distinguishing between principle I and principle III. Type-I and Type-II errors are both a problem and the problem of type-II errors in underpowered studies has been emphasized in a large literature on power with Jacob Cohen as the leading figure, but FER seem to be unaware of this literature or have another reason not to cite it, which reflects poorly on their scholarship. The simple solution to this problem has been outlined by Cohen: conduct fewer statistical tests with higher statistical power. FER have nothing to add to this simple statistical truth. A researcher who spends his whole live collecting data and at the end of his career conducts a single statistical test, and finds a significant result with p < .0001, is likely to have made a real discovery and a low probability to report a false positive result. In contrast, a researcher who publishes 100 statistical tests a year based on studies with low power will produce many false negative results and many false positive results.
This simple statistical truth implies that researchers have to make a choice. Do they want to invest their time and resources in many underpowered studies with many false positive and false negative results or do they want to invest their time and resources in a few high powered studies with few false positive and few false negative results?
Cohen advocated a slow and reliable approach when he said “less is more except for sample size.” FER fail to state where they stand because they started with the false premise that researchers can only balance the two types of errors without noticing that researchers can reduce both types of errors by conducting carefully planned studies with adequate power.
WHAT ABOUT HONESTY?
The most glaring omission in FER’s article is the lack of a discussion of dishonest reporting practices. Dishonest research practices are also called questionable research practices or p-hacking. Dishonest research practices make it difficult to distinguish between researchers who conduct carefully planned studies with high power from those who conduct many underpowered studies. If these researchers would report all of their results honestly, it would be easy to tell these two types of researchers apart. However, dishonest research practices allow researchers with underpowered studies to hide their false-negative results. As a result, the published record shows mostly significant results for both types of researchers, but this published record does not provide relevant information about the actual type-I and type-II errors being committed by the two researchers. The researcher with few, high powered studies has fewer unpublished non-significant results and a lower rate of published false positive results. The researcher with many underpowered studies has a large file-drawer filled with non-significant results that contains many false-negative results (discoveries that could have been made but were not made because the resources were spread too thin) and a higher rate of false-positive results in the published record.
The problem is that a system that tolerates dishonest reporting of results benefits researchers with many underpowered studies because they can publish more (true or false) discoveries and the number of (true or false) discoveries is used to reward researchers with positions, raises, awards, and grant money.
The main purpose of open science is to curb dishonest reporting practices. Preregistration makes it difficult to report a significant result that was not expected as predicted by a theory that was invented post-hoc after the results were known. Sharing of data sets makes it possible to check whether alternative analyses would have produced non-significant results. And rules about disclosing all measures makes it difficult to report only measures that produced a desired outcome. The common theme of all of these initiatives is to increase honesty. Rules that encourage or enforce honest reporting of all the evidence (good or bad) are assumed to be a guiding principle in science, but they are not being enforced and reporting only 3 studies with significant results when 15 studies were conducted is not considered a violation of scientific integrity.
What has been changing in the past years is a growing sense of awareness that dishonest reporting practices are harmful. Of course, it would have been difficult for FER to make a case for dishonest reporting practices and they do not make a positive case for dishonest reporting practices. However, they do present questionable arguments against recommendations that would curb questionable research practices and encourage honest reporting of results with the false argument that more honesty would increase the risk of type-II errors.
This argument is flawed because honest reporting of all results would provide an incentive for researchers to conduct more powerful studies that provide real support for a theory that can be reported honestly. Requirements to report all results honestly would also benefit researchers who conduct carefully planned studies with high power, which would reduce type-I and type-II error rates in the published literature. One might think everybody wins, but that is not the case. The losers in this new game would be researchers who have benefited from dishonest reporting practices.
FER’s article misrepresents the aims and consequences of the evidential value movement and fails to address the fundamental problem of allowing researches to pick and choose the results that they want to report. The consequences of tolerating dishonest reporting practices became visible in the scandals that rocked social psychology in 2011; the Stapel debacle and the Bem bust. Social psychology has been called a sloppy science. If social psychology wants to (re)gain respect from other psychologists, scientists, and the general public, it is essential that social psychologists enforce a code of conduct that requires honest reporting of results.
It is telling, that FER’s article appeared in the Interpersonal Relationship and Group Processes Section of the Journal of Personality and Social Psychology. In the 2015 rankings of 106 psychology journals, JPSP:IRGP can be found at the bottom of the rankings with a rank of 99. If relationship researchers take FER’s article as an excuse to resist changes in reporting practices, researchers may look towards other sciences (sociology) or other journals to learn about social relationships.
FER also fail to mention that new statistical developments have made it possible to distinguish between researches who conduct high-powered studies and those who use low-powered studies and report only significant results. These tools predict failures of replication in actual replication studies. As a result, the incentive structure is gradually changing and it is becoming more rewarding to conduct carefully-planned studies that can actually produce predicted results or in other words to be a scientist.
It is 2016, five years after the 2011 scandals that started the evidential value movement. I did not expect to see so much change in such a short time. The movement is gaining momentum and researchers in 2016 have to make a choice. They can be part of the solution or they can remain part of the problem.
VERY FINAL WORDS
Some psychologists do not like the idea that the new world of social media allows me to write a blog that has not been peer-reviewed. I think that social media have liberated science and encourage real debate. I can only imagine what would have happened if I had submitted this blog as a manuscript to JPSP:IRGP for peer-review. I am happy to respond to comments by FER or other researchers and I am happy to correct any mistakes that I have made in the characterization of FER’s article or in my arguments about power and error rates. Comments can be posted anonymously.