Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article: Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach
My response is organized as a commentary on key sections of the article. The sections of the article are direct quotations to give readers quick and easy access to FER’s arguments and conclusions, followed by my comments. The quotations are printed in bold.
Here, we extend FER2015’s analysis to suggest that much of the discussion of best research practices since 2011 has focused on a single feature of high-quality science—replicability—with insufficient sensitivity to the implications of recommended practices for other features, like discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.
I see replicability as being equivalent to the concept of reliability in psychological measurement. Reliability is necessary for validity, which means a measure needs to be reliable to produce valid results and this includes internal validity and external validity. And valid results are needed to create a solid body of research that provides the basis for a cumulative science.
Take life-satisfaction judgments as an example. In a review article, Schwarz and Strack (1999) claimed that life-satisfaction judgments are unreliable, extremely sensitive to context, and that responses can change dramatically as a function of characteristic of the survey questions. Do we think, a measure with low reliability can be used to study well-being and to build a cumulative science of well-being? No. It seems self-evidence that reliable measures are better than unreliable measures.
The reason why some measures are not reliable is that scores on the measure are influenced by factors that are difficult or too expensive to control. As a result, these factors have an undesirable effect on responses. The effect is not systematic, or it is too difficult to study the systematic effects, and therefore results will randomly change when the same measure is used again and again. We can assess the influence of these random factors by administering the same measurement procedure again and again and see how much scores change (in the absence of real change).
The same logic applies to replicability. Replicability means that we get the same result if we repeat a study again and again. Just like scores on a psychological measure can change, the results of even exact replication studies will not be the same. The reason is the same. Random factors that are outside the control of the experimenter influences the results that are obtained in a single study. Hence, we cannot expect that exact replication studies will always produce the same results. For example, the gender ratio in a psychology class will not be the same year after year, even if there is no real change in the gender ratio of psychology students over time.
So what does it even mean for a result to be replicable; that is for a replication study to produce the same result as the original study? It depends on the interpretation of the results of an original study. A professor interested in gender composition of psychology could compute the gender ratio for each year. The exact number would vary from year to year. However, the researcher could also compute a 95% confidence interval around these numbers. This interval specifies the amount of variability that is expected by chance. We may then say that a study is replicable if subsequent studies produce results that are compatible with the 95% confidence interval of the original studies. In contrast, low replicability would mean that results vary from study to study. For example, in one year the gender ratio is 70% female (+/- 10% 95% CI), in the next year it is 25% female (again +/-10%), and the following year it is 99% (+/- 10%). In this case, the gender ratio jumps around dramatically and the result from one study cannot be used to predict gender ratios in other years, and provides no solid empirical foundations for theories of the effect of gender on interest in psychology.
Using this criterion of replicability, many results in psychology are highly replicable. The problem is that using this criterion, many results in psychology are also not very informative because effect sizes tend to be relatively small compared to the width of confidence intervals (Cohen, 1994). With a standardized effect size of d = .4, and a typical confidence interval width of d ~ 1 (se = .25), the typical finding in psychology is that the effect size ranges from -.1 to d = 0.9. This means the typical result is consistent with a small effect in the opposite direction from the one in the sample (chocolate eating leads to weight gain, even if my study shows that chocolate eating leads to weight loss) and very large effects in the same direction (chocolate eating is a highly effective way of losing weight). Most important, the result is also consistent with the null-hypothesis (chocolate eating has no effect on weight; which in this case would be a sensational and important finding that would make Willy Wonka very happy). I hope this example makes the point that it is not very informative to conduct studies of small effect sizes with wide confidence intervals because we do not learn much from these studies. Mostly, we are not more informed about a research question after looking at the data than we were before we looked at the data.
Not surprisingly, psychology journals do not publish findings like d = .2 +/- .8. The typical criterion for reporting a newsworthy result is that the confidence interval falls into one of two regions. The region of effect sizes less than zero or the region of effect sizes greater than zero. If the 95%CI falls in one of these two regions, it is possible to say that there is only a maximum error rate of 5%, when we infer from a confidence interval in the positive region that the actual effect size is positive, and from a confidence interval in the negative region that the actual effect size is negative. In other words, it wasn’t just those random factors that produced a positive effect in a sample when the actual effect size is 0 or negative and it wasn’t just random factors that produced a negative effect when the actual effect size is 0 or positive. To examine whether the results of a study provide sufficient information to claim that an effect is real and not just due to random factors, researchers compute p-values and check whether the p-value is less than 5%.
If the original study, reported a significant result to make inferences about the direction of an effect, and replicability is defined as obtaining the same result, replicability means that we obtain a significant result again in the replication study. The famous statistician Sir Ronald Fisher made replicability a criterion for a good study. “A properly designed experiment rarely fails to give … significance” (Fisher, 1926, p. 504).
What are the implications of replication studies that do not replicate a significant result? These studies are often called failed replication studies, but this term is unfortunate because the study was not a failure. Maybe we might want to call these studies unsuccessful replication studies, although I am not sure this term is much better. The problem with unsuccessful replication studies is that there are a host of reasons why a replication study might fail. This means, additional research is needed to uncover why the original study and the replication study produced different results. In contrast, if a series of studies produces significant results, it is highly likely that the result is a real finding and can be used as an empirical foundation for theories. For example, the gender ratio in my PSY230 course is always significantly different from a 50/50 split that we might expect if both genders were equally interested in psychology. This shows that my study that recorded the gender of students and compared the ratio of men and women against a fixed probability of 50% meets at least one criterion of a properly designed experiment, namely it rarely fails to reject the null-hypothesis.
In short, it is hard to argue with the proposition that replicability is an important criterion for a good study. If study results cannot be replicated, it is not clear whether a phenomenon exists, and if it is not clear whether a phenomenon exists, it is impossible to make theoretical predictions about other phenomena based on this phenomenon. For example, we cannot predict gender differences in professions that require a psychology degree if we do not have replicable evidence that there is a gender difference in psychology students.
The present analysis extends FER2015’s “error balance” logic to emphasize tradeoffs among features of a high-quality science (among scientific desiderata). When seeking to optimize the quality of our science, scholars must consider not only how a given research practice influences replicability, but also how it influences other desirable features.
A small group of social relationship researchers (Finkel, Eeastwick, & Reis; henceforce FER) are concerned about the recent shift in psychology from a scientific discipline that ignored replicability entirely to a field that actually cares about the replicability of results published in original research articles. Although methodologists have criticized psychology for a long time, it was only after Bem (2011) published extraordinarily unbelievable results that psychologists finally started to wonder how replicable published results actually are. In response to this new focus on replicability, several projects have conducted replication studies with shocking results. In FER’s research area, replicability is estimated to be as low as 25%. That is, three-quarter of published results are not replicable and require further research efforts to examine why original studies and replication studies produced inconsistent results. In a large-scale replication study, one of the authors original findings failed to replicate and the replication studies cast doubt on theoretical assumptions about the determinants of forgiveness in close relationships.
FER ask “Should Scientists Consistently Prioritize Replicability Above Other Core Features?”
As FER are substantive researchers with little background in research methodology, it may be understandable that they do not mention important contributions by methodologists like Jacob Cohen. Cohen’s answer is clear. Less is more, except for sample size. This statement makes it clear that replicability is necessary for a good study. According to Cohen a study design can be perfect in many ways (e.g., elaborate experimental manipulation of real-world events with highly valid outcome measures), but if the sample size is small (e.g., N = 3), the study simply cannot produce results that can be used as an empirical foundation for theories. If a study cannot reject the null-hypothesis with some degree of confidence, it is impossible to say whether there is a real effect or whether the result was just caused by random factors.
Unfazed by their lack of knowledge about research methodology, FER take a different view.
In our view, the field’s discussion of best research practices should revolve around how we prioritize the various features of a high-quality science and how those priorities may shift across our discipline’s many subfields and research contexts.
Similarly, requiring very large sample sizes increases replicability by reducing false-positive rates and increases cumulativeness by reducing false-negative rates, but it also reduces the number of studies that can be run with the available resources, so conceptual replications and real-world extensions may remain unconducted.
So, who is right. Should researchers follow Cohen’s advice and conduct a small number of studies with large samples or is it better to conduct a large number of studies with small samples? If resources are limited and a researcher can collect data from 500 participants in one year. Should the researcher conduct one study with N = 500, five studies with N = 100, or 25 studies with N = 20? FER suggest that we have a trade-off between replicability and discoveries.
Also, large sample size norms and requirements may limit the feasibility of certain sorts of research, thereby reducing discovery.
This is true, if we consider true and false discoveries as discoveries (FER do not make a distinction). Bem (2011) discovered that human minds can time travel. This was a fascinating discovery, yet it was a false discovery. Bem (2001) himself advocated the view that all discoveries are valuable, even false discoveries (Let’s err on the side of discovery.). Maybe FER learned about research methods from Bem’s chapter. Most scientists and lay people, however, value true discoveries over false discoveries. Many people would feel cheated if the Moon landing was actually faked, for example, and if billions spent on cancer drugs are not helping to fight cancer (it really was just eating garlic). So, the real question is whether many studies with small samples produce more true discoveries than a single study with a large sample.
This question was examined in LeBell, Campbell, and Loving (2015), who concluded largely in favor of Cohen’s recommendation that a slow approach with fewer studies and high replicability is advantageous for a cumulative science.
For example, LCL2016’s Table 3 shows that the N-per-true discovery decreases from N=1,742 when the original research is statistically powered at 25% to N=917 when the original research is statistically powered at 95%.
FER criticize that LCL focused on efficient use of resources for replication studies and ignored the efficient use of resources for original researcher. As many researchers are often doing more than one study on a particular research question, the distinction between original researcher and replication researcher is artificial. Ultimately, researchers may conduct a number of studies. The studies can be totally new, conceptual replications of previous studies, or exact replications of previous studies. A significant result always will be used to claim a discovery. When a non-significant result contradicts a previous significant result, discovery, additional research is needed to examine whether the original result was a false discovery or whether the replication result was a false negative.
FER observe that “original researchers will be more efficient (smaller N-per-true discovery) when they prioritize lower-powered studies. That is, when assuming that an original researcher wishes to spend her resources efficiently to unearth many true effects, plans never to replicate her own work, and is insensitive to the resources required to replicate her studies, she should run many weakly powered studies.”
FER may have discovered why some researchers, including themselves, pursue a strategy of conducting many studies with relatively low power. It produces many discoveries that can be published. They also produce many non-significant results that do not lend to a discovery. But the absolute number of true discoveries is still likely to be greater than the 1 true discovery by a researcher who conducted only one study. The problem is that the researchers are also likely to make more false discoveries than the researcher who conducts only one study. They just make more discoveries, true discoveries and false discoveries, and replication studies are needed to examine whether the results are true discoveries or false discoveries. When other researchers conduct replication studies and fail to replicate an effect, further resources are needed to examine why the original study produced a non-significant result. However, this is not a problem for discoverers who are only in the business of testing new and original hypothesis and reporting those that produced a significant result and leave it to other researchers to examine which of these discoveries are true or false. These researchers were rewarded handsomely in the years before Bem (2011) because nobody wanted to be in the business of conducting replication studies. As a result, all discoveries produced by original researchers were treated as if they would replicate and researchers with a high number of discoveries were treated as researchers with more true discoveries. There just was no distinction between true and false discoveries and it made sense to err on the side of discovery.
Given the conflicting efficiency goals between original researchers and replicators, whose goals shall we prioritize?
This is a bizarre question. The goal of science is to uncover the truth and to create theories that rest on a body of replicable, empirical findings. Apparently, this is not the goal of original researchers. Their goal is to make as many discoveries as possible and to leave it to replicators to test which of these discoveries are replicable or not. This division is not very appealing and few scientists want to be the maid of original scientists and clean up their mess when they do cooking experiments in the kitchen. Original researchers should routinely replicate their own results and when they do so with small studies, they suddenly face the problem of replicators that they end up with non-significant results and now have to conduct further studies to uncover the reasons for these discrepancies. FER seem to agree.
We must prioritize the field’s efficiency goals rather than either the replicator’s or the original researcher’s in isolation. The solid line in Figure 2 illustrates N-per-true-discovery from the perspective of the field—when the original researcher’s 5,000 participants are added to the pool of participants used by the replicator. This line forms a U-shaped pattern, suggesting that the field will be more efficient (smaller N-per true-discovery) when original researchers prioritize moderately powered studies).
This conclusion is already implied in Cohen’s power calculations. The reason is that studies with very low power have a low chance of getting a significant result. As a result, resources are wasted on these studies and it would have been better not to conduct these studies, especially when we take into account that each study requires a new ethics approval, training of personal, data analysis time, etc. All of these costs multiply with the number of studies that are conducted to get a significant result. At the other extreme, power increases as a log-function of sample size. This means, once power has achieved a certain level, it requires more and more resources to increase power even further. Moreover, 80% power means that 8 out of 10 studies are significant and 90% power means that 9 out of 10 studies are significant. The extra costs of increasing power to 90% may not warrant the increase in success rate from 8 to 9 studies. For this reason, Cohen did not really suggest that infinite sample sizes are optimal. Instead, he suggested that researchers should aim for 80% power. That is 4 out of 5 studies that examine a real effect show a significant result.
However, FER’s simulations come to a different conclusion. Their Figure suggests that studies with 30% power are just as good as studies with 70% power and could be even better than studies with 80% power.
For example, if a hypothesis is 75% likely to be true, which might be the case if the finding had a strong theoretical foundation, the most efficient use of field-wide N appears to favor power of ~25% for d=.41 and ~40% for d=.80.
The problem with taking these results seriously is that the criterion N per true discovery does not take into account the costs of a type-I error. Conducting studies with small samples and low power can produce a larger number of significant results than a smaller sample of studies with large samples, simply due to the larger number of studies. However, it also implies a higher rate of false positives. Thus, it is important to take the seriousness of a type-I error or a type-II error into account.
So, let’s use a scenario where original results need to be replicated. In fact, many journals require at least two if not more significant results to provide evidence for an effect. The researcher who conducts many studies with low power has a problem because the probability of obtaining two significant results in a row has only a power-squared probability of getting the desired result. Even if a single significant result is reported, other researchers need to replicate this finding and many of these replication studies will fail, until eventually a replication study with a significant result corroborates the original finding.
In a simulation with d = .4 and an equal proportion of null-hypothesis and real effects, a researcher with 80% power (N = 200, d = .4, alpha = .05, two-tailed), needs about 900 participants for every discovery. A researcher with 20% power (N = 40, d = .4, alpha = .05, two-tailed) needs about 1800 participants for every discovery.
When the rate of true null-results decreases, the number of true discoveries increases and it is easier to make true discoveries. Nevertheless, the advantage of high powered studies remains. It takes about half of the participants for high powered studies to make a true discovery than for low powered studies (N = 665 vs. 1157).
The reason for the discrepancy between my results and FER’s result is that they do not take replicability into account. This is ironic because their title suggest that they are going to write about replicability, when they actually ignore that results from small studies with low power have low replicability. That is, if we only try to get a result once, it can be more efficient to do so with small, underpowered studies because random sampling error will often dramatically inflate effect sizes and produce a significant result. However, this inflation is not replicable and replication studies are likely to produce non-significant results and cast doubt on the original finding. In other words, they ignore the key characteristic of replicability that replication studies of the same effect should produce significant results again. Thus, FER’s argument is fundamentally flawed because it ignores the very key concept of replicability. Low powered studies are less replicable and original studies that are not replicable make it impossible to create a cumulative science.
The problems of underpowered studies increase exponentially in a research environment that rewards publication of discoveries, whether they are true or false, and provides no incentives for researches to publish non-significant results, even if these non-significant results challenge the significant results of an original article. Rather than treating these unsuccessful warning sign that the original results might have been false positives, the non-significant result is treated as evidence that the replication study must have been flawed; after all, the original study found the effect. Moreover, the replication study might just have low power and the effect exists. As a result, false positive results can poison theory development because theories have to explain findings that are actually false positives, and researchers continue to conduct unsuccessful replication studies because they are unaware that other researchers have already failed to replicate an original false positive result. These problems have been discussed at length in the years, but FER blissfully ignore these arguments and discussion.
Since 2011, psychological science has witnessed major changes in its standard operating procedures—changes that hold great promising for bolstering the replicability of our science. We have come a long way, we hope, from the era in which editors routinely encouraged authors to jettison studies or variables with ambiguous results, the file drawer received only passing consideration, and p<.05 was the statistical holy of holies. We remain, as in FER2015, enthusiastic about such changes. Our goal is to work alongside other meta-scientists to generate an empirically grounded, tradeoff-based framework for improving the overall quality of our science.
That sounds good, but it is not clear what FER bring to the table.
We must focus greater attention on establishing which features are most important in a given research context, the extent to which a given research practice influences the alignment of a collective knowledge base with each of the relevant features, and, all things considered, which research practices are optimal in light of the various tradeoffs involved. Such an approach will certainly prioritize replicability, but it will also prioritize other features of a high-quality science, including discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.
What is lacking here is a demonstration that it is possible to prioritize internal validity, external validity, conequentiality, and cumulativeness without replicability. How do we build on results that emerge only in one out of two, three, or five studies, let alone 1 out of 10 studies? FER create the illusion that we can make more true discoveries by conducting many small studies with low power. This is true, in the limited sense of needing fewer participants for an initial discovery. But their own criterion of cumulativeness implies that we are not interested in a single finding that may or may not replicate. To build on original findings, others should be able to redo a study and get a significant result again. This is what Fisher had in mind and what Neyman and Pearson formalized into power analysis.
FER also overlook a much simpler solution to balance the rate of original recovery and replicability. Namely, researchers an increase the type-I error rate from the conventional 5% criterion to 20% (or more). As the type-I error rate increases, power increases. At the same time, readers are properly warned that the results are only suggestive, but definitely require further researcher and cannot be treated as evidence that needs to be incorporated in a theory. At the same time, researchers with large samples do not have to waste their resources on rejecting H0 with apha = .05 and 99.9% power. They can use their resources to make more definitive statements about their data and reject H0 with a p-value that corresponds to 5 standard deviations of a standard normal (5 sigma rule in particle physics).
No matter what the solution to the replicability crisis in psychology is, the solution cannot be a continuation of the old practice to conduct numerous statistical tests on a small sample and then report only the results that are statistically significant at p < .05. It is unfortunate that FER’s article can be easily misunderstood as suggesting that using small samples and testing for significance with p < .05 and low power can be a viable research strategy in some specific context. I think they failed to make their case and to demonstrate in which research context this strategy benefits psychology.