Here is my advice for researchers who are planning to write a 10-study article. Don’t do it.
And here is my reason why.
Schimmack (2012) pointed out the problem of conducting multiple studies to test a set of related hypothesis in a single article (a.ka. multiple study articles). The problem is that even a single study in psychology tend to have modest power to produce empirical support for a correct hypothesis (p < .05, two-tailed). This probability called statistical power is estimated to be 50 or 60% on average. When researchers conduct multiple hypothesis tests, the probability of obtaining a significant result decreases exponentially. For 50% power, the probability that all tests provide a significant result halves with each study (.500, .250, .125, .063, etc.).
Schimmack (2012) used the term total power for the probability that a set of related hypothesis produces significant results. Few researchers who plan multiple study articles consider total power in the planning of their studies and multiple study articles do not explain how researchers deal with the likely outcome of a non-significant result. The most common practice is to simply ignore non-significant results and to report only results of studies that produced significant results. The problem with this approach is that the reported results overstate the empirical support for a theory, reported effect sizes are inflated, and researchers who want to build on these published findings are likely to end up with a surprising failure to replicate the original findings. A failed replication is surprising because the authors of the original article appeared to be able to obtain significant results in all studies. However, the reported success rate is deceptive and does not reveal the actual probability of a successful replication.
A number of statistical methods (TIVA, R-Index, P-Curve) have been developed to provide a more realistic impression of the strength and credibility of published results in multiple study articles. In this post, I used these tests to examine the evidence in a 10-study article in Psychological Science by Adam A. Galinski (Columbia Business School, Columbia University). I used this article because it is the article with the most studies in Psychological Science.
All 10 studies reported statistically significant results in support of the authors’ theoretical predictions. An a priori power analysis suggests that authors who aim to present evidence for a theory in 10 studies need 98% (.80 raised to power of 1/10) power in each study to have an 80% probability to obtain significant results in all studies.
Each study reported several statistical results. I focused on the first focal hypothesis test to obtain statistical results for the examination of bias and evidential value. The p-values for each statistical test were converted into z-scores (inverse.normal (1-p/2).
The Test of Insufficient Variance was used to examine whether the variation in z-scores is consistent with the amount of sampling error that is expected for a set of independent studies (Var = 1).The variance in z-scores is less than one would expect from a set of 10 independent studies, Var(z) = .27. The probability that this reduction of variance occurred just by chance is p = .02.
Thus, there is evidence that the perfect 10 for 10 rate of significant results was obtained by means of dishonest reporting practices. Either failed studies were not reported or significant results were obtained with undisclosed research methods. For example, given the wide variation in sample sizes, optional stopping may have been used to obtain significant results. Consistent with this hypothesis, there is a strong correlation between sampling error (se = 2/sqrt[N]) and effect size (Cohen’s d = t * se) across the 10 studies, r(10) = .88.
The median observed power for the 10 studies is 71%. Not a single study had observed power of 98% that is needed to have 80% total power. Moreover, the 71% estimate is an inflated estimate of power because the success rate (100%) exceeds observed power (71%). After correcting for the inflation rate (100 – 71 = 29), the R-INDEX is 43%.
An R-Index of 43% is below 50%, suggesting that the true power of the studies is below 50% and that researchers who conduct an exact replication study are more likely to end up with a failure of replication than with a successful replication despite the apparent ability of the original authors to obtain significant results in all reported studies.
A pcurve analysis shows that the results have evidential value, p = .02, using the convential criterion of p < .05. That is, it is unlikely that these 10 significant results were obtained without a real effect in at least one of the ten studies. However, excluding the most high-powered test in Study 9 renders the results of pcurve inconclusive, p = .11; that is, the hypothesis that the remaining 9 results were obtained without a real effect cannot be rejected at the conventional level of significance (p < .05).
These results show that the empirical evidence in this article is weak despite the impressive number of studies. The reason is that the absolute number of significant results is not an indicator of strength of evidence and that the reported rate of significant results is not an indicator of strength of evidence when non-significant results are not reported.
The statistical examination of this 10-study article reveals that the reported results are less robust than the 100% success rate suggests and that the reported results are unlikely to provide a complete account of the research program that generated the reported findings. Most likely, the researchers used optional stopping to increase their chances of obtaining significant results.
It is important to note that optional stopping is not necessarily a bad or questionable research practices. It is only problematic when the use of optional stopping is not disclosed. The reason is that optional stopping leads to biased effect size estimates and increases the type-I error probability, which invalidates the claim that results were significant at the nominal level that limits type-I error rates to 5%.
The results also highlight that the researchers were too ambitious in their goal to produce significant results in 10 studies. Even though their sample sizes are sometimes larger than the typical sample size in Psychological Science (N ~ 80), much larger samples would have been needed to produce significant results in all 10 studies.
It is also important that the article was published in 2013 and that it was common practice to exclude studies that fail to produce supporting evidence and to present results without full disclosure of the research methods that are used to produce these results at that time. Thus, the authors did not violate ethical standards of scientific integrity at that time.
However, publication standards are changing. When journals require full disclosure of data and methods, researchers need to change the way they plan their studies. There are several options for researchers to change their research practices.
First, they can reduce the number of studies so that each study has a larger sample size and higher power to produce significant results. Authors who wish to report results from multiple studies need to take total power into account. Eighty percent power in a single study is insufficient to produce significant results in multiple studies and power of each study needs to be adjusted accordingly (Total Power = Power ^ 1 / k; ^ = raised to the power of, k = number of studies).
Second, researchers can increase power by reducing the standard for statistical significance in a single study. For example, it may be sufficient to claim support for a theory if 5 studies produced significant results with alpha = .20 (a 20% type-I error rate per study) because the combined type-I error rate decreases with the number of studies (total alpha = alpha ^ k). Researchers can also conduct a meta-analysis of their individual studies to examine the total evidence across studies.
Third, researchers can specify a priori how many non-significant results they are willing to obtain and report. For example, researchers who plan 5 studies with 80% power can state that they expect one non-significant result. An honest set of results will typically produce a variance in accordance with sampling theory (var(z) = 1) and median observed power would be 80% and there would be no inflation (expected success rate = 80% – expected median power = 80% = 0). Thus, the R-Index would be 80 – 0 = 80.
In conclusion, there are many ways to obtain and report results of empirical results. There is only one way that is no longer an option, namely selective reporting of results that support theoretical predictions. Statistical tests like TIVA, R-Index, P-Curve can reveal these practices and undermine the apparent value of articles that report many and only significant results. As a result, the incentive structure is changing (again*) and researchers need to think hard about the amount of resources they really need to produce empirical results in multiple studies.
* The multiple-study article is a unique phenomenon that emerged in experimental psychology in the 1990s. It was supposed to provide more solid evidence and to protect against type-I errors in single study articles that presented exploratory results as if they confirmed theoretical predictions (HARKing). However, dishonest reporting practices made it possible to produce impressive results without increased rigor. At the same time, the allure of multiple study articles crowded out research that took time or required extensive resources to conduct only a single study. As a result, multiple study articles often report studies that are quick (take less than 1 hour to complete) and cost little (Mturk participants are paid less than $1) or nothing (undergraduate students receive course credit). Without real benefits and detrimental effects on the quality of empirical studies, I expected a decline in the number of studies per article and an increase in the quality of individual studies.