A previous blog examined how and why Dr. Förster’s data showed incredibly improbable linearity.
The main hypothesis was that two experimental manipulations have opposite effects on a dependent variable.
Assuming that the average effect size of a single manipulation is similar to effect sizes in social psychology, a single manipulation is expected to have an effect size of d = .5 (change by half a standard deviation). As the two manipulations are expected to have opposite effects, the mean difference between the two experimental groups should be one standard deviation (0.5 + 0.5 = 1). With N = 40, and d = 1, a study has 87% power to produce a significant effect (p < .05, two-tailed). With power of this magnitude, it would not be surprising to get significant results in 12 comparisons (Table 1).
The R-Index for the comparison of the two experimental groups in Table is Ř = 87%
(Success Rate = 100%, Median Observed Power = 94%, Inflation Rate = 6%).
The Test of Insufficient Variance (TIVA) shows that the variance in z-scores is less than 1, but the probability of this event to occur by chance is 10%, Var(z) = .63, Chi-square (df = 11) = 17.43, p = .096.
Thus, the results for the two experimental groups are perfectly consistent with real empirical data and the large effect size could be the result of two moderately strong manipulations with opposite effects.
The problem for Dr. Förster started when he included a control condition and want to demonstrate in each study that the two experimental groups also differed significantly from the experimental group. As already pointed out in the original post, samples of 20 participants per condition do not provide sufficient power to demonstrate effect sizes of d = .5 consistently.
To make matters worse, the three-group design has even less power than two independent studies because the same control group is used in a three-group comparison. When sampling error inflates the mean in the control group (e.g, true mean = 33, estimated mean = 36), it benefits the comparison for the experimental group with the lower mean, but it hurts the comparison for the experimental group with the higher mean (e.g., M = 27, M = 33, M = 39 vs. M = 27, M = 36, M = 39). When sampling error leads to an underestimation of the true mean in the control group (e.g., true mean = 33, estimated mean = 30), it benefits the comparison of the higher experimental group with the control group, but it hurts the comparison of the lower experimental group and the control group.
Thus, total power to produce significant results for both comparisons is even lower than for two independent studies.
It follows that the problem for a researcher with real data was the control group. Most studies would have produced significant results for the comparison of the two experimental groups, but failed to show significant differences between one of the experimental groups and the control group.
At this point, it is unclear how Jens Förster achieved significant results under the contested assumption that real data were collected. However, it seems most plausible that QRPs would be used to move the mean of the control group to the center so that both experimental groups show a significant difference. When this was impossible, the control group could be dropped, which may explain why 3 studies in Table 1 did not report results for a control group.
The influence of QRPs on the control group can be detected by examining the variation of means in Table 1 across the 12(9) studies. Sampling error should randomly increase or decrease means relative to the overall mean of an experimental condition. Thus, there is no reason to expect a correlation in the pattern of means. Consistent with this prediction, the means of the two experimental groups are unrelated, r(12) = .05, p = .889; r(9) = .36, p = .347. In contrast, the means of the control group are correlated with the means of the two experimental groups, r(9) = .73, r(9) = .71. If the means in the control group are the result of the unbiased means in the experimental groups, it makes sense to predict the means in the control group from the means in the two experimental groups. A regression equation shows that 77% of the variance in the means of the control group is explained by the variation in the means in the experimental groups, R = .88, F(2,6) = 10.06, p = .01.
This analysis clarifies the source of the unusual linearity in the data. Studies with n = 20 per condition have very low power to demonstrate significant differences between a control group and opposite experimental groups because sampling error in the control group is likely to move the mean of the control group too close to one of the experimental groups to produce a significant difference.
This problem of low power may lead researchers to use QRPs to move the mean of the control group to the center. The problem for users of QRPs is that this statistical boost of power leaves a trace in the data that can be detected with various bias tests. The pattern of the three means will be too linear, there will be insufficient variance in the effect sizes, p-values, and observed power in the comparisons of experimental groups and control groups, the success rate will exceed median observed power, and, as shown here, the means in the control group will be correlated with the means in the experimental group across conditions.
In a personal email Dr. Förster did not comment on the statistical analyses because his background in statistics is insufficient to follow the analyses. However, he rejected this scenario as an account for the unusual linearity in his data; “I never changed any means.” Another problem for this account of what could have happened is that dropping cases from the middle group would lower the sample size of this group, but the sample size is always close to n = 20. Moreover, oversampling and dropping of cases would be a QRP that Dr. Förster would remember and could report. Thus, I now agree with the conclusion of the LOWI commission that the data cannot be explained by using QRPs, mainly because Dr. Förster denies having used any plausible QRPs that could have produced his results.
Some readers may be confused about this conclusion because it may appear to contradict my first blog. However, my first blog merely challenged the claim by the LOWI commission that linearity cannot be explained by QRPs. I found a plausible way in which QRPs could have produced linearity, and these new analyses still suggest that secretive and selective dropping of cases from the middle group could be used to show significant contrasts. Depending on the strength of the original evidence, this use of QRPs would be consistent with the widespread use of QRPs in the field and would not be considered scientific misconduct. As Roy F. Baumeister, a prominent social psychologist put it, “this is just how the field works.” However, unlike Roy Baumeister, who explained improbable results with the use of QRPs, Dr. Förster denies any use of QRPs that could potentially explain the improbable linearity in his results.
In conclusion, the following facts have been established with sufficient certainty:
(a) the reported results are too improbable to reflect just true effects and sampling error; they are not credible.
(b) the main problem for a researcher to obtain valid results is the low power of multiple-study articles and the difficulty of demonstrating statistical differences between one control group and two opposite experimental groups.
(c) to avoid reporting non-significant results, a researcher must drop failed studies and selectively drop cases from the middle group to move the mean of the middle group to the middle.
(d) Dr. Förster denies the use of QRPs and he denies data manipulation.
Evidently, the facts do not add up.
The new analyses suggest that there is one simple way for Dr. Förster to show that his data have some validity. The reason is that the comparison of the two experimental groups shows an R-Index of 87%. This implies that there is nothing statistically improbable about the comparison of these data. If these reported results are based on real data, a replication study is highly likely to replicate the mean difference between the two experimental groups. With n = 20 in each cell (N = 40), it would be relatively easy to conduct a preregistered and transparent replication study. However, without further credible evidence the published data lack credible scientific evidence and it would be prudent to retract all articles that show unusual statistical patterns that cannot be explained by the author.