This morning a tweet by Jeff Rouder suggested to take a closer look at an online first article published in Psychological Science.
When the Spatial and Ideological Collide
Metaphorical Conflict Shapes Social Perception
http://pss.sagepub.com/content/early/2016/02/01/0956797615624029.abstract
The senior author of the article is Yaacov Trope from New York University. The powergraph of Yaacov Trope suggests that the average significant result that is reported in an article is based on a study with 52% in the years from 2000-2012 and 43% in the recent years from 2013-2015. The difference is probably not reliable, but the results show no evidence that Yaacov Trope has changed research practices in response to criticism of psychological research practices over the past five years.
The average of 50% power for statistically significant results would suggest that every other test of a theoretical prediction produces a non-significant result. If, however, articles typically report that the results confirmed a statistical prediction, it is clear that dishonest reporting practices (excluding non-significant results or using undisclosed statistical methods like optional stopping) were used to present results that confirm theoretical predictions.
Moreover, the 50% estimate is an average. Power varies as a function of the strength of evidence and power for just significant results is lower than 50%. The range of z-scores from 2 to 2.6 approximately covers p-values in the range from .05 to .01 (just significant results). Average power for p-values in this range can be estimated by examining the contribution of the red (< 20% power), black (50% power) and green (85% power densities). In both graphs the density in this area is fully covered by the red and black lines, which implies that power is a mixture of 20% and 50%, which means power is less than 50%. Using the more reliable powergraph on the left, the red line (less than 20% power) covers a large portion of the area under the curve, suggesting that power for p-values between .05 and .01 is less than 33%.
The powergraph suggests that statistically significant results are only obtained with the help of random sampling error, reported effect sizes are inflated, and the probability of a false positive results is high because in underpowered studies the ratio of true positives vs. false positives is low.
In the article, Troope and colleagues report four studies. Casual inspection would suggest that the authors did conduct a rigorous program of research. They had relatively large samples (Ns = 239 to 410) and reported a priori power analyses that suggested they had 80% power to detect the predicted effects.
However, closer inspection with modern statistical methods to examine the robustness of results in a multiple study article show that the reported results cannot be interpreted at face value. To maintain statistical independence, I picked the first focal hypothesis test from each of the four studies.
CSV To HTML using codebeautify.org
| Study | N | statistic | p | z | obs.power |
|---|---|---|---|---|---|
| 1 | 239 | t(237)=2.06 | 0.04 | 2.053748911 | 0.537345692 |
| 2 | 391 | t(389)=2.33 | 0.02 | 2.326347874 | 0.642947245 |
| 3 | 410 | t(407)=2.13 | 0.03 | 2.170090378 | 0.583201432 |
| 4 | 327 | t(325)=2.59 | 0.01 | 2.575829304 | 0.730996408 |
TIVA
TIVA examines whether a set of statistical results is consistent with the expected amount of sampling error. When test-statistics are converted into z-scores, sampling error should produce a variance of 1. However, the observed variance in the four z-scores is Var(z) = .05. Even with just four observations, a left-tailed chi-square test shows that this reduction in variance would occur rarely by chance, p = .02. This finding is consistent with the powergraph that shows reduced variance in z-scores because non-significant results that are predicted by the power analysis are not reported or significant results were obtained by violating sampling assumptions (e..g, undisclosed optional stopping).
R-INDEX
The able also shows that median observed power is only 61%, indicating that the a priori power analyses systematically overestimate power because they used effect sizes that were larger than the reported effect sizes. Moreover, the success rate in the four studies is 100%. When the success rate is higher than median observed power, actual power is even lower than observed power. To correct for this inflation in observed power, the R-Index subtracts the amount of inflation (100 – 61 = 39) from observed power. The R-Index is 61 – 39 = 22. Simulation studies show that an R-Index of 22 is obtained when the null-hypothesis is true (the predicted effect does not exist) and only significant results are being reported.
As it takes 20 studies to get 1 significant result by chance when the null-hypothesis is true, this model would imply that Troope and colleagues conducted another 4 * 20 – 4 = 76 studies with an average of 340 participants (a total of 25,973 participants) to obtain the significant results in their study. This is very unlikely. It is much more likely that Troope et al. used optional stopping to produce significant results.
Although the R-Index cannot reveal how the reported results were obtained, it does strongly suggest that these reported results will not be replicable. That is, other researchers who conduct the same study with the same sample sizes are unlikely to obtain significant results although Troope and colleagues reported getting significant results 4 out of 4 times.
P-Curve
TIVA and R-Index show that the reported results cannot be trusted at face value and that the reported effect sizes are inflated. These tests do not examine whether the data provide useful empirical evidence. P-Curve examines whether the data provide evidence against the null-hypothesis after taking into account that the results are biased. P-Curve shows that the results in this article do not contain evidential value (p = .69); that is, after correcting for bias the results do not reject the null-hypothesis at the convential p < .05 level.
Conclusion
Statisticians have warned psychologists for decades that only reporting significant results that support theoretical predictions is not science (Sterling, 1959). However, generations of psychologists have been trained to conduct research by looking for and reporting significant results that they can explain. In the past five years, a growing number of psychologists have realized the damage of this pseudo-scientific method for advancing understanding of human behavior.
It is unfortunate that many well-established researchers have been unable to change the way they conduct research and that the very same established researchers in their roles as reviewers and editors continue to let this type of research being published. It is even more unfortunate that these well-established researchers do not recognize the harm they are causing for younger researchers who end up with publications that tarnish their reputation.
After five years of discussion about questionable research practices, ignorance is no longer an excuse for engaging in these practices. If optional stopping was used, it has to be declared in the description of the sampling strategy. An article in a top journal is no longer a sure ticket to an academic job, if a statistical analysis reveals that the results are biased and do not contain evidential value.
Nobody benefits from empirical publications without evidential value. Why is it so hard to stop this nonsense?