Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers (2015) published the results of a preregistered adversarial collaboration. This article has been considered a model of conflict resolution among scientists.
The study examined the effect of eye-movements on memory. Drs. Nieuwenhuis and Slagter assume that horizontal eye-movements improve memory. Drs. Matzke, van Rijn, and Wagenmakers did not believe that horizontal-eye movements improve memory. That is, they assumed the null-hypothesis to be true. Van der Molen acted as a referee to resolve conflict about procedural questions (e.g., should some participants be excluded from analysis?).
The study was a between-subject design with three conditions: horizontal eye movements, vertical eye movements, and no eye movement.
The researchers collected data from 81 participants and agreed to exclude 2 participants, leaving 79 participants for analysis. As a result there were 27 or 26 participants per condition.
The hypothesis that horizontal eye-movements improve performance can be tested in several ways.
An overall F-test can compare the means of the three groups against the hypothesis that they are all equal. This test has low power because nobody predicted differences between vertical eye-movements and no eye-movements.
A second alternative is to compare the horizontal condition against the combined alternative groups. This can be done with a simple t-test. Given the directed hypothesis, a one-tailed test can be used.
Power analysis with the free software program GPower shows that this design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.
Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.
What does an effect size of d = .8 mean? It means that memory performance is boosted by .8 standard deviations. For example, if students take a multiple-choice exam with an average of 66% correct answers and a standard deviation of 15%, they could boost their performance by 12% points (15 * 0.8 = 12) from an average of 66% (C) to 78% (B+) by moving their eyes horizontally while thinking about a question.
The article makes no mention of power-analysis and the implicit assumption that the effect size has to be large to avoid biasing the experiment in favor of the critiques.
Instead the authors used Bayesian statistics; a type of statistics that most empirical psychologists understand even less than standard statistics. Bayesian statistics somehow magically appears to be able to draw inferences from small samples. The problem is that Bayesian statistics requires researchers to specify a clear alternative to the null-hypothesis. If the alternative is d = .8, small samples can be sufficient to decide whether an observed effect size is more consistent with d = 0 or d = .8. However, with more realistic assumptions about effect sizes, small samples are unable to reveal whether an observed effect size is more consistent with the null-hypothesis or a small to moderate effect.
So what where the actual results?
Condition Mean SD
Horizontal Eye-Movements 10.88 4.32
Vertical Eye-Movements 12.96 5.89
No Eye Movements 15.29 6.38
The results provide no evidence for a benefit of horizontal eye movements. In a comparison of the two a priori theories (d = 0 vs. d > 0), the Bayes-Factor strongly favored the null-hypothesis. However, this does not mean that Bayesian statistics has magical powers. The reason was that the empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81). A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.
In conclusion, a small study surprisingly showed a mean difference in the opposite prediction than previous studies had shown. This finding is noteworthy and shows that the effects of eye-movements on memory retrieval are poorly understood. As such, the results of this study are simply one more example of the replicability crisis in psychology.
However, it is unfortunate that this study is published as a model of conflict resolution, especially as the empirical results failed to resolve the conflict. A key aspect of a decisive study is to plan a study with adequate power to detect an effect. As such, it is essential that proponents of a theory clearly specify the effect size of their predicted effect and that the decisive experiment matches type-I and type-II error. With the common 5% Type-I error this means that a decisive experiment must have 95% power (1 – type II error). Bayesian statistics does not provide a magical solution to the problem of too much sampling error in small samples.
Bayesian statisticians may ignore power analysis because it was developed in the context of null-hypothesis testing. However, Bayesian inferences are also influenced by sample size and studies with small samples will often produce inconclusive results. Thus, it is more important that psychologists change the way they collect data than to change the way they analyze their data. It is time to allocate more resources to fewer studies with less sampling error than to waste resources on many studies with large sampling error; or as Cohen said: Less is more.