Category Archives: science

REPLICABILITY RANKING OF 26 PSYCHOLOGY JOURNALS

THEORETICAL BACKGROUND

Neyman & Pearson (1933) developed the theory of type-I and type-II errors in statistical hypothesis testing.

A type-I error is defined as the probability of rejecting the null-hypothesis (i.e., the effect size is zero) when the null-hypothesis is true.

A type-II error is defined as the probability of failing to reject the null-hypothesis when the null-hypothesis is false (i.e., there is an effect).

A common application of statistics is to provide empirical evidence for a theoretically predicted relationship between two variables (cause-effect or covariation). The results of an empirical study can produce two outcomes. Either the result is statistically significant or it is not statistically significant. Statistically significant results are interpreted as support for a theoretically predicted effect.

Statistically non-significant results are difficult to interpret because the prediction may be false (the null-hypothesis is true) or a type-II error occurred (the theoretical prediction is correct, but the results fail to provide sufficient evidence for it).

To avoid type-II errors, researchers can design studies that reduce the type-II error probability. The probability of avoiding a type-II error when a predicted effect exists is called power. It could also be called the probability of success because a significant result can be used to provide empirical support for a hypothesis.

Ideally researchers would want to maximize power to avoid type-II errors. However, powerful studies require more resources. Thus, researchers face a trade-off between the allocation of resources and their probability to obtain a statistically significant result.

Jacob Cohen dedicated a large portion of his career to help researchers with the task of planning studies that can produce a successful result, if the theoretical prediction is true. He suggested that researchers should plan studies to have 80% power. With 80% power, the type-II error rate is still 20%, which means that 1 out of 5 studies in which a theoretical prediction is true would fail to produce a statistically significant result.

Cohen (1962) examined the typical effect sizes in psychology and found that the typical effect size for the mean difference between two groups (e.g., men and women or experimental vs. control group) is about half-of a standard deviation. The standardized effect size measure is called Cohen’s d in his honor. Based on his review of the literature, Cohen suggested that an effect size of d = .2 is small, d = .5 moderate, and d = .8. Importantly, a statistically small effect size can have huge practical importance. Thus, these labels should not be used to make claims about the practical importance of effects. The main purpose of these labels is that researchers can better plan their studies. If researchers expect a large effect (d = .8), they need a relatively small sample to have high power. If researchers expect a small effect (d = .2), they need a large sample to have high power.   Cohen (1992) provided information about effect sizes and sample sizes for different statistical tests (chi-square, correlation, ANOVA, etc.).

Cohen (1962) conducted a meta-analysis of studies published in a prominent psychology journal. Based on the typical effect size and sample size in these studies, Cohen estimated that the average power in studies is about 60%. Importantly, this also means that the typical power to detect small effects is less than 60%. Thus, many studies in psychology have low power and a high type-II error probability. As a result, one would expect that journals often report that studies failed to support theoretical predictions. However, the success rate in psychological journals is over 90% (Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995). There are two explanations for discrepancies between the reported success rate and the success probability (power) in psychology. One explanation is that researchers conduct multiple studies and only report successful studies. The other studies remain unreported in a proverbial file-drawer (Rosenthal, 1979). The other explanation is that researchers use questionable research practices to produce significant results in a study (John, Loewenstein, & Prelec, 2012). Both practices have undesirable consequences for the credibility and replicability of published results in psychological journals.

A simple solution to the problem would be to increase the statistical power of studies. If the power of psychological studies in psychology were over 90%, a success rate of 90% would be justified by the actual probability of obtaining significant results. However, meta-analysis and method articles have repeatedly pointed out that psychologists do not consider statistical power in the planning of their studies and that studies continue to be underpowered (Maxwell, 2004; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989).

One reason for the persistent neglect of power could be that researchers have no awareness of the typical power of their studies. This could happen because observed power in a single study is an imperfect indicator of true power (Yuan & Maxwell, 2005). If a study produced a significant result, the observed power is at least 50%, even if the true power is only 30%. Even if the null-hypothesis is true, and researchers publish only type-I errors, observed power is dramatically inflated to 62%, when the true power is only 5% (the type-I error rate). Thus, Cohen’s estimate of 60% power is not very reassuring.

Over the past years, Schimmack and Brunner have developed a method to estimate power for sets of studies with heterogeneous designs, sample sizes, and effect sizes. A technical report is in preparation. The basic logic of this approach is to convert results of all statistical tests into z-scores using the one-tailed p-value of a statistical test.  The z-scores provide a common metric for observed statistical results. The standard normal distribution predicts the distribution of observed z-scores for a fixed value of true power.   However, for heterogeneous sets of studies the distribution of z-scores is a mixture of standard normal distributions with different weights attached to various power values. To illustrate this method, the histograms of z-scores below show simulated data with 10,000 observations with varying levels of true power: 20% null-hypotheses being true (5% power), 20% of studies with 33% power, 20% of studies with 50% power, 20% of studies with 66% power, and 20% of studies with 80% power.

RepRankSimulation

The plot shows the distribution of absolute z-scores (there are no negative effect sizes). The plot is limited to z-scores below 6 (N = 99,985 out of 10,000). Z-scores above 6 standard deviations from zero are extremely unlikely to occur by chance. Even with a conservative estimate of effect size (lower bound of 95% confidence interval), observed power is well above 99%. Moreover, quantum physics uses Z = 5 as a criterion to claim success (e.g., discovery of Higgs-Boson Particle). Thus, Z-scores above 6 can be expected to be highly replicable effects.

Z-scores below 1.96 (the vertical dotted red line) are not significant for the standard criterion of (p < .05, two-tailed). These values are excluded from the calculation of power because these results are either not reported or not interpreted as evidence for an effect. It is still important to realize that true power of all experiments would be lower if these studies were included because many of the non-significant results are produced by studies with 33% power. These non-significant results create two problems. Researchers wasted resources on studies with inconclusive results and readers may be tempted to misinterpret these results as evidence that an effect does not exist (e.g., a drug does not have side effects) when an effect is actually present. In practice, it is difficult to estimate power for non-significant results because the size of the file-drawer is difficult to estimate.

It is possible to estimate power for any range of z-scores, but I prefer the range of z-scores from 2 (just significant) to 4. A z-score of 4 has a 95% confidence interval that ranges from 2 to 6. Thus, even if the observed effect size is inflated, there is still a high chance that a replication study would produce a significant result (Z > 2). Thus, all z-scores greater than 4 can be treated as cases with 100% power. The plot also shows that conclusions are unlikely to change by using a wider range of z-scores because most of the significant results correspond to z-scores between 2 and 4 (89%).

The typical power of studies is estimated based on the distribution of z-scores between 2 and 4. A steep decrease from left to right suggests low power. A steep increase suggests high power. If the peak (mode) of the distribution were centered over Z = 2.8, the data would conform to Cohen’s recommendation to have 80% power.

Using the known distribution of power to estimate power in the critical range gives a power estimate of 61%. A simpler model that assumes a fixed power value for all studies produces a slightly inflated estimate of 63%. Although the heterogeneous model is correct, the plot shows that the homogeneous model provides a reasonable approximation when estimates are limited to a narrow range of Z-scores. Thus, I used the homogeneous model to estimate the typical power of significant results reported in psychological journals.

DATA

The results presented below are based on an ongoing project that examines power in psychological journals (see results section for the list of journals included so far). The set of journals does not include journals that primarily publish reviews and meta-analysis or clinical and applied journals. The data analysis is limited to the years from 2009 to 2015 to provide information about the typical power in contemporary research. Results regarding historic trends will be reported in a forthcoming article.

I downloaded pdf files of all articles published in the selected journals and converted the pdf files to text files. I then extracted all t-tests and F-tests that were reported in the text of the results section searching for t(df) or F(df1,df2). All t and F statistics were converted into one-tailed p-values and then converted into z-scores.

RepRankAll

The plot above shows the results based on 218,698 t and F tests reported between 2009 and 2015 in the selected psychology journals. Unlike the simulated data, the plot shows a steep drop for z-scores just below the threshold of significance (z = 1.96). This drop is due to the tendency not to publish or report non-significant results. The heterogeneous model uses the distribution of non-significant results to estimate the size of the file-drawer (unpublished non-significant results). However, for the present purpose the size of the file-drawer is irrelevant because power is estimated only for significant results for Z-scores between 2 and 4.

The green line shows the best fitting estimate for the homogeneous model. The red curve shows fit of the heterogeneous model. The heterogeneous model is doing a much better job at fitting the long tail of highly significant results, but for the critical interval of z-scores between 2 and 4, the two models provide similar estimates of power (55% homogeneous & 53% heterogeneous model).   If the range is extended to z-scores between 2 and 6, power estimates diverge (82% homogenous, 61% heterogeneous). The plot indicates that the heterogeneous model fits the data better and that the 61% estimate is a better estimate of true power for significant results in this range. Thus, the results are in line with Cohen (1962) estimate that psychological studies average 60% power.

REPLICABILITY RANKING

The distribution of z-scores between 2 and 4 was used to estimate the average power separately for each journal. As power is the probability to obtain a significant result, this measure estimates the replicability of results published in a particular journal if researchers would reproduce the studies under identical conditions with the same sample size (exact replication). Thus, even though the selection criterion ensured that all tests produced a significant result (100% success rate), the replication rate is expected to be only about 50%, even if the replication studies successfully reproduce the conditions of the published studies. The table below shows the replicability ranking of the journals, the replicability score, and a grade. Journals are graded based on a scheme that is similar to grading schemes for undergraduate students (below 50 = F, 50-59 = E, 60-69 = D, 70-79 = C, 80-89 = B, 90+ = A).

ReplicabilityRanking

The average value in 2000-2014 is 57 (D+). The average value in 2015 is 58 (D+). The correlation for the values in 2010-2014 and those in 2015 is r = .66.   These findings show that the replicability scores are reliable and that journals differ systematically in the power of published studies.

LIMITATIONS

The main limitation of the method is that focuses on t and F-tests. The results might change when other statistics are included in the analysis. The next goal is to incorporate correlations and regression coefficients.

The second limitation is that the analysis does not discriminate between primary hypothesis tests and secondary analyses. For example, an article may find a significant main effect for gender, but the critical test is whether gender interacts with an experimental manipulation. It is possible that some journals have lower scores because they report more secondary analyses with lower power. To address this issue, it will be necessary to code articles in terms of the importance of statistical test.

The ranking for 2015 is based on the currently available data and may change when more data become available. Readers should also avoid interpreting small differences in replicability scores as these scores are likely to fluctuate. However, the strong correlation over time suggests that there are meaningful differences in the replicability and credibility of published results across journals.

CONCLUSION

This article provides objective information about the replicability of published findings in psychology journals. None of the journals reaches Cohen’s recommended level of 80% replicability. Average replicability is just about 50%. This finding is largely consistent with Cohen’s analysis of power over 50 years ago. The publication of the first replicability analysis by journal should provide an incentive to editors to increase the reputation of their journal by paying more attention to the quality of the published data. In this regard, it is noteworthy that replicability scores diverge from traditional indicators of journal prestige such as impact factors. Ideally, the impact of an empirical article should be aligned with the replicability of the empirical results. Thus, the replicability index may also help researchers to base their own research on credible results that are published in journals with a high replicability score and to avoid incredible results that are published in journals with a low replicability score. Ultimately, I can only hope that journals will start competing with each other for a top spot in the replicability rankings and as a by-product increase the replicability of published findings and the credibility of psychological science.

Advertisements

R-Index predicts lower replicability of “subliminal” studies than “attribution” studies in JESP

PHP_subliminal_vs_attribution

This post compares articles in the Journal of Experimental Social Psychology that contained the keyword “subliminal” to articles that contained the word “attribution”.

PHP-curves based on t-tests and F-tests in these articles are compared.  Both sets of articles show signs of publication bias (fewer non-significant studies are reported than predicted based on post-hoc power).

The shape of the histogram shows clear evidence of heterogeneity (the red curve fits the data better than the green curve).

The estimated power of studies with z-scores between 2 and 4 for subliminal articles is 31%.

The estimated power of studies with z-scores between 2 and 3 for attribution articles is 42%.

The R-Index for subliminal articles is 39%, whereas the R-Index for attribution articles is 49%.

The values for subliminal articles are also lower than the values for the whole set of articles in JESP.

In conclusion, these results suggest that subliminal priming studies are less replicable than other findings in social psychology and should be the target of high-powered replication studies.  These replication studies need to take into account that reported effect sizes are inflated to achieve high power.

Post-Hoc-Power Curves of Social Psychology in Psychological Science, JESP, and Social Cognition

Post-hoc-power curves are used to evaluate the replicability of published results.  At present, PHP curves are based on t-tests and F-tests that are automatically extracted from text files of journal articles.  All test results are converted into z-scores.  PHP-curves are fitted to the density of the histogram of z-scores.

It is well known that non-significant results are less likely to be published and end up in the proverbial file-drawer.  To overcome this problem, PHP curves are fitted to the data by excluding non-significant results from the estimation of typical power (Simonsohn et al., 2013, 2014).

Another problem in the estimation of typical power is that power varies across tests. Heterogeneity of power leads to more variation in observed z-scores than a homogeneous model would assume (see comparison of variances in the figures below).  PHP-curves address this problem by fitting a model with multiple true power values to the observed data. Fit for the non-significant results is not expected to be good due to the file drawer problem. In fact, the gap between actual and predicted data can be considered a rough estimate of the size of the file-drawer.

For heterogeneous data, power depends on the set of results that is being analyzed. The reason is that low z-scores are more likely to be obtained in studies with low power, whereas high z-scores are more likely to be the result of high powered studies. The figures below estimated power for z-scores in the range from 2 to 6.  The mode of the red heterogenous curve shows that power for all tests would be considerably lower.  However, non-significant results are typically not interpreted or even excluded from published studies. Thus, replicability is better indexed by the typical power of significant results.

The power estimates for all JESP articles and for social psychology articles in Psychological Science are very similar (47%, and 45%). Power for Social Cognition in the years from 2010 to present is estimated to be higher (60%).  Older issues could not be analyzed because text recognition did not work.  In comparison, the estimated power for Memory related articles in Psychological Science is higher (66%).

The average can be a misleading statistic for skewed distributions. The figures show that the majority of significant results are closer to the lower limit (z = 2) than to the upper limit (z = 6) of the test interval. Thus, the median power is lower than the average power of 45-60%.

It is important to realize that post-hoc power is meaningful when it is based on a large set of studies.  A z-score of 4 is more likely to be based on a highly powered study than a z-score of 2, but a single z-score of 2 could be based on a high-powered study or it could be a type-I error. The purpose of PHP-curves is to evaluate journals, areas of research, and other meaningful sets of studies.  Hopefully, recent attempts to increase the replicability of social psychology will increase power.  PHP-curves, the R-Index, and estimates of typical power can be used to document improvements in future years.

PHPsocialpsychology

When Exact Replications Are Too Exact: The Lucky-Bounce-Test for Pairs of Exact Replication Studies

Imagine an NBA player has an 80% chance to make one free throw. What is the chance that he makes both free throws? The correct answer is 64% (80% * 80%).

Now consider the possibility that it is possible to distinguish between two types of free throws. Some free throws are good; they don’t touch the rim and make a swishing sound when they go through the net (all net). The other free throws bounce of the rim and go in (rattling in).

What is the probability that an NBA player with an 80% free throw percentage makes a free throw that is all net or rattles in? It is more likely that an NBA player with an 80% free throw average makes a perfect free throw because a free throw that rattles in could easily have bounded the wrong way, which would lower the free throw percentage. To achieve an 80% free throw percentage, most free throws have to be close to perfect.

Let’s say the probability of hitting the rim and going in is 30%. With an 80% free throw average, this means that the majority of free throws are in the close-to-perfect category (20% misses, 30% rattle-in, 50% close-to-perfect).

What does this have to do with science? A lot!

The reason is that the outcome of a scientific study is a bit like throwing free throws. One factor that contributes to a successful study is skill (making correct predictions, avoiding experimenter errors, and conducting studies with high statistical power). However, another factor is random (a lucky or unlucky bounce).

The concept of statistical power is similar to an NBA players’ free throw percentage. A researcher who conducts studies with 80% statistical power is going to have an 80% success rate (that is, if all predictions are correct). In the remaining 20% of studies, a study will not produce a statistically significant result, which is equivalent to missing a free throw and not getting a point.

Many years ago, Jacob Cohen observed that researchers often conduct studies with relatively low power to produce a statistically significant result. Let’s just assume right now that a researcher conducts studies with 60% power. This means, researchers would be like NBA players with a 60% free-throw average.

Now imagine that researchers have to demonstrate an effect not only once, but also a second time in an exact replication study. That is researchers have to make two free throws in a row. With 60% power, the probability to get two significant results in a row is only 36% (60% * 60%). Moreover, many of the freethrows that are made rattle in rather than being all net. The percentages are about 40% misses, 30% rattling in and 30% all net.

One major difference between NBA players and scientists is that NBA players have to demonstrate their abilities in front of large crowds and TV cameras, whereas scientists conduct their studies in private.

Imagine an NBA player could just go into a private room, throw two free throws and then report back how many free throws he made and the outcome of these free throws determine who wins game 7 in the playoff finals. Would you trust the player to tell the truth?

If you would not trust the NBA player, why would you trust scientists to report failed studies? You should not.

It can be demonstrated statistically that scientists are reporting more successes than the power of their studies would justify (Sterling et al., 1995; Schimmack, 2012). Amongst scientists this fact is well known, but the general public may not fully appreciate the fact that a pair of exact replication studies with significant results is often just a selection of studies that included failed studies that were not reported.

Fortunately, it is possible to use statistics to examine whether the results of a pair of studies are likely to be honest or whether failed studies were excluded. The reason is that an amateur is not only more likely to miss a free throw. An amateur is also less likely to make a perfect free throw.

Based on the theory of statistical power developed by Nyman and Pearson and popularized by Jacob Cohen, it is possible to make predictions about the relative frequency of p-values in the non-significant (failure), just significant (rattling in), and highly significant (all net) ranges.

As for made-free-throws, the distinction between lucky and clear successes is somewhat arbitrary because power is continuous. A study with a p-value of .0499 is very lucky because p = .501 would have been not significant (rattled in after three bounces on the rim). A study with p = .000001 is a clear success. Lower p-values are better, but where to draw the line?

As it turns out, Jacob Cohen’s recommendation to conduct studies with 80% power provides a useful criterion to distinguish lucky outcomes and clear successes.

Imagine a scientist conducts studies with 80% power. The distribution of observed test-statistics (e.g. z-scores) shows that this researcher has a 20% chance to get a non-significant result, a 30% chance to get a lucky significant result (p-value between .050 and .005), and a 50% chance to get a clear significant result (p < .005). If the 20% failed studies are hidden, the percentage of results that rattled in versus studies with all-net results are 37 vs. 63%. However, if true power is just 20% (an amateur), 80% of studies fail, 15% rattle in, and 5% are clear successes. If the 80% failed studies are hidden, only 25% of the successful studies are all-net and 75% rattle in.

One problem with using this test to draw conclusions about the outcome of a pair of exact replication studies is that true power is unknown. To avoid this problem, it is possible to compute the maximum probability of a rattling-in result. As it turns out, the optimal true power to maximize the percentage of lucky outcomes is 66% power. With true power of 66%, one would expect 34% misses (p > .05), 32% lucky successes (.050 < p < .005), and 34% clear successes (p < .005).

LuckyBounceTest

For a pair of exact replication studies, this means that there is only a 10% chance (32% * 32%) to get two rattle-in successes in a row. In contrast, there is a 90% chance that misses were not reported or that an honest report of successful studies would have produced at least one all-net result (z > 2.8, p < .005).

Example: Unconscious Priming Influences Behavior

I used this test to examine a famous and controversial set of exact replication studies. In Bargh, Chen, and Burrows (1996), Dr. Bargh reported two exact replication studies (studies 2a and 2b) that showed an effect of a subtle priming manipulation on behavior. Undergraduate students were primed with words that are stereotypically associated with old age. The researchers then measured the walking speed of primed participants (n = 15) and participants in a control group (n = 15).

The two studies were not only exact replications of each other; they also produced very similar results. Most readers probably expected this outcome because similar studies should produce similar results, but this false belief ignores the influence of random factors that are not under the control of a researcher. We do not expect lotto winners to win the lottery again because it is an entirely random and unlikely event. Experiments are different because there could be a systematic effect that makes a replication more likely, but in studies with low power results should not replicate exactly because random sampling error influences results.

Study 1: t(28) = 2.86, p = .008 (two-tailed), z = 2.66, observed power = 76%
Study 2: t(28) = 2.16, p = .039 (two-tailed), z = 2.06, observed power = 54%

The median power of these two studies is 65%. However, even if median power were lower or higher, the maximum probability of obtaining two p-values in the range between .050 and .005 remains just 10%.

Although this study has been cited over 1,000 times, replication studies are rare.

One of the few published replication studies was reported by Cesario, Plaks, and Higgins (2006). Naïve readers might take the significant results in this replication study as evidence that the effect is real. However, this study produced yet another lucky success.

Study 3: t(62) = 2.41, p = .019, z = 2.35, observed power = 65%.

The chances of obtaining three lucky successes in a row is only 3% (32% *32% * 32*). Moreover, with a median power of 65% and a reported success rate of 100%, the success rate is inflated by 35%. This suggests that the true power of the reported studies is considerably lower than the observed power of 65% and that observed power is inflated because failed studies were not reported.

The R-Index corrects for inflation by subtracting the inflation rate from observed power (65% – 35%). This means the R-Index for this set of published studies is 30%.

This R-Index can be compared to several benchmarks.

An R-Index of 22% is consistent with the null-hypothesis being true and failed attempts are not reported.

An R-Index of 40% is consistent with 30% true power and all failed attempts are not reported.

It is therefore not surprising that other researchers were not able to replicate Bargh’s original results, even though they increased statistical power by using larger samples (Pashler et al. 2011, Doyen et al., 2011).

In conclusion, it is unlikely that Dr. Bargh’s original results were the only studies that they conducted. In an interview, Dr. Bargh revealed that the studies were conducted in 1990 and 1991 and that they conducted additional studies until the publication of the two studies in 1996. Dr. Bargh did not reveal how many studies they conducted over the span of 5 years and how many of these studies failed to produce significant evidence of priming. If Dr. Bargh himself conducted studies that failed, it would not be surprising that others also failed to replicate the published results. However, in a personal email, Dr. Bargh assured me that “we did not as skeptics might presume run many studies and only reported the significant ones. We ran it once, and then ran it again (exact replication) in order to make sure it was a real effect.” With a 10% probability, it is possible that Dr. Bargh was indeed lucky to get two rattling-in findings in a row. However, his aim to demonstrate the robustness of an effect by trying to show it again in a second small study is misguided. The reason is that it is highly likely that the effect will not replicate or that the first study was already a lucky finding after some failed pilot studies. Underpowered studies cannot provide strong evidence for the presence of an effect and conducting multiple underpowered studies reduces the credibility of successes because the probability of this outcome to occur even when an effect is present decreases with each study (Schimmack, 2012). Moreover, even if Bargh was lucky to get two rattling-in results in a row, others will not be so lucky and it is likely that many other researchers tried to replicate this sensational finding, but failed to do so. Thus, publishing lucky results hurts science nearly as much as the failure to report failed studies by the original author.

Dr. Bargh also failed to realize how lucky he was to obtain his results, in his response to a published failed-replication study by Doyen. Rather than acknowledging that failures of replication are to be expected, Dr. Bargh criticized the replication study on methodological grounds. There would be a simple solution to test Dr. Bargh’s hypothesis that he is a better researcher and that his results are replicable when the study is properly conducted. He should demonstrate that he can replicate the result himself.

In an interview, Tom Bartlett asked Dr. Bargh why he didn’t conduct another replication study to demonstrate that the effect is real. Dr. Bargh’s response was that “he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.” The problem for Dr. Bargh is that there is no reason to believe his original results, either. Two rattling-in results alone do not constitute evidence for an effect, especially when this result could not be replicated in an independent study. NBA players have to make free-throws in front of a large audience for a free-throw to count. If Dr. Bargh wants his findings to count, he should demonstrate his famous effect in an open replication study. To avoid embarrassment, it would be necessary to increase the power of the replication study because it is highly unlikely that even Dr. Bargh can continuously produce significant results with samples of N = 30 participants. Even if the effect is real, sampling error is simply too large to demonstrate the effect consistently. Knowledge about statistical power is power. Knowledge about post-hoc power can be used to detect incredible results. Knowledge about a priori power can be used to produce credible results.

Swish!

R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

R-Index Bulletin

The world of scientific publishing is changing rapidly and there is a growing need to share scientific information as fast and as cheap as possible.

Traditional journals with pre-publication peer-review are too slow and focussed on major ground-breaking discoveries.

Open-access journals can be expensive.

R-Index Bulletin offers a new opportunity to share results with the scientific community quickly and free of charge.

R-Index Bulletin also avoids the problems of pre-publication peer-review by moving to a post-publication peer-review process. Readers are welcome to comment on posted contributions and to post their own analyses. This process ensures that scientific disputes and their resolution are open and part of the scientific process.

For the time being, submissions can be uploaded as comments to this blog. In the future, R-Index Bulletin may develop into a free online journal.

A submission should contain a brief description of the research question (e.g., what is the R-Index of studies on X, by X, or in the journal X?), the main statistical results (median observed power, success rate, inflation rate, R-Index) and a brief discussion of the implications of the analysis. There is no page restriction and analyses of larger data sets can include moderator analysis. Inclusion of other bias tests (Egger’s regression, TIVA, P-Curve, P-Uniform) is also welcome.

If you have conducted an R-Index analysis, please submit it to R-Index Bulletin to share your findings.

Submissions can be made anonymously or with an author’s name.

Go ahead and press the “Leave a comment” or “Leave a reply” button or scroll to the bottom of the page and past your results in the “Leave a reply” box.

Bayesian Statistics in Small Samples: Replacing Prejudice against the Null-Hypothesis with Prejudice in Favor of the Null-Hypothesis

Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers (2015) published the results of a preregistered adversarial collaboration. This article has been considered a model of conflict resolution among scientists.

The study examined the effect of eye-movements on memory. Drs. Nieuwenhuis and Slagter assume that horizontal eye-movements improve memory. Drs. Matzke, van Rijn, and Wagenmakers did not believe that horizontal-eye movements improve memory. That is, they assumed the null-hypothesis to be true. Van der Molen acted as a referee to resolve conflict about procedural questions (e.g., should some participants be excluded from analysis?).

The study was a between-subject design with three conditions: horizontal eye movements, vertical eye movements, and no eye movement.

The researchers collected data from 81 participants and agreed to exclude 2 participants, leaving 79 participants for analysis. As a result there were 27 or 26 participants per condition.

The hypothesis that horizontal eye-movements improve performance can be tested in several ways.

An overall F-test can compare the means of the three groups against the hypothesis that they are all equal. This test has low power because nobody predicted differences between vertical eye-movements and no eye-movements.

A second alternative is to compare the horizontal condition against the combined alternative groups. This can be done with a simple t-test. Given the directed hypothesis, a one-tailed test can be used.

Power analysis with the free software program GPower shows that this design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.

Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.

What does an effect size of d = .8 mean? It means that memory performance is boosted by .8 standard deviations. For example, if students take a multiple-choice exam with an average of 66% correct answers and a standard deviation of 15%, they could boost their performance by 12% points (15 * 0.8 = 12) from an average of 66% (C) to 78% (B+) by moving their eyes horizontally while thinking about a question.

The article makes no mention of power-analysis and the implicit assumption that the effect size has to be large to avoid biasing the experiment in favor of the critiques.

Instead the authors used Bayesian statistics; a type of statistics that most empirical psychologists understand even less than standard statistics. Bayesian statistics somehow magically appears to be able to draw inferences from small samples. The problem is that Bayesian statistics requires researchers to specify a clear alternative to the null-hypothesis. If the alternative is d = .8, small samples can be sufficient to decide whether an observed effect size is more consistent with d = 0 or d = .8. However, with more realistic assumptions about effect sizes, small samples are unable to reveal whether an observed effect size is more consistent with the null-hypothesis or a small to moderate effect.

Actual Results

So what where the actual results?

Condition                                          Mean     SD         

Horizontal Eye-Movements          10.88     4.32

Vertical Eye-Movements               12.96     5.89

No Eye Movements                       15.29     6.38      

The results provide no evidence for a benefit of horizontal eye movements. In a comparison of the two a priori theories (d = 0 vs. d > 0), the Bayes-Factor strongly favored the null-hypothesis. However, this does not mean that Bayesian statistics has magical powers. The reason was that the empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81).   A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.

Conclusion

In conclusion, a small study surprisingly showed a mean difference in the opposite prediction than previous studies had shown. This finding is noteworthy and shows that the effects of eye-movements on memory retrieval are poorly understood. As such, the results of this study are simply one more example of the replicability crisis in psychology.

However, it is unfortunate that this study is published as a model of conflict resolution, especially as the empirical results failed to resolve the conflict. A key aspect of a decisive study is to plan a study with adequate power to detect an effect.   As such, it is essential that proponents of a theory clearly specify the effect size of their predicted effect and that the decisive experiment matches type-I and type-II error. With the common 5% Type-I error this means that a decisive experiment must have 95% power (1 – type II error). Bayesian statistics does not provide a magical solution to the problem of too much sampling error in small samples.

Bayesian statisticians may ignore power analysis because it was developed in the context of null-hypothesis testing. However, Bayesian inferences are also influenced by sample size and studies with small samples will often produce inconclusive results. Thus, it is more important that psychologists change the way they collect data than to change the way they analyze their data. It is time to allocate more resources to fewer studies with less sampling error than to waste resources on many studies with large sampling error; or as Cohen said: Less is more.