+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication (Cohen, 1994).

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DEFINITION OF REPLICABILITY: In empirical studies with random error variance **replicability** refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Latest Blog Posts:

(August, 2, 2017)

What would Cohen say: A comment on p < .005 as the new criterion for significance

(April, 7, 2017)

Hidden Figures: Replication failures in the stereotype threat literature

(March 1, 2017)

2016 Replicability Rankings of 103 Psychology Journals

(February, 2, 2017)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

**REPLICABILITY REPORTS**: Examining the replicability of research topics

RR No1. (April 19, 2016) Is ego-depletion a replicable effect?

RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# TOP TEN LIST

1. 2016 Replicability Rankings of 103 Psychology Journals

Rankings of 103 Psychology Journals according to the average replicability of a pulished significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2016 and a comparison of psychological disciplines (cognitive, clinical, social, developmental, personality, biological, applied).

2. Z-Curve: Estimating replicability for sets of studies with heterogeneous power (e.g., Journals, Departments, Labs)

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal. The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores. The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests. A description of the new method will be published when extensive simulation studies are completed.

3. Replicability-Rankings of Psychology Departments

This blog presents rankings of psychology departments on the basis of the replicability of significant results published in 105 psychology journals (see the journal rankings for a list of journals). Reported success rates in psychology journals are over 90%, but this percentage is inflated by selective reporting of significant results. After correcting for selection bias, replicability is 60%, but there is reliable variation across departments.

4. An Introduction to the R-Index

The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

5. The Test of Insufficient Variance (TIVA)

The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one. Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed). If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient. The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

6. Validation of Meta-Analysis of Observed (post-hoc) Power

This post examines the ability of various estimation methods to estimate power of a set of studies based on the reported test statistics in these studies. The results show that most estimation methods work well when all studies have the same effect size (homogeneous) or if effect sizes are heterogeneous and symmetrically distributed (heterogeneous). However, most methods fail when effect sizes are heterogeneous and have a skewed distribution. The post does not yet include the more recent method that uses the distribution of z-scores (powergraphs) to estimate observe power because this method was developed after this blog was posted.

7. Roy Baumeister’s R-Index

Roy Baumeister was a reviewer of my 2012 article that introduced the Incredibiliy Index to detect publication bias and dishonest reporting practices. In his review and in a subsequent email exchange, Roy Baumeister admitted that his published article excluded studies that failed to produce results in support of his theory that blood-glucose is important for self-regulation (a theory that is now generally considered to be false), although he disagrees that excluding these studies was dishonest. The R-Index builds on the incredibility index and provides an index of the strength of evidence that corrects for the influence of dishonest reporting practices. This post reports the R-Index for Roy Baumeister’s most cited articles. The R-Index is low and does not justify the nearly perfect support for empirical predictions in these articles. At the same time, the R-Index is similar to R-Indices for other sets of studies in social psychology. This suggests that dishonest reporting practices are the norm in social psychology and that published articles exaggerate the strength of evidence in support of social psychological theories.

8. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance. This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting. After correcting for these effects, the stereotype-threat effect was negligible. This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat. These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

9. The R-Index for 18 Multiple-Study Psychology Articles in the Journal SCIENCE.

Francis (2014) demonstrated that nearly all multiple-study articles by psychology researchers that were published in the prestigious journal SCIENCE showed evidence of dishonest reporting practices (disconfirmatory evidence was missing). Francis (2014) used a method similar to the incredibility index. One problem of this method is that the result is a probability that is influenced by the amount of bias and the number of results that were available for analysis. As a result, an article with 9 studies and moderate bias is treated the same as an article with 4 studies and a lot of bias. The R-Index avoids this problem by focusing on the amount of bias (inflation) and the strength of evidence. This blog post shows the R-Index of the 18 studies and reveals that many articles have a low R-Index.

10. The Problem with Bayesian Null-Hypothesis Testing

Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect). They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist. This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1). As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2). A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.