Category Archives: Social Psychology

2016 Replicability Rankings of 103 Psychology Journals

I post the rankings on top.  Detailed information and statistical analysis are provided below the table.  You can click on the journal title to see Powergraphs for each year.

Rank   Journal Change 2016 2015 2014 2013 2012 2011 2010 Mean
1 Social Indicators Research 10 90 70 65 75 65 72 73 73
2 Psychology of Music -13 81 59 67 61 69 85 84 72
3 Journal of Memory and Language 11 79 76 65 71 64 71 66 70
4 British Journal of Developmental Psychology -9 77 52 61 54 82 74 69 67
5 Journal of Occupational and Organizational Psychology 13 77 59 69 58 61 65 56 64
6 Journal of Comparative Psychology 13 76 71 77 74 68 61 66 70
7 Cognitive Psychology 7 75 73 72 69 66 74 66 71
8 Epilepsy & Behavior 5 75 72 79 70 68 76 69 73
9 Evolution & Human Behavior 16 75 57 73 55 38 57 62 60
10 International Journal of Intercultural Relations 0 75 43 70 75 62 67 62 65
11 Pain 5 75 70 75 67 64 65 74 70
12 Psychological Medicine 4 75 57 66 70 58 72 61 66
13 Annals of Behavioral Medicine 10 74 50 63 62 62 62 51 61
14 Developmental Psychology 17 74 72 73 67 61 63 58 67
15 Judgment and Decision Making -3 74 59 68 56 72 66 73 67
16 Psychology and Aging 6 74 66 78 65 74 66 66 70
17 Aggressive Behavior 16 73 70 66 49 60 67 52 62
18 Journal of Gerontology-Series B 3 73 60 65 65 55 79 59 65
19 Journal of Youth and Adolescence 13 73 66 82 67 61 57 66 67
20 Memory 5 73 56 79 70 65 64 64 67
21 Sex Roles 6 73 67 59 64 72 68 58 66
22 Journal of Experimental Psychology – Learning, Memory & Cognition 4 72 74 76 71 71 67 72 72
23 Journal of Social and Personal Relationships -6 72 51 57 55 60 60 75 61
24 Psychonomic Review and Bulletin 8 72 79 62 78 66 62 69 70
25 European Journal of Social Psychology 5 71 61 63 58 50 62 67 62
26 Journal of Applied Social Psychology 4 71 58 69 59 73 67 58 65
27 Journal of Experimental Psychology – Human Perception and Performance -4 71 68 72 69 70 78 72 71
28 Journal of Research in Personality 9 71 75 47 65 51 63 63 62
29 Journal of Child and Family Studies 0 70 60 63 60 56 64 69 63
30 Journal of Cognition and Development 5 70 53 62 54 50 61 61 59
31 Journal of Happiness Studies -9 70 64 66 77 60 74 80 70
32 Political Psychology 4 70 55 64 66 71 35 75 62
33 Cognition 2 69 68 70 71 67 68 67 69
34 Depression & Anxiety -6 69 57 66 71 77 77 61 68
35 European Journal of Personality 2 69 61 75 65 57 54 77 65
36 Journal of Applied Psychology 6 69 58 71 55 64 59 62 63
37 Journal of Cross-Cultural Psychology -4 69 74 69 76 62 73 79 72
38 Journal of Psychopathology and Behavioral Assessment -13 69 67 63 77 74 77 79 72
39 JPSP-Interpersonal Relationships and Group Processes 15 69 64 56 52 54 59 50 58
40 Social Psychology 3 69 70 66 61 64 72 64 67
41 Achive of Sexual Behavior -2 68 70 78 73 69 71 74 72
42 Journal of Affective Disorders 0 68 64 54 66 70 60 65 64
43 Journal of Experimental Child Psychology 2 68 71 70 65 66 66 70 68
44 Journal of Educational Psychology -11 67 61 66 69 73 69 76 69
45 Journal of Experimental Social Psychology 13 67 56 60 52 50 54 52 56
46 Memory and Cognition -3 67 72 69 68 75 66 73 70
47 Personality and Individual Differences 8 67 68 67 68 63 64 59 65
48 Psychophysiology -1 67 66 65 65 66 63 70 66
49 Cognitve Development 6 66 78 60 65 69 61 65 66
50 Frontiers in Psychology -8 66 65 67 63 65 60 83 67
51 Journal of Autism and Developmental Disorders 0 66 65 58 63 56 61 70 63
52 Journal of Experimental Psychology – General 5 66 69 67 72 63 68 61 67
53 Law and Human Behavior 1 66 69 53 75 67 73 57 66
54 Personal Relationships 19 66 59 63 67 66 41 48 59
55 Early Human Development 0 65 52 69 71 68 49 68 63
56 Attention, Perception and Psychophysics -1 64 69 70 71 72 68 66 69
57 Consciousness and Cognition -3 64 65 67 57 64 67 68 65
58 Journal of Vocactional Behavior 5 64 78 66 78 71 74 57 70
59 The Journal of Positive Psychology 14 64 65 79 51 49 54 59 60
60 Behaviour Research and Therapy 7 63 73 73 66 69 63 60 67
61 Child Development 0 63 66 62 65 62 59 68 64
62 Emotion -1 63 61 56 66 62 57 65 61
63 JPSP-Personality Processes and Individual Differences 1 63 56 56 59 68 66 51 60
64 Schizophrenia Research 1 63 65 68 64 61 70 60 64
65 Self and Identity -4 63 52 61 62 50 55 71 59
66 Acta Psychologica -6 63 66 69 69 67 68 72 68
67 Behavioral Brain Research -3 62 67 61 62 64 65 67 64
68 Child Psychiatry and Human Development 5 62 72 83 73 50 82 58 69
69 Journal of Child Psychology and Psychiatry and Allied Disciplines 10 62 62 56 66 64 45 55 59
70 Journal of Consulting and Clinical Psychology 0 62 56 50 54 59 58 57 57
71 Journal of Counseling Psychology -3 62 70 60 74 72 56 72 67
72 Behavioral Neuroscience 1 61 66 63 62 65 58 64 63
73 Developmental Science -5 61 62 60 62 66 65 65 63
74 Journal of Experimental Psychology – Applied -4 61 61 65 53 69 57 69 62
75 Journal of Social Psychology -11 61 56 55 55 74 70 63 62
76 Social Psychology and Personality Science -5 61 42 56 59 59 65 53 56
77 Cognitive Therapy and Research 0 60 68 54 67 70 62 58 63
78 Hormones & Behavior -1 60 55 55 54 55 60 58 57
79 Motivation and Emotion 1 60 60 57 57 51 73 52 59
80 Organizational Behavior and Human Decision Processes 3 60 63 65 61 68 67 51 62
81 Psychoneuroendocrinology 5 60 58 58 56 53 59 53 57
82 Social Development -10 60 50 66 62 65 79 57 63
83 Appetite -10 59 57 57 65 64 66 67 62
84 Biological Psychology -6 59 60 55 57 57 65 64 60
85 Journal of Personality Psychology 17 59 59 60 62 69 37 45 56
86 Psychological Science 6 59 63 60 63 59 55 56 59
87 Asian Journal of Social Psychology 0 58 76 67 56 71 64 64 65
88 Behavior Therapy 0 58 63 66 69 66 52 65 63
89 Britsh Journal of Social Psychology 0 58 57 44 59 51 59 55 55
90 Social Influence 18 58 72 56 52 33 59 46 54
91 Developmental Psychobiology -9 57 54 61 60 70 64 62 61
92 Journal of Research on Adolescence 2 57 59 61 82 71 75 40 64
93 Journal of Abnormal Psychology -5 56 52 57 58 55 66 55 57
94 Social Cognition -2 56 54 52 54 62 69 46 56
95 Personality and Social Psychology Bulletin 2 55 57 58 55 53 56 54 55
96 Cognition and Emotion -14 54 66 61 62 76 69 69 65
97 Health Psychology -4 51 67 56 72 54 69 56 61
98 Journal of Clinical Child and Adolescence Psychology 1 51 66 61 74 64 58 54 61
99 Journal of Family Psychology -7 50 52 63 61 57 64 55 57
100 Group Processes & Intergroup Relations -5 49 53 68 64 54 62 55 58
101 Infancy -8 47 44 60 55 48 63 51 53
102 Journal of Consumer Psychology -5 46 57 55 51 53 48 61 53
103 JPSP-Attitudes & Social Cognition -3 45 69 62 39 54 54 62 55

Notes.
1. Change scores are the unstandardized regression weights with replicabilty estimates as outcome variable and year as predictor variable.  Year was coded from 0 for 2010 to 1 for 2016 so that the regression coefficient reflects change over the full 7 year period. This method is preferable to a simple difference score because estimates in individual years are variable and are likely to overestimate change.
2. Rich E. Lucas, Editor of JRP, noted that many articles in JRP do not report t of F values in the text and that the replicability estimates based on these statistics may not be representative of the bulk of results reported in this journal.  Hand-coding of articles is required to address this problem and the ranking of JRP, and other journals, should be interpreted with caution (see further discussion of these issues below).

Introduction

I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result.  In the past five years, it has become increasingly clear that psychology suffers from a replication crisis. Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011).  The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated  (OSC, 2015).

The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability.  However, the reliance on actual replication studies has a a number of limitations.  First, actual replication studies are expensive or impossible (e.g., a longitudinal study spanning 20 years).  Second, studies selected for replication may not be representative because the replication team lacks expertise to replicate some studies. Finally, replication studies take time and replicability of recent studies may not be known for several years. This makes it difficult to rely on actual replication studies to rank journals and to track replicability over time.

Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the original results in published articles.  This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies.  Replicability can be assessed in real time, it can be estimated for all published results, and it can be used for expensive studies that are impossible to reproduce.  Finally, it has the advantage that actual replication studies can be criticized  (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on published results in original articles.

Z-curve has been validated with simulation studies and can be used when replicability varies across studies and when there is selection for significance, and is superior to similar statistical methods that correct for publication bias (Brunner & Schimmack, 2016).  I use this method to estimate the average replicability of significant results published in 103 psychology journals. Separate estimates were obtained for the years from 2010, one year before the start of the replication crisis, to 2016 to examine whether replicability increased in response to discussions about replicability.  The OSC estimate of replicability was based on articles published in 2008 and it was limited to three journals.  I posted replicability estimates based on z-curve for the year 2015 (2015 replicability rankings).  There was no evidence that replicability had increased during this time period.

The main empirical question was whether the 2016 rankings show some improvement in replicability and whether some journals or disciplines have responded more strongly to the replication crisis than others.

A second empirical question was whether replicabilty varies across disciplines.  The OSC project provided first evidence that traditional cognitive psychology is more replicable than social psychology.  Replicability estimates with z-curve confirmed this finding.  In the 2015 rankings, The Journal of Experimental Psychology: Learning, Memory and Cognition ranked 25 with a replicability estimate of 74, whereas the two social psychology sections of the Journal of Personality and Social Psychology ranked 73 and 99 (68% and 60% replicability estimates).  For this post, I conducted more extensive analyses of disciplines.

Journals

The 103 journals that are included in these rankings were mainly chosen based on impact factors.  The list also includes diverse areas of psychology, including cognitive, developmental, social, personality, clinical, biological, and applied psychology.  The 2015 list included some new journals that started after 2010.  These journals were excluded from the 2016 rankings to avoid missing values in statistical analyses of time trends.  A few journals were added to the list and the results may change when more journals are added to the list.

The journals were classified into 9 categories: social (24), cognitive (12), development (15), clinical/medical (19), biological (8), personality (5), and applied(IO,education) (8).  Two journals were classified as general (Psychological Science, Frontiers in Psychology). The last category included topical, interdisciplinary journals (emotion, positive psychology).

Data 

All PDF versions of published articles were downloaded and converted into text files. The 2015 rankings were based on conversions with the free program pdf2text pilot.  The 2016 program used a superior conversion program pdfzilla.  Text files were searched for reports of statistical results using my own R-code (z-extraction). Only F-tests, t-tests, and z-tests were used for the rankings. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom.  A comparison of the 2015 rankings using the old method and the new method shows that extraction methods have an influence on replicability estimates some differences (r = .56). One reason for the low correlation is that replicability estimates have a relatively small range (50-80%) and low retest correlations. Thus, even small changes can have notable effects on rankings. For this reason, time trends in replicability have to be examined at the aggregate level of journals or over longer time intervals. The change score of a single journal from 2015 to 2016 is not a reliable measure of improvement.

Data Analysis

The data for each year were analyzed using z-curve Schimmack and Brunner (2016).  The results of individual analysis are presented in Powergraphs. Powergraphs for each journal and year are provided as links to the journal names in the table with the rankings.  Powergraphs convert test statistics into absolute z-scores as a common metric for the strength of evidence against the null-hypothesis.  Absolute z-scores greater than 1.96 (p < .05, two-tailed) are considered statistically significant. The distribution of z-scores greater than 1.96 is used to estimate the average true power (not observed power) of the set of significant studies. This estimate is an estimate of replicability for a set of exact replication studies because average power determines the percentage of statistically significant results.  Powergraphs provide additional information about replicability for different ranges of z-scores (z-values between 2 and 2.5 are less replicable than those between 4 and 4.5).  However, for the replicability rankings only the replicability estimate is used.

Results

Table 1 shows the replicability estimates sorted by replicability in 2016.

The data were analyzed with a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.  I compared three models. Model 1 assumed no mean level changes and variability across journals. Model 2 assumed a linear increase. Model 3 tested assumed no change from 2010 to 2015 and allowed for an increase in 2016.

Model 1 had acceptable fit (RMSEA = .043, BIC = 5004). Model 2 increased fit (RMSEA = 0.029, BIC = 5005), but BIC slightly favored the more parsimonious Model 1. Model 3 had the best fit (RMSEA = .000, BIC = 5001).  These results reproduce the results of the 2015 analysis that there was no improvement from 2010 to 2015, but there is some evidence that replicability increased in 2016.  Adding a variance component to slope in Model 3 produced an unidentified model. Subsequent analyses show that this is due to insufficient power to detect variation across journals in changes over time.

The standardized loadings of individual years on the latent intercept factor ranged from .49 to .58.  This shows high variabibility in replicability estimates from year to year. Most of the rank changes can be attributed to random factors.  A better way to compare journals is to average across years.  A moving average of five years will provide reliable information and allow for improvement over time.  The reliability of the 5-year average for the years 2012 to 2016 is 68%.

Figure 1 shows the annual averages with 95%CI as well relative to the average over the full 7-year period.

rep-by-year

A paired t-test confirmed that average replicability in 2016 was significantly higher (M = 65, SD = 8) than in the previous years (M = 63, SD = 8), t(101) = 2.95, p = .004.  This is the first evidence that psychological scientists are responding to the replicability crisis by publishing slightly more replicable results.  Of course, this positive result has to be tempered by the small effect size.  But if this trend continuous or even increases, replicability could reach 80% in 10 years.

The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2016 with the previous years.  This analysis produced only 7 significant increases with p < .05 (one-tailed), which is only 2 more significant results than would be expected by chance alone. Thus, the analysis failed to identify particular journals that contribute to the improvement in the average.  Figure 2 compares the observed distribution of t-values to the predicted distribution based on the null-hypothesis (no change).

t-value Distribution.png

The blue line shows the observed density distribution, which is slightly moved to the right, but there is no set of journals with notably larger t-values.  A more sustained and larger increase in replicability is needed to detect variability in change scores.

The next analyses examine stable differences between disciplines.  The first analysis compared cognitive journals to social journals.  No statistical tests are needed to see that cognitive journals publish more replicable results than social journals. This finding confirms the results with actual replications of studies published in 2008 (OSC, 2015). The Figure suggests that the improvement in 2016 is driven more by social journals, but only 2017 data can tell whether there is a real improvement in social psychology.

replicability.cog.vs.soc.png

The next Figure shows the results for 5 personality journals.  The large confidence intervals show that there is considerable variability among personality journals. The Figure shows the averages for cognitive and social psychology as horizontal lines. The average for personality is only slightly above the average for social and like social, personality shows an upward trend.  In conclusion, personality and social psychology look very similar.  This may be due to considerable overlap between the two disciplines, which is also reflected in shared journals.  Larger differences may be visible for specialized social journals that focus on experimental social psychology.

replicability-personality

The results for developmental journals show no clear time trend and the average is just about in the middle between cognitive and social psychology.  The wide confidence intervals suggest that there is considerable variability among developmental journals. Table 1 shows Developmental Psychology ranks 14 / 103 and Infancy ranks 101/103. The low rank for Infancy may be due to the great difficulty of measuring infant behavior.

replicability-developmental

The clinical/medical journals cover a wide range of topics from health psychology to special areas of psychiatry.  There has been some concern about replicability in medical research (Ioannidis, 2005). The results for clinical are similar to those for developmental journals. Replicability is lower than for cognitive psychology and higher than for social psychology.  This may seem surprising because patient populations and samples tend to be smaller. However, a randomized controlled intervention study uses pre-post designs to boost power, whereas social and personality psychologists use comparisons across individuals, which requires large samples to reduce sampling error.

 

replicability-clinical

The set of biological journals is very heterogeneous and small. It includes neuroscience and classic peripheral physiology.  Despite wide confidence intervals replicability for biological journals is significantly lower than replicabilty for cognitive psychology. There is no notable time trend. The average is slightly above the average for social journals.

replicability.biological.png

 

The last category are applied journals. One journal focuses on education. The other journals focus on industrial and organizational psychology.  Confidence intervals are wide, but replicabilty is generally lower than for cognitive psychology. There is no notable time trend for this set of journals.

replicability.applied.png

Given the stability of replicability, I averaged replicability estimates across years. The last figure shows a comparison of disciplines based on these averages.  The figure shows that social psychology is significantly below average and cognitive psychology is significantly above average with the other disciplines falling in the middle.  All averages are significantly above 50% and below 80%.

Discussion

The most exciting finding is that repicability appears to have increased in 2016. This increase is remarkable because averages in the years before consistently tracked the average of 63.  The increase by 2 percentage points in 2016 is not large, but it may represent a first response to the replication crisis.

The increase is particularly remarkable because statisticians have been sounding the alarm bells about low power and publication bias for over 50 years (Cohen, 1962; Sterling, 1959), but these warnings have had no effect on research practices. In 1989, Sedlmeier and Gigerenzer (1989) noted that studies of statistical power had no effect on the statistical power of studies.  The present results provide the first empirical evidence that psychologists are finally starting to change their research practices.

However, the results also suggest that most journals continue to publish articles with low power.  The replication crisis has affected social psychology more than other disciplines with fierce debates in journals and on social media (Schimmack, 2016).  On the one hand, the comparisons of disciplines supports the impression that social psychology has a bigger replicability problem than other disciplines. However, the differences between disciplines are small. With the exception of cognitive psychology, other disciplines are not a lot more replicable than social psychology.  The main reason for the focus on social psychology is probably that these studies are easier to replicate and that there have been more replication studies in social psychology in recent years.  The replicability rankings predict that other disciplines would also see a large number of replication failures, if they would subject important findings to actual replication attempts.  Only empirical data will tell.

Limitations

The main limitation of replicability rankings is that the use of an automatic extraction method does not distinguish theoretically important hypothesis tests and other statistical tests.  Although this is a problem for the interpretation of the absolute estimates, it is less important for the comparison over time.  Any changes in research practices that reduce sampling error (e.g., larger samples, more reliable measures) will not only strengthen the evidence for focal hypothesis tests, but also increase the strength of evidence for non-focal hypothesis tests.

Schimmack and Brunner (2016) compared replicability estimates with actual success rates in the OSC (2015) replication studies.  They found that the statistical method overestimates replicability by about 20%.  Thus, the absolute estimates can be interpreted as very optimistic estimates.  There are several reasons for this overestimation.  One reason is that the estimation method assumes that all results with a p-value greater than .05 are equally likely to be published. If there are further selection mechanisms that favor smaller p-values, the method overestimates replicability.  For example, sometimes researchers correct for multiple comparisons and need to meet a more stringent significance criterion.  Only careful hand-coding of research articles can provide more accurate estimates of replicability.  Schimmack and Brunner (2016) hand-coded the articles that were included in the OSC (2015) article and still found that the method overestimated replicability.  Thus, the absolute values need to be interpreted with great caution and success rates of actual replication studies are expected to be at least 10% lower than these estimates.

Implications

Power and replicability have been ignored for over 50 years.  A likely reason is that replicability is difficult to measure.  A statistical method for the estimation of replicability changes this. Replicability estimates of journals make it possible for editors to compete with other journals in the replicability rankings. Flashy journals with high impact factors may publish eye-catching results, but if this journal has a reputation of publishing results that do not replicate, they are not very likely to have a big impact.  Science is build on trust and trust has to be earned and can be easily lost.  Eventually, journals that publish replicable results may also increase their impact because more researchers are going to build on replicable results published in these journals.  In this way, replicability rankings can provide a much needed correction to the current incentive structure in science that rewards publishing as many articles as possible without any concerns about the replicability of these results. This reward structure is undermining science.  It is time to change it. It is no longer sufficient to publish a significant result, if this result cannot be replicate in other labs.

Many scientists feel threatened by changes in the incentive structure and the negative consequences of replication failures for their reputation. However, researchers have control over their reputation.  First, researchers often carry out many conceptually related studies. In the past, it was acceptable to publish only the studies that worked (p < .05). This selection for significance by researchers is the key factor in the replication crisis. The researchers who are conducting the studies are fully aware that it was difficult to get a significant result, but the selective reporting of these successes produces inflated effect size estimates and an illusion of high replicability that inevitably lead to replication failures.  To avoid these embarrassing replication failures researchers need to report results of all studies or conduct fewer studies with high power.  The 2016 rankings suggest that some researchers have started to change, but we will have to wait until 2017 to see whether 2017 can replicate the positive trend in the 2016 rankings.

 

 

 

 

Replicability Report No.2: Do Mating Primes have a replicable effects on behavior?

In 2000, APA declared the following decade the decade of behavior.  The current decade may be considered the decade of replicability or rather the lack thereof.  The replicability crisis started with the publication of Bem’s (2011) infamous “Feeling the future” article.  In response, psychologists have started the painful process of self-examination.

Preregistered replication reports and systematic studies of reproducibility have demonstrated that many published findings are difficult to replicate and when they can be replicated, actual effect sizes are about 50% smaller than reported effect sizes in original articles (OSC, Science, 2016).

To examine which studies in psychology produced replicable results, I created ReplicabilityReports.  Replicability reports use statistical tools that can detect publication bias and questionable research practices to examine the replicability of research findings in a particular research area.  The first replicability report examined the large literature of ego-depletion studies and found that only about a dozen studies may have produced replicable results.

This replicability report focuses on a smaller literature that used mating primes (images of potential romantic partners / imagining a romantic scenario) to test evolutionary theories of human behavior.  Most studies use the typical priming design, where participants are randomly assigned to one or more mating prime conditions or a control condition. After the priming manipulation the effect of activating mating-related motives and thoughts on a variety of measures is examined.  Typically, an interaction with gender is predicted with the hypothesis that mating primes have stronger effects on male participants. Priming manipulations vary from subliminal presentations to instructions to think about romantic scenarios for several minutes; sometimes with the help of visual stimuli.  Dependent variables range from attitudes towards risk-taking to purchasing decisions.

Shanks et al. (2015) conducted a meta-analysis of a subset of mating priming studies that focus on consumption and risk-taking.  A funnel plot showed clear evidence of bias in the published literature.  The authors also conducted several replication studies. The replication studies failed to produce any significant results. Although this outcome might be due to low power to detect small effects, a meta-analysis of all replication studies also produced no evidence for reliable priming effects (average d = 00, 95%CI = -.12 | .11).

This replicability report aims to replicate and extend Shanks et al.’s findings in three ways.  First, I expanded the data base by including all articles that mentioned the word mating primes in a full text search of social psychology journals.  This expanded the set of articles from 15 to 36 articles and the set of studies from 42 to 92. Second, I used a novel and superior bias test.  Shanks et al. used Funnel plots and Egger’s regression of effect sizes on sampling error to examine bias. The problem with this approach is that heterogeneity in effect sizes can produce a negative correlation between effect sizes and sample sizes.  Power-based bias tests do not suffer from this problem (Schimmack, 2014).  A set of studies with average power of 60% cannot produce more than 60% significant results (Sterling et al., 1995).  Thus, the discrepancy between observed power and reported success rate provides clear evidence of selection bias. Powergraphs also make it possible to estimate the actual power of studies after correcting for publication bias and questionable research practices.  Finally, replicability reports use bias tests that can be applied to small sets of studies.  This makes it possible to find studies with replicable results even if most studies have low replicability.

DESCRIPTIVE STATISTICS

The dataset consists of 36 articles and 92 studies. The median sample size of a study was N = 103 and the total number of participants was N = 11,570. The success rate including marginally significant results, z > 1.65, was 100%.  The success rate excluding marginally significant results, z > 1.96, was 90%.  Median observed power for all 92 studies was 66%.  This discrepancy shows that the published results are biased towards significance.  When bias is present, median observed power overestimates actual power.  To correct for this bias, the R-Index subtracts the inflation rate from median observed power.  The R-Index is 66 – 34 = 32.  An R-Index below 50% implies that most studies will not replicate a significant result in an exact replication study with the same sample size and power as the original studies.  The R-Index for the 15 studies included in Shanks et al. was 34% and the R-Index for the additional studies was 36%.  This shows that convergent results were obtained for two independent samples based on different sampling procedures and that Shanks et al.’s limited sample was representative of the wider literature.

POWERGRAPH

For each study, a focal hypothesis test was identified and the result of the statistical test was converted into an absolute z-score.  These absolute z-scores can vary as a function of random sampling error or differences in power and should follow a mixture of normal distributions.  Powergraphs find the best mixture model that minimizes the discrepancy between observed and predicted z-scores.

Powergraph for Romance Priming (Focal Tests)

 

The histogram of z-scores shows clear evidence of selection bias. The steep cliff on the left side of the criterion for significance (z = 1.96) shows a lack of non-significant results.  The few non-significant results are all in the range of marginal significance and were reported as evidence for an effect.

The histogram also shows evidence of the use of questionable research practices. Selection bias would only produce a cliff to the left of the significance criterion, but a mixture-normal distribution on the right side of the significance criterion. However, the graph also shows a second cliff around z = 2.8.  This cliff can be explained by questionable research practices that inflate effect sizes to produce significant results.  These questionable research practices are much more likely to produce z-scores in the range between 2 and 3 than z-scores greater than 3.

The large amount of z-scores in the range between 1.96 and 2.8 makes it impossible to distinguish between real effects with modest power and questionable effects with much lower power that will not replicate.  To obtain a robust estimate of power, power is estimated only for z-scores greater than 2.8 (k = 17).  The power estimate is 73% based. This power estimate suggests that some studies may have reported real effects that can be replicated.

The grey curve shows the predicted distribution for a set of studies with 73% power.  As can be seen, there are too many observed z-scores in the range between 1.96 and 2.8 and too few z-scores in the range between 0 and 1.96 compared to the predicted distribution based on z-scores greater than 2.8.

The powergraph analysis confirms and extends Shanks et al.’s (2016) findings. First, the analysis provides strong evidence that selection bias and questionable research practices contribute to the high success rate in the mating-prime literature.  Second, the analysis suggests that a small portion of studies may actually have reported true effects that can be replicated.

REPLICABILITY OF INDIVIDUAL ARTICLES

The replicability of results published in individual articles was examined with the Test of Insufficient Variance (TIVA) and the Replicability-Index.  TIVA tests bias by comparing the variance of observed z-scores against the variance that is expected based on sampling error.  As sampling error for z-scores is 1, observed z-scores should have at least a variance of 1. If there is heterogeneity, variance can be even greater, but it cannot be smaller than 1.  TIVA uses the chi-square test for variances to compute the probability that a variance less than 1 was simply due to chance.  A p-value less than .10 is used to flag an article as questionable.

The Replicability-Index (R-Index) used observed power to test bias. Z-scores are converted into a measure of observed power and median observed power is used as an estimate of power.  The success rate (percentage of significant results) should match observed power.  The difference between success rate and median power shows an inflated success rate.  The R-Index subtracts inflation from median observed power.  A value of 50% is used as the minimum criterion for replicability.

Articles that pass both tests are examined in more detail to identify studies with high replicability.  Only three articles passed this test.

1   Greitemeyer, Kastenmüller, and Fischer (2013) [R-Index = .80]

The article with the highest R-Index reported 4 studies.  The high R-Index for this article is due to Studies 2 to 4.  Studies 3 and 4 used a 2 x 3 between subject design with gender and three priming conditions. Both studies produced strong evidence for an interaction effect, Study 3: F(2,111) = 12.31, z = 4.33, Study 4: F(2,94) = 7.46, z = 3.30.  The pattern of the interaction is very similar in the two studies.  For women, the means are very similar and not significantly different for each other.  For men, the two mating prime conditions are very similar and significantly different from the control condition.  The standardized effect sizes for the difference between the combined mating prime conditions and the control conditions are large, Study 3: t(110) = 6.09, p < .001, z = 5.64, d = 1.63; Study 4: t(94) = 5.12, d = 1.30.

Taken at face value, these results are highly replicable, but there are some concerns about the reported results. The means in conditions that are not predicted to differ from each other are very similar.  I tested the probability of this event to occur using TIVA and compared the means of the two mating prime conditions for men and women in the two studies.  The four z-scores were z = 0.53, 0.08, 0.09, and -0.40.  The variance should be 1, but the observed variance is only Var(z) = 0.14.  The probability of this reduction in variance to occur by chance is p = .056.  Thus, even though the overall R-Index for this article is high and the reported effect sizes are very high, it is likely that an actual replication study will produce weaker effects and may not replicate the original findings.

Study 2 also produced strong evidence for a priming x gender interaction, F(1,81) = 11.23, z = 3.23.  In contrast to studies 3 and 4, this interaction was a cross-over interaction with opposite effects of primes for males and females.  However, there is some concern about the reliability of this interaction because the post-hoc tests for males and females were both just significant, males: t(40) = 2.61, d = .82, females, t(41) = 2.10, d = .63.  As these post-hoc tests are essentially two independent studies, it is possible to use TIVA to test whether these results are too similar, Var(z) = 0.11, p = .25.  The R-Index for this set of studies is low, R-Index = .24 (MOP = .62).  Thus, a replication study may replicate an interaction effect, but the chance of replicating significant results for males or females separately are lower.

Importantly, Shanks et al. (2016) conducted two close replication of Greitemeyer’s studies with risky driving, gambling, and sexual risk taking as dependent variables.  Study 5 compared the effects of short-term mate primes on risky driving.  Although the sample size was small, the large effect size in the original study implies that this study had high power to replicate the effect, but it did not, t(77) = = -0.85, p = .40, z = -.85.  The negative sign indicates that the pattern of means was reversed, but not significantly so.  Study 6 failed to replicate the interaction effect for sexual risk taking reported by Greitemeyer et al., F(1, 93) = 1.15, p = .29.  The means for male participants were in the opposite direction showing a decrease in risk taking after mating priming.  The study also failed to replicate the significant decrease in risk taking for female participants.  Study 6 also produced non-significant results for gambling and substance risk taking.   These failed replication studies raise further concerns about the replicability of the original results with extremely large effect sizes.

Jon K. Maner, Matthew T. Gailliot, D. Aaron Rouby, and Saul L. Miller (JPSP, 2007) [R-Index = .62]

This article passed TIVA only due to the low power of TIVA for a set of three studies, TIVA: Var(z) = 0.15, p = .14.  In Study 1, male and female participants were randomly assigned to a sexual-arousal priming condition or a happiness control condition. Participants also completed a measure of socio-sexual orientation (i.e., interest in casual and risky sex) and were classified into groups of unrestricted and restricted participants. The dependent variable was performance on a dot-probe task.  In a dot-probe task, participants have to respond to a dot that appears in the location of two stimuli that compete for visual attention.  In theory, participants are faster to respond to the dot if appears in the location of a stimulus that attracts more attention.  Stimuli were pictures of very attractive or less attractive members of the same or opposite sex.  The time between the presentation of the pictures and the dot was also manipulated.  The authors reported that they predicted a three-way way interaction between priming condition, target picture, and stimulus-onset time.  The authors did not predict an interaction with gender.  The ANOVA showed a significant three-way interaction, F(1,111) = 10.40, p = .002, z = 3.15.  A follow-up two-way ANOVA showed an interaction between priming condition and target for unrestricted participants, F(1,111) = 7.69, p = .006, z = 2.72.

Study 2 replicated Study 1 with a sentence unscrambling task which is used as a subtler priming manipulation.  The study closely replicated the results of Study 1. The three way interaction was significant, F(1,153) = 9.11, and the follow up two-way interaction for unrestricted participants was also significant, F(1,153) = 8.22, z = 2.75.

Study 3 changed the primes to jealousy or anxiety/frustration.  Jealousy is a mating related negative emotion and was predicted to influence participants like mating primes.  In this study, participants were classified into groups with high or low sexual vigilance based on a jealousy scale.  The predicted three-way interaction was significant, F(1,153) = 5.74, p = .018, z = 2.37.  The follow-up two-way interaction only for participants high in sexual vigilance was also significant, F(1,153) = 8.13, p = .005, z = 2.81.

A positive feature of this set of studies is that the manipulation of targets within subjects reduces within-cell variability and increases power to produce significant results.  However, a problem is that the authors also report studies for specific targets and do not mention that they used reaction times to other targets as covariate. These analyses have low power due to the high variability in reaction times across participants.  However, surprisingly each study still produced the predicted significant result.

Study 1: “Planned analyses clarified the specific pattern of hypothesized effects. Multiple regression evaluated the hypothesis that priming would interact with participants’ sociosexual orientation to increase attentional adhesion to attractive opposite-sex targets. Attention to those targets was regressed on experimental condition, SOI, participant sex, and their centered interactions (nonsignificant interactions were dropped). Results confirmed the hypothesized interaction between priming condition and SOI, beta = .19, p < .05 (see Figure 1).”
I used r = .19 and N = 113 and obtained t(111) = 2.04, p = .043, z = 2.02.

Study 2: “Planned analyses clarified the specific pattern of hypothesized effects. Regression evaluated the hypothesis that the mate-search prime would interact with sociosexual orientation to increase attentional adhesion to attractive opposite-sex targets. Attention to these targets was regressed on experimental condition, SOI score, participant sex, and their centered interactions (nonsignificant interactions were dropped). As in Study 1, results revealed the predicted interaction between priming condition and sociosexual orientation, beta = .15, p = .04, one-tailed (see Figure 2)”
I used r = .15 and N = 155 and obtained t(153) = 1.88, p = .06 (two-tailed!), z = 1.86.

Study 3: “We also observed a significant main effect of intrasexual vigilance, beta = .25, p < .001, partial r = .26, and, more important, the hypothesized two-way interaction between priming condition and level of intrasexual vigilance, beta = .15, p < .05, partial r = .16 (see Figure 3).”
I used r = .16 and N = 155 and obtained t(153) = 2.00, p = .047, z = 1.99.

The problem is that the results of these three independent analyses are too similar, z = 2.02, 1.86, 1.99; Var(z) < .001, p = .007.

In conclusion, there are some concerns about the replicability of these results and even if the results replicate they do not provide support for the hypothesis that mating primes have a hard-wired effect on males. Only one of the three studies produced a significant two-way interaction between priming and target (F-value not reported), and none of the three studies produced a significant three-way interaction between priming, target, and gender.  Thus, the results are inconsistent with other studies that found either main effects of mating primes or mating prime by gender interactions.

3. Bram Van den Bergh and Siegfried Dewitte (Proc. R. Soc. B, 2006) [R-index = .58]

This article reports three studies that examined the influence of mating primes on behavior in the ultimatum game.

Study 1 had a small sample size of 40 male participants who were randomly assigned to seeing pictures of non-nude female models or landscapes.  The study produced a significant main effect, F(1,40) = 4.75, p = .035, z = 2.11, and a significant interaction with finger digit ratio, F(1,40) = 4.70, p = .036, z = 2.10.  I used the main effect for analysis because it is theoretically more important than the interaction effect, but the results are so similar that it does not matter which effect is used.

Study 2 used rating of women’s t-shirts or bras as manipulation. The study produced strong evidence that mating primes (rating bras) lead to lower minimum acceptance rates in the ultimatum game than the control condition (rating t-shirts), F(1,33) = 8.88, p = .005, z = 2.78.  Once more the study also produced a significant interaction with finger digit ratio, F(1,33) = 8.76, p = .006, z = 2.77.

Study 3 had three experimental conditions, namely non-sexual pictures of older and young women, and pictures of young non-nude female models.  The study produced a significant effect of condition, F(2,87) = 5.49, p = .006, z = 2.77.  Once more the interaction with finger-digit ratio was also significant, F(2,87) = 5.42.

This article barely passed the test of insufficient variance in the primary analysis that uses one focal test per study, Var(z) = 0.15, p = .14.  However, the main effect and the interaction effects are statistically independent and it is possible to increase the power of TIVA by using the z-scores for the three main effects and the three interactions.  This test produces significant evidence for bias, Var(z) = 0.12, p = .01.

In conclusion, it is unlikely that the results reported in this article will replicate.

CONCLUSION

The replicability crisis in psychology has created doubt about the credibility of published results.  Numerous famous priming studies have failed to replicate in large replication studies.  Shanks et al. (2016) reported problems with the specific literature of romantic and mating priming.  This replicability report provided further evidence that the mating prime literature is not credible.  Using an expanded set of 92 studies, analysis with powergraphs, the test of insufficient variance, and the replicability index showed that many significant results were obtained with the help of questionable research practices that inflate observed effect sizes and provide misleading evidence about the strength and replicability of published results.  Only three articles passed the test with TIVA and R-Index and detailed examination of these studies also showed statistical problems with the evidence in these articles.  Thus, this replicability analysis of 36 articles failed to identify a single credible article.  The lack of credible evidence is consistent with Shanks et al.’s failure to produce significant results in 15 independent replication studies.

Of course, these results do not imply that evolutionary theory is wrong or that sexual stimuli have no influence on human behavior.  For example, in my own research I have demonstrated that sexually arousing opposite-sex pictures capture men’s and women’s attention (Schimmack, 2005).  However, these responses occurred in response to specific stimuli and not as carry-over effects of a priming manipulation. Thus, the problem with mating prime studies is probably that priming effects are weak and may have no notable influence on unrelated behaviors like consumer behavior or risk taking in investments.  Given the replication problems with other priming studies, it seems necessary to revisit the theoretical assumptions underlying this paradigm.  For example, Shanks et al. (2016) pointed out that behavioral priming effects are theoretically implausible because these predictions contradict well-established theories that behavior is guided by the cognitive appraisal of the situation at hand rather than unconscious residual information from previous situations. This makes evolutionary sense because behavior has to respond to the adaptive problem at hand to ensure survival and reproduction.

I recommend that textbook writers, journalists, and aspiring social psychologists treat claims about human behavior based on mating priming studies with a healthy dose of skepticism.  The results reported in these articles may reveal more about the motives of researchers than their participants.

Do Deceptive Reporting Practices in Social Psychology Harm Social Psychology?

Do Deceptive Reporting Practices in Social Psychology Harm Social Psychology?
A Critical Examination of “Research Practices That Can Prevent an Inflation of False-Positive Rates” by Murayama, Pekrun, and Fiedler (2014).

The article by Murayama, Pekrun, and Fiedler (MPK) discusses the probability of false positive results (evidence for an effect when no effect is present also known as type-I error) in multiple study articles. When researchers conduct a single study the nominal probability of obtaining a significant result without a real effect (a type-I error) is typically set to 5% (p < .05, two-tailed). Thus, for every significant result one would expect 19 non-significant results. A false-positive finding (type-I error) would be followed by several failed replications. Thus, replication studies can quickly correct false discoveries. Or so, one would like to believe. However, traditionally journals reported only significant results. Thus, false positive results remained uncorrected in the literature because failed replications were not published.

In the 1990s, experimental psychologists that run relatively cheap studies found a solution to this problem. Journals demanded that researchers replicate their findings in a series of studies that were then published in a single article.

MPK point out that the probability of a type-I error decreases exponentially as the number of studies increases. With two studies, the probability is less than 1% (.05 * .05 = .0025). It is easier to see the exponential effect in terms or ratios (1 out of 20, 1 out of 400, 1 out of 8000, etc. In top journals of experimental social psychology, a typical article contains four studies. The probability that all four studies produce a type-I error is only 1 out of 160,000. The corresponding value on a standard normal distribution is z = 4.52, which means the strength of evidence is 4.5 standard deviations away from 0, which represents the absence of an effect. In particle physics a value of z = 5 is used to rule out false-positives. Thus, getting 4 out of 4 significant results in four independent tests of an effect provides strong evidence for an effect.

I am in full agreement with MPK and I made the same point in Schimmack (2012). The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160). A real effect will produce stronger evidence for an effect as sample size increase. Getting four significant results at the 5% level is not more impressive than getting a single significant result at the p < .00001 level.

However, the strength of evidence from multiple study articles depends on one crucial condition. This condition is so elementary and self-evidence that it is not even mentioned in statistics. The condition is that a researcher honestly reports all results. 4 significant results is only impressive when a researcher went into the lab, conducted four studies, and obtained significant results in all studies. Similarly, 4 free throws are only impressive when there were only 4 attempts. 4 out of 20 free-throws is not that impressive and 4 out of 80 attempts is horrible. Thus, the absolute number of successes is not important. What matters is the relative frequency of successes for all attempts that were made.

Schimmack (2012) developed the incredibility index to examine whether a set of significant results is based on honest reporting or whether it was obtained by omitting non-significant results or by using questionable statistical practices to produce significant results. Evidence for dishonest reporting of results would undermine the credibility of the published results.

MPK have the following to say about dishonest reporting of results.

“On a related note, Francis (2012a, 2012b, 2012c, 2012d; see also Schimmack, 2012) recently published a series of analyses that indicated the prevalence of publication bias (i.e., file-drawer problem) in multi-study papers in the psychological literature.” (p. 111).   They also note that Francis used a related method to reveal that many multiple-study articles show statistical evidence of dishonest reporting. “Francis argued that there may be many cases in which the findings reported in multi-study papers are too good to be true” (p. 111).

In short, Schimmack and Francis argued that multiple study articles can be misleading because the provide the illusion of replicability (a researcher was able to demonstrate the effect again, and again, and again, therefore it must be a robust effect), but in reality it is not clear how robust the effect is because the results were not obtain in the way as the studies are described in the article (first we did Study 1, then we did Study 2, etc. and voila all of the studies worked and showed the effect).

One objection to Schimmack and Francis would be to find a problem with their method of detecting bias. However, MPK do not comment on the method at all. They sidestep this issue when they write “it is beyond the scope of this article to discuss whether publication bias actually exists in these articles or. or how prevalent it is in general” (p. 111).

After sidestepping the issue, MPK are faced with a dilemma or paradox. Do multiple study articles strengthen the evidence because the combined type-I error probability decreases or do multiple study articles weaken the evidence because the probability that researchers did not report the results of their research program honestly? “Should multi-study findings be regarded as reliable or shaky evidence?” (p. 111).

MPK solve this paradox with a semantic trick. First, they point out that dishonest reporting has undesirable effects on effect size estimates.

“A publication bias, if it exists, leads to overestimation of effect sizes because some null findings are not reported (i.e., only studies with relatively large effect sizes that produce significant results are reported). The overestimation of effect sizes is problematic” (p. 111).

They do not explain why researchers should be allowed to omit studies with non-significant results from an article, given that this practice leads to the undesirable consequences of inflated effect sizes. Accurate estimates of effect sizes would be obtained if researchers published all of their results. In fact, Schimmack (2012) suggested that researchers report all results and then conduct a meta-analysis of their set of studies to examine how strong the evidence of a set of studies is. This meta-analysis would provide an unbiased measure of the true effect size and unbiased evidence about the probability that the results of all studies were obtained in the absence of an effect.

The semantic trick occurs when the authors suggest that dishonest reporting practices are only a problem for effect size estimates, but not for the question whether an effect actually exists.

“However, the presence of publication bias does not necessarily mean that the effect is absent (i.e., that the findings are falsely positive).” (p. 111) and “Publication bias simply means that the effect size is overestimated—it does not necessarily imply that the effect is not real (i.e., falsely positive).” (p. 112).

This statement is true because it is practically impossible to demonstrate false positives, which would require demonstrating that the true effect size is exactly 0.   The presence of bias does not warrant the conclusion that the effect size is zero and that reported results are false positives.

However, this is not the point of revealing dishonest practices. The point is that dishonest reporting of results undermines the credibility of the evidence that was used to claim that an effect exists. The issue is the lack of credible evidence for an effect, not credible evidence for the lack of an effect. These two statements are distinct and MPK use the truth of the second statement to suggest that we can ignore whether the first statement is true.

Finally, MPK present a scenario of a multiple study article with 8 studies that all produced significant results. The state that it is “unrealistic that as many as eight statistically significant results were produced by a non-existent effect” (p. 112).

This blue-eyed view of multiple study articles ignores the fact that the replication crisis in psychology was triggered by Bem’s (2011) infamous article that contained 9 out of 9 statistically significant results (one marginal result was attributed to methodological problems, see Schimmack, 2012, for details) that supposedly demonstrated humans ability to foresee the future and to influence the past (e.g., learning after a test increased performance on a test that was taken before learning for the test). Schimmack (2012) used this article to demonstrate how important it can be to evaluate the credibility of multiple study articles and the incredibility index predicted correctly that these results would not replicate. So, it is simply naïve to assume that articles with more studies automatically strengthen evidence for the existence of an effect and that 8 significant results cannot occur in the absence of a true effect (maybe MPK believe in ESP).

It is also not clear why researchers should wonder about the credibility of results in multiple study articles.  A simple solution to the paradox is to reported all results honestly.  If an honest set of studies provides evidence for an effect, it is not clear why researchers would prefer to engage in dishonest reporting practices. MPK provide no explanation for this practices and make no recommendation to increase honesty in reporting of results as a simple solution to the replicability crisis in psychology.

They write, “the researcher may have conducted 10, or even 20, experiments until he/she obtained 8 successful experiments, but far more studies would have been needed had the effect not existed at all”. This is true, but we do not know how many studies a researcher conducted or what else a researcher did to the data unless all of this information is reported. If the combined evidence of 20 studies with 8 significant results shows that an effect is present, a researcher could just publish all 20 studies. What is the reason to hide over 50% of the evidence?

In the end, MPK assure readers that they “do not intend to defend underpowered studies” and they do suggest that “the most straightforward solution to this paradox is to conduct studies that have sufficient statistical power” (p. 112). I fully agree with these recommendations because powerful studies can provide real evidence for an effect and decrease the incentive to engage in dishonest practices.

It is discouraging that this article was published in a major review journal in social psychology. It is difficult to see how social psychology can regain trust, if social psychologists believe they can simply continue to engaging in dishonest reporting of results.

Fortunately, numerous social psychologists have responded to the replication crisis by demanding more honest research practices and by increasing statistical power of studies.  The article by MPK should not be considered representative of the response by all social psychologists and I hope MPK will agree that honest reporting of results is vital for a healthy science.

 

 

 

R-Index predicts lower replicability of “subliminal” studies than “attribution” studies in JESP

PHP_subliminal_vs_attribution

This post compares articles in the Journal of Experimental Social Psychology that contained the keyword “subliminal” to articles that contained the word “attribution”.

PHP-curves based on t-tests and F-tests in these articles are compared.  Both sets of articles show signs of publication bias (fewer non-significant studies are reported than predicted based on post-hoc power).

The shape of the histogram shows clear evidence of heterogeneity (the red curve fits the data better than the green curve).

The estimated power of studies with z-scores between 2 and 4 for subliminal articles is 31%.

The estimated power of studies with z-scores between 2 and 3 for attribution articles is 42%.

The R-Index for subliminal articles is 39%, whereas the R-Index for attribution articles is 49%.

The values for subliminal articles are also lower than the values for the whole set of articles in JESP.

In conclusion, these results suggest that subliminal priming studies are less replicable than other findings in social psychology and should be the target of high-powered replication studies.  These replication studies need to take into account that reported effect sizes are inflated to achieve high power.