Meta-Psychology: A new discipline and a new journal (draft)

Ulrich Schimmack and Rickard Carlsson

Psychology is a relatively young science that is just over 100 years old.  During its 100 years if existence, it has seen major changes in the way psychologists study the mind and behavior.  The first laboratories used a mix of methods and studied a broad range of topics. In the 1950s, behaviorism started to dominate psychology with studies of animal behavior. Then cognitive psychology took over and computerized studies with reaction time tasks started to dominate. In the 1990s, neuroscience took off and no top ranked psychology department can function without one or more MRI magnets. Theoretical perspectives have also seen major changes.  In the 1960s, personality traits were declared non-existent. In the 1980, twin studies were used to argue that everything is highly heritable, and nowadays gene-environment interactions and epigenetics are dominating theoretical perspectives on the nature-nurture debate. These shifts in methods and perspectives are often called paradigm shifts.

It is hard to keep up with all of these paradigm shifts in a young science like psychology. Moreover, many psychology researchers are busy just keeping up with developments in their paradigm. However, the pursuit of advancing research within a paradigm can be costly for researchers and a science as a whole because this research may become obsolete after a paradigm shift. One senior psychologist once expressed regret that he was a prisoner of a paradigm. To avoid a similar fate, it is helpful to have a broader perspective of developments in the field and to understand how progress in one area of psychology fits into the broader goal of understanding humans’ minds and behaviors.  This is the aim of meta-psychology.  Meta-psychology is the scientific investigation of psychology as a science.  It questions the basic assumptions that underpin research paradigm and monitors the progress of psychological science as a whole.

Why we Need a Meta-Psychology Journal 

Most scientific journals focus on publishing original research articles or review articles (meta-analyses) of studies on a particular topic.  This makes it difficult to publish meta-psychological articles.  As publishing in peer-reviewed journals is used to evaluate researchers, few researches dedicated time and energy to meta-psychology and those that did often had difficulties finding an outlet for their work.

In 2006, Ed Diener created Perspectives on Psychological Science (PPS) published by the Association for Psychological Science.  The journal aims to publish an “eclectic mix of provocative reports and articles, including broad integrative reviews, overviews of research programs, meta-analyses, theoretical statements, and articles on topics such as the philosophy of science, opinion pieces about major issues in the field, autobiographical reflections of senior members of the field, and even occasional humorous essays and sketches”   Not all of the articles in PPS are meta-psychology. However, PPS created a home for meta-psychological articles.  We carefully examined articles in PPS to identify content areas of meta-psychology.

We believe that MP can fulfill an important role in the growing number of psychology journals.  Most important, PPS can only publish a small number of articles.  For profit journals like PPS pride themselves on their high rejection rates.  We believe that high rejection rates create a problem and give editors and reviewers too much power to shape the scientific discourse and direction of psychology.  The power of editors is itself an important topic in meta-psychology.  In contrast to PPS, MP is an online journal with no strict page limits.  We will let the quality of published articles rather than rejection rates determine the prestige of our journal.

PPS is a for profit journal and published content is hidden behind paywalls. We think this is a major problem and does not serve the interest of scientists.  All articles published in MP will be open access.  One problem with some open access journals is that they charge high fees for authors to get their work published.  This gives authors from rich countries with grants a competitive advantage. MP will not charge any fees.

In short, while we appreciate the contribution PPS has made to the development of meta-psychology, we see MP as a modern journal that meets the need of psychology as a science for a journal that is dedicated to publishing meta-psychological articles without high rejection rates and without high costs to authors and readers.

Content Areas of Meta-Psychology 

1. Critical reflections on the process of data collection.

1.1.  Sampling

Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data?
By: Buhrmester, Michael; Kwang, Tracy; Gosling, Samuel D.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 6   Issue: 1   Pages: 3-5   Published: JAN 2011

1.2.  Experimental Paradigms

Using Smartphones to Collect Behavioral Data in Psychological Science: Opportunities, Practical Considerations, and Challenges
By: Harari, Gabriella M.; Lane, Nicholas D.; Wang, Rui; et al.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 11   Issue: 6   Pages: 838-854   Published: NOV 2016

1.3. Validity

What Do Implicit Measures Tell Us? Scrutinizing the Validity of Three Common Assumptions
By: Gawronski, Bertram; Lebel, Etienne P.; Peters, Kurt R.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 2   Issue: 2   Pages: 181-193   Published: JUN 2007

 

2.  Critical reflections on statistical methods / tutorials on best practices

2.1.  Philosophy of Statistics

Bayesian Versus Orthodox Statistics: Which Side Are You On?
By: Dienes, Zoltan
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 6   Issue: 3   Pages: 274-290   Published: MAY 2011

2.2. Tutorials

Sailing From the Seas of Chaos Into the Corridor of Stability Practical Recommendations to Increase the Informational Value of Studies
By: Lakens, Daniel; Evers, Ellen R. K.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 9   Issue: 3   Pages: 278-292   Published: MAY 2014

3. Critical reflections on published results / replicability

3.1.  Fraud

Scientific Misconduct and the Myth of Self-Correction in Science
By: Stroebe, Wolfgang; Postmes, Tom; Spears, Russell
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 7   Issue: 6   Pages: 670-688   Published: NOV 2012

3.2. Publication Bias

Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition
By: Vul, Edward; Harris, Christine; Winkielman, Piotr; et al.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 4   Issue: 3   Pages: 274-290   Published: MAY 2009

3.3. Quality of Peer-Review

The Air We Breathe: A Critical Look at Practices and Alternatives in the Peer-Review Process
By: Suls, Jerry; Martin, Rene
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 4   Issue: 1   Pages: 40-50   Published: JAN 2009

4. Critical reflections on Paradigms and Paradigm Shifts

4.1  History

Sexual Orientation Differences as Deficits: Science and Stigma in the History of American Psychology
By: Herek, Gregory M.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 5   Issue: 6   Pages: 693-699   Published: NOV 2010

4.2. Topics

Domain Denigration and Process Preference in Academic Psychology
By: Rozin, Paul
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 1   Issue: 4   Pages: 365-376   Published: DEC 2006

4.3 Incentives

Giving Credit Where Credit’s Due: Why It’s So Hard to Do in Psychological Science
By: Simonton, Dean Keith
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 11   Issue: 6   Pages: 888-892   Published: NOV 2016

4.5 Politics

Political Diversity in Social and Personality Psychology
By: Inbar, Yoel; Lammers, Joris
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 7   Issue: 5   Pages: 496-503   Published: SEP 2012

4.4. Paradigms

Why the Cognitive Approach in Psychology Would Profit From a Functional Approach and Vice Versa
By: De Houwer, Jan
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 6   Issue: 2   Pages: 202-209   Published: MAR 2011

5. Critical reflections on teaching and dissemination of research

5.1  Teaching

Teaching Replication
By: Frank, Michael C.; Saxe, Rebecca
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 7   Issue: 6   Pages: 600-604   Published: NOV 2012

5.2. Coverage of research in textbooks

N.A.

5.2  Coverage of psychology in popular books

N.A.

5.3  Popular Media Coverage of Psychology

N.A.

5.4. Social Media and Psychology

N.A.

 

Vision and Impact Statement

Currently PPS ranks number 7 out of all psychology journals with an Impact Factor of 6.08. The broad appeal of meta-psychology accounts for this relatively high impact factor. We believe that many articles published in MP will also achieve high citation rates, but we do not compete for the highest ranking.  A journal that publishes only 1 article a year, will get a higher ratio of citations per article than a journal that publishes 10 articles a year.  We recognize that it is difficult to predict which articles will become citation classics and we rather publish one gem and nine so-so articles than miss out on publishing the gem. We anticipate that MP will publish many gems that PPS rejected and we will be happy to give these articles a home.

This does not mean, MP will publish everything. We will harness the wisdom of crowds and we encourage authors to share their manuscripts on pre-publication sites or on social media for critical commentary.  In addition, reviewers will help authors to improve their manuscript, while authors can be assured that investing in major revisions will be rewarded with a better publication rather than an ultimate rejection that requires further changes to please editors at another journal.

 

 

 

 

Advertisements

2016 Replicability Rankings of 103 Psychology Journals

I post the rankings on top.  Detailed information and statistical analysis are provided below the table.  You can click on the journal title to see Powergraphs for each year.

Rank   Journal Change 2016 2015 2014 2013 2012 2011 2010 Mean
1 Social Indicators Research 10 90 70 65 75 65 72 73 73
2 Psychology of Music -13 81 59 67 61 69 85 84 72
3 Journal of Memory and Language 11 79 76 65 71 64 71 66 70
4 British Journal of Developmental Psychology -9 77 52 61 54 82 74 69 67
5 Journal of Occupational and Organizational Psychology 13 77 59 69 58 61 65 56 64
6 Journal of Comparative Psychology 13 76 71 77 74 68 61 66 70
7 Cognitive Psychology 7 75 73 72 69 66 74 66 71
8 Epilepsy & Behavior 5 75 72 79 70 68 76 69 73
9 Evolution & Human Behavior 16 75 57 73 55 38 57 62 60
10 International Journal of Intercultural Relations 0 75 43 70 75 62 67 62 65
11 Pain 5 75 70 75 67 64 65 74 70
12 Psychological Medicine 4 75 57 66 70 58 72 61 66
13 Annals of Behavioral Medicine 10 74 50 63 62 62 62 51 61
14 Developmental Psychology 17 74 72 73 67 61 63 58 67
15 Judgment and Decision Making -3 74 59 68 56 72 66 73 67
16 Psychology and Aging 6 74 66 78 65 74 66 66 70
17 Aggressive Behavior 16 73 70 66 49 60 67 52 62
18 Journal of Gerontology-Series B 3 73 60 65 65 55 79 59 65
19 Journal of Youth and Adolescence 13 73 66 82 67 61 57 66 67
20 Memory 5 73 56 79 70 65 64 64 67
21 Sex Roles 6 73 67 59 64 72 68 58 66
22 Journal of Experimental Psychology – Learning, Memory & Cognition 4 72 74 76 71 71 67 72 72
23 Journal of Social and Personal Relationships -6 72 51 57 55 60 60 75 61
24 Psychonomic Review and Bulletin 8 72 79 62 78 66 62 69 70
25 European Journal of Social Psychology 5 71 61 63 58 50 62 67 62
26 Journal of Applied Social Psychology 4 71 58 69 59 73 67 58 65
27 Journal of Experimental Psychology – Human Perception and Performance -4 71 68 72 69 70 78 72 71
28 Journal of Research in Personality 9 71 75 47 65 51 63 63 62
29 Journal of Child and Family Studies 0 70 60 63 60 56 64 69 63
30 Journal of Cognition and Development 5 70 53 62 54 50 61 61 59
31 Journal of Happiness Studies -9 70 64 66 77 60 74 80 70
32 Political Psychology 4 70 55 64 66 71 35 75 62
33 Cognition 2 69 68 70 71 67 68 67 69
34 Depression & Anxiety -6 69 57 66 71 77 77 61 68
35 European Journal of Personality 2 69 61 75 65 57 54 77 65
36 Journal of Applied Psychology 6 69 58 71 55 64 59 62 63
37 Journal of Cross-Cultural Psychology -4 69 74 69 76 62 73 79 72
38 Journal of Psychopathology and Behavioral Assessment -13 69 67 63 77 74 77 79 72
39 JPSP-Interpersonal Relationships and Group Processes 15 69 64 56 52 54 59 50 58
40 Social Psychology 3 69 70 66 61 64 72 64 67
41 Achive of Sexual Behavior -2 68 70 78 73 69 71 74 72
42 Journal of Affective Disorders 0 68 64 54 66 70 60 65 64
43 Journal of Experimental Child Psychology 2 68 71 70 65 66 66 70 68
44 Journal of Educational Psychology -11 67 61 66 69 73 69 76 69
45 Journal of Experimental Social Psychology 13 67 56 60 52 50 54 52 56
46 Memory and Cognition -3 67 72 69 68 75 66 73 70
47 Personality and Individual Differences 8 67 68 67 68 63 64 59 65
48 Psychophysiology -1 67 66 65 65 66 63 70 66
49 Cognitve Development 6 66 78 60 65 69 61 65 66
50 Frontiers in Psychology -8 66 65 67 63 65 60 83 67
51 Journal of Autism and Developmental Disorders 0 66 65 58 63 56 61 70 63
52 Journal of Experimental Psychology – General 5 66 69 67 72 63 68 61 67
53 Law and Human Behavior 1 66 69 53 75 67 73 57 66
54 Personal Relationships 19 66 59 63 67 66 41 48 59
55 Early Human Development 0 65 52 69 71 68 49 68 63
56 Attention, Perception and Psychophysics -1 64 69 70 71 72 68 66 69
57 Consciousness and Cognition -3 64 65 67 57 64 67 68 65
58 Journal of Vocactional Behavior 5 64 78 66 78 71 74 57 70
59 The Journal of Positive Psychology 14 64 65 79 51 49 54 59 60
60 Behaviour Research and Therapy 7 63 73 73 66 69 63 60 67
61 Child Development 0 63 66 62 65 62 59 68 64
62 Emotion -1 63 61 56 66 62 57 65 61
63 JPSP-Personality Processes and Individual Differences 1 63 56 56 59 68 66 51 60
64 Schizophrenia Research 1 63 65 68 64 61 70 60 64
65 Self and Identity -4 63 52 61 62 50 55 71 59
66 Acta Psychologica -6 63 66 69 69 67 68 72 68
67 Behavioral Brain Research -3 62 67 61 62 64 65 67 64
68 Child Psychiatry and Human Development 5 62 72 83 73 50 82 58 69
69 Journal of Child Psychology and Psychiatry and Allied Disciplines 10 62 62 56 66 64 45 55 59
70 Journal of Consulting and Clinical Psychology 0 62 56 50 54 59 58 57 57
71 Journal of Counseling Psychology -3 62 70 60 74 72 56 72 67
72 Behavioral Neuroscience 1 61 66 63 62 65 58 64 63
73 Developmental Science -5 61 62 60 62 66 65 65 63
74 Journal of Experimental Psychology – Applied -4 61 61 65 53 69 57 69 62
75 Journal of Social Psychology -11 61 56 55 55 74 70 63 62
76 Social Psychology and Personality Science -5 61 42 56 59 59 65 53 56
77 Cognitive Therapy and Research 0 60 68 54 67 70 62 58 63
78 Hormones & Behavior -1 60 55 55 54 55 60 58 57
79 Motivation and Emotion 1 60 60 57 57 51 73 52 59
80 Organizational Behavior and Human Decision Processes 3 60 63 65 61 68 67 51 62
81 Psychoneuroendocrinology 5 60 58 58 56 53 59 53 57
82 Social Development -10 60 50 66 62 65 79 57 63
83 Appetite -10 59 57 57 65 64 66 67 62
84 Biological Psychology -6 59 60 55 57 57 65 64 60
85 Journal of Personality Psychology 17 59 59 60 62 69 37 45 56
86 Psychological Science 6 59 63 60 63 59 55 56 59
87 Asian Journal of Social Psychology 0 58 76 67 56 71 64 64 65
88 Behavior Therapy 0 58 63 66 69 66 52 65 63
89 Britsh Journal of Social Psychology 0 58 57 44 59 51 59 55 55
90 Social Influence 18 58 72 56 52 33 59 46 54
91 Developmental Psychobiology -9 57 54 61 60 70 64 62 61
92 Journal of Research on Adolescence 2 57 59 61 82 71 75 40 64
93 Journal of Abnormal Psychology -5 56 52 57 58 55 66 55 57
94 Social Cognition -2 56 54 52 54 62 69 46 56
95 Personality and Social Psychology Bulletin 2 55 57 58 55 53 56 54 55
96 Cognition and Emotion -14 54 66 61 62 76 69 69 65
97 Health Psychology -4 51 67 56 72 54 69 56 61
98 Journal of Clinical Child and Adolescence Psychology 1 51 66 61 74 64 58 54 61
99 Journal of Family Psychology -7 50 52 63 61 57 64 55 57
100 Group Processes & Intergroup Relations -5 49 53 68 64 54 62 55 58
101 Infancy -8 47 44 60 55 48 63 51 53
102 Journal of Consumer Psychology -5 46 57 55 51 53 48 61 53
103 JPSP-Attitudes & Social Cognition -3 45 69 62 39 54 54 62 55

Notes.
1. Change scores are the unstandardized regression weights with replicabilty estimates as outcome variable and year as predictor variable.  Year was coded from 0 for 2010 to 1 for 2016 so that the regression coefficient reflects change over the full 7 year period. This method is preferable to a simple difference score because estimates in individual years are variable and are likely to overestimate change.
2. Rich E. Lucas, Editor of JRP, noted that many articles in JRP do not report t of F values in the text and that the replicability estimates based on these statistics may not be representative of the bulk of results reported in this journal.  Hand-coding of articles is required to address this problem and the ranking of JRP, and other journals, should be interpreted with caution (see further discussion of these issues below).

Introduction

I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result.  In the past five years, it has become increasingly clear that psychology suffers from a replication crisis. Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011).  The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated  (OSC, 2015).

The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability.  However, the reliance on actual replication studies has a a number of limitations.  First, actual replication studies are expensive or impossible (e.g., a longitudinal study spanning 20 years).  Second, studies selected for replication may not be representative because the replication team lacks expertise to replicate some studies. Finally, replication studies take time and replicability of recent studies may not be known for several years. This makes it difficult to rely on actual replication studies to rank journals and to track replicability over time.

Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the original results in published articles.  This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies.  Replicability can be assessed in real time, it can be estimated for all published results, and it can be used for expensive studies that are impossible to reproduce.  Finally, it has the advantage that actual replication studies can be criticized  (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on published results in original articles.

Z-curve has been validated with simulation studies and can be used when replicability varies across studies and when there is selection for significance, and is superior to similar statistical methods that correct for publication bias (Brunner & Schimmack, 2016).  I use this method to estimate the average replicability of significant results published in 103 psychology journals. Separate estimates were obtained for the years from 2010, one year before the start of the replication crisis, to 2016 to examine whether replicability increased in response to discussions about replicability.  The OSC estimate of replicability was based on articles published in 2008 and it was limited to three journals.  I posted replicability estimates based on z-curve for the year 2015 (2015 replicability rankings).  There was no evidence that replicability had increased during this time period.

The main empirical question was whether the 2016 rankings show some improvement in replicability and whether some journals or disciplines have responded more strongly to the replication crisis than others.

A second empirical question was whether replicabilty varies across disciplines.  The OSC project provided first evidence that traditional cognitive psychology is more replicable than social psychology.  Replicability estimates with z-curve confirmed this finding.  In the 2015 rankings, The Journal of Experimental Psychology: Learning, Memory and Cognition ranked 25 with a replicability estimate of 74, whereas the two social psychology sections of the Journal of Personality and Social Psychology ranked 73 and 99 (68% and 60% replicability estimates).  For this post, I conducted more extensive analyses of disciplines.

Journals

The 103 journals that are included in these rankings were mainly chosen based on impact factors.  The list also includes diverse areas of psychology, including cognitive, developmental, social, personality, clinical, biological, and applied psychology.  The 2015 list included some new journals that started after 2010.  These journals were excluded from the 2016 rankings to avoid missing values in statistical analyses of time trends.  A few journals were added to the list and the results may change when more journals are added to the list.

The journals were classified into 9 categories: social (24), cognitive (12), development (15), clinical/medical (19), biological (8), personality (5), and applied(IO,education) (8).  Two journals were classified as general (Psychological Science, Frontiers in Psychology). The last category included topical, interdisciplinary journals (emotion, positive psychology).

Data 

All PDF versions of published articles were downloaded and converted into text files. The 2015 rankings were based on conversions with the free program pdf2text pilot.  The 2016 program used a superior conversion program pdfzilla.  Text files were searched for reports of statistical results using my own R-code (z-extraction). Only F-tests, t-tests, and z-tests were used for the rankings. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom.  A comparison of the 2015 rankings using the old method and the new method shows that extraction methods have an influence on replicability estimates some differences (r = .56). One reason for the low correlation is that replicability estimates have a relatively small range (50-80%) and low retest correlations. Thus, even small changes can have notable effects on rankings. For this reason, time trends in replicability have to be examined at the aggregate level of journals or over longer time intervals. The change score of a single journal from 2015 to 2016 is not a reliable measure of improvement.

Data Analysis

The data for each year were analyzed using z-curve Schimmack and Brunner (2016).  The results of individual analysis are presented in Powergraphs. Powergraphs for each journal and year are provided as links to the journal names in the table with the rankings.  Powergraphs convert test statistics into absolute z-scores as a common metric for the strength of evidence against the null-hypothesis.  Absolute z-scores greater than 1.96 (p < .05, two-tailed) are considered statistically significant. The distribution of z-scores greater than 1.96 is used to estimate the average true power (not observed power) of the set of significant studies. This estimate is an estimate of replicability for a set of exact replication studies because average power determines the percentage of statistically significant results.  Powergraphs provide additional information about replicability for different ranges of z-scores (z-values between 2 and 2.5 are less replicable than those between 4 and 4.5).  However, for the replicability rankings only the replicability estimate is used.

Results

Table 1 shows the replicability estimates sorted by replicability in 2016.

The data were analyzed with a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.  I compared three models. Model 1 assumed no mean level changes and variability across journals. Model 2 assumed a linear increase. Model 3 tested assumed no change from 2010 to 2015 and allowed for an increase in 2016.

Model 1 had acceptable fit (RMSEA = .043, BIC = 5004). Model 2 increased fit (RMSEA = 0.029, BIC = 5005), but BIC slightly favored the more parsimonious Model 1. Model 3 had the best fit (RMSEA = .000, BIC = 5001).  These results reproduce the results of the 2015 analysis that there was no improvement from 2010 to 2015, but there is some evidence that replicability increased in 2016.  Adding a variance component to slope in Model 3 produced an unidentified model. Subsequent analyses show that this is due to insufficient power to detect variation across journals in changes over time.

The standardized loadings of individual years on the latent intercept factor ranged from .49 to .58.  This shows high variabibility in replicability estimates from year to year. Most of the rank changes can be attributed to random factors.  A better way to compare journals is to average across years.  A moving average of five years will provide reliable information and allow for improvement over time.  The reliability of the 5-year average for the years 2012 to 2016 is 68%.

Figure 1 shows the annual averages with 95%CI as well relative to the average over the full 7-year period.

rep-by-year

A paired t-test confirmed that average replicability in 2016 was significantly higher (M = 65, SD = 8) than in the previous years (M = 63, SD = 8), t(101) = 2.95, p = .004.  This is the first evidence that psychological scientists are responding to the replicability crisis by publishing slightly more replicable results.  Of course, this positive result has to be tempered by the small effect size.  But if this trend continuous or even increases, replicability could reach 80% in 10 years.

The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2016 with the previous years.  This analysis produced only 7 significant increases with p < .05 (one-tailed), which is only 2 more significant results than would be expected by chance alone. Thus, the analysis failed to identify particular journals that contribute to the improvement in the average.  Figure 2 compares the observed distribution of t-values to the predicted distribution based on the null-hypothesis (no change).

t-value Distribution.png

The blue line shows the observed density distribution, which is slightly moved to the right, but there is no set of journals with notably larger t-values.  A more sustained and larger increase in replicability is needed to detect variability in change scores.

The next analyses examine stable differences between disciplines.  The first analysis compared cognitive journals to social journals.  No statistical tests are needed to see that cognitive journals publish more replicable results than social journals. This finding confirms the results with actual replications of studies published in 2008 (OSC, 2015). The Figure suggests that the improvement in 2016 is driven more by social journals, but only 2017 data can tell whether there is a real improvement in social psychology.

replicability.cog.vs.soc.png

The next Figure shows the results for 5 personality journals.  The large confidence intervals show that there is considerable variability among personality journals. The Figure shows the averages for cognitive and social psychology as horizontal lines. The average for personality is only slightly above the average for social and like social, personality shows an upward trend.  In conclusion, personality and social psychology look very similar.  This may be due to considerable overlap between the two disciplines, which is also reflected in shared journals.  Larger differences may be visible for specialized social journals that focus on experimental social psychology.

replicability-personality

The results for developmental journals show no clear time trend and the average is just about in the middle between cognitive and social psychology.  The wide confidence intervals suggest that there is considerable variability among developmental journals. Table 1 shows Developmental Psychology ranks 14 / 103 and Infancy ranks 101/103. The low rank for Infancy may be due to the great difficulty of measuring infant behavior.

replicability-developmental

The clinical/medical journals cover a wide range of topics from health psychology to special areas of psychiatry.  There has been some concern about replicability in medical research (Ioannidis, 2005). The results for clinical are similar to those for developmental journals. Replicability is lower than for cognitive psychology and higher than for social psychology.  This may seem surprising because patient populations and samples tend to be smaller. However, a randomized controlled intervention study uses pre-post designs to boost power, whereas social and personality psychologists use comparisons across individuals, which requires large samples to reduce sampling error.

 

replicability-clinical

The set of biological journals is very heterogeneous and small. It includes neuroscience and classic peripheral physiology.  Despite wide confidence intervals replicability for biological journals is significantly lower than replicabilty for cognitive psychology. There is no notable time trend. The average is slightly above the average for social journals.

replicability.biological.png

 

The last category are applied journals. One journal focuses on education. The other journals focus on industrial and organizational psychology.  Confidence intervals are wide, but replicabilty is generally lower than for cognitive psychology. There is no notable time trend for this set of journals.

replicability.applied.png

Given the stability of replicability, I averaged replicability estimates across years. The last figure shows a comparison of disciplines based on these averages.  The figure shows that social psychology is significantly below average and cognitive psychology is significantly above average with the other disciplines falling in the middle.  All averages are significantly above 50% and below 80%.

Discussion

The most exciting finding is that repicability appears to have increased in 2016. This increase is remarkable because averages in the years before consistently tracked the average of 63.  The increase by 2 percentage points in 2016 is not large, but it may represent a first response to the replication crisis.

The increase is particularly remarkable because statisticians have been sounding the alarm bells about low power and publication bias for over 50 years (Cohen, 1962; Sterling, 1959), but these warnings have had no effect on research practices. In 1989, Sedlmeier and Gigerenzer (1989) noted that studies of statistical power had no effect on the statistical power of studies.  The present results provide the first empirical evidence that psychologists are finally starting to change their research practices.

However, the results also suggest that most journals continue to publish articles with low power.  The replication crisis has affected social psychology more than other disciplines with fierce debates in journals and on social media (Schimmack, 2016).  On the one hand, the comparisons of disciplines supports the impression that social psychology has a bigger replicability problem than other disciplines. However, the differences between disciplines are small. With the exception of cognitive psychology, other disciplines are not a lot more replicable than social psychology.  The main reason for the focus on social psychology is probably that these studies are easier to replicate and that there have been more replication studies in social psychology in recent years.  The replicability rankings predict that other disciplines would also see a large number of replication failures, if they would subject important findings to actual replication attempts.  Only empirical data will tell.

Limitations

The main limitation of replicability rankings is that the use of an automatic extraction method does not distinguish theoretically important hypothesis tests and other statistical tests.  Although this is a problem for the interpretation of the absolute estimates, it is less important for the comparison over time.  Any changes in research practices that reduce sampling error (e.g., larger samples, more reliable measures) will not only strengthen the evidence for focal hypothesis tests, but also increase the strength of evidence for non-focal hypothesis tests.

Schimmack and Brunner (2016) compared replicability estimates with actual success rates in the OSC (2015) replication studies.  They found that the statistical method overestimates replicability by about 20%.  Thus, the absolute estimates can be interpreted as very optimistic estimates.  There are several reasons for this overestimation.  One reason is that the estimation method assumes that all results with a p-value greater than .05 are equally likely to be published. If there are further selection mechanisms that favor smaller p-values, the method overestimates replicability.  For example, sometimes researchers correct for multiple comparisons and need to meet a more stringent significance criterion.  Only careful hand-coding of research articles can provide more accurate estimates of replicability.  Schimmack and Brunner (2016) hand-coded the articles that were included in the OSC (2015) article and still found that the method overestimated replicability.  Thus, the absolute values need to be interpreted with great caution and success rates of actual replication studies are expected to be at least 10% lower than these estimates.

Implications

Power and replicability have been ignored for over 50 years.  A likely reason is that replicability is difficult to measure.  A statistical method for the estimation of replicability changes this. Replicability estimates of journals make it possible for editors to compete with other journals in the replicability rankings. Flashy journals with high impact factors may publish eye-catching results, but if this journal has a reputation of publishing results that do not replicate, they are not very likely to have a big impact.  Science is build on trust and trust has to be earned and can be easily lost.  Eventually, journals that publish replicable results may also increase their impact because more researchers are going to build on replicable results published in these journals.  In this way, replicability rankings can provide a much needed correction to the current incentive structure in science that rewards publishing as many articles as possible without any concerns about the replicability of these results. This reward structure is undermining science.  It is time to change it. It is no longer sufficient to publish a significant result, if this result cannot be replicate in other labs.

Many scientists feel threatened by changes in the incentive structure and the negative consequences of replication failures for their reputation. However, researchers have control over their reputation.  First, researchers often carry out many conceptually related studies. In the past, it was acceptable to publish only the studies that worked (p < .05). This selection for significance by researchers is the key factor in the replication crisis. The researchers who are conducting the studies are fully aware that it was difficult to get a significant result, but the selective reporting of these successes produces inflated effect size estimates and an illusion of high replicability that inevitably lead to replication failures.  To avoid these embarrassing replication failures researchers need to report results of all studies or conduct fewer studies with high power.  The 2016 rankings suggest that some researchers have started to change, but we will have to wait until 2017 to see whether 2017 can replicate the positive trend in the 2016 rankings.

 

 

 

 

An Attempt at Explaining Null-Hypothesis Testing and Statistical Power with 1 Figure and 1,500 Words

Is a Figure worth 1,500 words?

gpower-zcurve

Gpower. http://www.gpower.hhu.de/en.html

Significance Testing

1. The red curve shows the sampling distribution if there is no effect. Most results will give a signal/noise ratio close to 0 because there is no effect (0/1 = 0)

2. Sometimes sampling error can produce large signals, but these events are rare

3. To be sure that we have a real signal, we can chose a high criterion to decide that there was an effect (reject H0). Normally, we use a 2:1 ratio (z > 2) to do so, but we could use a higher or lower criterion value.  This value is shown by the green vertical line in the Figure

4. z-score greater than 2 leaves only 2.5% of the red distribution. This means we would expect only 2.5% of outcomes with z-scores greater than 2 if there is no effect. If we would use the same criterion for negative effects, we would get another 2.5% in the lower tail of the red distribution. Combined we would have 5% of cases where we have a false positive, that is, we decide that there is an effect when there was no effect. This is why we say, p < .05 to call a result significant. The probabilty (p) of a false positive result is no greater than 5% if we keep on repeating studies and using z > 2 as the criterion to claim an effect. If there is never an effect in any of the studies we are doing, we end up with 5% false positive results. A false positive is also called a type-I error. We are making the mistake to infer from our study that an effect is present when there is no effect.

Statistical Power

5. Now that you understand significance testing (LOL), we can introduce the concept of statistical power. Effects can be large or small. For example, gender differences in height are large, gender differences in the number of sexual partners are small.  Also studies can have a lot of sampling error or very little sampling error.  A study of 10 men and 10 women may accidentally include 2 women who are on the basketball team.  A study of 1000 men and women is likely to be more representative of the population.  Based on the effect size in the population and sample size, the true signal (effect size in the population) to noise (sampling error) ratio can differ.  The higher the signal to noise ratio is, the further away the sampling distribution of the real data (the blue curve) will be.  In the figure below the population effect size and sampling error produced a z-score of 2.8, but actual samples will never produce this value. Sampling error will again produce different z-scores above or below the expected value of 2.8.  Most samples will produce values close to 2.8, but some samples will produce more extreme deviations.  Samples that overestimate the expected value of 2.8 are not a problem because these values are all greater than the criterion for statistical significance. So, in all of these samples we will make the right decision to infer that an effect is present when an effect is present. A so called true positive result.  Even if sampling error leads to a small underestimation of the expected value of 2.8, the values can still be above the criterion for statistical significance and we get a true positive result.

6. When sampling error leads to more extreme underestimation of the expected value of 2.8, samples may produce results with a z-score less than 2.  Now the result is no longer statistically significant. These cases are called false negatives or type-II errors.  We fail to infer that an effect is present, when there actually is an effect (think about a faulty pregnancy test that fails to detect that a woman is pregnant).  It does not matter whether we actually infer that there is no effect or remain indecisive about the presence of an effect. We did a study where an effect exists and we failed to provide sufficient evidence for it.

7. The Figure shows the probability of making a type-II error as the area of the blue curve on the left side of the green line.  In this example, 20% of the blue curve is on the left side of the green line. This means 20% of all samples with an expected value of 2.8 will produce false negative results.

8. We can also focus on the area of the blue curve on the right side of the green line.  If 20% of the area is on the left side, 80% of the area must be on the right side.  This means, we have an 80% probability to obtain a true positive result; that is, a statistically significant result where the observed z-score is greater than the criterion z-score of 2.   This probability is called statistical power.  A study with high power has a high probability to discover real effects by producing z-scores greater than the criterion value. A study with low power has a high probability to produce a false negative result by producing z-scores below the criterion value.

9. Power depends on the criterion value and the expected value.  We could reduce the type-II error and increase power in the Figure by moving the green line to the left.  As we reduce the criterion to claim an effect, we reduce the area of the blue curve on the left side of the line. We are now less likely to encounter false negative results when an effect is present.  However, there is a catch.  By moving the green line to the left, we are increasing the area of the red curve on the right side of the red curve. This means, we are increasing the probability of a false positive result.  To avoid this problem we can keep the green line where it is and move the expected value of the blue line to the right.  By shifting the blue curve to the right, a smaller area of the blue curve will be on the left side of green line.

10. In order to move the blue curve to the right we need to increase the effect size or reduce sampling error.  In experiments it may be possible to use more powerful manipulations to increase effect sizes.  However, often increasing effect sizes is not an option.  How would you increase the effect size of sex on sexual partners?  Therefore, your best option is to reduce sampling error.  As sampling error decreases, the blue curve moves further to the right and statistical power increases.

Practical Relevance: The Hunger Games of Science: With high power the odds are always in your favor

10. Learning about statistical power is important because the outcome of your studies does not just depend on your expertise. It also depends on factors that are not under your control. Sampling error can sometimes help you to get significance by giving you z-scores higher than the expected value, but these z-scores will not replicate because sampling error can also be your enemy and lower your z-scores.  In this way, each study that you do is a bit like playing the lottery or a box of chocolates. You never know how much sampling error you will get.  The good news is that you are in charge of the number of winning tickets in the lottery.  A study with 20% power, has only 20% winning tickets.  The other 80% say, “please play again.”  A study with 80% power has 80% winning tickets.  You have a high chance to get a significant result and you or others will be able to redo the study and again have a high chance to replicate your original result.  It can be embarrassing when somebody conducts a replication study of your significant result and ends up with a failure to replicate your finding.  You can avoid this outcome by conducting studies with high statistical power.

11. Of course, there is a price to pay. Reducing sampling error often requires more time and participants. Unfortunately, the costs increase exponentially.  It is easier to increase statistical power from 20% to 50% than to increase it from 50% to 80%. It is even more costly to increase it from 80% to 90%.  This is what economists call diminishing marginal utility.  Initially you get a lot of bang for your buck, but eventually the costs for any real gains are too high.  For this reason, Cohen (1988) recommended that researchers should aim for 80% power in their studies.  This means that 80% of your initial attempts to demonstrate an effect will succeed when your hard work in planning and conducting a study produced a real effect.  For 20% of the study you may either give up or try again to see whether your fist study produced a true negative result (there is no effect) or a false negative result (you did everything correctly, but sampling error handed you a losing ticket.  Failure is part of life, but you have some control over the amount of failures that you encounter.

12. The End. You are now ready to learn how you can conduct power analysis for actual studies to take control your fate.  Be a winner, not a loser.

 

Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

inflated-mop

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

inflated-effect-sizes

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

inflated-mop

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

 

 

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

Authors:  Ulrich Schimmack, Moritz Heene, and Kamini Kesavan

 

Abstract:
We computed the R-Index for studies cited in Chapter 4 of Kahneman’s book “Thinking Fast and Slow.” This chapter focuses on priming studies, starting with John Bargh’s study that led to Kahneman’s open email.  The results are eye-opening and jaw-dropping.  The chapter cites 12 articles and 11 of the 12 articles have an R-Index below 50.  The combined analysis of 31 studies reported in the 12 articles shows 100% significant results with average (median) observed power of 57% and an inflation rate of 43%.  The R-Index is 14. This result confirms Kahneman’s prediction that priming research is a train wreck and readers of his book “Thinking Fast and Slow” should not consider the presented studies as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.

Introduction

In 2011, Nobel Laureate Daniel Kahneman published a popular book, “Thinking Fast and Slow”, about important finding in social psychology.

In the same year, questions about the trustworthiness of social psychology were raised.  A Dutch social psychologist had fabricated data. Eventually over 50 of his articles would be retracted.  Another social psychologist published results that appeared to demonstrate the ability to foresee random future events (Bem, 2011). Few researchers believed these results and statistical analysis suggested that the results were not trustworthy (Francis, 2012; Schimmack, 2012).  Psychologists started to openly question the credibility of published results.

In the beginning of 2012, Doyen and colleagues published a failure to replicate a prominent study by John Bargh that was featured in Daniel Kahneman’s book.  A few month later, Daniel Kahneman distanced himself from Bargh’s research in an open email addressed to John Bargh (Young, 2012):

“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it… all I have personally at stake is that I recently wrote a book that emphasizes priming research as a new approach to the study of associative memory…Count me as a general believer… My reason for writing this letter is that I see a train wreck looming.”

Five years later, Kahneman’s concerns have been largely confirmed. Major studies in social priming research have failed to replicate and the replicability of results in social psychology is estimated to be only 25% (OSC, 2015).

Looking back, it is difficult to understand the uncritical acceptance of social priming as a fact.  In “Thinking Fast and Slow” Kahneman wrote “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Yet, Kahneman could have seen the train wreck coming. In 1971, he co-authored an article about scientists’ “exaggerated confidence in the validity of conclusions based on small samples” (Tversky & Kahneman, 1971, p. 105).  Yet, many of the studies described in Kahneman’s book had small samples.  For example, Bargh’s priming study used only 30 undergraduate students to demonstrate the effect.

Replicability Index

Small samples can be sufficient to detect large effects. However, small effects require large samples.  The probability of replicating a published finding is a function of sample size and effect size.  The Replicability Index (R-Index) makes it possible to use information from published results to predict how replicable published results are.

Every reported test-statistic can be converted into an estimate of power, called observed power. For a single study, this estimate is useless because it is not very precise. However, for sets of studies, the estimate becomes more precise.  If we have 10 studies and the average power is 55%, we would expect approximately 5 to 6 studies with significant results and 4 to 5 studies with non-significant results.

If we observe 100% significant results with an average power of 55%, it is likely that studies with non-significant results are missing (Schimmack, 2012).  There are too many significant results.  This is especially true because average power is also inflated when researchers report only significant results. Consequently, the true power is even lower than average observed power.  If we observe 100% significant results with 55% average powered power, power is likely to be less than 50%.

This is unacceptable. Tversky and Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.”

To correct for the inflation in power, the R-Index uses the inflation rate. For example, if all studies are significant and average power is 75%, the inflation rate is 25% points.  The R-Index subtracts the inflation rate from average power.  So, with 100% significant results and average observed power of 75%, the R-Index is 50% (75% – 25% = 50%).  The R-Index is not a direct estimate of true power. It is actually a conservative estimate of true power if the R-Index is below 50%.  Thus, an R-Index below 50% suggests that a significant result was obtained only by capitalizing on chance, although it is difficult to quantify by how much.

How Replicable are the Social Priming Studies in “Thinking Fast and Slow”?

Chapter 4: The Associative Machine

4.1.  Cognitive priming effect

In the 1980s, psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked.

[no reference provided]

4.2.  Priming of behavior without awareness

Another major advance in our understanding of memory was the discovery that priming is not restricted to concepts and words. You cannot know this from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not even aware.

“In an experiment that became an instant classic, the psychologist John Bargh and his collaborators asked students at New York University—most aged eighteen to twenty-two—to assemble four-word sentences from a set of five words (for example, “finds he it yellow instantly”). For one group of students, half the scrambled sentences contained words associated with the elderly, such as Florida, forgetful, bald, gray, or wrinkle. When they had completed that task, the young participants were sent out to do another experiment in an office down the hall. That short walk was what the experiment was about. The researchers unobtrusively measured the time it took people to get from one end of the corridor to the other.”

“As Bargh had predicted, the young people who had fashioned a sentence from words with an elderly theme walked down the hallway significantly more slowly than the others. walking slowly, which is associated with old age.”

“All this happens without any awareness. When they were questioned afterward, none of the students reported noticing that the words had had a common theme, and they all insisted that nothing they did after the first experiment could have been influenced by the words they had encountered. The idea of old age had not come to their conscious awareness, but their actions had changed nevertheless.“

[John A. Bargh, Mark Chen, and Lara Burrows, “Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action,” Journal of Personality and Social Psychology 71 (1996): 230–44.]

t(28)=2.86 0.008 2.66 0.76
t(28)=2.16 0.039 2.06 0.54

MOP = .65, Inflation = .35, R-Index = .30

4.3.  Reversed priming: Behavior primes cognitions

“The ideomotor link also works in reverse. A study conducted in a German university was the mirror image of the early experiment that Bargh and his colleagues had carried out in New York.”

“Students were asked to walk around a room for 5 minutes at a rate of 30 steps per minute, which was about one-third their normal pace. After this brief experience, the participants were much quicker to recognize words related to old age, such as forgetful, old, and lonely.”

“Reciprocal priming effects tend to produce a coherent reaction: if you were primed to think of old age, you would tend to act old, and acting old would reinforce the thought of old age.”

t(18)=2.10 0.050 1.96 0.50
t(35)=2.10 0.043 2.02 0.53
t(31)=2.50 0.018 2.37 0.66

MOP = .53, Inflation = .47, R-Index = .06

4.4.  Facial-feedback hypothesis (smiling makes you happy)

“Reciprocal links are common in the associative network. For example, being amused tends to make you smile, and smiling tends to make you feel amused….”

“College students were asked to rate the humor of cartoons from Gary Larson’s The Far Side while holding a pencil in their mouth. Those who were “smiling” (without any awareness of doing so) found the cartoons funnier than did those who were “frowning.”

[“Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis,” Journal of Personality and Social Psychology 54 (1988): 768–77.]

The authors used the more liberal and unconventional criterion of p < .05 (one-tailed), z = 1.65, as a criterion for significance. Accordingly, we adjusted the R-Index analysis and used 1.65 as the criterion value.

t(89)=1.85 0.034 1.83 0.57
t(75)=1.78 0.034 1.83 0.57

MOP = .57, Inflation = .43, R-Index = .14

These results could not be replicated in a large replication effort with 17 independent labs. Not a single lab produced a significant result and even a combined analysis failed to show any evidence for the effect.

4.5. Automatic Facial Responses

In another experiment, people whose face was shaped into a frown (by squeezing their eyebrows together) reported an enhanced emotional response to upsetting pictures—starving children, people arguing, maimed accident victims.

[Ulf Dimberg, Monika Thunberg, and Sara Grunedal, “Facial Reactions to

Emotional Stimuli: Automatically Controlled Emotional Responses,” Cognition and Emotion, 16 (2002): 449–71.]

The description in the book does not match any of the three studies reported in this article. The first two studies examined facial muscle movements in response to pictures of facial expressions (smiling or frowning faces).  The third study used emotional pictures of snakes and flowers. We might consider the snake pictures as being equivalent to pictures of starving children or maimed accident victims.  Participants were also asked to frown or to smile while looking at the pictures. However, the dependent variable was not how they felt in response to pictures of snakes, but rather how their facial muscles changed.  Aside from a strong effect of instructions, the study also found that the emotional picture had an automatic effect on facial muscles.  Participants frowned more when instructed to frown and looking at a snake picture than when instructed to frown and looking at a picture of a flower. “This response, however, was larger to snakes than to flowers as indicated by both the Stimulus factor, F(1, 47) = 6.66, p < .02, and the Stimulus 6 Interval factor, F(1, 47) = 4.30, p < .05.”  (p. 463). The evidence for smiling was stronger. “The zygomatic major muscle response was larger to flowers than to snakes, which was indicated by both the Stimulus factor, F(1, 47) = 18.03, p < .001, and the Stimulus 6 Interval factor, F(1, 47) = 16.78, p < .001.”  No measures of subjective experiences were included in this study.  Therefore, the results of this study provide no evidence for Kahneman’s claim in the book and the results of this study are not included in our analysis.

4.6.  Effects of Head-Movements on Persuasion

“Simple, common gestures can also unconsciously influence our thoughts and feelings.”

“In one demonstration, people were asked to listen to messages through new headphones. They were told that the purpose of the experiment was to test the quality of the audio equipment and were instructed to move their heads repeatedly to check for any distortions of sound. Half the participants were told to nod their head up and down while others were told to shake it side to side. The messages they heard were radio editorials.”

“Those who nodded (a yes gesture) tended to accept the message they heard, but those who shook their head tended to reject it. Again, there was no awareness, just a habitual connection between an attitude of rejection or acceptance and its common physical expression.”

F(2,66)=44.70 0.000 7.22 1.00

MOP = 1.00, Inflation = .00,  R-Index = 1.00

[Gary L. Wells and Richard E. Petty, “The Effects of Overt Head Movements on Persuasion: Compatibility and Incompatibility of Responses,” Basic and Applied Social Psychology, 1, (1980): 219–30.]

4.7   Location as Prime

“Our vote should not be affected by the location of the polling station, for example, but it is.”

“A study of voting patterns in precincts of Arizona in 2000 showed that the support for propositions to increase the funding of schools was significantly greater when the polling station was in a school than when it was in a nearby location.”

“A separate experiment showed that exposing people to images of classrooms and school lockers also increased the tendency of participants to support a school initiative. The effect of the images was larger than the difference between parents and other voters!”

[Jonah Berger, Marc Meredith, and S. Christian Wheeler, “Contextual Priming: Where People Vote Affects How They Vote,” PNAS 105 (2008): 8846–49.]

z = 2.10 0.036 2.10 0.56
p = .05 0.050 1.96 0.50

MOP = .53, Inflation = .47, R-Index = .06

4.8  Money Priming

“Reminders of money produce some troubling effects.”

“Participants in one experiment were shown a list of five words from which they were required to construct a four-word phrase that had a money theme (“high a salary desk paying” became “a high-paying salary”).”

“Other primes were much more subtle, including the presence of an irrelevant money-related object in the background, such as a stack of Monopoly money on a table, or a computer with a screen saver of dollar bills floating in water.”

“Money-primed people become more independent than they would be without the associative trigger. They persevered almost twice as long in trying to solve a very difficult problem before they asked the experimenter for help, a crisp demonstration of increased self-reliance.”

“Money-primed people are also more selfish: they were much less willing to spend time helping another student who pretended to be confused about an experimental task. When an experimenter clumsily dropped a bunch of pencils on the floor, the participants with money (unconsciously) on their mind picked up fewer pencils.”

“In another experiment in the series, participants were told that they would shortly have a get-acquainted conversation with another person and were asked to set up two chairs while the experimenter left to retrieve that person. Participants primed by money chose to stay much farther apart than their nonprimed peers (118 vs. 80 centimeters).”

“Money-primed undergraduates also showed a greater preference for being alone.”

[Kathleen D. Vohs, “The Psychological Consequences of Money,” Science 314 (2006): 1154–56.]

F(2,49)=3.73 0.031 2.16 0.58
t(35)=2.03 0.050 1.96 0.50
t(37)=2.06 0.046 1.99 0.51
t(42)=2.13 0.039 2.06 0.54
F(2,32)=4.34 0.021 2.30 0.63
t(38)=2.13 0.040 2.06 0.54
t(33)=2.37 0.024 2.26 0.62
F(2,58)=4.04 0.023 2.28 0.62
chi^2(2)=10.10 0.006 2.73 0.78

MOP = .58, Inflation = .42, R-Index = .16

4.9  Death Priming

“The evidence of priming studies suggests that reminding people of their mortality increases the appeal of authoritarian ideas, which may become reassuring in the context of the terror of death.”

The cited article does not directly examine this question.  The abstract states that “three experiments were conducted to test the hypothesis, derived from terror management theory, that reminding people of their mortality increases attraction to those who consensually validate their beliefs and decreases attraction to those who threaten their beliefs” (p. 308).  Study 2 found no general effect of death priming. Rather, the effect was qualified by authoritarianism. Mortality salience enhanced the rejection of dissimilar others in Study 2 only among high authoritarian subjects.” (p. 314), based on a three-way interaction with F(1,145) = 4.08, p = .045.  We used the three-way interaction for the computation of the R-Index.  Study 1 reported opposite effects for ratings of Christian targets, t(44) = 2.18, p = .034 and Jewish targets, t(44)= 2.08, p = .043. As these tests are dependent, only one test could be used, and we chose the slightly stronger result.  Similarly, Study 3 reported significantly more liking of a positive interviewee and less liking of a negative interviewee, t(51) = 2.02, p = .049 and t(49) = 2.42, p = .019, respectively. We chose the stronger effect.

[Jeff Greenberg et al., “Evidence for Terror Management Theory II: The Effect of Mortality Salience on Reactions to Those Who Threaten or Bolster the Cultural Worldview,” Journal of Personality and Social Psychology]

t(44)=2.18 0.035 2.11 0.56
F(1,145)=4.08 0.045 2.00 0.52
t(49)=2.42 0.019 2.34 0.65

MOP = .56, Inflation = .44, R-Index = .12

4.10  The “Lacy Macbeth Effect”

“For example, consider the ambiguous word fragments W_ _ H and S_ _ P. People who were recently asked to think of an action of which they are ashamed are more likely to complete those fragments as WASH and SOAP and less likely to see WISH and SOUP.”

“Furthermore, merely thinking about stabbing a coworker in the back leaves people more inclined to buy soap, disinfectant, or detergent than batteries, juice, or candy bars. Feeling that one’s soul is stained appears to trigger a desire to cleanse one’s body, an impulse that has been dubbed the “Lady Macbeth effect.”

[Lady Macbeth effect”: Chen-Bo Zhong and Katie Liljenquist, “Washing Away Your Sins:

Threatened Morality and Physical Cleansing,” Science 313 (2006): 1451–52.]

F(1,58)=4.26 0.044 2.02 0.52
F(1,25)=6.99 0.014 2.46 0.69

MOP = .61, Inflation = .39, R-Index = .22

The article reports two more studies that are not explicitly mentioned, but are used as empirical support for the Lady Macbeth effect. As the results of these studies were similar to those in the mentioned studies, including these tests in our analysis does not alter the conclusions.

chi^2(1)=4.57 0.033 2.14 0.57
chi^2(1)=5.02 0.025 2.24 0.61

MOP = .59, Inflation = .41, R-Index = .18

4.11  Modality Specificity of the “Lacy Macbeth Effect”

“Participants in an experiment were induced to “lie” to an imaginary person, either on the phone or in e-mail. In a subsequent test of the desirability of various products, people who had lied on the phone preferred mouthwash over soap, and those who had lied in e-mail preferred soap to mouthwash.”

[Spike Lee and Norbert Schwarz, “Dirty Hands and Dirty Mouths: Embodiment of the Moral-Purity Metaphor Is Specific to the Motor Modality Involved in Moral Transgression,” Psychological Science 21 (2010): 1423–25.]

The results are presented as significant with a one-sided t-test. “As shown in Figure 1a, participants evaluated mouthwash more positively after lying in a voice mail (M = 0.21, SD = 0.72) than after lying in an e-mail (M = –0.26, SD = 0.94), F(1, 81) = 2.93, p = .03 (one-tailed), d = 0.55 (simple main effect), but evaluated hand sanitizer more positively after lying in an e-mail (M = 0.31, SD = 0.76) than after lying in a voice mail (M = –0.12, SD = 0.86), F(1, 81) = 3.25, p = .04 (one-tailed), d = 0.53 (simple main effect).”  We adjusted the significance criterion for the R-Index accordingly.

F(1,81)=2.93 0.045 1.69 0.52
F(1,81)=3.25 0.038 1.78 0.55

MOP = .54, Inflation = .46, R-Index = .08

4.12   Eyes on You

“On the first week of the experiment (which you can see at the bottom of the figure), two wide-open eyes stare at the coffee or tea drinkers, whose average contribution was 70 pence per liter of milk. On week 2, the poster shows flowers and average contributions drop to about 15 pence. The trend continues. On average, the users of the kitchen contributed almost three times as much in ’eye weeks’ as they did in ’flower weeks.’ ”

[Melissa Bateson, Daniel Nettle, and Gilbert Roberts, “Cues of Being Watched Enhance Cooperation in a Real-World Setting,” Biology Letters 2 (2006): 412–14.]

F(1,7)=11.55 0.011 2.53 0.72

MOP = .72, Inflation = .28, R-Index = .44

Combined Analysis

We then combined the results from the 31 studies mentioned above.  While the R-Index for small sets of studies may underestimate replicability, the R-Index for a large set of studies is more accurate.  Median Obesrved Power for all 31 studies is only 57%. It is incredible that 31 studies with 57% power could produce 100% significant results (Schimmack, 2012). Thus, there is strong evidence that the studies provide an overly optimistic image of the robustness of social priming effects.  Moreover, median observed power overestimates true power if studies were selected to be significant. After correcting for inflation, the R-Index is well below 50%.  This suggests that the studies have low replicability. Moreover, it is possible that some of the reported results are actually false positive results.  Just like the large-scale replication of the facial feedback studies failed to provide any support for the original findings, other studies may fail to show any effects in large replication projects. As a result, readers of “Thinking Fast and Slow” should be skeptical about the reported results and they should disregard Kahneman’s statement that “you have no choice but to accept that the major conclusions of these studies are true.”  Our analysis actually leads to the opposite conclusion. “You should not accept any of the conclusions of these studies as true.”

k = 31,  MOP = .57, Inflation = .43, R-Index = .14,  Grade: F for Fail

Powergraph of Chapter 4kfs

Schimmack and Brunner (2015) developed an alternative method for the estimation of replicability.  This method takes into account that power can vary across studies. It also provides 95% confidence intervals for the replicability estimate.  The results of this method are presented in the Figure above. The replicability estimate is similar to the R-Index, with 14% replicability.  However, due to the small set of studies, the 95% confidence interval is wide and includes values above 50%. This does not mean that we can trust the published results, but it does suggest that some of the published results might be replicable in larger replication studies with more power to detect small effects.  At the same time, the graph shows clear evidence for a selection effect.  That is, published studies in these articles do not provide a representative picture of all the studies that were conducted.  The powergraph shows that there should have been a lot more non-significant results than were reported in the published articles.  The selective reporting of studies that worked is at the core of the replicability crisis in social psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).  To clean up their act and to regain trust in published results, social psychologists have to conduct studies with larger samples that have more than 50% power (Tversky & Kahneman, 1971) and they have to stop reporting only significant results.  We can only hope that social psychologists will learn from the train wreck of social priming research and improve their research practices.

Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

 

Table 1 shows the results.

Power                  PTH / PFH             
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67                     

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH             
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14                    

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.

ioannidis-fig6

A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).

ioannidis-fig1

Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.

ioannidis-fig3

Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.

Powergraphs for JEP-LMC3.g

The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.

Powergraphs for PsySci3.g

The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.

Powergraphs for JPSP-ASC3.g

One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.