Estimating Reproducibility of Psychology (No. 124): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article “Loving Those Who Justify Inequality: The Effects of System Threat on Attraction to Women Who Embody Benevolent Sexist Ideals”  is a Short Report in the journal Psychological Science.  The single study article is based on Study 3 of a doctoral dissertation supervised by the senior author Steven J. Spencer.

Spencer.png

The article has been cited 32 times and has not been cited in 2017 (but has one citation in 2018 so far).

Study 

The authors aim to provide further evidence for system-justification theory (Jost, Banaji, & Nosek, 2004).  A standard experimental paradigm is to experimentally manipulate beliefs in the fairness of the existing political system.  According to the theory, individuals are motivated to maintain positive views of the current system and will respond by threats to this belief in a defensive manner.

In this specific study, the authors predicted that male participants whose faith in the
political system was threatened would show greater romantic interest in women who embody benevolent sexist ideals than in women who do not embody these ideals.

The design of the study is a classic 2 x 2 design with system threat as between-subject factor and type of women (embody benevolent sexist ideals or not) as within-subject factor.

Stimuli were fake dating profiles.  Dating profiles of women who embody benevolent sexist ideals were based on the three dimensions of benevolent sexism, vulnerable, pure, and ideal for making a men feel complete (Glick & Fiske, 1996). The other women were described as career oriented, party seeking, active in social causes, or athletic.

A total of 36 male students participated in the study.

The article reports a significant interaction effect, F(1, 34) =5.89.  This interaction effect was due to a significant difference between the two groups in rating of women who embody benevolent sexist ideals, F(1,34) = 4.53.

Replication Study 

The replication study was conducted in Germany.

It failed to replicate the significant interaction effect, F(1,68) = 0.08, p = .79.

Conclusion

The sample size of the original study was very small and the result was just significant.  It is not surprising that a replication study failed to replicate this just significant result despite a somewhat larger sample size.

 

 

 

Advertisements

Estimating Reproducibility of Psychology (No. 61): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “Poignancy: Mixed emotional experience in the face of meaningful endings” was published in the Journal of Personality and Social Psychology.  The senior author is Laura L. Carstensen, who is best known for her socioemotional selectivity theory (Carstensen, 1999, American Psychologist).  This article has been cited (only) 83 times and is only #43 in the top cited articles of Laura Carstensen, although it contributes to her current H-Index of 49.

Carstensen.png

The main hypothesis is derived from Carstensen’s socioemotional selectivity theory.  The prediction is that endings (e.g., of student life, of life in general) elicit more mixed emotions.  This hypothesis was tested in two experiments.

Study 1

60 young (~ 20 years) and 60 older ~ 80 years) participated in Study 1.  The experimental procedure was a guided imagery to evoke emotions.   In one condition participants were asked to imagine in their favorite location in 4 months time.  In the other condition they were given the same instruction, but also told to imagine that this would be the last time they could visit this location.  The dependent variable were intensity ratings on an emotion questionnaire on a scale from 0 = not at all to 7 = extremely.

The intensity of mixed feelings was assessed by taking the minimum value of a positive and a negative emotion (Schimmack, 2001).

The analysis showed no age main effect or interactions and no differences in two control conditions.  For the critical imagery condition,  intensity of mixed feelings was higher in the last-time condition (M ~ 3.6, SD ~ 2.3) than in the next-visit condition (M ~ 2, SD ~ 2.3), d ~ .7,  t(118) ~ 3.77.

Study 2

Study 2 examined mixed feelings in the context of a naturalistic event.  It extend a previous study by Larsen, McGraw, & Cacioppo (2001) that demonstrated mixed feelings on graduation day.  Study 2 aimed to replicate and extend this finding.  To extend the finding, the authors added an experimental manipulation that either emphasized the ending of university or not.

110 students participated in the study.

In the control condition (N = 59), participants were given the following instructions: “Keeping in mind your current experiences, please rate the degree to which you feel each of the following emotions,” and were then presented with the list of 19 emotions. In the limited-time condition (n = 51), in which emphasis was placed on the ending that they were experiencing, participants were given the following instructions: “As a graduating senior, today is the last day that you will be a student at Stanford. Keeping that in mind, please rate the degree to which you feel each of the following emotions,”

The key finding was significantly higher means in the experimental condition than in the control condition, t(108) = 2.34, p = .021.

Replication Study

Recruiting participants on graduation day is not easy.  The replication study recruited participants over a 3-year period to achieve a sample size of N = 222 participants, more than double the sample size of the original study (2012 N = 95; 2013 N = 78; 2014 N = 49).

Despite the larger sample size, the study failed to replicate the effect of the experimental manipulation, t(220) = 0.07, p = .94.

Conclusion

While reports of mixed feelings in conflicting situations are a robust phenomenon (Study 1), experimental manipulations of the intensity of mixed feelings are relatively rare. The key novel contribution of Study 2 was the demonstration to focus on the ending of an event increase sadness and mixed feelings. However, the evidence for this effect was weak and could not be replicated in a larger sample. In combination, the evidence does not suggest that this is an effective way to manipulate the intensity of mixed feelings.

 

 

 

 

 

 

 

 

 

 

In Study 1, participants repeatedly imagined being in a meaningful location. Participants in the experimental condition imagined being in the meaningful
location for the final time. Only participants who imagined “last times” at meaningful locations
experienced more mixed emotions. In Study 2, college seniors reported their emotions on graduation day.
Mixed emotions were higher when participants were reminded of the ending that they were experiencing.
Findings suggest that poignancy is an emotional experience associated with meaningful endings.

 

Estimating Reproducibility of Psychology (No. 165): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “The Value Heuristic in Judgments of Relative Frequency” was published as a Short Report in Psychological Science.   The article has been cited only 25 times overall and was not cited at all in 2017.

Dai.WoS

The authors suggest that they have identified a new process how people judge the relative frequency of objects.

Estimating the relative frequency of a class of objects or events is fundamental in subjective probability assessments and decision making (Estes, 1976), and research has long shown that people rely on heuristics for making these judgments (Gilovich, Griffin, & Kahneman, 2002). In this report, we identify a novel heuristic for making these judgments, the value heuristic: People judge the frequency of a class of objects on the basis of the subjective value of the objects.

As my dissertation was about frequency judgments of emotions, I am familiar with the frequency estimation literature, especially the estimation of valued objects like positive and negative emotions.  My reading of the literature suggests that this hypothesis is inconsistent with prior research because frequency judgments are often made on the basis of a fast, automatic, parallel search of episodic memory (e..g, Hintzman, 1988). Thus, value might only indirectly influence frequency estimates if it influences the accessibility of exemplars.

The authors present a single experiment to support their hypothesis.

Experiment

68 students participated in this study.  5 were excluded for a final N of 63 students.

During the learning phase of the study , participants were exposed to 57 pictures of bird and 57 pictures of flowers.

Participants were  then told that they would receive 2 cent for each picture from one of the two categories. The experimental manipulation was whether participants would be rewarded for bird or flower pictures.

The dependent variables were frequency estimates of the number of pictures in each category. Specifically, whether participants gave a higher or lower, equal, or higher estimate to the rewarded category.

When flowers were rewarded, 12 participants had higher estimates for flowers and 15 had higher estimates for birds.

When birds were rewarded, 21 participants had higher estimates for bird and 8 had higher estimates for birds.

A chi-square test showed a just significant effect that was driven by the condition that rewarded birds, presumably because there was also a main effect of birds vs. flowers (birds are more memorable and accessible).

Chi2(1, N = 56) = 4.51,  p = .037.

Replication 

81 students participated in the replication study.  After exclusion of 4 participants the final sample size was N = 77.

When flowers were rewarded, 16 participants had higher estimates for flowers and 11 had higher estimates for birds.

When birds were rewarded, 10 participants had higher estimates for bird and 14 had higher estimates for birds.

The remaining participants were tied.

The chi-square test was not significant.

Chi2(1, N = 51) = 1.57,  p = .21.

Conclusion

The original article tested a novel and controversial hypothesis that was not grounded in the large cognitive literature on frequency estimation.  The article relied on a just significant result in a single study as evidence.  It is not surprising that a replication study failed to replicate the finding.  The article had very little impact. In hindsight, this study does not meet the high bar for acceptance into a high impact journal like Psychological Science.  However, hindsight is 20/20 and it is well known that the foresight of traditional peer-review is an imperfect predictor of replicability and relevance.

 

 

 

 

 

Estimating Reproducibility of Psychology (No. 151): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 151 “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” by Simone Schnall and colleagues has been the subject of heated debates among social psychologists.  The main finding of the article failed to replicate in an earlier replication attempt (Johnson, Cheung, & Donnellan, 2012).  In response to the replication failure, Simone Schnall suggested that the replication study was flawed and stood by her original findings.  This response led me to publish my first R-Index blog post that suggested the original results were not as credible as they seem because Simone Schnall was trained to use questionable research practices that produce significant results with low replicability. She was simply not aware of the problems of using these methods. However, Simone Schnall was not happy with my blog post and when I refused to take it down, she complained to the University of Toronto about it. UofT found that the blog post did not violate ethical standards.

The background is important because the OSC replication study was one of the replication studies that were published earlier and criticized by Schnall. Thus, it is necessary to revisit Schnall’s claim that the replication failure can be attributed to problems with the replication study.

Summary of Original Article 

The article “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” was published in Psychological Science. The article has been cited 197 times overall and 20 times in 2017.

Simone.Schnall

The article extends previous research that suggested a connection between feelings of disgust and moral judgments.  The article reports two experiments that test the complementary hypothesis that thoughts of purity make moral judgments less severe.  Study 1 used a priming manipulation. Study 2 evoked disgust followed by self-purification. Results in both studies confirmed this prediction.

Study 1

Forty undergraduate students (n = 20 per cell) participated in Study 1.

Half of the participants were primed with a scrambled sentence task that contained cleanliness words (e.g. pure, washed).  The other half did a scrambled sentence task with neutral words.

Right after the priming procedure, participants rated how morally wrong an action was in a series of six moral dilemmas.

The ANOVA showed a marginally mean difference, F(1,38) = 3.63, p = .064.  The results was reported with p-rep = .90, which was an experimental statistic in Psychological Science form 2005-2009 that was partially motivated by an attempt to soften the strict distinction between p-values just above or below .05.  Although a p-value of .064 is not meaningfully different from a p-value of .04, neither p-value suggests that a result is highly replicable. A p-value of .05 corresponds to 50% replicability (with large uncertainty around this point estimate) and the estimate is inflated if questionable research methods were used to produce it.

Study 2

Study 2 could have followed up the weak evidence of Study 1 with a larger sample to increase statistical power.  However, the sample size in Study 2 was nearly the same (N = 44).

Participants first watched a disgusting film clip.  Half (n = 21) of the participants then washed their hands before rating moral dilemmas.  The other half (n = 22) did not wash their hands.

The ANOVA showed a significant difference between the two conditions, F(1,41) = 7.81, p = .008.

Replicability Analysis 

No N Test p.val z OP
Study 1 40 F(1,38)=3.63 0.064 1.85 0.58*
Study 2 44 F(1,41)=7.81 0.008 2.66 0.76

*  using p < .10 as criterion for power analysis

With two studies it is difficult to predict replicability because observed power in a single study is strongly influenced by sampling error.  Individually, Study 1 has a low replicability index because the success (p < .10) was achieved with only 58% power. The inflation index (100 – 58 = 42) is high and the R-Index, 58 – 42 = 16, is low.

Combining both studies, still produces a low R-Index (Median Observed Power = 67, Inflation = 33, R-Index = 67 – 33 = 34).

My original blog post pointed out that we can predict replicability based on a researchers typical R-Index.  If a researcher typically conducts studies with high power, a p-value of .04 will sometimes occur due to bad luck, but the replication study is likely to be successful with a lower p-value because bad luck does not repeat itself.

In contrast, if a researcher conducts low powered studies, a p-value of .04 is a lucky outcome and the replication study is unlikely to be lucky again and therefore more likely to produce a non-significant result.

Since I published the blog post, Jerry Brunner and I have developed a new statistical method that allows meta-psychologists to take a researcher’s typical research practices into account. This method is called z-curve.

The figure below shows the z-curve for automatically extracted test statistics from articles by Simone Schnall from 2003 to 2017.  Trend analysis showed no major changes over time.

 

For some help with reading these plots check out this blog post.

The Figure shows a few things. First, it shows that the peak (mode) of the distribution is at z = 1.96, which corresponds to the criterion for significance (p < .05, two-tailed).  The steep drop on the left is not explained by normal sampling error and reveals the influence of QRPs (this is not unique to Schnall; the plot is similar for other social psychologists).  The grey line is a rough estimate of the proportion of non-significant results that would be expected given the distribution of significant results.  The discrepancy between the proportion of actual non-significant results and the grey line shows the extent of the influence of QRPs.

Simone.Schnall.2.png

Once QRPs are present, observed power of significant results is inflated. The average estimate is 48%. However, actual power varies.  The estimates below the x-axis show power estimates for different ranges of z-scores.  Even z-scores between 2.5 and 3 have only an average power estimate of 38%.  This implies that the z-score of 2.66 in Study 2 has a bias-corrected observed power of less than 50%. And as 50% power corresponds to p = .05, this implies that a bias-corrected p-value is not significant.

A new way of using z-curve is to fit z-curve with different proportions of false positive results and to compare the fit of these models.

Simone.Schnall.3.png

The plot shows that models with 0 or 20% false positives fit the data about equally well, but a model with 40% false positives lead to notably worse model fit.  Although this new feature is still in development, the results suggest that few of Schnall’s results are strictly false positives, but that many of her results may be difficult to replicate because QRPs produced inflated effect sizes and much larger samples might be needed to produce significant results (e.g., N > 700 is needed for 80% power with a small effect size, d = .2).

In conclusion, given the evidence for the presence of QRPs and the weak evidence for the cleanliness hypothesis, it is unlikely that equally underpowered studies would replicate the effect. At the same time, larger studies might produce significant results with weaker effect sizes.  Given the large sampling error in small samples, it is impossible to say how small the effects would be and how large samples would have to have high power to detect them.

Actual Replication Study

The replication study was carried out by Johnson, Cheung, and Donnellan.

Johnson et al. conducted replication studies of both studies with considerably larger samples.

Study 1 was replicated with 208 participants (vs. 40 in original study).

Study 2 was replicated with 126 participants (vs. 44 in original study).

Even if some changes in experimental procedures would have slightly lowered the true effect size, the larger samples would have compensated for this by reducing sampling error.

However, neither replication produced a significant result.

Study 1: F(1, 206) = 0.004, p = .95

Study 2: F(1, 124) = 0.001, p = .97.

Just like two p-values of .05 and .07 are unlikely, it is also unlikely to obtain two p-values of .95 and .97 even if the null-hypothesis is true because sampling error produces spurious mean differences.  When the null-hypothesis is true, p-values have a uniform distribution, and we would expect 10% of p-values between 9 and 1. To observe this event twice in a row has a probabiilty of .10 * .10 = .01.  Unusual events do sometimes happen by chance, but defenders of the original research could use this observation to suggest “reverse p-hacking” a term coined by Fritz Strack to insinuate that it can be of interest for replication researchers to make original effects go away.  Although I do not believe that this was the case here, it would be unscientific to ignore the surprising similarity of these two p-values

The authors conducted two more replication studies. These studies also produced non-significant results, with p = .31 and p = .27.  Thus, the similarity of the first two p-values was just a statistical fluke, just like some suspiciously similar  p-values of .04 are sometimes just a chance finding.

Schnall’s Response 

In a blog post, Schnall comments on the replication failure.  She starts with the observation that publishing failed replications is breaking with old traditions.

One thing, though, with the direct replications, is that now there can be findings where one gets a negative result, and that’s something we haven’t had in the literature so far, where one does a study and then it doesn’t match the earlier finding. 

Schnall is concerned that a failed replication could damage the reputation of the original researcher, if the failure is attributed either to a lack of competence or a lack of integrity.

Some people have said that well, that is not something that should be taken personally by the researcher who did the original work, it’s just science. These are usually people outside of social psychology because our literature shows that there are two core dimensions when we judge a person’s character. One is competence—how good are they at whatever they’re doing. And the second is warmth or morality—how much do I like the person and is it somebody I can trust.

Schnall believes that direct replication studies were introduced as a crime control measure in response to the revelation thaat Diedrik Stapel had made up data in over 50 articles.  This violation of research integrity is called fabrication.  However, direct replication studies are not an effective way to detect fabrication (Strobe and Strack, 2014).

In social psychology we had a problem a few years ago where one highly prominent psychologist turned out to have defrauded and betrayed us on an unprecedented scale. Diederik Stapel had fabricated data and then some 60-something papers were retracted… This is also when this idea of direct replications was developed for the first time where people suggested that to be really scientific we should do what the clinical trials do rather our regular [publish conceptual replication studies that work] way of replication that we’ve always done.

Schnall overlooks that another reason for direct replications were concerns about falsification.

Falsification is manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record (The Office of Research Integrity)

In 2011/2012 numerous articles suggested that falsification is a much bigger problem than fabrication and direct replications were used to examine whether falsified evidence also produced false positive results that could not be replicated.  Failures in direct replications are at least in part due to the use of questionable research practices that inflate effect sizes and success rates.

Today it is no longer a secret that many studies failed to replicate because original studies reported inflated effect sizes (OSC, 2015).  Given the widespread use of QRPs, especially in experimental social psychology, replication failures are the norm.  In this context, it makes sense that individual researches feel attacked if one of their studies is replicated.

There’s been a disproportional number of studies that have been singled out simply because they’re easy to conduct and the results are surprising to some people outside of the literature

Why me?  However, the OSC (2015) project did not single out individual researchers. It put up any study that was published in JPSP or Psychological Science in 2008 up for replication.  Maybe the ease of replication was a factor.

Schnall’s next complaint is that failure to replicate are treated as more credible than successful original studies.

Often the way these replications are interpreted is as if one single experiment disproves everything that has come before. That’s a bit surprising, especially when a finding is negative, if an effect was not confirmed. 

This argument ignores two things. First, it ignores that original researchers have a motivated bias to show a successful result.  Researchers who conduct direct replication studies are open to finding a positive or a negative result.  Second, Schnall ignores sample size.  Her original Study 1 had a sample size of N = 40.  The replication study had a sample size of N = 208.  Studies with larger samples have less sampling error and are more robust to violations of statistical assumptions underlying significance tests.  Thus, there are good reasons to believe the results of the failed replication studies more than the results of Schnall’s small original study.

Her next issue was that a special issue published a failed replication without peer review.  This led to some controversy, but it is not the norm.  More important, Schnall overstates the importance of traditional, anonymous, pre-publication peer-review.

It may not seem like a big deal but peer review is one of our laws; these are our publication ethics to ensure that whatever we declare as truth is unbiased. 

Pre-publication peer-review does not ensure that published resutls are unbiased. The OSC (2015) results clearly show that published results were biased in favor of supporting researchers’ hypotheses. Traditional peer-review does not check whether researchers used QRPs or not.  Peer-review does not end once a result is published.  It is possible to evaluate the results of original studies or replication studies even after the results are published.

And this is what Schnall did. She looked at the results and claimed that there was a mistake in the replication study.

I looked at their data, looked at their paper and I found what I consider a statistical problem.

However, others looked at the data and didn’t agree with her.  This led Schnall to consider replications a form of bulling.

“One thing I pointed to was this idea of this idea of replication bullying, that now if a finding doesn’t replicate, people take to social media and declare that they “disproved” an effect, and make inappropriate statements that go well beyond the data.”

It is of course ridiculous to think of failed replication studies as a form of bulling. We would not need to conduct empirical studies, if only successful replication studies were allowed to be published.  Apparently some colleagues tried to point this out to Schnall.

Interestingly, people didn’t see it that way. When I raised the issue, some people said yes, well, it’s too bad she felt bullied but it’s not personal and why can’t scientists live up to the truth when their finding doesn’t replicate?

Schnall could not see it this way.  According to her, there are only two reasons why a replication study may fail.

If my finding is wrong, there are two possibilities. Either I didn’t do enough work and/or reported it prematurely when it wasn’t solid enough or I did something unethical.

In reality there are many more reasons for a replication failure. One possible explanation is that the original result was an honest false positive finding.  The very notion of significance testing implies that some published findings can be false positives and that only future replication studies can tell us which published findings are false positives.  So a simple response to a failed replication is simply to say that it probably was a false positive result and that is the end of the story.

But Schnall does not believe that it is a false positive result ….

because so far I don’t know of a single person who failed to replicate that particular finding that concerned the effect of physical cleanliness and moral cleanliness. In fact, in my lab, we’ve done some direct replications, not conceptual replications, so repeating the same method. That’s been done in my lab, that’s been done in a different lab in Switzerland, in Germany, in the United States and in Hong Kong; all direct replications. As far as I can tell it is a solid effect.

The problem with this version of the story is that it is impossible to get significant results again and again with small samples, even if the effect is real.  So, it is not credible that Schnall was able to get significant results in many unpublished results and never obtained a contradictory result (Schimmack, 2012).

Despite many reasonable comments about the original study and the replication studies (e.g., sample size, QRPs, etc.), Schnall cannot escape the impression that replication researchers have an agenda to tear down good research.

Then the quality criteria are oftentimes not nearly as high as for the original work. The people who are running them sometimes have motivations to not necessarily want to find an effect as it appears.

This accusation motivated me to publish my first blog post and to elaborate on this study from the OSC reproducibilty project.  There is ample evidence that QRPs contributed to replication failures. In contrast, there is absolutely no empirical evidence that replication researchers deliberately produced non-significant results, and as far as I know Schnall has not yet apologized for her unfounded accusation.

One reason for her failure to apologize is probably that many social psychologists expressed support for Schnall either in public or mostly in private.

I raised these concerns about the special issue, I put them on a blog, thinking I would just put a few thoughts out there. That blog had some 17,000 hits within a few days. I was flooded with e-mails from the community, people writing to me to say things like “I’m so glad that finally somebody’s saying something.” I even received one e-mail from somebody writing to me anonymously, expressing support but not wanting to reveal their name. Each and every time I said: “Thank you for your support. Please also speak out. Please say something because we need more people to speak out openly. Almost no one did so.”

Schnall overlooks a simple solution to the problem.  Social psychologists who feel attacked by failed replications could simply preregister their own direct replications with large samples and show that their results do replicate.  This solution was suggested by Daniel Kahneman in 2012 in response to a major replication failure of a study by John Bargh that cast doubt on social priming effects.

What social psychology needs to do as a field is to consider our intuitions about how we make judgments, about evidence, about colleagues, because some of us have been singled out again and again and again. And we’ve been put under suspicion; whole areas of research topics such as embodied cognition and priming have been singled out by people who don’t work on the topics. False claims have been made about replication findings that in fact are not as conclusive as they seem. As a field we have to set aside our intuitions and move ahead with due process when we evaluate negative findings. 

However, what is most telling is the complete absence of direct replications by experimental social psychologists to demonstrate that their published results can be replicated.  The first major replication attempt by Vohs and Schmeichel just failed to replicate ego-depletion in a massive self-replication attempt.

In conclusion, it is no longer a secret that experimental social psychologists have used questionable research practices to produce more significant results than unbiased studies would produce.  The response to this crisis of confidence has been denial.

 

 

 

 

Robert Sternberg’s Rise to Fame

Robert Sternberg is a psychologist interested in being famous (Am I famous yet?).  Wellbeing theories predict that he is also dissatisfied because discrepancies between resources (being a psychologist) and goals (wanting to be famous) lead to dissatisfaction (Diener & Fujita, 1995).

Ask any undergraduate about famous psychologists and they typically can name two: Freud and Skinner.

Rorbert Sternberg is also smart. So, he realized that just publishing more good research is not going to increase his standing in the APA fame rankings from his current rank 60 out of 99 (APA; (In an alternative ranking by Ed Diener, who is not on the previous list but ranks #10 on his own list, he also ranks 60).

The problem is that being a good psychologist is simply not a recipe for fame.  So there is a need to think outside the box.  For example, a Google Search retrieves 40,000 hits for Diederik Stapel and only 30,000 for David Funder (we will address the distinction between good and bad creativity later on).

It looks like Robert Sternberg has found a way to become famous. More people are talking about him right now at least within psychology circles than ever before.  The trick was to turn the position of editor of the APS journal Perspectives on Psychological Science into a tool for self-promotion.

After all, if we equate psychological science with the activities of the most eminent psychologists, reflecting on psychological science means reflecting on the activities of eminent psychologists and if you are the editor of this journal you need to do self-reflection.  Thus, Perspectives on Psychological Science necessarily has to publish mostly auto-meta-psychological self-reflections of Robert Sternberg. These scientific self-reflections should not be confused this the paradigmatic example of Narcissus,  who famously fell in love with himself which led to a love-triangle between me, myself, and I.

Some envious second-stringers do not realize the need to focus on eminent psychologists who have made valuable contributions to psychological science and to stop the self-destructive, negative talk about a crisis in psychological science that was fueled by the long-forgotten previous editor of Perspectives on Psychological Science (her name escapes me right now).  Ironically, their petty complaints backfired and increased Robert Sternberg’s fame; not unlike petty criticism of Donald Trump by the leftist biased media help him to win the election in 2016.

I was pleased to see that Robert Sternberg remains undeterred in his mission to make Psychology great again and to ensure that eminent psychologists receive the recognition they deserve.

I am also pleased to share the highlights of his Introduction and Postscript to the forthcoming “Symposium on Modern Trends in Psychological Science: Good, Bad, or Indifferent?”  [the other contributions are not that important]

Introduction

Robert Sternberg’s Introduction takes a historic perspective on psychological science, which overlaps not coincidentally to a large extent with Robert Sternberg’s career.

“I was 25 years old, a first-year assistant professor.I came to believe that my faculty mentor, the late Wendell Garner, had a mistaken idea about the structure of perceptual stimuli. I did a study to show Garner wrong: It worked—or at least I thought it did!  Garner told me he did not think much of the study.  I did, however.  I presented the work as a colloquium at Bell Labs.  My namesake, Saul Sternberg (no relation), was in the audience.  He asked what appeared to be a simple question.  The simple question demolished my study.  I found myself wishing that a hole would open up in the ground and swallow me up.  But I was actually lucky: The study was not published.  What if it had been? I went back to Yale and told Professor Garner that the study was a bad misfire.  I expected him to be angry; but he was not.  He said something like: “You learned a valuable lesson from the experience.  You are judged in this field by the positive contributions you make, not by the negative ones.”  Garner was intending to say, I think, that the most valuable contributions are those that build things up rather than tear things down.

The most valuable lesson from these formative years of psychological science is:

You are judged largely by the positive contributions you make, much more so than by the negative ones. 

The implications of this insight are clear and have been formalized by another eminent Cornell researcher (no not Wansink), Daryl Bem, in a contribution to one of Sternberg’s great books (“Let’s err on the side of discovery”).

If you are judged by your successes, eminence is achieved by making lots’ of positive contributions (p < .05).  It doesn’t matter whether some second-stringer replicators later show that some of your discoveries are false positives. Surely some will be true positives and you will be remembered forever by these positive contributions.  The negative contributions don’t count and don’t hurt your rise to fame or eminence (unless you fake it, Stapel).

For a long time even false positives were not a problem because nobody actually bothered to examine whether discoveries were true or false.  So just publishing as many positives as possible was the best way to become famous; nobody noticed that it was false fame.

This is no longer the case and replication failures are threatening the eminence of some psychologists. However, driven people know how to turn a crisis into an opportunity; for example, an opportunity for more self-citations.

Sternberg ponders deep questions about the replication revolution in psychology.

So replication generally is good, but is it always good, and how good?  Is there a danger that young scientists who might have gone on to creative careers pushing the boundaries of science (Sternberg, Kaufman, & Pretz, 2002) will instead become replicators, spending their time replacing insights about new ideas and phenomena (Sternberg & Davidson, 1982, 1983) with repetitions of old ideas and tired old phenomena?  

Or is replication and, generally, repeating what others have done before, one of many forms of creativity (Frank & Saxe, 2012; Niu & Sternberg, 2003; Sternberg, 2005; Sternberg, Kaufman, & Pretz, 2002; Zhang & Sternberg, 1998) that in the past has been undervalued?  Moreover, is anyone truly just a “replicator”?

These are important questions that require 7 self-citations because Sternberg has made numerous important contribution to meta-psychology.

There is also strong evidence that researchers should focus on positive contributions rather than trying to correct others’ mistakes.  After all, why should anybody be bothered by Bem’s (2011) demonstration that students can improve their exam grades by studying AFTER taking the exam, but only at Cornell U.

In my own experience, my critiques (Sternberg, 1985a, 1986) have had much less impact than my positive contributions (Sternberg, 1981, 1984, 1997a, 1997b; Sternberg & Grigorenko, 2004; Sternberg & Hedlund, 2003; Sternberg & Smith, 1985), and I always thought this was generally true, but maybe that’s just my own limitation. [9 self-citations]

Be grateful for faculty mentors who are not only brilliant but also wise and kind—there are not so many of them.

An influential study also found that academics are more likely to be brilliant than wise or kind (Sternberg, 2016).  This is a problem in the age of social media, because some academics use their unkind brilliance to damage the reputation of researchers who are just trying to be famous.

The advent of social media, wherein essays, comments, and commentaries are not formally refereed, has led to much more aggressive language than those of us socialized in the latter years of the twentieth century ever were accustomed to.  Sometimes, attacks have become personal, not just professional.  And sometimes, replies to un-refereed critiques can look more like echo chambers than like genuinely critical responses to points that have been made.   

Sternberg himself shows wisdom and kindness in his words for graduate students and post-doctoral students.

How can one navigate a field in rapid transition?  I believe the answers to that question are the same as they always have been.  First, do the very best work of which you are capable.  I never worried too much about all the various crises the field went through as I grew up in it—I focused on doing my best work.  Second, remember that the most eminent scientists usually are not the crowd-followers but rather the crowd-defiers (Sternberg, 2003; Sternberg, Fiske, & Foss, 2016; Sternberg & Lubart, 1995) and the ones who can defy the current Zeitgeist (Sternberg, 2018).  So if you are not doing what everyone else is doing and following every trend everyone else is following, you may well end up being better off. 

In one word, don’t worry and just be like Sternberg [4 self-citations]

Of course, even an eminent scholar cannot do it all alone and Robert Sternberg does acknowledge the contribution of several people who helped him polish this brilliant and wise contribution to the current debate about the future of psychological science and it would be unkind if I didn’t mention their names (Brad Bushman, Alexandra Freund, June Gruber, Diane Halpern, Alex Holcombe, James Kaufman, Roddy Roediger, and Dan Simons).

Well done everybody.  Good to know that psychological science can build on solid foundations and a new generation of psychologists can stand on the broad shoulders of Robert Sternberg.

Postscript

Robert Sternberg’s brilliance also shines in the concluding statements that bring together the valuable contributions of select, eminent, contributors to this “symposium.”  He points out his valuable contribution to publishing in psychological science journals.

In a book I edited on submitting papers to psychology journals, Bem (2000) wrote: There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b).  

Bem’s advice reflected the state of the field in 1975, when I received my PhD, in 2000, when he wrote the article, and even more recently.  Today, such “HARKing” (Hypothesizing After the Results are Known) would likely be viewed with great suspicion. Both p-hacking and HARKing require a certain degree of creativity. 

Many professors and students, not only when I was in graduate school, but also throughout the world have built their careers on practices once considered both creative and perfectly legitimate but that today might be viewed as dubious.  What this fact highlights is that scientific creativity—indeed, any form of creativity—can be understood only in context (Csikszentmihalyi, 1988, 2013; Plucker, 2017; Simonton, 1994, 2004; Sternberg, 2018).

Sternberg self-critically points out that his book may have contributed to the replication crisis in psychology by featuring Bem’s creative approach to science.

I would argue that in science as well as in society, we too often have valued creativity without considering whether the creativity we are valuing is positive or negative (or neutral).  In science, we can get so caught up in achieving eminence or merely the next step on a promotion ladder that we fail to consider whether the creativity we are exhibiting is truly positive.

He recognizes that falsely positive contributions are ultimately not advancing science.  He also acknowledges that it can be difficult to correct false positives.

Scholars sometimes have taken to social media because they have felt their potential contributions to refereed journals have been blocked. At the same time, it is likely that many scholars who post critiques on social media have never even tried to have their work published.  It just is easier to bypass peer review, which can be a lengthy and sometimes frustrating process.

But he is also keenly aware that social media can be abused by terrorists and totalitarian governments.

Such media initially probably seemed like a wholly good idea. The inventors of various forms of social media presumably did not think through how social media might be used to undermine free elections, to spread hateful propaganda, to serve as a platform for cyberbullying, or even to undermine careers. 

In the gold old says, scientific criticism was vetted by constructive and selfless peer-reviews which helped critics to avoid making embarrassing mistakes in public.

At one time, if a scientist wished publicly to criticize another’s work, he or she had to pass the critique through peer reviewers. These reviewers often saved scientists from saying foolish and even destructive things.

Nowadays, fake news and fake criticism can spread through echo-chambers on social media.

With social media, the push of a button can bypass the need for peer reviewers.  Echo chambers of like-minded people then may reinforce what is said, no matter how obnoxious or simply wrong it may be. 

Sternberg is painfully aware that social media can be used for good or bad and he provides a brilliant solution to the problem of distinguishing good blogs with valid criticism from evil blogs that have no merit.

I believe there is, and that the principles for distinguishing positive from negative creativity, whether in the short or the long run, are the same principles that have contributed to wisdom over the ages:  honesty, transparency, sincerity, following of the Golden Rule (of acting toward others the way one would have them act toward oneself), and of course deep analysis of the consequences of one’s actions. 

Again, if we just followed his example and leadership as editor of Perspectives and the convener of this Symposium, psychology could be improved or at least be restored to its former greatness.  Let’s follow the golden rule and act towards Sternberg as he would act to himself.

Last but not least, Robert Sternberg acknowledges the contributors to this symposium, although their contributions are overshadowed by the brilliant Introduction and Postscript by Sternberg himself.

These principles, or at least some of them, are exactly what current trends in psychological science are trying to achieve (see Frankenhuis, this issue; Grand et al., this issue; Wagenmakers, Dutilh, & Sarafoglou, this issue).  This is all to the good.  But as Brainerd and Reyna (this issue), Fiedler (this issue), Kaufman and Glaveanu (this issue), and Vazire (this issue) as well as some other contributors point out. 

To bad that space limitations did not allow him to name all contributors and the lesser ones were just mentioned as “other contributors,”  but space was already tight and there were more important things about Sternberg to say.

For example, Sternberg recognizes that some of the creativity in the old days was bad.

Our field has not done an adequate job of emphasizing the analytical skills we need to ensure that our results in psychological science are sound, or at least as sound as we can make them. We did not satisfactorily police ourselves.  

Yet he recognizes that too much self-control can deplete creative people.

But I worry that our societal emphasis on promoting people up the advancement ladder by standardized tests of analytical skills may create a generation of researchers who place more and more emphasis on what they find easy—analysis—at the expense of creativity, which they (most others) may find quite a bit harder.  And when they stall in their creativity, they may fall back on critique and analysis.  This idea is not new, as I made this point first in the mid-nineteen eighties (Sternberg, 1981, 1985a, 1985c).  Given the way our students are taught and then assessed for memory and analysis, it sometimes has been difficult to make them feel comfortable thinking creatively—are we risking the possibility that an emphasis on replication will make it even harder (Sternberg, 1988, 1997a, 1997b, 2016)?

Developing creativity in students means instilling certain attitudes toward life and work in those students (Sternberg, 2000): willingness to defy the crowd, defy oneself and one’s past beliefs, defy the ongoing Zeitgeist (Sternberg, 2018), overcome obstacles, believe in oneself in the face of severe criticism, realize that one’s expertise can get in the way of one’s creativity (Sternberg & Lubart, 1995).  What would it mean to develop positive creativity?

The real danger is that the replication crisis will lead to standardized scientific practices that stifle creativity.

Increasing emphasis on replication, preregistration procedures, and related practices undoubtedly will do much good for psychological science.  Too many studies have been published that have proven to be based on remarkably flimsy data or post hoc theorizing presented as a priori.  But we in psychological science need to ensure that we do not further shift an educational system that already heavily emphasizes analytic (SAT-like and ACT-like) skills at the expense of positive creative skills.

My Humble Opinion

It is difficult to criticize a giant in the field of psychology and just like young Sternberg was wrong when he tried to find a flaw with his mentor’s theory, I am probably wrong when I am trying to find a mistake in Sternberg’s brilliant analysis of the replication crisis.

However, fully aware that I am risking public humiliation, I am going to try.  Ironically, starting point for my critique is Sternberg’s own brilliant insight that “scientific creativity—indeed, any form of creativity—can be understood only in context (Csikszentmihalyi, 1988, 2013; Plucker, 2017; Simonton, 1994, 2004; Sternberg, 2018).

And I think he fails to recognize the new emerging creativity in psychological science because the context (paradigm) has changed.  What looks like a threat in the old context looks like good creativity for young people who have a new perspective on psychological science.

He wrongly blames social media for cutting down creative people.

Being creative is uncomfortable—it potentially involves defying the crowd, defying oneself, and defying the Zeitgeist (Sternberg, 2018).  People always have been afraid of being creative, lest they fall prey to the “tall poppy” phenomenon, whereby they end up as the tall poppy that gets cut down, (today) by social media or by whatever means, to nothing more than the size of the other poppies.

But from the new perspective on psychological science, social media and other recent inventions are exactly the good creative forces that are needed. Eminent tall poppies have created an addiction to questionable research practices that make psychologists feel good about false discoveries need to be cut done.

The internet is changing psychological science and the most creative and disruptive innovations in psychological science are happening in response to the ability to exchange information in real time with minimal costs.

First, psychologists are no longer relying so heavily on undergraduate students to recruit participants.  Larger and more diverse samples can be recruited cheaply thanks to the Internet.  Initiatives like the Project Implicit are only possible due to the Internet.

Open science initiatives like data sharing or preregistration are only possible due to the Internet.

Sharing of pre-prints is only possible on the Internet. More important, the ability to publish critical articles and failed replications in peer-reviewed journals has increased thanks to the creation of online only journals. Some of these journals like Meta-Psychology are even free for authors and readers, unlike the for-profit journal Perspectives on Psychological Science.

The Internet also makes it possible to write scientific blog posts without peer-review.  This can be valuable because for-profit journals have limited pages and little interest in publishing criticisms of failed replications. The reasons is that these articles (a) are not cited a lot and (b) can reduce citations of articles that were criticized. No good capitalist would be interested in publishing articles that undermine the reputation of a brand and  profitability.

And last but not least, the Internet enables researchers from all over the world, including countries that are typically ignored by US American WEIRD psychologists to participate in psychological science for free.  For example, the Psychological Methods Discussion Group on Facebook has thousands of active members from all over the world.

In conclusion, Robert Sternberg’s contributions to this Symposium demonstrate his eminence, brillance, wisdom, and kindess, but ironically he fails to see where positive innovation and creativity in psychological science lives these days. It doesn’t live in American organizations like APA or APS or in Perspectives on Psychological Science behind a paywall. It lives in the new, wild, chaotic, and creative world of 24/7 free communication; and this blog post is one example of this.

This blog has a comment section and Robert Sternberg is welcome to comment there. However, it is unlikely that he will do so because comments on this blog will not count towards his publications and self-citations in the comment are not counting towards his citation count.

 

 

 

 

 

Implicit Racism, Starbucks, and the Failure of Experimental Social Psychology

Implicit racism is in the news again (CNN).   A manager of a Starbucks in Philadelphia called 911 to ask police to remove two Black men from the coffee store because they had not purchased anything.  The problem is that many White customers frequent Starbucks without purchasing things and the police is not called.  The incident caused widespread protests and Starbucks announced that it would close all of its stores for “implicit bias training.”

Starbucks’ CEO Derrick Johnson explains the need for store-wide training in this quote.

“The Starbucks situation provides dangerous insight regarding the failure of our nation to take implicit bias seriously,” said the group’s president and CEO Derrick Johnson in a statement. “We refuse to believe that our unconscious bias –the racism we are often unaware of—can and does make its way into our actions and policies.”

But was it implicit bias? It does not matter. CEO Derrick Johnson could have talked about racism without changing what happened or the need for training.

“The Starbucks situation provides dangerous insight regarding the failure of our nation to take racism seriously,” said the group’s president and CEO Derrick Johnson in a statement. “We refuse to believe that we are racists and that racism can and does make its way into our actions and policies.”

We have not heard from the store manager why she called the police. This post is not about a single incidence at Starbucks because psychological science can rarely provide satisfactory answers to single events.  However, the call for training of thousands of Starbucks’ employees is not a single event.  It implies that social psychologists have developed scientific ways to measure “implicit bias” and developed ways to change it. This is the topic of this post.

What is implicit bias and what can be done to reduce it?

The term “implicit” has a long history in psychology, but it rose to prominence in the early 1990s when computers became more widely used in psychological research.  Computers made it possible to present stimuli on screens rather than on paper and to measure reaction times rather than self-ratings.  Computerized tasks were first used in cognitive psychology to demonstrate that people have associations that can influence their behaviors.  For example, participants are faster to determine that “doctor” is a word if the word is presented after a related word like “hospital” or “nurse.”

The term implicit is used for effects like this because the effect occurs without participants’ intention, conscious reflection, or deliberation. They do not want to respond this way, but they do, whether they want to or not.  Implicit effects can occur with or without awareness, but they are generally uncontrollable.

After a while, social psychologists started to use computerized tasks that were developed by cognitive psychologists to study social topics like prejudice.  Most studies used White participants to demonstrate prejudice with implicit tasks. For example, the association task described above can be easily modified by showing traditionally White or Black names (in the beginning computers could not present pictures) or faces.

Given the widespread prevalence of stereotypes about African Americans, many of these studies demonstrated that White participants respond differently to Black or White stimuli.  Nobody doubts these effects.  However, there remain two unanswered questions about these effects.

What (the fuck) is Implicit Racial Bias?

First, do responses in this implicit task with racial stimuli measure a specific form of prejudice?  That is, do implicit tasks measure plain old prejudice with a new measure or do they actually measure a new form of prejudice?  The main problem is that psychologists are not very good at distinguishing constructs and measures.  This goes back to the days when psychologists equated measures and constructs.  For example, to answer the difficult question whether IQ tests measure intelligence, it was simply postulated that intelligence is what IQ tests measure.  Similarly, there is no clear definition of implicit racial bias.  In social psychology implicit racism is essentially whatever leads to different responses to Black and White stimuli in an implicit task.

The main problem with this definition is that different implicit tasks show low convergent validity.  Somebody can take two different “implicit tests” (the popular Implicit Association Test, IAT, or the Affective Misattribution Task) and get different results.  The correlations between two different tests range from 0 to .3, which means that the tests disagree more with each other than that they agree.

20 years after the first implicit tasks were used to study prejudice we still do not know whether implicit bias even exist or how it could be measured, despite the fact that these tests are made available to the public to “test their racial bias.”  These tests do not meet the standards of real psychological tests and nobody should take their test scores too seriously.  A brief moment of self-reflection is likely to provide better evidence about your own feelings towards different social groups.  How would you feel if somebody from this group would move in next door? How would you feel if somebody from this group would marry your son or daughter?  Responses to questions like this have been used for over 100 years and they still show that most people have a preference for their own group over most other groups.  The main concern is that respondents may not answer these survey questions honestly.  But if you do so in private for yourself and you are honest to yourself, you will know better how prejudice you are towards different groups than by taking an implicit test.

What was the Starbucks’ manager thinking or feeling when she called 911? The answer to this question would be more informative than giving her an implicit bias test.

Is it possible to Reduce Implicit Bias?

Any scientific answer to this question requires measuring implicit bias.  The ideal study to examine the effectiveness of any intervention is a randomized controlled trial.  In this case it is easy to do so because many White Americans who are prejudice do not want to be prejudice. They learned to be prejudice through parents, friends, school, or media. Racism has been part of American culture for a long time and even individuals who do not want to be prejudice respond differently to White and African Americans.  So, there is no ethical problem in subjecting participants to an anti-racism training program. It is like asking smokers who want to quit smoking to participate in a test of a new treatment of nicotine addiction.

Unfortunately, social psychologists are not trained in running well-controlled intervention studies.  They are mainly trained to do experiments that examine the immediate effects of an experimental manipulation on some measure of interest.  Another problem is that published articles typically report only report successful experiments.  This publication bias leads to the wrong impression that it may be easy to change implicit bias.

For example, one of the leading social psychologist on implicit bias published an article with the title “On the Malleability of Automatic Attitudes: Combating Automatic
Prejudice With Images of Admired and Disliked Individuals” (Dasgupta & Greenwald, 2001).  The title makes two (implicit) claims.  Implicit attitudes can change  (it is malleable) and this article introduces a method that successfully reduced it (combating it).  This article was published 17 years ago and it has been cited 537 times so far.

Dasgupta.png

Study 1

The first experiment relied on a small sample of university students (N = 48).  The study had three experimental conditions with n = 18, 15, and 15 for each condition.  It is now recognized that studies with fewer than n = 20 participants per condition are questionable (Simmons et al., 2011).

The key finding in this study was that scores on the Implicit Association Test (IAT) were lower when participants were exposure to positive examples of African Americans (e.g., Denzel Washington) and negative examples of European Americans (e.g., Jeffrey Dahmer – A serial killer)  than in the control condition, F(1, 31) = 5.23, p = .023.

The observed mean difference is d = .80.  This is considered a large effect. For an intervention to increase IQ it would imply an increase by 80% of a standard deviation or 12 IQ points.  However, in small samples, these estimates of effect size vary a lot.  To get an impression of the range of variability it is useful to compute the 95%CI around the observed effect size. It ranges form d = .10 to 1.49. This means that the actual effect size could be just 10% of a standard deviation, which in the IQ analogy would imply an increase by just 1.5 points.  Essentially, the results merely suggest that there is a positive effect, but they do not provide any information about the size of the effect. It could be very small or it could be very large.

Unusual for social psychology experiments, the authors brought participants back 24 hours after the manipulation to see whether the brief exposure to positive examples had a lasting effect on IAT scores.  As the results were published, we already know that it did. The only question is how strong the evidence was.

The result remained just significant, F(1, 31) = 4.16, p = .04999. A p-value greater than .05 would be non-significant, meaning the study provided insufficient evidence for a lasting change.  More troublesome is that the 95%CI around the observed mean difference of d = .73 ranged from d = .01 to 1.45.  This means it is possible that the actual effect size is just 1% of a standard deviation or 0.15 IQ points.  The small sample size simply makes it impossible to say how large the effect really is.

Study 2

Study 1 provided encouraging results in a small sample.  A logical extension for Study 2 would be to replicate the results of Study 1 with a larger sample in order to get a better sense of the size of the effect.  Another possible extension could be to see whether repeated presentations of positive examples over a longer time period can have lasting effects that last longer than 24 hours.  However, multiple-study articles in social psychology are rarely programmatic in this way (Schimmack, 2012). Instead, they are more a colorfull mosaic of studies that were selected to support a good story like “it is possible to combat implicit bias.”

The sample size in Study 2 was reduced from 48 to 26 participants.  This is a terrible decision because the results in Study 1 were barely significant and reducing sample sizes increases the risk of a false negative result (the intervention actually works, but the study fails to show it).

The purpose of Study 2 was to generalize the results of racial bias to aging bias.  Instead of African and European Americans, participants were exposed to positive and negative examples of young and old people and performed an age-IAT (old vs. young).

The statistical analysis showed again a significant mean difference, F(1, 24) = 5.13, p = .033.  However, the 95%CI again showed a wide range of possible effect sizes from d = .11 to 1.74.  Thus, the study provides no reliable information about the size of the effect.

Moreover, it has to be noted that study two did not report whether a 24-hour follow up was conducted or not.  Thus, there is no replication of the finding in Study 1 that a small intervention can have an effect that lasts 24 hours.

Publication Bias: Another Form of Implicit Bias [the bias researchers do not want to talk about in public]

Significance tests are only valid if the data are based on a representative sample of possible observations.  However, it is well-known that most journals, including social psychology journals publish only successful studies (p < .05) and that researchers use questionable research practices to meet this requirement.  Even two studies are sufficient to examine whether the results are representative or not.

The Test of Insufficient Variance examines whether reported p-values are too similar than we would expect based on a representative sample of data.  Selection for significance reduces variability in p-values because p-values greater than .05 are missing.

This article reported a p-value of .023 in Study 1 and .033 in Study 2.   These p-values were converted int z-values; 2.27 and 2.13, respectively. The variance for these two z-scores is 0.01.  Given the small sample sizes, it was necessary to run simulations to estimate the expected variance for two independent p-values in studies with 24 and 31 degrees of freedom. The expected variance is 0.875.  The probability of observing a variance of 0.01 or less with an expected variance of 0.875 is p = .085.  This finding raises concerns about the assumption that the reported results were based on a representative sample of observations.

In conclusion, the widely cited article with the promising title that scores on implicit bias measures are malleable and that it is possible to combat implicit bias provided very preliminary results that by no means provide conclusive evidence that merely presenting a few positive examples of African Americans reduces prejudice.

A Large-Scale Replication Study 

Nine years later, Joy-Gaba and Nosek (2010) examined whether the results reported by Dasgupta and Greenwald could be replicated.  The title of the article “The Surprisingly Limited Malleability of Implicit Racial Evaluations” foreshadows the results.

Abstract
“Implicit preferences for Whites compared to Blacks can be reduced via exposure to admired Black and disliked White individuals (Dasgupta & Greenwald, 2001). In four studies (total N = 4,628), while attempting to clarify the mechanism, we found that implicit preferences for Whites were weaker in the “positive Blacks” exposure condition compared to a control condition (weighted average d = .08). This effect was substantially smaller than the original demonstration (Dasgupta & Greenwald, 2001; d = .82).”

On the one hand, the results can be interpreted as a successful replication because the study with 4,628 participants again rejected the null-hypothesis that the intervention has absolutely no effect.  However, the mean difference in the replication study is only d = .08, which corresponds to an effect size estimate of 1.2 IQ points if the study had tried to raise IQ.  Moreover, it is clear that the original study was only able to report a significant result because the observed mean difference in this study was inflated by 1000%.

Study 1

Participants in Study 1 were Canadian students (N = 1,403). The study differed in that it separated exposure to positive Black examples and negative White examples.  Ideally, real-world training programs would aim to increase liking of African Americans rather than make people think about White people as serial killers.  So, the use of only positive examples of African Americans makes an additional contribution by examining a positive intervention without negative examples of Whites.  The study also included age to replicate Study 2.

Like US Americans, Canadian students also showed a preference for White over Blacks on the Implicit Association Test. So failures to replicate the intervention effect are not due to a lack of racism in Canada.

A focused analysis of the race condition showed no effect of exposure to positive Black examples, t(670) = .09, p = .93.  The 95%CI of the mean difference in this study ranged from -.15 to .16.  This means that with a maximum error probability of 5%, it is possible to rule out effect sizes greater than .16.  This finding is not entirely inconsistent with the original article because the original study was inconclusive about effect sizes.

The replication study is able to provide a more precise estimate of the effect size and the results show that the effect size could be 0, but it could not be d = .2, which is typically used as a reference point for a small effect.

Study 2a

Study 2a reintroduced the original manipulation that exposed participants to positive examples of African Americans and negative examples of European Americans.  This study showed a significant difference between the intervention condition and a control condition that exposed participants to flowers and insects, t(589) = 2.08, p = .038.  The 95%CI for the effect size estimate ranged from d = .02 to .35.

It is difficult to interpret this result in combination with the result from Study 1.  First, the results of the two studies are not significantly different from each other.  It is therefore not possible to conclude that manipulations with negative examples of Whites are more effective than those that just show positive examples of Blacks.  In combination, the results of Study 1 and 2a are not significant, meaning it is not clear whether the intervention has any effect at all.  Nevertheless, the significant result in Study 2a suggests that presenting negative examples of Whites may influence responses on the race IAT.

Study 2b

Study 2b is an exact replication of Study 2a.  It also replicated a significant mean difference between participants exposed to positive Black and negative White examples and the control condition, t(788) = 1.99, p = .047 (reported as p = .05). The 95%CI ranges  from d = .002 to d = .28.

The problem is that now three studies produced significant results with exposure to positive Black and negative White examples (Original Study 1; replication Study 2a & 2b) and all three studies had just significant p-values (p = .023, p = .038, p = .047). This is unlikely without selection of data to attain significance.

Study 3

The main purpose of Study 3 was to compare an online sample, an online student sample, and a lab student sample. None of the three samples showed a significant mean difference.

Online sample: t(999) = .96, p = .34

Online student sample: t(93) = 0.51, p = .61

Lab student sample: t(75) = 0.70, p = .48

The non-significant results for the student samples are not surprising because sample sizes are too small to detect small effects.  The non-significant result for the large online sample is more interesting.  It confirms that the two p-values in Studies 2a and 2b were too similar. Study 3 produces greater variability in p-values that is expected and given the small effect size variability was increased by a non-significant result rather than a highly significant one.

Conclusion

In conclusion, there is no reliable evidence that merely presenting a few positive Black examples alters responses on the Implicit Association Test.   There is some suggestive evidence that presenting negative White examples may reduce prejudice presumably by decreasing favorable responses to Whites, but even this effect is very weak and may not last more than a few minutes or hours.

The large replication study shows that the highly cited original article provided misleading evidence that responses on implicit bias measures can be easily and dramatically changed by presenting positive examples of African Americans. If it were this easy to reduce prejudice, racism wouldn’t be the problem that it still is.

Newest Evidence

In a major effort, Lai et al. (2016) examined several interventions that might be used to combat racism.  The first problem with the article is that the literature review fails to mention Joy-Gaba and Nosek’s finding that interventions were rather ineffective or evidence that implicit racism measures show little natural variation over time (Cunningham et al., 2001). Instead they suggest that the ” dominant view has changed over the past 15 years to one of implicit malleability” [what they mean malleability of responses on implict tasks with racial stimuli].  While this may accurately reflect changes in social psychologists’ opinions, it ignores that there is no credible evidence to suggest that implicit attitude measures are malleable.

More important, the study also failed to find evidence that a brief manipulation could change performance on the IAT a day or more later, despite a large sample size to detect even small lasting effects.  However, some manipulations produced immediate effects on IAT scores.  The strongest effect was observed for a manipulation that required vivid imagination.

Vivid counterstereotypic scenario.

Participants in this intervention read a vivid second-person story in which they are the
protagonist. The participant imagines walking down a street late at night after drinking at a bar. Suddenly, a White man in his forties assaults the participant, throws him/her into the trunk of his car, and drives away. After some time, the White man opens the trunk and assaults the participant again. A young Black man notices the second assault and knocks out the White assailant, saving the day.  After reading the story, participants are told the next task (i.e., the race IAT) was supposed to affirm the associations: White = Bad, Black = Good. Participants were instructed to keep the story in mind during the IAT.

When given this instruction, the pro-White bias in the IAT was reduced.  However, one day later (Study 2) or two or three days later (Study 1) IAT performance was not significantly different from a control condition.

In conclusion, social psychologists have found out something that most people already know.  Changing attitudes, including prejudice, is hard because they are stable and difficult to change, even when participants want to change them.  A simple, 5-minute manipulation is not an intervention and it will not produce lasting changes in attitudes.

General Discussion

Social psychology has failed Black people who would like to be treated with the same respect as White people and White people who do not want to be racist.

Since Martin Luther King gave his dream speech, America has made progress towards a goal of racial equality without the help of social psychologists. Nevertheless, racial bias remains a problem, but social psychologists are too busy with sterile experiments that have no application to the real world (No! Starbucks’ employees should not imagine being abducted by White sociopaths to avoid calling 911 on Black patrons of their stores) and performance on an implicit bias test is only relevant if it predicted behavior and it doesn’t do that very well.

The whole notion of implicit bias is a creation by social psychologists without scientific foundations, but 911 calls that kill black people are real.  Maybe Starbucks could  fund some real racism research at Howard University because the mostly White professors at elite Universities seem to be unable to develop and test real interventions that can influence real behavior.

And last but not least, don’t listen to self-proclaimed White experts.

Nosek.png

 

Social psychologists who have failed to validate measures and failed to conduct real intervention studies that might actually work are not experts.  It doesn’t take a Ph.D. to figure out some simple things that can be taught in a one-day workshop for Starbucks’ employees.  After all, the goal is just to get employees to treat all customers equally, which doesn’t even require a change in attitudes.

Here is one simple rule.  If you are ready to call 911 to remove somebody from your coffee shop and the person is Black, ask yourself before you dial whether you would do the same if the person were White and looked like you or your brother or sister. If so, go ahead. If not, don’t touch that dial.  Let them sit at a table like you let dozens of other people sit at their table because you make most of your money from people on the go anyways. Or buy them a coffee, or do something, but think twice or three times before you call the police.

And so what if it is just a PR campaign.  It is a good one. I am sure there are a few people who would celebrate a nation-wide racism training day for police (maybe without shutting down all police stations).

Real change comes from real people who protest.  Don’t wait for the academics to figure out how to combat automatic prejudice.  They are more interested in citations and further research than to provide real solutions to real problems.  Trust me, I know. I am (was?) a White social psychologist myself.