All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Estimating Reproducibility of Psychology (No. 151): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 151 “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” by Simone Schnall and colleagues has been the subject of heated debates among social psychologists.  The main finding of the article failed to replicate in an earlier replication attempt (Johnson, Cheung, & Donnellan, 2012).  In response to the replication failure, Simone Schnall suggested that the replication study was flawed and stood by her original findings.  This response led me to publish my first R-Index blog post that suggested the original results were not as credible as they seem because Simone Schnall was trained to use questionable research practices that produce significant results with low replicability. She was simply not aware of the problems of using these methods. However, Simone Schnall was not happy with my blog post and when I refused to take it down, she complained to the University of Toronto about it. UofT found that the blog post did not violate ethical standards.

The background is important because the OSC replication study was one of the replication studies that were published earlier and criticized by Schnall. Thus, it is necessary to revisit Schnall’s claim that the replication failure can be attributed to problems with the replication study.

Summary of Original Article 

The article “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” was published in Psychological Science. The article has been cited 197 times overall and 20 times in 2017.

Simone.Schnall

The article extends previous research that suggested a connection between feelings of disgust and moral judgments.  The article reports two experiments that test the complementary hypothesis that thoughts of purity make moral judgments less severe.  Study 1 used a priming manipulation. Study 2 evoked disgust followed by self-purification. Results in both studies confirmed this prediction.

Study 1

Forty undergraduate students (n = 20 per cell) participated in Study 1.

Half of the participants were primed with a scrambled sentence task that contained cleanliness words (e.g. pure, washed).  The other half did a scrambled sentence task with neutral words.

Right after the priming procedure, participants rated how morally wrong an action was in a series of six moral dilemmas.

The ANOVA showed a marginally mean difference, F(1,38) = 3.63, p = .064.  The results was reported with p-rep = .90, which was an experimental statistic in Psychological Science form 2005-2009 that was partially motivated by an attempt to soften the strict distinction between p-values just above or below .05.  Although a p-value of .064 is not meaningfully different from a p-value of .04, neither p-value suggests that a result is highly replicable. A p-value of .05 corresponds to 50% replicability (with large uncertainty around this point estimate) and the estimate is inflated if questionable research methods were used to produce it.

Study 2

Study 2 could have followed up the weak evidence of Study 1 with a larger sample to increase statistical power.  However, the sample size in Study 2 was nearly the same (N = 44).

Participants first watched a disgusting film clip.  Half (n = 21) of the participants then washed their hands before rating moral dilemmas.  The other half (n = 22) did not wash their hands.

The ANOVA showed a significant difference between the two conditions, F(1,41) = 7.81, p = .008.

Replicability Analysis 

No N Test p.val z OP
Study 1 40 F(1,38)=3.63 0.064 1.85 0.58*
Study 2 44 F(1,41)=7.81 0.008 2.66 0.76

*  using p < .10 as criterion for power analysis

With two studies it is difficult to predict replicability because observed power in a single study is strongly influenced by sampling error.  Individually, Study 1 has a low replicability index because the success (p < .10) was achieved with only 58% power. The inflation index (100 – 58 = 42) is high and the R-Index, 58 – 42 = 16, is low.

Combining both studies, still produces a low R-Index (Median Observed Power = 67, Inflation = 33, R-Index = 67 – 33 = 34).

My original blog post pointed out that we can predict replicability based on a researchers typical R-Index.  If a researcher typically conducts studies with high power, a p-value of .04 will sometimes occur due to bad luck, but the replication study is likely to be successful with a lower p-value because bad luck does not repeat itself.

In contrast, if a researcher conducts low powered studies, a p-value of .04 is a lucky outcome and the replication study is unlikely to be lucky again and therefore more likely to produce a non-significant result.

Since I published the blog post, Jerry Brunner and I have developed a new statistical method that allows meta-psychologists to take a researcher’s typical research practices into account. This method is called z-curve.

The figure below shows the z-curve for automatically extracted test statistics from articles by Simone Schnall from 2003 to 2017.  Trend analysis showed no major changes over time.

 

For some help with reading these plots check out this blog post.

The Figure shows a few things. First, it shows that the peak (mode) of the distribution is at z = 1.96, which corresponds to the criterion for significance (p < .05, two-tailed).  The steep drop on the left is not explained by normal sampling error and reveals the influence of QRPs (this is not unique to Schnall; the plot is similar for other social psychologists).  The grey line is a rough estimate of the proportion of non-significant results that would be expected given the distribution of significant results.  The discrepancy between the proportion of actual non-significant results and the grey line shows the extent of the influence of QRPs.

Simone.Schnall.2.png

Once QRPs are present, observed power of significant results is inflated. The average estimate is 48%. However, actual power varies.  The estimates below the x-axis show power estimates for different ranges of z-scores.  Even z-scores between 2.5 and 3 have only an average power estimate of 38%.  This implies that the z-score of 2.66 in Study 2 has a bias-corrected observed power of less than 50%. And as 50% power corresponds to p = .05, this implies that a bias-corrected p-value is not significant.

A new way of using z-curve is to fit z-curve with different proportions of false positive results and to compare the fit of these models.

Simone.Schnall.3.png

The plot shows that models with 0 or 20% false positives fit the data about equally well, but a model with 40% false positives lead to notably worse model fit.  Although this new feature is still in development, the results suggest that few of Schnall’s results are strictly false positives, but that many of her results may be difficult to replicate because QRPs produced inflated effect sizes and much larger samples might be needed to produce significant results (e.g., N > 700 is needed for 80% power with a small effect size, d = .2).

In conclusion, given the evidence for the presence of QRPs and the weak evidence for the cleanliness hypothesis, it is unlikely that equally underpowered studies would replicate the effect. At the same time, larger studies might produce significant results with weaker effect sizes.  Given the large sampling error in small samples, it is impossible to say how small the effects would be and how large samples would have to have high power to detect them.

Actual Replication Study

The replication study was carried out by Johnson, Cheung, and Donnellan.

Johnson et al. conducted replication studies of both studies with considerably larger samples.

Study 1 was replicated with 208 participants (vs. 40 in original study).

Study 2 was replicated with 126 participants (vs. 44 in original study).

Even if some changes in experimental procedures would have slightly lowered the true effect size, the larger samples would have compensated for this by reducing sampling error.

However, neither replication produced a significant result.

Study 1: F(1, 206) = 0.004, p = .95

Study 2: F(1, 124) = 0.001, p = .97.

Just like two p-values of .05 and .07 are unlikely, it is also unlikely to obtain two p-values of .95 and .97 even if the null-hypothesis is true because sampling error produces spurious mean differences.  When the null-hypothesis is true, p-values have a uniform distribution, and we would expect 10% of p-values between 9 and 1. To observe this event twice in a row has a probabiilty of .10 * .10 = .01.  Unusual events do sometimes happen by chance, but defenders of the original research could use this observation to suggest “reverse p-hacking” a term coined by Fritz Strack to insinuate that it can be of interest for replication researchers to make original effects go away.  Although I do not believe that this was the case here, it would be unscientific to ignore the surprising similarity of these two p-values

The authors conducted two more replication studies. These studies also produced non-significant results, with p = .31 and p = .27.  Thus, the similarity of the first two p-values was just a statistical fluke, just like some suspiciously similar  p-values of .04 are sometimes just a chance finding.

Schnall’s Response 

In a blog post, Schnall comments on the replication failure.  She starts with the observation that publishing failed replications is breaking with old traditions.

One thing, though, with the direct replications, is that now there can be findings where one gets a negative result, and that’s something we haven’t had in the literature so far, where one does a study and then it doesn’t match the earlier finding. 

Schnall is concerned that a failed replication could damage the reputation of the original researcher, if the failure is attributed either to a lack of competence or a lack of integrity.

Some people have said that well, that is not something that should be taken personally by the researcher who did the original work, it’s just science. These are usually people outside of social psychology because our literature shows that there are two core dimensions when we judge a person’s character. One is competence—how good are they at whatever they’re doing. And the second is warmth or morality—how much do I like the person and is it somebody I can trust.

Schnall believes that direct replication studies were introduced as a crime control measure in response to the revelation thaat Diedrik Stapel had made up data in over 50 articles.  This violation of research integrity is called fabrication.  However, direct replication studies are not an effective way to detect fabrication (Strobe and Strack, 2014).

In social psychology we had a problem a few years ago where one highly prominent psychologist turned out to have defrauded and betrayed us on an unprecedented scale. Diederik Stapel had fabricated data and then some 60-something papers were retracted… This is also when this idea of direct replications was developed for the first time where people suggested that to be really scientific we should do what the clinical trials do rather our regular [publish conceptual replication studies that work] way of replication that we’ve always done.

Schnall overlooks that another reason for direct replications were concerns about falsification.

Falsification is manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record (The Office of Research Integrity)

In 2011/2012 numerous articles suggested that falsification is a much bigger problem than fabrication and direct replications were used to examine whether falsified evidence also produced false positive results that could not be replicated.  Failures in direct replications are at least in part due to the use of questionable research practices that inflate effect sizes and success rates.

Today it is no longer a secret that many studies failed to replicate because original studies reported inflated effect sizes (OSC, 2015).  Given the widespread use of QRPs, especially in experimental social psychology, replication failures are the norm.  In this context, it makes sense that individual researches feel attacked if one of their studies is replicated.

There’s been a disproportional number of studies that have been singled out simply because they’re easy to conduct and the results are surprising to some people outside of the literature

Why me?  However, the OSC (2015) project did not single out individual researchers. It put up any study that was published in JPSP or Psychological Science in 2008 up for replication.  Maybe the ease of replication was a factor.

Schnall’s next complaint is that failure to replicate are treated as more credible than successful original studies.

Often the way these replications are interpreted is as if one single experiment disproves everything that has come before. That’s a bit surprising, especially when a finding is negative, if an effect was not confirmed. 

This argument ignores two things. First, it ignores that original researchers have a motivated bias to show a successful result.  Researchers who conduct direct replication studies are open to finding a positive or a negative result.  Second, Schnall ignores sample size.  Her original Study 1 had a sample size of N = 40.  The replication study had a sample size of N = 208.  Studies with larger samples have less sampling error and are more robust to violations of statistical assumptions underlying significance tests.  Thus, there are good reasons to believe the results of the failed replication studies more than the results of Schnall’s small original study.

Her next issue was that a special issue published a failed replication without peer review.  This led to some controversy, but it is not the norm.  More important, Schnall overstates the importance of traditional, anonymous, pre-publication peer-review.

It may not seem like a big deal but peer review is one of our laws; these are our publication ethics to ensure that whatever we declare as truth is unbiased. 

Pre-publication peer-review does not ensure that published resutls are unbiased. The OSC (2015) results clearly show that published results were biased in favor of supporting researchers’ hypotheses. Traditional peer-review does not check whether researchers used QRPs or not.  Peer-review does not end once a result is published.  It is possible to evaluate the results of original studies or replication studies even after the results are published.

And this is what Schnall did. She looked at the results and claimed that there was a mistake in the replication study.

I looked at their data, looked at their paper and I found what I consider a statistical problem.

However, others looked at the data and didn’t agree with her.  This led Schnall to consider replications a form of bulling.

“One thing I pointed to was this idea of this idea of replication bullying, that now if a finding doesn’t replicate, people take to social media and declare that they “disproved” an effect, and make inappropriate statements that go well beyond the data.”

It is of course ridiculous to think of failed replication studies as a form of bulling. We would not need to conduct empirical studies, if only successful replication studies were allowed to be published.  Apparently some colleagues tried to point this out to Schnall.

Interestingly, people didn’t see it that way. When I raised the issue, some people said yes, well, it’s too bad she felt bullied but it’s not personal and why can’t scientists live up to the truth when their finding doesn’t replicate?

Schnall could not see it this way.  According to her, there are only two reasons why a replication study may fail.

If my finding is wrong, there are two possibilities. Either I didn’t do enough work and/or reported it prematurely when it wasn’t solid enough or I did something unethical.

In reality there are many more reasons for a replication failure. One possible explanation is that the original result was an honest false positive finding.  The very notion of significance testing implies that some published findings can be false positives and that only future replication studies can tell us which published findings are false positives.  So a simple response to a failed replication is simply to say that it probably was a false positive result and that is the end of the story.

But Schnall does not believe that it is a false positive result ….

because so far I don’t know of a single person who failed to replicate that particular finding that concerned the effect of physical cleanliness and moral cleanliness. In fact, in my lab, we’ve done some direct replications, not conceptual replications, so repeating the same method. That’s been done in my lab, that’s been done in a different lab in Switzerland, in Germany, in the United States and in Hong Kong; all direct replications. As far as I can tell it is a solid effect.

The problem with this version of the story is that it is impossible to get significant results again and again with small samples, even if the effect is real.  So, it is not credible that Schnall was able to get significant results in many unpublished results and never obtained a contradictory result (Schimmack, 2012).

Despite many reasonable comments about the original study and the replication studies (e.g., sample size, QRPs, etc.), Schnall cannot escape the impression that replication researchers have an agenda to tear down good research.

Then the quality criteria are oftentimes not nearly as high as for the original work. The people who are running them sometimes have motivations to not necessarily want to find an effect as it appears.

This accusation motivated me to publish my first blog post and to elaborate on this study from the OSC reproducibilty project.  There is ample evidence that QRPs contributed to replication failures. In contrast, there is absolutely no empirical evidence that replication researchers deliberately produced non-significant results, and as far as I know Schnall has not yet apologized for her unfounded accusation.

One reason for her failure to apologize is probably that many social psychologists expressed support for Schnall either in public or mostly in private.

I raised these concerns about the special issue, I put them on a blog, thinking I would just put a few thoughts out there. That blog had some 17,000 hits within a few days. I was flooded with e-mails from the community, people writing to me to say things like “I’m so glad that finally somebody’s saying something.” I even received one e-mail from somebody writing to me anonymously, expressing support but not wanting to reveal their name. Each and every time I said: “Thank you for your support. Please also speak out. Please say something because we need more people to speak out openly. Almost no one did so.”

Schnall overlooks a simple solution to the problem.  Social psychologists who feel attacked by failed replications could simply preregister their own direct replications with large samples and show that their results do replicate.  This solution was suggested by Daniel Kahneman in 2012 in response to a major replication failure of a study by John Bargh that cast doubt on social priming effects.

What social psychology needs to do as a field is to consider our intuitions about how we make judgments, about evidence, about colleagues, because some of us have been singled out again and again and again. And we’ve been put under suspicion; whole areas of research topics such as embodied cognition and priming have been singled out by people who don’t work on the topics. False claims have been made about replication findings that in fact are not as conclusive as they seem. As a field we have to set aside our intuitions and move ahead with due process when we evaluate negative findings. 

However, what is most telling is the complete absence of direct replications by experimental social psychologists to demonstrate that their published results can be replicated.  The first major replication attempt by Vohs and Schmeichel just failed to replicate ego-depletion in a massive self-replication attempt.

In conclusion, it is no longer a secret that experimental social psychologists have used questionable research practices to produce more significant results than unbiased studies would produce.  The response to this crisis of confidence has been denial.

 

 

 

 

Advertisements

Robert Sternberg’s Rise to Fame

Robert Sternberg is a psychologist interested in being famous (Am I famous yet?).  Wellbeing theories predict that he is also dissatisfied because discrepancies between resources (being a psychologist) and goals (wanting to be famous) lead to dissatisfaction (Diener & Fujita, 1995).

Ask any undergraduate about famous psychologists and they typically can name two: Freud and Skinner.

Rorbert Sternberg is also smart. So, he realized that just publishing more good research is not going to increase his standing in the APA fame rankings from his current rank 60 out of 99 (APA; (In an alternative ranking by Ed Diener, who is not on the previous list but ranks #10 on his own list, he also ranks 60).

The problem is that being a good psychologist is simply not a recipe for fame.  So there is a need to think outside the box.  For example, a Google Search retrieves 40,000 hits for Diederik Stapel and only 30,000 for David Funder (we will address the distinction between good and bad creativity later on).

It looks like Robert Sternberg has found a way to become famous. More people are talking about him right now at least within psychology circles than ever before.  The trick was to turn the position of editor of the APS journal Perspectives on Psychological Science into a tool for self-promotion.

After all, if we equate psychological science with the activities of the most eminent psychologists, reflecting on psychological science means reflecting on the activities of eminent psychologists and if you are the editor of this journal you need to do self-reflection.  Thus, Perspectives on Psychological Science necessarily has to publish mostly auto-meta-psychological self-reflections of Robert Sternberg. These scientific self-reflections should not be confused this the paradigmatic example of Narcissus,  who famously fell in love with himself which led to a love-triangle between me, myself, and I.

Some envious second-stringers do not realize the need to focus on eminent psychologists who have made valuable contributions to psychological science and to stop the self-destructive, negative talk about a crisis in psychological science that was fueled by the long-forgotten previous editor of Perspectives on Psychological Science (her name escapes me right now).  Ironically, their petty complaints backfired and increased Robert Sternberg’s fame; not unlike petty criticism of Donald Trump by the leftist biased media help him to win the election in 2016.

I was pleased to see that Robert Sternberg remains undeterred in his mission to make Psychology great again and to ensure that eminent psychologists receive the recognition they deserve.

I am also pleased to share the highlights of his Introduction and Postscript to the forthcoming “Symposium on Modern Trends in Psychological Science: Good, Bad, or Indifferent?”  [the other contributions are not that important]

Introduction

Robert Sternberg’s Introduction takes a historic perspective on psychological science, which overlaps not coincidentally to a large extent with Robert Sternberg’s career.

“I was 25 years old, a first-year assistant professor.I came to believe that my faculty mentor, the late Wendell Garner, had a mistaken idea about the structure of perceptual stimuli. I did a study to show Garner wrong: It worked—or at least I thought it did!  Garner told me he did not think much of the study.  I did, however.  I presented the work as a colloquium at Bell Labs.  My namesake, Saul Sternberg (no relation), was in the audience.  He asked what appeared to be a simple question.  The simple question demolished my study.  I found myself wishing that a hole would open up in the ground and swallow me up.  But I was actually lucky: The study was not published.  What if it had been? I went back to Yale and told Professor Garner that the study was a bad misfire.  I expected him to be angry; but he was not.  He said something like: “You learned a valuable lesson from the experience.  You are judged in this field by the positive contributions you make, not by the negative ones.”  Garner was intending to say, I think, that the most valuable contributions are those that build things up rather than tear things down.

The most valuable lesson from these formative years of psychological science is:

You are judged largely by the positive contributions you make, much more so than by the negative ones. 

The implications of this insight are clear and have been formalized by another eminent Cornell researcher (no not Wansink), Daryl Bem, in a contribution to one of Sternberg’s great books (“Let’s err on the side of discovery”).

If you are judged by your successes, eminence is achieved by making lots’ of positive contributions (p < .05).  It doesn’t matter whether some second-stringer replicators later show that some of your discoveries are false positives. Surely some will be true positives and you will be remembered forever by these positive contributions.  The negative contributions don’t count and don’t hurt your rise to fame or eminence (unless you fake it, Stapel).

For a long time even false positives were not a problem because nobody actually bothered to examine whether discoveries were true or false.  So just publishing as many positives as possible was the best way to become famous; nobody noticed that it was false fame.

This is no longer the case and replication failures are threatening the eminence of some psychologists. However, driven people know how to turn a crisis into an opportunity; for example, an opportunity for more self-citations.

Sternberg ponders deep questions about the replication revolution in psychology.

So replication generally is good, but is it always good, and how good?  Is there a danger that young scientists who might have gone on to creative careers pushing the boundaries of science (Sternberg, Kaufman, & Pretz, 2002) will instead become replicators, spending their time replacing insights about new ideas and phenomena (Sternberg & Davidson, 1982, 1983) with repetitions of old ideas and tired old phenomena?  

Or is replication and, generally, repeating what others have done before, one of many forms of creativity (Frank & Saxe, 2012; Niu & Sternberg, 2003; Sternberg, 2005; Sternberg, Kaufman, & Pretz, 2002; Zhang & Sternberg, 1998) that in the past has been undervalued?  Moreover, is anyone truly just a “replicator”?

These are important questions that require 7 self-citations because Sternberg has made numerous important contribution to meta-psychology.

There is also strong evidence that researchers should focus on positive contributions rather than trying to correct others’ mistakes.  After all, why should anybody be bothered by Bem’s (2011) demonstration that students can improve their exam grades by studying AFTER taking the exam, but only at Cornell U.

In my own experience, my critiques (Sternberg, 1985a, 1986) have had much less impact than my positive contributions (Sternberg, 1981, 1984, 1997a, 1997b; Sternberg & Grigorenko, 2004; Sternberg & Hedlund, 2003; Sternberg & Smith, 1985), and I always thought this was generally true, but maybe that’s just my own limitation. [9 self-citations]

Be grateful for faculty mentors who are not only brilliant but also wise and kind—there are not so many of them.

An influential study also found that academics are more likely to be brilliant than wise or kind (Sternberg, 2016).  This is a problem in the age of social media, because some academics use their unkind brilliance to damage the reputation of researchers who are just trying to be famous.

The advent of social media, wherein essays, comments, and commentaries are not formally refereed, has led to much more aggressive language than those of us socialized in the latter years of the twentieth century ever were accustomed to.  Sometimes, attacks have become personal, not just professional.  And sometimes, replies to un-refereed critiques can look more like echo chambers than like genuinely critical responses to points that have been made.   

Sternberg himself shows wisdom and kindness in his words for graduate students and post-doctoral students.

How can one navigate a field in rapid transition?  I believe the answers to that question are the same as they always have been.  First, do the very best work of which you are capable.  I never worried too much about all the various crises the field went through as I grew up in it—I focused on doing my best work.  Second, remember that the most eminent scientists usually are not the crowd-followers but rather the crowd-defiers (Sternberg, 2003; Sternberg, Fiske, & Foss, 2016; Sternberg & Lubart, 1995) and the ones who can defy the current Zeitgeist (Sternberg, 2018).  So if you are not doing what everyone else is doing and following every trend everyone else is following, you may well end up being better off. 

In one word, don’t worry and just be like Sternberg [4 self-citations]

Of course, even an eminent scholar cannot do it all alone and Robert Sternberg does acknowledge the contribution of several people who helped him polish this brilliant and wise contribution to the current debate about the future of psychological science and it would be unkind if I didn’t mention their names (Brad Bushman, Alexandra Freund, June Gruber, Diane Halpern, Alex Holcombe, James Kaufman, Roddy Roediger, and Dan Simons).

Well done everybody.  Good to know that psychological science can build on solid foundations and a new generation of psychologists can stand on the broad shoulders of Robert Sternberg.

Postscript

Robert Sternberg’s brilliance also shines in the concluding statements that bring together the valuable contributions of select, eminent, contributors to this “symposium.”  He points out his valuable contribution to publishing in psychological science journals.

In a book I edited on submitting papers to psychology journals, Bem (2000) wrote: There are two possible articles you can write: (a) the article you planned to write when you designed your study or (b) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (b).  

Bem’s advice reflected the state of the field in 1975, when I received my PhD, in 2000, when he wrote the article, and even more recently.  Today, such “HARKing” (Hypothesizing After the Results are Known) would likely be viewed with great suspicion. Both p-hacking and HARKing require a certain degree of creativity. 

Many professors and students, not only when I was in graduate school, but also throughout the world have built their careers on practices once considered both creative and perfectly legitimate but that today might be viewed as dubious.  What this fact highlights is that scientific creativity—indeed, any form of creativity—can be understood only in context (Csikszentmihalyi, 1988, 2013; Plucker, 2017; Simonton, 1994, 2004; Sternberg, 2018).

Sternberg self-critically points out that his book may have contributed to the replication crisis in psychology by featuring Bem’s creative approach to science.

I would argue that in science as well as in society, we too often have valued creativity without considering whether the creativity we are valuing is positive or negative (or neutral).  In science, we can get so caught up in achieving eminence or merely the next step on a promotion ladder that we fail to consider whether the creativity we are exhibiting is truly positive.

He recognizes that falsely positive contributions are ultimately not advancing science.  He also acknowledges that it can be difficult to correct false positives.

Scholars sometimes have taken to social media because they have felt their potential contributions to refereed journals have been blocked. At the same time, it is likely that many scholars who post critiques on social media have never even tried to have their work published.  It just is easier to bypass peer review, which can be a lengthy and sometimes frustrating process.

But he is also keenly aware that social media can be abused by terrorists and totalitarian governments.

Such media initially probably seemed like a wholly good idea. The inventors of various forms of social media presumably did not think through how social media might be used to undermine free elections, to spread hateful propaganda, to serve as a platform for cyberbullying, or even to undermine careers. 

In the gold old says, scientific criticism was vetted by constructive and selfless peer-reviews which helped critics to avoid making embarrassing mistakes in public.

At one time, if a scientist wished publicly to criticize another’s work, he or she had to pass the critique through peer reviewers. These reviewers often saved scientists from saying foolish and even destructive things.

Nowadays, fake news and fake criticism can spread through echo-chambers on social media.

With social media, the push of a button can bypass the need for peer reviewers.  Echo chambers of like-minded people then may reinforce what is said, no matter how obnoxious or simply wrong it may be. 

Sternberg is painfully aware that social media can be used for good or bad and he provides a brilliant solution to the problem of distinguishing good blogs with valid criticism from evil blogs that have no merit.

I believe there is, and that the principles for distinguishing positive from negative creativity, whether in the short or the long run, are the same principles that have contributed to wisdom over the ages:  honesty, transparency, sincerity, following of the Golden Rule (of acting toward others the way one would have them act toward oneself), and of course deep analysis of the consequences of one’s actions. 

Again, if we just followed his example and leadership as editor of Perspectives and the convener of this Symposium, psychology could be improved or at least be restored to its former greatness.  Let’s follow the golden rule and act towards Sternberg as he would act to himself.

Last but not least, Robert Sternberg acknowledges the contributors to this symposium, although their contributions are overshadowed by the brilliant Introduction and Postscript by Sternberg himself.

These principles, or at least some of them, are exactly what current trends in psychological science are trying to achieve (see Frankenhuis, this issue; Grand et al., this issue; Wagenmakers, Dutilh, & Sarafoglou, this issue).  This is all to the good.  But as Brainerd and Reyna (this issue), Fiedler (this issue), Kaufman and Glaveanu (this issue), and Vazire (this issue) as well as some other contributors point out. 

To bad that space limitations did not allow him to name all contributors and the lesser ones were just mentioned as “other contributors,”  but space was already tight and there were more important things about Sternberg to say.

For example, Sternberg recognizes that some of the creativity in the old days was bad.

Our field has not done an adequate job of emphasizing the analytical skills we need to ensure that our results in psychological science are sound, or at least as sound as we can make them. We did not satisfactorily police ourselves.  

Yet he recognizes that too much self-control can deplete creative people.

But I worry that our societal emphasis on promoting people up the advancement ladder by standardized tests of analytical skills may create a generation of researchers who place more and more emphasis on what they find easy—analysis—at the expense of creativity, which they (most others) may find quite a bit harder.  And when they stall in their creativity, they may fall back on critique and analysis.  This idea is not new, as I made this point first in the mid-nineteen eighties (Sternberg, 1981, 1985a, 1985c).  Given the way our students are taught and then assessed for memory and analysis, it sometimes has been difficult to make them feel comfortable thinking creatively—are we risking the possibility that an emphasis on replication will make it even harder (Sternberg, 1988, 1997a, 1997b, 2016)?

Developing creativity in students means instilling certain attitudes toward life and work in those students (Sternberg, 2000): willingness to defy the crowd, defy oneself and one’s past beliefs, defy the ongoing Zeitgeist (Sternberg, 2018), overcome obstacles, believe in oneself in the face of severe criticism, realize that one’s expertise can get in the way of one’s creativity (Sternberg & Lubart, 1995).  What would it mean to develop positive creativity?

The real danger is that the replication crisis will lead to standardized scientific practices that stifle creativity.

Increasing emphasis on replication, preregistration procedures, and related practices undoubtedly will do much good for psychological science.  Too many studies have been published that have proven to be based on remarkably flimsy data or post hoc theorizing presented as a priori.  But we in psychological science need to ensure that we do not further shift an educational system that already heavily emphasizes analytic (SAT-like and ACT-like) skills at the expense of positive creative skills.

My Humble Opinion

It is difficult to criticize a giant in the field of psychology and just like young Sternberg was wrong when he tried to find a flaw with his mentor’s theory, I am probably wrong when I am trying to find a mistake in Sternberg’s brilliant analysis of the replication crisis.

However, fully aware that I am risking public humiliation, I am going to try.  Ironically, starting point for my critique is Sternberg’s own brilliant insight that “scientific creativity—indeed, any form of creativity—can be understood only in context (Csikszentmihalyi, 1988, 2013; Plucker, 2017; Simonton, 1994, 2004; Sternberg, 2018).

And I think he fails to recognize the new emerging creativity in psychological science because the context (paradigm) has changed.  What looks like a threat in the old context looks like good creativity for young people who have a new perspective on psychological science.

He wrongly blames social media for cutting down creative people.

Being creative is uncomfortable—it potentially involves defying the crowd, defying oneself, and defying the Zeitgeist (Sternberg, 2018).  People always have been afraid of being creative, lest they fall prey to the “tall poppy” phenomenon, whereby they end up as the tall poppy that gets cut down, (today) by social media or by whatever means, to nothing more than the size of the other poppies.

But from the new perspective on psychological science, social media and other recent inventions are exactly the good creative forces that are needed. Eminent tall poppies have created an addiction to questionable research practices that make psychologists feel good about false discoveries need to be cut done.

The internet is changing psychological science and the most creative and disruptive innovations in psychological science are happening in response to the ability to exchange information in real time with minimal costs.

First, psychologists are no longer relying so heavily on undergraduate students to recruit participants.  Larger and more diverse samples can be recruited cheaply thanks to the Internet.  Initiatives like the Project Implicit are only possible due to the Internet.

Open science initiatives like data sharing or preregistration are only possible due to the Internet.

Sharing of pre-prints is only possible on the Internet. More important, the ability to publish critical articles and failed replications in peer-reviewed journals has increased thanks to the creation of online only journals. Some of these journals like Meta-Psychology are even free for authors and readers, unlike the for-profit journal Perspectives on Psychological Science.

The Internet also makes it possible to write scientific blog posts without peer-review.  This can be valuable because for-profit journals have limited pages and little interest in publishing criticisms of failed replications. The reasons is that these articles (a) are not cited a lot and (b) can reduce citations of articles that were criticized. No good capitalist would be interested in publishing articles that undermine the reputation of a brand and  profitability.

And last but not least, the Internet enables researchers from all over the world, including countries that are typically ignored by US American WEIRD psychologists to participate in psychological science for free.  For example, the Psychological Methods Discussion Group on Facebook has thousands of active members from all over the world.

In conclusion, Robert Sternberg’s contributions to this Symposium demonstrate his eminence, brillance, wisdom, and kindess, but ironically he fails to see where positive innovation and creativity in psychological science lives these days. It doesn’t live in American organizations like APA or APS or in Perspectives on Psychological Science behind a paywall. It lives in the new, wild, chaotic, and creative world of 24/7 free communication; and this blog post is one example of this.

This blog has a comment section and Robert Sternberg is welcome to comment there. However, it is unlikely that he will do so because comments on this blog will not count towards his publications and self-citations in the comment are not counting towards his citation count.

 

 

 

 

 

Implicit Racism, Starbucks, and the Failure of Experimental Social Psychology

Implicit racism is in the news again (CNN).   A manager of a Starbucks in Philadelphia called 911 to ask police to remove two Black men from the coffee store because they had not purchased anything.  The problem is that many White customers frequent Starbucks without purchasing things and the police is not called.  The incident caused widespread protests and Starbucks announced that it would close all of its stores for “implicit bias training.”

Starbucks’ CEO Derrick Johnson explains the need for store-wide training in this quote.

“The Starbucks situation provides dangerous insight regarding the failure of our nation to take implicit bias seriously,” said the group’s president and CEO Derrick Johnson in a statement. “We refuse to believe that our unconscious bias –the racism we are often unaware of—can and does make its way into our actions and policies.”

But was it implicit bias? It does not matter. CEO Derrick Johnson could have talked about racism without changing what happened or the need for training.

“The Starbucks situation provides dangerous insight regarding the failure of our nation to take racism seriously,” said the group’s president and CEO Derrick Johnson in a statement. “We refuse to believe that we are racists and that racism can and does make its way into our actions and policies.”

We have not heard from the store manager why she called the police. This post is not about a single incidence at Starbucks because psychological science can rarely provide satisfactory answers to single events.  However, the call for training of thousands of Starbucks’ employees is not a single event.  It implies that social psychologists have developed scientific ways to measure “implicit bias” and developed ways to change it. This is the topic of this post.

What is implicit bias and what can be done to reduce it?

The term “implicit” has a long history in psychology, but it rose to prominence in the early 1990s when computers became more widely used in psychological research.  Computers made it possible to present stimuli on screens rather than on paper and to measure reaction times rather than self-ratings.  Computerized tasks were first used in cognitive psychology to demonstrate that people have associations that can influence their behaviors.  For example, participants are faster to determine that “doctor” is a word if the word is presented after a related word like “hospital” or “nurse.”

The term implicit is used for effects like this because the effect occurs without participants’ intention, conscious reflection, or deliberation. They do not want to respond this way, but they do, whether they want to or not.  Implicit effects can occur with or without awareness, but they are generally uncontrollable.

After a while, social psychologists started to use computerized tasks that were developed by cognitive psychologists to study social topics like prejudice.  Most studies used White participants to demonstrate prejudice with implicit tasks. For example, the association task described above can be easily modified by showing traditionally White or Black names (in the beginning computers could not present pictures) or faces.

Given the widespread prevalence of stereotypes about African Americans, many of these studies demonstrated that White participants respond differently to Black or White stimuli.  Nobody doubts these effects.  However, there remain two unanswered questions about these effects.

What (the fuck) is Implicit Racial Bias?

First, do responses in this implicit task with racial stimuli measure a specific form of prejudice?  That is, do implicit tasks measure plain old prejudice with a new measure or do they actually measure a new form of prejudice?  The main problem is that psychologists are not very good at distinguishing constructs and measures.  This goes back to the days when psychologists equated measures and constructs.  For example, to answer the difficult question whether IQ tests measure intelligence, it was simply postulated that intelligence is what IQ tests measure.  Similarly, there is no clear definition of implicit racial bias.  In social psychology implicit racism is essentially whatever leads to different responses to Black and White stimuli in an implicit task.

The main problem with this definition is that different implicit tasks show low convergent validity.  Somebody can take two different “implicit tests” (the popular Implicit Association Test, IAT, or the Affective Misattribution Task) and get different results.  The correlations between two different tests range from 0 to .3, which means that the tests disagree more with each other than that they agree.

20 years after the first implicit tasks were used to study prejudice we still do not know whether implicit bias even exist or how it could be measured, despite the fact that these tests are made available to the public to “test their racial bias.”  These tests do not meet the standards of real psychological tests and nobody should take their test scores too seriously.  A brief moment of self-reflection is likely to provide better evidence about your own feelings towards different social groups.  How would you feel if somebody from this group would move in next door? How would you feel if somebody from this group would marry your son or daughter?  Responses to questions like this have been used for over 100 years and they still show that most people have a preference for their own group over most other groups.  The main concern is that respondents may not answer these survey questions honestly.  But if you do so in private for yourself and you are honest to yourself, you will know better how prejudice you are towards different groups than by taking an implicit test.

What was the Starbucks’ manager thinking or feeling when she called 911? The answer to this question would be more informative than giving her an implicit bias test.

Is it possible to Reduce Implicit Bias?

Any scientific answer to this question requires measuring implicit bias.  The ideal study to examine the effectiveness of any intervention is a randomized controlled trial.  In this case it is easy to do so because many White Americans who are prejudice do not want to be prejudice. They learned to be prejudice through parents, friends, school, or media. Racism has been part of American culture for a long time and even individuals who do not want to be prejudice respond differently to White and African Americans.  So, there is no ethical problem in subjecting participants to an anti-racism training program. It is like asking smokers who want to quit smoking to participate in a test of a new treatment of nicotine addiction.

Unfortunately, social psychologists are not trained in running well-controlled intervention studies.  They are mainly trained to do experiments that examine the immediate effects of an experimental manipulation on some measure of interest.  Another problem is that published articles typically report only report successful experiments.  This publication bias leads to the wrong impression that it may be easy to change implicit bias.

For example, one of the leading social psychologist on implicit bias published an article with the title “On the Malleability of Automatic Attitudes: Combating Automatic
Prejudice With Images of Admired and Disliked Individuals” (Dasgupta & Greenwald, 2001).  The title makes two (implicit) claims.  Implicit attitudes can change  (it is malleable) and this article introduces a method that successfully reduced it (combating it).  This article was published 17 years ago and it has been cited 537 times so far.

Dasgupta.png

Study 1

The first experiment relied on a small sample of university students (N = 48).  The study had three experimental conditions with n = 18, 15, and 15 for each condition.  It is now recognized that studies with fewer than n = 20 participants per condition are questionable (Simmons et al., 2011).

The key finding in this study was that scores on the Implicit Association Test (IAT) were lower when participants were exposure to positive examples of African Americans (e.g., Denzel Washington) and negative examples of European Americans (e.g., Jeffrey Dahmer – A serial killer)  than in the control condition, F(1, 31) = 5.23, p = .023.

The observed mean difference is d = .80.  This is considered a large effect. For an intervention to increase IQ it would imply an increase by 80% of a standard deviation or 12 IQ points.  However, in small samples, these estimates of effect size vary a lot.  To get an impression of the range of variability it is useful to compute the 95%CI around the observed effect size. It ranges form d = .10 to 1.49. This means that the actual effect size could be just 10% of a standard deviation, which in the IQ analogy would imply an increase by just 1.5 points.  Essentially, the results merely suggest that there is a positive effect, but they do not provide any information about the size of the effect. It could be very small or it could be very large.

Unusual for social psychology experiments, the authors brought participants back 24 hours after the manipulation to see whether the brief exposure to positive examples had a lasting effect on IAT scores.  As the results were published, we already know that it did. The only question is how strong the evidence was.

The result remained just significant, F(1, 31) = 4.16, p = .04999. A p-value greater than .05 would be non-significant, meaning the study provided insufficient evidence for a lasting change.  More troublesome is that the 95%CI around the observed mean difference of d = .73 ranged from d = .01 to 1.45.  This means it is possible that the actual effect size is just 1% of a standard deviation or 0.15 IQ points.  The small sample size simply makes it impossible to say how large the effect really is.

Study 2

Study 1 provided encouraging results in a small sample.  A logical extension for Study 2 would be to replicate the results of Study 1 with a larger sample in order to get a better sense of the size of the effect.  Another possible extension could be to see whether repeated presentations of positive examples over a longer time period can have lasting effects that last longer than 24 hours.  However, multiple-study articles in social psychology are rarely programmatic in this way (Schimmack, 2012). Instead, they are more a colorfull mosaic of studies that were selected to support a good story like “it is possible to combat implicit bias.”

The sample size in Study 2 was reduced from 48 to 26 participants.  This is a terrible decision because the results in Study 1 were barely significant and reducing sample sizes increases the risk of a false negative result (the intervention actually works, but the study fails to show it).

The purpose of Study 2 was to generalize the results of racial bias to aging bias.  Instead of African and European Americans, participants were exposed to positive and negative examples of young and old people and performed an age-IAT (old vs. young).

The statistical analysis showed again a significant mean difference, F(1, 24) = 5.13, p = .033.  However, the 95%CI again showed a wide range of possible effect sizes from d = .11 to 1.74.  Thus, the study provides no reliable information about the size of the effect.

Moreover, it has to be noted that study two did not report whether a 24-hour follow up was conducted or not.  Thus, there is no replication of the finding in Study 1 that a small intervention can have an effect that lasts 24 hours.

Publication Bias: Another Form of Implicit Bias [the bias researchers do not want to talk about in public]

Significance tests are only valid if the data are based on a representative sample of possible observations.  However, it is well-known that most journals, including social psychology journals publish only successful studies (p < .05) and that researchers use questionable research practices to meet this requirement.  Even two studies are sufficient to examine whether the results are representative or not.

The Test of Insufficient Variance examines whether reported p-values are too similar than we would expect based on a representative sample of data.  Selection for significance reduces variability in p-values because p-values greater than .05 are missing.

This article reported a p-value of .023 in Study 1 and .033 in Study 2.   These p-values were converted int z-values; 2.27 and 2.13, respectively. The variance for these two z-scores is 0.01.  Given the small sample sizes, it was necessary to run simulations to estimate the expected variance for two independent p-values in studies with 24 and 31 degrees of freedom. The expected variance is 0.875.  The probability of observing a variance of 0.01 or less with an expected variance of 0.875 is p = .085.  This finding raises concerns about the assumption that the reported results were based on a representative sample of observations.

In conclusion, the widely cited article with the promising title that scores on implicit bias measures are malleable and that it is possible to combat implicit bias provided very preliminary results that by no means provide conclusive evidence that merely presenting a few positive examples of African Americans reduces prejudice.

A Large-Scale Replication Study 

Nine years later, Joy-Gaba and Nosek (2010) examined whether the results reported by Dasgupta and Greenwald could be replicated.  The title of the article “The Surprisingly Limited Malleability of Implicit Racial Evaluations” foreshadows the results.

Abstract
“Implicit preferences for Whites compared to Blacks can be reduced via exposure to admired Black and disliked White individuals (Dasgupta & Greenwald, 2001). In four studies (total N = 4,628), while attempting to clarify the mechanism, we found that implicit preferences for Whites were weaker in the “positive Blacks” exposure condition compared to a control condition (weighted average d = .08). This effect was substantially smaller than the original demonstration (Dasgupta & Greenwald, 2001; d = .82).”

On the one hand, the results can be interpreted as a successful replication because the study with 4,628 participants again rejected the null-hypothesis that the intervention has absolutely no effect.  However, the mean difference in the replication study is only d = .08, which corresponds to an effect size estimate of 1.2 IQ points if the study had tried to raise IQ.  Moreover, it is clear that the original study was only able to report a significant result because the observed mean difference in this study was inflated by 1000%.

Study 1

Participants in Study 1 were Canadian students (N = 1,403). The study differed in that it separated exposure to positive Black examples and negative White examples.  Ideally, real-world training programs would aim to increase liking of African Americans rather than make people think about White people as serial killers.  So, the use of only positive examples of African Americans makes an additional contribution by examining a positive intervention without negative examples of Whites.  The study also included age to replicate Study 2.

Like US Americans, Canadian students also showed a preference for White over Blacks on the Implicit Association Test. So failures to replicate the intervention effect are not due to a lack of racism in Canada.

A focused analysis of the race condition showed no effect of exposure to positive Black examples, t(670) = .09, p = .93.  The 95%CI of the mean difference in this study ranged from -.15 to .16.  This means that with a maximum error probability of 5%, it is possible to rule out effect sizes greater than .16.  This finding is not entirely inconsistent with the original article because the original study was inconclusive about effect sizes.

The replication study is able to provide a more precise estimate of the effect size and the results show that the effect size could be 0, but it could not be d = .2, which is typically used as a reference point for a small effect.

Study 2a

Study 2a reintroduced the original manipulation that exposed participants to positive examples of African Americans and negative examples of European Americans.  This study showed a significant difference between the intervention condition and a control condition that exposed participants to flowers and insects, t(589) = 2.08, p = .038.  The 95%CI for the effect size estimate ranged from d = .02 to .35.

It is difficult to interpret this result in combination with the result from Study 1.  First, the results of the two studies are not significantly different from each other.  It is therefore not possible to conclude that manipulations with negative examples of Whites are more effective than those that just show positive examples of Blacks.  In combination, the results of Study 1 and 2a are not significant, meaning it is not clear whether the intervention has any effect at all.  Nevertheless, the significant result in Study 2a suggests that presenting negative examples of Whites may influence responses on the race IAT.

Study 2b

Study 2b is an exact replication of Study 2a.  It also replicated a significant mean difference between participants exposed to positive Black and negative White examples and the control condition, t(788) = 1.99, p = .047 (reported as p = .05). The 95%CI ranges  from d = .002 to d = .28.

The problem is that now three studies produced significant results with exposure to positive Black and negative White examples (Original Study 1; replication Study 2a & 2b) and all three studies had just significant p-values (p = .023, p = .038, p = .047). This is unlikely without selection of data to attain significance.

Study 3

The main purpose of Study 3 was to compare an online sample, an online student sample, and a lab student sample. None of the three samples showed a significant mean difference.

Online sample: t(999) = .96, p = .34

Online student sample: t(93) = 0.51, p = .61

Lab student sample: t(75) = 0.70, p = .48

The non-significant results for the student samples are not surprising because sample sizes are too small to detect small effects.  The non-significant result for the large online sample is more interesting.  It confirms that the two p-values in Studies 2a and 2b were too similar. Study 3 produces greater variability in p-values that is expected and given the small effect size variability was increased by a non-significant result rather than a highly significant one.

Conclusion

In conclusion, there is no reliable evidence that merely presenting a few positive Black examples alters responses on the Implicit Association Test.   There is some suggestive evidence that presenting negative White examples may reduce prejudice presumably by decreasing favorable responses to Whites, but even this effect is very weak and may not last more than a few minutes or hours.

The large replication study shows that the highly cited original article provided misleading evidence that responses on implicit bias measures can be easily and dramatically changed by presenting positive examples of African Americans. If it were this easy to reduce prejudice, racism wouldn’t be the problem that it still is.

Newest Evidence

In a major effort, Lai et al. (2016) examined several interventions that might be used to combat racism.  The first problem with the article is that the literature review fails to mention Joy-Gaba and Nosek’s finding that interventions were rather ineffective or evidence that implicit racism measures show little natural variation over time (Cunningham et al., 2001). Instead they suggest that the ” dominant view has changed over the past 15 years to one of implicit malleability” [what they mean malleability of responses on implict tasks with racial stimuli].  While this may accurately reflect changes in social psychologists’ opinions, it ignores that there is no credible evidence to suggest that implicit attitude measures are malleable.

More important, the study also failed to find evidence that a brief manipulation could change performance on the IAT a day or more later, despite a large sample size to detect even small lasting effects.  However, some manipulations produced immediate effects on IAT scores.  The strongest effect was observed for a manipulation that required vivid imagination.

Vivid counterstereotypic scenario.

Participants in this intervention read a vivid second-person story in which they are the
protagonist. The participant imagines walking down a street late at night after drinking at a bar. Suddenly, a White man in his forties assaults the participant, throws him/her into the trunk of his car, and drives away. After some time, the White man opens the trunk and assaults the participant again. A young Black man notices the second assault and knocks out the White assailant, saving the day.  After reading the story, participants are told the next task (i.e., the race IAT) was supposed to affirm the associations: White = Bad, Black = Good. Participants were instructed to keep the story in mind during the IAT.

When given this instruction, the pro-White bias in the IAT was reduced.  However, one day later (Study 2) or two or three days later (Study 1) IAT performance was not significantly different from a control condition.

In conclusion, social psychologists have found out something that most people already know.  Changing attitudes, including prejudice, is hard because they are stable and difficult to change, even when participants want to change them.  A simple, 5-minute manipulation is not an intervention and it will not produce lasting changes in attitudes.

General Discussion

Social psychology has failed Black people who would like to be treated with the same respect as White people and White people who do not want to be racist.

Since Martin Luther King gave his dream speech, America has made progress towards a goal of racial equality without the help of social psychologists. Nevertheless, racial bias remains a problem, but social psychologists are too busy with sterile experiments that have no application to the real world (No! Starbucks’ employees should not imagine being abducted by White sociopaths to avoid calling 911 on Black patrons of their stores) and performance on an implicit bias test is only relevant if it predicted behavior and it doesn’t do that very well.

The whole notion of implicit bias is a creation by social psychologists without scientific foundations, but 911 calls that kill black people are real.  Maybe Starbucks could  fund some real racism research at Howard University because the mostly White professors at elite Universities seem to be unable to develop and test real interventions that can influence real behavior.

And last but not least, don’t listen to self-proclaimed White experts.

Nosek.png

 

Social psychologists who have failed to validate measures and failed to conduct real intervention studies that might actually work are not experts.  It doesn’t take a Ph.D. to figure out some simple things that can be taught in a one-day workshop for Starbucks’ employees.  After all, the goal is just to get employees to treat all customers equally, which doesn’t even require a change in attitudes.

Here is one simple rule.  If you are ready to call 911 to remove somebody from your coffee shop and the person is Black, ask yourself before you dial whether you would do the same if the person were White and looked like you or your brother or sister. If so, go ahead. If not, don’t touch that dial.  Let them sit at a table like you let dozens of other people sit at their table because you make most of your money from people on the go anyways. Or buy them a coffee, or do something, but think twice or three times before you call the police.

And so what if it is just a PR campaign.  It is a good one. I am sure there are a few people who would celebrate a nation-wide racism training day for police (maybe without shutting down all police stations).

Real change comes from real people who protest.  Don’t wait for the academics to figure out how to combat automatic prejudice.  They are more interested in citations and further research than to provide real solutions to real problems.  Trust me, I know. I am (was?) a White social psychologist myself.

 

 

 

 

 

 

 

 

Estimating Reproducibility of Psychology (No. 68): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “Why People Are Reluctant to Tempt Fate” by Risen and Gilovich examined magical thinking in six experiments.  The evidence suggests that individuals are reluctant to tempt fate because it increases the accessibility of thoughts about negative outcomes. The article has been cited 58 times so far and it was cited 10 times in 2017, although the key finding failed to replicate in the OSC (Science, 2015) replication study.

Risen.png

Study 1

Study 1 demonstrated the basic phenomenon.  62 students read a scenario about a male student who applied at a prestigious university.  His mother sent him a t-shirt with the logo of the university. In one condition, he decided to wear the t-shirt. In the other scenario, he stuffed it in the bottom drawer.  Participants rated how likely it would be that the student would be accepted.  Participants thought it would be more likely that the student gets accepted, if the student did not wear the t-shirt  (wearing it M = 5.19, SD = 1.35; stuffed away M = 6.13, SD = 1.02), t(60) = 3.01, p = .004, d = 0.78.

Study 2

120 students participated in Study 2 (n = 30 per cell). Study 2 manipulated whether participants imagined themselves or somebody else in a scenario. The scenario was about the probability of a professor picking a student to answer a question.  The experimental factor was whether students had done the reading or not. Not having done the reading was considered tempting fate.

The ANOVA results showed a significant main effect for tempting fate (not prepared M =  3.43, SD = 2.34; prepared M = 2.53, SD = 2.24), F(1, 116) = 4.60, p = .034. d = 0.39.

Study 3

Study 3 examined whether tempting fate increases the accessibility of thoughts about negative outcomes with 211 students.  Accessibiliy was measured with reaction times to two scenarios matching those from Study 1 and 2.  Participants had to indicate as quickly as possible whether the ending of a story matched the beginning of a story.

Analysis were carried out separately for each story.  Participants were faster to judge that not getting into a prestigious university was a reasonable ending after reading that a student tempted fate by wearing a t-shirt with the university logo  (wearing t-shirt M =  2,671 ms, SD = 1,113) than those who read that he stuffed the shirt in the drawer
(M = 3,176 ms, SD = 1,573), F(1, 171) = 11.01, p = .001, d = 0.53.

The same result was obtained for judgments of tempting fate by not doing the readings for a class, (not prepared M = 2,879 ms, SD = 1,149; prepared M = 3,112 ms, SD 1,226), F(1, 184) = 7.50, p = .007, d = 0.26.

Study 4 

Study 4 aimed to test the mediation hypothesis. Notably the sample size is much smaller than in Study 3 (N = 96 vs. N = 211).

The study used the university application scenario. For half the participants the decision was acceptance and for the other half it was rejection.

The reaction time ANOVA showed a significant interaction, F(1, 87) = 15.43.

As in Study 3, participants were faster to respond to a rejection after wearing the shirt than after not wearing it (wearing M = 3,196 ms, SD = 1,348; not wearing M = 4,324 ms,
SD = 2,194), F(1, 41) = 9.13, p = .004, d = 0.93.   Surprisingly, the effect size was twice as large as in Study 3.

The novel finding was that participants were faster to respond to an acceptance decision after not wearing the shirt than after wearing it (not wearing M = 2,995 ms, SD = 1,175;  wearing M = 3,551 ms, SD = 1,432),  F(1, 45) = 6.07, p = .018, d = 0.73.

Likelihood results also showed a significant interaction, F(1, 92) = 10.49, p = .002.

As in Study 2, in the rejection condition participants believed that a rejection was more likely after wearing the shirt than after putting it away (M = 5.79, SD = 1.53; M = 4.79, SD = 1.56), t(46) = 2.24, p = .030, d = 0.66.  In the new acceptance condition, participants thought that an acceptance was less likely after wearing the shirt than after putting it away (wore shirt M = 5.88, SD = 1.51;  did not wear shirt M = 6.83, SD = 1.31), t(46) = 2.35, p = .023, d = 0.69.  [The two p-values are surprisingly similar]

The mediation hypothesis was tested separately for the rejection and acceptance condition.  For the rejection condition, the Sobel test was significant, z = 1.96, p = .05. For the acceptance condition, the result was considered to be “supported by a marginally significant Sobel (1982) test, z = 1.91, p = .057.  [It is unlikely that two independent statistical tests produce p-values of .05 and .057]

Study 5

Study 5 is the icing on the cake. It aimed to manipulate accessibility by means of a subliminal priming manipulation.  [This was 2008 when subliminal priming was considered a plausible procedure]

Participants were 111 students.

The main story was about a woman who did or did not (tempt fate) bring an umbrella when the forecast predicted rain.  The ending of the story was that it started to rain hard.

For the reaction times, the interaction between subliminal priming and the manipulation of tempting fate (the protagonist brought an umbrella or not) was significant, F(1, 85) = 5.89.

In the control condition with a nonsense prime, participants were faster to respond to the ending that it would rain, if the protagonist did not bring an umbrella than when she did (no umbrella M = 2,694 ms, SD = 876; umbrella M = 3,957 ms, SD = 2,112), F(1, 43) =
15.45, p = .0003, d = 1.19.  This finding conceptually replicated studies 3 and 4.

In the priming condition, no significant effect of tempting fate was observed (no umbrella M = 2,749 ms, SD = 971, umbrella M = 2,770 ms, SD = 1,032).

For the likelihood judgments, the interaction was only marginally significant, F(1,
86) = 3.62, p = .06.

However, in the control condition with nonsense primes, the typical tempt fate effect was significant (no umbrella M = 6.96, SD = 1.31; M = 6.15, SD = 1.46), t(44) = 2.00, p = .052 (reported as p = .05), d = 0.58.

The tempt fate effect was not observed in the priming condition when participants were subliminally primed with rain (no umbrella M = 7.11, SD = 1.56; M = 7.16, SD = 1.41).

As in Study 5, “the mediated relation was supported by a marginally significant Sobel
(1982) test, z = 1.88, p = .06.  It is unlikely to get p = .05, p = .06 and p  = .06 in three independent mediation tests.

Study 6

Study 6 is the last study and the study that was chosen for the replication attempt.

122 students participated.  Study 6 used the scenario of being called on by a professor either prepared or not prepared (tempting fate).  The novel feature was a cognitive load manipulation.

The interaction between load manipulation and tempting fate manipulation was significant, F(1, 116) = 4.15, p = .044.

The no-load condition was a replication of Study 2 and replicated a significant effect of tempting fate (not prepared (M = 2.93, SD = 2.16, prepared M = 1.90, SD = 1.42), t(58) = 2.19, p = .033, d = 0.58.

Under the load condition, the effect was even more pronounced (not prepared M = 5.27, SD = 2.36′ prepared M = 2.70, SD = 2.17), t(58) = 4.38, p = .00005, d = 1.15.

A comparison of participants in the tempting fate condition showed a significant difference between the load and the no-load condition, t(58) = 3.99, p = .0002, d = 0.98.

Overall the results suggest that some questionable research practices were used (e.g., mediation tests p = .05, .06, .06).  The interaction effect in Study 6 with the load condition was also just significant and may not replicate.  However, the main effect of the tempting fate manipulation on likelihood judgments was obtained in all studies and might replicate.

Replication Study 

The replication study used an Mturk sample. The sample size was larger than in the original study (N = 226 vs. 122).

The load manipulation lead to higher likelihood estimates of being called on, suggesting that the load manipulation was effective even with Mturk participants, F(1,122) = 10.28.

However, the study did not replicate the interaction effect, F(1, 122) = 0.002.  More surprisingly, it also failed to show a main effect for the tempting-fate manipulation, F(1,122) = 0.50, p = .480.

One possible reason for the failure to replicate the tempting fate effect in this study could be the use of a school/university scenario (being called on by a professor) with Mturk participants who are older.

However, the results for the same scenario in the original article are not very strong.

In Study 2, the p-value was p = .034 and in the the no-load condition in Study 6 the p-value was p = .033.  Thus, neither the interaction with load, nor the main effect of the tempting fate manipulation are strongly supported in the original article.

Conclusion

It is never possible to show definitively that QRPs were used, it is possible that the use of QRPs in the original article explain the replication failure, although other explanations are also possible.  The most plausible alternative explanation would be the use of an Mturk sample.  A replication study in a student sample or a replication study of one of the other scenarios would be desirable.

 

 

 

 

 

 

 

 

 

Klaus Fiedler’s Response to the Replication Crisis: In/actions speaks louder than words

Klaus Fiedler  is a prominent experimental social psychologist.  Aside from his empirical articles, Klaus Fiedler has contributed to meta-psychological articles.  He is one of several authors of a highly cited article that suggested numerous improvements in response to the replication crisis; Recommendations for Increasing Replicability in Psychology (Asendorpf, Conner, deFruyt, deHower, Denissen, K. Fiedler, S. Fiedler, Funder, Kliegel, Nosek, Perugini, Roberts, Schmitt, vanAken, Weber, & Wicherts, 2013).

The article makes several important contributions.  First, it recognizes that success rates (p < .05) in psychology journals are too high (although a reference to Sterling, 1959, is missing). Second, it carefully distinguishes reproducibilty, replicabilty, and generalizability. Third, it recognizes that future studies need to decrease sampling error to increase replicability.  Fourth, it points out that reducing sampling error increases replicabilty because studies with less sampling error have more statistical power and reduce the risk of false negative results that often remain unpublished.  The article also points out problems with articles that present results from multiple underpowered studies.

“It is commonly believed that one way to increase replicability is to present multiple studies. If an effect can be shown in different studies, even though each one may be underpowered, many readers, reviewers, and editors conclude that it is robust and replicable. Schimmack (2012), however, has noted that the opposite can be true. A study with low power is, by definition, unlikely to obtain a significant result with a given effect size.” (p. 111)

If we assume that co-authorship implies knowledge of the content of an article, we can infer that Klaus Fiedler was aware of the problem of multiple-study articles in 2013. It is therefore disconcerting to see that Klaus Fiedler is the senior author of an article published in 2014 that illustrates the problem of multiple study articles (T. Krüger,  K. Fiedler, Koch, & Alves, 2014).

I came across this article in a response by Jens Forster to a failed replication of Study 1 in Forster, Liberman, and Kuschel, 2008).  Forster cites the Krüger et al. (2014) article as evidence that their findings have been replicated to discredit the failed replication in the Open Science Collaboration replication project (Science, 2015).  However, a bias-analysis suggests that Krüger et al.’s five studies had low power and a surprisingly high success rate of 100%.

No N Test p.val z OP
Study 1 44 t(41)=2.79 0.009 2.61 0.74
Study 2 80 t(78)=2.81 0.006 2.73 0.78
Study 3 65 t(63)=2.06 0.044 2.02 0.52
Study 4 66 t(64)=2.30 0.025 2.25 0.61
Study 5 170 t(168)=2.23 0.027 2.21 0.60

z = -qnorm(p.val/2);  OP = observed power  pnorm(z,1.96)

Median observed power is only 61%, but the success rate (p < .05) is 100%. Using the incredibility index from Schimmack (2012), we find that the binomial probability of obtaining at least one non-significant result with median power of 61% is 92%.  Thus, the absence of non-significant results in the set of five studies is unlikely.

As Klaus Fiedler was aware of the incredibility index by the time this article was published, the authors could have computed the incredibility of their results before they published the results (as Micky Inzlicht blogged “check yourself, before you wreck yourself“).

Meanwhile other bias tests have been developed.  The Test of Insufficient Variance (TIVA) compares the observed variance of p-values converted into z-scores to the expected variance of independent z-scores (1). The observed variance is much smaller,  var(z) = 0.089 and the probability of obtaining such small variation or less by chance is p = .014.  Thus, TIVA corroberates the results based on the incredibility index that the reported results are too good to be true.

Another new method is z-curve. Z-curve fits a model to the density distribution of significant z-scores.  The aim is not to show bias, but to estimate the true average power after correcting for bias.  The figure shows that the point estimate of 53% is high, but the 95%CI ranges from 5% (all 5 significant results are false positives) to 100% (all 5 results are perfectly replicable).  In other words, the data provide no empirical evidence despite five significant results.  The reason is that selection bias introduces uncertainty about the true values and the data are too weak to reduce this uncertainty.

Fiedler4

The plot also shows visually how unlikely the pile of z-scores between 2 and 2.8 is. Given normal sampling error there should be some non-significant results and some highly significant (p < .005, z > 2.8) results.

In conclusion, Krüger et al.’s multiple-study article cannot be used by Forster et al. as evidence that their findings have been replicated with credible evidence by independent researchers because the article contains no empirical evidence.

The evidence of low power in a multiple study article also shows a dissociation between Klaus Fiedler’s  verbal endorsement of the need to improve replicability as co-author of the Asendorpf et al. article and his actions as author of an incredible multiple-study article.

There is little excuse for the use of small samples in Krüger et al.’s set of five studies. Participants in all five studies were recruited from Mturk and it would have been easy to conduct more powerful and credible tests of the key hypotheses in the article. Whether these tests would have supported the predictions or not remains an open question.

Automated Analysis of Time Trends

It is very time consuming to carefully analyze individual articles. However, it is possible to use automated extraction of test statistics to examine time trends.  I extracted test statistics from social psychology articles that included Klaus Fiedler as an author. All test statistics were converted into absolute z-scores as a common metric of the strength of evidence against the null-hypothesis.  Because only significant results can be used as empirical support for predictions of an effect, I limited the analysis to significant results (z >  1.96).  I computed the median z-score and plotted them as a function of publication year.

Fiedler.png

The plot shows a slight increase in strength of evidence (annual increase = 0.009 standard deviations), which is not statistically significant, t(16) = 0.30.  Visual inspection shows no notable increase after 2011 when the replication crisis started or 2013 when Klaus Fiedler co-authored the article on ways to improve psychological science.

Given the lack of evidence for improvement,  I collapsed the data across years to examine the general replicability of Klaus Fiedler’s work.

Fiedler2.png

The estimate of 73% replicability suggests that randomly drawing a published result from one of Klaus Fiedler’s articles has a 73% chance of being replicated if the study and analysis was repeated exactly.  The 95%CI ranges from 68% to 77% showing relatively high precision in this estimate.   This is a respectable estimate that is consistent with the overall average of psychology and higher than the average of social psychology (Replicability Rankings).   The average for some social psychologists can be below 50%.

Despite this somewhat positive result, the graph also shows clear evidence of publication bias. The vertical red line at 1.96 indicates the boundary for significant results on the right and non-significant results on the left. Values between 1.65 and 1.96 are often published as marginally significant (p < .10) and interpreted as weak support for a hypothesis. Thus, the reporting of these results is not an indication of honest reporting of non-significant results.  Given the distribution of significant results, we would expect more (grey line) non-significant results than are actually reported.  The aim of reforms such as those recommended by Fiedler himself in the 2013 article is to reduce the bias in favor of significant results.

There is also clear evidence of heterogeneity in strength of evidence across studies. This is reflected in the average power estimates for different segments of z-scores.  Average power for z-scores between 2 and 2.5 is estimated to be only 45%, which also implies that after bias-correction the corresponding p-values are no longer significant because 50% power corresponds to p = .05.  Even z-scores between 2.5 and 3 average only 53% power.  All of the z-scores from the 2014 article are in the range between 2 and 2.8 (p < .05 & p > .005).  These results are unlikely to replicate.  However, other results show strong evidence and are likely to replicate. In fact, a study by Klaus Fiedler was successfully replicated in the OSC replication project.  This was a cognitive study with a within-subject design and a z-score of 3.54.

The next Figure shows the model fit for models with a fixed percentage of false positive results.

Fiedler3.png

Model fit starts to deteriorate notably with false positive rates of 40% or more.  This suggests that the majority of published results by Klaus Fiedler are true positives. However, selection for significance can inflate effect size estimates. Thus, observed effect sizes estimates should be adjusted.

Conclusion

In conclusion, it is easier to talk about improving replicability in psychological science, particularly experimental social psychology, than to actually implement good practices. Even prominent researchers like Klaus Fiedler have responsibilities to their students to publish as much as possible.  As long as reputation is measured in terms of number of publications and citations, this will not change.

Fortunately, it is now possible to quantify replicability and to use these measures to reward research that require more resources to provide replicable and credible evidence without the use of questionable research practices.  Based on these metrics, the article by Krüger et al. is not the norm for publications by Klaus Fiedler and Klaus Fiedler’s replicability index of 73 is higher than the index of other prominent experimental social psychologists.

An easy way to improve it further would be to retract the weak T. Krüger et al. article. This would not be a costly retraction because the article has not been cited in Web of Science so far (no harm, no foul).  In contrast, the Asendorph et al. (2013) article has been cited 245 times and is Klaus Fiedler’s second most cited article in WebofScience.

The message is clear.  Psychology is not in the year 2010 anymore. The replicability revolution is changing psychology as we speak.  Before 2010, the norm was to treat all published significant results as credible evidence and nobody asked how stars were able to report predicted results in hundreds of studies. Those days are over. Nobody can look at a series of p-values of .02, .03, .049, .01, and .05 and be impressed by this string of statistically significant results.  Time to change the saying “publish or perish” to “publish real results or perish.”

 

Estimating Reproducibility of Psychology (No. 64): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 68 “The Effect of Global Versus Local Processing Styles on Assimilation
Versus Contrast in Social Judgment”  is no ordinary article.  The first author, Jens Forster, has been under investigation for scientific misconduct and it is not clear whether published results in some articles are based on real or fabricated data.  Some articles that build on the same theory and used similar methods as this article have been retracted.  Scientific fraud would be one reason why an original study cannot be replicated.

Summary of Original Article

The article uses the first author’s model of global/local processing style model (GLOMO) to examine assimilation and contrast effects in social judgment. The article reports five experiments that showed that processing styles elicited in one task can carry over to other tasks and influence social judgments.

Study 1 

This study was chosen for the replication project.

Participants were 88 students.  Processing styles were manipulated by projecting a city map on a screen and asking participants to either (a) focus on the broader shape of the city or (b) to focus on specific details on the map. The study also included a control condition.  This task was followed by a scrambled sentence task with neutral or aggressive words.  The main dependent variable were aggression ratings in a person perception task.

With 88 participants and six conditions, there are n = 13 participants per condition.

The ANOVA results showed a highly significant interaction between the processing style and priming manipulations, F(2, 76) = 21.57, p < .0001.

We can think about the 2 x 3 design as three priming experiments for each of the three processing style conditions.

The global condition shows a strong assimilation effect, (prime M = 6.53, SD =1.21; no prime M =  4.15, SD = 1.25), t(26) = 5.10, p = .000007, d = 1.94.

In the control processing condition, priming shows an assimilation effect (priming (M = 5.63, SD = 1.25) than after nonaggression priming (M = 4.29, SD =1.23), t(25) = 2.79,  p = .007, d = 1.08.

The local processing condition shows a significant contrast effect (M = 2.86, SD = 1.15) than after nonaggression priming
(M = 4.62; SD = 1.16), t(25) = 3.96, p = .0005, d = -1.52.

Although the reported results appear to provide strong evidence, the extremely large effect sizes raise concern about the reported results.  After all, these are not the first studies that have examined priming effects on person perception.  The novel contribution was to demonstrate that these effects change (are moderated) as a function of processing styles.  What is surprising is that processing styles also appear to have magnified the typical effects without any theoretical explanation for this magnification.

The article was cited by Isbell, Rovenpor, and Lair (2016) because they used the map manipulation in combination with a mood manipulation. The article reports a significant interaction between processing and mood, F(1,73) = 6.33, p = .014.  In the global condition, more abstract statements in an open ended task were recorded in the angry mood condition, but the effect was not significant and much smaller than in Forster’s studies, F(1,73) = 3.21, p = .077, d = .55.  In the local condition, sad participants listed more abstract statements, but again the effect was not significant and smaller than in Forster et al.’s studies, F(1,73) = 3.20, p = .078, d = .67.  As noted before, these results are also questionable because it is unlikely to get p = .077 and p = .078 in two independent statistical tests.

In conclusion, the effect sizes reported by Foerster et al. in Study 1 are unbelievable because they are much larger than can be expected.

Study 2

Study 2 was a replication and extension of a study by Mussweiler and Strack (2000). Participants were 124 students from the same population.  This study used a standard processing style manipulation (Navon, 1977) that presented global letters composed of several smaller different letters (the letter E made up of several n).  The main dependent variable were judgments of drug use.   The design had 2 between subject factors: 3 (processing styles) x 2 (high vs. low comparison standard). Thus, there were about 20 to 21 participants per condition.  The study also had a within-subject factor (subjective vs. objective rating).

The ANOVA shows a 3-way interaction, F(2, 118) = 5.51, p = .005.

Once more, the 3 x 2 design can be treated as 3 independent studies of comparison standards. Because subjective and objective ratings are not independent, I focus on the objective ratings that produced stronger effects.

In the global condition, the high standard produced higher reports of drug use than the low standard (M = 0.66, SD = 1.13 vs. M = -0.47, SD = 0.57), t(39) = 4.04, p = .0004, d = 1.26.

In the control condition, a similar pattern was observed but it was not significant (M = 0.07, SD = 0.79 vs. M = -0.45, SD = 0.98), t(39) = 1.87, p = .07, d = 0.58.

In the local condition, the pattern is reversed (M = -0.41, SD = 0.83 vs. M = 0.60, SD = 0.99), t(39) = 3.54, p = .001, d = -1.11.

As the basic paradigm was a replication of Mussweiler and Strack’s (2000) Study 4, it is possible to compare the effect sizes in this study with the effect size in the original study.   The effect size in the original study was d = .31; 95%CI = -0.24, 1.01.  The effect is not significant, but the interaction effect for objective and subjective judgments was, F(1,30) = 4.49, p = .04.  The effect size is comparable to the control condition, but the  effect sizes for the global and local processing conditions are unusually large.

Study 3

132 students from the same population took part in Study 3.  This study was another replication and extension of Mussweiler and Strack (2000).  In this study, participants made ratings of their athletic abilities.  The extension was to add a manipulation of time (imagine being in an athletic competition today or in one year).  The design was a 3 (temporal distance: distant future vs. near future vs. control) by 2 (high vs. low standard) BS design with objective vs. subjective ratings as a within factor.

The three-way interaction was significant, F(2, 120) = 4.51, p = .013.

In the distant future condition,  objective ratings were higher with the high standard than with the low standard (high  M = 0.56, SD  = 1.04; low M = -0.58, SD = .51), t(41)  =
4.56, p = .0001, d = 1.39.

In the control condition,  objective ratings of athletic ability were higher after the high standard than after the low standard (high M = 0.36, SD = 1.08; low M = -0.36, SD = 0.77), t(38) = 2.44, p = .02, d = 0.77.

In the near condition, the opposite pattern was reported (high M = -0.35, SD = 0.33, vs. low M = 0.36, SD = 1.29), t(41) = 2.53; p = .02,  d = -.75.

In the original study by Mussweiler and Strack the effect size was smaller and not significant (high M = 5.92, SD = 1.88; low M = 4.89, SD = 2.37),  t(34) =  1.44, p = .15, d = 0.48.

Once more the reported effect sizes by Forster et al. are surprisingly large.

Study 4

120 students from the same population participated in Study 4.  The main novel feature of Study 4 was the inclusion of a lexical decision task and the use of reaction times as the dependent variable.   It is important to realize that most of the variance in lexical decision tasks is random noise and fixed individual differences in reaction times.  This makes it difficult to observe large effects in between-subject comparisons and it is common to use within-subject designs to increase statistical power.  However, this study used a between-subject design.  The ANOVA showed the predicted four-way interaction, F(1,108) = 26.17.

The four way interaction was explained by a 3-way interaction for self-primes F(1, 108)  = 39.65,, and no significant effects with control primes.

For moderately high standards, reaction times to athletic words were slower after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

For unathletic words, the revers pattern was observed.

For moderately high standards, reaction times were faster after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

In sum, Study 4 reported reaction time differences as a function of global versus local processing styles that were surprisingly large.

Study 5

Participants in Study 5 were 128 students.  The main novel contribution of Study 5 was the inclusion of a line-bisection task that is supposed to measure asymmetries in brain activation.  The authors predicted that local processing induces more activation of the left-side of the brain and global processing induces more activation of the right side of the brain.  The comparisons of the local and global condition with the control condition showed the predicted mean differences, t(120) = 1.95, p = .053 (reported as p = .05) and t(120) = 2.60, p = .010.   Also as predicted, the line-bisection measure was a significant mediator, z = 2.24, p = .03.

The Replication Study 

The replication project called for replication of the last study, but the replication team in the US found that it was impossible to do so because the main outcome measure of Study 5 was alcohol consumption and drug use (just like Study 2) and pilot studies showed that incidence rates were much lower than in the German sample.  Therefore the authors replicated the aggression priming study of Study 1.

The focal test of the replication study was the interaction between processing condition and priming condition. As noted earlier, this interaction was very strong,  F(2, 76) = 21.57, p < .0001, and therefore seemingly easy to replicate.

Fortunately, the replication team dismissed the outcome of a post-hoc power analysis, which suggested that only 32 participants would be needed and used the same sample size as the original study.

The processing manipulation was changed from a map of the German city of Oldenburg to a state map of South Carolina.  This map was provided by the original authors. The replication report emphasizes that “all changes were endorsed by the first author of the original study.”

The actual sample size was a bit smaller (N = 74) and after exclusion of 3 suspicious participants data analyses were based on 71 (vs. 80 in original study) participants.

The ANOVA failed to replicate a significant interaction effect, F(2, 65) = .865, p =
.426.

The replication study also included questions about the effectiveness of the processing style manipulation.  Only 32 participants indicated that they followed instructions.  Thus, one possible explanation for the replication failure is that the replication study did not successfully manipulate processing styles. However, the original study did not include a similar question and it is not clear why participants in the original study were more compliant.

More troublesome, is that the replication study did not replicate the simple priming effect in the control condition or the global condition, which should have produced the effect with or without successful manipulation of processing styles.

In the control condition, the mean was lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.27, SD = 1.29, neutral M = 7.00, SD = 1.30), t(22) = 1.38, p = .179, d = -.56.

In the global condition, the mean was also lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.38, SD = 1.75, neutral M = 7.23, SD = 1.46), t(22) = 1.29, p = .207, d = -.53.

In the local condition, the means were nearly identical (aggression M = 7.77, SD = 1.16, neutral M = 7.67, SD = 1.27), t(22) = 0.20, p = .842, d = .08.

The replication report points out that the priming task was introduced by Higgins, Rholes, and Jones (1977).   Careful reading of this article shows that the original article also did not show immediate effects of priming.  The study obtained target ratings immediately and 10 to 14 days later.  The ANOVA showed a significant interaction with time, F(1,36) = 4.04, p = .052 (reported as p < .05).

“A further analysis of the above Valence x Time interaction indicated that the difference in evaluation under positive and negative conditions was small and nonsignificant on the immediate measure (M = .8 and .3 under positive and negative conditions, respectively), t(38)= 0.72, p > .25 two-tailed; but was substantial and significant on the delayed measure.” (Higgins et al., 1977).

Conclusion

There are serious concerns about the strong effects in the original article by Forster et al. (2008).   Similar results have raised concerns about data collected by Jens Forster. Although investigations have yielded no clear answers about the research practices, some articles have been retracted (Retraction Watch).  Social priming effects have also proven to be difficult to replicate (R-Index).

The strong effects reported by Forster et al. are not the result of typical questionable research practices that result in just significant results.  Thus, statistical methods that predict replicability falsely predict that Forster’s results would be easy to replicate and only actual replication studies or forensic analysis of original data might be able to reveal that reported results are not trustworthy.  Thus, statistical predictions of replicability are likely to overestimate replicability because they do not detect all questionable practices or fraud.