Category Archives: Statistical Power

A critique of Stroebe and Strack’s Article “The Alleged Crisis and the Illusion of Exact Replication”

The article by Stroebe and Strack (2014) [henceforth S&S] illustrates how experimental social psychologists responded to replication failures in the beginning of the replicability revolution.  The response is a classic example of repressive coping: Houston, we do not have a problem. Even in 2014,  problems with the way experimental social psychologists had conducted research for decades were obvious (Bem, 2011; Wagenmakers et al., 2011; John et al., 2012; Francis, 2012; Schimmack, 2012; Hasher & Wagenmakers, 2012).  S&S article is an attempt to dismiss these concerns as misunderstandings and empirically unsupported criticism.

“In contrast to the prevalent sentiment, we will argue that the claim of a replicability crisis is greatly exaggerated” (p. 59).  

Although the article was well received by prominent experimental social psychologists (see citations in appendix), future events proved S&S wrong and vindicated critics of research methods in experimental social psychology. Only a year later, the Open Science Collaboration (2015) reported that only 25% of studies in social psychology could be replicated successfully.  A statistical analysis of focal hypothesis tests in social psychology suggests that roughly 50% of original studies could be replicated successfully if these studies were replicated exactly (Motyl et al., 2017).  Ironically, one of S&S’s point is that exact replication studies are impossible. As a result, the 50% estimate is an optimistic estimate of the success rate for actual replication studies, suggesting that the actual replicability of published results in social psychology is less than 50%.

Thus, even if S&S had reasons to be skeptical about the extent of the replicability crisis in experimental social psychology, it is now clear that experimental social psychology has a serious replication problem. Many published findings in social psychology textbooks may not replicate and many theoretical claims in social psychology rest on shaky empirical foundations.

What explains the replication problem in experimental social psychology?  The main reason for replication failures is that social psychology journals mostly published significant results.  The selective publishing of significant results is called publication bias. Sterling pointed out that publication bias in psychology is rampant.  He found that psychology journals publish over 90% significant results (Sterling, 1959; Sterling et al., 1995).  Given new estimates that the actual success rate of studies in experimental social psychology is less than 50%, only publication bias can explain why journals publish over 90% results that confirm theoretical predictions.

It is not difficult to see that reporting only studies that confirm predictions undermines the purpose of empirical tests of theoretical predictions.  If studies that do not confirm predictions are hidden, it is impossible to obtain empirical evidence that a theory is wrong.  In short, for decades experimental social psychologists have engaged in a charade that pretends that theories are empirically tested, but publication bias ensured that theories would never fail.  This is rather similar to Volkswagen’s emission tests that were rigged to pass because emissions were never subjected to a real test.

In 2014, there were ample warning signs that publication bias and other dubious practices inflated the success rate in social psychology journals.  However, S&S claim that (a) there is no evidence for the use of questionable research practices and (b) that it is unclear which practices are questionable or not.

“Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology. In fact, the discipline still needs to reach an agreement about the conditions under which these practices are unacceptable” (p. 60).

Scientists like to hedge their statements so that they are immune to criticism. S&S may argue that the evidence in 2014 was not “solid” and surely there was and still is no agreement about good research practices. However, this is irrelevant. What is important is that success rates in social psychology journals were and still are inflated by suppressing disconfirming evidence and biasing empirical tests of theories in favor of positive outcomes.

Although S&S’s main claims are not based on empirical evidence, it is instructive to examine how they tried to shield published results and established theories from the harsh light of open replication studies that report results without selection for significance and subject social psychological theories to real empirical tests for the first time.

Failed Replication of Between-Subject Priming Studies

S&S discuss failed replications of two famous priming studies in social psychology: Bargh’s elderly priming study and Dijksterhuis’s professor priming studies.  Both seminal articles reported several successful tests of the prediction that a subtle priming manipulation would influence behavior without participants even noticing the priming effect.  In 2012, Doyen et al., failed to replicate elderly priming. Schanks et al. (2013) failed to replicate professor priming effects and more recently a large registered replication report also provided no evidence for professor priming.  For naïve readers it is surprising that original studies had a 100% success rate and replication studies had a 0% success rate.  However, S&S are not surprised at all.

“as in most sciences, empirical findings cannot always be replicated” (p. 60). 

Apparently, S&S knows something that naïve readers do not know.  The difference between naïve readers and experts in the field is that experts have access to unpublished information about failed replications in their own labs and in the labs of their colleagues. Only they know how hard it sometimes was to get the successful outcomes that were published. With the added advantage of insider knowledge, it makes perfect sense to expect replication failures, although may be not 0%.

The problem is that S&S give the impression that replication failures are too be expected, but that this expectation cannot be based on the objective scientific record that hardly ever reports results that contradict theoretical predictions.  Replication failures occur all the time, but they remained unpublished. Doyen et al. and Schanks et al.’s articles only violated the code to publish only supportive evidence.

Kahneman’s Train Wreck Letter

S&S also comment on Kahneman’s letter to Bargh that compared priming research to a train wreck.  In response S&S claim that

“priming is an entirely undisputed method that is widely used to test hypotheses about associative memory (e.g., Higgins, Rholes, & Jones, 1977; Meyer & Schvaneveldt, 1971; Tulving & Schacter, 1990).” (p. 60).  

This argument does not stand the test of time.  Since S&S published their article researchers have distinguished more clearly between highly replicable priming effects in cognitive psychology with repeated measures and within-subject designs and difficult to replicate between-subject social priming studies with subtle priming manipulations and a single outcome measure (BS social priming).  With regards to BS social priming, it is unclear which of these effects can be replicated and leading social psychologists have been reluctant to demonstrate replicability of their famous studies by conducting self-replications as they were encouraged to do in Kahneman’s letter.

S&S also point to empirical evidence for robust priming effects.

“A meta-analysis of studies that investigated how trait primes influence impression formation identified 47 articles based on 6,833 participants and found overall effects to be statistically highly significant (DeCoster & Claypool, 2004).” (p. 60). 

The problem with this evidence is that this meta-analysis did not take publication bias into account; in fact, it does not even mention publication bias as a possible problem.  A meta-analysis of studies that were selected for significance produces is also biased by selection for significance.

Several years after Kahneman’s letter, it is widely agreed that past research on social priming is a train wreck.  Kahneman published a popular book that celebrated social priming effects as a major scientific discovery in psychology.  Nowadays, he agrees with critiques that the existing evidence is not credible.  It is also noteworthy that none of the researchers in this area have followed Kahneman’s advice to replicate their own findings to show the world that these effects are real.

It is all a big misunderstanding

S&S suggest that “the claim of a replicability crisis in psychology is based on a major misunderstanding.” (p. 60). 

Apparently, lay people, trained psychologists, and a Noble laureate are mistaken in their interpretation of replication failures.  S&S suggest that failed replications are unimportant.

“the myopic focus on “exact” replications neglects basic epistemological principles” (p. 60).  

To make their argument, they introduce the notion of exact replications and suggest that exact replication studies are uninformative.

 “a finding may be eminently reproducible and yet constitute a poor test of a theory.” (p. 60).

The problem with this line of argument is that we are supposed to assume that a finding is eminently reproducible, which probably means it has been successfully replicate many times.  It seems sensible that further studies of gender differences in height are unnecessary to convince us that there is a gender difference in height. However, results in social psychology are not like gender differences in height.  According to S&S own accord earlier, “empirical findings cannot always be replicated” (p. 60). And if journals only publish significant results, it remains unknown which results are eminently reproducible and which results are not.  S&S ignore publication bias and pretend that the published record suggests that all findings in social psychology are eminently reproducible. Apparently, they would suggest that even Bem’s findings that people have supernatural abilities is eminently reproducible.  These days, few social psychologists are willing to endorse this naïve interpretation of the scientific record as a credible body of empirical facts.   

Exact Replication Studies are Meaningful if they are Successful

Ironically, S&S next suggest that exact replication studies can be useful.

Exact replications are also important when studies produce findings that are unexpected and only loosely connected to a theoretical framework. Thus, the fact that priming individuals with the stereotype of the elderly resulted in a reduction of walking speed was a finding that was unexpected. Furthermore, even though it was consistent with existing theoretical knowledge, there was no consensus about the processes that mediate the impact of the prime on walking speed. It was therefore important that Bargh et al. (1996) published an exact replication of their experiment in the same paper.

Similarly, Dijksterhuis and van Knippenberg (1998) conducted four studies in which they replicated the priming effects. Three of these studies contained conditions that were exact replications.

Because it is standard practice in publications of new effects, especially of effects that are surprising, to publish one or two exact replications, it is clearly more conducive to the advancement of psychological knowledge to conduct conceptual replications rather than attempting further duplications of the original study.

Given these citations it is problematic that S&S article is often cited to claim that exact replications are impossible or unnecessary.  The argument that S&S are making here is rather different.  They are suggesting that original articles already provide sufficient evidence that results in social psychology are eminently reproducible because original articles report multiple studies and some of these studies are often exact replication studies.  At face value, S&S have a point.  An honest series of statistically significant results makes it practically impossible that an effect is a false positive result (Schimmack, 2012).  The problem is that multiple study articles are not honest reports of all replication attempts.  Francis (2014) found that at least 80% of multiple study articles showed statistical evidence of questionable research practices.  Given the pervasive influence of selection for significance, exact replication studies in original articles provide no information about the replicability of these results.

What made the failed replications by Doyen et al. and Shank et al. so powerful was that these studies were the first real empirical tests of BS social priming effects because the authors were willing to report successes or failures.  The problem for social psychology is that many textbook findings that were obtained with selection for significance cannot be reproduced in honest empirical tests of the predicted effects.  This means that the original effects were either dramatically inflated or may not exist at all.

Replication Studies are a Waste of Resources

S&S want readers to believe that replication studies are a waste of resources.

Given that both research time and money are scarce resources, the large scale attempts at duplicating previous studies seem to us misguided” (p. 61).

This statement sounds a bit like a plea to spare social psychology from the embarrassment of actual empirical tests that reveal the true replicability of textbook findings. After all, according to S&S it is impossible to duplicate original studies (i.e., conduct exact replication studies) because replication studies differ in some way from original studies and may not reproduce the original results.  So, none of the failed replication studies is an exact replication.  Doyen et al. replicate Bargh’s study that was conducted in New York city in Belgium and Shanks et al. replicated Dijksterhuis’s studies from the Netherlands in the United States.  The finding that the original results could not be replicate the original results does not imply that the original findings were false positives, but they do imply that these findings may be unique to some unspecified specifics of the original studies.  This is noteworthy when original results are used in textbook as evidence for general theories and not as historical accounts of what happened in one specific socio-cultural context during a specific historic period. As social situations and human behavior are never exact replications of the past, social psychological results need to be permanently replicated and doing so is not a waste of resources.  Suggesting that replications is a waste of resources is like suggesting that measuring GDP or unemployment every year is a waste of resources because we can just use last-year’s numbers.

As S&S ignore publication bias and selection for significance, they are also ignoring that publication bias leads to a massive waste of resources.  First, running empirical tests of theories that are not reported is a waste of resources.  Second, publishing only significant results is also a waste of resources because researchers design new studies based on the published record. When the published record is biased, many new studies will fail, just like airplanes who are designed based on flawed science would drop from the sky.  Thus, a biased literature creates a massive waste of resources.

Ultimately, a science that publishes only significant result wastes all resources because the outcome of the published studies is a foregone conclusion: the prediction was supported, p < .05. Social psychologists might as well publish purely theoretical article, just like philosophers in the old days used “thought experiments” to support their claims. An empirical science is only a real science if theoretical predictions are subjected to tests that can fail.  By this simple criterion, experimental social psychology is not (yet) a science.

Should Psychologists Conduct Exact Replications or Conceptual Replications?

Strobe and Strack’s next cite Pashler and Harris (2012) to claim that critiques of experimental social psychology have dismissed the value of so-called conceptual replications and generalize.

The main criticism of conceptual replications is that they are less informative than exact replications (e.g., Pashler & Harris, 2012).” 

Before I examine S&S’s counterargument, it is important to realize that S&S misrepresented, and maybe misunderstood, Pashler and Harris’s main point. Here is the relevant quote from Pashler and Harris’s article.

We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some of the famous and puzzling “pathological science” cases that embarrassed the natural sciences at several points in the 20th century (e.g., Polywater; Rousseau & Porto, 1970; and cold fusion; Taubes, 1993).

The problem for S&S is that they cannot address the problem of publication bias and therefore carefully avoid talking about it.  As a result, they misrepresent Pashler and Harris’s critique of conceptual replications in combination with publication bias as a criticism of conceptual replication studies, which is absurd and not what Pashler and Harris’s intended to say or actually said. The following quote from their article makes this crystal clear.

However, what kept faith in cold fusion alive for some time (at least in the eyes of some onlookers) was a trickle of positive results achieved using very different designs than the originals (i.e., what psychologists would call conceptual replications). This suggests that one important hint that a controversial finding is pathological may arise when defenders of a controversial effect disavow the initial methods used to obtain an effect and rest their case entirely upon later studies conducted using other methods. Of course, productive research into real phenomena often yields more refined and better ways of producing effects. But what should inspire doubt is any situation where defenders present a phenomenon as a “moving target” in terms of where and how it is elicited (cf. Langmuir, 1953/1989). When this happens, it would seem sensible to ask, “If the finding is real and yet the methods used by the original investigators are not reproducible, then how were these investigators able to uncover a valid phenomenon with methods that do not work?” Again, the unavoidable conclusion is that a sound assessment of a controversial phenomenon should focus first and foremost on direct replications of the original reports and not on novel variations, each of which may introduce independent ambiguities.

I am confident that unbiased readers will recognize that Pashler and Harris did not suggest that conceptual replication studies are bad.  Their main point is that a few successful conceptual replication studies can be used to keep theories alive in the face of a string of many replication failures. The problem is not that researchers conduct successful conceptual replication studies. The problem is dismissing or outright hiding of disconfirming evidence in replication studies. S&S misconstrue Pashler and Harris’s claim to avoid addressing this real problem of ignoring and suppressing failed studies to support an attractive but false theory.

The illusion of exact replications.

S&S next argument is that replication studies are never exact.

If one accepts that the true purpose of replications is a (repeated) test of a theoretical hypothesis rather than an assessment of the reliability of a particular experimental procedure, a major problem of exact replications becomes apparent: Repeating a specific operationalization of a theoretical construct at a different point in time and/or with a different population of participants might not reflect the same theoretical construct that the same procedure operationalized in the original study.

The most important word in this quote is “might.”   Ebbinghaus’s memory curve MIGHT not replicate today because he was his own subject.  Bargh’s elderly priming study MIGHT not work today because Florida is no longer associated with the elderly, and Disjterhuis’s priming study MIGHT no longer works because students no longer think that professors are smart or that Hooligans are dumb.

Just because there is no certainty in inductive inferences doesn’t mean we can just dismiss replication failures because something MIGHT have changed.  It is also possible that the published results MIGHT be false positives because significant results were obtained by chance, with QRPs, or outright fraud.  Most people think that outright fraud is unlikely, but the Stapel debacle showed that we cannot rule it out.  So, we can argue forever about hypothetical reasons why a particular study was successful or a failure. These arguments are futile and have nothing to do with scientific arguments and objective evaluation of facts.

This means that every study, whether it is a groundbreaking success or a replication failure needs to be evaluate in terms of the objective scientific facts. There is no blanket immunity for seminal studies that protects them from disconfirming evidence.  No study is an exact replication of another study. That is a truism and S&S article is often cited for this simple fact.  It is as true as it is irrelevant to understand the replication crisis in social psychology.

Exact Replications Are Often Uninformative

S&S contradict themselves in the use of the term exact replication.  First it is impossible to do exact replications, but then they are uninformative.  I agree with S&S that exact replication studies are impossible. So, we can simply drop the term “exact” and examine why S&S believe that some replication studies are uninformative.

First they give an elaborate, long and hypothetical explanation for Doyen et al.’s failure to replicate Bargh’s pair of elderly priming studies. After considering some possible explanations, they conclude

It is therefore possible that the priming procedure used in the Doyen et al. (2012) study failed in this respect, even though Doyen et al. faithfully replicated the priming procedure of Bargh et al. (1996).  

Once more the realm of hypothetical conjectures has to rescue seminal findings. Just as it is possible that S&S are right it is also possible that Bargh faked his data. To be sure, I do not believe that he faked his data and I apologized for a Facebook comment that gave the wrong impression that I did. I am only raising this possibility here to make the point that everything is possible. Maybe Bargh just got lucky.  The probability of this is 1 out of 1,600 attempts (the probability to get the predicted effect with .05 two-tailed (!) twice is .025^2). Not very likely, but also not impossible.

No matter what the reason for the discrepancy between Bargh and Doyen’s findings is, the example does not support S&S’s claim that replication studies are uninformative. The failed replication raised concerns about the robustness of BS social priming studies and stimulated further investigation of the robustness of social priming effects. In the short span of six years, the scientific consensus about these effects has shifted dramatically, and the first publication of a failed replication is an important event in the history of social psychology.

S&S’s critique of Shank et al.’s replication studies is even weaker.  First, they have to admit that professor probably still primes intelligence more than soccer hooligans. To rescue the original finding S&S propose

“the priming manipulation might have failed to increase the cognitive representation of the concept “intelligence.” 

S&S also think that

another LIKELY reason for their failure could be their selection of knowledge items.

Meanwhile a registered replication report with a design that was approved by Dijksterhuis failed to replicate the effect.  Although it is possible to come up with more possible reasons for these failures, real scientific creativity is revealed in creating experimental paradigms that produce replicable results, not in coming up with many post-hoc explanations for replication failures.

Ironically, S&S even agree with my criticism of their argument.

 “To be sure, these possibilities are speculative”  (p. 62). 

In contrast, S&S fail to consider the possibility that published significant results are false positives, even though there is actual evidence for publication bias. The strong bias against published failures may be rooted in a long history of dismissing unpublished failures that social psychologists routinely encounter in their own laboratory.  To avoid the self-awareness that hiding disconfirming evidence is unscientific, social psychologists made themselves believe that minute changes in experimental procedures can ruin a study (Stapel).  Unfortunately, a science that dismisses replication failures as procedural hiccups is fated to fail because it removed the mechanism that makes science self-correcting.

Failed Replications are Uninformative

S&S next suggest that “nonreplications are uninformative unless one can demonstrate that the theoretically relevant conditions were met” (p. 62).

This reverses the burden of proof.  Original researchers pride themselves on innovative ideas and groundbreaking discoveries.  Like famous rock stars, they are often not the best musicians, nor is it impossible for other musicians to play their songs. They get rewarded because they came up with something original. Take the Implicit Association Test as an example. The idea to use cognitive switching tasks to measure attitudes was original and Greenwald deserves recognition for inventing this task. The IAT did not revolutionize attitude research because only Tony Greenwald could get the effects. It did so because everybody, including my undergraduate students, could replicate the basic IAT effect.

However, let’s assume that the IAT effect could not have been replicated. Is it really the job of researchers who merely duplicated a study to figure out why it did not work and develop a theory under which circumstances an effect may occur or not?  I do not think so. Failed replications are informative even if there is no immediate explanation why the replication failed.  As Pashler and Harris’s cold fusion example shows there may not even be a satisfactory explanation after decades of research. Most probably, cold fusion never really worked and the successful outcome of the original study was a fluke or a problem of the experimental design.  Nevertheless, it was important to demonstrate that the original cold fusion study could not be replicated.  To ask for an explanation why replication studies fail is simply a way to make replication studies unattractive and to dismiss the results of studies that fail to produce the desired outcome.

Finally, S&S ignore that there is a simple explanation for replication failures in experimental social psychology: publication bias.  If original studies have low statistical power (e.g., Bargh’s studies with N = 30) to detect small effects, only vastly inflated effect sizes reach significance.  An open replication study without inflated effect sizes is unlikely to produce a successful outcome. Statistical analysis of original studies show that this explanation accounts for a large proportion of replication failures. Thus, publication bias provides one explanation for replication failures.

Conceptual Replication Studies are Informative

S&S cite Schmidt (2009) to argue that conceptual replication studies are informative.

With every difference that is introduced the confirmatory power of the replication increases, because we have shown that the phenomenon does not hinge on a particular operationalization but “generalizes to a larger area of application” (p. 93).

S&S continue

“An even more effective strategy to increase our trust in a theory is to test it using completely different manipulations.”

This is of course true as long as conceptual replication studies are successful. However, it is not clear why conceptual replication studies that for the first time try a completely different manipulation should be successful.  As I pointed out in my 2012 article, reading multiple-study articles with only successful conceptual replication studies is a bit like watching a magic show.

Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions. Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005). The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect.

I don’t know whether S&S are impressed by Bem’s article with 9 conceptual replication studies that successfully demonstrated supernatural abilities.  According to their line of arguments, they should be.  However, even most social psychologists found it impossible to accept that time-reversed subliminal priming works. Unfortunately, this also means that successful conceptual replication studies are meaningless if only successful results are published.  Once more, S&S cannot address this problem because they ignore the simple fact that selection for significance undermines the purpose of empirical research to test theoretical predictions.

Exact Replications Contribute Little to Scientific Knowledge

Without providing much evidence for their claims, S&S conclude

one reason why exact replications are not very interesting is that they contribute little to scientific knowledge.

Ironically, one year later Science published 100 replication studies with the only goal of estimating the replicability of psychology, with a focus on social psychology.  The article has already been cited 640 times, while S&S’s criticism of replication studies has been cited (only) 114 times.

Although the article did nothing else then to report the outcome of replication studies, it made a tremendous empirical contribution to psychology because it reported results of studies without the filter of publication bias.  Suddenly the success rate plummeted from over 90% to 37% and for social psychology to 25%.  While S&S could claim in 2014 that “Thus far, however, no solid data exist on the prevalence of such [questionable] research practices in either social or any other area of psychology,” the reproducibility project revealed that these practices dramatically inflated the percentage of successful studies reported in psychology journals.

The article has been celebrated by scientists in many disciplines as a heroic effort and a sign that psychologists are trying to improve their research practices. S&S may disagree, but I consider the reproducibility project a big contribution to scientific knowledge.

Why null findings are not always that informative

To fully appreciate the absurdity of S&S’s argument, I let them speak for themselves.

One reason is that not all null findings are interesting.  For example, just before his downfall, Stapel published an article on how disordered contexts promote stereotyping and discrimination. In this publication, Stapel and Lindenberg (2011) reported findings showing that litter or a broken-up sidewalk and an abandoned bicycle can increase social discrimination. These findings, which were later retracted, were judged to be sufficiently important and interesting to be published in the highly prestigious journal Science. Let us assume that Stapel had actually conducted the research described in this paper and failed to support his hypothesis. Such a null finding would have hardly merited publication in the Journal of Articles in Support of the Null Hypothesis. It would have been uninteresting for the same reason that made the positive result interesting, namely, that (a) nobody expected a relationship between disordered environments and prejudice and (b) there was no previous empirical evidence for such a relationship. Similarly, if Bargh et al. (1996) had found that priming participants with the stereotype of the elderly did not influence walking speed or if Dijksterhuis and van Knippenberg (1998) had reported that priming participants with “professor” did not improve their performance on a task of trivial pursuit, nobody would have been interested in their findings.

Notably, all of the examples are null-findings in original studies. Thus, they have absolutely no relevance for the importance of replication studies. As noted by Strack and Stroebe earlier

Thus, null findings are interesting only if they contradict a central hypothesis derived from an established theory and/or are discrepant with a series of earlier studies.” (p. 65). 

Bem (2011) reported 9 significant results to support unbelievable claims about supernatural abilities.  However, several failed replication studies allowed psychologists to dismiss these findings and to ignore claims about time-reversed priming effects. So, while not all null-results are important, null-results in replication studies are important because they can correct false positive results in original articles. Without this correction mechanism, science looses its ability to correct itself.

Failed Replications Do Not Falsify Theories

S&S state that failed replications do not falsify theories

The nonreplications published by Shanks and colleagues (2013) cannot be taken as a falsification of that theory, because their study does not explain why previous research was successful in replicating the original findings of Dijksterhuis and van Knippenberg (1998).” (p. 64). 

I am unaware of any theory in psychology that has been falsified. The reason for this is not that failed replication studies are not informative. The reason is that theories have been protected by hiding failed replication studies until recently. Only in recent years have social psychologists started to contemplate the possibility that some theories in social psychology might be false.  The most prominent example is ego-depletion theory, which has been one of the first prominent theories that has been put under the microscope of open science without the protection of questionable research practices in recent years. While ego-depletion theory is not entirely dead, few people still believe in the simple theory that 20 Stroop trials deplete individuals’ will power.  Falsification is hard, but falsification without disconfirming evidence is impossible.

Inconsistent Evidence

S&S argue that replication failures have to be evaluated in the context of replication successes.

Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis. 

Earlier S&S wrote

in social psychology, as in most sciences, empirical findings cannot always be replicated (this was one of the reasons for the development of meta-analytic methods). 

Indeed. Unless studies have very high statistical power, inconsistent results are inevitable; which is one reason why publishing only significant results is a sign of low credibility (Schimmack, 2012). Meta-analysis is the only way to make sense of these inconsistent findings.  However, it is well known that publication bias makes meta-analytic results meaningless (e.g., meta-analysis show very strong evidence for supernatural abilities).  Thus, it is important that all tests of a theoretical prediction are reported to produce meaningful meta-analyses.  If social psychologists would take S&S seriously and continue to suppress non-significant results because they are uninformative, meta-analysis would continue to provide biased results that support even false theories.

Failed Replications are Uninformative II

Sorry that this is getting really long. But S&S keep on making the same arguments and the editor of this article didn’t tell them to shorten the article. Here they repeat the argument that failed replications are uninformative.

One reason why null findings are not very interesting is because they tell us only that a finding could not be replicated but not why this was the case. This conflict can be resolved only if researchers develop a theory that could explain the inconsistency in findings.  

A related claim is that failed replications never demonstrate that original findings were false because the inconsistency is always due to some third variable; a hidden moderator.

Methodologically, however, nonreplications must be understood as interaction effects in that they suggest that the effect of the crucial influence depends on the idiosyncratic conditions under which the original experiment was conducted” (p. 64). 

These statements reveal a fundamental misunderstanding of statistical inferences.  A significant result never proofs that the null-hypothesis is false.  The inference that a real effect rather than sampling error caused the observed result can be a mistake. This mistake is called a false positive or a type-I error. S&S seems to believe that type-I errors do not exist. Accordingly, Bem’s significant results show real supernatural abilities.  If this were the case, it would be meaningless to report statistical significance tests. The only possible error that could be made would be false negatives or type-II error; the theory makes the correct prediction, but a study failed to produce a significant result. And if theoretical predictions are always correct, it is also not necessary to subject theories to empirical tests, because these tests either correctly show that a prediction was confirmed or falsely fail to confirm a prediction.

S&S’s belief in published results has a religious quality.  Apparently we know nothing about the world, but once a significant result is published in a social psychology journal, ideally JPSP, it becomes a holy truth that defies any evidence that non-believers may produce under the misguided assumption that further inquiry is necessary. Elderly priming is real, amen.

More Confusing Nonsense

At some point, I was no longer surprised by S&S’s claims, but I did start to wonder about the reviewers and editors who allowed this manuscript to be published apparently with light or no editing.  Why would a self-respecting journal publish a sentence like this?

As a consequence, the mere coexistence of exact replications that are both successful and unsuccessful is likely to leave researchers helpless about what to conclude from such a pattern of outcomes.

Didn’t S&S claim that exact replication studies do not exist? Didn’t they tell readers that every inconsistent finding has to be interpreted as an interaction effect?  And where do they see inconsistent results if journals never publish non-significant results?

Aside from these inconsistencies, inconsistent results do not lead to a state of helpless paralysis. As S&S suggested themselves, they conduct a meta-analysis. Are S&S suggesting that we need to spare researchers from inconsistent results to protect them from a state of helpless confusion? Is this their justification for publishing only significant results?

Even Massive Replication Failures in Registered Replication Reports are Uninformative

In response to the replication crisis, some psychologists started to invest time and resources in major replication studies called many lab studies or registered replication studies.  A single study was replicated in many labs.  The total sample size of many labs gives these studies high precision in estimating the average effect size and makes it even possible to demonstrate that an effect size is close to zero, which suggests that the null-hypothesis may be true.  These studies have failed to find evidence for classic social psychology findings, including Strack’s facial feedback studies. S&S suggest that even these results are uninformative.

Conducting exact replications in a registered and coordinated fashion by different laboratories does not remove the described shortcomings. This is also the case if exact replications are proposed as a means to estimate the “true size” of an effect. As the size of an experimental effect always depends on the specific error variance that is generated by the context, exact replications can assess only the efficiency of an intervention in a given situation but not the generalized strength of a causal influence.

Their argument does not make any sense to me.  First, it is not clear what S&S mean by “the size of an experimental effect always depends on the specific error variance.”  Neither unstandardized nor standardized effect sizes depend on the error variance. This is simple to see because error variance depends on the sample size and effect sizes do not depend on sample size.  So, it makes no sense to claim that effect sizes depend on error variance.

Second, it is not clear what S&S mean by specific error variance that is generated by the context.  I simply cannot address this argument because the notion of context generated specific error variance is not a statistical construct and S&S do not explain what they are talking about.

Finally, it is not clear why meta-analysis of replication studies cannot be used to estimate the generalized strength of a causal influence, which I believe to mean “an effect size”?  Earlier S&S alluded to meta-analysis as a way to resolve inconsistencies in the literature, but now they seem to suggest that meta-analysis cannot be used.

If S&S really want to imply that meta-analyses are useless, it is unclear how they would make sense of inconsistent findings.  The only viable solution seems to be to avoid inconsistencies by suppressing non-significant results in order to give the impression that every theory in social psychology is correct because theoretical predictions are always confirmed.  Although this sounds absurd, it is the inevitable logical consequence of S&S’s claim that non-significant results are uninformative, even if over 20 labs independently and in combination failed to provide evidence for a theoretical predicted effect.

The Great History of Social Psychological Theories

S&S next present Über-social psychologist, Leon Festinger, as an example why theories are good and failed studies are bad.  The argument is that good theories make correct predictions, even if bad studies fail to show the effect.

“Although their theoretical analysis was valid, it took a decade before researchers were able to reliably replicate the findings reported by Festinger and Carlsmith (1959).”

As a former student, I was surprised by this statement because I had learned that Festinger’s theory was challenged by Bem’s theory and that social psychologists had been unable to resolve which of the two theories was correct.  Couldn’t some of these replication failures be explained by the fact that Festinger’s theory sometimes made the wrong prediction?

It is also not surprising that researchers had a hard time replicating Festinger and Carlsmith original findings.  The reason is that the original study had low statistical power and replication failures are expected even if the theory is correct. Finally, I have been around social psychologists long enough to have heard some rumors about Festinger and Carlsmith’s original studies.  Accordingly, some of Festinger’s graduate students also tried and failed to get the effect. Carlsmith was the ‘lucky’ one who got the effect, in one study p < .05, and he became the co-author of one of the most cited articles in the history of social psychology. Naturally, Festinger did not publish the failed studies of his other graduate students because surely they must have done something wrong. As I said, that is a rumor.  Even if the rumor is not true, and Carlsmith got lucky on the first try, luck played a factor and nobody should expect that a study replicates simply because a single published study reported a p-value less than .05.

Failed Replications Did Not Influence Social Psychological Theories

Argument quality reaches a new low with the next argument against replication studies.

 “If we look at the history of social psychology, theories have rarely been abandoned because of failed replications.”

This is true, but it reveals the lack of progress in theory development in social psychology rather than the futility of replication studies.  From an evolutionary perspective, theory development requires selection pressure, but publication bias protects bad theories from failure.

The short history of open science shows how weak social psychological theories are and that even the most basic predictions cannot be confirmed in open replication studies that do not selectively report significant results.  So, even if it is true that failed replications have played a minor role in the past of social psychology, they are going to play a much bigger role in the future of social psychology.

The Red Herring: Fraud

S&S imply that Roediger suggested to use replication studies as a fraud detection tool.

if others had tried to replicate his [Stapel’s] work soon after its publication, his misdeeds might have been uncovered much more quickly

S&S dismiss this idea in part on the basis of Stroebe’s research on fraud detection.

To their own surprise, Stroebe and colleagues found that replications hardly played any role in the discovery of these fraud cases.

Now this is actually not surprising because failed replications were hardly ever published.  And if there is no variance in a predictor variable (significance), we cannot see a correlation between the predictor variable and an outcome (fraud).  Although failed replication studies may help to detect fraud in the future, this is neither their primary purpose, nor necessary to make replication studies valuable. Replication studies also do not bring world peace or bring an end to global warming.

For some inexplicable reason S&S continue to focus on fraud. For example, they also argue that meta-analyses are poor fraud detectors, which is as true as it is irrelevant.

They conclude their discussion with an observation by Stapel, who famously faked 50+ articles in social psychology journals.

As Stapel wrote in his autobiography, he was always pleased when his invented findings were replicated: “What seemed logical and was fantasized became true” (Stapel, 2012). Thus, neither can failures to replicate a research finding be used as indicators of fraud, nor can successful replications be invoked as indication that the original study was honestly conducted.

I am not sure why S&S spend so much time talking about fraud, but it is the only questionable research practice that they openly address.  In contrast, they do not discuss other questionable research practices, including suppressing failed studies, that are much more prevalent and much more important for the understanding of the replication crisis in social psychology than fraud.  The term “publication bias” is not mentioned once in the article. Sometimes what is hidden is more significant than what is being published.

Conclusion

The conclusion section correctly predicts that the results of the reproducibility project will make social psychology look bad and that social psychology will look worse than other areas of psychology.

But whereas it will certainly be useful to be informed about studies that are difficult to replicate, we are less confident about whether the investment of time and effort of the volunteers of the Open Science Collaboration is well spent on replicating studies published in three psychology journals. The result will be a reproducibility coefficient that will not be greatly informative, because of justified doubts about whether the “exact” replications succeeded in replicating the theoretical conditions realized in the original research.

As social psychologists, we are particularly concerned that one of the outcomes of this effort will be that results from our field will be perceived to be less “reproducible” than research in other areas of psychology. This is to be expected because for the reasons discussed earlier, attempts at “direct” replications of social psychological studies are less likely than exact replications of experiments in psychophysics to replicate the theoretical conditions that were established in the original study.

Although psychologists should not be complacent, there seem to be no reasons to panic the field into another crisis. Crises in psychology are not caused by methodological flaws but by the way people talk about them (Kruglanski & Stroebe, 2012).

S&S attribute the foreseen (how did they know?) bad outcome in the reproducibility project to the difficulty of replicating social psychological studies, but they fail to explain why social psychology journals publish as many successes as other disciplines.

The results of the reproducibility project provide an answer to this question.  Social psychologists use designs with less statistical power that have a lower chance of producing a significant result. Selection for significance ensures that the success rate is equally high in all areas of psychology, but lower power makes these successes less replicable.

To avoid further embarrassments in an increasingly open science, social psychologists must improve the statistical power of their studies. Which social psychological theories will survive actual empirical tests in the new world of open science is unclear.  In this regard, I think it makes more sense to compare social psychology to a ship wreck than a train wreck.  Somewhere down on the floor of the ocean is some gold. But it will take some deep diving and many failed attempts to find it.  Good luck!

Appendix

S&S’s article was published in a “prestigious” psychology journal and has already garnered 114 citations. It ranks #21 in my importance rankings of articles in meta-psychology.  So, I was curious why the article gets cited.  The appendix lists 51 citing articles with the relevant citation and the reason for citing S&S’s article.   The table shows the reasons for citations in decreasing order of frequency.

S&S are most frequently cited for the claim that exact replications are impossible, followed by the reason for this claim that effects in psychological research are sensitive to the unique context in which a study is conducted.  The next two reasons for citing the article are that only conceptual replications (CR) test theories, whereas the results of exact replications (ER) are uninformative.  The problem is that every study is a conceptual replication because exact replications are impossible. So, even if exact replications were uninformative this claim has no practical relevance because there are no exact replications.  Some articles cite S&S with no specific claim attached to the citation.  Only two articles cite them for the claim that there is no replication crisis and only 1 citation cites S&S for the claim that there is no evidence about the prevalence of QRPs.   In short, the article is mostly cited for the uncontroversial and inconsequential claim that exact replications are impossible and that effect sizes in psychological studies can vary as a function of unique features of a particular sample or study.  This observation is inconsequential because it is unclear how unknown unique characteristics of studies influence results.  The main implication of this observation is that study results will be more variable than we would expect from a set of exact replication studies. For this reason, meta-analysts often use random-effects model because fixed-effects meta-analysis assumes that all studies are exact replications.

ER impossible 11
Contextual Sensitivity 8
CR test theory 8
ER uninformative 7
Mention 6
ER/CR Distinction 2
No replication crisis 2
Disagreement 1
CR Definition 1
ER informative 1
ER useful for applied research 1
ER cannot detect fraud 1
No evidence about prevalence of QRP 1
Contextual sensitivity greater in social psychology 1

the most influential citing articles and the relevant citation.  I haven’t had time to do a content analysis, but the article is mostly cited to say (a) exact replications are impossible, and (b) conceptual replications are valuable, and (c) social psychological findings are harder to replicate.  Few articles cite to article to claim that the replication crisis is overblown or that failed replications are uninformative.  Thus, even though the article is cited a lot, it is not cited for the main points S&S tried to make.  The high number of citation therefore does not mean that S&S’s claims have been widely accepted.

(Disagreement)
The value of replication studies.

Simmons, DJ.
“In this commentary, I challenge these claims.”

(ER/CR Distinction)
Bilingualism and cognition.

Valian, V.
“A host of methodological issues should be resolved. One is whether the field should undertake exact replications, conceptual replications, or both, in order to determine the conditions under which effects are reliably obtained (Paap, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(Contextual Sensitivity)
Is Psychology Suffering From a Replication Crisis? What Does “Failure to Replicate” Really Mean?“
Maxwell et al. (2015)
A particular replication may fail to confirm the results of an original study for a variety of reasons, some of which may include intentional differences in procedures, measures, or samples as in a conceptual replication (Cesario, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(ER impossible)
The Chicago face database: A free stimulus set of faces and norming data 

Debbie S. Ma, Joshua Correll, & Bernd Wittenbrink.
The CFD will also make it easier to conduct exact replications, because researchers can use the same stimuli employed by other researchers (but see Stroebe & Strack, 2014).”

(Contextual Sensitivity)
“Contextual sensitivity in scientific reproducibility”
vanBavel et al. (2015)
“Many scientists have also argued that the failure to reproduce results might reflect contextual differences—often termed “hidden moderators”—between the original research and the replication attempt”

(Contextual Sensitivity)
Editorial Psychological Science

Linday,
As Nosek and his coauthors made clear, even ideal replications of ideal studies are expected to fail some of the time (Francis, 2012), and failure to replicate a previously observed effect can arise from differences between the original and replication studies and hence do not necessarily indicate flaws in the original study (Maxwell, Lau, & Howard, 2015; Stroebe & Strack, 2014). Still, it seems likely that psychology journals have too often reported spurious effects arising from Type I errors (e.g., Francis, 2014).

(ER impossible)
Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science

Finkel et al. (2015).
“Nevertheless, many scholars believe that direct replications are impossible in the human sciences—S&S (2014) call them “an illusion”— because certain factors, such as a moment in historical time or the precise conditions under which a sample was obtained and tested, that may have contributed to a result can never be reproduced identically.”

Conceptualizing and evaluating the replication of research results
Fabrigar and Wegener (2016)
(CR test theory)
“Traditionally, the primary presumed strength of conceptual replications has been their ability to address issues of construct validity (e.g., Brewer & Crano, 2014; Schmidt, 2009; Stroebe & Strack, 2014). “

(ER impossible)
“First, it should be recognized that an exact replication in the strictest sense of the term can never be achieved as it will always be impossible to fully recreate the contextual factors and participant characteristics present in the original experiment (see Schmidt (2009); S&S (2014).”

(Contextual Sensitivity)
“S&S (2014) have argued that there is good reason to expect that many traditional and contemporary experimental manipulations in social psychology would have different psychological properties and effects if used in contexts or populations different from the original experiments for which they were developed. For example, classic dissonance manipulations and fear manipulations or more contemporary priming procedures might work very differently if used in new contexts and/or populations. One could generate many additional examples beyond those mentioned by S&S.”

(ER impossible)
“Another important point illustrated by the above example is that the distinction between exact and conceptual replications is much more nebulous than many discussions of replication would suggest. Indeed, some critics of the exact/conceptual replication distinction have gone so far as to argue that the concept of exact replication is an “illusion” (Stroebe & Strack, 2014). Though we see some utility in the exact/conceptual distinction (especially regarding the goal of the researcher in the work), we agree with the sentiments expressed by S&S. Classifying studies on the basis of the exact/conceptual distinction is more difficult than is often appreciated, and the presumed strengths and weaknesses of the approaches are less straightforward than is often asserted or assumed.”

(Contextual Sensitivity)
“Furthermore, assuming that these failed replication experiments have used the same operationalizations of the independent and dependent variables, the most common inference drawn from such failures is that confidence in the existence of the originally demonstrated effect should be substantially undermined (e.g., see Francis (2012); Schimmack (2012)). Alternatively, a more optimistic interpretation of such failed replication experiments could be that the failed versus successful experiments differ as a function of one or more unknown moderators that regulate the emergence of the effect (e.g., Cesario, 2014; Stroebe & Strack, 2014).”

Replicating Studies in Which Samples of Participants Respond to Samples of Stimuli.
(CR Definition)
Westfall et al. (2015).
Nevertheless, the original finding is considered to be conceptually replicated if it can be convincingly argued that the same theoretical constructs thought to account for the results of the original study also account for the results of the replication study (Stroebe & Strack, 2014). Conceptual replications are thus “replications” in the sense that they establish the reproducibility of theoretical interpretations.”

(Mention)
“Although establishing the generalizability of research findings is undoubtedly important work, it is not the focus of this article (for opposing viewpoints on the value of conceptual replications, see Pashler & Harris, 2012; Stroebe & Strack, 2014).“

Introduction to the Special Section on Advancing Our Methods and Practices
(Mention)
Ledgerwood, A.
We can and surely should debate which problems are most pressing and which solutions most suitable (e.g., Cesario, 2014; Fiedler, Kutzner, & Krueger, 2012; Murayama, Pekrun, & Fiedler, 2013; Stroebe & Strack, 2014). But at this point, most can agree that there are some real problems with the status quo.

***Theory Building, Replication, and Behavioral Priming: Where Do We Need to Go From Here?
Locke, EA
(ER impossible)
As can be inferred from Table 1, I believe that the now popular push toward “exact” replication (e.g., see Simons, 2014) is not the best way to go. Everyone agrees that literal replication is impossible (e.g., Stroebe & Strack, 2014), but let us assume it is as close as one can get. What has been achieved?

The War on Prevention: Bellicose Cancer: Metaphors Hurt (Some) Prevention Intentions”
(CR test theory)
David J. Hauser1 and Norbert Schwarz
“As noted in recent discussions (Stroebe & Strack, 2014), consistent effects of multiple operationalizations of a conceptual variable across diverse content domains are a crucial criterion for the robustness of a theoretical approach.”

ON THE OTHER SIDE OF THE MIRROR: PRIMING IN COGNITIVE AND SOCIAL PSYCHOLOGY 
Doyen et al. “
(CR test theory)
In contrast, social psychologists assume that the primes activate culturally and situationally contextualized representations (e.g., stereotypes, social norms), meaning that they can vary over time and culture and across individuals. Hence, social psychologists have advocated the use of “conceptual replications” that reproduce an experiment by relying on different operationalizations of the concepts under investigation (Stroebe & Strack, 2014). For example, in a society in which old age is associated not with slowness but with, say, talkativeness, the outcome variable could be the number of words uttered by the subject at the end of the experiment rather than walking speed.”

***Welcome back Theory
Ap Dijksterhuis
(ER uninformative)
“it is unavoidable, and indeed, this commentary is also about replication—it is done against the background of something we had almost forgotten: theory! S&S (2014, this issue) argue that focusing on the replication of a phenomenon without any reference to underlying theoretical mechanisms is uninformative”

On the scientific superiority of conceptual replications for scientific progress
Christian S. Crandall, Jeffrey W. Sherman
(ER impossible)
But in matters of social psychology, one can never step in the same river twice—our phenomena rely on culture, language, socially primed knowledge and ideas, political events, the meaning of questions and phrases, and an ever-shifting experience of participant populations (Ramscar, 2015). At a certain level, then, all replications are “conceptual” (Stroebe & Strack, 2014), and the distinction between direct and conceptual replication is continuous rather than categorical (McGrath, 1981). Indeed, many direct replications turn out, in fact, to be conceptual replications. At the same time, it is clear that direct replications are based on an attempt to be as exact as possible, whereas conceptual replications are not.

***Are most published social psychological findings false?
Stroebe, W.
(ER uninformative)
This near doubling of replication success after combining original and replication effects is puzzling. Because these replications were already highly powered, the increase is unlikely to be due to the greater power of a meta-analytic synthesis. The two most likely explanations are quality problems with the replications or publication bias in the original studies or. An evaluation of the quality of the replications is beyond the scope of this review and should be left to the original authors of the replicated studies. However, the fact that all replications were exact rather than conceptual replications of the original studies is likely to account to some extent for the lower replication rate of social psychological studies (Stroebe & Strack, 2014). There is no evidence either to support or to reject the second explanation.”

(ER impossible)
“All four projects relied on exact replications, often using the material used in the original studies. However, as I argued earlier (Stroebe & Strack, 2014), even if an experimental manipulation exactly replicates the one used in the original study, it may not reflect the same theoretical variable.”

(CR test theory)
“Gergen’s argument has important implications for decisions about the appropriateness of conceptual compared to exact replication. The more a phenomenon is susceptible to historical change, the more conceptual replication rather than exact replication becomes appropriate (Stroebe & Strack, 2014).”

(CR test theory)
“Moonesinghe et al. (2007) argued that any true replication should be an exact replication, “a precise processwhere the exact same finding is reexamined in the same way”. However, conceptual replications are often more informative than exact replications, at least in studies that are testing theoretical predictions (Stroebe & Strack, 2014). Because conceptual replications operationalize independent and/or dependent variables in a different way, successful conceptual replications increase our trust in the predictive validity of our theory.”

There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance”
Anderson & Maxwell
(Mention)
“It is important to note some caveats regarding direct (exact) versus conceptual replications. While direct replications were once avoided for lack of originality, authors have recently urged the field to take note of the benefits and importance of direct replication. According to Simons (2014), this type of replication is “the only way to verify the reliability of an effect” (p. 76). With respect to this recent emphasis, the current article will assume direct replication. However, despite the push toward direct replication, some have still touted the benefits of conceptual replication (Stroebe & Strack, 2014). Importantly, many of the points and analyses suggested in this paper may translate well to conceptual replication.”

Reconceptualizing replication as a sequence of different studies: A replication typology
Joachim Hüffmeier, Jens Mazei, Thomas Schultze
(ER impossible)
The first type of replication study in our typology encompasses exact replication studies conducted by the author(s) of an original finding. Whereas we must acknowledge that replications can never be “exact” in a literal sense in psychology (Cesario, 2014; Stroebe & Strack, 2014), exact replications are studies that aspire to be comparable to the original study in all aspects (Schmidt, 2009). Exact replications—at least those that are not based on questionable research practices such as the arbitrary exclusion of critical outliers, sampling or reporting biases (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)—serve the function of protecting against false positive effects (Type I errors) right from the start.

(ER informative)
Thus, this replication constitutes a valuable contribution to the research process. In fact, already some time ago, Lykken (1968; see also Mummendey, 2012) recommended that all experiments should be replicated  before publication. From our perspective, this recommendation applies in particular to new findings (i.e., previously uninvestigated theoretical relations), and there seems to be some consensus that new findings should be replicated at least once, especially when they were unexpected, surprising, or only loosely connected to existing theoretical models (Stroebe & Strack, 2014; see also Giner-Sorolla, 2012; Murayama et al., 2014).”

(Mention)
Although there is currently some debate about the epistemological value of close replication studies (e.g., Cesario, 2014; LeBel & Peters, 2011; Pashler & Harris, 2012; Simons, 2014; Stroebe & Strack, 2014), the possibility that each original finding can—in principal—be replicated by the scientific community represents a cornerstone of science (Kuhn, 1962; Popper, 1992).”

(CR test theory)
So far, we have presented “only” the conventional rationale used to stress the importance of close replications. Notably, however, we will now add another—and as we believe, logically necessary—point originally introduced by S&S (2014). This point protects close replications from being criticized (cf. Cesario, 2014; Stroebe & Strack, 2014; see also LeBel & Peters, 2011). Close replications can be informative only as long as they ensure that the theoretical processes investigated or at least invoked by the original study are shown to also operate in the replication study.

(CR test theory)
The question of how to conduct a close replication that is maximally informative entails a number of methodological choices. It is important to both adhere to the original study proceedings (Brandt et al., 2014; Schmidt, 2009) and focus on and meticulously measure the underlying theoretical mechanisms that were shown or at least proposed in the original studies (Stroebe & Strack, 2014). In fact, replication attempts are most informative when they clearly demonstrate either that the theoretical processes have unfolded as expected or at which point in the process the expected results could no longer be observed (e.g., a process ranging from a treatment check to a manipulation check and [consecutive] mediator variables to the dependent variable). Taking these measures is crucial to rule out that a null finding is simply due to unsuccessful manipulations or changes in a manipulation’s meaning and impact over time (cf. Stroebe & Strack, 2014). “

(CR test theory)
Conceptual replications in laboratory settings are the fourth type of replication study in our typology. In these replications, comparability to the original study is aspired to only in the aspects that are deemed theoretically relevant (Schmidt, 2009; Stroebe & Strack, 2014). In fact, most if not all aspects may differ as long as the theoretical processes that have been studied or at least invoked in the original study are also covered in a conceptual replication study in the laboratory.”

(ER useful for applied research)
For instance, conceptual replications may be less important for applied disciplines that focus on clinical phenomena and interventions. Here, it is important to ensure that there is an impact of a specific intervention and that the related procedure does not hurt the members of the target population (e.g., Larzelere et al., 2015; Stroebe & Strack, 2014).”

From intrapsychic to ecological theories in social psychology: Outlines of a functional theory approach
Klaus Fiedler
(ER uninformative)
Replicating an ill-understood finding is like repeating a complex sentence in an unknown language. Such a “replication” in the absence of deep understanding may appear funny, ridiculous, and embarrassing to a native speaker, who has full control over the foreign language. By analogy, blindly replicating or running new experiments on an ill-understood finding will rarely create real progress (cf. Stroebe & Strack, 2014). “

Into the wild: Field research can increase both replicability and real-world impact
Jon K. Maner
(CR test theory)
Although studies relying on homogeneous samples of laboratory or online participants might be highly replicable when conducted again in a similar homogeneous sample of laboratory or online participants, this is not the key criterion (or at least not the only criterion) on which we should judge replicability (Westfall, Judd & Kenny, 2015; see also Brandt et al., 2014; Stroebe & Strack, 2014). Just as important is whether studies replicate in samples that include participants who reflect the larger and more diverse population.”

Romance, Risk, and Replication: Can Consumer Choices and Risk-Taking Be Primed by Mating Motives?
Shanks et al.
(ER impossible)
There is no such thing as an “exact” replication (Stroebe & Strack, 2014) and hence it must be acknowledged that the published studies (notwithstanding the evidence for p-hacking and/or publication bias) may have obtained genuine effects and that undetected moderator variables explain why the present studies failed to obtain priming.   Some of the experiments reported here differed in important ways from those on which they were modeled (although others were closer replications and even these failed to obtain evidence of reliable romantic priming).

(CR test theory)
As S&S (2014) point out, what is crucial is not so much exact surface replication but rather identical operationalization of the theoretically relevant variables. In the present case, the crucial factors are the activation of romantic motives and the appropriate assessment of consumption, risk-taking, and other measures.”

A Duty to Describe: Better the Devil You Know Than the Devil You Don’t
Brown, Sacha D et al.
(Mention)
Ioannidis (2005) has been at the forefront of researchers identifying factors interfering with self-correction. He has claimed that journal editors selectively publish positive findings and discriminate against study replications, permitting errors in data and theory to enjoy a long half-life (see also Ferguson & Brannick, 2012; Ioannidis, 2008, 2012; Shadish, Doherty, & Montgomery, 1989; Stroebe & Strack, 2014). We contend there are other equally important, yet relatively unexplored, problems.

A Room with a Viewpoint Revisited: Descriptive Norms and Hotel Guests’ Towel Reuse Behavior
(Contextual Sensitivity)
Bohner, Gerd; Schlueter, Lena E.
On the other hand, our pilot participants’ estimates of towel reuse rates were generally well below 75%, so we may assume that the guests participating in our experiments did not perceive the normative messages as presenting a surprisingly low figure. In a more general sense, the issue of greatly diverging baselines points to conceptual issues in trying to devise a ‘‘direct’’ replication: Identical operationalizations simply may take on different meanings for people in different cultures.

***The empirical benefits of conceptual rigor: Systematic articulation of conceptual hypotheses can reduce the risk of non-replicable results (and facilitate novel discoveries too)
Mark Schaller
(Contextual Sensitivity)
Unless these subsequent studies employ methods that exactly replicate the idiosyncratic context in which the effect was originally detected, these studies are unlikely to replicate the effect. Indeed, because many psychologically important contextual variables may lie outside the awareness of researchers, even ostensibly “exact” replications may fail to create the conditions necessary for a fragile effect to emerge (Stroebe & Strack, 2014)

A Concise Set of Core Recommendations to Improve the Dependability of Psychological Research
David A. Lishner
(CR test theory)
The claim that direct replication produces more dependable findings across replicated studies than does conceptual replication seems contrary to conventional wisdom that conceptual replication is preferable to direct replication (Dijksterhuis, 2014; Neulip & Crandall, 1990, 1993a, 1993b; Stroebe & Strack, 2014).
(CR test theory)
However, most arguments advocating conceptual replication over direct replication are attempting to promote the advancement or refinement of theoretical understanding (see Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014). The argument is that successful conceptual replication demonstrates a hypothesis (and by extension the theory from which it derives) is able to make successful predictions even when one alters the sampled population, setting, operations, or data analytic approach. Such an outcome not only suggests the presence of an organizing principle, but also the quality of the constructs linked by the organizing principle (their theoretical meanings). Of course this argument assumes that the consistency across the replicated findings is not an artifact of data acquisition or data analytic approaches that differ among studies. The advantage of direct replication is that regardless of how flexible or creative one is in data acquisition or analysis, the approach is highly similar across replication studies. This duplication ensures that any false finding based on using a flexible approach is unlikely to be repeated multiple times.

(CR test theory)
Does this mean conceptual replication should be abandoned in favor of direct replication? No, absolutely not. Conceptual replication is essential for the theoretical advancement of psychological science (Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014), but only if dependability in findings via direct replication is first established (Cesario, 2014; Simons, 2014). Interestingly, in instances where one is able to conduct multiple studies for inclusion in a research report, one approach that can produce confidence in both dependability of findings and theoretical generalizability is to employ nested replications.

(ER cannot detect fraud)
A second advantage of direct replications is that they can protect against fraudulent findings (Schmidt, 2009), particularly when different research groups conduct direct replication studies of each other’s research. S&S (2014) make a compelling argument that direct replication is unlikely to prove useful in detection of fraudulent research. However, even if a fraudulent study remains unknown or undetected, its impact on the literature would be lessened when aggregated with nonfraudulent direct replication studies conducted by honest researchers.

***Does cleanliness influence moral judgments? Response effort moderates the effect of cleanliness priming on moral judgments.
Huang
(ER uninformative)
Indeed, behavioral priming effects in general have been the subject of increased scrutiny (see Cesario, 2014), and researchers have suggested different causes for failed replication, such as measurement and sampling errors (Stanley and Spence,2014), variation in subject populations (Cesario, 2014), discrepancy in operationalizations (S&S, 2014), and unidentified moderators (Dijksterhuis,2014).

UNDERSTANDING PRIMING EFFECTS IN SOCIAL PSYCHOLOGY: AN OVERVIEW AND INTEGRATION
Daniel C. Molden
(ER uninformative)
Therefore, some greater emphasis on direct replication in addition to conceptual replication is likely necessary to maximize what can be learned from further research on priming (but see Stroebe and Strack, 2014, for costs of overemphasizing direct replication as well).

On the automatic link between affect and tendencies to approach and avoid: Chen and Bargh (1999) revisited
Mark Rotteveel et al.
(no replication crisis)
Although opinions differ with regard to the extent of this “replication crisis” (e.g., Pashler and Harris, 2012; S&S, 2014), the scientific community seems to be shifting its focus more toward direct replication.

(ER uninformative)
Direct replications not only affect one’s confidence about the veracity of the phenomenon under study, but they also increase our knowledge about effect size (see also Simons, 2014; but see also S&S, 2014).

Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability
McShane and Bockenholt
(ER impossible)
The purpose of meta-analysis is to synthesize a set of studies of a common phenomenon. This task is complicated in behavioral research by the fact that behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999).

(ER impossible)
Further, because behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999), our SPM methodology estimates and accounts for heterogeneity, which has been shown to be important in a wide variety of behavioral research settings (Hedges and Pigott 2001; Klein et al. 2014; Pigott 2012).

A Closer Look at Social Psychologists’ Silver Bullet: Inevitable and Evitable Side   Effects of the Experimental Approach
Herbert Bless and Axel M. Burger
(ER/CR Distinction)
Given the above perspective, it becomes obvious that in the long run, conceptual replications can provide very fruitful answers because they address the question of whether the initially observed effects are potentially caused by some perhaps unknown aspects of the experimental procedure (for a discussion of conceptual versus direct replications, see e.g., Stroebe & Strack, 2014; see also Brandt et al., 2014; Cesario, 2014; Lykken, 1968; Schwarz & Strack, 2014).  Whereas conceptual replications are adequate solutions for broadening the sample of situations (for examples, see Stroebe & Strack, 2014), the present perspective, in addition, emphasizes that it is important that the different conceptual replications do not share too much overlap in general aspects of the experiment (see also Schwartz, 2015, advocating for  conceptual replications)

Men in red: A reexamination of the red-attractiveness effect
Vera M. Hesslinger, Lisa Goldbach, & Claus-Christian Carbon
(ER impossible)
As Brandt et al. (2014) pointed out, a replication in psychological research will never be absolutely exact or direct (see also, Stroebe & Strack, 2014), which is, of course, also the case in the present research.

***On the challenges of drawing conclusions from p-values just below 0.05
Daniel Lakens
(no evidence about QRP)
In recent years, researchers have become more aware of how flexibility during the data-analysis can increase false positive results (e.g., Simmons, Nelson & Simonsohn, 2011). If the true Type 1 error rate is substantially inflated, for example because researchers analyze their data until a p-value smaller than 0.05 is observed, the robustness of scientific knowledge can substantially decrease. However, as Stroebe & Strack (2014, p. 60) have pointed out: ‘Thus far, however, no solid data exist on the prevalence of such research practices.’

***Does Merely Going Through the Same Moves Make for a ‘‘Direct’’ Replication? Concepts, Contexts, and Operationalizations
Norbert Schwarz and Fritz Strack
(Contextual Sensitivity)
In general, meaningful replications need to realize the psychological conditions of the original study. The easier option of merely running through technically identical procedures implies the assumption that psychological processes are context insensitive and independent of social, cultural, and historical differences (Cesario, 2014; Stroebe & Strack, 2014). Few social (let alone cross-cultural) psychologists would be willing to endorse this assumption with a straight face. If so, mere procedural equivalence is an insufficient criterion for assessing the quality of a replication.

The Replication Paradox: Combining Studies can Decrease Accuracy of Effect Size Estimates
(ER uninformative)
Michèle B. Nuijten, Marcel A. L. M. van Assen, Coosje L. S. Veldkamp, and Jelte M. Wicherts
Replications with nonsignificant results are easily dismissed with the argument that the replication might contain a confound that caused the null finding (Stroebe & Strack, 2014).

Retro-priming, priming, and double testing: psi and replication in a test-retest design
Rabeyron, T
(Mention)
Bem’s paper spawned numerous attempts to replicate it (see e.g., Galak et al., 2012; Bem et al., submitted) and reflections on the difficulty of direct replications in psychology (Ritchie et al., 2012). This aspect has been associated more generally with debates concerning the “decline effect” in science (Schooler, 2011) and a potential “replication crisis” (S&S, 2014) especially in the fields of psychology and medical sciences (De Winter and Happee, 2013).

Do p Values Lose Their Meaning in Exploratory Analyses? It Depends How You Define the Familywise Error Rate
Mark Rubin
(ER impossible)
Consequently, the Type I error rate remains constant if researchers simply repeat the same test over and over again using different samples that have been randomly drawn from the exact same population. However, this first situation is somewhat hypothetical and may even be regarded as impossible in the social sciences because populations of people change over time and location (e.g., Gergen, 1973; Iso-Ahola, 2017; Schneider, 2015; Serlin, 1987; Stroebe & Strack, 2014). Yesterday’s population of psychology undergraduate students from the University of Newcastle, Australia, will be a different population to today’s population of psychology undergraduate students from the University of Newcastle, Australia.

***Learning and the replicability of priming effects
Michael Ramscar
(ER uninformative)
In the limit, this means that in the absence of a means for objectively determining what the information that produces a priming effect is, and for determining that the same information is available to the population in a replication, all learned priming effects are scientifically unfalsifiable. (Which also means that in the absence of an account of what the relevant information is in a set of primes, and how it produces a specific effect, reports of a specific priming result — or failures to replicate it — are scientifically uninformative; see also [Stroebe & Strack, 2014.)

***Evaluating Psychological Research Requires More Than Attention to the N: A Comment on Simonsohn’s (2015) “Small Telescopes”
Norbert Schwarz and Gerald L. Clore
(CR test theory)
Simonsohn’s decision to equate a conceptual variable (mood) with its manipulation (weather) is compatible with the logic of clinical trials, but not with the logic of theory testing. In clinical trials, which have inspired much of the replicability debate and its statistical focus, the operationalization (e.g., 10 mg of a drug) is itself the variable of interest; in theory testing, any given operationalization is merely one, usually imperfect, way to realize the conceptual variable. For this reason, theory tests are more compelling when the results of different operationalizations converge (Stroebe & Strack, 2014), thus ensuring, in the case in point, that it is not “the weather” but indeed participants’ (sometimes weather-induced) mood that drives the observed effect.

Internal conceptual replications do not increase independent replication success
Kunert, R
(Contextual Sensitivity)
According to the unknown moderator account of independent replication failure, successful internal replications should correlate with independent replication success. This account suggests that replication failure is due to the fact that psychological phenomena are highly context-dependent, and replicating seemingly irrelevant contexts (i.e. unknown moderators) is rare (e.g., Barrett, 2015; DGPS, 2015; Fleming Crim, 2015; see also Stroebe & Strack, 2014; for a critique, see Simons, 2014). For example, some psychological phenomenon may unknowingly be dependent on time of day.

(Contextual Sensitivity greater in social psychology)
When the chances of unknown moderator influences are greater and replicability is achieved (internal, conceptual replications), then the same should be true when chances are smaller (independent, direct replications). Second, the unknown moderator account is usually invoked for social psychological effects (e.g. Cesario, 2014; Stroebe & Strack, 2014). However, the lack of influence of internal replications on independent replication success is not limited to social psychology. Even for cognitive psychology a similar pattern appears to hold.

On Klatzky and Creswell (2014): Saving Social Priming Effects But Losing Science as We Know It?
Barry Schwartz
(ER uninformative)
The recent controversy over what counts as “replication” illustrates the power of this presumption. Does “conceptual replication” count? In one respect, conceptual replication is a real advance, as conceptual replication extends the generality of the phenomena that were initially discovered. But what if it fails? Is it because the phenomena are unreliable, because the conceptual equivalency that justified the new study was logically flawed, or because the conceptual replication has permitted the intrusion of extraneous variables that obscure the original phenomenon? This ambiguity has led some to argue that there is no substitute for strict replication (see Pashler & Harris, 2012; Simons, 2014, and Stroebe & Strack, 2014, for recent manifestations of this controversy). A significant reason for this view, however, is less a critique of the logic of conceptual replication than it is a comment on the sociology (or politics, or economics) of science. As Pashler and Harris (2012) point out, publication bias virtually guarantees that successful conceptual replications will be published whereas failed conceptual replications will live out their lives in a file drawer.  I think Pashler and Harris’ surmise is probably correct, but it is not an argument for strict replication so much as it is an argument for publication of failed conceptual replication.

Commentary and Rejoinder on Lynott et al. (2014)
Lawrence E. Williams
(CR test theory)
On the basis of their investigations, Lynott and colleagues (2014) conclude ‘‘there is no evidence that brief exposure to warm therapeutic packs induces greater prosocial responding than exposure to cold therapeutic packs’’ (p. 219). This conclusion, however, does not take into account other related data speaking to the connection between physical warmth and prosociality. There is a fuller body of evidence to be considered, in which both direct and conceptual replications are instructive. The former are useful if researchers particularly care about the validity of a specific phenomenon; the latter are useful if researchers particularly care about theory testing (Stroebe & Strack, 2014).

The State of Social and Personality Science: Rotten to the Core, Not So Bad, Getting Better, or Getting Worse?
(no replication crisis)
Motyl et al. (2017) “The claim of a replicability crisis is greatly exaggerated.” Wolfgang Stroebe and Fritz Strack, 2014

Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology
Harry T. Reis, Karisa Y. Lee
(ER impossible)
Much of the current debate, however, is focused narrowly on direct or exact replications—whether the findings of a given study, carried out in a particular way with certain specific operations, would be repeated. Although exact replications are surely desirable, the papers by Fabrigar and by Crandall and Sherman remind us that in an absolute sense they are fundamentally impossible in social–personality psychology (see also S&S, 2014).

Show me the money
(Contextual Sensitivity)
Of course, it is possible that additional factors, which varied or could have varied among our studies and previously published studies (e.g., participants’ attitudes toward money) or among the online studies and laboratory study in this article (e.g., participants’ level of distraction), might account for these apparent inconsistencies. We did not aim to conduct a direct replication of any specific past study, and therefore we encourage special care when using our findings to evaluate existing ones (Doyen, Klein, Simons, & Cleeremans, 2014; Stroebe & Strack, 2014).

***From Data to Truth in Psychological Science. A Personal Perspective.
Strack
(ER uninformative)
In their introduction to the 2016 volume of the Annual Review of Psychology, Susan Fiske, Dan Schacter, and Shelley Taylor point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects. To allow non-replications to regain this constructive role, they must come with conclusions that enter and stimulate a critical debate. It is even better if replication studies are endowed with a hypothesis that relates to the state of the scientific discourse. To show that an effect occurs only under one but not under another condition is more informative than simply demonstrating noneffects (S&S, 2014). But this may require expertise and effort.

 

Advertisements

Replicability 101: How to interpret the results of replication studies

Even statistically sophisticated psychologists struggle with the interpretation of replication studies (Maxwell et al., 2015).  This article gives a basic introduction to the interpretation of statistical results within the Neyman Pearson approach to statistical inferences.

I make two important points and correct some potential misunderstandings in Maxwell et al.’s discussion of replication failures.  First, there is a difference between providing sufficient evidence for the null-hypothesis (evidence of absence) and providing insufficient evidence against the null-hypothesis (absence of evidence).  Replication studies are useful even if they simply produce absence of evidence without evidence that an effect is absent.  Second, I  point out that publication bias undermines the credibility of significant results in original studies.  When publication bias is present, open replication studies are valuable because they provide an unbiased test of the null-hypothesis, while original studies are rigged to reject the null-hypothesis.

DEFINITION OF REPLICATING A STATISTICAL RESULT

Replicating something means to get the same result.  If I make the first free throw, replicating this outcome means to also make the second free throw.  When we talk about replication studies in psychology we borrow from the common meaning of the term “to replicate.”

If we conduct psychological studies, we can control many factors, but some factors are not under our control.  Participants in two independent studies differ from each other and the variation in the dependent variable across samples introduces sampling error. Hence, it is practically impossible to get identical results, even if the two studies are exact copies of each other.  It is therefore more complicated to compare the results of two studies than to compare the outcome of two free throws.

To determine whether the results of two studies are identical or not, we need to focus on the outcome of a study.  The most common outcome in psychological studies is a significant or non-significant result.  The goal of a study is to produce a significant result and for this reason a significant result is often called a success.  A successful replication study is a study that also produces a significant result.  Obtaining two significant results is akin to making two free throws.  This is one of the few agreements between Maxwell and me.

“Generally speaking, a published  original study has in all likelihood demonstrated a statistically significant effect. In the current zeitgeist, a replication study is usually interpreted as successful if it also demonstrates a statistically significant effect.” (p. 488)

The more interesting and controversial scenario is a replication failure. That is, the original study produced a significant result (success) and the replication study produced a non-significant result (failure).

I propose that a lot of confusion arises from the distinction between original and replication studies. If a replication study is an exact copy of the first study, the outcome probabilities of original and replication studies are identical.  Otherwise, the replication study is not really a replication study.

There are only three possible outcomes in a set of two studies: (a) both studies are successful, (b) one study is a success and one is a failure, or (c) both studies are failures.  The probability of these outcomes depends on whether the significance criterion (the type-I error probability) when the null-hypothesis is true and the statistical power of a study when the null-hypothesis is false.

Table 1 shows the probability of the outcomes in two studies.  The uncontroversial scenario of two significant results is very unlikely, if the null-hypothesis is true. With conventional alpha = .05, the probability is .0025 or 1 out of 400 attempts.  This shows the value of replication studies. False positives are unlikely to repeat themselves and a series of replication studies with significant results is unlikely to occur by chance alone.

2 sig, 0 ns 1 sig, 1 ns 0 sig, 2 ns
H0 is True alpha^2 2*alpha*(1-alpha) (1-alpha^2)
H1 is True (1-beta)^2 2*(1-beta)*beta beta^2

The probability of a successful replication of a true effect is a function of statistical power (1 – type-II error probability).  High power is needed to get significant results in a pair of studies (an original study and a replication study).  For example, if power is only 50%, the chance of this outcome is only 25% (Schimmack, 2012).  Even with conventionally acceptable power of 80%, only 2/3 (64%) of replication attempts would produce this outcome.  However, studies in psychology do not have 80% power and estimates of power can be as low as 37% (OSC, 2015). With 40% power, a pair of studies would produce significant results in no more than 16 out of 100 attempts.   Although successful replications of true effects with low power are unlikely, they are still much more likely then significant results when the null-hypothesis is true (16/100 vs. 1/400 = 64:1).  It is therefore reasonable to infer from two significant results that the null-hypothesis is false.

If the null-hypothesis is true, it is extremely likely that both studies produce a non-significant result (.95^2 = 90.25%).  In contrast, it is unlikely that even a study with modest power would produce two non-significant results.  For example, if power is 50%, there is a 75% chance that at least one of the two studies produces a significant result. If power is 80%, the probability of obtaining two non-significant results is only 4%.  This means, it is much more likely (22.5 : 1) that the null-hypothesis is true than that the alternative hypothesis is true.  This does not mean that the null-hypothesis is true in an absolute sense because power depends on the effect size.  For example, if 80% power were obtained with a standardized effect size of Cohen’s d = .5,  two non-significant results would suggest that the effect size is smaller than .5, but it does not warrant the conclusion that H0 is true and the effect size is exactly 0.  Once more, it is important to distinguish between the absence of evidence for an effect and the evidence of absence of an effect.

The most controversial scenario assumes that the two studies produced inconsistent outcomes.  Although theoretically there is no difference between the first and the second study, it is common to focus on a successful outcome followed by a replication failure  (Maxwell et al., 2015). When the null-hypothesis is true, the probability of this outcome is low;  .05 * (1-.05) = .0425.  The same probability exists for the reverse pattern that a non-significant result is followed by a significant one.  A probability of 4.25% shows that it is unlikely to observe a significant result followed by a non-significant result when the null-hypothesis is true. However, the low probability is mostly due to the low probability of obtaining a significant result in the first study, while the replication failure is extremely likely.

Although inconsistent results are unlikely when the null-hypothesis is true, they can also be unlikely when the null-hypothesis is false.  The probability of this outcome depends on statistical power.  A pair of studies with very high power (95%) is very unlikely to produce an inconsistent outcome because both studies are expected to produce a significant result.  The probability of this rare event can be as low, or lower, than the probability with a true null effect; .95 * (1-.95) = .0425.  Thus, an inconsistent result provides little information about the probability of a type-I or type-II  error and is difficult to interpret.

In conclusion, a pair of significance tests can produce three outcomes. All three outcomes can occur when the null-hypothesis is true and when it is false.  Inconsistent outcomes are likely unless the null-hypothesis is true or the null-hypothesis is false and power is very high.  When two studies produce inconsistent results, statistical significance provides no basis for statistical inferences.

Meta-Analysis 

The counting of successes and failures is an old way to integrate information from multiple studies.  This approach has low power and is no longer used.  A more powerful approach is effect size meta-analysis.  Effect size meta-analysis was one way to interpret replication results in the Open Science Collaboration (2015) reproducibility project.  Surprisingly, Maxwell et al. (2015) do not consider this approach to the interpretation of failed replication studies. To be clear, Maxwell et al. (2015) mention meta-analysis, but they are talking about meta-analyzing a larger set of replication studies, rather than meta-analyzing the results of an original and a replication study.

“This raises a question about how to analyze the data obtained from multiple studies. The natural answer is to use meta-analysis.” (p. 495)

I am going to show that effect-size meta-analysis solves the problem of interpreting inconsistent results in pairs of studies. Importantly, effect size meta-analysis does not care about significance in individual studies.  A meta-analysis of a pair of studies with inconsistent results is no different from a meta-analysis of a pair of studies with consistent results.

Maxwell et al.’s (2015) introduced an example of a between-subject (BS) design with n = 40 per group (total N = 80) and a standardized effect size of Cohen’s d = .5 (a medium effect size).  This study has 59% power to obtain a significant result.  Thus, it is quite likely that a pair of studies produces inconsistent results (48.38%).   However, a pair of studies with N = 80 has the power of a total sample size of N = 160, which means a fixed-effects meta-analysis will produce a significant result in 88% of all attempts.  Thus, it is not difficult at all to interpret the results of pairs of studies with inconsistent results if the studies have acceptable power (> 50%).   Even if the results are inconsistent, a meta-analysis will provide the correct answer that there is an effect most of the time.

A more interesting scenario are inconsistent results when the null-hypothesis is true.  I turned to simulations to examine this scenario more closely.   The simulation showed that a meta-analysis of inconsistent studies produced a significant result in 34% of all cases.  The percentage slightly varies as a function of sample size.  With a small sample of N = 40, the percentage is 35%. With a large sample of  1,000 participants it is 33%.  This finding shows that in two-thirds of attempts, a failed replication reverses the inference about the null-hypothesis based on a significant original study.  Thus, if an original study produced a false-positive results, a failed replication study corrects this error in 2 out of 3 cases.  Importantly, this finding does not warrant the conclusion that the null-hypothesis is true. It merely reverses the result of the original study that falsely rejected the null-hypothesis.

In conclusion, meta-analysis of effect sizes is a powerful tool to interpret the results of replication studies, especially failed replication studies.  If the null-hypothesis is true, failed replication studies can reduce false positives by 66%.

DIFFERENCES IN SAMPLE SIZES

We can all agree that, everything else being equal, larger samples are better than smaller samples (Cohen, 1990).  This rule applies equally to original and replication studies. Sometimes it is recommended that replication studies should use much larger samples than original studies, but it is not clear to me why researchers who conduct replication studies should have to invest more resources than original researchers.  If original researchers conducted studies with adequate power,  an exact replication study with the same sample size would also have adequate power.  If the original study was a type-I error, the replication study is unlikely to replicate the result no matter what the sample size.  As demonstrated above, even a replication study with the same sample size as the original study can be effective in reversing false rejections of the null-hypothesis.

From a meta-analytic perspective, it does not matter whether a replication study had a larger or smaller sample size.  Studies with larger sample sizes are given more weight than studies with smaller samples.  Thus, researchers who invest more resources are rewarded by giving their studies more weight.  Large original studies require large replication studies to reverse false inferences, whereas small original studies require only small replication studies to do the same.  Nevertheless, failed replications with larger samples are more likely to reverse false rejections of the null-hypothesis, but there is no magical number about the size of a replication study to be useful.

I simulated a scenario with a sample size of N = 80 in the original study and a sample size of N = 200 in the replication study (a factor of 2.5).  In this simulation, only 21% of meta-analyses produced a significant result.  This is 13 percentage points lower than in the simulation with equal sample sizes (34%).  If the sample size of the replication study is 10 times larger (N = 80 and N = 800), the percentage of remaining false positive results in the meta-analysis shrinks to 10%.

The main conclusion is that even replication studies with the same sample size as the original study have value and can help to reverse false positive findings.  Larger sample sizes simply give replication studies more weight than original studies, but it is by no means necessary to increase sample sizes of replication studies to make replication failures meaningful.  Given unlimited resources, larger replications are better, but these analysis show that large replication studies are not necessary.  A replication study with the same sample size as the original study is more valuable than no replication study at all.

CONFUSING ABSENCE OF EVIDENCE WITH EVIDENCE OF ABSENCE

One problem in Maxwell et al’s (2015) article is to conflate two possible goals of replication studies.  One goal is to probe the robustness of the evidence against the null-hypothesis. If the original result was a false positive result, an unsuccessful replication study can reverse the initial inference and produce a non-significant result in a meta-analysis.  This finding would mean that evidence for an effect is absent.  The status of a hypothesis (e.g., humans have supernatural abilities; Bem, 2011) is back to where it was before the original study found a significant result and the burden of proof is shifted back to proponents of the hypothesis to provide unbiased credible evidence for it.

Another goal of replication studies can be to provide conclusive evidence that an original study reported a false positive result (i..e, humans do not have supernatural abilities).  Throughout their article, Maxwell et al. assume that the goal of replication studies is to prove the absence of an effect.  They make many correct observations about the difficulties of achieving this goal, but it is not clear why replication studies have to be conclusive when original studies are not held to the same standard.

This makes it easy to produce (potentially false) positive results and very hard to remove false positive results from the literature.   It also creates a perverse incentive to conduct underpowered original studies and to claim victory when a large replication study finds a significant result with an effect size that is 90% smaller than the effect size in an original study.  The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported.  To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles.  If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.

THE DIRTY BIG SECRET

The main problem of Maxwell et al.’s (2015) article is that the authors blissfully ignore the problem of publication bias.  They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results (Schimmack, 2012; Sterling; 1959; Sterling et al., 1995).

It is hard to believe that Maxwell is unaware of this problem, if only because Maxwell was action editor of my article that demonstrated how publication bias undermines the credibility of replication studies that are selected for significance  (Schimmack, 2012).

I used Bem’s infamous article on supernatural abilities as an example, which appeared to show 8 successful replications of supernatural abilities.  Ironically, Maxwell et al. (2015) also cites Bem’s article to argue that failed replication studies can be misinterpreted as evidence of absence of an effect.

“Similarly, Ritchie, Wiseman, and French (2012) state that their failure to obtain significant results in attempting to replicate Bem (2011) “leads us to favor the ‘experimental artifacts’ explanation for Bem’s original result” (p. 4)”

This quote is not only an insult to Ritchie et al.; it also ignores the concerns that have been raised about Bem’s research practices. First, Ritchie et al. do not claim that they have provided conclusive evidence against ESP.  They merely express their own opinion that they “favor the ‘experimental artifacts’ explanation.  There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities.

More important, Maxwell et al. ignore the broader context of these studies.  Schimmack (2012) discussed many questionable practices in Bem’s original studies and I presented statistical evidence that the significant results in Bem’s article were obtained with the help of questionable research practices.  Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.

It is not clear why Maxwell et al. (2015) picked Bem’s article to discuss problems with failed replication studies and ignores that questionable research practices undermine the credibility of significant results in original research articles. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.

Maxwell et al. (2015) were not aware that in the same year, the OSC (2015) reproducibilty project would replicate only 37% of statistically significant results in top psychology journals, while the apparent success rate in these journals is over 90%.  The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not.  Whether this makes things better is a matter of opinion.

Publication bias also undermines the usefulness of meta-analysis for hypothesis testing.  In the OSC reproducibility project, a meta-analysis of original and replication studies produced 68% significant results.  This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.

DOES PSYCHOLOGY HAVE A REPLICATION CRISIS?

Maxwell et al.’s (2015) answer to this question is captured in this sentence. “Despite raising doubts about the extent to which apparent failures to replicate necessarily reveal that psychology is in crisis,we do not intend to dismiss concerns about documented methodological flaws in the field.” (p. 496).  The most important part of this quote is “raising doubt,” the rest is Orwellian double-talk.

The whole point of Maxwell et al.’s article is to assure fellow psychologists that psychology is not in crisis and that failed replication studies should not be a major concern.  As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail.

The real answer to Maxwell et al.’s question was provided by the OSC (2015) finding that only 37% of published significant results could be replicated.  In my opinion that is not only a crisis, but a scandal because psychologists routinely apply for funding with power analyses that claim 80% power.  The reproducibilty project shows that the true power to obtain significant results in original and replication studies is much lower than this and that the 90% success rate is no more meaningful than 90% votes for a candidate in communist elections.

In the end, Maxwell et al. draw the misleading conclusion that “the proper design and interpretation of replication studies is less straightforward than conventional practice would suggest.”  They suggest that “most importantly, the mere fact that a replication study yields a nonsignificant statistical result should not by itself lead to a conclusion that the corresponding original study was somehow deficient and should no longer be trusted.”

As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if (a) the original study was not preregistered, (b) the original study produced weak evidence (e.g., p = .04), the original study was published in a journal that only publishes significant results, (d) the replication study had a larger sample, (e) the replication study would have been published independent of outcome, and (f) the replication study was preregistered.

We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail.  Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.

Long life the replicability revolution.  !!!

REFERENCES

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.

http://dx.doi.org/10.1037/0003-066X.45.12.1304

Maxwell, S.E, Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does ‘failure to replicate’ really mean? American Psychologist, 70, 487-498. http://dx.doi.org/10.1037/a0039400.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press). 

Forthcoming article: 
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J., Sun, J., Washburn, A. N., Wong, K., Yantis, C. A., & Skitka, L. J. (in press). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. (preprint)

Brief Introduction

Since JPSP published incredbile evidence for mental time travel (Bem, 2011), the credibility of social psychological research has been questioned.  There is talk of a crisis of confidence, a replication crisis, or a credibility crisis.  However, hard data on the credibility of empirical findings published in social psychology journals are scarce.

There have been two approaches to examine the credibility of social psychology.  One approach relies on replication studies.  Authors attempt to replicate original studies as closely as possible.  The most ambitious replication project was carried out by the Open Science Collaboration (Science, 2015) that replicated 1 study from 100 articles; 54 articles were classified as social psychology.   For original articles that reported a significant result, only a quarter replicated a significant result in the replication studies.  This estimate of replicability suggests that researches conduct many more studies than are published and that effect sizes in published articles are inflated by sampling error, which makes them difficult to replicate. One concern about the OSC results is that replicating original studies can be difficult.  For example, a bilingual study in California may not produce the same results as a bilingual study in Canada.  It is therefore possible that the poor outcome is partially due to problems of reproducing the exact conditions of original studies.

A second approach is to estimate replicability of published results using statistical methods.  The advantage of this approach is that replicabiliy estimates are predictions for exact replication studies of the original studies because the original studies provide the data for the replicability estimates.   This is the approach used by Motyl et al.

The authors sampled 30% of articles published in 2003-2004 (pre-crisis) and 2013-2014 (post-crisis) from four major social psychology journals (JPSP, PSPB, JESP, and PS).  For each study, coders identified one focal hypothesis and recorded the statistical result.  The bulk of the statistics were t-values from t-tests or regression analyses and F-tests from ANOVAs.  Only 19 statistics were z-tests.   The authors applied various statistical tests to the data that test for the presence of publication bias or whether the studies have evidential value (i.e., reject the null-hypothesis that all published results are false positives).  For the purpose of estimating replicability, the most important statistic is the R-Index.

The R-Index has two components.  First, it uses the median observed power of studies as an estimate of replicability (i.e., the percentage of studies that should produce a significant result if all studies were replicated exactly).  Second, it computes the percentage of studies with a significant result.  In an unbiased set of studies, median observed power and percentage of significant results should match.  Publication bias and questionable research practices will produce more significant results than predicted by median observed power.  The discrepancy is called the inflation rate.  The R-Index subtracts the inflation rate from median observed power because median observed power is an inflated estimate of replicability when bias is present.  The R-Index is not a replicability estimate.  That is, an R-Index of 30% does not mean that 30% of studies will produce a significant result.  However, a set of studies with an R-Index of 30 will have fewer successful replications than a set of studies with an R-Index of 80.  An exception is an R-Index of 50, which is equivalent with a replicability estimate of 50%.  If the R-Index is below 50, one would expect more replication failures than successes.

Motyl et al. computed the R-Index separately for the 2003/2004 and the 2013/2014 results and found “the R-index decreased numerically, but not statistically over time, from .62 [CI95% = .54, .68] in 2003-2004 to .52 [CI95% = .47, .56] in 2013-2014. This metric suggests that the field is not getting better and that it may consistently be rotten to the core.”

I think this interpretation of the R-Index results is too harsh.  I consider an R-Index below 50 an F (fail).  An R-Index in the 50s is a D, and an R-Index in the 60s is a C.  An R-Index greater than 80 is considered an A.  So, clearly there is a replication crisis, but social psychology is not rotten to the core.

The R-Index is a simple tool, but it is not designed to estimate replicability.  Jerry Brunner and I developed a method that can estimate replicability, called z-curve.  All test-statistics are converted into absolute z-scores and a kernel density distribution is fitted to the histogram of z-scores.  Then a mixture model of normal distributions is fitted to the density distribution and the means of the normal distributions are converted into power values. The weights of the components are used to compute the weighted average power. When this method is applied only to significant results, the weighted average power is the replicability estimate;  that is, the percentage of significant results that one would expect if the set of significant studies were replicated exactly.   Motyl et al. did not have access to this statistical tool.  They kindly shared their data and I was able to estimate replicability with z-curve.  For this analysis, I used all t-tests, F-tests, and z-tests (k = 1,163).   The Figure shows two results.  The left figure uses all z-scores greater than 2 for estimation (all values on the right side of the vertical blue line). The right figure uses only z-scores greater than 2.4.  The reason is that just-significant results may be compromised by questionable research methods that may bias estimates.

Motyl.2d0.2d4

The key finding is the replicability estimate.  Both estimations produce similar results (48% vs. 49%).  Even with over 1,000 observations there is uncertainty in these estimates and the 95%CI can range from 45 to 54% using all significant results.   Based on this finding, it is predicted that about half of these results would produce a significant result again in a replication study.

However, it is important to note that there is considerable heterogeneity in replicability across studies.  As z-scores increase, the strength of evidence becomes stronger, and results are more likely to replicate.  This is shown with average power estimates for bands of z-scores at the bottom of the figure.   In the left figure,  z-scores between 2 and 2.5 (~ .01 < p < .05) have only a replicability of 31%, and even z-scores between 2.5 and 3 have a replicability below 50%.  It requires z-scores greater than 4 to reach a replicability of 80% or more.   Similar results are obtained for actual replication studies in the OSC reproducibilty project.  Thus, researchers should take the strength of evidence of a particular study into account.  Studies with p-values in the .01 to .05 range are unlikely to replicate without boosting sample sizes.  Studies with p-values less than .001 are likely to replicate even with the same sample size.

Independent Replication Study 

Schimmack and Brunner (2016) applied z-curve to the original studies in the OSC reproducibility project.  For this purpose, I coded all studies in the OSC reproducibility project.  The actual replication project often picked one study from articles with multiple studies.  54 social psychology articles reported 173 studies.   The focal hypothesis test of each study was used to compute absolute z-scores that were analyzed with z-curve.

OSC.soc

The two estimation methods (using z > 2.0 or z > 2.4) produced very similar replicability estimates (53% vs. 52%).  The estimates are only slightly higher than those for Motyl et al.’s data (48% & 49%) and the confidence intervals overlap.  Thus, this independent replication study closely replicates the estimates obtained with Motyl et al.’s data.

Automated Extraction Estimates

Hand-coding of focal hypothesis tests is labor intensive and subject to coding biases. Often studies report more than one hypothesis test and it is not trivial to pick one of the tests for further analysis.  An alternative approach is to automatically extract all test statistics from articles.  This makes it also possible to base estimates on a much larger sample of test results.  The downside of automated extraction is that articles also report statistical analysis for trivial or non-critical tests (e.g., manipulation checks).  The extraction of non-significant results is irrelevant because they are not used by z-curve to estimate replicability.  I have reported the results of this method for various social psychology journals covering the years from 2010 to 2016 and posted powergraphs for all journals and years (2016 Replicability Rankings).   Further analyses replicated the results from the OSC reproducibility project that results published in cognitive journals are more replicable than those published in social journals.  The Figure below shows that the average replicability estimate for social psychology is 61%, with an encouraging trend in 2016.  This estimate is about 10% above the estimates based on hand-coded focal hypothesis tests in the two datasets above.  This discrepancy can be due to the inclusion of less original and trivial statistical tests in the automated analysis.  However, a 10% difference is not a dramatic difference.  Neither 50% nor 60% replicability justify claims that social psychology is rotten to the core, nor do they meet the expectation that researchers should plan studies with 80% power to detect a predicted effect.

replicability-cog-vs-soc

Moderator Analyses

Motyl et al. (in press) did extensive coding of the studies.  This makes it possible to examine potential moderators (predictors) of higher or lower replicability.  As noted earlier, the strength of evidence is an important predictor.  Studies with higher z-scores (smaller p-values) are, on average, more replicable.  The strength of evidence is a direct function of statistical power.  Thus, studies with larger population effect sizes and smaller sampling error are more likely to replicate.

It is well known that larger samples have less sampling error.  Not surprisingly, there is a correlation between sample size and the absolute z-scores (r = .3).  I also examined the R-Index for different ranges of sample sizes.  The R-Index was the lowest for sample sizes between N = 40 and 80 (R-Index = 43), increased for N = 80 to 200 (R-Index = 52) and further for sample sizes between 200 and 1,000 (R-Index = 69).  Interestingly, the R-Index for small samples with N < 40 was 70.  This is explained by the fact that research designs also influence replicability and that small samples often use more powerful within-subject designs.

A moderator analysis with design as moderator confirms this.  The R-Indices for between-subject designs is the lowest (R-Index = 48) followed by mixed designs (R-Index = 61) and then within-subject designs (R-Index = 75).  This pattern is also found in the OSC reproducibility project and partially accounts for the higher replicability of cognitive studies, which often employ within-subject designs.

Another possibility is that articles with more studies package smaller and less replicable studies.  However,  number of studies in an article was not a notable moderator:  1 study R-Index = 53, 2 studies R-Index = 51, 3 studies R-Index = 60, 4 studies R-Index = 52, 5 studies R-Index = 53.

Conclusion 

Motyl et al. (in press) coded a large and representative sample of results published in social psychology journals.  Their article complements results from the OSC reproducibility project that used actual replications, but a much smaller number of studies.  The two approaches produce different results.  Actual replication studies produced only 25% successful replications.  Statistical estimates of replicability are around 50%.   Due to the small number of actual replications in the OSC reproducibility project, it is important to be cautious in interpreting the differences.  However, one plausible explanation for lower success rates in actual replication studies is that it is practically impossible to redo a study exactly.  This may even be true when researchers conduct three similar studies in their own lab and only one of these studies produces a significant result.  Some non-random, but also not reproducible, factor may have helped to produce a significant result in this study.  Statistical models assume that we can redo a study exactly and may therefore overestimate the success rate for actual replication studies.  Thus, the 50% estimate is an optimistic estimate for the unlikely scenario that a study can be replicated exactly.  This means that even though optimists may see the 50% estimate as “the glass half full,” social psychologists need to increase statistical power and pay more attention to the strength of evidence of published results to build a robust and credible science of social behavior.

 

 

Hidden Figures: Replication Failures in the Stereotype Threat Literature

In the past five years, it has become apparent that many classic and important findings in social psychology fail to replicate (Schimmack, 2016).  The replication crisis is often considered a new phenomenon, but failed replications are not entirely new.  Sometimes these studies have simply been ignored.  These studies deserve more attention and need to be reevaluated in the context of the replication crisis in social psychology.

In the past, failed replications were often dismissed because seminal articles were assumed to provide robust empirical support for a phenomenon, especially if an article presented multiple studies. The chance of reporting a false positive results in a multiple study article is low because the risk of a false positive decreases exponentially (Schimmack, 2012). However, the low risk of a false positive is illusory if authors only publish studies that worked. In this case, even false positives can be supported by significant results in multiple studies, as demonstrated in the infamous ESP study by Bem (2011).  As a result, publication bias undermines the reporting of statistical significance as diagnostic information about the risk of false positives (Sterling, 1959) and many important theories in social psychology rest on shaky empirical foundations that need to be reexamined.

Research on stereotype threat and women’s performance on math tests is one example where publication bias undermines the findings in a seminal study that produced a large literature of studies on gender differences in math performance. After correcting for publication bias, this literature shows very little evidence that stereotype threat has a notable and practically significant effect on women’s math performance (Flore & Wicherts, 2014).

Another important line of research has examined the contribution of stereotype threat to differences between racial groups on academic performance tests.  This blog post examines the strength of the empirical evidence for stereotype threat effects in the seminal article by Steele and Aronson (1995). This article is currently the 12th most cited article in the top journal for social psychology, Journal of Personality and Social Psychology (2,278 citations so far).

According to the abstract, “stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one’s group.” Studies 1 and 2 showed that “reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled).”  “Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it.”  “Study 4 showed that mere salience of the stereotype could impair Blacks’ performance even when the test was not
ability diagnostic.”

The results of Study 4 motivated Stricker and colleagues to examine the influence of stereotype-treat on test performance in a real-world testing situation.  These studies had large samples and were not limited to students at Stanford. One study was reported in a College Board Report (Stricker and Ward, 1998).   Another two studies were published in the Journal of Applied Social Psychology (Stricker & Ward, 2004).  This article received only 52 citations, although it reported two studies with an experimental manipulation of stereotype threat in a real assessment context.  One group of participants were asked about their gender or ethnicity before the text, the other group did not receive these questions.  As noted in the abstract, neither the inquiry about race, nor about gender, had a significant effect on test performance. In short, this study failed to replicate Study 4 of the classic and widely cited article by Steele and Aronson.

Stricker and Ward’s Abstract
Steele and Aronson (1995) found that the performance of Black research participants on
ability test items portrayed as a problem-solving task, in laboratory experiments, was affected adversely when they were asked about their ethnicity. This outcome was attributed to stereotype threat: Performance was disrupted by participants’ concerns about fulfilling the negative stereotype concerning Black people’s intellectual ability. The present field experiments extended that research to other ethnic groups and to males and females taking operational tests. The experiments evaluated the effects of inquiring about ethnicity and gender on the performance of students taking 2 standardized tests-the Advanced Placement Calculus AB Examination, and the Computerized Placement Tests-in actual test administrations. This inquiry did not have any effects on the test performance of Black, female, or other subgroups of students that were both statistically and practically significant.

The article also mentions a personal communication with Steele, in which Steele mentions an unpublished study that also failed to demonstrate the effect under similar conditions.

“In fact, Steele found in an unpublished pilot study that inquiring about ethnicity did not affect Black participants’ performance when the task was described as diagnostic of their ability (C. M. Steele, personal communication, May 2 1, 1997), in contrast to the
substantial effect of inquiring when the task was described as nondiagnostic.”

A substantive interpretation of this finding is that inquires about race or gender do not produce stereotype threat effects when a test is diagnostic because a diagnostic test already activates stereotype threat.  However, if this were a real moderator, it would be important to document this fact and it is not clear why this finding obtained in an earlier study by Steele remained unpublished. Moreover, it is premature to interpret the significant result in the published study with a non-diagnostic task and the non-significant result in an unpublished study with a diagnostic task as evidence that diagnosticity moderates the effect of the stereotype-threat manipulation. A proper test of this moderator hypothesis would require the demonstration of a three-way interaction between race, inquiry about race, and diagnosticity. Absent this evidence, it remains possible that diagnosticity is not a moderator and that the published result is a false positive (or a positive result with an inflated effect size estimate). In contrast, there appears to be consistent evidence that inquiries about race or gender before a real assessment of academic performance does not influence performance. This finding is not widely publicized, but is important for a better understanding of performance differences in real world settings.

The best way to examine the replicability of Steele and Aronson’s seminal finding with non-diagnostic tasks would be to conduct an exact replication study.  However, exact replication studies are difficult and costly.  An alternative is to examine the robustness of the published results by taking a closer look at the strength of the statistical results reported by Steele and Aronson, using modern statistical tests of publication bias and statistical power like the R-Index (Schimmack, 2014) and the Test of Insufficient Variance (TIVA, Schimmack, 2014).

Replicability Analysis of Steele and Aronson’s four studies

Study 1. The first study had a relatively large sample of N = 114 participants, but it is not clear how many of the participants were White or Black.  The study also had a 2 x 3 design, which leaves less than 20 participants per condition.   The study produced a significant main effect of condition, F(2, 107) = 4.74, and race, F(1,107) = 5.22, but the critical condition x race interaction was not significant (reported as p > .19).   However, a specific contrast showed significant differences between Black participants in the diagnostic condition and the non-diagnostic condition, t(107) = 2.88, p = .005, z = 2.82.  The authors concluded “in sum, then, the hypothesis was supported by the pattern of contrasts, but when tested over the whole design, reached only marginal significance” (p. 800).  In other words, Study 1 provided only weak support for the stereotype threat hypothesis.

Study 2. Study 2 eliminated one of the three experimental conditions. Participants were 20 Black and 20 White participants. This means there were only 10 participants in each condition of a 2 x 2 design. The degrees of freedom further indicate that the actual sample size was only 38 participants. Given the weak evidence in Study 1, there is no justification for a reduction in the number of participants per cell, although the difficulty of recruiting Black participants at Stanford may explain this inadequate sample size. Nevertheless, the study showed a significant interaction between race and test description, F(1,35) = 8.07, p = .007. The study also replicated the contrast from Study 1 that Black participants in the diagnostic condition performed significantly worse than Black participants in the non-diagnostic group, t(35) = 2.38, p = .023, z = 2.28.

Studies 1 and 2 are close replications of each other.  The consistent finding across the two studies that supports stereotype-treat theory is the finding that merely changing the description of an assessment task changes Black participants performance, as revealed by significant differences between the diagnostic and non-diagnostic condition in both studies.  The problem is that both studies had small numbers of Black participants and that small samples have low power to produce significant results. As a result, it is unlikely that a pair of studies would produce significant results in both studies.

Observed power  in the two studies is .81 and .62 with median observed power of .71. Thus, the actual success rate of 100% (2 out of 2 significant results) is 29 percentage points higher than the expected success rate. Moreover, when inflation is evident, median observed power is also inflated. To correct for this inflation, the Replicability-Index (R-Index) subtracts inflation from median observed power, which yields an R-Index of 42.  Any value below 50 is considered unacceptably low and I give it a letter grade F, just like students at American Universities receive an F for exams with less than 50% correct answers.  This does not mean that stereotype threat is not a valid theory or that there was no real effect in this pair of studies. It simply means that the evidence in this highly cited article is insufficient to make strong claims about the causes of Black’s performance on academic tests.

The Test of Insufficient Variance (TIVA) provides another way to examine published results.  Test statistics like t-values vary considerably from study to study even if the exact same study is conducted twice (or if one larger sample is randomly split into two sub-samples).  When test-statistics are converted into z-scores, sampling error (the random variability from sample to sample) follows approximately a standard normal distribution with a variance of 1.  If the variance is considerably smaller than 1, it suggests that the reported results represent a selected sample. Often the selection is a result of publication bias.  Applying TIVA to the pair of studies, yields a variance of Var(z) = 0.15.  As there are only two studies, it is possible that this outcome occurred by chance, p = .300, and it does not imply intentional selection for significance or other questionable research practices.  Nevertheless, it suggests that future replication studies will be more variable and produce some non-significant results.

In conclusion, the evidence presented in the first two studies is weaker than we might assume if we focused only on the fact that both studies produced significant contrasts. Given publication bias, the fact that both studies reported significant results provides no empirical evidence because virtually all published studies report significant results. The R-Index quantifies the strength of evidence for an effect while taking the influence of publication bias into account and it shows that the two studies with small samples provide only weak evidence for an effect.

Study 3.  This study did not examine performance. The aim was to demonstrate activation of stereotype threat with a sentence completion task.  The sample size of 68 participants  (35 Black, 33 White) implied that only 11 or 12 participants were assigned to one of the six cells in a 2 (race) by 3 (task description) design. The study produced main effects for race and condition, but most importantly it produced a significant interaction effect, F(2,61) = 3.30, p = .044.  In addition, Black participants in the diagnostic condition had more stereotype-related associations than Black participants in the non-diagnostic condition, t(61) = 3.53,

Study 4.  This study used inquiry about race to induce stereotype-threat. Importantly, the task was described as non-diagnostic (as noted earlier, a similar study produced no significant results when the task was described as diagnostic).  The design was a 2 x 2 design with 47 participants, which means only 11 or 12 participants were allocated to the four conditions.  The degrees of freedom indicated that cell frequencies were even lower. The study produced a significant interaction effect, F(1,39) = 7.82, p = .008.  The study also produced a significant contrast between Blacks in the race-prime condition and the no-prime condition, t(39) = 2.43, p = .020.

The contrast effect in Study 3 is strong, but it is not a performance measure.  If stereotype threat mediates the effect of task characteristics and performance, we would expect a stronger effect on the measure of the mediator than on the actual outcome of interest, task performance.  The key aim of stereotype threat theory is to explain differences in performance.  With a focus on performance outcomes, it is possible to examine the R-Index and TIVA of Studies 1, 2, and 4.  All three studies reported significant contrasts between Black students randomly assigned to two groups that were expected to show performance differences (Table 1).

Table 1

Study Test Statistic p-value z-score obs.pow
Study 1 t(107) = 2.88 0.005 2.82 0.81
Study 2 t(35)=2.38 0.023 2.28 0.62
Study 4 t(39) = 2.43 0.020 2.33 0.64

Median observed power is 64 and the R-Index is well below 50, 64 – 36 = 28 (F).  The variance in z-scores is Var(z) = 0.09, p = .086.  These results cast doubt about the replicability of the performance effects reported in Steele and Aronson’s seminal stereotype threat article.

Conclusion

Racial stereotypes and racial disparities are an important social issue.  Social psychology aims and promises to contribute to the understanding of this issue by conducting objective, scientific studies that can inform our understanding of these issues.  In order to live up to these expectations, social psychology has to follow the rules of science and listen to the data.  Just like it is important to get the numbers right to send men and women into space (and bring them back), it is important to get the numbers right when we use science to understand women and men on earth.  Unfortunately, social psychologists have not followed the examples of astronomers and the numbers do not add up.

The three African American women, features in this years movie “Hidden Figures”***,  Katherine Johnson, Dorothy Vaughan, and Mary Jackson might not approve of the casual way social psychologists use numbers in their research, especially the wide-spread practice of hiding numbers that do not match expectations.  No science that wants to make a real-world contribution can condone this practice.  It is also not acceptable to simply ignore published results from well-conducted studies with large samples that challenge a prominent theory.

Surely, the movie Hidden Figures dramatized some of the experiences of Black women at NASA, but there is little doubt that Katherine Johnson, Dorothy Vaughan, and Mary Jackson encountered many obstacles that might be considered stereotype threatening situations.  Yet, they prevailed and they paved the way for future generations of stereotyped groups.  Understanding racial and gender bias and performance differences remains an important issue and that is the reason why it is important to shed a light on hidden numbers and put simplistic theories under the microscope. Stereotype threat is too often used as a simple explanation that avoids tackling deeper and more difficult issues that cannot be easily studied in a quick laboratory experiment with undergraduate students at top research universities.  It is time for social psychologists to live up to its promises by tackling real world issues with research designs that have real world significance that produce real evidence using open and transparent research practices.

————————————————————————————————————————————

*** If you haven’t seen the movie, I highly recommend it.

 

Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

inflated-mop

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

inflated-effect-sizes

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

inflated-mop

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

 

 

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

Authors:  Ulrich Schimmack, Moritz Heene, and Kamini Kesavan

 

Abstract:
We computed the R-Index for studies cited in Chapter 4 of Kahneman’s book “Thinking Fast and Slow.” This chapter focuses on priming studies, starting with John Bargh’s study that led to Kahneman’s open email.  The results are eye-opening and jaw-dropping.  The chapter cites 12 articles and 11 of the 12 articles have an R-Index below 50.  The combined analysis of 31 studies reported in the 12 articles shows 100% significant results with average (median) observed power of 57% and an inflation rate of 43%.  The R-Index is 14. This result confirms Kahneman’s prediction that priming research is a train wreck and readers of his book “Thinking Fast and Slow” should not consider the presented studies as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.

Introduction

In 2011, Nobel Laureate Daniel Kahneman published a popular book, “Thinking Fast and Slow”, about important finding in social psychology.

In the same year, questions about the trustworthiness of social psychology were raised.  A Dutch social psychologist had fabricated data. Eventually over 50 of his articles would be retracted.  Another social psychologist published results that appeared to demonstrate the ability to foresee random future events (Bem, 2011). Few researchers believed these results and statistical analysis suggested that the results were not trustworthy (Francis, 2012; Schimmack, 2012).  Psychologists started to openly question the credibility of published results.

In the beginning of 2012, Doyen and colleagues published a failure to replicate a prominent study by John Bargh that was featured in Daniel Kahneman’s book.  A few month later, Daniel Kahneman distanced himself from Bargh’s research in an open email addressed to John Bargh (Young, 2012):

“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it… all I have personally at stake is that I recently wrote a book that emphasizes priming research as a new approach to the study of associative memory…Count me as a general believer… My reason for writing this letter is that I see a train wreck looming.”

Five years later, Kahneman’s concerns have been largely confirmed. Major studies in social priming research have failed to replicate and the replicability of results in social psychology is estimated to be only 25% (OSC, 2015).

Looking back, it is difficult to understand the uncritical acceptance of social priming as a fact.  In “Thinking Fast and Slow” Kahneman wrote “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Yet, Kahneman could have seen the train wreck coming. In 1971, he co-authored an article about scientists’ “exaggerated confidence in the validity of conclusions based on small samples” (Tversky & Kahneman, 1971, p. 105).  Yet, many of the studies described in Kahneman’s book had small samples.  For example, Bargh’s priming study used only 30 undergraduate students to demonstrate the effect.

Replicability Index

Small samples can be sufficient to detect large effects. However, small effects require large samples.  The probability of replicating a published finding is a function of sample size and effect size.  The Replicability Index (R-Index) makes it possible to use information from published results to predict how replicable published results are.

Every reported test-statistic can be converted into an estimate of power, called observed power. For a single study, this estimate is useless because it is not very precise. However, for sets of studies, the estimate becomes more precise.  If we have 10 studies and the average power is 55%, we would expect approximately 5 to 6 studies with significant results and 4 to 5 studies with non-significant results.

If we observe 100% significant results with an average power of 55%, it is likely that studies with non-significant results are missing (Schimmack, 2012).  There are too many significant results.  This is especially true because average power is also inflated when researchers report only significant results. Consequently, the true power is even lower than average observed power.  If we observe 100% significant results with 55% average powered power, power is likely to be less than 50%.

This is unacceptable. Tversky and Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.”

To correct for the inflation in power, the R-Index uses the inflation rate. For example, if all studies are significant and average power is 75%, the inflation rate is 25% points.  The R-Index subtracts the inflation rate from average power.  So, with 100% significant results and average observed power of 75%, the R-Index is 50% (75% – 25% = 50%).  The R-Index is not a direct estimate of true power. It is actually a conservative estimate of true power if the R-Index is below 50%.  Thus, an R-Index below 50% suggests that a significant result was obtained only by capitalizing on chance, although it is difficult to quantify by how much.

How Replicable are the Social Priming Studies in “Thinking Fast and Slow”?

Chapter 4: The Associative Machine

4.1.  Cognitive priming effect

In the 1980s, psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked.

[no reference provided]

4.2.  Priming of behavior without awareness

Another major advance in our understanding of memory was the discovery that priming is not restricted to concepts and words. You cannot know this from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not even aware.

“In an experiment that became an instant classic, the psychologist John Bargh and his collaborators asked students at New York University—most aged eighteen to twenty-two—to assemble four-word sentences from a set of five words (for example, “finds he it yellow instantly”). For one group of students, half the scrambled sentences contained words associated with the elderly, such as Florida, forgetful, bald, gray, or wrinkle. When they had completed that task, the young participants were sent out to do another experiment in an office down the hall. That short walk was what the experiment was about. The researchers unobtrusively measured the time it took people to get from one end of the corridor to the other.”

“As Bargh had predicted, the young people who had fashioned a sentence from words with an elderly theme walked down the hallway significantly more slowly than the others. walking slowly, which is associated with old age.”

“All this happens without any awareness. When they were questioned afterward, none of the students reported noticing that the words had had a common theme, and they all insisted that nothing they did after the first experiment could have been influenced by the words they had encountered. The idea of old age had not come to their conscious awareness, but their actions had changed nevertheless.“

[John A. Bargh, Mark Chen, and Lara Burrows, “Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action,” Journal of Personality and Social Psychology 71 (1996): 230–44.]

t(28)=2.86 0.008 2.66 0.76
t(28)=2.16 0.039 2.06 0.54

MOP = .65, Inflation = .35, R-Index = .30

4.3.  Reversed priming: Behavior primes cognitions

“The ideomotor link also works in reverse. A study conducted in a German university was the mirror image of the early experiment that Bargh and his colleagues had carried out in New York.”

“Students were asked to walk around a room for 5 minutes at a rate of 30 steps per minute, which was about one-third their normal pace. After this brief experience, the participants were much quicker to recognize words related to old age, such as forgetful, old, and lonely.”

“Reciprocal priming effects tend to produce a coherent reaction: if you were primed to think of old age, you would tend to act old, and acting old would reinforce the thought of old age.”

t(18)=2.10 0.050 1.96 0.50
t(35)=2.10 0.043 2.02 0.53
t(31)=2.50 0.018 2.37 0.66

MOP = .53, Inflation = .47, R-Index = .06

4.4.  Facial-feedback hypothesis (smiling makes you happy)

“Reciprocal links are common in the associative network. For example, being amused tends to make you smile, and smiling tends to make you feel amused….”

“College students were asked to rate the humor of cartoons from Gary Larson’s The Far Side while holding a pencil in their mouth. Those who were “smiling” (without any awareness of doing so) found the cartoons funnier than did those who were “frowning.”

[“Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis,” Journal of Personality and Social Psychology 54 (1988): 768–77.]

The authors used the more liberal and unconventional criterion of p < .05 (one-tailed), z = 1.65, as a criterion for significance. Accordingly, we adjusted the R-Index analysis and used 1.65 as the criterion value.

t(89)=1.85 0.034 1.83 0.57
t(75)=1.78 0.034 1.83 0.57

MOP = .57, Inflation = .43, R-Index = .14

These results could not be replicated in a large replication effort with 17 independent labs. Not a single lab produced a significant result and even a combined analysis failed to show any evidence for the effect.

4.5. Automatic Facial Responses

In another experiment, people whose face was shaped into a frown (by squeezing their eyebrows together) reported an enhanced emotional response to upsetting pictures—starving children, people arguing, maimed accident victims.

[Ulf Dimberg, Monika Thunberg, and Sara Grunedal, “Facial Reactions to

Emotional Stimuli: Automatically Controlled Emotional Responses,” Cognition and Emotion, 16 (2002): 449–71.]

The description in the book does not match any of the three studies reported in this article. The first two studies examined facial muscle movements in response to pictures of facial expressions (smiling or frowning faces).  The third study used emotional pictures of snakes and flowers. We might consider the snake pictures as being equivalent to pictures of starving children or maimed accident victims.  Participants were also asked to frown or to smile while looking at the pictures. However, the dependent variable was not how they felt in response to pictures of snakes, but rather how their facial muscles changed.  Aside from a strong effect of instructions, the study also found that the emotional picture had an automatic effect on facial muscles.  Participants frowned more when instructed to frown and looking at a snake picture than when instructed to frown and looking at a picture of a flower. “This response, however, was larger to snakes than to flowers as indicated by both the Stimulus factor, F(1, 47) = 6.66, p < .02, and the Stimulus 6 Interval factor, F(1, 47) = 4.30, p < .05.”  (p. 463). The evidence for smiling was stronger. “The zygomatic major muscle response was larger to flowers than to snakes, which was indicated by both the Stimulus factor, F(1, 47) = 18.03, p < .001, and the Stimulus 6 Interval factor, F(1, 47) = 16.78, p < .001.”  No measures of subjective experiences were included in this study.  Therefore, the results of this study provide no evidence for Kahneman’s claim in the book and the results of this study are not included in our analysis.

4.6.  Effects of Head-Movements on Persuasion

“Simple, common gestures can also unconsciously influence our thoughts and feelings.”

“In one demonstration, people were asked to listen to messages through new headphones. They were told that the purpose of the experiment was to test the quality of the audio equipment and were instructed to move their heads repeatedly to check for any distortions of sound. Half the participants were told to nod their head up and down while others were told to shake it side to side. The messages they heard were radio editorials.”

“Those who nodded (a yes gesture) tended to accept the message they heard, but those who shook their head tended to reject it. Again, there was no awareness, just a habitual connection between an attitude of rejection or acceptance and its common physical expression.”

F(2,66)=44.70 0.000 7.22 1.00

MOP = 1.00, Inflation = .00,  R-Index = 1.00

[Gary L. Wells and Richard E. Petty, “The Effects of Overt Head Movements on Persuasion: Compatibility and Incompatibility of Responses,” Basic and Applied Social Psychology, 1, (1980): 219–30.]

4.7   Location as Prime

“Our vote should not be affected by the location of the polling station, for example, but it is.”

“A study of voting patterns in precincts of Arizona in 2000 showed that the support for propositions to increase the funding of schools was significantly greater when the polling station was in a school than when it was in a nearby location.”

“A separate experiment showed that exposing people to images of classrooms and school lockers also increased the tendency of participants to support a school initiative. The effect of the images was larger than the difference between parents and other voters!”

[Jonah Berger, Marc Meredith, and S. Christian Wheeler, “Contextual Priming: Where People Vote Affects How They Vote,” PNAS 105 (2008): 8846–49.]

z = 2.10 0.036 2.10 0.56
p = .05 0.050 1.96 0.50

MOP = .53, Inflation = .47, R-Index = .06

4.8  Money Priming

“Reminders of money produce some troubling effects.”

“Participants in one experiment were shown a list of five words from which they were required to construct a four-word phrase that had a money theme (“high a salary desk paying” became “a high-paying salary”).”

“Other primes were much more subtle, including the presence of an irrelevant money-related object in the background, such as a stack of Monopoly money on a table, or a computer with a screen saver of dollar bills floating in water.”

“Money-primed people become more independent than they would be without the associative trigger. They persevered almost twice as long in trying to solve a very difficult problem before they asked the experimenter for help, a crisp demonstration of increased self-reliance.”

“Money-primed people are also more selfish: they were much less willing to spend time helping another student who pretended to be confused about an experimental task. When an experimenter clumsily dropped a bunch of pencils on the floor, the participants with money (unconsciously) on their mind picked up fewer pencils.”

“In another experiment in the series, participants were told that they would shortly have a get-acquainted conversation with another person and were asked to set up two chairs while the experimenter left to retrieve that person. Participants primed by money chose to stay much farther apart than their nonprimed peers (118 vs. 80 centimeters).”

“Money-primed undergraduates also showed a greater preference for being alone.”

[Kathleen D. Vohs, “The Psychological Consequences of Money,” Science 314 (2006): 1154–56.]

F(2,49)=3.73 0.031 2.16 0.58
t(35)=2.03 0.050 1.96 0.50
t(37)=2.06 0.046 1.99 0.51
t(42)=2.13 0.039 2.06 0.54
F(2,32)=4.34 0.021 2.30 0.63
t(38)=2.13 0.040 2.06 0.54
t(33)=2.37 0.024 2.26 0.62
F(2,58)=4.04 0.023 2.28 0.62
chi^2(2)=10.10 0.006 2.73 0.78

MOP = .58, Inflation = .42, R-Index = .16

4.9  Death Priming

“The evidence of priming studies suggests that reminding people of their mortality increases the appeal of authoritarian ideas, which may become reassuring in the context of the terror of death.”

The cited article does not directly examine this question.  The abstract states that “three experiments were conducted to test the hypothesis, derived from terror management theory, that reminding people of their mortality increases attraction to those who consensually validate their beliefs and decreases attraction to those who threaten their beliefs” (p. 308).  Study 2 found no general effect of death priming. Rather, the effect was qualified by authoritarianism. Mortality salience enhanced the rejection of dissimilar others in Study 2 only among high authoritarian subjects.” (p. 314), based on a three-way interaction with F(1,145) = 4.08, p = .045.  We used the three-way interaction for the computation of the R-Index.  Study 1 reported opposite effects for ratings of Christian targets, t(44) = 2.18, p = .034 and Jewish targets, t(44)= 2.08, p = .043. As these tests are dependent, only one test could be used, and we chose the slightly stronger result.  Similarly, Study 3 reported significantly more liking of a positive interviewee and less liking of a negative interviewee, t(51) = 2.02, p = .049 and t(49) = 2.42, p = .019, respectively. We chose the stronger effect.

[Jeff Greenberg et al., “Evidence for Terror Management Theory II: The Effect of Mortality Salience on Reactions to Those Who Threaten or Bolster the Cultural Worldview,” Journal of Personality and Social Psychology]

t(44)=2.18 0.035 2.11 0.56
F(1,145)=4.08 0.045 2.00 0.52
t(49)=2.42 0.019 2.34 0.65

MOP = .56, Inflation = .44, R-Index = .12

4.10  The “Lacy Macbeth Effect”

“For example, consider the ambiguous word fragments W_ _ H and S_ _ P. People who were recently asked to think of an action of which they are ashamed are more likely to complete those fragments as WASH and SOAP and less likely to see WISH and SOUP.”

“Furthermore, merely thinking about stabbing a coworker in the back leaves people more inclined to buy soap, disinfectant, or detergent than batteries, juice, or candy bars. Feeling that one’s soul is stained appears to trigger a desire to cleanse one’s body, an impulse that has been dubbed the “Lady Macbeth effect.”

[Lady Macbeth effect”: Chen-Bo Zhong and Katie Liljenquist, “Washing Away Your Sins:

Threatened Morality and Physical Cleansing,” Science 313 (2006): 1451–52.]

F(1,58)=4.26 0.044 2.02 0.52
F(1,25)=6.99 0.014 2.46 0.69

MOP = .61, Inflation = .39, R-Index = .22

The article reports two more studies that are not explicitly mentioned, but are used as empirical support for the Lady Macbeth effect. As the results of these studies were similar to those in the mentioned studies, including these tests in our analysis does not alter the conclusions.

chi^2(1)=4.57 0.033 2.14 0.57
chi^2(1)=5.02 0.025 2.24 0.61

MOP = .59, Inflation = .41, R-Index = .18

4.11  Modality Specificity of the “Lacy Macbeth Effect”

“Participants in an experiment were induced to “lie” to an imaginary person, either on the phone or in e-mail. In a subsequent test of the desirability of various products, people who had lied on the phone preferred mouthwash over soap, and those who had lied in e-mail preferred soap to mouthwash.”

[Spike Lee and Norbert Schwarz, “Dirty Hands and Dirty Mouths: Embodiment of the Moral-Purity Metaphor Is Specific to the Motor Modality Involved in Moral Transgression,” Psychological Science 21 (2010): 1423–25.]

The results are presented as significant with a one-sided t-test. “As shown in Figure 1a, participants evaluated mouthwash more positively after lying in a voice mail (M = 0.21, SD = 0.72) than after lying in an e-mail (M = –0.26, SD = 0.94), F(1, 81) = 2.93, p = .03 (one-tailed), d = 0.55 (simple main effect), but evaluated hand sanitizer more positively after lying in an e-mail (M = 0.31, SD = 0.76) than after lying in a voice mail (M = –0.12, SD = 0.86), F(1, 81) = 3.25, p = .04 (one-tailed), d = 0.53 (simple main effect).”  We adjusted the significance criterion for the R-Index accordingly.

F(1,81)=2.93 0.045 1.69 0.52
F(1,81)=3.25 0.038 1.78 0.55

MOP = .54, Inflation = .46, R-Index = .08

4.12   Eyes on You

“On the first week of the experiment (which you can see at the bottom of the figure), two wide-open eyes stare at the coffee or tea drinkers, whose average contribution was 70 pence per liter of milk. On week 2, the poster shows flowers and average contributions drop to about 15 pence. The trend continues. On average, the users of the kitchen contributed almost three times as much in ’eye weeks’ as they did in ’flower weeks.’ ”

[Melissa Bateson, Daniel Nettle, and Gilbert Roberts, “Cues of Being Watched Enhance Cooperation in a Real-World Setting,” Biology Letters 2 (2006): 412–14.]

F(1,7)=11.55 0.011 2.53 0.72

MOP = .72, Inflation = .28, R-Index = .44

Combined Analysis

We then combined the results from the 31 studies mentioned above.  While the R-Index for small sets of studies may underestimate replicability, the R-Index for a large set of studies is more accurate.  Median Obesrved Power for all 31 studies is only 57%. It is incredible that 31 studies with 57% power could produce 100% significant results (Schimmack, 2012). Thus, there is strong evidence that the studies provide an overly optimistic image of the robustness of social priming effects.  Moreover, median observed power overestimates true power if studies were selected to be significant. After correcting for inflation, the R-Index is well below 50%.  This suggests that the studies have low replicability. Moreover, it is possible that some of the reported results are actually false positive results.  Just like the large-scale replication of the facial feedback studies failed to provide any support for the original findings, other studies may fail to show any effects in large replication projects. As a result, readers of “Thinking Fast and Slow” should be skeptical about the reported results and they should disregard Kahneman’s statement that “you have no choice but to accept that the major conclusions of these studies are true.”  Our analysis actually leads to the opposite conclusion. “You should not accept any of the conclusions of these studies as true.”

k = 31,  MOP = .57, Inflation = .43, R-Index = .14,  Grade: F for Fail

Powergraph of Chapter 4kfs

Schimmack and Brunner (2015) developed an alternative method for the estimation of replicability.  This method takes into account that power can vary across studies. It also provides 95% confidence intervals for the replicability estimate.  The results of this method are presented in the Figure above. The replicability estimate is similar to the R-Index, with 14% replicability.  However, due to the small set of studies, the 95% confidence interval is wide and includes values above 50%. This does not mean that we can trust the published results, but it does suggest that some of the published results might be replicable in larger replication studies with more power to detect small effects.  At the same time, the graph shows clear evidence for a selection effect.  That is, published studies in these articles do not provide a representative picture of all the studies that were conducted.  The powergraph shows that there should have been a lot more non-significant results than were reported in the published articles.  The selective reporting of studies that worked is at the core of the replicability crisis in social psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).  To clean up their act and to regain trust in published results, social psychologists have to conduct studies with larger samples that have more than 50% power (Tversky & Kahneman, 1971) and they have to stop reporting only significant results.  We can only hope that social psychologists will learn from the train wreck of social priming research and improve their research practices.