All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

A critique of Stroebe and Strack’s Article “The Alleged Crisis and the Illusion of Exact Replication”

The article by Stroebe and Strack (2014) [henceforth S&S] illustrates how experimental social psychologists responded to replication failures in the beginning of the replicability revolution.  The response is a classic example of repressive coping: Houston, we do not have a problem. Even in 2014,  problems with the way experimental social psychologists had conducted research for decades were obvious (Bem, 2011; Wagenmakers et al., 2011; John et al., 2012; Francis, 2012; Schimmack, 2012; Hasher & Wagenmakers, 2012).  S&S article is an attempt to dismiss these concerns as misunderstandings and empirically unsupported criticism.

“In contrast to the prevalent sentiment, we will argue that the claim of a replicability crisis is greatly exaggerated” (p. 59).  

Although the article was well received by prominent experimental social psychologists (see citations in appendix), future events proved S&S wrong and vindicated critics of research methods in experimental social psychology. Only a year later, the Open Science Collaboration (2015) reported that only 25% of studies in social psychology could be replicated successfully.  A statistical analysis of focal hypothesis tests in social psychology suggests that roughly 50% of original studies could be replicated successfully if these studies were replicated exactly (Motyl et al., 2017).  Ironically, one of S&S’s point is that exact replication studies are impossible. As a result, the 50% estimate is an optimistic estimate of the success rate for actual replication studies, suggesting that the actual replicability of published results in social psychology is less than 50%.

Thus, even if S&S had reasons to be skeptical about the extent of the replicability crisis in experimental social psychology, it is now clear that experimental social psychology has a serious replication problem. Many published findings in social psychology textbooks may not replicate and many theoretical claims in social psychology rest on shaky empirical foundations.

What explains the replication problem in experimental social psychology?  The main reason for replication failures is that social psychology journals mostly published significant results.  The selective publishing of significant results is called publication bias. Sterling pointed out that publication bias in psychology is rampant.  He found that psychology journals publish over 90% significant results (Sterling, 1959; Sterling et al., 1995).  Given new estimates that the actual success rate of studies in experimental social psychology is less than 50%, only publication bias can explain why journals publish over 90% results that confirm theoretical predictions.

It is not difficult to see that reporting only studies that confirm predictions undermines the purpose of empirical tests of theoretical predictions.  If studies that do not confirm predictions are hidden, it is impossible to obtain empirical evidence that a theory is wrong.  In short, for decades experimental social psychologists have engaged in a charade that pretends that theories are empirically tested, but publication bias ensured that theories would never fail.  This is rather similar to Volkswagen’s emission tests that were rigged to pass because emissions were never subjected to a real test.

In 2014, there were ample warning signs that publication bias and other dubious practices inflated the success rate in social psychology journals.  However, S&S claim that (a) there is no evidence for the use of questionable research practices and (b) that it is unclear which practices are questionable or not.

“Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology. In fact, the discipline still needs to reach an agreement about the conditions under which these practices are unacceptable” (p. 60).

Scientists like to hedge their statements so that they are immune to criticism. S&S may argue that the evidence in 2014 was not “solid” and surely there was and still is no agreement about good research practices. However, this is irrelevant. What is important is that success rates in social psychology journals were and still are inflated by suppressing disconfirming evidence and biasing empirical tests of theories in favor of positive outcomes.

Although S&S’s main claims are not based on empirical evidence, it is instructive to examine how they tried to shield published results and established theories from the harsh light of open replication studies that report results without selection for significance and subject social psychological theories to real empirical tests for the first time.

Failed Replication of Between-Subject Priming Studies

S&S discuss failed replications of two famous priming studies in social psychology: Bargh’s elderly priming study and Dijksterhuis’s professor priming studies.  Both seminal articles reported several successful tests of the prediction that a subtle priming manipulation would influence behavior without participants even noticing the priming effect.  In 2012, Doyen et al., failed to replicate elderly priming. Schanks et al. (2013) failed to replicate professor priming effects and more recently a large registered replication report also provided no evidence for professor priming.  For naïve readers it is surprising that original studies had a 100% success rate and replication studies had a 0% success rate.  However, S&S are not surprised at all.

“as in most sciences, empirical findings cannot always be replicated” (p. 60). 

Apparently, S&S knows something that naïve readers do not know.  The difference between naïve readers and experts in the field is that experts have access to unpublished information about failed replications in their own labs and in the labs of their colleagues. Only they know how hard it sometimes was to get the successful outcomes that were published. With the added advantage of insider knowledge, it makes perfect sense to expect replication failures, although may be not 0%.

The problem is that S&S give the impression that replication failures are too be expected, but that this expectation cannot be based on the objective scientific record that hardly ever reports results that contradict theoretical predictions.  Replication failures occur all the time, but they remained unpublished. Doyen et al. and Schanks et al.’s articles only violated the code to publish only supportive evidence.

Kahneman’s Train Wreck Letter

S&S also comment on Kahneman’s letter to Bargh that compared priming research to a train wreck.  In response S&S claim that

“priming is an entirely undisputed method that is widely used to test hypotheses about associative memory (e.g., Higgins, Rholes, & Jones, 1977; Meyer & Schvaneveldt, 1971; Tulving & Schacter, 1990).” (p. 60).  

This argument does not stand the test of time.  Since S&S published their article researchers have distinguished more clearly between highly replicable priming effects in cognitive psychology with repeated measures and within-subject designs and difficult to replicate between-subject social priming studies with subtle priming manipulations and a single outcome measure (BS social priming).  With regards to BS social priming, it is unclear which of these effects can be replicated and leading social psychologists have been reluctant to demonstrate replicability of their famous studies by conducting self-replications as they were encouraged to do in Kahneman’s letter.

S&S also point to empirical evidence for robust priming effects.

“A meta-analysis of studies that investigated how trait primes influence impression formation identified 47 articles based on 6,833 participants and found overall effects to be statistically highly significant (DeCoster & Claypool, 2004).” (p. 60). 

The problem with this evidence is that this meta-analysis did not take publication bias into account; in fact, it does not even mention publication bias as a possible problem.  A meta-analysis of studies that were selected for significance produces is also biased by selection for significance.

Several years after Kahneman’s letter, it is widely agreed that past research on social priming is a train wreck.  Kahneman published a popular book that celebrated social priming effects as a major scientific discovery in psychology.  Nowadays, he agrees with critiques that the existing evidence is not credible.  It is also noteworthy that none of the researchers in this area have followed Kahneman’s advice to replicate their own findings to show the world that these effects are real.

It is all a big misunderstanding

S&S suggest that “the claim of a replicability crisis in psychology is based on a major misunderstanding.” (p. 60). 

Apparently, lay people, trained psychologists, and a Noble laureate are mistaken in their interpretation of replication failures.  S&S suggest that failed replications are unimportant.

“the myopic focus on “exact” replications neglects basic epistemological principles” (p. 60).  

To make their argument, they introduce the notion of exact replications and suggest that exact replication studies are uninformative.

 “a finding may be eminently reproducible and yet constitute a poor test of a theory.” (p. 60).

The problem with this line of argument is that we are supposed to assume that a finding is eminently reproducible, which probably means it has been successfully replicate many times.  It seems sensible that further studies of gender differences in height are unnecessary to convince us that there is a gender difference in height. However, results in social psychology are not like gender differences in height.  According to S&S own accord earlier, “empirical findings cannot always be replicated” (p. 60). And if journals only publish significant results, it remains unknown which results are eminently reproducible and which results are not.  S&S ignore publication bias and pretend that the published record suggests that all findings in social psychology are eminently reproducible. Apparently, they would suggest that even Bem’s findings that people have supernatural abilities is eminently reproducible.  These days, few social psychologists are willing to endorse this naïve interpretation of the scientific record as a credible body of empirical facts.   

Exact Replication Studies are Meaningful if they are Successful

Ironically, S&S next suggest that exact replication studies can be useful.

Exact replications are also important when studies produce findings that are unexpected and only loosely connected to a theoretical framework. Thus, the fact that priming individuals with the stereotype of the elderly resulted in a reduction of walking speed was a finding that was unexpected. Furthermore, even though it was consistent with existing theoretical knowledge, there was no consensus about the processes that mediate the impact of the prime on walking speed. It was therefore important that Bargh et al. (1996) published an exact replication of their experiment in the same paper.

Similarly, Dijksterhuis and van Knippenberg (1998) conducted four studies in which they replicated the priming effects. Three of these studies contained conditions that were exact replications.

Because it is standard practice in publications of new effects, especially of effects that are surprising, to publish one or two exact replications, it is clearly more conducive to the advancement of psychological knowledge to conduct conceptual replications rather than attempting further duplications of the original study.

Given these citations it is problematic that S&S article is often cited to claim that exact replications are impossible or unnecessary.  The argument that S&S are making here is rather different.  They are suggesting that original articles already provide sufficient evidence that results in social psychology are eminently reproducible because original articles report multiple studies and some of these studies are often exact replication studies.  At face value, S&S have a point.  An honest series of statistically significant results makes it practically impossible that an effect is a false positive result (Schimmack, 2012).  The problem is that multiple study articles are not honest reports of all replication attempts.  Francis (2014) found that at least 80% of multiple study articles showed statistical evidence of questionable research practices.  Given the pervasive influence of selection for significance, exact replication studies in original articles provide no information about the replicability of these results.

What made the failed replications by Doyen et al. and Shank et al. so powerful was that these studies were the first real empirical tests of BS social priming effects because the authors were willing to report successes or failures.  The problem for social psychology is that many textbook findings that were obtained with selection for significance cannot be reproduced in honest empirical tests of the predicted effects.  This means that the original effects were either dramatically inflated or may not exist at all.

Replication Studies are a Waste of Resources

S&S want readers to believe that replication studies are a waste of resources.

Given that both research time and money are scarce resources, the large scale attempts at duplicating previous studies seem to us misguided” (p. 61).

This statement sounds a bit like a plea to spare social psychology from the embarrassment of actual empirical tests that reveal the true replicability of textbook findings. After all, according to S&S it is impossible to duplicate original studies (i.e., conduct exact replication studies) because replication studies differ in some way from original studies and may not reproduce the original results.  So, none of the failed replication studies is an exact replication.  Doyen et al. replicate Bargh’s study that was conducted in New York city in Belgium and Shanks et al. replicated Dijksterhuis’s studies from the Netherlands in the United States.  The finding that the original results could not be replicate the original results does not imply that the original findings were false positives, but they do imply that these findings may be unique to some unspecified specifics of the original studies.  This is noteworthy when original results are used in textbook as evidence for general theories and not as historical accounts of what happened in one specific socio-cultural context during a specific historic period. As social situations and human behavior are never exact replications of the past, social psychological results need to be permanently replicated and doing so is not a waste of resources.  Suggesting that replications is a waste of resources is like suggesting that measuring GDP or unemployment every year is a waste of resources because we can just use last-year’s numbers.

As S&S ignore publication bias and selection for significance, they are also ignoring that publication bias leads to a massive waste of resources.  First, running empirical tests of theories that are not reported is a waste of resources.  Second, publishing only significant results is also a waste of resources because researchers design new studies based on the published record. When the published record is biased, many new studies will fail, just like airplanes who are designed based on flawed science would drop from the sky.  Thus, a biased literature creates a massive waste of resources.

Ultimately, a science that publishes only significant result wastes all resources because the outcome of the published studies is a foregone conclusion: the prediction was supported, p < .05. Social psychologists might as well publish purely theoretical article, just like philosophers in the old days used “thought experiments” to support their claims. An empirical science is only a real science if theoretical predictions are subjected to tests that can fail.  By this simple criterion, experimental social psychology is not (yet) a science.

Should Psychologists Conduct Exact Replications or Conceptual Replications?

Strobe and Strack’s next cite Pashler and Harris (2012) to claim that critiques of experimental social psychology have dismissed the value of so-called conceptual replications and generalize.

The main criticism of conceptual replications is that they are less informative than exact replications (e.g., Pashler & Harris, 2012).” 

Before I examine S&S’s counterargument, it is important to realize that S&S misrepresented, and maybe misunderstood, Pashler and Harris’s main point. Here is the relevant quote from Pashler and Harris’s article.

We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some of the famous and puzzling “pathological science” cases that embarrassed the natural sciences at several points in the 20th century (e.g., Polywater; Rousseau & Porto, 1970; and cold fusion; Taubes, 1993).

The problem for S&S is that they cannot address the problem of publication bias and therefore carefully avoid talking about it.  As a result, they misrepresent Pashler and Harris’s critique of conceptual replications in combination with publication bias as a criticism of conceptual replication studies, which is absurd and not what Pashler and Harris’s intended to say or actually said. The following quote from their article makes this crystal clear.

However, what kept faith in cold fusion alive for some time (at least in the eyes of some onlookers) was a trickle of positive results achieved using very different designs than the originals (i.e., what psychologists would call conceptual replications). This suggests that one important hint that a controversial finding is pathological may arise when defenders of a controversial effect disavow the initial methods used to obtain an effect and rest their case entirely upon later studies conducted using other methods. Of course, productive research into real phenomena often yields more refined and better ways of producing effects. But what should inspire doubt is any situation where defenders present a phenomenon as a “moving target” in terms of where and how it is elicited (cf. Langmuir, 1953/1989). When this happens, it would seem sensible to ask, “If the finding is real and yet the methods used by the original investigators are not reproducible, then how were these investigators able to uncover a valid phenomenon with methods that do not work?” Again, the unavoidable conclusion is that a sound assessment of a controversial phenomenon should focus first and foremost on direct replications of the original reports and not on novel variations, each of which may introduce independent ambiguities.

I am confident that unbiased readers will recognize that Pashler and Harris did not suggest that conceptual replication studies are bad.  Their main point is that a few successful conceptual replication studies can be used to keep theories alive in the face of a string of many replication failures. The problem is not that researchers conduct successful conceptual replication studies. The problem is dismissing or outright hiding of disconfirming evidence in replication studies. S&S misconstrue Pashler and Harris’s claim to avoid addressing this real problem of ignoring and suppressing failed studies to support an attractive but false theory.

The illusion of exact replications.

S&S next argument is that replication studies are never exact.

If one accepts that the true purpose of replications is a (repeated) test of a theoretical hypothesis rather than an assessment of the reliability of a particular experimental procedure, a major problem of exact replications becomes apparent: Repeating a specific operationalization of a theoretical construct at a different point in time and/or with a different population of participants might not reflect the same theoretical construct that the same procedure operationalized in the original study.

The most important word in this quote is “might.”   Ebbinghaus’s memory curve MIGHT not replicate today because he was his own subject.  Bargh’s elderly priming study MIGHT not work today because Florida is no longer associated with the elderly, and Disjterhuis’s priming study MIGHT no longer works because students no longer think that professors are smart or that Hooligans are dumb.

Just because there is no certainty in inductive inferences doesn’t mean we can just dismiss replication failures because something MIGHT have changed.  It is also possible that the published results MIGHT be false positives because significant results were obtained by chance, with QRPs, or outright fraud.  Most people think that outright fraud is unlikely, but the Stapel debacle showed that we cannot rule it out.  So, we can argue forever about hypothetical reasons why a particular study was successful or a failure. These arguments are futile and have nothing to do with scientific arguments and objective evaluation of facts.

This means that every study, whether it is a groundbreaking success or a replication failure needs to be evaluate in terms of the objective scientific facts. There is no blanket immunity for seminal studies that protects them from disconfirming evidence.  No study is an exact replication of another study. That is a truism and S&S article is often cited for this simple fact.  It is as true as it is irrelevant to understand the replication crisis in social psychology.

Exact Replications Are Often Uninformative

S&S contradict themselves in the use of the term exact replication.  First it is impossible to do exact replications, but then they are uninformative.  I agree with S&S that exact replication studies are impossible. So, we can simply drop the term “exact” and examine why S&S believe that some replication studies are uninformative.

First they give an elaborate, long and hypothetical explanation for Doyen et al.’s failure to replicate Bargh’s pair of elderly priming studies. After considering some possible explanations, they conclude

It is therefore possible that the priming procedure used in the Doyen et al. (2012) study failed in this respect, even though Doyen et al. faithfully replicated the priming procedure of Bargh et al. (1996).  

Once more the realm of hypothetical conjectures has to rescue seminal findings. Just as it is possible that S&S are right it is also possible that Bargh faked his data. To be sure, I do not believe that he faked his data and I apologized for a Facebook comment that gave the wrong impression that I did. I am only raising this possibility here to make the point that everything is possible. Maybe Bargh just got lucky.  The probability of this is 1 out of 1,600 attempts (the probability to get the predicted effect with .05 two-tailed (!) twice is .025^2). Not very likely, but also not impossible.

No matter what the reason for the discrepancy between Bargh and Doyen’s findings is, the example does not support S&S’s claim that replication studies are uninformative. The failed replication raised concerns about the robustness of BS social priming studies and stimulated further investigation of the robustness of social priming effects. In the short span of six years, the scientific consensus about these effects has shifted dramatically, and the first publication of a failed replication is an important event in the history of social psychology.

S&S’s critique of Shank et al.’s replication studies is even weaker.  First, they have to admit that professor probably still primes intelligence more than soccer hooligans. To rescue the original finding S&S propose

“the priming manipulation might have failed to increase the cognitive representation of the concept “intelligence.” 

S&S also think that

another LIKELY reason for their failure could be their selection of knowledge items.

Meanwhile a registered replication report with a design that was approved by Dijksterhuis failed to replicate the effect.  Although it is possible to come up with more possible reasons for these failures, real scientific creativity is revealed in creating experimental paradigms that produce replicable results, not in coming up with many post-hoc explanations for replication failures.

Ironically, S&S even agree with my criticism of their argument.

 “To be sure, these possibilities are speculative”  (p. 62). 

In contrast, S&S fail to consider the possibility that published significant results are false positives, even though there is actual evidence for publication bias. The strong bias against published failures may be rooted in a long history of dismissing unpublished failures that social psychologists routinely encounter in their own laboratory.  To avoid the self-awareness that hiding disconfirming evidence is unscientific, social psychologists made themselves believe that minute changes in experimental procedures can ruin a study (Stapel).  Unfortunately, a science that dismisses replication failures as procedural hiccups is fated to fail because it removed the mechanism that makes science self-correcting.

Failed Replications are Uninformative

S&S next suggest that “nonreplications are uninformative unless one can demonstrate that the theoretically relevant conditions were met” (p. 62).

This reverses the burden of proof.  Original researchers pride themselves on innovative ideas and groundbreaking discoveries.  Like famous rock stars, they are often not the best musicians, nor is it impossible for other musicians to play their songs. They get rewarded because they came up with something original. Take the Implicit Association Test as an example. The idea to use cognitive switching tasks to measure attitudes was original and Greenwald deserves recognition for inventing this task. The IAT did not revolutionize attitude research because only Tony Greenwald could get the effects. It did so because everybody, including my undergraduate students, could replicate the basic IAT effect.

However, let’s assume that the IAT effect could not have been replicated. Is it really the job of researchers who merely duplicated a study to figure out why it did not work and develop a theory under which circumstances an effect may occur or not?  I do not think so. Failed replications are informative even if there is no immediate explanation why the replication failed.  As Pashler and Harris’s cold fusion example shows there may not even be a satisfactory explanation after decades of research. Most probably, cold fusion never really worked and the successful outcome of the original study was a fluke or a problem of the experimental design.  Nevertheless, it was important to demonstrate that the original cold fusion study could not be replicated.  To ask for an explanation why replication studies fail is simply a way to make replication studies unattractive and to dismiss the results of studies that fail to produce the desired outcome.

Finally, S&S ignore that there is a simple explanation for replication failures in experimental social psychology: publication bias.  If original studies have low statistical power (e.g., Bargh’s studies with N = 30) to detect small effects, only vastly inflated effect sizes reach significance.  An open replication study without inflated effect sizes is unlikely to produce a successful outcome. Statistical analysis of original studies show that this explanation accounts for a large proportion of replication failures. Thus, publication bias provides one explanation for replication failures.

Conceptual Replication Studies are Informative

S&S cite Schmidt (2009) to argue that conceptual replication studies are informative.

With every difference that is introduced the confirmatory power of the replication increases, because we have shown that the phenomenon does not hinge on a particular operationalization but “generalizes to a larger area of application” (p. 93).

S&S continue

“An even more effective strategy to increase our trust in a theory is to test it using completely different manipulations.”

This is of course true as long as conceptual replication studies are successful. However, it is not clear why conceptual replication studies that for the first time try a completely different manipulation should be successful.  As I pointed out in my 2012 article, reading multiple-study articles with only successful conceptual replication studies is a bit like watching a magic show.

Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions. Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005). The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect.

I don’t know whether S&S are impressed by Bem’s article with 9 conceptual replication studies that successfully demonstrated supernatural abilities.  According to their line of arguments, they should be.  However, even most social psychologists found it impossible to accept that time-reversed subliminal priming works. Unfortunately, this also means that successful conceptual replication studies are meaningless if only successful results are published.  Once more, S&S cannot address this problem because they ignore the simple fact that selection for significance undermines the purpose of empirical research to test theoretical predictions.

Exact Replications Contribute Little to Scientific Knowledge

Without providing much evidence for their claims, S&S conclude

one reason why exact replications are not very interesting is that they contribute little to scientific knowledge.

Ironically, one year later Science published 100 replication studies with the only goal of estimating the replicability of psychology, with a focus on social psychology.  The article has already been cited 640 times, while S&S’s criticism of replication studies has been cited (only) 114 times.

Although the article did nothing else then to report the outcome of replication studies, it made a tremendous empirical contribution to psychology because it reported results of studies without the filter of publication bias.  Suddenly the success rate plummeted from over 90% to 37% and for social psychology to 25%.  While S&S could claim in 2014 that “Thus far, however, no solid data exist on the prevalence of such [questionable] research practices in either social or any other area of psychology,” the reproducibility project revealed that these practices dramatically inflated the percentage of successful studies reported in psychology journals.

The article has been celebrated by scientists in many disciplines as a heroic effort and a sign that psychologists are trying to improve their research practices. S&S may disagree, but I consider the reproducibility project a big contribution to scientific knowledge.

Why null findings are not always that informative

To fully appreciate the absurdity of S&S’s argument, I let them speak for themselves.

One reason is that not all null findings are interesting.  For example, just before his downfall, Stapel published an article on how disordered contexts promote stereotyping and discrimination. In this publication, Stapel and Lindenberg (2011) reported findings showing that litter or a broken-up sidewalk and an abandoned bicycle can increase social discrimination. These findings, which were later retracted, were judged to be sufficiently important and interesting to be published in the highly prestigious journal Science. Let us assume that Stapel had actually conducted the research described in this paper and failed to support his hypothesis. Such a null finding would have hardly merited publication in the Journal of Articles in Support of the Null Hypothesis. It would have been uninteresting for the same reason that made the positive result interesting, namely, that (a) nobody expected a relationship between disordered environments and prejudice and (b) there was no previous empirical evidence for such a relationship. Similarly, if Bargh et al. (1996) had found that priming participants with the stereotype of the elderly did not influence walking speed or if Dijksterhuis and van Knippenberg (1998) had reported that priming participants with “professor” did not improve their performance on a task of trivial pursuit, nobody would have been interested in their findings.

Notably, all of the examples are null-findings in original studies. Thus, they have absolutely no relevance for the importance of replication studies. As noted by Strack and Stroebe earlier

Thus, null findings are interesting only if they contradict a central hypothesis derived from an established theory and/or are discrepant with a series of earlier studies.” (p. 65). 

Bem (2011) reported 9 significant results to support unbelievable claims about supernatural abilities.  However, several failed replication studies allowed psychologists to dismiss these findings and to ignore claims about time-reversed priming effects. So, while not all null-results are important, null-results in replication studies are important because they can correct false positive results in original articles. Without this correction mechanism, science looses its ability to correct itself.

Failed Replications Do Not Falsify Theories

S&S state that failed replications do not falsify theories

The nonreplications published by Shanks and colleagues (2013) cannot be taken as a falsification of that theory, because their study does not explain why previous research was successful in replicating the original findings of Dijksterhuis and van Knippenberg (1998).” (p. 64). 

I am unaware of any theory in psychology that has been falsified. The reason for this is not that failed replication studies are not informative. The reason is that theories have been protected by hiding failed replication studies until recently. Only in recent years have social psychologists started to contemplate the possibility that some theories in social psychology might be false.  The most prominent example is ego-depletion theory, which has been one of the first prominent theories that has been put under the microscope of open science without the protection of questionable research practices in recent years. While ego-depletion theory is not entirely dead, few people still believe in the simple theory that 20 Stroop trials deplete individuals’ will power.  Falsification is hard, but falsification without disconfirming evidence is impossible.

Inconsistent Evidence

S&S argue that replication failures have to be evaluated in the context of replication successes.

Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis. 

Earlier S&S wrote

in social psychology, as in most sciences, empirical findings cannot always be replicated (this was one of the reasons for the development of meta-analytic methods). 

Indeed. Unless studies have very high statistical power, inconsistent results are inevitable; which is one reason why publishing only significant results is a sign of low credibility (Schimmack, 2012). Meta-analysis is the only way to make sense of these inconsistent findings.  However, it is well known that publication bias makes meta-analytic results meaningless (e.g., meta-analysis show very strong evidence for supernatural abilities).  Thus, it is important that all tests of a theoretical prediction are reported to produce meaningful meta-analyses.  If social psychologists would take S&S seriously and continue to suppress non-significant results because they are uninformative, meta-analysis would continue to provide biased results that support even false theories.

Failed Replications are Uninformative II

Sorry that this is getting really long. But S&S keep on making the same arguments and the editor of this article didn’t tell them to shorten the article. Here they repeat the argument that failed replications are uninformative.

One reason why null findings are not very interesting is because they tell us only that a finding could not be replicated but not why this was the case. This conflict can be resolved only if researchers develop a theory that could explain the inconsistency in findings.  

A related claim is that failed replications never demonstrate that original findings were false because the inconsistency is always due to some third variable; a hidden moderator.

Methodologically, however, nonreplications must be understood as interaction effects in that they suggest that the effect of the crucial influence depends on the idiosyncratic conditions under which the original experiment was conducted” (p. 64). 

These statements reveal a fundamental misunderstanding of statistical inferences.  A significant result never proofs that the null-hypothesis is false.  The inference that a real effect rather than sampling error caused the observed result can be a mistake. This mistake is called a false positive or a type-I error. S&S seems to believe that type-I errors do not exist. Accordingly, Bem’s significant results show real supernatural abilities.  If this were the case, it would be meaningless to report statistical significance tests. The only possible error that could be made would be false negatives or type-II error; the theory makes the correct prediction, but a study failed to produce a significant result. And if theoretical predictions are always correct, it is also not necessary to subject theories to empirical tests, because these tests either correctly show that a prediction was confirmed or falsely fail to confirm a prediction.

S&S’s belief in published results has a religious quality.  Apparently we know nothing about the world, but once a significant result is published in a social psychology journal, ideally JPSP, it becomes a holy truth that defies any evidence that non-believers may produce under the misguided assumption that further inquiry is necessary. Elderly priming is real, amen.

More Confusing Nonsense

At some point, I was no longer surprised by S&S’s claims, but I did start to wonder about the reviewers and editors who allowed this manuscript to be published apparently with light or no editing.  Why would a self-respecting journal publish a sentence like this?

As a consequence, the mere coexistence of exact replications that are both successful and unsuccessful is likely to leave researchers helpless about what to conclude from such a pattern of outcomes.

Didn’t S&S claim that exact replication studies do not exist? Didn’t they tell readers that every inconsistent finding has to be interpreted as an interaction effect?  And where do they see inconsistent results if journals never publish non-significant results?

Aside from these inconsistencies, inconsistent results do not lead to a state of helpless paralysis. As S&S suggested themselves, they conduct a meta-analysis. Are S&S suggesting that we need to spare researchers from inconsistent results to protect them from a state of helpless confusion? Is this their justification for publishing only significant results?

Even Massive Replication Failures in Registered Replication Reports are Uninformative

In response to the replication crisis, some psychologists started to invest time and resources in major replication studies called many lab studies or registered replication studies.  A single study was replicated in many labs.  The total sample size of many labs gives these studies high precision in estimating the average effect size and makes it even possible to demonstrate that an effect size is close to zero, which suggests that the null-hypothesis may be true.  These studies have failed to find evidence for classic social psychology findings, including Strack’s facial feedback studies. S&S suggest that even these results are uninformative.

Conducting exact replications in a registered and coordinated fashion by different laboratories does not remove the described shortcomings. This is also the case if exact replications are proposed as a means to estimate the “true size” of an effect. As the size of an experimental effect always depends on the specific error variance that is generated by the context, exact replications can assess only the efficiency of an intervention in a given situation but not the generalized strength of a causal influence.

Their argument does not make any sense to me.  First, it is not clear what S&S mean by “the size of an experimental effect always depends on the specific error variance.”  Neither unstandardized nor standardized effect sizes depend on the error variance. This is simple to see because error variance depends on the sample size and effect sizes do not depend on sample size.  So, it makes no sense to claim that effect sizes depend on error variance.

Second, it is not clear what S&S mean by specific error variance that is generated by the context.  I simply cannot address this argument because the notion of context generated specific error variance is not a statistical construct and S&S do not explain what they are talking about.

Finally, it is not clear why meta-analysis of replication studies cannot be used to estimate the generalized strength of a causal influence, which I believe to mean “an effect size”?  Earlier S&S alluded to meta-analysis as a way to resolve inconsistencies in the literature, but now they seem to suggest that meta-analysis cannot be used.

If S&S really want to imply that meta-analyses are useless, it is unclear how they would make sense of inconsistent findings.  The only viable solution seems to be to avoid inconsistencies by suppressing non-significant results in order to give the impression that every theory in social psychology is correct because theoretical predictions are always confirmed.  Although this sounds absurd, it is the inevitable logical consequence of S&S’s claim that non-significant results are uninformative, even if over 20 labs independently and in combination failed to provide evidence for a theoretical predicted effect.

The Great History of Social Psychological Theories

S&S next present Über-social psychologist, Leon Festinger, as an example why theories are good and failed studies are bad.  The argument is that good theories make correct predictions, even if bad studies fail to show the effect.

“Although their theoretical analysis was valid, it took a decade before researchers were able to reliably replicate the findings reported by Festinger and Carlsmith (1959).”

As a former student, I was surprised by this statement because I had learned that Festinger’s theory was challenged by Bem’s theory and that social psychologists had been unable to resolve which of the two theories was correct.  Couldn’t some of these replication failures be explained by the fact that Festinger’s theory sometimes made the wrong prediction?

It is also not surprising that researchers had a hard time replicating Festinger and Carlsmith original findings.  The reason is that the original study had low statistical power and replication failures are expected even if the theory is correct. Finally, I have been around social psychologists long enough to have heard some rumors about Festinger and Carlsmith’s original studies.  Accordingly, some of Festinger’s graduate students also tried and failed to get the effect. Carlsmith was the ‘lucky’ one who got the effect, in one study p < .05, and he became the co-author of one of the most cited articles in the history of social psychology. Naturally, Festinger did not publish the failed studies of his other graduate students because surely they must have done something wrong. As I said, that is a rumor.  Even if the rumor is not true, and Carlsmith got lucky on the first try, luck played a factor and nobody should expect that a study replicates simply because a single published study reported a p-value less than .05.

Failed Replications Did Not Influence Social Psychological Theories

Argument quality reaches a new low with the next argument against replication studies.

 “If we look at the history of social psychology, theories have rarely been abandoned because of failed replications.”

This is true, but it reveals the lack of progress in theory development in social psychology rather than the futility of replication studies.  From an evolutionary perspective, theory development requires selection pressure, but publication bias protects bad theories from failure.

The short history of open science shows how weak social psychological theories are and that even the most basic predictions cannot be confirmed in open replication studies that do not selectively report significant results.  So, even if it is true that failed replications have played a minor role in the past of social psychology, they are going to play a much bigger role in the future of social psychology.

The Red Herring: Fraud

S&S imply that Roediger suggested to use replication studies as a fraud detection tool.

if others had tried to replicate his [Stapel’s] work soon after its publication, his misdeeds might have been uncovered much more quickly

S&S dismiss this idea in part on the basis of Stroebe’s research on fraud detection.

To their own surprise, Stroebe and colleagues found that replications hardly played any role in the discovery of these fraud cases.

Now this is actually not surprising because failed replications were hardly ever published.  And if there is no variance in a predictor variable (significance), we cannot see a correlation between the predictor variable and an outcome (fraud).  Although failed replication studies may help to detect fraud in the future, this is neither their primary purpose, nor necessary to make replication studies valuable. Replication studies also do not bring world peace or bring an end to global warming.

For some inexplicable reason S&S continue to focus on fraud. For example, they also argue that meta-analyses are poor fraud detectors, which is as true as it is irrelevant.

They conclude their discussion with an observation by Stapel, who famously faked 50+ articles in social psychology journals.

As Stapel wrote in his autobiography, he was always pleased when his invented findings were replicated: “What seemed logical and was fantasized became true” (Stapel, 2012). Thus, neither can failures to replicate a research finding be used as indicators of fraud, nor can successful replications be invoked as indication that the original study was honestly conducted.

I am not sure why S&S spend so much time talking about fraud, but it is the only questionable research practice that they openly address.  In contrast, they do not discuss other questionable research practices, including suppressing failed studies, that are much more prevalent and much more important for the understanding of the replication crisis in social psychology than fraud.  The term “publication bias” is not mentioned once in the article. Sometimes what is hidden is more significant than what is being published.

Conclusion

The conclusion section correctly predicts that the results of the reproducibility project will make social psychology look bad and that social psychology will look worse than other areas of psychology.

But whereas it will certainly be useful to be informed about studies that are difficult to replicate, we are less confident about whether the investment of time and effort of the volunteers of the Open Science Collaboration is well spent on replicating studies published in three psychology journals. The result will be a reproducibility coefficient that will not be greatly informative, because of justified doubts about whether the “exact” replications succeeded in replicating the theoretical conditions realized in the original research.

As social psychologists, we are particularly concerned that one of the outcomes of this effort will be that results from our field will be perceived to be less “reproducible” than research in other areas of psychology. This is to be expected because for the reasons discussed earlier, attempts at “direct” replications of social psychological studies are less likely than exact replications of experiments in psychophysics to replicate the theoretical conditions that were established in the original study.

Although psychologists should not be complacent, there seem to be no reasons to panic the field into another crisis. Crises in psychology are not caused by methodological flaws but by the way people talk about them (Kruglanski & Stroebe, 2012).

S&S attribute the foreseen (how did they know?) bad outcome in the reproducibility project to the difficulty of replicating social psychological studies, but they fail to explain why social psychology journals publish as many successes as other disciplines.

The results of the reproducibility project provide an answer to this question.  Social psychologists use designs with less statistical power that have a lower chance of producing a significant result. Selection for significance ensures that the success rate is equally high in all areas of psychology, but lower power makes these successes less replicable.

To avoid further embarrassments in an increasingly open science, social psychologists must improve the statistical power of their studies. Which social psychological theories will survive actual empirical tests in the new world of open science is unclear.  In this regard, I think it makes more sense to compare social psychology to a ship wreck than a train wreck.  Somewhere down on the floor of the ocean is some gold. But it will take some deep diving and many failed attempts to find it.  Good luck!

Appendix

S&S’s article was published in a “prestigious” psychology journal and has already garnered 114 citations. It ranks #21 in my importance rankings of articles in meta-psychology.  So, I was curious why the article gets cited.  The appendix lists 51 citing articles with the relevant citation and the reason for citing S&S’s article.   The table shows the reasons for citations in decreasing order of frequency.

S&S are most frequently cited for the claim that exact replications are impossible, followed by the reason for this claim that effects in psychological research are sensitive to the unique context in which a study is conducted.  The next two reasons for citing the article are that only conceptual replications (CR) test theories, whereas the results of exact replications (ER) are uninformative.  The problem is that every study is a conceptual replication because exact replications are impossible. So, even if exact replications were uninformative this claim has no practical relevance because there are no exact replications.  Some articles cite S&S with no specific claim attached to the citation.  Only two articles cite them for the claim that there is no replication crisis and only 1 citation cites S&S for the claim that there is no evidence about the prevalence of QRPs.   In short, the article is mostly cited for the uncontroversial and inconsequential claim that exact replications are impossible and that effect sizes in psychological studies can vary as a function of unique features of a particular sample or study.  This observation is inconsequential because it is unclear how unknown unique characteristics of studies influence results.  The main implication of this observation is that study results will be more variable than we would expect from a set of exact replication studies. For this reason, meta-analysts often use random-effects model because fixed-effects meta-analysis assumes that all studies are exact replications.

ER impossible 11
Contextual Sensitivity 8
CR test theory 8
ER uninformative 7
Mention 6
ER/CR Distinction 2
No replication crisis 2
Disagreement 1
CR Definition 1
ER informative 1
ER useful for applied research 1
ER cannot detect fraud 1
No evidence about prevalence of QRP 1
Contextual sensitivity greater in social psychology 1

the most influential citing articles and the relevant citation.  I haven’t had time to do a content analysis, but the article is mostly cited to say (a) exact replications are impossible, and (b) conceptual replications are valuable, and (c) social psychological findings are harder to replicate.  Few articles cite to article to claim that the replication crisis is overblown or that failed replications are uninformative.  Thus, even though the article is cited a lot, it is not cited for the main points S&S tried to make.  The high number of citation therefore does not mean that S&S’s claims have been widely accepted.

(Disagreement)
The value of replication studies.

Simmons, DJ.
“In this commentary, I challenge these claims.”

(ER/CR Distinction)
Bilingualism and cognition.

Valian, V.
“A host of methodological issues should be resolved. One is whether the field should undertake exact replications, conceptual replications, or both, in order to determine the conditions under which effects are reliably obtained (Paap, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(Contextual Sensitivity)
Is Psychology Suffering From a Replication Crisis? What Does “Failure to Replicate” Really Mean?“
Maxwell et al. (2015)
A particular replication may fail to confirm the results of an original study for a variety of reasons, some of which may include intentional differences in procedures, measures, or samples as in a conceptual replication (Cesario, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(ER impossible)
The Chicago face database: A free stimulus set of faces and norming data 

Debbie S. Ma, Joshua Correll, & Bernd Wittenbrink.
The CFD will also make it easier to conduct exact replications, because researchers can use the same stimuli employed by other researchers (but see Stroebe & Strack, 2014).”

(Contextual Sensitivity)
“Contextual sensitivity in scientific reproducibility”
vanBavel et al. (2015)
“Many scientists have also argued that the failure to reproduce results might reflect contextual differences—often termed “hidden moderators”—between the original research and the replication attempt”

(Contextual Sensitivity)
Editorial Psychological Science

Linday,
As Nosek and his coauthors made clear, even ideal replications of ideal studies are expected to fail some of the time (Francis, 2012), and failure to replicate a previously observed effect can arise from differences between the original and replication studies and hence do not necessarily indicate flaws in the original study (Maxwell, Lau, & Howard, 2015; Stroebe & Strack, 2014). Still, it seems likely that psychology journals have too often reported spurious effects arising from Type I errors (e.g., Francis, 2014).

(ER impossible)
Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science

Finkel et al. (2015).
“Nevertheless, many scholars believe that direct replications are impossible in the human sciences—S&S (2014) call them “an illusion”— because certain factors, such as a moment in historical time or the precise conditions under which a sample was obtained and tested, that may have contributed to a result can never be reproduced identically.”

Conceptualizing and evaluating the replication of research results
Fabrigar and Wegener (2016)
(CR test theory)
“Traditionally, the primary presumed strength of conceptual replications has been their ability to address issues of construct validity (e.g., Brewer & Crano, 2014; Schmidt, 2009; Stroebe & Strack, 2014). “

(ER impossible)
“First, it should be recognized that an exact replication in the strictest sense of the term can never be achieved as it will always be impossible to fully recreate the contextual factors and participant characteristics present in the original experiment (see Schmidt (2009); S&S (2014).”

(Contextual Sensitivity)
“S&S (2014) have argued that there is good reason to expect that many traditional and contemporary experimental manipulations in social psychology would have different psychological properties and effects if used in contexts or populations different from the original experiments for which they were developed. For example, classic dissonance manipulations and fear manipulations or more contemporary priming procedures might work very differently if used in new contexts and/or populations. One could generate many additional examples beyond those mentioned by S&S.”

(ER impossible)
“Another important point illustrated by the above example is that the distinction between exact and conceptual replications is much more nebulous than many discussions of replication would suggest. Indeed, some critics of the exact/conceptual replication distinction have gone so far as to argue that the concept of exact replication is an “illusion” (Stroebe & Strack, 2014). Though we see some utility in the exact/conceptual distinction (especially regarding the goal of the researcher in the work), we agree with the sentiments expressed by S&S. Classifying studies on the basis of the exact/conceptual distinction is more difficult than is often appreciated, and the presumed strengths and weaknesses of the approaches are less straightforward than is often asserted or assumed.”

(Contextual Sensitivity)
“Furthermore, assuming that these failed replication experiments have used the same operationalizations of the independent and dependent variables, the most common inference drawn from such failures is that confidence in the existence of the originally demonstrated effect should be substantially undermined (e.g., see Francis (2012); Schimmack (2012)). Alternatively, a more optimistic interpretation of such failed replication experiments could be that the failed versus successful experiments differ as a function of one or more unknown moderators that regulate the emergence of the effect (e.g., Cesario, 2014; Stroebe & Strack, 2014).”

Replicating Studies in Which Samples of Participants Respond to Samples of Stimuli.
(CR Definition)
Westfall et al. (2015).
Nevertheless, the original finding is considered to be conceptually replicated if it can be convincingly argued that the same theoretical constructs thought to account for the results of the original study also account for the results of the replication study (Stroebe & Strack, 2014). Conceptual replications are thus “replications” in the sense that they establish the reproducibility of theoretical interpretations.”

(Mention)
“Although establishing the generalizability of research findings is undoubtedly important work, it is not the focus of this article (for opposing viewpoints on the value of conceptual replications, see Pashler & Harris, 2012; Stroebe & Strack, 2014).“

Introduction to the Special Section on Advancing Our Methods and Practices
(Mention)
Ledgerwood, A.
We can and surely should debate which problems are most pressing and which solutions most suitable (e.g., Cesario, 2014; Fiedler, Kutzner, & Krueger, 2012; Murayama, Pekrun, & Fiedler, 2013; Stroebe & Strack, 2014). But at this point, most can agree that there are some real problems with the status quo.

***Theory Building, Replication, and Behavioral Priming: Where Do We Need to Go From Here?
Locke, EA
(ER impossible)
As can be inferred from Table 1, I believe that the now popular push toward “exact” replication (e.g., see Simons, 2014) is not the best way to go. Everyone agrees that literal replication is impossible (e.g., Stroebe & Strack, 2014), but let us assume it is as close as one can get. What has been achieved?

The War on Prevention: Bellicose Cancer: Metaphors Hurt (Some) Prevention Intentions”
(CR test theory)
David J. Hauser1 and Norbert Schwarz
“As noted in recent discussions (Stroebe & Strack, 2014), consistent effects of multiple operationalizations of a conceptual variable across diverse content domains are a crucial criterion for the robustness of a theoretical approach.”

ON THE OTHER SIDE OF THE MIRROR: PRIMING IN COGNITIVE AND SOCIAL PSYCHOLOGY 
Doyen et al. “
(CR test theory)
In contrast, social psychologists assume that the primes activate culturally and situationally contextualized representations (e.g., stereotypes, social norms), meaning that they can vary over time and culture and across individuals. Hence, social psychologists have advocated the use of “conceptual replications” that reproduce an experiment by relying on different operationalizations of the concepts under investigation (Stroebe & Strack, 2014). For example, in a society in which old age is associated not with slowness but with, say, talkativeness, the outcome variable could be the number of words uttered by the subject at the end of the experiment rather than walking speed.”

***Welcome back Theory
Ap Dijksterhuis
(ER uninformative)
“it is unavoidable, and indeed, this commentary is also about replication—it is done against the background of something we had almost forgotten: theory! S&S (2014, this issue) argue that focusing on the replication of a phenomenon without any reference to underlying theoretical mechanisms is uninformative”

On the scientific superiority of conceptual replications for scientific progress
Christian S. Crandall, Jeffrey W. Sherman
(ER impossible)
But in matters of social psychology, one can never step in the same river twice—our phenomena rely on culture, language, socially primed knowledge and ideas, political events, the meaning of questions and phrases, and an ever-shifting experience of participant populations (Ramscar, 2015). At a certain level, then, all replications are “conceptual” (Stroebe & Strack, 2014), and the distinction between direct and conceptual replication is continuous rather than categorical (McGrath, 1981). Indeed, many direct replications turn out, in fact, to be conceptual replications. At the same time, it is clear that direct replications are based on an attempt to be as exact as possible, whereas conceptual replications are not.

***Are most published social psychological findings false?
Stroebe, W.
(ER uninformative)
This near doubling of replication success after combining original and replication effects is puzzling. Because these replications were already highly powered, the increase is unlikely to be due to the greater power of a meta-analytic synthesis. The two most likely explanations are quality problems with the replications or publication bias in the original studies or. An evaluation of the quality of the replications is beyond the scope of this review and should be left to the original authors of the replicated studies. However, the fact that all replications were exact rather than conceptual replications of the original studies is likely to account to some extent for the lower replication rate of social psychological studies (Stroebe & Strack, 2014). There is no evidence either to support or to reject the second explanation.”

(ER impossible)
“All four projects relied on exact replications, often using the material used in the original studies. However, as I argued earlier (Stroebe & Strack, 2014), even if an experimental manipulation exactly replicates the one used in the original study, it may not reflect the same theoretical variable.”

(CR test theory)
“Gergen’s argument has important implications for decisions about the appropriateness of conceptual compared to exact replication. The more a phenomenon is susceptible to historical change, the more conceptual replication rather than exact replication becomes appropriate (Stroebe & Strack, 2014).”

(CR test theory)
“Moonesinghe et al. (2007) argued that any true replication should be an exact replication, “a precise processwhere the exact same finding is reexamined in the same way”. However, conceptual replications are often more informative than exact replications, at least in studies that are testing theoretical predictions (Stroebe & Strack, 2014). Because conceptual replications operationalize independent and/or dependent variables in a different way, successful conceptual replications increase our trust in the predictive validity of our theory.”

There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance”
Anderson & Maxwell
(Mention)
“It is important to note some caveats regarding direct (exact) versus conceptual replications. While direct replications were once avoided for lack of originality, authors have recently urged the field to take note of the benefits and importance of direct replication. According to Simons (2014), this type of replication is “the only way to verify the reliability of an effect” (p. 76). With respect to this recent emphasis, the current article will assume direct replication. However, despite the push toward direct replication, some have still touted the benefits of conceptual replication (Stroebe & Strack, 2014). Importantly, many of the points and analyses suggested in this paper may translate well to conceptual replication.”

Reconceptualizing replication as a sequence of different studies: A replication typology
Joachim Hüffmeier, Jens Mazei, Thomas Schultze
(ER impossible)
The first type of replication study in our typology encompasses exact replication studies conducted by the author(s) of an original finding. Whereas we must acknowledge that replications can never be “exact” in a literal sense in psychology (Cesario, 2014; Stroebe & Strack, 2014), exact replications are studies that aspire to be comparable to the original study in all aspects (Schmidt, 2009). Exact replications—at least those that are not based on questionable research practices such as the arbitrary exclusion of critical outliers, sampling or reporting biases (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)—serve the function of protecting against false positive effects (Type I errors) right from the start.

(ER informative)
Thus, this replication constitutes a valuable contribution to the research process. In fact, already some time ago, Lykken (1968; see also Mummendey, 2012) recommended that all experiments should be replicated  before publication. From our perspective, this recommendation applies in particular to new findings (i.e., previously uninvestigated theoretical relations), and there seems to be some consensus that new findings should be replicated at least once, especially when they were unexpected, surprising, or only loosely connected to existing theoretical models (Stroebe & Strack, 2014; see also Giner-Sorolla, 2012; Murayama et al., 2014).”

(Mention)
Although there is currently some debate about the epistemological value of close replication studies (e.g., Cesario, 2014; LeBel & Peters, 2011; Pashler & Harris, 2012; Simons, 2014; Stroebe & Strack, 2014), the possibility that each original finding can—in principal—be replicated by the scientific community represents a cornerstone of science (Kuhn, 1962; Popper, 1992).”

(CR test theory)
So far, we have presented “only” the conventional rationale used to stress the importance of close replications. Notably, however, we will now add another—and as we believe, logically necessary—point originally introduced by S&S (2014). This point protects close replications from being criticized (cf. Cesario, 2014; Stroebe & Strack, 2014; see also LeBel & Peters, 2011). Close replications can be informative only as long as they ensure that the theoretical processes investigated or at least invoked by the original study are shown to also operate in the replication study.

(CR test theory)
The question of how to conduct a close replication that is maximally informative entails a number of methodological choices. It is important to both adhere to the original study proceedings (Brandt et al., 2014; Schmidt, 2009) and focus on and meticulously measure the underlying theoretical mechanisms that were shown or at least proposed in the original studies (Stroebe & Strack, 2014). In fact, replication attempts are most informative when they clearly demonstrate either that the theoretical processes have unfolded as expected or at which point in the process the expected results could no longer be observed (e.g., a process ranging from a treatment check to a manipulation check and [consecutive] mediator variables to the dependent variable). Taking these measures is crucial to rule out that a null finding is simply due to unsuccessful manipulations or changes in a manipulation’s meaning and impact over time (cf. Stroebe & Strack, 2014). “

(CR test theory)
Conceptual replications in laboratory settings are the fourth type of replication study in our typology. In these replications, comparability to the original study is aspired to only in the aspects that are deemed theoretically relevant (Schmidt, 2009; Stroebe & Strack, 2014). In fact, most if not all aspects may differ as long as the theoretical processes that have been studied or at least invoked in the original study are also covered in a conceptual replication study in the laboratory.”

(ER useful for applied research)
For instance, conceptual replications may be less important for applied disciplines that focus on clinical phenomena and interventions. Here, it is important to ensure that there is an impact of a specific intervention and that the related procedure does not hurt the members of the target population (e.g., Larzelere et al., 2015; Stroebe & Strack, 2014).”

From intrapsychic to ecological theories in social psychology: Outlines of a functional theory approach
Klaus Fiedler
(ER uninformative)
Replicating an ill-understood finding is like repeating a complex sentence in an unknown language. Such a “replication” in the absence of deep understanding may appear funny, ridiculous, and embarrassing to a native speaker, who has full control over the foreign language. By analogy, blindly replicating or running new experiments on an ill-understood finding will rarely create real progress (cf. Stroebe & Strack, 2014). “

Into the wild: Field research can increase both replicability and real-world impact
Jon K. Maner
(CR test theory)
Although studies relying on homogeneous samples of laboratory or online participants might be highly replicable when conducted again in a similar homogeneous sample of laboratory or online participants, this is not the key criterion (or at least not the only criterion) on which we should judge replicability (Westfall, Judd & Kenny, 2015; see also Brandt et al., 2014; Stroebe & Strack, 2014). Just as important is whether studies replicate in samples that include participants who reflect the larger and more diverse population.”

Romance, Risk, and Replication: Can Consumer Choices and Risk-Taking Be Primed by Mating Motives?
Shanks et al.
(ER impossible)
There is no such thing as an “exact” replication (Stroebe & Strack, 2014) and hence it must be acknowledged that the published studies (notwithstanding the evidence for p-hacking and/or publication bias) may have obtained genuine effects and that undetected moderator variables explain why the present studies failed to obtain priming.   Some of the experiments reported here differed in important ways from those on which they were modeled (although others were closer replications and even these failed to obtain evidence of reliable romantic priming).

(CR test theory)
As S&S (2014) point out, what is crucial is not so much exact surface replication but rather identical operationalization of the theoretically relevant variables. In the present case, the crucial factors are the activation of romantic motives and the appropriate assessment of consumption, risk-taking, and other measures.”

A Duty to Describe: Better the Devil You Know Than the Devil You Don’t
Brown, Sacha D et al.
(Mention)
Ioannidis (2005) has been at the forefront of researchers identifying factors interfering with self-correction. He has claimed that journal editors selectively publish positive findings and discriminate against study replications, permitting errors in data and theory to enjoy a long half-life (see also Ferguson & Brannick, 2012; Ioannidis, 2008, 2012; Shadish, Doherty, & Montgomery, 1989; Stroebe & Strack, 2014). We contend there are other equally important, yet relatively unexplored, problems.

A Room with a Viewpoint Revisited: Descriptive Norms and Hotel Guests’ Towel Reuse Behavior
(Contextual Sensitivity)
Bohner, Gerd; Schlueter, Lena E.
On the other hand, our pilot participants’ estimates of towel reuse rates were generally well below 75%, so we may assume that the guests participating in our experiments did not perceive the normative messages as presenting a surprisingly low figure. In a more general sense, the issue of greatly diverging baselines points to conceptual issues in trying to devise a ‘‘direct’’ replication: Identical operationalizations simply may take on different meanings for people in different cultures.

***The empirical benefits of conceptual rigor: Systematic articulation of conceptual hypotheses can reduce the risk of non-replicable results (and facilitate novel discoveries too)
Mark Schaller
(Contextual Sensitivity)
Unless these subsequent studies employ methods that exactly replicate the idiosyncratic context in which the effect was originally detected, these studies are unlikely to replicate the effect. Indeed, because many psychologically important contextual variables may lie outside the awareness of researchers, even ostensibly “exact” replications may fail to create the conditions necessary for a fragile effect to emerge (Stroebe & Strack, 2014)

A Concise Set of Core Recommendations to Improve the Dependability of Psychological Research
David A. Lishner
(CR test theory)
The claim that direct replication produces more dependable findings across replicated studies than does conceptual replication seems contrary to conventional wisdom that conceptual replication is preferable to direct replication (Dijksterhuis, 2014; Neulip & Crandall, 1990, 1993a, 1993b; Stroebe & Strack, 2014).
(CR test theory)
However, most arguments advocating conceptual replication over direct replication are attempting to promote the advancement or refinement of theoretical understanding (see Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014). The argument is that successful conceptual replication demonstrates a hypothesis (and by extension the theory from which it derives) is able to make successful predictions even when one alters the sampled population, setting, operations, or data analytic approach. Such an outcome not only suggests the presence of an organizing principle, but also the quality of the constructs linked by the organizing principle (their theoretical meanings). Of course this argument assumes that the consistency across the replicated findings is not an artifact of data acquisition or data analytic approaches that differ among studies. The advantage of direct replication is that regardless of how flexible or creative one is in data acquisition or analysis, the approach is highly similar across replication studies. This duplication ensures that any false finding based on using a flexible approach is unlikely to be repeated multiple times.

(CR test theory)
Does this mean conceptual replication should be abandoned in favor of direct replication? No, absolutely not. Conceptual replication is essential for the theoretical advancement of psychological science (Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014), but only if dependability in findings via direct replication is first established (Cesario, 2014; Simons, 2014). Interestingly, in instances where one is able to conduct multiple studies for inclusion in a research report, one approach that can produce confidence in both dependability of findings and theoretical generalizability is to employ nested replications.

(ER cannot detect fraud)
A second advantage of direct replications is that they can protect against fraudulent findings (Schmidt, 2009), particularly when different research groups conduct direct replication studies of each other’s research. S&S (2014) make a compelling argument that direct replication is unlikely to prove useful in detection of fraudulent research. However, even if a fraudulent study remains unknown or undetected, its impact on the literature would be lessened when aggregated with nonfraudulent direct replication studies conducted by honest researchers.

***Does cleanliness influence moral judgments? Response effort moderates the effect of cleanliness priming on moral judgments.
Huang
(ER uninformative)
Indeed, behavioral priming effects in general have been the subject of increased scrutiny (see Cesario, 2014), and researchers have suggested different causes for failed replication, such as measurement and sampling errors (Stanley and Spence,2014), variation in subject populations (Cesario, 2014), discrepancy in operationalizations (S&S, 2014), and unidentified moderators (Dijksterhuis,2014).

UNDERSTANDING PRIMING EFFECTS IN SOCIAL PSYCHOLOGY: AN OVERVIEW AND INTEGRATION
Daniel C. Molden
(ER uninformative)
Therefore, some greater emphasis on direct replication in addition to conceptual replication is likely necessary to maximize what can be learned from further research on priming (but see Stroebe and Strack, 2014, for costs of overemphasizing direct replication as well).

On the automatic link between affect and tendencies to approach and avoid: Chen and Bargh (1999) revisited
Mark Rotteveel et al.
(no replication crisis)
Although opinions differ with regard to the extent of this “replication crisis” (e.g., Pashler and Harris, 2012; S&S, 2014), the scientific community seems to be shifting its focus more toward direct replication.

(ER uninformative)
Direct replications not only affect one’s confidence about the veracity of the phenomenon under study, but they also increase our knowledge about effect size (see also Simons, 2014; but see also S&S, 2014).

Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability
McShane and Bockenholt
(ER impossible)
The purpose of meta-analysis is to synthesize a set of studies of a common phenomenon. This task is complicated in behavioral research by the fact that behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999).

(ER impossible)
Further, because behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999), our SPM methodology estimates and accounts for heterogeneity, which has been shown to be important in a wide variety of behavioral research settings (Hedges and Pigott 2001; Klein et al. 2014; Pigott 2012).

A Closer Look at Social Psychologists’ Silver Bullet: Inevitable and Evitable Side   Effects of the Experimental Approach
Herbert Bless and Axel M. Burger
(ER/CR Distinction)
Given the above perspective, it becomes obvious that in the long run, conceptual replications can provide very fruitful answers because they address the question of whether the initially observed effects are potentially caused by some perhaps unknown aspects of the experimental procedure (for a discussion of conceptual versus direct replications, see e.g., Stroebe & Strack, 2014; see also Brandt et al., 2014; Cesario, 2014; Lykken, 1968; Schwarz & Strack, 2014).  Whereas conceptual replications are adequate solutions for broadening the sample of situations (for examples, see Stroebe & Strack, 2014), the present perspective, in addition, emphasizes that it is important that the different conceptual replications do not share too much overlap in general aspects of the experiment (see also Schwartz, 2015, advocating for  conceptual replications)

Men in red: A reexamination of the red-attractiveness effect
Vera M. Hesslinger, Lisa Goldbach, & Claus-Christian Carbon
(ER impossible)
As Brandt et al. (2014) pointed out, a replication in psychological research will never be absolutely exact or direct (see also, Stroebe & Strack, 2014), which is, of course, also the case in the present research.

***On the challenges of drawing conclusions from p-values just below 0.05
Daniel Lakens
(no evidence about QRP)
In recent years, researchers have become more aware of how flexibility during the data-analysis can increase false positive results (e.g., Simmons, Nelson & Simonsohn, 2011). If the true Type 1 error rate is substantially inflated, for example because researchers analyze their data until a p-value smaller than 0.05 is observed, the robustness of scientific knowledge can substantially decrease. However, as Stroebe & Strack (2014, p. 60) have pointed out: ‘Thus far, however, no solid data exist on the prevalence of such research practices.’

***Does Merely Going Through the Same Moves Make for a ‘‘Direct’’ Replication? Concepts, Contexts, and Operationalizations
Norbert Schwarz and Fritz Strack
(Contextual Sensitivity)
In general, meaningful replications need to realize the psychological conditions of the original study. The easier option of merely running through technically identical procedures implies the assumption that psychological processes are context insensitive and independent of social, cultural, and historical differences (Cesario, 2014; Stroebe & Strack, 2014). Few social (let alone cross-cultural) psychologists would be willing to endorse this assumption with a straight face. If so, mere procedural equivalence is an insufficient criterion for assessing the quality of a replication.

The Replication Paradox: Combining Studies can Decrease Accuracy of Effect Size Estimates
(ER uninformative)
Michèle B. Nuijten, Marcel A. L. M. van Assen, Coosje L. S. Veldkamp, and Jelte M. Wicherts
Replications with nonsignificant results are easily dismissed with the argument that the replication might contain a confound that caused the null finding (Stroebe & Strack, 2014).

Retro-priming, priming, and double testing: psi and replication in a test-retest design
Rabeyron, T
(Mention)
Bem’s paper spawned numerous attempts to replicate it (see e.g., Galak et al., 2012; Bem et al., submitted) and reflections on the difficulty of direct replications in psychology (Ritchie et al., 2012). This aspect has been associated more generally with debates concerning the “decline effect” in science (Schooler, 2011) and a potential “replication crisis” (S&S, 2014) especially in the fields of psychology and medical sciences (De Winter and Happee, 2013).

Do p Values Lose Their Meaning in Exploratory Analyses? It Depends How You Define the Familywise Error Rate
Mark Rubin
(ER impossible)
Consequently, the Type I error rate remains constant if researchers simply repeat the same test over and over again using different samples that have been randomly drawn from the exact same population. However, this first situation is somewhat hypothetical and may even be regarded as impossible in the social sciences because populations of people change over time and location (e.g., Gergen, 1973; Iso-Ahola, 2017; Schneider, 2015; Serlin, 1987; Stroebe & Strack, 2014). Yesterday’s population of psychology undergraduate students from the University of Newcastle, Australia, will be a different population to today’s population of psychology undergraduate students from the University of Newcastle, Australia.

***Learning and the replicability of priming effects
Michael Ramscar
(ER uninformative)
In the limit, this means that in the absence of a means for objectively determining what the information that produces a priming effect is, and for determining that the same information is available to the population in a replication, all learned priming effects are scientifically unfalsifiable. (Which also means that in the absence of an account of what the relevant information is in a set of primes, and how it produces a specific effect, reports of a specific priming result — or failures to replicate it — are scientifically uninformative; see also [Stroebe & Strack, 2014.)

***Evaluating Psychological Research Requires More Than Attention to the N: A Comment on Simonsohn’s (2015) “Small Telescopes”
Norbert Schwarz and Gerald L. Clore
(CR test theory)
Simonsohn’s decision to equate a conceptual variable (mood) with its manipulation (weather) is compatible with the logic of clinical trials, but not with the logic of theory testing. In clinical trials, which have inspired much of the replicability debate and its statistical focus, the operationalization (e.g., 10 mg of a drug) is itself the variable of interest; in theory testing, any given operationalization is merely one, usually imperfect, way to realize the conceptual variable. For this reason, theory tests are more compelling when the results of different operationalizations converge (Stroebe & Strack, 2014), thus ensuring, in the case in point, that it is not “the weather” but indeed participants’ (sometimes weather-induced) mood that drives the observed effect.

Internal conceptual replications do not increase independent replication success
Kunert, R
(Contextual Sensitivity)
According to the unknown moderator account of independent replication failure, successful internal replications should correlate with independent replication success. This account suggests that replication failure is due to the fact that psychological phenomena are highly context-dependent, and replicating seemingly irrelevant contexts (i.e. unknown moderators) is rare (e.g., Barrett, 2015; DGPS, 2015; Fleming Crim, 2015; see also Stroebe & Strack, 2014; for a critique, see Simons, 2014). For example, some psychological phenomenon may unknowingly be dependent on time of day.

(Contextual Sensitivity greater in social psychology)
When the chances of unknown moderator influences are greater and replicability is achieved (internal, conceptual replications), then the same should be true when chances are smaller (independent, direct replications). Second, the unknown moderator account is usually invoked for social psychological effects (e.g. Cesario, 2014; Stroebe & Strack, 2014). However, the lack of influence of internal replications on independent replication success is not limited to social psychology. Even for cognitive psychology a similar pattern appears to hold.

On Klatzky and Creswell (2014): Saving Social Priming Effects But Losing Science as We Know It?
Barry Schwartz
(ER uninformative)
The recent controversy over what counts as “replication” illustrates the power of this presumption. Does “conceptual replication” count? In one respect, conceptual replication is a real advance, as conceptual replication extends the generality of the phenomena that were initially discovered. But what if it fails? Is it because the phenomena are unreliable, because the conceptual equivalency that justified the new study was logically flawed, or because the conceptual replication has permitted the intrusion of extraneous variables that obscure the original phenomenon? This ambiguity has led some to argue that there is no substitute for strict replication (see Pashler & Harris, 2012; Simons, 2014, and Stroebe & Strack, 2014, for recent manifestations of this controversy). A significant reason for this view, however, is less a critique of the logic of conceptual replication than it is a comment on the sociology (or politics, or economics) of science. As Pashler and Harris (2012) point out, publication bias virtually guarantees that successful conceptual replications will be published whereas failed conceptual replications will live out their lives in a file drawer.  I think Pashler and Harris’ surmise is probably correct, but it is not an argument for strict replication so much as it is an argument for publication of failed conceptual replication.

Commentary and Rejoinder on Lynott et al. (2014)
Lawrence E. Williams
(CR test theory)
On the basis of their investigations, Lynott and colleagues (2014) conclude ‘‘there is no evidence that brief exposure to warm therapeutic packs induces greater prosocial responding than exposure to cold therapeutic packs’’ (p. 219). This conclusion, however, does not take into account other related data speaking to the connection between physical warmth and prosociality. There is a fuller body of evidence to be considered, in which both direct and conceptual replications are instructive. The former are useful if researchers particularly care about the validity of a specific phenomenon; the latter are useful if researchers particularly care about theory testing (Stroebe & Strack, 2014).

The State of Social and Personality Science: Rotten to the Core, Not So Bad, Getting Better, or Getting Worse?
(no replication crisis)
Motyl et al. (2017) “The claim of a replicability crisis is greatly exaggerated.” Wolfgang Stroebe and Fritz Strack, 2014

Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology
Harry T. Reis, Karisa Y. Lee
(ER impossible)
Much of the current debate, however, is focused narrowly on direct or exact replications—whether the findings of a given study, carried out in a particular way with certain specific operations, would be repeated. Although exact replications are surely desirable, the papers by Fabrigar and by Crandall and Sherman remind us that in an absolute sense they are fundamentally impossible in social–personality psychology (see also S&S, 2014).

Show me the money
(Contextual Sensitivity)
Of course, it is possible that additional factors, which varied or could have varied among our studies and previously published studies (e.g., participants’ attitudes toward money) or among the online studies and laboratory study in this article (e.g., participants’ level of distraction), might account for these apparent inconsistencies. We did not aim to conduct a direct replication of any specific past study, and therefore we encourage special care when using our findings to evaluate existing ones (Doyen, Klein, Simons, & Cleeremans, 2014; Stroebe & Strack, 2014).

***From Data to Truth in Psychological Science. A Personal Perspective.
Strack
(ER uninformative)
In their introduction to the 2016 volume of the Annual Review of Psychology, Susan Fiske, Dan Schacter, and Shelley Taylor point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects. To allow non-replications to regain this constructive role, they must come with conclusions that enter and stimulate a critical debate. It is even better if replication studies are endowed with a hypothesis that relates to the state of the scientific discourse. To show that an effect occurs only under one but not under another condition is more informative than simply demonstrating noneffects (S&S, 2014). But this may require expertise and effort.

 

Advertisements

Replicability 101: How to interpret the results of replication studies

Even statistically sophisticated psychologists struggle with the interpretation of replication studies (Maxwell et al., 2015).  This article gives a basic introduction to the interpretation of statistical results within the Neyman Pearson approach to statistical inferences.

I make two important points and correct some potential misunderstandings in Maxwell et al.’s discussion of replication failures.  First, there is a difference between providing sufficient evidence for the null-hypothesis (evidence of absence) and providing insufficient evidence against the null-hypothesis (absence of evidence).  Replication studies are useful even if they simply produce absence of evidence without evidence that an effect is absent.  Second, I  point out that publication bias undermines the credibility of significant results in original studies.  When publication bias is present, open replication studies are valuable because they provide an unbiased test of the null-hypothesis, while original studies are rigged to reject the null-hypothesis.

DEFINITION OF REPLICATING A STATISTICAL RESULT

Replicating something means to get the same result.  If I make the first free throw, replicating this outcome means to also make the second free throw.  When we talk about replication studies in psychology we borrow from the common meaning of the term “to replicate.”

If we conduct psychological studies, we can control many factors, but some factors are not under our control.  Participants in two independent studies differ from each other and the variation in the dependent variable across samples introduces sampling error. Hence, it is practically impossible to get identical results, even if the two studies are exact copies of each other.  It is therefore more complicated to compare the results of two studies than to compare the outcome of two free throws.

To determine whether the results of two studies are identical or not, we need to focus on the outcome of a study.  The most common outcome in psychological studies is a significant or non-significant result.  The goal of a study is to produce a significant result and for this reason a significant result is often called a success.  A successful replication study is a study that also produces a significant result.  Obtaining two significant results is akin to making two free throws.  This is one of the few agreements between Maxwell and me.

“Generally speaking, a published  original study has in all likelihood demonstrated a statistically significant effect. In the current zeitgeist, a replication study is usually interpreted as successful if it also demonstrates a statistically significant effect.” (p. 488)

The more interesting and controversial scenario is a replication failure. That is, the original study produced a significant result (success) and the replication study produced a non-significant result (failure).

I propose that a lot of confusion arises from the distinction between original and replication studies. If a replication study is an exact copy of the first study, the outcome probabilities of original and replication studies are identical.  Otherwise, the replication study is not really a replication study.

There are only three possible outcomes in a set of two studies: (a) both studies are successful, (b) one study is a success and one is a failure, or (c) both studies are failures.  The probability of these outcomes depends on whether the significance criterion (the type-I error probability) when the null-hypothesis is true and the statistical power of a study when the null-hypothesis is false.

Table 1 shows the probability of the outcomes in two studies.  The uncontroversial scenario of two significant results is very unlikely, if the null-hypothesis is true. With conventional alpha = .05, the probability is .0025 or 1 out of 400 attempts.  This shows the value of replication studies. False positives are unlikely to repeat themselves and a series of replication studies with significant results is unlikely to occur by chance alone.

2 sig, 0 ns 1 sig, 1 ns 0 sig, 2 ns
H0 is True alpha^2 2*alpha*(1-alpha) (1-alpha^2)
H1 is True (1-beta)^2 2*(1-beta)*beta beta^2

The probability of a successful replication of a true effect is a function of statistical power (1 – type-II error probability).  High power is needed to get significant results in a pair of studies (an original study and a replication study).  For example, if power is only 50%, the chance of this outcome is only 25% (Schimmack, 2012).  Even with conventionally acceptable power of 80%, only 2/3 (64%) of replication attempts would produce this outcome.  However, studies in psychology do not have 80% power and estimates of power can be as low as 37% (OSC, 2015). With 40% power, a pair of studies would produce significant results in no more than 16 out of 100 attempts.   Although successful replications of true effects with low power are unlikely, they are still much more likely then significant results when the null-hypothesis is true (16/100 vs. 1/400 = 64:1).  It is therefore reasonable to infer from two significant results that the null-hypothesis is false.

If the null-hypothesis is true, it is extremely likely that both studies produce a non-significant result (.95^2 = 90.25%).  In contrast, it is unlikely that even a study with modest power would produce two non-significant results.  For example, if power is 50%, there is a 75% chance that at least one of the two studies produces a significant result. If power is 80%, the probability of obtaining two non-significant results is only 4%.  This means, it is much more likely (22.5 : 1) that the null-hypothesis is true than that the alternative hypothesis is true.  This does not mean that the null-hypothesis is true in an absolute sense because power depends on the effect size.  For example, if 80% power were obtained with a standardized effect size of Cohen’s d = .5,  two non-significant results would suggest that the effect size is smaller than .5, but it does not warrant the conclusion that H0 is true and the effect size is exactly 0.  Once more, it is important to distinguish between the absence of evidence for an effect and the evidence of absence of an effect.

The most controversial scenario assumes that the two studies produced inconsistent outcomes.  Although theoretically there is no difference between the first and the second study, it is common to focus on a successful outcome followed by a replication failure  (Maxwell et al., 2015). When the null-hypothesis is true, the probability of this outcome is low;  .05 * (1-.05) = .0425.  The same probability exists for the reverse pattern that a non-significant result is followed by a significant one.  A probability of 4.25% shows that it is unlikely to observe a significant result followed by a non-significant result when the null-hypothesis is true. However, the low probability is mostly due to the low probability of obtaining a significant result in the first study, while the replication failure is extremely likely.

Although inconsistent results are unlikely when the null-hypothesis is true, they can also be unlikely when the null-hypothesis is false.  The probability of this outcome depends on statistical power.  A pair of studies with very high power (95%) is very unlikely to produce an inconsistent outcome because both studies are expected to produce a significant result.  The probability of this rare event can be as low, or lower, than the probability with a true null effect; .95 * (1-.95) = .0425.  Thus, an inconsistent result provides little information about the probability of a type-I or type-II  error and is difficult to interpret.

In conclusion, a pair of significance tests can produce three outcomes. All three outcomes can occur when the null-hypothesis is true and when it is false.  Inconsistent outcomes are likely unless the null-hypothesis is true or the null-hypothesis is false and power is very high.  When two studies produce inconsistent results, statistical significance provides no basis for statistical inferences.

Meta-Analysis 

The counting of successes and failures is an old way to integrate information from multiple studies.  This approach has low power and is no longer used.  A more powerful approach is effect size meta-analysis.  Effect size meta-analysis was one way to interpret replication results in the Open Science Collaboration (2015) reproducibility project.  Surprisingly, Maxwell et al. (2015) do not consider this approach to the interpretation of failed replication studies. To be clear, Maxwell et al. (2015) mention meta-analysis, but they are talking about meta-analyzing a larger set of replication studies, rather than meta-analyzing the results of an original and a replication study.

“This raises a question about how to analyze the data obtained from multiple studies. The natural answer is to use meta-analysis.” (p. 495)

I am going to show that effect-size meta-analysis solves the problem of interpreting inconsistent results in pairs of studies. Importantly, effect size meta-analysis does not care about significance in individual studies.  A meta-analysis of a pair of studies with inconsistent results is no different from a meta-analysis of a pair of studies with consistent results.

Maxwell et al.’s (2015) introduced an example of a between-subject (BS) design with n = 40 per group (total N = 80) and a standardized effect size of Cohen’s d = .5 (a medium effect size).  This study has 59% power to obtain a significant result.  Thus, it is quite likely that a pair of studies produces inconsistent results (48.38%).   However, a pair of studies with N = 80 has the power of a total sample size of N = 160, which means a fixed-effects meta-analysis will produce a significant result in 88% of all attempts.  Thus, it is not difficult at all to interpret the results of pairs of studies with inconsistent results if the studies have acceptable power (> 50%).   Even if the results are inconsistent, a meta-analysis will provide the correct answer that there is an effect most of the time.

A more interesting scenario are inconsistent results when the null-hypothesis is true.  I turned to simulations to examine this scenario more closely.   The simulation showed that a meta-analysis of inconsistent studies produced a significant result in 34% of all cases.  The percentage slightly varies as a function of sample size.  With a small sample of N = 40, the percentage is 35%. With a large sample of  1,000 participants it is 33%.  This finding shows that in two-thirds of attempts, a failed replication reverses the inference about the null-hypothesis based on a significant original study.  Thus, if an original study produced a false-positive results, a failed replication study corrects this error in 2 out of 3 cases.  Importantly, this finding does not warrant the conclusion that the null-hypothesis is true. It merely reverses the result of the original study that falsely rejected the null-hypothesis.

In conclusion, meta-analysis of effect sizes is a powerful tool to interpret the results of replication studies, especially failed replication studies.  If the null-hypothesis is true, failed replication studies can reduce false positives by 66%.

DIFFERENCES IN SAMPLE SIZES

We can all agree that, everything else being equal, larger samples are better than smaller samples (Cohen, 1990).  This rule applies equally to original and replication studies. Sometimes it is recommended that replication studies should use much larger samples than original studies, but it is not clear to me why researchers who conduct replication studies should have to invest more resources than original researchers.  If original researchers conducted studies with adequate power,  an exact replication study with the same sample size would also have adequate power.  If the original study was a type-I error, the replication study is unlikely to replicate the result no matter what the sample size.  As demonstrated above, even a replication study with the same sample size as the original study can be effective in reversing false rejections of the null-hypothesis.

From a meta-analytic perspective, it does not matter whether a replication study had a larger or smaller sample size.  Studies with larger sample sizes are given more weight than studies with smaller samples.  Thus, researchers who invest more resources are rewarded by giving their studies more weight.  Large original studies require large replication studies to reverse false inferences, whereas small original studies require only small replication studies to do the same.  Nevertheless, failed replications with larger samples are more likely to reverse false rejections of the null-hypothesis, but there is no magical number about the size of a replication study to be useful.

I simulated a scenario with a sample size of N = 80 in the original study and a sample size of N = 200 in the replication study (a factor of 2.5).  In this simulation, only 21% of meta-analyses produced a significant result.  This is 13 percentage points lower than in the simulation with equal sample sizes (34%).  If the sample size of the replication study is 10 times larger (N = 80 and N = 800), the percentage of remaining false positive results in the meta-analysis shrinks to 10%.

The main conclusion is that even replication studies with the same sample size as the original study have value and can help to reverse false positive findings.  Larger sample sizes simply give replication studies more weight than original studies, but it is by no means necessary to increase sample sizes of replication studies to make replication failures meaningful.  Given unlimited resources, larger replications are better, but these analysis show that large replication studies are not necessary.  A replication study with the same sample size as the original study is more valuable than no replication study at all.

CONFUSING ABSENCE OF EVIDENCE WITH EVIDENCE OF ABSENCE

One problem in Maxwell et al’s (2015) article is to conflate two possible goals of replication studies.  One goal is to probe the robustness of the evidence against the null-hypothesis. If the original result was a false positive result, an unsuccessful replication study can reverse the initial inference and produce a non-significant result in a meta-analysis.  This finding would mean that evidence for an effect is absent.  The status of a hypothesis (e.g., humans have supernatural abilities; Bem, 2011) is back to where it was before the original study found a significant result and the burden of proof is shifted back to proponents of the hypothesis to provide unbiased credible evidence for it.

Another goal of replication studies can be to provide conclusive evidence that an original study reported a false positive result (i..e, humans do not have supernatural abilities).  Throughout their article, Maxwell et al. assume that the goal of replication studies is to prove the absence of an effect.  They make many correct observations about the difficulties of achieving this goal, but it is not clear why replication studies have to be conclusive when original studies are not held to the same standard.

This makes it easy to produce (potentially false) positive results and very hard to remove false positive results from the literature.   It also creates a perverse incentive to conduct underpowered original studies and to claim victory when a large replication study finds a significant result with an effect size that is 90% smaller than the effect size in an original study.  The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported.  To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles.  If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.

THE DIRTY BIG SECRET

The main problem of Maxwell et al.’s (2015) article is that the authors blissfully ignore the problem of publication bias.  They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results (Schimmack, 2012; Sterling; 1959; Sterling et al., 1995).

It is hard to believe that Maxwell is unaware of this problem, if only because Maxwell was action editor of my article that demonstrated how publication bias undermines the credibility of replication studies that are selected for significance  (Schimmack, 2012).

I used Bem’s infamous article on supernatural abilities as an example, which appeared to show 8 successful replications of supernatural abilities.  Ironically, Maxwell et al. (2015) also cites Bem’s article to argue that failed replication studies can be misinterpreted as evidence of absence of an effect.

“Similarly, Ritchie, Wiseman, and French (2012) state that their failure to obtain significant results in attempting to replicate Bem (2011) “leads us to favor the ‘experimental artifacts’ explanation for Bem’s original result” (p. 4)”

This quote is not only an insult to Ritchie et al.; it also ignores the concerns that have been raised about Bem’s research practices. First, Ritchie et al. do not claim that they have provided conclusive evidence against ESP.  They merely express their own opinion that they “favor the ‘experimental artifacts’ explanation.  There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities.

More important, Maxwell et al. ignore the broader context of these studies.  Schimmack (2012) discussed many questionable practices in Bem’s original studies and I presented statistical evidence that the significant results in Bem’s article were obtained with the help of questionable research practices.  Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.

It is not clear why Maxwell et al. (2015) picked Bem’s article to discuss problems with failed replication studies and ignores that questionable research practices undermine the credibility of significant results in original research articles. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.

Maxwell et al. (2015) were not aware that in the same year, the OSC (2015) reproducibilty project would replicate only 37% of statistically significant results in top psychology journals, while the apparent success rate in these journals is over 90%.  The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not.  Whether this makes things better is a matter of opinion.

Publication bias also undermines the usefulness of meta-analysis for hypothesis testing.  In the OSC reproducibility project, a meta-analysis of original and replication studies produced 68% significant results.  This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.

DOES PSYCHOLOGY HAVE A REPLICATION CRISIS?

Maxwell et al.’s (2015) answer to this question is captured in this sentence. “Despite raising doubts about the extent to which apparent failures to replicate necessarily reveal that psychology is in crisis,we do not intend to dismiss concerns about documented methodological flaws in the field.” (p. 496).  The most important part of this quote is “raising doubt,” the rest is Orwellian double-talk.

The whole point of Maxwell et al.’s article is to assure fellow psychologists that psychology is not in crisis and that failed replication studies should not be a major concern.  As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail.

The real answer to Maxwell et al.’s question was provided by the OSC (2015) finding that only 37% of published significant results could be replicated.  In my opinion that is not only a crisis, but a scandal because psychologists routinely apply for funding with power analyses that claim 80% power.  The reproducibilty project shows that the true power to obtain significant results in original and replication studies is much lower than this and that the 90% success rate is no more meaningful than 90% votes for a candidate in communist elections.

In the end, Maxwell et al. draw the misleading conclusion that “the proper design and interpretation of replication studies is less straightforward than conventional practice would suggest.”  They suggest that “most importantly, the mere fact that a replication study yields a nonsignificant statistical result should not by itself lead to a conclusion that the corresponding original study was somehow deficient and should no longer be trusted.”

As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if (a) the original study was not preregistered, (b) the original study produced weak evidence (e.g., p = .04), the original study was published in a journal that only publishes significant results, (d) the replication study had a larger sample, (e) the replication study would have been published independent of outcome, and (f) the replication study was preregistered.

We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail.  Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.

Long life the replicability revolution.  !!!

REFERENCES

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.

http://dx.doi.org/10.1037/0003-066X.45.12.1304

Maxwell, S.E, Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does ‘failure to replicate’ really mean? American Psychologist, 70, 487-498. http://dx.doi.org/10.1037/a0039400.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

My email correspondence with Daryl J. Bem about the data for his 2011 article “Feeling the future”

In 2015, Daryl J. Bem shared the datafiles for the 9 studies reported in the 2011 article “Feeling the Future” with me.  In a blog post, I reported an unexplained decline effect in the data.  In an email exchange with Daryl Bem, I asked for some clarifications about the data, comments on the blog post, and permission to share the data.

Today, Daryl J. Bem granted me permission to share the data.  He declined to comment on the blog post and did not provide an explanation for the decline effect.  He also did not comment on my observation that the article did not mention that “Experiment 5” combined two experiments with N = 50 and that “Experiment 6” combined three datasets with Ns = 91, 19, and 40.  It is highly unusual to combine studies and this practice contradicts Bem’s claim that sample sizes were determined a priori based on power analysis.

Footnote on p. 409. “I set 100 as the minimum number of participants/sessions for each of the experiments reported in this article because most effect sizes (d) reported in the
psi literature range between 0.2 and 0.3. If d = 0.25 and N = 100, the power
to detect an effect significant at .05 by a one-tail, one-sample t test is .80
(Cohen, 1988).”

The undisclosed concoction of datasets is another questionable research practice that undermines the scientific integrity of significance tests reported in the original article. At a minimum, Bem should issue a correction that explains how the nine datasets were created and what decision rules were used to stop data collection.

I am sharing the datafiles so that other researchers can conduct further analyses of the data.

Datafiles: EXP1   EXP2   EXP3   EXP4   EXP5   EXP6   EXP7   EXP8   EXP9

Below is the complete email correspondence with Daryl J. Bem.

—————————————————————————————————————————————–

To: Daryl J. Bem
From:  Ulrich Schimmack
Sent: Thursday,  January 25, 2018 5:23 PM

Dear Dr. Bem,

I am going to share your comments on the blog.

I find the enthusiasm explanation less plausible than you.  More important, it doesn’t explain the lack of a decline effect in studies with significant results.

I just finished the analysis of the 6 studies with N > 100 by Maier that are also included in the meta-analysis (see Figure below).

Given the lack of a plausible explanation for your data, I think JPSP should retract your article or at least issue an expression of concern because the published results are based on abnormally strong effect sizes in the beginning of each study. Moreover, Study 5 is actually two studies of N = 50 and the pattern is repeated at the beginning of the two datasets.

I also noticed that the meta-analysis included one more study by you with an underpowered study of N = 42 that surprisingly produced yet another significant result.  As I pointed out in my article that you reviewed that you reviewed points out, this success makes it even more likely that some non-significant (pilot) studies were omitted.  Your success record is simply too good to be true (Francis, 2012).  Have you conducted any other studies since 2012?  A non-significant result is overdue.

Regarding the meta-analysis itself, most of these studies are severely underpowered and there is still evidence for publication bias after excluding your studies.

Maier.ESP.pngWhen I used puniform to control for publication bias and limited the dataset to studies with N > 90 and excluded your studies (as we agree, N < 90 is low power) the p-value was not significant, and even if it were less than .05, it would not be convincing evidence for an effect.  In addition, I computed t-values using the effect size that you assumed in 2011, d = .2, and found significant evidence against the null-hypothesis that the ESP effect size could be as large as d = .2.  This means, even studies with N = 100 are underpowered.   Any serious test of the hypothesis requires much larger sample sizes.

However, the meta-analysis and the existence of ESP are not my concern.  My concern is the way (social) psychologists have conducted research in the past and are responding to the replication crisis.  We need to understand how researchers were able to produce seemingly convincing evidence like your 9 studies in JPSP that are difficult to replicate.  How can original articles have success rates of 90% or more and replications produce only a success rate of 30% or less?  You are well aware that your 2011 article was published with reservations and concerns about the way social psychologists conducted research.   You can make a real contribution to the history of psychology by contributing to the understanding of the research process that led to your results.  This is independent of any future tests of PSI with more rigorous studies.

Best, Dr. Schimmack



To: 
Ulrich Schimmack
From: 
Daryl J. Bem
Sent: Thursday,  January 25, 2018 4:45 PM

Dear Dr. Schimmack,

You reference Schooler who has documented the decline effect in several areas—not just in psi research—and has advanced some hypotheses about its possible causes.  The hypothesis that strikes me as most plausible is that it is an experimenter effect whereby experimenters and their assistants begin with high expectations and enthusiasm begin to get bored after conducting a lot of sessions.  This increasing lack of enthusiasm gets transmitted to the participants during the sessions.  I also refer you to Bob Rosenthal’s extensive work with experimenter effects—which show up even in studies with maze-running rats.

Most of Galak’s sessions were online, thereby diminishing this factor.  Now that I am retired and no longer have a laboratory with access to student assistants and participants, I, too, am shifting to online administration, so it will provide a rough test of this hypothesis.

Were you planning to publish our latest exchange concerning the meta-analysis?  I would not like to leave your blog followers with only your statement that it was “contaminated” by my own studies when, in fact, we did a separate meta-analysis on the non-Bem replications, as I noted in my previous email to you.

Best,
Daryl Bem


 

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Thursday, January 25, 2018 12:05 PM

Dear Dr. Bem,

I now started working on the meta-analysis.
I see another study by you listed (Bem, 2012, N = 42).
Can you please send me the original data for this study?

Best, Dr. Schimmack

 



To: 
Ulrich Schimmack
From: 
Daryl J. Bem
Sent: Thursday,  January 25, 2018 4:45 PM

Dear Dr. Shimmack,

I was not able to figure out how to leave a comment on your blog post at the website. (I kept being asked to register a site of my own.)  So, I thought I would simply write you a note.  You are free to publish it as my response to your most recent post if you wish.

In reading your posts on my precognitive experiments, I kept puzzling over why you weren’t mentioning the published Meta-analysis of 90 “Feeling the Future” studies that I published in 2015 with Tessoldi, Rabeyron, & Duggan. After all, the first question we typically ask when controversial results are presented is  “Can Independent researchers replicate the effect(s)?”  I finally spotted a fleeting reference to our meta-analysis in one of your posts, in which you simply dismissed it as irrelevant because it included my own experiments, thereby “contaminating” it.

But in the very first Table of our analysis, we presented the results for both the full sample of 90 studies and, separately, for the 69 replications conducted by independent researchers (from 33 laboratories in 14 countries on 10,000 participants).

These 69 (non-Bem-contaminated) independent replications yielded a z score of 4.16, p =1.2 x E-5.  The Bayes Factor was 3.85—generally considered large enough to provide “Substantial Evidence” for the experimental hypothesis.

Of these 69 studies, 31 were exact replications in that the investigators used my computer programs for conducting the experiments, thereby controlling the stimuli, the number of trials, all event timings, and automatic data recording. The data were also encrypted to ensure that no post-experiment manipulations were made on them by the experimenters or their assistants. (My own data were similarly encrypted to prevent my own assistants from altering them.) The remaining 38 “modified” independent replications variously used investigator-designed computer programs, different stimuli, or even automated sessions conducted online.

Both exact and modified replications were statistically significant and did not differ from one another.  Both peer reviewed and non-peer reviewed replications were statistically significant and did not differ from one another. Replications conducted prior to the publication of my own experiments and those conducted after their publication were each statistically significant and did not differ from one another.

We also used the recently introduced p-curve analysis to rule out several kinds of selection bias (file drawer problems), p-hacking, and to estimate “true” effect sizes.
There was no evidence of p-hacking in the database, and the effect size for the non-bem replications was 0.24, somewhat higher than the average effect size of my 11 original experiments (0.22.)  (This is also higher than the mean effect size of 0.21 achieved by Presentiment experiments in which indices of participants’ physiological arousal “precognitively” anticipate the random presentation of an arousing stimulus.)

For various reasons, you may not find our meta-analysis any more persuasive than my original publication, but your website followers might.

Best,
Daryl J.  Bem


From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 20, 2018 6:48 PM

Dear Dr. Bem,

Thank you for your final response.   It answers all of my questions.

I am sorry if you felt bothered by my emails, but I am confident that many psychologists are interested in your answers to my questions.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Saturday, January 20, 2018 5:56 PM

Dear Dr. Schimmack,

I hereby grant you permission to be the conduit for making my data available to those requesting them. Most of the researchers who contributed to our 2015/16 meta-analysis of 90 retroactive “feeling-the-future” experiments have already received the data they required for replicating my experiments.

At the moment, I am planning to follow up our meta-analysis of 90 experiments by setting up pre-registered studies. That seems to me to be the most profitable response to the methodological, statistical, and reporting critiques that have emerged since I conducted my original experiments more than a decade ago.  To respond to your most recent request, I am not planning at this time to write any commentary to your posts.  I am happy to let replications settle the matter.

(One minor point: I did not spend $90,000 to conduct my experiments.  Almost all of the participants in my studies at Cornell were unpaid volunteers taking psychology courses that offered (or required) participation in laboratory experiments.  Nor did I discard failed experiments or make decisions on the basis of the results obtained.)

What I did do was spend a lot of time and effort preparing and discarding early versions of written instructions, stimulus sets and timing procedures.  These were pretested primarily on myself and my graduate assistants, who served repeatedly as pilot subjects. If instructions or procedures were judged to be too time consuming, confusing, or not arousing enough, they were changed before the formal experiments were begun on “real” participants.  Changes were not made on the basis of positive or negative results because we were only testing the procedures on ourselves.

When I did decide to change a formal experiment after I had started it, I reported it explicitly in my article. In several cases I wrote up the new trials as a modified replication of the prior experiment.  That’s why there are more experiments than phenomena in my article:  2 approach/avoidance experiments, 2 priming experiments, 3 habituation experiments, & 2 recall experiments.)

In some cases the literature suggested that some parameters would be systematically related to the dependent variables in nonlinear fashion—e.g., the number of subliminal presentations used in the familiarity-produces-increased liking effect, which has a curvilinear relationship.  In that case, I incorporated the variable as a systematic independent variable. That is also reported in the article.

It took you approximately 3 years to post your responses to my experiments after I sent you the data.  Understandable for a busy scholar.  But a bit unziemlich for you to then send me near-daily reminders the past 3 weeks to respond back to you (as Schumann commands in the first movement of his piano Sonata in g Minor) “so schnell wie möglich!”  And then a page later, “Schneller!”

Solche Unverschämtheit!   Wenn ich es sage.

Daryl J.  Bem
Professor Emeritus of Psychology

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 20, 2018, 1.06 PM

Dear Dr. Bem,

Please let me know by tomorrow how your data should be made public.

I want to post my blog about Study 6 tomorrow. If you want to comment on it before I post it, please do so today.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Monday, January 15, 2018, 10.35 PM

You are correct:  Experiment 8, the first Retroactive Recall experiment was conducted in 2007 and its replication (Experiment 9) was conducted in 2009.

The Avoidance of Negative Stimuli (Study/Experiment 2)  was conducted (and reported as a single experiment with 150 sessions) in 2008.  More later.

Best,
Daryl Bem

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Monday, January 15, 2018, 8.52 PM

Dear Dr. Bem,

Thank you for your table.  I think we are mostly in agreement (sorry, if I confused you by calling studies datasets. The numbers are supposed to correspond to the experiment numbers in your table.

The only remaining inconsistency is that the datafile for study 8 shows year 2007, while you have 2008 in your table.

Best, Dr. Schimmack

Study    Sample    Year       N             Experiment
5              1              2002       50           #5: Retroactive Habituation I (Neg only)
5              2              2002       50           #5: Retroactive Habituation I (Neg only)
6              1              2002       91           #6: Retroactive Habituation II (Neg & Erot)
6              2              2002       19           #6: Retroactive Habituation II (Neg & Erot)
6              3              2002       40           #6: Retroactive Habituation II (Neg & Erot)
7              1              2005       200         #7: Retroactive Induction of Boredom
1              1              2006       40           #1: Precognitive Detection of Erotic Stimuli
1              2              2006       60           #1: Precognitive Detection of Erotic Stimuli
2              1              2008       100         #2: Precognitive Avoidance of Negative Stimuli
2              2              2008       50           #2: Precognitive Avoidance of Negative Stimuli
3              1              2007       100         #3: Retroactive Priming I
4              1              2008       100         #4: Retroactive Priming  II
8?           1              2007/08  100         #8: Retroactive Facilitation of Recall I
9              1              2009       50           #9: Retroactive Facilitation of Recall II

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Monday, January 15, 2018, 4.17 PM

Dear Dr. Schimmack,

Here is my analysis of your Table.  I will try to get to the rest of your commentary in the coming week.

Attached Word document:

Dear Dr. Schimmack,

In looking at your table, I wasn’t sure from your numbering of Datasets & Samples which studies corresponded to those reported in my Feeling the Future article.  So I have prepared my own table in the same ordering you have provided and added a column identifying the phenomenon under investigation  (It is on the next page)

Unless I have made a mistake in identifying them, I find agreement between us on most of the figures.  I have marked in red places where we seem to disagree, which occur on Datasets identified as 3 & 8.  You have listed the dates for both as 2007, whereas my datafiles have 2008 listed for all participant sessions which describe the Precognitive Avoidance experiment and its replication.  Perhaps I have misidentified the two Datasets.  The second discrepancy is that you have listed Dataset 8 as having 100 participants, whereas I ran only 50 sessions with a revised method of selecting the negative stimulus for each trial.  As noted in the article, this did not produce a significant difference in the size of the effect, so I included all 150 sessions in the write-up of that experiment.

I do find it useful to identify the Datasets & Samples with their corresponding titles in the article.  This permits readers to read the method sections along with the table.  Perhaps it will also identify the discrepancy between our Tables.  In particular, I don’t understand the separation in your table between Datasets 8 & 9.  Perhaps you have transposed Datasets 4 & 8.

If so, then Datasets 4 & 9 would each comprise 50 sessions.

More later.

Your Table:

Dataset Sample    Year       N
5              1              2002       50
5              2              2002       50
6              1              2002       91
6              2              2002       19
6              3              2002       40
7              1              2005       200
1              1              2006       40
1              2              2006       60
3              1              2007       100
8              1              2007       100
2              1              2008       100
2              2              2008       50
4              1              2008       100
9              1              2009       50

My Table:

Dataset Sample    Year       N             Experiment
5              1              2002       50           #5: Retroactive Habituation I (Neg only)
5              2              2002       50           #5: Retroactive Habituation I (Neg only)
6              1              2002       91          #6: Retroactive Habituation II (Neg & Erot)
6              2              2002       19           #6: Retroactive Habituation II (Neg & Erot)
6              3              2002       40          #6: Retroactive Habituation II (Neg & Erot)
7              1              2005       200         #7: Retroactive Induction of Boredom
1              1              2006       40           #1: Precognitive Detection of Erotic Stimuli
1              2              2006       60           #1: Precognitive Detection of Erotic Stimuli
3              1              2008       100         #2: Precognitive Avoidance of Negative Stimuli
8?           1              2008       50           #2: Precognitive Avoidance of Negative Stimuli
2              1              2007       100         #3: Retroactive Priming I
2              2              2008       100         #4: Retroactive Priming  II
4?           1              2008       100         #8: Retroactive Facilitation of Recall I
9              1              2009       50           #9: Retroactive Facilitation of Recall II

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Monday, January 15, 2018 10.46 AM

Dear Dr. Bem,

I am sorry to bother you with my requests. It would be helpful if you could let me know if you are planning to respond to my questions and if so, when you will be able to do so?

Best regards,
Dr. Ulrich Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 13, 2018 3.53 PM

Dear Dr. Bem,

I put together a table that summarizes when studies were done and how they were combined into datasets.

Please confirm that this is accurate or let me know if there are any mistakes.

Best, Dr. Schimmack

Dataset Sample Year N
5 1 2002 50
5 2 2002 50
6 1 2002 91
6 2 2002 19
6 3 2002 40
7 1 2005 200
1 1 2006 40
1 2 2006 60
3 1 2007 100
8 1 2007 100
2 1 2008 100
2 2 2008 50
4 1 2008 100
9 1 2009 50

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 13, 2018 2.42 PM

Dear Dr. Bem,

I wrote another blog post about Study 6.  If you have any comments about this blog post or the earlier blog post, please let me know.

Also, other researchers are interested in looking at the data and I still need to hear from you how to share the datafiles.

Best, Dr. Schimmack

[Attachment: Draft of Blog Post]

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 7.47 PM

Dear. Dr. Bem,

Also, is it ok for me to share your data in public or would you rather post them in public?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 7.01 PM

Dear Dr. Bem,

Now that my question about Study 6 has been answered, I would like to hear your thoughts about my blog post. How do you explain the decline effect in your data; that is effect sizes decrease over the course of each experiment and when two experiments are combined into a single dataset, the decline effect seems to repeat at the beginning of the new study.   Study 6, your earliest study, doesn’t show the effect, but most other studies show this pattern.  As I pointed out on my blog, I think there are two explanations (see also Schooler, 2011).  Either unpublished studies with negative results were omitted or measurement of PSI makes the effect disappear.  What is probably most interesting is to know what you did when you encountered a promising pilot study.  Did you then start collecting new data with this promising procedure or did you continue collecting data and retained the pilot data?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Friday, January 12, 2018 2.17 PM

Dear Dr. Schimmack,

You are correct that I calculated all hit rates against a fixed null of 50%.

You are also correct that the first 91 participants (Spring semester of 2002) were exposed to 48 trials: 16 Negative images, 16, Erotic images, and 16 Neutral Images.

We continued with that same protocol in the Fall semester of 2002 for 19 additional sessions, sessions 51-91.

At this point, it was becoming clear from post-session debriefings of participants that the erotic pictures from the Affective Picture System (IAPS) were much too mild, especially for male participants.

(Recall that this was chronologically my first experiment and also the first one to use erotic materials.  The observation that mild erotic stimuli are insufficiently arousing, at least for college students, was later confirmed in our 2016 meta-analysis, which found that Wagenmakers attempt to replicate my Experiment #1 (Which of two curtains hides an erotic picture?) using only mild erotic pictures was the only replication failure out of 11 replication attempts of that protocol in our database.)  In all my subsequent experiments with erotic materials, I used the stronger images and permitted participants to choose which kind of erotic images (same-sex vs. opposite-sex erotica) they would be seeing.

For this reason, I decided to introduce more explicit erotic pictures into this attempted replication of the habituation protocol.

In particular, Sessions 92-110 (19 sessions) also consisted of 48 trials, but they were divided into 12 Negative trials, 12 highly Erotic trials, & 24 Neutral trials.

Finally, Sessions 111-150 (40 sessions) increased the number of trials to 60:  15 Negative trials, 15 Highly Erotic trials, & 30 Neutral trials.  With the stronger erotic materials, we felt we needed to have relatively more neutral stimuli interspersed with the stronger erotic materials.

Best,
Daryl Bem

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Friday, January 12, 2018 11.08 AM

Dear Dr. Bem,

I conducted further analyses and I figured out why I obtained discrepant results for Study 6.

I computed difference scores with the control condition, but the article reports results for a one-sample t-test of the hit rates against an expected value of 50%.

I also figured out that the first 91 participants were exposed to 16 critical trials and participants 92 to 150 were exposed to 30 critical trials. Can you please confirm this?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Thursday, January 11, 2018 10.53 PM

I’ll check them tomorrow to see where the problems are.

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 10, 2018 5.41 PM

Dear Dr. Bem,

I just double checked the data you sent me today and they match the data you sent me in 2015.

This means neither of these datasets reproduces the results reported in your 2011 article.

This means your article reported two more significant results (Study 6, Negative and Erotic) than the data support.

This raises further concerns about the credibility of your published results, in addition to the decline effect that I found in your data (except in Study 6, which also produced non-significant results).

Do you still believe that your 2011 studies provided credible information about timer-reversed causality or do you think that you may have capitalized on chance by conducting many pilot studies?

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 10, 2018 5:03 PM

Dear Dr. Bem,

Frequencies of male and female in dataset 5.

> table(bem5$Participant.Sex)

Female   Male
63     37

Article “One hundred Cornell undergraduates, 63 women and 37 men,
were recruited through the Psychology Department’s”

Analysis of dataset 5

One Sample t-test
data:  bem5$N.PC.C.PC[b:e]
t = 2.7234, df = 99, p-value = 0.007639
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1.137678 7.245655
sample estimates:
mean of  x
4.191667

Article “t(99) =  2.23, p = .014”

Conclusion:
Gender of participants matches.
t-values do not match, but both are significant.

Frequencies of male and female in dataset 6.

> table(bem6$Participant.Sex)

Female   Male
87     63

Article: Experiment 6: Retroactive Habituation II
One hundred fifty Cornell undergraduates, 87 women and 63
men,

Negative

Paired t-test
data:  bem6$NegHits.PC and bem6$ControlHits.PC
t = 1.4057, df = 149, p-value = 0.1619
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8463098  5.0185321
sample estimates:
mean of the differences
2.086111

Erotic

Paired t-test
data:  bem6$EroticHits.PC and bem6$ControlHits.PC
t = -1.3095, df = 149, p-value = 0.1924
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.2094289  0.8538733
sample estimates:
mean of the differences
-1.677778

Article

Both retroactive habituation hypothesis were supported. On
trials with negative picture pairs, participants preferred the target
significantly more frequently than the nontarget, 51.8%, t(149) _
1.80, p _ .037, d _ 0.15, binomial z _ 1.74, p _ .041, thereby
providing a successful replication of Experiment 5. On trials with
erotic picture pairs, participants preferred the target significantly
less frequently than the nontarget, 48.2%, t(149) _ _1.77, p _
.039, d _ 0.14, binomial z _ _1.74, p _ .041.

Conclusion:
t-values do not match, article reports significant results, but data you shared show non-significant results, although gender composition matches article.

I will double check the datafiles that you sent me in 2015 against the one you are sending me now.

Let’s first understand what is going on here before we discuss other issues.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Wednesday, January 10, 2018 4:42 PM

Dear Dr. Schimmack,

Sorry for the delay.  I have been busy re-programming my new experiments so they can be run online, requiring me to relearn the programming language.

The confusion you have experienced arises because the data from Experiments 5 and 6 in my article were split differently for exposition purposes. If you read the report of those two experiments in the article, you will see that Experiment 5 contained 100 participants experiencing only negative (and control) stimuli.  Experiment contained 150 participants who experienced negative, erotic, and control stimuli.

I started Experiment 5 (my first precognitive experiment) in the Spring semester of 2002. I ran the pre-planned 100 sessions, using only negative and control stimuli.  During that period, I was alerted to the 2002 publication by Dijksterhuis & Smith in the journal Emotion, in which they claimed to demonstrate the reverse of the standard “familiarity-promotes-liking” effect, showing that people also adapt to stimuli that are initially very positive and hence become less attractive as the result of multiple exposures.

So after completing my 100 sessions, I used what remained of the Spring semester to design and run a version of my own retroactive experiment that included erotic stimuli in addition to the negative and control stimuli.  I was able to run 50 sessions before the Spring semester ended, and I resumed that extended version the experiment in the following Fall semester when student-subjects again became available until I had a total of 150 sessions of this extended version.  For purposes of analysis and exposition, I then divided the experiments as described in the article:  100 sessions with only negative stimuli and 150 sessions with negative and erotic stimuli.  No subjects or sessions have been added or omitted, just re-assembled to reflect the change in protocol.

I don’t remember how I sent you the original data, so I am attaching a comma-delimited file (which will open automatically in Excel if you simply double or right click it).  It contains all 250 sessions ordered by dates.  The fields provided are:  Session number (numbered from 1 to 250 in chronological order),  the date of the session, the sex of the participant, % of hits on negative stimuli, % of hits on erotic stimuli (which is blank for the 100 subjects in Experiment 5) and % of hits on neutral stimuli.

Let me know if you need additional information.

I hope to get to your blog post soon.

Best,
Daryl Bem

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 6, 2018 11:43 AM

Dear Dr. Bem,

Please reply as soon as possible to my email.  Other researchers are interested in analyzing the data and if I submit my analyses some journals want me to provide data or an explanation why I cannot share the data.  I hope to hear from you by the end of this week.

Best, Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Saturday, January 6, 2018 11:43 AM

Dear Dr. Bem,

Meanwhile I posted a blog post about your 2011 article.  It has been well received by the scientific community.  I would like to encourage you to comment on it.

https://replicationindex.wordpress.com/2018/01/05/why-the-journal-of-personality-and-social-psychology-should-retract-article-doi-10-1037-a0021524-feeling-the-future-experimental-evidence-for-anomalous-retroactive-influences-on-cognition-a/

Best,
Dr. Schimmack

—————————————————————————————————————————————–

From: Ulrich Schimmack
To: Daryl J. Bem
Sent: Wednesday, January 3, 2018 4:12 PM

Dear Dr. Bem,

I am finally writing up the results of my reanalyses of your ESP studies.

I encountered one problem with the data for Study 6.

I cannot reproduce the test results reported in the article.

The article :

Both retroactive habituation hypothesis were supported. On trials with negative picture pairs, participants preferred the target significantly more frequently than the nontarget, 51.8%, t(149) _ 1.80, p _ .037, d _ 0.15, binomial z _ 1.74, p _ .041, thereby providing a successful replication of Experiment 5. On trials with erotic picture pairs, participants preferred the target significantly less frequently than the nontarget, 48.2%, t(149) _ _1.77, p _.039, d _ 0.14, binomial z _ _1.74, p _ .041.

I obtain

(negative)
t = 1.4057, df = 149, p-value = 0.1619

(erotic)
t = -1.3095, df = 149, p-value = 0.1924

Also, I wonder why the first 100 cases often produce decimals of .25 and the last 50 cases produce decimals of .33.

It would be nice if you could look into this and let me know what could explain the discrepancy.

Best,
Uli Schimmack

—————————————————————————————————————————————–

From: Daryl J. Bem
To: Ulrich Schimmack
Sent: Wednesday, February 25, 2015 2:47 AM

Dear Dr. Schimmack,

Attached is a folder of the data from my nine “Feeling the Future” experiments.  The files are plain text files, one line for each session, with variables separated by tabs.  The first line of each file is the list of variable names, also separated by tabs. I have omitted participants’ names but supplied their sex and age.

You should consult my 2011 article for the descriptions and definitions of the dependent variables for each experiment.

Most of the files contain the following variables: Session#, Date, StartTime, Session Length, Participant’s Sex, Participant’s Age, Experimenter’s Sex,  [the main dependent variable or variables], Stimulus Seeking score (from 1 to 5).

For the priming experiments (#3 & #4), the dependent variables are LnRT Forward and LnRT Retro, where Ln is the natural log of Response Times. As described in my 2011 publication, each response time (RT) is transformed by taking the natural log before being entered into calculations.  The software subtracts the mean transformed RT for congruent trials from the mean Transformed RT for incongruent trials, so positive values of LnRT indicate that the person took longer to respond to incongruent trials than to congruent trials.  Forward refers to the standard version of affective priming and Retro refers to the time-reversed version.  In the article, I show the results for both the Ln transformation and the inverse transformation (1/RT) for two different outlier definitions.  In the attached files, I provide the results using the Ln transformation and the definition of a too-long RT outlier as 2500 ms.

Subjects who made too many errors (> 25%) in judging the valence of the target picture were discarded. Thus, 3 subjects were discarded from Experiment #3 (hence N = 97) and 1 subject was discarded from Experiment #4 (hence N  = 99).  Their data do not appear in the attached files.

Note that the habituation experiment #5 used only negative and control (neutral) stimuli.

Habituation experiment #6 used Negative, erotic, and Control (neutral) stimuli.

Retro Boredom experiment #7 used only neutral stimuli.

In Experiment #8, the first  Retro Recall, the first 100 sessions are experimental sessions.  The last 25 sessions are no-practice control sessions.  The type of session is the second variable listed.

In Experiment #9, the first 50 sessions are the experimental sessions and the last 25 are no-practice control sessions.   Be sure to exclude the control sessions when analyzing the main experimental sessions. The summary measure of psi performance is the Precog% Score (DR%) whose definition you will find on page 419 of my article.

Let me know if you encounter any problems or want additional data.

Sincerely,
Daryl J.  Bem
Professor Emeritus of Psychology

—————————————————————————————————————————————–

A List of R-Index Blogs

In decreasing chronological order from November 30, 2014 to January 6, 2018

(statistics recorded on January 6)

Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem

A day ago

‘Before you know it’ by John A. Bargh: A quantitative book review

Nov 28, 2017 12:41 PM

(Preprint) Z-Curve: A Method for Estimating Replicability Based on Test Statistics in Original Studies (Schimmack & Brunner, 2017)

Nov 16, 2017 3:46 PM

Preliminary 2017 Replicability Rankings of 104 Psychology Journals

Oct 24, 2017 2:56 PM

P-REP (2005-2009): Reexamining the experiment to replace p-values with the probability of replicating an effect

Sep 19, 2017 12:01 PM

The Power of the Pen Paradigm: A Replicability Analysis

Sep 4, 2017 8:48 PM

What would Cohen say? A comment on p < .005

Aug 2, 2017 11:46 PM

How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

May 15, 2017 9:24 PM

How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press). 

May 4, 2017 8:41 PM

Hidden Figures: Replication Failures in the Stereotype Threat Literature

Apr 7, 2017 10:18 AM

No Ads with WordPress.com PremiumPrevent ads from showing on your site.

Personalized Adjustment of p-values for publication bias

Mar 13, 2017 2:31 PM

Meta-Psychology: A new discipline and a new journal (draft)

Mar 5, 2017 2:35 PM

2016 Replicability Rankings of 103 Psychology Journals

Mar 1, 2017 12:01 AM

An Attempt at Explaining Null-Hypothesis Testing and Statistical Power with 1 Figure and 1,500 Words

Feb 26, 2017 10:35 AM

Random measurement error and the replication crisis: A statistical analysis

Feb 23, 2017 7:38 AM

How Selection for Significance Influences Observed Power

Feb 21, 2017 10:38 AM

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

Feb 2, 2017 11:33 AM

Are Most Published Results in Psychology False? An Empirical Study

Jan 15, 2017 1:42 PM

Reexamining Cunningham, Preacher, and Banaji’s Multi-Method Model of Racism Measures

Jan 8, 2017 7:50 PM

Validity of the Implicit Association Test as a Measure of Implicit Attitudes

Jan 5, 2017 5:20 PM

Replicability Review of 2016

Dec 31, 2016 5:25 PM

Z-Curve: Estimating Replicability of Published Results in Psychology (Revision)

Dec 12, 2016 5:35 PM

How did Diedrik Stapel Create Fake Results? A forensic analysis of “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations”

Dec 6, 2016 12:27 PM

A sarcastic comment on “Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology” by Harry Reis

Dec 3, 2016 9:31 PM

A replicability analysis of”I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning”

Dec 3, 2016 3:15 PM

Bayesian Meta-Analysis: The Wrong Way and The Right Way

Nov 28, 2016 3:25 PM

Peer-Reviews from Psychological Methods

Nov 18, 2016 6:33 PM

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Sep 17, 2016 5:44 PM

A Critical Review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”

Sep 16, 2016 11:20 AM

Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

Sep 13, 2016 10:10 PM

The decline effect in social psychology: Evidence and possible explanations

Sep 4, 2016 9:26 AM

Fritz Strack’s self-serving biases in his personal account of the failure to replicate his most famous study.

Aug 29, 2016 12:10 PM

How Can We Interpret Inferences with Bayesian Hypothesis Tests?

Aug 9, 2016 4:16 PM

Bayes Ratios: A Principled Approach to Bayesian Hypothesis Testing

Jul 25, 2016 12:51 PM

How Does Uncertainty about Population Effect Sizes Influence the Probability that the Null-Hypothesis is True?

Jul 16, 2016 11:53 AM

Subjective Bayesian T-Test Code

Jul 5, 2016 6:58 PM

Wagenmakers’ Default Prior is Inconsistent with the Observed Results in Psychologial Research

Jun 30, 2016 10:15 AM

A comparison of The Test of Excessive Significance and the Incredibility Index

Jun 18, 2016 11:52 AM

R-Code for (Simplified) Powergraphs with StatCheck Dataset

Jun 17, 2016 6:32 PM

Replicability Report No.2: Do Mating Primes have a replicable effects on behavior?

May 21, 2016 9:37 AM

Subjective Priors: Putting Bayes into Bayes-Factors

May 18, 2016 12:22 PM

Who is Your Daddy? Priming women with a disengaged father increases their willingness to have sex without a condom

May 11, 2016 4:15 PM

Bayes-Factors Do Not Solve the Credibility Problem in Psychology

May 9, 2016 6:03 PM

Die Verdrängung des selektiven Publizierens: 7 Fallstudien von prominenten Sozialpsychologen

Apr 20, 2016 10:05 PM

Replicability Report No. 1: Is Ego-Depletion a Replicable Effect?

Apr 18, 2016 7:28 PM

Open Ego-Depletion Replication Initiative

Mar 26, 2016 10:10 AM

Estimating Replicability of Psychological Science: 35% or 50%

Mar 12, 2016 11:56 AM

MY JOURNEY TOWARDS ESTIMATION OF REPLICABILITY OF PSYCHOLOGICAL RESEARCH

Mar 4, 2016 10:11 AM

Replicability Ranking of Psychology Departments

Mar 2, 2016 5:01 PM

Reported Success Rates, Actual Success Rates, and Publication Bias In Psychology: Honoring Sterling et al. (1995)

Feb 26, 2016 5:50 PM

Are You Planning a 10-Study Article? You May Want to Read This First

Feb 10, 2016 5:24 PM

Dr. R Expresses Concerns about Results in Latest Psycholgical Science Article by Yaacov Trope and colleagues

Feb 9, 2016 11:41 AM

Dr. R’s Blog about Replicability

Feb 5, 2016 5:32 PMSticky

A Scathing Review of “Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science”

Feb 3, 2016 10:32 AM

Keep your Distance from Questionable Results

Jan 31, 2016 5:05 PM

Too good to be true: A reanalysis of Damisch, Stoberock, and Mussweiler (2010). Keep Your Fingers Crossed! How Superstition Improves Performance. Psychological Science, (21)7, p.1014-1020

Jan 31, 2016 5:01 PM

A Revised Introduction to the R-Index

Jan 31, 2016 3:50 PM

2015 Replicability Ranking of 100+ Psychology Journals

Jan 26, 2016 12:17 PM

Is the N-pact Factor (NF) a Reasonable Proxy for Statistical Power and Should the NF be used to Rank Journals’ Reputation and Replicability? A Critical Review of Fraley and Vazir (2014)

Jan 16, 2016 7:56 PM

On the Definition of Statistical Power

Jan 14, 2016 3:36 PM

The Abuse of Hoenig and Heisey: A Justification of Power Calculations with Observed Effect Sizes

Jan 14, 2016 2:06 PM

Do Deceptive Reporting Practices in Social Psychology Harm Social Psychology?

Jan 13, 2016 12:24 PM

“Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

Jan 12, 2016 1:51 PM

Distinguishing Questionable Research Practices from Publication Bias

Dec 8, 2015 6:00 PM

2015 Replicability Ranking of 54 Psychology Journals

Oct 27, 2015 7:05 PM

Dr. R’s comment on the Official Statement by the Board of the German Psychological Association (DGPs) about the Results of the OSF-Reproducibility Project published in Science.

Oct 10, 2015 12:01 PM

Replicability Ranking of 27 Psychology Journals (2015)

Sep 28, 2015 7:42 AM

Replicability Report for JOURNAL OF SOCIAL AND PERSONAL RELATIONSHIPS

Sep 28, 2015 7:02 AM

Replicability Report for PERSONAL RELATIONSHIPS

Sep 27, 2015 5:40 PM

Replicability Report for DEVELOPMENTAL SCIENCE

Sep 27, 2015 3:16 PM

Replicability Report for JOURNAL OF POSITIVE PSYCHOLOGY

Sep 26, 2015 8:08 PM

Replicability-Report for CHILD DEVELOPMENT

Sep 26, 2015 10:37 AM

Replicability-Report for PSYCHOLOGY & AGING

Sep 25, 2015 5:47 PM

Replicability-Report for JOURNAL OF EXPERIMENTAL PSYCHOLOGY: HUMAN PERCEPTION AND PERFORMANCE

Sep 25, 2015 12:33 PM

“THE STATISTICAL POWER OF ABNORMAL-SOCIAL PSYCHOLOGICAL RESEARCH: A REVEW” BY JACOB COHEN

Sep 22, 2015 7:11 AM

Replicability Report for the BRITISH JOURNAL OF SOCIAL PSYCHOLOGY

Sep 18, 2015 6:40 PM

Replicability-Ranking of 100 Social Psychology Departments

Sep 15, 2015 12:49 PM

Replicability Report for the journal SOCIAL PSYCHOLOGY

Sep 13, 2015 4:54 PM

Replicability-Report for the journal JUDGMENT AND DECISION MAKING

Sep 13, 2015 1:08 PM

Replicability-Report for JOURNAL OF CROSS-CULTURAL PSYCHOLOGY

Sep 12, 2015 12:32 PM

Replicability-Report for SOCIAL COGNITION

Sep 11, 2015 6:43 PM

Examining the Replicability of 66,212 Published Results in Social Psychology: A Post-Hoc-Power Analysis Informed by the Actual Success Rate in the OSF-Reproducibilty Project

Sep 7, 2015 9:20 AM

The Replicability of Cognitive Psychology in the OSF-Reproducibility-Project

Sep 5, 2015 2:47 PM

The Replicability of Social Psychology in the OSF-Reproducibility Project

Sep 3, 2015 8:18 PM

Which Social Psychology Results Were Successfully Replicated in the OSF-Reproducibility Project? Recommeding a 4-Sigma Rule

Aug 30, 2015 11:04 AM

Predictions about Replication Success in OSF-Reproducibility Project

Aug 26, 2015 9:54 PM

Replicability-Report for EUROPEAN JOURNAL OF SOCIAL PSYCHOLOGY

Aug 22, 2015 8:00 PM

Replicability-Report for JOURNAL OF EXPERIMENTAL SOCIAL PSYCHOLOGY

Aug 22, 2015 5:30 PM

Replicability-Report for JOURNAL OF MEMORY AND LANGUAGE

Aug 22, 2015 3:22 PM

Replicability-Report for JOURNAL OF EXPERIMENTAL PSYCHOLOGY: GENERAL

Aug 21, 2015 5:52 PM

Replicability-Report for DEVELOPMENTAL PSYCHOLOGY

Aug 21, 2015 4:36 PM

Replicability-Report for JOURNAL OF EXPERIMENTAL PSYCHOLOGY: LEARNING, MEMORY, AND COGNITION

Aug 19, 2015 10:21 AM

Replicability-Report for COGNITIVE PSYCHOLOGY

Aug 18, 2015 10:03 AM

Replicability-Report for COGNITION & EMOTION

Aug 18, 2015 8:37 AM

Replicability-Report for SOCIAL PSYCHOLOGY AND PERSONALITY SCIENCE

Aug 18, 2015 7:35 AM

Replicability-Report for EMOTION

Aug 17, 2015 7:53 PM

Replicability Report for PERSONALITY AND SOCIAL PSYCHOLOGY BULLETIN

Aug 17, 2015 9:59 AM

Replicability-Report for JPSP: INTERPERSONAL RELATIONSHIPS & GROUP PROCESSES

Aug 16, 2015 9:51 AM

Replicability-Report for JPSP: ATTITUDES & SOCIAL COGNITION

Aug 16, 2015 6:12 AM

Replicability-Report for JPSP: Personality Processes and Individual Differences

Aug 15, 2015 8:30 PM

Replicability Report for PSYCHOLOGICAL SCIENCE

Aug 15, 2015 5:51 PM

REPLICABILITY RANKING OF 26 PSYCHOLOGY JOURNALS

Aug 13, 2015 4:16 PM

Using the R-index to detect questionable research practices in SSRI studies

Aug 5, 2015 2:34 PM

R-Index predicts lower replicability of “subliminal” studies than “attribution” studies in JESP

Jul 7, 2015 9:37 PM

Post-Hoc-Power Curves of Social Psychology in Psychological Science, JESP, and Social Cognition

Jul 7, 2015 7:07 PM

Post-Hoc Power Curves: Estimating the typical power of statistical tests (t, F) in Psychological Science and Journal of Experimental Social Psychology

Jun 27, 2015 10:48 PM

When Exact Replications Are Too Exact: The Lucky-Bounce-Test for Pairs of Exact Replication Studies

May 27, 2015 11:30 AM

The Association for Psychological Science Improves Success Rate from 95% to 100% by Dropping Hypothesis Testing: The Sample Mean is the Sample Mean, Type-I Error 0%

May 21, 2015 1:40 PM

R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

May 19, 2015 5:43 PM

A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

May 18, 2015 6:41 PM

Power Analysis for Bayes-Factor: What is the Probability that a Study Produces an Informative Bayes-Factor?

May 16, 2015 4:41 PM

A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

May 16, 2015 7:51 AM

The Test of Insufficient Variance (TIVA): A New Tool for the Detection of Questionable Research Practices

May 13, 2015 12:40 PM

Why Psychologists Should Not Change The Way They Analyze Their Data: The Devil is in the Default Prior

May 9, 2015 10:43 AM

Replacing p-values with Bayes-Factors: A Miracle Cure for the Replicability Crisis in Psychological Science

Apr 30, 2015 1:48 PM

Further reflections on the linearity in Dr. Förster’s Data

Apr 21, 2015 7:57 AM

The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

Apr 20, 2015 6:04 PM

Bayesian Statistics in Small Samples: Replacing Prejudice against the Null-Hypothesis with Prejudice in Favor of the Null-Hypothesis

Apr 9, 2015 12:26 PM

Meta-Analysis of Observed Power: Comparison of Estimation Methods

Apr 1, 2015 9:49 PM

An Introduction to Observed Power based on Yuan and Maxwell (2005)

Mar 24, 2015 7:24 PM

R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

Feb 6, 2015 11:20 AM

Bayesian Statistics in Small Samples: Replacing Prejudice against the Null-Hypothesis with Prejudice in Favor of the Null-Hypothesis

Feb 2, 2015 10:45 AM

Questionable Research Practices: Definition, Detect, and Recommendations for Better Practices

Jan 24, 2015 8:40 AM

Further reflections on the linearity in Dr. Förster’s Data

Jan 14, 2015 5:47 PM

Why are Stereotype-Threat Effects on Women’s Math Performance Difficult to Replicate?

Jan 6, 2015 12:01 PM

How Power Analysis Could Have Prevented the Sad Story of Dr. Förster

Jan 2, 2015 12:22 PM

A Playful Way to Learn about Power, Publication Bias, and the R-Index: Simulate questionable research methods and see what happens.

Dec 31, 2014 3:25 PM

The Test of Insufficient Variance (TIVA): A New Tool for the Detection of Questionable Research Practices

Dec 30, 2014 10:22 PM

Christmas Special: R-Index of “Women Are More Likely to Wear Red or Pink at Peak Fertility”

Dec 24, 2014 1:54 PM

The R-Index of Ego-Depletion Studies with the Handgrip Paradigm

Dec 21, 2014 2:21 PM

The R-Index of Nicotine-Replacement-Therapy Studies: An Alternative Approach to Meta-Regression

Dec 17, 2014 3:52 PM

The R-Index of Simmons et al.’s 21 Word Solution

Dec 17, 2014 7:14 AM

The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

Dec 13, 2014 10:42 AM

Do it yourself: R-Index Spreadsheet and Manual is now available.

Dec 7, 2014 11:01 AM

Nature Neuroscience: R-Index

Dec 5, 2014 11:44 AM

Dr. Schnall’s R-Index

Dec 4, 2014 11:57 AM

Roy Baumeister’s R-Index

Dec 1, 2014 6:29 PM

The Replicability-Index (R-Index): Quantifying Research Integrity

Nov 30, 2014 10:19 PM

Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem

Added January 30, 2018: A formal letter to the editor of JPSP, calling for a retraction of the article (Letter).

“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)

In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.

The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results.  “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).

The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.

In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.

The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings).  The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).

Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.

Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.

I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely.  Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.

1. Luck

The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.

If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.

Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.

Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.

2. Questionable Research Practices

The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies.  This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used.  John et al. listed a number of questionable research practices that might explain Bem’s findings.

2.1. Multiple Dependent Variables

One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.

2.2. Failure to report all conditions

This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.

2.3 Generous Rounding

Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.

2.4 HARKing

Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.

2.5 Excluding of Data

Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.

2.6 Stopping Data Collection Early

Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.

In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.

2.7 Optional Stopping/Snooping

Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.

The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).

Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.

2.8 Selective Reporting

The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).

Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.

Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants $5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid $90,000 out of pocket.

As a strong believer in ESP, Bem may have paid $90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.

“The integrity of the scientific enterprise requires the reporting of disconfirming results.”

2.9 Conclusion

In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.

3. The Decline Effect and a New Questionable Research Practice

When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased.  This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).

Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.

Figure1

 

The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.

Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1.  P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.

EXPERIMENT S 1-50 S 51-100
EXP1 p = .004 p = .194
EXP2 p = .096 p = .170
EXP3 p = .039 p = .100
EXP4 p = .033 p = .067
EXP5 p = .013 p = .069
EXP6a p = .412 p = .126
EXP5b p = .023 p = .410
EXP7 p = .020 p = .338
EXP8 p = .010 p = .318
EXP9 p = .003 NA

 

There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).

In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes.  Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422).  How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.

There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection.  In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.

“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”

Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.

Additional hints come from an interview with Engber (2017).

“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.

In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.

Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one.  Schooler (2011) proposed that something more intriguing may cause decline effects.

“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.” 

Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.

I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.

Discussion

Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.

The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).

The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality.  The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.

Conceptual Replications and Hidden Moderators

In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).

Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).

The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.

Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.

The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.

The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.

Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.

P-Hacking

Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011).  The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used.  They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%.  A combination of four questionable research practices increased the risk to 60.7%.  The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky.  But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results,  (p = .6= 1%).

The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies).  If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.

To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).

Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct.  Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.

Should the Journal of Personality and Social Psychology Retract Bem’s Article?

Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.

However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.

“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).

Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.

This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.

A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).

[edited January, 8, 2018]
It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.

When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”

If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.

REFERENCES

Alcock, J. E. (2011). Back from the future: Parapsychology and the Bem affair. Skeptical Inquirer, 35(2). Retrieved from http://www.csicop.org/specialarticles/show/back_from_the_future

Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Bem, D.J., Tressoldi, P., Rabeyron, T. & Duggan, M. (2015) Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events, F1000 Research, 4, 1–33.

Engber, D. (2017). Daryl Bem proved ESP Is real: Which means science is broken. https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html

Francis, G. (2012). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9

Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, Issue 6203, 502-1505, DOI: 10.1126/science.1255484

Galak, J., Leboeuf, R.A., Nelson, L. D., & Simmons, J.P. (2012). Journal of Personality and Social Psychology, 103, 933-948, doi: 10.1037/a0029709.

John, L. K. Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532. DOI: 10.1177/0956797611430953

RetractionWatch (2018). Ask retraction watch: Is it OK to cite a retracted paper? http://retractionwatch.com/2018/01/05/ask-retraction-watch-ok-cite-retracted-paper/

Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 2012, 17, 551–566.

Schimmack, U. (2015). The Test of Insufficient Variance: A New Tool for the Detection of Questionable Research Practices. https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632

Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746

Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137

Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.

 

‘Before you know it’ by John A. Bargh: A quantitative book review

November 28, Open Draft/Preprint (Version 1.0)
[Please provide comments and suggestions]

In this blog post I present a quantitative review of John A Bargh’s book “Before you know it: The unconscious reasons we do what we do”  A quantitative book review is different from a traditional book review.  The goal of a quantitative review is to examine the strength of the scientific evidence that is provided to support ideas in the book.  Readers of a popular science book written by an eminent scientist expect that these ideas are based on solid scientific evidence.  However, the strength of scientific evidence in psychology, especially social psychology has been questioned.  I use statistical methods to examine how strong the evidence actually is.

One problem in psychological publishing is publication bias in favor of studies that support theories, so called publication bias.  The reason for publication bias is that scientific journals can publish only a fraction of results that scientists produce.  This leads to heavy competition among scientists to produce publishable results, and journals like to publish statistically significant results; that is studies that provide evidence for an effect (e.g., “eating green jelly beans cures cancer” rather than “eating red jelly beans does not cure cancer”).  Statisticians have pointed out that publication bias undermines the meaning of statistical significance, just like counting only hits would undermine the meaning of batting averages. Everybody would have an incredible batting average of 1.00.

For a long time it was assumed that publication bias is just a minor problem. Maybe researchers conducted 10 studies and reported only 8 significant results while not reporting the remaining two studies that did not produce a significant result.  However, in the past five years it has become apparent that publication bias, at least in some areas of the social sciences, is much more severe, and that there are more unpublished studies with non-significant results than published results with significant results.

In 2012, Daniel Kahneman (2012) raised doubts about the credibilty of priming research in an open email letter addressed to John A. Bargh, the author of “Before you know it.”  Daniel Kahneman is a big name in psychology; he won a Nobel Prize for economics in 2002.  He also wrote a popular book that features John Bargh’s priming research (see review of Chapter 4).  Kahneman wrote “As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.”

Kahneman is not an outright critic of priming research. In fact, he was concerned about the future of priming research and made some suggestions how Bargh and colleagues could alleviate doubts about the replicability of priming results.  He wrote:

“To deal effectively with the doubts you should acknowledge their existence and confront them straight on, because a posture of defiant denial is self-defeating. Specifically, I believe that you should have an association, with a board that might include prominent social psychologists from other fields. The first mission of the board would be to organize an effort to examine the replicability of priming results.”

However, prominent priming researchers have been reluctant to replicate their old studies.  At the same time, other scientists have conducted replication studies and failed to replicate classic findings. One example is Ap Dijksterhuis’s claim that showing words related to intelligence before taking a test can increase test performance.  Shanks and colleagues tried to replicate this finding in 9 studies and came up empty in all 9 studies. More recently, a team of over 100 scientists conducted 24 replication studies of Dijsterhuis’s professor priming study.  Only 1 study successfully replicated the original finding, but with a 5% error rate, 1 out of 20 studies is expected to produce a statistically significant result by chance alone.  This result validates Shanks’ failures to replicate and strongly suggests that the original result was a statistical fluke (i.e., a false positive result).

Proponent of priming research like  Dijksterhuis “argue that social-priming results are hard to replicate because the slightest change in conditions can affect the outcome” (Abbott, 2013, Nature News).  Many psychologists consider this response inadequate.  The hallmark of a good theory is that it predicts the outcome of a good experiment.  If the outcome depends on unknown factors and replication attempts fail more often than not, a scientific theory lacks empirical support.  For example,  Kahneman wrote in an email that the apparent “refusal to engage in a legitimate scientific conversation … invites the interpretation that the believers are afraid of the outcome” (Abbott, 2013, Nature News).

It is virtually impossible to check on all original findings by conducting extensive and expensive replication studies.  Moreover, proponents of priming research can always find problems with actual replication studies to dismiss replication failures.  Fortunately, there is another way to examine the replicability of priming research. This alternative approach, z-curve, uses a statistical approach to estimate replicability based on the results reported in original studies.  Most important, this approach examines how replicable and credible original findings were based on the results reported in the original articles.  Therefore, original researches cannot use inadequate methods or slight variations in contextual factors to dismiss replication failures. Z-curve can reveal that the original evidence was not as strong as dozens of published studies may reveal because it takes into account that published studies were selected to provide evidence for priming effects.

My colleagues and I used z-curve to estimate the average replicability of priming studies that were cited in Kahneman’s chapter on priming research.  We found that the average probability of a successful replication was only 14%. Given the small number of studies (k = 31), this estimate is not very precise. It could be higher, but it could also be even lower. This estimate would imply that for each published significant result, there are  9 unpublished non-significant results that were omitted due to publication bias. Given these results, the published significant results provide only weak empirical support for theoretical claims about priming effects.  In a response to our blog post, Kahneman agreed (“What the blog gets absolutely right is that I placed too much faith in underpowered studies”).

Our analysis of Kahneman’s chapter on priming provided a blue print for this  quantitative book review of Bargh’s book “Before you know it.”  I first checked the notes for sources and then linked the sources to the corresponding references in the reference section.  If the reference was an original research article, I downloaded the original research article and looked for the most critical statistical test of a study. If an article contained multiple studies, I chose one test from each study.  I found 168 usable original articles that reported a total of 400 studies.  I then converted all test statistics into absolute z-scores and analyzed them with z-curve to estimate replicability (see Excel file for coding of studies).

Figure 1 shows the distribution of absolute z-scores.  90% of test statistics were statistically significant (z > 1.96) and 99% were at least marginally significant (z > 1.65), meaning they passed a less stringent statistical criterion to claim a success.  This is not surprising because supporting evidence requires statistical significance. The more important question is how many studies would produce a statistically significant result again if all 400 studies were replicated exactly.  The estimated success rate in Figure 1 is less than half (41%). Although there is some uncertainty around this estimate, the 95% confidence interval just reaches 50%, suggesting that the true value is below 50%.  There is no clear criterion for inadequate replicability, but Tversky and Kahneman (1971) suggested a minimum of 50%.  Professors are also used to give students who scored below 50% on a test an F.  So, I decided to use the grading scheme at my university as a grading scheme for replicability scores.  So, the overall score for the replicability of studies cited by Bargh to support the ideas in his book is F.

 

Before.You.Know.It.Final

This being said, 41% replicability is a lot more than we would expect by chance alone, namely 5%.  Clearly some of the results mentioned in the book are replicable. The question is which findings are replicable and which ones are difficult to replicate or even false positive results.  The problem with 41% replicable results is that we do not know which results we can trust. Imagine you are interviewing 100 eyewitnesses and only 42 of them are reliable. Would you be able to identify a suspect?

It is also possible to analyze subsets of studies. Figure 2 shows the results of all experimental studies that randomly assigned participants to two or more conditions.  If a manipulation has an effect, it produces mean differences between the groups. Social psychologists like these studies because they allow for strong causal inferences and make it possible to disguise the purpose of a study.  Unfortunately, this design requires large samples to produce replicable results and social psychologists often used rather small samples in the past (the rule of thumb was 20 per group).  As Figure 2 shows, the replicability of these studies is lower than the replicability of all studies.  The average replicability is only 24%.  This means for every significant result there are at least three non-significant results that have not been reported due to the pervasive influence of publication bias.

Before.You.Know.It.BS.EXP.png

If 24% doesn’t sound bad enough, it is important to realize that this estimate assumes that the original studies can be replicated exactly.  However, social psychologists have pointed out that even minor differences between studies can lead to replication failures.  Thus, the success rate of actual replication studies is likely to be even less than 24%.

In conclusion, the statistical analysis of the evidence cited in Bargh’s book confirms concerns about the replicability of social psychological studies, especially experimental studies that compared mean differences between two groups in small samples. Readers of the book should be aware that the results reported in the book might not replicate in a new study under slightly different conditions and that numerous claims in the book are not supported by strong empirical evidence.

Replicability of Chapters

I also estimated the replicability separately for each of the 10 chapters to examine whether some chapters are based on stronger evidence than others. Table 1 shows the results. Seven chapters scored an F, two chapters scored a D, and one chapter earned a C-.   Although there is some variability across chapters, none of the chapters earned a high score, but some chapters may contain some studies with strong evidence.

Table 1. Chapter Report Card

Chapter 1 28 F
Chapter 2 40 F
Chapter 3 13 F
Chapter 4 47 F
Chapter 5 50 D-
Chapter 6 57 D+
Chapter 7 24 F
Chapter 8 19 F
Chapter 9 31 F
Chapter 10 62 C-

Credible Findings in the Book

Unfortunately, it is difficult to determine the replicability of individual studies with high precision.  Nevertheless, studies with high z-scores are more replicable.  Particle physicists use a criterion value of z > 5 to minimize the risk that the results of a single study are not a false positive.  I found that psychological studies with a z-score greater than 4 had an 80% chance of being replicated in actual replication studies.  Using this rule as a rough estimate of replicability, I was also able to identify credible claims in the book.  Highlighting these claims does not mean that the other claims are wrong. It simply means that they are not supported by strong evidence.

Chapter 1:    

According to Chapter 1, there seems “to be a connection between the strength of the unconscious physical safety motivation and a person’s political attitudes.”   The notes list a number of articles to support this claim.  The only conclusive evidence in these studies is that self-reported political attitudes (a measure of right-wing authoritarianism) is correlated with self-reported beliefs that the world is dangerous (Duckitt et al., JPSP, 2002, 2 studies, z = 5.42, 6.93).  The correlation between self-report measures is hardly evidence for unconscious physical safety motives.

Another claim is that “our biological mandate to reproduce can have surprising manifestations in today’s world.”   This claim is linked to a study that examined the influence of physical attractiveness on call backs for a job interview.  In a large field experiment, researchers mailed (N = 11,008 resumes) to real job ads and found that both men and women were more likely to be called for an interview if the application included a picture of a highly attractive applicant versus a not so attractive applicant (Busetta et al., 2013, z = 19.53).  Although this is an interesting and important finding, it is not clear that the human resource offices preference for attractive applicants was driven by their “biological mandate to reproduce.”

Chapter 2: 

Chapter 2 introduces the idea that there is a fundamental connection between physical sensations and social relationships.  “… why today we still speak so easily of a warm friend, or a cold father. We always will. Because the connection between physical and social warmth, and between physical and social coldness, is hardwired into the human brain.”   Only one z-score surpassed the 4-sigma threshold.  This z-score comes from a brain imaging study that found increased sensorimotor activation in response to hand-washing products (soap) after participants had lied in a written email, but not after they had lied verbally;  Schaefer et al., 2015, z = 4.65).  There are two problems with this supporting evidence.  First, z-scores in fMRI studies require a higher threshold than z-scores in other studies because brain imaging studies allow for multiple comparisons that increase the risk of a false positive result (Vul et al., 2009).  More important, even if this finding could be replicated, it does not provide support for the claim that these neurological connections are hard-wired into humans’ brains.

The second noteworthy claim in Chapter 2 is that infants “have a preference for their native language over other languages, even though they don’t yet understand a word.” This claim is not very controversial given ample evidence that humans’ prefer familiar over unfamilar stimuli (Zajonc, 1968, also cited in the book).  However, it is not so easy to study infants’ preferences (after all, they are not able to tell us).  Developmental researchers use a visual attention task to infer preferences. If an infant looks longer at one of two stimuli, it indicates a preference for this stimulus. Kinzler et al. (PNAS, 2007) reported six studies. For five studies, z-scores ranged from 1.85 to 2.92, which is insufficient evidence to draw strong conclusions.  However, Study 6 provided convincing evidence (z = 4.61) that 5-year old children in Boston preferred a native speaker to a child with a French accent. The effect was so strong that 8 children were sufficient to demonstrate it.  However, a study with 5-year olds hardly provides evidence for infants’ preferences. In addition, the design of this study holds all other features constant. Thus, it is not clear how strong this effect is in the real world when many other factors can influence the choice of a friend.

Chapter 3

Chapter 3 introduces the concept of priming. “Primes are like reminders, whether we are aware of the reminding or not”   It uses two examples to illustrate priming with and without awareness. One example implies that people can be aware of the primes that influenced their behavior.  If you are in the airport, smell Cinnabon, and find yourself suddenly in front of the Cinnabon counter you are likely to know that the smell made you think about Cinnabon and decide to eat one. The second example introduces the idea that primes can influence behavior without awareness. If you were caught off in traffic, you may respond more hostile to a transgression of a co-worker without being aware that the earlier experience in traffic influenced your reaction.  The supporting references contain two noteworthy (z > 4) findings that show how priming can be used effectively as reminders (Rogers & Milkman, 2016, Psychological Science, Studies 2a (N = 920, z = 5.45) and Study 5 (N = 305, z = 5.50). In Study 2a, online participants were presented with the following instruction:

In this survey, you will have an opportunity to
support a charitable organization called Gardens
for Health that provides lasting agricultural
solutions to address the problem of chronic
childhood malnutrition. On the 12th page of this
survey, please choose answer “A” for the last
question on that page, no matter your opinion. The
previous page is Page 1. You are now on Page 2.
The next page is Page 3. The picture below will
be on top of the NEXT button on the 12th page.
This is intended to remind you to select
answer “A” for the last question on that page. If you
follow these directions, we will donate $0.30 to
Gardens for Health.

Elefant.png

On pages 2-11 participants either saw distinct animals or other elephants.

elefant2.png

 

Participants in the distinct animal condition were more likely to press the response that led to a donation than participants who saw a variety of elephants (z = 5.45).

Study 5 examined whether respondents would be willing to pay for a reminder.  They were offered 60 cents extra payment for responding with “E” to the last question.  They could either pay 3 cents to get an elephant reminder or not.  53% of participants were willing to pay for the reminder, which the authors compared to 0, z = 2 × 10^9.  This finding implies that participants are not only aware of the prime when they respond in the primed way, but are also aware of this link ahead of time and are willing to pay for it.

In short, Chapter 3 introduces the idea of unconscious or automatic priming, but the only solid evidence in the reference section supports the notion that we can also be consciously aware of priming effects and use them to our advantage.

Chapter 4

Chapter 4 introduces the concept of arousal transfer; the idea that arousal from a previous event can linger and influence how we react to another event.  The book reports in detail a famous experiment by Dutton and Aaron (1974).

“In another famous demonstration of the same effect, men who had just crossed a rickety pedestrian bridge over a deep gorge were found to be more attracted to a woman they met while crossing that bridge. How do we know this? Because they were more likely to call that woman later on (she was one of the experimenters for the study and had given these men her number after they filled out a survey for her) than were those who met the same woman while crossing a much safer bridge. The men in this study reported that their decision to call the woman had nothing to do with their experience of crossing the scary bridge. But the experiment clearly showed they were wrong about that, because those in the scary-bridge group were more likely to call the woman than were those who had just crossed the safe bridge.”

First, it is important to correct the impression that men were asked about their reasons to call back.  The original article does not report any questions about motives.  This is the complete section in the results that mentions the call back.

“Female interviewer. In the experimental group, 18 of the 23 subjects who agreed to
the interview accepted the interviewer’s phone number. In the control group, 16 out of 22 accepted (see Table 1). A second measure of sexual attraction was the number of subjects who called the interviewer. In the experimental group 9 out of 18 called, in the control group 2 out of 16 called (x2 = 5.7, p < .02). Taken in conjunction with the sexual
imagery data, this finding suggests that subjects in the experimental group were more
attracted to the interviewer.”

A second concern is that the sample size was small and the evidence for the effect was not very strong.  In the experimental group 9 out of 18 called, in the control
group 2 out of 16 called (x2 = 5.7, p < .02) [z = 2.4].

Finally, the authors mention a possible confound in this field study.  It is possible that men who dared to cross the suspension bridge differ from men who crossed the safe bridge, and it has been shown that risk taking men are more likely to engage in casual sex.  Study 3 addressed this problem with a less colorful, but more rigorous experimental design.

Male students were led to believe that they were participants in a study on electric shock and learning.  An attractive female confederate (a student working with the experimenter but pretending to be a participants) was also present.  The study had four conditions. Male participants were told that they would receive weak or strong shock and they were told that the female confederate would receive weak or strong shock.  They then were asked to fill out a questionnaire before the study would start; in fact, the study ended after participants completed the questionnaire and they were told about the real purpose of the study.

The questionnaire contained two questions about the attractive female confederate. “How much would you like to kiss her?” and “How much would you like to ask her out on a date?”  Participants who were anticipating strong shock had much higher average ratings than those who anticipated weak shock, z = 4.46.

Although this is a strong finding, we also have a large literature on emotions and arousal that suggests frightening your date may not be the best way to get to second base (Reisenzein, 1983; Schimmack, 2005).  It is also not clear whether arousal transfer is a conscious or unconscious process. One study cited in the book found that exercise did not influenced sexual arousal right away, presumably because participants attributed their increased heart rate to the exercise. This suggests that arousal transfer is not entirely an unconscious process.

Chapter 4 also brings up global warming.  An unusually warm winter day in Canada often make people talk about global warming.  A series of studies examined the link between weather and beliefs about global warming more scientifically.  “What is fascinating (and sadly ironic) is how opinions regarding this issue fluctuate as a function of the very climate we’re arguing about. In general, what Weber and colleagues found was that when the current weather is hot, public opinion holds that global warming is occurring, and when the current weather is cold, public opinion is less concerned about global warming as a general threat. It is as if we use “local warming” as a proxy for “global warming.” Again, this shows how prone we are to believe that what we are experiencing right now in the present is how things always are, and always will be in the future. Our focus on the present dominates our judgments and reasoning, and we are unaware of the effects of our long-term and short-term past on what we are currently feeling and thinking.”

One of the four studies produced strong evidence (z = 7.05).  This study showed a correlation between respondents’ ratings of the current day’s temperature and their estimate of the percentage of above average warm days in the past year.  This result does not directly support the claim that we are more concerned about global warming on warm days for two reasons. First, response styles can produce spurious correlations between responses to similar questions on a questionnaire.  Second, it is not clear that participants attributed above average temperatures to global warming.

A third credible finding (z = 4.62) is from another classic study (Ross & Sicoly, 1974, JPSP, Study 2a).  “You will have more memories of yourself doing something than of your spouse or housemate doing them because you are guaranteed to be there when you do the chores. This seems pretty obvious, but we all know how common those kinds of squabbles are, nonetheless. (“I am too the one who unloads the dishwasher! I remember doing it last week!”)”   In this study, 44 students participated in pairs. They were given separate pieces of information and exchange information to come up with a joint answer to a set of questions.  Two days later, half of the participants were told that they performed poorly, whereas the other half was told that they performed well. In the success condition, participants were more likely to make self-attributions (i.e., take credit) than expected by chance.

Chapter 5

In Chapter 5, John Bargh tell us about work by his supervisor Robert Zajonc (1968).  “Bob was doing important work on the mere exposure effect, which is, basically, our tendency to like new things more, the more often we encounter them. In his studies, he repeatedly showed that we like them more just because they are shown to us more often, even if we don’t consciously remember seeing them”  The 1968 classic article contains two studies with strong evidence (Study 2, z = 6.84, Study 3 z = 5.81).  Even though the sample sizes were small, this was not a problem because the studies presented many stimuli at different frequencies to all participants. This makes it easy to spot reliable patterns in the data.

zajonc

 

 

 

 

 

 

 

Chapter 5 also introduces the concept of affective priming.  Affective priming refers to the tendency to respond emotionally to a stimulus even if a task demands to ignore it.  We simply cannot help to feel good or bad and turn our emotions off.  The experimental way to demonstrate this is to present an emotional stimulus quickly followed by a second emotional stimulus. Participants have to respond to the second stimulus and ignore the first stimulus.  It is easier to perform the task when the two stimuli have the same valence, suggesting that the valence of the first stimulus was processed even though participants had to ignore it.  Bargh et al. (1996, JESP) reported that this even happens  when the task is simply to pronounce the second word (Study 1 z = 5.42, Study 2 z = 4.13, Study 3, z = 3.97).

The book does not inform readers that we have to distinguish two types of affective priming effects.  Affective priming is a robust finding when participants’ task is to report on the valence (is it good or bad) of the second stimulus following the prime.  However, this finding has been interpreted by some researches as an interference effect, similar to the Stroop effect.  This explanation would not predict effects on a simple pronounciation task.  However, there are fewer studies with the pronounciation task and some of these have failed to replicate Bargh et al.’s original findings, despite the strong evidence observed in their studies. First, Klauer and Musch (2001) failed to replicate Bargh et al.’s findings that affective priming influences pronunciation of target words in three studies with good statistical power. Second, DeHouwer et al. (2001) were able to replicate it with degraded primes, but also failed to replicate the effect with visible primes that were used by Bargh et al.  In conclusion, affective priming is a robust effect when participants have to report on the valence of the second stimulus, but this finding does not necessarily imply that primes unconsciously activate related content in memory.

Chapter 5 also reports about some surprising associations between individuals’ names, or better their initials, and the places they live, professions, and partners. These correlations are relatively small, but they are based on large datasets and very unlikely to be just statistical flukes (z-scores ranging from 4.65 to 49.44).  The causal process underlying these correlations is less clear.  One possible explanation is that we have unconscious preferences that influence our choices. However, experimental studies that tried to study this effect in the laboratory are less convincing.  Moreover, Hodson and Olson failed to find a similar effect across a variety of domains such as liking of animals (Alicia is not more likely to like ants than Samantha), foods, or leisure activities. They found a significant correlation for brand names (p = .007), but this finding requires replication.   More recently, Kooti, Magno, and Weber (2014) examined name effects on social media. They found significant effects for some brand comparisons (Sega vs. Nintendo), but not for others (Pepsi vs. Coke).  However, they found that twitter users were more likely to follow other twitter uses with the same first name. Taken together, these results suggest that individuals’ names predict some choices, but it is not clear when or why this is the case.

The chapter ends with a not very convincing article (z = 2.39, z = 2.22) that it is actually very easy to resist or override unwanted priming effects. According to this article, simply being told that somebody is a team member can make automatic prejudice go away.  If it were so easy to control unwanted feelings, it is not clear why racism is still a problem 50 years after the civil rights movement started.

In conclusion Chapter 5 contains a mix of well-established findings with strong support (mere-exposure effects, affective priming) and several less supported ideas. One problem is that priming is sometimes presented as an unconscious process that is difficult to control and at other times these effects seem to be easily controllable. The chapter does not illuminate under which conditions we should suspect priming to influence our behavior in ways we don’t notice or cannot control and when we notice them and have the ability to control them.

Chapter 6

Chapter 6 deals with the thorny problem in psychological science that most theories make correct predictions sometimes. Even a broken clock tells the time right twice a day. The problem is to know in which context a theory makes correct predictions and when it does not.

“Entire books—bestsellers—have appeared in recent years that seem to give completely conflicting advice on this question: can we trust our intuitions (Blink, by Malcolm Gladwell), or not (Thinking, Fast and Slow, by Daniel Kahneman)? The answer lies in between. There are times when you can and should, and times when you can’t and shouldn’t [trust your gut].

Bargh then proceeds to make 8 evidence-based recommendation when it is advantages to rely on intuition without effortful deliberation (gut feelings).

Rule #1: supplement your gut impulse with at least a little conscious reflection, if you have the time to do so.

Rule # 2: when you don’t have the time to think about it, don’t take big chances for small gains going on your gut alone.

Rule #3: when you are faced with a complex decision involving many factors, and especially when you don’t have objective measurements (reliable data) of those important factors, take your gut feelings seriously.

Rule #4: be careful what you wish for, because your current goals and needs will color what you want and like in the present.

Rule #5: when our initial gut reaction to a person of a different race or ethnic group is negative, we should stifle it.

Rule #6: we should not trust our appraisals of others based on their faces alone, or on photographs, before we’ve had any interaction with them.

Rule #7: (it may be the most important one of all): You can trust your gut about other people—but only after you have seen them in action.

Rule #8: it is perfectly fine for attraction be one part of the romantic equation, but not so fine to let it be the only, or even the main, thing.

Unfortunately, the credible evidence in this chapter (z > 4) is only vaguely related to these rules and insufficient to claim that these rules are based on solid scientific evidence.

Morewedge and Norton (2009) provide strong evidence that people in different cultures (US z = 4.52, South Korea z = 7.18, India z = 6.78) believe that dreams provide meaningful information about themselves.   Study 3 used a hypothetical scenario to examine whether people would change their behavior in response to a dream.  Participants were more likely to say that they would change a flight after dreaming about a plane crash in the night before the flight than if they thought about a plane crash the evening before and dreams influenced behavior about as much as hearing about an actual plane crash (z = 10.13).   In a related article, Morewedge and colleagues (2014) asked participants to rate types of thoughts (e.g., dreams, problem solving, etc.) in terms of spontaneity or deliberation. A second rating asked about the extent to which the type of thought would generate self-insight or is merely a reflection of the current situation.  They found that spontaneous thoughts were considered to generate more self-insight (Study 1 z = 5.32, Study 2 z = 5.80).   In Study 5, they also found that more spontaneous recollection of a recent positive or negative experience with their romantic partner predicted hypothetical behavioral intention ratings (““To what extent might recalling the experience affect your likelihood of ending the relationship, if it came to mind when you tried to remember it”) (z = 4.06). These studies suggest that people find spontaneous, non-deliberate thoughts meaningful and that they are willing to use them in decision making.  The studies do not tell us under which circumstances listening to dreams and other spontaneous thoughts (gut feelings) is beneficial.

Inbar, Cone, and Gilovich (2010) created a set of 25 choice problems (e.g., choosing an entree, choosing a college).  They found that “the more a choice was seen as objectively evaluable, the more a rational approach was seen as the appropriate choice strategy” (Study 1a, z = 5.95).  In a related study, they found “the more participants
thought the decision encouraged sequential rather than holistic processing, the more they thought it should be based on rational analysis” (Study 1b, z = 5.02).   These studies provide some insight into people’s beliefs about optimal decision rules, but they do not tell us whether people’s beliefs are right or wrong, which would require to examine people’s actual satisfaction with their choices.

Frederick (2005) examined personality differences in the processing of simple problems (e.g., A bat and a ball cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?).  The quick answer is 10 cent, but the correct answer is 5 cent.  In this case, the gut response is false.  A sample of over 3,000 participants answered several similar questions. Participants who performed above average were more willing to delay gratification (get $3,800 in a month rather than $3,400 now) than participants with below average performance (z > 5).  If we consider the bigger reward a better choice, these results imply that it is not good to rely on gut responses when it is possible to use deliberation to get the right answer.

Two studies by Wilson and Schooler (1991) are used to support the claim that we can overthink choices.

“In their first study, they had participants judge the quality of different brands of jam, then compared their ratings with those of experts. They found that the participants who were asked to spend time consciously analyzing the jam had preferences that differed further from those of the experts, compared to those who responded with just the “gut” of their taste buds.”  The evidence in this study with a small sample is not very strong and requires replication  (N = 49, z = 2.36).

“In Wilson and Schooler’s second study, they interviewed hundreds of college students about the quality of a class. Once again, those who were asked to think for a moment about their decisions were further from the experts’ judgments than were those who just went with their initial feelings.”

Wilson.Schooler

The description in the book does not match the actual study.  There were three conditions.  In the control condition, participants were asked to read the information about the courses carefully.  In the reasons condition, participants were asked to write down their reasons. and in the rate all condition participants were asked to rate all pieces of information, no matter how important, in terms of its effect on their choices. The study showed that considering all pieces of information increased the likelihood of choosing a poorly rated course (a bad choice), but had a much smaller effect on ratings of highly rated courses (z = 4.14 for the interaction effect).  All conditions asked for some reflection and it remains unclear how students would respond if they went with their initial feelings, as described in the book.  Nevertheless, the study suggests that good choices require focusing on important factors and paying attention to trivial factors can lead to suboptimal choices.  For example, real estate agents in hot markets use interior design to drive up prices even though the design is not part of the sale.

We are born sensitive to violations of fair treatment and with the ability to detect those who are causing harm to others, and assign blame and responsibility to them. Recent research has shown that even children three to five years old are quite sensitive to fairness in social exchanges. They preferred to throw an extra prize (an eraser) away than to give more to one child than another—even when that extra prize could have gone to themselves. This is not an accurate description of the studies.  Study 1 (z > 5) found that 6 to 8 year old children preferred to give 2 erasers to one kid and 2 erasers to another kid and to throw the fifth eraser away to maintain equality (20 out of 20, p < .0001).  However, “the 3-to 5-year-olds showed no preference to throw a resource away (14 out of 24, p = .54)” (p. 386).  Subsequent studies used only 6-8 year old children. Study 4 examined how children would respond if erasers are divided between themselves and another kid. 17 out of 20 (p = .003, z = 2.97 preferred to throw the eraser away rather than getting one more for themselves.  However, in a related article, Shaw and Olson, 2012b) found that children preferred favoritism (getting more erasers) when receiving more erasers was introduced as winning a contest (Study 2, z = 4.65). These studies are quiet interesting, but they do not support the claim that equality norms are inborn, nor do they help us to figure out when we should or should not listen to our gut or whether it is better for us to be equitable or selfish.

The last, but in my opinion most interesting and relevant, piece of evidence in Chapter 6 is a large (N = 16,624) survey study of relationship satisfaction (Cacioppo et al., 2013, PNAS, z = 6.58).   Respondents reported their relationship satisfaction and how they had met.   Respondents who had met their partner online were slightly more satisfied than respondents who had met their partner offline.  There were also differences between different types of meeting online.  Respondents who met their partner in a bar had one of the lowest average level of satisfaction.  The study did not reveal why online dating is slightly more successful, but both forms of dating probably involve a combination of deliberation and “gut” reactions.

In conclusion, Chapter 6 provides some interesting insights into the way people make choices. However, the evidence does not provide a scientific foundation for recommendations when it is better to follow your instinct and when it is better to rely on logical reasoning and deliberation.  Either the evidence of the reviewed studies is too weak or the studies do not use actual choice outcomes as outcome variable. The comparison of online and offline dating is a notable exception.

Chapter 7

Chapter 7 uses an impressive field experiment to support the idea that “our mental representations of concepts such as politeness and rudeness, as well as innumerable other behaviors such as aggression and substance abuse, become activated by our direct perception of these forms of social behavior and emotion, and in this way are contagious.”   Keizer et al. (2008) conducted the study in an alley in Groningen, a city in the Netherlands.  In one condition, bikes were parked in front of a wall with graffiti, despite an anti-graffiti sign.  In the other condition, the wall was clean.  Researchers attached fliers to the bikes and recorded how many users would simply throw the fliers on the ground.  They recorded the behaviors of 77 bike riders in each condition. In the graffiti condition, 69% of riders littered. In the clean condition, only 33% of riders littered (z = 4.51).

Kreizer.Study1.png

In Study 2, the researchers put up a fence in front of the entrance to a car park that required car owners to walk an extra 200m to get to their car, but they left a gap that allowed car owners to avoid the detour.  There was also a sign that forbade looking bikes to the fence.  In one condition, bikes were not locked to the fence. In the experimental condition, the norm was violated and four bikes were locked to the fence.  41 car owners’ behaviors were observed in each condition.  In the experimental condition, 82% of participants stepped through the gap. In the control condition, only 27% of car owners stepped through the gap (z = 5.27).

Keizer.Study2.png

It is unlikely that bike riders or car owners in these studies consciously processed the graffiti or the locked bikes.  Thus, these studies support the hypothesis that our environment can influence behavior in subtle ways without our awareness.  Moreover, these studies show these effects with real-world behavior.

Another noteworthy study in Chapter 7 examined happiness in social networks (Fowler & Christakis, 2008).   The authors used data from the Framingham Heart Study, which is a unique study where most inhabitants of a small town, Framingham, participated in the study.   Researchers collected many measures, including a measure of happiness. They also mapped social relationships among them.  Fowler and Christakis used sophisticated statistical methods to examine whether people who were connected in the social network (e.g., spouses, friends, neighbors) had similar levels of happiness. They did (z = 9.09).  I may be more likely to believe these findings because I have found this in my own research on married couples (Schimmack & Lucas, 2010).  Spouses are not only more similar to each other at one moment in time, they also change in the same direction over time.  However, the causal mechanism underlying this effect is more elusive.  Maybe happiness is contagious and can spread through social networks like a disease. However, it is also possible that related members in social networks are exposed to similar environments.  For example, spouses share a common household income and money buys some happiness.  It is even less clear whether these effects occur outside of people’s awareness or not.

Chapter 8 ends with the positive message that a single person can change the word because his or her actions influence many people. “The effect of just one act, multiplies and spreads to influence many other people. A single drop becomes a wave”  This rosy conclusion overlooks that the impact of one person decreases exponentially when it spreads over social networks. If you are kind to a neighbor, the neighbor may be slightly more likely to be kind to the pizza delivery man, but your effect on the pizza delivery man is already barely noticeable.  This may be a good thing when it comes to the spreading of negative behaviors.  Even if the friend of a friend is engaging in immoral behaviors, it doesn’t mean that you are more likely to commit a crime. To really change society it is important to change social norms and increase individuals’ reliance on these norms even when situational influences tell them otherwise.   The more people have a strong norm not to litter, the less it matters whether there are graffiti on the wall or not.

Chapter 8

Chapter 8 examines dishonesty and suggests that dishonesty is a general human tendency. “When the goal of achievement and high performance is active, people are more likely to bend the rules in ways they’d normally consider  dishonest and immoral, if doing so helps them attain their performance goal”

Of course, not all people cheat in all situations even if they think they can get away with it.  So, the interesting scientific question is who will be dishonest in which context?

Mazar et al. (2008) examined situational effects on dishonesty.  In Study 2 (z = 4.33) students were given an opportunity to cheat in order to receive a higher reward. The study had three conditions, a control condition that did not allow students to cheat, a cheating condition, and a cheating condition with an honor pledge.  In the honor pledge condition, the test started with the sentence “I understand that this short survey falls under MIT’s [Yale’s] honor system”.   This manipulation eliminated cheating.  However, even in the cheating condition “participants cheated only 13.5% of the possible average magnitude.  Thus, MIT/Yale students are rather honest or the incentive was too small to tempt them (an extra $2).  Study 3 found that students were more likely to cheat if they were rewarded with tokens rather than money, even though they later could exchange tokens for money.  The authors suggests that cheating merely for tokens rather than real money made it seem less like “real” cheating (z = 6.72).

Serious immoral acts cannot be studied experimentally in a psychology laboratory.  Therefore, research on this topic has to rely on self-report and correlations. Pryor (1987) developed a questionnaire to study “Sexual Harassment Proclivities in Men.”  The questionnaire asks men to imagine being in a position of power and to indicate whether they would take advantage of their power to incur sexual favors if they know they can get away with it.  To validate the scale, Pryor showed that it correlated with a scale that measures how much men buy into rape myths (r = .40, z = 4.47).   Self-reports on these measures have to be taken with a grain of salt, but the results suggest that some men are willing to admit that they would abuse power to gain sexual favors, at least in anonymous questionnaires.

Another noteworthy study found that even prisoners are not always dishonest. Cohn et al. (2015) used a gambling task to study dishonesty in 182 prisoners in a maximum security prison.  Participants were given the opportunity to flip 10 coins and to keep all coins that showed head.  Importantly, the coin toss was not observed.  As it is possible, although unlikely, that all 10 coins show head by chance, inmates could keep all coins and hide behind chance.  The randomness of the outcome makes it impossible to accuse a particular prisoner of dishonesty.  Nevertheless, the task makes it possible to measure dishonesty of the group (collective dishonesty) because the percentage of coin tosses that were reported should be close to chance (50%). If it is significantly higher than chance, it shows that some prisoners were dishonest. On average, prisoners reported 60% head, which reveals some dishonesty, but even convicted criminals were more likely to respond honestly than not (the percentage increased from 60% to 66% when they were primed with their criminal identity, z = 2.39).

I see some parallels between the gambling task and the world of scientific publishing, at least in psychology.  The outcome of a study is partially determined by random factors. Even if a scientist does everything right, a study may produce a non-significant result due to random sampling error. The probability of observing a non-significant result is called a type-II error. The probability of observing a significant result is called statistical power.  Just like in a coin toss experiment, the observed percentage of significant results should match the expected percentage based on average power.  Numerous studies have shown that researchers report more significant results than the power of their studies justifies. As in the coin toss experiment, it is not possible to point the finger at a single outcome because chance might have been in a researcher’s favor, but in the long run the odds “cannot be always in your favor” (Hunger Games).  Psychologists disagree whether the excess of significant results in psychology journals should be attributed to dishonesty.  I think it is and it fits Bargh’s observation that humans, and most scientists are humans, have a tendency to bend the rules when doing so helps them to reach their goal, especially when the goal is highly relevant (e.g., get a job, get a grant, get tenure). Sadly, the extent of over-reporting significant results is considerably larger than the 10 to 15% overreporting of heads in the prisoner study.

Chapter 9

Chapter 9 introduces readers to Metcalfe’s work on insight problems (e.g., how to put 27 animals into 4 pens so that there is an odd number of animals in all four pens).  Participants had to predict quickly whether they would be able to solve the problem. They then got 5 minutes to actually solve the problem. Participants were not able to predict accurately which insight problems they would solve.  Metcalfe concluded that the solution for insight problems comes during a moment of sudden illumination that is not predictable.  Bargh adds “This is because the solver was working on the problem unconsciously, and when she reached a solution, it was delivered to her fully formed and ready for use.”  In contrast, people are able to predict memory performance on a recognition test, even when they were not able to recall the answer immediately.  This phenomenon is known as the tip-of-the-tongue effect (z = 5.02).  This phenomenon shows that we have access to our memory even before we can recall the final answer.  This phenomenon is similar to the feeling of familiarity that is created by mere exposure (Zajonc, 1968). We often know a face is familiar without being able to recall specific memories where we encountered it.

The only other noteworthy study in Chapter 9 was a study of sleep quality (Fichten et al., 2001).  “The researchers found that by far, the most common type of thought that kept them awake, nearly 50 percent of them, was about the future, the short-term events coming up in the next day or week. Their thoughts were about what they needed to get done the following day, or in the next few days.”   It is true that 48% thought about future short-term events, but only 1% described these thoughts as worries, and 57% of these thoughts were positive.  It is not clear, however, whether this category distinguished good and poor sleepers.  What distinguished good sleepers from poor sleepers, especially those with high distress, was the frequency of negative thoughts (z = 5.59).

Chapter 10

Chapter 10 examines whether it is possible to control automatic impulses. Ample research by personality psychologists suggests that controlling impulses is easier for some people than others.  The ability to exert self-control is often measured with self-report measures that predict objective life outcomes.

However, the book adds a twist to self-control. “The most effective self-control is not through willpower and exerting effort to stifle impulses and unwanted behaviors. It comes from effectively harnessing the unconscious powers of the mind to much more easily do the self-control for you.”

There is a large body of strong evidence that some individuals, those with high impulse control and conscientiousness, perform better academically or at work (Tangney et al., 2004; Study 1 z = 5.90, Galla & Duckworth, Studies 1, 4, & 6, Ns = 488, 7.62, 5.18).  Correlations between personality measures and outcomes do not reveal the causal mechanism that leads to these positive outcomes.  Bargh suggests that individuals who score high on self-control measures are “the ones who do the good things less consciously, more automatically, and more habitually. And you can certainly do the same.”   This maybe true, but empirical work to demonstrate this is hard to find.  At the end of the chapter, Bargh cites a recent study by Milyavskaya and Michael Inzlicht that suggested avoiding temptations is more important than being able to exert self-control in the face of temptation, willful or unconsciously.

Conclusion

The book “Before you know it: The unconscious reasons we do what we do” is based on the authors’ personal experiences, studies he has conducted, and studies he has read. The author is a scientist and I have no doubt that he shares with his readers insights that he believes to be true.  However, this does not automatically make them true.  John Bargh is well aware that many psychologists are skeptical about some of the findings that are used in the book.  Famously, some of Bargh’s own studies have been difficult to replicate.  One response to concerns about replicability could have been new demonstrations that important unconscious priming effects can be replicated. In an interview Tom Bartlett (January, 2013) suggested this to John Bargh.

“So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway. Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.”

Beliefs are subjective.  Readers of the book have their own beliefs and may find part of the book interesting and may be willing to change some of their beliefs about human behavior.  Not that there is anything wrong with this, but readers should also be aware that it is reasonable to treat the ideas presented in this book with a healthy does of skepticism.  In 2011, Daniel Kahneman wrote ““disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”  Five years later, it is pretty clear that Kahneman is more skeptical about the state of priming research and results of experiments with small samples in general.  Unfortunately, it is not clear which studies we can believe until replication studies distinguish real effects from statistical flukes. So, until we have better evidence, we are still free to belief what we want about the power of unconscious forces on our behavior.

 

(Preprint) Z-Curve: A Method for Estimating Replicability Based on Test Statistics in Original Studies (Schimmack & Brunner, 2017)

In this PDF document, Jerry Brunner and I would like to share our latest manuscript on z-curve,  a method that estimates average power of a set of studies selected for significance.  We call this estimate replicabilty because average power determines the success rate if the set of original studies were replicated exactly.

We welcome all comments and criticism as we plan to submit this manuscript to a peer-reviewed journal by December 1.

Highlights

Comparison of P-curve and Z-Curve in Simulation studies

Estimate of average replicability in Cuddy et al.’s (2017) P-curve analysis of power posing with z-curve (30% for z-curve vs. 44% for p-curvce).

Estimating average replicability in psychology based on over 500,000 significant test statitics.

Comparing automated extraction of test statistics and focal hypothesis tests using Motyl et al.’s (2016) replicability analysis of social psychology.