Fritz Strack asks “Have I done something wrong”

Since 2011, experimental social psychology is in crisis mode. It has become evident that social psychologists violated some implicit or explicit norms about science. Readers of scientific journals expect that the methods and results section of a scientific article provide an objective description of the study, data analysis, and results.  However, it is now clear that this is rarely the case.  Experimental social psychologists have used various questionable research practices to report mostly results that supported their theories . As a result, it is now unclear which published results are replicable and which results are not.

In response to this crisis of confidence, a new generation of social psychologists has started to conduct replication studies.  The most informative replication studies are published in a new type of article called registered replication reports (RRR)

What makes RRRs so powerful is that they are not simple replication studies.  An RRR is a collective effort to replicate an original study in multiple labs.  This makes it possible to examine generalizability of results across different populations and it makes it possible to combine the data in a meta-analysis.  The pooling of data across multiple replication studies reduces sampling error and it becomes possible to obtain fairly precise effect size estimates that can be used to provide positive evidence for the absence of an effect.  If the effect size estimate is within a reasonably small interval around zero, the results suggest that the population effect size is so close to zero that it is theoretically irrelevant. In this way, an RRR can have three possible results: (a) it replicates an original result in most of the individual studies (e.g., with 80% power, it would replicate the result in 80% of the replication attempts); (b) it fails to replicate the result in most of the replication attempts (e.g., it replicates the result in only 20% of replication studies), but the effect size in the meta-analysis is significant, or (c) it fails to replicate the original result in most studies and the meta-analytic effect size estimates suggests the effect does not exist.

Another feature of RRRs is that original authors get an opportunity to publish a response.  This blog post is about Fritz Strack’s response to the RRR of Strack et al.’s facial feeback study.  Strack et al. (1988) reported two studies that suggested incidental movement of facial muscles influences amusement in response to ratings of cartoons.  The article is the second most cited article by Strack and the most cited empirical article. It is therefore likely that Strack cared about the outcome of the replication study.


So, it may have elicited some negative feelings when the results showed that none of the replication studies produced a significant result and the meta-analysis suggested that the original result was a false positive result; that is the population effect size is close to zero and the results of the original studies were merely statistical flukes.


Strack’s Response to the RRR

Strack’s first response to the RRR results was surprise because numerous studies had conducted replication studies of this fairly famous study before and the published studies typically, but not always, reported successful replications.  Any naive reader of the literature, review articles, or textbook is likely to have the same response.  If an article has over 600 citations, it suggests that it made a solid contribution to the literature.

However, social psychology is not a normal psychological science.  Even more famous effects like the ego-depletion effect or elderly priming have failed to replicate.  A major replication attempted that was published in 2015 showed that only 25% of social psychological studies could be replicated (OSC, 2015); and Strack had commented on this result. Thus, I was a bit surprised by Strack’s surprise because the failure to replicate his results was in line with many other replication failures since 2011.

Despite concerns about the replicabilty of social psychology, Strack expected a positive result because he had conducted a meta-analysis of 20 studies that had been published in the past five years.

If 20 previous studies successfully replicated the effect and the 17 studies of the RRR all failed to replicate the effect, it suggests the presence of a moderator; that is some variation between these two sets of studies that explains why the nonRRR studies found the effect and the RRR studies did not.

Moderator 1

First, the authors have pointed out that the original study is “commonly discussed in introductory psychology courses and textbooks” (p. 918). Thus, a majority of psychology students was assumed to be familiar with the pen study and its findings.

As the study used deception, it makes sense that the study does not work if students know about the study.  However, this hypothesis assumes that all 17 samples in the RRR were recruited from universities in which the facial feedback hypothesis was taught before they participated in the study.  Moreover, it assumes that none of the samples in the successful nonRRR studies had the same problem.

However, Strack does not compare nonRRR studies to RRR studies.  Instead, he focuses on three RRR samples that did not use student samples (Holmes, Lynott, and Wagenmakers).  None of the three samples individually show a significant result. Thus, none of these studies replicated the original findings.  Strack conducts a meta-analysis of these three studies and finds an average mean difference of d = .16.

Table 1

Dataset N M-teeth M-lips SD-teeth SD-lips Cohen’s d
Holmes 130 4.94 4.79 1.14 1.3 0.12
Lynott 99 4.91 4.71 1.49 1.31 0.14
Wagenmakers 126 4.54 4.18 1.42 1.73 0.23
Pooled 355 4.79 4.55 1.81 2.16 0.12

The standardized effect size of d = .16 is the unweighted average of the three d-values in Table 1.  The weighted average is d = .17.  However, if the three studies are first combined into a single dataset and the standardized mean difference is computed from the combined dataset, the standardized mean difference is only d = .12

More importantly, the standard error for the pooled data is 2 / sqrt(355) = 0.106, which means the 95% confidence interval around any of these point estimates is 0.106 * 1.96 = .21 standard deviations wide on each side of the point estimate.  Even with d = .17, the 95%CI (-.04 to .48) includes 0.  At the same time, the effect size in the original study was approximately d ~ .5, suggesting that the original results were extreme outliers or that additional moderating factors account for the discrepancy.

Strack does something different. He tests the three proper studies against the average effect size of the “improper” replication studies with student samples.  These studies have an average effect size of d = -0.03.  This analysis shows a significant difference  (proper d = .16 and improper d = -.03) , t(15) = 2.35, p = .033.

This is an interesting pattern of results. The significant moderation effect suggests that that facial feedback effects were stronger in the 3 studies identified by Strack than in the remaining 14 studies.  At the same time, the average effect size of the three proper replication studies is still not significant, despite a pooled sample size that is three times larger than the sample size in the original study.

One problem for this moderator hypothesis is that the research protocol made it clear that the study had to be conducted before the original study was covered in a course (Simons response to Strack).  However, maybe student samples fail to show the effect for another reason.

The best conclusion that can be drawn from these results is that the effect may be greater than zero, but that the effect size in the original studies was probably inflated.

Cartoons were not funny

Strack’s second concern is weaker and he is violating some basic social psychological rules about argument strength. Adding weak arguments increases persuasiveness if the audience is not very attentive, but they backfire with an attentive audience.

Second, and despite the obtained ratings of funniness, it must be asked if Gary Larson’s The Far Side cartoons that were iconic for the zeitgeist of the 1980s instantiated similar psychological conditions 30 years later. It is indicative that one of the four exclusion criteria was participants’ failure to understand the cartoons.  

One of the exclusion criteria was failure to understand the cartoons, but how many participants were excluded because of this criterion?  Without this information, this is clearly not relevant and if it was an exclusion criterion and only participants who understood the cartoons were used it is not clear how this could explain the replication failure.  Moreover, Strack just tried to claim that proper studies did show the effect, which they could not show if the cartoons were not funny.  Finally, the means clearly show that participants reported being amused by the cartoons.

Weak arguments like this undermine more reasonable arguments like the previous one.

Using A Camera 

Third, it should be noted that to record their way of holding the pen, the RRR labs deviated from the original study by directing a camera on the participants. Based on results from research on objective self-awareness, a camera induces a subjective self-focus that may interfere with internal experiences and suppress emotional responses.

This is a serious problem and in hindsight the authors of the RRR are probably regretting the decision to use cameras.  A recent article actually manipulated the presence or absence of a camera and found stronger effects without a camera, although the predicted interaction effect was not significant. Nevertheless, the study suggested that creating self-awareness with a camera could be a moderator and might explain why the RRR studies failed to replicate the original effect.

Reverse p-hacking 

Strack also noticed a statistical anomaly in the data. When he correlated the standardized mean differences (d-values) with the sample sizes, a non-significant positive correlation emerged, r(17) = .45, p = .069.   This correlation shows a statistical trend for larger samples to produce larger effect sizes.  Even if this correlation were significant, it is not clear what conclusions should be drawn from this observation.  Moreover, earlier Strack distinguished student samples and non-student samples as a potential moderator.  It makes sense to include this moderator in the analysis because it could be confounded with sample size.

A regression analysis shows that the effect of proper sample is no longer significant, t(14) = 1.75, and the effect of sampling error (1/sqrt(N) is also not significant, t(14) = 0.59.

This new analysis suggests that sample size is not a predictor of effect sizes, which makes sense because there is no reasonable explanation for such a positive correlation.

However, now Strack makes a mistake by implying that the weaker effect sizes in smaller samples could be a sign of  “reverse p-hacking.”

Without insinuating the possibility of a reverse p hacking, the current anomaly
needs to be further explored.

The rhetorical vehicle of “Without insinuating” can be used to backtrack from the insinuation that was never made, but Strack is well aware of priming research and the ironic effect of instructions “not to think about a white bear”  (you probably didn’t think about one until now and nobody was thinking about reverse p-hacking until Strack mentioned it).

Everybody understood what he was implying (reverse p-hacking = intentional manipulations of the data to make significance disappear and discredit original researchers in order to become famous with failures to replicate famous studies) (Crystal Prison Zone blog; No Hesitations blog).


The main mistake that Strack makes is that a negative correlation between sample size and effect size can suggest that researches were p-hacking (inflating effect sizes in small samples to get significance), but a positive correlation does not imply reverse p-hacking (making significant results disappear).  Reverse p-hacking also implies a negative trend line where researchers like Wagenmakers with larger samples (the 2nd largest sample in the set), who love to find non-significant results that Bayesian statistics treat as evidence for the absence of an effect, would have to report smaller effect sizes to avoid significance or Bayes-Facotors in favor of an effect.

So, here is Strack’s fundamental error.   He correctly assumes that p-hacking results in a negative correlation, but he falsely assumes that reverse p-hacking would produce a positive correlation and then treats a positive correlation as evidence for reverse p-hacking. This is absurd and the only way to backtrack from this faulty argument is to use the “without insinuating” hedge (“I never said that this correlation implies reverse p-hacking, in fact I made clear that I am not insinuating any of this.”)

Questionable Research Practices as Moderator 

Although Strack mentions two plausible moderators (student samples, camera), there are other possible moderators that could explain the discrepancies between his original results and the RRR results.  One plausible moderator that Strack does not mention is that the original results were obtained with questionable research practices.

Questionable research practices is a broad term for a variety of practices that undermine the credibility of published results  (John et al., 2012), including fraud.

To be clear, I am not insinuating that Strack fabricated data and I have said so in public before.  Let’ me be absolutely clear because the phrasing I used in the previous sentence is a stab at Strack’s reverse p-hacking quote, which may be understood as implying exactly what is not being insinuated.  I positively do not believe that Fritz Strack faked data for the 1988 article or any other article.

One reason for my belief is that I don’t think anybody would fake data that produce only marginally significant results that some editors might reject as insufficient evidence. If you fake your data, why fake p = .06, if you can fake p = .04?

If you take a look at the Figure above, you see that the 95%CI of the original study includes zero.  That shows that the original study did not show a statistically significant result.  However, Strack et al. (1988) used a more liberal criterion of .10 (two-tailed) or .05 (one-tailed) to test significance and with this criterion the results were significant.

The problem is that it is unlikely for two independent studies to produce two marginally significant results in a row.  This is either an unlikely fluke or some other questionable research practices were used to get these results. So, yes, I am insinuating that questionable research practices may have inflated the effect sizes in Strack et al.’s studies and that this explains at least partially why the replication study failed.

It is important to point out that in 1988, questionable research practices were not considered questionable. In fact, experimental social psychologists were trained and trained their students to use these practices to get significance (Stangor, 2012).  Questionable research practices also explain why the reproducibilty project could only replicate 25% of published results in social and personality psychology (OSC, 2015).  Thus, it is plausible that QRPs also contributed to the discrepancy between Strack et al.’s original studies and the RRR results.

The OSC reproducibility project estimated that QRPs inflate effect sizes by 100%. Thus, an inflated effect size of d = .5 in Strack et al., (1988) might correspond to a true effect size of d = .25 (.25 real + 25 inflation = .50 observed).  Moreover, the inflation increases as p-values get closer to .05.  Thus, for a marginally significant result, inflation is likely to be greater than 100% and the true effect size is likely to be even smaller than .25. This suggest that the true effect size could be around d = .2, which is close to the effect size estimate of the “proper” RRR studies identified by Strack.

A meta-analysis of facial feedback studies also produced an average effect size estimate of d = .2, but this estimate is not corrected for publication bias, while the meta-analysis also showed evidence of publication bias.  Thus, the true average effect size is likely to be lower than .2 standard deviations.  Given the heterogeneity of studies in this meta-analysis it is difficult to know which specific paradigms are responsible for this average effect size and could be successfully replicated (Coles & Larsen, 2017).  The reason is that existing studies are severely underpowered to reliably detect effects of this magnitude (Replicability Report of Facial Feedback studies).

The existing evidence suggests that effect sizes of facial feedback studies are somewhere between 0 and .5, but it is impossible to say whether it is just slightly above zero with no practical significance or whether it is of sufficient magnitude so that a proper set of studies can reliably replicate the effect.  In short, 30 years and over 600 citations later, it is still unclear whether facial feedback effects exist and under which conditions this effect can be observed in the laboratory.

Did Fritz Strack use Quesitonable Research Practices

It is difficult to demonstrate conclusively that a researcher used QRPs based on a couple of statistical results, but it is possible to examine this question with larger sets of data. Brunner & Schimmack (2018)  developed a method, z-curve, that can reveal the use of QRPs for large sets of statistical tests.  To obtain a large set of studies, I used automatic text extraction of test statistics from articles co-authored by Fritz Strack.  I then applied z-curve to the data.  I limited the analysis to studies before 2010 when social psychologists did not consider QRPs problematic, but rather a form of creativity and exploration.   Of course, I view QRPs differently, but the question whether QRPs are wrong or not is independent of the question whether QRPs were used.


This figure (see detailed explanation here) shows the strength of evidence (based on test statistics like t and F-values converted into z-scores in Strack’s articles. The histogram shows a mode at 2, which is just significant (z = 1.96 ~ p = .05, two-tailed).  The high bar of z-scores between 1.8 and 2 shows marginally significant results as does the next bar with z-scores between 1.6 and 1.8 (1.65 = .05 one-tailed).  The drop from 2 to 1.6 is too steep to be compatible with sampling error.

The grey line provides a vague estimate of the expected proportion of non-significant results. The so called file-drawer (non-significant results that are not reported) is very large and it is unlikely that so many studies were attempted and failed. For example, it is unlikely that Strack conducted 10 more studies with N = 100 participants and did not report the results of these studies because they produced non-significant result.  Thus, it is likely that other QRPs were used that help to produce significance.  It is impossible to say how significant results were produced, but the distribution z-scores strongly suggests that QRPS were used.

The z-curve method provides an estimate of the average power of significant results. The estimate is 63%.  This value means that a randomly drawn significant result from one of Strack’s articles has a 63% probability of producing a significant results again in an exact replication study with the same sample size.

This value can be compared with estimates for social psychology using the same method. For example, the average for the Journal of Experimental Social Psychology from 2010-2015 is 59%.  Thus, the estimate for Strack is consistent with general research practices in his field.

One caveat with the z-curve estimate of 63% is that the dataset includes theoretically important and less important tests.  Often the theoretically important tests have lower z-scores and the average power estimates for so-called focal tests are about 20-30 percentage points lower than the estimate for all tests.

A second caveat is that there is heterogeneity in power across studies. Studies with high power are more likely to produce really small p-values and larger z-scores. This is reflected in the estimates below the x-axis for different segments of studies.  The average for studies with just significant results (z = 2 to 2.5) is only 45%.  The estimate for marginally significant results is even lower.

As 50% power corresponds to the criterion for significance, it is possible to use z-curve as a way to adjust p-values for the inflation that is introduced by QRPs.  Accordingly, only z-scores greater than 2.5 (~ p = .01) would be significant at the 5% level after correcting for QRPs.  However, once QRPs are being used, bias-corrected values are just estimates and validation with actual replication studies is needed.  Using this correction, Strack’s original results would not even meet the weak standard of for marginal significance.

Final Words

In conclusion, 30 years of research have failed to provide conclusive evidence for or against the facial feedback hypothesis. The lack of progress is largely due to a flawed use of the scientific method.  As journals published only successful outcomes, empirical studies failed to provide empirical evidence for or against a hypothesis derived from a theory.  Social psychologists only recognized this problem in 2011, when Bem was able to provide evidence for an incredible phenomenon of erotic time travel.

There is no simple answer to Fritz Strack’s question “Have I done something wrong?” because there is no objective standard to answer this question.

Did Fritz Strack fail to produce empirical evidence for the facial feedback hypothesis because he used the scientific method wrong?  I believe the answer is yes.

Did Fritz Strack do something morally wrong in doing so? I think the answer is no. He was trained and trained students in the use of a faulty method.

A more difficult question is whether Fritz Strack did something wrong in his response to the replication crisis and the results of the RRR.

We can all make honest mistakes and it is possible that I made some honest mistakes when I wrote this blog post.  Science is hard and it is unavoidable to make some mistakes.  As the German saying goes (Wo gehobelt wird fallen Späne) which is equivalent to saying “you can’t make an omelette without breaking some eggs.

Germans also have a saying that translates to “those who work a lot make a lot of mistakes, those who do nothing make no mistakes”  Clearly Fritz Strack did a lot for social psychology, so it is only natural that he also made mistakes.  The question is how scientists respond to criticism and discovery of mistakes by other scientists.

The very famous social psychologists Susan Fiske (2015) encouraged her colleagues to welcome humiliation.  This seems a bit much, especially for Germans who love perfection and hate making mistakes.  However, the danger of a perfectionistic ideal is that criticism can be interpreted as a personal attack with unhealthy consequences. Nobody is perfect and the best way to deal with mistakes is to admit them.

Unfortunately, many eminent social psychologists seem to be unable to admit that they used QRPs and that replication failures of some of their famous findings are to be expected.  It doesn’t require rocket science to realize that p-hacked results do not replicate without p-hacking. So, why is it so hard to admit the truth that everybody knows anyways?

It seems to be human nature to cover up mistakes. Maybe this is an embodied reaction to shame like trying to cover up when a stranger sees us naked.  However, typically this natural response is worse and it is better to override it to avoid even more severe consequences. A good example is Donald Trump. Surely, having sex with a pornstar is a questionable behavior for a married man, but this is no longer the problem for Donald Trump.  His presidency may end early not because he had sex with Stormy Daniels, but because he lied about it.  As the saying goes, the cover-up is often worse than the crime. Maybe there is even a social psychological experiment to prove it, p = .04.

P.S.  There is also a difference between not doing something wrong and doing something right.  Fritz, you can still do the right thing and retract your questionable statement about reverse-phacking and help new generations to avoid some of the mistakes of the past.














Charles Stangor’s Failed Attempt to Predict the Future


It is 2018, and 2012 is a faint memory.  So much has happened in the word and in
psychology over the past six years.

Two events rocked Experimental Social Psychology (ESP) in the year 2011 and everybody was talking about the implications of these events for the future of ESP.

First, Daryl Bem had published an incredible article that seemed to suggest humans, or at least extraverts, have the ability to anticipate random future events (e.g., where an erotic picture would be displayed).

Second, it was discovered that Diederik Stapel had fabricated data for several articles. Several years later, over 50 articles have been retracted.

Opinions were divided about the significance of these two events for experimental social psychology.  Some psychologists suggested that these events are symptomatic of a bigger crisis in social psychology.  Others considered these events as exceptions with little consequences for the future of experimental social psychology.

In February 2012, Charles Stangor tried to predict how these events will shape the future of experimental social psychology in an essay titled “Rethinking my Science

How will social and personality psychologists look back on 2011? With pride at having continued the hard work of unraveling the mysteries of human behavior, or with concern that the only thing that is unraveling is their discipline?

Stangor’s answer is clear.

“Although these two events are significant and certainly deserve our attention, they are flukes rather than game-changers.”

He describes Bem’s article as a “freak event” and Stapel’s behavior as a “fluke.”

“Some of us probably do fabricate data, but I imagine the numbers are relatively few.”

Stangor is confident that experimental social psychology is not really affected by these two events.

As shocking as they are, neither of these events create real problems for social psychologists

In a radical turn, Stangor then suggests that experimental social psychology will change, but not in response to these events, but in response to three other articles.

But three other papers published over the past two years must completely change how we think about our field and how we must conduct our research within it. And each is particularly important for me, personally, because each has challenged a fundamental assumption that was part of my training as a social psychologist.

Student Samples

The first article is a criticism of experimental social psychology for relying too much on first-year college students as participants (Heinrich, Heine, & Norenzayan, 2010).  Looking back, there is no evidence that US American psychologists have become more global in their research interests. One reason is that social phenomena are sensitive to the cultural context and for Americans it is more interesting to study how online dating is changing relationships than to study arranged marriages in more traditional cultures. There is nothing wrong with a focus on a particular culture.  It is not even clear that research article on prejudice against African Americans were supposed to generalize to the world (how would this research apply to African countries where the vast majority of citizens are black?).

The only change that occurred was not in response to Heinrich et al.’s (2010) article, but in response to technological changes that made it easier to conduct research and pay participants online.  Many social psychologists now use the online service Mturk to recruit participants.

Thus, I don’t think this article significantly changed experimental social psychology.

Decline Effect 

The second article with the title (“The Truth Wears Off“) was published in the weekly magazine the New Yorker.  It made the ridiculous claim that true effects become weaker or may even disappear over time.

The basic phenomenon is that observed findings in the social and biological sciences weaken with time. Effects that are easily replicable at first become less so every day. Drugs stop working over time the same way that social psychological phenomena become more and more elusive. The “the decline effect” or “the truth wears off effect,” is not easy to dismiss, although perhaps the strength of the decline effect will itself decline over time.

The assumption that the decline effect applies to real effects is no more credible than Bem’s claims of time-reversed causality.   I am still waiting for the effect of eating cheesecake on my weight (a biological effect) to wear off. My bathroom scale tells me it is not.

Why would Stangor believe in such a ridiculous idea?  The answer is that he observed it many times in his own work.

Frankly I have difficulty getting my head around this idea (I’m guessing others do too) but it is nevertheless exceedingly troubling. I know that I need to replicate my effects, but am often unable to do it. And perhaps this is part of the reason. Given the difficulty of replication, will we continue to even bother? And what becomes of our research if we do even less replicating than we do now? This is indeed a problem that does not seem likely to go away soon. 

In hindsight, it is puzzling that Stangor misses the connection between Bem’s (2011) article and the decline effect.   Bem published 9 successful results with p < .05.  This is not a fluke. The probability to get lucky 9 times in a row with a probability of just 5% for a single event is very very small (less than 1 in a billion attempts).  It is not a fluke. Bem also did not fabricate data like Stapel, but he falsified data to present results that are too good to be true (Definitions of Research Misconduct).  Not surprisingly, neither he nor others can replicate these results in transparent studies that prevent the use of QRPs (just like paranormal phenomena like spoon bending can not be replicated in transparent experiments that prevent fraud).

The decline effect is real, but it is wrong to misattribute it to a decline in the strength of a true phenomenon.  The decline effect occurs when researchers use questionable research practices (John et al., 2012) to fabricate statistically significant results.  Questionable research practices inflate “observed effect sizes” [a misnomer because effects cannot be observed]; that is, the observed mean differences between groups in an experiment.  Unfortunately, social psychologists do not distinguish between “observed effects sizes” and true or population effect sizes. As a result, they believe in a mysterious force that can reduce true effect sizes when sampling error moves mean differences in small samples around.

In conclusion, the truth does not wear off because there was no truth to begin with. Bem’s (2011) results did not show a real effect that wore off in replication studies. The effect was never there to begin with.


The third article mentioned by Stangor did change experimental social psychology.  In this article, Simmons, Nelson, and Simonsohn (2011) demonstrate the statistical tricks experimental social psychologists have used to produce statistically significant results.  They call these tricks, p-hacking.  All methods of p-hacking have one common feature. Researchers conduct mulitple statistical analysis and check the results. When they find a statistically significant result, they stop analyzing the data and report the significant result.  There is nothing wrong with this practice so far, but it essentially constitutes research misconduct when the result is reported without fully disclosing how many attempts were made to get it.  The failure to disclose all attempts is deceptive because the reported result (p < .05) is only valid if a researcher collected data and then conducted a single test of a hypothesis (it does not matter whether this hypothesis was made before or after data collection).  The point is that at the moment a researcher presses a mouse button or a key on a keyboard to see a p-value,  a statistical test occurred.  If this p-value is not significant and another test is run to look at another p-value, two tests are conducted and the risk of a type-I error is greater than 5%. It is no longer valid to claim p < .05, if more than one test was conducted.  With extreme abuse of the statistical method (p-hacking), it is possible to get a significant result even with randomly generated data.

In 2010, the Publication Manual of the American Psychological Association advised researchers that “omitting troublesome observations from reports to present a more convincing story is also prohibited” (APA).  It is telling that Stangor does not mention this section as a game-changer, because it has been widely ignored by experimental psychologists until this day.  Even Bem’s (2011) article that was published in an APA journal violated this rule, but it has not been retracted or corrected so far.

The p-hacking article had a strong effect on many social psychologists, including Stangor.

Its fundamental assertions are deep and long-lasting, and they have substantially affected me. 

Apparently, social psychologists were not aware that some of their research practices undermined the credibility of their published results.

Although there are many ways that I take the comments to heart, perhaps most important to me is the realization that some of the basic techniques that I have long used to collect and analyze data – techniques that were taught to me by my mentors and which I have shared with my students – are simply wrong.

I don’t know about you, but I’ve frequently “looked early” at my data, and I think my students do too. And I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.” Over the years my students have asked me about these practices (“What do you recommend, Herr Professor?”) and I have
routinely, but potentially wrongly, reassured them that in the end, truth will win out. 

Although it is widely recognized that many social psychologists p-hacked and buried studies that did not work out,  Stangor’s essay remains one of the few open admissions that these practices were used, which were not considered unethical, at least until 2010. In fact, social psychologists were trained that telling a good story was essential for social psychologists (Bem, 2001).

In short, this important paper will – must – completely change the field. It has shined a light on the elephant in the room, which is that we are publishing too many Type-1 errors, and we all know it.

Whew! What a year 2011 was – let’s hope that we come back with some good answers to these troubling issues in 2012.

In hindsight Stangor was right about the p-hacking article. It has been cited over 1,000 times so far and the term p-hacking is widely used for methods that essentially constitute a violation of research ethics.  P-values are only meaningful if all analyses are reported and failures to disclose analyses that produced inconvenient non-significant results to tell a more convincing story constitutes research misconduct according to the guidelines of APA and the HHS.

Charles Stangor’s Z-Curve

Stangor’s essay is valuable in many ways.  One important contribution is the open admission to the use of QRPs before the p-hacking article made Stangor realize that doing so was wrong.   I have been working on statistical methods to reveal the use of QRPs.  It is therefore interesting to see the results of this method when it is applied to data by a researcher who used QRPs.


This figure (see detailed explanation here) shows the strength of evidence (based on test statistics like t and F-values converted into z-scores in Stangor’s articles. The histogram shows a mode at 2, which is just significant (z = 1.96 ~ p = .05, two-tailed).  The steep drop on the left shows that Stangor rarely reported marginally significant results (p = .05 to .10).  It also shows the use of questionable research practices because sampling error should produce a larger number of non-significant results than are actually observed. The grey line provides a vague estimate of the expected proportion of non-significant results. The so called file-drawer (non-significant results that are not reported) is very large.  It is unlikely that so many studies were attempted and not reported. As Stangor mentions, he also used p-hacking to get significant results.  P-hacking can produce just significant results without conducting many studies.

In short, the graph is consistent with Stangor’s account that he used QRPs in his research, which was common practice and even encouraged, and did not violate any research ethics code of the times (Bem, 2001).

The graph also shows that the significant studies have an estimated average power of 71%.  This means any randomly drawn statistically significant result from Stangor’s articles has a 71% chance of producing a significant result again, if the study and the statistical test were replicated exactly (see Brunner & Schimmack, 2018, for details about the method).  This average is not much below the 80% value that is considered good power.

There are two caveats with the 71% estimate. One caveat is that this graph uses all statistical tests that are reported, but not all of these tests are interesting. Other datasets suggest that the average for focal hypothesis tests is about 20-30 percentage points lower than the estimate for all tests. Nevertheless, an average of 71% is above average for social psychology.

The second caveat is that there is heterogeneity in power across studies. Studies with high power are more likely to produce really small p-values and larger z-scores. This is reflected in the estimates below the x-axis for different segments of studies.  The average for studies with just significant results (z = 2 to 2.5) is only 49%.  It is possible to use the information from this graph to reexamine Stangor’s articles and to adjust nominal p-values.  According to this graph p-values in the range between .05 and .01 would not be significant because 50% power corresponds to a p-value of .05. Thus, all of the studies with a z-score of 2.5 or less (~ p > .01) would not be significant after correcting for the use of questionable research practices.

The main conclusion that can be drawn from this analysis is that the statistical analysis of Stangor’s reported results shows convergent validity with the description of his research practices.  If test statistics by other researchers show a similar (or worse) distribution, it is likely that they also used questionable research practices.

Charles Stangor’s Response to the Replication Crisis 

Stangor was no longer an active researcher when the replication crisis started. Thus, it is impossible to see changes in actual research practices.  However, Stangor co-edited a special issue for the Journal of Experimental Social Psychology on the replication crisis.

The Introduction mentions the p-hacking article.

At the same time, the empirical approaches adopted by social psychologists leave room for practices that distort or obscure the truth (Hales, 2016-in this issue; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)

and that

social psychologists need to do some serious housekeeping in order to progress
as a scientific enterprise.

It quotes, Dovidio to claim that social psychologists are

lucky to have the problem. Because social psychologists are rapidly developing new approaches and techniques, our publications will unavoidably contain conclusions that are uncertain, because the potential limitations of these procedures are not yet known. The trick then is to try to balance “new” with “careful.

It also mentions the problem of fabricating stories by hiding unruly non-significant results.

The availability of cheap data has a downside, however,which is that there is little cost in omitting data that contradict our hypotheses from our manuscripts (John et al., 2012). We may bury unruly data because it is so cheap and plentiful. Social psychologists justify this behavior, in part, because we think conceptually. When a manipulation fails, researchers may simply argue that the conceptual variable was not created by that particular manipulation and continue to seek out others that will work. But when a study is eventually successful,we don’t know if it is really better than the others or if it is instead a Type I error. Manipulation checks may help in this regard, but they are not definitive (Sigall &Mills, 1998).

It also mentioned file-drawers with unsuccessful studies like the one shown in the Figure above.

Unpublished studies likely outnumber published studies by an order of magnitude. This is wasteful use of research participants and demoralizing for social psychologists and their students.

It also mentions that governing bodies have failed to crack down on the use of p-hacking and other questionable practices and the APA guidelines are not mentioned.

There is currently little or no cost to publishing questionable findings

It foreshadows calls for a more stringent criterion of statistical significance, known as the p-value wars (alpha  = .05 vs. alpha = .005 vs. justify your alpha vs. abandon alpha)

Researchers base statistical analyses on the standard normal distribution but the actual tails are probably bigger than this approach predicts. It is clear that p b .05 is not enough to establish the credibility of an effect. For example, in the Reproducibility Project (Open Science Collaboration, 2015), only 18% of studies with a p-value greater than .04 replicated whereas 63% of those with a p-value less than .001 replicated. Perhaps we should require, at minimum, p <  .01 

It is not clear, why we should settle for p < .01, if only 63% of results replicated with p < .001. Moreover, it ignores that a more stringent criterion for significance also increases the risk of type-II error (Cohen).  It also ignores that only two studies are required to reduce the risk of a type-I error from .05 to .05*.05 = .0025.  As many articles in experimental social psychology are based on multiple cheap studies, the nominal type-I error rate is well below .001.  The real problem is that the reported results are not credible because QRPs are used (Schimmack, 2012).  A simple and effective way to improve experimental social psychology would be to enforce the APA ethics guidelines and hold violators of these rules accountable for their actions.  However, although no new rules would need to be created, experimental social psychologists are unable to police themselves and continue to use QRPs.

The Introduction ignores this valid criticism of multiple study and continues to give the misleading impression that more studies translate into more replicable results.  However, the Open-Science Collaboration reproducibility project showed no evidence that long, multiple-study articles reported more replicable results than shorter articles in Psychological Science.

In addition, replication concerns have mounted with the editorial practice of publishing short papers involving a single, underpowered study demonstrating counterintuitive results (e.g., Journal of Experimental Social Psychology; Psychological Science; Social Psychological and Personality Science). Publishing newsworthy results quickly has benefits,
but also potential costs (Ledgerwood & Sherman, 2012), including increasing Type 1 error rates (Stroebe, 2016-in this issue). 

Once more, the problem is dishonest reporting of results.  A risky study can be published and a true type-I error rate of 20% informs readers that there is a high risk of a false positive result. In contrast, 9 studies with a misleading type-I error rate of 5% violate the implicit assumptions that readers can trust a scientific research article to report the results of an objective test of a scientific question.

But things get worse.

We do, of course, understand the value of replication, and publications in the premier social-personality psychology journals often feature multiple replications of the primary findings. This is appropriate, because as the number of successful replications increases, our confidence in the finding also increases dramatically. However, given the possibility
of p-hacking (Head, Holman, Lanfear, Kahn, & Jennions, 2015; Simmons et al., 2011) and the selective reporting of data, replication is a helpful but imperfect gauge of whether an effect is real. 

Just like Stangor dismissed Bem’s mulitple-study article in JPSP as a fluke that does not require further attention, he dismisses evidence that QRPs were used to p-hack other multiple study articles (Schimmack, 2012).  Ignoring this evidence is just another violation of research ethics. The data that are being omitted here are articles that contradict the story that an author wants to present.

And it gets worse.

Conceptual replications have been the field’s bread and butter, and some authors of the special issue argue for the superiority of conceptual over exact replications (e.g. Crandall & Sherman, 2016-in this issue; Fabrigar and Wegener, 2016–in this issue; Stroebe, 2016-in this issue).  The benefits of conceptual replications are many within social psychology, particularly because they assess the robustness of effects across variation in methods, populations, and contexts. Constructive replications are particularly convincing because they directly replicate an effect from a prior study as exactly as possible in some conditions but also add other new conditions to test for generality or limiting conditions (Hüffmeier, 2016-in this issue).

Conceptual replication is a euphemism for story telling or as Sternberg calls it creative HARKing (Sternberg, in press).  Stangor explained earlier how an article with several conceptual replication studies is constructed.

I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.”

This is how Bem advised generations of social psychologists to write articles and that is how he wrote his 2011 article that triggered awareness of the replicability crisis in social psychology.

There is nothing wrong with doing multiple studies and to examine conditions that make an effect stronger or weaker.  However, it is psuedo-science if such a program of research reports only successful results because reporting only successes renders statistical significance meaningless (Sterling, 1959).

The miraculous conceptual replications of Bem (2011) are even more puzzling in the context of social psychologists conviction that their effects can decrease over time (Stangor, 2012) or change dramatically from one situation to the next.

Small changes in social context make big differences in experimental settings, and the same experimental manipulations create different psychological states in different times, places, and research labs (Fabrigar andWegener, 2016–in this issue). Reviewers and editors would do well to keep this in mind when evaluating replications. 

How can effects be sensitive to context and the success rate in published articles is 95%?

And it gets worse.

Furthermore, we should remain cognizant of the fact that variability in scientists’ skills can produce variability in findings, particularly for studies with more complex protocols that require careful experimental control (Baumeister, 2016-in this issue). 

Baumeister is one of the few other social psychologists who has openly admitted not disclosing failed studies.  He also pointed out that in 2008 this practice did not violate APA standards.  However, in 2016 a major replication project failed to replicate the ego-depletion effect that he first “demonstrated” in 1998.  In response to this failure, Baumeister claimed that he had produced the effect many times, suggesting that he has some capabilities that researchers who fail to show the effect lack (in his contribution to the special issue in JESP he calls this ability “flair”).  However, he failed to mention that many of his attempts failed to show the effect and that his high success rate in dozens of articles can only be explained by the use of QRPs.

While there is ample evidence for the use of QRPs, there is no empirical evidence for the claim that research expertise matters.  Moreover, most of the research is carried out by undergraduate students supervised by graduate students and the expertise of professors is limited to designing studies and not to actually carrying out studies.

In the end, the Introduction also comments on the process of correcting mistakes in published articles.

Correctors serve an invaluable purpose, but they should avoid taking an adversarial tone. As Fiske (2016–this issue) insightfully notes, corrective articles should also
include their own relevant empirical results — themselves subject to

This makes no sense. If somebody writes an article and claims to find an interaction effect based on a significant result in one condition and a non-significant result in another condition, the article makes a statistical mistake (Gelman & Stern, 2005). If a pre-registration contains the statement that an interaction is predicted and a published article claims an interaction is not necessary, the article misrepresents the nature of the preregistration.  Correcting mistakes like this is necessary for science to be a science.  No additional data are needed to correct factual mistakes in original articles (see, e.g., Carlsson, Schimmack, Williams, & Bürkner, 2017).

Moreover, Fiske has been inconsistent in her assessment of psychologists who have been motivated by the events of 2011 to improve psychological science.  On the one hand, she has called these individuals “method terrorists” (2016 review).  On the other hand, she suggests that psychologists should welcome humiliation that may result from the public correction of a mistake in a published article.


In 2012, Stangor asked “How will social and personality psychologists look back on 2011?” Six years later, it is possible to provide at least a temporary answer. There is no unified response.

The main response by older experimental social psychologist has been denial along Stangor’s initial response to Stapel and Bem.  Despite massive replication failures and criticism, including criticism by Noble Laureate Daniel Kahneman, no eminent social psychologists has responded to the replication crisis with an admission of mistakes.  In contrast, the list of eminent social psychologists who stand by their original findings despite evidence for the use of QRPs and replication failures is long and is growing every day as replication failures accumulate.

The response by some younger social psychologists has been to nudge social psychologists slowly towards improving their research methods, mainly by handing out badges for preregistrations of new studies.  Although preregistration makes it more difficult to use questionable research practices, it is too early to see how effective preregistration is in making published results more credible.  Another initiative is to conduct replication studies. The problem with this approach is that the outcome of replication studies can be challenged and so far these studies have not resulted in a consensual correction in the scientific literature. Even articles that reported studies that failed to replicate continue to be cited at a high rate.

Finally, some extremists are asking for more radical changes in the way social psychologists conduct research, but these extremists are dismissed by most social psychologists.

It will be interesting to see how social psychologists, funding agencies, and the general public will look back on 2011 in 2021.  In the meantime, social psychologists have to ask themselves how they want to be remembered and new investigators have to examine carefully where they want to allocate their resources.  The published literature in social psychology is a mine field and nobody knows which studies can be trusted or not.

I don’t know about you, but I am looking forward to reading the special issues in 2021 in celebration of the 10-year anniversary of Bem’s groundbreaking or should I saw earth-shattering publication of “Feeling the Future.”

Estimating Reproducibility of Psychology (No. 111): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article examines anchoring effects.  When individuals are uncertain about a quantity (the price of a house, the height of Mount Everest), their estimates can be influenced by some arbitrary prior number.  The anchoring effect is a robust phenomenon that has been replicated in one of the first large multi-lab replication projects (Klein et al., 2014).

This article titled “Precision of the anchor influence the amount of adjustment” tested the hypothesis that the anchoring effect is larger (or the adjustment effect is smaller) if the anchor is precise than if it is rounded in six studies.


Study 1

43 students participated in this study that manipulated the type of anchor between subjects (rounded, precise over, precise under; n = 14 per cell).

The main effect of the manipulation was significant, F(2,40) = 10.94.

Study 2

85 students participated in this study.  It manipulated anchor (rounding, precise under) and the range of plausible values (narrow vs. broad).

The study replicated a main effect of anchor, F(1,81) = 22.23.

Study 3

45 students participated in this study.

Study 3 added a condition with information that made the rounded anchor more credible.

The results were significant, F(2,42) = 23.07.  A follow up test showed that participants continued to be more influenced by a precise anchor than by a rounded anchor even with additional information that the rounded anchor was credible, F(1, 42) = 20.80.

Study 4a 

This study was picked for the replication attempt.

As the motivation to adjust increases and the number of units of adjustment increases correspondingly, the amount of adjustment on the coarse-resolution
scale should increase at a faster rate than the amount of adjustment on the fine-resolution scale (i.e., motivation to adjust and scale resolution should interact).

The high-motivation-to-adjust condition was created by removing information from the scenarios used in Experiment 2 (the scenarios from Experiment 2 were used without alteration in the low-motivation-to-adjust condition). For example, sentences
in the plasma-TV scenario that encouraged a slight adjustment (‘‘items are priced very close to their actual cost . . . actual cost would be only slightly less than $5,000’’) were replaced with a sentence that encouraged more adjustment (‘‘What is your estimate of the TV’s actual cost?’’).

The width of the scale unit was manipulated with the precision of the anchor (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a had 59 participants.

Study 4a was similar to Study 1 with a manipulation of the width of the scale. (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a showed an interaction between the motivation to adjust condition and the Scale Width manipulation, F(1, 55) = 6.88.

Study 4b

Study 4b  had 149 participants and also showed a significant result, F(1,145) = 4.01, p = .047.

Study 5 

Study 5 used home-sales data for 12,581 home sales.  The study found a significant effect of list-price precision on the sales price, F(1, 12577) = 23.88, with list price as a covariate.


In conclusion, all of the results showed strong statistical evidence against the null hypothesis except for the pair of studies 4a and 4b.  It is remarkable that this close replication study produced a just significant result with three times with a much larger sample size than Study 4a (149 vs. 59).  This pattern of results suggest that the sample size is not independent of the result and that the evidence for this effect could be exeggerated by  the use of optional stopping (collecting more data until p < .05).

Replication Study

The replication study did not use Study 5 which was not an experiment.  Study 4a was chosen over 4b because it used a more direct manipulation of motivation.

The replication report states the goal of the replication study as replicating two main effects and the interaction.

The results show a main effect of the motivation manipulation, F(1,116) = 71.06, a main effect of anchor precision, F(1,116) = 6.28, but no significant interaction, F < 1.

The data form shows the interaction effect as the main result in the original study, F(1, 55) = 6.88, but the effect is miscoded as a main effect.  The replication result is entered as the main effect for the anchor precision manipulation, F(1, 116) = 6.28 and this significant result is scored as a successful replication of the original study.

However, the key finding in the original article was the interaction effect.  No statistical tests of main effects are reported.

In Experiment 4a, there was a Motivation to Adjust x Scale Unit Width interaction, F(1, 55) = 6.88, prep = .947, omega2 = .02. The difference in the amount of adjustment between the rounded and precise-anchor conditions increased as the motivation to
adjust went from low (Mprecise = -0.76, Mrounded = -0.23, Mdifference = 0.53), F(1, 55) = 15.76, prep = .994, omega2 = .06, to high (Mprecise = -0.04, Mrounded = 0.98, Mdifference = 1.02), F(1, 55) = 60.55, prep =.996, omega2 = .25. 

This leads me to the conclusion that the successful replication of this study is a coding mistake. The critical interaction was not replicated.







Ethical Challenges for Psychological Scientists

Psychological scientists are human and like all humans they can be tempted to violate social norms (Fiske, 2015).  To help psychologists to conduct ethical research, professional organizations have developed codes of conduct (APA).  These rules are designed to help researchers to resist temptations to engage in unethical practices such as fabricate or falsify of data (Pain, Science, 2008).

Psychological science has ignored the problem of research integrity for a long time. The Association for Psychological Science (APS) still does not have formal guidelines about research misconduct (APS, 2016).

Two eminent psychologists recently edited a book with case studies that examine ethical dilemmas for psychological scientists (Sternberg & Fiske, 2015).  Unfortunately, this book lacks moral fiber and fails to discuss recent initiatives to address the lax ethical standards in psychology.

Many of the brief chapters in this book are concerned with unethical behaviors of students, in clinical settings, or ethics of conducting research with animals or human participants.  These chapters have no relevance for the current debates about improving psychological science.  Nevertheless, a few chapter do address these issues and these chapters show how little eminent psychologists are prepared to address an ethical crisis that threatens the foundation of psychological science.

Chapter 29
Desperate Data Analysis by a Desperate Job Candidate Jonathan Haidt

Pursuing a career in science is risky and getting an academic job is hard. After a two-year funded post-doc, I didn’t have a job for one year and I worked hard to get more publications.  Jonathan Haidt was in a similar situation.  He didn’t get an academic job after his first post-doc and was lucky to get a second post-doc,but he needed more publications.

He was interested in the link between feelings of disgust and moral judgments.  A common way to demonstrate causality in experimental social psychology is to use an incidental manipulation of the cause (disgust) and to show that the manipulation has an effect on a measure of the effect (moral judgments).

“I was looking for carry-over effects of disgust”

In the chapter, JH tells readers about the moral dilemma when he collected data and the data analysis showed the predicted pattern, but it was not statistically significant. This means the evidence was not strong enough to be publishable.  He carefully looked at the data and saw several outliers.  He came up with various reasons to exclude some. Many researchers have been in the same situation, but few have told their story in a book.

I knew I was doing this post hoc, and that it was wrong to do so. But I was so confident that the effect was real, and I had defensible justifications! I made a deal with myself: I would go ahead and write up the manuscript now, without the outliers, and while it was under review I would collect more data, which would allow me to get the result cleanly, including all outliers.

This account contradicts various assertions by psychological scientists that they did not know better or that questionable research practices just happen without intent. JH story is much more plausible. He needed publications to get a job. He had a promising dataset and all he was doing was eliminating a few outliers to bet an arbitrary criterion of statistical significance.  So what, if the p-value was .11 with the three cases included. The difference between p = .04 and p = .11 is not statistically significant.  Plus, he was not going to rely on these results. He would collect more data.  Surely, there was a good reason to bend the rules slightly or as Sternberg (2015) calls it going a couple of miles over the speed limit.  Everybody does it.  JH realized that his behavior was unethical, it just was not significantly unethical (Sternberg, 2015).

Decide That the Ethical Dimension Is Significant. If one observes a driver going one mile per hour over the speed limit on a highway, one is unlikely to become perturbed about the unethical behavior of the driver, especially if the driver is oneself.” (Sternberg, 2015). 

So what if JH was speeding a little bit to get an academic job. He wasn’t driving 80 miles in front of an elementary school like Diedrik Stapel, who just made up data.  But that is not how this chapter ends.  JH tells us that he never published the results of this study.

Fortunately, I ended up recruiting more participants before finishing the manuscript, and the new data showed no trend whatsoever. So I dropped the whole study and felt an enormous sense of relief. I also felt a mix of horror and shame that I had so blatantly massaged my data to make it comply with my hopes.

What vexes me about this story is that Jonathan Haidt is known for his work on morality and disgust and published a highly cited (> 2,000 citations in WebofScience) article that suggested disgust does influence moral judgments.

Wheatley and Haidt (2001) manipulated somatic markers even more directly. Highly hypnotizable participants were given the suggestion, under hypnosis, that they would feel a pang of disgust when they saw either the word take or the word often.  Participants were then asked to read and make moral judgments about six stories that were designed to elicit mild to moderate disgust, each of which contained either the word take or the word often. Participants made higher ratings of both disgust and moral condemnation about the stories containing their hypnotic disgust word. This study was designed to directly manipulate the intuitive judgment link (Link 1), and it demonstrates that artificially increasing the strength of a gut feeling increases the strength of the resulting moral judgment (Haidt, 2001, Psychological Review). 

A more detailed report of these studies was published in a few years later (Wheatley & Haidt, 2005).  Study 1 reported a significant difference between the disgust-hypnosis group and the control group, t(44) = 2.41, p = .020.  Study 2 produced a marginally significant result that was significant in a non-parametric test.

For the morality ratings, there were substantially more outliers (in both directions) than in Experiment 1 or for the other ratings in this experiment. As the paired-samples
t test loses power in the presence of outliers, we used its non-parametric analogue, the Wilcoxon signed-rank test, as well (Hollander&Wolfe, 1999). Participants judged the actions to be more morally wrong when their hypnotic word was present (M = 
73.4) than when it was absent (M = 69.6), t(62) = 1.74, p = .09, Wilcoxon Z = 2.18, p < .05. 

Although JH account of his failed study suggests he acted ethically, the same story also reveals that he did have at least one study that failed to provide support for the moral disgust hypothesis that was not mentioned in his Psychological Review article.  Disregarding an entire study that ultimately did not support a hypothesis is a questionable research practice, just as removing some outliers is (John et al., 2012; see also next section about Chapter 35).  However, JH seems to believe that he acted morally.

However, in 2015 social psychologists were well aware that hiding failed studies and other questionable practices undermine the credibility of published findings.  It is therefore particularly troubling that JH was a co-author of another article that failed to mention this study. Schnall, Haidt, Core, and Jordan (2015) responded to a meta-analysis that suggested the effect of incidental disgust on moral judgments is not reliable and that there was evidence for publication bias (e..g, not reporting the failed study JH mentions in his contribution to the book on ethical challenges).  This would have been a good opportunity to admit that some studies failed to show the effect and that these studies were not reported.  However, the response is rather different.

With failed replications on various topics getting published these days, we were pleased that Landy and Goodwin’s (2015) meta-analysis supported most of the findings we reported in Schnall, Haidt, Clore, and Jordan (2008). They focused on what Pizarro, Inbar 
and Helion (2011) had termed the amplification hypothesis of Haidt’s (2001) social intuitionist model of moral judgment, namely that “disgust amplifies moral evaluations—it makes wrong things seem even more wrong (Pizarro et al., 2011, p. 267, emphasis in original).” Like us, Landy and Goodwin (2015) found that the overall effect of incidental disgust on moral judgment is usually small or zero when ignoring relevant moderator variables.”   

Somebody needs to go back in time and correct JH’s Psychological Review article and the hypnosis studies that reported main effects with moderated effect sizes and no moderator effects.  Apparently, even JH doesn’t believe in these effects anymore in 2015 and so it was not important to mention failed studies. However, it might have been relevant to point out that the studies that did report main effects were false positives and what theoretical implications this would have.

More troubling is that the moderator effects are also not robust.  The moderator effects were shown in studies by Schnall and may be inflated by the use of questionable research practices.  In support of this interpretation of her results, a large replication study failed to replicate the results of Schnall et al.’s (2008) Study 3.  Neither the main effect of the disgust manipulation nor the interaction with the personality measure were significant (Johnson et al., 2016).

The fact that JH openly admits to hiding disconfirming evidence, while he would have considered selective deletion of outliers a moral violation, and was ashamed of even thinking about it, suggests that he does not consider hiding failed studies a violation of ethics (but see APA Guidelines, 6th edition, 2010).  This confirms Sternberg’s (2015) first observation about moral behavior.  A researcher needs to define an event as having an ethical dimension to act ethically.  As long as social psychologists do not consider hiding failed studies unethical, reported results cannot be trusted to be objective fact. Maybe it is time to teach social psychologists that hiding failed studies is a questionable research practice that violates scientific standards of research integrity.

Chapter 35
“Getting it Right” Can also be Wrong by Ronnie Janoff-Bulman 

This chapter provides the clearest introduction to the ethical dilemma that researchers face when they report the results of their research.  JB starts with a typical example that all empirical psychologists encountered.  A study showed a promising result, but a second study failed to show the desired and expected result (p > .10).  She then did what many researchers do. She changed the design of the study (a different outcome measure) and collected new data.  There is nothing wrong with trying again because there are many reasons why a study may produce an unexpected result.  However, JB also makes it clear that the article would not include the non-significant results.

“The null-result of the intermediary experiment will not be discussed or mentioned, but will be ignored and forgotten.” 

The suppression of the failed study is called a questionable research practice (John et al., 2012).  The Publication Manual of APA considers this unethical reporting of research results.

JP makes it clear that hiding failed studies undermines the credibility of published results.

“Running multiple versions of studies and ignoring the ones that “didn’t work” can have far-reaching negative effects by contributing to the false positives that pervade our field and now pass for psychological knowledge. I plead guilty.”

JP also explains why it is wrong to neglect failed studies. Running study after study to get a successful outcome, “is likely capitalize on chance, noise, or situational factors and increase the likelihood of finding a significant (but unreliable) effect.” 

This observation is by no means new. Sterling (1959) pointed out that publication bias (publishing only p-values below .05), essentially increases the risk of a false positive result from the nominal level of 5% to an actual level of 100%.  Even evidently false results will produce only significant results in the published literature if failures are not reported (Bem, 2011).

JP asked what can be done about this.  Apparently, JP is not aware of recent developments in psychological science that range from statistical tests that reveal missing studies (like an X-ray for looked file-drawers) to preregistration of studies that will be published without a significance filter.

Although utterly unlikely given current norms, reporting that we didn’t find the effect in a previous study (and describing the measures and manipulations used) would be broadly informative for the field and would benefit individual researchers conducting related studies. Certainly publication of replications by others would serve as a corrective as well.

It is not clear why publishing non-significant results is considered utterly unlikely in 2015, if the 2010 APA Publication Manual mandates publication of these studies.

Despite her pessimism about the future of Psychological Science, JP has a clear vision how psychologists could improve their science.

A major, needed shift in research and publication norms is likely to be greatly facilitated by an embrace of open access publishing, where immediate feedback, open evaluations and peer reviews, and greater communication among researchers (including replications and null results) hold the promise of opening debate and discussion of findings. Such changes would help preclude false-positive effects from becoming prematurely reified as facts; but such changes, if they are to occur, will clearly take time.

The main message of this chapter is that researchers in psychology have been trained to chase significance because obtaining statistical significance by all means was considered a form of creativity and good research (Sternberg, 2018).  Unfortunately, this is wrong. Statistical significance is only meaningful if it is obtained the right way and in an open and transparent manner.

33 Commentary to Part V Susan T. Fiske

It was surprising to read Fiske’s (2015) statement that “contrary to human nature, we as scientists should welcome humiliation, because it shows that the science is working.”

In marked contrast to this quote, Fiske has attacked psychologists who are trying to correct some of the errors in published articles as “method terrorists

I personally find both statements problematic. Nobody should welcome humiliation and nobody who points out errors in published articles is a terrorist.  Researchers should simply realize that publications in peer-reviewed journals can still contain errors and that it is part of the scientific process to correct these errors.  The biggest problem in the past seven years was not that psychologists made mistakes, but that they resisted efforts to correct them that arise from a flawed understanding of the scientific method.

36 Commentary to Part VI Susan T. Fiske

Social psychologists have justified not reporting failed study (cf. Jonathan Haidt example) by calling these studies pilot studies (Bem, 2011).  Bem pointed out that social psychologists have a lot of these pilot studies.  But a pilot study is not a study that tests the cause effect relationship. A pilot study tests either whether a manipulation is effective or whether a measure is reliable and valid.  It is simply wrong to treat studies that test the effect of a manipulation on an outcome a pilot study, if the study did not work.

“However, few of the current proposals for greater transparency recommend describingeach and every failed pilot study.”

The next statement makes it clear that Fiske conflates pilot studies with failed studies.

As noted, the reasons for failures to produce a given result are multiple, and supporting the null hypothesis is only one explanation. 

Yes, but it is one plausible explanation and not disclosing the failure renders the whole purpose of empirical hypothesis testing irrelevant (Sterling, 1959).

“Deciding when one has failed to replicate is a matter of persistence and judgment.”

No it is not. Preregister the study and if you are willing to use a significant result if you obtain it, you have to report the non-significant result if you do not. Everything else is not science and Susan Fiske seems to lack an understanding of the most basic reason for conducting an experiment.

What is an ethical scientist to do? One resolution is to treat a given result – even if it required fine-tuning to produce – as an existence proof: This result demonstrably can occur, at least under some circumstances. Over time, attempts to replicate will test generalizability.

This statement ignores that the observed pattern of results is heavily influenced by sampling error, especially in the typical between-subject design with small samples that is so popular in experimental social psychology.  A mean difference between two groups does not mean that anything happened in this study. It could just be sampling error.  But maybe the thought that most of the published results in experimental social psychology are just errors is too much to bear for somebody at the end of her career.

I have followed the replication crisis unfold over the past seven years since Bem (2011) published the eye-opening, ridiculous claims about feeling the future location of randomly displayed erotica. I cannot predict random events in the future, but I can notice trends and I do have a feeling that the future will not look kindly on those who tried to stand in the way of progress in psychological science. A new generation of psychologists is learning everyday about replication failures and how to conduct better studies.  For old people there are only two choices. Step aside or help them to learn from the mistakes of the older generation.

P.S. I think there is a connection between morality and disgust but it (mainly) goes from immoral behaviors to disgust.  So let me tell you, psychological science, Uranus stinks.

Estimating Reproducibility of Psychology (No. 136): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The authors of this article are prominent figures in the replication crisis of social psychology.  Kathleen Vohs was co-author of a highly criticized article that suggested will-power depends on blood-glucose levels. The evidence supporting this claim has been challenged for methodological and statistical reasons (Kurzban, 2010; Schimmack, 2012).  She also co-authored numerous articles on ego-depletion that are difficult to replicate (Schimmack, 2016; Inzlicht, 2016).  In a response with Roy Baumeister titled “A misguided effort with elusive implications” she dismissed these problems,b ut her own replication project produced very similar results (SPSP, 2018). Some of her social priming studies also failed to replicate (Vadillo, Hardwicke, & Shanks, 2016). 


The z-curve plot for Vohs shows clear evidence that her articles contain too many significant results (75% inc. marginally significant one’s with only 55% average power).  The average probability of successfully replicating a randomly drawn finding form Vohs’ articles is 55%. However, this average is obtained with substantial heterogeneity.  Just significant z-scores (2 to 2.5) have only an average estimated replicability of 33%. Even z-scores in the range from 2.5 to 3 have only an average replicability of 40%.   This suggests that p-values in the range between .05 and .005 are unlikely to replicate in exact replication studies.

In a controversial article with the title “The Truth is Wearing Off“, Jonathan Schooler even predicted that replication studies might often fail.  The article was controversial because Schooler suggested that all effects diminish over time (I wish this were true for the effect of eating chocolate on weight, but so far it hasn’t happened).  Schooler is also known for an influential article about “verbal overshadowing” in eyewitness identifications.  Francis (2012) demonstrated that the published results were too good to be true and the first Registered Replication Report failed to replicate on of the five studies and replicated another one only with a much smaller effect size.


The z-curve plot for Schooler looks very different. The average estimated power is higher.  However, there is a drop at z = 2.6 that is difficult to explain with a normal sampling distribution.

Based on this context information, predictions about replicabilty depend on the p-values of the actual studies.  Just significant p-values are unlikely to replicate but larger p-values might replicate.

Summary of Original Article

The article examines moral behavior.  The main hypothesis is that beliefs about free will vs. determinism influence cheating.  Whereas belief in free will encourages moral behavior,  beliefs that behavior is determined make it easier to cheat.

Study 1

30 students were randomly assigned to one of two conditions (n = 15).

Participants in the anti-free-will condition, read a passage written by Francis Crick, a Noble Laureate, suggesting that free will is an illusion.  In the control condition, they read about consciousness.

Then participants were asked to work on math problems on a computer. They were given a cover story that the computer program had a glitch and would present the correct answers, but they could fix this problem by pressing the space bar as soon as the question appeared.  They were asked to do so and to try to solve the problems on their own.

It is not mentioned whether participants were probed for suspicion and data from all participants were included in the analysis.

The main finding was that participants cheated more in the “no-free-will” condition than in the control condition, t(28) = 3.04, p = .005.

Study 2

Study 2 addressed several limitations of Study 1. Although the sample size was larger, the design included 5 conditions (n = 24/25 per condition).

The main dependent variable was the number of correct answers on 15 reading comprehension, mathematical, and logic problems that were used by Vohs in a previous study (Schmeichel, Vohs, & Baumeister, 2003).  For each correct answer, participants received $1.

Two conditions manipulate free will beliefs, but participants could not cheat. The comparison of these two conditions shows whether the manipulation influences actual performance, but there was no major difference (based on Figure $7.50 control vs. $7 no-free-will).

In the cheating condition, experimenters received a fake phone call, told the participants that they had to leave and that the participant should continue, score their answers and pay themselves.  Surprisingly, neither the free-will, nor the neutral condition showed any signs of cheating ($7.20 & 7.30, respectively).  However, the determinism condition increased the average pay-out to $10.50.

One problem for the statistical analysis is that the researchers “did not have participants’ answer sheets in the three self-paid conditions; therefore, we divided the number of $1 coins taken by each group by the number of group members to arrive at an average self-payment” (p. 52).

The authors then report a significant ANOVA result, F(4, 114) = 5.68, p = .0003.

However, without information about the standard deviation in each cell, it is not possible to compute an Analysis of Variance.  This part of the analysis is not explained in the article.


The replication team also had some problems with Study 2.

We originally intended to carry out Study 2, following the Reproducibility Project’s system of working from the back of an article. However, on corresponding with the authors we discovered that it had arisen in post-publication correspondence about analytic methods that the actual effect size found was smaller than reported, although the overall conclusion remained the same. 

As a result, they decided to replicate Study 1.

The sample size of the replication study was near twice as large as the sample of the original study (N = 58 vs. 30).

The results did not replicate the significant result of the original study, t(56) = 0.77, p = .44.


Study 1 was underpowered.  Even nearly doubling the sample size was not sufficient to obtain significance in the replication study.   Study 2 was superior, but it was reported so poorly that the replication team could not replicate the study.






Estimating Reproducibility of Psychology (No. 124): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article “Loving Those Who Justify Inequality: The Effects of System Threat on Attraction to Women Who Embody Benevolent Sexist Ideals”  is a Short Report in the journal Psychological Science.  The single study article is based on Study 3 of a doctoral dissertation supervised by the senior author Steven J. Spencer.


The article has been cited 32 times and has not been cited in 2017 (but has one citation in 2018 so far).


The authors aim to provide further evidence for system-justification theory (Jost, Banaji, & Nosek, 2004).  A standard experimental paradigm is to experimentally manipulate beliefs in the fairness of the existing political system.  According to the theory, individuals are motivated to maintain positive views of the current system and will respond by threats to this belief in a defensive manner.

In this specific study, the authors predicted that male participants whose faith in the
political system was threatened would show greater romantic interest in women who embody benevolent sexist ideals than in women who do not embody these ideals.

The design of the study is a classic 2 x 2 design with system threat as between-subject factor and type of women (embody benevolent sexist ideals or not) as within-subject factor.

Stimuli were fake dating profiles.  Dating profiles of women who embody benevolent sexist ideals were based on the three dimensions of benevolent sexism, vulnerable, pure, and ideal for making a men feel complete (Glick & Fiske, 1996). The other women were described as career oriented, party seeking, active in social causes, or athletic.

A total of 36 male students participated in the study.

The article reports a significant interaction effect, F(1, 34) =5.89.  This interaction effect was due to a significant difference between the two groups in rating of women who embody benevolent sexist ideals, F(1,34) = 4.53.

Replication Study 

The replication study was conducted in Germany.

It failed to replicate the significant interaction effect, F(1,68) = 0.08, p = .79.


The sample size of the original study was very small and the result was just significant.  It is not surprising that a replication study failed to replicate this just significant result despite a somewhat larger sample size.




Estimating Reproducibility of Psychology (No. 61): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “Poignancy: Mixed emotional experience in the face of meaningful endings” was published in the Journal of Personality and Social Psychology.  The senior author is Laura L. Carstensen, who is best known for her socioemotional selectivity theory (Carstensen, 1999, American Psychologist).  This article has been cited (only) 83 times and is only #43 in the top cited articles of Laura Carstensen, although it contributes to her current H-Index of 49.


The main hypothesis is derived from Carstensen’s socioemotional selectivity theory.  The prediction is that endings (e.g., of student life, of life in general) elicit more mixed emotions.  This hypothesis was tested in two experiments.

Study 1

60 young (~ 20 years) and 60 older ~ 80 years) participated in Study 1.  The experimental procedure was a guided imagery to evoke emotions.   In one condition participants were asked to imagine in their favorite location in 4 months time.  In the other condition they were given the same instruction, but also told to imagine that this would be the last time they could visit this location.  The dependent variable were intensity ratings on an emotion questionnaire on a scale from 0 = not at all to 7 = extremely.

The intensity of mixed feelings was assessed by taking the minimum value of a positive and a negative emotion (Schimmack, 2001).

The analysis showed no age main effect or interactions and no differences in two control conditions.  For the critical imagery condition,  intensity of mixed feelings was higher in the last-time condition (M ~ 3.6, SD ~ 2.3) than in the next-visit condition (M ~ 2, SD ~ 2.3), d ~ .7,  t(118) ~ 3.77.

Study 2

Study 2 examined mixed feelings in the context of a naturalistic event.  It extend a previous study by Larsen, McGraw, & Cacioppo (2001) that demonstrated mixed feelings on graduation day.  Study 2 aimed to replicate and extend this finding.  To extend the finding, the authors added an experimental manipulation that either emphasized the ending of university or not.

110 students participated in the study.

In the control condition (N = 59), participants were given the following instructions: “Keeping in mind your current experiences, please rate the degree to which you feel each of the following emotions,” and were then presented with the list of 19 emotions. In the limited-time condition (n = 51), in which emphasis was placed on the ending that they were experiencing, participants were given the following instructions: “As a graduating senior, today is the last day that you will be a student at Stanford. Keeping that in mind, please rate the degree to which you feel each of the following emotions,”

The key finding was significantly higher means in the experimental condition than in the control condition, t(108) = 2.34, p = .021.

Replication Study

Recruiting participants on graduation day is not easy.  The replication study recruited participants over a 3-year period to achieve a sample size of N = 222 participants, more than double the sample size of the original study (2012 N = 95; 2013 N = 78; 2014 N = 49).

Despite the larger sample size, the study failed to replicate the effect of the experimental manipulation, t(220) = 0.07, p = .94.


While reports of mixed feelings in conflicting situations are a robust phenomenon (Study 1), experimental manipulations of the intensity of mixed feelings are relatively rare. The key novel contribution of Study 2 was the demonstration to focus on the ending of an event increase sadness and mixed feelings. However, the evidence for this effect was weak and could not be replicated in a larger sample. In combination, the evidence does not suggest that this is an effective way to manipulate the intensity of mixed feelings.











In Study 1, participants repeatedly imagined being in a meaningful location. Participants in the experimental condition imagined being in the meaningful
location for the final time. Only participants who imagined “last times” at meaningful locations
experienced more mixed emotions. In Study 2, college seniors reported their emotions on graduation day.
Mixed emotions were higher when participants were reminded of the ending that they were experiencing.
Findings suggest that poignancy is an emotional experience associated with meaningful endings.