It is 2018, and 2012 is a faint memory. So much has happened in the word and in
psychology over the past six years.
Two events rocked Experimental Social Psychology (ESP) in the year 2011 and everybody was talking about the implications of these events for the future of ESP.
First, Daryl Bem had published an incredible article that seemed to suggest humans, or at least extraverts, have the ability to anticipate random future events (e.g., where an erotic picture would be displayed).
Second, it was discovered that Diederik Stapel had fabricated data for several articles. Several years later, over 50 articles have been retracted.
Opinions were divided about the significance of these two events for experimental social psychology. Some psychologists suggested that these events are symptomatic of a bigger crisis in social psychology. Others considered these events as exceptions with little consequences for the future of experimental social psychology.
In February 2012, Charles Stangor tried to predict how these events will shape the future of experimental social psychology in an essay titled “Rethinking my Science”
How will social and personality psychologists look back on 2011? With pride at having continued the hard work of unraveling the mysteries of human behavior, or with concern that the only thing that is unraveling is their discipline?
Stangor’s answer is clear.
“Although these two events are significant and certainly deserve our attention, they are flukes rather than game-changers.”
He describes Bem’s article as a “freak event” and Stapel’s behavior as a “fluke.”
“Some of us probably do fabricate data, but I imagine the numbers are relatively few.”
Stangor is confident that experimental social psychology is not really affected by these two events.
As shocking as they are, neither of these events create real problems for social psychologists
In a radical turn, Stangor then suggests that experimental social psychology will change, but not in response to these events, but in response to three other articles.
But three other papers published over the past two years must completely change how we think about our field and how we must conduct our research within it. And each is particularly important for me, personally, because each has challenged a fundamental assumption that was part of my training as a social psychologist.
The first article is a criticism of experimental social psychology for relying too much on first-year college students as participants (Heinrich, Heine, & Norenzayan, 2010). Looking back, there is no evidence that US American psychologists have become more global in their research interests. One reason is that social phenomena are sensitive to the cultural context and for Americans it is more interesting to study how online dating is changing relationships than to study arranged marriages in more traditional cultures. There is nothing wrong with a focus on a particular culture. It is not even clear that research article on prejudice against African Americans were supposed to generalize to the world (how would this research apply to African countries where the vast majority of citizens are black?).
The only change that occurred was not in response to Heinrich et al.’s (2010) article, but in response to technological changes that made it easier to conduct research and pay participants online. Many social psychologists now use the online service Mturk to recruit participants.
Thus, I don’t think this article significantly changed experimental social psychology.
The second article with the title (“The Truth Wears Off“) was published in the weekly magazine the New Yorker. It made the ridiculous claim that true effects become weaker or may even disappear over time.
The basic phenomenon is that observed findings in the social and biological sciences weaken with time. Effects that are easily replicable at first become less so every day. Drugs stop working over time the same way that social psychological phenomena become more and more elusive. The “the decline effect” or “the truth wears off effect,” is not easy to dismiss, although perhaps the strength of the decline effect will itself decline over time.
The assumption that the decline effect applies to real effects is no more credible than Bem’s claims of time-reversed causality. I am still waiting for the effect of eating cheesecake on my weight (a biological effect) to wear off. My bathroom scale tells me it is not.
Why would Stangor believe in such a ridiculous idea? The answer is that he observed it many times in his own work.
Frankly I have difficulty getting my head around this idea (I’m guessing others do too) but it is nevertheless exceedingly troubling. I know that I need to replicate my effects, but am often unable to do it. And perhaps this is part of the reason. Given the difficulty of replication, will we continue to even bother? And what becomes of our research if we do even less replicating than we do now? This is indeed a problem that does not seem likely to go away soon.
In hindsight, it is puzzling that Stangor misses the connection between Bem’s (2011) article and the decline effect. Bem published 9 successful results with p < .05. This is not a fluke. The probability to get lucky 9 times in a row with a probability of just 5% for a single event is very very small (less than 1 in a billion attempts). It is not a fluke. Bem also did not fabricate data like Stapel, but he falsified data to present results that are too good to be true (Definitions of Research Misconduct). Not surprisingly, neither he nor others can replicate these results in transparent studies that prevent the use of QRPs (just like paranormal phenomena like spoon bending can not be replicated in transparent experiments that prevent fraud).
The decline effect is real, but it is wrong to misattribute it to a decline in the strength of a true phenomenon. The decline effect occurs when researchers use questionable research practices (John et al., 2012) to fabricate statistically significant results. Questionable research practices inflate “observed effect sizes” [a misnomer because effects cannot be observed]; that is, the observed mean differences between groups in an experiment. Unfortunately, social psychologists do not distinguish between “observed effects sizes” and true or population effect sizes. As a result, they believe in a mysterious force that can reduce true effect sizes when sampling error moves mean differences in small samples around.
In conclusion, the truth does not wear off because there was no truth to begin with. Bem’s (2011) results did not show a real effect that wore off in replication studies. The effect was never there to begin with.
The third article mentioned by Stangor did change experimental social psychology. In this article, Simmons, Nelson, and Simonsohn (2011) demonstrate the statistical tricks experimental social psychologists have used to produce statistically significant results. They call these tricks, p-hacking. All methods of p-hacking have one common feature. Researchers conduct mulitple statistical analysis and check the results. When they find a statistically significant result, they stop analyzing the data and report the significant result. There is nothing wrong with this practice so far, but it essentially constitutes research misconduct when the result is reported without fully disclosing how many attempts were made to get it. The failure to disclose all attempts is deceptive because the reported result (p < .05) is only valid if a researcher collected data and then conducted a single test of a hypothesis (it does not matter whether this hypothesis was made before or after data collection). The point is that at the moment a researcher presses a mouse button or a key on a keyboard to see a p-value, a statistical test occurred. If this p-value is not significant and another test is run to look at another p-value, two tests are conducted and the risk of a type-I error is greater than 5%. It is no longer valid to claim p < .05, if more than one test was conducted. With extreme abuse of the statistical method (p-hacking), it is possible to get a significant result even with randomly generated data.
In 2010, the Publication Manual of the American Psychological Association advised researchers that “omitting troublesome observations from reports to present a more convincing story is also prohibited” (APA). It is telling that Stangor does not mention this section as a game-changer, because it has been widely ignored by experimental psychologists until this day. Even Bem’s (2011) article that was published in an APA journal violated this rule, but it has not been retracted or corrected so far.
The p-hacking article had a strong effect on many social psychologists, including Stangor.
Its fundamental assertions are deep and long-lasting, and they have substantially affected me.
Apparently, social psychologists were not aware that some of their research practices undermined the credibility of their published results.
Although there are many ways that I take the comments to heart, perhaps most important to me is the realization that some of the basic techniques that I have long used to collect and analyze data – techniques that were taught to me by my mentors and which I have shared with my students – are simply wrong.
I don’t know about you, but I’ve frequently “looked early” at my data, and I think my students do too. And I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.” Over the years my students have asked me about these practices (“What do you recommend, Herr Professor?”) and I have
routinely, but potentially wrongly, reassured them that in the end, truth will win out.
Although it is widely recognized that many social psychologists p-hacked and buried studies that did not work out, Stangor’s essay remains one of the few open admissions that these practices were used, which were not considered unethical, at least until 2010. In fact, social psychologists were trained that telling a good story was essential for social psychologists (Bem, 2001).
In short, this important paper will – must – completely change the field. It has shined a light on the elephant in the room, which is that we are publishing too many Type-1 errors, and we all know it.
Whew! What a year 2011 was – let’s hope that we come back with some good answers to these troubling issues in 2012.
In hindsight Stangor was right about the p-hacking article. It has been cited over 1,000 times so far and the term p-hacking is widely used for methods that essentially constitute a violation of research ethics. P-values are only meaningful if all analyses are reported and failures to disclose analyses that produced inconvenient non-significant results to tell a more convincing story constitutes research misconduct according to the guidelines of APA and the HHS.
Charles Stangor’s Z-Curve
Stangor’s essay is valuable in many ways. One important contribution is the open admission to the use of QRPs before the p-hacking article made Stangor realize that doing so was wrong. I have been working on statistical methods to reveal the use of QRPs. It is therefore interesting to see the results of this method when it is applied to data by a researcher who used QRPs.
This figure (see detailed explanation here) shows the strength of evidence (based on test statistics like t and F-values converted into z-scores in Stangor’s articles. The histogram shows a mode at 2, which is just significant (z = 1.96 ~ p = .05, two-tailed). The steep drop on the left shows that Stangor rarely reported marginally significant results (p = .05 to .10). It also shows the use of questionable research practices because sampling error should produce a larger number of non-significant results than are actually observed. The grey line provides a vague estimate of the expected proportion of non-significant results. The so called file-drawer (non-significant results that are not reported) is very large. It is unlikely that so many studies were attempted and not reported. As Stangor mentions, he also used p-hacking to get significant results. P-hacking can produce just significant results without conducting many studies.
In short, the graph is consistent with Stangor’s account that he used QRPs in his research, which was common practice and even encouraged, and did not violate any research ethics code of the times (Bem, 2001).
The graph also shows that the significant studies have an estimated average power of 71%. This means any randomly drawn statistically significant result from Stangor’s articles has a 71% chance of producing a significant result again, if the study and the statistical test were replicated exactly (see Brunner & Schimmack, 2018, for details about the method). This average is not much below the 80% value that is considered good power.
There are two caveats with the 71% estimate. One caveat is that this graph uses all statistical tests that are reported, but not all of these tests are interesting. Other datasets suggest that the average for focal hypothesis tests is about 20-30 percentage points lower than the estimate for all tests. Nevertheless, an average of 71% is above average for social psychology.
The second caveat is that there is heterogeneity in power across studies. Studies with high power are more likely to produce really small p-values and larger z-scores. This is reflected in the estimates below the x-axis for different segments of studies. The average for studies with just significant results (z = 2 to 2.5) is only 49%. It is possible to use the information from this graph to reexamine Stangor’s articles and to adjust nominal p-values. According to this graph p-values in the range between .05 and .01 would not be significant because 50% power corresponds to a p-value of .05. Thus, all of the studies with a z-score of 2.5 or less (~ p > .01) would not be significant after correcting for the use of questionable research practices.
The main conclusion that can be drawn from this analysis is that the statistical analysis of Stangor’s reported results shows convergent validity with the description of his research practices. If test statistics by other researchers show a similar (or worse) distribution, it is likely that they also used questionable research practices.
Charles Stangor’s Response to the Replication Crisis
Stangor was no longer an active researcher when the replication crisis started. Thus, it is impossible to see changes in actual research practices. However, Stangor co-edited a special issue for the Journal of Experimental Social Psychology on the replication crisis.
The Introduction mentions the p-hacking article.
At the same time, the empirical approaches adopted by social psychologists leave room for practices that distort or obscure the truth (Hales, 2016-in this issue; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)
social psychologists need to do some serious housekeeping in order to progress
as a scientific enterprise.
It quotes, Dovidio to claim that social psychologists are
lucky to have the problem. Because social psychologists are rapidly developing new approaches and techniques, our publications will unavoidably contain conclusions that are uncertain, because the potential limitations of these procedures are not yet known. The trick then is to try to balance “new” with “careful.
It also mentions the problem of fabricating stories by hiding unruly non-significant results.
The availability of cheap data has a downside, however,which is that there is little cost in omitting data that contradict our hypotheses from our manuscripts (John et al., 2012). We may bury unruly data because it is so cheap and plentiful. Social psychologists justify this behavior, in part, because we think conceptually. When a manipulation fails, researchers may simply argue that the conceptual variable was not created by that particular manipulation and continue to seek out others that will work. But when a study is eventually successful,we don’t know if it is really better than the others or if it is instead a Type I error. Manipulation checks may help in this regard, but they are not definitive (Sigall &Mills, 1998).
It also mentioned file-drawers with unsuccessful studies like the one shown in the Figure above.
Unpublished studies likely outnumber published studies by an order of magnitude. This is wasteful use of research participants and demoralizing for social psychologists and their students.
It also mentions that governing bodies have failed to crack down on the use of p-hacking and other questionable practices and the APA guidelines are not mentioned.
There is currently little or no cost to publishing questionable findings
It foreshadows calls for a more stringent criterion of statistical significance, known as the p-value wars (alpha = .05 vs. alpha = .005 vs. justify your alpha vs. abandon alpha)
Researchers base statistical analyses on the standard normal distribution but the actual tails are probably bigger than this approach predicts. It is clear that p b .05 is not enough to establish the credibility of an effect. For example, in the Reproducibility Project (Open Science Collaboration, 2015), only 18% of studies with a p-value greater than .04 replicated whereas 63% of those with a p-value less than .001 replicated. Perhaps we should require, at minimum, p < .01
It is not clear, why we should settle for p < .01, if only 63% of results replicated with p < .001. Moreover, it ignores that a more stringent criterion for significance also increases the risk of type-II error (Cohen). It also ignores that only two studies are required to reduce the risk of a type-I error from .05 to .05*.05 = .0025. As many articles in experimental social psychology are based on multiple cheap studies, the nominal type-I error rate is well below .001. The real problem is that the reported results are not credible because QRPs are used (Schimmack, 2012). A simple and effective way to improve experimental social psychology would be to enforce the APA ethics guidelines and hold violators of these rules accountable for their actions. However, although no new rules would need to be created, experimental social psychologists are unable to police themselves and continue to use QRPs.
The Introduction ignores this valid criticism of multiple study and continues to give the misleading impression that more studies translate into more replicable results. However, the Open-Science Collaboration reproducibility project showed no evidence that long, multiple-study articles reported more replicable results than shorter articles in Psychological Science.
In addition, replication concerns have mounted with the editorial practice of publishing short papers involving a single, underpowered study demonstrating counterintuitive results (e.g., Journal of Experimental Social Psychology; Psychological Science; Social Psychological and Personality Science). Publishing newsworthy results quickly has benefits,
but also potential costs (Ledgerwood & Sherman, 2012), including increasing Type 1 error rates (Stroebe, 2016-in this issue).
Once more, the problem is dishonest reporting of results. A risky study can be published and a true type-I error rate of 20% informs readers that there is a high risk of a false positive result. In contrast, 9 studies with a misleading type-I error rate of 5% violate the implicit assumptions that readers can trust a scientific research article to report the results of an objective test of a scientific question.
But things get worse.
We do, of course, understand the value of replication, and publications in the premier social-personality psychology journals often feature multiple replications of the primary findings. This is appropriate, because as the number of successful replications increases, our confidence in the finding also increases dramatically. However, given the possibility
of p-hacking (Head, Holman, Lanfear, Kahn, & Jennions, 2015; Simmons et al., 2011) and the selective reporting of data, replication is a helpful but imperfect gauge of whether an effect is real.
Just like Stangor dismissed Bem’s mulitple-study article in JPSP as a fluke that does not require further attention, he dismisses evidence that QRPs were used to p-hack other multiple study articles (Schimmack, 2012). Ignoring this evidence is just another violation of research ethics. The data that are being omitted here are articles that contradict the story that an author wants to present.
And it gets worse.
Conceptual replications have been the field’s bread and butter, and some authors of the special issue argue for the superiority of conceptual over exact replications (e.g. Crandall & Sherman, 2016-in this issue; Fabrigar and Wegener, 2016–in this issue; Stroebe, 2016-in this issue). The benefits of conceptual replications are many within social psychology, particularly because they assess the robustness of effects across variation in methods, populations, and contexts. Constructive replications are particularly convincing because they directly replicate an effect from a prior study as exactly as possible in some conditions but also add other new conditions to test for generality or limiting conditions (Hüffmeier, 2016-in this issue).
Conceptual replication is a euphemism for story telling or as Sternberg calls it creative HARKing (Sternberg, in press). Stangor explained earlier how an article with several conceptual replication studies is constructed.
I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.”
This is how Bem advised generations of social psychologists to write articles and that is how he wrote his 2011 article that triggered awareness of the replicability crisis in social psychology.
There is nothing wrong with doing multiple studies and to examine conditions that make an effect stronger or weaker. However, it is psuedo-science if such a program of research reports only successful results because reporting only successes renders statistical significance meaningless (Sterling, 1959).
The miraculous conceptual replications of Bem (2011) are even more puzzling in the context of social psychologists conviction that their effects can decrease over time (Stangor, 2012) or change dramatically from one situation to the next.
Small changes in social context make big differences in experimental settings, and the same experimental manipulations create different psychological states in different times, places, and research labs (Fabrigar andWegener, 2016–in this issue). Reviewers and editors would do well to keep this in mind when evaluating replications.
How can effects be sensitive to context and the success rate in published articles is 95%?
And it gets worse.
Furthermore, we should remain cognizant of the fact that variability in scientists’ skills can produce variability in findings, particularly for studies with more complex protocols that require careful experimental control (Baumeister, 2016-in this issue).
Baumeister is one of the few other social psychologists who has openly admitted not disclosing failed studies. He also pointed out that in 2008 this practice did not violate APA standards. However, in 2016 a major replication project failed to replicate the ego-depletion effect that he first “demonstrated” in 1998. In response to this failure, Baumeister claimed that he had produced the effect many times, suggesting that he has some capabilities that researchers who fail to show the effect lack (in his contribution to the special issue in JESP he calls this ability “flair”). However, he failed to mention that many of his attempts failed to show the effect and that his high success rate in dozens of articles can only be explained by the use of QRPs.
While there is ample evidence for the use of QRPs, there is no empirical evidence for the claim that research expertise matters. Moreover, most of the research is carried out by undergraduate students supervised by graduate students and the expertise of professors is limited to designing studies and not to actually carrying out studies.
In the end, the Introduction also comments on the process of correcting mistakes in published articles.
Correctors serve an invaluable purpose, but they should avoid taking an adversarial tone. As Fiske (2016–this issue) insightfully notes, corrective articles should also
include their own relevant empirical results — themselves subject to
This makes no sense. If somebody writes an article and claims to find an interaction effect based on a significant result in one condition and a non-significant result in another condition, the article makes a statistical mistake (Gelman & Stern, 2005). If a pre-registration contains the statement that an interaction is predicted and a published article claims an interaction is not necessary, the article misrepresents the nature of the preregistration. Correcting mistakes like this is necessary for science to be a science. No additional data are needed to correct factual mistakes in original articles (see, e.g., Carlsson, Schimmack, Williams, & Bürkner, 2017).
Moreover, Fiske has been inconsistent in her assessment of psychologists who have been motivated by the events of 2011 to improve psychological science. On the one hand, she has called these individuals “method terrorists” (2016 review). On the other hand, she suggests that psychologists should welcome humiliation that may result from the public correction of a mistake in a published article.
In 2012, Stangor asked “How will social and personality psychologists look back on 2011?” Six years later, it is possible to provide at least a temporary answer. There is no unified response.
The main response by older experimental social psychologist has been denial along Stangor’s initial response to Stapel and Bem. Despite massive replication failures and criticism, including criticism by Noble Laureate Daniel Kahneman, no eminent social psychologists has responded to the replication crisis with an admission of mistakes. In contrast, the list of eminent social psychologists who stand by their original findings despite evidence for the use of QRPs and replication failures is long and is growing every day as replication failures accumulate.
The response by some younger social psychologists has been to nudge social psychologists slowly towards improving their research methods, mainly by handing out badges for preregistrations of new studies. Although preregistration makes it more difficult to use questionable research practices, it is too early to see how effective preregistration is in making published results more credible. Another initiative is to conduct replication studies. The problem with this approach is that the outcome of replication studies can be challenged and so far these studies have not resulted in a consensual correction in the scientific literature. Even articles that reported studies that failed to replicate continue to be cited at a high rate.
Finally, some extremists are asking for more radical changes in the way social psychologists conduct research, but these extremists are dismissed by most social psychologists.
It will be interesting to see how social psychologists, funding agencies, and the general public will look back on 2011 in 2021. In the meantime, social psychologists have to ask themselves how they want to be remembered and new investigators have to examine carefully where they want to allocate their resources. The published literature in social psychology is a mine field and nobody knows which studies can be trusted or not.
I don’t know about you, but I am looking forward to reading the special issues in 2021 in celebration of the 10-year anniversary of Bem’s groundbreaking or should I saw earth-shattering publication of “Feeling the Future.”