[please hold pencil (pen does not work) like this while reading this blog post]
In “Sad Face: Another classic finding in psychology—that you can smile your way to happiness—just blew up. Is it time to panic yet?” b Daniel Engber, Fritz Strack gets to tell his version of the importance of his original study and what it means that it failed to replicate in a recent attempt to replicate his original results in 17 independent replication studies. In this blog post, I provide my commentary on Fritz Strack’s story to reveal inconsistencies, omissions of important fact, and false arguments to discount the results of the replication studies.
PART I: Prior to the Replication of Strack et al. (1988)
In 2011, many psychologists lost confidence in social psychology as a science. One social psychologists had fabricated data at midnight in his kitchen. Another presented incredible results that people can foresee random events in the future. And finally, a researcher failed to replicate a famous study where subtle reminders of elderly people made students walk more slowly. A New Yorker article captured the mood of the time. It wasn’t clear which findings one should believe and would replicate under close scrutiny? In response, psychologists created a new initiative to replicate original findings across many independent labs. A first study produced encouraging results. Many classic findings in psychology (like the anchoring effect) replicated sometimes even with stronger effect sizes than in the original study. However, some studies didn’t replicate. Especially, results from a small group of social psychologists who had built their career around the idea that small manipulations can have strong effects on participants’ behavior without participants’ awareness (such as the elderly priming study) did not replicate well. The question was which results from this group of social psychologists who study unconscious or implicit processes would replicate?
Quote “The experts were reluctant to step forward. In recent months their field had fallen into scandal and uncertainty: An influential scholar had been outed as a fraud; certain bedrock studies—even so-called “instant classics”—had seemed to shrivel under scrutiny. But the rigidity of the replication process felt a bit like bullying. After all, their work on social priming was delicate by definition: It relied on lab manipulations that had been precisely calibrated to elicit tiny changes in behavior. Even slight adjustments to their setups, or small mistakes made by those with less experience, could set the data all askew. So let’s say another lab—or several other labs—tried and failed to copy their experiments. What would that really prove? Would it lead anyone to change their minds about the science?”
The small group of social psychologist felt under attack. They had published hundreds of articles and become famous for demonstrating the influence of unconscious processes that by definition were ignored by people when they tried to understand their own behaviors because they operated in secrecy, undetected by conscious introspection. What if all of their amazing discoveries were not real? Of course, the researchers were aware that not all studies worked. After all, they often encountered failures to find these effects in their own lab. It often required several attempts to get the right conditions to produce results that could be published. If a group of researchers would just go into the lab and do the study once, how would we know that they did everything right. Given ample evidence of failure in their own labs, nobody from this group wanted to step forward and replicate their own study or subject their study to a one-shot test.
Quote “Then on March 21, Fritz Strack, the psychologist in Wurzburg, sent a message to the guys. “Don’t get me wrong,” he wrote, “but I am not a particularly religious person and I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.” So if the skeptics wanted something to examine—a test case to stand in for all of social-psych research—then let them try his work.”
Fritz Strack was not afraid of failure. He volunteered his most famous study for a replication project.
Quote “ In 1988, Strack had shown that movements of the face lead to movements of the mind. He’d proved that emotion doesn’t only go from the inside out, as Malcolm Gladwell once described it, but from the outside in.”
It is not exactly clear why Strack picked his 1988 for replication. The article included two studies. The first study produced a result that is called marginally significant. That is, it did not meet the standard criterion of evidence, a p-value less than .05 (two-tailed). But the p-value was very close to .05 and less than .10 (or .05 one-tailed). This finding alone would not justify great confidence in the replicability of the original finding. Moreover, a small study with so much noise makes it impossible to estimate the true effect size. The observed effect size in the study was large, but this could have been due to luck (sampling error). In a replication study, the effect size could be a lot smaller, which would make it difficult to get a significant result in a replication study.
The key finding of this study was that manipulating participants’ facial muscles appeared to influence their feelings of amusement in response to funny cartoons without participants’ awareness that their facial muscles contributed to the intensity of the experience. This finding made sense in the context of a long tradition of theories that assumed feedback from facial muscles plays an important role in the experience of emotions.
Strack seemed to be confident that his results would replicate because many other articles also reported results that seemed to support the facial feedback hypothesis. His study became famous because it used an elaborate cover story to ensure that the effect occurred without participants’ awareness.
Quote: “In lab experiments, facial feedback seemed to have a real effect…But Strack realized that all this prior research shared a fundamental problem: The subjects either knew or could have guessed the point of the experiments. When a psychologist tells you to smile, you sort of know how you’re expected to feel.”
Strack was not the first to do so.
Quote: “In the 1960s, James Laird, then a graduate student at the University of Rochester, had concocted an elaborate ruse: He told a group of students that he wanted to record the activity of their facial muscles under various conditions, and then he hooked silver cup electrodes to the corners of their mouths, the edges of their jaws, and the space between their eyebrows. The wires from the electrodes plugged into a set of fancy but nonfunctional gizmos… Subjects who had put their faces in frowns gave the cartoons an average rating of 4.4; those who put their faces in smiles judged the same set of cartoons as being funnier—the average jumped to 5.5.”
A change by 1.1 points on a rating scale is a huge effect and consistent results across different studies would suggest that the effect can be easily replicated. The point of Strack’s study was not to demonstrate the effect, but to improve the cover story that made it difficult for participants to guess the real purpose of the study.
“Laird’s subterfuge wasn’t perfect, though. For all his careful posturing, it wasn’t hard for the students to figure out what he was up to. Almost one-fifth of them said they’d figured out that the movements of their facial muscles were related to their emotions. Strack and Martin knew they’d have to be more crafty. At one point on the drive to Mardi Gras, Strack mused that maybe they could use thermometers. He stuck his finger in his mouth to demonstrate. Martin, who was driving, saw Strack’s lips form into a frown in the rearview mirror. That would be the first condition. Martin had an idea for the second one: They could ask the subjects to hold thermometers—or better, pens—between their teeth. This would be the stroke of genius that produced a classic finding in psychology.”
So in a way, Strack et al.’s study was a conceptual replication study of Laird’s study that used a different manipulation of facial muscles. And the replication study was successful.
“The results matched up with those from Laird’s experiment. The students who were frowning, with their pens balanced on their lips, rated the cartoons at 4.3 on average. The ones who were smiling, with their pens between their teeth, rated them at 5.1. What’s more, not a single subject in the study noticed that her face had been manipulated. If her frown or smile changed her judgment of the cartoons, she’d been totally unaware.”
However, even though the effect size is still large, an .8 difference in ratings, the effect was only marginally significant. A second study by Strack et al. also produced only a marginally significant results. Thus, we may start to wonder why the researchers were not able to produce stronger evidence for the effect that would produce a significant result at the conventional criterion that is required for claiming a discovery, p < .05 (two-tailed)? And why did this study become a classic without stronger evidence that the effect is real and that the effect is really as large as the reported effect sizes in these studies. The effect size may not matter for basic research studies that merely want to demonstrate that the effect exists, but it is important for applications to the real word. If an effect is large under strictly controlled laboratory conditions, the effect is going to be much smaller in real world situations where many of the factors that are controlled in the laboratory also influence emotional experiences. This might also explain why people normally do not notice the contribution of their facial expressions to their experiences. Relative to their mood, the funniness of a joke, the presence of others, and a dozen more contextual factors that influence our emotional experiences, feedback from facial muscles may make a very small contribution to emotional experiences. Strack seems to agree.
Quote “It was theoretically trivial,” says Strack, but his procedure was both clever and revealing, and it seemed to show, once and for all, that facial feedback worked directly on the brain, without the intervention of the conscious mind. Soon he was fielding calls from journalists asking if the pen-in-mouth routine might be used to cure depression. He laughed them off. There are better, stronger interventions, he told them, if you want to make a person happy.”
Strack may have been confident that his study would replicate because other publications used his manipulation and also reported significant results. And researchers even proposed that the effect is strong enough to have practical implications in the real world. One study even suggested that controlling facial expressions can reduce prejudice.
Quote: “Strack and Martin’s method would eventually appear in a bewildering array of contexts—and be pushed into the realm of the practical. If facial expressions could influence a person’s mental state, could smiling make them better off, or even cure society’s ills? It seemed so. In 2006, researchers at the University of Chicago showed that you could make people less racist by inducing them to smile—with a pen between their teeth—while they looked at pictures of black faces.”
The result is so robust that replicating it is a piece of cake, a walk in the park, and works even in classroom demonstrations.
“Indeed, the basic finding of Strack’s research—that a facial expression can change your feelings even if you don’t know that you’re making it—has now been reproduced, at least conceptually, many, many times. (Martin likes to replicate it with the students in his intro to psychology class.)”
Finally, Strack may have been wrong when he laughed off questions about curing depression with controlling facial muscles. Apparently, it is much harder to commit suicide if you put a pen in your mouth to make yourself smile.
Quote: “In recent years, it has even formed the basis for the treatment of mental illness. An idea that Strack himself had scoffed at in the 1980s now is taken very seriously: Several recent, randomized clinical trials found that injecting patients’ faces with Botox to make their “frown lines” go away also helped them to recover from depression.”
So, here you have it. If you ignore publication bias and treat the mountain of confirmatory evidence with a 100% success rate in journals as credible evidence, there is little doubt that the results would replicate. Of course, by the same standard of evidence there is no reason to doubt that other priming studies would replicate, which they did until a group of skeptical researchers tried to replicate the results and failed to do so.
Quote: “Strack found himself with little doubt about the field. “The direct influence of facial expression on judgment has been demonstrated many, many times,” he told me. “I’m completely convinced.” That’s why he volunteered to help the skeptics in that email chain three years ago. “They wanted to replicate something, so I suggested my facial-feedback study,” he said. “I was confident that they would get results, so I didn’t know how interesting it would be, but OK, if they wanted to do that? It would be fine with me.”
PART II: THE REPLICATION STUDY
The replication project was planned by EJ Wagenmakers, who made his name as a critic of research practices in social psychology in response to Bem’s (2011) incredible demonstration of feelings that predict random future events. Wagenmakers believes that many published results are not credible because the studies failed to test theoretical predictions. Social psychologists would run many studies and publish results when they discovered a significant result with p < .05 (at least one-tailed). When the results did not become significant the study was considered a failure and not reported. This practice makes it difficult to predict which results are real and replicate and which results are not real and do not replicate. Wagenmakers estimated that the facial feedback study had a 30% chance to replicate.
Quote “Personally, I felt that this one actually had a good chance to work,” he said. How good a chance? I gave it a 30-percent shot.” [Come again. A good chance is 30%?]
A 30% probability may be justified because a replication project by the Open Science Collaborative found that only 25% of social psychological results were successfully replicated. However, this project used only slightly larger samples than the original studies. In the replication of the facial feedback hypothesis, 17 labs with larger samples than the original studies and nearly 2000 participants were going to replicate the original study. The increase in sample size increases the chances of producing a significant result even if the effect size of the original study was vastly inflated. If a result is not significant with 2,000 participants, it becomes possible to say that the effect may actually not exist or that the effect size is so small to be practically meaningless and definitely have no relevance for the treatment of depression. Thus, the prediction that there is only a 30% chance of success implies that Wagenmakers was very skeptical about the original results and expected a drastic reduction in the effect size.
Quote “In a sense, he was being optimistic. Replication projects have had a way of turning into train wrecks. When researchers tried to replicate 100 psychology experiments from 2008, they interpreted just 39 of the attempts as successful. In the last few years, Perspectives on Psychological Science has been publishing “Registered Replication Reports,” the gold standard for this type of work, in which lots of different researchers try to re-create a single study so the data from their labs can be combined and analyzed in aggregate. Of the first four of these to be completed, three ended up in failure.”
There were good reasons to be skeptical. First, the facial feedback theory is controversial. There are two camps in psychology .One camp assumes that emotions are generated in the brain in direct response to cognitive appraisals of the environment. Others have argued that emotional experiences are based on bodily feedback. The controversy goes back to James versus Cannon and lead to the famous Lazarus-Zajonc debate in the 1980s at the beginning of modern emotion research. There is also the problem that it is statistically improbable that Strack et al. (1988) would get marginally significant results twice in a row in two independent replications of their study. Sampling error makes p-values move around and the chance of getting p < .10 and p > .05 twice in a row is slim. This suggests that the evidence was partially obtained with a healthy dose of sampling error and that a replication study would produce weaker effect sizes.
Quote: The work on facial feedback, though, had never been a target for the doubters; no one ever tried to take it down. Remember, Strack’s original study had confirmed (and then extended) a very old idea. His pen-in-mouth procedure worked in other labs.
Strack also had some reasons why the replication project would not produce straight replications of his findings, because he claims that the original study did not produce a huge effect.
Quote “He acknowledged that the evidence from the paper wasn’t overwhelming—the effect he’d gotten wasn’t huge. Still, the main idea had withstood a quarter-century of research, and it hadn’t been disputed in a major, public way. “I am sure some colleagues from the cognitive sciences will manage to come up with a few nonreplications,” he predicted. But he thought the main result would hold.”
But that is wrong. The study did produce a surprisingly huge effect. It just didn’t produce strong evidence that this effect was caused by facial feedback rather than problems with the randomized assignment of participants to conditions. His sample sizes were so small that the large effect was only a bit more than 1.5 times of the standard deviation, which is just enough to claim a discovery with p < .05 one-tailed, but not 2 times of the standard deviation, which is needed to claim a discovery with p < .05 two-tailed. So, the reported effect size was huge, but the strength of evidence was not. Taking the reported effect size at face value, one would predict that only every other study would produce a significant result and the other studies would fail to replicate his results. So even if 17 laboratories would successfully replicate his study and the true effect size was as large as the effect size reported by Strack et al., only half of the labs would be able to claim a successful replication. As sample sizes were a bit larger in the replication studies, the percentage would be a bit higher, but clearly nobody should expect that all labs individually produce at least marginally significant results. In fact, it is unlikely that Strack was able to get two significant results in his two reported studies.
After several years of planning, collecting data, and analyzing the data the results were reported. Not a single lab had produced a significant result. More important, even a combined analysis of data from close to 2,000 participants showed no effect. The effect size was close to zero. In other words, there was no evidence that facial feedback had any influence on ratings of amusement in response to cartoons. This is what researchers call an epic fail. The study did not just fail in a replication with a smaller sample. It didn’t produce a significant result with a smaller effect size estimate. The effect just doesn’t appear to be there, although even with 2,000 participants it is not possible to say that the effect is zero. The results leave a possibility that a very small effect may exist, but an even larger sample would be needed to test this hypothesis. At the same time, the results are not inconsistent with the original results because the original study had so much noise that the population effect size could have been close to zero.
PART III: Response to the Replication Failure
We might think that Strack was devastated by the failure to replicate his most famous result that he has produced in his research career. However, he is rather unmoved by these results.
Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.
This is a bit odd because earlier Strack assured us that he is not religious and trusts the scientific method. “I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.” So here we have two original studies with weak evidence for an effect and 17 studies with no evidence for the effect and if we combine the information of all 19 studies, we have no evidence for an effect, and to believe in an effect even though 19 studies fail to provide scientific evidence for it seems a bit religious although I would make a distinction between really religious individuals who know that they believe in something and wanna-be-scientists who believe that they know something. How does Strack justify his belief in an effect that just failed to replicate? He refers to an article (take-down) by himself that according to his own account shows fundamental problems with the idea that failed replication studies provide meaningful information. Apparently, only original studies provide meaningful information and when replication studies fail to replicate the results of original studies there must be a problem with the replication studies.
Quote: “Two years ago, while the replication of his work was underway, Strack wrote a takedown of the skeptics’ project with the social psychologist Wolfgang Stroebe. Their piece, called “The Alleged Crisis and the Illusion of Exact Replication,” argued that efforts like the RRR reflect an “epistemological misunderstanding,”
Accodingly, Bem(2011) did successfully demonstrate that humans (at least extraverted humans) can successfully predict random events in the future and learning after an exam can retroactively improve performance on the completed exam. The fact that replication studies failed to replicate these results only shows an epistemic misunderstanding that we can learn anything from replication studies by skeptics. So what is the problem with replication studies?
Quote: “Since it’s impossible to make a perfect copy of an old experiment. People change, times change, and cultures change, they said. No social psychologist ever steps in the same river twice. Even if a study could be reproduced, they added, a negative result wouldn’t be that interesting, because it wouldn’t explain why the replication didn’t work.”
We cannot reproduce exactly the same conditions of the original experiment. But, why is that important. The same paradigm was allegedly used to reduce prejudice and cure depression, in studies that are wildly different from the original studies. It worked even then. So, why did it not work when the original study was replicated as closely as possible. And why would we care about a study that worked (marginally) in 92 undergraduate students at the University of Illinois in the 1980s in 2016? We don’t. For humans in 2016, the results of a study in 2015 are more relevant. Maybe it worked, may be it didn’t. We will never know, but now we do now that it typically doesn’t work in 2015. Maybe it will work again in 2017. Who knows. But we cannot claim that there is good support for the facial feedback theory since Darwin came up with it.
But Strack goes further. When he looks at the results of the replication studies, he does not see what the authors of the replication studies see.
Quote: “So when Strack looks at the recent data he sees not a total failure but a set of mixed results.”
17 studies all find no effect and all studies are consistent with the hypothesis that there is no effect; the 95% confidence interval includes 0, which is also true for Stracks’ original two studies.” How can somebody see mixed results in this consistent pattern of results?
Quote: Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed?
He simply post-hoc divides studies into studies that produced a positive result and studies that produced a negative result. There is no justification for this because none of these studies are individually significantly different from each other and the overall test shows that there is no heterogeneity; that is the results are consistent with the hypothesis that the true population effect size is 0 and that all of the variability in effects across studies is just random noise that is expected from studies with modest sample sizes.
Quote: “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. How could he turn his back on all that evidence?”
And with this final quote, Strack is leaving the realm of scientific discourse and proper interpretation of empirical facts. He is willing to disregard the results of a scientific test of the facial feedback hypothesis that he initially agreed to. It is now clear why he agreed to it. He never considered it a real test of his theory. No matter what the results would be he would maintain his believe in his couple of marginally significant results that are statistically improbable. Social psychologists have of course studied how humans respond to negative information that challenges their self-esteem and world views. Unlike facial feedback, the results are robust and not surprising. Humans are prone to dismiss inconvenient evidence and to construe sometimes ridiculous arguments in order to prop up cherished false beliefs. As such, Strack’s response to the failure of his most famous article is a successful demonstration that some findings in social psychology are replicable; it just so happens that Strack’s study is not one of these findings.
Strack comes up with several objections to the replication studies that show his ignorance about the whole project. For example, he claims that many participants may have guessed the purpose of the study because the study is now a textbook finding. However, the researchers who conducted the replication studies made sure that the study was conducted before the study was covered in class and some universities do not cover it at all. Moreover, just like Laird, participants who guessed the purpose were excluded. A lot more participants were excluded because they didn’t hold the pen properly. Of course, this should strengthen the effect because the manipulation should not work when the wrong facial muscles are activated.
Strack even claims that the whole project lacked a research question.
Quote: “Strack had one more concern: “What I really find very deplorable is that this entire replication thing doesn’t have a research question.” It does “not have a specific hypothesis, so it’s very difficult to draw any conclusions,” he told me.”
This makes no sense. Participants were randomly allocated to two conditions and a dependent variable was measured. The hypothesis was that holding the pen in a way that elicits a smile leads to higher ratings of amusement than holding the pen in a way that leads to a frown. The empirical question was whether this manipulation would have an effect and this was assessed with a standard test of statistical significance. The answer was that there was no evidence for the effect. The research question was the same as in the original study. If this is not a research question than the original study also had no research question.
And finally, Strack makes the unscientific claim that it simply cannot be true that the reported studies all got it wrong.
Quote: The RRR provides no coherent argument, he said, against the vast array of research, conducted over several decades, that supports his original conclusion. “You cannot say these [earlier] studies are all p-hacked,” Strack continued, referring to the battery of ways in which scientists can nudge statistics so they work out in their favor. “You have to look at them and argue why they did not get it right.”
Scientific journals select studies that produced significant results. As a result, all prior studies were published because they produced a significant (or at least marginally significant) result. Given the selectin for significance, there is no error control. The number of successful replications in the published literature tells us nothing about the truth of a finding. We do not have to claim that all studies were p-hacked. We can just say all studies were selected to be significant and that is true and well known. As a result, we do not know which results will replicate until we have conducted replication studies and do not select for significance. This is what the RRR did. As a result, it provides the first unbiased and real empirical test of the facial feedback hypothesis and it failed. That is science. Ignoring it is not.
Closer inspection of the original article by Daniel Engber shows further problems.
Quote: For the second version, Strack added a new twist. Now the students would have to answer two questions instead of one: First, how funny was the cartoon, and second, how amused did it make them feel? This was meant to help them separate their objective judgments of the cartoons’ humor from their emotional reactions. When the students answered the first question—“how funny is it?,” the same one that was used for Study 1—it looked as though the effect had disappeared. Now the frowners gave the higher ratings, by 0.17 points. If the facial feedback worked, it was only on the second question, “how amused do you feel?” There, the smilers scored a full point higher. (For the RRR, Wagenmakers and the others paired this latter question with the setup from the first experiment.) In effect, Strack had turned up evidence that directly contradicted the earlier result: Using the same pen-in-mouth routine, and asking the same question of the students, he’d arrived at the opposite answer. Wasn’t that a failed replication, or something like it?”
Strack dismisses this concern as well, but Daniel Engber is not convinced.
Quote: “Strack didn’t think so. The paper that he wrote with Martin called it a success: “Study 1’s findings … were replicated in Study 2.”… That made sense, sort of. But with the benefit of hindsight—or one could say, its bias—Study 2 looks like a warning sign. This foundational study in psychology contained at least some hairline cracks. It hinted at its own instability. Why didn’t someone notice?
And nobody else should be convinced. Fritz Strack is a prototypical example of a small group of social psychologists that has ruined social psychology by engaging in a game of publishing results that were consistent with theories of strong and powerful effects of stimuli on people’s behavior outside their awareness. These results were attention-grabbing just like annual returns of 20% would be eye-catching returns. Many people invested in these claims on the basis of flimsy evidence that doesn’t even withstand scrutiny by a science journalist. And to be clear, only a few of them did go as far to fabricate data. But many others fabricated facts by publishing only studies that supported their claims while hiding evidence from studies that failed to show the effect. Now we see what happens when these claims are subjected to real empirical tests that can succeed or fail. Many of the fail. For future generations it is not important why they did what they did and how they feel about it now. What is important is that we realize that many results in textbooks are not based on solid evidence and social psychology needs to change the way they conduct research if it wants to become a real science that builds on empirically verifiable facts. Strack’s response to the RRR is what it is a defensive reaction to evidence that his famous article was based on a false positive result.