Added January 30, 2018: A formal letter to the editor of JPSP, calling for a retraction of the article (Letter).
“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)
In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.
The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results. “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).
The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.
In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.
The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings). The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).
Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.
Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.
I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely. Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.
The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.
If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.
Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.
Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.
2. Questionable Research Practices
The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies. This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used. John et al. listed a number of questionable research practices that might explain Bem’s findings.
2.1. Multiple Dependent Variables
One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.
2.2. Failure to report all conditions
This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.
2.3 Generous Rounding
Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.
Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.
2.5 Excluding of Data
Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.
2.6 Stopping Data Collection Early
Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.
In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.
2.7 Optional Stopping/Snooping
Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.
The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).
Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.
2.8 Selective Reporting
The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).
Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.
Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants $5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid $90,000 out of pocket.
As a strong believer in ESP, Bem may have paid $90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.
“The integrity of the scientific enterprise requires the reporting of disconfirming results.”
In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.
3. The Decline Effect and a New Questionable Research Practice
When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased. This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).
Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.
The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.
Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1. P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.
|EXPERIMENT||S 1-50||S 51-100|
|EXP1||p = .004||p = .194|
|EXP2||p = .096||p = .170|
|EXP3||p = .039||p = .100|
|EXP4||p = .033||p = .067|
|EXP5||p = .013||p = .069|
|EXP6a||p = .412||p = .126|
|EXP5b||p = .023||p = .410|
|EXP7||p = .020||p = .338|
|EXP8||p = .010||p = .318|
|EXP9||p = .003||NA|
There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).
In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes. Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422). How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.
There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection. In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.
“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”
Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.
Additional hints come from an interview with Engber (2017).
“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.
In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.
Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one. Schooler (2011) proposed that something more intriguing may cause decline effects.
“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.”
Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.
I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.
Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.
The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).
The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality. The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.
Conceptual Replications and Hidden Moderators
In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).
Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).
The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.
Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.
The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.
The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.
Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.
Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011). The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used. They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%. A combination of four questionable research practices increased the risk to 60.7%. The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky. But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results, (p = .69 = 1%).
The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies). If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.
To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).
Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct. Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.
Should the Journal of Personality and Social Psychology Retract Bem’s Article?
Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.
However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.
“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).
Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.
This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.
A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).
[edited January, 8, 2018]
It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.
When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”
If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.
Alcock, J. E. (2011). Back from the future: Parapsychology and the Bem affair. Skeptical Inquirer, 35(2). Retrieved from http://www.csicop.org/specialarticles/show/back_from_the_future
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524
Bem, D.J., Tressoldi, P., Rabeyron, T. & Duggan, M. (2015) Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events, F1000 Research, 4, 1–33.
Engber, D. (2017). Daryl Bem proved ESP Is real: Which means science is broken. https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html
Francis, G. (2012). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9
Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.
Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, Issue 6203, 502-1505, DOI: 10.1126/science.1255484
Galak, J., Leboeuf, R.A., Nelson, L. D., & Simmons, J.P. (2012). Journal of Personality and Social Psychology, 103, 933-948, doi: 10.1037/a0029709.
John, L. K. Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532. DOI: 10.1177/0956797611430953
RetractionWatch (2018). Ask retraction watch: Is it OK to cite a retracted paper? http://retractionwatch.com/2018/01/05/ask-retraction-watch-ok-cite-retracted-paper/
Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 2012, 17, 551–566.
Schimmack, U. (2015). The Test of Insufficient Variance: A New Tool for the Detection of Questionable Research Practices. https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632
Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746
Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137
Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.
November 28, Open Draft/Preprint (Version 1.0)
[Please provide comments and suggestions]
In this blog post I present a quantitative review of John A Bargh’s book “Before you know it: The unconscious reasons we do what we do” A quantitative book review is different from a traditional book review. The goal of a quantitative review is to examine the strength of the scientific evidence that is provided to support ideas in the book. Readers of a popular science book written by an eminent scientist expect that these ideas are based on solid scientific evidence. However, the strength of scientific evidence in psychology, especially social psychology has been questioned. I use statistical methods to examine how strong the evidence actually is.
One problem in psychological publishing is publication bias in favor of studies that support theories, so called publication bias. The reason for publication bias is that scientific journals can publish only a fraction of results that scientists produce. This leads to heavy competition among scientists to produce publishable results, and journals like to publish statistically significant results; that is studies that provide evidence for an effect (e.g., “eating green jelly beans cures cancer” rather than “eating red jelly beans does not cure cancer”). Statisticians have pointed out that publication bias undermines the meaning of statistical significance, just like counting only hits would undermine the meaning of batting averages. Everybody would have an incredible batting average of 1.00.
For a long time it was assumed that publication bias is just a minor problem. Maybe researchers conducted 10 studies and reported only 8 significant results while not reporting the remaining two studies that did not produce a significant result. However, in the past five years it has become apparent that publication bias, at least in some areas of the social sciences, is much more severe, and that there are more unpublished studies with non-significant results than published results with significant results.
In 2012, Daniel Kahneman (2012) raised doubts about the credibilty of priming research in an open email letter addressed to John A. Bargh, the author of “Before you know it.” Daniel Kahneman is a big name in psychology; he won a Nobel Prize for economics in 2002. He also wrote a popular book that features John Bargh’s priming research (see review of Chapter 4). Kahneman wrote “As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.”
Kahneman is not an outright critic of priming research. In fact, he was concerned about the future of priming research and made some suggestions how Bargh and colleagues could alleviate doubts about the replicability of priming results. He wrote:
“To deal effectively with the doubts you should acknowledge their existence and confront them straight on, because a posture of defiant denial is self-defeating. Specifically, I believe that you should have an association, with a board that might include prominent social psychologists from other fields. The first mission of the board would be to organize an effort to examine the replicability of priming results.”
However, prominent priming researchers have been reluctant to replicate their old studies. At the same time, other scientists have conducted replication studies and failed to replicate classic findings. One example is Ap Dijksterhuis’s claim that showing words related to intelligence before taking a test can increase test performance. Shanks and colleagues tried to replicate this finding in 9 studies and came up empty in all 9 studies. More recently, a team of over 100 scientists conducted 24 replication studies of Dijsterhuis’s professor priming study. Only 1 study successfully replicated the original finding, but with a 5% error rate, 1 out of 20 studies is expected to produce a statistically significant result by chance alone. This result validates Shanks’ failures to replicate and strongly suggests that the original result was a statistical fluke (i.e., a false positive result).
Proponent of priming research like Dijksterhuis “argue that social-priming results are hard to replicate because the slightest change in conditions can affect the outcome” (Abbott, 2013, Nature News). Many psychologists consider this response inadequate. The hallmark of a good theory is that it predicts the outcome of a good experiment. If the outcome depends on unknown factors and replication attempts fail more often than not, a scientific theory lacks empirical support. For example, Kahneman wrote in an email that the apparent “refusal to engage in a legitimate scientific conversation … invites the interpretation that the believers are afraid of the outcome” (Abbott, 2013, Nature News).
It is virtually impossible to check on all original findings by conducting extensive and expensive replication studies. Moreover, proponents of priming research can always find problems with actual replication studies to dismiss replication failures. Fortunately, there is another way to examine the replicability of priming research. This alternative approach, z-curve, uses a statistical approach to estimate replicability based on the results reported in original studies. Most important, this approach examines how replicable and credible original findings were based on the results reported in the original articles. Therefore, original researches cannot use inadequate methods or slight variations in contextual factors to dismiss replication failures. Z-curve can reveal that the original evidence was not as strong as dozens of published studies may reveal because it takes into account that published studies were selected to provide evidence for priming effects.
My colleagues and I used z-curve to estimate the average replicability of priming studies that were cited in Kahneman’s chapter on priming research. We found that the average probability of a successful replication was only 14%. Given the small number of studies (k = 31), this estimate is not very precise. It could be higher, but it could also be even lower. This estimate would imply that for each published significant result, there are 9 unpublished non-significant results that were omitted due to publication bias. Given these results, the published significant results provide only weak empirical support for theoretical claims about priming effects. In a response to our blog post, Kahneman agreed (“What the blog gets absolutely right is that I placed too much faith in underpowered studies”).
Our analysis of Kahneman’s chapter on priming provided a blue print for this quantitative book review of Bargh’s book “Before you know it.” I first checked the notes for sources and then linked the sources to the corresponding references in the reference section. If the reference was an original research article, I downloaded the original research article and looked for the most critical statistical test of a study. If an article contained multiple studies, I chose one test from each study. I found 168 usable original articles that reported a total of 400 studies. I then converted all test statistics into absolute z-scores and analyzed them with z-curve to estimate replicability (see Excel file for coding of studies).
Figure 1 shows the distribution of absolute z-scores. 90% of test statistics were statistically significant (z > 1.96) and 99% were at least marginally significant (z > 1.65), meaning they passed a less stringent statistical criterion to claim a success. This is not surprising because supporting evidence requires statistical significance. The more important question is how many studies would produce a statistically significant result again if all 400 studies were replicated exactly. The estimated success rate in Figure 1 is less than half (41%). Although there is some uncertainty around this estimate, the 95% confidence interval just reaches 50%, suggesting that the true value is below 50%. There is no clear criterion for inadequate replicability, but Tversky and Kahneman (1971) suggested a minimum of 50%. Professors are also used to give students who scored below 50% on a test an F. So, I decided to use the grading scheme at my university as a grading scheme for replicability scores. So, the overall score for the replicability of studies cited by Bargh to support the ideas in his book is F.
This being said, 41% replicability is a lot more than we would expect by chance alone, namely 5%. Clearly some of the results mentioned in the book are replicable. The question is which findings are replicable and which ones are difficult to replicate or even false positive results. The problem with 41% replicable results is that we do not know which results we can trust. Imagine you are interviewing 100 eyewitnesses and only 42 of them are reliable. Would you be able to identify a suspect?
It is also possible to analyze subsets of studies. Figure 2 shows the results of all experimental studies that randomly assigned participants to two or more conditions. If a manipulation has an effect, it produces mean differences between the groups. Social psychologists like these studies because they allow for strong causal inferences and make it possible to disguise the purpose of a study. Unfortunately, this design requires large samples to produce replicable results and social psychologists often used rather small samples in the past (the rule of thumb was 20 per group). As Figure 2 shows, the replicability of these studies is lower than the replicability of all studies. The average replicability is only 24%. This means for every significant result there are at least three non-significant results that have not been reported due to the pervasive influence of publication bias.
If 24% doesn’t sound bad enough, it is important to realize that this estimate assumes that the original studies can be replicated exactly. However, social psychologists have pointed out that even minor differences between studies can lead to replication failures. Thus, the success rate of actual replication studies is likely to be even less than 24%.
In conclusion, the statistical analysis of the evidence cited in Bargh’s book confirms concerns about the replicability of social psychological studies, especially experimental studies that compared mean differences between two groups in small samples. Readers of the book should be aware that the results reported in the book might not replicate in a new study under slightly different conditions and that numerous claims in the book are not supported by strong empirical evidence.
Replicability of Chapters
I also estimated the replicability separately for each of the 10 chapters to examine whether some chapters are based on stronger evidence than others. Table 1 shows the results. Seven chapters scored an F, two chapters scored a D, and one chapter earned a C-. Although there is some variability across chapters, none of the chapters earned a high score, but some chapters may contain some studies with strong evidence.
Table 1. Chapter Report Card
Credible Findings in the Book
Unfortunately, it is difficult to determine the replicability of individual studies with high precision. Nevertheless, studies with high z-scores are more replicable. Particle physicists use a criterion value of z > 5 to minimize the risk that the results of a single study are not a false positive. I found that psychological studies with a z-score greater than 4 had an 80% chance of being replicated in actual replication studies. Using this rule as a rough estimate of replicability, I was also able to identify credible claims in the book. Highlighting these claims does not mean that the other claims are wrong. It simply means that they are not supported by strong evidence.
According to Chapter 1, there seems “to be a connection between the strength of the unconscious physical safety motivation and a person’s political attitudes.” The notes list a number of articles to support this claim. The only conclusive evidence in these studies is that self-reported political attitudes (a measure of right-wing authoritarianism) is correlated with self-reported beliefs that the world is dangerous (Duckitt et al., JPSP, 2002, 2 studies, z = 5.42, 6.93). The correlation between self-report measures is hardly evidence for unconscious physical safety motives.
Another claim is that “our biological mandate to reproduce can have surprising manifestations in today’s world.” This claim is linked to a study that examined the influence of physical attractiveness on call backs for a job interview. In a large field experiment, researchers mailed (N = 11,008 resumes) to real job ads and found that both men and women were more likely to be called for an interview if the application included a picture of a highly attractive applicant versus a not so attractive applicant (Busetta et al., 2013, z = 19.53). Although this is an interesting and important finding, it is not clear that the human resource offices preference for attractive applicants was driven by their “biological mandate to reproduce.”
Chapter 2 introduces the idea that there is a fundamental connection between physical sensations and social relationships. “… why today we still speak so easily of a warm friend, or a cold father. We always will. Because the connection between physical and social warmth, and between physical and social coldness, is hardwired into the human brain.” Only one z-score surpassed the 4-sigma threshold. This z-score comes from a brain imaging study that found increased sensorimotor activation in response to hand-washing products (soap) after participants had lied in a written email, but not after they had lied verbally; Schaefer et al., 2015, z = 4.65). There are two problems with this supporting evidence. First, z-scores in fMRI studies require a higher threshold than z-scores in other studies because brain imaging studies allow for multiple comparisons that increase the risk of a false positive result (Vul et al., 2009). More important, even if this finding could be replicated, it does not provide support for the claim that these neurological connections are hard-wired into humans’ brains.
The second noteworthy claim in Chapter 2 is that infants “have a preference for their native language over other languages, even though they don’t yet understand a word.” This claim is not very controversial given ample evidence that humans’ prefer familiar over unfamilar stimuli (Zajonc, 1968, also cited in the book). However, it is not so easy to study infants’ preferences (after all, they are not able to tell us). Developmental researchers use a visual attention task to infer preferences. If an infant looks longer at one of two stimuli, it indicates a preference for this stimulus. Kinzler et al. (PNAS, 2007) reported six studies. For five studies, z-scores ranged from 1.85 to 2.92, which is insufficient evidence to draw strong conclusions. However, Study 6 provided convincing evidence (z = 4.61) that 5-year old children in Boston preferred a native speaker to a child with a French accent. The effect was so strong that 8 children were sufficient to demonstrate it. However, a study with 5-year olds hardly provides evidence for infants’ preferences. In addition, the design of this study holds all other features constant. Thus, it is not clear how strong this effect is in the real world when many other factors can influence the choice of a friend.
Chapter 3 introduces the concept of priming. “Primes are like reminders, whether we are aware of the reminding or not” It uses two examples to illustrate priming with and without awareness. One example implies that people can be aware of the primes that influenced their behavior. If you are in the airport, smell Cinnabon, and find yourself suddenly in front of the Cinnabon counter you are likely to know that the smell made you think about Cinnabon and decide to eat one. The second example introduces the idea that primes can influence behavior without awareness. If you were caught off in traffic, you may respond more hostile to a transgression of a co-worker without being aware that the earlier experience in traffic influenced your reaction. The supporting references contain two noteworthy (z > 4) findings that show how priming can be used effectively as reminders (Rogers & Milkman, 2016, Psychological Science, Studies 2a (N = 920, z = 5.45) and Study 5 (N = 305, z = 5.50). In Study 2a, online participants were presented with the following instruction:
In this survey, you will have an opportunity to
support a charitable organization called Gardens
for Health that provides lasting agricultural
solutions to address the problem of chronic
childhood malnutrition. On the 12th page of this
survey, please choose answer “A” for the last
question on that page, no matter your opinion. The
previous page is Page 1. You are now on Page 2.
The next page is Page 3. The picture below will
be on top of the NEXT button on the 12th page.
This is intended to remind you to select
answer “A” for the last question on that page. If you
follow these directions, we will donate $0.30 to
Gardens for Health.
On pages 2-11 participants either saw distinct animals or other elephants.
Participants in the distinct animal condition were more likely to press the response that led to a donation than participants who saw a variety of elephants (z = 5.45).
Study 5 examined whether respondents would be willing to pay for a reminder. They were offered 60 cents extra payment for responding with “E” to the last question. They could either pay 3 cents to get an elephant reminder or not. 53% of participants were willing to pay for the reminder, which the authors compared to 0, z = 2 × 10^9. This finding implies that participants are not only aware of the prime when they respond in the primed way, but are also aware of this link ahead of time and are willing to pay for it.
In short, Chapter 3 introduces the idea of unconscious or automatic priming, but the only solid evidence in the reference section supports the notion that we can also be consciously aware of priming effects and use them to our advantage.
Chapter 4 introduces the concept of arousal transfer; the idea that arousal from a previous event can linger and influence how we react to another event. The book reports in detail a famous experiment by Dutton and Aaron (1974).
“In another famous demonstration of the same effect, men who had just crossed a rickety pedestrian bridge over a deep gorge were found to be more attracted to a woman they met while crossing that bridge. How do we know this? Because they were more likely to call that woman later on (she was one of the experimenters for the study and had given these men her number after they filled out a survey for her) than were those who met the same woman while crossing a much safer bridge. The men in this study reported that their decision to call the woman had nothing to do with their experience of crossing the scary bridge. But the experiment clearly showed they were wrong about that, because those in the scary-bridge group were more likely to call the woman than were those who had just crossed the safe bridge.”
First, it is important to correct the impression that men were asked about their reasons to call back. The original article does not report any questions about motives. This is the complete section in the results that mentions the call back.
“Female interviewer. In the experimental group, 18 of the 23 subjects who agreed to
the interview accepted the interviewer’s phone number. In the control group, 16 out of 22 accepted (see Table 1). A second measure of sexual attraction was the number of subjects who called the interviewer. In the experimental group 9 out of 18 called, in the control group 2 out of 16 called (x2 = 5.7, p < .02). Taken in conjunction with the sexual
imagery data, this finding suggests that subjects in the experimental group were more
attracted to the interviewer.”
A second concern is that the sample size was small and the evidence for the effect was not very strong. In the experimental group 9 out of 18 called, in the control
group 2 out of 16 called (x2 = 5.7, p < .02) [z = 2.4].
Finally, the authors mention a possible confound in this field study. It is possible that men who dared to cross the suspension bridge differ from men who crossed the safe bridge, and it has been shown that risk taking men are more likely to engage in casual sex. Study 3 addressed this problem with a less colorful, but more rigorous experimental design.
Male students were led to believe that they were participants in a study on electric shock and learning. An attractive female confederate (a student working with the experimenter but pretending to be a participants) was also present. The study had four conditions. Male participants were told that they would receive weak or strong shock and they were told that the female confederate would receive weak or strong shock. They then were asked to fill out a questionnaire before the study would start; in fact, the study ended after participants completed the questionnaire and they were told about the real purpose of the study.
The questionnaire contained two questions about the attractive female confederate. “How much would you like to kiss her?” and “How much would you like to ask her out on a date?” Participants who were anticipating strong shock had much higher average ratings than those who anticipated weak shock, z = 4.46.
Although this is a strong finding, we also have a large literature on emotions and arousal that suggests frightening your date may not be the best way to get to second base (Reisenzein, 1983; Schimmack, 2005). It is also not clear whether arousal transfer is a conscious or unconscious process. One study cited in the book found that exercise did not influenced sexual arousal right away, presumably because participants attributed their increased heart rate to the exercise. This suggests that arousal transfer is not entirely an unconscious process.
Chapter 4 also brings up global warming. An unusually warm winter day in Canada often make people talk about global warming. A series of studies examined the link between weather and beliefs about global warming more scientifically. “What is fascinating (and sadly ironic) is how opinions regarding this issue fluctuate as a function of the very climate we’re arguing about. In general, what Weber and colleagues found was that when the current weather is hot, public opinion holds that global warming is occurring, and when the current weather is cold, public opinion is less concerned about global warming as a general threat. It is as if we use “local warming” as a proxy for “global warming.” Again, this shows how prone we are to believe that what we are experiencing right now in the present is how things always are, and always will be in the future. Our focus on the present dominates our judgments and reasoning, and we are unaware of the effects of our long-term and short-term past on what we are currently feeling and thinking.”
One of the four studies produced strong evidence (z = 7.05). This study showed a correlation between respondents’ ratings of the current day’s temperature and their estimate of the percentage of above average warm days in the past year. This result does not directly support the claim that we are more concerned about global warming on warm days for two reasons. First, response styles can produce spurious correlations between responses to similar questions on a questionnaire. Second, it is not clear that participants attributed above average temperatures to global warming.
A third credible finding (z = 4.62) is from another classic study (Ross & Sicoly, 1974, JPSP, Study 2a). “You will have more memories of yourself doing something than of your spouse or housemate doing them because you are guaranteed to be there when you do the chores. This seems pretty obvious, but we all know how common those kinds of squabbles are, nonetheless. (“I am too the one who unloads the dishwasher! I remember doing it last week!”)” In this study, 44 students participated in pairs. They were given separate pieces of information and exchange information to come up with a joint answer to a set of questions. Two days later, half of the participants were told that they performed poorly, whereas the other half was told that they performed well. In the success condition, participants were more likely to make self-attributions (i.e., take credit) than expected by chance.
In Chapter 5, John Bargh tell us about work by his supervisor Robert Zajonc (1968). “Bob was doing important work on the mere exposure effect, which is, basically, our tendency to like new things more, the more often we encounter them. In his studies, he repeatedly showed that we like them more just because they are shown to us more often, even if we don’t consciously remember seeing them” The 1968 classic article contains two studies with strong evidence (Study 2, z = 6.84, Study 3 z = 5.81). Even though the sample sizes were small, this was not a problem because the studies presented many stimuli at different frequencies to all participants. This makes it easy to spot reliable patterns in the data.
Chapter 5 also introduces the concept of affective priming. Affective priming refers to the tendency to respond emotionally to a stimulus even if a task demands to ignore it. We simply cannot help to feel good or bad and turn our emotions off. The experimental way to demonstrate this is to present an emotional stimulus quickly followed by a second emotional stimulus. Participants have to respond to the second stimulus and ignore the first stimulus. It is easier to perform the task when the two stimuli have the same valence, suggesting that the valence of the first stimulus was processed even though participants had to ignore it. Bargh et al. (1996, JESP) reported that this even happens when the task is simply to pronounce the second word (Study 1 z = 5.42, Study 2 z = 4.13, Study 3, z = 3.97).
The book does not inform readers that we have to distinguish two types of affective priming effects. Affective priming is a robust finding when participants’ task is to report on the valence (is it good or bad) of the second stimulus following the prime. However, this finding has been interpreted by some researches as an interference effect, similar to the Stroop effect. This explanation would not predict effects on a simple pronounciation task. However, there are fewer studies with the pronounciation task and some of these have failed to replicate Bargh et al.’s original findings, despite the strong evidence observed in their studies. First, Klauer and Musch (2001) failed to replicate Bargh et al.’s findings that affective priming influences pronunciation of target words in three studies with good statistical power. Second, DeHouwer et al. (2001) were able to replicate it with degraded primes, but also failed to replicate the effect with visible primes that were used by Bargh et al. In conclusion, affective priming is a robust effect when participants have to report on the valence of the second stimulus, but this finding does not necessarily imply that primes unconsciously activate related content in memory.
Chapter 5 also reports about some surprising associations between individuals’ names, or better their initials, and the places they live, professions, and partners. These correlations are relatively small, but they are based on large datasets and very unlikely to be just statistical flukes (z-scores ranging from 4.65 to 49.44). The causal process underlying these correlations is less clear. One possible explanation is that we have unconscious preferences that influence our choices. However, experimental studies that tried to study this effect in the laboratory are less convincing. Moreover, Hodson and Olson failed to find a similar effect across a variety of domains such as liking of animals (Alicia is not more likely to like ants than Samantha), foods, or leisure activities. They found a significant correlation for brand names (p = .007), but this finding requires replication. More recently, Kooti, Magno, and Weber (2014) examined name effects on social media. They found significant effects for some brand comparisons (Sega vs. Nintendo), but not for others (Pepsi vs. Coke). However, they found that twitter users were more likely to follow other twitter uses with the same first name. Taken together, these results suggest that individuals’ names predict some choices, but it is not clear when or why this is the case.
The chapter ends with a not very convincing article (z = 2.39, z = 2.22) that it is actually very easy to resist or override unwanted priming effects. According to this article, simply being told that somebody is a team member can make automatic prejudice go away. If it were so easy to control unwanted feelings, it is not clear why racism is still a problem 50 years after the civil rights movement started.
In conclusion Chapter 5 contains a mix of well-established findings with strong support (mere-exposure effects, affective priming) and several less supported ideas. One problem is that priming is sometimes presented as an unconscious process that is difficult to control and at other times these effects seem to be easily controllable. The chapter does not illuminate under which conditions we should suspect priming to influence our behavior in ways we don’t notice or cannot control and when we notice them and have the ability to control them.
Chapter 6 deals with the thorny problem in psychological science that most theories make correct predictions sometimes. Even a broken clock tells the time right twice a day. The problem is to know in which context a theory makes correct predictions and when it does not.
“Entire books—bestsellers—have appeared in recent years that seem to give completely conflicting advice on this question: can we trust our intuitions (Blink, by Malcolm Gladwell), or not (Thinking, Fast and Slow, by Daniel Kahneman)? The answer lies in between. There are times when you can and should, and times when you can’t and shouldn’t [trust your gut].
Bargh then proceeds to make 8 evidence-based recommendation when it is advantages to rely on intuition without effortful deliberation (gut feelings).
Rule #1: supplement your gut impulse with at least a little conscious reflection, if you have the time to do so.
Rule # 2: when you don’t have the time to think about it, don’t take big chances for small gains going on your gut alone.
Rule #3: when you are faced with a complex decision involving many factors, and especially when you don’t have objective measurements (reliable data) of those important factors, take your gut feelings seriously.
Rule #4: be careful what you wish for, because your current goals and needs will color what you want and like in the present.
Rule #5: when our initial gut reaction to a person of a different race or ethnic group is negative, we should stifle it.
Rule #6: we should not trust our appraisals of others based on their faces alone, or on photographs, before we’ve had any interaction with them.
Rule #7: (it may be the most important one of all): You can trust your gut about other people—but only after you have seen them in action.
Rule #8: it is perfectly fine for attraction be one part of the romantic equation, but not so fine to let it be the only, or even the main, thing.
Unfortunately, the credible evidence in this chapter (z > 4) is only vaguely related to these rules and insufficient to claim that these rules are based on solid scientific evidence.
Morewedge and Norton (2009) provide strong evidence that people in different cultures (US z = 4.52, South Korea z = 7.18, India z = 6.78) believe that dreams provide meaningful information about themselves. Study 3 used a hypothetical scenario to examine whether people would change their behavior in response to a dream. Participants were more likely to say that they would change a flight after dreaming about a plane crash in the night before the flight than if they thought about a plane crash the evening before and dreams influenced behavior about as much as hearing about an actual plane crash (z = 10.13). In a related article, Morewedge and colleagues (2014) asked participants to rate types of thoughts (e.g., dreams, problem solving, etc.) in terms of spontaneity or deliberation. A second rating asked about the extent to which the type of thought would generate self-insight or is merely a reflection of the current situation. They found that spontaneous thoughts were considered to generate more self-insight (Study 1 z = 5.32, Study 2 z = 5.80). In Study 5, they also found that more spontaneous recollection of a recent positive or negative experience with their romantic partner predicted hypothetical behavioral intention ratings (““To what extent might recalling the experience affect your likelihood of ending the relationship, if it came to mind when you tried to remember it”) (z = 4.06). These studies suggest that people find spontaneous, non-deliberate thoughts meaningful and that they are willing to use them in decision making. The studies do not tell us under which circumstances listening to dreams and other spontaneous thoughts (gut feelings) is beneficial.
Inbar, Cone, and Gilovich (2010) created a set of 25 choice problems (e.g., choosing an entree, choosing a college). They found that “the more a choice was seen as objectively evaluable, the more a rational approach was seen as the appropriate choice strategy” (Study 1a, z = 5.95). In a related study, they found “the more participants
thought the decision encouraged sequential rather than holistic processing, the more they thought it should be based on rational analysis” (Study 1b, z = 5.02). These studies provide some insight into people’s beliefs about optimal decision rules, but they do not tell us whether people’s beliefs are right or wrong, which would require to examine people’s actual satisfaction with their choices.
Frederick (2005) examined personality differences in the processing of simple problems (e.g., A bat and a ball cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?). The quick answer is 10 cent, but the correct answer is 5 cent. In this case, the gut response is false. A sample of over 3,000 participants answered several similar questions. Participants who performed above average were more willing to delay gratification (get $3,800 in a month rather than $3,400 now) than participants with below average performance (z > 5). If we consider the bigger reward a better choice, these results imply that it is not good to rely on gut responses when it is possible to use deliberation to get the right answer.
Two studies by Wilson and Schooler (1991) are used to support the claim that we can overthink choices.
“In their first study, they had participants judge the quality of different brands of jam, then compared their ratings with those of experts. They found that the participants who were asked to spend time consciously analyzing the jam had preferences that differed further from those of the experts, compared to those who responded with just the “gut” of their taste buds.” The evidence in this study with a small sample is not very strong and requires replication (N = 49, z = 2.36).
“In Wilson and Schooler’s second study, they interviewed hundreds of college students about the quality of a class. Once again, those who were asked to think for a moment about their decisions were further from the experts’ judgments than were those who just went with their initial feelings.”
The description in the book does not match the actual study. There were three conditions. In the control condition, participants were asked to read the information about the courses carefully. In the reasons condition, participants were asked to write down their reasons. and in the rate all condition participants were asked to rate all pieces of information, no matter how important, in terms of its effect on their choices. The study showed that considering all pieces of information increased the likelihood of choosing a poorly rated course (a bad choice), but had a much smaller effect on ratings of highly rated courses (z = 4.14 for the interaction effect). All conditions asked for some reflection and it remains unclear how students would respond if they went with their initial feelings, as described in the book. Nevertheless, the study suggests that good choices require focusing on important factors and paying attention to trivial factors can lead to suboptimal choices. For example, real estate agents in hot markets use interior design to drive up prices even though the design is not part of the sale.
We are born sensitive to violations of fair treatment and with the ability to detect those who are causing harm to others, and assign blame and responsibility to them. Recent research has shown that even children three to five years old are quite sensitive to fairness in social exchanges. They preferred to throw an extra prize (an eraser) away than to give more to one child than another—even when that extra prize could have gone to themselves. This is not an accurate description of the studies. Study 1 (z > 5) found that 6 to 8 year old children preferred to give 2 erasers to one kid and 2 erasers to another kid and to throw the fifth eraser away to maintain equality (20 out of 20, p < .0001). However, “the 3-to 5-year-olds showed no preference to throw a resource away (14 out of 24, p = .54)” (p. 386). Subsequent studies used only 6-8 year old children. Study 4 examined how children would respond if erasers are divided between themselves and another kid. 17 out of 20 (p = .003, z = 2.97 preferred to throw the eraser away rather than getting one more for themselves. However, in a related article, Shaw and Olson, 2012b) found that children preferred favoritism (getting more erasers) when receiving more erasers was introduced as winning a contest (Study 2, z = 4.65). These studies are quiet interesting, but they do not support the claim that equality norms are inborn, nor do they help us to figure out when we should or should not listen to our gut or whether it is better for us to be equitable or selfish.
The last, but in my opinion most interesting and relevant, piece of evidence in Chapter 6 is a large (N = 16,624) survey study of relationship satisfaction (Cacioppo et al., 2013, PNAS, z = 6.58). Respondents reported their relationship satisfaction and how they had met. Respondents who had met their partner online were slightly more satisfied than respondents who had met their partner offline. There were also differences between different types of meeting online. Respondents who met their partner in a bar had one of the lowest average level of satisfaction. The study did not reveal why online dating is slightly more successful, but both forms of dating probably involve a combination of deliberation and “gut” reactions.
In conclusion, Chapter 6 provides some interesting insights into the way people make choices. However, the evidence does not provide a scientific foundation for recommendations when it is better to follow your instinct and when it is better to rely on logical reasoning and deliberation. Either the evidence of the reviewed studies is too weak or the studies do not use actual choice outcomes as outcome variable. The comparison of online and offline dating is a notable exception.
Chapter 7 uses an impressive field experiment to support the idea that “our mental representations of concepts such as politeness and rudeness, as well as innumerable other behaviors such as aggression and substance abuse, become activated by our direct perception of these forms of social behavior and emotion, and in this way are contagious.” Keizer et al. (2008) conducted the study in an alley in Groningen, a city in the Netherlands. In one condition, bikes were parked in front of a wall with graffiti, despite an anti-graffiti sign. In the other condition, the wall was clean. Researchers attached fliers to the bikes and recorded how many users would simply throw the fliers on the ground. They recorded the behaviors of 77 bike riders in each condition. In the graffiti condition, 69% of riders littered. In the clean condition, only 33% of riders littered (z = 4.51).
In Study 2, the researchers put up a fence in front of the entrance to a car park that required car owners to walk an extra 200m to get to their car, but they left a gap that allowed car owners to avoid the detour. There was also a sign that forbade looking bikes to the fence. In one condition, bikes were not locked to the fence. In the experimental condition, the norm was violated and four bikes were locked to the fence. 41 car owners’ behaviors were observed in each condition. In the experimental condition, 82% of participants stepped through the gap. In the control condition, only 27% of car owners stepped through the gap (z = 5.27).
It is unlikely that bike riders or car owners in these studies consciously processed the graffiti or the locked bikes. Thus, these studies support the hypothesis that our environment can influence behavior in subtle ways without our awareness. Moreover, these studies show these effects with real-world behavior.
Another noteworthy study in Chapter 7 examined happiness in social networks (Fowler & Christakis, 2008). The authors used data from the Framingham Heart Study, which is a unique study where most inhabitants of a small town, Framingham, participated in the study. Researchers collected many measures, including a measure of happiness. They also mapped social relationships among them. Fowler and Christakis used sophisticated statistical methods to examine whether people who were connected in the social network (e.g., spouses, friends, neighbors) had similar levels of happiness. They did (z = 9.09). I may be more likely to believe these findings because I have found this in my own research on married couples (Schimmack & Lucas, 2010). Spouses are not only more similar to each other at one moment in time, they also change in the same direction over time. However, the causal mechanism underlying this effect is more elusive. Maybe happiness is contagious and can spread through social networks like a disease. However, it is also possible that related members in social networks are exposed to similar environments. For example, spouses share a common household income and money buys some happiness. It is even less clear whether these effects occur outside of people’s awareness or not.
Chapter 8 ends with the positive message that a single person can change the word because his or her actions influence many people. “The effect of just one act, multiplies and spreads to influence many other people. A single drop becomes a wave” This rosy conclusion overlooks that the impact of one person decreases exponentially when it spreads over social networks. If you are kind to a neighbor, the neighbor may be slightly more likely to be kind to the pizza delivery man, but your effect on the pizza delivery man is already barely noticeable. This may be a good thing when it comes to the spreading of negative behaviors. Even if the friend of a friend is engaging in immoral behaviors, it doesn’t mean that you are more likely to commit a crime. To really change society it is important to change social norms and increase individuals’ reliance on these norms even when situational influences tell them otherwise. The more people have a strong norm not to litter, the less it matters whether there are graffiti on the wall or not.
Chapter 8 examines dishonesty and suggests that dishonesty is a general human tendency. “When the goal of achievement and high performance is active, people are more likely to bend the rules in ways they’d normally consider dishonest and immoral, if doing so helps them attain their performance goal”
Of course, not all people cheat in all situations even if they think they can get away with it. So, the interesting scientific question is who will be dishonest in which context?
Mazar et al. (2008) examined situational effects on dishonesty. In Study 2 (z = 4.33) students were given an opportunity to cheat in order to receive a higher reward. The study had three conditions, a control condition that did not allow students to cheat, a cheating condition, and a cheating condition with an honor pledge. In the honor pledge condition, the test started with the sentence “I understand that this short survey falls under MIT’s [Yale’s] honor system”. This manipulation eliminated cheating. However, even in the cheating condition “participants cheated only 13.5% of the possible average magnitude. Thus, MIT/Yale students are rather honest or the incentive was too small to tempt them (an extra $2). Study 3 found that students were more likely to cheat if they were rewarded with tokens rather than money, even though they later could exchange tokens for money. The authors suggests that cheating merely for tokens rather than real money made it seem less like “real” cheating (z = 6.72).
Serious immoral acts cannot be studied experimentally in a psychology laboratory. Therefore, research on this topic has to rely on self-report and correlations. Pryor (1987) developed a questionnaire to study “Sexual Harassment Proclivities in Men.” The questionnaire asks men to imagine being in a position of power and to indicate whether they would take advantage of their power to incur sexual favors if they know they can get away with it. To validate the scale, Pryor showed that it correlated with a scale that measures how much men buy into rape myths (r = .40, z = 4.47). Self-reports on these measures have to be taken with a grain of salt, but the results suggest that some men are willing to admit that they would abuse power to gain sexual favors, at least in anonymous questionnaires.
Another noteworthy study found that even prisoners are not always dishonest. Cohn et al. (2015) used a gambling task to study dishonesty in 182 prisoners in a maximum security prison. Participants were given the opportunity to flip 10 coins and to keep all coins that showed head. Importantly, the coin toss was not observed. As it is possible, although unlikely, that all 10 coins show head by chance, inmates could keep all coins and hide behind chance. The randomness of the outcome makes it impossible to accuse a particular prisoner of dishonesty. Nevertheless, the task makes it possible to measure dishonesty of the group (collective dishonesty) because the percentage of coin tosses that were reported should be close to chance (50%). If it is significantly higher than chance, it shows that some prisoners were dishonest. On average, prisoners reported 60% head, which reveals some dishonesty, but even convicted criminals were more likely to respond honestly than not (the percentage increased from 60% to 66% when they were primed with their criminal identity, z = 2.39).
I see some parallels between the gambling task and the world of scientific publishing, at least in psychology. The outcome of a study is partially determined by random factors. Even if a scientist does everything right, a study may produce a non-significant result due to random sampling error. The probability of observing a non-significant result is called a type-II error. The probability of observing a significant result is called statistical power. Just like in a coin toss experiment, the observed percentage of significant results should match the expected percentage based on average power. Numerous studies have shown that researchers report more significant results than the power of their studies justifies. As in the coin toss experiment, it is not possible to point the finger at a single outcome because chance might have been in a researcher’s favor, but in the long run the odds “cannot be always in your favor” (Hunger Games). Psychologists disagree whether the excess of significant results in psychology journals should be attributed to dishonesty. I think it is and it fits Bargh’s observation that humans, and most scientists are humans, have a tendency to bend the rules when doing so helps them to reach their goal, especially when the goal is highly relevant (e.g., get a job, get a grant, get tenure). Sadly, the extent of over-reporting significant results is considerably larger than the 10 to 15% overreporting of heads in the prisoner study.
Chapter 9 introduces readers to Metcalfe’s work on insight problems (e.g., how to put 27 animals into 4 pens so that there is an odd number of animals in all four pens). Participants had to predict quickly whether they would be able to solve the problem. They then got 5 minutes to actually solve the problem. Participants were not able to predict accurately which insight problems they would solve. Metcalfe concluded that the solution for insight problems comes during a moment of sudden illumination that is not predictable. Bargh adds “This is because the solver was working on the problem unconsciously, and when she reached a solution, it was delivered to her fully formed and ready for use.” In contrast, people are able to predict memory performance on a recognition test, even when they were not able to recall the answer immediately. This phenomenon is known as the tip-of-the-tongue effect (z = 5.02). This phenomenon shows that we have access to our memory even before we can recall the final answer. This phenomenon is similar to the feeling of familiarity that is created by mere exposure (Zajonc, 1968). We often know a face is familiar without being able to recall specific memories where we encountered it.
The only other noteworthy study in Chapter 9 was a study of sleep quality (Fichten et al., 2001). “The researchers found that by far, the most common type of thought that kept them awake, nearly 50 percent of them, was about the future, the short-term events coming up in the next day or week. Their thoughts were about what they needed to get done the following day, or in the next few days.” It is true that 48% thought about future short-term events, but only 1% described these thoughts as worries, and 57% of these thoughts were positive. It is not clear, however, whether this category distinguished good and poor sleepers. What distinguished good sleepers from poor sleepers, especially those with high distress, was the frequency of negative thoughts (z = 5.59).
Chapter 10 examines whether it is possible to control automatic impulses. Ample research by personality psychologists suggests that controlling impulses is easier for some people than others. The ability to exert self-control is often measured with self-report measures that predict objective life outcomes.
However, the book adds a twist to self-control. “The most effective self-control is not through willpower and exerting effort to stifle impulses and unwanted behaviors. It comes from effectively harnessing the unconscious powers of the mind to much more easily do the self-control for you.”
There is a large body of strong evidence that some individuals, those with high impulse control and conscientiousness, perform better academically or at work (Tangney et al., 2004; Study 1 z = 5.90, Galla & Duckworth, Studies 1, 4, & 6, Ns = 488, 7.62, 5.18). Correlations between personality measures and outcomes do not reveal the causal mechanism that leads to these positive outcomes. Bargh suggests that individuals who score high on self-control measures are “the ones who do the good things less consciously, more automatically, and more habitually. And you can certainly do the same.” This maybe true, but empirical work to demonstrate this is hard to find. At the end of the chapter, Bargh cites a recent study by Milyavskaya and Michael Inzlicht that suggested avoiding temptations is more important than being able to exert self-control in the face of temptation, willful or unconsciously.
The book “Before you know it: The unconscious reasons we do what we do” is based on the authors’ personal experiences, studies he has conducted, and studies he has read. The author is a scientist and I have no doubt that he shares with his readers insights that he believes to be true. However, this does not automatically make them true. John Bargh is well aware that many psychologists are skeptical about some of the findings that are used in the book. Famously, some of Bargh’s own studies have been difficult to replicate. One response to concerns about replicability could have been new demonstrations that important unconscious priming effects can be replicated. In an interview Tom Bartlett (January, 2013) suggested this to John Bargh.
“So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway. Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.”
Beliefs are subjective. Readers of the book have their own beliefs and may find part of the book interesting and may be willing to change some of their beliefs about human behavior. Not that there is anything wrong with this, but readers should also be aware that it is reasonable to treat the ideas presented in this book with a healthy does of skepticism. In 2011, Daniel Kahneman wrote ““disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.” Five years later, it is pretty clear that Kahneman is more skeptical about the state of priming research and results of experiments with small samples in general. Unfortunately, it is not clear which studies we can believe until replication studies distinguish real effects from statistical flukes. So, until we have better evidence, we are still free to belief what we want about the power of unconscious forces on our behavior.
Update: March 20, 2018
An earlier version included a reference to my role as editor of Meta-Psychology. I apologize for including this reference. The journal has nothing to do with this blog post and the tone of this blog post reflects only my personal frustration with traditional peer-reviews. Some readers should be warned that the tone of this blog post is rude. Some people think this is inappropriate. I consider it an open and transparent depiction of what really goes on in academia where scientists’ ego is often more important than some objective search for the truth. And yes, I have an ego, too, and I think the only way to deal with it is open and frank exchange of arguments and critical examination of all arguments. Reviews that simply dismiss alternative ideas are not helpful and cannot advance psychology as a science.
In this PDF document, Jerry Brunner and I would like to share our latest manuscript on z-curve, a method that estimates average power of a set of studies selected for significance. We call this estimate replicabilty because average power determines the success rate if the set of original studies were replicated exactly.
We welcome all comments and criticism as we plan to submit this manuscript to a peer-reviewed journal by December 1.
Comparison of P-curve and Z-Curve in Simulation studies
Estimate of average replicability in Cuddy et al.’s (2017) P-curve analysis of power posing with z-curve (30% for z-curve vs. 44% for p-curvce).
Estimating average replicability in psychology based on over 500,000 significant test statitics.
Comparing automated extraction of test statistics and focal hypothesis tests using Motyl et al.’s (2016) replicability analysis of social psychology.
The manuscript was rejected. Here you can read the reasons by the editor and the reviews (2 anonymous and 1 by Leif Nelson) and make up your mind whether these reviews contain valid criticism. Importantly, nobody questions the key findings of the simulation studies that show our methods is unbiased whereas p-curve, which is already been used as a statistical tool, can provide inflated estimates in realistic scenarios when power varies across studies. We think the decision to not publish a method that improves on an existing method that is being used is somewhat strange for a journal that calls itself ADVANCES IN METHODS AND PRACTICES.
Dear Dr. Schimmack:
Thank you for submitting your manuscript (AMPPS-17-0114) entitled “Z-Curve: A Method for the Estimating Replicability Based on Test Statistics in Original Studies” to Advances in Methods and Practices in Psychological Science (AMPPS). First, my apologies for the overly long review process. I initially struggled to find reviewers for the paper and I also had to wait for the final review. In the end, I received guidance from three expert reviewers whose comments appear at the end of this message.
Reviewers 1 and 2 chose to remain anonymous and Reviewer 3 is Leif Nelson (signed review). Reviewers 1 and 2 were both strongly negative and recommended rejection. Nelson was more positive about the goals of the paper and approach, although he wasn’t entirely convinced by the approach and evidence. I read the paper independently of the reviews, both before sending it out and again before reading the reviews (given that it had been a while). My take was largely consistent with that of the reviewers.
Although the issue of estimating replicability from published results is an important one, I was less convinced about the method and felt that the paper does not do enough to define the approach precisely, and it did not adequately demonstrate its benefits and limits relative to other meta-analytic bias correction techniques. Based on the comments of the reviewers and my independent evaluation, I found these issues to be substantial enough that I have decided to decline the manuscript.
The reviews are extensive and thoughtful, and I won’t rehash all of the details in my letter. I would like to highlight what I see as the key issues, but many of the other comments are important and substantive. I hope you will find the comments useful as you continue to develop this approach (which I do think is a worthwhile enterprise).
All three reviews raised concerns about the clarity of the paper and the figures as well as the lack of grounding for a number of strong claims and conclusions (they each quote examples). They also note the lack of specificity for some of the simulations and question the datasets used for the analyses.
I agreed that the use of some of the existing data sets (e.g., the scraped data, the Cuddy data, perhaps the Motyl data) are not ideal ways to demonstrate the usefulness of this tool. Simulations in which you know and can specify the ground truth seem more helpful in demonstrating the advantages and constraints of this approach.
Reviewers 1 and 2 both questioned the goal of estimating average power. Reviewer 2 presents the strongest case against doing so. Namely, average power is a weird quantity to estimate in light of a) decades of research on meta-analytic approaches to estimating the average effect size in the face of selection, and b) the fact that average power is a transformation of effect size. To demonstrate that Z-curve is a valid measure and an improvement over existing approaches, it seems critical to test it against other established meta-analytic models.
p-curve is relatively new, and as reviewer 2 notes, it has not been firmly established as superior to other more formal meta-analytic approaches (it might well be better in some contexts and worse in others). When presenting a new method like Z-curve, it’s is important to establish it against well-grounded methods or at least to demonstrate how accurate, precise, and biased it is under a range of realistic scenarios. In the context of this broader literature on bias correction, the comparison only to p-curve seems narrow, and a stronger case would involve comparing the ability of Z-curve to recover average effect size against other models of bias correction (or power if you want to adapt them to do that).
[FYI: p-curve is the only other method that aims to estimate average power of studies selected for significance. Other meta-analytic tools aim to estimate effect sizes, which are related to power but not identical. ]
Nelson notes that other analyses show p-curve to be robust to heterogeneity and argues that you need to more clearly specify why and when Z curve does better or worse. I would take that as a constructive suggestion that is worth pursuing. (e.g., he and the other reviewers are right that you need to provide more specificity about the nature of the heterogeneity your modeling.).
I thought Nelson’s suggestions for ways to explain the discrepant results of these two approaches were constructive, and they might help to explain when each approach does better, which would be a useful contribution. Just to be clear, I know that the datacolada post that Nelson cites was posted after your paper was submitted and I’m not factoring your paper’s failure to anticipate it into my decision (after all, Bem was wrong).
[That blog post was posted after I shared our manuscript with Uri and tried to get him to comment on z-curve. In a long email exchange he came up with scenarios in which p-curve did better,but never challenged the results of my simulations that it performs a lot worse when there is heterogeneity. To refer to this self-serving blog post as a reason for rejection is problematic at best, especially if the simulation results in the manuscript are ignored.
Like Reviewer 2 and Nelson, I was troubled by the lack of a data model for Z curve (presented around page 11-12). As best I can tell, it is a weighted average of 7 standard normal curves with different means. I could see that approach being useful, and it might well turn out to be optimal for some range of cases, but it seems arbitrary and isn’t suitably justified. Why 7? Why those 7? Is there some importance to those choices?
The data model (never heard the term, model) was specified and it is so simple that the editorial letter even characterizes it correctly. The observed density distribution is modeled with weighted averages of the density distributions of 7 standard normal distributions and yes 7 is arbitrary because it has very little influence on the results.
Do they reflect some underlying principle or are they a means to an end? If the goal is to estimate only the end output from these weights, how do we know that those are the right weights to use?
Because the simulation results show that the model recovers the simulated average power correctly within 3% points?
If the discrete values themselves are motivated by a model, then the fact that the weight estimates for each component are not accurate even with k=10000 seems worrisome.
No it is not worrisome because the end goal is the average, not the individual weights.
If they aren’t motivated by a more formal model, how were they selected and what aspects of the data do they capture? Similarly, doesn’t using absolute value mean that your model can’t handle sign errors for significant results? And, how are your results affected by an arbitrary ceiling at z=6?
There are no sign errors in analyses of significant results that cover different research questions. Heck, there is not even a reasonable way to speak about signs. Is neuroticism a negative predictor of wellbeing or is emotional stabilty a positive predictor of wellbeing?
Finally, the paper comments near the end that this approach works well if k=100, but that doesn’t inform the reader about whether it works for k=15 or k=30 as would be common for meta-analysis in psychology.
The editor doesn’t even seem to understand that this method is not intended to be used for a classic effect size meta-analysis. We do have statistical methods for that. Believe me I know that. But how would we apply these methods to estimate the replicability of social psychology? And why would we use k = 30 to do so, when we can use k = 1,000?
To show that this approach is useful in practice, it would be good to show how it fares with sets of results that are more typical in scale in psychology. What are the limits of its usefulness? That could be demonstrated more fully with simulations in which the ground truth is known.
No bias correction method that relies on only significant results provides meaningful results with k = 30. We provide 95%CI and they are huge with k = 30.
I know you worked hard on the preparation of this manuscript, and that you will be disappointed by this outcome. I hope that you will find the reviewer comments helpful in further developing this work and that the outcome for this submission will not discourage you from submitting future manuscripts to AMPPS.
Daniel J. Simons, Editor
Advances in Methods and Practices in Psychological Science (AMPPS) Psychology
Unfortunately, I don’t think the editor worked hard on reading the manuscript and missed the main point of the contribution. So, no I am not planing on wasting more time on sending my best work to this journal. I had hopes that AMPPS was serious about improving psychology as a science. Now I know better. I will also no longer review for your journal. Good luck with your efforts to do actual replication studies. I will look elsewhere to publish my work that makes original research more credible to start with.
The authors of this manuscript introduce a new statistical method, the z-curve, for estimating the average replicability of empirical studies. The authors evaluate the method via simulation methods and via select empirical examples; they also compare it to an alternative approach (p-curve). The authors conclude that the z-curve approach works well, and that it may be superior to the p-curve in cases where there is substantial heterogeneity in the effect sizes of the studies being examined. They also conclude, based on applying the z-curve to specific cases, that the average power of studies in some domains (e.g., power posing research and social psychology) is low.
One of the strengths of this manuscript is that it addresses an important issue: How can we evaluate the replicability of findings reported in the literature based on properties inherent to the studies (or their findings) themselves? In addition, the manuscript approaches the issue with a variety of demonstrations, including simulated data, studies based on power posing, scraped statistics from psychology journals, and social psychological studies.
After reading the manuscript carefully, however, I’m not sure I understand how the z-curve works or how it is supposed to solve potential problems faced by other approaches for evaluating the power of studies published in the empirical literature.
That is too bad. We provided detailed annotated R-Code to make it possible for quantitative psychologists to understand how z-curve works. It is unfortunate that you were not able to understand the code. We would have been happy to answer questions.
I realize the authors have included links to more technical discussions of the z-curve on their websites, but I think a manuscript like this should be self-contained–especially when it is billed as an effort “to introduce and evaluate a new statistical method” (p. 25, line 34).
We think that extensive code is better provided in a supplement. However, this is really an editorial question and not a comment on the quality or originality of our work.
Some additional comments, questions, and suggestions:
1. One of my concerns is that the approach appears to be based on using “observed power” (or observed effect sizes) as the basis for the calculations. And, although the authors are aware of the problems with this (e.g., published effect sizes are over-estimates of true effect sizes; p. 9), they seem content with using observed effect sizes when multiple effect sizes from diverse studies are considered. I don’t understand how averaging values that are over-estimates of true values can lead to anything other than an inflated average. Perhaps this can be explained better.
Again, we are sorry that you did not understand how our method achieves this goal, but that is surely not a reason to recommend rejection
2. Figure 2 is not clear. What is varying on the x-axis? Why is there a Microsoft-style spelling error highlighted in the graph?
The Figure was added after an exchange with Uri Simonsohn and reproduces the simulation that they did for effect sizes (see text). The x-axis shows d-values.
3. Figure 4 shows a z-curve for power posing research. But the content of the graph isn’t explained. What does the gray, dotted line represent? What does the solid blue line represent? (Is it a smoothed density curve?) What do the hashed red vertical lines represent? In short, without guidance, this graph is impossible to understand.
Thank you for your suggestion. We will revise the manuscript to make it easier to read the figures.
4. I’m confused on how Figure 2 is relevant to the discussion (see p. 19, line 22)
Again. We are sorry about the confusion that we caused. The Figure shows that p-curve overestimates power in some scenarios (d = .8, SD = .2) which was not apparent when Simonsohn did these simulation to estimate effect sizes.
5. Some claims are made without any explanation or rationale. For example, on page 19 the authors write “Random sampling error cannot produce this drop” when commenting on the distribution of z-scores in the power posing data. But no explanation is offered for how this conclusion is reached.
Random sampling error of z-scores is one. So, we should see a lot of values next to the mode of a distribution. A steep drop cannot be produced by random sampling error. The same observation has been made repeatedly about a string of just significant p-values. If you can get .04., .03, .02, again and again, why do you not get .06 or .11?
6. I assume the authors are re-analyzing the data collected by Motyl and colleagues for Demonstration 3? This isn’t stated explicitly; one has to read between the lines to reach this conclusion.
You read correctly between the lines
7. Figure 6 contains text which states that the estimated replicability is 67%. But the narrative states that the estimated replicability using the z-curve approach is 46% (p. 24, line 8). Is the figure using a different method than the z-curve method?
This is a real problem. This was the wrong Figure. Thank you for pointing it out. The estimate in the text is correct.
8. p. 25. Unclear why Figure 4 is being referenced here.
Another typo. Thanks for pointing it out.
9. The authors write that “a study with 80% power is expected to produce 4 out of 5 significant results in the long run.” (p. 6). This is only true when the null hypothesis is false. I assume the authors know this, but it would be helpful to be precise when describing concepts that most psychologists don’t “really” understand.
If a study has 80% power it is assumed that the null-hypothesis is false. A study where the null-hypothesis is true has a power of alpha to produce significant results.
10. I am not sure I understand the authors’ claim that, “once we take replicability into account, the distinction between false positives and true positives with low power becomes meaningless.” (p. 7).
We are saying in the article that there is no practical difference between a study with power = alpha (5%) where the null-hypothesis is true and a study with very low power (6%) where the null-hypothesis is false. Maybe it helps to think about effect sizes. d = 0 means null is true and power is 5% and d = 0.000000000000001 means null-hypothesis is false and power is 5.000000001%. In terms o f the probability to replicate a significant result both studies have a very low probability of doing so.
11. With respect to the second demonstration: The authors should provide a stronger justification for examining all reported test statistics. It seems that the z-curve’s development is mostly motivated by debates concerning the on-going replication crisis. Presumably, that crisis concerns the evaluation of specific hypotheses in the literature (e.g., power posing effects of hormone levels) and not a hodge-podge of various test results that could be relevant to manipulation checks, age differences, etc. I realize it requires more work to select the tests that are actually relevant to each research article than to scrape all statistics robotically from a manuscript, but, without knowing whether the tests are “relevant” or not, it seems pointless to analyze them and draw conclusions about them.
This is a criticism of one dataset, not a criticism of the method.
12. Some of the conclusions that the authors research, such as “our results suggest that the majority of studies in psychology fail to meet the minimum standard of a good study . . . and even more studies fail to meet the well-known and accepted norm that studies should have 80% power” have been reached by other authors too.
But what methodology did these authors used to come to this conclusion? Did they validate their method with simulation studies?
This leads me to wonder whether the z-curve approach represents an incremental advance over other approaches. (I’m nitpicking when I say this, of course. But, ultimately, the “true power” of a collection of studies is not really a “thing.”
What does that mean the true power of studies is not a thing. Researchers conduct significance tests (many) and see whether they get a publishable significant result. The average percentage of times they get a significant result is the true average power of the population of statistical tests that are being conducted. Of course, we can only estimate this true value, but who says that other estimates that we use everyday are any better than the z-curve estimates?
It is a useful fiction, of course, but getting a more precise estimate of it might be overkill.)
Sure let’s not get to precise. Why don’t we settle for 50% +/- 50% and call it a day?
Perhaps the authors can provide a stronger justification for the need of highly precise, but non-transparent, methods for estimating power in published research?
Just because you don’t understand the method doesn’t mean it is not transparent and maybe it could be useful to know that social psychologists conduct studies with 30% power and only publish results that fit their theories and got significant with the help of luck? Maybe we have had 6 years of talk about a crisis without any data except the OSC results in 2015 that are limited to 2008 and three journals. But maybe we just don’t care because it is 2018 and it is time to get on with business as usual. Glad you were able to review for a new journal that was intended to Advance Methods and Practices in Psychological Science. Clearly estimating the typical power of studies in psychology is not important for this goal in your opinion. Again sorry for submitting such a difficult manuscript and wasting your time.
The authors present a new methodology (“z-curve”) that purports to estimate the average of the power of a set of studied included in a meta-analysis which is subject to publication bias (i.e., statistically significant studies are over-represented among the set meta-analyzed). At present, the manuscript is not suitable for publication largely for three major reasons.
 Average Power: The authors propose to estimate the average power of the set of prior historical studies included in a meta-analysis. This is a strange quantity: meta-analytic research has for decades focused on estimating effect sizes. Why are the authors proposing this novel quantity? This needs ample justification. I for one see no reason why I would be interested in such a quantity (for the record, I do not believe it says much at all about replicability).
Why three reasons, if the first reason is that we are doing something silly. Who gives a fuck about power of studies. Clearly knowing how powerful studies are is as irrelevant as knowing the number of potholes in Toronto. Thank you for your opinion that unfortunately was shared by the editor and mostly the first reviewer.
There is another reason why this quantity is strange, namely it is redundant. In particular, under a homogeneous effect size, average power is a simple transformation of the effect size; under heterogeneous effect sizes, it is a simple transformation of the effect size distribution (if normality is assumed for the effect size distribution, then a simple transformation of the average effect size and the heterogeneity variance parameter; if a more complicated mixture distribution is assumed as here then a somewhat more complicated transformation). So, since it is just a transformation, why not stick with what meta-analysts have focused on for decades!
You should have stopped when things were going good. Now you are making silly comments that show your prejudice and ignorance. The whole point of the paper is to present a method that estimate average power when there is heterogeneity (if this is too difficult for you, let’s call it variability or even better, you know, bro, power is not always the same in each study). If you missed this, you clearly didn’t read the manuscript for more than two minutes. So, your clever remark about redundancy is just a waste of my time and the time of readers of this blog because things are no longer so simple when there is heterogeneity. But may be you even know this but just wanted to be a smart ass.
 No Data Model / Likelihood: On pages 10-13, the authors heuristically propose a model but never write down the formal data model or likelihood. This is simply unacceptable in a methods paper: we need to know what assumptions your model is making about the observed data!
We provided r-code that not only makes it clear how z-curve works but also was available for reviewers to test it. The assumptions are made clear and are simple. This is not some fancy Bayesian model with 20 unproven priors. We simply estimate a single population parameter from the observed distribution of z-scores and we make this pretty clear. It is simple, makes no assumptions, and it works. Take that!
Further, what are the model parameters? It is unclear whether they are mixture weights as well as means, just mixture weights, etc. Further, if just the weights and your setting the means to 0, 1, …, 6 is not just an example but embedded in your method, this is sheer ad hockery.
Again, it works. What is your problem?
It is quite clear from your example (Page 12-13) the model cannot recover the weights correctly even with 10,000 (whoa!) studies! This is not good. I realize your interest is on the average power that comes out of the model and not the weights themselves (these are a means to an end) but I would nonetheless be highly concerned—especially as 20-100 studies would be much more common than 10,000.
Unlike some statisticians we do not pretend that we can estimate something that cannot be estimated without making strong and unproven assumptions. We are content with estimating what we can estimate and that is average power, which of course, you think is useless. If average power is useless, why would it be better if we could estimate the weights?
 Model Validation / Comparison: The authors validate their z-curve by comparing it to an ad hoc improvised method known as the p-curve (“p-Curve and effect size”, Perspectives on Psychological Science, 2014). The p-curve method was designed to estimate effect sizes (as per the title of the paper) and is known to perform extremely poorly at estimating at this task (particularly under effect size heterogeneity); there is no work validating how well it performs at estimating this rather curious average power quantity (but likely it would do poorly given that it is poor at estimating effect sizes and average power is a transformation of the effect size). Thus, knowing the z-curve performs better than the p-curve at estimating average power tells me next to nothing: you cannot validate your model against a model that has no known validation properties! Please find a compelling way to validate your model estimates (some suggested in the paragraphs below) whether that is via theoretical results, comparison to other models known to perform well, etc. etc.
No we are not validating z-curve with p-curve. We are validating z-curve with simulation studies that show z-curve produces good estimates of simulated true power. We only included p-curve to show that this method produces biased estimates when there is considerable variability in power.
At the same time, we disagree with the claim that p-curve is not a good tool to estimate average effect sizes from a set of studies that are selected for significance. It is actually surprisingly good at estimating the average effect size for the set of studies that were selected for significance (as is puniform).
It is not a good tool to estimate the effect size for the population of studies before selection for significance, but this is irrelevant in this context because we focus on replicability which implies that an original study produced a significant result and we want to know how likely it is that a replication study will produce a significant result again.
Relatedly, the results in Table 1 are completely inaccessible. I have no idea what you are presenting here and this was not made clear either in the table caption or in the main text. Here is what we would need to see at minimum to understand how well the approach performs—at least in an absolute sense.
[It shows the estimates (mean, SD) by the various models for our 3 x 3 design of the simulation study. But who cares, the objective is useless so you probably spend 5 seconds trying to understand the Table.]
First, and least important, we need results around bias: what is the bias in each of the simulation scenarios (these are implicitly in the Table 1 results I believe)? However, we also need a measure of accuracy, say RMSE, a metric the authors should definitely include for each simulation setting. Finally, we need to know something about standard errors or confidence intervals so we can know the precision of individual estimates. What would be nice to report is the coverage percentage of your 95% confidence intervals and the average width of these intervals in each simulation setting.
There are many ways to present results about accuracy. Too bad we didn’t pick the right way, but would it matter to you? You don’t really think it is useful anyways.
This would allow us to, if not compare methods in a relative way, to get an absolute assessment of model performance. If, for example, in some simulation you have a bias of 1% and an RMSE of 3% and coverage percentage of 94% and average width of 12% you would seem to be doing well on all metrics*; on the other hand, if you have a bias of 1% and an RMSE of 15% and coverage percentage of 82% and average width of 56%, you would seem to be doing poorly on all metrics but bias (this is especially the case for RMSE and average width bc average power is bounded between 0% and 100%).
* Of course, doing well versus poorly is in the eye of the beholder and for the purposes at hand, but I have tried to use illustrative values for the various metrics that for almost all tasks at hand would be good / poor performance.
For this reason, we presented the Figure that showed how often the estimates were outside +/- 10%, where we think estimates of power do not need to be more precise than that. No need to make a big deal out of 33% vs. 38% power, but 30% vs. 80% matters.
I have many additional comments. These are not necessarily minor at all (some are; some aren’t) but they are minor relative to the above three:
[a] Page 4: a prior effect size: You dismiss these hastily which is a shame. You should give them more treatment, and especially discuss the compelling use of them by Gelman and Carlin here:
This paper is absolutely irrelevant for the purpose of z-curve to estimate the actual power that researchers achieve in their studies.
[b] Page 5: What does “same result” and “successful replication” mean? You later define this in terms of statistical significance. This is obviously a dreadful definition as it is subject to all the dichotomization issues intrinsic to the outmoded null hypothesis significance paradigm. You should not rely on dichotomization and NHST so strongly.
What is obvious to you, is not the scientific consensus. The most widely used criterion for a succesful replicaiton study is to get a significant result again. Of course, we could settle for getting the same sign again and a 50% type-I error probability, but hey, as a reviewer you get to say whatever you want without accountability.
Further, throughout please replace “significant” by “statistically significant” and related terms when it is the latter you mean.
[c] Page 6: Your discussion regarding if studies had 80% then up to 80% of results would be successful is not quite right: this would depend on the prior probability of “non-null” studies.
[that is why we wrote UP TO]
[d] Page 7: I do not think 50% power is at all “good”. I would be appalled in fact to trust my scientific results to a mere coin toss. You should drop this or justify why coin tosses are the way we should be doing science.
We didn’t say it is all good. We used it as a minimum, less than that is all bad, but that doesn’t mean 50% is all good. But hey, you don’t care anyways. so what the heck.
[e] Page 10: Taking absolute values of z-statistics seems wrong as the sign provides information about the sign of the effect. Why do you do this?
It is only wrong if you are thinking about a meta-analysis of studies that test the same hypothesis. However, if I want to examine the replicability of more than one specific hypothesis all results have to be coded so that a significant results implies support for the hypothesis in the direction of significance.
[f] Page 13 and throughout: There are ample references to working papers and blog posts in this paper. That really is not going to cut it. Peer review is far from perfect but these cited works do not even reach that low bar.
Well better than quoting hearsay rumors form blog posts that coding in some dataset is debatable in a peer review of a methods paper.
[g] Page 16: What was the “skewed distribution”? More details about this and all simulation settings are necessary. You need to be explicit about what you are doing so readers can evaluate it.
We provided the r-code to recreate the distributions or change them. It doesn’t matter. The conclusions remain the same.
[h] Page 15, Figure 2: Why plot median and no mean? Where are SEs or CIs on this figure?
Why do you need a CI or SE for simulations and what do you need to see that there is a difference between 0 and 80%?
[i] Page 14: p-curve does NOT provide good estimates of effect sizes!
Wrong. You don’t know what you are talking about. It does provide a good estimate of average effect sizes for the set of studies selected for significance, which is the relevant set here.
[j] You find p-curve is biased upwards for average power under heterogeneity; this seems to follow directly from the fact that it is biased upwards for effect size under heterogeneity (“Adjusting for Publication Bias in Meta-analysis”, Perspectives on Psychological Science, 2016) and the simply mapping between effect size and average power discussed above.
Wrong again. You are confusing estimates of average effect size for the studies before selection and after selection for significance.
[k] Page 20: Can z-curve estimate heterogeneity (the answer is yes)? You should probably provide such estimates.
We do not claim that z-curve estimates heterogeneity. Maybe some misunderstanding.
[l] Page 21-23: I don’t think the concept of the “replicability of all of psychology” is at all meaningful*. You are mixing apples and oranges in terms of areas studies as well as in terms of tests (focal tests vs manipulation checks). I would entirely cut this.
Of course, we can look for moderators but that is not helpful to you because you don’t think the concept of power is useful.
* Even if it were, it seems completely implausible that the way to estimate it would be to combine all the studies in a single meta-analysis as here.
[m] Page 23-25: I also don’t think the concept of the “replicability of all of social psychology” is at all meaningful. Note also there has been much dispute about the Motyl coding of the data so it is not necessarily reliable.
Of course you don’t, but why should I care about your personal preferences.
Further, why do you exclude large sample, large F, and large df1 studies? This seems unjustified.
They are not representative, but it doesn’t make a difference.
[n] Page 25: You write “47% average power implies that most published results are not false positives because we would expect 52.5% replicability if 50% of studies were false positives and the other 50% of studies had 100% power.” No, I think this will depend on the prior probability.
Wrong again. If 50% of studies were false positives, the power estimate would be 5%. To get an average of 50%, and the other studies have the maximum of 100% power, we would get a clearly visible bimodal distribution of z-scores and we would get an average estimate of p(H0) * 2.5 + (1-p(H0) * 100. You are a smart boy (sorry assuming this is a dick), you figure it out.
[o] Page 25: What are the analogous z-curve results if those extreme outliers are excluded? You give them for p-curve but not z-curve.
We provided that information, but you would need to care to look for them.
[p] Page 27: You say the z-curve limitations are not a problem when there are 100 or more studies and some heterogeneity. The latter is fine to assume as heterogeneity is rife in psychological research but seldom do we have 100+ studies. Usually 100 is an upper bound so this poses problems for your method.
It doesn’t mean our method doesn’t work with smaller N. Moreover, the goal is not to conduct effect size meta-analysis, but apparently you missed that because you don’t really care about the main objective to estimate replicability. Not sure why you agreed to review a paper that is titled “A method for estimating replicability?”
Final comment: Thanks for nothing.
This review was conducted by Leif Nelson
[Thank you for signing your review.]
Let me begin by apologizing for the delay in my review; the process has been delayed because of me and not anyone else in the review team.
Not sure why the editor waited for your review. Could have rejected it after reading the first two reviews that the whole objective, which you and I think is meaningful, is irrelevant for advancing psychological science. Sorry for the unnecessary trouble.
Part of the delay was because I spent a long time working on the review (as witnessed by the cumbersome length of this document). The paper is dense, makes strong claims, and is necessarily technical; evaluating it is a challenge.
I commend the authors for developing a new statistical tool for such an important topic. The assessment of published evidence has always been a crucial topic, but in the midst of the current methodological renaissance, it has gotten a substantial spotlight.
Furthermore, the authors are technically competent and the paper articulates a clear thesis. A new and effective tool for identifying the underlying power of studies could certainly be useful, and though I necessarily have a positive view of p-curve, I am open to the idea that a new tool could be even better.
Ok, enough with the politeness. Let’s get to it.
I am not convinced that Z-curve is that tool. To be clear, it might be, but this paper does not convince me of that.
As expected,…. p < .01. So let’s hear why the simulation results and the demonstration of inflated estimates in real datasets do not convince you.
I have a list of concerns, but a quick summary might save someone from the long slog through the 2500 words that follow:
- The authors claim that, relative to Z-curve, p-curve fails under heterogeneity and do not report, comment on, or explain analyses showing exactly the opposite of that assertion.
Wow. let me parse this sentence. The authors claim p-curve fails under heterogeneity (yes) and do not report … analyses showing … the opposite of that assertion.
Yes, that is correct. We do not show results opposite to our assertion. We show results that confirm our assertion in Figure 1 and 2. We show in simulations with R-code that we provided and you could have used to run your own simulations that z-curve provides very good estimates of average power when there is heterogeneity and that p-curve tends to overestimate average power. That is the key point of this paper. Now how much time did you spend on this review, exactly?
The authors do show that Z-curve gives better average estimates under certain circumstances, but they neither explain why, nor clarify what those circumstances look like in some easy to understand way, nor argue that those circumstances are representative of published results.
Our understanding was that technical details are handled in the supplement that we provided. The editor asked us to supply R-code again for a reviewer but it is not clear to us which reviewer actually used the provided R-code to answer technical questions like this. The main point is made clear in the paper. When the true power (or z-values) varies across studies, p-curve tends to overestimate. Not sure the claims of being open are very credible if this main point is ignored.
3. They attempt to demonstrate the validity of the Z-curve with three sets of clearly invalid data.
No. we do not attempt to validate z-curve with real datasets. That would imply that we already know the average power in real data, which we do not. We used simulations to validate z-cure and to show that p-curve estimates are biased. We used real data only to show that the differences in estimates have real world implications. For example, when we use the Motyl et al. (JSPSP) data to examine replicability, z-curve gives a reasonable estimate of 46% (in line with the reported R-Index estimates in the JPSP article), while p-curve gives an estimate of 72% power. This is not a demonstration of validity, it is a demonstration that p-curve would overestimate replicability of social psychological findings in a way that most readers would consider practically meaningful. ]
I think that any one of those would make me an overall negative evaluator; the combination only more so. Despite that, I could see a version which clarified the “heterogeneity” differences, acknowledged the many circumstances where Z-curve is less accurate than p-curve, and pointed out why Z-curve performs better under certain circumstances. Those might not be easy adjustments, but they are possible, and I think that these authors could be the right people to do it. (the demonstrations should simply be removed, or if the authors are motivated, replaced with valid sets).
We already point out when p-curve does better. When there is minimal variability or actually identical power, precision of p-curve is 2-3% better.
Brief elaboration on the first point: In the initial description of p-curve the authors seem to imply that it should/does/might have “problems when the true power is heterogeneous”. I suppose that is an empirical question, but it one that has been answered. In the original paper, Simonsohn et al. report results showing how p-curve behaves under some types of heterogeneity. Furthermore, and more recently, we have reported how p-curve responds under other different and severe forms of heterogeneity (dataclolada.org/67). Across all of those simulations, p-curve does indeed seem to perform fine. If the authors want to claim that it doesn’t perform well enough (with some quantifiable statement about what that means), or perhaps that there are some special conditions in which it performs worse, that would be entirely reasonable to articulate. However, to say “the robustness of p-curve has not been tested” is not even slightly accurate and quite misleading.
These are totally bogus and cheery-picked simulations that were conducted after I shared a preprint of this manuscript with Uri. I don’t agree with Reviewer 2 that we shouldn’t use blogs, but the content of the blog post needs to be accurate and scientific. The simulations in this blog post are not. The variation of power is very small. In contrast, we examine p-curve and z-curve in a fair comparison with varying amounts of heterogeneity that is found in real data sets. In this simulations p-curve again does slightly better when there is no heterogeneity, but it does a lot worse when there is considerable variability.
To ignore the results in the manuscript and to claim that the blog post shows something different is not scientific. It is pure politics. The good news is that simulation studies have a real truth and the truth is that when you simulate large variability in power, p-curve starts overestimating average power. We explain that this is due to the use of a single parameter model that cannot model heterogeneity. If we limit z-curve to a single parameter it has the same problem. The novel contribution of z-curve is to use multiple (3 or 7 doesn’t matter much) parameters to model heterogeneity. Not surprisingly, a model that is more consistent with the data produces better estimates.
Brief elaboration on the second point: The paper claims (and shows) that p-curve performs worse than Z-curve with more heterogeneity. DataColada claims (and shows) that p-curve performs better than Z-curve with more heterogeneity.
p-curve does not perform better with more heterogeneity. I had a two-week email exchange with Uri when he came up with simulations that showed better performance of p-curve. For example, transformation to z-scores is an approximation and when you use t-values with small N (all studies have N = 20), the approximation leads to suboptimal estimates. Also smaller k is an issue because z-curve estimates density distributions. So, I am well aware of limited specialized situations where p-curve can do better by up to 10% points, but that doesn’t change the fact that it does a lot worse when p-curve is applied to real heterogeneous data like I have been analyzing for years (ego-depletion replicability report, Motyl focal hpyothesis tests, etc. etc.).
I doubt neither set of simulations. That means that the difference – barring an error or similar – must lie in the operational definition of “heterogeneity.” Although I have a natural bias in interpretation (I assisted in the process of generating different versions of heterogeneity to then be tested for the DataColada post), I accept that the Z-curve authors may have entirely valid thinking here as well. So a few suggestions: 1. Since there is apparently some disagreement about how to operationalize heterogeneity, I would recommend not talking about it as a single monolithic construct.
How is variability in true power not a single construct. We have a parameter and it can vary from alpha to 1. Or we have a population effect size and a specific amount of sampling error and that gives us a ratio that reflects the deviation of a test statistic from 0. I understand the aim of saving p-curve, but in the end p-curve in its current form is unable to handle larger amounts of heterogeneity. You provide no evidence to the contrary.
Instead clarify exactly how it will be operationalized and tested and then talk about those. 2. When running simulations, rather than only reporting the variance or the skewness, simply show the distribution of power in the studies being submitted to Z-curve (as in DataColada). Those distributions, at the end of the day, will convey what exactly Z-curve (or p-curve) is estimating. 3. To the extent possible, figure out why the two differ. What are the cases where one fails and the other succeeds? It is neither informative (nor accurate) to describe Z-curve as simply “better”. If it were better in every situation then I might say, “hey, who cares why?”. But it is not. So then it becomes a question of identifying when it will be better.
Again, I had a frustrating email correspondence with Uri and the issues are all clear and do not change the main conclusion of our paper. When there is large heterogeneity, modeling this heterogeneity leads to unbiased estimates of average power, whereas a single component model tends to produce biased estimates.
Brief elaboration on the third point: Cuddy et al. selected incorrect test statistics from problematic studies. Motyl et al. selected lots and lots of incorrect tests. Scraping test-fstatistics is not at all relevant to an assessment of the power of the studies where they came from. These are all unambiguously invalid. Unfortunately, one cannot therefore learn anything about the performance of Z-curve in assessing them.
I really don’t care about Cuddy. What I do care about is that they used p-curve as if it can produce accurate estimates of average power and reported an estimate to readers that suggested they had the right estimate, when p-curve again overestimated average power.
The claims about Motyl are false. I have done my own coding of these studies and despite a few inconsistencies in coding some studies, I get the same results with my coding. Please provide your own coding of these studies and I am sure the results will be the same. Unless you have coded Motyl et al.’s studies, you should not make unbased claims about this dataset or the results that are based on it.
OK, with those in mind, I list below concerns I have with specifics in the paper. These are roughly ordered based on where they occur in the paper:
Really, I would love to stop hear, but I am a bit obsessive compulsive, but readers might have enough information to draw their own conclusions.
* The paper contains are a number of statements of fact that seem too certain. Just one early example, “the most widely used criterion for a successful replication is statistical significance (Killeen, 2005).” That is a common definition and it may be the most common, but that is hardly a certainty (even with a citation). It would be better to simply identify that definition as common and then consider its limitations (while also considering others).
Aside from being the most common, it is also the most reasonable. How else would we compare the results of a study that claimed, the effect is positive, 95%CI d = .03 to .1.26 to the results of a replication study. Would we say, wow replication d = .05, this is consistent with the original study therefore we have a successful replication?
* The following statement seems incorrect (and I think that the authors would be the first to agree with me): “Exact replications of the original study should also produce significant results; at least we should observe more successful than failed replications if the hypothesis is true.” If original studies were all true, but all powered at 25%, then exact (including sample size) replications would be significant 25% of the time. I assume that I am missing the argument, so perhaps I am merely suggesting a simple clarification. (p. 6)
You misinterpret the intention here. We are stating that a good study should be replicable and are implying that a study with 25% power is not a good study. At a minimum we would expect a good study to be more often correct than incorrect which happens when power is over 50%.
* I am not sure that I completely understand the argument about the equivalence of low power and false positives (e.g., “Once we take replicability into account, the distinction between false positives and true positives with low power becomes meaningless, and it is more important to distinguish between studies with good power that are replicable and studies with low power or false positives that are difficult to replicate.”) It seems to me that underpowered original studies may, in the extreme case, be true hypotheses, but they lack meaningful evidence. Alternatively, false positives are definitionally false hypotheses that also, definitionally, lack meaningful evidence. If a replicator were to use a very large sample size, they would certainly care about the difference. Note that I am hardly making a case in support of the underpowered original – I think the authors’ articulations of the importance of statistical power is entirely reasonable – but I think the statement of functional equivalence is a touch cavalier.
Replicability is a property of the original study. If the original study had 6% power it is a bad study, even if a subsequent study with 10 times the sample size is able to show a significant result with much more power.
* I was surprised that there was no discussion of the Simonsohn Small Telescopes perspective in the statistical evaluation of replications. That offers a well-cited and frequently discussed definition of replicability that talks about many of the same issues considered in this introduction. If the authors think that work isn’t worth considering, that is fine, but they might anticipate that other readers would at least wonder why it was not.
The paper is about the replicability of published findings, not about sample size planing for replication studies. Average power predicts what would happen in a study with the same sample sizes, not what would happen if sample sizes were increased. So, the telescope paper is not relevant.
* The consideration of the Reproducibility Project struck me as lacking nuance. It takes the 36% estimate too literally, despite multiple articles and blog posts which have challenged that cut-and-dried interpretation. I think that it would be reasonable to at least give some voice to the Gilbert et al. criticisms which point out that, given the statistical imprecision of the replication studies, a more positive estimate is justifiable. (again, I am guessing that many people – including me – share the general sense of pessimism expressed by the authors, but a one-sided argument will not be persuasive).
Are you nuts? Gilbert may have had one or two points about specific replication studies, but his broader claims about the OSC results are utter nonsense, even if they were published as a commentary in Science. It is a trivial fact that the success rate in a set of studies that is not selected for significance is an estimate of average power. If we didn’t have a file drawer, we could just count the percentage of significant results to know how low power actually is. However, we do have file drawers, and therefore we need a statistical tool like z-curve to estimate average power if that is a desirable goal. If you cannot see that the OSC data are the best possible dataset to evaluate bias-correction methods with heterogeneous data, you seem to lack the most fundamental understanding of statistical power and how it relates to success rates in significance tests.
* The initial description of Z-Curve is generally clear and brief. That is great. On the other hand I think that a reasonable standard should be that readers would need neither to download and run the R-code nor go and read the 2016 paper in order to understand the machinery of the algorithm. Perhaps a few extra sentences to clarify before giving up and sending readers to those other sources.
This is up to the editor. We are happy to move content from the Supplement to the main article or do anything else that can improve clarity and communication. But first we need to be given an opportunity to do so.
* I don’t understand what is happening on pages 11-13. I say that with as much humility as possible, because I am sure that the failing is with me. Nevertheless, I really don’t understand. Is this going to be a telling example? Or is it the structure of the underlying computations? What was the data generating function that made the figure? What is the goal?
* Figure 1: (A few points). The caption mentions “…how Z-curve models…” I am sure that it does, but it doesn’t make sense to me. Perhaps it would be worth clarifying what the inputs are, what the outputs are, what the inferences are, and in general, what the point of the figure is. The authors have spent far more time in creating this figure than anyone else who simply reads it, so I do not doubt that it is a good representation of something, but I am honestly indicating that I do not know what that is. Furthermore, the authors’ say “the dotted black line in Figure 1.” I found it eventually, but it is really hard to see. Perhaps make the other lines a very light gray and the critical line a pure and un-dashed black?
It is a visual representation of the contribution of each component of the model to the total density.
* The authors say that they turn every Z-score of >6 into 6. How consequential is that decision? The explanation that those are all powered at 100% is not sufficient. If there are two results entered into Z-curve one with Z = 7 and one with Z = 12, Z-curve would treat them identically to each other and identically as if they were both Z = 6, right? Is that a strength? (without clarification, it sounds like a weakness). Perhaps it would be worth some sentences and some simulations to clarify the consequences of the arbitrary cutoff. Quite possibly the consequences are zero, but I can’t tell.
Z-curve could also fit components here, but there are few z-scores and if you convert the z-score into power it is pnorm(6, 1.96) = .99997 or 99.997%. So does it matter. No it doesn’t, which is the reason why are doing it. If it would make a difference, we wouldn’t be doing it.
* On p. 13 the authors say, “… the average power estimate was 50% demonstrating large sample accuracy.” That seems like a good solid conclusion inference, but I didn’t understand how they got to it. One possible approach would be to start a bit earlier with clarifying the approach. Something that sounded like, “Our goal was to feed data from a 50% power distribution and then assess the accuracy of Z-curve by seeing whether or not it returned an average estimate of 50%.” From there, perhaps, it might be useful to explain in conversational language how that was conducted.
The main simulations are done later. This is just an example. So we can just delete the claim about large sample accuracy here.
* To reiterate, I simply cannot follow what the authors are doing. I accept that as my fault, but let’s assume that a different reader might share some of my shortcomings. If so, then some extra clarification would be helpful.
Thanks, but if you don’t understand what we are doing, why are you an expert reviewer for our paper. I did ask that Uri is not picked as a reviewer because he ignored all reasonable arguments when I sent him a preprint, but that didn’t mean that some other proponent of p-curve with less statistical background should be the reviewer.
* The authors say that p-curve generates an estimate of 76% for this analysis and that is bad. I believe them. Unfortunately, as I have indicated in a few places, I simply do not understand what the authors did, and so cannot assess the different results.
We used the R-Code for the p-curve app, submitted the data and read the output. And yes, we agree, it is bad that a tool is in the public domain without any warning about bias when there is heterogeneity and the tool can overestimates average power by 25% points. What are you going to do about it?
So clarification would help. Furthermore, the authors then imply that this is due to p-curve’s failure with heterogeneity. That sounds unlikely, given the demonstrations of p-curve’s robustness to heterogeneity (i.e., DataColada), but let’s assume that they are correct.
Uri simulated minimal heterogeneity to save p-curve from embarrassment. So there is nothing surprising here. Uri essentially p-hacked p-curve results to get the results he wanted.
It then becomes absolutely critical for the authors to explain why that particular version is so far off. Based on lengthy exchanges between Uli and Uri, and as referenced in the DataColada post, across large and varied forms of heterogeneity, Z-curve performs worse than p-curve. What is special about this case? Is it one that exists frequently in nature?
Enough already. That p-hacked post is not worth the bytes on the hosting server.
* I understand Figure 2. That is great.
Do we have badges for reviewers who actually understand something in a paper?
* The authors imply that p-curve does worse at estimating high powered studies because of heterogeneity. Is there evidence for that causal claim? It would be great if they could identify the source of the difference.
The evidence is in the fucking paper you were supposed to review and evaluate.
* Uri, in the previous exchanges with Uli (and again, described in the blog post), came to the conclusion that Z-curve did better than p-curve when there were many very extreme (100% power) observations in the presence of other very low powered observations. The effect seemed to be carried by how Z-curve handles those extreme cases. I believe – and truly I am not sure here – that the explanation had to do with the fact that with Z-curve, extreme cases are capped at some upper bound. If that is true then (a) it is REALLY important for that to be described, clarified, and articulated. In addition, (b) it needs to be clearly justified. Is that what we want the algorithm to do? It seems important and potentially persuasive that Z-curve does better with certain distributions, but it clearly does worse with others. Given that, (c) it seems like the best case positioning for Z-curve would be if it could lay out the conditions under which it would perform better (e.g., one in which there are many low powered studies, but the mode was nevertheless >99.99% power), while acknowledging those in which it performs worse (e.g., all of the scenarios laid out in DataColada).
I can read and I read the blog post. I didn’t know these p-hacked simulations would be used against me in the review process.
* Table 1: Rather than presenting these findings in tabular form, I think it would be informative if there were histograms of the studies being entered into Z-curve (as in DataColada). That allows someone to see what is being assessed rather than relying on their intuitive grasp of skewness, for example.
Of course we can add those, but that doesn’t change anything about the facts.
* the Power Posing Meta-Analysis. I think it is interesting to look at how Z-curve evaluates a set of studies. I don’t think that one can evaluate the tool in this way (because we do not know the true power of Power Posing studies), but it is interesting to see. I would make some suggestions though. (a) In a different DataColada post (datacolada.org/66), we looked carefully at the Cuddy, Shultz, & Fosse p-curve and identified that the authors had selected demonstrably incorrect tests from demonstrably problematic studies. I can’t imagine anyone debating either contention (indeed, no one has, though the Z-curve authors might think the studies and tests selected were perfect. That would be interesting to add to this paper.). Those tests were also the most extreme (all >99% power). Without reading this section I would say, “well, no analysis should be run on those test statistics since they are meaningless. On the other hand, since they are extreme in the presence of other very low powered studies, this sounds like exactly the scenario where Z-curve will generate a different estimate from p-curve”. [again, the authors simply cite “heterogeneity” as the explanation and again, that is not informative]. I think that a better comparison might be on the original power-posing p-curve (Simmons & Simonsohn, 2017; datacolada.org/37). Since those test statistics were coded by two authors of the original p-curve, that part is not going to be casually contested. I have no idea what that comparison will look like, but I would be interested.
I don’t care about the silly power-posing research. I can take this out, but it just showed that p-curve is used without understanding its limitations, which have been neglected by the developers of p-curve (not sure how much you were involved).
* The scraping of 995,654 test statistics. I suppose one might wonder “what is the average power of any test reported in psychology between 2010 and 2017?” So long as that is not seen as even vaguely relevant to the power of the studies in which they were reported, then OK. But any implied relevance is completely misleading. The authors link the numbers (68% or 83%) to the results of the Reproducibility Project. That is exactly the type of misleading reporting I am referring to. I would strongly encourage this demonstration to be removed from the paper.
How do we know what the replicability in developmental psychology is? How do we know what the replicability in clinical psychology is? The only information that we have comes from social and experimental cognitive research with simple paradigms. Clearly we cannot generalize to all areas of psychology. Surely an analysis of focal and non-focal tests has some problems that we discuss, but it clearly serves as an upper limit and can be used for termporal and cross-discipline comparisons without taking the absolute numbers too seriously. But this can only be done with a method that is unbiased, not a method that estimates 96% power when power is 75%.
* The Motyl et al p-curve. This is a nice idea, but the data set being evaluated is completely unreasonable to use. In yet another DataColada post (datacolada.org/60), we show that the Motyl et al. researchers selected a number of incorrect tests. Many omnibus tests and many manipulation checks. I honestly think that those authors made a sincere effort but there is no way to use those data in any reasonable fashion. It is certainly no better (and possibly worse) than simply scraping every p-value from each of the included studies. I wish the Motyl et al. study had been very well conducted and that the data were usable. They are not. I recommend that this be removed from the analysis or, time permitting, the Z-curve authors could go through the set of papers and select and document the correct tests themselves.
You are wrong and I haven’t seen you posting a corrected data set. Give me a corrected data set and I bet you $1000 dollar that p-curve will produce a higher estimate than z-curve again.
* Since it is clearly relevant with the above, I will mention that the Z-curve authors do not mention how tests should be selected. Experientially, p-curve users infrequently make mistakes with the statistical procedure, but they frequently make mistakes in the selection of test statistics. I think that if the authors want their tool to be used correctly they would be well served by giving serious consideration to how tests should be selected and then carefully explaining that.
Any statistical method depends on the data you supply. Like when Uri phacked simulations to show that p-curve does well with heterogeneity.
Dear reader (if you made it this far, please let me know in the comments section what you take away from all of this).
The Motyl data are ok, p-curve overestimates, and that is because p-curve doesn’t handle realistic amounts of heterogeneity well.
The Motyl data are ok, p-curve overestimates, but this only happens with the Motyl data.
The Motyl data are ok, p-curve overestimates, but that is because we didn’t use p-curve properly.
The Motyl data are not ok, and our simulations are p-hacked and p-curve does well with heterogeneity.
The table shows the preliminary 2017 rankings of 104 psychology journals. A description of the methodology and analyses of by discipline and time are reported below the table.
Download PDF of this ggplot representation of the table courtesy of David Lovis-McMahon.
I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result. In the past five years, there have been concerns about a replication crisis in psychology. Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011). The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated (OSC, 2015).
The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability. However, the reliance on actual replication studies has a a number of limitations. First, actual replication studies are expensive, time-consuming, and sometimes impossible (e.g., a longitudinal study spanning 20 years). This makes it difficult to rely on actual replication studies to assess the replicability of psychological results, produce replicability rankings of journals, and to track replicability over time.
Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the test-statistics reported in published articles. This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies: (a) replicability can be assessed in real time, (b) it can be estimated for all published results rather than a small sample of studies, and (c) it can be applied to studies that are impossible to reproduce. Finally, it has the advantage that actual replication studies can be criticized (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on results reported in the original articles.
Z-curve has been validated with simulation studies and can be used with heterogeneous sets of studies that vary across statistical methods, sample sizes, and effect sizes (Brunner & Schimmack, 2016). I have applied this method to articles published in psychology journals to create replicability rankings of psychology journals in 2015 and 2016. This blog post presents preliminary rankings for 2017 based on articles that have been published so far. The rankings will be updated in 2018, when all 2017 articles are available.
For the 2016 rankings, I used z-curve to obtain annual replicability estimates for 103 journals from 2010 to 2016. Analyses of time trends showed no changes from 2010-2015. However, in 2016 there were first signs of an increase in replicabilty. Additional analyses suggested that social psychology journals contributed mostly to this trend. The preliminary 2017 rankings provide an opportunity to examine whether there is a reliable increase in replicability in psychology and whether such a trend is limited to social psychology.
Journals were mainly selected based on impact factor. Preliminary replicability rankings for 2017 are based on 104 journals. Several new journals were added to increase the number of journals specializing in five disciplines: social (24), cognitive (13), development (15), clinical/medical (18), biological (13). The other 24 journals were broad journals (Psychological Science) or from other disciplines. The total number of journals for the preliminary rankings is 104. More journals will be added to the final rankings for 2017.
All PDF versions of published articles were downloaded and converted into text files using the conversion program pdfzilla. Text files were searched for reports of statistical results using a self-created R program. Only F-tests, t-tests, and z-tests were used for the rankings because they can be reliabilty extracted from diverse journals. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom. Test-statistics were converted into exact p-values and exact p-values were converted into absolute z-scores as a measure of the strength of evidence against the null-hypothesis.
The data for each year were analyzed using z-curve (Schimmack and Brunner (2016). Z-curve provides a replicability estimate. In addition, it generates a Powergraph. A Powergraph is essentially a histogram of absolute z-scores. Visual inspection of Powergraphs can be used to examine publication bias. A drop of z-values on the left side of the significance criterion (p < .05, two-tailed, z = 1.96) shows that non-significant results are underpresented. A further drop may be visible at z = 1.65 because values between z = 1.65 and z = 1.96 are sometimes reported as marginally significant support for a hypothesis. The critical values z = 1.65 and z = 1.96 are marked by vertical red lines in the Powergraphs.
Replicabilty rankings rely only on statistically significant results (z > 1.96). The aim of z-curve is to estimate the average probability that an exact replication of a study that produced a significant result produces a significant result again. As replicability estimates rely only on significant results, journals are not being punished for publishing non-significant results. The key criterion is how strong the evidence against the null-hypothesis is when an article published results that lead to the rejection of the null-hypothesis.
Statistically, replicability is the average statistical power of the set of studies that produced significant results. As power is the probabilty of obtaining a significant result, average power of the original studies is equivalent with average power of a set of exact replication studies. Thus, average power of the original studies is an estimate of replicability.
Links to powergraphs for all journals and years are provided in the ranking table. These powergraphs provide additional information that is not used for the rankings. The only information that is being used is the replicability estimate based on the distribution of significant z-scores.
The replicability estimates for each journal and year (104 * 8 = 832 data points) served as the raw data for the following statistical analyses. I fitted a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.
I compared several models. Model 1 assumed no mean level changes and stable variability across journals (significant variance in the intercept/trait). Model 2 assumed no change from 2010 to 2015 and allowed for mean level changes in 2016 and 2017 as well as stable differences between journals. Model 3 was identical to Model 2 and allowed for random variability in the slope factor.
Model 1 did not have acceptable fit (RMSEA = .109, BIC = 5198). Model 2 increased fit (RMSEA = 0.063, BIC = 5176). Model 3 did not improve model fit (RMSEA = .063, BIC = 5180), the variance of the slope factor was not significant, and BIC favored the more parsimonious Model 2. The parameter estimates suggested that replicability estimates increased from 72 in the years from 2010 to 2015 by 2 points to 74 (z = 3.70, p < .001).
The standardized loadings of individual years on the latent intercept factor ranged from .57 to .61. This implies that about one-third of the variance is stable, while the remaining two-thirds of the variance is due to fluctuations in estimates from year to year.
The average of 72% replicability is notably higher than the estimate of 62% reported in the 2016 rankings. The difference is due to a computational error in the 2016 rankings that affected mainly the absolute values, but not the relative ranking of journals. The r-code for the 2016 rankings miscalculated the percentage of extreme z-scores (z > 6), which is used to adjust the z-curve estimate that are based on z-scores between 1.96 and 6 because all z-scores greater than 6 essentially have 100% power. For the 2016 rankings, I erroneously computed the percentage of extreme z-scores out of all z-scores rather than limiting it to the set of statistically significant results. This error became apparent during new simulation studies that produced wrong estimates.
Although the previous analysis failed to find significant variability for the slope (change factor), this could be due to the low power of this statistical test. The next models included disciplines as predictors of the intercept (Model 4) or the intercept and slope (Model 5). Model 4 had acceptable fit (RMSEA = .059, BIC = 5175), but Model 5 improved fit, although BIC favored the more parsimonious model (RMSEA = .036, BIC = 5178). The Bayesian Information Criterion favors parsimony and better fit cannot be interpreted as evidence for the absence of an effect. Model 5 showed two significant (p < .05) effects for social and developmental psychology. In Model 6 I included only social and development as predictors of the slope factor. BIC favored this model over the other models (RMSEA = .029, BIC = 5164). The model results showed improvements for social psychology (increase by 4.48 percentage points, z = 3.46, p = .001) and developmental psychology (increase by 3.25 percentage points, z = 2.65, p = .008). Whereas the improvement for social psychology was expected based on the 2016 results, the increase for developmental psychology was unexpected and requires replication in the 2018 rankings.
The only significant predictors for the intercept were social psychology (-4.92 percentage points, z = 4.12, p < .001) and cognitive psychology (+2.91, z = 2.15, p = .032). The strong negative effect (standardized effect size d = 1.14) for social psychology confirms earlier findings that social psychology was most strongly affected by the replication crisis (OSC, 2015). It is encouraging to see that social psychology is also the discipline with the strongest evidence for improvement in response to the replication crisis. With an increase by 4.48 points, replicabilty of social psychology is now at the same level as other disciplines in psychology other than cognitive psychology, which is still a bit more replicable than all other disciplines.
In conclusion, the results confirm that social psychology had lower replicability than other disciplines, but also shows that social psychology has significantly improved in replicabilty over the past couple of years.
Analysis of Individual Journals
The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2010-1015 (0) with 2016-2017 (1). This analysis produced 10 significant increases with p < .01 (one-tailed), when only 1 out of 100 would be expected by chance.
Five of the 10 journals (50% vs. 20% in the total set of journals) were from social psychology (SPPS + 13, JESP + 11, JPSP-IRGP + 11, PSPB + 10, Sex Roles + 8). The remaining journals were from developmental psychology (European J. Dev. Psy + 17, J Cog. Dev. + 9), clinical psychology (J. Cons. & Clinical Psy + 8, J. Autism and Dev. Disorders + 6), and the Journal of Applied Psychology (+7). The high proportion of social psychology journals provides further evidence that social psychology has responded most strongly to the replication crisis.
Although z-curve provides very good absolute estimates of replicability in simulation studies, the absolute values in the rankings have to be interpreted with a big grain of salt for several reasons. Most important, the rankings are based on all test-statistics that were reported in an article. Only a few of these statistics test theoretically important hypothesis. Others may be manipulation checks or other incidental analyses. For the OSC (2015) studies the replicability etimate was 69% when the actual success rate was only 37%. Moreover, comparisons of the automated extraction method used for the rankings and hand-coding of focal hypothesis in the same article also show a 20% point difference. Thus, a posted replicability of 70% may imply only 50% replicability for a critical hypothesis test. Second, the estimates are based on the ideal assumptions underlying statistical test distributions. Violations of these assumptions (outliers) are likely to reduce actual replicability. Third, actual replication studies are never exact replication studies and minor differences between the studies are also likely to reduce replicability. There are currently not sufficient actual replication studies to correct for these factors, but the average is likely to be less than 72%. It is also likely to be higher than 37% because this estimate is heavily influenced by social psychology, while cognitive psychology had a success rate of 50%. Thus, a plausible range of the typical replicability of psychology is somwhere between 40% and 60%. We might say the glass is half full and have empty, while there is systematic variation around this average across journals.
55 years after Cohen (1962) pointed out that psychologists conduct many studies that produce non-significant results (type-II errors). For decades there was no sign of improvement. The preliminary rankings of 2017 provide the first empirical evidence that psychologists are waking up to the replication crisis caused by selective reporting of significant results from underpowered studies. Right now, social psychologists appear to respond most strongly to concerns about replicability. However, it is possible that other disciplines will follow in the future as the open science movement is gaining momentum. Hopefully, replicabilty rankings can provide an incentive to consider replicability as one of several criterion for publication. A study with z = 2.20 and another study with z = 3.85 are both significant (z > 1.96), but a study with z =3.85 has a higher chance of being replicable. Everything else being equal, editors should favor studies with stronger evidence; that is higher z-scores (a.k.a, lower p-values). By taking the strength of evidence into account, psychologists can move away from treating all significant results (p < .05) as equal and take type-II errors and power into account.
In 2005, Psychological Science published an article titled “An Alternative to Null-Hypothesis Significance Tests” by Peter R. Killeen. The article proposed to replace p-values and significance testing with a new statistic; the probability of replicating an effect (P-rep). The article generated a lot of excitement and for a period from 2006 to 2009, Psychological Science encouraged reporting p-rep. After some statistical criticism and after a new editor took over Psychological Science, interest in p-rep declined (see Figure).
It is ironic that only a few years later, psychological science would encounter a replication crisis, where several famous experiments did not replicate. Despite much discussion about replicability of psychological science in recent years, Killeen’s attempt to predict replication outcome has been hardly mentioned. This blog post reexamines p-rep in the context of the current replication crisis.
The abstract clearly defines p-rep as an estimate of “the probability of replicating an effect” (p. 345), which is the core meaning of replicability. Factories have high replicability (6 sigma) and produce virtually identical products that work with high probability. However, in empirical research it is not so easy to define what it means to get the same result. So, the first step in estimating replicability is to define the result of a study that a replication study aims to replicate.
“Traditionally, replication has been viewed as a second successful attainment of a significant effect” (Killeen, 2005, p. 349). Viewed from this perspective, p-rep would estimate the probability of obtaining a significant result (p < alpha) after observing a significant result in an original study.
Killeen proposes to change the criterion to the sign of the observed effect size. This implies that p-rep can only be applied to hypothesis with a directional hypothesis (e.g, it does not apply to tests of explained variance). The criterion for a successful replication then becomes observing an effect size with the same sign as the original study.
Although this may appear like a radical change from null-hypothesis significance testing, this is not the case. We can translate the sign criterion into an alpha level of 50% in a one-tailed t-test. For a one-tailed t-test, negative effect sizes have p-values ranging from 1 to .50 and positive effect sizes have p-values ranging from .50 to 0. So, a successful outcome is associated with a p-value below .50 (p < .50).
If we observe a positive effect size in the original study, we can compute the power of obtaining a positive result in a replicating study with a post-hoc power analysis, where we enter information about the standardized effect size, sample size, and alpha = .50, one-tailed.
Using R syntax this can be achieved with the formula:
with obs.es being the observed standardized effect size (Cohen’s d), N = total sample size, and se = sampling error = 2/sqrt(N).
The similarity to p-rep is apparent, when we look at the formula for p-rep.
There are two differences. First, p-rep uses the standard normal distribution to estimate power. This is a simplification that ignores the degrees of freedom. The more accurate formula for power is the non-central t-distribution that takes the degrees of freedom (N-2) into account. However, even with modest sample sizes of N =40, this simplification has negligible effects on power estimates.
The second difference is that p-rep reduces the non-centrality parameter (effect size/sampling error) by a factor of square-root 2. Without going into the complex reasoning behind this adjustment, the end-result of the adjustment is that p-rep will be lower than the standard power estimate.
Using Killeen’s example on page 347 with d = .5 and N = 20, p-rep = .785. In contrast, the power estimate with alpha = .50 is .861.
The comparison of p-rep with standard power analysis brings up an interesting and unexplored question. “Does p-rep really predict the probability of replication?” (p. 348). Killeen (2005) uses meta-analyses to answer this question. In one example, he found that 70% of studies showed a negative relation between heart rate and aggressive behaviors. The median value of p-rep over those studies was 71%. Two other examples are provided.
A better way to evaluate estimates of replicability is to conduct simulation studies where the true answer is known. For example, a simulation study can simulate 1,000,000 exact replications of Killeen’s example with d = .5 and N = 20 and we can observe how many studies show a positive observed effect size. In a single run of this simulation, 86,842 studies showed a positive sign. Median P-rep (.788) underestimates this actual success rate, whereas median observed power more closely predicts the observed success rate (.861).
This is not surprising. Power analysis is designed to predict the long-term success rate given a population effect size, a criterion value, and sampling error. The adjustment made by Killeen is unnecessary and leads to the wrong prediction.
P-rep applied to Single Studies
It is also peculiar to use meta-analyses to test the performance of p-rep because a meta-analysis implies that many studies have been conducted, whereas the goal of p-rep was to predict the outcome of a single replication study from the outcome of an original study.
This primary aim also explains the adjustment to the non-centrality parameter, which was based on the idea to add the sampling variances of the original and replication study. Finally, Killeen clearly states that the goal of p-rep is to ignore population effect sizes and to define replicability as “an effect of the same sign as that found in the original experiment” (p. 346). This is very different from power analysis, which estimates the probability of an effect of the same sign as the population effect size.
We can evaluate p-rep as a predictor of obtaining effect sizes with the same direction in two studies with another simulation study. Assume that the effect size is d = .20 and the total sample size is also small (N = 20). The median p-rep estimate is 62%.
The 2 x 2 table shows how often the effect sizes of the original study and the replication study match.
The table shows that the original and replication study match only 45% of the time when the sign also matches the population effect size. Another 11% matches occur when the original and the replication study show the wrong sign and future replication studies are more likely to show the opposite effect size. Although these cases meet the definition of replicability with the sign of the original study as criterion, it seems questionable to define a pair of studies that both show the wrong result as a successful replication. Furthermore, the median p-rep estimate of 62% is inconsistent with the correctly matched cases (45%) or the total number of matched cases (45% + 11% = 56%). In conclusion, it is neither sensible to define replicability as consistency between pairs of exact replication studies, nor does p-rep estimate this probability very well.
Can we fix it?
The previous examination of p-rep showed that it is essentially an observed power estimate with alpha = 50% and an attenuated non-centrality parameter. Does this mean we can fix p-rep and turn it into a meaningful statistic? In other words, is it meaningful to compute the probability that future replication studies will reveal the direction of the population effect size by computing power with alpha = 50%?
For example, a research finds an effect size of d = .4 with a total sample size of N = 100. Using a standard t-test, the research can report the traditional p-value; p = .048.
The simulation results show that the most observations show consistent signs in pairs of studies and are also consistent with the population effect size. Median observed power, the new p-rep, is 98%. So, is a high p-rep value a good indicator that future studies will also produce a positive sign?
The main problem with observed power analysis is that it relies on the observed effect size as an estimate of the population effect size. However, in small samples, the difference between observed effect sizes and population effect sizes can be large, which leads to very variable estimates of p-rep. One way to alert readers to the variability in replicability estimates is to provide a confidence interval around the estimate. As p-rep is a function of the observed effect size, this is easily achieved by converting the lower and upper limit of the confidence interval around the effect size into a confidence interval for p-rep. With d = .4 and N = 100 (sampling error = 2/sqrt(100) = .20), the confidence interval of effect sizes ranges from d = .008 to d = .792. The corresponding p-rep values are 52% to 100%.
Importantly, a value of 50% is the lower bound for p-rep and corresponds to determining the direction of the effect by a coin toss. In other words, the point estimate of replicability can be highly misleading because the observed effect size may be considerably lower than the population effect size. This means that reporting the point-estimate of p-rep can give false assurance about replicability, while the confidence interval shows that there is tremendous uncertainty around this estimate.
Understanding Replication Failures
Killeen (2005) pointed out that it can be difficult to understand replication failures using the traditional criterion of obtaining a significant result in the replication study. For example, the original study may have reported a significant result with p = .04 and the replication study produced a non-significant p-value of p = .06. According to the criterion of obtaining a significant result in the replication study, this outcome is a disappointing failure. Of course, there is no meaningful difference between p = .04 and p = .06. It just so happens that they are on opposite sides of an arbitrary criterion value.
Killeen suggests that we can avoid this problem by reporting p-rep. However, p-rep just changes the arbitrary criterion value from p = .05 to d = 0. It is still possible that a replication study will fail because the effect sizes do not match. Whereas the effect size in an original study was d = .05, the effect size in the replication study was d = -.05. In small samples, this is not a meaningful difference in effect sizes, but the outcome constitutes a replication failure.
There is simply no way around making mistakes in inferential statistics. We can only try to minimize them at the expense of reducing sampling error. By setting alpha to 50%, we are reducing type-II errors (failing to support a correct hypothesis) at the expense of increasing the risk of a type-I error (failing to accept the wrong hypothesis), but errors will be made.
P-rep and Publication Bias
Killeen (2005) points out another limitation of p-rep. “One might, of course, be misled by a value of prep that itself cannot be replicated. This can be caused by publication bias against small or negative effects.” (p. 350). Here we see the real problem of raising alpha to 50%. If there is no effect (d = 0), one out of two studies will produce a positive result that can be published. If 100 researchers test an interesting hypothesis in their lab, but only positive results will be published, approximately 50 articles will support a false conclusion, while 50 other articles that showed the opposite result will be hidden in file drawers. A stricter alpha criterion is needed to minimize the rate of false inferences, especially when publication bias is present.
A counter-argument could be that researchers who find a negative result can also publish their results, because positive and negative results are equally publishable. However, this would imply that journals are filled with inconsistent results and research areas with small effects and small samples will publish nearly equal number of studies with positive and negative results. Each article would draw a conclusion based on the results of a single study and try to explain inconsistent with potential moderator variables. By imposing a stricter criterion for sufficient evidence, published results are more consistent and more likely to reflect a true finding. This is especially true, if studies have sufficient power to reduce the risk of type-II errors and if journals do not selectively report studies with positive results.
Does this mean estimating replicability is a bad idea?
Although Killeen’s (2005) main goal was to predict the outcome of a single replication study, he did explore how well median replicability estimates predicted the outcome of meta-analysis. As aggregation across studies reduces sampling error, replicability estimates based on sets of studies can be useful to predict actual success rates in studies (Sterling et al., 1995). The comparison of median observed power with actual success rates can be used to reveal publication bias (Schimmack, 2012) and median observed power is a valid predictor of future study outcomes in the absence of publication bias and for homogeneous sets of studies. More advanced methods even make it possible to estimate replicability when publication bias is present and when the set of studies is heterogenous (Brunner & Schimmack, 2016). So, while p-rep has a number of shortcomings, the idea of estimating replicability deserves further attention.
The rise and fall of p-rep in the first decade of the 2000s tells an interesting story about psychological science. In hindsight, the popularity of p-rep is consistent with an area that focused more on discoveries than on error control. Ideally, every study, no matter how small, would be sufficient to support inferences about human behavior. The criterion to produce a p-value below .05 was deemed an “unfortunate historical commitment to significance testing” (p. 346), when psychologists were only interested in the direction of the observed effect size in their sample. Apparently, there was no need to examine whether the observed effect size in a small sample was consistent with a population effect size or whether the sign would replicate in a series of studies.
Although p-rep never replaced p-values (most published p-rep values convert into p-values below .05), the general principles of significance testing were ignored. Instead of increasing alpha, researchers found ways to lower p-values to meet the alpha = .05 criterion. A decade later, the consequences of this attitude towards significance testing are apparent. Many published findings do not hold up when they are subjected to an actual replication attempt by researchers who are willing to report successes and failures.
In this emerging new era, it is important to teach a new generation of psychologists how to navigate the inescapable problem of inferential statistics: you will make errors. Either you falsely claim a discovery of an effect or you fail to provide sufficient evidence for an effect that does exist. Errors are part of science. How many and what type of errors will be made depends on how scientists conduct their studies.
Authors: Ulrich Schimmack & Yue Chen
“Any man whose errors take ten years to correct is quite a man.” (J. Robert Oppenheimer)
More than a century ago, Charles Darwin proposed that facial expressions of emotions not only communicate emotional experiences to others, but play an integral role in the experience of emotions themselves (Darwin, 1872). This hypothesis later became known as the facial feedback hypothesis.
Nearly a century later, a review article concluded that empirical evidence for the facial feedback hypothesis was inconclusive and suffered from some methodological problems (Ross, 1980). Most important, positive results may have been due to demand effects. That is, participants may have been aware that the manipulation of their facial muscles was intended to induce a specific emotion and respond accordingly.
Strack, Martin, and Stepper (1988) invented the pen-in–mouth-paradigm to overcome these limitations of prior studies. In this paradigm, participants are instructed to hold a pen in their mouth either with their lips or with their teeth. Holding the pen with the teeth is supposed to activate the muscles involved in smiling (zygomaticus major). Holding the pen with the lips prevents smiling. To ensure that participants are not aware of the purpose of the manipulation, the study is conducted as a between-subject study with participants being randomly assigned to either the teeth or the lips condition. Furthermore, they are given a cover story for holding the pen in the mouth.
“The study you are participating in has to do with psychomotoric coordination. More specifically, we are interested in people’s ability to perform various tasks with parts of their body that they would normally not use for such tasks…The tasks we would like you to perform are actually part of a pilot study for a more complicated experiment we are planning to do next semester to better understand this substitution process.” (p. 770).
In Study 1, participants were shown several cartoons and asked to rate how funny each cartoon was. According to FFH, inducing smiling by holding a pen with teeth should induce amusement and amplify the funniness of cartoons. The average rating of funniness was consistent with this prediction (teeth M = 5.14 vs. lips M = 4.33 on a 0 to 9 scale). A second study replicated the pen-in-mouth paradigm with amusement ratings as the dependent variable (M = 6.43 vs. 5.40).
Strack et al.’s article has been widely cited as conclusive evidence for FFH (cf. Wagenmakers et al., 2016) and the article has been featured prominently in textbooks (cf. Coles & Larsen, 2017) and popular psychology books (cf. Schimmack, 2017). However, in 2011 psychologists encountered a crisis of confidence after some classic findings could not be replicated and Nobel Laureate Daniel Kahneman asked for replications of classic studies (Kahneman, 2012).
Wagenmakers et al. (2016) answered this call using the newly established format of a Registered Replication Report (Simons & Holcombe, 2014). In this format, original authors, replication authors, and editors work together to design the replication study and the original study is replicated across several labs. Wagenmakers et al. (2016) reported the results of 17 preregistered replications of Strack et al.’s Study 1. The minimum sample size for each study was N = 50. Actual sample sizes ranged from N = 87 to 139. These sample sizes do not provide sufficient statistical power to replicate the effect in each study. However, a meta-analysis of all 17 studies ensures a high probability of replicating the original finding even with a statistically small effect size. Nevertheless, the replication study failed to provide evidence for FFH.
Some psychologists interpreted these results as challenging Darwin’s century old hypothesis that facial expressions play an important role in emotional experiences. After all, results based on the best test of the theory that were widely used to support FFH could not be replicated. However, some psychologists raised concerns about the replication study. Reber (2016) compared psychology to chemistry. For an experiment in chemistry to work as predicted, chemists need to use pure chemicals. Even small impurities may cause failures to demonstrate chemical processes. Reber suggested that the replication failure of the FFH could have been caused by “impurities” in the replication study. This line of argumentation is dangerous because it can lead to circular reasoning. That is, if a study provides evidence for a theoretically predicted effect, the study was pure, but if a study fails to provide evidence for the effect, the study was impure. Accordingly, a theoretical prediction can never be falsified.
It is also possible to question the results of the original study. Schimmack (2017) pointed out that both studies failed to reach the standard criterion of statistical significance in a two-tailed test and were only significant in a one-tailed test. These results are often called marginally significant. Two marginally significant results are suggestive, but do not provide conclusive evidence for an effect. Thus, these results were prematurely accepted as evidence for FFH, when additional evidence was needed.
It would be surprising if nobody had ever tried to replicate the pen-in-mouth paradigm given its prominence and theoretical importance. In fact, numerous published articles have used the paradigm to replicate and extend the original findings (see Appendix). We conducted a replicability analysis of studies that used the pen-in-mouth paradigm prior to the controversial registered replication report. If previous studies consistently found evidence for FFH, it suggests that the replication report studies were impure. However, if previous studies also had difficulties demonstrating the effect, it suggests that the pen-in-mouth paradigm does not reliably produce facial feedback effects.
A replicabiliy analysis differs from conventional meta-analyses in two ways. First, the goal of a replicability analysis is not to estimate an effect size. Instead, the goal is to estimate the average replicability of a set of studies, where replicability is defined as the probability of obtaining a statistically significant result (Schimmack, 2014). Second, a replicability analysis examines whether a set of studies shows signs of publication bias by comparing the percentage of significant results to the average statistical power of studies. In an unbiased set of studies, the success rate should match median observed power. However, if publication bias is present, the success rate is higher than median observed power justifies (Schimmack, 2012, 2014).
Median observed power is only an estimate of average power and the estimate is imprecise with small sets of studies. However, precision increases as the number of studies increases. For this replicability analysis, we conducted a cumulative replicability analysis where studies are added in chronological order. The cummulative analysis shows how strong the evidence for FFH was in the beginning and how it changed over time.
We used three search strategies to retrieve original articles that used the pen-in-mouth paradigm. First, we conducted full text searchers of social psychology journals looking for the word pen. Second, we searched for articles that cited Strack et al.’s seminal study that introduced the pen-in-mouth paradigm. Third, articles that were found using the first two strategies were searched for references to additional studies. We fond 12 published articles with 19 independent studies that used the pen-in-mouth paradigm, including the original pair of studies.
For each independent study, we converted reported test statistics into a z-score as a standardized measure of the strength of evidence for FFH. If the means were not in the predicted direction, the z-scores were negative. We then computed observed power with z = 1.96 (p < .05, two-tailed) as criterion, unless the authors interpreted a marginally significant result as evidence for FFH; in this case, we used z = 1.65 as criterion for significance. The formula to convert z-scores into observed power is simply
1-pnorm(criterion.z,obs.z); obs.z = observed z-score; criterion.z = 1.96 or 1.65
The outcome of each study is dichotomous with 0 = not significant and 1 = significant. Averaging this outcome across studies yields the success rate.
We then compute an inflation index as the difference between the success rate and median observed power. In the long run, these two values should be equivalent if there is no publication bias. If there is publication bias, the inflation index is positive and reflects the amount of publication bias.
Finally, I computed the Replicability Index (R-Index; Schimmack, 2014). As publication bias also inflates median observed power, we subtracted the inflation index from median observed power. The result is called the R-Index. An R-Index of 50% or less suggests that it would be difficult to replicate a finding with the typical sample sizes in the set of studies.
Table 1 shows the results. The original studies were both successful with the weaker criterion value of z = 1.65 that was used by the authors. However, both studies barely met this criterion which leads to a high inflation index and a low replicability index. As predicted by the low R-Index, the next study produced a non-significant result which brought the success rate more in line with median observed power. However, the next three studies also failed to demonstrate the effect and median observed power dropped to .07. From study 9 till study 19, median observed power stays at this level, while the success rate remains above 30%, indicating the influence of publication bias.
For the total set of 19 studies, the probability of obtaining more than 53% (10 / 19) non-significant results with a 1-.07 = 93% probability of this outcome is greater than 99.99% (Schimmack, 2012). Thus, there is strong evidence of publication bias, even though the estimated median power is only 7%. Combining the very low estimate of median power with a positive inflation index yields a negative R-Index. Thus, it is not surprising that a set of studies without publication bias failed to replicate the original effect. This finding is entirely consistent with the cumulative evidence from previous studies, once publication bias is taken into account. In fact, the cumulative analysis shows that there was never convincing evidence for the effect (R-Index < 50).
Note. No. = Number of Study in Chronological Order, Year = Year, A# Article Number (see Appendix), S# Study Number in Article, z = strength of evidence for or against FFH, OP = observed power, Sig = Significant (0 = No, 1 = Yes), MOP = Median Observed Power, SR = Success Rate, Inf. = Inflation, R-Index = Replicability Index (MOP – Inf).
Darwin was a great scientist. Since he published his influential theory of evolution in 1859, biology has made tremendous progress in understanding the process of evolution. The same cannot be said about Darwin’s theory of emotion. More than hundred years later, psychologists are still debating the influence of facial feedback on emotional experiences. One reason for the slow progress in some areas of psychology is that original studies were often accepted as conclusive evidence without rigorous replication efforts. In addition, meta-analyses provided misleading results because they failed to take publication bias into account. The present replicability analysis showed that the pen-in-mouth paradigm never provided convincing evidence for facial feedback effects. Nevertheless, the original study was often cited as evidence for facial feedback effects. To make progress like other sciences, psychology needs to take empirical studies more seriously and ensure that important findings can be replicated before they become corner stones of theories and textbook findings.
This replicability analysis is limited to the pen-in-mouth paradigm. Other paradigms may produce replicable results. However, the pen-in-mouth paradigm has been used because it addressed limitations of these paradigms such as demand effects. Thus, even if these paradigms were more successful, the underlying mechanism would be less clear. At present, the replicability analysis simply shows a lack of evidence for FFH, but it would be premature to conclude that facial feedback effects do not exist.
Buck, R. (1980). Nonverbal Behavior and the Theory of Emotion: The Facial
Feedback Hypothesis. Journal of Personality and Social Psychology, 38, 811-824.
Coles, N. A., & Larsen, J. T., & Lench, H. C. (2017). A meta-analysis of the facial feedback hypothesis literature. OSF-Preprint.
Darwin, C. (1872). The expression of emotions in man and animals. London: John Murray.
Kahneman, D. (2012). A proposal to deal with questions about priming effects.
Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis. Journal of Personality and Social Psychology. 54, 768–777.
Reber, R. (2016). Impure replications.
Schimmack, U. The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles, Psychological Methods, 17, 551–566.
Schimmack, U. (2014). A revised introduction to the R-Index.
Schimmack, U. (2017). Reconstruction of a Train Wreck: How Priming Research Went off the Rails. https://replicationindex.wordpress.com/category/thinking-fast-and-slow/
Simons, D. J., & Holcombe, A. O. (2014). Registered Replication Reports.
Wagenmakers, EJ et al. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 917-928.
Appendix: Articles used for Meta-Analysis
A1. Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis. Journal of Personality and Social Psychology. 54, 768–777.
A2. Soussignan, R. (2002). Duchenne Smile, Emotional Experience, and Autonomic Reactivity: A Test of the Facial Feedback Hypothesis. Emotion, 2, 52-74.
A3. Ito, T., Chiao, K. W., Devine, P. G., Lorig, T. S., & Cacioppo, J. T. (2006). The Influence of Facial Feedback on Race Bias. Psychological Science, 17, 256-261.
A4. Andreasson, P., & Dimberg, U. (2008). Emotional Empathy and Facial Feedback. Journal of Nonverbal Behavior, 32, 215-224.
A5. Wiswede, D., Munte, T. F., Kramer, U. M., & Russler, J. (2009). Embodied Emotion Modulates Neural Signature of Performance Monitoring. PlosOne, 4, e5754, 1-6.
A6. Kraft, T. L., & Pressman, S. D. (2012). Grin and Bear It: The Influence of Manipulated Facial Expression on the Stress Response. Psychological Science, 23, 1372-1378.
A7. Paredes, B., Stavraki, M., Briñol, P., & Petty, R. E. (2013). Social Psychology, 44, 349-353.
A8. Marmolejo-Ramos, F. & Dunn, J. (2013). On the activation of sensorimotor systems during the processing of emotionally-laden stimuli. Universitas Psychologica, 12, 1511-1542.
A9. Rummer, R., Schweppe, J., Schleelmilch, R., Grice, M. (2014). Mood Is Linked to Vowel Type: The Role of Articulatory Movements. Emotion, 14, 246-250.
A10. Dzokoto, V., Wallace, D. S., Peters, L., & Bentsi-Enchill, E. (2014). Attention to Emotion and Non-Western Faces: Revisiting the Facial Feedback Hypothesis. The Journal of General Psychology, 2014, 141(2), 151–168.
A11. Arminjon, M., Preissmann, D., Chmetz, F., Duraku, A., Ansermet, F., & Magistretti, P. J. (2015). Embodied memory: Unconscious smiling modulates emotional evaluation of episodic memories, Frontiers in Psychology, 6, 650, 1-7.
A12. Epstein, N., Brendel, T., Hege, I., Ouellette, D. L., Schmidmaier, R., & Kiesewetter, J. (2016). The power of the pen: how to make physicians more friendly and patients more attractive. Medical Education, 50, 1214–1218.