Category Archives: Priming

train-wreck-1

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

Authors:  Ulrich Schimmack, Moritz Heene, and Kamini Kesavan

 

Abstract:
We computed the R-Index for studies cited in Chapter 4 of Kahneman’s book “Thinking Fast and Slow.” This chapter focuses on priming studies, starting with John Bargh’s study that led to Kahneman’s open email.  The results are eye-opening and jaw-dropping.  The chapter cites 12 articles and 11 of the 12 articles have an R-Index below 50.  The combined analysis of 31 studies reported in the 12 articles shows 100% significant results with average (median) observed power of 57% and an inflation rate of 43%.  The R-Index is 14. This result confirms Kahneman’s prediction that priming research is a train wreck and readers of his book “Thinking Fast and Slow” should not consider the presented studies as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.

Introduction

In 2011, Nobel Laureate Daniel Kahneman published a popular book, “Thinking Fast and Slow”, about important finding in social psychology.

In the same year, questions about the trustworthiness of social psychology were raised.  A Dutch social psychologist had fabricated data. Eventually over 50 of his articles would be retracted.  Another social psychologist published results that appeared to demonstrate the ability to foresee random future events (Bem, 2011). Few researchers believed these results and statistical analysis suggested that the results were not trustworthy (Francis, 2012; Schimmack, 2012).  Psychologists started to openly question the credibility of published results.

In the beginning of 2012, Doyen and colleagues published a failure to replicate a prominent study by John Bargh that was featured in Daniel Kahneman’s book.  A few month later, Daniel Kahneman distanced himself from Bargh’s research in an open email addressed to John Bargh (Young, 2012):

“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it… all I have personally at stake is that I recently wrote a book that emphasizes priming research as a new approach to the study of associative memory…Count me as a general believer… My reason for writing this letter is that I see a train wreck looming.”

Five years later, Kahneman’s concerns have been largely confirmed. Major studies in social priming research have failed to replicate and the replicability of results in social psychology is estimated to be only 25% (OSC, 2015).

Looking back, it is difficult to understand the uncritical acceptance of social priming as a fact.  In “Thinking Fast and Slow” Kahneman wrote “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Yet, Kahneman could have seen the train wreck coming. In 1971, he co-authored an article about scientists’ “exaggerated confidence in the validity of conclusions based on small samples” (Tversky & Kahneman, 1971, p. 105).  Yet, many of the studies described in Kahneman’s book had small samples.  For example, Bargh’s priming study used only 30 undergraduate students to demonstrate the effect.

Replicability Index

Small samples can be sufficient to detect large effects. However, small effects require large samples.  The probability of replicating a published finding is a function of sample size and effect size.  The Replicability Index (R-Index) makes it possible to use information from published results to predict how replicable published results are.

Every reported test-statistic can be converted into an estimate of power, called observed power. For a single study, this estimate is useless because it is not very precise. However, for sets of studies, the estimate becomes more precise.  If we have 10 studies and the average power is 55%, we would expect approximately 5 to 6 studies with significant results and 4 to 5 studies with non-significant results.

If we observe 100% significant results with an average power of 55%, it is likely that studies with non-significant results are missing (Schimmack, 2012).  There are too many significant results.  This is especially true because average power is also inflated when researchers report only significant results. Consequently, the true power is even lower than average observed power.  If we observe 100% significant results with 55% average powered power, power is likely to be less than 50%.

This is unacceptable. Tversky and Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.”

To correct for the inflation in power, the R-Index uses the inflation rate. For example, if all studies are significant and average power is 75%, the inflation rate is 25% points.  The R-Index subtracts the inflation rate from average power.  So, with 100% significant results and average observed power of 75%, the R-Index is 50% (75% – 25% = 50%).  The R-Index is not a direct estimate of true power. It is actually a conservative estimate of true power if the R-Index is below 50%.  Thus, an R-Index below 50% suggests that a significant result was obtained only by capitalizing on chance, although it is difficult to quantify by how much.

How Replicable are the Social Priming Studies in “Thinking Fast and Slow”?

Chapter 4: The Associative Machine

4.1.  Cognitive priming effect

In the 1980s, psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked.

[no reference provided]

4.2.  Priming of behavior without awareness

Another major advance in our understanding of memory was the discovery that priming is not restricted to concepts and words. You cannot know this from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not even aware.

“In an experiment that became an instant classic, the psychologist John Bargh and his collaborators asked students at New York University—most aged eighteen to twenty-two—to assemble four-word sentences from a set of five words (for example, “finds he it yellow instantly”). For one group of students, half the scrambled sentences contained words associated with the elderly, such as Florida, forgetful, bald, gray, or wrinkle. When they had completed that task, the young participants were sent out to do another experiment in an office down the hall. That short walk was what the experiment was about. The researchers unobtrusively measured the time it took people to get from one end of the corridor to the other.”

“As Bargh had predicted, the young people who had fashioned a sentence from words with an elderly theme walked down the hallway significantly more slowly than the others. walking slowly, which is associated with old age.”

“All this happens without any awareness. When they were questioned afterward, none of the students reported noticing that the words had had a common theme, and they all insisted that nothing they did after the first experiment could have been influenced by the words they had encountered. The idea of old age had not come to their conscious awareness, but their actions had changed nevertheless.“

[John A. Bargh, Mark Chen, and Lara Burrows, “Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action,” Journal of Personality and Social Psychology 71 (1996): 230–44.]

t(28)=2.86 0.008 2.66 0.76
t(28)=2.16 0.039 2.06 0.54

MOP = .65, Inflation = .35, R-Index = .30

4.3.  Reversed priming: Behavior primes cognitions

“The ideomotor link also works in reverse. A study conducted in a German university was the mirror image of the early experiment that Bargh and his colleagues had carried out in New York.”

“Students were asked to walk around a room for 5 minutes at a rate of 30 steps per minute, which was about one-third their normal pace. After this brief experience, the participants were much quicker to recognize words related to old age, such as forgetful, old, and lonely.”

“Reciprocal priming effects tend to produce a coherent reaction: if you were primed to think of old age, you would tend to act old, and acting old would reinforce the thought of old age.”

t(18)=2.10 0.050 1.96 0.50
t(35)=2.10 0.043 2.02 0.53
t(31)=2.50 0.018 2.37 0.66

MOP = .53, Inflation = .47, R-Index = .06

4.4.  Facial-feedback hypothesis (smiling makes you happy)

“Reciprocal links are common in the associative network. For example, being amused tends to make you smile, and smiling tends to make you feel amused….”

“College students were asked to rate the humor of cartoons from Gary Larson’s The Far Side while holding a pencil in their mouth. Those who were “smiling” (without any awareness of doing so) found the cartoons funnier than did those who were “frowning.”

[“Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis,” Journal of Personality and Social Psychology 54 (1988): 768–77.]

The authors used the more liberal and unconventional criterion of p < .05 (one-tailed), z = 1.65, as a criterion for significance. Accordingly, we adjusted the R-Index analysis and used 1.65 as the criterion value.

t(89)=1.85 0.034 1.83 0.57
t(75)=1.78 0.034 1.83 0.57

MOP = .57, Inflation = .43, R-Index = .14

These results could not be replicated in a large replication effort with 17 independent labs. Not a single lab produced a significant result and even a combined analysis failed to show any evidence for the effect.

4.5. Automatic Facial Responses

In another experiment, people whose face was shaped into a frown (by squeezing their eyebrows together) reported an enhanced emotional response to upsetting pictures—starving children, people arguing, maimed accident victims.

[Ulf Dimberg, Monika Thunberg, and Sara Grunedal, “Facial Reactions to

Emotional Stimuli: Automatically Controlled Emotional Responses,” Cognition and Emotion, 16 (2002): 449–71.]

The description in the book does not match any of the three studies reported in this article. The first two studies examined facial muscle movements in response to pictures of facial expressions (smiling or frowning faces).  The third study used emotional pictures of snakes and flowers. We might consider the snake pictures as being equivalent to pictures of starving children or maimed accident victims.  Participants were also asked to frown or to smile while looking at the pictures. However, the dependent variable was not how they felt in response to pictures of snakes, but rather how their facial muscles changed.  Aside from a strong effect of instructions, the study also found that the emotional picture had an automatic effect on facial muscles.  Participants frowned more when instructed to frown and looking at a snake picture than when instructed to frown and looking at a picture of a flower. “This response, however, was larger to snakes than to flowers as indicated by both the Stimulus factor, F(1, 47) = 6.66, p < .02, and the Stimulus 6 Interval factor, F(1, 47) = 4.30, p < .05.”  (p. 463). The evidence for smiling was stronger. “The zygomatic major muscle response was larger to flowers than to snakes, which was indicated by both the Stimulus factor, F(1, 47) = 18.03, p < .001, and the Stimulus 6 Interval factor, F(1, 47) = 16.78, p < .001.”  No measures of subjective experiences were included in this study.  Therefore, the results of this study provide no evidence for Kahneman’s claim in the book and the results of this study are not included in our analysis.

4.6.  Effects of Head-Movements on Persuasion

“Simple, common gestures can also unconsciously influence our thoughts and feelings.”

“In one demonstration, people were asked to listen to messages through new headphones. They were told that the purpose of the experiment was to test the quality of the audio equipment and were instructed to move their heads repeatedly to check for any distortions of sound. Half the participants were told to nod their head up and down while others were told to shake it side to side. The messages they heard were radio editorials.”

“Those who nodded (a yes gesture) tended to accept the message they heard, but those who shook their head tended to reject it. Again, there was no awareness, just a habitual connection between an attitude of rejection or acceptance and its common physical expression.”

F(2,66)=44.70 0.000 7.22 1.00

MOP = 1.00, Inflation = .00,  R-Index = 1.00

[Gary L. Wells and Richard E. Petty, “The Effects of Overt Head Movements on Persuasion: Compatibility and Incompatibility of Responses,” Basic and Applied Social Psychology, 1, (1980): 219–30.]

4.7   Location as Prime

“Our vote should not be affected by the location of the polling station, for example, but it is.”

“A study of voting patterns in precincts of Arizona in 2000 showed that the support for propositions to increase the funding of schools was significantly greater when the polling station was in a school than when it was in a nearby location.”

“A separate experiment showed that exposing people to images of classrooms and school lockers also increased the tendency of participants to support a school initiative. The effect of the images was larger than the difference between parents and other voters!”

[Jonah Berger, Marc Meredith, and S. Christian Wheeler, “Contextual Priming: Where People Vote Affects How They Vote,” PNAS 105 (2008): 8846–49.]

z = 2.10 0.036 2.10 0.56
p = .05 0.050 1.96 0.50

MOP = .53, Inflation = .47, R-Index = .06

4.8  Money Priming

“Reminders of money produce some troubling effects.”

“Participants in one experiment were shown a list of five words from which they were required to construct a four-word phrase that had a money theme (“high a salary desk paying” became “a high-paying salary”).”

“Other primes were much more subtle, including the presence of an irrelevant money-related object in the background, such as a stack of Monopoly money on a table, or a computer with a screen saver of dollar bills floating in water.”

“Money-primed people become more independent than they would be without the associative trigger. They persevered almost twice as long in trying to solve a very difficult problem before they asked the experimenter for help, a crisp demonstration of increased self-reliance.”

“Money-primed people are also more selfish: they were much less willing to spend time helping another student who pretended to be confused about an experimental task. When an experimenter clumsily dropped a bunch of pencils on the floor, the participants with money (unconsciously) on their mind picked up fewer pencils.”

“In another experiment in the series, participants were told that they would shortly have a get-acquainted conversation with another person and were asked to set up two chairs while the experimenter left to retrieve that person. Participants primed by money chose to stay much farther apart than their nonprimed peers (118 vs. 80 centimeters).”

“Money-primed undergraduates also showed a greater preference for being alone.”

[Kathleen D. Vohs, “The Psychological Consequences of Money,” Science 314 (2006): 1154–56.]

F(2,49)=3.73 0.031 2.16 0.58
t(35)=2.03 0.050 1.96 0.50
t(37)=2.06 0.046 1.99 0.51
t(42)=2.13 0.039 2.06 0.54
F(2,32)=4.34 0.021 2.30 0.63
t(38)=2.13 0.040 2.06 0.54
t(33)=2.37 0.024 2.26 0.62
F(2,58)=4.04 0.023 2.28 0.62
chi^2(2)=10.10 0.006 2.73 0.78

MOP = .58, Inflation = .42, R-Index = .16

4.9  Death Priming

“The evidence of priming studies suggests that reminding people of their mortality increases the appeal of authoritarian ideas, which may become reassuring in the context of the terror of death.”

The cited article does not directly examine this question.  The abstract states that “three experiments were conducted to test the hypothesis, derived from terror management theory, that reminding people of their mortality increases attraction to those who consensually validate their beliefs and decreases attraction to those who threaten their beliefs” (p. 308).  Study 2 found no general effect of death priming. Rather, the effect was qualified by authoritarianism. Mortality salience enhanced the rejection of dissimilar others in Study 2 only among high authoritarian subjects.” (p. 314), based on a three-way interaction with F(1,145) = 4.08, p = .045.  We used the three-way interaction for the computation of the R-Index.  Study 1 reported opposite effects for ratings of Christian targets, t(44) = 2.18, p = .034 and Jewish targets, t(44)= 2.08, p = .043. As these tests are dependent, only one test could be used, and we chose the slightly stronger result.  Similarly, Study 3 reported significantly more liking of a positive interviewee and less liking of a negative interviewee, t(51) = 2.02, p = .049 and t(49) = 2.42, p = .019, respectively. We chose the stronger effect.

[Jeff Greenberg et al., “Evidence for Terror Management Theory II: The Effect of Mortality Salience on Reactions to Those Who Threaten or Bolster the Cultural Worldview,” Journal of Personality and Social Psychology]

t(44)=2.18 0.035 2.11 0.56
F(1,145)=4.08 0.045 2.00 0.52
t(49)=2.42 0.019 2.34 0.65

MOP = .56, Inflation = .44, R-Index = .12

4.10  The “Lacy Macbeth Effect”

“For example, consider the ambiguous word fragments W_ _ H and S_ _ P. People who were recently asked to think of an action of which they are ashamed are more likely to complete those fragments as WASH and SOAP and less likely to see WISH and SOUP.”

“Furthermore, merely thinking about stabbing a coworker in the back leaves people more inclined to buy soap, disinfectant, or detergent than batteries, juice, or candy bars. Feeling that one’s soul is stained appears to trigger a desire to cleanse one’s body, an impulse that has been dubbed the “Lady Macbeth effect.”

[Lady Macbeth effect”: Chen-Bo Zhong and Katie Liljenquist, “Washing Away Your Sins:

Threatened Morality and Physical Cleansing,” Science 313 (2006): 1451–52.]

F(1,58)=4.26 0.044 2.02 0.52
F(1,25)=6.99 0.014 2.46 0.69

MOP = .61, Inflation = .39, R-Index = .22

The article reports two more studies that are not explicitly mentioned, but are used as empirical support for the Lady Macbeth effect. As the results of these studies were similar to those in the mentioned studies, including these tests in our analysis does not alter the conclusions.

chi^2(1)=4.57 0.033 2.14 0.57
chi^2(1)=5.02 0.025 2.24 0.61

MOP = .59, Inflation = .41, R-Index = .18

4.11  Modality Specificity of the “Lacy Macbeth Effect”

“Participants in an experiment were induced to “lie” to an imaginary person, either on the phone or in e-mail. In a subsequent test of the desirability of various products, people who had lied on the phone preferred mouthwash over soap, and those who had lied in e-mail preferred soap to mouthwash.”

[Spike Lee and Norbert Schwarz, “Dirty Hands and Dirty Mouths: Embodiment of the Moral-Purity Metaphor Is Specific to the Motor Modality Involved in Moral Transgression,” Psychological Science 21 (2010): 1423–25.]

The results are presented as significant with a one-sided t-test. “As shown in Figure 1a, participants evaluated mouthwash more positively after lying in a voice mail (M = 0.21, SD = 0.72) than after lying in an e-mail (M = –0.26, SD = 0.94), F(1, 81) = 2.93, p = .03 (one-tailed), d = 0.55 (simple main effect), but evaluated hand sanitizer more positively after lying in an e-mail (M = 0.31, SD = 0.76) than after lying in a voice mail (M = –0.12, SD = 0.86), F(1, 81) = 3.25, p = .04 (one-tailed), d = 0.53 (simple main effect).”  We adjusted the significance criterion for the R-Index accordingly.

F(1,81)=2.93 0.045 1.69 0.52
F(1,81)=3.25 0.038 1.78 0.55

MOP = .54, Inflation = .46, R-Index = .08

4.12   Eyes on You

“On the first week of the experiment (which you can see at the bottom of the figure), two wide-open eyes stare at the coffee or tea drinkers, whose average contribution was 70 pence per liter of milk. On week 2, the poster shows flowers and average contributions drop to about 15 pence. The trend continues. On average, the users of the kitchen contributed almost three times as much in ’eye weeks’ as they did in ’flower weeks.’ ”

[Melissa Bateson, Daniel Nettle, and Gilbert Roberts, “Cues of Being Watched Enhance Cooperation in a Real-World Setting,” Biology Letters 2 (2006): 412–14.]

F(1,7)=11.55 0.011 2.53 0.72

MOP = .72, Inflation = .28, R-Index = .44

Combined Analysis

We then combined the results from the 31 studies mentioned above.  While the R-Index for small sets of studies may underestimate replicability, the R-Index for a large set of studies is more accurate.  Median Obesrved Power for all 31 studies is only 57%. It is incredible that 31 studies with 57% power could produce 100% significant results (Schimmack, 2012). Thus, there is strong evidence that the studies provide an overly optimistic image of the robustness of social priming effects.  Moreover, median observed power overestimates true power if studies were selected to be significant. After correcting for inflation, the R-Index is well below 50%.  This suggests that the studies have low replicability. Moreover, it is possible that some of the reported results are actually false positive results.  Just like the large-scale replication of the facial feedback studies failed to provide any support for the original findings, other studies may fail to show any effects in large replication projects. As a result, readers of “Thinking Fast and Slow” should be skeptical about the reported results and they should disregard Kahneman’s statement that “you have no choice but to accept that the major conclusions of these studies are true.”  Our analysis actually leads to the opposite conclusion. “You should not accept any of the conclusions of these studies as true.”

k = 31,  MOP = .57, Inflation = .43, R-Index = .14,  Grade: F for Fail

Powergraph of Chapter 4kfs

Schimmack and Brunner (2015) developed an alternative method for the estimation of replicability.  This method takes into account that power can vary across studies. It also provides 95% confidence intervals for the replicability estimate.  The results of this method are presented in the Figure above. The replicability estimate is similar to the R-Index, with 14% replicability.  However, due to the small set of studies, the 95% confidence interval is wide and includes values above 50%. This does not mean that we can trust the published results, but it does suggest that some of the published results might be replicable in larger replication studies with more power to detect small effects.  At the same time, the graph shows clear evidence for a selection effect.  That is, published studies in these articles do not provide a representative picture of all the studies that were conducted.  The powergraph shows that there should have been a lot more non-significant results than were reported in the published articles.  The selective reporting of studies that worked is at the core of the replicability crisis in social psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).  To clean up their act and to regain trust in published results, social psychologists have to conduct studies with larger samples that have more than 50% power (Tversky & Kahneman, 1971) and they have to stop reporting only significant results.  We can only hope that social psychologists will learn from the train wreck of social priming research and improve their research practices.

MatingPrime

Replicability Report No.2: Do Mating Primes have a replicable effects on behavior?

In 2000, APA declared the following decade the decade of behavior.  The current decade may be considered the decade of replicability or rather the lack thereof.  The replicability crisis started with the publication of Bem’s (2011) infamous “Feeling the future” article.  In response, psychologists have started the painful process of self-examination.

Preregistered replication reports and systematic studies of reproducibility have demonstrated that many published findings are difficult to replicate and when they can be replicated, actual effect sizes are about 50% smaller than reported effect sizes in original articles (OSC, Science, 2016).

To examine which studies in psychology produced replicable results, I created ReplicabilityReports.  Replicability reports use statistical tools that can detect publication bias and questionable research practices to examine the replicability of research findings in a particular research area.  The first replicability report examined the large literature of ego-depletion studies and found that only about a dozen studies may have produced replicable results.

This replicability report focuses on a smaller literature that used mating primes (images of potential romantic partners / imagining a romantic scenario) to test evolutionary theories of human behavior.  Most studies use the typical priming design, where participants are randomly assigned to one or more mating prime conditions or a control condition. After the priming manipulation the effect of activating mating-related motives and thoughts on a variety of measures is examined.  Typically, an interaction with gender is predicted with the hypothesis that mating primes have stronger effects on male participants. Priming manipulations vary from subliminal presentations to instructions to think about romantic scenarios for several minutes; sometimes with the help of visual stimuli.  Dependent variables range from attitudes towards risk-taking to purchasing decisions.

Shanks et al. (2015) conducted a meta-analysis of a subset of mating priming studies that focus on consumption and risk-taking.  A funnel plot showed clear evidence of bias in the published literature.  The authors also conducted several replication studies. The replication studies failed to produce any significant results. Although this outcome might be due to low power to detect small effects, a meta-analysis of all replication studies also produced no evidence for reliable priming effects (average d = 00, 95%CI = -.12 | .11).

This replicability report aims to replicate and extend Shanks et al.’s findings in three ways.  First, I expanded the data base by including all articles that mentioned the word mating primes in a full text search of social psychology journals.  This expanded the set of articles from 15 to 36 articles and the set of studies from 42 to 92. Second, I used a novel and superior bias test.  Shanks et al. used Funnel plots and Egger’s regression of effect sizes on sampling error to examine bias. The problem with this approach is that heterogeneity in effect sizes can produce a negative correlation between effect sizes and sample sizes.  Power-based bias tests do not suffer from this problem (Schimmack, 2014).  A set of studies with average power of 60% cannot produce more than 60% significant results (Sterling et al., 1995).  Thus, the discrepancy between observed power and reported success rate provides clear evidence of selection bias. Powergraphs also make it possible to estimate the actual power of studies after correcting for publication bias and questionable research practices.  Finally, replicability reports use bias tests that can be applied to small sets of studies.  This makes it possible to find studies with replicable results even if most studies have low replicability.

DESCRIPTIVE STATISTICS

The dataset consists of 36 articles and 92 studies. The median sample size of a study was N = 103 and the total number of participants was N = 11,570. The success rate including marginally significant results, z > 1.65, was 100%.  The success rate excluding marginally significant results, z > 1.96, was 90%.  Median observed power for all 92 studies was 66%.  This discrepancy shows that the published results are biased towards significance.  When bias is present, median observed power overestimates actual power.  To correct for this bias, the R-Index subtracts the inflation rate from median observed power.  The R-Index is 66 – 34 = 32.  An R-Index below 50% implies that most studies will not replicate a significant result in an exact replication study with the same sample size and power as the original studies.  The R-Index for the 15 studies included in Shanks et al. was 34% and the R-Index for the additional studies was 36%.  This shows that convergent results were obtained for two independent samples based on different sampling procedures and that Shanks et al.’s limited sample was representative of the wider literature.

POWERGRAPH

For each study, a focal hypothesis test was identified and the result of the statistical test was converted into an absolute z-score.  These absolute z-scores can vary as a function of random sampling error or differences in power and should follow a mixture of normal distributions.  Powergraphs find the best mixture model that minimizes the discrepancy between observed and predicted z-scores.

Powergraph for Romance Priming (Focal Tests)

 

The histogram of z-scores shows clear evidence of selection bias. The steep cliff on the left side of the criterion for significance (z = 1.96) shows a lack of non-significant results.  The few non-significant results are all in the range of marginal significance and were reported as evidence for an effect.

The histogram also shows evidence of the use of questionable research practices. Selection bias would only produce a cliff to the left of the significance criterion, but a mixture-normal distribution on the right side of the significance criterion. However, the graph also shows a second cliff around z = 2.8.  This cliff can be explained by questionable research practices that inflate effect sizes to produce significant results.  These questionable research practices are much more likely to produce z-scores in the range between 2 and 3 than z-scores greater than 3.

The large amount of z-scores in the range between 1.96 and 2.8 makes it impossible to distinguish between real effects with modest power and questionable effects with much lower power that will not replicate.  To obtain a robust estimate of power, power is estimated only for z-scores greater than 2.8 (k = 17).  The power estimate is 73% based. This power estimate suggests that some studies may have reported real effects that can be replicated.

The grey curve shows the predicted distribution for a set of studies with 73% power.  As can be seen, there are too many observed z-scores in the range between 1.96 and 2.8 and too few z-scores in the range between 0 and 1.96 compared to the predicted distribution based on z-scores greater than 2.8.

The powergraph analysis confirms and extends Shanks et al.’s (2016) findings. First, the analysis provides strong evidence that selection bias and questionable research practices contribute to the high success rate in the mating-prime literature.  Second, the analysis suggests that a small portion of studies may actually have reported true effects that can be replicated.

REPLICABILITY OF INDIVIDUAL ARTICLES

The replicability of results published in individual articles was examined with the Test of Insufficient Variance (TIVA) and the Replicability-Index.  TIVA tests bias by comparing the variance of observed z-scores against the variance that is expected based on sampling error.  As sampling error for z-scores is 1, observed z-scores should have at least a variance of 1. If there is heterogeneity, variance can be even greater, but it cannot be smaller than 1.  TIVA uses the chi-square test for variances to compute the probability that a variance less than 1 was simply due to chance.  A p-value less than .10 is used to flag an article as questionable.

The Replicability-Index (R-Index) used observed power to test bias. Z-scores are converted into a measure of observed power and median observed power is used as an estimate of power.  The success rate (percentage of significant results) should match observed power.  The difference between success rate and median power shows an inflated success rate.  The R-Index subtracts inflation from median observed power.  A value of 50% is used as the minimum criterion for replicability.

Articles that pass both tests are examined in more detail to identify studies with high replicability.  Only three articles passed this test.

1   Greitemeyer, Kastenmüller, and Fischer (2013) [R-Index = .80]

The article with the highest R-Index reported 4 studies.  The high R-Index for this article is due to Studies 2 to 4.  Studies 3 and 4 used a 2 x 3 between subject design with gender and three priming conditions. Both studies produced strong evidence for an interaction effect, Study 3: F(2,111) = 12.31, z = 4.33, Study 4: F(2,94) = 7.46, z = 3.30.  The pattern of the interaction is very similar in the two studies.  For women, the means are very similar and not significantly different for each other.  For men, the two mating prime conditions are very similar and significantly different from the control condition.  The standardized effect sizes for the difference between the combined mating prime conditions and the control conditions are large, Study 3: t(110) = 6.09, p < .001, z = 5.64, d = 1.63; Study 4: t(94) = 5.12, d = 1.30.

Taken at face value, these results are highly replicable, but there are some concerns about the reported results. The means in conditions that are not predicted to differ from each other are very similar.  I tested the probability of this event to occur using TIVA and compared the means of the two mating prime conditions for men and women in the two studies.  The four z-scores were z = 0.53, 0.08, 0.09, and -0.40.  The variance should be 1, but the observed variance is only Var(z) = 0.14.  The probability of this reduction in variance to occur by chance is p = .056.  Thus, even though the overall R-Index for this article is high and the reported effect sizes are very high, it is likely that an actual replication study will produce weaker effects and may not replicate the original findings.

Study 2 also produced strong evidence for a priming x gender interaction, F(1,81) = 11.23, z = 3.23.  In contrast to studies 3 and 4, this interaction was a cross-over interaction with opposite effects of primes for males and females.  However, there is some concern about the reliability of this interaction because the post-hoc tests for males and females were both just significant, males: t(40) = 2.61, d = .82, females, t(41) = 2.10, d = .63.  As these post-hoc tests are essentially two independent studies, it is possible to use TIVA to test whether these results are too similar, Var(z) = 0.11, p = .25.  The R-Index for this set of studies is low, R-Index = .24 (MOP = .62).  Thus, a replication study may replicate an interaction effect, but the chance of replicating significant results for males or females separately are lower.

Importantly, Shanks et al. (2016) conducted two close replication of Greitemeyer’s studies with risky driving, gambling, and sexual risk taking as dependent variables.  Study 5 compared the effects of short-term mate primes on risky driving.  Although the sample size was small, the large effect size in the original study implies that this study had high power to replicate the effect, but it did not, t(77) = = -0.85, p = .40, z = -.85.  The negative sign indicates that the pattern of means was reversed, but not significantly so.  Study 6 failed to replicate the interaction effect for sexual risk taking reported by Greitemeyer et al., F(1, 93) = 1.15, p = .29.  The means for male participants were in the opposite direction showing a decrease in risk taking after mating priming.  The study also failed to replicate the significant decrease in risk taking for female participants.  Study 6 also produced non-significant results for gambling and substance risk taking.   These failed replication studies raise further concerns about the replicability of the original results with extremely large effect sizes.

Jon K. Maner, Matthew T. Gailliot, D. Aaron Rouby, and Saul L. Miller (JPSP, 2007) [R-Index = .62]

This article passed TIVA only due to the low power of TIVA for a set of three studies, TIVA: Var(z) = 0.15, p = .14.  In Study 1, male and female participants were randomly assigned to a sexual-arousal priming condition or a happiness control condition. Participants also completed a measure of socio-sexual orientation (i.e., interest in casual and risky sex) and were classified into groups of unrestricted and restricted participants. The dependent variable was performance on a dot-probe task.  In a dot-probe task, participants have to respond to a dot that appears in the location of two stimuli that compete for visual attention.  In theory, participants are faster to respond to the dot if appears in the location of a stimulus that attracts more attention.  Stimuli were pictures of very attractive or less attractive members of the same or opposite sex.  The time between the presentation of the pictures and the dot was also manipulated.  The authors reported that they predicted a three-way way interaction between priming condition, target picture, and stimulus-onset time.  The authors did not predict an interaction with gender.  The ANOVA showed a significant three-way interaction, F(1,111) = 10.40, p = .002, z = 3.15.  A follow-up two-way ANOVA showed an interaction between priming condition and target for unrestricted participants, F(1,111) = 7.69, p = .006, z = 2.72.

Study 2 replicated Study 1 with a sentence unscrambling task which is used as a subtler priming manipulation.  The study closely replicated the results of Study 1. The three way interaction was significant, F(1,153) = 9.11, and the follow up two-way interaction for unrestricted participants was also significant, F(1,153) = 8.22, z = 2.75.

Study 3 changed the primes to jealousy or anxiety/frustration.  Jealousy is a mating related negative emotion and was predicted to influence participants like mating primes.  In this study, participants were classified into groups with high or low sexual vigilance based on a jealousy scale.  The predicted three-way interaction was significant, F(1,153) = 5.74, p = .018, z = 2.37.  The follow-up two-way interaction only for participants high in sexual vigilance was also significant, F(1,153) = 8.13, p = .005, z = 2.81.

A positive feature of this set of studies is that the manipulation of targets within subjects reduces within-cell variability and increases power to produce significant results.  However, a problem is that the authors also report studies for specific targets and do not mention that they used reaction times to other targets as covariate. These analyses have low power due to the high variability in reaction times across participants.  However, surprisingly each study still produced the predicted significant result.

Study 1: “Planned analyses clarified the specific pattern of hypothesized effects. Multiple regression evaluated the hypothesis that priming would interact with participants’ sociosexual orientation to increase attentional adhesion to attractive opposite-sex targets. Attention to those targets was regressed on experimental condition, SOI, participant sex, and their centered interactions (nonsignificant interactions were dropped). Results confirmed the hypothesized interaction between priming condition and SOI, beta = .19, p < .05 (see Figure 1).”
I used r = .19 and N = 113 and obtained t(111) = 2.04, p = .043, z = 2.02.

Study 2: “Planned analyses clarified the specific pattern of hypothesized effects. Regression evaluated the hypothesis that the mate-search prime would interact with sociosexual orientation to increase attentional adhesion to attractive opposite-sex targets. Attention to these targets was regressed on experimental condition, SOI score, participant sex, and their centered interactions (nonsignificant interactions were dropped). As in Study 1, results revealed the predicted interaction between priming condition and sociosexual orientation, beta = .15, p = .04, one-tailed (see Figure 2)”
I used r = .15 and N = 155 and obtained t(153) = 1.88, p = .06 (two-tailed!), z = 1.86.

Study 3: “We also observed a significant main effect of intrasexual vigilance, beta = .25, p < .001, partial r = .26, and, more important, the hypothesized two-way interaction between priming condition and level of intrasexual vigilance, beta = .15, p < .05, partial r = .16 (see Figure 3).”
I used r = .16 and N = 155 and obtained t(153) = 2.00, p = .047, z = 1.99.

The problem is that the results of these three independent analyses are too similar, z = 2.02, 1.86, 1.99; Var(z) < .001, p = .007.

In conclusion, there are some concerns about the replicability of these results and even if the results replicate they do not provide support for the hypothesis that mating primes have a hard-wired effect on males. Only one of the three studies produced a significant two-way interaction between priming and target (F-value not reported), and none of the three studies produced a significant three-way interaction between priming, target, and gender.  Thus, the results are inconsistent with other studies that found either main effects of mating primes or mating prime by gender interactions.

3. Bram Van den Bergh and Siegfried Dewitte (Proc. R. Soc. B, 2006) [R-index = .58]

This article reports three studies that examined the influence of mating primes on behavior in the ultimatum game.

Study 1 had a small sample size of 40 male participants who were randomly assigned to seeing pictures of non-nude female models or landscapes.  The study produced a significant main effect, F(1,40) = 4.75, p = .035, z = 2.11, and a significant interaction with finger digit ratio, F(1,40) = 4.70, p = .036, z = 2.10.  I used the main effect for analysis because it is theoretically more important than the interaction effect, but the results are so similar that it does not matter which effect is used.

Study 2 used rating of women’s t-shirts or bras as manipulation. The study produced strong evidence that mating primes (rating bras) lead to lower minimum acceptance rates in the ultimatum game than the control condition (rating t-shirts), F(1,33) = 8.88, p = .005, z = 2.78.  Once more the study also produced a significant interaction with finger digit ratio, F(1,33) = 8.76, p = .006, z = 2.77.

Study 3 had three experimental conditions, namely non-sexual pictures of older and young women, and pictures of young non-nude female models.  The study produced a significant effect of condition, F(2,87) = 5.49, p = .006, z = 2.77.  Once more the interaction with finger-digit ratio was also significant, F(2,87) = 5.42.

This article barely passed the test of insufficient variance in the primary analysis that uses one focal test per study, Var(z) = 0.15, p = .14.  However, the main effect and the interaction effects are statistically independent and it is possible to increase the power of TIVA by using the z-scores for the three main effects and the three interactions.  This test produces significant evidence for bias, Var(z) = 0.12, p = .01.

In conclusion, it is unlikely that the results reported in this article will replicate.

CONCLUSION

The replicability crisis in psychology has created doubt about the credibility of published results.  Numerous famous priming studies have failed to replicate in large replication studies.  Shanks et al. (2016) reported problems with the specific literature of romantic and mating priming.  This replicability report provided further evidence that the mating prime literature is not credible.  Using an expanded set of 92 studies, analysis with powergraphs, the test of insufficient variance, and the replicability index showed that many significant results were obtained with the help of questionable research practices that inflate observed effect sizes and provide misleading evidence about the strength and replicability of published results.  Only three articles passed the test with TIVA and R-Index and detailed examination of these studies also showed statistical problems with the evidence in these articles.  Thus, this replicability analysis of 36 articles failed to identify a single credible article.  The lack of credible evidence is consistent with Shanks et al.’s failure to produce significant results in 15 independent replication studies.

Of course, these results do not imply that evolutionary theory is wrong or that sexual stimuli have no influence on human behavior.  For example, in my own research I have demonstrated that sexually arousing opposite-sex pictures capture men’s and women’s attention (Schimmack, 2005).  However, these responses occurred in response to specific stimuli and not as carry-over effects of a priming manipulation. Thus, the problem with mating prime studies is probably that priming effects are weak and may have no notable influence on unrelated behaviors like consumer behavior or risk taking in investments.  Given the replication problems with other priming studies, it seems necessary to revisit the theoretical assumptions underlying this paradigm.  For example, Shanks et al. (2016) pointed out that behavioral priming effects are theoretically implausible because these predictions contradict well-established theories that behavior is guided by the cognitive appraisal of the situation at hand rather than unconscious residual information from previous situations. This makes evolutionary sense because behavior has to respond to the adaptive problem at hand to ensure survival and reproduction.

I recommend that textbook writers, journalists, and aspiring social psychologists treat claims about human behavior based on mating priming studies with a healthy dose of skepticism.  The results reported in these articles may reveal more about the motives of researchers than their participants.