Which Social Psychology Results Were Successfully Replicated in the OSF-Reproducibility Project? Recommeding a 4-Sigma Rule

After several years and many hours of hard work by hundreds of psychologists, the results of the OSF-Reproducibility project are in. The project aimed to replicate a representative set of 100 studies from top journals in social and cognitive psychology. The replication studies aimed to reproduce the original studies as closely as possible, while increasing sample sizes somewhat to reduce the risk of type-II errors (failure to replicate a true effect).

The results have been widely publicized in the media. On average, only 36% of studies were successfully replicated; that is, the replication study reproduced a significant result. More detailed analysis shows that results from cognitive psychology had a higher success rate (50%) than results from social psychology (25%).

This post describes the 9 results from social psychology that were successfully replicated. 6 out of the 9 successfully replicated studies reported highly significant results with a z-score greater than 4 sigma (standard deviations) from 0 (p < .00003). Particle physics uses a 5-sigma rule to avoid false positives and industry has adopted a 6-sigma rule in quality control.

Based on my analysis of the OSF-results, I recommend a 4-sigma rule for textbook writers, journalists, and other consumers of scientific findings in social psychology to avoid dissemination of false information.

List of Studies in Decreasing Order of Strength of Evidence

1. Single Study, Self-Report, Between-Subject Analysis, Extremely large sample (N = 230,047), Highly Significant Result (z > 4 sigma)

CJ Soto, OP John, SD Gosling, J Potter (2008). The developmental psychometrics of big five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20, JPSP-PPID.

This article reported results of a psychometric analysis of self-reports of personality traits in a very large sample (N = 230,047). The replication study used the exact same method with participants from the same population (N = 455,326). Not surprisingly, the results were replicated. Unfortunately, it is not an option to conduct all studies with huge samples like this one.

2.  4 Studies, Self-Report, Large Sample (N = 211), One-Sample Test, Highly Significant Result (z > 4 sigma)

JL Tracy, RW Robins. (2008). The nonverbal expression of pride: Evidence for cross-cultural recognition. JPSP;PPID.

The replication project focussed on the main effect in Study 4. The main effect in question was whether raters (N = 211) would accurately recognize non-verbal displays of pride in six pictures that displayed pride. The recognition rates were high (range 70%–87%) and highly significant. The sample size of N = 211 is large for a one-sample test that compares a sample mean against a fixed value.

3. Five Studies, Self-Report, Moderate Sample Size (N = 153), Correlation, Highly Significant Result (z > 4 sigma)

EP Lemay, MS Clark (2008). How the head liberates the heart: Projection of communal responsiveness guides relationship promotion. JPSP:IRGP.

Study 5 examined accuracy and biases in perceptions of responsiveness (caring and support for a partner). Participants (N = 153) rated their own responsiveness and how responsive their partner was. Ratings of perceived responsiveness were regressed on self-ratings of responsiveness and targets’ self-ratings of responsiveness. The results revealed a highly significant projection effect; that is, perceptions of responsiveness were predicted by self-ratings of responsiveness. This study produced a highly significant result despite a moderate sample size because the effect size was large.

4. Single Study, Behavior, Moderate Sample (N = 240), Highly Significant Result (z > 4 sigma)

N Halevy, G Bornstein, L Sagiv (2008). In-Group-Love and Out-Group-Hate as Motives for Individual Participation in Intergroup Conflict: A New Game Paradigm, Psychological Science.

This study had a sample size of N = 240. Participants were recruited in groups of six. The experiment had four conditions. The main dependent variable was how a monetary reward was allocated. One manipulation was that some groups had the option to allocate money to the in-group whereas others did not have this option. Naturally, the percentages of allocation to the in-group differed across these conditions. Another manipulation allowed some group-members to communicate whereas in the other condition players had to make decisions on their own. This study produced a highly significant interaction between the two experimental manipulations that was successfully replicated.

5. Single Study, Self-Report, Large Sample (N = 82), Within-Subject Analysis, Highly Significant Result (z > 4 sigma)

M Tamir, C Mitchell, JJ Gross (2008). Hedonic and instrumental motives in anger regulation. Psychological Science.

In this study, 82 participants were asked to imagine being in two types of situations; either scenarios with a hypothetical confrontation or scenarios without a confrontation. They also listened to music that was designed to elicit an excited, angry, or neutral mood. Afterwards participants rated how much they would like to listen to the music they heard if they were in the hypothetical situation. Each participant listened to all pairings of situation and music and the data were analyzed within-subject. A sample size of 82 is large for within-subject designs. A highly significant interaction revealed that a preference for angry music in confrontations and a dislike of angry music without a confrontation that was successfully replicated. A sample of 82 participants is large for a within-subject comparison of means for different conditions.

6. Single Study, Self-Report, Large Sample (N = 124), One-Sample Test, Highly-Significant Result (z > 4 sigma)

DA Armor, C Massey, AM Sackett (2008). Prescribed optimism: Is it right to be wrong about the future? Psychological Science.

In this study, participants (N = 124) were asked to read 8 vignettes that involved making decisions. Participants were asked to judge whether they would recommend making pessimistic, realistic, or optimistic predictions. The main finding was that the average recommendation was to be optimistic. The effect was highly significant. A sample of N = 124 is very large for a design that compares a sample mean to a fixed value.

7. Four Studies, Self-Report, Small Sample (N = 71), Experiment, Moderate Support (z = 2.97)

BK Payne, MA Burkley, MB Stokes (2008). Why do implicit and explicit attitude tests diverge? The role of structural fit. JPSP:ASC.

In this study, participants worked on the standard Affect Misattribution Paradigm (AMP). In the AMP, two stimuli are presented in brief succession. In this study, the first stimulus was a picture of a European or African American face. The second stimulus was a picture of a Chinese pictogram. In the standard paradigm, participants are asked to report how much they like the second stimulus (Chinese pictogram) and to ignore the first stimulus (Black or White face). The AMP is typically used to measure racial attitudes because racial attitudes can influences responses to the Chinese characters.

In this study, the standard AMP was modified by giving two different sets of instructions. One instruction was the standard instructions to respond to the Chinese pictograms. The other instruction was to respond directly to the faces.   All participants (N = 71) completed both tasks. The participants were randomly assigned to two conditions. One condition made it easier to honestly report prejudice (low social pressure). The other condition emphasized that prejudice is socially undesirable (high social pressure). The results showed a significantly stronger correlation between the two tasks (ratings of Chines pictographs & faces) in the low social pressure condition than in the high social pressure condition, which was replicated in the replication study.

8. Two Studies, Self-Report, moderate sample (N = 119), Correlation, Weak Support (z = 2.27)

JT Larsen, AR McKibban (2008). Is happiness having what you want, wanting what you have, or both? Psychological Science.

In this study, participants (N = 124) received a list of 62 material items and were asked to check whether they had the item or not (e.g., a cell phone). They then rated how much they wanted each item. Based on these responses, the authors computed measures of (a) how much participants’ wanted what they have and (b) have what they wanted. The main finding was that life-satisfaction was significantly predicted by wanting what one has while controlling for having what one wants.   This finding was also found in Study 1 (N = 124) and successfully replicated in the OSF-project with a larger sample (N = 238).

9. Five Studies, Behavior, Small Sample (N = 28), Main Effect, Very Weak Support (z = 1.80)

SM McCrea (2008). Self-handicapping, excuse making, and counterfactual thinking: Consequences for self-esteem and future motivation. JPSP:ASC.

In this study, all participants (N = 28) first worked on a math task that was very difficult and participants received failure feedback.   Participants were then randomly assigned to two groups. One group was given feedback that they had insufficient practice (self-handicap). The control group was not given an explanation for their failure. All participants then worked again on a second math task. The main effect showed that performance on the second task was better (higher percentage of correct answers) in the control group than in the self-handicap condition. Although this difference was only marginally significant (p < .05, one-tailed) in the original study, it was significant in the replication study with a larger sample (N = 61).

Although the percentage of correct answers showed only a marginally significant effect, the number of attempted answers and the absolute number of correct answers showed significant effects. Thus, this study does not count as a publication of a null-result. Moreover, these results suggest that participants in the control group were more motivated to do well because they worked on more problems and got more correct answers.


5 thoughts on “Which Social Psychology Results Were Successfully Replicated in the OSF-Reproducibility Project? Recommeding a 4-Sigma Rule

  1. “[S]ocial psychology…reported highly significant results with a z-score greater than 4 sigma (standard deviations) from 0 (p < .00003). Particle physics uses a 5-sigma rule to avoid false positives and industry has adopted a 6-sigma rule in quality control." This is not an apples-to-apples comparison because there is a mathematical 1.5 sigma shift in the "Six Sigma" methods used by industry.


      1. It means that what industry calls “six sigma” [3.4 defects per million] is a 4.5 sigma value in math. Therefore, industry quality control, for those that use “six sigma” processes, is not more conservative than particle physics. So the order, from least conservative to most conservative, should be: social psychology and cognitive psychology, industry quality control, particle physics.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s