Category Archives: Uncategorized

Estimating Reproducibility of Psychology (No. 151): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 151 “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” by Simone Schnall and colleagues has been the subject of heated debates among social psychologists.  The main finding of the article failed to replicate in an earlier replication attempt (Johnson, Cheung, & Donnellan, 2012).  In response to the replication failure, Simone Schnall suggested that the replication study was flawed and stood by her original findings.  This response led me to publish my first R-Index blog post that suggested the original results were not as credible as they seem because Simone Schnall was trained to use questionable research practices that produce significant results with low replicability. She was simply not aware of the problems of using these methods. However, Simone Schnall was not happy with my blog post and when I refused to take it down, she complained to the University of Toronto about it. UofT found that the blog post did not violate ethical standards.

The background is important because the OSC replication study was one of the replication studies that were published earlier and criticized by Schnall. Thus, it is necessary to revisit Schnall’s claim that the replication failure can be attributed to problems with the replication study.

Summary of Original Article 

The article “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” was published in Psychological Science. The article has been cited 197 times overall and 20 times in 2017.

Simone.Schnall

The article extends previous research that suggested a connection between feelings of disgust and moral judgments.  The article reports two experiments that test the complementary hypothesis that thoughts of purity make moral judgments less severe.  Study 1 used a priming manipulation. Study 2 evoked disgust followed by self-purification. Results in both studies confirmed this prediction.

Study 1

Forty undergraduate students (n = 20 per cell) participated in Study 1.

Half of the participants were primed with a scrambled sentence task that contained cleanliness words (e.g. pure, washed).  The other half did a scrambled sentence task with neutral words.

Right after the priming procedure, participants rated how morally wrong an action was in a series of six moral dilemmas.

The ANOVA showed a marginally mean difference, F(1,38) = 3.63, p = .064.  The results was reported with p-rep = .90, which was an experimental statistic in Psychological Science form 2005-2009 that was partially motivated by an attempt to soften the strict distinction between p-values just above or below .05.  Although a p-value of .064 is not meaningfully different from a p-value of .04, neither p-value suggests that a result is highly replicable. A p-value of .05 corresponds to 50% replicability (with large uncertainty around this point estimate) and the estimate is inflated if questionable research methods were used to produce it.

Study 2

Study 2 could have followed up the weak evidence of Study 1 with a larger sample to increase statistical power.  However, the sample size in Study 2 was nearly the same (N = 44).

Participants first watched a disgusting film clip.  Half (n = 21) of the participants then washed their hands before rating moral dilemmas.  The other half (n = 22) did not wash their hands.

The ANOVA showed a significant difference between the two conditions, F(1,41) = 7.81, p = .008.

Replicability Analysis 

No N Test p.val z OP
Study 1 40 F(1,38)=3.63 0.064 1.85 0.58*
Study 2 44 F(1,41)=7.81 0.008 2.66 0.76

*  using p < .10 as criterion for power analysis

With two studies it is difficult to predict replicability because observed power in a single study is strongly influenced by sampling error.  Individually, Study 1 has a low replicability index because the success (p < .10) was achieved with only 58% power. The inflation index (100 – 58 = 42) is high and the R-Index, 58 – 42 = 16, is low.

Combining both studies, still produces a low R-Index (Median Observed Power = 67, Inflation = 33, R-Index = 67 – 33 = 34).

My original blog post pointed out that we can predict replicability based on a researchers typical R-Index.  If a researcher typically conducts studies with high power, a p-value of .04 will sometimes occur due to bad luck, but the replication study is likely to be successful with a lower p-value because bad luck does not repeat itself.

In contrast, if a researcher conducts low powered studies, a p-value of .04 is a lucky outcome and the replication study is unlikely to be lucky again and therefore more likely to produce a non-significant result.

Since I published the blog post, Jerry Brunner and I have developed a new statistical method that allows meta-psychologists to take a researcher’s typical research practices into account. This method is called z-curve.

The figure below shows the z-curve for automatically extracted test statistics from articles by Simone Schnall from 2003 to 2017.  Trend analysis showed no major changes over time.

 

For some help with reading these plots check out this blog post.

The Figure shows a few things. First, it shows that the peak (mode) of the distribution is at z = 1.96, which corresponds to the criterion for significance (p < .05, two-tailed).  The steep drop on the left is not explained by normal sampling error and reveals the influence of QRPs (this is not unique to Schnall; the plot is similar for other social psychologists).  The grey line is a rough estimate of the proportion of non-significant results that would be expected given the distribution of significant results.  The discrepancy between the proportion of actual non-significant results and the grey line shows the extent of the influence of QRPs.

Simone.Schnall.2.png

Once QRPs are present, observed power of significant results is inflated. The average estimate is 48%. However, actual power varies.  The estimates below the x-axis show power estimates for different ranges of z-scores.  Even z-scores between 2.5 and 3 have only an average power estimate of 38%.  This implies that the z-score of 2.66 in Study 2 has a bias-corrected observed power of less than 50%. And as 50% power corresponds to p = .05, this implies that a bias-corrected p-value is not significant.

A new way of using z-curve is to fit z-curve with different proportions of false positive results and to compare the fit of these models.

Simone.Schnall.3.png

The plot shows that models with 0 or 20% false positives fit the data about equally well, but a model with 40% false positives lead to notably worse model fit.  Although this new feature is still in development, the results suggest that few of Schnall’s results are strictly false positives, but that many of her results may be difficult to replicate because QRPs produced inflated effect sizes and much larger samples might be needed to produce significant results (e.g., N > 700 is needed for 80% power with a small effect size, d = .2).

In conclusion, given the evidence for the presence of QRPs and the weak evidence for the cleanliness hypothesis, it is unlikely that equally underpowered studies would replicate the effect. At the same time, larger studies might produce significant results with weaker effect sizes.  Given the large sampling error in small samples, it is impossible to say how small the effects would be and how large samples would have to have high power to detect them.

Actual Replication Study

The replication study was carried out by Johnson, Cheung, and Donnellan.

Johnson et al. conducted replication studies of both studies with considerably larger samples.

Study 1 was replicated with 208 participants (vs. 40 in original study).

Study 2 was replicated with 126 participants (vs. 44 in original study).

Even if some changes in experimental procedures would have slightly lowered the true effect size, the larger samples would have compensated for this by reducing sampling error.

However, neither replication produced a significant result.

Study 1: F(1, 206) = 0.004, p = .95

Study 2: F(1, 124) = 0.001, p = .97.

Just like two p-values of .05 and .07 are unlikely, it is also unlikely to obtain two p-values of .95 and .97 even if the null-hypothesis is true because sampling error produces spurious mean differences.  When the null-hypothesis is true, p-values have a uniform distribution, and we would expect 10% of p-values between 9 and 1. To observe this event twice in a row has a probabiilty of .10 * .10 = .01.  Unusual events do sometimes happen by chance, but defenders of the original research could use this observation to suggest “reverse p-hacking” a term coined by Fritz Strack to insinuate that it can be of interest for replication researchers to make original effects go away.  Although I do not believe that this was the case here, it would be unscientific to ignore the surprising similarity of these two p-values

The authors conducted two more replication studies. These studies also produced non-significant results, with p = .31 and p = .27.  Thus, the similarity of the first two p-values was just a statistical fluke, just like some suspiciously similar  p-values of .04 are sometimes just a chance finding.

Schnall’s Response 

In a blog post, Schnall comments on the replication failure.  She starts with the observation that publishing failed replications is breaking with old traditions.

One thing, though, with the direct replications, is that now there can be findings where one gets a negative result, and that’s something we haven’t had in the literature so far, where one does a study and then it doesn’t match the earlier finding. 

Schnall is concerned that a failed replication could damage the reputation of the original researcher, if the failure is attributed either to a lack of competence or a lack of integrity.

Some people have said that well, that is not something that should be taken personally by the researcher who did the original work, it’s just science. These are usually people outside of social psychology because our literature shows that there are two core dimensions when we judge a person’s character. One is competence—how good are they at whatever they’re doing. And the second is warmth or morality—how much do I like the person and is it somebody I can trust.

Schnall believes that direct replication studies were introduced as a crime control measure in response to the revelation thaat Diedrik Stapel had made up data in over 50 articles.  This violation of research integrity is called fabrication.  However, direct replication studies are not an effective way to detect fabrication (Strobe and Strack, 2014).

In social psychology we had a problem a few years ago where one highly prominent psychologist turned out to have defrauded and betrayed us on an unprecedented scale. Diederik Stapel had fabricated data and then some 60-something papers were retracted… This is also when this idea of direct replications was developed for the first time where people suggested that to be really scientific we should do what the clinical trials do rather our regular [publish conceptual replication studies that work] way of replication that we’ve always done.

Schnall overlooks that another reason for direct replications were concerns about falsification.

Falsification is manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record (The Office of Research Integrity)

In 2011/2012 numerous articles suggested that falsification is a much bigger problem than fabrication and direct replications were used to examine whether falsified evidence also produced false positive results that could not be replicated.  Failures in direct replications are at least in part due to the use of questionable research practices that inflate effect sizes and success rates.

Today it is no longer a secret that many studies failed to replicate because original studies reported inflated effect sizes (OSC, 2015).  Given the widespread use of QRPs, especially in experimental social psychology, replication failures are the norm.  In this context, it makes sense that individual researches feel attacked if one of their studies is replicated.

There’s been a disproportional number of studies that have been singled out simply because they’re easy to conduct and the results are surprising to some people outside of the literature

Why me?  However, the OSC (2015) project did not single out individual researchers. It put up any study that was published in JPSP or Psychological Science in 2008 up for replication.  Maybe the ease of replication was a factor.

Schnall’s next complaint is that failure to replicate are treated as more credible than successful original studies.

Often the way these replications are interpreted is as if one single experiment disproves everything that has come before. That’s a bit surprising, especially when a finding is negative, if an effect was not confirmed. 

This argument ignores two things. First, it ignores that original researchers have a motivated bias to show a successful result.  Researchers who conduct direct replication studies are open to finding a positive or a negative result.  Second, Schnall ignores sample size.  Her original Study 1 had a sample size of N = 40.  The replication study had a sample size of N = 208.  Studies with larger samples have less sampling error and are more robust to violations of statistical assumptions underlying significance tests.  Thus, there are good reasons to believe the results of the failed replication studies more than the results of Schnall’s small original study.

Her next issue was that a special issue published a failed replication without peer review.  This led to some controversy, but it is not the norm.  More important, Schnall overstates the importance of traditional, anonymous, pre-publication peer-review.

It may not seem like a big deal but peer review is one of our laws; these are our publication ethics to ensure that whatever we declare as truth is unbiased. 

Pre-publication peer-review does not ensure that published resutls are unbiased. The OSC (2015) results clearly show that published results were biased in favor of supporting researchers’ hypotheses. Traditional peer-review does not check whether researchers used QRPs or not.  Peer-review does not end once a result is published.  It is possible to evaluate the results of original studies or replication studies even after the results are published.

And this is what Schnall did. She looked at the results and claimed that there was a mistake in the replication study.

I looked at their data, looked at their paper and I found what I consider a statistical problem.

However, others looked at the data and didn’t agree with her.  This led Schnall to consider replications a form of bulling.

“One thing I pointed to was this idea of this idea of replication bullying, that now if a finding doesn’t replicate, people take to social media and declare that they “disproved” an effect, and make inappropriate statements that go well beyond the data.”

It is of course ridiculous to think of failed replication studies as a form of bulling. We would not need to conduct empirical studies, if only successful replication studies were allowed to be published.  Apparently some colleagues tried to point this out to Schnall.

Interestingly, people didn’t see it that way. When I raised the issue, some people said yes, well, it’s too bad she felt bullied but it’s not personal and why can’t scientists live up to the truth when their finding doesn’t replicate?

Schnall could not see it this way.  According to her, there are only two reasons why a replication study may fail.

If my finding is wrong, there are two possibilities. Either I didn’t do enough work and/or reported it prematurely when it wasn’t solid enough or I did something unethical.

In reality there are many more reasons for a replication failure. One possible explanation is that the original result was an honest false positive finding.  The very notion of significance testing implies that some published findings can be false positives and that only future replication studies can tell us which published findings are false positives.  So a simple response to a failed replication is simply to say that it probably was a false positive result and that is the end of the story.

But Schnall does not believe that it is a false positive result ….

because so far I don’t know of a single person who failed to replicate that particular finding that concerned the effect of physical cleanliness and moral cleanliness. In fact, in my lab, we’ve done some direct replications, not conceptual replications, so repeating the same method. That’s been done in my lab, that’s been done in a different lab in Switzerland, in Germany, in the United States and in Hong Kong; all direct replications. As far as I can tell it is a solid effect.

The problem with this version of the story is that it is impossible to get significant results again and again with small samples, even if the effect is real.  So, it is not credible that Schnall was able to get significant results in many unpublished results and never obtained a contradictory result (Schimmack, 2012).

Despite many reasonable comments about the original study and the replication studies (e.g., sample size, QRPs, etc.), Schnall cannot escape the impression that replication researchers have an agenda to tear down good research.

Then the quality criteria are oftentimes not nearly as high as for the original work. The people who are running them sometimes have motivations to not necessarily want to find an effect as it appears.

This accusation motivated me to publish my first blog post and to elaborate on this study from the OSC reproducibilty project.  There is ample evidence that QRPs contributed to replication failures. In contrast, there is absolutely no empirical evidence that replication researchers deliberately produced non-significant results, and as far as I know Schnall has not yet apologized for her unfounded accusation.

One reason for her failure to apologize is probably that many social psychologists expressed support for Schnall either in public or mostly in private.

I raised these concerns about the special issue, I put them on a blog, thinking I would just put a few thoughts out there. That blog had some 17,000 hits within a few days. I was flooded with e-mails from the community, people writing to me to say things like “I’m so glad that finally somebody’s saying something.” I even received one e-mail from somebody writing to me anonymously, expressing support but not wanting to reveal their name. Each and every time I said: “Thank you for your support. Please also speak out. Please say something because we need more people to speak out openly. Almost no one did so.”

Schnall overlooks a simple solution to the problem.  Social psychologists who feel attacked by failed replications could simply preregister their own direct replications with large samples and show that their results do replicate.  This solution was suggested by Daniel Kahneman in 2012 in response to a major replication failure of a study by John Bargh that cast doubt on social priming effects.

What social psychology needs to do as a field is to consider our intuitions about how we make judgments, about evidence, about colleagues, because some of us have been singled out again and again and again. And we’ve been put under suspicion; whole areas of research topics such as embodied cognition and priming have been singled out by people who don’t work on the topics. False claims have been made about replication findings that in fact are not as conclusive as they seem. As a field we have to set aside our intuitions and move ahead with due process when we evaluate negative findings. 

However, what is most telling is the complete absence of direct replications by experimental social psychologists to demonstrate that their published results can be replicated.  The first major replication attempt by Vohs and Schmeichel just failed to replicate ego-depletion in a massive self-replication attempt.

In conclusion, it is no longer a secret that experimental social psychologists have used questionable research practices to produce more significant results than unbiased studies would produce.  The response to this crisis of confidence has been denial.

 

 

 

 

Advertisements

Estimating Reproducibility of Psychology (No. 68): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “Why People Are Reluctant to Tempt Fate” by Risen and Gilovich examined magical thinking in six experiments.  The evidence suggests that individuals are reluctant to tempt fate because it increases the accessibility of thoughts about negative outcomes. The article has been cited 58 times so far and it was cited 10 times in 2017, although the key finding failed to replicate in the OSC (Science, 2015) replication study.

Risen.png

Study 1

Study 1 demonstrated the basic phenomenon.  62 students read a scenario about a male student who applied at a prestigious university.  His mother sent him a t-shirt with the logo of the university. In one condition, he decided to wear the t-shirt. In the other scenario, he stuffed it in the bottom drawer.  Participants rated how likely it would be that the student would be accepted.  Participants thought it would be more likely that the student gets accepted, if the student did not wear the t-shirt  (wearing it M = 5.19, SD = 1.35; stuffed away M = 6.13, SD = 1.02), t(60) = 3.01, p = .004, d = 0.78.

Study 2

120 students participated in Study 2 (n = 30 per cell). Study 2 manipulated whether participants imagined themselves or somebody else in a scenario. The scenario was about the probability of a professor picking a student to answer a question.  The experimental factor was whether students had done the reading or not. Not having done the reading was considered tempting fate.

The ANOVA results showed a significant main effect for tempting fate (not prepared M =  3.43, SD = 2.34; prepared M = 2.53, SD = 2.24), F(1, 116) = 4.60, p = .034. d = 0.39.

Study 3

Study 3 examined whether tempting fate increases the accessibility of thoughts about negative outcomes with 211 students.  Accessibiliy was measured with reaction times to two scenarios matching those from Study 1 and 2.  Participants had to indicate as quickly as possible whether the ending of a story matched the beginning of a story.

Analysis were carried out separately for each story.  Participants were faster to judge that not getting into a prestigious university was a reasonable ending after reading that a student tempted fate by wearing a t-shirt with the university logo  (wearing t-shirt M =  2,671 ms, SD = 1,113) than those who read that he stuffed the shirt in the drawer
(M = 3,176 ms, SD = 1,573), F(1, 171) = 11.01, p = .001, d = 0.53.

The same result was obtained for judgments of tempting fate by not doing the readings for a class, (not prepared M = 2,879 ms, SD = 1,149; prepared M = 3,112 ms, SD 1,226), F(1, 184) = 7.50, p = .007, d = 0.26.

Study 4 

Study 4 aimed to test the mediation hypothesis. Notably the sample size is much smaller than in Study 3 (N = 96 vs. N = 211).

The study used the university application scenario. For half the participants the decision was acceptance and for the other half it was rejection.

The reaction time ANOVA showed a significant interaction, F(1, 87) = 15.43.

As in Study 3, participants were faster to respond to a rejection after wearing the shirt than after not wearing it (wearing M = 3,196 ms, SD = 1,348; not wearing M = 4,324 ms,
SD = 2,194), F(1, 41) = 9.13, p = .004, d = 0.93.   Surprisingly, the effect size was twice as large as in Study 3.

The novel finding was that participants were faster to respond to an acceptance decision after not wearing the shirt than after wearing it (not wearing M = 2,995 ms, SD = 1,175;  wearing M = 3,551 ms, SD = 1,432),  F(1, 45) = 6.07, p = .018, d = 0.73.

Likelihood results also showed a significant interaction, F(1, 92) = 10.49, p = .002.

As in Study 2, in the rejection condition participants believed that a rejection was more likely after wearing the shirt than after putting it away (M = 5.79, SD = 1.53; M = 4.79, SD = 1.56), t(46) = 2.24, p = .030, d = 0.66.  In the new acceptance condition, participants thought that an acceptance was less likely after wearing the shirt than after putting it away (wore shirt M = 5.88, SD = 1.51;  did not wear shirt M = 6.83, SD = 1.31), t(46) = 2.35, p = .023, d = 0.69.  [The two p-values are surprisingly similar]

The mediation hypothesis was tested separately for the rejection and acceptance condition.  For the rejection condition, the Sobel test was significant, z = 1.96, p = .05. For the acceptance condition, the result was considered to be “supported by a marginally significant Sobel (1982) test, z = 1.91, p = .057.  [It is unlikely that two independent statistical tests produce p-values of .05 and .057]

Study 5

Study 5 is the icing on the cake. It aimed to manipulate accessibility by means of a subliminal priming manipulation.  [This was 2008 when subliminal priming was considered a plausible procedure]

Participants were 111 students.

The main story was about a woman who did or did not (tempt fate) bring an umbrella when the forecast predicted rain.  The ending of the story was that it started to rain hard.

For the reaction times, the interaction between subliminal priming and the manipulation of tempting fate (the protagonist brought an umbrella or not) was significant, F(1, 85) = 5.89.

In the control condition with a nonsense prime, participants were faster to respond to the ending that it would rain, if the protagonist did not bring an umbrella than when she did (no umbrella M = 2,694 ms, SD = 876; umbrella M = 3,957 ms, SD = 2,112), F(1, 43) =
15.45, p = .0003, d = 1.19.  This finding conceptually replicated studies 3 and 4.

In the priming condition, no significant effect of tempting fate was observed (no umbrella M = 2,749 ms, SD = 971, umbrella M = 2,770 ms, SD = 1,032).

For the likelihood judgments, the interaction was only marginally significant, F(1,
86) = 3.62, p = .06.

However, in the control condition with nonsense primes, the typical tempt fate effect was significant (no umbrella M = 6.96, SD = 1.31; M = 6.15, SD = 1.46), t(44) = 2.00, p = .052 (reported as p = .05), d = 0.58.

The tempt fate effect was not observed in the priming condition when participants were subliminally primed with rain (no umbrella M = 7.11, SD = 1.56; M = 7.16, SD = 1.41).

As in Study 5, “the mediated relation was supported by a marginally significant Sobel
(1982) test, z = 1.88, p = .06.  It is unlikely to get p = .05, p = .06 and p  = .06 in three independent mediation tests.

Study 6

Study 6 is the last study and the study that was chosen for the replication attempt.

122 students participated.  Study 6 used the scenario of being called on by a professor either prepared or not prepared (tempting fate).  The novel feature was a cognitive load manipulation.

The interaction between load manipulation and tempting fate manipulation was significant, F(1, 116) = 4.15, p = .044.

The no-load condition was a replication of Study 2 and replicated a significant effect of tempting fate (not prepared (M = 2.93, SD = 2.16, prepared M = 1.90, SD = 1.42), t(58) = 2.19, p = .033, d = 0.58.

Under the load condition, the effect was even more pronounced (not prepared M = 5.27, SD = 2.36′ prepared M = 2.70, SD = 2.17), t(58) = 4.38, p = .00005, d = 1.15.

A comparison of participants in the tempting fate condition showed a significant difference between the load and the no-load condition, t(58) = 3.99, p = .0002, d = 0.98.

Overall the results suggest that some questionable research practices were used (e.g., mediation tests p = .05, .06, .06).  The interaction effect in Study 6 with the load condition was also just significant and may not replicate.  However, the main effect of the tempting fate manipulation on likelihood judgments was obtained in all studies and might replicate.

Replication Study 

The replication study used an Mturk sample. The sample size was larger than in the original study (N = 226 vs. 122).

The load manipulation lead to higher likelihood estimates of being called on, suggesting that the load manipulation was effective even with Mturk participants, F(1,122) = 10.28.

However, the study did not replicate the interaction effect, F(1, 122) = 0.002.  More surprisingly, it also failed to show a main effect for the tempting-fate manipulation, F(1,122) = 0.50, p = .480.

One possible reason for the failure to replicate the tempting fate effect in this study could be the use of a school/university scenario (being called on by a professor) with Mturk participants who are older.

However, the results for the same scenario in the original article are not very strong.

In Study 2, the p-value was p = .034 and in the the no-load condition in Study 6 the p-value was p = .033.  Thus, neither the interaction with load, nor the main effect of the tempting fate manipulation are strongly supported in the original article.

Conclusion

It is never possible to show definitively that QRPs were used, it is possible that the use of QRPs in the original article explain the replication failure, although other explanations are also possible.  The most plausible alternative explanation would be the use of an Mturk sample.  A replication study in a student sample or a replication study of one of the other scenarios would be desirable.

 

 

 

 

 

 

 

 

 

Klaus Fiedler’s Response to the Replication Crisis: In/actions speaks louder than words

Klaus Fiedler  is a prominent experimental social psychologist.  Aside from his empirical articles, Klaus Fiedler has contributed to meta-psychological articles.  He is one of several authors of a highly cited article that suggested numerous improvements in response to the replication crisis; Recommendations for Increasing Replicability in Psychology (Asendorpf, Conner, deFruyt, deHower, Denissen, K. Fiedler, S. Fiedler, Funder, Kliegel, Nosek, Perugini, Roberts, Schmitt, vanAken, Weber, & Wicherts, 2013).

The article makes several important contributions.  First, it recognizes that success rates (p < .05) in psychology journals are too high (although a reference to Sterling, 1959, is missing). Second, it carefully distinguishes reproducibilty, replicabilty, and generalizability. Third, it recognizes that future studies need to decrease sampling error to increase replicability.  Fourth, it points out that reducing sampling error increases replicabilty because studies with less sampling error have more statistical power and reduce the risk of false negative results that often remain unpublished.  The article also points out problems with articles that present results from multiple underpowered studies.

“It is commonly believed that one way to increase replicability is to present multiple studies. If an effect can be shown in different studies, even though each one may be underpowered, many readers, reviewers, and editors conclude that it is robust and replicable. Schimmack (2012), however, has noted that the opposite can be true. A study with low power is, by definition, unlikely to obtain a significant result with a given effect size.” (p. 111)

If we assume that co-authorship implies knowledge of the content of an article, we can infer that Klaus Fiedler was aware of the problem of multiple-study articles in 2013. It is therefore disconcerting to see that Klaus Fiedler is the senior author of an article published in 2014 that illustrates the problem of multiple study articles (T. Krüger,  K. Fiedler, Koch, & Alves, 2014).

I came across this article in a response by Jens Forster to a failed replication of Study 1 in Forster, Liberman, and Kuschel, 2008).  Forster cites the Krüger et al. (2014) article as evidence that their findings have been replicated to discredit the failed replication in the Open Science Collaboration replication project (Science, 2015).  However, a bias-analysis suggests that Krüger et al.’s five studies had low power and a surprisingly high success rate of 100%.

No N Test p.val z OP
Study 1 44 t(41)=2.79 0.009 2.61 0.74
Study 2 80 t(78)=2.81 0.006 2.73 0.78
Study 3 65 t(63)=2.06 0.044 2.02 0.52
Study 4 66 t(64)=2.30 0.025 2.25 0.61
Study 5 170 t(168)=2.23 0.027 2.21 0.60

z = -qnorm(p.val/2);  OP = observed power  pnorm(z,1.96)

Median observed power is only 61%, but the success rate (p < .05) is 100%. Using the incredibility index from Schimmack (2012), we find that the binomial probability of obtaining at least one non-significant result with median power of 61% is 92%.  Thus, the absence of non-significant results in the set of five studies is unlikely.

As Klaus Fiedler was aware of the incredibility index by the time this article was published, the authors could have computed the incredibility of their results before they published the results (as Micky Inzlicht blogged “check yourself, before you wreck yourself“).

Meanwhile other bias tests have been developed.  The Test of Insufficient Variance (TIVA) compares the observed variance of p-values converted into z-scores to the expected variance of independent z-scores (1). The observed variance is much smaller,  var(z) = 0.089 and the probability of obtaining such small variation or less by chance is p = .014.  Thus, TIVA corroberates the results based on the incredibility index that the reported results are too good to be true.

Another new method is z-curve. Z-curve fits a model to the density distribution of significant z-scores.  The aim is not to show bias, but to estimate the true average power after correcting for bias.  The figure shows that the point estimate of 53% is high, but the 95%CI ranges from 5% (all 5 significant results are false positives) to 100% (all 5 results are perfectly replicable).  In other words, the data provide no empirical evidence despite five significant results.  The reason is that selection bias introduces uncertainty about the true values and the data are too weak to reduce this uncertainty.

Fiedler4

The plot also shows visually how unlikely the pile of z-scores between 2 and 2.8 is. Given normal sampling error there should be some non-significant results and some highly significant (p < .005, z > 2.8) results.

In conclusion, Krüger et al.’s multiple-study article cannot be used by Forster et al. as evidence that their findings have been replicated with credible evidence by independent researchers because the article contains no empirical evidence.

The evidence of low power in a multiple study article also shows a dissociation between Klaus Fiedler’s  verbal endorsement of the need to improve replicability as co-author of the Asendorpf et al. article and his actions as author of an incredible multiple-study article.

There is little excuse for the use of small samples in Krüger et al.’s set of five studies. Participants in all five studies were recruited from Mturk and it would have been easy to conduct more powerful and credible tests of the key hypotheses in the article. Whether these tests would have supported the predictions or not remains an open question.

Automated Analysis of Time Trends

It is very time consuming to carefully analyze individual articles. However, it is possible to use automated extraction of test statistics to examine time trends.  I extracted test statistics from social psychology articles that included Klaus Fiedler as an author. All test statistics were converted into absolute z-scores as a common metric of the strength of evidence against the null-hypothesis.  Because only significant results can be used as empirical support for predictions of an effect, I limited the analysis to significant results (z >  1.96).  I computed the median z-score and plotted them as a function of publication year.

Fiedler.png

The plot shows a slight increase in strength of evidence (annual increase = 0.009 standard deviations), which is not statistically significant, t(16) = 0.30.  Visual inspection shows no notable increase after 2011 when the replication crisis started or 2013 when Klaus Fiedler co-authored the article on ways to improve psychological science.

Given the lack of evidence for improvement,  I collapsed the data across years to examine the general replicability of Klaus Fiedler’s work.

Fiedler2.png

The estimate of 73% replicability suggests that randomly drawing a published result from one of Klaus Fiedler’s articles has a 73% chance of being replicated if the study and analysis was repeated exactly.  The 95%CI ranges from 68% to 77% showing relatively high precision in this estimate.   This is a respectable estimate that is consistent with the overall average of psychology and higher than the average of social psychology (Replicability Rankings).   The average for some social psychologists can be below 50%.

Despite this somewhat positive result, the graph also shows clear evidence of publication bias. The vertical red line at 1.96 indicates the boundary for significant results on the right and non-significant results on the left. Values between 1.65 and 1.96 are often published as marginally significant (p < .10) and interpreted as weak support for a hypothesis. Thus, the reporting of these results is not an indication of honest reporting of non-significant results.  Given the distribution of significant results, we would expect more (grey line) non-significant results than are actually reported.  The aim of reforms such as those recommended by Fiedler himself in the 2013 article is to reduce the bias in favor of significant results.

There is also clear evidence of heterogeneity in strength of evidence across studies. This is reflected in the average power estimates for different segments of z-scores.  Average power for z-scores between 2 and 2.5 is estimated to be only 45%, which also implies that after bias-correction the corresponding p-values are no longer significant because 50% power corresponds to p = .05.  Even z-scores between 2.5 and 3 average only 53% power.  All of the z-scores from the 2014 article are in the range between 2 and 2.8 (p < .05 & p > .005).  These results are unlikely to replicate.  However, other results show strong evidence and are likely to replicate. In fact, a study by Klaus Fiedler was successfully replicated in the OSC replication project.  This was a cognitive study with a within-subject design and a z-score of 3.54.

The next Figure shows the model fit for models with a fixed percentage of false positive results.

Fiedler3.png

Model fit starts to deteriorate notably with false positive rates of 40% or more.  This suggests that the majority of published results by Klaus Fiedler are true positives. However, selection for significance can inflate effect size estimates. Thus, observed effect sizes estimates should be adjusted.

Conclusion

In conclusion, it is easier to talk about improving replicability in psychological science, particularly experimental social psychology, than to actually implement good practices. Even prominent researchers like Klaus Fiedler have responsibilities to their students to publish as much as possible.  As long as reputation is measured in terms of number of publications and citations, this will not change.

Fortunately, it is now possible to quantify replicability and to use these measures to reward research that require more resources to provide replicable and credible evidence without the use of questionable research practices.  Based on these metrics, the article by Krüger et al. is not the norm for publications by Klaus Fiedler and Klaus Fiedler’s replicability index of 73 is higher than the index of other prominent experimental social psychologists.

An easy way to improve it further would be to retract the weak T. Krüger et al. article. This would not be a costly retraction because the article has not been cited in Web of Science so far (no harm, no foul).  In contrast, the Asendorph et al. (2013) article has been cited 245 times and is Klaus Fiedler’s second most cited article in WebofScience.

The message is clear.  Psychology is not in the year 2010 anymore. The replicability revolution is changing psychology as we speak.  Before 2010, the norm was to treat all published significant results as credible evidence and nobody asked how stars were able to report predicted results in hundreds of studies. Those days are over. Nobody can look at a series of p-values of .02, .03, .049, .01, and .05 and be impressed by this string of statistically significant results.  Time to change the saying “publish or perish” to “publish real results or perish.”

 

Estimating Reproducibility of Psychology (No. 64): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 68 “The Effect of Global Versus Local Processing Styles on Assimilation
Versus Contrast in Social Judgment”  is no ordinary article.  The first author, Jens Forster, has been under investigation for scientific misconduct and it is not clear whether published results in some articles are based on real or fabricated data.  Some articles that build on the same theory and used similar methods as this article have been retracted.  Scientific fraud would be one reason why an original study cannot be replicated.

Summary of Original Article

The article uses the first author’s model of global/local processing style model (GLOMO) to examine assimilation and contrast effects in social judgment. The article reports five experiments that showed that processing styles elicited in one task can carry over to other tasks and influence social judgments.

Study 1 

This study was chosen for the replication project.

Participants were 88 students.  Processing styles were manipulated by projecting a city map on a screen and asking participants to either (a) focus on the broader shape of the city or (b) to focus on specific details on the map. The study also included a control condition.  This task was followed by a scrambled sentence task with neutral or aggressive words.  The main dependent variable were aggression ratings in a person perception task.

With 88 participants and six conditions, there are n = 13 participants per condition.

The ANOVA results showed a highly significant interaction between the processing style and priming manipulations, F(2, 76) = 21.57, p < .0001.

We can think about the 2 x 3 design as three priming experiments for each of the three processing style conditions.

The global condition shows a strong assimilation effect, (prime M = 6.53, SD =1.21; no prime M =  4.15, SD = 1.25), t(26) = 5.10, p = .000007, d = 1.94.

In the control processing condition, priming shows an assimilation effect (priming (M = 5.63, SD = 1.25) than after nonaggression priming (M = 4.29, SD =1.23), t(25) = 2.79,  p = .007, d = 1.08.

The local processing condition shows a significant contrast effect (M = 2.86, SD = 1.15) than after nonaggression priming
(M = 4.62; SD = 1.16), t(25) = 3.96, p = .0005, d = -1.52.

Although the reported results appear to provide strong evidence, the extremely large effect sizes raise concern about the reported results.  After all, these are not the first studies that have examined priming effects on person perception.  The novel contribution was to demonstrate that these effects change (are moderated) as a function of processing styles.  What is surprising is that processing styles also appear to have magnified the typical effects without any theoretical explanation for this magnification.

The article was cited by Isbell, Rovenpor, and Lair (2016) because they used the map manipulation in combination with a mood manipulation. The article reports a significant interaction between processing and mood, F(1,73) = 6.33, p = .014.  In the global condition, more abstract statements in an open ended task were recorded in the angry mood condition, but the effect was not significant and much smaller than in Forster’s studies, F(1,73) = 3.21, p = .077, d = .55.  In the local condition, sad participants listed more abstract statements, but again the effect was not significant and smaller than in Forster et al.’s studies, F(1,73) = 3.20, p = .078, d = .67.  As noted before, these results are also questionable because it is unlikely to get p = .077 and p = .078 in two independent statistical tests.

In conclusion, the effect sizes reported by Foerster et al. in Study 1 are unbelievable because they are much larger than can be expected.

Study 2

Study 2 was a replication and extension of a study by Mussweiler and Strack (2000). Participants were 124 students from the same population.  This study used a standard processing style manipulation (Navon, 1977) that presented global letters composed of several smaller different letters (the letter E made up of several n).  The main dependent variable were judgments of drug use.   The design had 2 between subject factors: 3 (processing styles) x 2 (high vs. low comparison standard). Thus, there were about 20 to 21 participants per condition.  The study also had a within-subject factor (subjective vs. objective rating).

The ANOVA shows a 3-way interaction, F(2, 118) = 5.51, p = .005.

Once more, the 3 x 2 design can be treated as 3 independent studies of comparison standards. Because subjective and objective ratings are not independent, I focus on the objective ratings that produced stronger effects.

In the global condition, the high standard produced higher reports of drug use than the low standard (M = 0.66, SD = 1.13 vs. M = -0.47, SD = 0.57), t(39) = 4.04, p = .0004, d = 1.26.

In the control condition, a similar pattern was observed but it was not significant (M = 0.07, SD = 0.79 vs. M = -0.45, SD = 0.98), t(39) = 1.87, p = .07, d = 0.58.

In the local condition, the pattern is reversed (M = -0.41, SD = 0.83 vs. M = 0.60, SD = 0.99), t(39) = 3.54, p = .001, d = -1.11.

As the basic paradigm was a replication of Mussweiler and Strack’s (2000) Study 4, it is possible to compare the effect sizes in this study with the effect size in the original study.   The effect size in the original study was d = .31; 95%CI = -0.24, 1.01.  The effect is not significant, but the interaction effect for objective and subjective judgments was, F(1,30) = 4.49, p = .04.  The effect size is comparable to the control condition, but the  effect sizes for the global and local processing conditions are unusually large.

Study 3

132 students from the same population took part in Study 3.  This study was another replication and extension of Mussweiler and Strack (2000).  In this study, participants made ratings of their athletic abilities.  The extension was to add a manipulation of time (imagine being in an athletic competition today or in one year).  The design was a 3 (temporal distance: distant future vs. near future vs. control) by 2 (high vs. low standard) BS design with objective vs. subjective ratings as a within factor.

The three-way interaction was significant, F(2, 120) = 4.51, p = .013.

In the distant future condition,  objective ratings were higher with the high standard than with the low standard (high  M = 0.56, SD  = 1.04; low M = -0.58, SD = .51), t(41)  =
4.56, p = .0001, d = 1.39.

In the control condition,  objective ratings of athletic ability were higher after the high standard than after the low standard (high M = 0.36, SD = 1.08; low M = -0.36, SD = 0.77), t(38) = 2.44, p = .02, d = 0.77.

In the near condition, the opposite pattern was reported (high M = -0.35, SD = 0.33, vs. low M = 0.36, SD = 1.29), t(41) = 2.53; p = .02,  d = -.75.

In the original study by Mussweiler and Strack the effect size was smaller and not significant (high M = 5.92, SD = 1.88; low M = 4.89, SD = 2.37),  t(34) =  1.44, p = .15, d = 0.48.

Once more the reported effect sizes by Forster et al. are surprisingly large.

Study 4

120 students from the same population participated in Study 4.  The main novel feature of Study 4 was the inclusion of a lexical decision task and the use of reaction times as the dependent variable.   It is important to realize that most of the variance in lexical decision tasks is random noise and fixed individual differences in reaction times.  This makes it difficult to observe large effects in between-subject comparisons and it is common to use within-subject designs to increase statistical power.  However, this study used a between-subject design.  The ANOVA showed the predicted four-way interaction, F(1,108) = 26.17.

The four way interaction was explained by a 3-way interaction for self-primes F(1, 108)  = 39.65,, and no significant effects with control primes.

For moderately high standards, reaction times to athletic words were slower after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

For unathletic words, the revers pattern was observed.

For moderately high standards, reaction times were faster after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

In sum, Study 4 reported reaction time differences as a function of global versus local processing styles that were surprisingly large.

Study 5

Participants in Study 5 were 128 students.  The main novel contribution of Study 5 was the inclusion of a line-bisection task that is supposed to measure asymmetries in brain activation.  The authors predicted that local processing induces more activation of the left-side of the brain and global processing induces more activation of the right side of the brain.  The comparisons of the local and global condition with the control condition showed the predicted mean differences, t(120) = 1.95, p = .053 (reported as p = .05) and t(120) = 2.60, p = .010.   Also as predicted, the line-bisection measure was a significant mediator, z = 2.24, p = .03.

The Replication Study 

The replication project called for replication of the last study, but the replication team in the US found that it was impossible to do so because the main outcome measure of Study 5 was alcohol consumption and drug use (just like Study 2) and pilot studies showed that incidence rates were much lower than in the German sample.  Therefore the authors replicated the aggression priming study of Study 1.

The focal test of the replication study was the interaction between processing condition and priming condition. As noted earlier, this interaction was very strong,  F(2, 76) = 21.57, p < .0001, and therefore seemingly easy to replicate.

Fortunately, the replication team dismissed the outcome of a post-hoc power analysis, which suggested that only 32 participants would be needed and used the same sample size as the original study.

The processing manipulation was changed from a map of the German city of Oldenburg to a state map of South Carolina.  This map was provided by the original authors. The replication report emphasizes that “all changes were endorsed by the first author of the original study.”

The actual sample size was a bit smaller (N = 74) and after exclusion of 3 suspicious participants data analyses were based on 71 (vs. 80 in original study) participants.

The ANOVA failed to replicate a significant interaction effect, F(2, 65) = .865, p =
.426.

The replication study also included questions about the effectiveness of the processing style manipulation.  Only 32 participants indicated that they followed instructions.  Thus, one possible explanation for the replication failure is that the replication study did not successfully manipulate processing styles. However, the original study did not include a similar question and it is not clear why participants in the original study were more compliant.

More troublesome, is that the replication study did not replicate the simple priming effect in the control condition or the global condition, which should have produced the effect with or without successful manipulation of processing styles.

In the control condition, the mean was lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.27, SD = 1.29, neutral M = 7.00, SD = 1.30), t(22) = 1.38, p = .179, d = -.56.

In the global condition, the mean was also lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.38, SD = 1.75, neutral M = 7.23, SD = 1.46), t(22) = 1.29, p = .207, d = -.53.

In the local condition, the means were nearly identical (aggression M = 7.77, SD = 1.16, neutral M = 7.67, SD = 1.27), t(22) = 0.20, p = .842, d = .08.

The replication report points out that the priming task was introduced by Higgins, Rholes, and Jones (1977).   Careful reading of this article shows that the original article also did not show immediate effects of priming.  The study obtained target ratings immediately and 10 to 14 days later.  The ANOVA showed a significant interaction with time, F(1,36) = 4.04, p = .052 (reported as p < .05).

“A further analysis of the above Valence x Time interaction indicated that the difference in evaluation under positive and negative conditions was small and nonsignificant on the immediate measure (M = .8 and .3 under positive and negative conditions, respectively), t(38)= 0.72, p > .25 two-tailed; but was substantial and significant on the delayed measure.” (Higgins et al., 1977).

Conclusion

There are serious concerns about the strong effects in the original article by Forster et al. (2008).   Similar results have raised concerns about data collected by Jens Forster. Although investigations have yielded no clear answers about the research practices, some articles have been retracted (Retraction Watch).  Social priming effects have also proven to be difficult to replicate (R-Index).

The strong effects reported by Forster et al. are not the result of typical questionable research practices that result in just significant results.  Thus, statistical methods that predict replicability falsely predict that Forster’s results would be easy to replicate and only actual replication studies or forensic analysis of original data might be able to reveal that reported results are not trustworthy.  Thus, statistical predictions of replicability are likely to overestimate replicability because they do not detect all questionable practices or fraud.

 

 

 

 

 

 

 

 

 

 

Shit Social Psychologists Say: “Hitler had High Self-Esteem”

A popular twitter account is called “Shit Academics Say” and it often posts funny commentaries on academia (see example above).
 
I borrow the phrase “Shit Academics Say ” for this post about shit social psychologists say with a sense of authority and superiority.  Social psychologists see themselves as psychological “scientists,”  who study people and therefore believe that they know people better than you or me. However, often their claims are not based on credible scientific evidence and are merely personal opinions disguised as science.
 
For example, a popular undergraduate psychology textbook claims that “Hitler had high self-esteem.” quoting an article that has been cited over 500 times in the journal “Psychological Science in the Public Interest” (although the title suggests it is written for the general public, it is mostly read by psychologists and the title is supposed to create the illusion that they are actually doing important work that serves the public interest.
 
At the end of the article with the title “Does High Self-Esteem Cause Better Performance, Interpersonal Success, Happiness, or Healthier Lifestyles?” the authors write: 
 
“High self-esteem feels good and fosters initiative. It may still prove a useful tool to promote success and virtue, but it should be clearly and explicitly linked to desirable behavior. After all, Hitler had very high self-esteem and plenty of initiative, too, but those were hardly guarantees of ethical behavior.”
 
In the textbook this quote is linked to boys who engage in sex at an “inappropriately young age” which is not further specified (in Canada this would be 14) according to recent statistics). 
 
“High self-esteem does have some benefits—it fosters initiative, resilience, and pleasant feelings (Baumeister & others, 2003). Yet teen males who engage in sexual activity at an “inappropriately young age” tend to have higher than average self-esteem. So do teen gang leaders, extreme ethnocentrists, terrorists, and men in prison for committing violent crimes (Bushman & Baumeister, 2002; Dawes, 1994, 1998). “Hitler had very high self-esteem,” note Baumeister and co-authors (2003).”  (Myers, 2011, Social Psychology, 12th edition)
 
Undergraduate students pay (if they pay; hopefully they do not) $200 to be informed that people with high self-esteem are like sexually deviants, terrorists, violent criminals, and Hitler. (maybe we should add scientists with flair to the list).
 
The problem is that this is not even true. Students who work with me on fact checking the textbook found this quote in the original article.
 
“There was no [!] significant difference in self-esteem scores between violent offenders and non-offenders, Ms = 28.90 and 28.89, respectively, t(7653) = 0.02, p > .9, d = 0.0001.”
[Technical detail you can skip. Although the df of the t-test look impressive, the study compared 63 violent offenders to 7590 unmatched, mostly undergraduate student (gender not specified, probably mostly female) participants. So the sampling error of this study is high and the theoretical importance of comparing these two groups is questionable.
 
How Many Correct Citations Could be False Positives? 
Of course, the example above is an exception.  Most of the time a cited reference contains an empirical finding that is consistent with the textbook claim.  However, this does not mean that textbook findings are based on credible and replicable evidence.  Even a Noble Laureate was conned by flashy findings in small samples that could not be replicated (Train Wreck: Fast & Slow). 
Until recently it was common to assume that statistical significance ensures that most published results are true positives (i.e, not a false positive random finding).  However, this is only the case if all results are reported. It has been known since 1959 that this is not the case in psychology (Sterling, 1959). Psychologists selectively publish only results that support their theories.  This practice disables the significance-filter that is supposed to keep false positives out of the literature.  The claim that results published in social psychology journals  were obtained with rigorous research (Crandall et al., 2018) is as bogus as Volkswagen’s Diesel tests, and the future of social psychology may be as bleak as the future of Diesel engines.  
Jerry Brunner and I developed a statistical tool that can be used to clean up the existing literature. Rather than actually redoing 50 years of research, we use the statistical results reported in original studies to apply a significance filter post-hoc.  Our tool is called zcurve.   Below I used zcurve to examine the replicability of studies that were used in the chapter that also included the comparison of sexually active teenagers with violent criminals, terrorists, and Hitler.
Chapter2.Self
More detailed information about the interpretation of the graph above is provided elsewhere (link).  In short, for each citation in the textbook chapter that is used as evidence for a claim, a team of undergraduate students retrieved the cited article and extracted the main statistical result that matches the textbook claim.  These statistical results are then converted into a z-score that reflects the strength of evidence for a claim.  Only significant results are important because non-significant results cannot support an empirical claim (although sometimes non-significant results are falsely used to support claims that there is no effect).
Zcurve fits a model to the (density) distribution of significant z-scores (z-scores > 1.96).  The shape of the density distribution provides information about the probability that a randomly drawn study from the set would replicate (i.e., reproduce a significant result).  The grey line shows the predicted distribution by zcurve. It matches the observed density in dark blue well. Simulation studies show good performance of zcurve. Zcurve estimates that the average replicability of studies in this chapter is  56%. This number would be reassuring if all studies had 56% power.  This would mean that all studies are true positives and if a study were replicated every other study would be successful.
However, reality does not match this rosy scenario.  In reality, studies vary in replicability.  Studies with z-scores greater than 5 have 99% replicability (see numbers below x-axis).  However, studies with just significant results (z < 2.5) have only 21% replicability.  As you can see, there are a lot more studies with z < 2.5 than studies with z > 5.  So there are more studies with low replicability than studies with high replicability.
The next plot shows model fit (higher numbers = worse fit) for zcurve models with a fixed proportion of false positives.  If the data are inconsistent with a fixed proportion of false positives, model fit decreases (higher numbers).  Chapter2.Self.model.fitpng
The graph shows that models with 100%, 90% or 80% false positives clearly do not fit the data as well as models with fewer false positives.  This shows that some textbook claims are based on solid empirical evidence.   However, model fit for models with 0% to 60% look very similar.  Thus, it is possible that the majority of claims in the self chapter of this textbook are false positives.
It is even more problematic that textbook claims are often based on a single study with a student sample at one university.  Social psychologists have warned repeatedly that their findings are very sensitive to minute variations in studies, which makes it difficult to replicate these effects even under very similar conditions (Van Bavel et al., 2016), and that it is impossible to reproduce exactly the same experimental conditions (Stroebe and Strack, 2014).  Thus, the zcurve estimate of 56% replicability is a wildly optimistic estimate of replicability in actual replication studies. In fact, the average replicability of studies in social psychology is only 25% (Open Science Collaboration, 2015).
Conclusion
While social psychologists are currently outraged about a psychologist with too many self-citations, they are silent about the crimes against science that have been committed by social psychologists that produced pseudo-scientific comparisons of sexually active teenagers with Hitler and questionable claims that suggest high self-esteem is a sign of pathology. Maybe social psychologists should spend less time criticizing others and spend more time reflecting on their own errors. 
In official statements and editorials, social psychologists are taking the right talk.

SPSP recently published a statement on scientific progress which began “Science advances largely by correcting errors, and scientific progress involves learning from mistakes. By eliminating errors in methods and theories, we provide a stronger evidentiary basis for science that allows us to better describe events, predict what will happen, and solve problems” (SPSP Board of Directors, 2016).  [Cited from Crandall et al., 2018, PSPB, Editorial]

However, they are still not walking the walk.  Seven years ago, Simmons et al. (2011) published an article called psychology “False Positive Psychology” that shocked psychologists and raised concerns about the credibilty of textbook findings. One year later, Nobel Laureate Daniel Kahneman wrote an open letter to star social psychologist John Bargh to clean up social psychology. Nothing happened. Instead, John Bargh published a popular book in 2017 that does not mention any of the concerns about the replicabilty of social psychology in general or his work in particular.  Denial is no longer acceptable. It is time to walk the walk and to get rid of pseudo-science in journals and in textbooks.
Hey its spring. What better time to get started with a major house cleaning.

On Power Posing, Power Analysis, Publication Bias, Peer-Review, P-curve, Pepsi, Porsche, and why Psychologists Hate Zcurve

Yesterday, Forbes Magazine was happy to tell its readers that “Power Posing Is Back: Amy Cuddy Successfully Refutes Criticism.”   I am not blaming a journalist for making a false claim that has been published in the premier journal of the American Psychological Society (APS).  My should a journalist be responsible for correcting all the errors that reviewers and the editor who are trained psychological scientists missed.  Rather, the Forbes article highlights that APS is more concerned about the image of psychological science than the scientific validity of data and methods that are used to distinguish opinions from scientific facts.

This blog post shows first of all that power posing researchers have used questionable research practices to produce way more significant results than the weak effects and small samples justify.  Second, it shows that the meta-analysis used a statistical method that is flawed and overestimates evidence in favor of power-posing effects.

Finally, I point out how a better statistical tool shows that the power-posing literature does not provide credible evidence for replicable effects and cannot be used to make bold claims about the effectiveness of power posing as a way to enhance confidence and performance in stressful situations.

Before I start, I need to make something very clear. I don’t care about power posing or Amy Cuddy and I have a track record of attacking powerful men with flair; not power-posing women (so far, I have been threatened with legal actions by 4 men and 1 women). So, this post has nothing to do with gender.  The primary goal is to show problems with the statistical method that Cuddy and colleagues used. This is not their fault.  The method has been heavily advertised, although it has never been subjected to peer-review or published in a peer-reviewed journal.  On top of this, Jerry Brunner and I have tried to publish our criticism of this method since 2016 and journals have rejected our manuscript because this finding was of insufficient importance.  The main point of this blog post is to show that the p-curve meta-analysis by Cuddy et al. is problematic because they used p-curve (which was developed by three men, not a woman). If Cuddy et al. had used z-curve, their conclusions would have been different.
Before I start, I also need to declare a conflict of interest. I am the co-inventor of z-curve and if I am right that z-curve is a lot better than p-curve, it would enhance my reputation and I might benefit from this in the future (so far, I have only suffered from rejections). So, I am presenting my case with a clear bias in favor of zcurve. Whether my arguments are strong or weak, is for the reader to decide.
P-Curve of Power-Posing
The main p-curve analysis includes 53 studies. 11 studies had a non-significant result and were not used because p-curve (and z-curve) only use significant results.  The Figure shows the pcurve results that were reported in the article.
Pcurve.Power.Posing.png
The key finding is this figure is the reported average power estimate of 44% with a 90%CI ranging from 23% to 63%.
The figure also includes the redundant information that the 90%CI implies that we can safely reject the null-hypothesis (Null of no effect, p < .0001), but not the hypothesis that average power is 33%. After all, 33 falls within the lower bound of 23 and the upper bound of 63 of the 90%CI (p > alpha = 1 – .90 = .10).
The p-curve histogram shows that p-values below .01 are the most frequent p-values compared to p-values for the other four segments in the histogram. This visual information is also redundant with the information that average power is above .05 because any set of studies with average power that is not close to alpha (5%) will show this pattern.
These results are used to support the claim that power-posing has real effects (see quotes below).
Conclusions.Power.Posing.png
Z-Curve and Power-Posing
Z-curve produces different results and leads to different conclusions.  The results that I present here were included in a manuscript that was submitted to a new journal by the American Psychological Society (APS) that also publishes Psychological Science, in which Cuddy’s p-curve analysis was reported.  One of the reviewers was a co-author of p-curve, who has just as much a conflict of interest in favor of p-curve as I have in favor of z-curve.  So a biased peer-review is partially responsible for the fact that my criticism of p-curve is published as a blog post and not (yet) in a peer-reviewed journal.  The zcurve plot below was included in the rejected manuscript. So, readers can make up their own mind whether the rejection was justified or not.  Based on this plot, I challenge two claims about power-posing that were made by Cuddy et al. based on their p-curve analysis.
Z-Curve Shows Clear Evidence that Questionable Research Practices were Used by Power-Posing Researchers
Cuddy et al. write “no p-curve analysis by either set of authors yielded results that were left skewed or that suggested that the existing evidence was p-hacked.”
This statement is true, but it overlooks the fact that p-curve is often unable to detect p-hacking and other deceptive practicing like hiding studies that failed to provide evidence for power posing.   In contrast, z-curve makes it easy to see the effect of deceptive practices that are euphemistically called “questionable research practices.”
Power-Posing Meta-Analysis
Zcurve makes it easy to detect the influence of QRPs because the z-curve plot (a) includes the non-significant results (on the left of 1.96), (b) distinguishes between marginal significant results that are often used as weak but sufficient support for a hypothesis (z > 1.65 & z < 1.96), and differentiates a lot more among the significant results that are lumped into one category with p < .01 (z > 2.6).
The z-curve plot reveals that we are not looking at a healthy body of evidence. Zcurve projects the model prediction into the range of non-significant results. Although these projections are more biased because they are not based on actual data, the area below the projected curve is very large. In addition, the steep drop from 1.96 to 1.65 in the histogram shows clear evidence that questionable research practices eliminated non-significant results that we would expect from a distribution with a mode at 1.96.
The influence of questionable research practices is also implied by the z-curve estimate of 30% average power for significant results.  In contrast, the success rates are 79% if marginally significant results are excluded and 91% if they are included.  As power is a predictor of the probability to obtain a significant result, it is implausible that a set of studies with average power of 30% could produce 79% or 91% demonstrations that power posing works.  In the best case scenario, the average would be a fixed effect that is true for all studies (each study has 30% power). In this case, we expect to see 2 non-significant results for every significant result (30% successes and 70% failures) and with 42 significant results there should be 84 non-significant results. Even counting the marginally significant ones, we see only 11 non-significant results.
Thus, the claim that there is no evidence for p-hacking in power-posing research is totally inconsistent with the evidence. Not surprisingly, the authors also do not use a validated test of bias, the magic index, to test whether magic was used to produce 91% at least marginally significant results with just 30% power (Schimmack, 2012; also Francis, 2012).
In conclusion, p-curve is not a suitable statistical tool to examine the influence of p-hacking and other deceptive practices.   To demonstrate that there is no evidence for these practices with p-curve is like saying that a polar bear does not notice when it snows during hibernation. It is true, but totally irrelevant for the question whether it is snowing or not .
Power Estimation
The main aim of p-curve is to examine whether significant results provide evidence (e.g., for power posing effects) even if p-hacking or other deceptive methods were used.  Initially this was done by means of significance tests. If p < .05, a set of studies was set to provide evidence for real effects, whereas p > .05 results showed that there was insufficient evidence to reject the alternative hypothesis that questionable research practices alone explained significant results. In other word, the null-hypothesis is that p-hacking alone produced 44 significant results without any real power posing effects.
The problem with relying exclusively on p-values is that p-values are sensitive to the effect size (how strong is the evidence for an effect) and sampling error (how much error is there in the estimated effect size).  As sample sizes increase, it gets easier and easier to show that at least some studies contain some evidence, even if the evidence is weak. To address this concern, it is important to complement p-values with information about effect sizes and this can be easily done with confidence intervals.  The p-curve result of 44% power tells us that the observed effect size is moderate.  It is not close to 5%, which would indicate that all studies are false positives and it is not close to 100%, which would show that all studies are likely to replicate in an exact replication study.
The 90% confidence interval suggests that power could be lower than 44%, but is unlikely to be lower than 23%.  At the upper end, the 63% average power also tell us that it is unlikely that the average study had more than 60% power.  Thus, power posing studies fall considerably short of the criterion for well-designed studies that they should have 80% power (Cohen, 1988).
It is therefore important to distinguish between two meanings of strong evidence.  Cuddy et al. (2018) are justified in claiming that a bias-corrected estimate of 44% average power in a set of 44 significant studies provides strong evidence against the null-hypothesis that all studies are false positive results. However,  average power of 44% also shows that each study individually has low power to detect power posing effects.
Like p-curve, z-curve aims to estimate the average power of studies that are selected for significance.  The main advantage of z-curve is that it allows for variation in power across studies (heterogeneity).  This seems a plausible assumption for a meta-analysis that includes manipulation checks of power feelings and actual outcome measures like risk taking. Evidently, we would expect stronger effects for feelings that are induced by a manipulation aimed at changing feelings than on an outcome like performance in public speaking or risk taking.
Power-Posing Meta-Analysis
The zcurve (same Figure as above is shown here so that you do not have to scroll back and force) provides clear evidence of heterogeneity. Most z-scores pile up close to significance (all z-scores < 2.6 imply p-values greater than .01). However, there are three studies with strong evidence and the range information shows that there are even some (actually only 1, not shown) z-scores with values above 6 (highest value = 7.23).
In our rejected manuscript, we showed with simulation studies that pcurve has problems with strong effects (high z-scores) and pcurve estimates of average increase a lot more than they should when a few studies with very strong evidence are added to a dataset.  This estimation bias explains the discrepancy between the pcurve estimate of 44% average power and the zcurvc estimate of 30% average power.
As I already pointed out in the rejected article, the bad behavior of pcurve is evident when the four studies with strong evidence are excluded. The p-curve estimate drops from 44% to 13%.  Of course, the average should decrease when the strongest evidence is excluded, but a drop by 31% is not plausible when only four studies are excluded. Going in reverse, if 4 studies with 100% power were added to 40 studies with 13% power, the new average power would be ((40*.13)+(4*1))/44 = 20% average power not 44% power.
In conclusion, the 44% average and the 23% lower bound of the 90% (alpha = 10% type-I error probability) confidence interval reported by Cuddy et al. are inflated because they used a biased tool to estimate average power.   Z-curve provides lower estimates of average power and the lower bound of the 95%CI is only 13%.
Even 13% average power in 44 studies would provide strong evidence against the null-hypothesis that all 44 studies are false positives,  and 30% average power clearly means that these studies were not all p-hacked false positives.  Although this may be considered good news, if we have a cynical view of psychological scientists,  the null-hypothesis is also a very low bar to cross.  We can reject the hypothesis that 100% of power posing results are p-hacked false positives, but we can also reject the hypothesis that most studies were well-designed studies with 80% power, which would yield a minimum of 40% average power (50% of studies with 80% power yields an average of 40% just for the well powered studies).
Heterogeneity of Power in Power Posing Studies
The presence of heterogeneity in power across power posing studies also has implications for the interpretation of average power. An average of 30% power can be obtained in many different ways. It could be that all studies have 30% power. in this case, all 44 studies with significant results that used different manipulations or outcome variables would show a true positive results. The 30% power estimate would only tell us that the studies had low power and that reported effect sizes are considerably inflated, but all studies are expected to replicate if they were repeated with larger samples to increase power. In other words, there is no need to be concerned about false positive psychology where most published results are false (positives). All results are true positives.
In contrast to this rosy and delusional interpretation of averages, it is also possible that the average is a result of a mixture of false and true positives. In the most extreme case, we can get an average of 30% power with 15 out of 44 (34%) false positive results, if all other studies have 100% power. Even this estimate is only an estimate that depends on numerous assumptions and the percentage of false positives could be higher or lower.
It is also not clear which of the significant results is a false positive and which result would be replicable in larger samples with higher power,  So, an average of 30% power tells us only that some of the significant results are true positives, but it does not tell us which studies produced true positives with meaningful effect sizes.  Only studies with 80% power or more can be expected to replicate with only slightly attenuated effect sizes.  But which power posing studies did have 80% power?  The average of 30% does not tell us this.
zcurve.Power.Posing2.png
Observed power (or the corresponding z-score) is correlated with true power.  The correlation is too weak to use observed power as a reliable indicator of true power for a single study, but in a set of studies, higher z-scores are more likely to reflect higher levels of true power.  Z-curve uses this correlation to estimate average power for different regions in the set of significant studies.  These estimates are displayed below the x-axis.  For z-scores between 2 and 2.5 (roughly .05 and .01), average power is only 22%.  However, for z-scores above 4, average power is above 50%.  This finding suggests that a small subset of power posing studies is replicable in exact replication studies, whereas the majority of studies has low power and the outcome of replication studies, even with larger samples, is uncertain because some of these studies may be false positives.
Thus modeling heterogeneity has the advantage that it is possible to examine to some extent variability in true power. If all studies had the same power, all segments would have the same estimate.  As heterogeneity increases, the true power of just significant results, p < .05 & p > .01, decreases and the power of studies with strong evidence, p < .001 increases.   For power-posing a few studies with strong evidence have a strong influence on average power.
Another novel feature of z-curve (that still needs to be validated with extensive simulation studies) is the ability to fit models that make assumptions about the percentage of false positive results.  It is not possible to estimate the actual percentage of false positives, but it is possible to see what the worst case scenario would be.  To do so, a new (beta) version of zcurve fits models with 0 to 100% false positives and tries to optimize prediction of the observed distribution of z-scores as much as possible. A model that makes unrealistic assumptions will not fit the data well.
The plot below shows that a model with 100% false positive results does not fit the data. This is already implied by the 95%CI of average power that does not include 5%.  The novel contribution of this plot is to see at what point the model can fit the observed distribution with a maximum number of false positives.  The scree plot below suggests that models with up to 40% false positives fit the data about as well as a model with 0% false positives.  So, it is possible that there are no false positives, but it is also possible that there are up to 40% false positives.  In this case, 60% of studies would have about  50% power (.5 * .6 = .30) and 40% would have 5% power (which is the probability of false positives to produce significant results with alpha = 5%; .40 * .05 = .02 = 2%).
true.null.percentage.plot
In conclusion, it is misleading to interpret average power of 30% as strong evidence if a set of studies is heterogeneous.  The reason is that different studies with different manipulations or outcome variables produced different results and the average does not apply to any individual studies.  In this way,  using the average to draw inferences about individual studies is like stereotyping.  Just because a study was drawn from a sample with average power of 30% does not mean that this study has 30% power.  At the level of individual studies, most of these studies produced evidence for power posing with the help of luck and questionable research practices and exact replication studies are unlikely to be successful.  Thus, any strong conclusions about power posing based on these studies are not supported by strong evidence.
Again, this is not a problem of Cuddy’s analysis. The abstract correctly reports the results of their p-curve analysis.
“Several p-curve analyses based on a systematic review of the current scientific literature on adopting expansive postures reveal strong evidential value for posturalfeedback (i.e., power-posing) effects and particularly robust evidential value for effects on emotional and affective states (e.g., mood and evaluations, attitudes, and feelings about the self).” (Cuddy et al., Psychological Science, 2018)

The problem is that the p-curve analysis is misleading because it does not reveal the strong influence of questionable research practices, it overestimates average power, and it ignores heterogeneity in the strength of evidence across studies.

Peer-Review 
Prominent representatives of the American Psychological Society (I am proud not to be a member) have warned about the increasing influence of bloggers that were unkindly called method terrorists.   APS wants you to believe that closed and anonymous peer-review is working as a quality control mechanism and that bloggers are frustrated, second grade scientists who are unable to publish in top journals.
The truth is that peer-review is not working.  Peer-review works in academia works as well as asking one cat to make sure that the other cat doesn’t eat the chicken, while you are still fixing a salad (enjoy your vegetarian dinner).
The main points about p-curve in general and the power-posing p-curve were made in a manuscript that Jerry Brunner and I submitted to a new journal of APS that claims to represents Open Science and aims to improve scientific standards in psychological science.  Given the conflict of interest, I requested that the main author of p-curve should not be a reviewer.  The editor responded to this request by making another p-curve author a reviewer and this reviewer submitted a review that ignored major aspects of our criticism of p-curve (including simulation studies that prove our point) and objected to my criticism of the p-curve power posing meta-analysis.  The manuscript was rejected without an opportunity to respond to misleading reviewer comments.  The main reason for the rejection was that there was insufficient interest in p-curve or z-curve, while at the same time another APS journal had accepted the p-curve paper by Cuddy that is now cited as strong evidence for power posing effects.
Whether this was the right decision or not depends of course on the strength of the arguments that I presented here.  As I said, I can only believe that they are strong because I wouldn’t be writing this blog post if I thought they were weak.  So, I can only draw (possibly false) inferences about peer-review and APS based on the assumption that I presented strong arguments.  Based on this assumption, I feel justified in returning the favor for being branded a method terrorist and for being called out in another APS journal as a hateful blogger.
In response to the reaction by APS representatives to z-curve, I feel justified in calling some areas of psychology, mostly experimental social psychology which I have examined thoroughly, a failed science and APS (and APA) Errorist Organizations (not the t-word that APS representatives used to label critics like me) with no interest in examining errors that psychology science made.  I also believe that the rejection of manuscripts that show the validity of zcurve can be explained by fear about what this method may reveal.  Just like professional athletes who use performance enhancing substances are afraid of doping tests, scientists who use questionable research methods feel uneasy when a statistical method can reveal these practices.  Even if they no longer use doping these days,  their past published work is akin to frozen urine samples from the past that reveal massive doping in the past, when doping tests were unable to detect these drugs. Although fear of the truth is just one possible explanation, I find it difficult to come up with alternative explanations for dismissing a method that can examine the credibility and replicability of published findings as uninteresting and irrelevant.
Pepsi and Porsche 
P-Curve and Z-Curve have the same objective (there is also an effect size p-curve, but I am focusing on power here).  They both aim to estimate average power of a set of studies that were selected for significance.  When average power is low (which also implies low heterogeneity) both methods produce similar results and in some simulations p-curve performs slightly better (as we demonstrated ourselves in our own simulations).  So, one could think about pcurve and zcurve as two very similar products like pepsi or coke.  Not exactly the same but similar enough.  Competition between pcurve and zcurve would be mostly limited to marketing (pcurve has an online app, zcurve does not – yet).
However, I hope that I made some convincing arguments why pcurve and zcurve are more like a car and a Porsche (Made in Germany).  They both get you to where you want to go most of the time, but a Porsche offers you a lot more.  Zcurve is like the Porsche in this analogy, but it is also free (a free online app will be available soon).
Conclusion
My conclusion is that Zcurve is great tool that makes it possible to examine the credibility of published results in published studies.  The tool can be applied to any set of studies, whether they are studies of a specific topic or a heterogeneous set of studies published in a journal.  It can even be used to estimate the replicability of psychology based on thousands of articles and over a million test statistics and it can reveal whether recent initiatives for rescuing psychological science are actually having an effect on the credibility and replicability of published results.
Whether this conclusion is right or wrong is not for me to decide.  This decision will be made by the process called science, but for this process to work, the arguments and the evidence needs to be examined by scientists.  APS and APA made it clear that they do not want this to happen in their peer-reviewed, for pay journals, but that will not stop me from exposing zcurve and my reputation to open and harsh criticism, and this blog and many other blog posts on this site allow me to do this.
As always comments are welcome.

Open Discussion Forum: [67] P-curve Handles Heterogeneity Just Fine

[67] P-curve Handles Heterogeneity Just Fine

UPDATE 3/27/2018:  Here is R-code to see how z-curve and p-curve work and to run the simulations used by Datacolada and to try other ones.  (R-Code download)

Introduction

The blog Datacolada is a joint blog by Uri Simonsohn, Leif Nelson, and Joe Simmons.  Like this blog, Datacolada blogs about statistics and research methods in the social sciences with a focus on controversial issues in psychology.  Unlike this blog, Datacolada does not have a comments section.  However, this shouldn’t stop researchers to critically examine the content of Datacolada.  As I have a comments section, I will first voice my concerns about blog post [67] and then open the discussion to anybody who cares about estimating the average power of studies that reported a “discovery” in a psychology journal.

Background: 

Estimating power is easy when all studies are honestly reported.  In this ideal world, average power can be estimated by the percentage of significant results  and with the median observed power (Schimmack, 2015).  However, in reality not all studies are published and researchers use questionable research practices that inflate success rates and observed power.  Currently two methods promise to correct for these problems and to provide estimates of the average power of studies that yielded a significant result.

Uri Simonsohn’s P-Curve has been in the public domain in the form of an app since January 2015.  Z-Curve has been used to critique published studies and individual authors for low power in their published studies on blog posts since June 2015. Neither method has the stamp of approval of peer-review.  P-Curve has been developed from Version 3.0 to Version 4.6 without presenting any simulations that the method works. It is simply assumed that the method works because it is built on a peer-reviewed method for the estimation of effect sizes.  Jerry Brunner and I have developed four methods for the estimation of average power in a set of studies selected for significance,  including z-curve and our own version of p-curve that estimates power and not effect sizes .

We have carried out extensive simulation studies and asked numerous journals to examine the validity of our simulation results.  We also posted our results in a blog post and asked for comments.  The fact that our work is still not published in 2018 does not reflect problems with out results. The reasons for rejection were mostly that it is not relevant to estimate average power of studies that have been published.

Respondents to an informal poll in the Psychological Methods Discussion Group mostly disagree and so do we.

Power.Estimation.Poll.png

There are numerous examples on this blog that show how this method can be used to predict that major replication efforts will fail (ego-depletion replicability report) or that claims about the way people (that is you and I) think in a popular book (Thinking: Fast and Slow) for a general audience (again that is you and me) by a Nobel Laureate are based on studies that were obtained with deceptive research practices.

The author, Daniel Kahneman, was as dismayed as I am by the realization that many published findings that are supposed to enlighten us have provided false facts and he graciously acknowledged this.

“I accept the basic conclusions of this blog. To be clear, I do so (1) without expressing an opinion about the statistical techniques it employed and (2) without stating an opinion about the validity and replicability of the individual studies I cited. What the blog gets absolutely right is that I placed too much faith in underpowered studies.” (Daniel Kahneman).

It is time to ensure that methods like p-curve and z-curve are vetted by independent statistical experts.  The traditional way of closed peer review in journals that need to reject good work because for-profit publishers and organizations like APS need to earn money from selling print-copies of their journals has failed.

Therefore we ask statisticians and methodologists from any discipline that uses significance testing to draw inferences from empirical studies to examine the claims in our manuscript and to help us to correct any errors.  If p-curve is the better tool for the job, so be it.

It is unfortunate that the comparison of p-curve and z-curve has become a public battle. In an idealistic world, scientists would not be attached to their ideas and would resolve conflicts in a calm exchange of arguments.  What better field to reach consensus than math or statistics where a true answer exists and can be revealed by means of mathematical proof or simulation studies.

However, the real world does not match the ideal world of science.  Just like Uri-Simonsohn is proud of p-curve, I am proud of z-curve and I want z-curve to do better.  This explains why my attempt to resolve this conflict in private failed (see email exchange).

The main outcome of the failed attempt to find agreement in private was that Uri Simonsohn posted a blog on Datacolada with the bold claim “P-Curve Handles Heterogeneity Just Fine,” which contradicts the claims that Jerry and I made in the manuscript that I sent him before we submitted it for publication. So, not only did the private communication fail.  Our attempt to resolve disagreement resulted in an open blog post that contradicted our claims.  A few months later, this blog post was cited by the editor of our manuscript as a minor reason for rejecting our comparison of p-curve and z-curve.

Just to be clear, I know that the datacolada post that Nelson cites was posted after your paper was submitted and I’m not factoring your paper’s failure to anticipate it into my decision (after all, Bem was wrong (Dan Simons, Editor of AMMPS)

Please remember,  I shared a document and R-Code with simulations that document the behavior of p-curve.  I had a very long email exchange with Uri Simonsohn in which I asked him to comment on our simulation results, which he never did.   Instead, he wrote his own simulations to convince himself that p-curve works.

The tweet below shows that Uri is aware of the problem that statisticians can use statistical tricks, p-hacking, to make their method look better than they are.

Uri.Tweet.Phacking

I will now demonstrate that Uri p-hacked his simulations to make p-curve look better than it is and to hide the fact that z-curve is the better tool for the job.

Critical Examination of Uri Simonsohn’s Simulation Studies

On the blog, Uri Simonsohn shows the Figure below which was based on an example that I provided during our email exchange.   The Figure shows the simulated distribution of true power.  It also shows the the mean true power is 61%, whereas the p-curve estimate is 79%.  Uri Simonssohn does not show the z-curve estimate.  He also does not show what the distribution of observed t-values looks like. This is important because few readers are familiar with histograms of power and the fact that it is normal for power to pile up at 1 because 1 is the upper limit for power.

 

datacolada67.png

I used the R-Code posted on the Datacolada website to provide additional information about this example. Before I show the results it is important to point out that Uri Simonshon works with a different selection model than Jerry and I.   We verified that this has no implications for the performance of p-curve or z-curve, but it does have implications for the distribution of true power that we would expect in real data.

Selection for Significance 1:   Jerry and I work with a simple model where researchers conduct studies, test for significance, and then publish the significant results.  They may also publish the non-significant results, but they cannot be used to claim a discovery (of course, we can debate whether a significant result implies a discovery, but that is irrelevant here).   We use z-curve to estimate the average power of those studies that produced a significant result.  As power is the probabilty of obtaining a significant result, the average true power of significant results predicts the success rate in a set of exact replication studies. Therefore, we call this estimate an estimate of replicability.

Selection for Significance 2:   The Datacolada team famously coined the term p-hacking.  p-hacking refers to massive use of questionable research practices in order to produce statistically significant results.   In an influential article, they created the impression that p-hacking allows researchers to get statistical significance in pretty much every study without a real effect (i.e., a false positive).  If this were the case, researchers would not have failed studies hidden away like our selection model implies.

No File Drawers: Another Unsupported Claim by Datacolada

In the 2018 volume of Annual Review of Psychology (edited by Susan Fiske), the Datacolada team explicitly claims that psychology researchers do not have file drawers of failed studies.

There is an old, popular, and simple explanation for this paradox. Experiments that work are sent to a journal, whereas experiments that fail are sent to the file drawer (Rosenthal 1979). We believe that this “file-drawer explanation” is incorrect. Most failed studies are not missing. They are published in our journals, masquerading as successes. 

They provide no evidence for this claim and ignore evidence to the contrary.  For example,  Bem (2011) pointed out that it is a common practice in experimental social psychology to conduct small studies so that failed studies can be dismissed as “pilot studies.”    In addition, some famous social psychologists have stated explicitly that they have a file drawer of studies that did not work.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

In response to replication failures,  Kathleen Vohs acknowledged that a couple of studies with non-significant results were excluded from the manuscript submitted for publication that was published with only significant results.

(2) With regard to unreported studies, the authors conducted two additional money priming studies that showed no effects, the details of which were shared with us.
(quote from Rohrer et al., 2015, who failed to replicate Vohs’s findings; see also Vadillo et al., 2016.)

Dan Gilbert and Timothy Wilson acknowledged that they did not publish non-significant results that they considered to be uninformative.

“First, it’s important to be clear about what “publication bias” means. It doesn’t mean that anyone did anything wrong, improper, misleading, unethical, inappropriate, or illegal. Rather it refers to the wellknown fact that scientists in every field publish studies whose results tell them something interesting about the world, and don’t publish studies whose results tell them nothing.  Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know. Failed studies are often (though not always) inconclusive, which is why they are often (but not always) unpublishable. So yes, we had to mess around for a while to establish a paradigm that was sensitive and powerful enough to observe the effects that we had hypothesized.”  (Gilbert and Wilson).

Bias analyses show some problems with the evidence for stereotype threat effects.  In a radio interview, Michael Inzlicht reported that he had several failed studies that were not submitted for publication and he is now skeptical about the entirely stereotype threat literature (conflict of interest: Mickey Inzlicht is a friend and colleague of mine who remains the only social psychologists who has published a critical self-analysis of his work before 2011 and is actively involved in reforming research practices in social psychology). 

Steve Spencer also acknowledged that he has a file drawer with unsuccessful studies.  In 2016, he promised to open his file-drawer and  make the results available. 

By the end of the year, I will certainly make my whole file drawer available for any one who wants to see it. Despite disagreeing with some of the specifics of what Uli says and certainly with his tone I would welcome everyone else who studies stereotype threat to make their whole file drawer available as well.

Nearly two years later, he hasn’t followed through on this promise (how big can it be? LOL). 

Although this anecdotal evidence makes it clear that researchers have file drawers with non-significant results,  it remains unclear how large file-drawers are and how often researchers p-hacked null-effects to significance (creating false positive results).

The Influence of Z-Curve on the Distribution of True Power and Observed Test-Statistics

Z-Curve, but not p-curve, can address this question to some extent because p-hacking influences the probability that a low-powered study will be published.  A simple selection model with alpha = .05 implies that only 1 out of 20 false positive results produces a significant result and will be included in the set of studies with significant results.  In contrast, extreme p-hacking implies that every false positive result (20 out of 20) will be included in the set of studies with significant results.

To illustrate the implications of selection for significance versus p-hacking, it is instructive to examine the distribution of observed significant results based on the simulated distribution of true power in Figure 1.

Figure 2 shows the distribution assuming that all studies will be p-hacked to significance. P-hacking can influence the observed distribution, but I am assuming a simple p-hacking model that is statistically equivalent to optional stopping with small samples. Just keep repeating the experiment (with minor variations that do not influence power  to deceive yourself that you are not p-hacking) and stop when you have a significant result.

t.sig

 

The histogram of t-values looks very similar to a z-score because t-values with df = 98 are approximately normally distributed.  As all studies were p-hacked, all studies are significant with qt(.975,98) = 1.98 as criterion value.  However, some studies have strong evidence against the null-hypothesis with t-values greater than 6.  The huge pile of t-values just above the criterion value of 1.98 occurs because all low powered studies became significant.

The distribution in Figure 3 looks different than the distribution in Figure 2.

no.sel.t.png

Now there are numerous non-significant results and even a few significant results with the opposite sign of the true effect (t < -1.98).   For the estimation of replicability only the results that reached significance are relevant, if only for the reason that they are the only results that are published (success rates in psychology are above 90%; Sterling, 1959, see also real data later on).   To compare the distributions it is more instructive to select only the significant results in Figure 3 and to compare the densities in Figures 2 and 3.

phack.vs.pubBias

The graph in Figure 4 shows that p-hacking results in more just significant results with t-values between 2 and 2.5 than mere publication bias does. The reason is that the significance filter of alpha = .05 eliminates false positives and low powered true effects. As a result the true power of studies that produced significant results is higher in the set of studies that were selected for significance.  The average true power of the honest significant results without p-hacking is 80% (as seen in Figure 1, the average power for the p-hacked studies in red is 61%).

With real data, the distribution of true power is unknown. Thus, it is unknown how much p-hacking occurred.  For the reader of a journal that reports only significant it is also irrelevant whether p-hacking occurred.  A result may be reported because 10 similar studies tested a single hypothesis or 10 conceptual replication studies produced 1 significant result.  In either scenario, the reported significant result provides weak evidence for an effect if the significant result occurred with low power.

It is also important to realize (and it took Jerry and I some time to convince ourselves with simulations that this is actually true) that p-curve and z-curve estimates do not depend on the selection mechanism. The only information that matters is the true power of studies and not how studies were selected.  To illustrate this fact, I also used p-curve and z-curve to estimate the average power of the t-values without p-hacking (blue distribution in Figure 4).   P-Curve again overestimates true power.  While average true power is 80%,  the p-curve estimate is 94%.

In conclusion,  the datacolada blog post did present one out of several examples that I provided and that were included in the manuscript that I shared with Uri.  The Datacolada post correctly showed that z-curve provides good estimates of the average true power and that p-curve produces inflated estimates.

I elaborated on this example by pointing out the distinction between p-hacking (all studies are significant) and selection for significance (e.g., due to publication bias or in assessing replicability of published results).  I showed that z-curve produces the correct estimates with and without p-hacking because the selection process does not matter.  The only consequence of p-hacking is that more low-powered studies become significant because it undermines the function of the significance filter to prevent studies with weak evidence from entering the literature.

In conclusion, the actual blog post shows that p-curve can be severely biased when data are heterogeneous, which contradicts the title that P-Curve handles heterogeneity just fine.

When The Shoe Doesn’t Fit, Cut of Your Toes

To rescue p-curve and to justify the title, Uri Simonsohn suggests that the example that I provided is unrealistic and that p-curve performs as well or better in simulations that are more realistic.  He does not mention that I also provided real world examples in my article that showed better performance of z-curve with real data.

So, the real issue is not whether p-curve handles heterogeneity well (it does not). The real issue is now how much heterogeneity we should expect.

Figure 5 shows that Uri Simonsohn considers to be realistic data. The distribution of true power uses the same beta distribution as the distribution in Figure 1, but instead of scaling it from the lowest possible value (alpha = 5%) to the highest possible value 1-1/infinity), it scales power from alpha to a maximum of 80%.  For readers less familiar with power, a value of 80% implies that researches plan studies deliberately with the risk of a 20% probability to end up with a false negative result (i.e., the effect exists, but the evidence is not strong enough, p > .05).

datacolada67.fig2.png

The labeling in the graph implies that studies with more than recommended 80% power, including 81% power are considered to have extremely high power (again, with a 20% risk of a false positive result).   The graph also shows that p-curve provided an unbiased estimate of true average power despite (extreme) heterogeneity in true power between 5% and 80%.

t.sig.2

Figure 6 shows the histogram of observed t-values based on a simulation in which all studies in Figure 5 are p-hacked to get significance.  As p-hacking inflates all t-values to meet the minimum value of 1.98, and truncation of power to values below 80% removes high t-values, 92% of t-values are within the limited range from 1.98 to 4.  A crud measure of heterogeneity is the variance of t-values, which is 0.51.  With N = 100, a t-distribution is just a little bit wider than the standard normal distribution, which has a standard deviation of 1. Thus, the small variance of 0.51 indicates that these data have low variability.

The histogram of observed t-values and the variance in these observed t-values makes it possible to quantify heterogeneity in true power.  In Figure 2, heterogeneity was high (Var(t) = 1.56) and p-curve overestimated average true power.  In Figure 6, heterogeneity is low (Var(t) = 0.51) and p-curve provided accurate estimates.  This finding suggests that estimation bias in p-curve is linked to the distribution and variance in observed t-values, which reflects the distribution and variance in true power.

When the data are not simulated, test statistics can come from different tests with different degrees of freedom.  In this case, it is necessary to convert all test statistics into z-scores so that strength of evidence is measured in a common metric.  In our manuscript, we used the variance of z-scores to quantify heterogeneity and showed that p-curve overestimates when heterogeneity is high.

In conclusion, Uri Simonsohn demonstrated that p-curve can produce accurate estimates when the range of true power is arbitrarily limited to values below 80% power.  He suggests that this is reasonable because having more than 80% power is extremely high power and rare.

Thus, there is no disagreement between Uri Simonsohn and us when it comes to the statistical performance of p-curve and z-curve.  P-curve overestimates when power is not truncated at 80%.  The only disagreement concerns the amount of actual variability in real data.

What is realistic?

Jerry and I are both big fans of Jacob Cohen who has made invaluable contributions to psychology as a science, including his attempt to introduce psychologists to Neyman-Pearson’s approach to statistical inferences that avoids many of the problems of Fishers’ approach that dominates statistics training in psychology to this day.

The concept of statistical power requires that researchers formulate an alternative hypothesis, which requires specifying an expected effect size.  To facilitate this task, Cohen developed standardized effect sizes. For example, Cohen’s standardizes a mean difference (e.g., height difference between men and women in centimeters) by the standard deviation.  As a result, the effect size is independent of the unit of measurement and is expressed in terms of percentages of a standard deviation.  Cohen provided rough guidelines about the size of effect sizes that one could expect in psychology.

It is now widely accepted that most effect sizes are in the range between 0 and 1 standard deviation.  It is common to refer to effect sizes of d = .2 (20% of a standard deviation) as small, d = .5 as medium, and d = .8 as large.

True power is a function of effect size and sampling error.  In a between subject study sampling error is a function of sample size and most sample sizes in between-subject designs fall into a range from 40 to 200 participants, although sample sizes have been increasing somewhat in response to the replication crisis.  With N = 40 to 200, sampling error ranges from 0.14 (2/sqrt(200) to .32 (2/sqrt(40).

The non-central t-values are simply the ratio of standardized effect sizes and sampling error of standardized measures.  At the lowest end, effect sizes of 0 have a non-central t-value of 0 (0/.14 = 0; 0/.32 = 0).  At the upper end, a large effect size of .8 obtained in the largest sample (N = 200) yields a t-value of .8/.14 = 5.71.   While smaller non-central t-values than 0 are not possible, larger non-central t-values can occur in some studies. Either the effect size is very large or sampling error is smaller.  Smaller sampling errors are especially likely when studies use covariates, within-subject designs or one-sample t-tests.  For example, a moderate effect size (d = .5) in a within-subject design with 90% fixed error variance (r = .9), yields a non-central t-value of 11.

A simple way to simulate data that are consistent with these well-known properties of results in psychology is to assume that the average effect size is half a standard deviation (d = .5) and to model variability in true effect sizes with a normal distribution with a standard deviation of SD = .2.  Accordingly, 95% of effect sizes would fall into the range from d = .1 to d = .9.  Sample sizes can be modeled with a simple uniform distribution (equal probability) from N = 40 to 200.

ncp.png

Converting the non-centrality parameters to power with p < .05 shows that many values fall into the region from .80 to 1 that Uri Simonsohn called extremely high power.  The graph shows that it does not require extremely large effect sizes (d > 1) or large samples (N > 200) to conduct studies with 80% power or more.   Of course, the percentage of studies with 80% power or more depends on the distribution of effect sizes, but it seems questionable to assume that studies rarely have 80% power.

power.d.png

The mean true power is 66% (I guess you see where this is going).

 

t.sig.d.png

This is the distribution of the observed t-values.  The variance is 1.21 and 23% of the t-values are greater than 4.   The z-curve estimate is 66% and the p-curve estimate is 83%.

In conclusion, a simulation that starts with knowledge about effect sizes and sample sizes in psychological research shows that it is misleading to call 80% power or more extremely high power that is rarely achieved in actual studies.  It is likely that real datasets will include studies with more than 80% power and that this will lead p-curve to overestimate average power.

A comparison of P-Curve and Z-Curve with Real Data

The point of fitting p-curve and z-curve to real data is not to validate the methods.  The methods have been validated in simulation studies that show good performance of z-curve and poor performance of p-curve when hterogeneity is high.

The only question remains how biased p-curve is with real data.  Of course, this depends on the nature of the data.  It is therefore important to remember that the Datacolada team proposed p-curve as an alternative to Cohen’s (1962) seminal study of power in the 1960 issue of the Journal of Abnormal and Social Psychology.

“Estimating the publication-bias corrected estimate of the average power of a set of studies can be useful for at least two purposes. First, many scientists are intrinsically
interested in assessing the statistical power of published research (see e.g., Button et al., 2013; Cohen, 1962; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). 

There have been two recent attempts at estimating the replicability of results in psychology.   One project conducted 100 actual replication studies (Open Science Collaboration, 2015).  A more recent project examined the replicability of social psychology using a larger set of studies and statistical methods to assess replicability (Motyl et al., 2017).

The authors sampled articles from four journals, the Journal of Personality and Social Psychology, Personality and Social Psychology Bulletin, Journal of Experimental Psychology, and Psychological Science and four years, 2003, 2004, 2013, and 2014.  They randomly sampled 543 articles that contained 1,505 studies. For each study, a coding team picked one statistical test that tested the main hypothesis.  The authors converted test-statistics into z-scores and showed histograms for the years 2003-2004 and 2013-2014 to examine changes over time.  The results were similar.

Motyl.zcurve

The histograms show clear evidence that non-significant results are missing either due to p-hacking or publication bias.  The authors did not use p-curve or z-curve to estimate the average true power.  I used these data to examine the performance of z-curve and p-curve.  I selected only tests that were coded as ANOVAs (k = 751) or t-tests (k = 232).  Furthermore, I excluded cases with very large test statistics (> 100) and experimenter degrees of freedom (10 or more). For participant degrees of freedom, I excluded values below 10 and above 1000.  This left 889 test statistics.  The test statistics were converted into z-scores.  The variance of the significant z-scores was 2.56.  However, this is due to a long tail of z-scores with a maximum value of 18.02.  The variance of z-scores between 1.96 and 6 was 0.83.

Motyl.zcurve.png

Fitting z-curve to all significant z-scores yielded an estimate of 45% average true power.  The p-curve estimate was 78% (90%CI = 75;81).  This finding is not surprising given the simulation results and the variance in the Motyl et al. data.

One possible solution to this problem could be to modify p-curve in the same way that z-curve only models z-scores between 1.96 and 6 and treats all z-scores of 6 as having power = 1.   The z-curve estimate is then adjusted by the proportion of extreme z-scores

average.true.power = z-curve.estimate * (1 – extreme) + extreme

Using the same approach with p-curve does help to reduce the bias in p-curve estimates, but p-curve still produces a much higher estimate than z-curve,  namely 63% (90%CI = .58;67.  This is still nearly 20% higher than the z-curve estimate.

In response to these results, Leif Nelson argued that the problem is not with p-curve, but with the Motyl et al. data.

They attempt to demonstrate the validity of the Z-curve with three sets of clearly invalid data.

One of these “clearly invalid” data are the data from Motyl et al.’s study.  Nelson claim is based on another datacolada blog post about the Motyl et al. study with the title

[60] Forthcoming in JPSP: A Non-Diagnostic Audit of Psychological Research

A detailed examination of datacolada 60 will be the subject of another open discussion about Datacolada.  Here it is sufficient to point that Nelson’s strong claim that Motyl et al.’s data are “clearly invalid” is not based on empirical evidence. It is based on disagreement about the coding of 10 out of over 1,500 tests (0.67%).  Moreover, it is wrong to label these disagreements mistakes because there is no right or wrong way to pick one test from a set of tests.

In conclusion, the Datacolada team has provided no evidence to support their claim that my simulations are unrealistic.  In contrast, I have demonstrated that their truncated simulation does not match reality.  Their only defense is now that I cheery-picked data that make z-curve look good.  However, a simulation with realistic assumptions about effect sizes and sample sizes also shows large heterogeneity and p-curve fails to provide reasonable estimates.

The fact that sometimes p-curve is not biased is not particularly important because z-curve provides practically useful estimates in these scenarios as well. So, the choice is between one method that gets it right sometimes and another method that gets it right all the time. Which method would you choose?

It is important to point out that z-curve shows some small systematic bias in some situations. The bias is typically about 2% points.  We developed a conservative 95%CI to address this problem and demonstrated that his 95% confidence interval has good coverage under these conditions and is conservative in situations when z-curve is unbiased.  The good performance of z-curve is the result of several years of development.  Not surprisingly, it works better than a method that has never been subjected to stress-tests by the developers.

Future Directions

Z-curve has many additional advantages over p-curve.  First, z-curve is a model for heterogeneous data.  As a result, it is possible to develop methods that can quantify the amount of variability in power while correcting for selection bias.  Second, heterogeneity implies that power varies across studies. As studies with higher power tend to produce larger z-scores, it is possible to provide corrected power estimates for sets of z-values. For example, the average power of just significant results (z < 2.5) could be very low.

Although these new features are still under development, first tests show promising results.  For example, the local power estimates for Motyl et al. suggest that test statistics with z-scores below 2.5 (p = .012) have only 26% power and even those between 2.5 and 3.0 (p = .0026) have only 34% power. Moreover, test statistics between 1.96 and 3 account for two-thirds of all test statistics. This suggests that many published results in social psychology will be difficult to replicate.

The problem with fixed-effect models like p-curve is that the average may be falsely generalized to individual studies. Accordingly, an average estimate of 45% might be misinterpreted as evidence that most findings are replicable and that replication studies with a little bit power would be able to replicate most findings. However, this is not the case (OSC, 2015).  In reality, there are many studies with low power that are difficult to replicate and relatively few studies with very high power that are easy to replicate.  Averaging across these studies gives the wrong impression that all studies have moderate power.  Thus, p-curve estimates may be misinterpreted easily because p-curve ignores heterogeneity in true power.

Motyl.zcurve.with.png

Final Conclusion

In the datacolada 67 blog post, the Datacolada team tried to defend p-curve against evidence that p-curve fails when data are heterogeneous.  It is understandable that authors are defensive about their methods.  In this comment on the blog post, I tried to reveal flaws in Uri’s arguments and to show that z-curve is indeed a better tool for the job.  However, I am just as motivated to promote z-curve as the Datacolada team is to promote p-curve.

To address this problem of conflict of interest and motivated reasoning, it is time for third parties to weigh in.  Neither method has been vetted by traditional peer-review because editors didn’t see any merit in p-curve or z-curve, but these methods are already being used to make claims about replicability.  It is time to make sure that they are used properly.  So, please contribute to the discussion about p-curve and z-curve in the comments section.  Even if you simply have a clarification question, please post it.